RolnickLab · mihow · Feb 11, 2026 · Feb 11, 2026 · Feb 11, 2026 · Feb 11, 2026
diff --git a/.agents/planning/dwca-export-plan.md b/.agents/planning/dwca-export-plan.md
@@ -0,0 +1,207 @@
+# Plan: Add DwC-A (Darwin Core Archive) Export Format
+
+## Why
+
+AMI projects produce biodiversity occurrence data (species observations from automated insect monitoring stations). To make this data discoverable and citable in the global biodiversity research community, it needs to be published to GBIF (Global Biodiversity Information Facility). GBIF's standard ingestion format is the Darwin Core Archive (DwC-A).
+
+**Roadmap:**
+1. **This PR** — Static DwC-A export: user triggers an export, downloads a ZIP file. Validates against GBIF's data validator. Serves as the foundation for all downstream GBIF integration.
+2. **Near follow-up** — Enrich the archive with additional DwC extensions (multimedia, measurement/fact) and a more complete EML metadata profile. Apply project default filters to the export.
+3. **Eventual** — Automated publishing: either push archives to a hosted GBIF IPT (Integrated Publishing Toolkit) server, or implement the IPT's RSS/DwC-A endpoint protocol directly within Antenna so it can act as its own IPT, serving a feed that GBIF crawls on a schedule.
+
+## Context
+
+The export framework already exists (`ami/exports/`) with JSON and CSV formats registered via a simple registry pattern. Adding a new format requires: an exporter class, field mappings, and a one-line registration. The `DataExport` model and async job infrastructure handle storage, progress tracking, and file serving.
+
+**Decisions made:**
+- **Event-core architecture** (events as core, occurrences as extension) — This matches AMI's data model (monitoring sessions containing species observations) and is the recommended GBIF pattern for sampling-event datasets, which enables richer ecological analysis than occurrence-only archives.
+- **URN format for IDs**: `urn:ami:event:{project_slug}:{id}`, `urn:ami:occurrence:{project_slug}:{id}` — Globally unique, stable, and human-readable. The project slug provides namespacing across AMI instances.
+- **Coordinates from Deployment lat/lon only** (text locality fields like country/stateProvince deferred) — Deployments store coordinates; reverse geocoding for text fields is a separate concern.
+- **`basisOfRecord` = `"MachineObservation"`** — GBIF's standard term for automated/sensor-derived observations, distinct from `HumanObservation`.
+- **No DRF serializer** — DwC fields are flat extractions, not nested API representations. Direct TSV writing is simpler and faster.
+- **Taxonomy from `parents_json`** — Avoids N+1 parent chain queries by walking the pre-computed `parents_json` list on each Taxon.
+
+## Implementation Steps
+
+### Step 1: Create DwC-A exporter class
+
+**File:** `ami/exports/format_types.py` (add to existing file)
+
+Create `DwCAExporter(BaseExporter)` with:
+- `file_format = "zip"`
+- `export()` method that orchestrates the full pipeline:
+  1. Write `event.txt` (tab-delimited) from Event queryset
+  2. Write `occurrence.txt` (tab-delimited) from Occurrence queryset
+  3. Generate `meta.xml`
+  4. Generate `eml.xml`
+  5. Package all into a ZIP, return temp file path
+
+**Querysets:**
+- Events: `Event.objects.filter(project=self.project)` with `select_related('deployment', 'deployment__research_site')`
+- Occurrences: `Occurrence.objects.valid().filter(project=self.project)` with `select_related('determination', 'event', 'deployment')` and `.with_timestamps().with_detections_count()`
+
+**Override `get_filter_backends()`** to return backends appropriate for events+occurrences (or empty list if collection filtering doesn't apply to events).
+
+### Step 2: Define DwC field mappings
+
+**File:** `ami/exports/dwca.py` (new file)
+
+Contains:
+- `EVENT_FIELDS`: ordered list of `(dwc_term_uri, header_name, getter_function)` tuples
+- `OCCURRENCE_FIELDS`: same structure
+- Helper functions to extract taxonomy hierarchy from `determination.parents_json` (walk the `list[TaxonParent]` for kingdom, phylum, class, order, family, genus)
+- `get_specific_epithet(name)` - split binomial to get second word
+- `generate_meta_xml(event_fields, occurrence_fields, event_filename, occurrence_filename)` - builds the XML string
+- `generate_eml_xml(project, events_queryset)` - builds minimal EML metadata from project info
+
+**Event field mapping (event.txt):**
+
+| Column | DwC Term | Source |
+|--------|----------|--------|
+| 0 | eventID | `urn:ami:event:{project_slug}:{event.id}` |
+| 1 | eventDate | `event.start`/`event.end` as ISO date interval |
+| 2 | eventTime | time portion of `event.start` |
+| 3 | year | from `event.start` |
+| 4 | month | from `event.start` |
+| 5 | day | from `event.start` |
+| 6 | samplingProtocol | `"automated light trap with camera"` (constant, could be project-level setting later) |
+| 7 | sampleSizeValue | `event.captures_count` |
+| 8 | sampleSizeUnit | `"images"` |
+| 9 | samplingEffort | duration formatted |
+| 10 | locationID | `deployment.name` |
+| 11 | decimalLatitude | `deployment.latitude` |
+| 12 | decimalLongitude | `deployment.longitude` |
+| 13 | geodeticDatum | `"WGS84"` |
+| 14 | datasetName | `project.name` |
+| 15 | modified | `event.updated_at` ISO format |
+
+**Occurrence field mapping (occurrence.txt):**
+
+| Column | DwC Term | Source |
+|--------|----------|--------|
+| 0 | eventID | same URN as core (foreign key) |
+| 1 | occurrenceID | `urn:ami:occurrence:{project_slug}:{occurrence.id}` |
+| 2 | basisOfRecord | `"MachineObservation"` |
+| 3 | occurrenceStatus | `"present"` |
+| 4 | scientificName | `determination.name` |
+| 5 | taxonRank | `determination.rank` (lowercase) |
+| 6 | kingdom | from `determination.parents_json` |
+| 7 | phylum | from `determination.parents_json` |
+| 8 | class | from `determination.parents_json` |
+| 9 | order | from `determination.parents_json` |
+| 10 | family | from `determination.parents_json` |
+| 11 | genus | from `determination.parents_json` |
+| 12 | specificEpithet | second word of species name |
+| 13 | vernacularName | `determination.common_name_en` |
+| 14 | taxonID | `determination.gbif_taxon_key` (if available) |
+| 15 | individualCount | `detections_count` |
+| 16 | identificationVerificationStatus | "verified" if identifications exist, else "unverified" |
+| 17 | modified | `occurrence.updated_at` ISO format |
+
+### Step 3: Register the exporter
+
+**File:** `ami/exports/registry.py`
+
+Add: `ExportRegistry.register("dwca")(DwCAExporter)`
+
+This is all that's needed for it to appear in the API's valid format choices.
+
+### Step 4: Override `generate_filename()` behavior
+
+The `DataExport.generate_filename()` uses `exporter.file_format` for the extension. Since `file_format = "zip"`, the filename will be `{project_slug}_export-{pk}.zip` which is correct.
+
+No changes needed to `DataExport` model.
+
+### Step 5: Write tests
+
+**File:** `ami/exports/tests.py` (add to existing)
+
+- Test that `DwCAExporter` is registered and retrievable
+- Test that export produces a valid ZIP with expected files (event.txt, occurrence.txt, meta.xml, eml.xml)
+- Test that event.txt has correct headers and row count matches events
+- Test that occurrence.txt has correct headers and row count matches valid occurrences
+- Test that meta.xml is valid XML with correct core/extension structure
+- Test that all occurrence eventIDs reference existing event eventIDs (referential integrity)
+- Test taxonomy hierarchy extraction from `parents_json`
+
+### Step 6: Update documentation
+
+**File:** `docs/claude/dwca-format-reference.md` (already created, update with final field mappings)
+
+## Key Files to Modify
+
+| File | Action |
+|------|--------|
+| `ami/exports/dwca.py` | **New** - DwC field mappings, meta.xml/eml.xml generators, taxonomy helpers |
+| `ami/exports/format_types.py` | **Modify** - Add `DwCAExporter` class |
+| `ami/exports/registry.py` | **Modify** - Register `"dwca"` format |
+| `ami/exports/tests.py` | **Modify** - Add DwC-A tests |
+
+## Key Files to Read (not modify)
+
+| File | Why |
+|------|-----|
+| `ami/exports/base.py` | BaseExporter interface |
+| `ami/exports/models.py` | DataExport model, run_export() flow |
+| `ami/exports/utils.py` | get_data_in_batches(), generate_fake_request() |
+| `ami/main/models.py:1025` | Event model fields |
+| `ami/main/models.py:2808` | Occurrence model fields |
+| `ami/main/models.py:3329` | TaxonParent pydantic model (parents_json schema) |
+| `ami/main/models.py:3349` | Taxon model fields |
+| `docs/claude/reference/example_dwca_exporter.md` | Reference DwC-A implementation |
+
+## Design Decisions
+
+1. **No DRF serializer for DwC-A** - Unlike JSON/CSV exporters that use DRF serializers via `get_data_in_batches()`, the DwC-A exporter writes TSV directly. DwC fields are simple extractions, not nested API representations. This avoids the overhead of serializer instantiation per record.
+
+2. **Direct queryset iteration** - Use `queryset.iterator(chunk_size=500)` for memory efficiency, writing rows directly to the TSV file.
+
+3. **Taxonomy from parents_json** - Walk the `parents_json` list (which contains `{id, name, rank}` dicts) to extract kingdom/phylum/class/order/family/genus. This avoids N+1 queries on the Taxon parent chain.
+
+4. **meta.xml generated from field definitions** - The same field list used for writing TSV columns also drives meta.xml generation, ensuring they stay in sync.
+
+5. **Minimal eml.xml** - Start with project name, description, and owner. Can be enriched later with geographic bounding box, temporal coverage, etc.
+
+6. **Scope for follow-up** - Species checklist (taxon.txt) and multimedia extension (multimedia.txt) are explicitly out of scope for this PR, as stated in the task.
+
+## Verification
+
+1. Run existing export tests to ensure no regression: `docker compose run --rm django python manage.py test ami.exports`
+2. Run new DwC-A tests
+3. Manual test: create a DwC-A export via the API or admin, download the ZIP, inspect contents
+4. Validate with GBIF Data Validator: https://www.gbif.org/tools/data-validator
+
+## Known issues to fix before merge
+
+1. **Occurrences without events produce empty `coreid`** — GBIF rejects orphaned extension rows. Need `.filter(event__isnull=False)` on occurrence queryset. (`ami/exports/format_types.py:199`)
+2. **Occurrences without determinations produce empty `scientificName`** — GBIF treats this as required. Need `.filter(determination__isnull=False)`. (`ami/exports/format_types.py:199`)
+3. **`individualCount` semantics wrong** — `detections_count` = bounding boxes across frames, not individuals. Each AMI occurrence is one individual. Should emit `1` or omit. (`ami/exports/dwca.py:87`)
+4. **`vernacularName` operator precedence** — `x or "" if y else ""` should be `(x or "") if y else ""`. (`ami/exports/dwca.py:78-79`)
+5. **Temp files never cleaned up** — event.txt, occurrence.txt, zip temp file leak on worker. (`ami/exports/format_types.py:238-264`)
+
+## Near follow-up (before real GBIF submission)
+
+- **Apply project default filters** to occurrence queryset — without this, low-confidence ML determinations get published to GBIF. Biggest data quality risk.
+- **Add `license` field** on events — GBIF requires a dataset license for reuse terms.
+- **Add `identifiedBy` / `dateIdentified`** — provenance for who/what made the determination.
+- **Add `associatedMedia`** — detection image URLs (pipe-separated). Primary evidence for an image-based platform.
+- **Runtime validation before packaging** — check for missing required fields, orphaned references, before creating the ZIP.
+- **Upgrade EML to 2.2.0** — current code uses 2.1.1, GBIF recommends 2.2.0. The reference doc already shows 2.2.0.
+
+## Eventual follow-up
+
+- EML geographic/temporal coverage computed from actual data (bounding box, date range)
+- `country`, `stateProvince`, `locality` on events (requires reverse geocoding or Site model fields)
+- `coordinateUncertaintyInMeters`
+- `institutionCode`, `collectionCode` (project-level settings)
+- `scientificNameAuthorship` from `Taxon.author`
+- `eventType` field
+- Multimedia extension file (`multimedia.txt`)
+- GBIF Data Validator automated integration test
+- IPT server integration / acting as IPT endpoint for GBIF crawling
+
+## Nice to haves
+
+- Use `default` attribute in meta.xml for constant fields (`basisOfRecord`, `geodeticDatum`, etc.) to reduce file size
+- Filter events to only those that have occurrences in the export
+- Guard against `ZeroDivisionError` in progress callback when `total_records` is 0
diff --git a/ami/exports/dwca/__init__.py b/ami/exports/dwca/__init__.py
@@ -0,0 +1,53 @@
+"""Public surface of the DwC-A export package.
+
+Re-exports keep existing imports (format_types.py, tests) working unchanged
+while internal code is organized by responsibility.
+"""
+
+from ami.exports.dwca.eml import generate_eml_xml
+from ami.exports.dwca.fields import (
+    DC,
+    DWC,
+    ECO,
+    EVENT_FIELDS,
+    MOF_FIELDS,
+    MULTIMEDIA_FIELDS,
+    OCCURRENCE_FIELDS,
+    DwCAField,
+)
+from ami.exports.dwca.helpers import (
+    _format_coord,
+    _format_datetime,
+    _format_duration,
+    _format_event_date,
+    _format_time,
+    _get_rank_from_parents,
+    _get_verification_status,
+    get_specific_epithet,
+)
+from ami.exports.dwca.meta import generate_meta_xml
+from ami.exports.dwca.tsv import write_tsv
+from ami.exports.dwca.zip import create_dwca_zip
+
+__all__ = [
+    "DC",
+    "DWC",
+    "ECO",
+    "DwCAField",
+    "EVENT_FIELDS",
+    "MOF_FIELDS",
+    "MULTIMEDIA_FIELDS",
+    "OCCURRENCE_FIELDS",
+    "create_dwca_zip",
+    "generate_eml_xml",
+    "generate_meta_xml",
+    "get_specific_epithet",
+    "write_tsv",
+    "_format_coord",
+    "_format_datetime",
+    "_format_duration",
+    "_format_event_date",
+    "_format_time",
+    "_get_rank_from_parents",
+    "_get_verification_status",
+]
diff --git a/ami/exports/dwca/eml.py b/ami/exports/dwca/eml.py
@@ -0,0 +1,132 @@
+"""Generate EML 2.2.0 metadata for the DwC-A.
+
+EML 2.2.0 is the current ratified version and what GBIF expects. Geographic
+and temporal coverage are computed from the event list; a methods section
+documents the automated capture + ML pipeline and the quality-control filters
+applied at export time.
+"""
+
+from __future__ import annotations
+
+from xml.etree import ElementTree as ET
+
+from django.utils import timezone
+from django.utils.text import slugify
+
+EML_NS = "https://eml.ecoinformatics.org/eml-2.2.0"
+XSI_NS = "http://www.w3.org/2001/XMLSchema-instance"
+
+
+def generate_eml_xml(project, events=None) -> str:
+    """Return the eml.xml body.
+
+    If `events` is provided (iterable of Event), geographic and temporal
+    coverage are computed from it. If absent, the coverage element is omitted.
+    """
+    project_slug = slugify(project.name)
+    now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")
+
+    eml = ET.Element("eml:eml")
+    eml.set("xmlns:eml", EML_NS)
+    eml.set("xmlns:dc", "http://purl.org/dc/terms/")
+    eml.set("xmlns:xsi", XSI_NS)
+    eml.set("xsi:schemaLocation", f"{EML_NS} https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd")
+    eml.set("packageId", f"urn:ami:dataset:{project_slug}:{now}")
+    eml.set("system", "AMI")
+
+    dataset = ET.SubElement(eml, "dataset")
+    _add_text(dataset, "title", project.name)
+
+    creator = ET.SubElement(dataset, "creator")
+    _add_text(creator, "organizationName", "Automated Monitoring of Insects (AMI)")
+    if project.owner and project.owner.name:
+        individual = ET.SubElement(creator, "individualName")
+        _add_text(individual, "surName", project.owner.name)
+
+    abstract = ET.SubElement(dataset, "abstract")
+    _add_text(abstract, "para", project.description or f"Biodiversity monitoring data from {project.name}.")
+
+    _add_intellectual_rights(dataset, project)
+
+    if events is not None:
+        _add_coverage(dataset, events)
+
+    _add_methods(dataset)
+
+    contact = ET.SubElement(dataset, "contact")
+    _add_text(contact, "organizationName", "Automated Monitoring of Insects (AMI)")
+
+    ET.indent(eml, space="  ")
+    xml_str = ET.tostring(eml, encoding="unicode", xml_declaration=False)
+    return '<?xml version="1.0" encoding="UTF-8"?>\n' + xml_str + "\n"
+
+
+def _add_text(parent, tag, text):
+    child = ET.SubElement(parent, tag)
+    child.text = text or ""
+    return child
+
+
+def _add_intellectual_rights(dataset, project):
+    rights = ET.SubElement(dataset, "intellectualRights")
+    para = ET.SubElement(rights, "para")
+    project_license = (getattr(project, "license", "") or "").strip()
+    para.text = project_license if project_license else "All rights reserved. No license specified."
+    if getattr(project, "rights_holder", ""):
+        additional = ET.SubElement(dataset, "additionalInfo")
+        _add_text(additional, "para", f"Rights holder: {project.rights_holder}")
+
+
+def _add_coverage(dataset, events):
+    lats = [e.deployment.latitude for e in events if e.deployment and e.deployment.latitude is not None]
+    lons = [e.deployment.longitude for e in events if e.deployment and e.deployment.longitude is not None]
+    starts = [e.start for e in events if e.start]
+    ends = [e.end for e in events if e.end] or starts
+
+    if not (lats and lons) and not starts:
+        return
+
+    coverage = ET.SubElement(dataset, "coverage")
+
+    if lats and lons:
+        geo = ET.SubElement(coverage, "geographicCoverage")
+        _add_text(geo, "geographicDescription", "Computed from event deployment coordinates")
+        bounding = ET.SubElement(geo, "boundingCoordinates")
+        _add_text(bounding, "westBoundingCoordinate", f"{min(lons):.6f}")
+        _add_text(bounding, "eastBoundingCoordinate", f"{max(lons):.6f}")
+        _add_text(bounding, "northBoundingCoordinate", f"{max(lats):.6f}")
+        _add_text(bounding, "southBoundingCoordinate", f"{min(lats):.6f}")
+
+    if starts:
+        temporal = ET.SubElement(coverage, "temporalCoverage")
+        range_of_dates = ET.SubElement(temporal, "rangeOfDates")
+        begin = ET.SubElement(range_of_dates, "beginDate")
+        _add_text(begin, "calendarDate", min(starts).date().isoformat())
+        end = ET.SubElement(range_of_dates, "endDate")
+        _add_text(end, "calendarDate", max(ends).date().isoformat())
+
+
+def _add_methods(dataset):
+    methods = ET.SubElement(dataset, "methods")
+    step = ET.SubElement(methods, "methodStep")
+    description = ET.SubElement(step, "description")
+    _add_text(
+        description,
+        "para",
+        "Images captured at a fixed interval by an automated camera trap with light attractant. "
+        "Each image is processed through an ML detector (bounding-box extraction) and an ML "
+        "classifier (species prediction). Individual detections are aggregated into occurrences "
+        "by spatiotemporal grouping and assigned a consensus determination.",
+    )
+    sampling = ET.SubElement(methods, "sampling")
+    study_extent = ET.SubElement(sampling, "studyExtent")
+    _add_text(study_extent, "description", "See <coverage> for geographic and temporal extent.")
+    _add_text(sampling, "samplingDescription", "Automated overnight monitoring with continuous image capture.")
+    qc = ET.SubElement(methods, "qualityControl")
+    qc_description = ET.SubElement(qc, "description")
+    _add_text(
+        qc_description,
+        "para",
+        "Project default filters applied before export: score thresholds, include/exclude taxa "
+        "lists, soft-delete exclusion. Only occurrences with at least one detection are included.",
+    )