Skip to content
Open
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
c45a4dc
docs: add DwC-A export plan and technical references
mihow Feb 11, 2026
3c9945b
feat(exports): add Darwin Core Archive (DwC-A) export format
mihow Feb 11, 2026
0cf0b92
test(exports): add DwC-A export tests
mihow Feb 11, 2026
928d9fc
docs: add feature context and roadmap to DwC-A export plan
mihow Feb 11, 2026
dd2309e
docs: add review findings and follow-up roadmap to DwC-A plan
mihow Feb 11, 2026
576cba6
fix(exports): fix null guards and field semantics in DwC-A mappings
mihow Feb 11, 2026
a74aee9
fix(exports): filter null event/determination and fix PII leak in EML
mihow Feb 11, 2026
ad1b910
fix(exports): temp file cleanup, timezone, and EML schema fixes
mihow Feb 11, 2026
556e104
fix(exports): update tests and docs for DwC-A review fixes
mihow Feb 11, 2026
0927ddb
fix(exports): meta.xml field mappings, enclosure char, and progress u…
mihow Feb 11, 2026
2f5d381
test(exports): optimize DwC-A tests with setUpClass shared export
mihow Feb 11, 2026
c43d406
fix(exports): enable filter backends and derive events from filtered …
mihow Feb 11, 2026
d11976e
test(exports): add DwC-A collection filter test and fix event count a…
mihow Feb 11, 2026
e14139d
docs(exports): add API and operations reference for export system
mihow Feb 12, 2026
c8aadb7
docs(exports): merge API reference into export-framework.md
mihow Feb 12, 2026
fe73fc3
feat(exports): apply default filters, populate identifiedBy, add proj…
mihow Apr 14, 2026
92b2c04
refactor(exports): replace tuple DwC field defs with DwCAField dataclass
mihow Apr 14, 2026
d87156a
test(exports): add offline DwC-A structural validator
mihow Apr 15, 2026
2e3725a
docs(exports): add PR review and mapping spec
mihow Apr 15, 2026
6682f30
docs(exports): stash DwC-A April draft design in progress
mihow Apr 21, 2026
72332a3
docs(exports): record GBIF guide findings, confirm Event Core + Humboldt
mihow Apr 21, 2026
aebd708
docs(exports): scope targetTaxonomicScope to TaxaList longer-term
mihow Apr 21, 2026
ed80334
docs(exports): add DwC-A April 2026 implementation plan
mihow Apr 21, 2026
b07be30
refactor(exports): split dwca.py into package
mihow Apr 21, 2026
e15c6e1
feat(exports): derive targetTaxonomicScope via LCA of include taxa
mihow Apr 21, 2026
4822064
feat(exports): add Humboldt eco: terms as event.txt columns
mihow Apr 21, 2026
9a89887
feat(exports): add associatedMedia column to occurrence.txt
mihow Apr 21, 2026
cebed07
feat(exports): add multimedia extension field catalogue and row gener…
mihow Apr 21, 2026
0440a72
feat(exports): wire multimedia.txt into DwC-A archive
mihow Apr 21, 2026
0def56f
feat(exports): add measurementorfact.txt extension
mihow Apr 21, 2026
0563686
feat(exports): upgrade eml.xml to EML 2.2.0 with coverage and methods
mihow Apr 21, 2026
9979f75
feat(exports): validate occurrenceID cross-references between extensions
mihow Apr 21, 2026
d67146e
feat(exports): validate DwC-A archive structure before returning
mihow Apr 21, 2026
1350d51
feat(ui): expose dwca as a user-selectable export format
mihow Apr 21, 2026
45dec2e
docs(exports): update DwC-A format reference for April 2026 draft
mihow Apr 21, 2026
182587f
fix(exports): harden DwC-A validator and clean up on failure
mihow Apr 22, 2026
6d3b31b
fix(main): avoid PII leak in DwC-A identifiedBy, prefer Classificatio…
mihow Apr 22, 2026
5abf979
docs(exports): fix nested markdown fences in DwC-A plan
mihow Apr 22, 2026
5fc2478
Merge remote-tracking branch 'origin/main' into feat/dwca-export
mihow Apr 22, 2026
ce1e17f
fix(exports): exclude withdrawn identifications from DwC-A verificati…
mihow Apr 23, 2026
73eb3dc
feat(exports): default DwC-A license to "All rights reserved" when blank
mihow Apr 23, 2026
9475664
feat(exports): label DwC-A archives as April 2026 draft schema
mihow Apr 23, 2026
71dda42
test(exports): cover DwC-A hardcap + validation error file behaviors
mihow Apr 23, 2026
7488162
test(exports): add HTTP end-to-end DwC-A export test
mihow Apr 23, 2026
6d0eddd
docs(exports): refresh DwC-A follow-up tracking after PR #1131 round 2
mihow Apr 23, 2026
30bd47b
Merge remote-tracking branch 'origin/main' into feat/dwca-export
mihow Apr 23, 2026
77939ba
fix(exports): switch DwCA e2e test away from TransactionTestCase
mihow Apr 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 207 additions & 0 deletions .agents/planning/dwca-export-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# Plan: Add DwC-A (Darwin Core Archive) Export Format

## Why

AMI projects produce biodiversity occurrence data (species observations from automated insect monitoring stations). To make this data discoverable and citable in the global biodiversity research community, it needs to be published to GBIF (Global Biodiversity Information Facility). GBIF's standard ingestion format is the Darwin Core Archive (DwC-A).

**Roadmap:**
1. **This PR** — Static DwC-A export: user triggers an export, downloads a ZIP file. Validates against GBIF's data validator. Serves as the foundation for all downstream GBIF integration.
2. **Near follow-up** — Enrich the archive with additional DwC extensions (multimedia, measurement/fact) and a more complete EML metadata profile. Apply project default filters to the export.
3. **Eventual** — Automated publishing: either push archives to a hosted GBIF IPT (Integrated Publishing Toolkit) server, or implement the IPT's RSS/DwC-A endpoint protocol directly within Antenna so it can act as its own IPT, serving a feed that GBIF crawls on a schedule.

## Context

The export framework already exists (`ami/exports/`) with JSON and CSV formats registered via a simple registry pattern. Adding a new format requires: an exporter class, field mappings, and a one-line registration. The `DataExport` model and async job infrastructure handle storage, progress tracking, and file serving.

**Decisions made:**
- **Event-core architecture** (events as core, occurrences as extension) — This matches AMI's data model (monitoring sessions containing species observations) and is the recommended GBIF pattern for sampling-event datasets, which enables richer ecological analysis than occurrence-only archives.
- **URN format for IDs**: `urn:ami:event:{project_slug}:{id}`, `urn:ami:occurrence:{project_slug}:{id}` — Globally unique, stable, and human-readable. The project slug provides namespacing across AMI instances.
- **Coordinates from Deployment lat/lon only** (text locality fields like country/stateProvince deferred) — Deployments store coordinates; reverse geocoding for text fields is a separate concern.
- **`basisOfRecord` = `"MachineObservation"`** — GBIF's standard term for automated/sensor-derived observations, distinct from `HumanObservation`.
- **No DRF serializer** — DwC fields are flat extractions, not nested API representations. Direct TSV writing is simpler and faster.
- **Taxonomy from `parents_json`** — Avoids N+1 parent chain queries by walking the pre-computed `parents_json` list on each Taxon.

## Implementation Steps

### Step 1: Create DwC-A exporter class

**File:** `ami/exports/format_types.py` (add to existing file)

Create `DwCAExporter(BaseExporter)` with:
- `file_format = "zip"`
- `export()` method that orchestrates the full pipeline:
1. Write `event.txt` (tab-delimited) from Event queryset
2. Write `occurrence.txt` (tab-delimited) from Occurrence queryset
3. Generate `meta.xml`
4. Generate `eml.xml`
5. Package all into a ZIP, return temp file path

**Querysets:**
- Events: `Event.objects.filter(project=self.project)` with `select_related('deployment', 'deployment__research_site')`
- Occurrences: `Occurrence.objects.valid().filter(project=self.project)` with `select_related('determination', 'event', 'deployment')` and `.with_timestamps().with_detections_count()`

**Override `get_filter_backends()`** to return backends appropriate for events+occurrences (or empty list if collection filtering doesn't apply to events).

### Step 2: Define DwC field mappings

**File:** `ami/exports/dwca.py` (new file)

Contains:
- `EVENT_FIELDS`: ordered list of `(dwc_term_uri, header_name, getter_function)` tuples
- `OCCURRENCE_FIELDS`: same structure
- Helper functions to extract taxonomy hierarchy from `determination.parents_json` (walk the `list[TaxonParent]` for kingdom, phylum, class, order, family, genus)
- `get_specific_epithet(name)` - split binomial to get second word
- `generate_meta_xml(event_fields, occurrence_fields, event_filename, occurrence_filename)` - builds the XML string
- `generate_eml_xml(project, events_queryset)` - builds minimal EML metadata from project info

**Event field mapping (event.txt):**

| Column | DwC Term | Source |
|--------|----------|--------|
| 0 | eventID | `urn:ami:event:{project_slug}:{event.id}` |
| 1 | eventDate | `event.start`/`event.end` as ISO date interval |
| 2 | eventTime | time portion of `event.start` |
| 3 | year | from `event.start` |
| 4 | month | from `event.start` |
| 5 | day | from `event.start` |
| 6 | samplingProtocol | `"automated light trap with camera"` (constant, could be project-level setting later) |
| 7 | sampleSizeValue | `event.captures_count` |
| 8 | sampleSizeUnit | `"images"` |
| 9 | samplingEffort | duration formatted |
| 10 | locationID | `deployment.name` |
| 11 | decimalLatitude | `deployment.latitude` |
| 12 | decimalLongitude | `deployment.longitude` |
| 13 | geodeticDatum | `"WGS84"` |
| 14 | datasetName | `project.name` |
| 15 | modified | `event.updated_at` ISO format |

**Occurrence field mapping (occurrence.txt):**

| Column | DwC Term | Source |
|--------|----------|--------|
| 0 | eventID | same URN as core (foreign key) |
| 1 | occurrenceID | `urn:ami:occurrence:{project_slug}:{occurrence.id}` |
| 2 | basisOfRecord | `"MachineObservation"` |
| 3 | occurrenceStatus | `"present"` |
| 4 | scientificName | `determination.name` |
| 5 | taxonRank | `determination.rank` (lowercase) |
| 6 | kingdom | from `determination.parents_json` |
| 7 | phylum | from `determination.parents_json` |
| 8 | class | from `determination.parents_json` |
| 9 | order | from `determination.parents_json` |
| 10 | family | from `determination.parents_json` |
| 11 | genus | from `determination.parents_json` |
| 12 | specificEpithet | second word of species name |
| 13 | vernacularName | `determination.common_name_en` |
| 14 | taxonID | `determination.gbif_taxon_key` (if available) |
| 15 | individualCount | `detections_count` |
| 16 | identificationVerificationStatus | "verified" if identifications exist, else "unverified" |
| 17 | modified | `occurrence.updated_at` ISO format |

### Step 3: Register the exporter

**File:** `ami/exports/registry.py`

Add: `ExportRegistry.register("dwca")(DwCAExporter)`

This is all that's needed for it to appear in the API's valid format choices.

### Step 4: Override `generate_filename()` behavior

The `DataExport.generate_filename()` uses `exporter.file_format` for the extension. Since `file_format = "zip"`, the filename will be `{project_slug}_export-{pk}.zip` which is correct.

No changes needed to `DataExport` model.

### Step 5: Write tests

**File:** `ami/exports/tests.py` (add to existing)

- Test that `DwCAExporter` is registered and retrievable
- Test that export produces a valid ZIP with expected files (event.txt, occurrence.txt, meta.xml, eml.xml)
- Test that event.txt has correct headers and row count matches events
- Test that occurrence.txt has correct headers and row count matches valid occurrences
- Test that meta.xml is valid XML with correct core/extension structure
- Test that all occurrence eventIDs reference existing event eventIDs (referential integrity)
- Test taxonomy hierarchy extraction from `parents_json`

### Step 6: Update documentation

**File:** `docs/claude/dwca-format-reference.md` (already created, update with final field mappings)

## Key Files to Modify

| File | Action |
|------|--------|
| `ami/exports/dwca.py` | **New** - DwC field mappings, meta.xml/eml.xml generators, taxonomy helpers |
| `ami/exports/format_types.py` | **Modify** - Add `DwCAExporter` class |
| `ami/exports/registry.py` | **Modify** - Register `"dwca"` format |
| `ami/exports/tests.py` | **Modify** - Add DwC-A tests |

## Key Files to Read (not modify)

| File | Why |
|------|-----|
| `ami/exports/base.py` | BaseExporter interface |
| `ami/exports/models.py` | DataExport model, run_export() flow |
| `ami/exports/utils.py` | get_data_in_batches(), generate_fake_request() |
| `ami/main/models.py:1025` | Event model fields |
| `ami/main/models.py:2808` | Occurrence model fields |
| `ami/main/models.py:3329` | TaxonParent pydantic model (parents_json schema) |
| `ami/main/models.py:3349` | Taxon model fields |
| `docs/claude/reference/example_dwca_exporter.md` | Reference DwC-A implementation |

## Design Decisions

1. **No DRF serializer for DwC-A** - Unlike JSON/CSV exporters that use DRF serializers via `get_data_in_batches()`, the DwC-A exporter writes TSV directly. DwC fields are simple extractions, not nested API representations. This avoids the overhead of serializer instantiation per record.

2. **Direct queryset iteration** - Use `queryset.iterator(chunk_size=500)` for memory efficiency, writing rows directly to the TSV file.

3. **Taxonomy from parents_json** - Walk the `parents_json` list (which contains `{id, name, rank}` dicts) to extract kingdom/phylum/class/order/family/genus. This avoids N+1 queries on the Taxon parent chain.

4. **meta.xml generated from field definitions** - The same field list used for writing TSV columns also drives meta.xml generation, ensuring they stay in sync.

5. **Minimal eml.xml** - Start with project name, description, and owner. Can be enriched later with geographic bounding box, temporal coverage, etc.

6. **Scope for follow-up** - Species checklist (taxon.txt) and multimedia extension (multimedia.txt) are explicitly out of scope for this PR, as stated in the task.

## Verification

1. Run existing export tests to ensure no regression: `docker compose run --rm django python manage.py test ami.exports`
2. Run new DwC-A tests
3. Manual test: create a DwC-A export via the API or admin, download the ZIP, inspect contents
4. Validate with GBIF Data Validator: https://www.gbif.org/tools/data-validator

## Known issues to fix before merge

1. **Occurrences without events produce empty `coreid`** — GBIF rejects orphaned extension rows. Need `.filter(event__isnull=False)` on occurrence queryset. (`ami/exports/format_types.py:199`)
2. **Occurrences without determinations produce empty `scientificName`** — GBIF treats this as required. Need `.filter(determination__isnull=False)`. (`ami/exports/format_types.py:199`)
3. **`individualCount` semantics wrong** — `detections_count` = bounding boxes across frames, not individuals. Each AMI occurrence is one individual. Should emit `1` or omit. (`ami/exports/dwca.py:87`)
4. **`vernacularName` operator precedence** — `x or "" if y else ""` should be `(x or "") if y else ""`. (`ami/exports/dwca.py:78-79`)
5. **Temp files never cleaned up** — event.txt, occurrence.txt, zip temp file leak on worker. (`ami/exports/format_types.py:238-264`)

## Near follow-up (before real GBIF submission)

- **Apply project default filters** to occurrence queryset — without this, low-confidence ML determinations get published to GBIF. Biggest data quality risk.
- **Add `license` field** on events — GBIF requires a dataset license for reuse terms.
- **Add `identifiedBy` / `dateIdentified`** — provenance for who/what made the determination.
- **Add `associatedMedia`** — detection image URLs (pipe-separated). Primary evidence for an image-based platform.
- **Runtime validation before packaging** — check for missing required fields, orphaned references, before creating the ZIP.
- **Upgrade EML to 2.2.0** — current code uses 2.1.1, GBIF recommends 2.2.0. The reference doc already shows 2.2.0.

## Eventual follow-up

- EML geographic/temporal coverage computed from actual data (bounding box, date range)
- `country`, `stateProvince`, `locality` on events (requires reverse geocoding or Site model fields)
- `coordinateUncertaintyInMeters`
- `institutionCode`, `collectionCode` (project-level settings)
- `scientificNameAuthorship` from `Taxon.author`
- `eventType` field
- Multimedia extension file (`multimedia.txt`)
- GBIF Data Validator automated integration test
- IPT server integration / acting as IPT endpoint for GBIF crawling

## Nice to haves

- Use `default` attribute in meta.xml for constant fields (`basisOfRecord`, `geodeticDatum`, etc.) to reduce file size
- Filter events to only those that have occurrences in the export
- Guard against `ZeroDivisionError` in progress callback when `total_records` is 0
53 changes: 53 additions & 0 deletions ami/exports/dwca/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
"""Public surface of the DwC-A export package.

Re-exports keep existing imports (format_types.py, tests) working unchanged
while internal code is organized by responsibility.
"""

from ami.exports.dwca.eml import generate_eml_xml
from ami.exports.dwca.fields import (
DC,
DWC,
ECO,
EVENT_FIELDS,
MOF_FIELDS,
MULTIMEDIA_FIELDS,
OCCURRENCE_FIELDS,
DwCAField,
)
from ami.exports.dwca.helpers import (
_format_coord,
_format_datetime,
_format_duration,
_format_event_date,
_format_time,
_get_rank_from_parents,
_get_verification_status,
get_specific_epithet,
)
from ami.exports.dwca.meta import generate_meta_xml
from ami.exports.dwca.tsv import write_tsv
from ami.exports.dwca.zip import create_dwca_zip

__all__ = [
"DC",
"DWC",
"ECO",
"DwCAField",
"EVENT_FIELDS",
"MOF_FIELDS",
"MULTIMEDIA_FIELDS",
"OCCURRENCE_FIELDS",
"create_dwca_zip",
"generate_eml_xml",
"generate_meta_xml",
"get_specific_epithet",
"write_tsv",
"_format_coord",
"_format_datetime",
"_format_duration",
"_format_event_date",
"_format_time",
"_get_rank_from_parents",
"_get_verification_status",
]
132 changes: 132 additions & 0 deletions ami/exports/dwca/eml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
"""Generate EML 2.2.0 metadata for the DwC-A.

EML 2.2.0 is the current ratified version and what GBIF expects. Geographic
and temporal coverage are computed from the event list; a methods section
documents the automated capture + ML pipeline and the quality-control filters
applied at export time.
"""

from __future__ import annotations

from xml.etree import ElementTree as ET

from django.utils import timezone
from django.utils.text import slugify

EML_NS = "https://eml.ecoinformatics.org/eml-2.2.0"
XSI_NS = "http://www.w3.org/2001/XMLSchema-instance"


def generate_eml_xml(project, events=None) -> str:
"""Return the eml.xml body.

If `events` is provided (iterable of Event), geographic and temporal
coverage are computed from it. If absent, the coverage element is omitted.
"""
project_slug = slugify(project.name)
now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")

eml = ET.Element("eml:eml")
eml.set("xmlns:eml", EML_NS)
eml.set("xmlns:dc", "http://purl.org/dc/terms/")
eml.set("xmlns:xsi", XSI_NS)
eml.set("xsi:schemaLocation", f"{EML_NS} https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd")
eml.set("packageId", f"urn:ami:dataset:{project_slug}:{now}")
eml.set("system", "AMI")

dataset = ET.SubElement(eml, "dataset")
_add_text(dataset, "title", project.name)

creator = ET.SubElement(dataset, "creator")
_add_text(creator, "organizationName", "Automated Monitoring of Insects (AMI)")
if project.owner and project.owner.name:
individual = ET.SubElement(creator, "individualName")
_add_text(individual, "surName", project.owner.name)

abstract = ET.SubElement(dataset, "abstract")
_add_text(abstract, "para", project.description or f"Biodiversity monitoring data from {project.name}.")

_add_intellectual_rights(dataset, project)

if events is not None:
_add_coverage(dataset, events)

_add_methods(dataset)

contact = ET.SubElement(dataset, "contact")
_add_text(contact, "organizationName", "Automated Monitoring of Insects (AMI)")

ET.indent(eml, space=" ")
xml_str = ET.tostring(eml, encoding="unicode", xml_declaration=False)
return '<?xml version="1.0" encoding="UTF-8"?>\n' + xml_str + "\n"


def _add_text(parent, tag, text):
child = ET.SubElement(parent, tag)
child.text = text or ""
return child


def _add_intellectual_rights(dataset, project):
rights = ET.SubElement(dataset, "intellectualRights")
para = ET.SubElement(rights, "para")
project_license = (getattr(project, "license", "") or "").strip()
para.text = project_license if project_license else "All rights reserved. No license specified."
if getattr(project, "rights_holder", ""):
additional = ET.SubElement(dataset, "additionalInfo")
_add_text(additional, "para", f"Rights holder: {project.rights_holder}")


def _add_coverage(dataset, events):
lats = [e.deployment.latitude for e in events if e.deployment and e.deployment.latitude is not None]
lons = [e.deployment.longitude for e in events if e.deployment and e.deployment.longitude is not None]
starts = [e.start for e in events if e.start]
ends = [e.end for e in events if e.end] or starts

if not (lats and lons) and not starts:
return

coverage = ET.SubElement(dataset, "coverage")

if lats and lons:
geo = ET.SubElement(coverage, "geographicCoverage")
_add_text(geo, "geographicDescription", "Computed from event deployment coordinates")
bounding = ET.SubElement(geo, "boundingCoordinates")
_add_text(bounding, "westBoundingCoordinate", f"{min(lons):.6f}")
_add_text(bounding, "eastBoundingCoordinate", f"{max(lons):.6f}")
_add_text(bounding, "northBoundingCoordinate", f"{max(lats):.6f}")
_add_text(bounding, "southBoundingCoordinate", f"{min(lats):.6f}")

if starts:
temporal = ET.SubElement(coverage, "temporalCoverage")
range_of_dates = ET.SubElement(temporal, "rangeOfDates")
begin = ET.SubElement(range_of_dates, "beginDate")
_add_text(begin, "calendarDate", min(starts).date().isoformat())
end = ET.SubElement(range_of_dates, "endDate")
_add_text(end, "calendarDate", max(ends).date().isoformat())


def _add_methods(dataset):
methods = ET.SubElement(dataset, "methods")
step = ET.SubElement(methods, "methodStep")
description = ET.SubElement(step, "description")
_add_text(
description,
"para",
"Images captured at a fixed interval by an automated camera trap with light attractant. "
"Each image is processed through an ML detector (bounding-box extraction) and an ML "
"classifier (species prediction). Individual detections are aggregated into occurrences "
"by spatiotemporal grouping and assigned a consensus determination.",
)
sampling = ET.SubElement(methods, "sampling")
study_extent = ET.SubElement(sampling, "studyExtent")
_add_text(study_extent, "description", "See <coverage> for geographic and temporal extent.")
_add_text(sampling, "samplingDescription", "Automated overnight monitoring with continuous image capture.")
qc = ET.SubElement(methods, "qualityControl")
qc_description = ET.SubElement(qc, "description")
_add_text(
qc_description,
"para",
"Project default filters applied before export: score thresholds, include/exclude taxa "
"lists, soft-delete exclusion. Only occurrences with at least one detection are included.",
)
Loading