diff --git a/.agents/planning/dwca-export-plan.md b/.agents/planning/dwca-export-plan.md new file mode 100644 index 000000000..8b198ffac --- /dev/null +++ b/.agents/planning/dwca-export-plan.md @@ -0,0 +1,207 @@ +# Plan: Add DwC-A (Darwin Core Archive) Export Format + +## Why + +AMI projects produce biodiversity occurrence data (species observations from automated insect monitoring stations). To make this data discoverable and citable in the global biodiversity research community, it needs to be published to GBIF (Global Biodiversity Information Facility). GBIF's standard ingestion format is the Darwin Core Archive (DwC-A). + +**Roadmap:** +1. **This PR** — Static DwC-A export: user triggers an export, downloads a ZIP file. Validates against GBIF's data validator. Serves as the foundation for all downstream GBIF integration. +2. **Near follow-up** — Enrich the archive with additional DwC extensions (multimedia, measurement/fact) and a more complete EML metadata profile. Apply project default filters to the export. +3. **Eventual** — Automated publishing: either push archives to a hosted GBIF IPT (Integrated Publishing Toolkit) server, or implement the IPT's RSS/DwC-A endpoint protocol directly within Antenna so it can act as its own IPT, serving a feed that GBIF crawls on a schedule. + +## Context + +The export framework already exists (`ami/exports/`) with JSON and CSV formats registered via a simple registry pattern. Adding a new format requires: an exporter class, field mappings, and a one-line registration. The `DataExport` model and async job infrastructure handle storage, progress tracking, and file serving. + +**Decisions made:** +- **Event-core architecture** (events as core, occurrences as extension) — This matches AMI's data model (monitoring sessions containing species observations) and is the recommended GBIF pattern for sampling-event datasets, which enables richer ecological analysis than occurrence-only archives. +- **URN format for IDs**: `urn:ami:event:{project_slug}:{id}`, `urn:ami:occurrence:{project_slug}:{id}` — Globally unique, stable, and human-readable. The project slug provides namespacing across AMI instances. +- **Coordinates from Deployment lat/lon only** (text locality fields like country/stateProvince deferred) — Deployments store coordinates; reverse geocoding for text fields is a separate concern. +- **`basisOfRecord` = `"MachineObservation"`** — GBIF's standard term for automated/sensor-derived observations, distinct from `HumanObservation`. +- **No DRF serializer** — DwC fields are flat extractions, not nested API representations. Direct TSV writing is simpler and faster. +- **Taxonomy from `parents_json`** — Avoids N+1 parent chain queries by walking the pre-computed `parents_json` list on each Taxon. + +## Implementation Steps + +### Step 1: Create DwC-A exporter class + +**File:** `ami/exports/format_types.py` (add to existing file) + +Create `DwCAExporter(BaseExporter)` with: +- `file_format = "zip"` +- `export()` method that orchestrates the full pipeline: + 1. Write `event.txt` (tab-delimited) from Event queryset + 2. Write `occurrence.txt` (tab-delimited) from Occurrence queryset + 3. Generate `meta.xml` + 4. Generate `eml.xml` + 5. Package all into a ZIP, return temp file path + +**Querysets:** +- Events: `Event.objects.filter(project=self.project)` with `select_related('deployment', 'deployment__research_site')` +- Occurrences: `Occurrence.objects.valid().filter(project=self.project)` with `select_related('determination', 'event', 'deployment')` and `.with_timestamps().with_detections_count()` + +**Override `get_filter_backends()`** to return backends appropriate for events+occurrences (or empty list if collection filtering doesn't apply to events). + +### Step 2: Define DwC field mappings + +**File:** `ami/exports/dwca.py` (new file) + +Contains: +- `EVENT_FIELDS`: ordered list of `(dwc_term_uri, header_name, getter_function)` tuples +- `OCCURRENCE_FIELDS`: same structure +- Helper functions to extract taxonomy hierarchy from `determination.parents_json` (walk the `list[TaxonParent]` for kingdom, phylum, class, order, family, genus) +- `get_specific_epithet(name)` - split binomial to get second word +- `generate_meta_xml(event_fields, occurrence_fields, event_filename, occurrence_filename)` - builds the XML string +- `generate_eml_xml(project, events_queryset)` - builds minimal EML metadata from project info + +**Event field mapping (event.txt):** + +| Column | DwC Term | Source | +|--------|----------|--------| +| 0 | eventID | `urn:ami:event:{project_slug}:{event.id}` | +| 1 | eventDate | `event.start`/`event.end` as ISO date interval | +| 2 | eventTime | time portion of `event.start` | +| 3 | year | from `event.start` | +| 4 | month | from `event.start` | +| 5 | day | from `event.start` | +| 6 | samplingProtocol | `"automated light trap with camera"` (constant, could be project-level setting later) | +| 7 | sampleSizeValue | `event.captures_count` | +| 8 | sampleSizeUnit | `"images"` | +| 9 | samplingEffort | duration formatted | +| 10 | locationID | `deployment.name` | +| 11 | decimalLatitude | `deployment.latitude` | +| 12 | decimalLongitude | `deployment.longitude` | +| 13 | geodeticDatum | `"WGS84"` | +| 14 | datasetName | `project.name` | +| 15 | modified | `event.updated_at` ISO format | + +**Occurrence field mapping (occurrence.txt):** + +| Column | DwC Term | Source | +|--------|----------|--------| +| 0 | eventID | same URN as core (foreign key) | +| 1 | occurrenceID | `urn:ami:occurrence:{project_slug}:{occurrence.id}` | +| 2 | basisOfRecord | `"MachineObservation"` | +| 3 | occurrenceStatus | `"present"` | +| 4 | scientificName | `determination.name` | +| 5 | taxonRank | `determination.rank` (lowercase) | +| 6 | kingdom | from `determination.parents_json` | +| 7 | phylum | from `determination.parents_json` | +| 8 | class | from `determination.parents_json` | +| 9 | order | from `determination.parents_json` | +| 10 | family | from `determination.parents_json` | +| 11 | genus | from `determination.parents_json` | +| 12 | specificEpithet | second word of species name | +| 13 | vernacularName | `determination.common_name_en` | +| 14 | taxonID | `determination.gbif_taxon_key` (if available) | +| 15 | individualCount | `detections_count` | +| 16 | identificationVerificationStatus | "verified" if identifications exist, else "unverified" | +| 17 | modified | `occurrence.updated_at` ISO format | + +### Step 3: Register the exporter + +**File:** `ami/exports/registry.py` + +Add: `ExportRegistry.register("dwca")(DwCAExporter)` + +This is all that's needed for it to appear in the API's valid format choices. + +### Step 4: Override `generate_filename()` behavior + +The `DataExport.generate_filename()` uses `exporter.file_format` for the extension. Since `file_format = "zip"`, the filename will be `{project_slug}_export-{pk}.zip` which is correct. + +No changes needed to `DataExport` model. + +### Step 5: Write tests + +**File:** `ami/exports/tests.py` (add to existing) + +- Test that `DwCAExporter` is registered and retrievable +- Test that export produces a valid ZIP with expected files (event.txt, occurrence.txt, meta.xml, eml.xml) +- Test that event.txt has correct headers and row count matches events +- Test that occurrence.txt has correct headers and row count matches valid occurrences +- Test that meta.xml is valid XML with correct core/extension structure +- Test that all occurrence eventIDs reference existing event eventIDs (referential integrity) +- Test taxonomy hierarchy extraction from `parents_json` + +### Step 6: Update documentation + +**File:** `docs/claude/dwca-format-reference.md` (already created, update with final field mappings) + +## Key Files to Modify + +| File | Action | +|------|--------| +| `ami/exports/dwca.py` | **New** - DwC field mappings, meta.xml/eml.xml generators, taxonomy helpers | +| `ami/exports/format_types.py` | **Modify** - Add `DwCAExporter` class | +| `ami/exports/registry.py` | **Modify** - Register `"dwca"` format | +| `ami/exports/tests.py` | **Modify** - Add DwC-A tests | + +## Key Files to Read (not modify) + +| File | Why | +|------|-----| +| `ami/exports/base.py` | BaseExporter interface | +| `ami/exports/models.py` | DataExport model, run_export() flow | +| `ami/exports/utils.py` | get_data_in_batches(), generate_fake_request() | +| `ami/main/models.py:1025` | Event model fields | +| `ami/main/models.py:2808` | Occurrence model fields | +| `ami/main/models.py:3329` | TaxonParent pydantic model (parents_json schema) | +| `ami/main/models.py:3349` | Taxon model fields | +| `docs/claude/reference/example_dwca_exporter.md` | Reference DwC-A implementation | + +## Design Decisions + +1. **No DRF serializer for DwC-A** - Unlike JSON/CSV exporters that use DRF serializers via `get_data_in_batches()`, the DwC-A exporter writes TSV directly. DwC fields are simple extractions, not nested API representations. This avoids the overhead of serializer instantiation per record. + +2. **Direct queryset iteration** - Use `queryset.iterator(chunk_size=500)` for memory efficiency, writing rows directly to the TSV file. + +3. **Taxonomy from parents_json** - Walk the `parents_json` list (which contains `{id, name, rank}` dicts) to extract kingdom/phylum/class/order/family/genus. This avoids N+1 queries on the Taxon parent chain. + +4. **meta.xml generated from field definitions** - The same field list used for writing TSV columns also drives meta.xml generation, ensuring they stay in sync. + +5. **Minimal eml.xml** - Start with project name, description, and owner. Can be enriched later with geographic bounding box, temporal coverage, etc. + +6. **Scope for follow-up** - Species checklist (taxon.txt) and multimedia extension (multimedia.txt) are explicitly out of scope for this PR, as stated in the task. + +## Verification + +1. Run existing export tests to ensure no regression: `docker compose run --rm django python manage.py test ami.exports` +2. Run new DwC-A tests +3. Manual test: create a DwC-A export via the API or admin, download the ZIP, inspect contents +4. Validate with GBIF Data Validator: https://www.gbif.org/tools/data-validator + +## Known issues to fix before merge + +1. **Occurrences without events produce empty `coreid`** — GBIF rejects orphaned extension rows. Need `.filter(event__isnull=False)` on occurrence queryset. (`ami/exports/format_types.py:199`) +2. **Occurrences without determinations produce empty `scientificName`** — GBIF treats this as required. Need `.filter(determination__isnull=False)`. (`ami/exports/format_types.py:199`) +3. **`individualCount` semantics wrong** — `detections_count` = bounding boxes across frames, not individuals. Each AMI occurrence is one individual. Should emit `1` or omit. (`ami/exports/dwca.py:87`) +4. **`vernacularName` operator precedence** — `x or "" if y else ""` should be `(x or "") if y else ""`. (`ami/exports/dwca.py:78-79`) +5. **Temp files never cleaned up** — event.txt, occurrence.txt, zip temp file leak on worker. (`ami/exports/format_types.py:238-264`) + +## Near follow-up (before real GBIF submission) + +- **Apply project default filters** to occurrence queryset — without this, low-confidence ML determinations get published to GBIF. Biggest data quality risk. +- **Add `license` field** on events — GBIF requires a dataset license for reuse terms. +- **Add `identifiedBy` / `dateIdentified`** — provenance for who/what made the determination. +- **Add `associatedMedia`** — detection image URLs (pipe-separated). Primary evidence for an image-based platform. +- **Runtime validation before packaging** — check for missing required fields, orphaned references, before creating the ZIP. +- **Upgrade EML to 2.2.0** — current code uses 2.1.1, GBIF recommends 2.2.0. The reference doc already shows 2.2.0. + +## Eventual follow-up + +- EML geographic/temporal coverage computed from actual data (bounding box, date range) +- `country`, `stateProvince`, `locality` on events (requires reverse geocoding or Site model fields) +- `coordinateUncertaintyInMeters` +- `institutionCode`, `collectionCode` (project-level settings) +- `scientificNameAuthorship` from `Taxon.author` +- `eventType` field +- Multimedia extension file (`multimedia.txt`) +- GBIF Data Validator automated integration test +- IPT server integration / acting as IPT endpoint for GBIF crawling + +## Nice to haves + +- Use `default` attribute in meta.xml for constant fields (`basisOfRecord`, `geodeticDatum`, etc.) to reduce file size +- Filter events to only those that have occurrences in the export +- Guard against `ZeroDivisionError` in progress callback when `total_records` is 0 diff --git a/ami/exports/base.py b/ami/exports/base.py index 389480d5e..b44c3cc98 100644 --- a/ami/exports/base.py +++ b/ami/exports/base.py @@ -11,6 +11,7 @@ class BaseExporter(ABC): """Base class for all data export handlers.""" file_format = "" # To be defined in child classes + filename_label = "" # Optional slug token inserted into export filenames (e.g. "dwca_draft-2026-04") serializer_class = None filter_backends = [] diff --git a/ami/exports/dwca/__init__.py b/ami/exports/dwca/__init__.py new file mode 100644 index 000000000..7a2c22ef2 --- /dev/null +++ b/ami/exports/dwca/__init__.py @@ -0,0 +1,53 @@ +"""Public surface of the DwC-A export package. + +Re-exports keep existing imports (format_types.py, tests) working unchanged +while internal code is organized by responsibility. +""" + +from ami.exports.dwca.eml import generate_eml_xml +from ami.exports.dwca.fields import ( + DC, + DWC, + ECO, + EVENT_FIELDS, + MOF_FIELDS, + MULTIMEDIA_FIELDS, + OCCURRENCE_FIELDS, + DwCAField, +) +from ami.exports.dwca.helpers import ( + _format_coord, + _format_datetime, + _format_duration, + _format_event_date, + _format_time, + _get_rank_from_parents, + _get_verification_status, + get_specific_epithet, +) +from ami.exports.dwca.meta import generate_meta_xml +from ami.exports.dwca.tsv import write_tsv +from ami.exports.dwca.zip import create_dwca_zip + +__all__ = [ + "DC", + "DWC", + "ECO", + "DwCAField", + "EVENT_FIELDS", + "MOF_FIELDS", + "MULTIMEDIA_FIELDS", + "OCCURRENCE_FIELDS", + "create_dwca_zip", + "generate_eml_xml", + "generate_meta_xml", + "get_specific_epithet", + "write_tsv", + "_format_coord", + "_format_datetime", + "_format_duration", + "_format_event_date", + "_format_time", + "_get_rank_from_parents", + "_get_verification_status", +] diff --git a/ami/exports/dwca/eml.py b/ami/exports/dwca/eml.py new file mode 100644 index 000000000..a1f650fe3 --- /dev/null +++ b/ami/exports/dwca/eml.py @@ -0,0 +1,146 @@ +"""Generate EML 2.2.0 metadata for the DwC-A. + +EML 2.2.0 is the current ratified version and what GBIF expects. Geographic +and temporal coverage are computed from the event list; a methods section +documents the automated capture + ML pipeline and the quality-control filters +applied at export time. +""" + +from __future__ import annotations + +from xml.etree import ElementTree as ET + +from django.utils import timezone +from django.utils.text import slugify + +EML_NS = "https://eml.ecoinformatics.org/eml-2.2.0" +XSI_NS = "http://www.w3.org/2001/XMLSchema-instance" + + +def generate_eml_xml(project, events=None) -> str: + """Return the eml.xml body. + + If `events` is provided (iterable of Event), geographic and temporal + coverage are computed from it. If absent, the coverage element is omitted. + """ + project_slug = slugify(project.name) + now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S") + + eml = ET.Element("eml:eml") + eml.set("xmlns:eml", EML_NS) + eml.set("xmlns:dc", "http://purl.org/dc/terms/") + eml.set("xmlns:xsi", XSI_NS) + eml.set("xsi:schemaLocation", f"{EML_NS} https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd") + eml.set("packageId", f"urn:ami:dataset:{project_slug}:{now}") + eml.set("system", "AMI") + + dataset = ET.SubElement(eml, "dataset") + _add_text(dataset, "title", project.name) + + creator = ET.SubElement(dataset, "creator") + _add_text(creator, "organizationName", "Automated Monitoring of Insects (AMI)") + if project.owner and project.owner.name: + individual = ET.SubElement(creator, "individualName") + _add_text(individual, "surName", project.owner.name) + + abstract = ET.SubElement(dataset, "abstract") + _add_text(abstract, "para", project.description or f"Biodiversity monitoring data from {project.name}.") + + _add_intellectual_rights(dataset, project) + + if events is not None: + _add_coverage(dataset, events) + + _add_methods(dataset) + + _add_draft_notice(dataset) + + contact = ET.SubElement(dataset, "contact") + _add_text(contact, "organizationName", "Automated Monitoring of Insects (AMI)") + + ET.indent(eml, space=" ") + xml_str = ET.tostring(eml, encoding="unicode", xml_declaration=False) + return '\n' + xml_str + "\n" + + +def _add_text(parent, tag, text): + child = ET.SubElement(parent, tag) + child.text = text or "" + return child + + +def _add_intellectual_rights(dataset, project): + rights = ET.SubElement(dataset, "intellectualRights") + para = ET.SubElement(rights, "para") + project_license = (getattr(project, "license", "") or "").strip() + para.text = project_license if project_license else "All rights reserved. No license specified." + if getattr(project, "rights_holder", ""): + additional = ET.SubElement(dataset, "additionalInfo") + _add_text(additional, "para", f"Rights holder: {project.rights_holder}") + + +def _add_coverage(dataset, events): + lats = [e.deployment.latitude for e in events if e.deployment and e.deployment.latitude is not None] + lons = [e.deployment.longitude for e in events if e.deployment and e.deployment.longitude is not None] + starts = [e.start for e in events if e.start] + ends = [e.end for e in events if e.end] or starts + + if not (lats and lons) and not starts: + return + + coverage = ET.SubElement(dataset, "coverage") + + if lats and lons: + geo = ET.SubElement(coverage, "geographicCoverage") + _add_text(geo, "geographicDescription", "Computed from event deployment coordinates") + bounding = ET.SubElement(geo, "boundingCoordinates") + _add_text(bounding, "westBoundingCoordinate", f"{min(lons):.6f}") + _add_text(bounding, "eastBoundingCoordinate", f"{max(lons):.6f}") + _add_text(bounding, "northBoundingCoordinate", f"{max(lats):.6f}") + _add_text(bounding, "southBoundingCoordinate", f"{min(lats):.6f}") + + if starts: + temporal = ET.SubElement(coverage, "temporalCoverage") + range_of_dates = ET.SubElement(temporal, "rangeOfDates") + begin = ET.SubElement(range_of_dates, "beginDate") + _add_text(begin, "calendarDate", min(starts).date().isoformat()) + end = ET.SubElement(range_of_dates, "endDate") + _add_text(end, "calendarDate", max(ends).date().isoformat()) + + +def _add_methods(dataset): + methods = ET.SubElement(dataset, "methods") + step = ET.SubElement(methods, "methodStep") + description = ET.SubElement(step, "description") + _add_text( + description, + "para", + "Images captured at a fixed interval by an automated camera trap with light attractant. " + "Each image is processed through an ML detector (bounding-box extraction) and an ML " + "classifier (species prediction). Individual detections are aggregated into occurrences " + "by spatiotemporal grouping and assigned a consensus determination.", + ) + sampling = ET.SubElement(methods, "sampling") + study_extent = ET.SubElement(sampling, "studyExtent") + _add_text(study_extent, "description", "See for geographic and temporal extent.") + _add_text(sampling, "samplingDescription", "Automated overnight monitoring with continuous image capture.") + qc = ET.SubElement(methods, "qualityControl") + qc_description = ET.SubElement(qc, "description") + _add_text( + qc_description, + "para", + "Project default filters applied before export: score thresholds, include/exclude taxa " + "lists, soft-delete exclusion. Only occurrences with at least one detection are included.", + ) + + +def _add_draft_notice(dataset): + additional = ET.SubElement(dataset, "additionalInfo") + _add_text( + additional, + "para", + "DRAFT SCHEMA (April 2026). This archive is a preview of the Darwin Core Archive format " + "being developed for AMI data. Schema details (terms, extensions, required fields) are " + "subject to change. Do not submit to GBIF or other biodiversity aggregators without first " + "confirming the current schema with the AMI team.", + ) diff --git a/ami/exports/dwca/fields.py b/ami/exports/dwca/fields.py new file mode 100644 index 000000000..38a8d8ab6 --- /dev/null +++ b/ami/exports/dwca/fields.py @@ -0,0 +1,285 @@ +"""DwC-A column catalogues. + +Each DwCAField ties a term URI, a TSV header, and a row extractor together so +meta.xml cannot drift from the TSV. Catalogues here: EVENT_FIELDS, +OCCURRENCE_FIELDS. Additional catalogues (multimedia, MoF) live beside them and +are added in later PRs. +""" + +from __future__ import annotations + +from collections.abc import Callable +from dataclasses import dataclass +from typing import Any + +from ami.exports.dwca.helpers import ( + DEFAULT_LICENSE, + _format_coord, + _format_datetime, + _format_duration, + _format_event_date, + _format_time, + _get_rank_from_parents, + _get_verification_status, + get_specific_epithet, +) + +DWC = "http://rs.tdwg.org/dwc/terms/" +DC = "http://purl.org/dc/terms/" +ECO = "http://rs.tdwg.org/eco/terms/" + + +def _humboldt_effort_value(event) -> str: + """Sampling effort value: prefer image count, fall back to nothing.""" + count = getattr(event, "captures_count", None) or 0 + return str(count) if count else "" + + +def _associated_media(occurrence) -> str: + """Pipe-separated distinct public URLs of source captures for this occurrence. + + Ordered by detection timestamp; timestamp-less detections sort last. Uses + prefetched detections + source_image; the exporter ensures the prefetch + chain so this stays O(N) at write time. + """ + import datetime as _dt + + seen: set[str] = set() + urls: list[str] = [] + _far_future = _dt.datetime.max + + def _sort_key(d): + ts = d.timestamp or (d.source_image.timestamp if d.source_image else None) + return ts or _far_future + + detections = sorted(occurrence.detections.all(), key=_sort_key) + for det in detections: + si = det.source_image + if si is None: + continue + url = si.public_url() + if not url or url in seen: + continue + seen.add(url) + urls.append(url) + return "|".join(urls) + + +@dataclass(frozen=True) +class DwCAField: + """A single column mapping in a DwC-A text file. + + Ties together the Darwin Core term URI (written to meta.xml), the + TSV header, and the extractor that produces the cell value from a + model instance. Consolidating all three here makes the field the + unit of test and review, and lets meta.xml be derived from the + same list instead of reconstructed in parallel. + """ + + term: str + header: str + extract: Callable[[Any, str], str] + required: bool = False # GBIF acceptance bar; informational today, validated later. + + +EVENT_FIELDS: list[DwCAField] = [ + DwCAField(DWC + "eventID", "eventID", lambda e, slug: f"urn:ami:event:{slug}:{e.id}", required=True), + DwCAField(DWC + "eventDate", "eventDate", lambda e, slug: _format_event_date(e), required=True), + DwCAField(DWC + "eventTime", "eventTime", lambda e, slug: _format_time(e.start)), + DwCAField(DWC + "year", "year", lambda e, slug: str(e.start.year) if e.start else ""), + DwCAField(DWC + "month", "month", lambda e, slug: str(e.start.month) if e.start else ""), + DwCAField(DWC + "day", "day", lambda e, slug: str(e.start.day) if e.start else ""), + DwCAField(DWC + "samplingProtocol", "samplingProtocol", lambda e, slug: "automated light trap with camera"), + DwCAField(DWC + "sampleSizeValue", "sampleSizeValue", lambda e, slug: str(e.captures_count or 0)), + DwCAField(DWC + "sampleSizeUnit", "sampleSizeUnit", lambda e, slug: "images"), + DwCAField(DWC + "samplingEffort", "samplingEffort", lambda e, slug: _format_duration(e)), + DwCAField(DWC + "locationID", "locationID", lambda e, slug: e.deployment.name if e.deployment else ""), + DwCAField( + DWC + "decimalLatitude", + "decimalLatitude", + lambda e, slug: _format_coord(e.deployment.latitude if e.deployment else None), + required=True, + ), + DwCAField( + DWC + "decimalLongitude", + "decimalLongitude", + lambda e, slug: _format_coord(e.deployment.longitude if e.deployment else None), + required=True, + ), + DwCAField(DWC + "geodeticDatum", "geodeticDatum", lambda e, slug: "WGS84"), + DwCAField(DWC + "datasetName", "datasetName", lambda e, slug: e.project.name if e.project else ""), + DwCAField( + DC + "license", + "license", + lambda e, slug: (e.project.license if e.project else "") or DEFAULT_LICENSE, + ), + DwCAField( + DC + "rightsHolder", "rightsHolder", lambda e, slug: (e.project.rights_holder if e.project else "") or "" + ), + DwCAField(DC + "modified", "modified", lambda e, slug: _format_datetime(e.updated_at)), + # ── Humboldt Extension (eco:) terms flattened onto event.txt ── + DwCAField( + ECO + "isSamplingEffortReported", + "isSamplingEffortReported", + lambda e, slug: "true", + ), + DwCAField( + ECO + "samplingEffortValue", + "samplingEffortValue", + lambda e, slug: _humboldt_effort_value(e), + ), + DwCAField( + ECO + "samplingEffortUnit", + "samplingEffortUnit", + lambda e, slug: "images", + ), + DwCAField( + ECO + "samplingEffortProtocol", + "samplingEffortProtocol", + lambda e, slug: ( + "automated camera trap with light attractant; continuous overnight monitoring " + "with fixed image-capture interval; images processed by ML detector + classifier pipeline" + ), + ), + DwCAField( + ECO + "isAbsenceReported", + "isAbsenceReported", + lambda e, slug: "true", + ), + DwCAField( + ECO + "targetTaxonomicScope", + "targetTaxonomicScope", + lambda e, slug: getattr(e, "_target_taxonomic_scope", "") or "", + ), + DwCAField( + ECO + "inventoryTypes", + "inventoryTypes", + lambda e, slug: "trap or sample", + ), + DwCAField( + ECO + "protocolNames", + "protocolNames", + lambda e, slug: "AMI ML detector + classifier pipeline", + ), + DwCAField( + ECO + "protocolDescriptions", + "protocolDescriptions", + lambda e, slug: ( + "Images captured at a fixed interval by an automated monitoring station; each image " + "processed through a detector (bounding-box extraction) and classifier (species " + "prediction). Occurrences grouped from co-located detections; default filters applied." + ), + ), + DwCAField( + ECO + "hasMaterialSamples", + "hasMaterialSamples", + lambda e, slug: "true", + ), + DwCAField( + ECO + "materialSampleTypes", + "materialSampleTypes", + lambda e, slug: "digital images", + ), +] + + +OCCURRENCE_FIELDS: list[DwCAField] = [ + DwCAField( + DWC + "eventID", + "eventID", + lambda o, slug: f"urn:ami:event:{slug}:{o.event_id}" if o.event_id else "", + required=True, + ), + DwCAField( + DWC + "occurrenceID", "occurrenceID", lambda o, slug: f"urn:ami:occurrence:{slug}:{o.id}", required=True + ), + DwCAField(DWC + "basisOfRecord", "basisOfRecord", lambda o, slug: "MachineObservation", required=True), + DwCAField(DWC + "occurrenceStatus", "occurrenceStatus", lambda o, slug: "present"), + DwCAField( + DWC + "scientificName", + "scientificName", + lambda o, slug: o.determination.name if o.determination else "", + required=True, + ), + DwCAField( + DWC + "taxonRank", + "taxonRank", + lambda o, slug: (o.determination.rank.lower() if o.determination and o.determination.rank else ""), + ), + DwCAField(DWC + "kingdom", "kingdom", lambda o, slug: _get_rank_from_parents(o, "KINGDOM")), + DwCAField(DWC + "phylum", "phylum", lambda o, slug: _get_rank_from_parents(o, "PHYLUM")), + DwCAField(DWC + "class", "class", lambda o, slug: _get_rank_from_parents(o, "CLASS")), + DwCAField(DWC + "order", "order", lambda o, slug: _get_rank_from_parents(o, "ORDER")), + DwCAField(DWC + "family", "family", lambda o, slug: _get_rank_from_parents(o, "FAMILY")), + DwCAField(DWC + "genus", "genus", lambda o, slug: _get_rank_from_parents(o, "GENUS")), + DwCAField( + DWC + "specificEpithet", + "specificEpithet", + lambda o, slug: get_specific_epithet(o.determination.name if o.determination else ""), + ), + DwCAField( + DWC + "vernacularName", + "vernacularName", + lambda o, slug: (o.determination.common_name_en or "") if o.determination else "", + ), + DwCAField( + DWC + "taxonID", + "taxonID", + lambda o, slug: ( + str(o.determination.gbif_taxon_key) if o.determination and o.determination.gbif_taxon_key else "" + ), + ), + DwCAField(DWC + "individualCount", "individualCount", lambda o, slug: "1"), + DwCAField(DWC + "identifiedBy", "identifiedBy", lambda o, slug: o.get_identified_by()), + DwCAField( + DWC + "dateIdentified", + "dateIdentified", + lambda o, slug: _format_datetime(o.get_identified_date()), + ), + DwCAField( + DWC + "identificationVerificationStatus", + "identificationVerificationStatus", + lambda o, slug: _get_verification_status(o), + ), + DwCAField(DC + "modified", "modified", lambda o, slug: _format_datetime(o.updated_at)), + DwCAField( + DWC + "associatedMedia", + "associatedMedia", + lambda o, slug: _associated_media(o), + ), +] + + +MOF_FIELDS: list[DwCAField] = [ + DwCAField(DWC + "eventID", "eventID", lambda r, slug: r["eventID"], required=True), + DwCAField(DWC + "occurrenceID", "occurrenceID", lambda r, slug: r.get("occurrenceID", "")), + DwCAField(DWC + "measurementID", "measurementID", lambda r, slug: r.get("measurementID", "")), + DwCAField(DWC + "measurementType", "measurementType", lambda r, slug: r["measurementType"], required=True), + DwCAField(DWC + "measurementValue", "measurementValue", lambda r, slug: r.get("measurementValue", "")), + DwCAField(DWC + "measurementUnit", "measurementUnit", lambda r, slug: r.get("measurementUnit", "")), + DwCAField( + DWC + "measurementDeterminedBy", + "measurementDeterminedBy", + lambda r, slug: r.get("measurementDeterminedBy", ""), + ), + DwCAField( + DWC + "measurementRemarks", + "measurementRemarks", + lambda r, slug: r.get("measurementRemarks", ""), + ), +] + + +MULTIMEDIA_FIELDS: list[DwCAField] = [ + DwCAField(DWC + "eventID", "eventID", lambda r, slug: r["eventID"], required=True), + DwCAField(DWC + "occurrenceID", "occurrenceID", lambda r, slug: r.get("occurrenceID", "")), + DwCAField(DC + "type", "type", lambda r, slug: r.get("type", "StillImage")), + DwCAField(DC + "format", "format", lambda r, slug: r.get("format", "image/jpeg")), + DwCAField(DC + "identifier", "identifier", lambda r, slug: r["identifier"], required=True), + DwCAField(DC + "references", "references", lambda r, slug: r.get("references", "")), + DwCAField(DC + "created", "created", lambda r, slug: r.get("created", "")), + DwCAField(DC + "license", "license", lambda r, slug: r.get("license", "")), + DwCAField(DC + "rightsHolder", "rightsHolder", lambda r, slug: r.get("rightsHolder", "")), + DwCAField(DC + "creator", "creator", lambda r, slug: r.get("creator", "")), + DwCAField(DC + "description", "description", lambda r, slug: r.get("description", "")), +] diff --git a/ami/exports/dwca/helpers.py b/ami/exports/dwca/helpers.py new file mode 100644 index 000000000..5f0134ca6 --- /dev/null +++ b/ami/exports/dwca/helpers.py @@ -0,0 +1,91 @@ +"""Small pure helpers used by DwC-A field extractors.""" + +from __future__ import annotations + +import datetime +import logging + +logger = logging.getLogger(__name__) + +DEFAULT_LICENSE = "All rights reserved" + + +def _format_event_date(event) -> str: + """Format event date as ISO date or date interval.""" + if not event.start: + return "" + start_date = event.start.date().isoformat() + if event.end and event.end.date() != event.start.date(): + return f"{start_date}/{event.end.date().isoformat()}" + return start_date + + +def _format_time(dt) -> str: + if not dt: + return "" + return dt.strftime("%H:%M:%S") + + +def _format_datetime(dt) -> str: + if not dt: + return "" + if isinstance(dt, datetime.datetime): + return dt.isoformat() + return str(dt) + + +def _format_coord(value) -> str: + if value is None: + return "" + return str(round(value, 6)) + + +def _format_duration(event) -> str: + """Format event duration as human-readable string.""" + if not event.start or not event.end: + return "" + delta = event.end - event.start + total_seconds = int(delta.total_seconds()) + if total_seconds < 0: + return "" + hours, remainder = divmod(total_seconds, 3600) + minutes, _ = divmod(remainder, 60) + if hours > 0: + return f"{hours}h {minutes}m" + return f"{minutes}m" + + +def _get_rank_from_parents(occurrence, rank: str) -> str: + """Extract a taxon name at a specific rank from determination.parents_json.""" + if not occurrence.determination: + return "" + parents = occurrence.determination.parents_json + if not parents: + return "" + for parent in parents: + # parents_json contains TaxonParent objects (or dicts with id, name, rank) + parent_rank = parent.rank if hasattr(parent, "rank") else parent.get("rank", "") + # TaxonRank enum values are uppercase strings + parent_rank_str = parent_rank.name if hasattr(parent_rank, "name") else str(parent_rank) + if parent_rank_str.upper() == rank: + return parent.name if hasattr(parent, "name") else parent.get("name", "") + # Also check the determination itself if it matches the requested rank + det_rank = occurrence.determination.rank + if det_rank and det_rank.upper() == rank: + return occurrence.determination.name + return "" + + +def get_specific_epithet(name: str) -> str: + """Extract the specific epithet (second word) from a binomial name.""" + parts = name.split() + if len(parts) >= 2: + return parts[1] + return "" + + +def _get_verification_status(occurrence) -> str: + """Return "verified" when a non-withdrawn human Identification exists, else "unverified".""" + if hasattr(occurrence, "_prefetched_objects_cache") and "identifications" in occurrence._prefetched_objects_cache: + return "verified" if any(not i.withdrawn for i in occurrence.identifications.all()) else "unverified" + return "verified" if occurrence.identifications.filter(withdrawn=False).exists() else "unverified" diff --git a/ami/exports/dwca/meta.py b/ami/exports/dwca/meta.py new file mode 100644 index 000000000..4725e2e38 --- /dev/null +++ b/ami/exports/dwca/meta.py @@ -0,0 +1,78 @@ +"""Generate the DwC-A descriptor (meta.xml). + +meta.xml is derived from the field catalogues so TSV columns cannot drift from +declared term URIs. The core/extension list is passed in so the caller composes +the archive shape. +""" + +from __future__ import annotations + +from xml.etree import ElementTree as ET + +from ami.exports.dwca.fields import DwCAField + + +def generate_meta_xml(tables: list[dict]) -> str: + """Build meta.xml from a list of table descriptors. + + Each descriptor is a dict: + { + "role": "core" | "extension", + "row_type": , + "filename": "event.txt", + "fields": list[DwCAField], + } + + The first descriptor must have role="core"; remaining are extensions. + """ + if not tables or tables[0]["role"] != "core": + raise ValueError("First table must be the core (role='core')") + + archive = ET.Element("archive") + archive.set("xmlns", "http://rs.tdwg.org/dwc/text/") + archive.set("metadata", "eml.xml") + + for table in tables: + tag = table["role"] + _append_table( + archive, + tag=tag, + row_type=table["row_type"], + filename=table["filename"], + fields=table["fields"], + id_tag="id" if tag == "core" else "coreid", + ) + + ET.indent(archive, space=" ") + xml_str = ET.tostring(archive, encoding="unicode", xml_declaration=False) + return '\n' + xml_str + "\n" + + +def _append_table( + archive: ET.Element, + *, + tag: str, + row_type: str, + filename: str, + fields: list[DwCAField], + id_tag: str, +) -> None: + table = ET.SubElement(archive, tag) + table.set("rowType", row_type) + table.set("encoding", "UTF-8") + table.set("fieldsTerminatedBy", "\\t") + table.set("linesTerminatedBy", "\\n") + table.set("fieldsEnclosedBy", '"') + table.set("ignoreHeaderLines", "1") + + files = ET.SubElement(table, "files") + location = ET.SubElement(files, "location") + location.text = filename + + id_elem = ET.SubElement(table, id_tag) + id_elem.set("index", "0") + + for i, field in enumerate(fields): + field_elem = ET.SubElement(table, "field") + field_elem.set("index", str(i)) + field_elem.set("term", field.term) diff --git a/ami/exports/dwca/rows.py b/ami/exports/dwca/rows.py new file mode 100644 index 000000000..57033af50 --- /dev/null +++ b/ami/exports/dwca/rows.py @@ -0,0 +1,181 @@ +"""Row generators for DwC-A extension TSVs (multimedia, measurementorfact). + +These generators yield plain dicts so the existing write_tsv + DwCAField +pattern handles both query-backed tables and computed row streams uniformly. +""" + +from __future__ import annotations + +import json + +from ami.exports.dwca.helpers import DEFAULT_LICENSE, _format_datetime + + +def _event_id(event, slug: str) -> str: + return f"urn:ami:event:{slug}:{event.id}" + + +def _occurrence_id(occurrence, slug: str) -> str: + return f"urn:ami:occurrence:{slug}:{occurrence.id}" + + +def iter_multimedia_rows(events_qs, occurrences, project_slug: str): + """Yield dicts for multimedia.txt rows. + + Two row types: + - Capture row: one per SourceImage linked to >=1 occurrence in filter set. + occurrenceID is blank; identifier is the capture URL. + - Crop row: one per Detection whose occurrence is in filter set + AND which has a usable crop URL. occurrenceID populated; + references = source capture URL. + + `occurrences` is an already-materialized sequence with detections and + source_image prefetched; no DB query is issued here beyond what `events_qs` + triggers. + """ + events_list = list(events_qs) + license_value = _project_license(events_list) + rights_holder = _project_rights_holder(events_list) + + occurrences_by_event: dict[int, list] = {} + for occ in occurrences: + if occ.event_id is None: + continue + occurrences_by_event.setdefault(occ.event_id, []).append(occ) + + for event in events_list: + eid = _event_id(event, project_slug) + occurrences_for_event = occurrences_by_event.get(event.id, []) + + # Deduplicate capture images across all occurrences in this event. + seen_captures: set[int] = set() + for occ in occurrences_for_event: + for det in occ.detections.all(): + si = det.source_image + if si is None or si.id in seen_captures: + continue + seen_captures.add(si.id) + capture_url = si.public_url() + if not capture_url: + continue + yield { + "eventID": eid, + "occurrenceID": "", + "type": "StillImage", + "format": "image/jpeg", + "identifier": capture_url, + "references": "", + "created": _format_datetime(si.timestamp), + "license": license_value, + "rightsHolder": rights_holder, + "creator": "", + "description": "Source capture image from automated monitoring station", + } + + # Detection crop rows. + for occ in occurrences_for_event: + occ_urn = _occurrence_id(occ, project_slug) + for det in occ.detections.all(): + crop_url = det.url() if hasattr(det, "url") else None + if not crop_url: + continue + si = det.source_image + capture_url = si.public_url() if si else "" + created_ts = getattr(det, "timestamp", None) or (si.timestamp if si else None) + yield { + "eventID": eid, + "occurrenceID": occ_urn, + "type": "StillImage", + "format": "image/jpeg", + "identifier": crop_url, + "references": capture_url, + "created": _format_datetime(created_ts), + "license": license_value, + "rightsHolder": rights_holder, + "creator": "", + "description": "Cropped detection from source capture", + } + + +def iter_mof_rows(occurrences, project_slug: str): + """Yield dicts for measurementorfact.txt rows. + + Per occurrence: + - classificationScore (value = occurrence.determination_score, unit = proportion) + + Per detection: + - detectionScore (value = detection.detection_score, unit = proportion) + - boundingBox (value = JSON [x1,y1,x2,y2], unit = pixels) + + `occurrences` is an already-materialized sequence with determination/event + and detection prefetches; no DB query is issued. + """ + for occ in occurrences: + eid = _event_id(occ.event, project_slug) if occ.event_id else "" + occ_urn = _occurrence_id(occ, project_slug) + if eid and occ.determination_score is not None: + yield { + "eventID": eid, + "occurrenceID": occ_urn, + "measurementID": f"{occ_urn}:classificationScore", + "measurementType": "classificationScore", + "measurementValue": f"{occ.determination_score:.6f}", + "measurementUnit": "proportion", + "measurementDeterminedBy": _classifier_name(occ), + "measurementRemarks": "ML classifier softmax score", + } + for det in occ.detections.all(): + det_urn = f"urn:ami:detection:{project_slug}:{det.id}" + if det.detection_score is not None: + yield { + "eventID": eid, + "occurrenceID": occ_urn, + "measurementID": f"{det_urn}:detectionScore", + "measurementType": "detectionScore", + "measurementValue": f"{det.detection_score:.6f}", + "measurementUnit": "proportion", + "measurementDeterminedBy": det.detection_algorithm.name if det.detection_algorithm else "", + "measurementRemarks": "ML detector confidence score", + } + if det.bbox: + yield { + "eventID": eid, + "occurrenceID": occ_urn, + "measurementID": f"{det_urn}:boundingBox", + "measurementType": "boundingBox", + "measurementValue": json.dumps(det.bbox), + "measurementUnit": "pixels", + "measurementDeterminedBy": det.detection_algorithm.name if det.detection_algorithm else "", + "measurementRemarks": "Bounding box [x1, y1, x2, y2]", + } + + +def _classifier_name(occurrence) -> str: + """Best-effort: name + version of the classifier that produced this determination.""" + best = None + for det in occurrence.detections.all(): + for cls in det.classifications.all(): + if cls.taxon_id == occurrence.determination_id: + best = cls + break + if best: + break + if best and best.algorithm: + name = best.algorithm.name or "" + version = getattr(best.algorithm, "version", "") or "" + return f"{name} {version}".strip() + return "" + + +def _project_license(events) -> str: + for e in events: + if e.project and getattr(e.project, "license", ""): + return e.project.license + return DEFAULT_LICENSE + + +def _project_rights_holder(events) -> str: + for e in events: + if e.project and getattr(e.project, "rights_holder", ""): + return e.project.rights_holder + return "" diff --git a/ami/exports/dwca/targetscope.py b/ami/exports/dwca/targetscope.py new file mode 100644 index 000000000..b2daf7fee --- /dev/null +++ b/ami/exports/dwca/targetscope.py @@ -0,0 +1,41 @@ +"""Derive eco:targetTaxonomicScope from a project's include-taxa filter. + +The scope is the lowest common ancestor (LCA) across all taxa in +Project.default_filters_include_taxa. Empty include-list -> empty string +(meta.xml still declares the column; EML notes the gap). + +This is the v1 sourcing strategy. v2 will move to a per-Site TaxaList so each +deployment can declare its own expected species pool (the groundwork for +per-taxon absence occurrence rows). +""" + +from __future__ import annotations + + +def derive_target_taxonomic_scope(project) -> str: + """Return the name of the LCA of the project's include-taxa filter. + + `parents_json` on each Taxon is ordered root-to-leaf (kingdom first). + The LCA is the deepest (longest) common prefix of the + `parents_json + [self]` chains across all selected taxa. + """ + taxa = list(project.default_filters_include_taxa.all()) + if not taxa: + return "" + + def ancestry(t) -> list[tuple[int, str]]: + chain: list[tuple[int, str]] = [(p.id, p.name) for p in (t.parents_json or [])] + chain.append((t.id, t.name)) + return chain + + chains = [ancestry(t) for t in taxa] + if any(not c for c in chains): + return "" + + lca_name = "" + for position in zip(*chains): + ids = {entry[0] for entry in position} + if len(ids) != 1: + break + lca_name = position[0][1] + return lca_name diff --git a/ami/exports/dwca/tsv.py b/ami/exports/dwca/tsv.py new file mode 100644 index 000000000..089357720 --- /dev/null +++ b/ami/exports/dwca/tsv.py @@ -0,0 +1,37 @@ +"""TSV writing for DwC-A text files.""" + +from __future__ import annotations + +import csv + +from ami.exports.dwca.fields import DwCAField + + +def write_tsv( + filepath: str, + fields: list[DwCAField], + source, + project_slug: str, + progress_callback=None, +): + """Write a tab-delimited file from a queryset or any iterable of row objects. + + Returns the number of records written. A row object can be a model instance + or a plain dict — each field's extract callable handles attribute access. + """ + headers = [f.header for f in fields] + records_written = 0 + iterator = source.iterator(chunk_size=500) if hasattr(source, "iterator") else iter(source) + + with open(filepath, "w", encoding="utf-8", newline="") as f: + writer = csv.writer(f, delimiter="\t", quoting=csv.QUOTE_MINIMAL, lineterminator="\n") + writer.writerow(headers) + + for obj in iterator: + row = [field.extract(obj, project_slug) for field in fields] + writer.writerow(row) + records_written += 1 + if progress_callback and records_written % 500 == 0: + progress_callback(records_written) + + return records_written diff --git a/ami/exports/dwca/zip.py b/ami/exports/dwca/zip.py new file mode 100644 index 000000000..bc6a5a5b5 --- /dev/null +++ b/ami/exports/dwca/zip.py @@ -0,0 +1,22 @@ +"""Package DwC-A files into a single ZIP.""" + +from __future__ import annotations + +import tempfile +import zipfile + + +def create_dwca_zip(files: dict[str, str], meta_xml: str, eml_xml: str) -> str: + """Build the archive. + + `files` maps archive-internal name -> source tempfile path. + Returns the path to the new ZIP. + """ + temp_zip = tempfile.NamedTemporaryFile(delete=False, suffix=".zip") + temp_zip.close() + with zipfile.ZipFile(temp_zip.name, "w", zipfile.ZIP_DEFLATED) as zf: + for archive_name, source_path in files.items(): + zf.write(source_path, archive_name) + zf.writestr("meta.xml", meta_xml) + zf.writestr("eml.xml", eml_xml) + return temp_zip.name diff --git a/ami/exports/dwca_validator.py b/ami/exports/dwca_validator.py new file mode 100644 index 000000000..f11d39a0f --- /dev/null +++ b/ami/exports/dwca_validator.py @@ -0,0 +1,325 @@ +""" +Offline structural validator for Darwin Core Archives produced by this app. + +Checks the invariants that tests and code review can't catch reliably: +zip contents, meta.xml parses and references files that actually exist, +column count matches meta.xml field declarations, core ids are unique, +every extension coreid resolves to a core id, required columns are +populated on every row, and eml.xml parses. + +This is not a GBIF-compliance validator — those concerns (vocabularies, +geographic coverage, taxonomic backbone matching) require the official +GBIF validator. This catches the class of bug where meta.xml and the +TSVs drift apart, which is historically where DwC-A producers break. +""" + +from __future__ import annotations + +import csv +import io +import zipfile +from dataclasses import dataclass, field +from xml.etree.ElementTree import Element as _XmlElement + +from defusedxml import ElementTree as ET +from defusedxml.common import DefusedXmlException + +META_NS = "http://rs.tdwg.org/dwc/text/" +_OCCURRENCE_ID_TERM = "http://rs.tdwg.org/dwc/terms/occurrenceID" + + +@dataclass +class ValidationResult: + errors: list[str] = field(default_factory=list) + warnings: list[str] = field(default_factory=list) + + @property + def ok(self) -> bool: + return not self.errors + + def add_error(self, msg: str) -> None: + self.errors.append(msg) + + def add_warning(self, msg: str) -> None: + self.warnings.append(msg) + + +@dataclass +class _TableSpec: + role: str # "core" or "extension" + row_type: str + filename: str + id_index: int # for core, for extension + field_terms: dict[int, str] # index -> term URI + required: bool = False + + +def validate_dwca_zip(zip_path: str, required_terms: set[str] | None = None) -> ValidationResult: + """Validate a DwC-A zip structurally. + + `required_terms` is an optional set of DwC term URIs that must be + present AND non-empty on every row of the table that declares them. + When omitted, the validator only checks structural invariants + (parseable meta.xml, coreid referential integrity, consistent + column counts, unique core ids). + """ + result = ValidationResult() + + if not zipfile.is_zipfile(zip_path): + result.add_error(f"Not a zip file: {zip_path}") + return result + + with zipfile.ZipFile(zip_path, "r") as zf: + names = set(zf.namelist()) + + for required in ("meta.xml", "eml.xml"): + if required not in names: + result.add_error(f"Archive missing required file: {required}") + + if "meta.xml" not in names: + return result + + meta_bytes = zf.read("meta.xml") + try: + meta_root = ET.fromstring(meta_bytes) + except (ET.ParseError, DefusedXmlException) as exc: + result.add_error(f"meta.xml does not parse: {exc}") + return result + + tables = _parse_meta(meta_root, result) + if not tables: + return result + + core_ids = _validate_core(zf, tables[0], result, required_terms or set()) + + for ext in tables[1:]: + _validate_extension(zf, ext, core_ids, result, required_terms or set()) + + occurrence_ids = _collect_occurrence_ids(zf, tables) + for ext in tables[1:]: + if ext.filename == "occurrence.txt": + continue + _validate_occurrence_id_references(zf, ext, occurrence_ids, result) + + if "eml.xml" in names: + try: + ET.fromstring(zf.read("eml.xml")) + except (ET.ParseError, DefusedXmlException) as exc: + result.add_error(f"eml.xml does not parse: {exc}") + + return result + + +def _parse_meta(meta_root: _XmlElement, result: ValidationResult) -> list[_TableSpec]: + tables: list[_TableSpec] = [] + + # meta.xml uses the dwc/text namespace; handle both namespaced and + # unqualified tags so we don't choke on hand-written meta files. + def strip_ns(tag: str) -> str: + return tag.split("}", 1)[-1] if "}" in tag else tag + + core_elems = [child for child in meta_root if strip_ns(child.tag) == "core"] + ext_elems = [child for child in meta_root if strip_ns(child.tag) == "extension"] + + if len(core_elems) != 1: + result.add_error(f"meta.xml must declare exactly one , found {len(core_elems)}") + return [] + + for elem in [core_elems[0], *ext_elems]: + role = strip_ns(elem.tag) + id_tag = "id" if role == "core" else "coreid" + row_type = elem.get("rowType", "") + + files = [c for c in elem if strip_ns(c.tag) == "files"] + location = "" + if files: + for loc in files[0]: + if strip_ns(loc.tag) == "location" and loc.text: + location = loc.text.strip() + break + if not location: + result.add_error(f"meta.xml {role} missing ") + continue + + id_elems = [c for c in elem if strip_ns(c.tag) == id_tag] + if len(id_elems) != 1: + result.add_error(f"meta.xml {role} ({location}) must declare exactly one <{id_tag}>") + continue + try: + id_index = int(id_elems[0].get("index", "")) + if id_index < 0: + raise ValueError + except ValueError: + result.add_error(f"meta.xml {role} ({location}) <{id_tag}> index is not a non-negative integer") + continue + + field_terms: dict[int, str] = {} + for fld in elem: + if strip_ns(fld.tag) != "field": + continue + term = fld.get("term", "") + idx_raw = fld.get("index") + if idx_raw is None or not term: + result.add_error(f"meta.xml {role} ({location}) missing index or term") + continue + try: + idx = int(idx_raw) + if idx < 0: + raise ValueError + except ValueError: + result.add_error( + f"meta.xml {role} ({location}) index is not a non-negative integer: {idx_raw}" + ) + continue + if idx in field_terms: + result.add_error( + f"meta.xml {role} ({location}) duplicate index {idx}: " f"{field_terms[idx]} and {term}" + ) + field_terms[idx] = term + + tables.append( + _TableSpec( + role=role, + row_type=row_type, + filename=location, + id_index=id_index, + field_terms=field_terms, + ) + ) + + return tables + + +def _collect_occurrence_ids(zf: zipfile.ZipFile, tables: list[_TableSpec]) -> set[str]: + """Return the set of occurrenceID values declared in occurrence.txt.""" + for t in tables: + if t.filename != "occurrence.txt": + continue + rows = _read_tsv(zf, t.filename, ValidationResult()) + if rows is None: + return set() + occ_col = None + for idx, term in t.field_terms.items(): + if term == _OCCURRENCE_ID_TERM: + occ_col = idx + break + if occ_col is None: + return set() + return {row[occ_col].strip() for row in rows[1:] if occ_col < len(row) and row[occ_col].strip()} + return set() + + +def _validate_occurrence_id_references( + zf: zipfile.ZipFile, + ext: _TableSpec, + occurrence_ids: set[str], + result: ValidationResult, +) -> None: + """Any extension declaring dwc:occurrenceID must only carry values from occurrence.txt.""" + occ_col = None + for idx, term in ext.field_terms.items(): + if term == _OCCURRENCE_ID_TERM: + occ_col = idx + break + if occ_col is None: + return + rows = _read_tsv(zf, ext.filename, result) + if rows is None: + return + missing: set[str] = set() + for row in rows[1:]: + if occ_col >= len(row): + continue + val = row[occ_col].strip() + if val and val not in occurrence_ids: + missing.add(val) + if missing: + sample = sorted(missing)[:5] + result.add_error( + f"{ext.filename}: {len(missing)} occurrenceID value(s) do not exist in occurrence.txt. " f"First: {sample}" + ) + + +def _read_tsv(zf: zipfile.ZipFile, filename: str, result: ValidationResult) -> list[list[str]] | None: + if filename not in zf.namelist(): + result.add_error(f"meta.xml references {filename} but it is missing from the archive") + return None + try: + raw = zf.read(filename).decode("utf-8") + except UnicodeDecodeError as exc: + result.add_error(f"{filename}: file is not valid UTF-8: {exc}") + return None + reader = csv.reader(io.StringIO(raw), delimiter="\t", quoting=csv.QUOTE_MINIMAL) + return list(reader) + + +def _validate_core( + zf: zipfile.ZipFile, core: _TableSpec, result: ValidationResult, required_terms: set[str] +) -> set[str]: + rows = _read_tsv(zf, core.filename, result) + if rows is None: + return set() + return _validate_table(rows, core, result, required_terms, collect_ids=True) + + +def _validate_extension( + zf: zipfile.ZipFile, + ext: _TableSpec, + core_ids: set[str], + result: ValidationResult, + required_terms: set[str], +) -> None: + rows = _read_tsv(zf, ext.filename, result) + if rows is None: + return + ext_ids = _validate_table(rows, ext, result, required_terms, collect_ids=False) + # coreid referential integrity + missing = ext_ids - core_ids + if missing: + sample = sorted(missing)[:5] + result.add_error(f"{ext.filename}: {len(missing)} coreid value(s) do not exist in core. " f"First: {sample}") + + +def _validate_table( + rows: list[list[str]], + spec: _TableSpec, + result: ValidationResult, + required_terms: set[str], + collect_ids: bool, +) -> set[str]: + if not rows: + result.add_error(f"{spec.filename}: file is empty (not even a header)") + return set() + + header = rows[0] + data_rows = rows[1:] + expected_cols = max([spec.id_index, *spec.field_terms.keys()], default=-1) + 1 + if len(header) != expected_cols: + result.add_error(f"{spec.filename}: header has {len(header)} columns but meta.xml declares {expected_cols}") + + required_indices = [i for i, term in spec.field_terms.items() if term in required_terms] + + seen_ids: set[str] = set() + ids: set[str] = set() + for row_num, row in enumerate(data_rows, start=2): # 1-based + header + if len(row) != len(header): + result.add_error(f"{spec.filename}:L{row_num}: row has {len(row)} columns, " f"expected {len(header)}") + continue + + id_value = row[spec.id_index].strip() if spec.id_index < len(row) else "" + if not id_value: + result.add_error(f"{spec.filename}:L{row_num}: empty id/coreid value") + continue + + if collect_ids: + if id_value in seen_ids: + result.add_error(f"{spec.filename}:L{row_num}: duplicate core id: {id_value!r}") + seen_ids.add(id_value) + ids.add(id_value) + + for idx in required_indices: + if idx >= len(row) or not row[idx].strip(): + term = spec.field_terms[idx] + result.add_error(f"{spec.filename}:L{row_num}: required term {term} is empty") + + return ids diff --git a/ami/exports/format_types.py b/ami/exports/format_types.py index 9c57b5c25..5ebc058e9 100644 --- a/ami/exports/format_types.py +++ b/ami/exports/format_types.py @@ -1,6 +1,7 @@ import csv import json import logging +import os import tempfile from django.core.serializers.json import DjangoJSONEncoder @@ -247,3 +248,225 @@ def export(self): self.update_job_progress(records_exported) self.update_export_stats(file_temp_path=temp_file.name) return temp_file.name # Return the file path + + +def _append_validation_report_to_zip(zip_path, validation) -> None: + """Append a human-readable VALIDATION_ERRORS.txt to a failed DwC-A archive. + + The archive is left on disk so it can be persisted to storage for the user to + download and inspect. The exporter still raises ValueError afterwards so the + DataExport is marked failed. + """ + import zipfile + + lines = ["DwC-A archive failed structural validation.", ""] + lines.append(f"Errors ({len(validation.errors)}):") + lines.extend(f" - {e}" for e in validation.errors) + if validation.warnings: + lines.append("") + lines.append(f"Warnings ({len(validation.warnings)}):") + lines.extend(f" - {w}" for w in validation.warnings) + lines.append("") + body = "\n".join(lines).encode("utf-8") + + with zipfile.ZipFile(zip_path, "a", compression=zipfile.ZIP_DEFLATED) as zf: + zf.writestr("VALIDATION_ERRORS.txt", body) + + +class DwCAExporter(BaseExporter): + """Handles Darwin Core Archive (DwC-A) export with Event Core and Occurrence Extension.""" + + file_format = "zip" + filename_label = "dwca_draft-2026-04" + + DWCA_MAX_OCCURRENCES = 100_000 + + def get_queryset(self): + """Return the occurrence queryset (used by BaseExporter for record count). + + Applies the project's default filters (score threshold, include/exclude taxa). + Low-confidence ML output is gated here to avoid publishing unreviewed + classifications to downstream consumers (e.g. GBIF). + + Prefetches cover every reader downstream (occurrence.txt, multimedia.txt, + measurementorfact.txt) so the queryset can be materialized to a list once + in `export()` and reused without extra DB passes. + """ + return ( + Occurrence.objects.valid() # type: ignore[union-attr] + .filter(project=self.project, event__isnull=False, determination__isnull=False) + .apply_default_filters(self.project) # type: ignore[union-attr] + .select_related( + "determination", + "event", + "deployment", + ) + .prefetch_related( + "detections__source_image", + "detections__detection_algorithm", + "detections__classifications__algorithm", + ) + .with_detections_count() + .with_identifications() + ) + + def get_events_queryset(self): + from ami.main.models import Event + + event_ids = self.queryset.values_list("event_id", flat=True).distinct() + return Event.objects.filter( + project=self.project, + id__in=event_ids, + ).select_related( + "deployment", + "project", + ) + + def export(self): + """Export project data as a Darwin Core Archive ZIP.""" + from django.utils.text import slugify + + from ami.exports.dwca import ( + EVENT_FIELDS, + MOF_FIELDS, + MULTIMEDIA_FIELDS, + OCCURRENCE_FIELDS, + create_dwca_zip, + generate_eml_xml, + generate_meta_xml, + write_tsv, + ) + from ami.exports.dwca.rows import iter_mof_rows, iter_multimedia_rows + from ami.exports.dwca.targetscope import derive_target_taxonomic_scope + + project_slug = slugify(self.project.name) + + def _tmp_txt(): + tf = tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w", encoding="utf-8") + tf.close() + return tf.name + + event_path = _tmp_txt() + occ_path = _tmp_txt() + multimedia_path = _tmp_txt() + mof_path = _tmp_txt() + + try: + if self.total_records > self.DWCA_MAX_OCCURRENCES: + raise ValueError( + f"DwC-A export refused: project has {self.total_records} occurrences, " + f"hard cap is {self.DWCA_MAX_OCCURRENCES}. The current exporter materializes " + f"the queryset in memory; streaming fan-out is planned as a follow-up." + ) + + events_qs = self.get_events_queryset() + events_list = list(events_qs) + target_scope = derive_target_taxonomic_scope(self.project) + for e in events_list: + e._target_taxonomic_scope = target_scope + + # Materialize the occurrence queryset once with all prefetches in place + # so all three extension writers iterate the same in-memory list. + occurrences_list = list(self.queryset) + + event_count = write_tsv(event_path, EVENT_FIELDS, events_list, project_slug) + logger.info(f"DwC-A: wrote {event_count} events") + + occ_count = write_tsv( + occ_path, + OCCURRENCE_FIELDS, + occurrences_list, + project_slug, + progress_callback=self.update_job_progress, + ) + logger.info(f"DwC-A: wrote {occ_count} occurrences") + + mm_count = write_tsv( + multimedia_path, + MULTIMEDIA_FIELDS, + iter_multimedia_rows(events_list, occurrences_list, project_slug), + project_slug, + ) + logger.info(f"DwC-A: wrote {mm_count} multimedia rows") + + mof_count = write_tsv( + mof_path, + MOF_FIELDS, + iter_mof_rows(occurrences_list, project_slug), + project_slug, + ) + logger.info(f"DwC-A: wrote {mof_count} measurementOrFact rows") + + if self.total_records: + self.update_job_progress(occ_count) + + meta_xml = generate_meta_xml( + [ + { + "role": "core", + "row_type": "http://rs.tdwg.org/dwc/terms/Event", + "filename": "event.txt", + "fields": EVENT_FIELDS, + }, + { + "role": "extension", + "row_type": "http://rs.tdwg.org/dwc/terms/Occurrence", + "filename": "occurrence.txt", + "fields": OCCURRENCE_FIELDS, + }, + { + "role": "extension", + "row_type": "http://rs.gbif.org/terms/1.0/Multimedia", + "filename": "multimedia.txt", + "fields": MULTIMEDIA_FIELDS, + }, + { + "role": "extension", + "row_type": "http://rs.gbif.org/terms/1.0/MeasurementOrFact", + "filename": "measurementorfact.txt", + "fields": MOF_FIELDS, + }, + ] + ) + eml_xml = generate_eml_xml(self.project, events_list) + + zip_path = create_dwca_zip( + { + "event.txt": event_path, + "occurrence.txt": occ_path, + "multimedia.txt": multimedia_path, + "measurementorfact.txt": mof_path, + }, + meta_xml, + eml_xml, + ) + + from ami.exports.dwca_validator import validate_dwca_zip + + validation = validate_dwca_zip(zip_path) + for warning in validation.warnings: + logger.warning(f"DwC-A validation warning: {warning}") + if not validation.ok: + for err in validation.errors: + logger.error(f"DwC-A validation error: {err}") + _append_validation_report_to_zip(zip_path, validation) + try: + file_url = self.data_export.save_export_file(zip_path) + self.data_export.file_url = file_url + self.data_export.save(update_fields=["file_url"]) + except OSError as exc: + logger.error(f"Could not persist failed DwC-A archive for inspection: {exc}") + raise ValueError( + f"DwC-A archive failed structural validation ({len(validation.errors)} errors). " + f"First: {validation.errors[0]}. " + f"See VALIDATION_ERRORS.txt inside the archive for the full report." + ) + + self.update_export_stats(file_temp_path=zip_path) + return zip_path + finally: + for path in (event_path, occ_path, multimedia_path, mof_path): + try: + os.unlink(path) + except OSError: + pass diff --git a/ami/exports/management/__init__.py b/ami/exports/management/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/ami/exports/management/commands/__init__.py b/ami/exports/management/commands/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/ami/exports/management/commands/validate_dwca_archive.py b/ami/exports/management/commands/validate_dwca_archive.py new file mode 100644 index 000000000..ca103247d --- /dev/null +++ b/ami/exports/management/commands/validate_dwca_archive.py @@ -0,0 +1,46 @@ +"""Offline structural validator for DwC-A zips. + +Run against a fresh export before shipping to GBIF to catch drift between +meta.xml and the TSVs. Required-term enforcement defaults to the terms +this app marks as required in its field catalogue; pass --no-required +to do only structural checks. +""" + +from __future__ import annotations + +from django.core.management.base import BaseCommand, CommandError + +from ami.exports.dwca import EVENT_FIELDS, OCCURRENCE_FIELDS +from ami.exports.dwca_validator import validate_dwca_zip + + +class Command(BaseCommand): + help = "Validate a Darwin Core Archive zip against structural invariants." + + def add_arguments(self, parser): + parser.add_argument("archive", help="Path to the DwC-A zip to validate") + parser.add_argument( + "--no-required", + action="store_true", + help="Skip required-field checks (structural only)", + ) + + def handle(self, *args, **options): + archive = options["archive"] + required_terms: set[str] = set() + if not options["no_required"]: + required_terms = {f.term for f in EVENT_FIELDS if f.required} + required_terms |= {f.term for f in OCCURRENCE_FIELDS if f.required} + + result = validate_dwca_zip(archive, required_terms=required_terms) + + for warn in result.warnings: + self.stdout.write(self.style.WARNING(f"WARN: {warn}")) + for err in result.errors: + self.stdout.write(self.style.ERROR(f"FAIL: {err}")) + + if result.ok: + self.stdout.write(self.style.SUCCESS(f"OK: {archive} passes structural validation")) + return + + raise CommandError(f"Validation failed with {len(result.errors)} error(s)") diff --git a/ami/exports/models.py b/ami/exports/models.py index 073a396da..dc47c46e8 100644 --- a/ami/exports/models.py +++ b/ami/exports/models.py @@ -72,8 +72,10 @@ def generate_filename(self): registry = ExportRegistry.get_exporter(self.format) assert registry, f"Export format '{self.format}' not found in registry" extension = registry.file_format - project_slug = slugify(self.project.name) # Convert project name to a slug - return f"{project_slug}_export-{self.pk}.{extension}" + project_slug = slugify(self.project.name) + label = getattr(registry, "filename_label", "") + stem = f"{project_slug}_{label}_export-{self.pk}" if label else f"{project_slug}_export-{self.pk}" + return f"{stem}.{extension}" def save_export_file(self, file_temp_path): """ diff --git a/ami/exports/registry.py b/ami/exports/registry.py index 29a4cc0e7..695ded051 100644 --- a/ami/exports/registry.py +++ b/ami/exports/registry.py @@ -27,3 +27,4 @@ def get_supported_formats(cls): ExportRegistry.register("occurrences_api_json")(format_types.JSONExporter) ExportRegistry.register("occurrences_simple_csv")(format_types.CSVExporter) +ExportRegistry.register("dwca")(format_types.DwCAExporter) diff --git a/ami/exports/tests.py b/ami/exports/tests.py index 866b1af61..fe2e00b38 100644 --- a/ami/exports/tests.py +++ b/ami/exports/tests.py @@ -1,6 +1,9 @@ import csv import json import logging +import zipfile +from io import StringIO +from xml.etree import ElementTree as ET from django.core.files.base import ContentFile from django.core.files.storage import default_storage @@ -305,6 +308,736 @@ def test_non_member_cannot_create_export(self): ) +class DwCAExportHttpE2ETest(TestCase): + """End-to-end DwC-A export via the HTTP API. + + Exercises the full path: POST /api/v2/exports/ -> permission check -> serializer + validation -> ExportViewSet.create -> Job.enqueue -> eager Celery task -> + DwCAExporter.export -> validator -> zip on storage -> DataExport.file_url set. + + Uses captureOnCommitCallbacks(execute=True) so `transaction.on_commit` + callbacks (scheduled inside Job.enqueue to dispatch the Celery task) fire + at the end of the context manager without needing TransactionTestCase — + which would truncate tables and disrupt fixture state for later tests. + """ + + @classmethod + def setUpClass(cls): + super().setUpClass() + from django.test import override_settings + + cls._eager_override = override_settings( + CELERY_TASK_ALWAYS_EAGER=True, + CELERY_TASK_EAGER_PROPAGATES=True, + ) + cls._eager_override.enable() + + @classmethod + def tearDownClass(cls): + cls._eager_override.disable() + super().tearDownClass() + + def setUp(self): + self.project, self.deployment = setup_test_project(reuse=False) + self.user = self.project.owner + self.client = APIClient() + self.client.force_authenticate(user=self.user) + create_captures(deployment=self.deployment, num_nights=2, images_per_night=3, interval_minutes=1) + group_images_into_events(self.deployment) + create_taxa(self.project) + create_occurrences(num=6, deployment=self.deployment) + + images = list(self.project.captures.all()[:4]) + self.collection = SourceImageCollection.objects.create( + name="E2E DwC-A Collection", + project=self.project, + method="manual", + kwargs={"image_ids": [img.pk for img in images]}, + ) + self.collection.populate_sample() + + def test_dwca_export_end_to_end_via_http(self): + """POST /api/v2/exports/ with format=dwca triggers the full pipeline.""" + with self.captureOnCommitCallbacks(execute=True): + response = self.client.post( + "/api/v2/exports/", + data={ + "project": self.project.pk, + "format": "dwca", + "filters": {"collection_id": self.collection.pk}, + }, + format="json", + ) + self.assertEqual(response.status_code, 201, f"POST failed: {response.status_code} {response.content[:400]}") + export_id = response.data["id"] + + data_export = DataExport.objects.get(pk=export_id) + self.assertTrue(data_export.file_url, "DataExport.file_url should be populated by eager Celery task") + self.assertTrue(data_export.file_url.endswith(".zip")) + self.assertIn("dwca_draft-2026-04", data_export.file_url, "Draft filename label should be present") + self.assertGreater(data_export.record_count, 0, "record_count should reflect exported occurrences") + + file_path = data_export.file_url.replace("/media/", "") + self.assertTrue(default_storage.exists(file_path), f"Storage missing: {file_path}") + + with default_storage.open(file_path, "rb") as f: + self.assertTrue(zipfile.is_zipfile(f)) + f.seek(0) + with zipfile.ZipFile(f, "r") as zf: + names = set(zf.namelist()) + self.assertIn("event.txt", names) + self.assertIn("occurrence.txt", names) + self.assertIn("multimedia.txt", names) + self.assertIn("measurementorfact.txt", names) + self.assertIn("meta.xml", names) + self.assertIn("eml.xml", names) + self.assertNotIn( + "VALIDATION_ERRORS.txt", + names, + "Clean exports should not emit VALIDATION_ERRORS.txt", + ) + eml = zf.read("eml.xml").decode("utf-8") + self.assertIn("DRAFT SCHEMA (April 2026)", eml, "EML should contain draft notice") + occ = zf.read("occurrence.txt").decode("utf-8") + occ_rows = list(csv.DictReader(StringIO(occ), delimiter="\t")) + self.assertGreater(len(occ_rows), 0) + for row in occ_rows: + self.assertEqual(row["basisOfRecord"], "MachineObservation") + + event_rows = list(csv.DictReader(StringIO(zf.read("event.txt").decode("utf-8")), delimiter="\t")) + self.assertGreater(len(event_rows), 0) + for row in event_rows: + self.assertEqual(row["license"], "All rights reserved") + + default_storage.delete(file_path) + + +class DwCAExportTest(TestCase): + """Tests for Darwin Core Archive (DwC-A) export format. + + Uses setUpClass to run the export once and share the ZIP across + structural validation tests for better performance. + """ + + @classmethod + def setUpClass(cls): + super().setUpClass() + cls.project, cls.deployment = setup_test_project(reuse=False) + cls.user = cls.project.owner + create_captures(deployment=cls.deployment, num_nights=2, images_per_night=4, interval_minutes=1) + group_images_into_events(cls.deployment) + create_taxa(cls.project) + create_occurrences(num=10, deployment=cls.deployment) + + # Run the export once and cache the file path + cls._export_file_path = cls._create_export(cls.project, cls.user) + + @classmethod + def tearDownClass(cls): + if cls._export_file_path and default_storage.exists(cls._export_file_path): + default_storage.delete(cls._export_file_path) + super().tearDownClass() + + @staticmethod + def _create_export(project, user): + """Run a DwC-A export and return the storage file path.""" + from django.conf import settings + + data_export = DataExport.objects.create( + user=user, + project=project, + format="dwca", + job=None, + ) + file_url = data_export.run_export() + assert file_url is not None, "Export did not produce a file URL" + file_path = file_url.replace(settings.MEDIA_URL, "") + assert default_storage.exists(file_path), f"Export file not found: {file_path}" + return file_path + + def _open_zip(self): + """Open the cached export ZIP for reading.""" + return default_storage.open(self._export_file_path, "rb") + + def test_dwca_exporter_is_registered(self): + """DwC-A exporter should be registered and retrievable.""" + from ami.exports.registry import ExportRegistry + + exporter_cls = ExportRegistry.get_exporter("dwca") + self.assertIsNotNone(exporter_cls, "DwC-A exporter not found in registry") + self.assertEqual(exporter_cls.file_format, "zip") + + def test_export_produces_valid_zip(self): + """Export should produce a valid ZIP with expected files.""" + with self._open_zip() as f: + self.assertTrue(zipfile.is_zipfile(f)) + f.seek(0) + with zipfile.ZipFile(f, "r") as zf: + names = zf.namelist() + self.assertIn("event.txt", names) + self.assertIn("occurrence.txt", names) + self.assertIn("meta.xml", names) + self.assertIn("eml.xml", names) + + def test_event_headers_and_row_count(self): + """event.txt should have correct headers and row count matching events.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + event_data = zf.read("event.txt").decode("utf-8") + reader = csv.DictReader(StringIO(event_data), delimiter="\t") + rows = list(reader) + + # Check headers + self.assertIn("eventID", reader.fieldnames) + self.assertIn("eventDate", reader.fieldnames) + self.assertIn("decimalLatitude", reader.fieldnames) + self.assertIn("samplingProtocol", reader.fieldnames) + + # Row count should match events referenced by valid occurrences + expected_count = ( + Occurrence.objects.valid() # type: ignore[union-attr] + .filter(project=self.project, event__isnull=False, determination__isnull=False) + .values("event_id") + .distinct() + .count() + ) + self.assertEqual(len(rows), expected_count, "Event row count mismatch") + + def test_occurrence_headers_and_row_count(self): + """occurrence.txt should have correct headers and row count matching valid occurrences.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + occ_data = zf.read("occurrence.txt").decode("utf-8") + reader = csv.DictReader(StringIO(occ_data), delimiter="\t") + rows = list(reader) + + # Check headers + self.assertIn("occurrenceID", reader.fieldnames) + self.assertIn("scientificName", reader.fieldnames) + self.assertIn("basisOfRecord", reader.fieldnames) + self.assertIn("taxonRank", reader.fieldnames) + + # Row count should match valid occurrences with event and determination + expected_count = ( + Occurrence.objects.valid() # type: ignore[union-attr] + .filter(project=self.project, event__isnull=False, determination__isnull=False) + .count() + ) + self.assertEqual(len(rows), expected_count, "Occurrence row count mismatch") + + # All rows should have basisOfRecord = MachineObservation + for row in rows: + self.assertEqual(row["basisOfRecord"], "MachineObservation") + + def test_meta_xml_structure(self): + """meta.xml should be valid XML with correct core/extension structure.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + meta_xml = zf.read("meta.xml").decode("utf-8") + root = ET.fromstring(meta_xml) + + # Default namespace + ns = "http://rs.tdwg.org/dwc/text/" + + # Should have a core element with Event rowType + core = root.find(f"{{{ns}}}core") + self.assertIsNotNone(core, "meta.xml missing element") + self.assertIn("Event", core.get("rowType", "")) + + # Should have an extension element with Occurrence rowType + ext = root.find(f"{{{ns}}}extension") + self.assertIsNotNone(ext, "meta.xml missing element") + self.assertIn("Occurrence", ext.get("rowType", "")) + + # Core should reference event.txt + core_location = core.find(f".//{{{ns}}}location") + self.assertIsNotNone(core_location, "meta.xml core missing ") + self.assertEqual(core_location.text, "event.txt") + + # Extension should reference occurrence.txt + ext_location = ext.find(f".//{{{ns}}}location") + self.assertIsNotNone(ext_location, "meta.xml extension missing ") + self.assertEqual(ext_location.text, "occurrence.txt") + + def test_referential_integrity(self): + """All occurrence eventIDs should reference existing event eventIDs.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + # Read event IDs + event_data = zf.read("event.txt").decode("utf-8") + event_reader = csv.DictReader(StringIO(event_data), delimiter="\t") + event_ids = {row["eventID"] for row in event_reader} + + # Read occurrence eventIDs + occ_data = zf.read("occurrence.txt").decode("utf-8") + occ_reader = csv.DictReader(StringIO(occ_data), delimiter="\t") + occ_event_ids = {row["eventID"] for row in occ_reader if row["eventID"]} + + # All occurrence eventIDs should exist in events + orphaned = occ_event_ids - event_ids + self.assertEqual( + len(orphaned), + 0, + f"Orphaned occurrence eventIDs (not in events): {orphaned}", + ) + + def test_taxonomy_hierarchy_extraction(self): + """Taxonomy fields should be extracted from parents_json.""" + from ami.exports.dwca import _get_rank_from_parents + + # Get an occurrence with a determination that has parents + occurrence = ( + Occurrence.objects.valid() # type: ignore[union-attr] + .filter(project=self.project, determination__isnull=False) + .select_related("determination") + .first() + ) + self.assertIsNotNone(occurrence, "No occurrence with determination found") + + # Update parents_json on the taxon so we can test extraction + taxon = occurrence.determination + taxon.save(update_calculated_fields=True) + taxon.refresh_from_db() + + # Ensure parents_json is populated so this test doesn't pass vacuously + self.assertTrue(taxon.parents_json, "Test taxon should have parents_json populated") + + ranks_found = [] + for rank in ["KINGDOM", "PHYLUM", "CLASS", "ORDER", "FAMILY", "GENUS"]: + value = _get_rank_from_parents(occurrence, rank) + if value: + ranks_found.append(rank) + self.assertGreater(len(ranks_found), 0, "No taxonomy ranks extracted from parents_json") + + def test_specific_epithet_extraction(self): + """get_specific_epithet should extract the second word of a binomial name.""" + from ami.exports.dwca import get_specific_epithet + + self.assertEqual(get_specific_epithet("Vanessa cardui"), "cardui") + self.assertEqual(get_specific_epithet("Vanessa"), "") + self.assertEqual(get_specific_epithet(""), "") + self.assertEqual(get_specific_epithet("Homo sapiens sapiens"), "sapiens") + + def test_verification_status_ignores_withdrawn_identifications(self): + """identificationVerificationStatus should flip to 'verified' only for non-withdrawn human IDs.""" + from ami.exports.dwca import _get_verification_status + + occurrence = ( + Occurrence.objects.valid() # type: ignore[union-attr] + .filter(project=self.project, determination__isnull=False) + .first() + ) + self.assertIsNotNone(occurrence) + occurrence.identifications.all().delete() + self.assertEqual(_get_verification_status(occurrence), "unverified") + + Identification.objects.create( + user=self.user, + taxon=occurrence.determination, + occurrence=occurrence, + withdrawn=True, + ) + self.assertEqual(_get_verification_status(occurrence), "unverified") + + Identification.objects.create( + user=self.user, + taxon=occurrence.determination, + occurrence=occurrence, + ) + self.assertEqual(_get_verification_status(occurrence), "verified") + + def test_eml_xml_valid(self): + """eml.xml should be valid EML 2.2.0 with coverage, methods, and license.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + eml_xml = zf.read("eml.xml").decode("utf-8") + root = ET.fromstring(eml_xml) + + self.assertIn("eml-2.2.0", eml_xml) + ns = {"eml": "https://eml.ecoinformatics.org/eml-2.2.0"} + dataset = root.find("eml:dataset", ns) or root.find("dataset") + self.assertIsNotNone(dataset, "eml.xml missing ") + + title = dataset.find("eml:title", ns) or dataset.find("title") + self.assertIsNotNone(title) + self.assertEqual(title.text, self.project.name) + + coverage = dataset.find("eml:coverage", ns) or dataset.find("coverage") + self.assertIsNotNone(coverage, "Missing ") + self.assertIsNotNone( + coverage.find("eml:geographicCoverage", ns) or coverage.find("geographicCoverage") + ) + self.assertIsNotNone(coverage.find("eml:temporalCoverage", ns) or coverage.find("temporalCoverage")) + + methods = dataset.find("eml:methods", ns) or dataset.find("methods") + self.assertIsNotNone(methods, "Missing ") + method_step = methods.find("eml:methodStep", ns) or methods.find("methodStep") + self.assertIsNotNone(method_step) + + def test_dwca_export_with_collection_filter(self): + """DwC-A export with collection_id filter should only include matching occurrences and their events.""" + # Create a collection with a subset of images + images = self.project.captures.all() + collection_images = images[: images.count() // 2] + self.assertGreater(len(collection_images), 0) + + collection = SourceImageCollection.objects.create( + name="DwCA Filter Test Collection", + project=self.project, + method="manual", + kwargs={"image_ids": [img.pk for img in collection_images]}, + ) + collection.populate_sample() + + # Run filtered export + data_export = DataExport.objects.create( + user=self.user, + project=self.project, + format="dwca", + filters={"collection_id": collection.pk}, + job=None, + ) + file_url = data_export.run_export() + self.assertIsNotNone(file_url) + + from django.conf import settings + + file_path = file_url.replace(settings.MEDIA_URL, "") + self.assertTrue(default_storage.exists(file_path)) + + try: + # Count expected filtered occurrences + expected_occ_count = ( + Occurrence.objects.valid() # type: ignore[union-attr] + .filter( + project=self.project, + event__isnull=False, + determination__isnull=False, + detections__source_image__collections=collection, + ) + .distinct() + .count() + ) + total_occ_count = ( + Occurrence.objects.valid() # type: ignore[union-attr] + .filter(project=self.project, event__isnull=False, determination__isnull=False) + .count() + ) + self.assertGreater(expected_occ_count, 0, "Filtered occurrences should not be empty") + self.assertLess(expected_occ_count, total_occ_count, "Filtered should be fewer than total") + + with default_storage.open(file_path, "rb") as f: + with zipfile.ZipFile(f, "r") as zf: + # Verify occurrence count + occ_data = zf.read("occurrence.txt").decode("utf-8") + occ_reader = csv.DictReader(StringIO(occ_data), delimiter="\t") + occ_rows = list(occ_reader) + self.assertEqual(len(occ_rows), expected_occ_count, "Filtered occurrence count mismatch") + + # Verify event count matches only events from filtered occurrences + event_data = zf.read("event.txt").decode("utf-8") + event_reader = csv.DictReader(StringIO(event_data), delimiter="\t") + event_rows = list(event_reader) + event_ids_in_file = {row["eventID"] for row in event_rows} + + # Events should only be those referenced by filtered occurrences + occ_event_ids = {row["eventID"] for row in occ_rows if row["eventID"]} + self.assertEqual( + event_ids_in_file, + occ_event_ids, + "Event IDs should match exactly those referenced by filtered occurrences", + ) + + # Referential integrity: no orphaned eventIDs in occurrences + orphaned = occ_event_ids - event_ids_in_file + self.assertEqual(len(orphaned), 0, f"Orphaned occurrence eventIDs: {orphaned}") + finally: + default_storage.delete(file_path) + + def test_dwca_export_refuses_over_hardcap(self): + """Export should refuse with a clear message when queryset exceeds DWCA_MAX_OCCURRENCES.""" + from unittest.mock import patch + + from ami.exports.format_types import DwCAExporter + + data_export = DataExport.objects.create( + user=self.user, + project=self.project, + format="dwca", + job=None, + ) + exporter = data_export.get_exporter() + with patch.object(DwCAExporter, "DWCA_MAX_OCCURRENCES", 1): + with self.assertRaisesRegex(ValueError, "hard cap"): + exporter.export() + + def test_validation_failure_writes_errors_into_zip(self): + """_append_validation_report_to_zip should add a readable VALIDATION_ERRORS.txt.""" + import tempfile + + from ami.exports.dwca_validator import ValidationResult + from ami.exports.format_types import _append_validation_report_to_zip + + tf = tempfile.NamedTemporaryFile(delete=False, suffix=".zip") + tf.close() + with zipfile.ZipFile(tf.name, "w") as zf: + zf.writestr("meta.xml", "") + result = ValidationResult() + result.add_error("meta.xml references event.txt but file is missing") + result.add_error("occurrence.txt:L4: duplicate core id: 'E1'") + result.add_warning("eml.xml is unusually small") + + _append_validation_report_to_zip(tf.name, result) + + with zipfile.ZipFile(tf.name, "r") as zf: + self.assertIn("VALIDATION_ERRORS.txt", zf.namelist()) + body = zf.read("VALIDATION_ERRORS.txt").decode("utf-8") + self.assertIn("Errors (2)", body) + self.assertIn("duplicate core id", body) + self.assertIn("Warnings (1)", body) + + def test_validator_runs_on_produced_zip(self): + """The exporter's own zip should pass its own validator cleanly.""" + import tempfile + + from ami.exports.dwca_validator import validate_dwca_zip + + with self._open_zip() as f: + tf = tempfile.NamedTemporaryFile(delete=False, suffix=".zip") + tf.write(f.read()) + tf.close() + result = validate_dwca_zip(tf.name) + self.assertTrue( + result.ok, + f"Self-produced DwC-A failed own validator: {result.errors}", + ) + + def test_measurementorfact_txt_in_archive(self): + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + self.assertIn("measurementorfact.txt", zf.namelist()) + data = zf.read("measurementorfact.txt").decode("utf-8") + reader = csv.DictReader(StringIO(data), delimiter="\t") + rows = list(reader) + self.assertGreater(len(rows), 0) + types = {r["measurementType"] for r in rows} + self.assertIn("classificationScore", types) + for r in rows: + self.assertTrue(r["eventID"], "MoF row missing eventID") + self.assertTrue(r["occurrenceID"], "MoF row missing occurrenceID in this PR") + + def test_meta_xml_declares_mof_extension(self): + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + meta_xml = zf.read("meta.xml").decode("utf-8") + self.assertIn("measurementorfact.txt", meta_xml) + self.assertIn("http://rs.gbif.org/terms/1.0/MeasurementOrFact", meta_xml) + + def test_multimedia_txt_in_archive(self): + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + self.assertIn("multimedia.txt", zf.namelist()) + data = zf.read("multimedia.txt").decode("utf-8") + reader = csv.DictReader(StringIO(data), delimiter="\t") + rows = list(reader) + self.assertGreater(len(rows), 0, "multimedia.txt has no rows") + ids = {row["eventID"] for row in rows if row["eventID"]} + event_data = zf.read("event.txt").decode("utf-8") + event_ids = {r["eventID"] for r in csv.DictReader(StringIO(event_data), delimiter="\t")} + self.assertTrue(ids.issubset(event_ids), f"Orphaned multimedia eventIDs: {ids - event_ids}") + + def test_meta_xml_declares_multimedia_extension(self): + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + meta_xml = zf.read("meta.xml").decode("utf-8") + self.assertIn("multimedia.txt", meta_xml) + self.assertIn("http://rs.gbif.org/terms/1.0/Multimedia", meta_xml) + + def test_occurrence_has_associated_media_column(self): + """occurrence.txt should carry associatedMedia as pipe-separated URLs.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + occ_data = zf.read("occurrence.txt").decode("utf-8") + reader = csv.DictReader(StringIO(occ_data), delimiter="\t") + fieldnames = set(reader.fieldnames or []) + self.assertIn("associatedMedia", fieldnames) + rows = list(reader) + non_empty = [r for r in rows if r.get("associatedMedia")] + self.assertGreater(len(non_empty), 0, "No occurrences have associatedMedia") + for r in non_empty: + self.assertFalse(r["associatedMedia"].endswith("|")) + for part in r["associatedMedia"].split("|"): + self.assertTrue(part.startswith("http"), f"Not a URL: {part}") + + def test_event_has_humboldt_eco_columns(self): + """event.txt should carry the Humboldt eco: columns as flattened columns.""" + expected_columns = { + "isSamplingEffortReported", + "samplingEffortValue", + "samplingEffortUnit", + "samplingEffortProtocol", + "isAbsenceReported", + "targetTaxonomicScope", + "inventoryTypes", + "protocolNames", + "protocolDescriptions", + "hasMaterialSamples", + "materialSampleTypes", + } + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + event_data = zf.read("event.txt").decode("utf-8") + reader = csv.DictReader(StringIO(event_data), delimiter="\t") + fieldnames = set(reader.fieldnames or []) + self.assertTrue( + expected_columns.issubset(fieldnames), + f"event.txt missing Humboldt columns: {expected_columns - fieldnames}", + ) + rows = list(reader) + self.assertGreater(len(rows), 0) + for row in rows: + self.assertEqual(row["isSamplingEffortReported"], "true") + self.assertEqual(row["isAbsenceReported"], "true") + self.assertEqual(row["hasMaterialSamples"], "true") + self.assertEqual(row["materialSampleTypes"], "digital images") + self.assertEqual(row["inventoryTypes"], "trap or sample") + + def test_event_humboldt_terms_in_meta_xml(self): + """meta.xml core should declare eco: term URIs for Humboldt columns.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + meta_xml = zf.read("meta.xml").decode("utf-8") + self.assertIn("http://rs.tdwg.org/eco/terms/isSamplingEffortReported", meta_xml) + self.assertIn("http://rs.tdwg.org/eco/terms/isAbsenceReported", meta_xml) + self.assertIn("http://rs.tdwg.org/eco/terms/targetTaxonomicScope", meta_xml) + + def test_offline_structural_validator(self): + """Full archive passes the offline DwC-A structural validator. + + This catches the class of drift bugs (meta.xml term count diverges + from TSV columns, dangling coreids, duplicate core ids, empty + required fields) that don't show up in diff review. Fast enough to + run in unit tests; no network required. + """ + import tempfile + + from ami.exports.dwca import DWC, EVENT_FIELDS, OCCURRENCE_FIELDS + from ami.exports.dwca_validator import validate_dwca_zip + + required_terms = {f.term for f in EVENT_FIELDS if f.required} + required_terms |= {f.term for f in OCCURRENCE_FIELDS if f.required} + # occurrenceID is required inside occurrence.txt but legitimately blank + # on multimedia capture-rows; the current validator takes a flat + # required set, so scope it out here. The Task 9 cross-reference check + # covers the stronger integrity condition for extensions. + required_terms.discard(DWC + "occurrenceID") + + with tempfile.NamedTemporaryFile(suffix=".zip", delete=False) as tmp: + with self._open_zip() as src: + tmp.write(src.read()) + tmp_path = tmp.name + + result = validate_dwca_zip(tmp_path, required_terms=required_terms) + self.assertTrue( + result.ok, + msg="DwC-A structural validator failed:\n" + "\n".join(result.errors), + ) + + +class MultimediaExtensionTest(TestCase): + """Unit tests for multimedia.txt row generator (in isolation from a full export).""" + + def test_field_catalogue_present(self): + from ami.exports.dwca.fields import MULTIMEDIA_FIELDS + + headers = [f.header for f in MULTIMEDIA_FIELDS] + for required in [ + "eventID", + "occurrenceID", + "type", + "format", + "identifier", + "references", + "created", + "license", + "rightsHolder", + ]: + self.assertIn(required, headers) + + def test_iter_multimedia_rows_emits_capture_and_crop_rows(self): + from ami.exports.dwca.rows import iter_multimedia_rows + + project, deployment = setup_test_project(reuse=False) + create_captures(deployment=deployment, num_nights=1, images_per_night=4, interval_minutes=1) + group_images_into_events(deployment) + create_taxa(project) + create_occurrences(num=4, deployment=deployment) + + events_qs = project.events.all() + occurrences_qs = Occurrence.objects.valid().filter( # type: ignore[union-attr] + project=project, event__isnull=False, determination__isnull=False + ) + rows = list(iter_multimedia_rows(events_qs, occurrences_qs, "test-project")) + + capture_rows = [r for r in rows if not r["occurrenceID"]] + crop_rows = [r for r in rows if r["occurrenceID"]] + self.assertGreater(len(capture_rows), 0, "Expected capture rows with blank occurrenceID") + # Detection-crop rows require det.url() to return non-empty; depends on fixture + # setup (the image_dimensions stub may not produce crop URLs for test fixtures). + # Check at least the capture-row invariants here. + for r in capture_rows: + self.assertTrue(r["identifier"], "Capture row missing identifier") + self.assertEqual(r["type"], "StillImage") + # Crop rows, if present, must have both identifier and references. + for r in crop_rows: + self.assertTrue(r["identifier"], "Crop row missing identifier") + self.assertTrue(r["references"], "Crop row missing references (source capture URL)") + self.assertEqual(r["type"], "StillImage") + + +class TargetTaxonomicScopeTest(TestCase): + """Tests for eco:targetTaxonomicScope derivation from project include taxa.""" + + @classmethod + def setUpClass(cls): + super().setUpClass() + cls.project, cls.deployment = setup_test_project(reuse=False) + create_taxa(cls.project) + + def test_empty_include_taxa_returns_empty_string(self): + from ami.exports.dwca.targetscope import derive_target_taxonomic_scope + + self.project.default_filters_include_taxa.clear() + self.assertEqual(derive_target_taxonomic_scope(self.project), "") + + def test_single_taxon_returns_its_name(self): + from ami.exports.dwca.targetscope import derive_target_taxonomic_scope + from ami.main.models import Taxon + + taxon = Taxon.objects.filter(projects=self.project).first() + self.assertIsNotNone(taxon, "Expected at least one taxon on fixture project") + self.project.default_filters_include_taxa.set([taxon]) + self.assertEqual(derive_target_taxonomic_scope(self.project), taxon.name) + + def test_multiple_taxa_returns_lca_name(self): + from ami.exports.dwca.targetscope import derive_target_taxonomic_scope + from ami.main.models import Taxon + + taxa = list(Taxon.objects.filter(projects=self.project).exclude(parents_json=[])[:2]) + if len(taxa) < 2: + self.skipTest("Fixture does not have two taxa with shared ancestry") + for t in taxa: + t.save(update_calculated_fields=True) + t.refresh_from_db() + self.project.default_filters_include_taxa.set(taxa) + + result = derive_target_taxonomic_scope(self.project) + self.assertTrue(result, "LCA should resolve to a non-empty ancestor name") + for t in taxa: + ancestor_names = [p.name for p in t.parents_json] + [t.name] + self.assertIn(result, ancestor_names, f"{result} not in ancestry of {t.name}") + + class ExportNewFieldsTest(TestCase): """Test the new machine prediction, verification, and detection fields in CSV exports.""" diff --git a/ami/exports/tests_dwca_validator.py b/ami/exports/tests_dwca_validator.py new file mode 100644 index 000000000..1b840b169 --- /dev/null +++ b/ami/exports/tests_dwca_validator.py @@ -0,0 +1,186 @@ +"""Unit tests for the offline DwC-A structural validator. + +Covers the error paths that the integration test against a real export +will never hit (malformed meta.xml, duplicate core ids, dangling coreids, +empty required fields, column count mismatches). +""" + +from __future__ import annotations + +import tempfile +import zipfile + +from django.test import SimpleTestCase + +from ami.exports.dwca_validator import validate_dwca_zip + +DWC = "http://rs.tdwg.org/dwc/terms/" + + +META_OK = """ + + + event.txt + + + + + + occurrence.txt + + + + + +""" + +EML_OK = '' + + +def _build_zip(files: dict[str, str]) -> str: + tmp = tempfile.NamedTemporaryFile(suffix=".zip", delete=False) + tmp.close() + with zipfile.ZipFile(tmp.name, "w") as zf: + for name, content in files.items(): + zf.writestr(name, content) + return tmp.name + + +class DwCAValidatorTests(SimpleTestCase): + def test_well_formed_archive_passes(self): + path = _build_zip( + { + "meta.xml": META_OK, + "eml.xml": EML_OK, + "event.txt": "eventID\teventDate\nE1\t2024-06-15\nE2\t2024-06-16\n", + "occurrence.txt": "eventID\toccurrenceID\nE1\tO1\nE2\tO2\n", + } + ) + result = validate_dwca_zip(path, required_terms={DWC + "eventID"}) + self.assertTrue(result.ok, msg="\n".join(result.errors)) + + def test_missing_meta_xml_fails(self): + path = _build_zip({"eml.xml": EML_OK}) + result = validate_dwca_zip(path) + self.assertFalse(result.ok) + self.assertTrue(any("meta.xml" in e for e in result.errors)) + + def test_dangling_coreid_is_detected(self): + path = _build_zip( + { + "meta.xml": META_OK, + "eml.xml": EML_OK, + "event.txt": "eventID\teventDate\nE1\t2024-06-15\n", + # O2 references event E2, which doesn't exist in event.txt + "occurrence.txt": "eventID\toccurrenceID\nE1\tO1\nE2\tO2\n", + } + ) + result = validate_dwca_zip(path) + self.assertFalse(result.ok) + self.assertTrue(any("coreid" in e and "E2" in e for e in result.errors)) + + def test_duplicate_core_id_is_detected(self): + path = _build_zip( + { + "meta.xml": META_OK, + "eml.xml": EML_OK, + "event.txt": "eventID\teventDate\nE1\t2024-06-15\nE1\t2024-06-16\n", + "occurrence.txt": "eventID\toccurrenceID\nE1\tO1\n", + } + ) + result = validate_dwca_zip(path) + self.assertFalse(result.ok) + self.assertTrue(any("duplicate core id" in e for e in result.errors)) + + def test_empty_required_term_is_detected(self): + path = _build_zip( + { + "meta.xml": META_OK, + "eml.xml": EML_OK, + "event.txt": "eventID\teventDate\nE1\t\n", + "occurrence.txt": "eventID\toccurrenceID\nE1\tO1\n", + } + ) + result = validate_dwca_zip(path, required_terms={DWC + "eventDate"}) + self.assertFalse(result.ok) + self.assertTrue(any("eventDate" in e and "empty" in e for e in result.errors)) + + def test_column_count_mismatch_is_detected(self): + path = _build_zip( + { + "meta.xml": META_OK, + "eml.xml": EML_OK, + # Only 1 column but meta.xml declares 2 + "event.txt": "eventID\nE1\n", + "occurrence.txt": "eventID\toccurrenceID\nE1\tO1\n", + } + ) + result = validate_dwca_zip(path) + self.assertFalse(result.ok) + self.assertTrue(any("columns" in e and "meta.xml" in e for e in result.errors)) + + def test_malformed_meta_xml_fails_gracefully(self): + path = _build_zip( + { + "meta.xml": "not closed", + "eml.xml": EML_OK, + } + ) + result = validate_dwca_zip(path) + self.assertFalse(result.ok) + self.assertTrue(any("meta.xml" in e and "parse" in e for e in result.errors)) + + def test_not_a_zip_fails(self): + tmp = tempfile.NamedTemporaryFile(suffix=".zip", delete=False) + tmp.write(b"not a zip") + tmp.close() + result = validate_dwca_zip(tmp.name) + self.assertFalse(result.ok) + self.assertTrue(any("Not a zip" in e for e in result.errors)) + + def test_orphaned_occurrence_id_on_extension_row_fails(self): + """A multimedia row whose occurrenceID isn't in occurrence.txt should error.""" + meta = _META_WITH_MULTIMEDIA + path = _build_zip( + { + "meta.xml": meta, + "eml.xml": EML_OK, + "event.txt": "eventID\teventDate\nE1\t2024-06-15\n", + "occurrence.txt": "eventID\toccurrenceID\tbasisOfRecord\nE1\tO1\tMachineObservation\n", + "multimedia.txt": "eventID\toccurrenceID\tidentifier\nE1\tO_MISSING\thttp://example.com/a.jpg\n", + } + ) + result = validate_dwca_zip(path) + self.assertFalse(result.ok) + self.assertTrue(any("occurrenceID" in e for e in result.errors)) + + +_META_WITH_MULTIMEDIA = """ + + + event.txt + + + + + + occurrence.txt + + + + + + + multimedia.txt + + + + + + +""" diff --git a/ami/main/migrations/0084_project_license.py b/ami/main/migrations/0084_project_license.py new file mode 100644 index 000000000..51d73bf1d --- /dev/null +++ b/ami/main/migrations/0084_project_license.py @@ -0,0 +1,34 @@ +from django.db import migrations, models + + +class Migration(migrations.Migration): + dependencies = [ + ("main", "0083_dedupe_taxalist_names"), + ] + + operations = [ + migrations.AddField( + model_name="project", + name="license", + field=models.CharField( + blank=True, + default="", + help_text=( + "Data license for published occurrence records. " + "Use an SPDX identifier (e.g. 'CC-BY-4.0', 'CC0-1.0') or a license URL. " + "Required by GBIF for DwC-A publication." + ), + max_length=255, + ), + ), + migrations.AddField( + model_name="project", + name="rights_holder", + field=models.CharField( + blank=True, + default="", + help_text="Name of the organization or individual owning rights to the data.", + max_length=255, + ), + ), + ] diff --git a/ami/main/models.py b/ami/main/models.py index c604c6319..7e3b31dfb 100644 --- a/ami/main/models.py +++ b/ami/main/models.py @@ -285,6 +285,23 @@ class Project(ProjectSettingsMixin, BaseModel): active = models.BooleanField(default=True) priority = models.IntegerField(default=1) + license = models.CharField( + max_length=255, + blank=True, + default="", + help_text=( + "Data license for published occurrence records. " + "Use an SPDX identifier (e.g. 'CC-BY-4.0', 'CC0-1.0') or a license URL. " + "Required by GBIF for DwC-A publication." + ), + ) + rights_holder = models.CharField( + max_length=255, + blank=True, + default="", + help_text="Name of the organization or individual owning rights to the data.", + ) + # Backreferences for type hinting captures: models.QuerySet["SourceImage"] deployments: models.QuerySet["Deployment"] @@ -3191,6 +3208,39 @@ def get_determination_score(self) -> float | None: else: return None + def get_identified_by(self) -> str: + # Museum-style: the most recent authoritative identifier owns the record. + # A human identification (if present) supersedes any ML prediction, + # mirroring update_occurrence_determination. + top_identification = self.best_identification + if top_identification and top_identification.user: + user = top_identification.user + # Do NOT fall back to user.email — this value is published in DwC-A archives (PII / GDPR). + return user.name or getattr(user, "username", "") or f"user:{user.pk}" + + top_prediction = self.best_prediction + if top_prediction and top_prediction.algorithm: + algo = top_prediction.algorithm + if algo.version_name: + return f"{algo.name} {algo.version_name}" + if algo.version: + return f"{algo.name} v{algo.version}" + return algo.name + + return "" + + def get_identified_date(self) -> datetime.datetime | None: + # Prefer the identification/classification event time (set when the model + # or user actually produced the result) over created_at, which is the DB + # insert time and can lag by years for backfills and reprocessing jobs. + top_identification = self.best_identification + if top_identification: + return getattr(top_identification, "timestamp", None) or top_identification.created_at + top_prediction = self.best_prediction + if top_prediction: + return getattr(top_prediction, "timestamp", None) or top_prediction.created_at + return None + def predictions(self): # Retrieve the classification with the max score for each algorithm. # select_related avoids per-row taxon/algorithm lazy loads when callers diff --git a/ami/ml/schemas.py b/ami/ml/schemas.py index 9322e4116..1c9514951 100644 --- a/ami/ml/schemas.py +++ b/ami/ml/schemas.py @@ -9,6 +9,15 @@ class BoundingBox(pydantic.BaseModel): + """Detection bounding box in source-image pixel coordinates. + + Coordinates are absolute pixels relative to the source image origin + (top-left), with (x1, y1) the upper-left corner and (x2, y2) the + lower-right. Values are passed directly to PIL.Image.crop(), so + normalized [0, 1] floats are NOT supported — producers must convert + to pixel coordinates before populating this field. + """ + x1: float y1: float x2: float diff --git a/docs/claude/dwc-terms-reference.md b/docs/claude/dwc-terms-reference.md new file mode 100644 index 000000000..09c112f09 --- /dev/null +++ b/docs/claude/dwc-terms-reference.md @@ -0,0 +1,4062 @@ + +# Darwin Core Quick Reference Guide + +This document is intended to be an easy-to-read reference of the currently (as of 2023-09-18) recommended terms maintained as part of the [Darwin Core standard](https://www.tdwg.org/standards/dwc/) and is maintained by the [Darwin Core Maintenance Group](https://www.tdwg.org/community/dwc/). + +**Need help?** Read more about how to use Darwin Core in the [Darwin Core Questions & Answers site](https://github.com/tdwg/dwc-qa/blob/master/README.md). Still have questions? Submit a new issue (question/problem) to the [dwc-qa issues page in GitHub](https://github.com/tdwg/dwc-qa/issues), or use the [form](https://tinyurl.com/darwin-qa). See the bottom of this document for [how to cite Darwin Core](https://dwc.tdwg.org/terms/#cite-darwin-core)." + +**Want to contribute?** For information about how to contribute to the Darwin Core Standard, including how to propose changes, see the [Guidelines for contributing](https://github.com/tdwg/dwc/blob/master/.github/CONTRIBUTING.md). + +This page is not part of the standard, but combines the normative term names and definitions with the non-normative comments and examples that are meant to help people to use the terms consistently. Definitions, comments, and examples may include namespace abbreviations (e.g., "dwc:"). These are included to show that the meaning for the word it is attached to very specifically means the term as defined in that namespace. Thus, dwc:Event means Event as defined by Darwin Core at https://dwc.tdwg.org/terms/#event. Capitalized terms that follow a namespace abbreviation, such as dwc:Occurrence, are Darwin Core class terms, which are a special category of terms used to group sets of property terms (terms that being with lower case names that follow the namespace abbreviation, e.g., dwc:eventID) for convenience. Comprehensive metadata for current and obsolete terms in human readable form are found in the document [List of Darwin Core terms](../list/). + +Additional [files with just the current term names](https://github.com/tdwg/dwc/tree/master/dist) and a [file with the full term history](https://github.com/tdwg/dwc/blob/master/vocabulary/term_versions.csv) can be found in the [Darwin Core repository](https://github.com/tdwg/dwc). + + +## Record-level + +This category contains terms that are generic in that they might apply to any type of record in a dataset. + + + + + + + + + + + + +
type
Identifierhttp://purl.org/dc/elements/1.1/type
DefinitionThe nature or genre of the resource.
CommentsMust be populated with a value from the DCMI type vocabulary (https://www.dublincore.org/specifications/dublin-core/dcmi-type-vocabulary/2010-10-11/).
Examples
  • StillImage
  • MovingImage
  • Sound
  • PhysicalObject
  • Event
  • Text
+ + + + + + + + + +
modified
Identifierhttp://purl.org/dc/terms/modified
DefinitionDate on which the resource was changed.
CommentsRecommended best practice is to use a date that conforms to ISO 8601-1:2019.
Examples
  • 1963-03-08T14:07-06:00 (8 Mar 1963 at or after 2:07pm and before 2:08pm in the time zone six hours earlier than UTC)
  • 2009-02-20T08:40Z (20 February 2009 at or after 8:40am and before 8:41 UTC)
  • 2018-08-29T15:19 (29 August 2018 at or after 3:19pm and before 3:20pm local time)
  • 1809-02-12 (within the day 12 February 1809)
  • 1906-06 (in the month of June 1906)
  • 1971 (in the year 1971)
  • 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z (some time within the interval beginning 1 March 2007 at 1pm UTC and before 11 May 2008 at 3:30pm UTC)
  • 1900/1909 (some time within the interval between the beginning of the year 1900 and before the year 1909)
  • 2007-11-13/15 (some time in the interval between the beginning of 13 November 2007 and before 15 November 2007)
+ + + + + + + + + +
language
Identifierhttp://purl.org/dc/elements/1.1/language
DefinitionA language of the resource.
CommentsRecommended best practice is to use a controlled vocabulary such as RFC 5646. This term has an equivalent in the dcterms: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • en (for English)
  • es (for Spanish)
+ + + + + + + + + +
license
Identifierhttp://purl.org/dc/terms/license
DefinitionA legal document giving official permission to do something with the resource.
Comments
Examples
+ + + + + + + + + +
rightsHolder
Identifierhttp://purl.org/dc/terms/rightsHolder
DefinitionA person or organization owning or managing rights over the resource.
Comments
ExamplesThe Regents of the University of California
+ + + + + + + + + +
accessRights
Identifierhttp://purl.org/dc/terms/accessRights
DefinitionInformation about who can access the resource or an indication of its security status.
CommentsAccess Rights may include information regarding access or restrictions based on privacy, security, or other policies.
Examples
+ + + + + + + + + +
bibliographicCitation
Identifierhttp://purl.org/dc/terms/bibliographicCitation
DefinitionA bibliographic reference for the resource.
CommentsFrom Dublin Core, "Recommended practice is to include sufficient bibliographic detail to identify the resource as unambiguously as possible." The intended usage of this term in Darwin Core is to provide the preferred way to cite the resource itself - "how to cite this record". Note that the intended usage of dcterms:references in Darwin Core, by contrast, is to point to the definitive source representation of the resource - "where to find the as-close-to-original reference", if one is available.
Examples
+ + + + + + + + + +
references
Identifierhttp://purl.org/dc/terms/references
DefinitionA related resource that is referenced, cited, or otherwise pointed to by the described resource.
CommentsFrom Dublin Core, "This property is intended to be used with non-literal values. This property is an inverse property of Is Referenced By." The intended usage of this term in Darwin Core is to point to the definitive source representation of the resource (e.g.,dwc:Taxon, dwc:Occurrence, dwc:Event), if one is available. Note that the intended usage of dcterms:bibliographicCitation in Darwin Core, by contrast, is to provide the preferred way to cite the resource itself.
Examples
+ + + + + + + + + +
feedbackURL
Identifierhttp://rs.tdwg.org/dwc/terms/feedbackURL
DefinitionA uniform resource locator (URL) that points to a webpage on which a form may be submitted to gather feedback about the record.
CommentsRecommended best practice is to optionally include query strings that act to pre-populate web page form elements and communicate the context.
Exampleshttps://example.com/new?title=New+issue&body=This+comment+is+about+CAN12345
+ + + + + + + + + +
institutionID
Identifierhttp://rs.tdwg.org/dwc/terms/institutionID
DefinitionAn identifier for the institution having custody of the object(s) or information referred to in the record.
CommentsFor physical specimens, the recommended best practice is to use a globally unique and resolvable identifier from a collections registry such as the Research Organization Registry (ROR) or the Global Registry of Scientific Collections (https://scientific-collections.gbif.org/)
Examples
+ + + + + + + + + +
collectionID
Identifierhttp://rs.tdwg.org/dwc/terms/collectionID
DefinitionAn identifier for the collection or dataset from which the record was derived.
CommentsFor physical specimens, the recommended best practice is to use a globally unique and resolvable identifier from a collections registry such as the Global Registry of Scientific Collections (https://scientific-collections.gbif.org/).
Exampleshttps://scientific-collections.gbif.org/collection/fbd3ed74-5a21-4e01-b86a-33d36f032d9c
+ + + + + + + + + +
datasetID
Identifierhttp://rs.tdwg.org/dwc/terms/datasetID
DefinitionAn identifier for the set of data. May be a global unique identifier or an identifier specific to a collection or institution.
Comments
Examplesb15d4952-7d20-46f1-8a3e-556a512b04c5
+ + + + + + + + + +
institutionCode
Identifierhttp://rs.tdwg.org/dwc/terms/institutionCode
DefinitionThe name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record.
Comments
Examples
  • MVZ
  • FMNH
  • CLO
  • UCMP
+ + + + + + + + + +
collectionCode
Identifierhttp://rs.tdwg.org/dwc/terms/collectionCode
DefinitionThe name, acronym, coden, or initialism identifying the collection or data set from which the record was derived.
Comments
Examples
  • Mammals
  • Hildebrandt
  • EBIRD
  • VP
+ + + + + + + + + +
datasetName
Identifierhttp://rs.tdwg.org/dwc/terms/datasetName
DefinitionThe name identifying the data set from which the record was derived.
Comments
Examples
  • Grinnell Resurvey Mammals
  • Lacey Ctenomys Recaptures
+ + + + + + + + + +
ownerInstitutionCode
Identifierhttp://rs.tdwg.org/dwc/terms/ownerInstitutionCode
DefinitionThe name (or acronym) in use by the institution having ownership of the object(s) or information referred to in the record.
Comments
Examples
  • NPS
  • APN
  • InBio
+ + + + + + + + + +
basisOfRecord
Identifierhttp://rs.tdwg.org/dwc/terms/basisOfRecord
DefinitionThe specific nature of the data record.
CommentsRecommended best practice is to use a controlled vocabulary such as the set of local names of the identifiers for classes in Darwin Core.
Examples
  • MaterialEntity
  • PreservedSpecimen
  • FossilSpecimen
  • LivingSpecimen
  • MaterialSample
  • Event
  • HumanObservation
  • MachineObservation
  • Taxon
  • Occurrence
  • MaterialCitation
+ + + + + + + + + +
informationWithheld
Identifierhttp://rs.tdwg.org/dwc/terms/informationWithheld
DefinitionAdditional information that exists, but that has not been shared in the given record.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • location information not given for endangered species
  • collector identities withheld | ask about tissue samples
+ + + + + + + + + +
dataGeneralizations
Identifierhttp://rs.tdwg.org/dwc/terms/dataGeneralizations
DefinitionActions taken to make the shared data less specific or complete than in its original form. Suggests that alternative data of higher quality may be available on request.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
ExamplesCoordinates generalized from original GPS coordinates to the nearest half degree grid cell.
+ + + + + + + + + +
dynamicProperties
Identifierhttp://rs.tdwg.org/dwc/terms/dynamicProperties
DefinitionA list of additional measurements, facts, characteristics, or assertions about the record. Meant to provide a mechanism for structured content.
CommentsRecommended best practice is to use a key:value encoding schema for a data interchange format such as JSON.
Examples
  • {"heightInMeters":1.5}
  • {"targusLengthInMeters":0.014, "weightInGrams":120}
  • {"natureOfID":"expert identification", "identificationEvidence":"cytochrome B sequence"}
  • {"relativeHumidity":28, "airTemperatureInCelsius":22, "sampleSizeInKilograms":10}
  • {"aspectHeading":277, "slopeInDegrees":6}
  • {"iucnStatus":"vulnerable", "taxonDistribution":"Neuquén, Argentina"}
+ + +## Occurrence + + + + + + + + + + + +
Occurrence Class
Identifierhttp://rs.tdwg.org/dwc/terms/Occurrence
DefinitionAn existence of a dwc:Organism at a particular place at a particular time.
Comments
Examples
  • a wolf pack on the shore of Kluane Lake in 1988
  • a virus in a plant leaf in the New York Botanical Garden at 15:29 on 2014-10-23
  • a fungus in Central Park in the summer of 1929
+ + + + + + + + + + +
occurrenceID
Identifierhttp://rs.tdwg.org/dwc/terms/occurrenceID
DefinitionAn identifier for the dwc:Occurrence (as opposed to a particular digital record of the dwc:Occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the dwc:occurrenceID globally unique.
CommentsRecommended best practice is to use a persistent, globally unique identifier.
Examples
+ + + + + + + + + +
catalogNumber
Identifierhttp://rs.tdwg.org/dwc/terms/catalogNumber
DefinitionAn identifier (preferably unique) for the record within the data set or collection.
Comments
Examples
  • 145732
  • 145732a
  • 2008.1334
  • R-4313
+ + + + + + + + + +
recordNumber
Identifierhttp://rs.tdwg.org/dwc/terms/recordNumber
DefinitionAn identifier given to the dwc:Occurrence at the time it was recorded. Often serves as a link between field notes and a dwc:Occurrence record, such as a specimen collector's number.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
ExamplesOPP 7101
+ + + + + + + + + +
recordedBy
Identifierhttp://rs.tdwg.org/dwc/terms/recordedBy
DefinitionA list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original dwc:Occurrence. The primary collector or observer, especially one who applies a personal identifier (dwc:recordNumber), should be listed first.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ). This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • José E. Crespo
  • Oliver P. Pearson | Anita K. Pearson (where the value in recordNumber OPP 7101 corresponds to the collector number for the specimen in the field catalog of Oliver P. Pearson)
+ + + + + + + + + +
recordedByID
Identifierhttp://rs.tdwg.org/dwc/terms/recordedByID
DefinitionA list (concatenated and separated) of the globally unique identifier for the person, people, groups, or organizations responsible for recording the original dwc:Occurrence.
CommentsRecommended best practice is to provide a single identifier that disambiguates the details of the identifying agent. If a list is used, it is recommended to separate the values in the list with space vertical bar space ( | ). The order of the identifiers on any list for this term can not be guaranteed to convey any semantics.
Examples
+ + + + + + + + + +
individualCount
Identifierhttp://rs.tdwg.org/dwc/terms/individualCount
DefinitionThe number of individuals present at the time of the dwc:Occurrence.
Comments
Examples
  • 0
  • 1
  • 25
+ + + + + + + + + +
organismQuantity
Identifierhttp://rs.tdwg.org/dwc/terms/organismQuantity
DefinitionA number or enumeration value for the quantity of dwc:Organisms.
CommentsA dwc:organismQuantity must have a corresponding dwc:organismQuantityType.
Examples
  • 27 (organismQuantity) with individuals (organismQuantityType)
  • 12.5 (organismQuantity) with % biomass (organismQuantityType)
  • r (organismQuantity) with Braun-Blanquet Scale (organismQuantityType)
  • many (organismQuantity) with individuals (organismQuantityType)
+ + + + + + + + + +
organismQuantityType
Identifierhttp://rs.tdwg.org/dwc/terms/organismQuantityType
DefinitionThe type of quantification system used for the quantity of dwc:Organisms.
CommentsA dwc:organismQuantityType must have a corresponding dwc:organismQuantity. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • 27 (organismQuantity) with individuals (organismQuantityType)
  • 12.5 (organismQuantity) with % biomass (organismQuantityType)
  • r (organismQuantity) with Braun-Blanquet Scale (organismQuantityType)
  • many (organismQuantity) with individuals (organismQuantityType)
+ + + + + + + + + +
sex
Identifierhttp://rs.tdwg.org/dwc/terms/sex
DefinitionThe sex of the biological individual(s) represented in the dwc:Occurrence.
CommentsRecommended best practice is to use a controlled vocabulary. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • female
  • male
  • hermaphrodite
+ + + + + + + + + +
lifeStage
Identifierhttp://rs.tdwg.org/dwc/terms/lifeStage
DefinitionThe age class or life stage of the dwc:Organism(s) at the time the dwc:Occurrence was recorded.
CommentsRecommended best practice is to use a controlled vocabulary. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • zygote
  • larva
  • juvenile
  • adult
  • seedling
  • flowering
  • fruiting
+ + + + + + + + + +
reproductiveCondition
Identifierhttp://rs.tdwg.org/dwc/terms/reproductiveCondition
DefinitionThe reproductive condition of the biological individual(s) represented in the dwc:Occurrence.
CommentsRecommended best practice is to use a controlled vocabulary. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • non-reproductive
  • pregnant
  • in bloom
  • fruit-bearing
+ + + + + + + + + +
caste
Identifierhttp://rs.tdwg.org/dwc/terms/caste
DefinitionCategorisation of individuals for eusocial species (including some mammals and arthropods).
CommentsRecommended best practice is to use a controlled vocabulary that aligns best with the dwc:Taxon. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • queen
  • male alate
  • intercaste
  • minor worker
  • soldier
  • ergatoid
+ + + + + + + + + +
behavior
Identifierhttp://rs.tdwg.org/dwc/terms/behavior
DefinitionThe behavior shown by the subject at the time the dwc:Occurrence was recorded.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • roosting
  • foraging
  • running
+ + + + + + + + + +
vitality
Identifierhttp://rs.tdwg.org/dwc/terms/vitality
DefinitionAn indication of whether a dwc:Organism was alive or dead at the time of collection or observation.
CommentsRecommended best practice is to use a controlled vocabulary. Intended to be used with records having a dwc:basisOfRecord of PreservedSpecimen, MaterialEntity, MaterialSample, or HumanObservation. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • alive
  • dead
  • mixedLot
  • uncertain
  • notAssessed
+ + + + + + + + + +
establishmentMeans
Identifierhttp://rs.tdwg.org/dwc/terms/establishmentMeans
DefinitionStatement about whether a dwc:Organism has been introduced to a given place and time through the direct or indirect activity of modern humans.
CommentsRecommended best practice is to use controlled value strings from the controlled vocabulary designated for use with this term, listed at http://rs.tdwg.org/dwc/doc/em/. For details, refer to https://doi.org/10.3897/biss.3.38084. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • native
  • nativeReintroduced
  • introduced
  • introducedAssistedColonisation
  • vagrant
  • uncertain
+ + + + + + + + + +
degreeOfEstablishment
Identifierhttp://rs.tdwg.org/dwc/terms/degreeOfEstablishment
DefinitionThe degree to which a dwc:Organism survives, reproduces, and expands its range at the given place and time.
CommentsRecommended best practice is to use controlled value strings from the controlled vocabulary designated for use with this term, listed at http://rs.tdwg.org/dwc/doc/doe/. For details, refer to https://doi.org/10.3897/biss.3.38084. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • native
  • captive
  • cultivated
  • released
  • failing
  • casual
  • reproducing
  • established
  • colonising
  • invasive
  • widespreadInvasive
+ + + + + + + + + +
pathway
Identifierhttp://rs.tdwg.org/dwc/terms/pathway
DefinitionThe process by which a dwc:Organism came to be in a given place at a given time.
CommentsRecommended best practice is to use controlled value strings from the controlled vocabulary designated for use with this term, listed at http://rs.tdwg.org/dwc/doc/pw/. For details, refer to https://doi.org/10.3897/biss.3.38084. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • releasedForUse
  • otherEscape
  • transportContaminant
  • transportStowaway
  • corridor
  • unaided
+ + + + + + + + + +
georeferenceVerificationStatus
Identifierhttp://rs.tdwg.org/dwc/terms/georeferenceVerificationStatus
DefinitionA categorical description of the extent to which the georeference has been verified to represent the best possible spatial description for the dcterms:Location of the dwc:Occurrence.
CommentsRecommended best practice is to use a controlled vocabulary. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • unable to georeference
  • requires georeference
  • requires verification
  • verified by data custodian
  • verified by contributor
+ + + + + + + + + +
occurrenceStatus
Identifierhttp://rs.tdwg.org/dwc/terms/occurrenceStatus
DefinitionA statement about the presence or absence of a dwc:Taxon at a dcterms:Location.
CommentsFor dwc:Occurrences, the default vocabulary is recommended to consist of present and absent, but can be extended by implementers with good justification. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • present
  • absent
+ + + + + + + + + +
associatedMedia
Identifierhttp://rs.tdwg.org/dwc/terms/associatedMedia
DefinitionA list (concatenated and separated) of identifiers (publication, global unique identifier, URI) of media associated with the dwc:Occurrence.
Comments
Exampleshttps://arctos.database.museum/media/10520962 | https://arctos.database.museum/media/10520964
+ + + + + + + + + +
associatedOccurrences
Identifierhttp://rs.tdwg.org/dwc/terms/associatedOccurrences
DefinitionA list (concatenated and separated) of identifiers of other dwc:Occurrence records and their associations to this dwc:Occurrence.
CommentsThis term can be used to provide a list of associations to other dwc:Occurrences. Note that the dwc:ResourceRelationship class is an alternative means of representing associations, and with more detail. Recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
+ + + + + + + + + +
associatedReferences
Identifierhttp://rs.tdwg.org/dwc/terms/associatedReferences
DefinitionA list (concatenated and separated) of identifiers (publication, bibliographic reference, global unique identifier, URI) of literature associated with the dwc:Occurrence.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ). Note that the dwc:ResourceRelationship class is an alternative means of representing associations, and with more detail. Note also that the intended usage of the term dcterms:references in Darwin Core when applied to a dwc:Occurrence is to point to the definitive source representation of that dwc:Occurrence if one is available. Note also that the intended usage of dcterms:bibliographicCitation in Darwin Core when applied to a dwc:Occurrence is to provide the preferred way to cite the dwc:Occurrence itself.
Examples
  • http://www.sciencemag.org/cgi/content/abstract/322/5899/261
  • Christopher J. Conroy, Jennifer L. Neuwald. 2008. Phylogeographic study of the California vole, Microtus californicus Journal of Mammalogy, 89(3):755-767.
  • Steven R. Hoofer and Ronald A. Van Den Bussche. 2001. Phylogenetic Relationships of Plecotine Bats and Allies Based on Mitochondrial Ribosomal Sequences. Journal of Mammalogy 82(1):131-137. | Walker, Faith M., Jeffrey T. Foster, Kevin P. Drees, Carol L. Chambers. 2014. Spotted bat (Euderma maculatum) microsatellite discovery using illumina sequencing. Conservation Genetics Resources.
+ + + + + + + + + +
associatedTaxa
Identifierhttp://rs.tdwg.org/dwc/terms/associatedTaxa
DefinitionA list (concatenated and separated) of identifiers or names of dwc:Taxon records and the associations of this dwc:Occurrence to each of them.
CommentsThis term can be used to provide a list of associations to dwc:Taxon records other than the one defined in the dwc:Occurrence. Note that the dwc:ResourceRelationship class is an alternative means of representing associations, and with more detail. This term is not apt for establishing relationships between dwc:Taxon records, only between specific dwc:Occurrences of a dwc:Organism with other dwc:Taxon records. Recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
  • "host":"Quercus alba"
  • "host":"gbif.org/species/2879737"
  • "parasitoid of":"Cyclocephala signaticollis" | "predator of":"Apis mellifera"
+ + + + + + + + + +
otherCatalogNumbers
Identifierhttp://rs.tdwg.org/dwc/terms/otherCatalogNumbers
DefinitionA list (concatenated and separated) of previous or alternate fully qualified catalog numbers or other human-used identifiers for the same dwc:Occurrence, whether in the current or any other data set or collection.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
  • FMNH:Mammal:1234
  • NPS YELLO6778 | MBG 33424
+ + + + + + + + + +
occurrenceRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/occurrenceRemarks
DefinitionComments or notes about the dwc:Occurrence.
Comments
Examplesfound dead on road
+ + +## Organism + + + + + + + + + + + +
Organism Class
Identifierhttp://rs.tdwg.org/dwc/terms/Organism
DefinitionA particular organism or defined group of organisms considered to be taxonomically homogeneous.
CommentsInstances of the dwc:Organism class are intended to facilitate linking one or more dwc:Identification instances to one or more dwc:Occurrence instances. Therefore, things that are typically assigned scientific names (such as viruses, hybrids, and lichens) and aggregates whose dwc:Occurrences are typically recorded (such as packs, clones, and colonies) are included in the scope of this class.
Examples
  • a specific bird
  • a specific wolf pack
  • a specific instance of a bacterial culture
+ + + + + + + + + + +
organismID
Identifierhttp://rs.tdwg.org/dwc/terms/organismID
DefinitionAn identifier for the dwc:Organism instance (as opposed to a particular digital record of the dwc:Organism). May be a globally unique identifier or an identifier specific to the data set.
Comments
Exampleshttp://arctos.database.museum/guid/WNMU:Mamm:1249
+ + + + + + + + + +
organismName
Identifierhttp://rs.tdwg.org/dwc/terms/organismName
DefinitionA textual name or label assigned to a dwc:Organism instance.
Comments
Examples
  • Huberta
  • Boab Prison Tree
  • J pod
+ + + + + + + + + +
organismScope
Identifierhttp://rs.tdwg.org/dwc/terms/organismScope
DefinitionA description of the kind of dwc:Organism instance. Can be used to indicate whether the dwc:Organism instance represents a discrete organism or if it represents a particular type of aggregation.
CommentsRecommended best practice is to use a controlled vocabulary. This term is not intended to be used to specify a type of dwc:Taxon. To describe the kind of dwc:Organism using a URI object in RDF, use rdf:type (http://www.w3.org/1999/02/22-rdf-syntax-ns#type) instead.
Examples
  • multicellular organism
  • virus
  • clone
  • pack
  • colony
+ + + + + + + + + +
causeOfDeath
Identifierhttp://rs.tdwg.org/dwc/terms/causeOfDeath
DefinitionAn indication of the known or suspected cause of death of a dwc:Organism.
CommentsThe cause may be due to natural causes (e.g., disease, predation), human-related activities (e.g., roadkill, pollution), or other environmental factors (e.g., extreme weather events).
Examples
  • trap
  • poison
  • starvation
  • drowning
  • shooting
  • old age
  • vehicle collision
  • disease
  • herbicide
  • burning
  • infanticide
+ + + + + + + + + +
associatedOrganisms
Identifierhttp://rs.tdwg.org/dwc/terms/associatedOrganisms
DefinitionA list (concatenated and separated) of identifiers of other dwc:Organisms and the associations of this dwc:Organism to each of them.
CommentsThis term can be used to provide a list of associations to other dwc:Organisms. Note that the dwc:ResourceRelationship class is an alternative means of representing associations, and with more detail. Recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
+ + + + + + + + + +
previousIdentifications
Identifierhttp://rs.tdwg.org/dwc/terms/previousIdentifications
DefinitionA list (concatenated and separated) of previous assignments of names to the dwc:Organism.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
  • Chalepidae
  • Pinus abies
  • Anthus sp., field ID by G. Iglesias | Anthus correndera, expert ID by C. Cicero 2009-02-12 based on morphology
+ + + + + + + + + +
organismRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/organismRemarks
DefinitionComments or notes about the dwc:Organism instance.
Comments
ExamplesOne of a litter of six
+ + +## MaterialEntity + + + + + + + + + + + +
MaterialEntity Class
Identifierhttp://rs.tdwg.org/dwc/terms/MaterialEntity
DefinitionAn entity that can be identified, exists for some period of time, and consists in whole or in part of physical matter while it exists.
CommentsThe term is defined at the most general level to admit descriptions of any subtype of material entity within the scope of Darwin Core. In particular, any kind of material sample, preserved specimen, fossil, or exemplar from living collections is intended to be subsumed under this term.
Examples
  • an instance of a fossil
  • an instance of a herbarium sheet with its attached plant specimen
  • a particular part of the plant-derived material affixed to a herbarium sheet
  • an instance of a frozen tissue sample
  • a specific water sample
  • an instance of a meteorite fragment
  • a particular wolf in a zoo
  • a particular pack of wolves in the wild
  • an isolated molecule of DNA
  • a specific deep-frozen DNA sample
  • a particular field notebook
  • a particular paper page from a field notebook
  • an instance of a printed photograph
+ + + + + + + + + + +
materialEntityID
Identifierhttp://rs.tdwg.org/dwc/terms/materialEntityID
DefinitionAn identifier for a particular instance of a dwc:MaterialEntity.
CommentsValues of dwc:materialEntityID are intended to uniquely and persistently identify a particular dwc:MaterialEntity within some context. Examples of context include a particular sample collection, an organization, or the worldwide scale. Recommended best practice is to use a persistent, globally unique identifier. The identifier is bound to a physical object (the dwc:MaterialEntity) as opposed to a particular digital record (representation) of that physical object.
Examples06809dc5-f143-459a-be1a-6f03e63fc083
+ + + + + + + + + +
digitalSpecimenID
Identifierhttp://rs.tdwg.org/dwc/terms/digitalSpecimenID
DefinitionAn identifier for a particular instance of a Digital Specimen.
CommentsA Digital Specimen is defined in https://doi.org/10.3897/rio.7.e67379. A dwc:digitalSpecimenID is intended to uniquely and persistently identify a Digital Specimen. Recommended best practice is to use a DOI with machine readable metadata in the DOI record that uses a community agreed metadata profile (also known as FDO profile) for a Digital Specimen. For an example see: https://doi.org/10.3535/N75-CR4-0SM?noredirect. The identifier is for a digital information artifact (the Digital Specimen) as opposed to an identifier for a specific instance of a dwc:MaterialEntity.
Examples
+ + + + + + + + + +
materialEntityType
Identifierhttp://rs.tdwg.org/dwc/terms/materialEntityType
DefinitionA category that best matches the nature of a dwc:MaterialEntity.
CommentsA more generic classification of a dwc:MaterialEntity than dwc:preparations. Recommended best practice is to use a controlled vocabulary. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • Macro-object
  • Micro-object
  • Oversized object
  • Cut/polished gemstone
  • Compound Specimen
  • Core
  • Mixed Materials
  • Environmental sample
  • Microscope slide
  • Spore print
  • Macrofossil
  • Mesofossil
  • Microfossil
  • Pinned object/specimen
  • Taxidermy mount
  • Blood sampling cards
  • Oversized fossil
  • Anthropogenic Artifact
+ + + + + + + + + +
discipline
Identifierhttp://rs.tdwg.org/dwc/terms/discipline
DefinitionThe primary branch or branches of knowledge represented by the record.
CommentsThis term can be used to classify records according to branches of knowledge. Recommended best practice is to use a controlled vocabulary and to separate the values in a list with space vertical bar space ( | ).This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value. It is also recommended to use this field to describe specimenType in MIDS.
Examples
  • Botany
  • Botany | Virology | Taxonomy
+ + + + + + + + + +
preparations
Identifierhttp://rs.tdwg.org/dwc/terms/preparations
DefinitionA list (concatenated and separated) of preparations and preservation methods for a dwc:MaterialEntity.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ). This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • fossil
  • cast
  • photograph
  • DNA extract
  • skin | skull | skeleton
  • whole animal (EtOH) | tissue (EDTA)
+ + + + + + + + + +
disposition
Identifierhttp://rs.tdwg.org/dwc/terms/disposition
DefinitionThe current state of a dwc:MaterialEntity with respect to a collection.
CommentsRecommended best practice is to use a controlled vocabulary. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • in collection
  • missing
  • on loan
  • used up
  • destroyed
  • deaccessioned
+ + + + + + + + + +
verbatimLabel
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimLabel
DefinitionThe content of this term should include no embellishments, prefixes, headers or other additions made to the text. Abbreviations must not be expanded and supposed misspellings must not be corrected. Lines or breakpoints between blocks of text that could be verified by seeing the original labels or images of them may be used. Examples of material entities include preserved specimens, fossil specimens, and material samples. Best practice is to use UTF-8 for all characters. Best practice is to add comment “verbatimLabel derived from human transcription” in dwc:occurrenceRemarks.
CommentsExamples can be found at https://dwc.tdwg.org/examples/verbatimLabel.
Examples
+ + + + + + + + + +
associatedSequences
Identifierhttp://rs.tdwg.org/dwc/terms/associatedSequences
DefinitionA list (concatenated and separated) of identifiers (publication, global unique identifier, URI) of genetic sequence information associated with the dwc:MaterialEntity.
Comments
Examples
+ + + + + + + + + +
materialEntityRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/materialEntityRemarks
DefinitionComments or notes about the dwc:MaterialEntity instance.
Comments
Examples
  • found in association with charred remains
  • some original fragments missing
+ + +## MaterialSample + + + + + + + + + + + +
MaterialSample Class
Identifierhttp://rs.tdwg.org/dwc/terms/MaterialSample
DefinitionA material entity that represents an entity of interest in whole or in part.
Comments
Examples
  • a whole organism preserved in a collection
  • a part of an organism isolated for some purpose
  • a soil sample
  • a marine microbial sample
+ + + + + + + + + + +
materialSampleID
Identifierhttp://rs.tdwg.org/dwc/terms/materialSampleID
DefinitionAn identifier for the dwc:MaterialSample (as opposed to a particular digital record of the dwc:MaterialSample). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the dwc:materialSampleID globally unique.
CommentsRecommended best practice is to use a persistent, globally unique identifier.
Examples06809dc5-f143-459a-be1a-6f03e63fc083
+ + +## Event + + + + + + + + + + + +
Event Class
Identifierhttp://rs.tdwg.org/dwc/terms/Event
DefinitionAn action that occurs at some location during some time.
Comments
Examples
  • a specimen collecting event
  • a camera trap image capture
  • a marine trawl
+ + + + + + + + + + +
eventID
Identifierhttp://rs.tdwg.org/dwc/terms/eventID
DefinitionAn identifier for the set of information associated with a dwc:Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set.
Comments
ExamplesINBO:VIS:Ev:00009375
+ + + + + + + + + +
parentEventID
Identifierhttp://rs.tdwg.org/dwc/terms/parentEventID
DefinitionAn identifier for the broader dwc:Event that groups this and potentially other dwc:Events.
CommentsUse a globally unique identifier for a dwc:Event or an identifier for a dwc:Event that is specific to the data set.
ExamplesA1 (parentEventID to identify the main Whittaker Plot in nested samples, each with its own eventID - A1:1, A1:2).
+ + + + + + + + + +
eventType
Identifierhttp://rs.tdwg.org/dwc/terms/eventType
DefinitionThe nature of the dwc:Event.
CommentsRecommended best practice is to use a controlled vocabulary. Regardless of the dwc:eventType, the interval of the dwc:Event can be captured in dwc:eventDate. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • Sample
  • Observation
  • Site Visit
  • Biotic Interaction
  • Bioblitz
  • Expedition
  • Survey
  • Project
+ + + + + + + + + +
fieldNumber
Identifierhttp://rs.tdwg.org/dwc/terms/fieldNumber
DefinitionAn identifier given to the dwc:Event in the field. Often serves as a link between field notes and the dwc:Event.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
ExamplesRV Sol 87-03-08
+ + + + + + + + + +
projectTitle
Identifierhttp://rs.tdwg.org/dwc/terms/projectTitle
DefinitionA list (concatenated and separated) of titles or names for projects that contributed to a dwc:Event.
CommentsUse this term to provide the official name or title of a project as it is commonly known and cited. Avoid abbreviations unless they are widely understood. The recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
  • The Nansen Legacy
  • Scalidophora i Noreg
  • Arctic Deep
+ + + + + + + + + +
projectID
Identifierhttp://rs.tdwg.org/dwc/terms/projectID
DefinitionA list (concatenated and separated) of identifiers for projects that contributed to a dwc:Event.
CommentsA projectID may be shared in multiple distinct datasets. The nature of the association can be described in the metadata project description element. This term should be used to provide a globally unique identifier (GUID) for a project, if available. This could be a DOI, URI, or any other persistent identifier that ensures a project can be uniquely distinguished from others. The recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
+ + + + + + + + + +
fundingAttribution
Identifierhttp://rs.tdwg.org/ac/terms/fundingAttribution
DefinitionText description of organizations or individuals who funded the creation of the resource.
CommentsSpecify the full official name of the funding body. This should include the complete name without abbreviations, unless the abbreviation is an official and commonly recognized form (e.g., NSF for the National Science Foundation). The recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
  • Norges forskningsråd
  • Artsdatabanken
  • Ocean Census | Nippon Foundation
+ + + + + + + + + +
fundingAttributionID
Identifierhttp://rs.tdwg.org/dwc/terms/fundingAttributionID
DefinitionA list (concatenated and separated) of the globally unique identifiers for the funding organizations or agencies that supported the project.
CommentsProvide a unique identifier for the funding body, such as an identifier used in governmental or international databases. If no official identifier exists, use a persistent and unique identifier within your organization or dataset. The recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
+ + + + + + + + + +
eventDate
Identifierhttp://rs.tdwg.org/dwc/terms/eventDate
DefinitionThe date-time or interval during which a dwc:Event occurred. For occurrences, this is the date-time when the dwc:Event was recorded. Not suitable for a time in a geological context.
CommentsRecommended best practice is to use a date that conforms to ISO 8601-1:2019.
Examples
  • 1963-03-08T14:07-06:00 (8 Mar 1963 at or after 2:07pm and before 2:08pm in the time zone six hours earlier than UTC)
  • 2009-02-20T08:40Z (20 February 2009 at or after 8:40am and before 8:41 UTC)
  • 2018-08-29T15:19 (29 August 2018 at or after 3:19pm and before 3:20pm local time)
  • 1809-02-12 (within the day 12 February 1809)
  • 1906-06 (in the month of June 1906)
  • 1971 (in the year 1971)
  • 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z (some time within the interval beginning 1 March 2007 at 1pm UTC and before 11 May 2008 at 3:30pm UTC)
  • 1900/1909 (some time within the interval between the beginning of the year 1900 and before the year 1909)
  • 2007-11-13/15 (some time in the interval between the beginning of 13 November 2007 and before 15 November 2007)
+ + + + + + + + + +
eventTime
Identifierhttp://rs.tdwg.org/dwc/terms/eventTime
DefinitionThe time or interval during which a dwc:Event occurred.
CommentsRecommended best practice is to use a time of day that conforms to ISO 8601-1:2019.
Examples
  • 14:07-06:00 (at or after 2:07pm and before 2:08pm in the time zone six hours earlier than UTC)
  • 08:40:21Z (at or after 8:40:21am and before 8:41:22am UTC)
  • 13:00:00Z/15:30:00Z (at or after 1pm and before 3:30pm UTC)
+ + + + + + + + + +
startDayOfYear
Identifierhttp://rs.tdwg.org/dwc/terms/startDayOfYear
DefinitionThe earliest integer day of the year on which the dwc:Event occurred (1 for January 1, 365 for December 31, except in a leap year, in which case it is 366).
Comments
Examples
  • 1 (1 January)
  • 366 (31 December)
  • 365 (30 December in a leap year, 31 December in a non-leap year)
+ + + + + + + + + +
endDayOfYear
Identifierhttp://rs.tdwg.org/dwc/terms/endDayOfYear
DefinitionThe latest integer day of the year on which the dwc:Event occurred (1 for January 1, 365 for December 31, except in a leap year, in which case it is 366).
Comments
Examples
  • 1 (1 January)
  • 32 (1 February)
  • 366 (31 December)
  • 365 (30 December in a leap year, 31 December in a non-leap year)
+ + + + + + + + + +
year
Identifierhttp://rs.tdwg.org/dwc/terms/year
DefinitionThe four-digit year in which the dwc:Event occurred, according to the Common Era Calendar.
Comments
Examples
  • 1160
  • 2008
+ + + + + + + + + +
month
Identifierhttp://rs.tdwg.org/dwc/terms/month
DefinitionThe integer month in which the dwc:Event occurred.
Comments
Examples
  • 1 (January)
  • 10 (October)
+ + + + + + + + + +
day
Identifierhttp://rs.tdwg.org/dwc/terms/day
DefinitionThe integer day of the month on which the dwc:Event occurred.
Comments
Examples
  • 9
  • 28
+ + + + + + + + + +
verbatimEventDate
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimEventDate
DefinitionThe verbatim original representation of the date and time information for a dwc:Event.
Comments
Examples
  • spring 1910
  • Marzo 2002
  • 1999-03-XX
  • 17IV1934
+ + + + + + + + + +
habitat
Identifierhttp://rs.tdwg.org/dwc/terms/habitat
DefinitionA category or description of the habitat in which the dwc:Event occurred.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • oak savanna
  • pre-cordilleran steppe
+ + + + + + + + + +
samplingProtocol
Identifierhttp://rs.tdwg.org/dwc/terms/samplingProtocol
DefinitionThe names of, references to, or descriptions of the methods or protocols used during a dwc:Event.
CommentsRecommended best practice is describe a dwc:Event with no more than one sampling protocol. In the case of a summary Event with multiple protocols, in which a specific protocol can not be attributed to specific dwc:Occurrences, the recommended best practice is to separate the values in a list with space vertical bar space ( | ). This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
+ + + + + + + + + +
sampleSizeValue
Identifierhttp://rs.tdwg.org/dwc/terms/sampleSizeValue
DefinitionA numeric value for a measurement of the size (time duration, length, area, or volume) of a sample in a sampling dwc:Event.
CommentsA dwc:sampleSizeValue must have a corresponding dwc:sampleSizeUnit.
Examples5 (sampleSizeValue) with metre (sampleSizeUnit)
+ + + + + + + + + +
sampleSizeUnit
Identifierhttp://rs.tdwg.org/dwc/terms/sampleSizeUnit
DefinitionThe unit of measurement of the size (time duration, length, area, or volume) of a sample in a sampling dwc:Event.
CommentsA dwc:sampleSizeUnit must have a corresponding dwc:sampleSizeValue, e.g., 5 for dwc:sampleSizeValue with m for dwc:sampleSizeUnit. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • minute
  • hour
  • day
  • metre
  • square metre
  • cubic metre
+ + + + + + + + + +
samplingEffort
Identifierhttp://rs.tdwg.org/dwc/terms/samplingEffort
DefinitionThe amount of effort expended during a dwc:Event.
Comments
Examples
  • 40 trap-nights
  • 10 observer-hours
  • 10 km by foot
  • 30 km by car
+ + + + + + + + + +
fieldNotes
Identifierhttp://rs.tdwg.org/dwc/terms/fieldNotes
DefinitionOne of a) an indicator of the existence of, b) a reference to (publication, URI), or c) the text of notes taken in the field about the dwc:Event.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
ExamplesNotes available in the Grinnell-Miller Library.
+ + + + + + + + + +
eventRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/eventRemarks
DefinitionComments or notes about the dwc:Event.
Comments
ExamplesAfter the recent rains the river is nearly at flood stage.
+ + +## Location + + + + + + + + + + + +
Location Class
Identifierhttp://purl.org/dc/terms/Location
DefinitionA spatial region or named place.
Comments
Examples
  • the municipality of San Carlos de Bariloche, Río Negro, Argentina
  • the place defined by a georeference
+ + + + + + + + + + +
locationID
Identifierhttp://rs.tdwg.org/dwc/terms/locationID
DefinitionAn identifier for the set of dcterms:Location information. May be a global unique identifier or an identifier specific to the data set.
Comments
Exampleshttps://opencontext.org/subjects/768A875F-E205-4D0B-DE55-BAB7598D0FD1
+ + + + + + + + + +
higherGeographyID
Identifierhttp://rs.tdwg.org/dwc/terms/higherGeographyID
DefinitionAn identifier for the geographic region within which the dcterms:Location occurred.
CommentsRecommended best practice is to use a persistent identifier from a controlled vocabulary such as the Getty Thesaurus of Geographic Names.
Exampleshttp://vocab.getty.edu/tgn/1002002 (Antártida e Islas del Atlántico Sur, Territorio Nacional de la Tierra del Fuego, Argentina).
+ + + + + + + + + +
higherGeography
Identifierhttp://rs.tdwg.org/dwc/terms/higherGeography
DefinitionA list (concatenated and separated) of geographic names less specific than the information captured in the dwc:locality term.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ), with terms in order from least specific to most specific.
Examples
  • North Atlantic Ocean
  • South America | Argentina | Patagonia | Parque Nacional Nahuel Huapi | Neuquén | Los Lagos with accompanying values South America (continent) Argentina (country), Neuquén (first order division), and Los Lagos (second order division)
+ + + + + + + + + +
continent
Identifierhttp://rs.tdwg.org/dwc/terms/continent
DefinitionThe name of the continent in which the dcterms:Location occurs.
CommentsRecommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names. Recommended best practice is to leave this field blank if the dcterms:Location spans multiple entities at this administrative level or if the dcterms:Location might be in one or another of multiple possible entities at this level. Multiplicity and uncertainty of the geographic entity can be captured either in the term dwc:higherGeography or in the term dwc:locality, or both.
Examples
  • Africa
  • Antarctica
  • Asia
  • Europe
  • North America
  • Oceania
  • South America
+ + + + + + + + + +
waterBody
Identifierhttp://rs.tdwg.org/dwc/terms/waterBody
DefinitionThe name of the water body in which the dcterms:Location occurs.
CommentsRecommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names.
Examples
  • Indian Ocean
  • Baltic Sea
  • Hudson River
  • Lago Nahuel Huapi
+ + + + + + + + + +
islandGroup
Identifierhttp://rs.tdwg.org/dwc/terms/islandGroup
DefinitionThe name of the island group in which the dcterms:Location occurs.
CommentsRecommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names.
Examples
  • Alexander Archipelago
  • Archipiélago Diego Ramírez
  • Seychelles
+ + + + + + + + + +
island
Identifierhttp://rs.tdwg.org/dwc/terms/island
DefinitionThe name of the island on or near which the dcterms:Location occurs.
CommentsRecommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names.
Examples
  • Nosy Be
  • Bikini Atoll
  • Vancouver
  • Viti Levu
  • Zanzibar
+ + + + + + + + + +
country
Identifierhttp://rs.tdwg.org/dwc/terms/country
DefinitionThe name of the country or major administrative unit in which the dcterms:Location occurs.
CommentsRecommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names. Recommended best practice is to leave this field blank if the dcterms:Location spans multiple entities at this administrative level or if the dcterms:Location might be in one or another of multiple possible entities at this level. Multiplicity and uncertainty of the geographic entity can be captured either in the term dwc:higherGeography or in the term dwc:locality, or both.
Examples
  • Denmark
  • Colombia
  • España
+ + + + + + + + + +
countryCode
Identifierhttp://rs.tdwg.org/dwc/terms/countryCode
DefinitionThe standard code for the country in which the dcterms:Location occurs.
CommentsRecommended best practice is to use an ISO 3166-1-alpha-2 country code, or ZZ (for an unknown location or a location unassignable to a single country code), or XZ (for the high seas beyond national jurisdictions).
Examples
  • AR
  • SV
  • XZ
  • ZZ
+ + + + + + + + + +
stateProvince
Identifierhttp://rs.tdwg.org/dwc/terms/stateProvince
DefinitionThe name of the next smaller administrative region than country (state, province, canton, department, region, etc.) in which the dcterms:Location occurs.
CommentsRecommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names. Recommended best practice is to leave this field blank if the dcterms:Location spans multiple entities at this administrative level or if the dcterms:Location might be in one or another of multiple possible entities at this level. Multiplicity and uncertainty of the geographic entity can be captured either in the term dwc:higherGeography or in the term dwc:locality, or both.
Examples
  • Montana
  • Minas Gerais
  • Córdoba
+ + + + + + + + + +
county
Identifierhttp://rs.tdwg.org/dwc/terms/county
DefinitionThe full, unabbreviated name of the next smaller administrative region than stateProvince (county, shire, department, etc.) in which the dcterms:Location occurs.
CommentsRecommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names. Recommended best practice is to leave this field blank if the dcterms:Location spans multiple entities at this administrative level or if the dcterms:Location might be in one or another of multiple possible entities at this level. Multiplicity and uncertainty of the geographic entity can be captured either in the term dwc:higherGeography or in the term dwc:locality, or both.
Examples
  • Missoula
  • Los Lagos
  • Mataró
+ + + + + + + + + +
municipality
Identifierhttp://rs.tdwg.org/dwc/terms/municipality
DefinitionThe full, unabbreviated name of the next smaller administrative region than county (city, municipality, etc.) in which the dcterms:Location occurs. Do not use this term for a nearby named place that does not contain the actual dcterms:Location.
CommentsRecommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names. Recommended best practice is to leave this field blank if the dcterms:Location spans multiple entities at this administrative level or if the dcterms:Location might be in one or another of multiple possible entities at this level. Multiplicity and uncertainty of the geographic entity can be captured either in the term dwc:higherGeography or in the term dwc:locality, or both.
Examples
  • Holzminden
  • Araçatuba
  • Ga-Segonyana
+ + + + + + + + + +
locality
Identifierhttp://rs.tdwg.org/dwc/terms/locality
DefinitionThe specific description of the place.
CommentsLess specific geographic information can be provided in other geographic terms (dwc:higherGeography, dwc:continent, dwc:country, dwc:stateProvince, dwc:county, dwc:municipality, dwc:waterBody, dwc:island, dwc:islandGroup). This term may contain information modified from the original to correct perceived errors or standardize the description.
Examples
  • Bariloche, 25 km NNE via Ruta Nacional 40 (=Ruta 237)
  • Queets Rainforest, Olympic National Park
+ + + + + + + + + +
verbatimLocality
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimLocality
DefinitionThe original textual description of the place.
Comments
Examples25 km NNE Bariloche por R. Nac. 237
+ + + + + + + + + +
minimumElevationInMeters
Identifierhttp://rs.tdwg.org/dwc/terms/minimumElevationInMeters
DefinitionThe lower limit of the range of elevation (altitude, usually above sea level), in meters.
Comments
Examples
  • -100
  • 802
+ + + + + + + + + +
maximumElevationInMeters
Identifierhttp://rs.tdwg.org/dwc/terms/maximumElevationInMeters
DefinitionThe upper limit of the range of elevation (altitude, usually above sea level), in meters.
Comments
Examples
  • -205
  • 1236
+ + + + + + + + + +
verbatimElevation
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimElevation
DefinitionThe original description of the elevation (altitude, usually above sea level) of the Location.
Comments
Examples100-200 m
+ + + + + + + + + +
verticalDatum
Identifierhttp://rs.tdwg.org/dwc/terms/verticalDatum
DefinitionThe vertical datum used as the reference upon which the values in the elevation terms are based.
CommentsRecommended best practice is to use a controlled vocabulary. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • EGM84
  • EGM96
  • EGM2008
  • PGM2000A
  • PGM2004
  • PGM2006
  • PGM2007
  • EPSG:7030
  • not recorded
+ + + + + + + + + +
minimumDepthInMeters
Identifierhttp://rs.tdwg.org/dwc/terms/minimumDepthInMeters
DefinitionThe lesser depth of a range of depth below the local surface, in meters.
Comments
Examples
  • 0
  • 100
+ + + + + + + + + +
maximumDepthInMeters
Identifierhttp://rs.tdwg.org/dwc/terms/maximumDepthInMeters
DefinitionThe greater depth of a range of depth below the local surface, in meters.
Comments
Examples
  • 0
  • 200
+ + + + + + + + + +
verbatimDepth
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimDepth
DefinitionThe original description of the depth below the local surface.
Comments
Examples100-200 m
+ + + + + + + + + +
minimumDistanceAboveSurfaceInMeters
Identifierhttp://rs.tdwg.org/dwc/terms/minimumDistanceAboveSurfaceInMeters
DefinitionThe lesser distance in a range of distance from a reference surface in the vertical direction, in meters. Use positive values for locations above the surface, negative values for locations below. If depth measures are given, the reference surface is the location given by the depth, otherwise the reference surface is the location given by the elevation.
Comments
Examples
  • -1.5 (below the surface)
  • 4.2 (above the surface)
  • For a 1.5 meter sediment core from the bottom of a lake (at depth 20m) at 300m elevation: verbatimElevation: 300m minimumElevationInMeters: 300, maximumElevationInMeters: 300, verbatimDepth: 20m, minimumDepthInMeters: 20, maximumDepthInMeters: 20, minimumDistanceAboveSurfaceInMeters: 0, maximumDistanceAboveSurfaceInMeters: -1.5.
+ + + + + + + + + +
maximumDistanceAboveSurfaceInMeters
Identifierhttp://rs.tdwg.org/dwc/terms/maximumDistanceAboveSurfaceInMeters
DefinitionThe greater distance in a range of distance from a reference surface in the vertical direction, in meters. Use positive values for locations above the surface, negative values for locations below. If depth measures are given, the reference surface is the location given by the depth, otherwise the reference surface is the location given by the elevation.
Comments
Examples
  • -1.5 (below the surface)
  • 4.2 (above the surface)
  • For a 1.5 meter sediment core from the bottom of a lake (at depth 20m) at 300m elevation: verbatimElevation: 300m minimumElevationInMeters: 300, maximumElevationInMeters: 300, verbatimDepth: 20m, minimumDepthInMeters: 20, maximumDepthInMeters: 20, minimumDistanceAboveSurfaceInMeters: 0, maximumDistanceAboveSurfaceInMeters: -1.5.
+ + + + + + + + + +
locationAccordingTo
Identifierhttp://rs.tdwg.org/dwc/terms/locationAccordingTo
DefinitionInformation about the source of this dcterms:Location information. Could be a publication (gazetteer), institution, or team of individuals.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • Getty Thesaurus of Geographic Names
  • GADM
+ + + + + + + + + +
locationRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/locationRemarks
DefinitionComments or notes about the dcterms:Location.
Comments
Examplesunder water since 2005
+ + + + + + + + + +
decimalLatitude
Identifierhttp://rs.tdwg.org/dwc/terms/decimalLatitude
DefinitionThe geographic latitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive.
Comments
Examples-41.0983423
+ + + + + + + + + +
decimalLongitude
Identifierhttp://rs.tdwg.org/dwc/terms/decimalLongitude
DefinitionThe geographic longitude (in decimal degrees, using the spatial reference system given in dwc:geodeticDatum) of the geographic center of a dcterms:Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive.
Comments
Examples-121.1761111
+ + + + + + + + + +
geodeticDatum
Identifierhttp://rs.tdwg.org/dwc/terms/geodeticDatum
DefinitionThe ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in dwc:decimalLatitude and dwc:decimalLongitude are based.
CommentsRecommended best practice is to use the EPSG code of the SRS, if known. Otherwise use a controlled vocabulary for the name or code of the geodetic datum, if known. Otherwise use a controlled vocabulary for the name or code of the ellipsoid, if known. If none of these is known, use the value not recorded. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for a string literal value.
Examples
  • EPSG:4326
  • WGS84
  • NAD27
  • Campo Inchauspe
  • European 1950
  • Clarke 1866
  • not recorded
+ + + + + + + + + +
coordinateUncertaintyInMeters
Identifierhttp://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters
DefinitionThe horizontal distance (in meters) from the given dwc:decimalLatitude and dwc:decimalLongitude describing the smallest circle containing the whole of the dcterms:Location. Leave the value empty if the uncertainty is unknown, cannot be estimated, or is not applicable (because there are no coordinates). Zero is not a valid value for this term.
Comments
Examples
  • 30 (reasonable lower limit on or after 2000-05-01 of a GPS reading under good conditions if the actual precision was not recorded at the time)
  • 100 (reasonable lower limit before 2000-05-01 of a GPS reading under good conditions if the actual precision was not recorded at the time)
  • 71 (uncertainty for a UTM coordinate having 100 meter precision and a known spatial reference system)
+ + + + + + + + + +
coordinatePrecision
Identifierhttp://rs.tdwg.org/dwc/terms/coordinatePrecision
DefinitionA decimal representation of the precision of the coordinates given in the dwc:decimalLatitude and dwc:decimalLongitude.
Comments
Examples
  • 0.00001 (normal GPS limit for decimal degrees)
  • 0.000278 (nearest second)
  • 0.01667 (nearest minute)
  • 1.0 (nearest degree)
+ + + + + + + + + +
pointRadiusSpatialFit
Identifierhttp://rs.tdwg.org/dwc/terms/pointRadiusSpatialFit
DefinitionThe ratio of the area of the point-radius (dwc:decimalLatitude, dwc:decimalLongitude, dwc:coordinateUncertaintyInMeters) to the area of the true (original, or most specific) spatial representation of the dcterms:Location. Legal values are 0, greater than or equal to 1, or undefined. A value of 1 is an exact match or 100% overlap. A value of 0 should be used if the given point-radius does not completely contain the original representation. The dwc:pointRadiusSpatialFit is undefined (and should be left empty) if the original representation is any geometry without area (e.g., a point or polyline) and without uncertainty and the given georeference is not that same geometry (without uncertainty). If both the original and the given georeference are the same point, the dwc:pointRadiusSpatialFit is 1.
CommentsDetailed explanations with graphical examples can be found in the Georeferencing Best Practices, Chapman and Wieczorek, 2020 (https://doi.org/10.15468/doc-gg7h-s853).
Examples
  • 0
  • 1
  • 1.5708
+ + + + + + + + + +
verbatimCoordinates
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimCoordinates
DefinitionThe verbatim original spatial coordinates of the dcterms:Location. The coordinate ellipsoid, geodeticDatum, or full Spatial Reference System (SRS) for these coordinates should be stored in dwc:verbatimSRS and the coordinate system should be stored in dwc:verbatimCoordinateSystem.
Comments
Examples
  • 41 05 54S 121 05 34W
  • 17T 630000 4833400
+ + + + + + + + + +
verbatimLatitude
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimLatitude
DefinitionThe verbatim original latitude of the dcterms:Location. The coordinate ellipsoid, geodeticDatum, or full Spatial Reference System (SRS) for these coordinates should be stored in dwc:verbatimSRS and the coordinate system should be stored in dwc:verbatimCoordinateSystem.
Comments
Examples41 05 54.03S
+ + + + + + + + + +
verbatimLongitude
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimLongitude
DefinitionThe verbatim original longitude of the dcterms:Location. The coordinate ellipsoid, geodeticDatum, or full Spatial Reference System (SRS) for these coordinates should be stored in dwc:verbatimSRS and the coordinate system should be stored in dwc:verbatimCoordinateSystem.
Comments
Examples121d 10' 34" W
+ + + + + + + + + +
verbatimCoordinateSystem
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimCoordinateSystem
DefinitionThe coordinate format for the dwc:verbatimLatitude and dwc:verbatimLongitude or the dwc:verbatimCoordinates of the dcterms:Location.
CommentsRecommended best practice is to use a controlled vocabulary. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • decimal degrees
  • degrees decimal minutes
  • degrees minutes seconds
  • UTM
+ + + + + + + + + +
verbatimSRS
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimSRS
DefinitionThe ellipsoid, geodetic datum, or spatial reference system (SRS) upon which coordinates given in dwc:verbatimLatitude and dwc:verbatimLongitude, or dwc:verbatimCoordinates are based.
CommentsRecommended best practice is to use the EPSG code of the SRS, if known. Otherwise use a controlled vocabulary for the name or code of the geodetic datum, if known. Otherwise use a controlled vocabulary for the name or code of the ellipsoid, if known. If none of these is known, use the value not recorded. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • EPSG:4326
  • WGS84
  • NAD27
  • Campo Inchauspe
  • European 1950
  • Clarke 1866
  • not recorded
+ + + + + + + + + +
footprintWKT
Identifierhttp://rs.tdwg.org/dwc/terms/footprintWKT
DefinitionA Well-Known Text (WKT) representation of the shape (footprint, geometry) that defines the dcterms:Location. A dcterms:Location may have both a point-radius representation (see dwc:decimalLatitude) and a footprint representation, and they may differ from each other.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
ExamplesPOLYGON ((10 20, 11 20, 11 21, 10 21, 10 20)) (the one-degree bounding box with opposite corners at longitude=10, latitude=20 and longitude=11, latitude=21)
+ + + + + + + + + +
footprintSRS
Identifierhttp://rs.tdwg.org/dwc/terms/footprintSRS
DefinitionThe ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geometry given in dwc:footprintWKT is based.
CommentsRecommended best practice is to use the EPSG code of the SRS, if known. Otherwise use a controlled vocabulary for the name or code of the geodetic datum, if known. Otherwise use a controlled vocabulary for the name or code of the ellipsoid, if known. If none of these is known, use the value not recorded. It is also permitted to provide the SRS in Well-Known-Text, especially if no EPSG code provides the necessary values for the attributes of the SRS. Do not use this term to describe the SRS of the dwc:decimalLatitude and dwc:decimalLongitude, nor of any verbatim coordinates - use the dwc:geodeticDatum and dwc:verbatimSRS instead. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • EPSG:4326
  • GEOGCS["GCS_WGS_1984", DATUM["D_WGS_1984", SPHEROID["WGS_1984",6378137,298.257223563]], PRIMEM["Greenwich",0], UNIT["Degree",0.0174532925199433]] (WKT for the standard WGS84 Spatial Reference System EPSG:4326)
  • not recorded
+ + + + + + + + + +
footprintSpatialFit
Identifierhttp://rs.tdwg.org/dwc/terms/footprintSpatialFit
DefinitionThe ratio of the area of the dwc:footprintWKT to the area of the true (original, or most specific) spatial representation of the dcterms:Location. Legal values are 0, greater than or equal to 1, or undefined. A value of 1 is an exact match or 100% overlap. A value of 0 should be used if the given dwc:footprintWKT does not completely contain the original representation. The dwc:footprintSpatialFit is undefined (and should be left empty) if the original representation is any geometry without area (e.g., a point or polyline) and without uncertainty and the given georeference is not that same geometry (without uncertainty). If both the original and the given georeference are the same point, the dwc:footprintSpatialFit is 1.
CommentsDetailed explanations with graphical examples can be found in the Georeferencing Best Practices, Chapman and Wieczorek, 2020 (https://doi.org/10.15468/doc-gg7h-s853).
Examples
  • 0
  • 1
  • 1.5708
+ + + + + + + + + +
georeferencedBy
Identifierhttp://rs.tdwg.org/dwc/terms/georeferencedBy
DefinitionA list (concatenated and separated) of names of people, groups, or organizations who determined the georeference (spatial representation) for the dcterms:Location.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ). This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • Brad Millen (ROM)
  • Kristina Yamamoto | Janet Fang
+ + + + + + + + + +
georeferencedDate
Identifierhttp://rs.tdwg.org/dwc/terms/georeferencedDate
DefinitionThe date on which the dcterms:Location was georeferenced.
CommentsRecommended best practice is to use a date that conforms to ISO 8601-1:2019.
Examples
  • 1963-03-08T14:07-06:00 (8 Mar 1963 at or after 2:07pm and before 2:08pm in the time zone six hours earlier than UTC)
  • 2009-02-20T08:40Z (20 February 2009 at or after 8:40am and before 8:41 UTC)
  • 2018-08-29T15:19 (29 August 2018 at or after 3:19pm and before 3:20pm local time)
  • 1809-02-12 (within the day 12 February 1809)
  • 1906-06 (in the month of June 1906)
  • 1971 (in the year 1971)
  • 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z (some time within the interval beginning 1 March 2007 at 1pm UTC and before 11 May 2008 at 3:30pm UTC)
  • 1900/1909 (some time within the interval between the beginning of the year 1900 and before the year 1909)
  • 2007-11-13/15 (some time in the interval between the beginning of 13 November 2007 and before 15 November 2007)
+ + + + + + + + + +
georeferenceProtocol
Identifierhttp://rs.tdwg.org/dwc/terms/georeferenceProtocol
DefinitionA description or reference to the methods used to determine the spatial footprint, coordinates, and uncertainties.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
ExamplesGeoreferencing Quick Reference Guide (Zermoglio et al. 2020, https://doi.org/10.35035/e09p-h128)
+ + + + + + + + + +
georeferenceSources
Identifierhttp://rs.tdwg.org/dwc/terms/georeferenceSources
DefinitionA list (concatenated and separated) of maps, gazetteers, or other resources used to georeference the dcterms:Location, described specifically enough to allow anyone in the future to use the same resources.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ). This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
+ + + + + + + + + +
georeferenceRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/georeferenceRemarks
DefinitionComments or notes about the spatial description determination, explaining assumptions made in addition or opposition to the those formalized in the method referred to in dwc:georeferenceProtocol.
Comments
ExamplesAssumed distance by road (Hwy. 101)
+ + +## GeologicalContext + + + + + + + + + + + +
GeologicalContext Class
Identifierhttp://rs.tdwg.org/dwc/terms/GeologicalContext
DefinitionGeological information, such as stratigraphy, that qualifies a region or place.
Comments
Examplesa lithostratigraphic layer
+ + + + + + + + + + +
geologicalContextID
Identifierhttp://rs.tdwg.org/dwc/terms/geologicalContextID
DefinitionAn identifier for the set of information associated with a dwc:GeologicalContext (the location within a geological context, such as stratigraphy). May be a global unique identifier or an identifier specific to the data set.
Comments
Exampleshttps://opencontext.org/subjects/e54377f7-4452-4315-b676-40679b10c4d9
+ + + + + + + + + +
earliestEonOrLowestEonothem
Identifierhttp://rs.tdwg.org/dwc/terms/earliestEonOrLowestEonothem
DefinitionThe full name of the earliest possible geochronologic eon or lowest chrono-stratigraphic eonothem or the informal name ("Precambrian") attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Phanerozoic
  • Proterozoic
+ + + + + + + + + +
latestEonOrHighestEonothem
Identifierhttp://rs.tdwg.org/dwc/terms/latestEonOrHighestEonothem
DefinitionThe full name of the latest possible geochronologic eon or highest chrono-stratigraphic eonothem or the informal name ("Precambrian") attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Phanerozoic
  • Proterozoic
+ + + + + + + + + +
earliestEraOrLowestErathem
Identifierhttp://rs.tdwg.org/dwc/terms/earliestEraOrLowestErathem
DefinitionThe full name of the earliest possible geochronologic era or lowest chronostratigraphic erathem attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Cenozoic
  • Mesozoic
+ + + + + + + + + +
latestEraOrHighestErathem
Identifierhttp://rs.tdwg.org/dwc/terms/latestEraOrHighestErathem
DefinitionThe full name of the latest possible geochronologic era or highest chronostratigraphic erathem attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Cenozoic
  • Mesozoic
+ + + + + + + + + +
earliestPeriodOrLowestSystem
Identifierhttp://rs.tdwg.org/dwc/terms/earliestPeriodOrLowestSystem
DefinitionThe full name of the earliest possible geochronologic period or lowest chronostratigraphic system attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Neogene
  • Tertiary
  • Quaternary
+ + + + + + + + + +
latestPeriodOrHighestSystem
Identifierhttp://rs.tdwg.org/dwc/terms/latestPeriodOrHighestSystem
DefinitionThe full name of the latest possible geochronologic period or highest chronostratigraphic system attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Neogene
  • Tertiary
  • Quaternary
+ + + + + + + + + +
earliestEpochOrLowestSeries
Identifierhttp://rs.tdwg.org/dwc/terms/earliestEpochOrLowestSeries
DefinitionThe full name of the earliest possible geochronologic epoch or lowest chronostratigraphic series attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Holocene
  • Pleistocene
  • Ibexian Series
+ + + + + + + + + +
latestEpochOrHighestSeries
Identifierhttp://rs.tdwg.org/dwc/terms/latestEpochOrHighestSeries
DefinitionThe full name of the latest possible geochronologic epoch or highest chronostratigraphic series attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Holocene
  • Pleistocene
  • Ibexian Series
+ + + + + + + + + +
earliestAgeOrLowestStage
Identifierhttp://rs.tdwg.org/dwc/terms/earliestAgeOrLowestStage
DefinitionThe full name of the earliest possible geochronologic age or lowest chronostratigraphic stage attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Atlantic
  • Boreal
  • Skullrockian
+ + + + + + + + + +
latestAgeOrHighestStage
Identifierhttp://rs.tdwg.org/dwc/terms/latestAgeOrHighestStage
DefinitionThe full name of the latest possible geochronologic age or highest chronostratigraphic stage attributable to the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Atlantic
  • Boreal
  • Skullrockian
+ + + + + + + + + +
lowestBiostratigraphicZone
Identifierhttp://rs.tdwg.org/dwc/terms/lowestBiostratigraphicZone
DefinitionThe full name of the lowest possible geological biostratigraphic zone of the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
ExamplesMaastrichtian
+ + + + + + + + + +
highestBiostratigraphicZone
Identifierhttp://rs.tdwg.org/dwc/terms/highestBiostratigraphicZone
DefinitionThe full name of the highest possible geological biostratigraphic zone of the stratigraphic horizon from which the dwc:MaterialEntity was collected.
Comments
ExamplesBlancan
+ + + + + + + + + +
lithostratigraphicTerms
Identifierhttp://rs.tdwg.org/dwc/terms/lithostratigraphicTerms
DefinitionThe combination of all lithostratigraphic names for the rock from which the dwc:MaterialEntity was collected.
Comments
ExamplesPleistocene-Weichselien
+ + + + + + + + + +
group
Identifierhttp://rs.tdwg.org/dwc/terms/group
DefinitionThe full name of the lithostratigraphic group from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Bathurst
  • Lower Wealden
+ + + + + + + + + +
formation
Identifierhttp://rs.tdwg.org/dwc/terms/formation
DefinitionThe full name of the lithostratigraphic formation from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Notch Peak Formation
  • House Limestone
  • Fillmore Formation
+ + + + + + + + + +
member
Identifierhttp://rs.tdwg.org/dwc/terms/member
DefinitionThe full name of the lithostratigraphic member from which the dwc:MaterialEntity was collected.
Comments
Examples
  • Lava Dam Member
  • Hellnmaria Member
+ + + + + + + + + +
bed
Identifierhttp://rs.tdwg.org/dwc/terms/bed
DefinitionThe full name of the lithostratigraphic bed from which the dwc:MaterialEntity was collected.
Comments
ExamplesHarlem coal
+ + +## Identification + + + + + + + + + + + +
Identification Class
Identifierhttp://rs.tdwg.org/dwc/terms/Identification
DefinitionA taxonomic determination (e.g., the assignment to a dwc:Taxon).
Comments
Examplesa subspecies determination of an organism
+ + + + + + + + + + +
identificationID
Identifierhttp://rs.tdwg.org/dwc/terms/identificationID
DefinitionAn identifier for the dwc:Identification (the body of information associated with the assignment of a scientific name). May be a global unique identifier or an identifier specific to the data set.
Comments
Examples9992
+ + + + + + + + + +
verbatimIdentification
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimIdentification
DefinitionA string representing the taxonomic identification as it appeared in the original record.
CommentsThis term is meant to allow the capture of an unaltered original identification/determination, including identification qualifiers, hybrid formulas, uncertainties, etc. This term is meant to be used in addition to dwc:scientificName (and dwc:identificationQualifier etc.), not instead of it.
Examples
  • Peromyscus sp.
  • Ministrymon sp. nov. 1
  • Anser anser × Branta canadensis
  • Pachyporidae?
  • Potentilla × pantotricha Soják
  • Aconitum pilipes × A. variegatum
  • Lepomis auritus x cyanellus
+ + + + + + + + + +
identificationQualifier
Identifierhttp://rs.tdwg.org/dwc/terms/identificationQualifier
DefinitionA brief phrase or a standard term ("cf.", "aff.") to express the determiner's doubts about the dwc:Identification.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • aff. agrifolia var. oxyadenia (for Quercus aff. agrifolia var. oxyadenia with accompanying values Quercus in genus, agrifolia in specificEpithet, oxyadenia in infraspecificEpithet, and var. in taxonRank)
  • cf. var. oxyadenia (for Quercus agrifolia cf. var. oxyadenia with accompanying values Quercus in genus, agrifolia in specificEpithet, oxyadenia in infraspecificEpithet, and var. in taxonRank)
+ + + + + + + + + +
typeStatus
Identifierhttp://rs.tdwg.org/dwc/terms/typeStatus
DefinitionA list (concatenated and separated) of nomenclatural types (type status, typified scientific name, publication) applied to the subject.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ). This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • holotype of Ctenomys sociabilis. Pearson O. P., and M. I. Christie. 1985. Historia Natural, 5(37):388
  • holotype of Pinus abies | holotype of Picea abies
+ + + + + + + + + +
typifiedName
Identifierhttp://rs.tdwg.org/dwc/terms/typifiedName
DefinitionA scientific name that is based on a type specimen.
CommentsRecommended best practice is also to indicate the dwc:typeStatus of the specimen.
ExamplesPolysiphonia amphibolis Womersley
+ + + + + + + + + +
identifiedBy
Identifierhttp://rs.tdwg.org/dwc/terms/identifiedBy
DefinitionA list (concatenated and separated) of names of people, groups, or organizations who assigned the dwc:Taxon to the subject.
CommentsWhen used in the context of an Event (such as in the Humboldt Extension), the subject consists of all of the dwc:Organisms related to the Event. Recommended best practice is to separate the values in a list with space vertical bar space ( | ). This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
ExamplesJames L. Patton| Theodore Pappenfuss | Robert Macey
+ + + + + + + + + +
identifiedByID
Identifierhttp://rs.tdwg.org/dwc/terms/identifiedByID
DefinitionA list (concatenated and separated) of the globally unique identifier for the person, people, groups, or organizations responsible for assigning the dwc:Taxon to the subject.
CommentsRecommended best practice is to provide a single identifier that disambiguates the details of the identifying agent. If a list is used, the order of the identifiers on the list should not be assumed to convey any semantics. Recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
+ + + + + + + + + +
dateIdentified
Identifierhttp://rs.tdwg.org/dwc/terms/dateIdentified
DefinitionThe date on which the subject was determined as representing the dwc:Taxon.
CommentsRecommended best practice is to use a date that conforms to ISO 8601-1:2019.
Examples
  • 1963-03-08T14:07-06:00 (8 Mar 1963 at or after 2:07pm and before 2:08pm in the time zone six hours earlier than UTC)
  • 2009-02-20T08:40Z (20 February 2009 at or after 8:40am and before 8:41 UTC)
  • 2018-08-29T15:19 (29 August 2018 at or after 3:19pm and before 3:20pm local time)
  • 1809-02-12 (within the day 12 February 1809)
  • 1906-06 (in the month of June 1906)
  • 1971 (in the year 1971)
  • 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z (some time within the interval beginning 1 March 2007 at 1pm UTC and before 11 May 2008 at 3:30pm UTC)
  • 1900/1909 (some time within the interval between the beginning of the year 1900 and before the year 1909)
  • 2007-11-13/15 (some time in the interval between the beginning of 13 November 2007 and before 15 November 2007)
+ + + + + + + + + +
identificationReferences
Identifierhttp://rs.tdwg.org/dwc/terms/identificationReferences
DefinitionA list (concatenated and separated) of references (publication, global unique identifier, URI) used in the dwc:Identification.
CommentsWhen used in the context of an Event (such as in the Humboldt Extension), the subject consists of all of the dwc:Organisms related to the Event. Recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples
  • Aves del Noroeste Patagonico. Christie et al. 2004.
  • Stebbins, R. Field Guide to Western Reptiles and Amphibians. 3rd Edition. 2003. | Irschick, D.J. and Shaffer, H.B. (1997). The polytypic species revisited: Morphological differentiation among tiger salamanders (Ambystoma tigrinum) (Amphibia: Caudata). Herpetologica, 53(1), 30-49.
+ + + + + + + + + +
identificationVerificationStatus
Identifierhttp://rs.tdwg.org/dwc/terms/identificationVerificationStatus
DefinitionA categorical indicator of the extent to which the taxonomic identification has been verified to be correct.
CommentsRecommended best practice is to use a controlled vocabulary such as that used in HISPID and ABCD. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples0 ("unverified" in HISPID/ABCD).
+ + + + + + + + + +
identificationRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/identificationRemarks
DefinitionComments or notes about the dwc:Identification.
Comments
ExamplesDistinguished between Anthus correndera and Anthus hellmayri based on the comparative lengths of the uñas.
+ + +## Taxon + + + + + + + + + + + +
Taxon Class
Identifierhttp://rs.tdwg.org/dwc/terms/Taxon
DefinitionA group of organisms (sensu http://purl.obolibrary.org/obo/OBI_0100026) considered by taxonomists to form a homogeneous unit.
Comments
Examplesthe genus Truncorotaloides as published by Brönnimann et al. in 1953 in the Journal of Paleontology Vol. 27(6) p. 817-820
+ + + + + + + + + + +
taxonID
Identifierhttp://rs.tdwg.org/dwc/terms/taxonID
DefinitionAn identifier for the set of dwc:Taxon information. May be a global unique identifier or an identifier specific to the data set.
Comments
Examples
+ + + + + + + + + +
scientificNameID
Identifierhttp://rs.tdwg.org/dwc/terms/scientificNameID
DefinitionAn identifier for the nomenclatural (not taxonomic) details of a scientific name.
Comments
Examplesurn:lsid:ipni.org:names:37829-1:1.3
+ + + + + + + + + +
acceptedNameUsageID
Identifierhttp://rs.tdwg.org/dwc/terms/acceptedNameUsageID
DefinitionAn identifier for the name usage (documented meaning of the name according to a source) of the currently valid (zoological) or accepted (botanical) taxon.
CommentsThis term should be used for synonyms or misapplied names to refer to the dwc:taxonID of a dwc:Taxon record that represents the accepted (botanical) or valid (zoological) name. For Darwin Core Archives the related record should be present locally in the same archive.
Examples
  • tsn:41107 (ITIS)
  • urn:lsid:ipni.org:names:320035-2 (IPNI)
  • 2704179 (GBIF)
  • 6W3C4 (COL)
+ + + + + + + + + +
parentNameUsageID
Identifierhttp://rs.tdwg.org/dwc/terms/parentNameUsageID
DefinitionAn identifier for the name usage (documented meaning of the name according to a source) of the direct, most proximate higher-rank parent taxon (in a classification) of the most specific element of the dwc:scientificName.
CommentsThis term should be used for accepted names to refer to the dwc:taxonID of a dwc:Taxon record that represents the next higher taxon rank in the same taxonomic classification. For Darwin Core Archives the related record should be present locally in the same archive.
Examples
  • tsn:41074 (ITIS)
  • urn:lsid:ipni.org:names:30001404-2 (IPNI)
  • 2704173 (GBIF)
  • 6T8N (COL)
+ + + + + + + + + +
originalNameUsageID
Identifierhttp://rs.tdwg.org/dwc/terms/originalNameUsageID
DefinitionAn identifier for the name usage (documented meaning of the name according to a source) in which the terminal element of the dwc:scientificName was originally established under the rules of the associated dwc:nomenclaturalCode.
CommentsThis term should be used to refer to the dwc:taxonID of a dwc:Taxon record that represents the usage of the terminal element of the dwc:scientificName as originally established under the rules of the associated dwc:nomenclaturalCode. For example, for names governed by the ICNafp, this term would establish the relationship between a record representing a subsequent combination and the record for its corresponding basionym. Unlike basionyms, however, this term can apply to scientific names at all ranks. For Darwin Core Archives the related record should be present locally in the same archive.
Examples
  • tsn:41107 (ITIS)
  • urn:lsid:ipni.org:names:320035-2 (IPNI)
  • 2704179 (GBIF)
  • 6W3C4 (COL)
+ + + + + + + + + +
nameAccordingToID
Identifierhttp://rs.tdwg.org/dwc/terms/nameAccordingToID
DefinitionAn identifier for the source in which the specific taxon concept circumscription is defined or implied. See dwc:nameAccordingTo.
Comments
Exampleshttps://doi.org/10.1016/S0269-915X(97)80026-2
+ + + + + + + + + +
namePublishedInID
Identifierhttp://rs.tdwg.org/dwc/terms/namePublishedInID
DefinitionAn identifier for the publication in which the dwc:scientificName was originally established under the rules of the associated dwc:nomenclaturalCode.
CommentsA citation of the first publication of the name in its given combination, not the basionym / original name. Recombinations are often not published in zoology, in which case dwc:namePublishedInID should be empty.
Examples
+ + + + + + + + + +
taxonConceptID
Identifierhttp://rs.tdwg.org/dwc/terms/taxonConceptID
DefinitionAn identifier for the taxonomic concept to which the record refers - not for the nomenclatural details of a dwc:Taxon.
Comments
Examples8fa58e08-08de-4ac1-b69c-1235340b7001
+ + + + + + + + + +
scientificName
Identifierhttp://rs.tdwg.org/dwc/terms/scientificName
DefinitionThe full scientific name, with authorship and date information if known. When forming part of a dwc:Identification, this should be the name in lowest level taxonomic rank that can be determined. This term should not contain identification qualifications, which should instead be supplied in the dwc:identificationQualifier term.
CommentsThis term should not contain identification qualifications, which should instead be supplied in the IdentificationQualifier term. When applied to an Organism or Occurrence, this term should be used to represent the scientific name that was applied to the associated Organism in accordance with the Taxon to which it was or is currently identified. Names should be compliant to the most recent nomenclatural code. For example, names of hybrids for algae, fungi and plants should follow the rules of the International Code of Nomenclature for algae, fungi, and plants (Schenzhen Code Articles H.1, H.2 and H.3). Thus, use the multiplication sign × (Unicode U+00D7, HTML ×) to identify a hybrid, not x or X, if possible.
Examples
  • Coleoptera (order)
  • Vespertilionidae (family)
  • Manis (genus)
  • Ctenomys sociabilis (genus + specificEpithet)
  • Ambystoma tigrinum diaboli (genus + specificEpithet + infraspecificEpithet)
  • Roptrocerus typographi (Györfi, 1952) (genus + specificEpithet + scientificNameAuthorship)
  • Quercus agrifolia var. oxyadenia (Torr.) J.T. Howell (genus + specificEpithet + taxonRank + infraspecificEpithet + scientificNameAuthorship)
  • ×Agropogon littoralis (Sm.) C. E. Hubb.
  • Mentha ×smithiana R. A. Graham
  • Agrostis stolonifera L. × Polypogon monspeliensis (L.) Desf.
+ + + + + + + + + +
acceptedNameUsage
Identifierhttp://rs.tdwg.org/dwc/terms/acceptedNameUsage
DefinitionThe full name, with authorship and date information if known, of the currently valid (zoological) or accepted (botanical) dwc:Taxon.
CommentsThe full scientific name, with authorship and date information if known, of the accepted (botanical) or valid (zoological) name in cases where the provided dwc:scientificName is considered by the reference indicated in the dwc:nameAccordingTo property, or of the content provider, to be a synonym or misapplied name. When applied to a dwc:Organism or dwc:Occurrence, this term should be used in cases where a content provider regards the provided dwc:scientificName to be inconsistent with the taxonomic perspective of the content provider. For example, there are many discrepancies within specimen collections and observation datasets between the recorded name (e.g., the most recent identification from an expert who examined a specimen, or a field identification for an observed dwc:Organism), and the name asserted by the content provider to be taxonomically accepted.
ExamplesTamias minimus (valid name for Eutamias minimus)
+ + + + + + + + + +
parentNameUsage
Identifierhttp://rs.tdwg.org/dwc/terms/parentNameUsage
DefinitionThe full name, with authorship and date information if known, of the direct, most proximate higher-rank parent dwc:Taxon (in a classification) of the most specific element of the dwc:scientificName.
Comments
Examples
  • Rubiaceae
  • Gruiformes
  • Testudinae
+ + + + + + + + + +
originalNameUsage
Identifierhttp://rs.tdwg.org/dwc/terms/originalNameUsage
DefinitionThe taxon name, with authorship and date information if known, as it originally appeared when first established under the rules of the associated dwc:nomenclaturalCode. The basionym (botany) or basonym (bacteriology) of the dwc:scientificName or the senior/earlier homonym for replaced names.
CommentsThe full scientific name, with authorship and date information if known, of the name usage in which the terminal element of the dwc:scientificName was originally established under the rules of the associated dwc:nomenclaturalCode. For example, for names governed by the ICNafp, this term would indicate the basionym of a record representing a subsequent combination. Unlike basionyms, however, this term can apply to scientific names at all ranks.
Examples
  • Pinus abies
  • Gasterosteus saltatrix Linnaeus 1768
+ + + + + + + + + +
nameAccordingTo
Identifierhttp://rs.tdwg.org/dwc/terms/nameAccordingTo
DefinitionThe reference to the source in which the specific taxon concept circumscription is defined or implied - traditionally signified by the Latin "sensu" or "sec." (from secundum, meaning "according to"). For taxa that result from identifications, a reference to the keys, monographs, experts and other sources should be given.
CommentsThis term provides context to the dwc:scientificName. Together with the dwc:scientificName, separated by sensu or sec., it forms the taxon concept label, which may be seen as having the same relationship to dwc:taxonConceptID as, for example, dwc:acceptedNameUsage has to dwc:acceptedNameUsageID. When not provided, in Taxon Core data sets the dwc:nameAccordingTo can be taken to be the data set. In this case the data set mostly provides sufficient context to infer the delimitation of the taxon and its relationship with other taxa. In Occurrence Core data sets, when not provided, dwc:nameAccordingTo can be an underlying taxonomy of the data set, e.g. Plants of the World Online (http://powo.science.kew.org/) for vascular plant records in iNaturalist (in which case it should be provided), or, which is the case for most dwc:PreservedSpecimen data sets, the dwc:Identification, in which case there is no further context.
ExamplesFranz NM, Cardona-Duque J (2013) Description of two new species and phylogenetic reassessment of Perelleschus Wibmer & O’Brien, 1986 (Coleoptera: Curculionidae), with a complete taxonomic concept history of Perelleschus sec. Franz & Cardona-Duque, 2013. Syst Biodivers. 11: 209–236. (as the full citation of the Franz & Cardona-Duque (2013) in Perelleschus splendida sec. Franz & Cardona-Duque (2013))
+ + + + + + + + + +
namePublishedIn
Identifierhttp://rs.tdwg.org/dwc/terms/namePublishedIn
DefinitionA reference for the publication in which the dwc:scientificName was originally established under the rules of the associated dwc:nomenclaturalCode.
CommentsA citation of the first publication of the name in its given combination, not the basionym / original name. Recombinations are often not published in zoology, in which case dwc:namePublishedIn should be empty.
Examples
  • Pearson O. P., and M. I. Christie. 1985. Historia Natural, 5(37):388
  • Forel, Auguste, Diagnosies provisoires de quelques espèces nouvelles de fourmis de Madagascar, récoltées par M. Grandidier., Annales de la Societe Entomologique de Belgique, Comptes-rendus des Seances 30, 1886
+ + + + + + + + + +
namePublishedInYear
Identifierhttp://rs.tdwg.org/dwc/terms/namePublishedInYear
DefinitionThe four-digit year in which the dwc:scientificName was published.
Comments
Examples
  • 1915
  • 2008
+ + + + + + + + + +
higherClassification
Identifierhttp://rs.tdwg.org/dwc/terms/higherClassification
DefinitionA list (concatenated and separated) of taxa names terminating at the rank immediately superior to the referenced dwc:Taxon.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ), with terms in order from the highest taxonomic rank to the lowest.
Examples
  • Plantae | Tracheophyta | Magnoliopsida | Ranunculales | Ranunculaceae | Ranunculus
  • Animalia
  • Animalia | Chordata | Vertebrata | Mammalia | Theria | Eutheria | Rodentia | Hystricognatha | Hystricognathi | Ctenomyidae | Ctenomyini | Ctenomys
+ + + + + + + + + +
kingdom
Identifierhttp://rs.tdwg.org/dwc/terms/kingdom
DefinitionThe full scientific name of the kingdom in which the dwc:Taxon is classified.
Comments
Examples
  • Animalia
  • Archaea
  • Bacteria
  • Chromista
  • Fungi
  • Plantae
  • Protozoa
  • Viruses
+ + + + + + + + + +
phylum
Identifierhttp://rs.tdwg.org/dwc/terms/phylum
DefinitionThe full scientific name of the phylum or division in which the dwc:Taxon is classified.
Comments
Examples
  • Chordata (phylum)
  • Bryophyta (division)
+ + + + + + + + + +
class
Identifierhttp://rs.tdwg.org/dwc/terms/class
DefinitionThe full scientific name of the class in which the dwc:Taxon is classified.
Comments
Examples
  • Mammalia
  • Hepaticopsida
+ + + + + + + + + +
order
Identifierhttp://rs.tdwg.org/dwc/terms/order
DefinitionThe full scientific name of the order in which the dwc:Taxon is classified.
Comments
Examples
  • Carnivora
  • Monocleales
+ + + + + + + + + +
superfamily
Identifierhttp://rs.tdwg.org/dwc/terms/superfamily
DefinitionThe full scientific name of the superfamily in which the dwc:Taxon is classified.
CommentsA taxonomic category subordinate to an order and superior to a family. According to ICZN article 29.2, the suffix -oidea is used for a superfamily name.
Examples
  • Achatinoidea
  • Cerithioidea
  • Helicoidea
  • Hypsibioidea
  • Valvatoidea
  • Zonitoidea
+ + + + + + + + + +
family
Identifierhttp://rs.tdwg.org/dwc/terms/family
DefinitionThe full scientific name of the family in which the dwc:Taxon is classified.
Comments
Examples
  • Felidae
  • Monocleaceae
+ + + + + + + + + +
subfamily
Identifierhttp://rs.tdwg.org/dwc/terms/subfamily
DefinitionThe full scientific name of the subfamily in which the dwc:Taxon is classified.
Comments
Examples
  • Periptyctinae
  • Orchidoideae
  • Sphindociinae
+ + + + + + + + + +
tribe
Identifierhttp://rs.tdwg.org/dwc/terms/tribe
DefinitionThe full scientific name of the tribe in which the dwc:Taxon is classified.
Comments
Examples
  • Ortaliini
  • Arethuseae
+ + + + + + + + + +
subtribe
Identifierhttp://rs.tdwg.org/dwc/terms/subtribe
DefinitionThe full scientific name of the subtribe in which the dwc:Taxon is classified.
Comments
Examples
  • Plotinini
  • Typhaeini
+ + + + + + + + + +
genus
Identifierhttp://rs.tdwg.org/dwc/terms/genus
DefinitionThe full scientific name of the genus in which the dwc:Taxon is classified.
Comments
Examples
  • Puma
  • Monoclea
+ + + + + + + + + +
genericName
Identifierhttp://rs.tdwg.org/dwc/terms/genericName
DefinitionThe genus part of the dwc:scientificName without authorship.
CommentsFor synonyms the accepted genus and the genus part of the name may be different. The term dwc:genericName should be used together with dwc:specificEpithet to form a binomial and with dwc:infraspecificEpithet to form a trinomial. The term dwc:genericName should only be used for combinations. Uninomials of generic rank do not have a dwc:genericName.
ExamplesFelis (for scientificName Felis concolor, with accompanying values of Puma concolor in acceptedNameUsage and Puma in genus)
+ + + + + + + + + +
subgenus
Identifierhttp://rs.tdwg.org/dwc/terms/subgenus
DefinitionThe full scientific name of the subgenus in which the dwc:Taxon is classified.
CommentsA value for this term should be a complete subgenus name as required by the appropriate nomenclatural code.
Examples
  • Abacetus (Parastygis)
  • Dicranum subgen. Orthodicranum
+ + + + + + + + + +
infragenericEpithet
Identifierhttp://rs.tdwg.org/dwc/terms/infragenericEpithet
DefinitionThe infrageneric part of a binomial name at ranks above species but below genus.
CommentsThe term dwc:infragenericEpithet should be used in conjunction with dwc:genericName, dwc:specificEpithet, dwc:infraspecificEpithet, dwc:taxonRank and dwc:scientificNameAuthorship to represent the individual elements of the complete dwc:scientificName. It can be used to indicate the subgenus placement of a species, which in zoology is often given in parentheses. Can also be used to share infrageneric names such as botanical sections (e.g., Vicia sect. Cracca).
Examples
  • Abacetillus (for scientificName Abacetus (Abacetillus) ambiguus)
  • Cracca (for scientificName Vicia sect. Cracca)
+ + + + + + + + + +
specificEpithet
Identifierhttp://rs.tdwg.org/dwc/terms/specificEpithet
DefinitionThe name of the first or species epithet of the dwc:scientificName.
Comments
Examples
  • concolor
  • gottschei
+ + + + + + + + + +
infraspecificEpithet
Identifierhttp://rs.tdwg.org/dwc/terms/infraspecificEpithet
DefinitionThe name of the lowest or terminal infraspecific epithet of the dwc:scientificName, excluding any rank designation.
CommentsIn botany, name strings in literature and identifications may have multiple infraspecific ranks. According to the International Code of Nomenclature for algae, fungi, and plants (Schenzhen Code Articles 6.7 & Art. 24.1), valid names only have two epithets, with the lowest rank being the dwc:infraspecificEpithet. For example: the dwc:infraspecificEpithet in the string Indigofera charlieriana subsp. sessilis var. scaberrima is scaberrima and the dwc:scientificName is Indigofera charlieriana var. scaberrima (Schinz) J.B.Gillett. Use dwc:verbatimIdentification for the full name string used in a dwc:Identification.
Examples
  • concolor (for scientificName Puma concolor concolor (Linnaeus, 1771))
  • oxyadenia (for scientificName Quercus agrifolia var. oxyadenia (Torr.) J.T. Howell)
  • laxa (for scientificName Cheilanthes hirta f. laxa (Kunze) W.Jacobsen & N.Jacobsen)
  • scaberrima (for scientificName Indigofera charlieriana var. scaberrima (Schinz) J.B.Gillett)
+ + + + + + + + + +
cultivarEpithet
Identifierhttp://rs.tdwg.org/dwc/terms/cultivarEpithet
DefinitionPart of the name of a cultivar, cultivar group or grex that follows the dwc:scientificName.
CommentsAccording to the Rules of the Cultivated Plant Code, a cultivar name consists of a botanical name followed by a cultivar epithet. The value given as the dwc:cultivarEpithet should exclude any quotes. The term dwc:taxonRank should be used to indicate which type of cultivated plant name (e.g. cultivar, cultivar group, grex) is concerned. This epithet, including any enclosing apostrophes or suffix, should be provided in dwc:scientificName as well.
Examples
  • King Edward (for scientificName Solanum tuberosum 'King Edward' and taxonRank cultivar)
  • Mishmiense (for scientificName Rhododendron boothii Mishmiense Group and taxonRank cultivar group)
  • Atlantis (for scientificName Paphiopedilum Atlantis grex and taxonRank grex)
+ + + + + + + + + +
taxonRank
Identifierhttp://rs.tdwg.org/dwc/terms/taxonRank
DefinitionThe taxonomic rank of the most specific name in the dwc:scientificName.
CommentsRecommended best practice is to use a controlled vocabulary. The taxon ranks of algae, fungi and plants are defined in the International Code of Nomenclature for algae, fungi, and plants (Schenzhen Code Articles H3.2, H4.4 and H.3.1).
Examples
  • subspecies
  • varietas
  • forma
  • species
  • genus
  • nothogenus
  • nothospecies
  • nothosubspecies
+ + + + + + + + + +
verbatimTaxonRank
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimTaxonRank
DefinitionThe taxonomic rank of the most specific name in the dwc:scientificName as it appears in the original record.
Comments
Examples
  • Agamospecies
  • sub-lesus
  • prole
  • apomict
  • nothogrex
  • sp.
  • subsp.
  • var.
+ + + + + + + + + +
scientificNameAuthorship
Identifierhttp://rs.tdwg.org/dwc/terms/scientificNameAuthorship
DefinitionThe authorship information for the dwc:scientificName formatted according to the conventions of the applicable dwc:nomenclaturalCode.
Comments
Examples
  • (Torr.) J.T. Howell
  • (Martinovský) Tzvelev
  • (Györfi, 1952)
+ + + + + + + + + +
vernacularName
Identifierhttp://rs.tdwg.org/dwc/terms/vernacularName
DefinitionA common or vernacular name.
Comments
Examples
  • Andean Condor
  • Condor Andino
  • American Eagle
  • Gänsegeier
+ + + + + + + + + +
nomenclaturalCode
Identifierhttp://rs.tdwg.org/dwc/terms/nomenclaturalCode
DefinitionThe nomenclatural code (or codes in the case of an ambiregnal name) under which the dwc:scientificName is constructed.
CommentsRecommended best practice is to use a controlled vocabulary.
Examples
  • ICN
  • ICZN
  • BC
  • ICNCP
  • BioCode
+ + + + + + + + + +
taxonomicStatus
Identifierhttp://rs.tdwg.org/dwc/terms/taxonomicStatus
DefinitionThe status of the use of the dwc:scientificName as a label for a taxon. Requires taxonomic opinion to define the scope of a dwc:Taxon. Rules of priority then are used to define the taxonomic status of the nomenclature contained in that scope, combined with the experts opinion. It must be linked to a specific taxonomic reference that defines the concept.
CommentsRecommended best practice is to use a controlled vocabulary.
Examples
  • invalid
  • misapplied
  • homotypic synonym
  • accepted
+ + + + + + + + + +
nomenclaturalStatus
Identifierhttp://rs.tdwg.org/dwc/terms/nomenclaturalStatus
DefinitionThe status related to the original publication of the name and its conformance to the relevant rules of nomenclature. It is based essentially on an algorithm according to the business rules of the code. It requires no taxonomic opinion.
Comments
Examples
  • nom. ambig.
  • nom. illeg.
  • nom. subnud.
+ + + + + + + + + +
taxonRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/taxonRemarks
DefinitionComments or notes about the taxon or name.
Comments
Examplesthis name is a misspelling in common use
+ + +## MeasurementOrFact + + + + + + + + + + + +
MeasurementOrFact Class
Identifierhttp://rs.tdwg.org/dwc/terms/MeasurementOrFact
DefinitionA measurement of or fact about an rdfs:Resource (http://www.w3.org/2000/01/rdf-schema#Resource).
CommentsResources can be thought of as identifiable records or instances of classes and may include, but need not be limited to instances of dwc:Occurrence, dwc:Organism, dwc:MaterialEntity, dwc:Event, dcterms:Location, dwc:GeologicalContext, dwc:Identification, or dwc:Taxon.
Examples
  • the weight of a dwc:Organism in grams
  • the number of placental scars
  • surface water temperature in Celsius
+ + + + + + + + + + +
measurementID
Identifierhttp://rs.tdwg.org/dwc/terms/measurementID
DefinitionAn identifier for the dwc:MeasurementOrFact (information pertaining to measurements, facts, characteristics, or assertions). May be a global unique identifier or an identifier specific to the data set.
Comments
Examples9c752d22-b09a-11e8-96f8-529269fb1459
+ + + + + + + + + +
parentMeasurementID
Identifierhttp://rs.tdwg.org/dwc/terms/parentMeasurementID
DefinitionAn identifier for a broader dwc:MeasurementOrFact that groups this and potentially other dwc:MeasurementOrFacts.
CommentsMay be a globally unique identifier or an identifier specific to the data set.
Examples
  • 9c752d22-b09a-11e8-96f8-529269fb1459
  • E1_E1_O1_M1
+ + + + + + + + + +
measurementType
Identifierhttp://rs.tdwg.org/dwc/terms/measurementType
DefinitionThe nature of the measurement, fact, characteristic, or assertion.
CommentsRecommended best practice is to use a controlled vocabulary. This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • tail length
  • temperature
  • trap line length
  • survey area
  • trap type
+ + + + + + + + + +
verbatimMeasurementType
Identifierhttp://rs.tdwg.org/dwc/terms/verbatimMeasurementType
DefinitionA string representing the type of measurement or fact as it appeared in the original record.
CommentsThis term is meant to allow the capture of an unaltered original name for a measurement or fact type. This term is meant to be used in addition to dwc:measurementType, not instead of it.
Examples
  • water_temp
  • Fish biomass
  • sampling net mesh size
+ + + + + + + + + +
measurementValue
Identifierhttp://rs.tdwg.org/dwc/terms/measurementValue
DefinitionThe value of the measurement, fact, characteristic, or assertion.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • 45
  • 20
  • 1
  • 14.5
  • UV-light
+ + + + + + + + + +
measurementAccuracy
Identifierhttp://rs.tdwg.org/dwc/terms/measurementAccuracy
DefinitionThe description of the potential error associated with the dwc:measurementValue.
Comments
Examples
  • 0.01
  • normal distribution with variation of 2 m
+ + + + + + + + + +
measurementUnit
Identifierhttp://rs.tdwg.org/dwc/terms/measurementUnit
DefinitionThe units associated with the dwc:measurementValue.
CommentsRecommended best practice is to use the International System of Units (SI). This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • m
  • g
  • l
  • °C
  • mm
  • km²
  • %
  • hh:mm:ss
+ + + + + + + + + +
measurementDeterminedBy
Identifierhttp://rs.tdwg.org/dwc/terms/measurementDeterminedBy
DefinitionA list (concatenated and separated) of names of people, groups, or organizations who determined the value of the dwc:MeasurementOrFact.
CommentsRecommended best practice is to separate the values in a list with space vertical bar space ( | ). This term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • Rob Guralnick
  • Peter Desmet | Stijn Van Hoey
+ + + + + + + + + +
measurementDeterminedDate
Identifierhttp://rs.tdwg.org/dwc/terms/measurementDeterminedDate
DefinitionThe date on which the dwc:MeasurementOrFact was made.
CommentsRecommended best practice is to use a date that conforms to ISO 8601-1:2019.
Examples
  • 1963-03-08T14:07-06:00 (8 Mar 1963 at or after 2:07pm and before 2:08pm in the time zone six hours earlier than UTC)
  • 2009-02-20T08:40Z (20 February 2009 at or after 8:40am and before 8:41 UTC)
  • 2018-08-29T15:19 (29 August 2018 at or after 3:19pm and before 3:20pm local time)
  • 1809-02-12 (within the day 12 February 1809)
  • 1906-06 (in the month of June 1906)
  • 1971 (in the year 1971)
  • 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z (some time within the interval beginning 1 March 2007 at 1pm UTC and before 11 May 2008 at 3:30pm UTC)
  • 1900/1909 (some time within the interval between the beginning of the year 1900 and before the year 1909)
  • 2007-11-13/15 (some time in the interval between the beginning of 13 November 2007 and before 15 November 2007)
+ + + + + + + + + +
measurementMethod
Identifierhttp://rs.tdwg.org/dwc/terms/measurementMethod
DefinitionA description of or reference to (publication, URI) the method or protocol used to determine the measurement, fact, characteristic, or assertion.
CommentsThis term has an equivalent in the dwciri: namespace that allows only an IRI as a value, whereas this term allows for any string literal value.
Examples
  • minimum convex polygon around burrow entrances (for a home range area)
  • barometric altimeter (for an elevation)
+ + + + + + + + + +
measurementRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/measurementRemarks
DefinitionComments or notes accompanying the dwc:MeasurementOrFact.
Comments
Examplestip of tail missing
+ + +## ResourceRelationship + + + + + + + + + + + +
ResourceRelationship Class
Identifierhttp://rs.tdwg.org/dwc/terms/ResourceRelationship
DefinitionA relationship of one rdfs:Resource (http://www.w3.org/2000/01/rdf-schema#Resource) to another.
CommentsResources can be thought of as identifiable records or instances of classes and may include, but need not be limited to instances of dwc:Occurrence, dwc:Organism, dwc:MaterialEntity, dwc:Event, dcterms:Location, dwc:GeologicalContext, dwc:Identification, or dwc:Taxon.
Examples
  • an instance of a dwc:Organism is the mother of another instance of a dwc:Organism
  • a uniquely identified dwc:Occurrence represents the same dwc:Occurrence as another uniquely identified dwc:Occurrence
  • a dwc:MaterialEntity is a subsample of another dwc:MaterialEntity
+ + + + + + + + + + +
resourceRelationshipID
Identifierhttp://rs.tdwg.org/dwc/terms/resourceRelationshipID
DefinitionAn identifier for an instance of relationship between one resource (the subject) and another (dwc:relatedResource, the object).
Comments
Examples04b16710-b09c-11e8-96f8-529269fb1459
+ + + + + + + + + +
resourceID
Identifierhttp://rs.tdwg.org/dwc/terms/resourceID
DefinitionAn identifier for the resource that is the subject of the relationship.
Comments
Examplesf809b9e0-b09b-11e8-96f8-529269fb1459
+ + + + + + + + + +
relationshipOfResourceID
Identifierhttp://rs.tdwg.org/dwc/terms/relationshipOfResourceID
DefinitionAn identifier for the relationship type (predicate) that connects the subject identified by dwc:resourceID to its object identified by dwc:relatedResourceID.
CommentsRecommended best practice is to use the identifiers of the terms in a controlled vocabulary, such as the OBO Relation Ontology.
Examples
+ + + + + + + + + +
relatedResourceID
Identifierhttp://rs.tdwg.org/dwc/terms/relatedResourceID
DefinitionAn identifier for a related resource (the object, rather than the subject of the relationship).
Comments
Examplesdc609808-b09b-11e8-96f8-529269fb1459
+ + + + + + + + + +
relationshipOfResource
Identifierhttp://rs.tdwg.org/dwc/terms/relationshipOfResource
DefinitionThe relationship of the subject (identified by dwc:resourceID) to the object (identified by dwc:relatedResourceID).
CommentsRecommended best practice is to use a controlled vocabulary.
Examples
  • same as
  • duplicate of
  • mother of
  • offspring of
  • sibling of
  • parasite of
  • host of
  • valid synonym of
  • located within
  • pollinator of members of taxon
  • pollinated specific plant
  • pollinated by members of taxon
  • on slab with
+ + + + + + + + + +
relationshipAccordingTo
Identifierhttp://rs.tdwg.org/dwc/terms/relationshipAccordingTo
DefinitionThe source (person, organization, publication, reference) establishing the relationship between the two resources.
Comments
ExamplesJulie Woodruff
+ + + + + + + + + +
relationshipEstablishedDate
Identifierhttp://rs.tdwg.org/dwc/terms/relationshipEstablishedDate
DefinitionThe date-time on which the relationship between the two resources was established.
CommentsRecommended best practice is to use a date that conforms to ISO 8601-1:2019.
Examples
  • 1963-03-08T14:07-06:00 (8 Mar 1963 at or after 2:07pm and before 2:08pm in the time zone six hours earlier than UTC)
  • 2009-02-20T08:40Z (20 February 2009 at or after 8:40am and before 8:41 UTC)
  • 2018-08-29T15:19 (29 August 2018 at or after 3:19pm and before 3:20pm local time)
  • 1809-02-12 (within the day 12 February 1809)
  • 1906-06 (in the month of June 1906)
  • 1971 (in the year 1971)
  • 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z (some time within the interval beginning 1 March 2007 at 1pm UTC and before 11 May 2008 at 3:30pm UTC)
  • 1900/1909 (some time within the interval between the beginning of the year 1900 and before the year 1909)
  • 2007-11-13/15 (some time in the interval between the beginning of 13 November 2007 and before 15 November 2007)
+ + + + + + + + + +
relationshipRemarks
Identifierhttp://rs.tdwg.org/dwc/terms/relationshipRemarks
DefinitionComments or notes about the relationship between the two resources.
Comments
Examples
  • mother and offspring collected from the same nest
  • pollinator captured in the act
+ + +## UseWithIRI + +For more information on `UseWithIRI`, see [Section 2.5 of the RDF Guide](https://dwc.tdwg.org/rdf/#25-terms-in-the-dwciri-namespace-normative). + + + + + + + + + + + + +
behavior
Identifierhttp://rs.tdwg.org/dwc/iri/behavior
DefinitionA description of the behavior shown by the subject at the time the dwc:Occurrence was recorded.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
caste
Identifierhttp://rs.tdwg.org/dwc/iri/caste
DefinitionCategorisation of individuals for eusocial species (including some mammals and arthropods).
CommentsRecommended best practice is to use a controlled vocabulary that aligns best with the dwc:Taxon. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
dataGeneralizations
Identifierhttp://rs.tdwg.org/dwc/iri/dataGeneralizations
DefinitionActions taken to make the shared data less specific or complete than in its original form. Suggests that alternative data of higher quality may be available on request.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
degreeOfEstablishment
Identifierhttp://rs.tdwg.org/dwc/iri/degreeOfEstablishment
DefinitionThe degree to which a dwc:Organism survives, reproduces, and expands its range at the given place and time.
CommentsRecommended best practice is to use IRIs from the controlled vocabulary designated for use with this term, listed at http://rs.tdwg.org/dwc/doc/doe/. For details, refer to https://doi.org/10.3897/biss.3.38084 . Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
disposition
Identifierhttp://rs.tdwg.org/dwc/iri/disposition
DefinitionThe current state of a specimen with respect to the collection identified in dwc:collectionCode or dwc:collectionID.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
earliestGeochronologicalEra
Identifierhttp://rs.tdwg.org/dwc/iri/earliestGeochronologicalEra
DefinitionUse to link a dwc:GeologicalContext instance to chronostratigraphic time periods at the lowest possible level in a standardized hierarchy. Use this property to point to the earliest possible geological time period from which the dwc:MaterialEntity was collected.
CommentsRecommended best practice is to use an IRI from a controlled vocabulary. A "convenience property" that replaces Darwin Core literal-value terms related to geological context. See Section 2.7.6 of the Darwin Core RDF Guide for details.
Examples
+ + + + + + + + + +
establishmentMeans
Identifierhttp://rs.tdwg.org/dwc/iri/establishmentMeans
DefinitionStatement about whether a dwc:Organism has been introduced to a given place and time through the direct or indirect activity of modern humans.
CommentsRecommended best practice is to use IRIs from the controlled vocabulary designated for use with this term, listed at http://rs.tdwg.org/dwc/doc/em/. For details, refer to https://doi.org/10.3897/biss.3.38084 . Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
eventType
Identifierhttp://rs.tdwg.org/dwc/iri/eventType
DefinitionThe nature of the dwc:Event.
CommentsRecommended best practice is to use a controlled vocabulary. Regardless of the dwc:eventType, the interval of the dwc:Event can be captured in dwc:eventDate. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
fieldNotes
Identifierhttp://rs.tdwg.org/dwc/iri/fieldNotes
DefinitionOne of a) an indicator of the existence of, b) a reference to (publication, URI), or c) the text of notes taken in the field about the dwc:Event.
CommentsThe subject is a dwc:Event instance and the object is a (possibly IRI-identified) resource that is the field notes.
Examples
+ + + + + + + + + +
fieldNumber
Identifierhttp://rs.tdwg.org/dwc/iri/fieldNumber
DefinitionAn identifier given to the event in the field. Often serves as a link between field notes and the dwc:Event.
CommentsThe subject is a (possibly IRI-identified) resource that is the field notes and the object is a dwc:Event instance.
Examples
+ + + + + + + + + +
footprintSRS
Identifierhttp://rs.tdwg.org/dwc/iri/footprintSRS
DefinitionThe ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geometry given in dwc:footprintWKT is based.
CommentsRecommended best practice is to use an IRI for the EPSG code of the SRS, if known. Otherwise use a controlled vocabulary IRI for the name or code of the geodetic datum, if known. Otherwise use a controlled vocabulary IRI for the name or code of the ellipsoid, if known. Otherwise use an IRI for the value corresponding to not recorded.
Examples
+ + + + + + + + + +
footprintWKT
Identifierhttp://rs.tdwg.org/dwc/iri/footprintWKT
DefinitionA Well-Known Text (WKT) representation of the shape (footprint, geometry) that defines the dcterms:Location. A dcterms:Location may have both a point-radius representation (see dwc:decimalLatitude) and a footprint representation, and they may differ from each other.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
fundingAttribution
Identifierhttp://rs.tdwg.org/dwc/iri/fundingAttribution
DefinitionAn organization or agency that provided funding for a project.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
fromLithostratigraphicUnit
Identifierhttp://rs.tdwg.org/dwc/iri/fromLithostratigraphicUnit
DefinitionUse to link a dwc:GeologicalContext instance to an IRI-identified lithostratigraphic unit at the lowest possible level in a hierarchy.
CommentsRecommended best practice is to use an IRI from a controlled vocabulary. A "convenience property" that replaces Darwin Core literal-value terms related to geological context. See Section 2.7.7 of the Darwin Core RDF Guide for details.
Examples
+ + + + + + + + + +
geodeticDatum
Identifierhttp://rs.tdwg.org/dwc/iri/geodeticDatum
DefinitionThe ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in dwc:decimalLatitude and dwc:decimalLongitude are based.
CommentsRecommended best practice is to use an IRI for the EPSG code of the SRS, if known. Otherwise use a controlled vocabulary for the name or code of the geodetic datum, if known. Otherwise use a controlled vocabulary for the name or code of the ellipsoid, if known. If none of these is known, use an IRI corresponding to the value not recorded.
Examples
+ + + + + + + + + +
georeferencedBy
Identifierhttp://rs.tdwg.org/dwc/iri/georeferencedBy
DefinitionA person, group, or organization who determined the georeference (spatial representation) for the dcterms:Location.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
georeferenceProtocol
Identifierhttp://rs.tdwg.org/dwc/iri/georeferenceProtocol
DefinitionA description or reference to the methods used to determine the spatial footprint, coordinates, and uncertainties.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
georeferenceSources
Identifierhttp://rs.tdwg.org/dwc/iri/georeferenceSources
DefinitionA map, gazetteer, or other resource used to georeference the dcterms:Location.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
georeferenceVerificationStatus
Identifierhttp://rs.tdwg.org/dwc/iri/georeferenceVerificationStatus
DefinitionA categorical description of the extent to which the georeference has been verified to represent the best possible spatial description for the dcterms:Location of the dwc:Occurrence.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
habitat
Identifierhttp://rs.tdwg.org/dwc/iri/habitat
DefinitionA category or description of the habitat in which the dwc:Event occurred.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
identificationQualifier
Identifierhttp://rs.tdwg.org/dwc/iri/identificationQualifier
DefinitionA controlled value to express the determiner's doubts about the dwc:Identification.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
identificationVerificationStatus
Identifierhttp://rs.tdwg.org/dwc/iri/identificationVerificationStatus
DefinitionA categorical indicator of the extent to which the taxonomic identification has been verified to be correct.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects. Recommended best practice is to use a controlled vocabulary such as that used in HISPID and ABCD.
Examples
+ + + + + + + + + +
identifiedBy
Identifierhttp://rs.tdwg.org/dwc/iri/identifiedBy
DefinitionA person, group, or organization who assigned the dwc:Taxon to the subject.
CommentsWhen used in the context of an Event (such as in the Humboldt Extension), the subject consists of all of the dwc:Organisms related to the Event. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
inCollection
Identifierhttp://rs.tdwg.org/dwc/iri/inCollection
DefinitionUse to link any subject resource that is part of a collection to the collection containing the resource.
CommentsRecommended best practice is to use an IRI from a controlled registry. A "convenience property" that replaces literal-value terms related to collections and institutions. See Section 2.7.3 of the Darwin Core RDF Guide for details.
Examples
+ + + + + + + + + +
inDataset
Identifierhttp://rs.tdwg.org/dwc/iri/inDataset
DefinitionUse to link a subject dataset record to the dataset which contains it.
CommentsA string literal name of the dataset can be provided using the term dwc:datasetName. See the Darwin Core RDF Guide for details.
Examples
+ + + + + + + + + +
inDescribedPlace
Identifierhttp://rs.tdwg.org/dwc/iri/inDescribedPlace
DefinitionUse to link a dcterms:Location instance subject to the lowest level standardized hierarchically-described resource.
CommentsRecommended best practice is to use an IRI from a controlled registry. A "convenience property" that replaces Darwin Core literal-value terms related to locations. See Section 2.7.5 of the Darwin Core RDF Guide for details.
Exampleshttp://vocab.getty.edu/tgn/1019987
+ + + + + + + + + +
informationWithheld
Identifierhttp://rs.tdwg.org/dwc/iri/informationWithheld
DefinitionAdditional information that exists, but that has not been shared in the given record.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
language
Identifierhttp://purl.org/dc/terms/language
DefinitionA language of the resource.
CommentsRecommended best practice is to use an IRI from the Library of Congress ISO 639-2 scheme http://id.loc.gov/vocabulary/iso639-2
Examples
+ + + + + + + + + +
latestGeochronologicalEra
Identifierhttp://rs.tdwg.org/dwc/iri/latestGeochronologicalEra
DefinitionUse to link a dwc:GeologicalContext instance to chronostratigraphic time periods at the lowest possible level in a standardized hierarchy. Use this property to point to the latest possible geological time period from which the dwc:MaterialEntity was collected.
CommentsRecommended best practice is to use an IRI from a controlled vocabulary. A "convenience property" that replaces Darwin Core literal-value terms related to geological context. See Section 2.7.6 of the Darwin Core RDF Guide for details.
Examples
+ + + + + + + + + +
lifeStage
Identifierhttp://rs.tdwg.org/dwc/iri/lifeStage
DefinitionThe age class or life stage of the dwc:Organism(s) at the time the dwc:Occurrence was recorded.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
locationAccordingTo
Identifierhttp://rs.tdwg.org/dwc/iri/locationAccordingTo
DefinitionInformation about the source of this dcterms:Location information. Could be a publication (gazetteer), institution, or team of individuals.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
measurementDeterminedBy
Identifierhttp://rs.tdwg.org/dwc/iri/measurementDeterminedBy
DefinitionA person, group, or organization who determined the value of the dwc:MeasurementOrFact.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
measurementMethod
Identifierhttp://rs.tdwg.org/dwc/iri/measurementMethod
DefinitionThe method or protocol used to determine the measurement, fact, characteristic, or assertion.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
measurementType
Identifierhttp://rs.tdwg.org/dwc/iri/measurementType
DefinitionThe nature of the measurement, fact, characteristic, or assertion.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
measurementUnit
Identifierhttp://rs.tdwg.org/dwc/iri/measurementUnit
DefinitionThe units associated with the dwc:measurementValue.
CommentsRecommended best practice is to use a controlled vocabulary such as the Ontology of Units of Measure http://www.wurvoc.org/vocabularies/om-1.8/ of SI units, derived units, or other non-SI units accepted for use within the SI.
Examples
+ + + + + + + + + +
measurementValue
Identifierhttp://rs.tdwg.org/dwc/iri/measurementValue
DefinitionThe value of the measurement, fact, characteristic, or assertion.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Exampleshttp://vocab.nerc.ac.uk/collection/L22/current/TOOL0960/
+ + + + + + + + + +
occurrenceStatus
Identifierhttp://rs.tdwg.org/dwc/iri/occurrenceStatus
DefinitionA statement about the presence or absence of a dwc:Taxon at a dcterms:Location.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
organismQuantityType
Identifierhttp://rs.tdwg.org/dwc/iri/organismQuantityType
DefinitionThe type of quantification system used for the quantity of organisms.
CommentsA dwc:organismQuantityType must have a corresponding dwc:organismQuantity.
Examples
+ + + + + + + + + +
pathway
Identifierhttp://rs.tdwg.org/dwc/iri/pathway
DefinitionThe process by which a dwc:Organism came to be in a given place at a given time.
CommentsRecommended best practice is to use IRIs from the controlled vocabulary designated for use with this term, listed at http://rs.tdwg.org/dwc/doc/pw/. For details, refer to https://doi.org/10.3897/biss.3.38084 . Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
preparations
Identifierhttp://rs.tdwg.org/dwc/iri/preparations
DefinitionA preparation or preservation method for a specimen.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
recordedBy
Identifierhttp://rs.tdwg.org/dwc/iri/recordedBy
DefinitionA person, group, or organization responsible for recording the original dwc:Occurrence.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
recordNumber
Identifierhttp://rs.tdwg.org/dwc/iri/recordNumber
DefinitionAn identifier given to the dwc:Occurrence at the time it was recorded. Often serves as a link between field notes and a dwc:Occurrence record, such as a specimen collector's number.
CommentsThe subject is a dwc:Occurrence and the object is a (possibly IRI-identified) resource that is the field notes.
Examples
+ + + + + + + + + +
reproductiveCondition
Identifierhttp://rs.tdwg.org/dwc/iri/reproductiveCondition
DefinitionThe reproductive condition of the biological individual(s) represented in the dwc:Occurrence.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
sampleSizeUnit
Identifierhttp://rs.tdwg.org/dwc/iri/sampleSizeUnit
DefinitionThe unit of measurement of the size (time duration, length, area, or volume) of a sample in a sampling dwc:Event.
CommentsA dwciri:sampleSizeUnit must have a corresponding dwc:sampleSizeValue. Recommended best practice is to use a controlled vocabulary such as the Ontology of Units of Measure http://www.wurvoc.org/vocabularies/om-1.8/ of SI units, derived units, or other non-SI units accepted for use within the SI.
Examples
+ + + + + + + + + +
samplingProtocol
Identifierhttp://rs.tdwg.org/dwc/iri/samplingProtocol
DefinitionThe methods or protocols used during a dwc:Event, denoted by an IRI.
CommentsRecommended best practice is describe a dwc:Event with no more than one sampling protocol. In the case of a summary dwc:Event in which a specific protocol can not be attributed to specific dwc:Occurrences, the recommended best practice is to repeat the property for each IRI that denotes a different sampling protocol that applies to the dwc:Occurrence.
Exampleshttps://doi.org/10.1111/j.1466-8238.2009.00467.x
+ + + + + + + + + +
sex
Identifierhttp://rs.tdwg.org/dwc/iri/sex
DefinitionThe sex of the biological individual(s) represented in the dwc:Occurrence.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
toDigitalSpecimen
Identifierhttp://rs.tdwg.org/dwc/iri/toDigitalSpecimen
DefinitionUse to link a dwc:Identification instance subject to a taxonomic entity such as a taxon, taxon concept, or taxon name use.
CommentsUse to link a dwc:MaterialEntity instance subject to a Digital Specimem entity.
Examples
+ + + + + + + + + +
toTaxon
Identifierhttp://rs.tdwg.org/dwc/iri/toTaxon
DefinitionUse to link a dwc:Identification instance subject to a taxonomic entity such as a taxon, taxon concept, or taxon name use.
CommentsA "convenience property" that replaces Darwin Core literal-value terms related to taxonomic entities. See Section 2.7.4 of the Darwin Core RDF Guide for details.
Examples
+ + + + + + + + + +
typeStatus
Identifierhttp://rs.tdwg.org/dwc/iri/typeStatus
DefinitionA nomenclatural type (type status, typified scientific name, publication) applied to the subject.
CommentsTerms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
verbatimCoordinateSystem
Identifierhttp://rs.tdwg.org/dwc/iri/verbatimCoordinateSystem
DefinitionThe spatial coordinate system for the dwc:verbatimLatitude and dwc:verbatimLongitude or the dwc:verbatimCoordinates of the dcterms:Location.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
verbatimSRS
Identifierhttp://rs.tdwg.org/dwc/iri/verbatimSRS
DefinitionThe ellipsoid, geodetic datum, or spatial reference system (SRS) upon which coordinates given in dwc:verbatimLatitude and dwc:verbatimLongitude, or dwc:verbatimCoordinates are based.
CommentsRecommended best practice is to use an IRI for the EPSG code of the SRS, if known. Otherwise use a controlled vocabulary IRI for the name or code of the geodetic datum, if known. Otherwise use a controlled vocabulary IRI for the name or code of the ellipsoid, if known. Otherwise use an IRI for the value corresponding to not recorded.
Examples
+ + + + + + + + + +
verticalDatum
Identifierhttp://rs.tdwg.org/dwc/iri/verticalDatum
DefinitionThe vertical datum used as the reference upon which the values in the elevation terms are based.
CommentsRecommended best practice is to use a controlled vocabulary. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + + + + + + + + +
vitality
Identifierhttp://rs.tdwg.org/dwc/iri/vitality
DefinitionAn indication of whether a dwc:Organism was alive or dead at the time of collection or observation.
CommentsRecommended best practice is to use a controlled vocabulary. Intended to be used with records having a dwc:basisOfRecord of PreservedSpecimen, MaterialEntity, MaterialSample, or HumanObservation. Terms in the dwciri: namespace are intended to be used in RDF with non-literal objects.
Examples
+ + +## LivingSpecimen + +
+
+ + + + + + + + + +
LivingSpecimen Class
Identifierhttp://rs.tdwg.org/dwc/terms/LivingSpecimen
DefinitionA specimen that is alive.
Comments
Examples
  • a living plant in a botanical garden
  • a living animal in a zoo
+ + + +## PreservedSpecimen + +
+
+ + + + + + + + + +
PreservedSpecimen Class
Identifierhttp://rs.tdwg.org/dwc/terms/PreservedSpecimen
DefinitionA specimen that has been preserved.
Comments
Examples
  • a plant on an herbarium sheet
  • a cataloged lot of fish in a jar
+ + + +## FossilSpecimen + +
+
+ + + + + + + + + +
FossilSpecimen Class
Identifierhttp://rs.tdwg.org/dwc/terms/FossilSpecimen
DefinitionA preserved specimen that is a fossil.
Comments
Examples
  • a body fossil
  • a coprolite
  • a gastrolith
  • an ichnofossil
  • a piece of a petrified tree
+ + + +## MaterialCitation + +
+
+ + + + + + + + + +
MaterialCitation Class
Identifierhttp://rs.tdwg.org/dwc/terms/MaterialCitation
DefinitionA reference to or citation of one, a part of, or multiple specimens in scholarly publications.
CommentsThis class constitutes a new value for the controlled vocabulary in the recommendations for basisOfRecord. When importing Darwin Core Archives of literature-based datasets to GBIF, the basisOfRecord should be changed from "Occurrence", "PreservedSpecimen" or "Literature" to "MaterialCitation".
Examples
  • a citation of a physical specimen from a scientific collection in a taxonomic treatment in a scientific publication
  • a citation of a group of physical specimens, such as paratypes in a taxonomic treatment in a scientific publication
+ + + +## HumanObservation + +
+
+ + + + + + + + + +
HumanObservation Class
Identifierhttp://rs.tdwg.org/dwc/terms/HumanObservation
DefinitionAn output of a human observation process.
Comments
Examples
  • evidence of a dwc:Occurrence taken from field notes or literature
  • a record of a dwc:Occurrence without physical evidence or evidence captured with a machine
+ + + +## MachineObservation + +
+
+ + + + + + + + + +
MachineObservation Class
Identifierhttp://rs.tdwg.org/dwc/terms/MachineObservation
DefinitionAn output of a machine observation process.
Comments
Examples
  • a photograph
  • a video
  • an audio recording
  • a remote sensing image
  • a dwc:Occurrence record based on telemetry
+ + + +## Cite Darwin Core + +To cite Darwin Core in general, use the peer-reviewed article on Darwin Core: + +> Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, et al. (2012) Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS ONE 7(1): e29715. + +To cite the standard document upon which this page is built, use the following: + +> Darwin Core Maintenance Group. 2021. List of Darwin Core terms. Biodiversity Information Standards (TDWG). + +To cite this document specifically, use the following: + +> Darwin Core Maintenance Group. 2021. Darwin Core Quick Reference Guide. Biodiversity Information Standards (TDWG). \ No newline at end of file diff --git a/docs/claude/dwca-format-reference.md b/docs/claude/dwca-format-reference.md new file mode 100644 index 000000000..8863d13ff --- /dev/null +++ b/docs/claude/dwca-format-reference.md @@ -0,0 +1,208 @@ +# Darwin Core Archive (DwC-A) Format Reference + +## What is DwC-A? + +A ZIP archive containing standardized biodiversity data files. The standard format for sharing occurrence and sampling-event data with GBIF, OBIS, and other biodiversity data aggregators. + +## Archive Structure + +AMI's April 2026 draft emits a four-file archive with the following layout: + +``` +project_export.zip +├── meta.xml DwC-A text-archive descriptor +├── eml.xml EML 2.2.0 dataset metadata +├── event.txt Core — Event row per AMI Event, with +│ Humboldt eco: columns flattened in +├── occurrence.txt Extension — coreid=eventID, one row per +│ published Occurrence. associatedMedia +│ column carries pipe-separated capture URLs. +├── multimedia.txt Extension — coreid=eventID. Two row types: +│ - capture rows (occurrenceID blank) +│ - detection-crop rows (occurrenceID populated) +└── measurementorfact.txt Extension — coreid=eventID. Per-occurrence + classificationScore; per-detection + detectionScore and boundingBox. +``` + +## Humboldt Extension columns on event.txt + +Humboldt Extension (`eco:`) terms are flattened onto Event Core rows as the pragmatic alternative to a separate `humboldt.txt` extension (GBIF accepts this shape). The values encode the scientific contribution of automated monitoring: sampling-effort structure + provable absence during known sampling windows. + +| Column | Term | Source | +|---|---|---| +| isSamplingEffortReported | eco:isSamplingEffortReported | constant `true` | +| samplingEffortValue | eco:samplingEffortValue | `Event.captures_count` | +| samplingEffortUnit | eco:samplingEffortUnit | constant `images` | +| samplingEffortProtocol | eco:samplingEffortProtocol | constant protocol description | +| isAbsenceReported | eco:isAbsenceReported | constant `true` (per-taxon rows deferred) | +| targetTaxonomicScope | eco:targetTaxonomicScope | LCA of `Project.default_filters_include_taxa` | +| inventoryTypes | eco:inventoryTypes | constant `trap or sample` | +| protocolNames | eco:protocolNames | constant `AMI ML detector + classifier pipeline` | +| protocolDescriptions | eco:protocolDescriptions | constant pipeline description | +| hasMaterialSamples | eco:hasMaterialSamples | constant `true` | +| materialSampleTypes | eco:materialSampleTypes | constant `digital images` | + +## Star Schema + +One **core** file, surrounded by **extension** files. Extensions link back to the core via an ID column. + +For sampling-event datasets (like AMI): +- **Core**: Event (one row per sampling event) +- **Extension**: Occurrence (many occurrences per event) + +## meta.xml Specification + +```xml + + + + + + event.txt + + + + + + + + + + occurrence.txt + + + + + + + + +``` + +### Key Attributes + +| Attribute | Default | Notes | +|-----------|---------|-------| +| `rowType` | Required | URI: `http://rs.tdwg.org/dwc/terms/Event`, `...Occurrence`, `...Taxon` | +| `fieldsTerminatedBy` | `,` | Use `\t` for TSV (recommended for DwC-A) | +| `linesTerminatedBy` | `\n` | Standard newline | +| `fieldsEnclosedBy` | `"` | Quote character | +| `encoding` | `UTF-8` | Always use UTF-8 | +| `ignoreHeaderLines` | `0` | Set to `1` if header row present | +| `dateFormat` | `YYYY-MM-DD` | ISO 8601 | + +### Field Element + +- `index` (0-based): column position in the data file +- `term`: Darwin Core term URI +- `default`: constant value for all rows (no index needed) + +### ID Elements + +- `` in core: column containing unique record ID +- `` in extensions: column containing the core record's ID (foreign key) + +## EML Metadata (eml.xml) + +Describes the dataset: title, abstract, creators, geographic/temporal coverage, methods, etc. GBIF provides an EML profile. Minimum useful content: + +```xml + + + + + {project.name} + + {project.owner or institution} + + + {project.description} + + + License information here + + + +``` + +## Key DwC Terms for AMI Data + +### Event Terms (Core) + +| DwC Term | AMI Source | Notes | +|----------|-----------|-------| +| eventID | `urn:ami:event:{project_slug}:{event.id}` | Globally unique | +| parentEventID | | Empty for now (could link to deployment-level events) | +| eventType | `"CameraTrapSession"` | Or custom vocabulary | +| eventDate | `event.start` / `event.end` as ISO interval | `2024-06-15/2024-06-16` | +| year | from `event.start` | | +| month | from `event.start` | | +| day | from `event.start` | | +| samplingProtocol | `"automated light trap with camera"` | Project-level constant | +| sampleSizeValue | `event.captures_count` | Number of images | +| sampleSizeUnit | `"images"` | | +| samplingEffort | `event.duration` formatted | e.g. "12 hours" | +| eventRemarks | | | +| **Location terms (on event)** | | | +| locationID | `deployment.name` or `site.name` | | +| decimalLatitude | `deployment.latitude` | | +| decimalLongitude | `deployment.longitude` | | +| geodeticDatum | `"WGS84"` | Assumed | +| coordinateUncertaintyInMeters | | Not currently stored | + +### Occurrence Terms (Extension) + +| DwC Term | AMI Source | Notes | +|----------|-----------|-------| +| eventID | Same as core eventID | Links occurrence to event | +| occurrenceID | `urn:ami:occurrence:{project_slug}:{occurrence.id}` | Globally unique | +| basisOfRecord | `"MachineObservation"` | All records | +| occurrenceStatus | `"present"` | Always present (we don't record absences) | +| scientificName | `occurrence.determination.name` | | +| taxonRank | `occurrence.determination.rank` | Lowercase | +| kingdom | from `determination.parents_json` | Walk parent chain | +| phylum | from `determination.parents_json` | | +| class | from `determination.parents_json` | | +| order | from `determination.parents_json` | | +| family | from `determination.parents_json` | | +| genus | from `determination.parents_json` | | +| specificEpithet | split from species name | Second word of binomial | +| vernacularName | `determination.common_name_en` | | +| taxonID | `determination.gbif_taxon_key` or internal URN | | +| individualCount | `occurrence.detections_count` | Number of detections | +| associatedMedia | Detection image URLs | Pipe-separated | +| identifiedBy | `"AMI ML Pipeline"` or identification user | | +| dateIdentified | `occurrence.created_at` or identification date | | +| identificationRemarks | Score info, algorithm used | | +| identificationVerificationStatus | Verified/Not verified | Based on identifications | + +## Validation + +- Core IDs must be unique +- Extension coreid values must reference existing core IDs +- No literal "NULL" values +- UTF-8 encoding throughout +- GBIF validator: https://www.gbif.org/tools/data-validator + +## References + +- DwC Text Guide: https://dwc.tdwg.org/text/ +- GBIF DwC-A Guide: https://ipt.gbif.org/manual/en/ipt/latest/dwca-guide +- DwC Terms: https://dwc.tdwg.org/terms/ +- Full terms reference downloaded to: `docs/claude/dwc-terms-reference.md` diff --git a/docs/claude/dwca-pr-review-and-mapping-spec.md b/docs/claude/dwca-pr-review-and-mapping-spec.md new file mode 100644 index 000000000..b8abdfd94 --- /dev/null +++ b/docs/claude/dwca-pr-review-and-mapping-spec.md @@ -0,0 +1,222 @@ +# DwC-A Export (PR #1131) — Review & Mapping Spec + +> Scope: two distinct parts. +> **Part A** reviews the export implementation (class structure, robustness, operations). +> **Part B** revises the data mapping against project-specific context (CamtrapDP, GBIF camera-trap guide, AMI working-group decisions, InsectAI-Metadata-Standards sheet, Dandjoo conventions). + +Related: +- `docs/claude/dwca-format-reference.md` (PR's own reference, authored earlier) +- `docs/claude/export-framework.md` (PR's API/ops reference) +- Google Doc: `1xShA-aRfzSwFQ78MMomUerjU-LD4wOA1VyanukLnIPw` + +--- + +## Part A — Code / Export Approach Review + +### A1. Class structure & integration + +- `DwCAExporter(BaseExporter)` in `ami/exports/format_types.py:192-290` plugs into the existing export framework (good: inherits filters, progress, artifact upload). +- Term catalogue split into plain Python tuples (`EVENT_FIELDS`, `OCCURRENCE_FIELDS`) in `ami/exports/dwca.py:26-98`. Simple and readable, but: + - **Mapping logic is colocated with term declarations** via lambdas. This is fine for ~20 fields, but becomes hard to test in isolation. Consider a small `DwCField` dataclass (`term`, `header`, `required`, `extract`, `domain`) to enable: per-field unit tests, per-field null-handling rules, and programmatic generation of `meta.xml`. + - `generate_meta_xml()` reconstructs the same term URIs it already has in the tuples — the two are not a single source of truth. A field catalogue object would let meta.xml be *derived*, not re-written. +- **Extension model used**: Event Core + Occurrence extension only. No Multimedia / MeasurementOrFact yet. This is the biggest structural decision to revisit in Part B — the working group's Oct 2024 notes and Slide 14 of the Montreal 2024 deck both point to Occurrence Core + Multimedia (Audubon) as the mainstream GBIF pattern. The PR's Event-Core choice is defensible but underjustified. + +### A2. Robustness / correctness issues visible in the diff + +| Severity | Issue | Location | Status | +|---|---|---|---| +| High | `meta.xml` declares `` and *also* emits a `` for the same column. Spec-ambiguous. | `dwca.py` meta-gen | Flagged by Copilot, not yet cleanly resolved | +| High | `individualCount` was set to `detections_count` (bounding-box count across frames, not individuals) | `dwca.py` | Fixed → hardcoded `"1"` — but `"1"` is also wrong when a future pipeline counts multiples; needs a model-level source | +| High | PII leak: project owner email written to EML `` | EML gen | Fixed (a74aee98) | +| Med | Temp files created with `delete=False`, not cleaned on exception | `format_types.py` | Fixed via `try/finally` (ad1b9109) | +| Med | `get_filter_backends()` returned empty list → user filters ignored | `format_types.py` | Fixed (c43d4069) | +| Med | Events derived independently from the filter set → could include events with no published occurrences | `format_types.py` | Fixed by deriving event IDs from filtered occurrences | +| Med | Progress callback fires every 500 rows → small jobs show 0% | export loop | Partially addressed (final update call) | +| Low | `taxonRank.lower()` on possibly `None` | `dwca.py:63` | Fixed | +| Low | `vernacularName` lambda precedence ambiguous | `dwca.py:83` | Fixed | +| Low | EML version mismatch (docs 2.2.0 vs impl 2.1.1); `schemaLocation` relative path | EML | Partially fixed | + +### A3. Issues *not* yet raised that should be before merge + +1. **No runtime validation pass before zipping.** A malformed archive only fails on the GBIF side. Add a post-write validation step (check: coreid uniqueness, non-null required terms, UTF-8, line-count parity between TSV and `` declarations). +2. **No explicit `license` / `rights` on the Event rows.** GBIF rejects datasets without a machine-readable license. Project model has no license field yet — surface this as a `Project.license` model addition or a per-export argument. +3. **Taxonomic hierarchy via `parents_json` string-matching is fragile.** `_get_rank_from_parents` walks a denormalized JSON blob; if rank strings drift (`"Order"` vs `"order"`), data silently disappears. Prefer an explicit join on `Taxon.rank` through a recursive CTE, or at minimum a normalization pass. +4. **`specificEpithet` by splitting `scientificName` on whitespace is wrong** for subspecies, hybrids, "cf." qualifiers, and authorship strings. Should come from a structured `Taxon` field or be left blank. +5. **`apply_default_filters()` is not invoked in the export queryset.** Project-level score thresholds and taxa include/exclude lists (the standard AMI filter) are bypassed — so low-confidence ML output is exported. Either invoke `apply_default_filters` directly, or document why the export deliberately ignores it. +6. **`identifiedBy` / `dateIdentified` unpopulated.** Flagged as TODO in PR description but is essential for any ML-provenance claim; without it the archive claims a `MachineObservation` by an unknown identifier. +7. **Memory shape of the export loop.** Currently streams rows one occurrence at a time with `.iterator()` (good) but enriching each occurrence walks `parents_json` (a JSON decode per row) and does per-row `determination.common_name_en` (triggers N+1 without `select_related`). Verify select_related/prefetch_related on the queryset. +8. **Concurrency / partial-write failure mode.** If export crashes mid-loop, the ZIP is never assembled but the temp files linger (pre-fix) or are removed (post-fix) — however `DataExport.status` transitions aren't bulletproof. Confirm the Celery task wrapping handles `SoftTimeLimitExceeded` and marks the export as failed. +9. **Archive determinism.** TSV row order = queryset order. For reproducibility (diffable re-exports, GBIF versioning), prefer a stable ORDER BY on `(event.start, occurrence.id)`. +10. **No CI check against GBIF's DwC-A validator.** A single offline invocation of `gbif-dwca-validator` against the fixture archive would catch nearly every class of bug above. + +### A4. Testing + +Good: 10 tests in `DwCAExporterTests`, use `setUpClass` for shared export. +Gaps: +- No test asserts `meta.xml` parses and validates against the GBIF DwC schema. +- No test for the null-event / null-determination filter (added in `d11976e7` — confirm coverage is explicit, not implicit). +- No test that exported `occurrence.txt` round-trips through GBIF's validator or at least a `dwclib` / `pygbif` parse. +- No test for multi-taxon parent walking (genus-only taxa, subgenus, ranks beyond species). + +### A5. Operational concerns + +- **Publishing pathway is not wired.** The archive is produced but there's no IPT integration, no DOI minting, no versioning. Document this as out-of-scope or stub the upload target. +- **Re-exports overwrite or version?** Unclear from the diff. GBIF consumes archives by `dataset UUID + version`; producing identical `occurrenceID`s on every re-export is correct only if the publishing target treats new archives as new versions of the same dataset. +- **Sensitive taxa handling.** No `dwc:informationWithheld` / coordinate generalization pass. Endangered species occurrences should be generalized before publication. Currently all lat/lon export at full precision. + +--- + +## Part B — Data Mapping Review + +This section supersedes the mapping baked into `ami/exports/dwca.py` and proposes a project-specific spec grounded in: GBIF Camera Trap guide, CamtrapDP, the AMI Metadata Standards WG (July & Oct 2024), InsectAI-Metadata-Standards sheet, Dandjoo, and Slide 14 of the Montreal 2024 deck. + +### B1. Archive shape (the structural decision) + +The current PR uses **Event Core + Occurrence extension**. The alternatives on the table: + +| Option | Core | Extensions | Rationale | +|---|---|---|---| +| **1. Current PR** | Event | Occurrence | Simple; but no place to attach multimedia or ML provenance without violating star schema | +| **2. GBIF camera-trap canonical** | Occurrence | Multimedia (Audubon Core), MeasurementOrFact | Recommended by GBIF camera-trap guide; loses sampling-event metadata (must embed in each occurrence) | +| **3. CamtrapDP → DwC-A downscale** | Event | Occurrence, Multimedia, MeasurementOrFact | Richest; CamtrapDP has a documented downscale path. Matches AMI WG direction. | +| **4. CamtrapDP native + DwC-A sibling** | — | — | Export both; CamtrapDP for insect-specific richness (annotation + model tables), DwC-A for GBIF ingestion | + +**Recommendation: Option 3 (Event Core + Occurrence + Multimedia + MeasurementOrFact)** as the near-term target. Option 4 (native CamtrapDP) is the longer-term target once `Deployment`/`Event` models carry the required CamtrapDP fields (`captureMethod`, `timestampIssues`, `baitUse`, `attractantType`, etc.). + +The WG's Oct 2024 decision point matters: they want to add an **annotation table** and **model metadata table** as CamtrapDP extensions to capture multiple-classifications-per-detection and model provenance. Neither fits the DwC star schema. For the GBIF-facing archive, collapse to a single "preferred" classification per occurrence; for the richer AMI internal export, use CamtrapDP. + +### B2. Term-by-term mapping (proposed, revised from `dwca.py`) + +Legend: **bold** = required for GBIF acceptance; *italic* = recommended; plain = optional. + +#### B2.1 Event Core (`event.txt`) — per AMI `Event` + +| DwC term | Source | Change from PR | Notes | +|---|---|---|---| +| **eventID** | `urn:ami:event:{project.slug}:{event.id}` | keep | ✓ | +| **eventDate** | `event.start` / `event.start/event.end` ISO 8601 interval | keep | ✓ | +| eventTime, year, month, day | derived | keep | ✓ | +| **samplingProtocol** | new `Deployment.sampling_protocol` text field | replace hardcoded | Currently `"automated light trap with camera"` — hardcoded string loses deployment-specific info (UV vs actinic, trigger type, etc.) | +| *sampleSizeValue* | `event.captures_count` | keep | ✓ | +| *sampleSizeUnit* | `"images"` | keep | ✓ | +| *samplingEffort* | duration | keep | Consider `"N trap-nights"` for multi-night events | +| *locationID* | deployment slug/URN, not name | change | `Deployment.name` is not a stable ID; use `urn:ami:deployment:{project.slug}:{deployment.id}` | +| **decimalLatitude**, **decimalLongitude** | `deployment.latitude/longitude` | keep | ✓ | +| **coordinateUncertaintyInMeters** | new field on `Deployment`; default 30m | **add** | Required for GBIF; currently missing | +| *geodeticDatum* | `"WGS84"` | keep | ✓ | +| *countryCode* | new, from reverse geocoding of lat/lon, cached | **add** | ISO 3166-1 alpha-2 | +| stateProvince, locality | optional, from reverse geocoding | add | | +| *datasetName* | `event.project.name` | keep | ✓ | +| **license** | `project.license` (new model field) | **add** | Required for GBIF | +| *rightsHolder* | `project.rights_holder` (new) | add | | +| *institutionCode* | `project.institution_code` (new) | add | | +| dc:modified | `event.updated_at` | keep | ✓ | +| eventRemarks | `event.notes` if available | add | | + +**AMI-specific add:** `dwc:parentEventID` pointing to a deployment-level event, to preserve the `Deployment → Event` hierarchy. This addresses the WG's requirement that individuals can be tracked across snapshots within a sampling period. + +#### B2.2 Occurrence extension (`occurrence.txt`) — per AMI `Occurrence` + +| DwC term | Source | Change from PR | Notes | +|---|---|---|---| +| **eventID** | coreid, as currently | keep | ✓ (meta.xml: coreid only, do NOT double-map as a field — this is the bug flagged by Copilot) | +| **occurrenceID** | `urn:ami:occurrence:{project.slug}:{occurrence.id}` | keep | ✓ | +| **basisOfRecord** | `"MachineObservation"` if ML-only; `"HumanObservation"` if a human Identification is the determination | refine | Currently hardcoded `MachineObservation`; should flip when `occurrence.identifications.exists()` | +| *occurrenceStatus* | `"present"` | keep | ✓ | +| **scientificName** | `occurrence.determination.name` | keep | ✓ | +| *verbatimScientificName* | original classifier output (pre-backbone match) | **add** | Dandjoo convention; useful when backbone mapping is lossy | +| *taxonRank* | `occurrence.determination.rank` lowercased | keep | ✓ | +| *kingdom, phylum, class, order, family, genus* | from `parents_json` via recursive lookup | change source | Prefer recursive CTE or `Taxon.parent` walk over JSON string matching | +| specificEpithet | `Taxon.specific_epithet` field (to add) | change source | String-splitting is unreliable | +| *vernacularName* | `determination.common_name_en` | keep | ✓ | +| *taxonID* | `determination.gbif_taxon_key` | keep | ✓ | +| nameAccordingTo | `"GBIF Backbone Taxonomy {date}"` | add | | +| *individualCount* | `occurrence.individual_count` (new field, default 1) | **change** | `"1"` hardcode is wrong once group counting is added; defer to model | +| organismQuantity / organismQuantityType | optional future | skip for now | | +| **identifiedBy** | pipeline name+version for ML; user email-or-username for human | **add** | Currently blank — essential | +| **dateIdentified** | `classification.created_at` (latest) / `identification.created_at` | **add** | Currently blank | +| **identificationVerificationStatus** | `"verified"` if Identifications present, else `"unverified"`; `"rejected"` if all Identifications disagree with ML | keep but refine | PR uses binary; consider three-state per camera-trap guide | +| identificationRemarks | `f"pipeline={...};score={...}"` if MeasurementOrFact not used | add | | +| identificationQualifier | `"cf."` for below-threshold ML-only occurrences (optional) | add | | +| associatedMedia | pipe-separated URLs of the occurrence's detection source images | **add** | Flagged as TODO; see B2.3 for richer Multimedia extension | +| recordedBy | `deployment.recorded_by` or `project.institution_code` | add | | +| dc:modified | `occurrence.updated_at` | keep | ✓ | + +#### B2.3 Multimedia extension (`multimedia.txt`) — **NEW**, per source image linked to published occurrences + +Uses GBIF simple multimedia (`http://rs.gbif.org/terms/1.0/Multimedia`) rather than full Audubon for simplicity. + +| term | source | notes | +|---|---|---| +| coreid | `eventID` (if Event core) | Star-schema limit: multimedia attaches to the core, not to occurrence. Carry `occurrenceID` in `references` field as a workaround. | +| dc:type | `"StillImage"` | | +| dc:format | `"image/jpeg"` (or detected) | | +| *dc:identifier* | full public URL of the SourceImage | | +| *accessURI* | same | | +| *dc:created* | `source_image.timestamp` | | +| *dc:license* | `project.license` | | +| *dc:rightsHolder* | `project.rights_holder` | | +| *dc:creator* | `deployment.name` or operator | | +| *dc:description* | `f"Detection of {taxon.name}"` | | +| references | occurrence detail URL in the AMI UI | workaround for occurrence linkage | + +**Filter:** only publish media linked to at least one published occurrence. Exclude blanks and any image flagged `contains_humans` (WG requirement). + +#### B2.4 MeasurementOrFact extension (`measurementorfact.txt`) — **NEW**, carries ML confidence + +| measurementType | measurementValue | measurementUnit | measurementDeterminedBy | +|---|---|---|---| +| `"classificationScore"` | `classification.score` (0–1) | `"proportion"` | `f"{pipeline.name} v{pipeline.version}"` | +| `"detectionScore"` | `detection.score` | `"proportion"` | detector algorithm name | +| `"boundingBox"` | JSON `[x1,y1,x2,y2]` normalized | `"normalized"` | detector | +| `"classificationRank"` | `"top1"` or `"top2"`... | — | pipeline | + +This resolves the WG's open question about how to publish confidence: raw score in a structured, machine-readable place (not buried in free-text remarks). Top-N predictions can be emitted as multiple rows per occurrence. + +#### B2.5 EML — package metadata + +Current impl mostly works; required changes: +- Upgrade to EML 2.2.0 (align with docs). +- **Never** write user email to ``; fixed in a74aee98 but regression-test it. +- Compute `` bounding box from actual deployment coordinates (PR has this as TODO). +- Compute `` from actual min/max `event.start`. +- Populate `` with `["automated insect monitoring", "camera trap", "Lepidoptera", project taxa]`. +- Populate `` section with sampling protocol + ML pipeline names used. +- Explicit `` with the project license. + +### B3. What to *not* export (filter rules) + +- Occurrences with `determination IS NULL` ✓ (already filtered) +- Occurrences with `event IS NULL` ✓ (already filtered) +- Occurrences with `determination_score < project.default_score_threshold` — **add** (currently missing; this is what `apply_default_filters` enforces elsewhere) +- Occurrences whose taxon is in `project.default_excluded_taxa` — **add** +- SourceImages flagged `contains_humans` — **add** +- SourceImages flagged `is_blank` with no linked occurrences — skip naturally + +### B4. AMI-specific mapping decisions (from WG notes) + +1. **Absences as zero-count events.** When a full Event (sampling period) produced no valid occurrences, emit the event row with a MeasurementOrFact `"individualsObserved" = 0`. Do not suppress the event. (WG July 2024.) +2. **Model identity.** For now, `identifiedBy = f"{pipeline.name} v{pipeline.version}"` and `measurementDeterminedBy` mirrors. Once the WG's model-metadata registry exists, replace with a resolvable model DOI. +3. **Confidence score interpretation.** Raw 0–1 in MeasurementOrFact (machine-consumable); no coarse tier in the DwC-A export (display-layer concern). The tier belongs in the AMI UI, not in the archive. +4. **Multiple classifications per detection.** Not representable in DwC-A; preserve only the "preferred" determination per occurrence. Export the full multi-classification graph via CamtrapDP + annotation extension (future work). +5. **Attributes (flying / resting / nectaring / torn wing).** If captured on the Classification, expose as MeasurementOrFact rows with `measurementType = "behavior:flying"` etc. Skip in v1 if data isn't populated. +6. **Unprocessed / expected images.** No current DwC term; capture in EML `` as text. Flag for the WG. + +### B5. Out-of-scope for this PR (future work) + +- CamtrapDP native export (parallel to DwC-A) +- Annotation table extension (multi-classification provenance) +- Model metadata table + DOI registry integration +- Sensitive-taxa coordinate generalization +- Automated GBIF validator CI step +- IPT / DOI / dataset-versioning integration + +--- + +## Open questions to resolve with reviewers + +1. Is Event Core or Occurrence Core the right choice for the GBIF-facing archive? (B1) +2. Do we want `apply_default_filters()` on by default, or an explicit `--include-low-confidence` opt-in for re-processing pipelines? (A3 §5) +3. Which license do projects default to? Is `project.license` a required field at project creation? (A3 §2) +4. Is there appetite to add `Taxon.specific_epithet` and `Taxon.parent_cache` (recursive-CTE-backed) to kill the string-split and JSON-walk? (B2.2) +5. For multi-classification occurrences, what counts as the "preferred" determination in DwC-A? (B4 §4) diff --git a/docs/claude/export-framework.md b/docs/claude/export-framework.md new file mode 100644 index 000000000..c584e56a7 --- /dev/null +++ b/docs/claude/export-framework.md @@ -0,0 +1,215 @@ +# Export Framework Technical Reference + +## Architecture Overview + +The export system uses a registry pattern where format-specific exporters register themselves and are dispatched by `DataExport.run_export()`. + +### Key Files + +| File | Purpose | +|------|---------| +| `ami/exports/base.py` | `BaseExporter` ABC - all exporters inherit from this | +| `ami/exports/registry.py` | `ExportRegistry` - maps format strings to exporter classes | +| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter`, `DwCAExporter` | +| `ami/exports/models.py` | `DataExport` model - tracks export jobs, files, stats | +| `ami/exports/utils.py` | `apply_filters()`, `get_data_in_batches()`, `generate_fake_request()` | +| `ami/exports/views.py` | `ExportViewSet` - API endpoint for creating/listing exports | +| `ami/exports/serializers.py` | `DataExportSerializer` - validates format, filters | +| `ami/exports/signals.py` | Deletes exported file when `DataExport` is deleted | +| `ami/exports/dwca.py` | DwC-A field definitions, XML generators, TSV writer | + +### Flow + +``` +1. User POST /api/v2/exports/ with {format, filters, project} +2. DataExportSerializer validates format against ExportRegistry +3. DataExport created, Job created (job_type_key="data_export") +4. Celery task calls DataExport.run_export() +5. run_export() calls DataExport.get_exporter() → ExportRegistry lookup +6. Exporter.__init__() builds queryset with filters +7. Exporter.export() writes temp file, returns path +8. DataExport.save_export_file() uploads to default_storage (S3/MinIO) +9. file_url saved to DataExport model +``` + +## API Endpoint + +`/api/v2/exports/` — `ExportViewSet` (`ami/exports/views.py:13`) + +### Methods + +| Method | Endpoint | Description | +|--------|----------|-------------| +| POST | `/api/v2/exports/` | Create export, enqueue async Celery job | +| GET | `/api/v2/exports/` | List exports (scoped to active project via `ProjectMixin`) | +| GET | `/api/v2/exports/{id}/` | Retrieve single export (job progress, file URL, record count) | +| PUT/PATCH | `/api/v2/exports/{id}/` | Update export (admin-only) | +| DELETE | `/api/v2/exports/{id}/` | Delete export and its file from storage | + +Permissions: `ObjectPermission` (`ami/base/permissions.py`). Researcher role can create and delete. Admin can update. Basic members and non-members cannot create. + +### Creating an Export (POST) + +**Required fields:** +- `project` (int) — Project PK +- `format` (string) — One of: `"occurrences_simple_csv"`, `"occurrences_api_json"`, `"dwca"` + +**Optional fields:** +- `filters` (object) — Filter criteria applied to occurrences + - `collection_id` (int) — Restrict to occurrences whose detections link to images in this `SourceImageCollection` + +**Validation** (`views.py:30-86`): +1. Format checked against `ExportRegistry.get_supported_formats()` +2. If `collection_id` provided, validates existence and project ownership +3. Object-level permission check on unsaved instance before persisting +4. Creates `DataExportJob` and enqueues via Celery + +**Response:** 201 with serialized `DataExport` including nested `job` object. + +### Response Fields + +Defined in `DataExportSerializer` (`ami/exports/serializers.py:30`): + +``` +id, user, project, format, filters, filters_display, +job {id, name, project, progress, result}, +file_url, record_count, file_size, file_size_display, +created_at, updated_at +``` + +- `file_url` — null until export completes, then absolute URL to file +- `file_size_display` — human-readable (e.g. "2.4 MB") +- `filters_display` — auto-populated with human names (e.g. collection name) +- `job.progress` — tracks export stages with percentage + +### Polling for Completion + +Exports run asynchronously. Poll `GET /api/v2/exports/{id}/` and check: +- `job.progress` for stage updates +- `file_url` becomes non-null when export is ready for download + +## Registered Formats + +Registered in `ami/exports/registry.py:28-30`, implemented in `ami/exports/format_types.py`: + +| Key | Class | Output | Description | +|-----|-------|--------|-------------| +| `occurrences_simple_csv` | `CSVExporter` (:149) | `.csv` | Tabular occurrence data with detection fields | +| `occurrences_api_json` | `JSONExporter` (:39) | `.json` | Full API serialization of occurrences | +| `dwca` | `DwCAExporter` (:192) | `.zip` | Darwin Core Archive with event.txt + occurrence.txt + meta.xml + eml.xml | + +## Internals + +### BaseExporter (ami/exports/base.py) + +```python +class BaseExporter(ABC): + file_format = "" # e.g. "json", "csv", "zip" + serializer_class = None # DRF serializer for data transformation + filter_backends = [] # DRF filter backends + + def __init__(self, data_export): + # Sets self.data_export, self.job, self.project + # Builds self.queryset using get_queryset() + apply_filters() + # Sets self.total_records = queryset.count() + + @abstractmethod + def export(self) -> str: + """Must return path to temp file.""" + + @abstractmethod + def get_queryset(self): + """Must return a Django QuerySet.""" + + def get_filter_backends(self): + return [OccurrenceCollectionFilter] # default + + def update_export_stats(self, file_temp_path): + """Updates record_count and file_size on DataExport.""" + + def update_job_progress(self, records_exported): + """Updates Job progress stage.""" +``` + +### ExportRegistry (ami/exports/registry.py) + +```python +ExportRegistry.register("format_name")(ExporterClass) +ExportRegistry.get_exporter("format_name") # → ExporterClass +ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv", "dwca"] +``` + +### DataExport Model (ami/exports/models.py) + +Key fields: +- `user` FK → User (who triggered) +- `project` FK → Project (scoped to project) +- `format` CharField (registry key) +- `filters` JSONField (e.g. `{"collection_id": 5}`) +- `filters_display` JSONField (precomputed human-readable) +- `file_url` URLField (final download URL) +- `record_count` PositiveIntegerField +- `file_size` PositiveBigIntegerField + +Key methods: +- `run_export()` - orchestrates the full export pipeline +- `save_export_file(temp_path)` - uploads to storage, returns URL +- `generate_filename()` - `{project_slug}_export-{pk}.{ext}` +- `get_exporter()` - cached exporter instance + +### Filter System + +All exporters inherit `OccurrenceCollectionFilter` from `BaseExporter.get_filter_backends()` (`base.py:42-45`). + +**OccurrenceCollectionFilter** (`ami/main/api/views.py:981-998`): +- Accepts `collection_id` or `collection` query param +- Filters: `queryset.filter(detections__source_image__collections=collection_id).distinct()` +- No-op when param is absent — unfiltered exports work unchanged + +**How filters are applied in Celery context** (`ami/exports/utils.py`): +- `generate_fake_request()` creates a mock DRF Request with filter values as query params +- `apply_filters()` runs each filter backend's `filter_queryset()` against the exporter's queryset +- Called in `BaseExporter.__init__()` so `self.queryset` is already filtered before `export()` runs + +### DwC-A Specifics + +The DwC-A exporter produces two data files linked by `eventID`: + +- **event.txt** — Events derived from filtered occurrences (`get_events_queryset()` at `format_types.py:211`) +- **occurrence.txt** — Filtered occurrences with Darwin Core terms + +Events are not fetched independently — they're derived from `self.queryset.values_list("event_id").distinct()` to maintain referential integrity when filters are active. + +Field definitions: `ami/exports/dwca.py` — `EVENT_FIELDS` (:26), `OCCURRENCE_FIELDS` (:57). +See `docs/claude/dwca-format-reference.md` for Darwin Core term mappings. + +### Job Integration + +`DataExportJob` (`ami/jobs/models.py:682-716`): +1. Adds "Exporting data" progress stage +2. Calls `job.data_export.run_export()` +3. Adds "Uploading snapshot" stage with file URL +4. Finalizes job as SUCCESS + +`DataExport` has a OneToOne relation to `Job` via `job.data_export` (`models.py:841`). + +### File Lifecycle + +1. Exporter writes to temp file during `export()` +2. `DataExport.save_export_file()` uploads to `exports/` in default_storage (S3/MinIO) +3. `file_url` saved on model +4. On `DataExport` deletion: `pre_delete` signal (`ami/exports/signals.py:13`) removes file from storage + +### Adding a New Export Format + +1. Create exporter class extending `BaseExporter` +2. Set `file_format` (file extension) +3. Implement `get_queryset()` and `export()` +4. Register: `ExportRegistry.register("format_key")(YourExporter)` +5. The format automatically appears in the API's valid choices + +### Utilities (ami/exports/utils.py) + +- `generate_fake_request()` - creates a DRF Request for serializer context (needed because exports run in Celery, not in HTTP request context) +- `apply_filters(queryset, filters, filter_backends)` - applies DRF filter backends using fake request with filter query params +- `get_data_in_batches(queryset, serializer_class, batch_size=100)` - yields batches of serialized data using queryset.iterator() diff --git a/docs/claude/planning/2026-04-21-dwca-april-draft-design.md b/docs/claude/planning/2026-04-21-dwca-april-draft-design.md new file mode 100644 index 000000000..a1590f046 --- /dev/null +++ b/docs/claude/planning/2026-04-21-dwca-april-draft-design.md @@ -0,0 +1,282 @@ +# DwC-A April 2026 Draft — Design (in progress) + +**Status:** Brainstorming in progress. Converged on Event Core, verifying against GBIF guide before final commit. +**PR:** #1131 (`feat/dwca-export`) +**Owner:** Michael / Claude session +**Date:** 2026-04-21 + +--- + +## Decision trail so far + +### Core choice: Event Core (retained from current PR) + +Initially proposed flipping to Occurrence Core per user direction, then surfaced the +star-schema consequence: in Occurrence-Core, every extension row must have `coreid = +occurrenceID`, which forces `event.txt` to duplicate event rows per-occurrence (50 +near-identical rows for an event with 50 occurrences). Reconsidered and landed on +**Event Core** because: + +1. **Absence inference** — automated camera-trap sampling (photo every 10s all night) + enables *strong* absence inference: we can prove species X was not present during + this sampling window. This is the scientific breakthrough of automated monitoring. + Event Core carries structured sampling effort natively (duration, protocol, light + type, sample size); Occurrence Core loses it except as EML free-text. Shipping a + DwC-A that erases our strongest scientific contribution is wrong. +2. **AMI's data shape** — many occurrences per event (one night = 100s of moths); Event + Core eats the smaller redundancy tax (one `occurrenceID` column on MoF/multimedia + rows) versus Occurrence Core (full row duplication in event.txt or wholesale + denormalization onto every occurrence row). +3. **CamtrapDP alignment** — CamtrapDP's `events` table maps ~1:1 onto our `event.txt`. + Event Core today makes the CamtrapDP follow-up PR easier, not harder. + +### Archive shape + +``` +project_export.zip +├── meta.xml +├── eml.xml ← EML 2.2.0 +├── event.txt ← Event Core — one row per AMI Event +├── occurrence.txt ← Occurrence extension — coreid=eventID +├── multimedia.txt ← GBIF simple Multimedia extension — coreid=eventID +│ (holds both capture images and detection crops; +│ detection-crop rows carry occurrenceID column +│ to link back to their occurrence) +└── measurementorfact.txt ← MoF extension — coreid=eventID + (carries classificationScore, detectionScore, + boundingBox; per-occurrence rows carry + occurrenceID column; per-event rows don't) +``` + +The `occurrenceID` column is a valid DwC term that extensions are legally allowed to +carry. Using it in multimedia and MoF rows is not a hack — it's how Event-Core archives +point extension rows back to specific occurrences. + +### occurrence.txt (extension) — columns + +Keep all current columns. Add: + +- `associatedMedia` — pipe-separated public URLs of source captures that produced this + occurrence's detections (distinct, ordered by detection timestamp). Redundant with + multimedia.txt but useful for quick consumers. + +### event.txt (core) — columns + +Inherits current event columns. Already present: `eventID, eventDate, eventTime, year, +month, day, samplingProtocol, sampleSizeValue, sampleSizeUnit, samplingEffort, +locationID, decimalLatitude, decimalLongitude, geodeticDatum, datasetName, license, +rightsHolder, modified`. Add: + +- `parentEventID` — blank for now (would link deployment-level parent once that model + exists). Documented as follow-up in PR body. +- Placeholder columns for Device fields (`deviceType`, `attractantType`, + `lightWavelength`) — **deferred to follow-up PR** with the Device model migration. + +### multimedia.txt — columns + +Columns: `coreid (=eventID), occurrenceID, type, format, identifier, references, created, +license, rightsHolder, creator, description`. + +Row shape: +- **Capture image rows:** one per SourceImage in an event linked to ≥1 published occurrence. + `occurrenceID` blank. `identifier` = capture URL. +- **Detection crop rows:** one per Detection whose occurrence is in the filter set. + `occurrenceID` populated with the detection's occurrenceID URN. `identifier` = crop URL; + `references` = source capture URL. + +Filter rules: exclude `is_blank` and `contains_humans` SourceImages (WG requirement). + +### measurementorfact.txt — columns + +Columns: `coreid (=eventID), occurrenceID, measurementID, measurementType, +measurementValue, measurementUnit, measurementDeterminedBy, measurementRemarks`. + +Row types: +- **Per-occurrence:** `classificationScore` (value, unit=proportion, determinedBy=pipeline + name+version). `occurrenceID` populated. +- **Per-detection:** `detectionScore` (proportion, detector algorithm), `boundingBox` + (JSON `[x1,y1,x2,y2]`, normalized or pixels). `occurrenceID` populated. +- **Per-event (future hook):** no rows emitted yet; slot for future lux, temperature, + moon phase. + +### EML 2.2.0 upgrade + +- Bump namespace + schemaLocation +- Compute `geographicCoverage` bbox from deployment lat/lon min/max across the filter set +- Compute `temporalCoverage` from min/max `event.start` +- `methods` section: sampling protocol text + list of pipelines used (name+version) + + `qualityControl` para noting default filters applied +- Keep current license + rights-holder behavior + +### Runtime pre-zip validation + +Reuse `ami/exports/dwca_validator.py` on temp TSVs before packaging. Extend validator to +cover: (a) the two new extension files (multimedia, MoF), (b) `occurrenceID` column +cross-references between occurrence.txt and extension rows that carry occurrenceID, (c) +multimedia.txt having at most one row per (identifier, occurrenceID) combination. Fatal +errors fail the export and mark `DataExport.status = FAILED`; warnings log. + +### UI label + +`ui/src/data-services/models/export.ts`: +- Add `'dwca'` to `SERVER_EXPORT_TYPES` +- Label: `"Darwin Core Archive (DwC-A) — April 2026 Draft"` + +### Code organization (optional, leaning split) + +`ami/exports/dwca.py` is 432 lines and will ~double. Proposal: split into +`ami/exports/dwca/` package with `fields.py` / `meta.py` / `eml.py` / `validator.py` / +`__init__.py`. Public surface unchanged. Ask user yes/no. + +### Tests + +- Update existing 10 DwCAExporterTests for the new extension shape (additions, not + replacements — core is unchanged) +- Add tests per new extension: header parity, row counts, coreid referential integrity, + MoF measurement-type coverage, multimedia-URL presence, multimedia-crop has + occurrenceID populated, capture rows don't +- Add test: Event with no occurrences after filtering produces 0 extension rows but no + error +- Add test: detection with no crop URL is skipped from multimedia + +### PR discussion comment + +Post a dedicated "Multimedia and bounding-box representation" comment on the PR +covering: +- Capture images vs detection crops: one-file-with-occurrenceID-column approach vs. + two separate multimedia files (not possible in DwC-A — only one table per rowType). +- bbox in MoF vs. inline on multimedia row (MoF wins because structured numeric). +- How CamtrapDP will represent the same data (richer media.csv with structural link to + detections). +- Solicit WG feedback; not blocking this PR. + +### Device model changes — deferred + +Note in PR follow-up section: `Device.device_type` and `Device.attractant_type` (and +`Device.light_wavelength`) should be added in the CamtrapDP PR. DwC has no direct term +for attractant; these will populate CamtrapDP `captureMethod` + custom columns on +event.txt in a later DwC-A iteration. + +### Explicitly deferred + +- CamtrapDP native export +- Device model additions +- Sensitive-taxa coordinate generalization +- Reverse-geocoding for country/state/locality +- `coordinateUncertaintyInMeters` +- Annotation / model-metadata extensions (CamtrapDP path) +- Online GBIF-API validator CI +- IPT publishing + DOI minting + +--- + +## GBIF guide findings (2026-04-21) + +Two GBIF guides give **opposite** recommendations for camera-trap data. Resolving the +tension in AMI's favor: + +### Camera-Trap Data Publishing Guide — Occurrence Core + AMDE + +Recommends Occurrence Core + Audubon Media Description extension. Reason per the guide +itself: GBIF portal's UI can't display event-level media when viewing an individual +occurrence (confirmed by GBIF portal-feedback issue #4216). The guide's own rationale +acknowledges Model 2 (Event Core + Occurrence + AMDE) is "conceptually superior" but is +not recommended because of portal display limitations. + +The guide explicitly says **"classifications of blanks, vehicles and preferably humans +should be filtered out"** — i.e. it does not support absence representation. This +directly sacrifices AMI's core scientific contribution (automated monitoring's +strongest claim is *provable absence during a known sampling window*). + +### Survey & Monitoring Publishing Guide + Humboldt Extension — Event Core + +The Humboldt Extension (ratified 2024-2025, 55 terms) is **explicitly an Event-Core +extension**: "a vocabulary extension to the Darwin Core Event Class." It is the +GBIF-official pathway for survey and monitoring data. + +Terms relevant to AMI, all on event rows: + +- `eco:isSamplingEffortReported` = true +- `eco:samplingEffortValue` + `eco:samplingEffortUnit` — e.g. value=`1440`, unit=`camera-minutes` or trap-nights +- `eco:samplingEffortProtocol` — free text: "automated camera trap, image interval 10s, continuous overnight monitoring" +- `eco:isAbsenceReported` = true +- `eco:targetTaxonomicScope` — v1: derived from `Project.default_filters_include_taxa` + (lowest common ancestor across the M2M; blank if none). v2: sourced from the `TaxaList` + curated per Site (`Site.primary_taxa_list`, falling back to `Project`'s default "all + possible species" list). The TaxaList is the enumerable scope that later unblocks + per-taxon absence occurrences. +- `eco:inventoryTypes` = "trap or sample" +- `eco:protocolNames` / `eco:protocolDescriptions` — document the ML pipeline as a protocol +- `eco:hasMaterialSamples` = true, `eco:materialSampleTypes` = "digital images" + +Absence pattern (canonical): per-taxon absence `Occurrence` row per Event with +`dwc:occurrenceStatus = "absent"` and eventID link. AMI can emit absence occurrences in +a follow-up PR once `project.target_taxa` is defined. For this PR, we declare +`eco:isAbsenceReported=true` on events to signal the capacity; actual absence rows come +later. + +### Decision (confirmed after research) + +**Event Core + Humboldt Extension + Occurrence + Multimedia + MeasurementOrFact.** +Matches the GBIF survey-data guide and the extension purpose-built for our data shape. +Trades GBIF portal display of media at the occurrence view (a UI limitation, not a +data-ingestion limitation) for preserving absence inference and structured sampling +effort. + +The camera-trap guide's Occurrence-Core recommendation is a pragmatic workaround for +GBIF's portal UX, not a statement of correct data modeling for monitoring data. AMI is +a survey/monitoring dataset first, a camera-trap records dataset second. + +### CamtrapDP positioning + +The camera-trap guide explicitly recommends CamtrapDP as the primary format; GBIF +doesn't yet ingest CamtrapDP in production. So CamtrapDP is still the right next-PR +target, but for AMI's community (Wildlife Insights, Agouti, EU camera-trap networks) +rather than as a GBIF ingestion route. DwC-A via Humboldt remains the GBIF path. + +--- + +## Updated archive shape + +``` +project_export.zip +├── meta.xml +├── eml.xml ← EML 2.2.0 +├── event.txt ← Event Core (DwC Event terms + Humboldt eco: terms) +├── occurrence.txt ← Occurrence extension (coreid=eventID) +├── multimedia.txt ← GBIF Multimedia ext (coreid=eventID; occurrenceID column +│ links detection crops back to occurrences) +└── measurementorfact.txt ← MoF extension (coreid=eventID; occurrenceID column for + per-occurrence/per-detection measurements) +``` + +`event.txt` carries the Humboldt `eco:` terms as additional columns. They're declared in +meta.xml via their term URIs. Humboldt is technically registered as its own extension +(`http://eco.tdwg.org/xml/ecoterm.xml`), but there's precedent for flattening Humboldt +terms into the Event Core row; GBIF accepts both. We flatten for simplicity — fewer +files, same semantic content, same GBIF ingestion outcome. + +--- + +## Sources consulted + +- GBIF Camera Trap Data Publishing Guide (docs.gbif.org/camera-trap-guide/en/) — §4.3, + §4.4.1, §4.4.2, §4.4.3; recommends Occurrence Core + AMDE as portal-pragmatic. +- GBIF Survey & Monitoring Data Publishing Guide (docs.gbif.org/guide-publishing-survey-data/en/) + — recommends Event Core + Humboldt. +- GBIF Survey & Monitoring Quick-Start Guide (docs.gbif.org/survey-monitoring-quick-start/en/) + — Humboldt term-by-term usage. +- GBIF portal-feedback issue #4216 — confirms Event-level multimedia is "conceptually + superior" but GBIF portal UI doesn't display it in occurrence views. +- Humboldt Extension Implementation Experience Report (eco.tdwg.org). + +--- + +## Followups after design approval + +1. Update PR body with Event-Core-retained + Humboldt rationale. +2. Write implementation plan via `writing-plans` skill. +3. Implement: extensions + Humboldt terms on event.txt + split `dwca.py` → package → + validator extensions → EML 2.2.0 → UI label → tests → docs. +4. Post the multimedia/bbox discussion comment on the PR (now more concrete: GBIF + portal-display caveat is a known trade, not a surprise). diff --git a/docs/claude/planning/2026-04-21-dwca-implementation-plan.md b/docs/claude/planning/2026-04-21-dwca-implementation-plan.md new file mode 100644 index 000000000..59387ca37 --- /dev/null +++ b/docs/claude/planning/2026-04-21-dwca-implementation-plan.md @@ -0,0 +1,2017 @@ +# DwC-A April 2026 Draft — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Deliver a scientifically-defensible Darwin Core Archive export that preserves sampling-effort and absence-inference context (Event Core + Humboldt Extension), adds multimedia and measurement extensions, upgrades EML to 2.2.0, runs structural validation before packaging, and ships behind a clearly-labeled "April 2026 Draft" UI option for field testing. + +**Architecture:** Event-Core DwC-A. `event.txt` carries DwC Event terms + flattened Humboldt `eco:` terms. `occurrence.txt`, `multimedia.txt`, and `measurementorfact.txt` are coreid=eventID extensions. Extension rows that pertain to a single occurrence carry an `occurrenceID` column for back-linkage. EML 2.2.0 with computed geographic/temporal coverage + methods section. Runtime pre-zip validation via extended offline validator. Existing `ami/exports/dwca.py` splits into `ami/exports/dwca/` package for manageability as the code ~doubles. + +**Tech Stack:** Django 4.2, Python 3.10+ typing, stdlib (`csv`, `zipfile`, `xml.etree.ElementTree`, `tempfile`), `django.utils.text.slugify`, existing `ami.exports.base.BaseExporter` + `ami.exports.dwca_validator`. + +**Spec:** `docs/claude/planning/2026-04-21-dwca-april-draft-design.md` +**PR:** #1131 (`feat/dwca-export`) +**Follow-up ticket:** #1262 (CamtrapDP) + +--- + +## File Structure + +**New package layout (`ami/exports/dwca/`):** + +- `__init__.py` — re-export the public API so external imports (`format_types.py`, tests) keep working unchanged +- `fields.py` — `DwCAField` dataclass + `EVENT_FIELDS`, `OCCURRENCE_FIELDS`, `MULTIMEDIA_FIELDS`, `MOF_FIELDS` catalogues +- `helpers.py` — `_format_event_date`, `_format_time`, `_format_datetime`, `_format_coord`, `_format_duration`, `_get_rank_from_parents`, `get_specific_epithet`, `_get_verification_status` +- `targetscope.py` — `derive_target_taxonomic_scope(project)` LCA helper +- `rows.py` — `iter_multimedia_rows(events_qs, occurrences_qs, project_slug)`, `iter_mof_rows(occurrences_qs, project_slug)` +- `tsv.py` — `write_tsv(filepath, fields, source, project_slug, progress_callback)` (supports queryset OR iterable of mapping-like objects via a small adapter; see Task 5) +- `meta.py` — `generate_meta_xml()` now takes a list of `(tag, row_type, filename, fields)` so the caller composes the archive shape +- `eml.py` — `generate_eml_xml(project, events_qs)` (2.2.0, computed coverage + methods) +- `zip.py` — `create_dwca_zip(files: dict[str, str], meta_xml: str, eml_xml: str)` (`files` maps archive-name → tmp-path; e.g. `{"event.txt": tmp1, "occurrence.txt": tmp2, ...}`) + +**Modified files:** +- `ami/exports/format_types.py` — `DwCAExporter.export()` composes the 4-file archive + runs pre-zip validation +- `ami/exports/dwca_validator.py` — add `occurrenceID` cross-reference check + multimedia uniqueness warning +- `ami/exports/tests.py` — update existing `DwCAExportTest` setup + add new extension tests +- `ui/src/data-services/models/export.ts` — add `'dwca'` entry + label +- `docs/claude/dwca-format-reference.md` — document new archive shape + +**Unchanged:** `ami/exports/base.py`, `ami/exports/registry.py`, `ami/exports/tests_dwca_validator.py` (new tests go into a new file or extend existing). + +--- + +## Test infrastructure note + +The existing `DwCAExportTest` uses `setUpClass` to run the export **once** and share the ZIP across structural assertions (see `ami/exports/tests.py:310-355`). Every new extension test should reuse `self._open_zip()` and assert against the cached export — do NOT run a second export per test. If a new test needs a different project/filter shape, follow the pattern of `test_dwca_export_with_collection_filter` (`ami/exports/tests.py:533`) which creates its own export inside the test method. + +Docker test runner (from CLAUDE.md; keepdb for speed): +```bash +docker compose run --rm django python manage.py test ami.exports.tests --keepdb -v 2 +``` + +--- + +## Task 1: Split `dwca.py` into package (mechanical, safety-net via existing tests) + +**Files:** +- Create: `ami/exports/dwca/__init__.py` +- Create: `ami/exports/dwca/fields.py` +- Create: `ami/exports/dwca/helpers.py` +- Create: `ami/exports/dwca/tsv.py` +- Create: `ami/exports/dwca/meta.py` +- Create: `ami/exports/dwca/eml.py` +- Create: `ami/exports/dwca/zip.py` +- Delete: `ami/exports/dwca.py` + +- [ ] **Step 1: Verify existing tests pass as baseline** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest --keepdb -v 2` +Expected: 10 tests pass. + +- [ ] **Step 2: Create `ami/exports/dwca/helpers.py`** + +Move these verbatim from `ami/exports/dwca.py` (current lines 146-232): `_format_event_date`, `_format_time`, `_format_datetime`, `_format_coord`, `_format_duration`, `_get_rank_from_parents`, `get_specific_epithet`, `_get_verification_status`. Keep the module-level `logger = logging.getLogger(__name__)`. + +- [ ] **Step 3: Create `ami/exports/dwca/fields.py`** + +```python +"""DwC-A column catalogues. Each DwCAField ties a term URI, TSV header, and row extractor together so meta.xml cannot drift from the TSV.""" + +from collections.abc import Callable +from dataclasses import dataclass +from typing import Any + +from ami.exports.dwca.helpers import ( + _format_coord, + _format_datetime, + _format_duration, + _format_event_date, + _format_time, + _get_rank_from_parents, + _get_verification_status, + get_specific_epithet, +) + +DWC = "http://rs.tdwg.org/dwc/terms/" +DC = "http://purl.org/dc/terms/" +ECO = "http://rs.tdwg.org/eco/terms/" + + +@dataclass(frozen=True) +class DwCAField: + term: str + header: str + extract: Callable[[Any, str], str] + required: bool = False + + +# Paste current EVENT_FIELDS list verbatim from dwca.py (lines 47-78) +EVENT_FIELDS: list[DwCAField] = [ + # ... (existing content unchanged for now; extended in Task 3) +] + +# Paste current OCCURRENCE_FIELDS list verbatim from dwca.py (lines 84-143) +OCCURRENCE_FIELDS: list[DwCAField] = [ + # ... (existing content unchanged for now; extended in Task 4) +] +``` + +Replace the `...` comments with the actual field list contents. MULTIMEDIA_FIELDS and MOF_FIELDS will be added in later tasks. + +- [ ] **Step 4: Create `ami/exports/dwca/tsv.py`** + +Move `write_tsv` verbatim from `dwca.py:239-264`. Update its import to use the new `DwCAField` location: +```python +from ami.exports.dwca.fields import DwCAField +``` + +- [ ] **Step 5: Create `ami/exports/dwca/meta.py`** + +Move `generate_meta_xml`, `_append_table`, and the `DWC` constant usage from `dwca.py:272-339`. Import `DwCAField` from `ami.exports.dwca.fields`. **Do not change the signature yet** — Task 6 will generalize it. + +- [ ] **Step 6: Create `ami/exports/dwca/eml.py`** + +Move `generate_eml_xml` verbatim from `dwca.py:347-410`. Keep it at EML 2.1.1 for now; Task 9 upgrades it. + +- [ ] **Step 7: Create `ami/exports/dwca/zip.py`** + +Move `create_dwca_zip` verbatim from `dwca.py:418-432`. Keep its current 2-extension signature; Task 6 generalizes it. + +- [ ] **Step 8: Create `ami/exports/dwca/__init__.py`** + +```python +"""Public surface of the DwC-A export package. + +Re-exports keep external imports (format_types.py, tests) working +unchanged while internal code is organized by responsibility. +""" + +from ami.exports.dwca.eml import generate_eml_xml +from ami.exports.dwca.fields import ( + DC, + DWC, + ECO, + DwCAField, + EVENT_FIELDS, + OCCURRENCE_FIELDS, +) +from ami.exports.dwca.helpers import ( + _format_coord, + _format_datetime, + _format_duration, + _format_event_date, + _format_time, + _get_rank_from_parents, + _get_verification_status, + get_specific_epithet, +) +from ami.exports.dwca.meta import generate_meta_xml +from ami.exports.dwca.tsv import write_tsv +from ami.exports.dwca.zip import create_dwca_zip + +__all__ = [ + "DC", + "DWC", + "ECO", + "DwCAField", + "EVENT_FIELDS", + "OCCURRENCE_FIELDS", + "create_dwca_zip", + "generate_eml_xml", + "generate_meta_xml", + "get_specific_epithet", + "write_tsv", + "_format_coord", + "_format_datetime", + "_format_duration", + "_format_event_date", + "_format_time", + "_get_rank_from_parents", + "_get_verification_status", +] +``` + +- [ ] **Step 9: Delete old file** + +```bash +git rm ami/exports/dwca.py +``` + +- [ ] **Step 10: Run existing tests to verify no regression** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest --keepdb -v 2` +Expected: 10 tests pass, same assertions as before the split. + +- [ ] **Step 11: Commit** + +```bash +git add ami/exports/dwca/ ami/exports/dwca.py +git commit -m "$(cat <<'EOF' +refactor(exports): split dwca.py into package + +Upcoming additions (Humboldt eco: terms, multimedia and MoF +extensions, EML 2.2.0) roughly double the code. Split by +responsibility now so each module has a single clear purpose. + +Public API unchanged — re-exported from ami.exports.dwca.__init__. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 2: Target taxonomic scope derivation (LCA helper) + +**Why:** `eco:targetTaxonomicScope` is derived from `Project.default_filters_include_taxa` via lowest common ancestor. Pure function, easy to test. + +**Files:** +- Create: `ami/exports/dwca/targetscope.py` +- Modify: `ami/exports/tests.py` (add test class) + +- [ ] **Step 1: Write the failing test** + +Add at the end of `ami/exports/tests.py`: + +```python +class TargetTaxonomicScopeTest(TestCase): + """Tests for eco:targetTaxonomicScope derivation from project include taxa.""" + + @classmethod + def setUpClass(cls): + super().setUpClass() + cls.project, cls.deployment = setup_test_project(reuse=False) + create_taxa(cls.project) + + def test_empty_include_taxa_returns_empty_string(self): + from ami.exports.dwca.targetscope import derive_target_taxonomic_scope + + self.project.default_filters_include_taxa.clear() + self.assertEqual(derive_target_taxonomic_scope(self.project), "") + + def test_single_taxon_returns_its_name(self): + from ami.exports.dwca.targetscope import derive_target_taxonomic_scope + from ami.main.models import Taxon + + taxon = Taxon.objects.filter(projects=self.project).first() + self.assertIsNotNone(taxon, "Expected at least one taxon on fixture project") + self.project.default_filters_include_taxa.set([taxon]) + self.assertEqual(derive_target_taxonomic_scope(self.project), taxon.name) + + def test_multiple_taxa_returns_lca_name(self): + from ami.exports.dwca.targetscope import derive_target_taxonomic_scope + from ami.main.models import Taxon + + # Find two taxa sharing a parent in parents_json + taxa = list(Taxon.objects.filter(projects=self.project).exclude(parents_json=[])[:2]) + if len(taxa) < 2: + self.skipTest("Fixture does not have two taxa with shared ancestry") + for t in taxa: + t.save(update_calculated_fields=True) + t.refresh_from_db() + self.project.default_filters_include_taxa.set(taxa) + + result = derive_target_taxonomic_scope(self.project) + # LCA should be some ancestor name, not empty + self.assertTrue(result, "LCA should resolve to a non-empty ancestor name") + # And it should be in the ancestry of BOTH taxa + for t in taxa: + ancestor_names = [p.name for p in t.parents_json] + [t.name] + self.assertIn(result, ancestor_names, f"{result} not in ancestry of {t.name}") +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.TargetTaxonomicScopeTest --keepdb -v 2` +Expected: FAIL — `ModuleNotFoundError: No module named 'ami.exports.dwca.targetscope'` + +- [ ] **Step 3: Implement `ami/exports/dwca/targetscope.py`** + +```python +"""Derive eco:targetTaxonomicScope from a project's include-taxa filter. + +The scope is the lowest common ancestor (LCA) across all taxa in +Project.default_filters_include_taxa. Empty include-list -> empty +string (meta.xml still declares the column; EML notes the gap). + +This is the v1 sourcing strategy. v2 will move to a per-Site TaxaList +so each deployment can declare its own expected species pool (the +groundwork for per-taxon absence occurrence rows). +""" + +from __future__ import annotations + + +def derive_target_taxonomic_scope(project) -> str: + """Return the name of the LCA of the project's include-taxa filter. + + `parents_json` on each Taxon is ordered root-to-leaf (kingdom first). + The LCA is the deepest (longest) common prefix of the + `parents_json + [self]` chains across all selected taxa. + """ + taxa = list(project.default_filters_include_taxa.all()) + if not taxa: + return "" + + def ancestry(t) -> list[tuple[int, str]]: + # parents_json entries expose .id and .name; list is root -> leaf + chain: list[tuple[int, str]] = [(p.id, p.name) for p in (t.parents_json or [])] + chain.append((t.id, t.name)) + return chain + + chains = [ancestry(t) for t in taxa] + if any(not c for c in chains): + return "" + + # Walk positions in lockstep; stop at the first divergence. + lca_name = "" + for position in zip(*chains): + ids = {entry[0] for entry in position} + if len(ids) != 1: + break + lca_name = position[0][1] + return lca_name +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.TargetTaxonomicScopeTest --keepdb -v 2` +Expected: 3 tests pass. + +- [ ] **Step 5: Commit** + +```bash +git add ami/exports/dwca/targetscope.py ami/exports/tests.py +git commit -m "$(cat <<'EOF' +feat(exports): derive targetTaxonomicScope via LCA of include taxa + +Pure helper that walks parents_json chains and returns the deepest +common ancestor name. Empty include-list -> empty string. v1 +strategy; v2 will source from a per-Site TaxaList. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 3: Add Humboldt eco: terms as columns on event.txt + +**Why:** Preserves sampling effort + declares absence-reportability — the scientific contribution of automated monitoring. Flattened onto Event Core rows (GBIF accepts this; simpler than a separate humboldt.txt extension). + +**Files:** +- Modify: `ami/exports/dwca/fields.py` +- Modify: `ami/exports/tests.py` (extend `DwCAExportTest`) + +- [ ] **Step 1: Write the failing test** + +Add to `DwCAExportTest` in `ami/exports/tests.py`: + +```python + def test_event_has_humboldt_eco_columns(self): + """event.txt should carry the Humboldt eco: columns as flattened columns.""" + expected_columns = { + "isSamplingEffortReported", + "samplingEffortValue", + "samplingEffortUnit", + "samplingEffortProtocol", + "isAbsenceReported", + "targetTaxonomicScope", + "inventoryTypes", + "protocolNames", + "protocolDescriptions", + "hasMaterialSamples", + "materialSampleTypes", + } + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + event_data = zf.read("event.txt").decode("utf-8") + reader = csv.DictReader(StringIO(event_data), delimiter="\t") + self.assertTrue( + expected_columns.issubset(set(reader.fieldnames)), + f"event.txt missing Humboldt columns: {expected_columns - set(reader.fieldnames)}", + ) + rows = list(reader) + self.assertGreater(len(rows), 0) + for row in rows: + self.assertEqual(row["isSamplingEffortReported"], "true") + self.assertEqual(row["isAbsenceReported"], "true") + self.assertEqual(row["hasMaterialSamples"], "true") + self.assertEqual(row["materialSampleTypes"], "digital images") + self.assertEqual(row["inventoryTypes"], "trap or sample") + + def test_event_humboldt_terms_in_meta_xml(self): + """meta.xml core should declare eco: term URIs for Humboldt columns.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + meta_xml = zf.read("meta.xml").decode("utf-8") + self.assertIn("http://rs.tdwg.org/eco/terms/isSamplingEffortReported", meta_xml) + self.assertIn("http://rs.tdwg.org/eco/terms/isAbsenceReported", meta_xml) + self.assertIn("http://rs.tdwg.org/eco/terms/targetTaxonomicScope", meta_xml) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest.test_event_has_humboldt_eco_columns ami.exports.tests.DwCAExportTest.test_event_humboldt_terms_in_meta_xml --keepdb -v 2` +Expected: FAIL — columns missing from event.txt; meta.xml lacks eco: terms. + +- [ ] **Step 3: Extend EVENT_FIELDS in `ami/exports/dwca/fields.py`** + +Append to `EVENT_FIELDS` (after the existing `DC + "modified"` entry): + +```python + # ── Humboldt Extension (eco:) terms flattened onto event.txt ── + DwCAField( + ECO + "isSamplingEffortReported", + "isSamplingEffortReported", + lambda e, slug: "true", + ), + DwCAField( + ECO + "samplingEffortValue", + "samplingEffortValue", + lambda e, slug: _humboldt_effort_value(e), + ), + DwCAField( + ECO + "samplingEffortUnit", + "samplingEffortUnit", + lambda e, slug: "images", + ), + DwCAField( + ECO + "samplingEffortProtocol", + "samplingEffortProtocol", + lambda e, slug: ( + "automated camera trap with light attractant; continuous overnight monitoring " + "with fixed image-capture interval; images processed by ML detector + classifier pipeline" + ), + ), + DwCAField( + ECO + "isAbsenceReported", + "isAbsenceReported", + lambda e, slug: "true", + ), + DwCAField( + ECO + "targetTaxonomicScope", + "targetTaxonomicScope", + lambda e, slug: getattr(e, "_target_taxonomic_scope", "") or "", + ), + DwCAField( + ECO + "inventoryTypes", + "inventoryTypes", + lambda e, slug: "trap or sample", + ), + DwCAField( + ECO + "protocolNames", + "protocolNames", + lambda e, slug: "AMI ML detector + classifier pipeline", + ), + DwCAField( + ECO + "protocolDescriptions", + "protocolDescriptions", + lambda e, slug: ( + "Images captured at a fixed interval by an automated monitoring station; each image " + "processed through a detector (bounding-box extraction) and classifier (species " + "prediction). Occurrences grouped from co-located detections; default filters applied." + ), + ), + DwCAField( + ECO + "hasMaterialSamples", + "hasMaterialSamples", + lambda e, slug: "true", + ), + DwCAField( + ECO + "materialSampleTypes", + "materialSampleTypes", + lambda e, slug: "digital images", + ), +``` + +At the top of `fields.py`, add a helper: + +```python +def _humboldt_effort_value(event) -> str: + """Sampling effort value: prefer image count, fall back to nothing.""" + count = getattr(event, "captures_count", None) or 0 + return str(count) if count else "" +``` + +- [ ] **Step 4: Attach target-scope value to events in the exporter** + +Modify `ami/exports/format_types.py` `DwCAExporter.export()`. Before calling `write_tsv` on events, compute the scope once and attach it to each event: + +```python +from ami.exports.dwca.targetscope import derive_target_taxonomic_scope + +target_scope = derive_target_taxonomic_scope(self.project) +events_list = list(events_qs) +for e in events_list: + e._target_taxonomic_scope = target_scope +event_count = write_tsv(event_file.name, EVENT_FIELDS, events_list, project_slug) +``` + +Note: `write_tsv` currently calls `queryset.iterator(chunk_size=500)`. Update its signature to accept either a queryset or a plain iterable (see Task 5, Step 3) — or for this task only, duck-type: try `.iterator()` else iterate directly. + +- [ ] **Step 5: Update `write_tsv` to accept plain iterables** + +In `ami/exports/dwca/tsv.py`: + +```python +def write_tsv(filepath, fields, source, project_slug, progress_callback=None): + """Write a tab-delimited file. `source` is a Django queryset OR any iterable.""" + headers = [f.header for f in fields] + records_written = 0 + iterator = source.iterator(chunk_size=500) if hasattr(source, "iterator") else iter(source) + with open(filepath, "w", encoding="utf-8", newline="") as f: + writer = csv.writer(f, delimiter="\t", quoting=csv.QUOTE_MINIMAL, lineterminator="\n") + writer.writerow(headers) + for obj in iterator: + row = [field.extract(obj, project_slug) for field in fields] + writer.writerow(row) + records_written += 1 + if progress_callback and records_written % 500 == 0: + progress_callback(records_written) + return records_written +``` + +- [ ] **Step 6: Run tests to verify they pass** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest --keepdb -v 2` +Expected: all tests pass (including the two new ones and the existing 10). + +- [ ] **Step 7: Commit** + +```bash +git add ami/exports/dwca/fields.py ami/exports/dwca/tsv.py ami/exports/format_types.py ami/exports/tests.py +git commit -m "$(cat <<'EOF' +feat(exports): add Humboldt eco: terms as event.txt columns + +Flatten 11 Humboldt Extension terms onto Event Core rows: +sampling-effort structure (reported/value/unit/protocol), absence +reportability, targetTaxonomicScope (LCA-derived), protocol +identifiers, and material-sample declaration. + +Carries the scientific contribution of automated monitoring +(provable absence during known sampling windows) into the GBIF +pipeline. GBIF accepts eco: terms on Event rows as the +pragmatic alternative to a separate humboldt.txt extension. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 4: Add `associatedMedia` column to occurrence.txt + +**Why:** Pipe-separated public URLs of source captures, per the design doc. Redundant with multimedia.txt but convenient for CSV-level consumers. + +**Files:** +- Modify: `ami/exports/dwca/fields.py` +- Modify: `ami/exports/format_types.py` (ensure detection + source_image are prefetched) +- Modify: `ami/exports/tests.py` + +- [ ] **Step 1: Write the failing test** + +Add to `DwCAExportTest`: + +```python + def test_occurrence_has_associated_media_column(self): + """occurrence.txt should carry associatedMedia as pipe-separated URLs.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + occ_data = zf.read("occurrence.txt").decode("utf-8") + reader = csv.DictReader(StringIO(occ_data), delimiter="\t") + self.assertIn("associatedMedia", reader.fieldnames) + rows = list(reader) + # At least one row should have a non-empty associatedMedia value + non_empty = [r for r in rows if r.get("associatedMedia")] + self.assertGreater(len(non_empty), 0, "No occurrences have associatedMedia") + for r in non_empty: + # URLs separated by pipe, no trailing pipe + self.assertFalse(r["associatedMedia"].endswith("|")) + for part in r["associatedMedia"].split("|"): + self.assertTrue(part.startswith("http"), f"Not a URL: {part}") +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest.test_occurrence_has_associated_media_column --keepdb -v 2` +Expected: FAIL — column missing. + +- [ ] **Step 3: Add field to OCCURRENCE_FIELDS** + +In `ami/exports/dwca/fields.py`, append to `OCCURRENCE_FIELDS` (after the last entry): + +```python + DwCAField( + DWC + "associatedMedia", + "associatedMedia", + lambda o, slug: _associated_media(o), + ), +``` + +And a helper near the other helpers: + +```python +def _associated_media(occurrence) -> str: + """Pipe-separated distinct public URLs of source captures for this occurrence. + + Ordered by detection timestamp. Uses prefetched detections + source_image; + the exporter ensures the prefetch chain. + """ + seen: set[str] = set() + urls: list[str] = [] + detections = sorted( + occurrence.detections.all(), + key=lambda d: (d.timestamp or d.source_image.timestamp), + ) + for det in detections: + si = det.source_image + if si is None: + continue + url = si.public_url() + if not url or url in seen: + continue + seen.add(url) + urls.append(url) + return "|".join(urls) +``` + +- [ ] **Step 4: Ensure the queryset prefetches `detections__source_image`** + +In `ami/exports/format_types.py` `DwCAExporter.get_queryset()`, extend the existing prefetch chain: + +```python +return ( + Occurrence.objects.valid() + .filter(project=self.project, event__isnull=False, determination__isnull=False) + .apply_default_filters(self.project) + .select_related("determination", "event", "deployment") + .prefetch_related("detections__source_image") + .with_detections_count() + .with_identifications() +) +``` + +- [ ] **Step 5: Run test to verify it passes** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest.test_occurrence_has_associated_media_column --keepdb -v 2` +Expected: PASS. + +- [ ] **Step 6: Commit** + +```bash +git add ami/exports/dwca/fields.py ami/exports/format_types.py ami/exports/tests.py +git commit -m "$(cat <<'EOF' +feat(exports): add associatedMedia column to occurrence.txt + +Pipe-separated distinct source-capture URLs per occurrence, ordered +by detection timestamp. Redundant with multimedia.txt but useful +for tabular consumers. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 5: `multimedia.txt` extension — field catalogue + row generator + +**Why:** One file holds both capture images (event-level context) and detection crops (per-occurrence evidence). `occurrenceID` column on crop rows links them back to occurrences; capture rows have it blank. + +**Files:** +- Modify: `ami/exports/dwca/fields.py` (add `MULTIMEDIA_FIELDS`) +- Create: `ami/exports/dwca/rows.py` (`iter_multimedia_rows`) +- Modify: `ami/exports/tests.py` + +- [ ] **Step 1: Write the failing test** + +Add a new test class in `ami/exports/tests.py`. It will fail first because neither the constant nor the generator exists: + +```python +class MultimediaExtensionTest(TestCase): + """Unit tests for multimedia.txt row generator (in isolation from a full export).""" + + def test_field_catalogue_present(self): + from ami.exports.dwca.fields import MULTIMEDIA_FIELDS + + headers = [f.header for f in MULTIMEDIA_FIELDS] + for required in [ + "eventID", + "occurrenceID", + "type", + "format", + "identifier", + "references", + "created", + "license", + "rightsHolder", + ]: + self.assertIn(required, headers) + + def test_iter_multimedia_rows_emits_capture_and_crop_rows(self): + from ami.exports.dwca.rows import iter_multimedia_rows + + project, deployment = setup_test_project(reuse=False) + create_captures(deployment=deployment, num_nights=1, images_per_night=4, interval_minutes=1) + group_images_into_events(deployment) + create_taxa(project) + create_occurrences(num=4, deployment=deployment) + + events_qs = project.events.all() + occurrences_qs = ( + Occurrence.objects.valid() + .filter(project=project, event__isnull=False, determination__isnull=False) + ) + rows = list(iter_multimedia_rows(events_qs, occurrences_qs, "test-project")) + + # Expect at least one capture row (occurrenceID blank) and at least one crop row + capture_rows = [r for r in rows if not r["occurrenceID"]] + crop_rows = [r for r in rows if r["occurrenceID"]] + self.assertGreater(len(capture_rows), 0, "Expected capture rows with blank occurrenceID") + self.assertGreater(len(crop_rows), 0, "Expected detection-crop rows with occurrenceID") + + # Every crop row must have both identifier (crop URL) and references (source URL) + for r in crop_rows: + self.assertTrue(r["identifier"], "Crop row missing identifier") + self.assertTrue(r["references"], "Crop row missing references (source capture URL)") + self.assertEqual(r["type"], "StillImage") +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.MultimediaExtensionTest --keepdb -v 2` +Expected: FAIL — `ImportError: cannot import name 'MULTIMEDIA_FIELDS'` and `No module named 'ami.exports.dwca.rows'`. + +- [ ] **Step 3: Add `MULTIMEDIA_FIELDS` to `fields.py`** + +The multimedia row generator in Task 5, Step 4 yields plain dicts. The field extractors here read from those dicts: + +```python +MULTIMEDIA_FIELDS: list[DwCAField] = [ + DwCAField(DWC + "eventID", "eventID", lambda r, slug: r["eventID"], required=True), + DwCAField(DWC + "occurrenceID", "occurrenceID", lambda r, slug: r.get("occurrenceID", "")), + DwCAField("http://purl.org/dc/terms/type", "type", lambda r, slug: r.get("type", "StillImage")), + DwCAField("http://purl.org/dc/terms/format", "format", lambda r, slug: r.get("format", "image/jpeg")), + DwCAField("http://purl.org/dc/terms/identifier", "identifier", lambda r, slug: r["identifier"], required=True), + DwCAField("http://purl.org/dc/terms/references", "references", lambda r, slug: r.get("references", "")), + DwCAField("http://purl.org/dc/terms/created", "created", lambda r, slug: r.get("created", "")), + DwCAField("http://purl.org/dc/terms/license", "license", lambda r, slug: r.get("license", "")), + DwCAField("http://purl.org/dc/terms/rightsHolder", "rightsHolder", lambda r, slug: r.get("rightsHolder", "")), + DwCAField("http://purl.org/dc/terms/creator", "creator", lambda r, slug: r.get("creator", "")), + DwCAField("http://purl.org/dc/terms/description", "description", lambda r, slug: r.get("description", "")), +] +``` + +Also export it from `__init__.py` (add to both the import block and `__all__`). + +- [ ] **Step 4: Create `ami/exports/dwca/rows.py`** + +```python +"""Row generators for DwC-A extension TSVs (multimedia, measurementorfact). + +Both generators yield plain dicts so the TSV writer can treat them the +same as Django model instances via the DwCAField extract lambdas. +""" + +from __future__ import annotations + +from ami.exports.dwca.helpers import _format_datetime + + +def _event_id(event, slug: str) -> str: + return f"urn:ami:event:{slug}:{event.id}" + + +def _occurrence_id(occurrence, slug: str) -> str: + return f"urn:ami:occurrence:{slug}:{occurrence.id}" + + +def iter_multimedia_rows(events_qs, occurrences_qs, project_slug: str): + """Yield dicts for multimedia.txt rows. + + Two row types: + - Capture row: one per SourceImage linked to >=1 occurrence in filter set. + occurrenceID is blank; identifier is the capture URL. + - Crop row: one per Detection whose occurrence is in filter set + AND which has a usable crop URL. occurrenceID populated; + references = source capture URL. + """ + license = _project_license(events_qs) + rights_holder = _project_rights_holder(events_qs) + + # Build (event, [occurrences]) pairs up front so we can iterate once. + occurrences_by_event: dict[int, list] = {} + for occ in occurrences_qs.select_related("event").prefetch_related( + "detections__source_image" + ): + if occ.event_id is None: + continue + occurrences_by_event.setdefault(occ.event_id, []).append(occ) + + for event in events_qs: + eid = _event_id(event, project_slug) + occurrences_for_event = occurrences_by_event.get(event.id, []) + + # Deduplicate capture images across all occurrences in this event. + seen_captures: set[int] = set() + for occ in occurrences_for_event: + for det in occ.detections.all(): + si = det.source_image + if si is None or si.id in seen_captures: + continue + seen_captures.add(si.id) + capture_url = si.public_url() + if not capture_url: + continue + yield { + "eventID": eid, + "occurrenceID": "", + "type": "StillImage", + "format": "image/jpeg", + "identifier": capture_url, + "references": "", + "created": _format_datetime(si.timestamp), + "license": license, + "rightsHolder": rights_holder, + "creator": "", + "description": "Source capture image from automated monitoring station", + } + + # Detection crop rows. + for occ in occurrences_for_event: + occ_urn = _occurrence_id(occ, project_slug) + for det in occ.detections.all(): + crop_url = det.url() if hasattr(det, "url") else None + if not crop_url: + continue + si = det.source_image + capture_url = si.public_url() if si else "" + yield { + "eventID": eid, + "occurrenceID": occ_urn, + "type": "StillImage", + "format": "image/jpeg", + "identifier": crop_url, + "references": capture_url, + "created": _format_datetime(det.timestamp or (si.timestamp if si else None)), + "license": license, + "rightsHolder": rights_holder, + "creator": "", + "description": "Cropped detection from source capture", + } + + +def _project_license(events_qs) -> str: + for e in events_qs: + if e.project and getattr(e.project, "license", ""): + return e.project.license + break + return "" + + +def _project_rights_holder(events_qs) -> str: + for e in events_qs: + if e.project and getattr(e.project, "rights_holder", ""): + return e.project.rights_holder + break + return "" +``` + +- [ ] **Step 5: Run test to verify it passes** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.MultimediaExtensionTest --keepdb -v 2` +Expected: both tests pass. + +- [ ] **Step 6: Commit** + +```bash +git add ami/exports/dwca/fields.py ami/exports/dwca/rows.py ami/exports/dwca/__init__.py ami/exports/tests.py +git commit -m "$(cat <<'EOF' +feat(exports): add multimedia extension field catalogue and row generator + +Single multimedia.txt carries both capture-image rows (occurrenceID +blank) and detection-crop rows (occurrenceID linking back to +occurrence.txt). dc:references on crop rows points back to the +source capture URL. + +Row generator yields plain dicts so the existing write_tsv + +DwCAField pattern handles both query-backed tables and computed +row streams uniformly. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 6: Generalize meta.xml + zip packaging; wire multimedia.txt into the archive + +**Why:** Current `generate_meta_xml` hardcodes two tables; `create_dwca_zip` hardcodes two payload files. Generalize both to accept a list, then add multimedia.txt to the archive. + +**Files:** +- Modify: `ami/exports/dwca/meta.py` +- Modify: `ami/exports/dwca/zip.py` +- Modify: `ami/exports/format_types.py` +- Modify: `ami/exports/tests.py` + +- [ ] **Step 1: Write the failing test** + +Add to `DwCAExportTest`: + +```python + def test_multimedia_txt_in_archive(self): + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + self.assertIn("multimedia.txt", zf.namelist()) + data = zf.read("multimedia.txt").decode("utf-8") + reader = csv.DictReader(StringIO(data), delimiter="\t") + rows = list(reader) + self.assertGreater(len(rows), 0, "multimedia.txt has no rows") + ids = {row["eventID"] for row in rows if row["eventID"]} + # Every multimedia eventID must exist in event.txt + event_data = zf.read("event.txt").decode("utf-8") + event_ids = {r["eventID"] for r in csv.DictReader(StringIO(event_data), delimiter="\t")} + self.assertTrue(ids.issubset(event_ids), f"Orphaned multimedia eventIDs: {ids - event_ids}") + + def test_meta_xml_declares_multimedia_extension(self): + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + meta_xml = zf.read("meta.xml").decode("utf-8") + # Look for a second element referencing multimedia.txt + self.assertIn("multimedia.txt", meta_xml) + # GBIF Multimedia extension rowType + self.assertIn("http://rs.gbif.org/terms/1.0/Multimedia", meta_xml) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest.test_multimedia_txt_in_archive ami.exports.tests.DwCAExportTest.test_meta_xml_declares_multimedia_extension --keepdb -v 2` +Expected: FAIL — multimedia.txt not in archive. + +- [ ] **Step 3: Generalize `generate_meta_xml`** + +Rewrite `ami/exports/dwca/meta.py`: + +```python +"""Generate the DwC-A descriptor (meta.xml). + +meta.xml is derived from the field catalogues so TSV columns cannot +drift from declared term URIs. The core/extension list is passed in +so the caller composes the archive shape. +""" + +from __future__ import annotations + +from xml.etree import ElementTree as ET + +from ami.exports.dwca.fields import DwCAField, DWC + + +def generate_meta_xml(tables: list[dict]) -> str: + """Build meta.xml from a list of table descriptors. + + Each descriptor is a dict: + { + "role": "core" | "extension", + "row_type": , + "filename": "event.txt", + "fields": list[DwCAField], + } + + The first descriptor must have role="core"; remaining are extensions. + """ + if not tables or tables[0]["role"] != "core": + raise ValueError("First table must be the core (role='core')") + + archive = ET.Element("archive") + archive.set("xmlns", "http://rs.tdwg.org/dwc/text/") + archive.set("metadata", "eml.xml") + + for table in tables: + tag = table["role"] + _append_table( + archive, + tag=tag, + row_type=table["row_type"], + filename=table["filename"], + fields=table["fields"], + id_tag="id" if tag == "core" else "coreid", + ) + + ET.indent(archive, space=" ") + xml_str = ET.tostring(archive, encoding="unicode", xml_declaration=False) + return '\n' + xml_str + "\n" + + +def _append_table(archive, *, tag, row_type, filename, fields: list[DwCAField], id_tag: str) -> None: + table = ET.SubElement(archive, tag) + table.set("rowType", row_type) + table.set("encoding", "UTF-8") + table.set("fieldsTerminatedBy", "\\t") + table.set("linesTerminatedBy", "\\n") + table.set("fieldsEnclosedBy", '"') + table.set("ignoreHeaderLines", "1") + files = ET.SubElement(table, "files") + location = ET.SubElement(files, "location") + location.text = filename + id_elem = ET.SubElement(table, id_tag) + id_elem.set("index", "0") + for i, field in enumerate(fields): + field_elem = ET.SubElement(table, "field") + field_elem.set("index", str(i)) + field_elem.set("term", field.term) +``` + +- [ ] **Step 4: Generalize `create_dwca_zip`** + +Rewrite `ami/exports/dwca/zip.py`: + +```python +"""Package DwC-A files into a single ZIP.""" + +import tempfile +import zipfile + + +def create_dwca_zip(files: dict[str, str], meta_xml: str, eml_xml: str) -> str: + """Build the archive. + + `files` maps archive-internal-name -> source-temp-path. + Returns the path to the new ZIP. + """ + temp_zip = tempfile.NamedTemporaryFile(delete=False, suffix=".zip") + temp_zip.close() + with zipfile.ZipFile(temp_zip.name, "w", zipfile.ZIP_DEFLATED) as zf: + for archive_name, source_path in files.items(): + zf.write(source_path, archive_name) + zf.writestr("meta.xml", meta_xml) + zf.writestr("eml.xml", eml_xml) + return temp_zip.name +``` + +- [ ] **Step 5: Update `DwCAExporter.export()` to produce multimedia.txt** + +Replace the body of `DwCAExporter.export()` in `ami/exports/format_types.py`: + +```python +def export(self): + """Export project data as a Darwin Core Archive ZIP.""" + from django.utils.text import slugify + + from ami.exports.dwca import ( + EVENT_FIELDS, + OCCURRENCE_FIELDS, + create_dwca_zip, + generate_eml_xml, + generate_meta_xml, + write_tsv, + ) + from ami.exports.dwca.fields import MULTIMEDIA_FIELDS + from ami.exports.dwca.rows import iter_multimedia_rows + from ami.exports.dwca.targetscope import derive_target_taxonomic_scope + + project_slug = slugify(self.project.name) + + def _tmp_txt(): + tf = tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w", encoding="utf-8") + tf.close() + return tf.name + + event_path = _tmp_txt() + occ_path = _tmp_txt() + multimedia_path = _tmp_txt() + + try: + events_qs = self.get_events_queryset() + events_list = list(events_qs) + target_scope = derive_target_taxonomic_scope(self.project) + for e in events_list: + e._target_taxonomic_scope = target_scope + + event_count = write_tsv(event_path, EVENT_FIELDS, events_list, project_slug) + logger.info(f"DwC-A: wrote {event_count} events") + + occ_count = write_tsv( + occ_path, + OCCURRENCE_FIELDS, + self.queryset, + project_slug, + progress_callback=self.update_job_progress, + ) + logger.info(f"DwC-A: wrote {occ_count} occurrences") + + mm_count = write_tsv( + multimedia_path, + MULTIMEDIA_FIELDS, + iter_multimedia_rows(events_list, self.queryset, project_slug), + project_slug, + ) + logger.info(f"DwC-A: wrote {mm_count} multimedia rows") + + if self.total_records: + self.update_job_progress(occ_count) + + meta_xml = generate_meta_xml([ + { + "role": "core", + "row_type": "http://rs.tdwg.org/dwc/terms/Event", + "filename": "event.txt", + "fields": EVENT_FIELDS, + }, + { + "role": "extension", + "row_type": "http://rs.tdwg.org/dwc/terms/Occurrence", + "filename": "occurrence.txt", + "fields": OCCURRENCE_FIELDS, + }, + { + "role": "extension", + "row_type": "http://rs.gbif.org/terms/1.0/Multimedia", + "filename": "multimedia.txt", + "fields": MULTIMEDIA_FIELDS, + }, + ]) + eml_xml = generate_eml_xml(self.project) + + zip_path = create_dwca_zip( + { + "event.txt": event_path, + "occurrence.txt": occ_path, + "multimedia.txt": multimedia_path, + }, + meta_xml, + eml_xml, + ) + + self.update_export_stats(file_temp_path=zip_path) + return zip_path + finally: + for path in (event_path, occ_path, multimedia_path): + try: + os.unlink(path) + except OSError: + pass +``` + +- [ ] **Step 6: Run tests to verify they pass** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest --keepdb -v 2` +Expected: all tests pass, including new multimedia assertions and existing referential-integrity checks. + +- [ ] **Step 7: Commit** + +```bash +git add ami/exports/dwca/meta.py ami/exports/dwca/zip.py ami/exports/format_types.py ami/exports/tests.py +git commit -m "$(cat <<'EOF' +feat(exports): wire multimedia.txt into DwC-A archive + +Generalize meta.xml descriptor and zip packager to accept arbitrary +extension lists, then add multimedia.txt as the third table. +Row type http://rs.gbif.org/terms/1.0/Multimedia (GBIF simple +Multimedia extension). Capture-image rows carry blank +occurrenceID; crop rows carry the occurrenceID URN so consumers +can link evidence back to determinations. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 7: `measurementorfact.txt` extension — catalogue, generator, wiring + +**Why:** Structured numeric provenance: `classificationScore` per occurrence, `detectionScore` + `boundingBox` per detection. Extension coreid=eventID; `occurrenceID` column populated on both row types (all MoF rows are per-occurrence or per-detection in this PR — per-event rows are deferred). + +**Files:** +- Modify: `ami/exports/dwca/fields.py` (add `MOF_FIELDS`) +- Modify: `ami/exports/dwca/rows.py` (`iter_mof_rows`) +- Modify: `ami/exports/format_types.py` +- Modify: `ami/exports/tests.py` + +- [ ] **Step 1: Write the failing tests** + +Add to `ami/exports/tests.py`: + +```python + def test_measurementorfact_txt_in_archive(self): + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + self.assertIn("measurementorfact.txt", zf.namelist()) + data = zf.read("measurementorfact.txt").decode("utf-8") + reader = csv.DictReader(StringIO(data), delimiter="\t") + rows = list(reader) + self.assertGreater(len(rows), 0) + types = {r["measurementType"] for r in rows} + self.assertIn("classificationScore", types) + # Rows must all have populated coreid (=eventID) + for r in rows: + self.assertTrue(r["eventID"], "MoF row missing eventID") + self.assertTrue(r["occurrenceID"], "MoF row missing occurrenceID in this PR") + + def test_meta_xml_declares_mof_extension(self): + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + meta_xml = zf.read("meta.xml").decode("utf-8") + self.assertIn("measurementorfact.txt", meta_xml) + self.assertIn("http://rs.gbif.org/terms/1.0/MeasurementOrFact", meta_xml) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest.test_measurementorfact_txt_in_archive ami.exports.tests.DwCAExportTest.test_meta_xml_declares_mof_extension --keepdb -v 2` +Expected: FAIL — file not in archive. + +- [ ] **Step 3: Add MOF_FIELDS to `fields.py`** + +```python +MOF_FIELDS: list[DwCAField] = [ + DwCAField(DWC + "eventID", "eventID", lambda r, slug: r["eventID"], required=True), + DwCAField(DWC + "occurrenceID", "occurrenceID", lambda r, slug: r.get("occurrenceID", "")), + DwCAField(DWC + "measurementID", "measurementID", lambda r, slug: r.get("measurementID", "")), + DwCAField(DWC + "measurementType", "measurementType", lambda r, slug: r["measurementType"], required=True), + DwCAField(DWC + "measurementValue", "measurementValue", lambda r, slug: r.get("measurementValue", "")), + DwCAField(DWC + "measurementUnit", "measurementUnit", lambda r, slug: r.get("measurementUnit", "")), + DwCAField( + DWC + "measurementDeterminedBy", + "measurementDeterminedBy", + lambda r, slug: r.get("measurementDeterminedBy", ""), + ), + DwCAField( + DWC + "measurementRemarks", + "measurementRemarks", + lambda r, slug: r.get("measurementRemarks", ""), + ), +] +``` + +Export from `__init__.py` (add import and `__all__`). + +- [ ] **Step 4: Add `iter_mof_rows` to `rows.py`** + +```python +import json + + +def iter_mof_rows(occurrences_qs, project_slug: str): + """Yield dicts for measurementorfact.txt rows. + + Per-occurrence: + - classificationScore (value = occurrence.determination_score, unit = proportion) + + Per-detection: + - detectionScore (value = detection.detection_score) + - boundingBox (value = JSON [x1,y1,x2,y2], unit = pixels) + """ + for occ in occurrences_qs.select_related("determination").prefetch_related( + "detections__detection_algorithm", + "detections__classifications__algorithm", + ): + eid = _event_id(occ.event, project_slug) if occ.event_id else "" + occ_urn = _occurrence_id(occ, project_slug) + if eid and occ.determination_score is not None: + yield { + "eventID": eid, + "occurrenceID": occ_urn, + "measurementID": f"{occ_urn}:classificationScore", + "measurementType": "classificationScore", + "measurementValue": f"{occ.determination_score:.6f}", + "measurementUnit": "proportion", + "measurementDeterminedBy": _classifier_name(occ), + "measurementRemarks": "ML classifier softmax score", + } + for det in occ.detections.all(): + det_urn = f"urn:ami:detection:{project_slug}:{det.id}" + if det.detection_score is not None: + yield { + "eventID": eid, + "occurrenceID": occ_urn, + "measurementID": f"{det_urn}:detectionScore", + "measurementType": "detectionScore", + "measurementValue": f"{det.detection_score:.6f}", + "measurementUnit": "proportion", + "measurementDeterminedBy": det.detection_algorithm.name if det.detection_algorithm else "", + "measurementRemarks": "ML detector confidence score", + } + if det.bbox: + yield { + "eventID": eid, + "occurrenceID": occ_urn, + "measurementID": f"{det_urn}:boundingBox", + "measurementType": "boundingBox", + "measurementValue": json.dumps(det.bbox), + "measurementUnit": "pixels", + "measurementDeterminedBy": det.detection_algorithm.name if det.detection_algorithm else "", + "measurementRemarks": "Bounding box [x1, y1, x2, y2]", + } + + +def _classifier_name(occurrence) -> str: + """Best-effort: name + version of the classifier that produced this determination.""" + best = None + for det in occurrence.detections.all(): + for cls in det.classifications.all(): + if cls.taxon_id == occurrence.determination_id: + best = cls + break + if best: + break + if best and best.algorithm: + name = best.algorithm.name or "" + version = getattr(best.algorithm, "version", "") or "" + return f"{name} {version}".strip() + return "" +``` + +- [ ] **Step 5: Wire into `DwCAExporter.export()`** + +In `ami/exports/format_types.py`, extend the `export()` method from Task 6, Step 5: + +1. Add `from ami.exports.dwca.fields import MOF_FIELDS` and `from ami.exports.dwca.rows import iter_mof_rows`. +2. Create `mof_path = _tmp_txt()`. +3. After the multimedia write, add: + ```python + mof_count = write_tsv( + mof_path, + MOF_FIELDS, + iter_mof_rows(self.queryset, project_slug), + project_slug, + ) + logger.info(f"DwC-A: wrote {mof_count} measurementOrFact rows") + ``` +4. Add a fourth entry to the `generate_meta_xml` list: + ```python + { + "role": "extension", + "row_type": "http://rs.gbif.org/terms/1.0/MeasurementOrFact", + "filename": "measurementorfact.txt", + "fields": MOF_FIELDS, + }, + ``` +5. Add `"measurementorfact.txt": mof_path,` to the `create_dwca_zip` files dict. +6. Add `mof_path` to the cleanup tuple. + +- [ ] **Step 6: Run tests to verify they pass** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest --keepdb -v 2` +Expected: all pass, including 2 new MoF assertions. + +- [ ] **Step 7: Commit** + +```bash +git add ami/exports/dwca/fields.py ami/exports/dwca/rows.py ami/exports/dwca/__init__.py ami/exports/format_types.py ami/exports/tests.py +git commit -m "$(cat <<'EOF' +feat(exports): add measurementorfact.txt extension + +Captures ML provenance as structured numeric facts: +classificationScore per occurrence, detectionScore + boundingBox +per detection. Row type http://rs.gbif.org/terms/1.0/MeasurementOrFact, +coreid=eventID, occurrenceID column linking back to the occurrence. + +Per-event MoF rows (lux, temperature, moon phase) are not emitted +in this PR; the column layout reserves space for them. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 8: Upgrade EML 2.1.1 → 2.2.0 with computed coverage and methods + +**Why:** EML 2.2.0 is the current ratified version and what GBIF expects. Compute geographic/temporal coverage from the actual event data + document sampling protocol explicitly in `methods`. + +**Files:** +- Modify: `ami/exports/dwca/eml.py` +- Modify: `ami/exports/format_types.py` (pass events to `generate_eml_xml`) +- Modify: `ami/exports/tests.py` + +- [ ] **Step 1: Write the failing test** + +Replace the existing `test_eml_xml_valid` in `DwCAExportTest` with an expanded version: + +```python + def test_eml_xml_valid(self): + """eml.xml should be valid EML 2.2.0 with coverage, methods, and license.""" + with self._open_zip() as f: + with zipfile.ZipFile(f, "r") as zf: + eml_xml = zf.read("eml.xml").decode("utf-8") + root = ET.fromstring(eml_xml) + + # EML 2.2.0 namespace + self.assertIn("eml-2.2.0", eml_xml) + ns = {"eml": "https://eml.ecoinformatics.org/eml-2.2.0"} + dataset = root.find("eml:dataset", ns) + self.assertIsNotNone(dataset, "eml.xml missing ") + + # Title matches project name + title = dataset.find("eml:title", ns) + self.assertIsNotNone(title) + self.assertEqual(title.text, self.project.name) + + # Coverage: bounding box + temporal + coverage = dataset.find("eml:coverage", ns) + self.assertIsNotNone(coverage, "Missing ") + self.assertIsNotNone(coverage.find(".//eml:geographicCoverage", ns)) + self.assertIsNotNone(coverage.find(".//eml:temporalCoverage", ns)) + + # Methods section + methods = dataset.find("eml:methods", ns) + self.assertIsNotNone(methods, "Missing ") + method_step = methods.find("eml:methodStep", ns) + self.assertIsNotNone(method_step) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest.test_eml_xml_valid --keepdb -v 2` +Expected: FAIL — namespace mismatch, missing `` and ``. + +- [ ] **Step 3: Rewrite `ami/exports/dwca/eml.py`** + +```python +"""Generate EML 2.2.0 metadata for the DwC-A.""" + +from __future__ import annotations + +from xml.etree import ElementTree as ET + +from django.utils import timezone +from django.utils.text import slugify + +EML_NS = "https://eml.ecoinformatics.org/eml-2.2.0" +XSI_NS = "http://www.w3.org/2001/XMLSchema-instance" + + +def generate_eml_xml(project, events=None) -> str: + """Return the eml.xml body. + + If `events` is provided (iterable of Event), geographic and temporal + coverage are computed from it. If absent, they're omitted. + """ + project_slug = slugify(project.name) + now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S") + + eml = ET.Element("eml:eml") + eml.set("xmlns:eml", EML_NS) + eml.set("xmlns:dc", "http://purl.org/dc/terms/") + eml.set("xmlns:xsi", XSI_NS) + eml.set("xsi:schemaLocation", f"{EML_NS} https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd") + eml.set("packageId", f"urn:ami:dataset:{project_slug}:{now}") + eml.set("system", "AMI") + + dataset = ET.SubElement(eml, "dataset") + _add_text(dataset, "title", project.name) + + creator = ET.SubElement(dataset, "creator") + _add_text(creator, "organizationName", "Automated Monitoring of Insects (AMI)") + if project.owner and project.owner.name: + individual = ET.SubElement(creator, "individualName") + _add_text(individual, "surName", project.owner.name) + + abstract = ET.SubElement(dataset, "abstract") + _add_text(abstract, "para", project.description or f"Biodiversity monitoring data from {project.name}.") + + _add_intellectual_rights(dataset, project) + + if events is not None: + _add_coverage(dataset, events) + + _add_methods(dataset) + + contact = ET.SubElement(dataset, "contact") + _add_text(contact, "organizationName", "Automated Monitoring of Insects (AMI)") + + ET.indent(eml, space=" ") + xml_str = ET.tostring(eml, encoding="unicode", xml_declaration=False) + return '\n' + xml_str + "\n" + + +def _add_text(parent, tag, text): + child = ET.SubElement(parent, tag) + child.text = text or "" + return child + + +def _add_intellectual_rights(dataset, project): + rights = ET.SubElement(dataset, "intellectualRights") + para = ET.SubElement(rights, "para") + project_license = (getattr(project, "license", "") or "").strip() + para.text = project_license if project_license else "All rights reserved. No license specified." + if getattr(project, "rights_holder", ""): + additional = ET.SubElement(dataset, "additionalInfo") + _add_text(additional, "para", f"Rights holder: {project.rights_holder}") + + +def _add_coverage(dataset, events): + lats = [e.deployment.latitude for e in events if e.deployment and e.deployment.latitude is not None] + lons = [e.deployment.longitude for e in events if e.deployment and e.deployment.longitude is not None] + starts = [e.start for e in events if e.start] + ends = [e.end for e in events if e.end] or starts + + if not (lats and lons) and not starts: + return + + coverage = ET.SubElement(dataset, "coverage") + + if lats and lons: + geo = ET.SubElement(coverage, "geographicCoverage") + _add_text(geo, "geographicDescription", "Computed from event deployment coordinates") + bounding = ET.SubElement(geo, "boundingCoordinates") + _add_text(bounding, "westBoundingCoordinate", f"{min(lons):.6f}") + _add_text(bounding, "eastBoundingCoordinate", f"{max(lons):.6f}") + _add_text(bounding, "northBoundingCoordinate", f"{max(lats):.6f}") + _add_text(bounding, "southBoundingCoordinate", f"{min(lats):.6f}") + + if starts: + temporal = ET.SubElement(coverage, "temporalCoverage") + range_of_dates = ET.SubElement(temporal, "rangeOfDates") + begin = ET.SubElement(range_of_dates, "beginDate") + _add_text(begin, "calendarDate", min(starts).date().isoformat()) + end = ET.SubElement(range_of_dates, "endDate") + _add_text(end, "calendarDate", max(ends).date().isoformat()) + + +def _add_methods(dataset): + methods = ET.SubElement(dataset, "methods") + step = ET.SubElement(methods, "methodStep") + description = ET.SubElement(step, "description") + _add_text( + description, + "para", + "Images captured at a fixed interval by an automated camera trap with light attractant. " + "Each image is processed through an ML detector (bounding-box extraction) and an ML " + "classifier (species prediction). Individual detections are aggregated into occurrences " + "by spatiotemporal grouping and assigned a consensus determination.", + ) + sampling = ET.SubElement(methods, "sampling") + sampling_description = ET.SubElement(sampling, "studyExtent") + _add_text(sampling_description, "description", "See for geographic and temporal extent.") + _add_text(sampling, "samplingDescription", "Automated overnight monitoring with continuous image capture.") + qc = ET.SubElement(methods, "qualityControl") + qc_description = ET.SubElement(qc, "description") + _add_text( + qc_description, + "para", + "Project default filters applied before export: score thresholds, include/exclude taxa " + "lists, soft-delete exclusion. Only occurrences with at least one detection are included.", + ) +``` + +- [ ] **Step 4: Pass events into `generate_eml_xml` from the exporter** + +In `DwCAExporter.export()`, change: +```python +eml_xml = generate_eml_xml(self.project) +``` +to: +```python +eml_xml = generate_eml_xml(self.project, events_list) +``` + +- [ ] **Step 5: Run tests to verify they pass** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest --keepdb -v 2` +Expected: all pass, including the rewritten EML test. + +- [ ] **Step 6: Commit** + +```bash +git add ami/exports/dwca/eml.py ami/exports/format_types.py ami/exports/tests.py +git commit -m "$(cat <<'EOF' +feat(exports): upgrade eml.xml to EML 2.2.0 with coverage and methods + +Bumps namespace and schemaLocation to eml-2.2.0. Adds computed +geographicCoverage (bbox from event deployment coordinates), +temporalCoverage (min/max event start), and a methods section +documenting the automated capture + ML pipeline workflow and +the quality-control filters applied at export time. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 9: Extend validator to cover multimedia.txt and MoF; add occurrenceID cross-ref check + +**Why:** The new extensions introduce referential semantics (crop rows must point at real occurrenceIDs; MoF rows must too). These invariants aren't caught by normal tests because they're structural. + +**Files:** +- Modify: `ami/exports/dwca_validator.py` +- Modify: `ami/exports/tests_dwca_validator.py` + +- [ ] **Step 1: Write the failing tests** + +Append to `ami/exports/tests_dwca_validator.py`: + +```python +def test_validator_detects_orphaned_occurrence_id_on_extension_row(tmp_path): + """A multimedia row whose occurrenceID isn't in occurrence.txt should error.""" + import csv + import zipfile + from ami.exports.dwca_validator import validate_dwca_zip + + zip_path = tmp_path / "bad.zip" + with zipfile.ZipFile(zip_path, "w") as zf: + zf.writestr("meta.xml", _MINIMAL_META_WITH_MULTIMEDIA) + zf.writestr("eml.xml", "") + zf.writestr("event.txt", "eventID\tdecimalLatitude\tdecimalLongitude\nE1\t45\t-73\n") + zf.writestr("occurrence.txt", "eventID\toccurrenceID\tbasisOfRecord\nE1\tO1\tMachineObservation\n") + zf.writestr( + "multimedia.txt", + "eventID\toccurrenceID\tidentifier\nE1\tO_MISSING\thttp://example.com/a.jpg\n", + ) + result = validate_dwca_zip(str(zip_path)) + assert not result.ok + assert any("occurrenceID" in e for e in result.errors) + + +_MINIMAL_META_WITH_MULTIMEDIA = """ + + + event.txt + + + + + + + occurrence.txt + + + + + + + multimedia.txt + + + + + + +""" +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `docker compose run --rm django python -m pytest ami/exports/tests_dwca_validator.py::test_validator_detects_orphaned_occurrence_id_on_extension_row -v` +Expected: FAIL — validator doesn't cross-check occurrenceID yet. + +- [ ] **Step 3: Extend validator** + +In `ami/exports/dwca_validator.py`, modify `validate_dwca_zip` to additionally: +1. Collect occurrenceID values from the Occurrence extension (if present) +2. For any other extension that declares a `dwc:occurrenceID` field, verify each non-blank value exists in that set + +Add this block at the end of the existing loop, right after the existing `_validate_extension` call: + +```python + occurrence_ids = _collect_occurrence_ids(zf, tables) + for ext in tables[1:]: + if ext.filename == "occurrence.txt": + continue + _validate_occurrence_id_references(zf, ext, occurrence_ids, result) +``` + +And two helpers at module level: + +```python +_OCCURRENCE_ID_TERM = "http://rs.tdwg.org/dwc/terms/occurrenceID" + + +def _collect_occurrence_ids(zf: zipfile.ZipFile, tables: list[_TableSpec]) -> set[str]: + for t in tables: + if t.filename == "occurrence.txt": + rows = _read_tsv(zf, t.filename, ValidationResult()) + if rows is None: + return set() + occ_col = None + for idx, term in t.field_terms.items(): + if term == _OCCURRENCE_ID_TERM: + occ_col = idx + break + if occ_col is None: + return set() + return {row[occ_col].strip() for row in rows[1:] if occ_col < len(row) and row[occ_col].strip()} + return set() + + +def _validate_occurrence_id_references( + zf: zipfile.ZipFile, + ext: _TableSpec, + occurrence_ids: set[str], + result: ValidationResult, +) -> None: + occ_col = None + for idx, term in ext.field_terms.items(): + if term == _OCCURRENCE_ID_TERM: + occ_col = idx + break + if occ_col is None: + return + rows = _read_tsv(zf, ext.filename, result) + if rows is None: + return + missing: set[str] = set() + for row in rows[1:]: + if occ_col >= len(row): + continue + val = row[occ_col].strip() + if val and val not in occurrence_ids: + missing.add(val) + if missing: + sample = sorted(missing)[:5] + result.add_error( + f"{ext.filename}: {len(missing)} occurrenceID value(s) do not exist in occurrence.txt. " + f"First: {sample}" + ) +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `docker compose run --rm django python -m pytest ami/exports/tests_dwca_validator.py -v` +Expected: all validator tests pass, including the new one. + +- [ ] **Step 5: Commit** + +```bash +git add ami/exports/dwca_validator.py ami/exports/tests_dwca_validator.py +git commit -m "$(cat <<'EOF' +feat(exports): validate occurrenceID cross-references between extensions + +Any extension that declares a dwc:occurrenceID column must only +contain values that exist in occurrence.txt. Multimedia crop rows +and MoF rows both carry occurrenceID as a back-link; this check +catches drift where the pipeline emits rows pointing at filtered- +out or non-existent occurrences. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 10: Run the validator before zipping; fail fast on structural errors + +**Why:** Design says "Fatal errors fail the export and mark `DataExport.status = FAILED`." Best to catch drift before users download broken archives. + +**Files:** +- Modify: `ami/exports/format_types.py` +- Modify: `ami/exports/tests.py` + +- [ ] **Step 1: Write the failing test** + +Add to `DwCAExportTest`: + +```python + def test_validator_runs_on_produced_zip(self): + """The exporter's own zip should pass its own validator cleanly.""" + from ami.exports.dwca_validator import validate_dwca_zip + + with self._open_zip() as f: + # Write cached zip to a tempfile the validator can reopen. + import tempfile + tf = tempfile.NamedTemporaryFile(delete=False, suffix=".zip") + tf.write(f.read()) + tf.close() + result = validate_dwca_zip(tf.name) + self.assertTrue( + result.ok, + f"Self-produced DwC-A failed own validator: {result.errors}", + ) +``` + +- [ ] **Step 2: Add runtime validation step in the exporter** + +In `DwCAExporter.export()` in `ami/exports/format_types.py`, right after `zip_path = create_dwca_zip(...)`: + +```python + from ami.exports.dwca_validator import validate_dwca_zip + + validation = validate_dwca_zip(zip_path) + for warning in validation.warnings: + logger.warning(f"DwC-A validation warning: {warning}") + if not validation.ok: + for err in validation.errors: + logger.error(f"DwC-A validation error: {err}") + raise ValueError( + f"DwC-A archive failed structural validation ({len(validation.errors)} errors). " + f"First: {validation.errors[0]}" + ) +``` + +- [ ] **Step 3: Run test to verify it passes** + +Run: `docker compose run --rm django python manage.py test ami.exports.tests.DwCAExportTest --keepdb -v 2` +Expected: all pass. + +- [ ] **Step 4: Commit** + +```bash +git add ami/exports/format_types.py ami/exports/tests.py +git commit -m "$(cat <<'EOF' +feat(exports): validate DwC-A archive structure before returning + +Run the offline structural validator against the zip the exporter +just produced. Fatal errors raise, which is caught by the export +framework and flips DataExport.status to FAILED. Warnings log. + +Prevents users downloading broken archives where meta.xml, TSV +columns, or cross-references have silently drifted. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 11: UI — register `dwca` as an export type with the April-2026-Draft label + +**Why:** Users can't choose the format from the UI until it's in `SERVER_EXPORT_TYPES`. Label must clearly signal "draft" so early testers know what they have. + +**Files:** +- Modify: `ui/src/data-services/models/export.ts` + +- [ ] **Step 1: Update `SERVER_EXPORT_TYPES` and the label map** + +Replace lines 6-36 of `ui/src/data-services/models/export.ts` with: + +```typescript +export const SERVER_EXPORT_TYPES = [ + 'occurrences_simple_csv', + 'occurrences_api_json', + 'dwca', +] as const + +export type ServerExportType = (typeof SERVER_EXPORT_TYPES)[number] + +export type ServerExport = any // TODO: Update this type + +export class Export extends Entity { + public readonly job?: Job + + public constructor(entity: ServerExport) { + super(entity) + + if (this._data.job) { + this.job = new JobDetails(this._data.job) + } + } + + static getExportTypeInfo(key: ServerExportType) { + const label = { + occurrences_simple_csv: 'Occurrences (simple CSV)', + occurrences_api_json: 'Occurrences (API JSON)', + dwca: 'Darwin Core Archive (DwC-A) — April 2026 Draft', + }[key] + + return { + key, + label, + } + } +``` + +- [ ] **Step 2: Build to verify no TypeScript errors** + +Run: `cd ui && yarn build` +Expected: build succeeds with no new errors referencing `export.ts`. + +- [ ] **Step 3: Commit** + +```bash +git add ui/src/data-services/models/export.ts +git commit -m "$(cat <<'EOF' +feat(ui): expose dwca as a user-selectable export format + +Labels the format 'Darwin Core Archive (DwC-A) — April 2026 Draft' +so scientists testing the export know what they're looking at +before GBIF-registration and scheme stabilization. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Task 12: Update format reference doc + +**Files:** +- Modify: `docs/claude/dwca-format-reference.md` + +- [ ] **Step 1: Rewrite the "Archive contents" section** + +Replace the archive-contents description in `docs/claude/dwca-format-reference.md` with: + +````markdown +## Archive contents + +```text +project_export.zip +├── meta.xml DwC-A text-archive descriptor +├── eml.xml EML 2.2.0 dataset metadata +├── event.txt Core — Event row per AMI Event, with +│ Humboldt eco: columns flattened in +├── occurrence.txt Extension — coreid=eventID, one row per +│ published Occurrence. associatedMedia +│ column carries pipe-separated capture URLs. +├── multimedia.txt Extension — coreid=eventID. Two row types: +│ - capture rows (occurrenceID blank) +│ - detection-crop rows (occurrenceID populated) +└── measurementorfact.txt Extension — coreid=eventID. Per-occurrence + classificationScore; per-detection + detectionScore and boundingBox. +``` + +## Humboldt Extension columns on event.txt + +| Column | Term | Source | +|---|---|---| +| isSamplingEffortReported | eco:isSamplingEffortReported | constant `true` | +| samplingEffortValue | eco:samplingEffortValue | `Event.captures_count` | +| samplingEffortUnit | eco:samplingEffortUnit | constant `images` | +| samplingEffortProtocol | eco:samplingEffortProtocol | constant protocol description | +| isAbsenceReported | eco:isAbsenceReported | constant `true` (per-taxon rows deferred) | +| targetTaxonomicScope | eco:targetTaxonomicScope | LCA of `Project.default_filters_include_taxa` | +| inventoryTypes | eco:inventoryTypes | constant `trap or sample` | +| protocolNames | eco:protocolNames | constant `AMI ML detector + classifier pipeline` | +| protocolDescriptions | eco:protocolDescriptions | constant pipeline description | +| hasMaterialSamples | eco:hasMaterialSamples | constant `true` | +| materialSampleTypes | eco:materialSampleTypes | constant `digital images` | +```` + +- [ ] **Step 2: Commit** + +```bash +git add docs/claude/dwca-format-reference.md +git commit -m "$(cat <<'EOF' +docs(exports): update DwC-A format reference for April 2026 draft + +Documents the four-file archive shape, Humboldt eco: columns on +event.txt, and the source of each constant/derived value. + +Co-Authored-By: Claude +EOF +)" +``` + +--- + +## Self-Review + +After all 12 tasks are implemented, run the full test suite and confirm: + +```bash +docker compose -f docker-compose.ci.yml run --rm django python manage.py test ami.exports --keepdb -v 2 +``` + +Expected: existing 10 DwC-A tests pass, plus the new tests: +- 3 in `TargetTaxonomicScopeTest` +- 2 in `MultimediaExtensionTest` +- 7 net-new in `DwCAExportTest` (humboldt cols, humboldt meta.xml, associatedMedia, multimedia in archive, multimedia meta.xml, MoF in archive, MoF meta.xml, validator self-check — replaced/modified eml test) +- 1 in `tests_dwca_validator.py` + +Also manually inspect a sample zip produced by a real project (`docker compose exec django python manage.py shell` → create + run a DwCAExporter), unzip it, and spot-check: +- `event.txt` has Humboldt columns populated +- `multimedia.txt` has both capture and crop rows +- `measurementorfact.txt` has classificationScore rows +- `eml.xml` references `eml-2.2.0` and has non-empty `` / `` +- `meta.xml` declares four tables + +## Deferred (follow-up PRs / tracked elsewhere) + +The following items from the design doc are intentionally not in this plan and are documented as follow-ups in the PR body or tickets: + +- `is_blank` / `contains_humans` source-image filters (fields don't exist in the model yet; design doc flags this as a WG requirement but implementation depends on upstream work) +- Per-taxon absence occurrence rows (pending `Site.primary_taxa_list` design) +- Device model additions (`device_type`, `attractant_type`, `light_wavelength`) → handled by CamtrapDP PR (#1262) +- CamtrapDP native export → #1262 +- Sensitive-taxa coordinate generalization +- Reverse-geocoding for country / state / locality +- `coordinateUncertaintyInMeters` (needs Deployment field) +- Online GBIF-API validator CI +- IPT publishing + DOI minting +- PR-body multimedia/bbox discussion comment (write after Task 7 merges, not part of the code plan) + +--- + +## Execution choice + +Plan complete and saved to `docs/claude/planning/2026-04-21-dwca-implementation-plan.md`. Two execution options: + +**1. Subagent-Driven (recommended)** — I dispatch a fresh subagent per task, review between tasks, fast iteration. + +**2. Inline Execution** — Execute tasks in this session using executing-plans, batch execution with checkpoints. + +Which approach? diff --git a/docs/claude/planning/dwca-followup-tickets.md b/docs/claude/planning/dwca-followup-tickets.md new file mode 100644 index 000000000..379618bc3 --- /dev/null +++ b/docs/claude/planning/dwca-followup-tickets.md @@ -0,0 +1,109 @@ +# DwC-A export follow-up tickets (single issue draft) + +**Status:** draft — do NOT post yet. Intended as one GitHub issue filed after #1131 merges, grouping the items explicitly deferred from that PR. + +--- + +## Title + +`DwC-A export follow-ups (perf, data quality, scope)` + +## Context + +PR #1131 shipped the April 2026 DwC-A draft: Event Core + Humboldt eco: terms + Occurrence / Multimedia / MeasurementOrFact extensions, EML 2.2.0, and an offline structural validator gated as a pre-zip check. CodeRabbit review surfaced several items that were deliberately scoped out to keep that PR focused on the archive contract. This issue tracks them as a single bundle so they can be sequenced together. + +## Items + +### 1. Perf: streaming fan-out for very large projects (>100K occurrences) + +**Status:** Partially addressed in PR #1131. `DwCAExporter.export()` now materializes `self.queryset` to a `list[Occurrence]` once (with all prefetches) and fans out to the three writers; a 100K hardcap (`DwCAExporter.DWCA_MAX_OCCURRENCES`) refuses exports larger than that with a clear error. The follow-up is removing the cap. + +**Why:** The materialize-once approach scales linearly in memory (~1 GB for 100K occurrences with prefetched detections + source images). For the AMI dataset today this is fine, but a genuinely large project would still OOM a worker. + +**Follow-up work:** Implement **stream once, emit side-by-side**: sort occurrences by `event_id`, stream through once, emit rows to all three files from a single pass using the classic grouping-by-sort pattern. Memory-bounded regardless of project size. Benchmark against a synthetic 500K-occurrence project before raising or removing the 100K cap. + +**References:** CodeRabbit PR #1131 threads `PRRT_kwDOIlxGbc587spb` (iter_multimedia_rows memory) and `PRRT_kwDOIlxGbc587spj` (3× queryset scan). Partial fix in `ami/exports/format_types.py` on the feat/dwca-export branch. + +--- + +### 2. Per-taxon absence occurrences + +**Why:** The current export sets `eco:isAbsenceReported="true"` and `eco:targetTaxonomicScope=` on every event row — this declares absence-inference capacity but doesn't emit actual absence occurrences. The Humboldt-canonical pattern is one `dwc:occurrenceStatus="absent"` row per target-taxon that was not detected during a given event, making "we proved this species was not present during the sampling window" machine-consumable for GBIF consumers. + +**How to apply:** depends on an enumerable target-taxon list. Design doc calls for sourcing from `TaxaList` (per-Site `Site.primary_taxa_list`, falling back to the project's default "all possible species" list). That model wiring is a prerequisite. + +**References:** `docs/claude/planning/2026-04-21-dwca-april-draft-design.md` § "Per-taxon absence occurrences — deferred to a follow-up PR". + +--- + +### 3. TaxaList-driven `targetTaxonomicScope` + +**Why:** v1 derives `eco:targetTaxonomicScope` by computing the LCA of `project.default_filters_include_taxa`. That is a pragmatic proxy but can be empty or overly broad. The right source is the `TaxaList` attached to a Site (curated checklist per monitoring station), which is also what unblocks absence occurrences above. + +**How to apply:** replace the LCA computation in `ami/exports/dwca/targetscope.py` with a lookup against `Site.primary_taxa_list` (falling back to project default) once the TaxaList wiring lands. Same follow-up as item 2. + +--- + +### 4. CamtrapDP sibling export + +**Why:** The GBIF Camera Trap Guide recommends CamtrapDP as the primary format for the camera-trap community (Wildlife Insights, Agouti, EU camera-trap networks). GBIF doesn't ingest CamtrapDP in production today, so DwC-A via Humboldt remains the GBIF path, but CamtrapDP matters for non-GBIF consumers. + +**How to apply:** separate PR. Shares code generously with DwC-A (the `DwCAField` dataclass pattern, the offline validator, the row-generator shape) but emits a Frictionless Data Package zip with its own schema. Tracked separately in issue #1262; this follow-up ticket should link there rather than duplicate. + +--- + +### 5. Human-identifier opt-in for `dwc:identifiedBy` + +**Why:** PR #1131 removed `user.email` from the identifiedBy fallback chain (GDPR concern — published archives mirrored by GBIF are hard to retract). The chain is now `user.name → user.username → user:{pk}`. If the project wants a real human identifier in published archives (e.g., for attribution in peer-reviewed datasets), that should be an explicit opt-in, not an unconditional email fallback. + +**How to apply:** needs a product decision on UX. Options: per-user "publish my name in open data" toggle; project-level "publish identifier users' names" setting; per-identification opt-in at time of verification. Likely smallest-useful is a user-profile boolean that gates whether `user.email` (or a display name) is used when the user has no `user.name` set. + +**References:** CodeRabbit PR #1131 thread `PRRT_kwDOIlxGbc587spn`. + +--- + +### 6. BoundingBox coordinate validation + +**Why:** PR #1131 documented that `BoundingBox` coordinates are absolute source-image pixels (they're passed directly to `PIL.Image.crop()`). The docstring makes the invariant explicit, but there's no runtime enforcement. A couple of test fixtures hold normalized `[0, 1]` values that only pass because the test only checks structural validity, not image crops. + +**How to apply:** add a Pydantic validator on `BoundingBox` enforcing `x2 > x1`, `y2 > y1`, and non-negative coords. Small, low-risk change, ideally paired with cleanup of the normalized test fixtures so they stop shadowing the production contract. + +**References:** CodeRabbit PR #1131 thread `PRRT_kwDOIlxGbc587spd`. + +--- + +### 7. Validator warning path + failure visibility in UI + +**Why:** `ValidationResult` supports both `errors` (fatal) and `warnings` (non-fatal), but every finding currently goes through `add_error` — there is no warning path wired up yet. When future vocabulary / backbone / EML schema checks land, they'll need to distinguish "this blocks export" from "this is worth flagging." Related: failure messages are currently buried in Celery/Django logs. The exporter now writes `VALIDATION_ERRORS.txt` into the zip and persists the failed zip to storage so users can download it and read the report, but the DataExport model has no `status` or `error_message` field — the UI only knows "file_url populated" vs. "not populated." + +**How to apply:** +- Add `DataExport.status` (choices: PENDING / SUCCEEDED / FAILED) and `DataExport.error_message` (TextField). Populate on failure. Show prominently in the exports table. +- Introduce at least one warning-category check in the validator so the `warnings` path is exercised. Candidates: "meta.xml omits a recommended column," "EML schema location unreachable," "project.license is the default placeholder." + +**References:** Item 3 of the takeaway review on PR #1131. + +--- + +### 8. Run archive through GBIF DwC-A Validator + +**Why:** The offline structural validator catches drift bugs (meta.xml vs. TSV column counts, dangling coreids) but does not check DwC vocabulary compliance, EML 2.2.0 schema conformance, Humboldt extension vocabulary, or taxonomic backbone matching. The archive has never been through or GBIF's IPT. The "April 2026 Draft" label is load-bearing until this runs cleanly. + +**How to apply:** add a Makefile target (or scripts/validate_dwca_via_gbif.sh) that (1) generates a fixture export, (2) uploads to or pipes through the GBIF validator, (3) asserts 0 schema errors. Blocks removing the "draft" label from the UI selector. Does not block merging PR #1131. + +**References:** Item 6 of the takeaway review on PR #1131. + +--- + +## Sequencing suggestion + +1. **(8) GBIF validator** — blocks "remove draft label." Do first so we know what else is actually broken before investing in the others. +2. **(6) BoundingBox validation** — smallest, unblocks confidence in downstream detection-related code. +3. **(7) Status + error_message on DataExport** — gives users a real failure signal; small model + serializer change. +4. **(1) Streaming fan-out** — benchmark-driven. Do before raising the 100K hardcap. +5. **(3) TaxaList scope** + **(2) absence occurrences** — coupled. Wait until `TaxaList` / `Site.primary_taxa_list` wiring lands, then land both in one PR. +6. **(5) identifiedBy opt-in** — needs product input before coding. File a short RFC. +7. **(4) CamtrapDP** — tracked separately in #1262; cross-link from here. + +## Closing criteria for this tracking issue + +Individual items ship in their own PRs and link back here. Close this issue when (1), (2), (3), (5), (6), (7), (8) are either merged or explicitly decided-not-to-do. (4) is tracked separately and doesn't block closure here. diff --git a/ui/src/data-services/models/export.ts b/ui/src/data-services/models/export.ts index d96ee2e39..eebbebef8 100644 --- a/ui/src/data-services/models/export.ts +++ b/ui/src/data-services/models/export.ts @@ -6,6 +6,7 @@ import { JobDetails } from './job-details' export const SERVER_EXPORT_TYPES = [ 'occurrences_simple_csv', 'occurrences_api_json', + 'dwca', ] as const export type ServerExportType = (typeof SERVER_EXPORT_TYPES)[number] @@ -27,6 +28,7 @@ export class Export extends Entity { const label = { occurrences_simple_csv: 'Occurrences (simple CSV)', occurrences_api_json: 'Occurrences (API JSON)', + dwca: 'Darwin Core Archive (DwC-A) — April 2026 Draft', }[key] return {