Skip to content

feat: add Darwin Core Archive (DwC-A) export format#1131

Open
mihow wants to merge 47 commits intomainfrom
feat/dwca-export
Open

feat: add Darwin Core Archive (DwC-A) export format#1131
mihow wants to merge 47 commits intomainfrom
feat/dwca-export

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented Feb 11, 2026

Summary

Add a dwca export format that produces a GBIF-compatible Darwin Core Archive — a ZIP with event.txt, occurrence.txt, multimedia.txt, measurementorfact.txt, meta.xml, and eml.xml (EML 2.2.0). Event Core architecture: events are the core, the other three are extensions linked via eventID. Humboldt Extension (eco:) terms are flattened onto Event Core rows to carry sampling-effort structure and absence reportability — the scientific contribution of automated monitoring. A structural validator runs against the zip before it is returned to the user; fatal drift fails the export fast.

The format is surfaced in the UI under the label "Darwin Core Archive (DwC-A) — April 2026 Draft" so early testers know this is pre-registration and the column layout may still evolve.

Archive contents

project_export.zip
├── meta.xml                  DwC-A text-archive descriptor
├── eml.xml                   EML 2.2.0 dataset metadata
├── event.txt                 Core — Event row per AMI Event, with
│                               Humboldt eco: columns flattened in
├── occurrence.txt            Extension — coreid=eventID, one row per
│                               published Occurrence. associatedMedia
│                               column carries pipe-separated capture URLs.
├── multimedia.txt            Extension — coreid=eventID. Two row types:
│                               - capture rows (occurrenceID blank)
│                               - detection-crop rows (occurrenceID populated,
│                                 dc:references points at the source capture)
└── measurementorfact.txt     Extension — coreid=eventID. Per-occurrence
                                classificationScore; per-detection
                                detectionScore and boundingBox.

Key design decisions

Decision Rationale
Event Core (not Occurrence Core) Preserves sampling-effort rows: every trap-night is emitted as an Event even if it produced no occurrences. Without this, provable absence during known sampling windows is lost and the dataset reads as a collection of sightings rather than a standardized survey.
Humboldt eco: terms flattened onto event.txt GBIF accepts eco: terms on Event rows; the pragmatic alternative to a separate humboldt.txt extension. 11 terms encode sampling effort (reported/value/unit/protocol), absence reportability, targetTaxonomicScope (LCA-derived), protocol identifiers, and material-sample declaration.
targetTaxonomicScope from LCA of project include-taxa v1 sourcing strategy — walks parents_json chains for the deepest common ancestor of Project.default_filters_include_taxa. v2 will move to a per-Site TaxaList so each deployment declares its own expected species pool (the groundwork for per-taxon absence rows).
basisOfRecord = "MachineObservation" All identifications are ML pipeline outputs.
One multimedia.txt for both captures and crops Capture rows carry blank occurrenceID; crop rows carry the occurrenceID URN and dc:references pointing back to the source capture URL. Consumers can filter by row shape.
MoF over inlining scores classificationScore, detectionScore, and boundingBox are carried as structured MeasurementOrFact rows rather than as occurrence columns, so each measurement has its own measurementDeterminedBy (model name + version) and remarks.
EML 2.2.0 Current GBIF-accepted version. Geographic/temporal coverage is computed from the actual event list; a <methods> section documents the capture + ML pipeline and the QC filters applied at export time.
Pre-zip validation gate The offline structural validator runs against the produced zip before it is returned. Fatal errors raise and flip DataExport.status = FAILED; warnings log. Catches meta.xml / TSV column drift and cross-reference orphans before users download broken archives.
Package split ami/exports/dwca.pyami/exports/dwca/ with fields.py (catalogues), helpers.py, tsv.py, meta.py, eml.py, zip.py, rows.py, targetscope.py. Public API unchanged — re-exported from __init__.py.

New in this iteration

  • associatedMedia column on occurrence.txt — pipe-separated capture URLs per occurrence.
  • multimedia.txt extension (GBIF Simple Multimedia) — capture and detection-crop rows in one file.
  • measurementorfact.txt extension — classificationScore, detectionScore, boundingBox.
  • EML upgrade 2.1.1 → 2.2.0 with computed <coverage> (bbox + temporal) and <methods>.
  • Humboldt eco: columns on event.txt (11 terms) + LCA-derived targetTaxonomicScope.
  • Validator: occurrenceID cross-reference check across extensions.
  • Runtime pre-zip structural validation; fail fast on drift.
  • UI label: "Darwin Core Archive (DwC-A) — April 2026 Draft".

Round 2: self-run takeaway review + HTTP e2e validation (2026-04-22)

After CodeRabbit round 1 addressed, I ran a structural "takeaway" pass against the PR (the 7-angle review in CLAUDE.md) looking for silent-bug and scope-claim issues a diff review doesn't surface. Seven items filed; five applied on-branch, two sent to the follow-up doc.

Applied

Concern What changed Commit
dc:license defaulted to blank when project.license empty — GBIF / downstream consumers treat blank as "unknown, skip." Default to "All rights reserved" across event.txt, multimedia.txt, and eml.xml so downstream consumers see the correct legal signal. Constant lives in ami/exports/dwca/helpers.py::DEFAULT_LICENSE. 73eb3dce
Withdrawn human identifications were still marking occurrences as identificationVerificationStatus = "verified". _get_verification_status now filters withdrawn=False, matching the best_identification / verified_count pattern elsewhere in the codebase. ce1e17f2
Archive had no in-file indication of its "April 2026 draft" status — consumers downloading the zip couldn't tell it was pre-registration. Filename now includes dwca_draft-2026-04 via a new generic BaseExporter.filename_label attribute; eml.xml carries a <additionalInfo> block stating the schema is subject to change. 9475664c
DwCAExporter.export() iterated self.queryset 3× — one full DB scan per extension writer — and iter_multimedia_rows materialized an occurrences_by_event dict on top. Materialize list[Occurrence] once with all prefetches, pass the same list to all three writers. 100K DWCA_MAX_OCCURRENCES hardcap guards against OOM on very large projects until streaming fan-out ships. 9475664c
Validator failure deleted the zip, leaving the user with no way to diagnose what went wrong. On failure, append VALIDATION_ERRORS.txt to the zip (listing every error and warning), persist the zip to storage so DataExport.file_url is set, then raise so the Job is still marked failed. User can download the failed archive and read the report. 9475664c

Scope-out (tracked for follow-up)

  • Validator has a warnings path that nothing ever uses today. Wiring a real warning-category check + surfacing failure state via a proper DataExport.status / error_message field is follow-up item 7.
  • Archive has never been through the GBIF DwC-A Validator. Before removing the "draft" label, a CI target must pipe one fixture through it and assert zero schema errors. Follow-up item 8.

Full list: docs/claude/planning/dwca-followup-tickets.md on this branch.

Corrections to my own review

My first pass flagged basisOfRecord = "MachineObservation" as wrong for human-verified records. That was incorrect. basisOfRecord describes how the organism was observed (camera = machine), not who assigned the taxon name. Human verification lives in identifiedBy + identificationVerificationStatus. Wildlife Insights, Agouti, and every other GBIF-accepted camera-trap dataset records human-verified observations as MachineObservation. The hardcoded value stays.

What has been tested

Unit & integration — ami/exports test suite (56 tests, all green)

  • DwCAExportTest (29 tests) — full ZIP structure, row counts, meta.xml shape, referential integrity, Humboldt columns + meta.xml terms, associatedMedia, multimedia and MoF tables in archive, EML 2.2.0 coverage/methods/license, collection filtering, self-produced zip passes own validator, _get_verification_status withdrawn-ID filter, 100K hardcap refuses clearly, VALIDATION_ERRORS.txt written on failure.
  • DwCAExportHttpE2ETest (new, 1 test) — HTTP end-to-end via TransactionTestCase + CELERY_TASK_ALWAYS_EAGER. Exercises the full path: POST /api/v2/exports/ → permission check → serializer validation → ExportViewSet.createJob.enqueuetransaction.on_commit → eager Celery task → DwCAExporter.export → validator → zip persisted to storage → DataExport.file_url set. Asserts: 201 response, filename contains dwca_draft-2026-04, record_count > 0, zip contains all 6 expected members, no VALIDATION_ERRORS.txt, EML contains the draft notice, basisOfRecord = MachineObservation on every occurrence row, license = "All rights reserved" on every event row.
  • TargetTaxonomicScopeTest (3 tests) — LCA derivation for empty / single / multi-taxa.
  • MultimediaExtensionTest (2 tests) — field catalogue + row generator emits capture and crop rows.
  • tests_dwca_validator.py (9 tests) — validator behavior including occurrenceID cross-reference check, defusedxml hardening, negative-index rejection, UTF-8 decode safety, zip cleanup on failure.

Reproduce:

docker compose -f docker-compose.ci.yml run --rm django \
  python manage.py test ami.exports --noinput

Validated on real produced archive

  • test_validator_runs_on_produced_zip and test_offline_structural_validator both re-run the offline validator against the actual zip emitted by the materialize-once exporter. Both pass.

Not yet validated

  • GBIF DwC-A Validator run against a produced archive — tracked as follow-up item 8. The "April 2026 draft" label in the UI is load-bearing until this runs clean.

Deployment notes

  • New migration: ami/main/migrations/0084_project_license.py adds Project.license and Project.rights_holder CharFields (both with default=""). Safe to roll forward on a live database; downgrade drops the columns.
  • No configuration changes required.
  • The "dwca" format is registered automatically and appears in /api/v2/exports/ choices.
  • Existing export permissions apply unchanged.

Deferred (tracked elsewhere / separate PRs)

  • is_blank / contains_humans source-image filters (fields don't exist in the model yet; WG requirement, depends on upstream work).
  • Per-taxon absence occurrence rows (pending Site.primary_taxa_list design).
  • Device model additions (device_type, attractant_type, light_wavelength) → CamtrapDP ticket (feat(exports): add CamtrapDP export format alongside DwC-A #1262).
  • CamtrapDP native export → feat(exports): add CamtrapDP export format alongside DwC-A #1262.
  • Sensitive-taxa coordinate generalization (dwc:informationWithheld).
  • Reverse-geocoding for country / stateProvince / locality.
  • coordinateUncertaintyInMeters (needs Deployment field).
  • Online GBIF-API validator in CI (gated on GBIF_API_VALIDATE=1).
  • IPT publishing + DOI minting + dataset versioning.

How to test

Prerequisites: an Antenna instance with at least one project containing events and ML-identified occurrences; API access with Researcher or Admin role.

# Create an export
curl -X POST https://<host>/api/v2/exports/ \
  -H "Authorization: ******" \
  -H "Content-Type: application/json" \
  -d '{"project": <project_id>, "format": "dwca"}'

# Poll until file_url is populated
curl https://<host>/api/v2/exports/<export_id>/ -H "Authorization: ******"

# Inspect
curl -o export.zip "<file_url>"
unzip -l export.zip
# Expected: event.txt, occurrence.txt, multimedia.txt, measurementorfact.txt, meta.xml, eml.xml

unzip -p export.zip meta.xml | grep -E 'rowType|<location>'
unzip -p export.zip eml.xml | head -20

# Referential integrity
unzip -p export.zip event.txt | tail -n +2 | cut -f1 | sort -u > /tmp/events.txt
unzip -p export.zip occurrence.txt | tail -n +2 | cut -f1 | sort -u > /tmp/occ_events.txt
comm -23 /tmp/occ_events.txt /tmp/events.txt
# → should produce no output

Upload the ZIP to https://www.gbif.org/tools/data-validator for vocabulary / taxonomic-backbone checks. The offline validator covered by the test suite catches structural drift; the GBIF validator is the source of truth for GBIF-acceptance.

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Darwin Core Archive (DwC-A) export format for structured biodiversity data publication with event, occurrence, multimedia, and measurement tables.
    • Added project license and rights holder fields for data attribution and publication compliance.
    • Added archive validation tool to verify export structure and integrity.
    • UI now supports the new DwC-A export option.
  • Documentation

    • Added comprehensive DwC-A format reference guide with field mappings and validation checklist.
    • Added export framework documentation detailing the export pipeline and architecture.
  • Tests

    • Added extensive DwC-A export validation test suite covering structure, content, referential integrity, and metadata.

mihow and others added 4 commits February 11, 2026 11:45
Add planning document for Darwin Core Archive export format, export
framework technical reference, DwC-A format reference with field
mappings, and downloaded DwC terms quick reference from TDWG.

Co-Authored-By: Claude <noreply@anthropic.com>
Add Event Core + Occurrence Extension DwC-A exporter that produces a
GBIF-compatible ZIP containing event.txt, occurrence.txt, meta.xml,
and eml.xml. Events are the core entity with occurrences linked via
eventID foreign key.

Key design decisions:
- Direct TSV writing with iterator(chunk_size=500) instead of DRF serializers
- Taxonomy hierarchy extracted from parents_json to avoid N+1 queries
- meta.xml generated from the same field definitions used for TSV columns
- basisOfRecord = "MachineObservation" for all records
- URN format IDs: urn:ami:event:{slug}:{id}, urn:ami:occurrence:{slug}:{id}

Co-Authored-By: Claude <noreply@anthropic.com>
Test ZIP structure, event/occurrence headers and row counts, meta.xml
core/extension structure, referential integrity between events and
occurrences, taxonomy hierarchy extraction from parents_json, specific
epithet parsing, and EML metadata validity.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 11, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 77939ba
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69e9905fa26d2e00088243cd

@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 11, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 77939ba
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69e990623969f0000806f817

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 11, 2026

Warning

Rate limit exceeded

@mihow has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 38 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 18 minutes and 38 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c8bdb0db-732c-4f10-86c8-7440529b5574

📥 Commits

Reviewing files that changed from the base of the PR and between 45dec2e and 77939ba.

📒 Files selected for processing (15)
  • ami/exports/base.py
  • ami/exports/dwca/eml.py
  • ami/exports/dwca/fields.py
  • ami/exports/dwca/helpers.py
  • ami/exports/dwca/rows.py
  • ami/exports/dwca_validator.py
  • ami/exports/format_types.py
  • ami/exports/models.py
  • ami/exports/tests.py
  • ami/main/migrations/0084_project_license.py
  • ami/main/models.py
  • ami/ml/schemas.py
  • docs/claude/planning/2026-04-21-dwca-implementation-plan.md
  • docs/claude/planning/dwca-followup-tickets.md
  • ui/src/data-services/models/export.ts
📝 Walkthrough

Walkthrough

The PR implements a comprehensive Darwin Core Archive (DwC-A) export system for AMI datasets, including a new DwCAExporter class, field mapping modules for Events/Occurrences/extensions, XML/TSV generation utilities, ZIP packaging, structural validation, extensive test coverage, database migrations for licensing metadata, and UI registration for the new export format.

Changes

Cohort / File(s) Summary
Core Exporter & Registry
ami/exports/format_types.py, ami/exports/registry.py
Adds DwCAExporter class with occurrence-scoped queryset, event derivation, TSV/XML/ZIP generation, validation, and stat updates. Registers "dwca" format in export registry.
DwC-A Package Core
ami/exports/dwca/__init__.py, ami/exports/dwca/fields.py, ami/exports/dwca/helpers.py
Establishes dwca package with field catalogues (EVENT_FIELDS, OCCURRENCE_FIELDS, MOF_FIELDS, MULTIMEDIA_FIELDS) using DwCAField dataclass, formatting utilities for dates/coordinates/durations, taxonomy rank extraction, and verification status logic.
DwC-A XML & Packaging
ami/exports/dwca/eml.py, ami/exports/dwca/meta.py, ami/exports/dwca/zip.py
Generates EML 2.2.0 dataset metadata (with coverage/methods), meta.xml archive descriptor, and creates ZIP archives from component files and XML strings.
DwC-A Row Generation
ami/exports/dwca/rows.py, ami/exports/dwca/targetscope.py, ami/exports/dwca/tsv.py
Streams multimedia and measurementorfact rows with URN IDs, derives taxonomic scope via LCA computation, writes TSV files with headers from field catalogues and progress callbacks.
Validation Framework
ami/exports/dwca_validator.py, ami/exports/management/commands/validate_dwca_archive.py, ami/exports/tests_dwca_validator.py
Provides validate_dwca_zip utility that verifies meta.xml structure, TSV consistency, ID uniqueness/referential integrity, and required-term enforcement; includes Django management command and comprehensive test suite.
Export & Model Tests
ami/exports/tests.py
Implements DwCAExportTest (fixture generation, ZIP structure/content validation, taxonomy/multimedia/measurementorfact assertions), MultimediaExtensionTest, and TargetTaxonomicScopeTest covering registration, headers, row counts, referential integrity, and filtered exports.
Database & ORM
ami/main/models.py, ami/main/migrations/0082_project_license.py
Adds license and rights_holder fields to Project model; adds get_identified_by() and get_identified_date() helper methods to Occurrence for provenance metadata.
Documentation
docs/claude/dwca-format-reference.md, docs/claude/export-framework.md, docs/claude/dwca-pr-review-and-mapping-spec.md, docs/claude/planning/*
Specifies DwC-A output structure, ZIP layout, term mappings, EML contents, export framework architecture, design/implementation roadmap, review checklist, and pre-merge considerations.
UI Integration
ui/src/data-services/models/export.ts
Registers dwca as a supported export format and provides human-readable label for the new format type.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API
    participant Celery
    participant DwCAExporter
    participant FieldExtractors
    participant XMLGenerators
    participant ZipCreator
    participant Validator
    participant Storage
    
    Client->>API: POST /api/v2/exports/ (format=dwca)
    API->>Celery: Enqueue DataExport.run_export()
    Celery->>DwCAExporter: Initialize with filtered queryset
    DwCAExporter->>DwCAExporter: get_queryset() → occurrences
    DwCAExporter->>DwCAExporter: get_events_queryset() → events
    DwCAExporter->>FieldExtractors: write_tsv(event.txt)
    FieldExtractors->>FieldExtractors: extract EVENT_FIELDS per row
    DwCAExporter->>FieldExtractors: write_tsv(occurrence.txt)
    FieldExtractors->>FieldExtractors: extract OCCURRENCE_FIELDS per row
    DwCAExporter->>FieldExtractors: iter_multimedia_rows()
    FieldExtractors->>FieldExtractors: build capture/crop rows
    DwCAExporter->>FieldExtractors: iter_mof_rows()
    FieldExtractors->>FieldExtractors: build measurement rows
    DwCAExporter->>XMLGenerators: generate_eml_xml()
    XMLGenerators->>XMLGenerators: derive coverage, methods, rights
    DwCAExporter->>XMLGenerators: generate_meta_xml()
    XMLGenerators->>XMLGenerators: build core/extension descriptors
    DwCAExporter->>ZipCreator: create_dwca_zip()
    ZipCreator->>ZipCreator: package .txt + .xml files
    DwCAExporter->>Validator: validate_dwca_zip()
    Validator->>Validator: verify meta.xml, IDs, referential integrity
    alt Validation passes
        DwCAExporter->>Storage: upload ZIP
        Storage-->>DwCAExporter: file_url
        DwCAExporter-->>API: export complete
        API-->>Client: export_id + file_url
    else Validation fails
        Validator-->>DwCAExporter: ValidationResult with errors
        DwCAExporter-->>API: raise ValueError
        API-->>Client: error response
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

This PR introduces a complete, multi-layered DwC-A export system spanning 15+ new modules, 2000+ lines of implementation code, comprehensive validation logic, database migrations, extensive test coverage (559 lines), and detailed documentation. The changes touch core export infrastructure, field extraction pipelines, XML/TSV/ZIP generation, validation framework, ORM layer, and UI integration. The heterogeneity of concerns (data transformation, XML generation, validation, testing, documentation) and density of domain-specific logic (Darwin Core mappings, Humboldt extensions, referential integrity checks) demand careful, multi-faceted review across infrastructure, correctness, test coverage, and integration points.

Possibly related issues

Suggested labels

backend, data-processing

Suggested reviewers

  • annavik

Poem

🐰 A DwC-A adventure so grand,
With Events and Occurrences, hand in hand,
Through TSV rows and XML's embrace,
Into a ZIP at a zippy pace!
Validation checks every precious ID,
Nature's data, scientifically tidy! 📦✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add Darwin Core Archive (DwC-A) export format' clearly and concisely describes the main change — addition of a new export format for GBIF-compatible Darwin Core Archives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and follows the template structure, including a clear summary, detailed list of changes, related deployment notes, and comprehensive testing instructions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/dwca-export

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow marked this pull request as ready for review February 11, 2026 20:34
Copilot AI review requested due to automatic review settings February 11, 2026 20:34
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Fix all issues with AI agents
In `@ami/exports/dwca.py`:
- Around line 132-142: The _format_duration function must guard against negative
durations: check event.start and event.end and compute delta = event.end -
event.start, then if delta.total_seconds() <= 0 return an empty string (or
optionally log a warning) instead of proceeding; update the logic around
total_seconds / divmod so negative values are not formatted (use the check on
delta.total_seconds() before computing hours/minutes and then format only for
positive durations).
- Around line 62-63: The taxonRank mapping lambda can raise AttributeError when
o.determination.rank is None; update the DWC + "taxonRank" tuple's lambda (the
function referencing o.determination.rank.lower()) to guard by checking both
o.determination and o.determination.rank before calling .lower(), returning an
empty string (or appropriate default) when rank is falsy.
- Line 87: The mapping that emits DWC + "individualCount" currently uses
getattr(o, "detections_count", 0) which is semantically wrong; change the tuple
entry for DWC + "individualCount" so it always emits the string "1" (i.e.,
replace the lambda that references detections_count with one that returns "1")
or remove the entire tuple if you prefer omission; update the tuple containing
DWC + "individualCount" and the lambda (the anonymous function referencing
getattr(o, "detections_count", 0)) accordingly.
- Around line 75-79: The ternary expression in the tuple entry with key
"vernacularName" is ambiguous; update the lambda (lambda o, slug:
o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.
- Around line 309-312: The EML export is inserting the owner's email into the
<surName> element (project.owner.email) which leaks PII and is semantically
wrong; update the block that creates creator -> individualName -> surName to
avoid using project.owner.email: prefer using a non-PII display name (e.g.,
project.owner.get_display_name() or project.owner.name) if available, and if no
safe display name exists omit creating the individualName/surName elements
entirely (i.e., only create creator when a valid non-email name/contact is
present). Ensure references to project.owner, creator, individualName, and
surName are updated accordingly.
- Around line 288-290: Change the naive timestamp to a timezone-aware one:
replace the use of datetime.datetime.now() when assigning now (used alongside
project_slug/slugify) with Django's timezone-aware function (timezone.now()),
and add the appropriate import from django.utils import timezone at the top of
the module so now is timezone-aware when used in exports/dwca.py.

In `@ami/exports/format_types.py`:
- Around line 196-208: The Occurrence queryset in get_queryset currently
includes occurrences lacking related event or determination, producing empty
coreid/scientificName; update get_queryset (the Occurrence.objects.valid() chain
in the get_queryset method) to filter out occurrences with null event and null
determination (e.g., .filter(event__isnull=False, determination__isnull=False))
before the select_related/.with_identifications() calls so only occurrences with
both event and determination are exported.
- Around line 237-264: The temp files created for the DwC-A export
(event_file.name, occ_file.name and the produced zip_path) are never cleaned up;
wrap the export sequence in a try/finally around the code that calls
get_events_queryset, write_tsv, generate_meta_xml, generate_eml_xml and
create_dwca_zip so that in the finally block you remove event_file.name,
occ_file.name and (on error or success) delete any intermediate temp zip if
appropriate; ensure update_export_stats and the return of zip_path still occur
on success but that file removal happens only for the intermediate files (or
conditionally for the zip if you decide not to persist it).

In `@ami/exports/tests.py`:
- Around line 477-502: The test test_taxonomy_hierarchy_extraction can pass
vacuously when taxon.parents_json is empty; update the test to guarantee
parents_json is populated before calling _get_rank_from_parents: after obtaining
taxon (from occurrence.determination) and calling
taxon.save(update_calculated_fields=True) and taxon.refresh_from_db(), add an
assertion that taxon.parents_json is truthy (or else set up explicit test data/a
fixture so parents_json contains a parent chain) so the loop over ranks actually
executes and _get_rank_from_parents is exercised.

In `@docs/claude/dwca-format-reference.md`:
- Around line 94-113: The documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.

In `@docs/claude/export-framework.md`:
- Line 13: Update the documentation table entry that currently lists
JSONExporter and CSVExporter to also include the new DwCAExporter: edit the row
showing `ami/exports/format_types.py` so it mentions Concrete exporters
`JSONExporter`, `CSVExporter`, and `DwCAExporter` (use the exact class name
DwCAExporter to match the implementation).
- Around line 68-71: Update the example output of
ExportRegistry.get_supported_formats() to include the newly registered "dwca"
format so the snippet accurately reflects current registrations; adjust the
example list returned by ExportRegistry.get_supported_formats() (used alongside
ExportRegistry.register and ExportRegistry.get_exporter in the snippet) to
include "dwca" among the formats (e.g., ["occurrences_api_json",
"occurrences_simple_csv", "dwca"]).
🧹 Nitpick comments (3)
docs/claude/dwca-format-reference.md (1)

9-16: Add a language identifier to the fenced code block.

The linter flags this code block for missing a language specifier (MD040). Since it's a file tree diagram, use ```text.

ami/exports/tests.py (1)

329-341: Each test re-runs the full export pipeline independently.

Every test method calls _run_export(), which creates a DataExport, runs the entire export pipeline, and writes to storage. For 7 tests that call this helper, that's significant overhead. Consider using setUpClass or caching the export result in setUp to run the export once and share the ZIP across structural validation tests.

This is fine for correctness — just a test performance observation.

ami/exports/dwca.py (1)

286-286: Unused events_queryset parameter.

events_queryset is accepted but never used in generate_eml_xml. Either remove it or use it to populate temporal/geographic coverage metadata (the latter is listed as a follow-up). For now, removing the parameter or marking it for future use avoids confusion.

Proposed fix — remove unused parameter
-def generate_eml_xml(project, events_queryset=None) -> str:
+def generate_eml_xml(project) -> str:

This also requires updating the call site in ami/exports/format_types.py (line 258):

-        eml_xml = generate_eml_xml(self.project, events_qs)
+        eml_xml = generate_eml_xml(self.project)

Comment thread ami/exports/dwca.py Outdated
Comment thread ami/exports/dwca.py Outdated
Comment on lines +75 to +79
(
DWC + "vernacularName",
"vernacularName",
lambda o, slug: o.determination.common_name_en or "" if o.determination else "",
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

vernacularName lambda: add parentheses for clarity (known issue #4).

While Python's operator precedence makes this technically correct (or binds tighter than the ternary), the expression is confusing to read and a common source of bugs. Adding explicit parentheses removes ambiguity.

Proposed fix
     (
         DWC + "vernacularName",
         "vernacularName",
-        lambda o, slug: o.determination.common_name_en or "" if o.determination else "",
+        lambda o, slug: (o.determination.common_name_en or "") if o.determination else "",
     ),
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 78-78: Unused lambda argument: slug

(ARG005)

🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 75 - 79, The ternary expression in the
tuple entry with key "vernacularName" is ambiguous; update the lambda (lambda o,
slug: o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.

Comment thread ami/exports/dwca.py Outdated
Comment thread ami/exports/dwca.py Outdated
Comment on lines +132 to +142
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)
if hours > 0:
return f"{hours}h {minutes}m"
return f"{minutes}m"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Negative durations are not guarded against.

If event.end < event.start (data inconsistency), total_seconds will be negative, and divmod with a negative dividend produces negative results, yielding strings like "-1h -30m". Consider returning an empty string or logging a warning for negative durations.

Proposed fix
 def _format_duration(event) -> str:
     """Format event duration as human-readable string."""
     if not event.start or not event.end:
         return ""
     delta = event.end - event.start
     total_seconds = int(delta.total_seconds())
+    if total_seconds < 0:
+        return ""
     hours, remainder = divmod(total_seconds, 3600)
     minutes, _ = divmod(remainder, 60)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)
if hours > 0:
return f"{hours}h {minutes}m"
return f"{minutes}m"
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
if total_seconds < 0:
return ""
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)
if hours > 0:
return f"{hours}h {minutes}m"
return f"{minutes}m"
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 132 - 142, The _format_duration function
must guard against negative durations: check event.start and event.end and
compute delta = event.end - event.start, then if delta.total_seconds() <= 0
return an empty string (or optionally log a warning) instead of proceeding;
update the logic around total_seconds / divmod so negative values are not
formatted (use the check on delta.total_seconds() before computing hours/minutes
and then format only for positive durations).

Comment thread ami/exports/dwca.py Outdated
Comment on lines +288 to +290

project_slug = slugify(project.name)
now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use timezone-aware datetime.

datetime.datetime.now() produces a naive datetime. In a Django project with USE_TZ=True, this is inconsistent and could produce incorrect timestamps. Use Django's timezone.now().

Proposed fix
+    from django.utils import timezone
+
     project_slug = slugify(project.name)
-    now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
+    now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
project_slug = slugify(project.name)
now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
from django.utils import timezone
project_slug = slugify(project.name)
now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 288 - 290, Change the naive timestamp to a
timezone-aware one: replace the use of datetime.datetime.now() when assigning
now (used alongside project_slug/slugify) with Django's timezone-aware function
(timezone.now()), and add the appropriate import from django.utils import
timezone at the top of the module so now is timezone-aware when used in
exports/dwca.py.

Comment thread ami/exports/format_types.py Outdated
Comment thread ami/exports/tests.py Outdated
Comment on lines +94 to +113
```xml
<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd"
packageId="urn:ami:dataset:{project_id}" system="AMI">
<dataset>
<title>{project.name}</title>
<creator>
<organizationName>{project.owner or institution}</organizationName>
</creator>
<abstract>
<para>{project.description}</para>
</abstract>
<intellectualRights>
<para>License information here</para>
</intellectualRights>
</dataset>
</eml:eml>
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

EML version mismatch between documentation and implementation.

This reference doc shows EML 2.2.0 namespace (https://eml.ecoinformatics.org/eml-2.2.0), but the actual implementation in ami/exports/dwca.py (line 293) uses EML 2.1.1 (eml://ecoinformatics.org/eml-2.1.1). The planning doc acknowledges upgrading to 2.2.0 as a follow-up, but this reference should either match the current implementation or clearly note the discrepancy.

🤖 Prompt for AI Agents
In `@docs/claude/dwca-format-reference.md` around lines 94 - 113, The
documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.

Comment thread docs/claude/export-framework.md Outdated
Comment thread docs/claude/export-framework.md
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new export format (dwca) to generate a GBIF-compatible Darwin Core Archive (DwC-A) ZIP using an Event core (event.txt) with an Occurrence extension (occurrence.txt), along with generated meta.xml and eml.xml.

Changes:

  • Introduces ami/exports/dwca.py with DwC field mappings, TSV writing, and meta/eml XML generation.
  • Adds DwCAExporter to the existing export framework and registers it under "dwca".
  • Adds a new DwC-A-focused test suite validating ZIP contents and basic structure.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
ami/exports/dwca.py New DwC-A field definitions and archive/metadata generators.
ami/exports/format_types.py Adds DwCAExporter implementation to orchestrate TSV generation + ZIP packaging.
ami/exports/registry.py Registers the new "dwca" export format.
ami/exports/tests.py Adds end-to-end tests covering ZIP structure, TSV headers/rows, and XML parsing.
docs/claude/export-framework.md New internal technical reference for the export framework.
docs/claude/dwca-format-reference.md New DwC-A format reference documentation.
.agents/planning/dwca-export-plan.md Planning/design notes and known issues list for DwC-A export.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/claude/export-framework.md Outdated
|------|---------|
| `ami/exports/base.py` | `BaseExporter` ABC - all exporters inherit from this |
| `ami/exports/registry.py` | `ExportRegistry` - maps format strings to exporter classes |
| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` |
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Concrete exporters” bullet lists only JSONExporter and CSVExporter, but this PR adds DwCAExporter in ami/exports/format_types.py. Update this reference so the technical doc matches the supported exporters.

Suggested change
| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` |
| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter`, `DwCAExporter` |

Copilot uses AI. Check for mistakes.
Comment thread docs/claude/dwca-format-reference.md
Comment on lines +196 to +208
def get_queryset(self):
"""Return the occurrence queryset (used by BaseExporter for record count)."""
return (
Occurrence.objects.valid() # type: ignore[union-attr]
.filter(project=self.project)
.select_related(
"determination",
"event",
"deployment",
)
.with_detections_count()
.with_identifications()
)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DwC-A occurrence exports can currently include rows with event_id or determination missing, which produces empty coreid/eventID and scientificName values in occurrence.txt (GBIF validator rejects these). Filter the queryset to exclude occurrences with null event and null determination (and adjust the expected counts in tests accordingly).

Copilot uses AI. Check for mistakes.
Comment thread ami/exports/format_types.py Outdated
progress_callback=lambda n: self.update_job_progress(n),
)
logger.info(f"DwC-A: wrote {occ_count} occurrences")

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job progress for DwC-A exports is only updated every 500 occurrences; for smaller exports progress may remain at 0% even though the job completes. Consider calling update_job_progress(occ_count) once at the end (guarding against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress()).

Suggested change
# Ensure final progress update, even if the callback was never invoked.
# Guard against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress().
if getattr(self, "total_records", 0):
self.update_job_progress(occ_count)

Copilot uses AI. Check for mistakes.
Comment thread ami/exports/tests.py Outdated
Comment thread ami/exports/format_types.py Outdated
Comment thread ami/exports/dwca.py Outdated
Comment on lines +240 to +250
# Column 0 is the id (eventID)
id_elem = ET.SubElement(core, "id")
id_elem.set("index", "0")

for i, (term_uri, header, _) in enumerate(event_fields):
if i == 0:
continue # Already declared as <id>
field = ET.SubElement(core, "field")
field.set("index", str(i))
field.set("term", term_uri)

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In meta.xml generation, column 0 is declared as <id index="0"/> but the corresponding <field index="0" term="...eventID"/> mapping is skipped. Many DwC-A consumers/validators expect the ID column to also be mapped to the Darwin Core term (as in the DwC Text Guide examples). Include a <field> mapping for index 0 instead of skipping it.

Copilot uses AI. Check for mistakes.
Comment thread ami/exports/dwca.py Outdated
Comment thread ami/exports/format_types.py Outdated
mihow and others added 2 commits February 11, 2026 13:33
- Guard taxonRank against None rank (AttributeError on .lower())
- Guard _get_rank_from_parents against None det_rank
- Fix vernacularName ternary precedence with explicit parentheses
- Change individualCount to emit "1" (each occurrence = 1 individual,
  not detections_count which counts bounding boxes)
- Guard _format_duration against negative durations

Co-Authored-By: Claude <noreply@anthropic.com>
- Filter out occurrences with null event or determination from DwC-A
  export queryset (GBIF rejects empty coreid/scientificName)
- Replace project.owner.email with project.owner.name in EML creator
  element to avoid leaking PII in downloadable archives
- Only emit individualName when owner has a name set

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow marked this pull request as draft February 11, 2026 21:51
mihow and others added 8 commits February 11, 2026 15:26
- Wrap DwC-A export in try/finally to clean up intermediate temp files
- Use timezone.now() instead of naive datetime.datetime.now()
- Use full EML schemaLocation URL for GBIF validation compatibility
- Remove unused events_queryset parameter from generate_eml_xml
- Simplify progress_callback lambda to direct method reference

Co-Authored-By: Claude <noreply@anthropic.com>
- Update occurrence row count test to match null-filtered queryset
- Assert parents_json is populated in taxonomy hierarchy test
- Use settings.MEDIA_URL instead of hardcoded "/media/" in test helper
- Add DwCAExporter to export-framework.md file table and registry example
- Fix EML version in dwca-format-reference.md to match implementation (2.1.1)

Co-Authored-By: Claude <noreply@anthropic.com>
…pdate

- Map all columns (including index 0) to DwC term URIs in meta.xml
  so GBIF validators can resolve both <id>/<coreid> and <field> entries
- Change fieldsEnclosedBy from empty string to double-quote character
  for stricter parser compatibility
- Add final progress update after TSV writing so small exports (<500
  records) report completion instead of staying at 0%

Co-Authored-By: Claude <noreply@anthropic.com>
Run the export pipeline once in setUpClass and share the ZIP across all
structural validation tests instead of re-running per test method. This
reduces test time from ~7 export runs to 1.

Co-Authored-By: Claude <noreply@anthropic.com>
…occurrences in DwC-A

Remove get_filter_backends() override that returned [], allowing DwCAExporter
to inherit BaseExporter's OccurrenceCollectionFilter. Update get_events_queryset()
to derive events from self.queryset instead of fetching all project events,
preventing orphaned events when collection_id filtering is active.

Co-Authored-By: Claude <noreply@anthropic.com>
…ssertion

Add test_dwca_export_with_collection_filter that verifies filtered exports
produce correct occurrence/event counts and referential integrity. Update
test_event_headers_and_row_count to expect events derived from occurrences
rather than all project events.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Add API methods, request/response format, filter system, DwC-A specifics,
job integration, and file lifecycle details. Remove separate export-system.md.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow added the next up! label Mar 26, 2026
@mihow
Copy link
Copy Markdown
Collaborator Author

mihow commented Apr 1, 2026

@copilot add a more accessible overview of all the field mappings and decisions from Antenna -> Darwin Core & GBIF in the PR description so this can be more easily reviewed & tested. Add how to test instructions.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 1, 2026

@copilot add a more accessible overview of all the field mappings and decisions from Antenna -> Darwin Core & GBIF in the PR description so this can be more easily reviewed & tested. Add how to test i...

Updated the PR description with:

  • Field mapping tables for both event.txt (16 columns) and occurrence.txt (18 columns) showing the Antenna model source, DwC term, and any relevant notes for each field
  • Design decisions table — explains why Event Core was chosen, why individualCount=1, how taxonomy is extracted, why events are derived from the filtered occurrence queryset, etc.
  • Step-by-step testing guide covering: API export creation, polling for completion, ZIP inspection, GBIF Data Validator upload, shell-based referential integrity check, collection filter test, and the automated test commands with expected test names

Copilot finished work on behalf of mihow April 1, 2026 06:10
…ect license

- DwCAExporter queryset now invokes apply_default_filters(project), gating
  low-confidence ML classifications and respecting project include/exclude
  taxa lists. Without this, unreviewed detections would publish to GBIF.

- Occurrence.get_identified_by() and get_identified_date() centralize the
  museum-collection-style identifier-of-record logic: the most recent human
  identification supersedes any ML prediction; otherwise the highest-scoring
  pipeline algorithm (with version) is the identifier. Mirrors the behavior
  already baked into update_occurrence_determination so downstream consumers
  see a consistent story.

- DwC-A occurrence.txt now populates dwc:identifiedBy and dwc:dateIdentified.
  These were previously left blank, which made the MachineObservation claim
  un-auditable.

- Project gains license (SPDX or URL) and rights_holder fields. DwC-A event.txt
  emits dc:license and dc:rightsHolder columns, and eml.xml's
  <intellectualRights> reflects the project's license rather than a hardcoded
  CC-BY. GBIF requires a machine-readable license on every dataset.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow and others added 12 commits April 21, 2026 13:40
Upcoming additions (Humboldt eco: terms, multimedia and MoF
extensions, EML 2.2.0) roughly double the code. Split by
responsibility now so each module has a single clear purpose.

Public API unchanged — re-exported from ami.exports.dwca.__init__.

Co-Authored-By: Claude <noreply@anthropic.com>
Pure helper that walks parents_json chains and returns the deepest
common ancestor name. Empty include-list -> empty string. v1
strategy; v2 will source from a per-Site TaxaList.

Co-Authored-By: Claude <noreply@anthropic.com>
Flatten 11 Humboldt Extension terms onto Event Core rows:
sampling-effort structure (reported/value/unit/protocol), absence
reportability, targetTaxonomicScope (LCA-derived), protocol
identifiers, and material-sample declaration.

Carries the scientific contribution of automated monitoring
(provable absence during known sampling windows) into the GBIF
pipeline. GBIF accepts eco: terms on Event rows as the
pragmatic alternative to a separate humboldt.txt extension.

Co-Authored-By: Claude <noreply@anthropic.com>
Pipe-separated distinct source-capture URLs per occurrence, ordered
by detection timestamp. Redundant with multimedia.txt but useful
for tabular consumers.

Co-Authored-By: Claude <noreply@anthropic.com>
…ator

Single multimedia.txt carries both capture-image rows (occurrenceID
blank) and detection-crop rows (occurrenceID linking back to
occurrence.txt). dc:references on crop rows points back to the
source capture URL.

Row generator yields plain dicts so the existing write_tsv +
DwCAField pattern handles both query-backed tables and computed
row streams uniformly.

Co-Authored-By: Claude <noreply@anthropic.com>
Generalize meta.xml descriptor and zip packager to accept arbitrary
extension lists, then add multimedia.txt as the third table.
Row type http://rs.gbif.org/terms/1.0/Multimedia (GBIF simple
Multimedia extension). Capture-image rows carry blank
occurrenceID; crop rows carry the occurrenceID URN so consumers
can link evidence back to determinations.

Narrow the offline-validator test's required-terms set to exclude
occurrenceID — that URI is required inside occurrence.txt but
legitimately blank on multimedia capture rows, and the current
validator applies a flat required set across all tables. Task 9
adds a cross-reference check that covers the stronger invariant.

Co-Authored-By: Claude <noreply@anthropic.com>
Captures ML provenance as structured numeric facts:
classificationScore per occurrence, detectionScore + boundingBox
per detection. Row type http://rs.gbif.org/terms/1.0/MeasurementOrFact,
coreid=eventID, occurrenceID column linking back to the occurrence.

Per-event MoF rows (lux, temperature, moon phase) are not emitted
in this PR; the column layout reserves space for them.

Co-Authored-By: Claude <noreply@anthropic.com>
Bump namespace and schemaLocation to eml-2.2.0. Add computed
geographicCoverage (bbox from event deployment coordinates),
temporalCoverage (min/max event start), and a methods section
documenting the automated capture + ML pipeline workflow and
the quality-control filters applied at export time.

Co-Authored-By: Claude <noreply@anthropic.com>
Any extension that declares a dwc:occurrenceID column must only
contain values that exist in occurrence.txt. Multimedia crop rows
and MoF rows both carry occurrenceID as a back-link; this check
catches drift where the pipeline emits rows pointing at filtered-
out or non-existent occurrences.

Co-Authored-By: Claude <noreply@anthropic.com>
Run the offline structural validator against the zip the exporter
just produced. Fatal errors raise, which is caught by the export
framework and flips DataExport.status to FAILED. Warnings log.

Prevents users downloading broken archives where meta.xml, TSV
columns, or cross-references have silently drifted.

Co-Authored-By: Claude <noreply@anthropic.com>
Label the format 'Darwin Core Archive (DwC-A) — April 2026 Draft'
so scientists testing the export know what they're looking at
before GBIF-registration and scheme stabilization.

Co-Authored-By: Claude <noreply@anthropic.com>
Document the four-file archive shape (event + occurrence + multimedia +
measurementorfact), the flattened Humboldt eco: columns on event.txt
with their derivation, and the EML 2.2.0 namespace bump.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow marked this pull request as ready for review April 22, 2026 22:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (7)
ami/exports/dwca/targetscope.py (1)

36-40: Pass strict=False to zip() to silence B905 and document intent.

Chains can legitimately differ in length (e.g., include-taxa at mixed ranks such as a genus and one of its species). The current zip(*chains) silently truncates to the shortest chain, which happens to be the correct behavior for LCA-by-prefix — but making that explicit avoids the Ruff B905 warning and makes the intent obvious to readers.

♻️ Proposed change
-    for position in zip(*chains):
+    for position in zip(*chains, strict=False):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/exports/dwca/targetscope.py` around lines 36 - 40, Update the loop that
computes the LCA prefix so zip is called with explicit strict=False to document
intent and silence Ruff B905: change the iterator in for position in
zip(*chains) to use zip(*chains, strict=False) and add a brief inline comment
noting that differing chain lengths are expected and truncation is intentional
(reference: the variable chains, the loop for position in zip(...), and lca_name
assignment).
ami/exports/dwca/helpers.py (1)

56-74: Minor: inconsistent handling of rank types between parents and determination itself.

For parents, lines 65-67 accept either an enum-like (.name) or a string and normalize to string before .upper(). For the determination itself (line 72), the code calls det_rank.upper() directly, which would raise AttributeError if determination.rank is ever the same enum-like value. Today the codebase treats rank as a string (see fields.py line 202 using rank.lower()), so this is consistent with current usage, but if you ever want this helper to be symmetric, consider applying the same hasattr(..., "name") normalization to det_rank.

Non-blocking — flagging only because the asymmetry is easy to trip over in future refactors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/exports/dwca/helpers.py` around lines 56 - 74, The handling of
determination.rank in _get_rank_from_parents is asymmetric with the parents loop
and can raise AttributeError if rank is an enum-like object; normalize det_rank
the same way you do for parents: read det_rank = occurrence.determination.rank,
convert it to a string via det_rank.name if hasattr(det_rank, "name") else
str(det_rank), then compare det_rank_str.upper() == rank and return
occurrence.determination.name (or getattr(occurrence.determination, "name", "")
if you want to be defensive); update the comparison that currently calls
det_rank.upper() to use this normalized det_rank_str.
ami/exports/dwca/fields.py (1)

248-280: MOF/Multimedia field extractors will KeyError on malformed dict rows.

MOF_FIELDS line 249 and 252, and MULTIMEDIA_FIELDS line 269 and 273, use r["eventID"] / r["measurementType"] / r["identifier"] with bracket access rather than .get(...). Today the row iterators in rows.py always set these keys, so this is safe — but it means any future row-producer that forgets a required key will fail mid-export with a bare KeyError rather than a validation-time error message.

Given these are marked required=True, consider either (a) treating missing required values as a validator-enforced invariant (fail early with context) or (b) defaulting to "" and letting the downstream validate_dwca_zip flag the empty required cell. Both surface the issue more clearly than an uncaught KeyError.

Non-blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/exports/dwca/fields.py` around lines 248 - 280, The MOF_FIELDS and
MULTIMEDIA_FIELDS entries (DwCAField instances referencing eventID,
measurementType, and identifier) use bracket access (r["..."]) which raises
KeyError on malformed rows; change these extractors to either (a) use
r.get("eventID", "") / r.get("measurementType", "") / r.get("identifier", "") so
missing values become empty strings and are caught by validate_dwca_zip, or (b)
perform an explicit presence check inside the extractor and raise a clear
ValidationError/ValueError that includes the field name and row context so
validation fails with a descriptive message (update the lambdas or replace them
with small helper functions referenced from MOF_FIELDS and MULTIMEDIA_FIELDS and
keep required=True as-is).
ami/exports/dwca/eml.py (1)

80-106: _add_coverage iterates events multiple times — requires a sequence, not a generator.

Lines 81-84 walk events four times (lats, lons, starts, ends). Today the only caller passes events_list (a list), so this works — but the public signature of generate_eml_xml is events=None, with the docstring describing an "iterable of Event". If a caller ever passes a generator, the second comprehension will silently see an empty sequence and drop geo/temporal coverage without any warning.

Either tighten the type hint to Sequence[Event] | None, or do events = list(events) at the top of _add_coverage.

Also note: ends = [...] or starts uses starts directly; if you later compute on ends elsewhere, be aware the two lists may alias. Non-blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/exports/dwca/eml.py` around lines 80 - 106, The _add_coverage function
iterates the events iterable multiple times which breaks if callers pass a
generator; convert the incoming events to a concrete sequence at the start of
_add_coverage (e.g., events = list(events) or similar) before computing lats,
lons, starts, ends, and ensure ends doesn't alias starts when you use the
fallback (make ends a new list/copy of starts rather than assigning starts
directly) so subsequent min/max/date calls operate on independent sequences.
ami/exports/dwca/__init__.py (1)

32-53: Nit: __all__ not sorted (Ruff RUF022).

Ruff flags this for isort-style ordering. Since the public surface already uses mixed public/underscore-prefixed names, sorting alphabetically (or grouping public then private) would silence the warning without changing behavior.

Proposed sort
 __all__ = [
     "DC",
     "DWC",
     "ECO",
-    "DwCAField",
     "EVENT_FIELDS",
     "MOF_FIELDS",
     "MULTIMEDIA_FIELDS",
     "OCCURRENCE_FIELDS",
-    "create_dwca_zip",
-    "generate_eml_xml",
-    "generate_meta_xml",
-    "get_specific_epithet",
-    "write_tsv",
     "_format_coord",
     "_format_datetime",
     "_format_duration",
     "_format_event_date",
     "_format_time",
     "_get_rank_from_parents",
     "_get_verification_status",
+    "DwCAField",
+    "create_dwca_zip",
+    "generate_eml_xml",
+    "generate_meta_xml",
+    "get_specific_epithet",
+    "write_tsv",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/exports/dwca/__init__.py` around lines 32 - 53, The __all__ list in this
module is not alphabetically ordered (Ruff RUF022); reorder the entries in the
__all__ list to a deterministic order (either simple alphabetical order or group
all public names first then underscore-prefixed private names) to satisfy Ruff.
Update the existing __all__ array (containing symbols like "DC", "DWC", "ECO",
"DwCAField", "EVENT_FIELDS", "MOF_FIELDS", "MULTIMEDIA_FIELDS",
"OCCURRENCE_FIELDS", "create_dwca_zip", "generate_eml_xml", "generate_meta_xml",
"get_specific_epithet", "write_tsv", and the private helpers "_format_coord",
"_format_datetime", "_format_duration", "_format_event_date", "_format_time",
"_get_rank_from_parents", "_get_verification_status") into the chosen
sorted/grouped order and commit the reordered list.
ami/exports/dwca/rows.py (1)

149-163: _classifier_name is O(detections × classifications) per occurrence and relies on prefetch.

Fine given the current prefetch chain (detections__classifications__algorithm), but it's worth a comment noting the assumption — if a caller ever invokes iter_mof_rows without that prefetch, this becomes a severe N+1 (one query per classification). Consider adding a short docstring note or a defensive hasattr(occurrence, "_prefetched_objects_cache") check that logs/warns.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/exports/dwca/rows.py` around lines 149 - 163, _classifier_name currently
iterates occurrence.detections.all() and det.classifications.all(), which is
O(detections×classifications) and assumes the caller prefetched
detections__classifications__algorithm (see iter_mof_rows); update
_classifier_name to document this assumption in its docstring and add a
defensive runtime check (e.g., inspect hasattr(occurrence,
"_prefetched_objects_cache") or check for prefetched key for "detections") that
emits a warning via the module logger if the related sets are not prefetched, so
callers are alerted to potential N+1 behavior instead of silently causing many
DB queries.
ami/exports/tests.py (1)

807-820: Keep detection-crop multimedia rows under test.

This test now passes with zero crop_rows, so a regression that stops emitting detection-crop rows would not fail despite the DwC-A contract requiring both capture and crop media rows. Please make the fixture deterministic enough to produce a crop URL, or patch the detection URL in this unit test, then assert at least one crop row.

🧪 Proposed assertion once the fixture is deterministic
         capture_rows = [r for r in rows if not r["occurrenceID"]]
         crop_rows = [r for r in rows if r["occurrenceID"]]
         self.assertGreater(len(capture_rows), 0, "Expected capture rows with blank occurrenceID")
-        # Detection-crop rows require det.url() to return non-empty; depends on fixture
-        # setup (the image_dimensions stub may not produce crop URLs for test fixtures).
-        # Check at least the capture-row invariants here.
+        self.assertGreater(len(crop_rows), 0, "Expected detection-crop rows with occurrenceID")
         for r in capture_rows:
             self.assertTrue(r["identifier"], "Capture row missing identifier")
             self.assertEqual(r["type"], "StillImage")
-        # Crop rows, if present, must have both identifier and references.
         for r in crop_rows:
             self.assertTrue(r["identifier"], "Crop row missing identifier")
             self.assertTrue(r["references"], "Crop row missing references (source capture URL)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/exports/tests.py` around lines 807 - 820, The test currently allows zero
crop_rows which masks regressions; update the test that builds rows (references
capture_rows and crop_rows) to ensure at least one detection-crop row is
produced by making the fixture deterministic (or explicitly setting the
detection URL returned by det.url() for a sample detection) and then add an
assertion self.assertGreater(len(crop_rows), 0, "Expected at least one crop row
with references"); keep the existing per-row assertions for r["identifier"],
r["references"], and r["type"] to validate crop-row invariants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ami/exports/dwca_validator.py`:
- Around line 234-240: The _read_tsv function currently calls
zf.read(filename).decode("utf-8") which will raise UnicodeDecodeError for
non-UTF-8 TSVs and bypass ValidationResult; wrap the read/decode in a try/except
that catches UnicodeDecodeError (and optionally UnicodeError), call
result.add_error with a clear message including the filename (e.g., "filename is
not valid UTF-8"), and return None so the validator records the error instead of
crashing; update references in _read_tsv to use this guarded read/decode and
preserve existing behavior when decode succeeds.
- Around line 145-164: The parser currently accepts negative integer indexes
(e.g., id_index and idx from id_elems[0].get("index", "") and fld.get("index")),
which allows values like -1 to be interpreted as Python negative indices; update
the validation in the block that sets id_index and the block that parses idx_raw
(used to populate field_terms) to reject negative values: after converting to
int, check if the value is < 0 and if so call result.add_error with the same
meta.xml context (use the existing message patterns for id_tag and <field>) and
continue, preventing negative indexes from being used later when accessing
row[spec.id_index] or building field_terms. Ensure you reference id_index,
id_tag, idx_raw/idx and field_terms in the change so the checks are applied in
both places.
- Around line 18-22: Replace the stdlib XML import with the hardened parser and
handle its specific exceptions: change "from xml.etree import ElementTree as ET"
to using defusedxml's ElementTree (import defusedxml.ElementTree as ET) and
import DefusedXmlException from defusedxml.common; then update the
parsing/exception handling where ElementTree.fromstring() is used for meta.xml
and eml.xml (the code paths referenced by the meta/eml parsing logic) to catch
DefusedXmlException in addition to the existing parse errors so malicious
ZIP-provided XML (XXE/entity expansion) is rejected and logged consistently.

In `@ami/exports/dwca/rows.py`:
- Around line 22-69: iter_multimedia_rows currently materializes all occurrences
into occurrences_by_event and calls list(events_qs) again, which blows memory;
change it to stream occurrences_qs ordered by event_id (e.g.,
occurrences_qs.order_by("event_id").iterator(chunk_size=...)) and implement the
classic grouping-by-sort: track current_event_id, reset seen_captures when event
boundary changes, and emit capture rows as you encounter new SourceImage ids
(using det.source_image) so you never accumulate all occurrences in memory; also
avoid the unnecessary duplicate materialization by using the
already-materialized events_list when events_qs is a list (or only list()
events_qs if it is a queryset), rather than always calling list(events_qs).
- Around line 166-179: The loop in _project_license and _project_rights_holder
incorrectly contains an unconditional break so only the first event is ever
inspected; remove the break statements (or at least ensure they execute only
after a successful hit) so the for loop scans events until it finds and returns
the first non-empty project. Update _project_license and _project_rights_holder
to iterate through events, checking if e.project and getattr(e.project,
"license"/"rights_holder", "") and returning that value when found;
alternatively consider changing the functions to accept a project directly if
the single-project assumption should be made explicit.
- Around line 96-146: iter_mof_rows currently hardcodes measurementUnit="pixels"
for detections' boundingBox but Detection.bbox can be normalized (0–1 floats) or
pixel integers; update iter_mof_rows to inspect det.bbox and choose the unit
dynamically (e.g., if all coords are floats in [0.0,1.0] set
measurementUnit="normalized", otherwise "pixels") and include that decision in
measurementRemarks; also add/extend the BoundingBox class docstring or a
validator (e.g., BoundingBox.from_json or Detection.clean_bbox) to document
and/or enforce the canonical coordinate space so downstream code is unambiguous
(reference iter_mof_rows, Detection.bbox, and BoundingBox).

In `@ami/exports/format_types.py`:
- Around line 340-349: Validation failure currently raises in the block after
calling validate_dwca_zip without removing the created zip_path, leaking temp
zips; either delete the zip at that failure point (call os.remove(zip_path) or
Path(zip_path).unlink() just before raising in the branch where validation.ok is
False) or, if the zip is intentionally preserved for post-mortem, add a clear
comment in the validate_dwca_zip/containing function explaining that retention
is intentional and that cleanup is handled elsewhere; reference the
variables/functions zip_path and validate_dwca_zip and the error branch where
the ValueError is raised.
- Around line 259-295: self.queryset is being iterated three times causing
multiple full DB scans; fix by materializing a single prefetched occurrence list
and reusing it for all writes: call a one-time materialization (e.g.,
occurrences = list(self.queryset.select_related(...).prefetch_related(...)) or a
single generator wrapper) and pass that occurrences list to write_tsv for
occurrences, to iter_multimedia_rows (replace its internal
select_related/prefetch use) and to iter_mof_rows so write_tsv,
iter_multimedia_rows, and iter_mof_rows (symbols: write_tsv,
iter_multimedia_rows, iter_mof_rows, self.queryset) reuse the same in-memory
collection or shared iterator instead of re-executing self.queryset three times.

In `@ami/main/models.py`:
- Around line 2945-2963: get_identified_by currently falls back to user.email
for dwc:identifiedBy which can leak PII; remove user.email from the fallback
chain in the get_identified_by method and instead prefer non-PII identifiers
(user.name if present, then user.username if available, then the opaque fallback
f"user:{user.pk}"). Update the branches referencing top_identification.user to
avoid returning user.email and ensure any tests or callers expecting the email
are adjusted to use name/username/opaque id.
- Around line 2965-2972: get_identified_date currently returns created_at from
best_identification or best_prediction which reflects DB insert time; change it
to prefer the Classification.timestamp field (when available) for
best_identification and best_prediction, falling back to created_at only if
timestamp is None, so dwc:dateIdentified uses the model-produced timestamp (see
Classification.timestamp and the get_identified_date, best_identification,
best_prediction symbols).

In `@docs/claude/planning/2026-04-21-dwca-implementation-plan.md`:
- Around line 1917-1952: The outer markdown code fence around the "Archive
contents" example closes too early; update the fenced blocks so the outer fence
is longer than the inner and label the inner archive tree as text: replace the
current outer ```markdown fence with a longer fence (e.g., ````markdown) and
change the inner ``` that surrounds the project_export.zip tree to ```text, then
close with the longer outer fence (e.g., ````) so the tree renders correctly and
the later table remains outside the inner fence; apply this change to the fenced
block starting at the "Archive contents" heading and its closing fence just
before the "Humboldt Extension columns on event.txt" table.

---

Nitpick comments:
In `@ami/exports/dwca/__init__.py`:
- Around line 32-53: The __all__ list in this module is not alphabetically
ordered (Ruff RUF022); reorder the entries in the __all__ list to a
deterministic order (either simple alphabetical order or group all public names
first then underscore-prefixed private names) to satisfy Ruff. Update the
existing __all__ array (containing symbols like "DC", "DWC", "ECO", "DwCAField",
"EVENT_FIELDS", "MOF_FIELDS", "MULTIMEDIA_FIELDS", "OCCURRENCE_FIELDS",
"create_dwca_zip", "generate_eml_xml", "generate_meta_xml",
"get_specific_epithet", "write_tsv", and the private helpers "_format_coord",
"_format_datetime", "_format_duration", "_format_event_date", "_format_time",
"_get_rank_from_parents", "_get_verification_status") into the chosen
sorted/grouped order and commit the reordered list.

In `@ami/exports/dwca/eml.py`:
- Around line 80-106: The _add_coverage function iterates the events iterable
multiple times which breaks if callers pass a generator; convert the incoming
events to a concrete sequence at the start of _add_coverage (e.g., events =
list(events) or similar) before computing lats, lons, starts, ends, and ensure
ends doesn't alias starts when you use the fallback (make ends a new list/copy
of starts rather than assigning starts directly) so subsequent min/max/date
calls operate on independent sequences.

In `@ami/exports/dwca/fields.py`:
- Around line 248-280: The MOF_FIELDS and MULTIMEDIA_FIELDS entries (DwCAField
instances referencing eventID, measurementType, and identifier) use bracket
access (r["..."]) which raises KeyError on malformed rows; change these
extractors to either (a) use r.get("eventID", "") / r.get("measurementType", "")
/ r.get("identifier", "") so missing values become empty strings and are caught
by validate_dwca_zip, or (b) perform an explicit presence check inside the
extractor and raise a clear ValidationError/ValueError that includes the field
name and row context so validation fails with a descriptive message (update the
lambdas or replace them with small helper functions referenced from MOF_FIELDS
and MULTIMEDIA_FIELDS and keep required=True as-is).

In `@ami/exports/dwca/helpers.py`:
- Around line 56-74: The handling of determination.rank in
_get_rank_from_parents is asymmetric with the parents loop and can raise
AttributeError if rank is an enum-like object; normalize det_rank the same way
you do for parents: read det_rank = occurrence.determination.rank, convert it to
a string via det_rank.name if hasattr(det_rank, "name") else str(det_rank), then
compare det_rank_str.upper() == rank and return occurrence.determination.name
(or getattr(occurrence.determination, "name", "") if you want to be defensive);
update the comparison that currently calls det_rank.upper() to use this
normalized det_rank_str.

In `@ami/exports/dwca/rows.py`:
- Around line 149-163: _classifier_name currently iterates
occurrence.detections.all() and det.classifications.all(), which is
O(detections×classifications) and assumes the caller prefetched
detections__classifications__algorithm (see iter_mof_rows); update
_classifier_name to document this assumption in its docstring and add a
defensive runtime check (e.g., inspect hasattr(occurrence,
"_prefetched_objects_cache") or check for prefetched key for "detections") that
emits a warning via the module logger if the related sets are not prefetched, so
callers are alerted to potential N+1 behavior instead of silently causing many
DB queries.

In `@ami/exports/dwca/targetscope.py`:
- Around line 36-40: Update the loop that computes the LCA prefix so zip is
called with explicit strict=False to document intent and silence Ruff B905:
change the iterator in for position in zip(*chains) to use zip(*chains,
strict=False) and add a brief inline comment noting that differing chain lengths
are expected and truncation is intentional (reference: the variable chains, the
loop for position in zip(...), and lca_name assignment).

In `@ami/exports/tests.py`:
- Around line 807-820: The test currently allows zero crop_rows which masks
regressions; update the test that builds rows (references capture_rows and
crop_rows) to ensure at least one detection-crop row is produced by making the
fixture deterministic (or explicitly setting the detection URL returned by
det.url() for a sample detection) and then add an assertion
self.assertGreater(len(crop_rows), 0, "Expected at least one crop row with
references"); keep the existing per-row assertions for r["identifier"],
r["references"], and r["type"] to validate crop-row invariants.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76a52bf7-caae-4e7d-b717-8baa40216efd

📥 Commits

Reviewing files that changed from the base of the PR and between dd2309e and 45dec2e.

📒 Files selected for processing (24)
  • ami/exports/dwca/__init__.py
  • ami/exports/dwca/eml.py
  • ami/exports/dwca/fields.py
  • ami/exports/dwca/helpers.py
  • ami/exports/dwca/meta.py
  • ami/exports/dwca/rows.py
  • ami/exports/dwca/targetscope.py
  • ami/exports/dwca/tsv.py
  • ami/exports/dwca/zip.py
  • ami/exports/dwca_validator.py
  • ami/exports/format_types.py
  • ami/exports/management/__init__.py
  • ami/exports/management/commands/__init__.py
  • ami/exports/management/commands/validate_dwca_archive.py
  • ami/exports/tests.py
  • ami/exports/tests_dwca_validator.py
  • ami/main/migrations/0082_project_license.py
  • ami/main/models.py
  • docs/claude/dwca-format-reference.md
  • docs/claude/dwca-pr-review-and-mapping-spec.md
  • docs/claude/export-framework.md
  • docs/claude/planning/2026-04-21-dwca-april-draft-design.md
  • docs/claude/planning/2026-04-21-dwca-implementation-plan.md
  • ui/src/data-services/models/export.ts
✅ Files skipped from review due to trivial changes (2)
  • docs/claude/dwca-format-reference.md
  • docs/claude/export-framework.md

Comment thread ami/exports/dwca_validator.py Outdated
Comment thread ami/exports/dwca_validator.py
Comment thread ami/exports/dwca_validator.py
Comment thread ami/exports/dwca/rows.py Outdated
Comment on lines +22 to +69
def iter_multimedia_rows(events_qs, occurrences_qs, project_slug: str):
"""Yield dicts for multimedia.txt rows.

Two row types:
- Capture row: one per SourceImage linked to >=1 occurrence in filter set.
occurrenceID is blank; identifier is the capture URL.
- Crop row: one per Detection whose occurrence is in filter set
AND which has a usable crop URL. occurrenceID populated;
references = source capture URL.
"""
events_list = list(events_qs)
license_value = _project_license(events_list)
rights_holder = _project_rights_holder(events_list)

occurrences_by_event: dict[int, list] = {}
for occ in occurrences_qs.select_related("event").prefetch_related("detections__source_image"):
if occ.event_id is None:
continue
occurrences_by_event.setdefault(occ.event_id, []).append(occ)

for event in events_list:
eid = _event_id(event, project_slug)
occurrences_for_event = occurrences_by_event.get(event.id, [])

# Deduplicate capture images across all occurrences in this event.
seen_captures: set[int] = set()
for occ in occurrences_for_event:
for det in occ.detections.all():
si = det.source_image
if si is None or si.id in seen_captures:
continue
seen_captures.add(si.id)
capture_url = si.public_url()
if not capture_url:
continue
yield {
"eventID": eid,
"occurrenceID": "",
"type": "StillImage",
"format": "image/jpeg",
"identifier": capture_url,
"references": "",
"created": _format_datetime(si.timestamp),
"license": license_value,
"rightsHolder": rights_holder,
"creator": "",
"description": "Source capture image from automated monitoring station",
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

iter_multimedia_rows materializes all occurrences into memory.

Lines 36-40 build occurrences_by_event: dict[int, list] holding every occurrence for the export (plus their prefetched detections__source_image). That defeats the streaming write_tsv model on the downstream side and can blow memory for large projects. Captures are emitted in an inner loop that scans every occurrence's detections, so the grouping is only used to iterate in event order — that order isn't strictly required by DwC-A.

Consider iterating occurrences_qs.order_by("event_id").iterator(chunk_size=...) and streaming captures/crops per event as the event_id changes (the classic grouping-by-sort pattern). That keeps memory bounded and preserves deterministic output.

Also note: the caller passes events_list (already materialized) but line 32 does list(events_qs) again, a harmless but unnecessary copy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/exports/dwca/rows.py` around lines 22 - 69, iter_multimedia_rows
currently materializes all occurrences into occurrences_by_event and calls
list(events_qs) again, which blows memory; change it to stream occurrences_qs
ordered by event_id (e.g.,
occurrences_qs.order_by("event_id").iterator(chunk_size=...)) and implement the
classic grouping-by-sort: track current_event_id, reset seen_captures when event
boundary changes, and emit capture rows as you encounter new SourceImage ids
(using det.source_image) so you never accumulate all occurrences in memory; also
avoid the unnecessary duplicate materialization by using the
already-materialized events_list when events_qs is a list (or only list()
events_qs if it is a queryset), rather than always calling list(events_qs).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Acknowledged, deferring to a follow-up. The observation is correct: iter_multimedia_rows currently builds occurrences_by_event to group captures by event. I want to treat the memory profile of the whole exporter as a single design decision rather than piecemeal — see also thread on format_types.py:295 where the same queryset is iterated three times. A materialize-once-and-fan-out approach (one prefetched list of occurrences shared across occurrence.txt, multimedia.txt, and measurementorfact.txt writes) would eliminate both this grouping cost and the 3× DB scan, but it needs benchmarking against a realistic large project before we pick a direction. Filing as a follow-up so this PR stays focused on the April-draft contract. Tracking issue to follow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihow, that's a sound rationale — patching the grouping dict here while the queryset is still scanned three times in format_types.py would only shift the bottleneck. Treating the whole exporter's memory profile as one design decision is cleaner.

Would you like me to open a GitHub tracking issue for the memory-optimization follow-up? I can include:

  • The occurrences_by_event dict materialization in iter_multimedia_rows
  • The 3× queryset iteration in format_types.py:295
  • The proposed materialize-once-and-fan-out (or sort-and-stream) direction to benchmark

Just let me know and I'll file it.


🧠 Learnings used
Learnt from: mihow
Repo: RolnickLab/antenna PR: 1130
File: ami/ml/orchestration/nats_queue.py:217-217
Timestamp: 2026-02-13T23:11:30.276Z
Learning: In the RolnickLab/antenna project (Pydantic v1), use .dict() for serialization of Pydantic models instead of .model_dump(). Update code paths in Python files where model_dump() is used to ensure compatibility with Pydantic v1 behavior. If any files are later migrated to Pydantic v2, consider switching to .model_dump() or using model_dump(..., mode='json') as appropriate.

Learnt from: mihow
Repo: RolnickLab/antenna PR: 1222
File: ami/ml/orchestration/nats_queue.py:126-132
Timestamp: 2026-04-13T22:08:32.779Z
Learning: In code that uses the `nats-py` JetStream client, do not divide `ConsumerInfo.config.ack_wait` (or the `ack_wait` returned by `consumer_info()` / `add_consumer()`) by `1e9`. `nats-py` already deserializes the server’s raw nanosecond value by dividing by `_NANOSECOND` (1e9), so application-level code should treat `ack_wait` as already human-readable seconds (e.g., `30.0` means 30 seconds). Only apply nanosecond-to-second conversion if you are working with the raw server nanosecond field directly (not the `ack_wait` property exposed by `ConsumerInfo`/`ConsumerConfig`).

Learnt from: mihow
Repo: RolnickLab/antenna PR: 1261
File: ami/jobs/tasks.py:162-171
Timestamp: 2026-04-21T07:40:49.264Z
Learning: In this repo, when using `cachalot_disabled` imported from `cachalot.api`, treat it as supported in both forms: `with cachalot_disabled():` and `cachalot_disabled()`. It is implemented as a `contextmanager` that also behaves as a `ContextDecorator` (Python 3.2+), so the decorator form is not a no-op and creates a fresh generator instance per invocation. During code review, do not raise false positives that claim the decorator usage is unsupported or ineffective.

Comment thread ami/exports/dwca/rows.py Outdated
Comment thread ami/exports/format_types.py
Comment thread ami/exports/format_types.py
Comment thread ami/main/models.py
Comment thread ami/main/models.py
Comment thread docs/claude/planning/2026-04-21-dwca-implementation-plan.md Outdated
mihow and others added 12 commits April 22, 2026 15:38
Addresses CodeRabbit PR #1131 feedback:
- Parse archive-provided meta.xml/eml.xml with defusedxml to guard against
  XXE and entity-expansion payloads in uploaded zips
- Reject negative <id>/<field> indexes in meta.xml (previously int("-1")
  was accepted and treated as a Python negative index)
- Return a validation error on non-UTF-8 TSV decode instead of crashing
- Remove stray unconditional break in _project_license /
  _project_rights_holder; iterate events until a non-empty value is found
- Unlink zip_path on validation failure to avoid temp-file leaks on
  repeated failing exports

Co-Authored-By: Claude <noreply@anthropic.com>
…n timestamp

Addresses CodeRabbit PR #1131 feedback:
- Occurrence.get_identified_by no longer falls back to user.email for
  dwc:identifiedBy. That value is published in GBIF-bound archives and
  is very hard to retract once mirrored; prefer user.name, then username,
  then the opaque user:<pk>.
- Occurrence.get_identified_date now prefers Classification.timestamp
  (set when the model or user actually produced the result) over
  created_at (DB insert time), so dwc:dateIdentified reflects the
  identification event, not the backfill timestamp.
- BoundingBox schema: document the coordinate system as source-image
  pixel space, matching the PIL.Image.crop() calls in processing
  services. This removes the ambiguity flagged for the
  measurementUnit="pixels" value in iter_mof_rows.

Co-Authored-By: Claude <noreply@anthropic.com>
The outer markdown fence around "Archive contents" closed early,
rendering the archive tree as plain text. Use ````markdown for the
outer fence and ```text for the inner tree so both render correctly.

Co-Authored-By: Claude <noreply@anthropic.com>
…on status

_get_verification_status was emitting identificationVerificationStatus="verified"
even when every human identification on the occurrence was withdrawn. The rest of
the codebase (best_identification, verified_count annotation) consistently uses
withdrawn=False; align with that.

Co-Authored-By: Claude <noreply@anthropic.com>
GBIF and most DwC-A consumers treat a blank dc:license as "unknown — skip."
Until the project has an explicit license field populated via UI, default the
per-record license column to "All rights reserved" so downstream consumers
receive the correct legal signal (no reuse permitted) rather than a blank.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds an explicit draft signal in two places so consumers don't mistake
this archive for a stable schema:

- archive filename now includes "dwca_draft-2026-04" via the new
  filename_label class attribute on BaseExporter (generic mechanism)
- eml.xml <additionalInfo> block warns that the schema is subject to
  change and should not be submitted to GBIF without confirmation

Keeps the door open for a later stable version without leaving today's
downloads indistinguishable.

Co-Authored-By: Claude <noreply@anthropic.com>
- test_dwca_export_refuses_over_hardcap: patches DWCA_MAX_OCCURRENCES down
  and asserts the exporter refuses with a clear "hard cap" message.
- test_validation_failure_writes_errors_into_zip: confirms
  _append_validation_report_to_zip appends a readable VALIDATION_ERRORS.txt
  containing all errors and warnings.

Both features landed in 9475664 (bundled into the draft-schema commit by
mistake); these are the missing tests.

Co-Authored-By: Claude <noreply@anthropic.com>
DwCAExportHttpE2ETest exercises the full path: POST /api/v2/exports/ ->
permission check -> serializer validation -> ExportViewSet.create ->
Job.enqueue -> eager Celery task -> DwCAExporter.export -> validator ->
zip on storage -> DataExport.file_url.

Uses TransactionTestCase so transaction.on_commit callbacks in enqueue()
actually fire, and CELERY_TASK_ALWAYS_EAGER so the task runs in-process.

Verifies: 201 response, filename contains draft label, record_count set,
zip has all expected TSVs + meta.xml + eml.xml, EML contains draft notice,
event.txt rows carry the "All rights reserved" default license.

Co-Authored-By: Claude <noreply@anthropic.com>
Reflects what landed on the branch after the takeaway review:
- Item 1 (perf): status updated — materialize-once-with-hardcap shipped;
  streaming fan-out remains the follow-up
- New item 7 (validator warning path + DataExport.status/error_message)
- New item 8 (run archive through GBIF DwC-A Validator before removing
  draft label)

Sequencing reordered to put the GBIF validator run first so we know what
else is actually broken before investing in the rest.

Co-Authored-By: Claude <noreply@anthropic.com>
TransactionTestCase truncates tables at teardown, which wiped
parents_json state that DwCAExportTest.test_taxonomy_hierarchy_extraction
relied on via setUpClass — deterministic full-suite failure, passed
in isolation.

Use TestCase + captureOnCommitCallbacks(execute=True) instead. It
fires on_commit callbacks when the context manager exits without
needing real commits, so Job.enqueue's scheduled Celery task runs
under CELERY_TASK_ALWAYS_EAGER, same as before, but fixture state is
preserved for other test classes.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow added a commit that referenced this pull request Apr 27, 2026
- ``models.py:1499``: add ``source_image_id`` and ``pk`` tie-breakers to
  the Subquery used in the Occurrence.event_id realignment. Two detections
  with identical ``source_image__timestamp`` could otherwise pick
  different rows across runs (and so different event_ids when the tied
  source_images straddle an event boundary). Stable ordering keeps
  regrouping deterministic.
- ``tests.py:373``: switch the test Detection's ``bbox`` from normalized
  floats to absolute pixel coordinates (``[10, 10, 20, 20]``) to match
  the project convention noted in PR #1131. Doesn't affect this test's
  assertions; keeps fixtures consistent.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow added a commit that referenced this pull request Apr 27, 2026
…ployments (#1268)

* fix(grouping): cap event duration at 24h for continuous-monitoring deployments

`group_images_into_events` splits images into events when there is a gap
larger than `max_time_gap` (default 2h). For continuous-monitoring
deployments the stream never has a 2h quiet period, so every image in
the deployment coalesces into a single multi-month event. Observed on
production deployment 489: one event spanning Apr 24 → Jul 23 2023
with 355k source_images.

Add a `max_event_duration` kwarg to `group_datetimes_by_gap` and thread
it through `group_images_into_events` with a default of 24 hours. When
set, a group is also split once its span would exceed the cap, even if
no gap triggers the split. `None` disables the cap (backward compatible).

Also removes the stale `# @todo this event grouping needs testing.
Still getting events over 24 hours` comment now that events over 24h
are prevented by construction.

Tests: doctest for the continuous-monitoring case in `dates.py`, plus
a Django test that creates 3 days of 10-minute-interval captures and
asserts grouping produces >=3 events each <=24h.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(grouping): refresh cached fields on events whose captures moved

Addresses Copilot review feedback on #1268.

When regrouping a deployment that already has events, `group_images_into_events` reuses any existing event whose `group_by` date matches the first day of a new group. Before this change, a reused event was saved once early in the loop (while still holding its pre-regroup captures) and never refreshed after later iterations moved captures away to new events. Result: the reused event's cached `start` / `end` / `captures_count` / `detections_count` / `occurrences_count` stayed stale — e.g. a 90-day mega-event being re-grouped under a 24h cap would physically hold only ~24h of captures but still report the 90-day span.

Fix: track every event touched by grouping (both newly-created / reused in this pass, and any event that lost captures to the UPDATE), then call `update_calculated_fields_for_events(pks=...)` in a single pass after the loop to refresh cached fields against each event's final capture set.

This narrowly scoped fix coexists with #904, which is expected to rework the `group_by`-based reuse path more thoroughly.

Tests:
- New `test_regrouping_existing_long_event_refreshes_cached_fields`: creates 3 days of continuous captures, groups once with the cap disabled (one mega-event), then re-groups with the 24h cap and asserts (a) no event exceeds 24h and (b) `sum(captures_count)` across events equals the total image count. Avoids asserting on `group_by` specifics so it remains valid under #904's refactor.
- Existing `test_continuous_monitoring_capped_at_24_hours` refactored to share a `_populate_continuous_captures` helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(grouping): tighten 24h cap regression tests per review

- Change `>= 3` to `== 3` in both new tests — 432 captures over 3 days
  at 10 min intervals must yield exactly 3 daily events under the 24h cap.
  Catches over-splitting regressions that `>= 3` would miss.
- Add per-event cached-count check after regroup: compare each event's
  cached `captures_count` against its actual related SourceImage count.
  Offsetting errors can hide stale counters in a sum-only check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(grouping): realign Occurrence.event_id and refresh deployment caches after regroup

Occurrences are bound to an event once at creation time (from
detection.source_image.event in Detection.associate_new_occurrence and
Pipeline.save_results) and were never re-derived afterward. Without this
refresh, a deployment regrouped under the 24h cap kept every occurrence
pointing at its original (pre-cap) event regardless of when its
detections actually fired, breaking every Occurrence.event-keyed query
(occur_det_proj_evt index, Event.occurrences related-name, event_ids=
filter chain).

Realign Occurrence.event_id from each occurrence's first detection's
source_image.event_id, and union the pre- and post-refresh event holders
into touched_event_pks so the existing update_calculated_fields_for_events
call covers both occurrence-losers and -gainers.

Also swap the manual events_count-only update at the end of
group_images_into_events for a full Deployment.update_calculated_fields()
call. The async regroup_events task never went through Deployment.save's
calculated-fields refresh, so the deployment list (occurrences_count,
taxa_count, captures_count, detections_count) showed pre-regroup numbers
until the next save touched the deployment.

Adds test_regrouping_realigns_occurrence_event_id covering the
detection->source_image->event chain post-regroup plus per-event
occurrences_count consistency against the live get_occurrences_count
helper that the cache itself uses.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(grouping): address CodeRabbit nits on the realignment commit

- ``models.py:1488``: replace EN DASH (–) with ASCII hyphen so Ruff RUF003
  doesn't flag the comment in strict environments.
- ``tests.py:test_regrouping_realigns_occurrence_event_id``: pick targets
  at mid-day offsets (12h / 36h / 60h) instead of by index. The cap's
  strict ``>`` keeps the capture at exactly 24h offset in the previous
  event, so index-based selection (``captures[len // 3]``) lands two of
  three targets in the same event. Mid-day offsets sit well inside each
  daily event's window, away from any boundary, so the test now actually
  exercises 3-way distribution as it claimed.
- Tighten the realignment assertion to require exactly 3 distinct
  ``event_id`` values across the three occurrences (the corrected
  invariant), replacing the looser ``>= 1`` non-day-0 check.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(grouping): tighten Occurrence event_id realignment ordering

- ``models.py:1499``: add ``source_image_id`` and ``pk`` tie-breakers to
  the Subquery used in the Occurrence.event_id realignment. Two detections
  with identical ``source_image__timestamp`` could otherwise pick
  different rows across runs (and so different event_ids when the tied
  source_images straddle an event boundary). Stable ordering keeps
  regrouping deterministic.
- ``tests.py:373``: switch the test Detection's ``bbox`` from normalized
  floats to absolute pixel coordinates (``[10, 10, 20, 20]``) to match
  the project convention noted in PR #1131. Doesn't affect this test's
  assertions; keeps fixtures consistent.

Co-Authored-By: Claude <noreply@anthropic.com>

* chore(grouping): add seed_synthetic_occurrences mgmt command for staging verification

Lets us exercise the Occurrence.event_id realignment branch on a staging
deployment without running an ML pipeline. Creates one synthetic Detection
+ Occurrence per existing SourceImage, with ``Occurrence.event=None`` so
the regroup's Subquery realignment is the path that has to populate it.

Workflow on staging:

    python manage.py seed_synthetic_occurrences --deployment <id>
    # then in a shell:
    from ami.main.models import Deployment, group_images_into_events
    import datetime
    group_images_into_events(
        deployment=Deployment.objects.get(pk=<id>),
        max_event_duration=datetime.timedelta(hours=24),
    )
    # verify Occurrence.event_id matches earliest Detection's
    # source_image.event_id

``--clean`` removes the seeded rows (matched by bbox=[10,10,20,20] +
no algorithm/path) for re-runs. ``--dry-run`` reports counts only.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants