feat: add Darwin Core Archive (DwC-A) export format#1131
feat: add Darwin Core Archive (DwC-A) export format#1131
Conversation
Add planning document for Darwin Core Archive export format, export framework technical reference, DwC-A format reference with field mappings, and downloaded DwC terms quick reference from TDWG. Co-Authored-By: Claude <noreply@anthropic.com>
Add Event Core + Occurrence Extension DwC-A exporter that produces a
GBIF-compatible ZIP containing event.txt, occurrence.txt, meta.xml,
and eml.xml. Events are the core entity with occurrences linked via
eventID foreign key.
Key design decisions:
- Direct TSV writing with iterator(chunk_size=500) instead of DRF serializers
- Taxonomy hierarchy extracted from parents_json to avoid N+1 queries
- meta.xml generated from the same field definitions used for TSV columns
- basisOfRecord = "MachineObservation" for all records
- URN format IDs: urn:ami:event:{slug}:{id}, urn:ami:occurrence:{slug}:{id}
Co-Authored-By: Claude <noreply@anthropic.com>
Test ZIP structure, event/occurrence headers and row counts, meta.xml core/extension structure, referential integrity between events and occurrences, taxonomy hierarchy extraction from parents_json, specific epithet parsing, and EML metadata validity. Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for antenna-preview canceled.
|
✅ Deploy Preview for antenna-ssec canceled.
|
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 18 minutes and 38 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (15)
📝 WalkthroughWalkthroughThe PR implements a comprehensive Darwin Core Archive (DwC-A) export system for AMI datasets, including a new DwCAExporter class, field mapping modules for Events/Occurrences/extensions, XML/TSV generation utilities, ZIP packaging, structural validation, extensive test coverage, database migrations for licensing metadata, and UI registration for the new export format. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API
participant Celery
participant DwCAExporter
participant FieldExtractors
participant XMLGenerators
participant ZipCreator
participant Validator
participant Storage
Client->>API: POST /api/v2/exports/ (format=dwca)
API->>Celery: Enqueue DataExport.run_export()
Celery->>DwCAExporter: Initialize with filtered queryset
DwCAExporter->>DwCAExporter: get_queryset() → occurrences
DwCAExporter->>DwCAExporter: get_events_queryset() → events
DwCAExporter->>FieldExtractors: write_tsv(event.txt)
FieldExtractors->>FieldExtractors: extract EVENT_FIELDS per row
DwCAExporter->>FieldExtractors: write_tsv(occurrence.txt)
FieldExtractors->>FieldExtractors: extract OCCURRENCE_FIELDS per row
DwCAExporter->>FieldExtractors: iter_multimedia_rows()
FieldExtractors->>FieldExtractors: build capture/crop rows
DwCAExporter->>FieldExtractors: iter_mof_rows()
FieldExtractors->>FieldExtractors: build measurement rows
DwCAExporter->>XMLGenerators: generate_eml_xml()
XMLGenerators->>XMLGenerators: derive coverage, methods, rights
DwCAExporter->>XMLGenerators: generate_meta_xml()
XMLGenerators->>XMLGenerators: build core/extension descriptors
DwCAExporter->>ZipCreator: create_dwca_zip()
ZipCreator->>ZipCreator: package .txt + .xml files
DwCAExporter->>Validator: validate_dwca_zip()
Validator->>Validator: verify meta.xml, IDs, referential integrity
alt Validation passes
DwCAExporter->>Storage: upload ZIP
Storage-->>DwCAExporter: file_url
DwCAExporter-->>API: export complete
API-->>Client: export_id + file_url
else Validation fails
Validator-->>DwCAExporter: ValidationResult with errors
DwCAExporter-->>API: raise ValueError
API-->>Client: error response
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes This PR introduces a complete, multi-layered DwC-A export system spanning 15+ new modules, 2000+ lines of implementation code, comprehensive validation logic, database migrations, extensive test coverage (559 lines), and detailed documentation. The changes touch core export infrastructure, field extraction pipelines, XML/TSV/ZIP generation, validation framework, ORM layer, and UI integration. The heterogeneity of concerns (data transformation, XML generation, validation, testing, documentation) and density of domain-specific logic (Darwin Core mappings, Humboldt extensions, referential integrity checks) demand careful, multi-faceted review across infrastructure, correctness, test coverage, and integration points. Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Fix all issues with AI agents
In `@ami/exports/dwca.py`:
- Around line 132-142: The _format_duration function must guard against negative
durations: check event.start and event.end and compute delta = event.end -
event.start, then if delta.total_seconds() <= 0 return an empty string (or
optionally log a warning) instead of proceeding; update the logic around
total_seconds / divmod so negative values are not formatted (use the check on
delta.total_seconds() before computing hours/minutes and then format only for
positive durations).
- Around line 62-63: The taxonRank mapping lambda can raise AttributeError when
o.determination.rank is None; update the DWC + "taxonRank" tuple's lambda (the
function referencing o.determination.rank.lower()) to guard by checking both
o.determination and o.determination.rank before calling .lower(), returning an
empty string (or appropriate default) when rank is falsy.
- Line 87: The mapping that emits DWC + "individualCount" currently uses
getattr(o, "detections_count", 0) which is semantically wrong; change the tuple
entry for DWC + "individualCount" so it always emits the string "1" (i.e.,
replace the lambda that references detections_count with one that returns "1")
or remove the entire tuple if you prefer omission; update the tuple containing
DWC + "individualCount" and the lambda (the anonymous function referencing
getattr(o, "detections_count", 0)) accordingly.
- Around line 75-79: The ternary expression in the tuple entry with key
"vernacularName" is ambiguous; update the lambda (lambda o, slug:
o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.
- Around line 309-312: The EML export is inserting the owner's email into the
<surName> element (project.owner.email) which leaks PII and is semantically
wrong; update the block that creates creator -> individualName -> surName to
avoid using project.owner.email: prefer using a non-PII display name (e.g.,
project.owner.get_display_name() or project.owner.name) if available, and if no
safe display name exists omit creating the individualName/surName elements
entirely (i.e., only create creator when a valid non-email name/contact is
present). Ensure references to project.owner, creator, individualName, and
surName are updated accordingly.
- Around line 288-290: Change the naive timestamp to a timezone-aware one:
replace the use of datetime.datetime.now() when assigning now (used alongside
project_slug/slugify) with Django's timezone-aware function (timezone.now()),
and add the appropriate import from django.utils import timezone at the top of
the module so now is timezone-aware when used in exports/dwca.py.
In `@ami/exports/format_types.py`:
- Around line 196-208: The Occurrence queryset in get_queryset currently
includes occurrences lacking related event or determination, producing empty
coreid/scientificName; update get_queryset (the Occurrence.objects.valid() chain
in the get_queryset method) to filter out occurrences with null event and null
determination (e.g., .filter(event__isnull=False, determination__isnull=False))
before the select_related/.with_identifications() calls so only occurrences with
both event and determination are exported.
- Around line 237-264: The temp files created for the DwC-A export
(event_file.name, occ_file.name and the produced zip_path) are never cleaned up;
wrap the export sequence in a try/finally around the code that calls
get_events_queryset, write_tsv, generate_meta_xml, generate_eml_xml and
create_dwca_zip so that in the finally block you remove event_file.name,
occ_file.name and (on error or success) delete any intermediate temp zip if
appropriate; ensure update_export_stats and the return of zip_path still occur
on success but that file removal happens only for the intermediate files (or
conditionally for the zip if you decide not to persist it).
In `@ami/exports/tests.py`:
- Around line 477-502: The test test_taxonomy_hierarchy_extraction can pass
vacuously when taxon.parents_json is empty; update the test to guarantee
parents_json is populated before calling _get_rank_from_parents: after obtaining
taxon (from occurrence.determination) and calling
taxon.save(update_calculated_fields=True) and taxon.refresh_from_db(), add an
assertion that taxon.parents_json is truthy (or else set up explicit test data/a
fixture so parents_json contains a parent chain) so the loop over ranks actually
executes and _get_rank_from_parents is exercised.
In `@docs/claude/dwca-format-reference.md`:
- Around line 94-113: The documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.
In `@docs/claude/export-framework.md`:
- Line 13: Update the documentation table entry that currently lists
JSONExporter and CSVExporter to also include the new DwCAExporter: edit the row
showing `ami/exports/format_types.py` so it mentions Concrete exporters
`JSONExporter`, `CSVExporter`, and `DwCAExporter` (use the exact class name
DwCAExporter to match the implementation).
- Around line 68-71: Update the example output of
ExportRegistry.get_supported_formats() to include the newly registered "dwca"
format so the snippet accurately reflects current registrations; adjust the
example list returned by ExportRegistry.get_supported_formats() (used alongside
ExportRegistry.register and ExportRegistry.get_exporter in the snippet) to
include "dwca" among the formats (e.g., ["occurrences_api_json",
"occurrences_simple_csv", "dwca"]).
🧹 Nitpick comments (3)
docs/claude/dwca-format-reference.md (1)
9-16: Add a language identifier to the fenced code block.The linter flags this code block for missing a language specifier (MD040). Since it's a file tree diagram, use
```text.ami/exports/tests.py (1)
329-341: Each test re-runs the full export pipeline independently.Every test method calls
_run_export(), which creates aDataExport, runs the entire export pipeline, and writes to storage. For 7 tests that call this helper, that's significant overhead. Consider usingsetUpClassor caching the export result insetUpto run the export once and share the ZIP across structural validation tests.This is fine for correctness — just a test performance observation.
ami/exports/dwca.py (1)
286-286: Unusedevents_querysetparameter.
events_querysetis accepted but never used ingenerate_eml_xml. Either remove it or use it to populate temporal/geographic coverage metadata (the latter is listed as a follow-up). For now, removing the parameter or marking it for future use avoids confusion.Proposed fix — remove unused parameter
-def generate_eml_xml(project, events_queryset=None) -> str: +def generate_eml_xml(project) -> str:This also requires updating the call site in
ami/exports/format_types.py(line 258):- eml_xml = generate_eml_xml(self.project, events_qs) + eml_xml = generate_eml_xml(self.project)
| ( | ||
| DWC + "vernacularName", | ||
| "vernacularName", | ||
| lambda o, slug: o.determination.common_name_en or "" if o.determination else "", | ||
| ), |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
vernacularName lambda: add parentheses for clarity (known issue #4).
While Python's operator precedence makes this technically correct (or binds tighter than the ternary), the expression is confusing to read and a common source of bugs. Adding explicit parentheses removes ambiguity.
Proposed fix
(
DWC + "vernacularName",
"vernacularName",
- lambda o, slug: o.determination.common_name_en or "" if o.determination else "",
+ lambda o, slug: (o.determination.common_name_en or "") if o.determination else "",
),🧰 Tools
🪛 Ruff (0.15.0)
[warning] 78-78: Unused lambda argument: slug
(ARG005)
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 75 - 79, The ternary expression in the
tuple entry with key "vernacularName" is ambiguous; update the lambda (lambda o,
slug: o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.
| def _format_duration(event) -> str: | ||
| """Format event duration as human-readable string.""" | ||
| if not event.start or not event.end: | ||
| return "" | ||
| delta = event.end - event.start | ||
| total_seconds = int(delta.total_seconds()) | ||
| hours, remainder = divmod(total_seconds, 3600) | ||
| minutes, _ = divmod(remainder, 60) | ||
| if hours > 0: | ||
| return f"{hours}h {minutes}m" | ||
| return f"{minutes}m" |
There was a problem hiding this comment.
Negative durations are not guarded against.
If event.end < event.start (data inconsistency), total_seconds will be negative, and divmod with a negative dividend produces negative results, yielding strings like "-1h -30m". Consider returning an empty string or logging a warning for negative durations.
Proposed fix
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
+ if total_seconds < 0:
+ return ""
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _format_duration(event) -> str: | |
| """Format event duration as human-readable string.""" | |
| if not event.start or not event.end: | |
| return "" | |
| delta = event.end - event.start | |
| total_seconds = int(delta.total_seconds()) | |
| hours, remainder = divmod(total_seconds, 3600) | |
| minutes, _ = divmod(remainder, 60) | |
| if hours > 0: | |
| return f"{hours}h {minutes}m" | |
| return f"{minutes}m" | |
| def _format_duration(event) -> str: | |
| """Format event duration as human-readable string.""" | |
| if not event.start or not event.end: | |
| return "" | |
| delta = event.end - event.start | |
| total_seconds = int(delta.total_seconds()) | |
| if total_seconds < 0: | |
| return "" | |
| hours, remainder = divmod(total_seconds, 3600) | |
| minutes, _ = divmod(remainder, 60) | |
| if hours > 0: | |
| return f"{hours}h {minutes}m" | |
| return f"{minutes}m" |
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 132 - 142, The _format_duration function
must guard against negative durations: check event.start and event.end and
compute delta = event.end - event.start, then if delta.total_seconds() <= 0
return an empty string (or optionally log a warning) instead of proceeding;
update the logic around total_seconds / divmod so negative values are not
formatted (use the check on delta.total_seconds() before computing hours/minutes
and then format only for positive durations).
|
|
||
| project_slug = slugify(project.name) | ||
| now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S") |
There was a problem hiding this comment.
Use timezone-aware datetime.
datetime.datetime.now() produces a naive datetime. In a Django project with USE_TZ=True, this is inconsistent and could produce incorrect timestamps. Use Django's timezone.now().
Proposed fix
+ from django.utils import timezone
+
project_slug = slugify(project.name)
- now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
+ now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| project_slug = slugify(project.name) | |
| now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S") | |
| from django.utils import timezone | |
| project_slug = slugify(project.name) | |
| now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S") |
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 288 - 290, Change the naive timestamp to a
timezone-aware one: replace the use of datetime.datetime.now() when assigning
now (used alongside project_slug/slugify) with Django's timezone-aware function
(timezone.now()), and add the appropriate import from django.utils import
timezone at the top of the module so now is timezone-aware when used in
exports/dwca.py.
| ```xml | ||
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" | ||
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
| xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd" | ||
| packageId="urn:ami:dataset:{project_id}" system="AMI"> | ||
| <dataset> | ||
| <title>{project.name}</title> | ||
| <creator> | ||
| <organizationName>{project.owner or institution}</organizationName> | ||
| </creator> | ||
| <abstract> | ||
| <para>{project.description}</para> | ||
| </abstract> | ||
| <intellectualRights> | ||
| <para>License information here</para> | ||
| </intellectualRights> | ||
| </dataset> | ||
| </eml:eml> | ||
| ``` |
There was a problem hiding this comment.
EML version mismatch between documentation and implementation.
This reference doc shows EML 2.2.0 namespace (https://eml.ecoinformatics.org/eml-2.2.0), but the actual implementation in ami/exports/dwca.py (line 293) uses EML 2.1.1 (eml://ecoinformatics.org/eml-2.1.1). The planning doc acknowledges upgrading to 2.2.0 as a follow-up, but this reference should either match the current implementation or clearly note the discrepancy.
🤖 Prompt for AI Agents
In `@docs/claude/dwca-format-reference.md` around lines 94 - 113, The
documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.
There was a problem hiding this comment.
Pull request overview
Adds a new export format (dwca) to generate a GBIF-compatible Darwin Core Archive (DwC-A) ZIP using an Event core (event.txt) with an Occurrence extension (occurrence.txt), along with generated meta.xml and eml.xml.
Changes:
- Introduces
ami/exports/dwca.pywith DwC field mappings, TSV writing, and meta/eml XML generation. - Adds
DwCAExporterto the existing export framework and registers it under"dwca". - Adds a new DwC-A-focused test suite validating ZIP contents and basic structure.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
ami/exports/dwca.py |
New DwC-A field definitions and archive/metadata generators. |
ami/exports/format_types.py |
Adds DwCAExporter implementation to orchestrate TSV generation + ZIP packaging. |
ami/exports/registry.py |
Registers the new "dwca" export format. |
ami/exports/tests.py |
Adds end-to-end tests covering ZIP structure, TSV headers/rows, and XML parsing. |
docs/claude/export-framework.md |
New internal technical reference for the export framework. |
docs/claude/dwca-format-reference.md |
New DwC-A format reference documentation. |
.agents/planning/dwca-export-plan.md |
Planning/design notes and known issues list for DwC-A export. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| |------|---------| | ||
| | `ami/exports/base.py` | `BaseExporter` ABC - all exporters inherit from this | | ||
| | `ami/exports/registry.py` | `ExportRegistry` - maps format strings to exporter classes | | ||
| | `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` | |
There was a problem hiding this comment.
The “Concrete exporters” bullet lists only JSONExporter and CSVExporter, but this PR adds DwCAExporter in ami/exports/format_types.py. Update this reference so the technical doc matches the supported exporters.
| | `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` | | |
| | `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter`, `DwCAExporter` | |
| def get_queryset(self): | ||
| """Return the occurrence queryset (used by BaseExporter for record count).""" | ||
| return ( | ||
| Occurrence.objects.valid() # type: ignore[union-attr] | ||
| .filter(project=self.project) | ||
| .select_related( | ||
| "determination", | ||
| "event", | ||
| "deployment", | ||
| ) | ||
| .with_detections_count() | ||
| .with_identifications() | ||
| ) |
There was a problem hiding this comment.
DwC-A occurrence exports can currently include rows with event_id or determination missing, which produces empty coreid/eventID and scientificName values in occurrence.txt (GBIF validator rejects these). Filter the queryset to exclude occurrences with null event and null determination (and adjust the expected counts in tests accordingly).
| progress_callback=lambda n: self.update_job_progress(n), | ||
| ) | ||
| logger.info(f"DwC-A: wrote {occ_count} occurrences") | ||
|
|
There was a problem hiding this comment.
Job progress for DwC-A exports is only updated every 500 occurrences; for smaller exports progress may remain at 0% even though the job completes. Consider calling update_job_progress(occ_count) once at the end (guarding against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress()).
| # Ensure final progress update, even if the callback was never invoked. | |
| # Guard against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress(). | |
| if getattr(self, "total_records", 0): | |
| self.update_job_progress(occ_count) |
| # Column 0 is the id (eventID) | ||
| id_elem = ET.SubElement(core, "id") | ||
| id_elem.set("index", "0") | ||
|
|
||
| for i, (term_uri, header, _) in enumerate(event_fields): | ||
| if i == 0: | ||
| continue # Already declared as <id> | ||
| field = ET.SubElement(core, "field") | ||
| field.set("index", str(i)) | ||
| field.set("term", term_uri) | ||
|
|
There was a problem hiding this comment.
In meta.xml generation, column 0 is declared as <id index="0"/> but the corresponding <field index="0" term="...eventID"/> mapping is skipped. Many DwC-A consumers/validators expect the ID column to also be mapped to the Darwin Core term (as in the DwC Text Guide examples). Include a <field> mapping for index 0 instead of skipping it.
- Guard taxonRank against None rank (AttributeError on .lower()) - Guard _get_rank_from_parents against None det_rank - Fix vernacularName ternary precedence with explicit parentheses - Change individualCount to emit "1" (each occurrence = 1 individual, not detections_count which counts bounding boxes) - Guard _format_duration against negative durations Co-Authored-By: Claude <noreply@anthropic.com>
- Filter out occurrences with null event or determination from DwC-A export queryset (GBIF rejects empty coreid/scientificName) - Replace project.owner.email with project.owner.name in EML creator element to avoid leaking PII in downloadable archives - Only emit individualName when owner has a name set Co-Authored-By: Claude <noreply@anthropic.com>
- Wrap DwC-A export in try/finally to clean up intermediate temp files - Use timezone.now() instead of naive datetime.datetime.now() - Use full EML schemaLocation URL for GBIF validation compatibility - Remove unused events_queryset parameter from generate_eml_xml - Simplify progress_callback lambda to direct method reference Co-Authored-By: Claude <noreply@anthropic.com>
- Update occurrence row count test to match null-filtered queryset - Assert parents_json is populated in taxonomy hierarchy test - Use settings.MEDIA_URL instead of hardcoded "/media/" in test helper - Add DwCAExporter to export-framework.md file table and registry example - Fix EML version in dwca-format-reference.md to match implementation (2.1.1) Co-Authored-By: Claude <noreply@anthropic.com>
…pdate - Map all columns (including index 0) to DwC term URIs in meta.xml so GBIF validators can resolve both <id>/<coreid> and <field> entries - Change fieldsEnclosedBy from empty string to double-quote character for stricter parser compatibility - Add final progress update after TSV writing so small exports (<500 records) report completion instead of staying at 0% Co-Authored-By: Claude <noreply@anthropic.com>
Run the export pipeline once in setUpClass and share the ZIP across all structural validation tests instead of re-running per test method. This reduces test time from ~7 export runs to 1. Co-Authored-By: Claude <noreply@anthropic.com>
…occurrences in DwC-A Remove get_filter_backends() override that returned [], allowing DwCAExporter to inherit BaseExporter's OccurrenceCollectionFilter. Update get_events_queryset() to derive events from self.queryset instead of fetching all project events, preventing orphaned events when collection_id filtering is active. Co-Authored-By: Claude <noreply@anthropic.com>
…ssertion Add test_dwca_export_with_collection_filter that verifies filtered exports produce correct occurrence/event counts and referential integrity. Update test_event_headers_and_row_count to expect events derived from occurrences rather than all project events. Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Add API methods, request/response format, filter system, DwC-A specifics, job integration, and file lifecycle details. Remove separate export-system.md. Co-Authored-By: Claude <noreply@anthropic.com>
|
@copilot add a more accessible overview of all the field mappings and decisions from Antenna -> Darwin Core & GBIF in the PR description so this can be more easily reviewed & tested. Add how to test instructions. |
Updated the PR description with:
|
…ect license - DwCAExporter queryset now invokes apply_default_filters(project), gating low-confidence ML classifications and respecting project include/exclude taxa lists. Without this, unreviewed detections would publish to GBIF. - Occurrence.get_identified_by() and get_identified_date() centralize the museum-collection-style identifier-of-record logic: the most recent human identification supersedes any ML prediction; otherwise the highest-scoring pipeline algorithm (with version) is the identifier. Mirrors the behavior already baked into update_occurrence_determination so downstream consumers see a consistent story. - DwC-A occurrence.txt now populates dwc:identifiedBy and dwc:dateIdentified. These were previously left blank, which made the MachineObservation claim un-auditable. - Project gains license (SPDX or URL) and rights_holder fields. DwC-A event.txt emits dc:license and dc:rightsHolder columns, and eml.xml's <intellectualRights> reflects the project's license rather than a hardcoded CC-BY. GBIF requires a machine-readable license on every dataset. Co-Authored-By: Claude <noreply@anthropic.com>
Upcoming additions (Humboldt eco: terms, multimedia and MoF extensions, EML 2.2.0) roughly double the code. Split by responsibility now so each module has a single clear purpose. Public API unchanged — re-exported from ami.exports.dwca.__init__. Co-Authored-By: Claude <noreply@anthropic.com>
Pure helper that walks parents_json chains and returns the deepest common ancestor name. Empty include-list -> empty string. v1 strategy; v2 will source from a per-Site TaxaList. Co-Authored-By: Claude <noreply@anthropic.com>
Flatten 11 Humboldt Extension terms onto Event Core rows: sampling-effort structure (reported/value/unit/protocol), absence reportability, targetTaxonomicScope (LCA-derived), protocol identifiers, and material-sample declaration. Carries the scientific contribution of automated monitoring (provable absence during known sampling windows) into the GBIF pipeline. GBIF accepts eco: terms on Event rows as the pragmatic alternative to a separate humboldt.txt extension. Co-Authored-By: Claude <noreply@anthropic.com>
Pipe-separated distinct source-capture URLs per occurrence, ordered by detection timestamp. Redundant with multimedia.txt but useful for tabular consumers. Co-Authored-By: Claude <noreply@anthropic.com>
…ator Single multimedia.txt carries both capture-image rows (occurrenceID blank) and detection-crop rows (occurrenceID linking back to occurrence.txt). dc:references on crop rows points back to the source capture URL. Row generator yields plain dicts so the existing write_tsv + DwCAField pattern handles both query-backed tables and computed row streams uniformly. Co-Authored-By: Claude <noreply@anthropic.com>
Generalize meta.xml descriptor and zip packager to accept arbitrary extension lists, then add multimedia.txt as the third table. Row type http://rs.gbif.org/terms/1.0/Multimedia (GBIF simple Multimedia extension). Capture-image rows carry blank occurrenceID; crop rows carry the occurrenceID URN so consumers can link evidence back to determinations. Narrow the offline-validator test's required-terms set to exclude occurrenceID — that URI is required inside occurrence.txt but legitimately blank on multimedia capture rows, and the current validator applies a flat required set across all tables. Task 9 adds a cross-reference check that covers the stronger invariant. Co-Authored-By: Claude <noreply@anthropic.com>
Captures ML provenance as structured numeric facts: classificationScore per occurrence, detectionScore + boundingBox per detection. Row type http://rs.gbif.org/terms/1.0/MeasurementOrFact, coreid=eventID, occurrenceID column linking back to the occurrence. Per-event MoF rows (lux, temperature, moon phase) are not emitted in this PR; the column layout reserves space for them. Co-Authored-By: Claude <noreply@anthropic.com>
Bump namespace and schemaLocation to eml-2.2.0. Add computed geographicCoverage (bbox from event deployment coordinates), temporalCoverage (min/max event start), and a methods section documenting the automated capture + ML pipeline workflow and the quality-control filters applied at export time. Co-Authored-By: Claude <noreply@anthropic.com>
Any extension that declares a dwc:occurrenceID column must only contain values that exist in occurrence.txt. Multimedia crop rows and MoF rows both carry occurrenceID as a back-link; this check catches drift where the pipeline emits rows pointing at filtered- out or non-existent occurrences. Co-Authored-By: Claude <noreply@anthropic.com>
Run the offline structural validator against the zip the exporter just produced. Fatal errors raise, which is caught by the export framework and flips DataExport.status to FAILED. Warnings log. Prevents users downloading broken archives where meta.xml, TSV columns, or cross-references have silently drifted. Co-Authored-By: Claude <noreply@anthropic.com>
Label the format 'Darwin Core Archive (DwC-A) — April 2026 Draft' so scientists testing the export know what they're looking at before GBIF-registration and scheme stabilization. Co-Authored-By: Claude <noreply@anthropic.com>
Document the four-file archive shape (event + occurrence + multimedia + measurementorfact), the flattened Humboldt eco: columns on event.txt with their derivation, and the EML 2.2.0 namespace bump. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (7)
ami/exports/dwca/targetscope.py (1)
36-40: Passstrict=Falsetozip()to silence B905 and document intent.Chains can legitimately differ in length (e.g., include-taxa at mixed ranks such as a genus and one of its species). The current
zip(*chains)silently truncates to the shortest chain, which happens to be the correct behavior for LCA-by-prefix — but making that explicit avoids the Ruff B905 warning and makes the intent obvious to readers.♻️ Proposed change
- for position in zip(*chains): + for position in zip(*chains, strict=False):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/exports/dwca/targetscope.py` around lines 36 - 40, Update the loop that computes the LCA prefix so zip is called with explicit strict=False to document intent and silence Ruff B905: change the iterator in for position in zip(*chains) to use zip(*chains, strict=False) and add a brief inline comment noting that differing chain lengths are expected and truncation is intentional (reference: the variable chains, the loop for position in zip(...), and lca_name assignment).ami/exports/dwca/helpers.py (1)
56-74: Minor: inconsistent handling ofranktypes between parents and determination itself.For parents, lines 65-67 accept either an enum-like (
.name) or a string and normalize to string before.upper(). For the determination itself (line 72), the code callsdet_rank.upper()directly, which would raiseAttributeErrorifdetermination.rankis ever the same enum-like value. Today the codebase treatsrankas a string (seefields.pyline 202 usingrank.lower()), so this is consistent with current usage, but if you ever want this helper to be symmetric, consider applying the samehasattr(..., "name")normalization todet_rank.Non-blocking — flagging only because the asymmetry is easy to trip over in future refactors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/exports/dwca/helpers.py` around lines 56 - 74, The handling of determination.rank in _get_rank_from_parents is asymmetric with the parents loop and can raise AttributeError if rank is an enum-like object; normalize det_rank the same way you do for parents: read det_rank = occurrence.determination.rank, convert it to a string via det_rank.name if hasattr(det_rank, "name") else str(det_rank), then compare det_rank_str.upper() == rank and return occurrence.determination.name (or getattr(occurrence.determination, "name", "") if you want to be defensive); update the comparison that currently calls det_rank.upper() to use this normalized det_rank_str.ami/exports/dwca/fields.py (1)
248-280: MOF/Multimedia field extractors willKeyErroron malformed dict rows.
MOF_FIELDSline 249 and 252, andMULTIMEDIA_FIELDSline 269 and 273, user["eventID"]/r["measurementType"]/r["identifier"]with bracket access rather than.get(...). Today the row iterators inrows.pyalways set these keys, so this is safe — but it means any future row-producer that forgets a required key will fail mid-export with a bareKeyErrorrather than a validation-time error message.Given these are marked
required=True, consider either (a) treating missing required values as a validator-enforced invariant (fail early with context) or (b) defaulting to""and letting the downstreamvalidate_dwca_zipflag the empty required cell. Both surface the issue more clearly than an uncaughtKeyError.Non-blocking.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/exports/dwca/fields.py` around lines 248 - 280, The MOF_FIELDS and MULTIMEDIA_FIELDS entries (DwCAField instances referencing eventID, measurementType, and identifier) use bracket access (r["..."]) which raises KeyError on malformed rows; change these extractors to either (a) use r.get("eventID", "") / r.get("measurementType", "") / r.get("identifier", "") so missing values become empty strings and are caught by validate_dwca_zip, or (b) perform an explicit presence check inside the extractor and raise a clear ValidationError/ValueError that includes the field name and row context so validation fails with a descriptive message (update the lambdas or replace them with small helper functions referenced from MOF_FIELDS and MULTIMEDIA_FIELDS and keep required=True as-is).ami/exports/dwca/eml.py (1)
80-106:_add_coverageiterateseventsmultiple times — requires a sequence, not a generator.Lines 81-84 walk
eventsfour times (lats, lons, starts, ends). Today the only caller passesevents_list(a list), so this works — but the public signature ofgenerate_eml_xmlisevents=None, with the docstring describing an "iterable of Event". If a caller ever passes a generator, the second comprehension will silently see an empty sequence and drop geo/temporal coverage without any warning.Either tighten the type hint to
Sequence[Event] | None, or doevents = list(events)at the top of_add_coverage.Also note:
ends = [...] or startsusesstartsdirectly; if you later compute onendselsewhere, be aware the two lists may alias. Non-blocking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/exports/dwca/eml.py` around lines 80 - 106, The _add_coverage function iterates the events iterable multiple times which breaks if callers pass a generator; convert the incoming events to a concrete sequence at the start of _add_coverage (e.g., events = list(events) or similar) before computing lats, lons, starts, ends, and ensure ends doesn't alias starts when you use the fallback (make ends a new list/copy of starts rather than assigning starts directly) so subsequent min/max/date calls operate on independent sequences.ami/exports/dwca/__init__.py (1)
32-53: Nit:__all__not sorted (Ruff RUF022).Ruff flags this for isort-style ordering. Since the public surface already uses mixed public/underscore-prefixed names, sorting alphabetically (or grouping public then private) would silence the warning without changing behavior.
Proposed sort
__all__ = [ "DC", "DWC", "ECO", - "DwCAField", "EVENT_FIELDS", "MOF_FIELDS", "MULTIMEDIA_FIELDS", "OCCURRENCE_FIELDS", - "create_dwca_zip", - "generate_eml_xml", - "generate_meta_xml", - "get_specific_epithet", - "write_tsv", "_format_coord", "_format_datetime", "_format_duration", "_format_event_date", "_format_time", "_get_rank_from_parents", "_get_verification_status", + "DwCAField", + "create_dwca_zip", + "generate_eml_xml", + "generate_meta_xml", + "get_specific_epithet", + "write_tsv", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/exports/dwca/__init__.py` around lines 32 - 53, The __all__ list in this module is not alphabetically ordered (Ruff RUF022); reorder the entries in the __all__ list to a deterministic order (either simple alphabetical order or group all public names first then underscore-prefixed private names) to satisfy Ruff. Update the existing __all__ array (containing symbols like "DC", "DWC", "ECO", "DwCAField", "EVENT_FIELDS", "MOF_FIELDS", "MULTIMEDIA_FIELDS", "OCCURRENCE_FIELDS", "create_dwca_zip", "generate_eml_xml", "generate_meta_xml", "get_specific_epithet", "write_tsv", and the private helpers "_format_coord", "_format_datetime", "_format_duration", "_format_event_date", "_format_time", "_get_rank_from_parents", "_get_verification_status") into the chosen sorted/grouped order and commit the reordered list.ami/exports/dwca/rows.py (1)
149-163:_classifier_nameis O(detections × classifications) per occurrence and relies on prefetch.Fine given the current prefetch chain (
detections__classifications__algorithm), but it's worth a comment noting the assumption — if a caller ever invokesiter_mof_rowswithout that prefetch, this becomes a severe N+1 (one query per classification). Consider adding a short docstring note or a defensivehasattr(occurrence, "_prefetched_objects_cache")check that logs/warns.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/exports/dwca/rows.py` around lines 149 - 163, _classifier_name currently iterates occurrence.detections.all() and det.classifications.all(), which is O(detections×classifications) and assumes the caller prefetched detections__classifications__algorithm (see iter_mof_rows); update _classifier_name to document this assumption in its docstring and add a defensive runtime check (e.g., inspect hasattr(occurrence, "_prefetched_objects_cache") or check for prefetched key for "detections") that emits a warning via the module logger if the related sets are not prefetched, so callers are alerted to potential N+1 behavior instead of silently causing many DB queries.ami/exports/tests.py (1)
807-820: Keep detection-crop multimedia rows under test.This test now passes with zero
crop_rows, so a regression that stops emitting detection-crop rows would not fail despite the DwC-A contract requiring both capture and crop media rows. Please make the fixture deterministic enough to produce a crop URL, or patch the detection URL in this unit test, then assert at least one crop row.🧪 Proposed assertion once the fixture is deterministic
capture_rows = [r for r in rows if not r["occurrenceID"]] crop_rows = [r for r in rows if r["occurrenceID"]] self.assertGreater(len(capture_rows), 0, "Expected capture rows with blank occurrenceID") - # Detection-crop rows require det.url() to return non-empty; depends on fixture - # setup (the image_dimensions stub may not produce crop URLs for test fixtures). - # Check at least the capture-row invariants here. + self.assertGreater(len(crop_rows), 0, "Expected detection-crop rows with occurrenceID") for r in capture_rows: self.assertTrue(r["identifier"], "Capture row missing identifier") self.assertEqual(r["type"], "StillImage") - # Crop rows, if present, must have both identifier and references. for r in crop_rows: self.assertTrue(r["identifier"], "Crop row missing identifier") self.assertTrue(r["references"], "Crop row missing references (source capture URL)")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/exports/tests.py` around lines 807 - 820, The test currently allows zero crop_rows which masks regressions; update the test that builds rows (references capture_rows and crop_rows) to ensure at least one detection-crop row is produced by making the fixture deterministic (or explicitly setting the detection URL returned by det.url() for a sample detection) and then add an assertion self.assertGreater(len(crop_rows), 0, "Expected at least one crop row with references"); keep the existing per-row assertions for r["identifier"], r["references"], and r["type"] to validate crop-row invariants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ami/exports/dwca_validator.py`:
- Around line 234-240: The _read_tsv function currently calls
zf.read(filename).decode("utf-8") which will raise UnicodeDecodeError for
non-UTF-8 TSVs and bypass ValidationResult; wrap the read/decode in a try/except
that catches UnicodeDecodeError (and optionally UnicodeError), call
result.add_error with a clear message including the filename (e.g., "filename is
not valid UTF-8"), and return None so the validator records the error instead of
crashing; update references in _read_tsv to use this guarded read/decode and
preserve existing behavior when decode succeeds.
- Around line 145-164: The parser currently accepts negative integer indexes
(e.g., id_index and idx from id_elems[0].get("index", "") and fld.get("index")),
which allows values like -1 to be interpreted as Python negative indices; update
the validation in the block that sets id_index and the block that parses idx_raw
(used to populate field_terms) to reject negative values: after converting to
int, check if the value is < 0 and if so call result.add_error with the same
meta.xml context (use the existing message patterns for id_tag and <field>) and
continue, preventing negative indexes from being used later when accessing
row[spec.id_index] or building field_terms. Ensure you reference id_index,
id_tag, idx_raw/idx and field_terms in the change so the checks are applied in
both places.
- Around line 18-22: Replace the stdlib XML import with the hardened parser and
handle its specific exceptions: change "from xml.etree import ElementTree as ET"
to using defusedxml's ElementTree (import defusedxml.ElementTree as ET) and
import DefusedXmlException from defusedxml.common; then update the
parsing/exception handling where ElementTree.fromstring() is used for meta.xml
and eml.xml (the code paths referenced by the meta/eml parsing logic) to catch
DefusedXmlException in addition to the existing parse errors so malicious
ZIP-provided XML (XXE/entity expansion) is rejected and logged consistently.
In `@ami/exports/dwca/rows.py`:
- Around line 22-69: iter_multimedia_rows currently materializes all occurrences
into occurrences_by_event and calls list(events_qs) again, which blows memory;
change it to stream occurrences_qs ordered by event_id (e.g.,
occurrences_qs.order_by("event_id").iterator(chunk_size=...)) and implement the
classic grouping-by-sort: track current_event_id, reset seen_captures when event
boundary changes, and emit capture rows as you encounter new SourceImage ids
(using det.source_image) so you never accumulate all occurrences in memory; also
avoid the unnecessary duplicate materialization by using the
already-materialized events_list when events_qs is a list (or only list()
events_qs if it is a queryset), rather than always calling list(events_qs).
- Around line 166-179: The loop in _project_license and _project_rights_holder
incorrectly contains an unconditional break so only the first event is ever
inspected; remove the break statements (or at least ensure they execute only
after a successful hit) so the for loop scans events until it finds and returns
the first non-empty project. Update _project_license and _project_rights_holder
to iterate through events, checking if e.project and getattr(e.project,
"license"/"rights_holder", "") and returning that value when found;
alternatively consider changing the functions to accept a project directly if
the single-project assumption should be made explicit.
- Around line 96-146: iter_mof_rows currently hardcodes measurementUnit="pixels"
for detections' boundingBox but Detection.bbox can be normalized (0–1 floats) or
pixel integers; update iter_mof_rows to inspect det.bbox and choose the unit
dynamically (e.g., if all coords are floats in [0.0,1.0] set
measurementUnit="normalized", otherwise "pixels") and include that decision in
measurementRemarks; also add/extend the BoundingBox class docstring or a
validator (e.g., BoundingBox.from_json or Detection.clean_bbox) to document
and/or enforce the canonical coordinate space so downstream code is unambiguous
(reference iter_mof_rows, Detection.bbox, and BoundingBox).
In `@ami/exports/format_types.py`:
- Around line 340-349: Validation failure currently raises in the block after
calling validate_dwca_zip without removing the created zip_path, leaking temp
zips; either delete the zip at that failure point (call os.remove(zip_path) or
Path(zip_path).unlink() just before raising in the branch where validation.ok is
False) or, if the zip is intentionally preserved for post-mortem, add a clear
comment in the validate_dwca_zip/containing function explaining that retention
is intentional and that cleanup is handled elsewhere; reference the
variables/functions zip_path and validate_dwca_zip and the error branch where
the ValueError is raised.
- Around line 259-295: self.queryset is being iterated three times causing
multiple full DB scans; fix by materializing a single prefetched occurrence list
and reusing it for all writes: call a one-time materialization (e.g.,
occurrences = list(self.queryset.select_related(...).prefetch_related(...)) or a
single generator wrapper) and pass that occurrences list to write_tsv for
occurrences, to iter_multimedia_rows (replace its internal
select_related/prefetch use) and to iter_mof_rows so write_tsv,
iter_multimedia_rows, and iter_mof_rows (symbols: write_tsv,
iter_multimedia_rows, iter_mof_rows, self.queryset) reuse the same in-memory
collection or shared iterator instead of re-executing self.queryset three times.
In `@ami/main/models.py`:
- Around line 2945-2963: get_identified_by currently falls back to user.email
for dwc:identifiedBy which can leak PII; remove user.email from the fallback
chain in the get_identified_by method and instead prefer non-PII identifiers
(user.name if present, then user.username if available, then the opaque fallback
f"user:{user.pk}"). Update the branches referencing top_identification.user to
avoid returning user.email and ensure any tests or callers expecting the email
are adjusted to use name/username/opaque id.
- Around line 2965-2972: get_identified_date currently returns created_at from
best_identification or best_prediction which reflects DB insert time; change it
to prefer the Classification.timestamp field (when available) for
best_identification and best_prediction, falling back to created_at only if
timestamp is None, so dwc:dateIdentified uses the model-produced timestamp (see
Classification.timestamp and the get_identified_date, best_identification,
best_prediction symbols).
In `@docs/claude/planning/2026-04-21-dwca-implementation-plan.md`:
- Around line 1917-1952: The outer markdown code fence around the "Archive
contents" example closes too early; update the fenced blocks so the outer fence
is longer than the inner and label the inner archive tree as text: replace the
current outer ```markdown fence with a longer fence (e.g., ````markdown) and
change the inner ``` that surrounds the project_export.zip tree to ```text, then
close with the longer outer fence (e.g., ````) so the tree renders correctly and
the later table remains outside the inner fence; apply this change to the fenced
block starting at the "Archive contents" heading and its closing fence just
before the "Humboldt Extension columns on event.txt" table.
---
Nitpick comments:
In `@ami/exports/dwca/__init__.py`:
- Around line 32-53: The __all__ list in this module is not alphabetically
ordered (Ruff RUF022); reorder the entries in the __all__ list to a
deterministic order (either simple alphabetical order or group all public names
first then underscore-prefixed private names) to satisfy Ruff. Update the
existing __all__ array (containing symbols like "DC", "DWC", "ECO", "DwCAField",
"EVENT_FIELDS", "MOF_FIELDS", "MULTIMEDIA_FIELDS", "OCCURRENCE_FIELDS",
"create_dwca_zip", "generate_eml_xml", "generate_meta_xml",
"get_specific_epithet", "write_tsv", and the private helpers "_format_coord",
"_format_datetime", "_format_duration", "_format_event_date", "_format_time",
"_get_rank_from_parents", "_get_verification_status") into the chosen
sorted/grouped order and commit the reordered list.
In `@ami/exports/dwca/eml.py`:
- Around line 80-106: The _add_coverage function iterates the events iterable
multiple times which breaks if callers pass a generator; convert the incoming
events to a concrete sequence at the start of _add_coverage (e.g., events =
list(events) or similar) before computing lats, lons, starts, ends, and ensure
ends doesn't alias starts when you use the fallback (make ends a new list/copy
of starts rather than assigning starts directly) so subsequent min/max/date
calls operate on independent sequences.
In `@ami/exports/dwca/fields.py`:
- Around line 248-280: The MOF_FIELDS and MULTIMEDIA_FIELDS entries (DwCAField
instances referencing eventID, measurementType, and identifier) use bracket
access (r["..."]) which raises KeyError on malformed rows; change these
extractors to either (a) use r.get("eventID", "") / r.get("measurementType", "")
/ r.get("identifier", "") so missing values become empty strings and are caught
by validate_dwca_zip, or (b) perform an explicit presence check inside the
extractor and raise a clear ValidationError/ValueError that includes the field
name and row context so validation fails with a descriptive message (update the
lambdas or replace them with small helper functions referenced from MOF_FIELDS
and MULTIMEDIA_FIELDS and keep required=True as-is).
In `@ami/exports/dwca/helpers.py`:
- Around line 56-74: The handling of determination.rank in
_get_rank_from_parents is asymmetric with the parents loop and can raise
AttributeError if rank is an enum-like object; normalize det_rank the same way
you do for parents: read det_rank = occurrence.determination.rank, convert it to
a string via det_rank.name if hasattr(det_rank, "name") else str(det_rank), then
compare det_rank_str.upper() == rank and return occurrence.determination.name
(or getattr(occurrence.determination, "name", "") if you want to be defensive);
update the comparison that currently calls det_rank.upper() to use this
normalized det_rank_str.
In `@ami/exports/dwca/rows.py`:
- Around line 149-163: _classifier_name currently iterates
occurrence.detections.all() and det.classifications.all(), which is
O(detections×classifications) and assumes the caller prefetched
detections__classifications__algorithm (see iter_mof_rows); update
_classifier_name to document this assumption in its docstring and add a
defensive runtime check (e.g., inspect hasattr(occurrence,
"_prefetched_objects_cache") or check for prefetched key for "detections") that
emits a warning via the module logger if the related sets are not prefetched, so
callers are alerted to potential N+1 behavior instead of silently causing many
DB queries.
In `@ami/exports/dwca/targetscope.py`:
- Around line 36-40: Update the loop that computes the LCA prefix so zip is
called with explicit strict=False to document intent and silence Ruff B905:
change the iterator in for position in zip(*chains) to use zip(*chains,
strict=False) and add a brief inline comment noting that differing chain lengths
are expected and truncation is intentional (reference: the variable chains, the
loop for position in zip(...), and lca_name assignment).
In `@ami/exports/tests.py`:
- Around line 807-820: The test currently allows zero crop_rows which masks
regressions; update the test that builds rows (references capture_rows and
crop_rows) to ensure at least one detection-crop row is produced by making the
fixture deterministic (or explicitly setting the detection URL returned by
det.url() for a sample detection) and then add an assertion
self.assertGreater(len(crop_rows), 0, "Expected at least one crop row with
references"); keep the existing per-row assertions for r["identifier"],
r["references"], and r["type"] to validate crop-row invariants.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 76a52bf7-caae-4e7d-b717-8baa40216efd
📒 Files selected for processing (24)
ami/exports/dwca/__init__.pyami/exports/dwca/eml.pyami/exports/dwca/fields.pyami/exports/dwca/helpers.pyami/exports/dwca/meta.pyami/exports/dwca/rows.pyami/exports/dwca/targetscope.pyami/exports/dwca/tsv.pyami/exports/dwca/zip.pyami/exports/dwca_validator.pyami/exports/format_types.pyami/exports/management/__init__.pyami/exports/management/commands/__init__.pyami/exports/management/commands/validate_dwca_archive.pyami/exports/tests.pyami/exports/tests_dwca_validator.pyami/main/migrations/0082_project_license.pyami/main/models.pydocs/claude/dwca-format-reference.mddocs/claude/dwca-pr-review-and-mapping-spec.mddocs/claude/export-framework.mddocs/claude/planning/2026-04-21-dwca-april-draft-design.mddocs/claude/planning/2026-04-21-dwca-implementation-plan.mdui/src/data-services/models/export.ts
✅ Files skipped from review due to trivial changes (2)
- docs/claude/dwca-format-reference.md
- docs/claude/export-framework.md
| def iter_multimedia_rows(events_qs, occurrences_qs, project_slug: str): | ||
| """Yield dicts for multimedia.txt rows. | ||
|
|
||
| Two row types: | ||
| - Capture row: one per SourceImage linked to >=1 occurrence in filter set. | ||
| occurrenceID is blank; identifier is the capture URL. | ||
| - Crop row: one per Detection whose occurrence is in filter set | ||
| AND which has a usable crop URL. occurrenceID populated; | ||
| references = source capture URL. | ||
| """ | ||
| events_list = list(events_qs) | ||
| license_value = _project_license(events_list) | ||
| rights_holder = _project_rights_holder(events_list) | ||
|
|
||
| occurrences_by_event: dict[int, list] = {} | ||
| for occ in occurrences_qs.select_related("event").prefetch_related("detections__source_image"): | ||
| if occ.event_id is None: | ||
| continue | ||
| occurrences_by_event.setdefault(occ.event_id, []).append(occ) | ||
|
|
||
| for event in events_list: | ||
| eid = _event_id(event, project_slug) | ||
| occurrences_for_event = occurrences_by_event.get(event.id, []) | ||
|
|
||
| # Deduplicate capture images across all occurrences in this event. | ||
| seen_captures: set[int] = set() | ||
| for occ in occurrences_for_event: | ||
| for det in occ.detections.all(): | ||
| si = det.source_image | ||
| if si is None or si.id in seen_captures: | ||
| continue | ||
| seen_captures.add(si.id) | ||
| capture_url = si.public_url() | ||
| if not capture_url: | ||
| continue | ||
| yield { | ||
| "eventID": eid, | ||
| "occurrenceID": "", | ||
| "type": "StillImage", | ||
| "format": "image/jpeg", | ||
| "identifier": capture_url, | ||
| "references": "", | ||
| "created": _format_datetime(si.timestamp), | ||
| "license": license_value, | ||
| "rightsHolder": rights_holder, | ||
| "creator": "", | ||
| "description": "Source capture image from automated monitoring station", | ||
| } |
There was a problem hiding this comment.
iter_multimedia_rows materializes all occurrences into memory.
Lines 36-40 build occurrences_by_event: dict[int, list] holding every occurrence for the export (plus their prefetched detections__source_image). That defeats the streaming write_tsv model on the downstream side and can blow memory for large projects. Captures are emitted in an inner loop that scans every occurrence's detections, so the grouping is only used to iterate in event order — that order isn't strictly required by DwC-A.
Consider iterating occurrences_qs.order_by("event_id").iterator(chunk_size=...) and streaming captures/crops per event as the event_id changes (the classic grouping-by-sort pattern). That keeps memory bounded and preserves deterministic output.
Also note: the caller passes events_list (already materialized) but line 32 does list(events_qs) again, a harmless but unnecessary copy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/exports/dwca/rows.py` around lines 22 - 69, iter_multimedia_rows
currently materializes all occurrences into occurrences_by_event and calls
list(events_qs) again, which blows memory; change it to stream occurrences_qs
ordered by event_id (e.g.,
occurrences_qs.order_by("event_id").iterator(chunk_size=...)) and implement the
classic grouping-by-sort: track current_event_id, reset seen_captures when event
boundary changes, and emit capture rows as you encounter new SourceImage ids
(using det.source_image) so you never accumulate all occurrences in memory; also
avoid the unnecessary duplicate materialization by using the
already-materialized events_list when events_qs is a list (or only list()
events_qs if it is a queryset), rather than always calling list(events_qs).
There was a problem hiding this comment.
Claude says: Acknowledged, deferring to a follow-up. The observation is correct: iter_multimedia_rows currently builds occurrences_by_event to group captures by event. I want to treat the memory profile of the whole exporter as a single design decision rather than piecemeal — see also thread on format_types.py:295 where the same queryset is iterated three times. A materialize-once-and-fan-out approach (one prefetched list of occurrences shared across occurrence.txt, multimedia.txt, and measurementorfact.txt writes) would eliminate both this grouping cost and the 3× DB scan, but it needs benchmarking against a realistic large project before we pick a direction. Filing as a follow-up so this PR stays focused on the April-draft contract. Tracking issue to follow.
There was a problem hiding this comment.
@mihow, that's a sound rationale — patching the grouping dict here while the queryset is still scanned three times in format_types.py would only shift the bottleneck. Treating the whole exporter's memory profile as one design decision is cleaner.
Would you like me to open a GitHub tracking issue for the memory-optimization follow-up? I can include:
- The
occurrences_by_eventdict materialization initer_multimedia_rows - The 3× queryset iteration in
format_types.py:295 - The proposed materialize-once-and-fan-out (or sort-and-stream) direction to benchmark
Just let me know and I'll file it.
🧠 Learnings used
Learnt from: mihow
Repo: RolnickLab/antenna PR: 1130
File: ami/ml/orchestration/nats_queue.py:217-217
Timestamp: 2026-02-13T23:11:30.276Z
Learning: In the RolnickLab/antenna project (Pydantic v1), use .dict() for serialization of Pydantic models instead of .model_dump(). Update code paths in Python files where model_dump() is used to ensure compatibility with Pydantic v1 behavior. If any files are later migrated to Pydantic v2, consider switching to .model_dump() or using model_dump(..., mode='json') as appropriate.
Learnt from: mihow
Repo: RolnickLab/antenna PR: 1222
File: ami/ml/orchestration/nats_queue.py:126-132
Timestamp: 2026-04-13T22:08:32.779Z
Learning: In code that uses the `nats-py` JetStream client, do not divide `ConsumerInfo.config.ack_wait` (or the `ack_wait` returned by `consumer_info()` / `add_consumer()`) by `1e9`. `nats-py` already deserializes the server’s raw nanosecond value by dividing by `_NANOSECOND` (1e9), so application-level code should treat `ack_wait` as already human-readable seconds (e.g., `30.0` means 30 seconds). Only apply nanosecond-to-second conversion if you are working with the raw server nanosecond field directly (not the `ack_wait` property exposed by `ConsumerInfo`/`ConsumerConfig`).
Learnt from: mihow
Repo: RolnickLab/antenna PR: 1261
File: ami/jobs/tasks.py:162-171
Timestamp: 2026-04-21T07:40:49.264Z
Learning: In this repo, when using `cachalot_disabled` imported from `cachalot.api`, treat it as supported in both forms: `with cachalot_disabled():` and `cachalot_disabled()`. It is implemented as a `contextmanager` that also behaves as a `ContextDecorator` (Python 3.2+), so the decorator form is not a no-op and creates a fresh generator instance per invocation. During code review, do not raise false positives that claim the decorator usage is unsupported or ineffective.
Addresses CodeRabbit PR #1131 feedback: - Parse archive-provided meta.xml/eml.xml with defusedxml to guard against XXE and entity-expansion payloads in uploaded zips - Reject negative <id>/<field> indexes in meta.xml (previously int("-1") was accepted and treated as a Python negative index) - Return a validation error on non-UTF-8 TSV decode instead of crashing - Remove stray unconditional break in _project_license / _project_rights_holder; iterate events until a non-empty value is found - Unlink zip_path on validation failure to avoid temp-file leaks on repeated failing exports Co-Authored-By: Claude <noreply@anthropic.com>
…n timestamp Addresses CodeRabbit PR #1131 feedback: - Occurrence.get_identified_by no longer falls back to user.email for dwc:identifiedBy. That value is published in GBIF-bound archives and is very hard to retract once mirrored; prefer user.name, then username, then the opaque user:<pk>. - Occurrence.get_identified_date now prefers Classification.timestamp (set when the model or user actually produced the result) over created_at (DB insert time), so dwc:dateIdentified reflects the identification event, not the backfill timestamp. - BoundingBox schema: document the coordinate system as source-image pixel space, matching the PIL.Image.crop() calls in processing services. This removes the ambiguity flagged for the measurementUnit="pixels" value in iter_mof_rows. Co-Authored-By: Claude <noreply@anthropic.com>
The outer markdown fence around "Archive contents" closed early, rendering the archive tree as plain text. Use ````markdown for the outer fence and ```text for the inner tree so both render correctly. Co-Authored-By: Claude <noreply@anthropic.com>
# Conflicts: # ami/exports/tests.py
…on status _get_verification_status was emitting identificationVerificationStatus="verified" even when every human identification on the occurrence was withdrawn. The rest of the codebase (best_identification, verified_count annotation) consistently uses withdrawn=False; align with that. Co-Authored-By: Claude <noreply@anthropic.com>
GBIF and most DwC-A consumers treat a blank dc:license as "unknown — skip." Until the project has an explicit license field populated via UI, default the per-record license column to "All rights reserved" so downstream consumers receive the correct legal signal (no reuse permitted) rather than a blank. Co-Authored-By: Claude <noreply@anthropic.com>
Adds an explicit draft signal in two places so consumers don't mistake this archive for a stable schema: - archive filename now includes "dwca_draft-2026-04" via the new filename_label class attribute on BaseExporter (generic mechanism) - eml.xml <additionalInfo> block warns that the schema is subject to change and should not be submitted to GBIF without confirmation Keeps the door open for a later stable version without leaving today's downloads indistinguishable. Co-Authored-By: Claude <noreply@anthropic.com>
- test_dwca_export_refuses_over_hardcap: patches DWCA_MAX_OCCURRENCES down and asserts the exporter refuses with a clear "hard cap" message. - test_validation_failure_writes_errors_into_zip: confirms _append_validation_report_to_zip appends a readable VALIDATION_ERRORS.txt containing all errors and warnings. Both features landed in 9475664 (bundled into the draft-schema commit by mistake); these are the missing tests. Co-Authored-By: Claude <noreply@anthropic.com>
DwCAExportHttpE2ETest exercises the full path: POST /api/v2/exports/ -> permission check -> serializer validation -> ExportViewSet.create -> Job.enqueue -> eager Celery task -> DwCAExporter.export -> validator -> zip on storage -> DataExport.file_url. Uses TransactionTestCase so transaction.on_commit callbacks in enqueue() actually fire, and CELERY_TASK_ALWAYS_EAGER so the task runs in-process. Verifies: 201 response, filename contains draft label, record_count set, zip has all expected TSVs + meta.xml + eml.xml, EML contains draft notice, event.txt rows carry the "All rights reserved" default license. Co-Authored-By: Claude <noreply@anthropic.com>
Reflects what landed on the branch after the takeaway review: - Item 1 (perf): status updated — materialize-once-with-hardcap shipped; streaming fan-out remains the follow-up - New item 7 (validator warning path + DataExport.status/error_message) - New item 8 (run archive through GBIF DwC-A Validator before removing draft label) Sequencing reordered to put the GBIF validator run first so we know what else is actually broken before investing in the rest. Co-Authored-By: Claude <noreply@anthropic.com>
TransactionTestCase truncates tables at teardown, which wiped parents_json state that DwCAExportTest.test_taxonomy_hierarchy_extraction relied on via setUpClass — deterministic full-suite failure, passed in isolation. Use TestCase + captureOnCommitCallbacks(execute=True) instead. It fires on_commit callbacks when the context manager exits without needing real commits, so Job.enqueue's scheduled Celery task runs under CELERY_TASK_ALWAYS_EAGER, same as before, but fixture state is preserved for other test classes. Co-Authored-By: Claude <noreply@anthropic.com>
- ``models.py:1499``: add ``source_image_id`` and ``pk`` tie-breakers to the Subquery used in the Occurrence.event_id realignment. Two detections with identical ``source_image__timestamp`` could otherwise pick different rows across runs (and so different event_ids when the tied source_images straddle an event boundary). Stable ordering keeps regrouping deterministic. - ``tests.py:373``: switch the test Detection's ``bbox`` from normalized floats to absolute pixel coordinates (``[10, 10, 20, 20]``) to match the project convention noted in PR #1131. Doesn't affect this test's assertions; keeps fixtures consistent. Co-Authored-By: Claude <noreply@anthropic.com>
…ployments (#1268) * fix(grouping): cap event duration at 24h for continuous-monitoring deployments `group_images_into_events` splits images into events when there is a gap larger than `max_time_gap` (default 2h). For continuous-monitoring deployments the stream never has a 2h quiet period, so every image in the deployment coalesces into a single multi-month event. Observed on production deployment 489: one event spanning Apr 24 → Jul 23 2023 with 355k source_images. Add a `max_event_duration` kwarg to `group_datetimes_by_gap` and thread it through `group_images_into_events` with a default of 24 hours. When set, a group is also split once its span would exceed the cap, even if no gap triggers the split. `None` disables the cap (backward compatible). Also removes the stale `# @todo this event grouping needs testing. Still getting events over 24 hours` comment now that events over 24h are prevented by construction. Tests: doctest for the continuous-monitoring case in `dates.py`, plus a Django test that creates 3 days of 10-minute-interval captures and asserts grouping produces >=3 events each <=24h. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(grouping): refresh cached fields on events whose captures moved Addresses Copilot review feedback on #1268. When regrouping a deployment that already has events, `group_images_into_events` reuses any existing event whose `group_by` date matches the first day of a new group. Before this change, a reused event was saved once early in the loop (while still holding its pre-regroup captures) and never refreshed after later iterations moved captures away to new events. Result: the reused event's cached `start` / `end` / `captures_count` / `detections_count` / `occurrences_count` stayed stale — e.g. a 90-day mega-event being re-grouped under a 24h cap would physically hold only ~24h of captures but still report the 90-day span. Fix: track every event touched by grouping (both newly-created / reused in this pass, and any event that lost captures to the UPDATE), then call `update_calculated_fields_for_events(pks=...)` in a single pass after the loop to refresh cached fields against each event's final capture set. This narrowly scoped fix coexists with #904, which is expected to rework the `group_by`-based reuse path more thoroughly. Tests: - New `test_regrouping_existing_long_event_refreshes_cached_fields`: creates 3 days of continuous captures, groups once with the cap disabled (one mega-event), then re-groups with the 24h cap and asserts (a) no event exceeds 24h and (b) `sum(captures_count)` across events equals the total image count. Avoids asserting on `group_by` specifics so it remains valid under #904's refactor. - Existing `test_continuous_monitoring_capped_at_24_hours` refactored to share a `_populate_continuous_captures` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(grouping): tighten 24h cap regression tests per review - Change `>= 3` to `== 3` in both new tests — 432 captures over 3 days at 10 min intervals must yield exactly 3 daily events under the 24h cap. Catches over-splitting regressions that `>= 3` would miss. - Add per-event cached-count check after regroup: compare each event's cached `captures_count` against its actual related SourceImage count. Offsetting errors can hide stale counters in a sum-only check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(grouping): realign Occurrence.event_id and refresh deployment caches after regroup Occurrences are bound to an event once at creation time (from detection.source_image.event in Detection.associate_new_occurrence and Pipeline.save_results) and were never re-derived afterward. Without this refresh, a deployment regrouped under the 24h cap kept every occurrence pointing at its original (pre-cap) event regardless of when its detections actually fired, breaking every Occurrence.event-keyed query (occur_det_proj_evt index, Event.occurrences related-name, event_ids= filter chain). Realign Occurrence.event_id from each occurrence's first detection's source_image.event_id, and union the pre- and post-refresh event holders into touched_event_pks so the existing update_calculated_fields_for_events call covers both occurrence-losers and -gainers. Also swap the manual events_count-only update at the end of group_images_into_events for a full Deployment.update_calculated_fields() call. The async regroup_events task never went through Deployment.save's calculated-fields refresh, so the deployment list (occurrences_count, taxa_count, captures_count, detections_count) showed pre-regroup numbers until the next save touched the deployment. Adds test_regrouping_realigns_occurrence_event_id covering the detection->source_image->event chain post-regroup plus per-event occurrences_count consistency against the live get_occurrences_count helper that the cache itself uses. Co-Authored-By: Claude <noreply@anthropic.com> * fix(grouping): address CodeRabbit nits on the realignment commit - ``models.py:1488``: replace EN DASH (–) with ASCII hyphen so Ruff RUF003 doesn't flag the comment in strict environments. - ``tests.py:test_regrouping_realigns_occurrence_event_id``: pick targets at mid-day offsets (12h / 36h / 60h) instead of by index. The cap's strict ``>`` keeps the capture at exactly 24h offset in the previous event, so index-based selection (``captures[len // 3]``) lands two of three targets in the same event. Mid-day offsets sit well inside each daily event's window, away from any boundary, so the test now actually exercises 3-way distribution as it claimed. - Tighten the realignment assertion to require exactly 3 distinct ``event_id`` values across the three occurrences (the corrected invariant), replacing the looser ``>= 1`` non-day-0 check. Co-Authored-By: Claude <noreply@anthropic.com> * fix(grouping): tighten Occurrence event_id realignment ordering - ``models.py:1499``: add ``source_image_id`` and ``pk`` tie-breakers to the Subquery used in the Occurrence.event_id realignment. Two detections with identical ``source_image__timestamp`` could otherwise pick different rows across runs (and so different event_ids when the tied source_images straddle an event boundary). Stable ordering keeps regrouping deterministic. - ``tests.py:373``: switch the test Detection's ``bbox`` from normalized floats to absolute pixel coordinates (``[10, 10, 20, 20]``) to match the project convention noted in PR #1131. Doesn't affect this test's assertions; keeps fixtures consistent. Co-Authored-By: Claude <noreply@anthropic.com> * chore(grouping): add seed_synthetic_occurrences mgmt command for staging verification Lets us exercise the Occurrence.event_id realignment branch on a staging deployment without running an ML pipeline. Creates one synthetic Detection + Occurrence per existing SourceImage, with ``Occurrence.event=None`` so the regroup's Subquery realignment is the path that has to populate it. Workflow on staging: python manage.py seed_synthetic_occurrences --deployment <id> # then in a shell: from ami.main.models import Deployment, group_images_into_events import datetime group_images_into_events( deployment=Deployment.objects.get(pk=<id>), max_event_duration=datetime.timedelta(hours=24), ) # verify Occurrence.event_id matches earliest Detection's # source_image.event_id ``--clean`` removes the seeded rows (matched by bbox=[10,10,20,20] + no algorithm/path) for re-runs. ``--dry-run`` reports counts only. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Add a
dwcaexport format that produces a GBIF-compatible Darwin Core Archive — a ZIP withevent.txt,occurrence.txt,multimedia.txt,measurementorfact.txt,meta.xml, andeml.xml(EML 2.2.0). Event Core architecture: events are the core, the other three are extensions linked viaeventID. Humboldt Extension (eco:) terms are flattened onto Event Core rows to carry sampling-effort structure and absence reportability — the scientific contribution of automated monitoring. A structural validator runs against the zip before it is returned to the user; fatal drift fails the export fast.The format is surfaced in the UI under the label "Darwin Core Archive (DwC-A) — April 2026 Draft" so early testers know this is pre-registration and the column layout may still evolve.
Archive contents
Key design decisions
humboldt.txtextension. 11 terms encode sampling effort (reported/value/unit/protocol), absence reportability,targetTaxonomicScope(LCA-derived), protocol identifiers, and material-sample declaration.targetTaxonomicScopefrom LCA of project include-taxaparents_jsonchains for the deepest common ancestor ofProject.default_filters_include_taxa. v2 will move to a per-Site TaxaList so each deployment declares its own expected species pool (the groundwork for per-taxon absence rows).basisOfRecord = "MachineObservation"multimedia.txtfor both captures and cropsoccurrenceID; crop rows carry the occurrenceID URN anddc:referencespointing back to the source capture URL. Consumers can filter by row shape.classificationScore,detectionScore, andboundingBoxare carried as structuredMeasurementOrFactrows rather than as occurrence columns, so each measurement has its ownmeasurementDeterminedBy(model name + version) and remarks.<methods>section documents the capture + ML pipeline and the QC filters applied at export time.DataExport.status = FAILED; warnings log. Catches meta.xml / TSV column drift and cross-reference orphans before users download broken archives.ami/exports/dwca.py→ami/exports/dwca/withfields.py(catalogues),helpers.py,tsv.py,meta.py,eml.py,zip.py,rows.py,targetscope.py. Public API unchanged — re-exported from__init__.py.New in this iteration
associatedMediacolumn onoccurrence.txt— pipe-separated capture URLs per occurrence.multimedia.txtextension (GBIF Simple Multimedia) — capture and detection-crop rows in one file.measurementorfact.txtextension — classificationScore, detectionScore, boundingBox.<coverage>(bbox + temporal) and<methods>.eco:columns onevent.txt(11 terms) + LCA-derivedtargetTaxonomicScope.Round 2: self-run takeaway review + HTTP e2e validation (2026-04-22)
After CodeRabbit round 1 addressed, I ran a structural "takeaway" pass against the PR (the 7-angle review in CLAUDE.md) looking for silent-bug and scope-claim issues a diff review doesn't surface. Seven items filed; five applied on-branch, two sent to the follow-up doc.
Applied
dc:licensedefaulted to blank whenproject.licenseempty — GBIF / downstream consumers treat blank as "unknown, skip.""All rights reserved"across event.txt, multimedia.txt, and eml.xml so downstream consumers see the correct legal signal. Constant lives inami/exports/dwca/helpers.py::DEFAULT_LICENSE.73eb3dceidentificationVerificationStatus = "verified"._get_verification_statusnow filterswithdrawn=False, matching thebest_identification/verified_countpattern elsewhere in the codebase.ce1e17f2dwca_draft-2026-04via a new genericBaseExporter.filename_labelattribute;eml.xmlcarries a<additionalInfo>block stating the schema is subject to change.9475664cDwCAExporter.export()iteratedself.queryset3× — one full DB scan per extension writer — anditer_multimedia_rowsmaterialized anoccurrences_by_eventdict on top.list[Occurrence]once with all prefetches, pass the same list to all three writers. 100KDWCA_MAX_OCCURRENCEShardcap guards against OOM on very large projects until streaming fan-out ships.9475664cVALIDATION_ERRORS.txtto the zip (listing every error and warning), persist the zip to storage soDataExport.file_urlis set, then raise so the Job is still marked failed. User can download the failed archive and read the report.9475664cScope-out (tracked for follow-up)
warningspath that nothing ever uses today. Wiring a real warning-category check + surfacing failure state via a properDataExport.status/error_messagefield is follow-up item 7.Full list:
docs/claude/planning/dwca-followup-tickets.mdon this branch.Corrections to my own review
My first pass flagged
basisOfRecord = "MachineObservation"as wrong for human-verified records. That was incorrect.basisOfRecorddescribes how the organism was observed (camera = machine), not who assigned the taxon name. Human verification lives inidentifiedBy+identificationVerificationStatus. Wildlife Insights, Agouti, and every other GBIF-accepted camera-trap dataset records human-verified observations asMachineObservation. The hardcoded value stays.What has been tested
Unit & integration —
ami/exportstest suite (56 tests, all green)DwCAExportTest(29 tests) — full ZIP structure, row counts, meta.xml shape, referential integrity, Humboldt columns + meta.xml terms,associatedMedia, multimedia and MoF tables in archive, EML 2.2.0 coverage/methods/license, collection filtering, self-produced zip passes own validator,_get_verification_statuswithdrawn-ID filter, 100K hardcap refuses clearly,VALIDATION_ERRORS.txtwritten on failure.DwCAExportHttpE2ETest(new, 1 test) — HTTP end-to-end viaTransactionTestCase+CELERY_TASK_ALWAYS_EAGER. Exercises the full path:POST /api/v2/exports/→ permission check → serializer validation →ExportViewSet.create→Job.enqueue→transaction.on_commit→ eager Celery task →DwCAExporter.export→ validator → zip persisted to storage →DataExport.file_urlset. Asserts: 201 response, filename containsdwca_draft-2026-04,record_count > 0, zip contains all 6 expected members, noVALIDATION_ERRORS.txt, EML contains the draft notice,basisOfRecord = MachineObservationon every occurrence row,license = "All rights reserved"on every event row.TargetTaxonomicScopeTest(3 tests) — LCA derivation for empty / single / multi-taxa.MultimediaExtensionTest(2 tests) — field catalogue + row generator emits capture and crop rows.tests_dwca_validator.py(9 tests) — validator behavior including occurrenceID cross-reference check, defusedxml hardening, negative-index rejection, UTF-8 decode safety, zip cleanup on failure.Reproduce:
docker compose -f docker-compose.ci.yml run --rm django \ python manage.py test ami.exports --noinputValidated on real produced archive
test_validator_runs_on_produced_zipandtest_offline_structural_validatorboth re-run the offline validator against the actual zip emitted by the materialize-once exporter. Both pass.Not yet validated
Deployment notes
ami/main/migrations/0084_project_license.pyaddsProject.licenseandProject.rights_holderCharFields (both withdefault=""). Safe to roll forward on a live database; downgrade drops the columns."dwca"format is registered automatically and appears in/api/v2/exports/choices.Deferred (tracked elsewhere / separate PRs)
is_blank/contains_humanssource-image filters (fields don't exist in the model yet; WG requirement, depends on upstream work).Site.primary_taxa_listdesign).device_type,attractant_type,light_wavelength) → CamtrapDP ticket (feat(exports): add CamtrapDP export format alongside DwC-A #1262).dwc:informationWithheld).country/stateProvince/locality.coordinateUncertaintyInMeters(needs Deployment field).GBIF_API_VALIDATE=1).How to test
Prerequisites: an Antenna instance with at least one project containing events and ML-identified occurrences; API access with Researcher or Admin role.
Upload the ZIP to https://www.gbif.org/tools/data-validator for vocabulary / taxonomic-backbone checks. The offline validator covered by the test suite catches structural drift; the GBIF validator is the source of truth for GBIF-acceptance.
Checklist
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests