Skip to content

[DCP -Ingestion]#525

Merged
gmechali merged 55 commits into
datacommonsorg:masterfrom
gmechali:perImp
Jun 9, 2026
Merged

[DCP -Ingestion]#525
gmechali merged 55 commits into
datacommonsorg:masterfrom
gmechali:perImp

Conversation

@gmechali

@gmechali gmechali commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

This PR primarily handles per import ingestion, and along the way we fixed some related things.

Mainly now when we ingest all the data, we must keep the key of which import the data is associated with.

We are also strictly going to use the Provenance as the import name. Instead of defining the provenance in the config.json inline, we now require you top ass it in the inputFiles like this: UNdata or Test

We also did some changes to properly parse all the fields on the Provenance And Source, and add validation that the provenance and source nodes are properly defined in the .mcf file.

we added custom namespace support, and some minor changes to how we batch GCS upload to process each import sequentially, but parallelize within theimport.

Added and modified tests. Note that this PR gets rid of usages of provenance node creation from config.json and we will follow up later with code deletion to simplify this codebase.

@codacy-production

codacy-production Bot commented Jun 5, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 8 high · 28 medium · 59 minor

Alerts:
⚠ 95 issues (≤ 0 issues of at least minor severity)

Results:
95 new issues

Category Results
Compatibility 1 medium
UnusedCode 2 medium
1 minor
BestPractice 16 medium
Documentation 11 minor
ErrorProne 8 high
CodeStyle 47 minor
Complexity 5 medium
Performance 4 medium

View in Codacy

🟢 Metrics 157 complexity · 4 duplication

Metric Results
Complexity 157
Duplication 4

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new isBaseDc configuration flag across the ingestion pipeline to support custom Data Commons setups, adds a Terraform deployment configuration, and implements Cloud Workflows and Cloud Run helper services to automate imports, Spanner ingestion, and BigQuery aggregations. It also introduces JsonLdStreamDb to stream JSON-LD shards directly to GCS or disk, alongside a new metadata validator. The review feedback highlights several critical security and reliability improvements, including replacing custom SQL escaping with parameterized queries to prevent SQL injection, restricting broad IAM roles in Terraform, adding timeouts to HTTP requests and polling loops, globally initializing Spanner and Storage clients to reduce latency, fixing a potential null-value bug in the JSON-LD stream database, and refining file-matching logic and test assertions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread simple/stats/runner.py Outdated
Comment thread pipeline/workflow/ingestion-helper/aggregation_utils.py Outdated
Comment thread pipeline/workflow/ingestion-helper/embedding_utils_test.py
Comment thread pipeline/workflow/import-helper/import_helper.py
Comment thread pipeline/workflow/ingestion-helper/main.py
Comment thread simple/stats/jsonld_stream_db.py
Comment thread pipeline/workflow/cloud_workflow.py
Comment thread pipeline/terraform/main.tf
@gmechali

gmechali commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces strict metadata validation via a new MetadataValidator class, updates JsonLdStreamDb to process imports sequentially to optimize memory usage, and enhances configuration parsing to support both modern list and legacy dictionary formats. Feedback on these changes focuses on preserving the original order of observations and triples when chunking, explicitly specifying UTF-8 encoding for local file operations, and removing dead code in nodes.py. Additionally, the reviewer suggests handling empty directory names gracefully in runner.py to prevent prefix issues, raising a ValueError for invalid configuration entries to maintain consistency, addressing a Liskov Substitution Principle violation in the validator, and reusing the existing strip_namespace utility to avoid duplicate logic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

I am having trouble creating individual review comments. Click here to see my feedback.

simple/stats/jsonld_stream_db.py (334-340)

medium

Popping from the end of the list (records.pop()) progressively cleans memory but reverses the order of observations in the generated chunks. To preserve the original order while maintaining the O(1) pop overhead, we can reverse the list once at the beginning of the generator.

    records = self._obs_records.get(import_name, [])
    records.reverse()
    shard_index = 0
    while records:
      chunk = []
      # Pop from the end to avoid O(N) list shifting overhead
      for _ in range(min(_CHUNK_SIZE, len(records))):
        chunk.append(records.pop())

simple/stats/jsonld_stream_db.py (349-355)

medium

Popping from the end of the list (triples.pop()) progressively cleans memory but reverses the order of triples in the generated chunks. To preserve the original order while maintaining the O(1) pop overhead, we can reverse the list once at the beginning of the generator.

    triples = self._triples.get(import_name, [])
    triples.reverse()
    shard_index = 0
    while triples:
      chunk = []
      # Pop from the end to avoid O(N) list shifting overhead
      for _ in range(min(_CHUNK_SIZE, len(triples))):
        chunk.append(triples.pop())

simple/stats/jsonld_stream_db.py (421-425)

medium

When opening local JSON-LD files, it is best practice to explicitly specify encoding="utf-8". This prevents potential UnicodeDecodeError on platforms where the default system encoding is not UTF-8.

    def _copy_single(rel_path: str):
      local_file_path = os.path.join(temp_local_dir, rel_path)
      with open(local_file_path, "r", encoding="utf-8") as f:
        content = f.read()
      target_store.open_file(rel_path).write(content)

simple/stats/nodes.py (184-189)

medium

The check if not prov_name: is dead code because self.config.provenance_name(input_file) always returns a non-empty string (falling back to input_file.path if the provenance property is missing). If the intent is to strictly require a provenance property in config.json for all modes, you should check the configuration directly rather than relying on provenance_name.

simple/stats/runner.py (135)

medium

If a config.json is located at the root directory, fspath.dirname(c.path) will return "" (empty string), resulting in a prefix of "/". This will cause file.path.startswith("/") to fail for all root files. We should handle the empty directory name gracefully.

        self.active_import_prefixes = set(f"{fspath.dirname(c.path)}/" if fspath.dirname(c.path) else "" for c in configs)

simple/stats/runner.py (303-304)

medium

In config.py, a ValueError is raised if an entry in inputFiles is not a dictionary. However, _merge_configs in runner.py silently filters out non-dictionary entries. We should raise a ValueError here as well to maintain consistent validation behavior across single and bulk imports.

      if isinstance(input_files, list):
        for entry in input_files:
          if not isinstance(entry, dict):
            raise ValueError(
                f"Invalid entry in '{_INPUT_FILES_FIELD}': must be a JSON object. Got: {entry}"
            )
          entries.append(entry)

simple/stats/validation.py (89-92)

medium

MetadataValidator type-hints db as Db (the base class), but directly accesses _triples (a private attribute specific to JsonLdStreamDb). This violates the Liskov Substitution Principle (LSP) and tight-couples the validator to a specific subclass. Consider defining a public method or property on Db (or JsonLdStreamDb) to retrieve triples, or restrict the type hint to JsonLdStreamDb.

simple/stats/validation.py (144-152)

medium

The _clean_dcid method duplicates namespace stripping logic that is already implemented in strip_namespace (imported from stats.data). We can simplify this method and leverage the existing utility function.

  def _clean_dcid(self, val: str) -> str:
    """Normalizes a DCID value by ensuring it starts with 'dcid:' and has no prefix namespaces."""
    if val.startswith(("http://", "https://")):
      return val
    return f"dcid:{strip_namespace(val)}"

@gmechali gmechali changed the title Per imp [DCP -Ingestion] Jun 8, 2026
@dwnoble dwnoble self-requested a review June 8, 2026 23:59
Comment thread simple/stats/runner.py Outdated
@gmechali gmechali requested a review from dwnoble June 9, 2026 11:54
@gmechali

gmechali commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces strict metadata validation for ingestion runs, ensuring referenced provenances and sources are properly defined in MCF files. It also updates the configuration parser to support both modern list-of-objects and legacy dictionary formats, refactors the JSON-LD streaming database to process multiple imports sequentially-parallelly, and updates the ingestion workflow trigger. The review feedback highlights several critical improvements: stripping namespaces in the metadata validator to prevent false failures, supporting both source and sourceLink predicates in the MCF importer, ensuring updated provenances and sources are correctly mapped in registries, preserving global triples when no file-specific imports are processed, adding type checks to is_uri_or_namespace, reusing the multiprocessing.Pool to reduce overhead, reversing lists before popping to preserve order, and truncating/hashing overly long sanitized import names.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread simple/stats/validation.py
Comment thread simple/stats/mcf_importer.py
Comment thread simple/stats/nodes.py
Comment thread simple/stats/nodes.py
Comment thread simple/stats/jsonld_stream_db.py
Comment thread simple/stats/util.py
Comment thread simple/stats/jsonld_stream_db.py
Comment thread simple/stats/jsonld_stream_db.py
Comment thread simple/stats/jsonld_stream_db.py
Comment thread simple/stats/trigger_ingestion_workflow.py

@dwnoble dwnoble left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gmechali gmechali merged commit 1348c8b into datacommonsorg:master Jun 9, 2026
9 of 10 checks passed
@gmechali gmechali deleted the perImp branch June 9, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants