[DCP -Ingestion] by gmechali · Pull Request #525 · datacommonsorg/import

gmechali · 2026-06-05T18:50:58Z

This PR primarily handles per import ingestion, and along the way we fixed some related things.

Mainly now when we ingest all the data, we must keep the key of which import the data is associated with.

We are also strictly going to use the Provenance as the import name. Instead of defining the provenance in the config.json inline, we now require you top ass it in the inputFiles like this: UNdata or Test

We also did some changes to properly parse all the fields on the Provenance And Source, and add validation that the provenance and source nodes are properly defined in the .mcf file.

we added custom namespace support, and some minor changes to how we batch GCS upload to process each import sequentially, but parallelize within theimport.

Added and modified tests. Note that this PR gets rid of usages of provenance node creation from config.json and we will follow up later with code deletion to simplify this codebase.

codacy-production · 2026-06-05T18:52:01Z

Not up to standards ⛔

🔴 Issues 8 high · 28 medium · 59 minor

Alerts:
⚠ 95 issues (≤ 0 issues of at least minor severity)

Results:
95 new issues

Category Results

Compatibility 1 medium

UnusedCode 2 medium
1 minor

BestPractice 16 medium

Documentation 11 minor

ErrorProne 8 high

CodeStyle 47 minor

Complexity 5 medium

Performance 4 medium

View in Codacy

🟢 Metrics 157 complexity · 4 duplication

Metric Results

Complexity 157

Duplication 4

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

gemini-code-assist

Code Review

This pull request introduces a new isBaseDc configuration flag across the ingestion pipeline to support custom Data Commons setups, adds a Terraform deployment configuration, and implements Cloud Workflows and Cloud Run helper services to automate imports, Spanner ingestion, and BigQuery aggregations. It also introduces JsonLdStreamDb to stream JSON-LD shards directly to GCS or disk, alongside a new metadata validator. The review feedback highlights several critical security and reliability improvements, including replacing custom SQL escaping with parameterized queries to prevent SQL injection, restricting broad IAM roles in Terraform, adding timeouts to HTTP requests and polling loops, globally initializing Spanner and Storage clients to reduce latency, fixing a potential null-value bug in the JSON-LD stream database, and refining file-matching logic and test assertions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

… fallback for legacy runs.

gmechali · 2026-06-08T21:14:54Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces strict metadata validation via a new MetadataValidator class, updates JsonLdStreamDb to process imports sequentially to optimize memory usage, and enhances configuration parsing to support both modern list and legacy dictionary formats. Feedback on these changes focuses on preserving the original order of observations and triples when chunking, explicitly specifying UTF-8 encoding for local file operations, and removing dead code in nodes.py. Additionally, the reviewer suggests handling empty directory names gracefully in runner.py to prevent prefix issues, raising a ValueError for invalid configuration entries to maintain consistency, addressing a Liskov Substitution Principle violation in the validator, and reusing the existing strip_namespace utility to avoid duplicate logic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

I am having trouble creating individual review comments. Click here to see my feedback.

simple/stats/jsonld_stream_db.py (334-340)

Popping from the end of the list (records.pop()) progressively cleans memory but reverses the order of observations in the generated chunks. To preserve the original order while maintaining the O(1) pop overhead, we can reverse the list once at the beginning of the generator.

    records = self._obs_records.get(import_name, [])
    records.reverse()
    shard_index = 0
    while records:
      chunk = []
      # Pop from the end to avoid O(N) list shifting overhead
      for _ in range(min(_CHUNK_SIZE, len(records))):
        chunk.append(records.pop())

simple/stats/jsonld_stream_db.py (349-355)

Popping from the end of the list (triples.pop()) progressively cleans memory but reverses the order of triples in the generated chunks. To preserve the original order while maintaining the O(1) pop overhead, we can reverse the list once at the beginning of the generator.

    triples = self._triples.get(import_name, [])
    triples.reverse()
    shard_index = 0
    while triples:
      chunk = []
      # Pop from the end to avoid O(N) list shifting overhead
      for _ in range(min(_CHUNK_SIZE, len(triples))):
        chunk.append(triples.pop())

simple/stats/jsonld_stream_db.py (421-425)

When opening local JSON-LD files, it is best practice to explicitly specify encoding="utf-8". This prevents potential UnicodeDecodeError on platforms where the default system encoding is not UTF-8.

    def _copy_single(rel_path: str):
      local_file_path = os.path.join(temp_local_dir, rel_path)
      with open(local_file_path, "r", encoding="utf-8") as f:
        content = f.read()
      target_store.open_file(rel_path).write(content)

simple/stats/nodes.py (184-189)

The check if not prov_name: is dead code because self.config.provenance_name(input_file) always returns a non-empty string (falling back to input_file.path if the provenance property is missing). If the intent is to strictly require a provenance property in config.json for all modes, you should check the configuration directly rather than relying on provenance_name.

simple/stats/runner.py (135)

If a config.json is located at the root directory, fspath.dirname(c.path) will return "" (empty string), resulting in a prefix of "/". This will cause file.path.startswith("/") to fail for all root files. We should handle the empty directory name gracefully.

        self.active_import_prefixes = set(f"{fspath.dirname(c.path)}/" if fspath.dirname(c.path) else "" for c in configs)

simple/stats/runner.py (303-304)

In config.py, a ValueError is raised if an entry in inputFiles is not a dictionary. However, _merge_configs in runner.py silently filters out non-dictionary entries. We should raise a ValueError here as well to maintain consistent validation behavior across single and bulk imports.

      if isinstance(input_files, list):
        for entry in input_files:
          if not isinstance(entry, dict):
            raise ValueError(
                f"Invalid entry in '{_INPUT_FILES_FIELD}': must be a JSON object. Got: {entry}"
            )
          entries.append(entry)

simple/stats/validation.py (89-92)

MetadataValidator type-hints db as Db (the base class), but directly accesses _triples (a private attribute specific to JsonLdStreamDb). This violates the Liskov Substitution Principle (LSP) and tight-couples the validator to a specific subclass. Consider defining a public method or property on Db (or JsonLdStreamDb) to retrieve triples, or restrict the type hint to JsonLdStreamDb.

simple/stats/validation.py (144-152)

The _clean_dcid method duplicates namespace stripping logic that is already implemented in strip_namespace (imported from stats.data). We can simplify this method and leverage the existing utility function.

  def _clean_dcid(self, val: str) -> str:
    """Normalizes a DCID value by ensuring it starts with 'dcid:' and has no prefix namespaces."""
    if val.startswith(("http://", "https://")):
      return val
    return f"dcid:{strip_namespace(val)}"

gmechali · 2026-06-09T11:55:28Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces strict metadata validation for ingestion runs, ensuring referenced provenances and sources are properly defined in MCF files. It also updates the configuration parser to support both modern list-of-objects and legacy dictionary formats, refactors the JSON-LD streaming database to process multiple imports sequentially-parallelly, and updates the ingestion workflow trigger. The review feedback highlights several critical improvements: stripping namespaces in the metadata validator to prevent false failures, supporting both source and sourceLink predicates in the MCF importer, ensuring updated provenances and sources are correctly mapped in registries, preserving global triples when no file-specific imports are processed, adding type checks to is_uri_or_namespace, reusing the multiprocessing.Pool to reduce overhead, reversing lists before popping to preserve order, and truncating/hashing overly long sanitized import names.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

dwnoble

LGTM

…ecation targets

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

gmechali and others added 28 commits June 8, 2026 09:28

First try at sqlite-less

3c73983

Multi threaded

d9168a5

More speedu p

2a7ffa6

Logging buffering

132cf2c

Threadit

87cbcb6

more parallelization

4c357cd

Export faster

2a7f6ee

Adds a global counter

e5de7d1

Adds gardbage collection to help performance, and prevent ooms

aadefdc

Bypass rdflib for speed.

fdad734

Remove gccollect

dbd4c03

Use native GCS

c4d4659

More performance fix

0d99f46

Minor cleanup

766d000

Lint

68ef5d7

Major cleanup on the code

83bb3c4

Minor cleanup

3f99685

Lint

5c7c433

More cleanups and comments

2283706

Per-import split

0bce6cf

Cloud build fix (datacommonsorg#520)

3e6a494

Per import support, adds validation and more

3589ebf

Fix the import paths

3d827c7

Path fix

7775828

fixing importname

9f22040

clean up import name, clean up lots of other things.

e20a993

Clean up the default nodes and sources

abfaeb8

not sure anymore

85b0931

gmechali added 8 commits June 8, 2026 14:55

Allow custom namespacing

f2eb02f

Custom namesapce fix

fd80f89

More cleanup to reuse the namesapce helper

a9d5fda

Fix the expected tests to stop using the generated provenance and use…

a1b40c0

… fallback for legacy runs.

Supporting a custom namespace for DCID format

4f2d756

Regex fix

a045ff8

Modify provenance and source schema.

7c24a99

Some minor refactoring

6012652

gmechali commented Jun 8, 2026

View reviewed changes

Comment thread simple/tests/stats/test_data/runner/expected/namespace_prefixes/datacommons.sql

Lint

b7e58bb

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

gmechali changed the title ~~Per imp~~ [DCP -Ingestion] Jun 8, 2026

dwnoble self-requested a review June 8, 2026 23:59

dwnoble reviewed Jun 9, 2026

View reviewed changes

Comment thread simple/stats/runner.py Outdated

gmechali added 3 commits June 8, 2026 22:08

lint

5a55d9c

tweaks

e43f572

One more lint

6c19a70

gmechali requested a review from dwnoble June 9, 2026 11:54

Merge branch 'master' into perImp

a302123

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

dwnoble approved these changes Jun 9, 2026

View reviewed changes

gmechali added 5 commits June 9, 2026 11:03

Review gemini comments

9ed348d

Test fix for nl artifacts now gone

56547af

Lint

c9dbd89

Remove old tests that we rendered backwards incompatible. Future depr…

1943d3a

…ecation targets

lint

b720072

gmechali merged commit 1348c8b into datacommonsorg:master Jun 9, 2026
9 of 10 checks passed

gmechali deleted the perImp branch June 9, 2026 18:47

Conversation

gmechali commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codacy-production Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmechali commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

simple/stats/jsonld_stream_db.py (334-340)

simple/stats/jsonld_stream_db.py (349-355)

simple/stats/jsonld_stream_db.py (421-425)

simple/stats/nodes.py (184-189)

simple/stats/runner.py (135)

simple/stats/runner.py (303-304)

simple/stats/validation.py (89-92)

simple/stats/validation.py (144-152)

Uh oh!

Uh oh!

gmechali commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dwnoble left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gmechali commented Jun 5, 2026 •

edited

Loading

codacy-production Bot commented Jun 5, 2026 •

edited

Loading