[DCP -Ingestion]#525
Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Compatibility | 1 medium |
| UnusedCode | 2 medium 1 minor |
| BestPractice | 16 medium |
| Documentation | 11 minor |
| ErrorProne | 8 high |
| CodeStyle | 47 minor |
| Complexity | 5 medium |
| Performance | 4 medium |
🟢 Metrics 157 complexity · 4 duplication
Metric Results Complexity 157 Duplication 4
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request introduces a new isBaseDc configuration flag across the ingestion pipeline to support custom Data Commons setups, adds a Terraform deployment configuration, and implements Cloud Workflows and Cloud Run helper services to automate imports, Spanner ingestion, and BigQuery aggregations. It also introduces JsonLdStreamDb to stream JSON-LD shards directly to GCS or disk, alongside a new metadata validator. The review feedback highlights several critical security and reliability improvements, including replacing custom SQL escaping with parameterized queries to prevent SQL injection, restricting broad IAM roles in Terraform, adding timeouts to HTTP requests and polling loops, globally initializing Spanner and Storage clients to reduce latency, fixing a potential null-value bug in the JSON-LD stream database, and refining file-matching logic and test assertions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
… fallback for legacy runs.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces strict metadata validation via a new MetadataValidator class, updates JsonLdStreamDb to process imports sequentially to optimize memory usage, and enhances configuration parsing to support both modern list and legacy dictionary formats. Feedback on these changes focuses on preserving the original order of observations and triples when chunking, explicitly specifying UTF-8 encoding for local file operations, and removing dead code in nodes.py. Additionally, the reviewer suggests handling empty directory names gracefully in runner.py to prevent prefix issues, raising a ValueError for invalid configuration entries to maintain consistency, addressing a Liskov Substitution Principle violation in the validator, and reusing the existing strip_namespace utility to avoid duplicate logic.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
I am having trouble creating individual review comments. Click here to see my feedback.
simple/stats/jsonld_stream_db.py (334-340)
Popping from the end of the list (records.pop()) progressively cleans memory but reverses the order of observations in the generated chunks. To preserve the original order while maintaining the O(1) pop overhead, we can reverse the list once at the beginning of the generator.
records = self._obs_records.get(import_name, [])
records.reverse()
shard_index = 0
while records:
chunk = []
# Pop from the end to avoid O(N) list shifting overhead
for _ in range(min(_CHUNK_SIZE, len(records))):
chunk.append(records.pop())simple/stats/jsonld_stream_db.py (349-355)
Popping from the end of the list (triples.pop()) progressively cleans memory but reverses the order of triples in the generated chunks. To preserve the original order while maintaining the O(1) pop overhead, we can reverse the list once at the beginning of the generator.
triples = self._triples.get(import_name, [])
triples.reverse()
shard_index = 0
while triples:
chunk = []
# Pop from the end to avoid O(N) list shifting overhead
for _ in range(min(_CHUNK_SIZE, len(triples))):
chunk.append(triples.pop())
simple/stats/jsonld_stream_db.py (421-425)
When opening local JSON-LD files, it is best practice to explicitly specify encoding="utf-8". This prevents potential UnicodeDecodeError on platforms where the default system encoding is not UTF-8.
def _copy_single(rel_path: str):
local_file_path = os.path.join(temp_local_dir, rel_path)
with open(local_file_path, "r", encoding="utf-8") as f:
content = f.read()
target_store.open_file(rel_path).write(content)
simple/stats/nodes.py (184-189)
The check if not prov_name: is dead code because self.config.provenance_name(input_file) always returns a non-empty string (falling back to input_file.path if the provenance property is missing). If the intent is to strictly require a provenance property in config.json for all modes, you should check the configuration directly rather than relying on provenance_name.
simple/stats/runner.py (135)
If a config.json is located at the root directory, fspath.dirname(c.path) will return "" (empty string), resulting in a prefix of "/". This will cause file.path.startswith("/") to fail for all root files. We should handle the empty directory name gracefully.
self.active_import_prefixes = set(f"{fspath.dirname(c.path)}/" if fspath.dirname(c.path) else "" for c in configs)
simple/stats/runner.py (303-304)
In config.py, a ValueError is raised if an entry in inputFiles is not a dictionary. However, _merge_configs in runner.py silently filters out non-dictionary entries. We should raise a ValueError here as well to maintain consistent validation behavior across single and bulk imports.
if isinstance(input_files, list):
for entry in input_files:
if not isinstance(entry, dict):
raise ValueError(
f"Invalid entry in '{_INPUT_FILES_FIELD}': must be a JSON object. Got: {entry}"
)
entries.append(entry)
simple/stats/validation.py (89-92)
MetadataValidator type-hints db as Db (the base class), but directly accesses _triples (a private attribute specific to JsonLdStreamDb). This violates the Liskov Substitution Principle (LSP) and tight-couples the validator to a specific subclass. Consider defining a public method or property on Db (or JsonLdStreamDb) to retrieve triples, or restrict the type hint to JsonLdStreamDb.
simple/stats/validation.py (144-152)
The _clean_dcid method duplicates namespace stripping logic that is already implemented in strip_namespace (imported from stats.data). We can simplify this method and leverage the existing utility function.
def _clean_dcid(self, val: str) -> str:
"""Normalizes a DCID value by ensuring it starts with 'dcid:' and has no prefix namespaces."""
if val.startswith(("http://", "https://")):
return val
return f"dcid:{strip_namespace(val)}"|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces strict metadata validation for ingestion runs, ensuring referenced provenances and sources are properly defined in MCF files. It also updates the configuration parser to support both modern list-of-objects and legacy dictionary formats, refactors the JSON-LD streaming database to process multiple imports sequentially-parallelly, and updates the ingestion workflow trigger. The review feedback highlights several critical improvements: stripping namespaces in the metadata validator to prevent false failures, supporting both source and sourceLink predicates in the MCF importer, ensuring updated provenances and sources are correctly mapped in registries, preserving global triples when no file-specific imports are processed, adding type checks to is_uri_or_namespace, reusing the multiprocessing.Pool to reduce overhead, reversing lists before popping to preserve order, and truncating/hashing overly long sanitized import names.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
This PR primarily handles per import ingestion, and along the way we fixed some related things.
Mainly now when we ingest all the data, we must keep the key of which import the data is associated with.
We are also strictly going to use the Provenance as the import name. Instead of defining the provenance in the config.json inline, we now require you top ass it in the inputFiles like this: UNdata or Test
We also did some changes to properly parse all the fields on the Provenance And Source, and add validation that the provenance and source nodes are properly defined in the .mcf file.
we added custom namespace support, and some minor changes to how we batch GCS upload to process each import sequentially, but parallelize within theimport.
Added and modified tests. Note that this PR gets rid of usages of provenance node creation from config.json and we will follow up later with code deletion to simplify this codebase.