feat: [anvil dx] add anvil datasets to google datasets catalog (#4807)#4831
Open
frano-m wants to merge 2 commits into
Open
feat: [anvil dx] add anvil datasets to google datasets catalog (#4807)#4831frano-m wants to merge 2 commits into
frano-m wants to merge 2 commits into
Conversation
9 tasks
eca1901 to
d2aa49e
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Schema.org Dataset JSON-LD to AnVIL CMG dataset detail pages so Google Dataset Search can index them, mirroring the HCA pattern from #4829. The PR also generalises a description helper and consolidates per-consumer JSON-LD render wrappers into a single generic helper to make the upcoming LungMAP integration (#4808) a one-liner.
Changes:
- New
buildAnvilDatasetJsonLdbuilder (+ 11 unit tests) covering required Dataset fields, dbGaPsameAsURLs, and aggregated keyword union. - Promoted
buildDescriptionfromhcaProjectDataset.tsinto the sharedschemaOrg/utils.tswith a caller-ownedfallbackSuffix. - Replaced per-consumer
renderHcaProjectJsonLdwith a genericrenderJsonLd<T>(props, entityListType, build)and mounted it for bothisAnVIL/datasetsandisHcaDcp/projects.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
app/utils/schemaOrg/anvilDataset.ts |
New AnVIL Dataset JSON-LD builder, including keyword/sameAs helpers. |
app/utils/schemaOrg/utils.ts |
Adds shared buildDescription previously local to HCA. |
app/utils/schemaOrg/hcaProjectDataset.ts |
Migrates to shared buildDescription; net code reduction. |
pages/[entityListType]/[...params].tsx |
Generic renderJsonLd<T> helper; mounts AnVIL + HCA JSON-LD conditionally. |
__tests__/utils/schemaOrg/anvilDataset.test.ts |
New 11-case unit test suite for the AnVIL builder. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Schema.org Dataset JSON-LD to AnVIL CMG dataset detail pages so Google Dataset Search can index them. Mirrors the HCA pattern from #4806 — same shared
JsonLdcomponent, same sharedbuildDescription/escapeJsonForHtml/stripHtmlTags/truncateDescription/uniqueNonEmptyhelpers, same<JsonLd>mount path.New files:
app/utils/schemaOrg/anvilDataset.ts—buildAnvilDatasetJsonLd(data, browserURL)for AnVILDatasetsResponse.__tests__/utils/schemaOrg/anvilDataset.test.ts— 11 unit tests.Modified:
app/utils/schemaOrg/utils.ts— generalisedbuildDescription(was local tohcaProjectDataset.ts) so HCA + AnVIL + future LungMAP can all reuse it. Takes a caller-ownedfallbackSuffixso each consumer controls its own padding phrasing.app/utils/schemaOrg/hcaProjectDataset.ts— uses the sharedbuildDescriptioninstead of declaring locally. Net-31/+5lines.pages/[entityListType]/[...params].tsx— unifiedrenderJsonLd<T>(props, entityListType, build)generic helper replaces the per-consumerrenderHcaProjectJsonLd/renderAnvilDatasetJsonLd. Mount path is now{isAnVIL && renderJsonLd(props, "datasets", buildAnvilDatasetJsonLd)}and{isHcaDcp && renderJsonLd(props, "projects", buildHcaProjectJsonLd)}. [LungMAP] Add LungMAP projects to Google Datasets catalog #4808 (LungMAP) drops in as a one-liner on top.Closes #4807. Stacked on #4829 (the HCA PR). Once #4829 merges, rebase this PR's base to
main.Ticket scope audit (MVP)
name,description(required)identifier,url,sameAs,includedInDataCatalog,isAccessibleForFree,keywordscreatorconsortiumfield onDatasetEntity; deferredfunder,license,distribution,variableMeasuredmeasurementTechniquesameAsURLs go to dbGaP (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=<phs>) — AnVIL's only external reference type, no identifiers.org mapping needed.isAccessibleForFree: truefor all AnVIL datasets per Google's spec (the flag is the inverse of "paid", not "unrestricted access"; dbGaP gating doesn't make a dataset "paid").Test plan
npx tsc --noEmitpassesnpm run lint,npm run check-formatpassnpx jest __tests__/utils/schemaOrg— 25/25 tests pass (14 HCA + 11 AnVIL)npm run build:anvil-cmgsucceeds; 375/422 dataset detail pages emit JSON-LD (47 omissions are sub-tab/export routes whereprocessEntityPropsshort-circuits — same gating as HCA's project pages)npm run build-ma-dev:hca-dcp— HCA still emits JSON-LD (110/116 project pages)npm run build-dev:lungmap,npm run build-dev:anvil-catalog— clean builds, no JSON-LD (correctly gated)🤖 Generated with Claude Code