feat: [hca dcp] add hca projects to google datasets catalog (#4806)#4829
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds Schema.org Dataset JSON-LD to HCA DCP project detail pages so Google Dataset Search can index them, using a shared schema-org utilities module and a shared <JsonLd /> head renderer to support future dataset-catalog consumers.
Changes:
- Adds shared schema.org types + helpers and a reusable
<JsonLd />renderer for emittingapplication/ld+jsonin<head>. - Implements an HCA-specific JSON-LD builder (
buildHcaProjectJsonLd) and mounts it on HCA project detail routes. - Exports
transformAccessionURLfor reuse and adds unit tests for the HCA builder.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
pages/[entityListType]/[...params].tsx |
Passes browserURL into props and conditionally renders JSON-LD on HCA project detail pages. |
app/viewModelBuilders/azul/hca-dcp/common/accessionMapper/accessionMapper.ts |
Exports transformAccessionURL for schema-org builders to reuse. |
app/utils/schemaOrg/hcaProjectDataset.ts |
Builds Schema.org Dataset JSON-LD for HCA projects (name/description/identifier/sameAs/keywords/creator/citation). |
app/utils/schemaOrg/common.ts |
Introduces shared schema.org type definitions and helper utilities (HTML stripping, truncation, JSON escaping, dedupe). |
app/components/Detail/components/JsonLd/jsonLd.tsx |
Adds a shared Next.js <Head> JSON-LD script renderer using HTML-escaped JSON. |
__tests__/utils/schemaOrg/hcaProjectDataset.test.ts |
Adds unit tests covering required fields, truncation, and conditional field omission. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This was referenced May 13, 2026
d5dba6f to
0abd7b3
Compare
0abd7b3 to
36b316a
Compare
NoopDog
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Schema.org Dataset JSON-LD to HCA DCP project detail pages so Google Dataset Search can index them. Mirrors the NCPI Dataset Catalog pattern with a multi-consumer structure so the follow-up issues (#4807 AnVIL, #4808 LungMAP) can plug in a per-consumer builder without further infrastructure changes.
New files:
app/utils/schemaOrg/types.ts— shared types (SchemaDataset,SchemaPerson, etc.).app/utils/schemaOrg/utils.ts— helpers (stripHtmlTags,truncateDescription,escapeJsonForHtml,uniqueNonEmpty).app/utils/schemaOrg/constants.ts— shared constants.app/utils/schemaOrg/hcaProjectDataset.ts—buildHcaProjectJsonLd(data, browserURL).app/views/EntityDetailView/components/JsonLd/jsonLd.tsx— shared<script type="application/ld+json">renderer used by all consumers.__tests__/utils/schemaOrg/hcaProjectDataset.test.ts— 11 unit tests.Wired in:
pages/[entityListType]/[...params].tsx— addedbrowserURLto props, mount<JsonLd>conditionally for HCA DCP project routes viaisHcaDcp(mirrors the existingisAnVILguard pattern).app/viewModelBuilders/azul/hca-dcp/common/accessionMapper/accessionMapper.ts— exported existingtransformAccessionURLso the schema-org builder reuses it.Closes #4806. Follow-ups: #4807 (AnVIL datasets) and #4808 (LungMAP projects). The shared
types.ts/utils.ts/constants.ts+JsonLdcomponent are designed for those to drop in a parallelbuild*JsonLdand one extra conditional guard.Ticket scope audit
name,description(required)identifier,url,sameAs,includedInDataCatalog,isAccessibleForFree,keywords,creator,citationfunder,license,distributionmeasurementTechniquelibraryConstructionApproachcurrently surfaces viakeywords; can split out in a follow-upvariableMeasuredversionImplementation steps 1–4 from the issue done. Steps 5 (Google Rich Results Test) and 6 (Search Console reindex) are manual post-merge actions.
NCPI mirror — parity check
Structural parity with NCPI's
app/utils/schemaOrg.ts+studyJsonLd.tsx:<script type="application/ld+json">vianext/headbuild*JsonLd(entity, browserURL): SchemaDataset[...params].tsxDeliberate deviations (multi-consumer support):
types.ts/utils.ts/constants.ts(shared) +hcaProjectDataset.ts(HCA-specific); sharedJsonLdrenderer instead of per-consumer component.isAccessibleForFree: truevs NCPI'sfalse— HCA is open, dbGaP is gated.sameAsas array vs NCPI's single string — HCA projects can have multiple accessions across namespaces.Known gaps vs NCPI worth flagging (not blocking; can address in follow-up):
MAX_CITATIONS) and sorts by citation count; we include all publications. HCA'sPublicationResponsedoesn't expose citation count, but we should still consider capping for prolific projects to keep payload size predictable.measurementTechniquefromdataType. We currently routelibraryConstructionApproachintokeywords; could split it into its ownmeasurementTechniquefield.parseAuthorsbuildsScholarlyArticle.authorarrays from comma-separated author strings. HCA'sPublicationResponsehas no author field, so this is N/A here.Test plan
npx tsc --noEmitpassesnpm run lint,npm run check-formatpassnpx jest __tests__/utils/schemaOrg— 11/11 tests passnpm run build-ma-dev:hca-dcpsucceeds; spot-checked a real project page (/projects/<uuid>) —<script type="application/ld+json">present with full payload (description, identifier, sameAs, keywords, creator, etc.)npm run build:anvil-cmg,npm run build-dev:anvil-catalog,npm run build-dev:lungmap— all build clean; verified lungmap project page emits zero JSON-LD (correctly gated byisHcaDcp)🤖 Generated with Claude Code