Skip to content

feat: [hca dcp] add hca projects to google datasets catalog (#4806)#4829

Merged
NoopDog merged 3 commits into
mainfrom
fran/4806-hca-dcp-google-datasets-jsonld
May 14, 2026
Merged

feat: [hca dcp] add hca projects to google datasets catalog (#4806)#4829
NoopDog merged 3 commits into
mainfrom
fran/4806-hca-dcp-google-datasets-jsonld

Conversation

@frano-m
Copy link
Copy Markdown
Contributor

@frano-m frano-m commented May 13, 2026

Summary

Adds Schema.org Dataset JSON-LD to HCA DCP project detail pages so Google Dataset Search can index them. Mirrors the NCPI Dataset Catalog pattern with a multi-consumer structure so the follow-up issues (#4807 AnVIL, #4808 LungMAP) can plug in a per-consumer builder without further infrastructure changes.

New files:

  • app/utils/schemaOrg/types.ts — shared types (SchemaDataset, SchemaPerson, etc.).
  • app/utils/schemaOrg/utils.ts — helpers (stripHtmlTags, truncateDescription, escapeJsonForHtml, uniqueNonEmpty).
  • app/utils/schemaOrg/constants.ts — shared constants.
  • app/utils/schemaOrg/hcaProjectDataset.tsbuildHcaProjectJsonLd(data, browserURL).
  • app/views/EntityDetailView/components/JsonLd/jsonLd.tsx — shared <script type="application/ld+json"> renderer used by all consumers.
  • __tests__/utils/schemaOrg/hcaProjectDataset.test.ts — 11 unit tests.

Wired in:

  • pages/[entityListType]/[...params].tsx — added browserURL to props, mount <JsonLd> conditionally for HCA DCP project routes via isHcaDcp (mirrors the existing isAnVIL guard pattern).
  • app/viewModelBuilders/azul/hca-dcp/common/accessionMapper/accessionMapper.ts — exported existing transformAccessionURL so the schema-org builder reuses it.

Closes #4806. Follow-ups: #4807 (AnVIL datasets) and #4808 (LungMAP projects). The shared types.ts/utils.ts/constants.ts + JsonLd component are designed for those to drop in a parallel build*JsonLd and one extra conditional guard.

Ticket scope audit

Field Status
name, description (required) ✅ implemented
identifier, url, sameAs, includedInDataCatalog, isAccessibleForFree, keywords, creator, citation ✅ implemented
funder, license, distribution ⏸ deferred — issue notes these as "TBD / confirm with team"
measurementTechnique ❌ not in this MVP — libraryConstructionApproach currently surfaces via keywords; can split out in a follow-up
variableMeasured ❌ not in this MVP — listed as optional in the issue
version ❌ N/A — HCA projects aren't versioned the way dbGaP studies are

Implementation steps 1–4 from the issue done. Steps 5 (Google Rich Results Test) and 6 (Search Console reindex) are manual post-merge actions.

NCPI mirror — parity check

Structural parity with NCPI's app/utils/schemaOrg.ts + studyJsonLd.tsx:

  • ✅ HTML-strip + truncate-to-5000 + ellipsis description handling
  • ✅ HTML-escaped JSON inside <script type="application/ld+json"> via next/head
  • ✅ Builder signature build*JsonLd(entity, browserURL): SchemaDataset
  • ✅ Page integration via a guarded conditional render in [...params].tsx
  • ✅ Test coverage shape (~11 cases: required fields, fallbacks, truncation, conditional omissions)

Deliberate deviations (multi-consumer support):

  • Single NCPI file → split into types.ts/utils.ts/constants.ts (shared) + hcaProjectDataset.ts (HCA-specific); shared JsonLd renderer instead of per-consumer component.
  • isAccessibleForFree: true vs NCPI's false — HCA is open, dbGaP is gated.
  • sameAs as array vs NCPI's single string — HCA projects can have multiple accessions across namespaces.

Known gaps vs NCPI worth flagging (not blocking; can address in follow-up):

  1. NCPI caps citations at 5 (MAX_CITATIONS) and sorts by citation count; we include all publications. HCA's PublicationResponse doesn't expose citation count, but we should still consider capping for prolific projects to keep payload size predictable.
  2. NCPI emits measurementTechnique from dataType. We currently route libraryConstructionApproach into keywords; could split it into its own measurementTechnique field.
  3. NCPI's parseAuthors builds ScholarlyArticle.author arrays from comma-separated author strings. HCA's PublicationResponse has no author field, so this is N/A here.

Test plan

  • npx tsc --noEmit passes
  • npm run lint, npm run check-format pass
  • npx jest __tests__/utils/schemaOrg — 11/11 tests pass
  • npm run build-ma-dev:hca-dcp succeeds; spot-checked a real project page (/projects/<uuid>) — <script type="application/ld+json"> present with full payload (description, identifier, sameAs, keywords, creator, etc.)
  • npm run build:anvil-cmg, npm run build-dev:anvil-catalog, npm run build-dev:lungmap — all build clean; verified lungmap project page emits zero JSON-LD (correctly gated by isHcaDcp)
  • Validate output against Google's Rich Results Test and Schema Markup Validator for representative projects (single-cell, spatial, multi-organ) after deploy
  • Request indexing via Google Search Console post-merge

🤖 Generated with Claude Code

@frano-m frano-m requested a review from Copilot May 13, 2026 07:29
@frano-m frano-m marked this pull request as ready for review May 13, 2026 07:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Schema.org Dataset JSON-LD to HCA DCP project detail pages so Google Dataset Search can index them, using a shared schema-org utilities module and a shared <JsonLd /> head renderer to support future dataset-catalog consumers.

Changes:

  • Adds shared schema.org types + helpers and a reusable <JsonLd /> renderer for emitting application/ld+json in <head>.
  • Implements an HCA-specific JSON-LD builder (buildHcaProjectJsonLd) and mounts it on HCA project detail routes.
  • Exports transformAccessionURL for reuse and adds unit tests for the HCA builder.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pages/[entityListType]/[...params].tsx Passes browserURL into props and conditionally renders JSON-LD on HCA project detail pages.
app/viewModelBuilders/azul/hca-dcp/common/accessionMapper/accessionMapper.ts Exports transformAccessionURL for schema-org builders to reuse.
app/utils/schemaOrg/hcaProjectDataset.ts Builds Schema.org Dataset JSON-LD for HCA projects (name/description/identifier/sameAs/keywords/creator/citation).
app/utils/schemaOrg/common.ts Introduces shared schema.org type definitions and helper utilities (HTML stripping, truncation, JSON escaping, dedupe).
app/components/Detail/components/JsonLd/jsonLd.tsx Adds a shared Next.js <Head> JSON-LD script renderer using HTML-escaped JSON.
__tests__/utils/schemaOrg/hcaProjectDataset.test.ts Adds unit tests covering required fields, truncation, and conditional field omission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/utils/schemaOrg/hcaProjectDataset.ts
Comment thread app/utils/schemaOrg/hcaProjectDataset.ts Outdated
Comment thread pages/[entityListType]/[...params].tsx Outdated
@frano-m frano-m force-pushed the fran/4806-hca-dcp-google-datasets-jsonld branch from 0abd7b3 to 36b316a Compare May 14, 2026 04:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread pages/[entityListType]/[...params].tsx
@NoopDog NoopDog merged commit d6d2b4c into main May 14, 2026
3 checks passed
@frano-m frano-m deleted the fran/4806-hca-dcp-google-datasets-jsonld branch May 14, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[HCA DCP] Add HCA projects to Google Datasets catalog

3 participants