Skip to content

fix(rdf): scope glossary graph by membership and surface glossary labels#28199

Open
harshach wants to merge 4 commits into
mainfrom
fix/rdf-glossary-graph-scope-and-labels
Open

fix(rdf): scope glossary graph by membership and surface glossary labels#28199
harshach wants to merge 4 commits into
mainfrom
fix/rdf-glossary-graph-scope-and-labels

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

Summary

  • /rdf/glossary/graph?glossaryId=<id> silently ignored its filter and returned every term across every glossary, because the SPARQL bound ?glossary via OPTIONAL { ?term1 om:belongsTo ?glossary } but the JSON-LD context maps GlossaryTerm.glossary to om:belongsToGlossary. The downstream FILTER(?glossary = <…>) was a no-op.
  • The UI hierarchy view rendered each leaked glossary's terms inside a group container whose label was the raw glossary UUID (no glossaryId/group on the response, so the UI fell through glossaryNames[id] ?? id).
  • A handful of cascading projection / read-side bugs surfaced once the predicate was fixed: term mutations dropped rdf:type/rdfs:label/om:belongsToGlossary; empty skos:prefLabel = "" won over rdfs:label and rendered as blank node labels; one-sided RDF deletion left a reverse triple behind; the DB-fallback was scoped via a ListFilter param the table doesn't recognize.

This PR fixes all of the above and adds nine integration tests that lock in the user-visible behaviors end-to-end through both the RDF path and the DB-backed term API.

Backend (openmetadata-service)

  • RdfRepository.buildGlossaryTermGraphQuery — use om:belongsToGlossary, make membership a required triple when scoped, COALESCE glossary label from skos:prefLabel/rdfs:label.
  • RdfRepository.parseGlossaryTermGraphResults — emit group + glossaryId per node, backfill membership when a node is upgraded from term2→term1, blank-resilient label extraction, DB-resolved glossary-name fallback.
  • RdfRepository.addRelationship — additive INSERT DATA instead of storeEntity (which swept identity predicates off the source entity).
  • RdfRepository.removeGlossaryTermRelation — delete both directions so the reverse triple written by EntityRepository.addRelationship doesn't linger.
  • RdfRepository.getGlossaryTermGraphFromDatabase — scope by glossary.id in-memory; ListFilter has no glossary predicate for the table, so the previous addQueryParam was a silent no-op.
  • RdfUpdater.addRelationship / removeRelationship — short-circuit glossaryTerm ⇔ glossaryTerm RELATED_TO; the typed addGlossaryTermRelation path owns those edges with the precise predicate (skos:exactMatch, skos:broader, …) so type changes don't leave stale om:relatedTo triples.
  • RdfPropertyMapper.addTypedProperty — skip blank xsd:string literals (one canonical source of truth on the write side).

UI (openmetadata-ui)

  • convertRdfGraphToOntologyGraph prefers the explicit node.glossaryId from the response over the FQN-prefix heuristic.
  • hierarchyGraphBuilder combo label falls back to a node's group when the glossary id isn't in the caller's visible glossary list — so the container never renders as a raw UUID.
  • GraphNode typed with the new optional glossaryId.

Tests (openmetadata-integration-tests)

RdfGlossaryGraphIT (new, 9 tests):

  1. glossaryIdFilterScopesGraphToRequestedGlossary — only requested glossary's terms returned.
  2. scopedResponseCarriesGlossaryNameAndIdPerNodegroup + glossaryId populated per term.
  3. termNodeLabelFallsBackToNameWhenDisplayNameIsAbsent — label = term name when no displayName, never UUID.
  4. labelTracksDisplayNameLifecycle — displayName: unset → set → renamed → cleared → still resolves correctly through both the RDF graph and the DB term API.
  5. glossaryMembershipSurvivesAddAndDeleteRelationsbelongsToGlossary / label preserved through add+delete cycles; cross-checked via DB.
  6. sameTermPairCanHoldMultipleRelationTypesIndependently — two relation types between the same pair coexist; deleting one leaves the other intact.
  7. changingRelationTypeReplacesOldEdgeWithNewType — delete-then-add to a new type leaves no orphan of the old type.
  8. crossGlossaryRelationDoesNotLeakIntoOtherGlossaryScope — cross-glossary relation does NOT make the source term a member of the target glossary.
  9. glossaryIdFilterReturnsEmptyForGlossaryWithNoTerms — DB-fallback path also scopes correctly.
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

Test plan

  • mvn -pl openmetadata-integration-tests test -Dtest=RdfGlossaryGraphIT — 9/9 green
  • mvn spotless:apply — clean
  • yarn organize-imports + eslint --fix + prettier on changed UI files — clean
  • Manual UI verification on a local Docker stack with Fuseki enabled (Finance + Marketing glossaries with hierarchy, intra- and cross-glossary relations): scoped graphs render only the requested glossary's terms with proper group labels.

🤖 Generated with Claude Code

The /rdf/glossary/graph endpoint silently ignored its `glossaryId` filter
and returned every term in every glossary, because the SPARQL query bound
`?glossary` via `OPTIONAL { ?term1 om:belongsTo ?glossary }` while the
JSON-LD context (governance.jsonld) maps GlossaryTerm.glossary to
`om:belongsToGlossary`. The downstream `FILTER(?glossary = <…>)` was a
no-op, and the UI hierarchy view rendered every glossary's terms grouped
by their raw UUID — most visibly for cross-glossary scenarios.

Backend changes (openmetadata-service):
  - RdfRepository: use the correct `om:belongsToGlossary` predicate and
    make membership required (not OPTIONAL + FILTER) when scoped, so the
    join drops non-matching terms instead of relying on filter semantics.
  - Emit `group` (glossary name via COALESCE on skos:prefLabel /
    rdfs:label) and `glossaryId` on every term node; fall back to a DB
    lookup so a scoped response always carries the glossary label.
  - Read term names from `rdfs:label` (the actual predicate base.jsonld
    uses for `name`) instead of `om:name`, which is never written.
  - Backfill membership when a node is upgraded from a term2 (edge target)
    on one row to a term1 (primary) on a later row.
  - removeGlossaryTermRelation deletes both directions so the reverse
    triple written by EntityRepository.addRelationship doesn't linger.
  - addRelationship (generic) uses additive INSERT DATA instead of
    storeEntity, which destructively swept rdf:type, rdfs:label and
    om:belongsToGlossary off the source entity when called with a
    relationship-only model.
  - Database fallback in getGlossaryTermGraphFromDatabase now scopes
    in-memory; ListFilter has no first-class `glossary` predicate so
    addQueryParam was a silent no-op that leaked every term.
  - RdfUpdater short-circuits glossaryTerm⇔glossaryTerm RELATED_TO; the
    typed addGlossaryTermRelation / removeGlossaryTermRelation path owns
    those edges with the precise predicate.
  - RdfPropertyMapper skips blank `xsd:string` literals so empty
    displayNames no longer materialize as `skos:prefLabel ""` and win
    over `rdfs:label` at read time.

UI changes (openmetadata-ui):
  - convertRdfGraphToOntologyGraph prefers the explicit `node.glossaryId`
    from the response over the FQN-prefix heuristic.
  - hierarchyGraphBuilder falls back to a term node's `group` when the
    glossary id can't be resolved against the caller's glossary list, so
    the combo container label is never a raw UUID.
  - rdfAPI.interface declares the new `glossaryId` field.

Tests (openmetadata-integration-tests):
  - RdfGlossaryGraphIT covers nine scenarios end-to-end, including
    membership/labels through add+delete relation cycles, displayName
    set/clear lifecycle, two relation types between the same term pair,
    relation-type change (delete + add new), cross-glossary relations,
    and the DB-fallback path for an empty glossary.
Copilot AI review requested due to automatic review settings May 17, 2026 03:09
@harshach harshach requested a review from a team as a code owner May 17, 2026 03:09
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 17, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…scan

Address PR #28199 review: getGlossaryTermGraphFromDatabase previously
loaded every glossary term in the deployment with listAll() and then
filtered by glossary.id in a loop. That defeats the point of scoping
and creates memory / GC pressure on every fallback call.

Now drive the scoping at the DB level by calling
glossaryTermDAO.getNestedTerms(glossaryFqn), which executes
`SELECT json FROM glossary_term_entity WHERE fqnHash LIKE :prefix` — an
indexed prefix scan that only returns the requested glossary's terms.
Parse the result JSON for ids, then bulk-hydrate relatedTerms / parent /
children via EntityRepository.get(uri, ids, fields, include) in one
batched call.

The unscoped branch (glossaryId == null) still uses listAll because the
endpoint is documented to support that case; the regression was strictly
about silently doing a full scan when the caller DID supply a scope.

All 9 RdfGlossaryGraphIT tests remain green.
@harshach
Copy link
Copy Markdown
Collaborator Author

Addressed in e5b8b51.

The DB fallback now drives scoping at the database level via glossaryTermDAO.getNestedTerms(glossaryFqn):

SELECT json FROM glossary_term_entity WHERE fqnHash LIKE :prefix

which is an indexed fqnHash-prefix scan that only returns the requested glossary's terms. We then parse the result JSON for ids and bulk-hydrate relatedTerms / parent / children via EntityRepository.get(uri, ids, fields, include) in a single batched call — no more listAll + in-memory filter.

The unscoped branch (glossaryId == null) is unchanged because that's the documented "all glossaries" mode; the regression was strictly about silently doing a full scan when the caller DID supply a scope.

All 9 RdfGlossaryGraphIT tests still green:

Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

…orts

Per review on PR #28199:

1. Drive the DB fallback through the SAME code path as
   `GET /v1/glossaryTerms?glossary=<id>` instead of a parallel
   getNestedTerms + manual hydration sequence. That endpoint sets
   `filter.addQueryParam("parent", glossaryFqn)`, and
   `ListFilter.getParentCondition` turns it into a fqnHash LIKE
   prefix predicate via `getFqnPrefixCondition` — the same indexed
   prefix scan, but routed through `listAll` so cursor pagination,
   field hydration, and bulk relationship fetching all come for free.
   Net result: one DB-scoped read path used by both the public
   listing endpoint and the RDF graph DB fallback.

2. Replace fully-qualified type names introduced in the earlier
   commit (`org.openmetadata.schema.entity.data.GlossaryTerm`,
   `…Glossary`, `…Include.NON_DELETED`,
   `org.openmetadata.service.jdbi3.GlossaryTermRepository`,
   `…ListFilter`) with proper imports. Also collapsed the pre-existing
   `Relationship` FQN at line 348 since the new import made it
   redundant.

All 9 RdfGlossaryGraphIT tests remain green.
Copilot AI review requested due to automatic review settings May 17, 2026 03:40
@harshach
Copy link
Copy Markdown
Collaborator Author

Good catch — they weren't sharing the path before. e181f94 makes the RDF DB fallback drive through the same code as GET /v1/glossaryTerms?glossary=<id>:

var listFilter = new ListFilter(null);
if (glossaryId != null) {
  var glossary = (Glossary) glossaryRepo.get(null, glossaryId, …);
  listFilter.addQueryParam("parent", glossary.getFullyQualifiedName());
}
var fetched =
    glossaryTermRepository.listAll(
        glossaryTermRepository.getFields("relatedTerms,parent,children"),
        listFilter);

That's exactly what GlossaryTermResource.list(...) does — parent flows through ListFilter.getParentConditiongetFqnPrefixCondition → an indexed fqnHash LIKE '<prefix>.%' predicate on glossary_term_entity. Single read path now serves both the public listing and the RDF graph DB fallback; no parallel getNestedTerms + manual id hydration.

Same commit also collapses the freshly-introduced FQN class names (org.openmetadata.schema.entity.data.GlossaryTerm, etc.) into proper imports per the convention.

All 9 RdfGlossaryGraphIT tests still green.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 17, 2026

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 62%
62.69% (65385/104296) 43.52% (35792/82227) 46.27% (10528/22753)

- RdfPropertyMapperTest.AddTypedPropertyBlankString — direct unit test
  for the blank xsd:string skip in addTypedProperty. Covers blank, all-
  whitespace, populated, and the negative case (skip is intentionally
  scoped to xsd:string so "0" / xsd:integer literals still get emitted).
- RdfUpdaterTest (new) — verifies the glossaryTerm⇔glossaryTerm
  RELATED_TO short-circuit on both addRelationship and removeRelationship:
  short-circuit shape never reaches the repository, while cross-entity
  RELATED_TO (table→glossaryTerm) and other relationship types
  (CONTAINS between glossary terms) still flow through. Uses reflection
  to inject a Mockito mock + Awaitility to bridge the async submit path.
- graphBuilders.test.ts (new) — convertRdfGraphToOntologyGraph prefers
  the explicit node.glossaryId from the RDF response over the
  FQN-prefix heuristic; falls back to node.group then FQN; preserves
  group passthrough so the combo can fall back to it; replaces a
  UUID-shaped label with the last FQN segment.
- hierarchyGraphBuilder.test.ts (new) — combo label resolution order:
  glossaryNames[id] > non-blank node.group > raw glossaryId. Covers
  the missing-from-glossaryNames case that previously rendered as a
  raw UUID in the UI hierarchy view.

All 41 new + existing tests across both modules pass.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 17, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Scopes glossary graph filtering to specific memberships and corrects label extraction to prevent raw UUIDs in the UI. Nine new integration tests confirm full functional coverage and resolved DB fallback inefficiencies.

✅ 1 resolved
Performance: DB fallback loads ALL glossary terms into memory

📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfRepository.java:1622-1636
The getGlossaryTermGraphFromDatabase method calls glossaryTermRepository.listAll(...) which internally uses dao.listAfter(filter, Integer.MAX_VALUE, "", ""). This fetches every glossary term in the system into memory, then filters by glossaryId in a loop. For deployments with thousands of terms this creates unnecessary memory pressure and GC load on every invocation of the DB-fallback path.

The original code attempted to use listFilter.addQueryParam("glossary", ...) which (as the PR correctly notes) is a silent no-op. However, a better fix would be to use a JDBI query that supports glossary filtering at the DB level rather than loading everything and filtering in-memory.

If a DB-level filter is not feasible short-term, at minimum add a comment documenting the scalability limitation and consider streaming/pagination.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🔴 Playwright Results — 1 failure(s), 8 flaky

✅ 4112 passed · ❌ 1 failed · 🟡 8 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🔴 Shard 2 771 1 2 8
🟡 Shard 3 780 0 4 7
✅ Shard 4 816 0 0 18
🟡 Shard 5 708 0 1 41
🟡 Shard 6 738 0 1 8

Genuine Failures (failed on all attempts)

Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2)
Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoHaveText�[2m(�[22m�[32mexpected�[39m�[2m)�[22m failed

Locator:  locator('[data-row-key*="StatusBadgeTerm1779000283633"]').locator('.status-badge')
Expected: �[32m"Draft"�[39m
Received: �[31m"In Review"�[39m
Timeout:  15000ms

Call log:
�[2m  - Expect "toHaveText" with timeout 15000ms�[22m
�[2m  - waiting for locator('[data-row-key*="StatusBadgeTerm1779000283633"]').locator('.status-badge')�[22m
�[2m    18 × locator resolved to <div class="status-badge inReview" data-testid=""PW%'a6881030.Calme8bca324".StatusBadgeTerm1779000283633-status">…</div>�[22m
�[2m       - unexpected value "In Review"�[22m

🟡 8 flaky test(s) (passed on retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/KnowledgeCenter.spec.ts › Article mentions in description should working for Knowledge Center (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Flow/ObservabilityAlerts.spec.ts › Table alert (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 2 retries)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants