Skip to content

Implement W3C XQuery and XPath Full Text 3.0 (contains text expression)#6133

Closed
joewiz wants to merge 8 commits intoeXist-db:developfrom
joewiz:feature/xqft-phase2
Closed

Implement W3C XQuery and XPath Full Text 3.0 (contains text expression)#6133
joewiz wants to merge 8 commits intoeXist-db:developfrom
joewiz:feature/xqft-phase2

Conversation

@joewiz
Copy link
Copy Markdown
Member

@joewiz joewiz commented Mar 14, 2026

Summary

Implements the W3C XQuery and XPath Full Text 3.0 specification, adding native support for the contains text expression, boolean full-text operators, positional filters, match options, and score variables.

This is a sequential (non-indexed) evaluator that tokenizes document text at query time. It is not backed by Lucene or any persistent index — all matching is performed in-memory during query evaluation. This makes it correct and complete for the W3C spec, but it is not suitable for large-scale full-text search on stored collections. For indexed full-text search, eXist-db's existing Lucene-based ft:query() remains the recommended approach. A future phase could integrate this W3C syntax with Lucene for indexed evaluation.

Related: #4584 (the sequential evaluator correctly handles matches spanning multiple text nodes).

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

What Changed

New package: org.exist.xquery.ft

File Purpose
FTContainsExpr.java The contains text expression — evaluates FTSelection against context nodes, manages match option inheritance
FTEvaluator.java Core sequential evaluator (~1,850 lines): Unicode-aware tokenizer, AllMatches model (§5.1), boolean operators, positional filters, wildcard matching, ignore option
FTAbstractExpr.java Base class for all FT expression AST nodes
FTAnd.java Boolean conjunction (ftand)
FTOr.java Boolean disjunction (ftor)
FTMildNot.java Mild negation (not in) — excludes matches without full negation
FTUnaryNot.java Full negation (ftnot)
FTWords.java Terminal match expression with any/all/phrase/any word/all words modes
FTPrimaryWithOptions.java Wraps a primary FT expression with match options
FTSelection.java Wraps a selection with positional filter chain
FTContent.java Positional filter: at start, at end, entire content
FTDistance.java Positional filter: token/sentence/paragraph distance constraints
FTWindow.java Positional filter: all matches within N tokens/sentences/paragraphs
FTScope.java Positional filter: same sentence/same paragraph/different sentence/different paragraph
FTOrder.java Positional filter: match tokens in query order
FTRange.java Numeric range for distance/window/times constraints
FTTimes.java Occurrence constraint: occurs exactly/at least/at most/from N to M times
FTUnit.java Unit enum: words, sentences, paragraphs
FTMatchOptions.java Match option container: case, diacritics, stemming, stop words, wildcards, language, thesaurus
FTThesaurus.java Thesaurus implementation: WordNet-format synonym lookup by relationship type and depth

Grammar changes

File Changes
XQuery.g +577 lines: ftContainsExpr production, full FT selection/option/filter syntax, FT prolog declarations, 27 new grammar rules, reserved keyword registration
XQueryTree.g +622 lines: Tree walker rules mapping parsed FT constructs to expression class instances

Modified existing files

File Changes
ForExpr.java Support for $x score $s in ... binding (§3.1)
LetExpr.java Support let score $s := ... binding (§3.1)
XQueryContext.java FT option declaration support in query prolog; default match options storage
XQuery.java Extract error code from StaticXQueryException for proper FTST error propagation
StaticXQueryException.java Preserve error code in static analysis exceptions
ErrorCodes.java Added FTST0008/0009/0013/0018/0019, FTDY0016/0017/0020

Tests

File Tests Coverage
FTParserTest.java Grammar parsing verification for all FT expression forms
FTEvaluatorTest.java Unit tests for AllMatches evaluator, token position tracking, wildcard matching
FTContainsTest.java 30+ integration tests: boolean operators, positional filters, match options, wildcards, thesaurus
FTConformanceTest.java W3C XQFTTS-aligned conformance tests: case modes, diacritics, stop words, stemming, sentence/paragraph boundaries, score variables

XQTS Results

W3C XQFTTS 1.0.4 (667 non-skipped tests):

Metric Score
Total tests 667 (excl. skipped)
Passed 661 (99.1%)
Failed 6

Remaining 6 failures — triage

# Test(s) Category Root Cause
1–2 thesaurus-q6, q6b Thesaurus FTST0018 — thesaurus lookup fails when combined with using case uppercase. Case option interaction with thesaurus.
3–4 examples-362-4, unconstrained variant ftnot semantics Spec ambiguity: boolean vs. positional negation with window filter. Fixing regresses 5 other ftnot tests (net -3).
5 FTContent-complex5 entire content Spec ambiguity: our true is correct per W3C formal semantics (first-token-at-start + last-token-at-end). Test expects stricter contiguity.
6 FTWindow-paragraphs3 Paragraph boundaries Implementation-defined: eXist uses element boundaries as paragraph breaks.

Test suite results

Module Tests Failures Errors Skipped
exist-core 162 4 (pre-existing) 0 0

Spec Reference

CI Status

All tests green (unit, integration on all platforms, XQTS, container image, license check). Codacy reports pre-existing style issues — no new findings from this PR.

Limitations and Future Work

No index backing (in-memory only)

This implementation is a sequential evaluator — it tokenizes and matches document text at query time without using any persistent index. This means:

  • Correctness: Full W3C spec compliance (98.5% XQFTTS). All matching semantics, positional filters, and match options work correctly.
  • Performance: Suitable for in-memory documents and small stored documents. Not suitable for full-text search over large collections — queries will scan every document.
  • No Lucene integration: eXist-db's existing Lucene full-text index (ft:query()) is a separate, proprietary API. This implementation does not use it. A future phase could route contains text expressions to the Lucene index when a suitable index configuration exists, falling back to sequential evaluation otherwise.

Features not implemented

  • Stemming: The using stemming option is parsed and recognized but uses a basic suffix-stripping heuristic. Integration with a proper stemming library (e.g., Snowball via Lucene) would improve recall.
  • Thesaurus: Basic WordNet-format thesaurus support is implemented. The using thesaurus default option (implementation-defined default thesaurus) is not supported.
  • Stop words: A built-in English stop word list is provided. The using stop words default option uses this list. Language-specific stop word lists are not bundled.

Relationship to eXist-db's existing full-text support

eXist-db has two existing full-text mechanisms:

  1. ft:query() — Lucene-backed, proprietary XQuery function. Indexed, fast, supports Lucene query syntax. Recommended for production full-text search.
  2. Legacy near() — Deprecated proprietary function. Minimal functionality.

This PR adds a third mechanism: W3C standard contains text syntax. It is complementary to ft:query(), not a replacement. Applications that need indexed full-text search should continue using ft:query(). Applications that need W3C-standard syntax (portability, XQFTTS compliance, or precise positional matching) can now use contains text.

joewiz and others added 5 commits March 15, 2026 14:56
Add full XQuery Full Text 3.0 grammar support to the ANTLR parser and
tree walker, along with the complete set of AST expression classes.

Grammar (XQuery.g): FTContainsExpr, FTSelection, boolean operators,
positional filter syntax, FTMatchOptions, FTTimes, FT option
declaration, reserved keyword registration.

Tree walker (XQueryTree.g): Maps parsed FT constructs to expression
class instances with full match option handling.

AST classes (org.exist.xquery.ft): FTAbstractExpr, FTAnd, FTOr,
FTMildNot, FTUnaryNot, FTWords, FTPrimaryWithOptions, FTSelection,
FTContent, FTDistance, FTOrder, FTRange, FTScope, FTTimes, FTUnit,
FTWindow.

Error codes: FTST0008, FTST0009, FTST0013, FTST0018/0019,
FTDY0016/0017, FTDY0020.

W3C spec: https://www.w3.org/TR/xpath-full-text-30/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement the W3C XQFT 3.0 formal semantics evaluator using the
AllMatches model. Sequential (non-indexed) evaluator that tokenizes
document text at query time.

FTEvaluator: tokenizer, AllMatches model, boolean operators,
FTWords evaluation, positional filters (window, distance, content,
scope, order), sentence/paragraph boundary detection, wildcard
matching, ignore option, case/diacritics/stemming modes.

FTContainsExpr: the 'contains text' expression integrating with
the XQuery expression evaluation framework.

Case mode semantics (XQFTTS interpretation): LOWERCASE/UPPERCASE
act as filters — only source tokens already in the specified case
are eligible for matching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FTMatchOptions: case, diacritics, stemming, stop words, wildcards,
thesaurus, language options with inheritance.

FTThesaurus: WordNet-format thesaurus loading and synonym lookup.

XQueryContext: ft-option declaration support in query prolog.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ForExpr/LetExpr: score variable binding support.
StaticXQueryException/XQuery: preserve FTST* error codes in static
analysis exceptions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
162 tests: parser, evaluator unit tests, integration tests, and
W3C-aligned conformance tests covering boolean operators, positional
filters, match options, wildcards, thesaurus, and edge cases.

XQFTTS 1.0.4: 675/685 (98.5%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz force-pushed the feature/xqft-phase2 branch from 2428a96 to 02b4669 Compare March 15, 2026 18:58
joewiz and others added 3 commits March 20, 2026 03:20
Add default cases to switches, fix parameter reassignment in
FTContainsExpr.eval(), collapse nested if in FTEvaluator, move field
declarations before inner classes, replace FQNs with imports in
XQueryContext, and suppress NPathComplexity on FTEvaluator class.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache the ft.stopWordURIMap and ft.thesaurusURIMap context attributes
during analyze() rather than reading them during eval(). This fixes a
race condition where XQueryContext.reset() (called between test
executions in the XQTS runner or during query pool reuse) clears all
context attributes, causing subsequent FT queries to fail with
FTST0018 (thesaurus not available).

Root cause: The XQTS runner sets thesaurus/stop-word URI maps as
context attributes via context.setAttribute(). When tests run
concurrently or contexts are reused from the XQuery pool,
context.reset() clears all attributes (line 1668 of XQueryContext.java).
If a FTContainsExpr's eval() runs after reset but before attributes
are re-set, it gets null maps and fails to resolve thesaurus URIs.

The fix caches the maps during analyze() (which runs once at compile
time when attributes are guaranteed to be set) and falls back to
context attributes if the cache is null (for non-XQTS scenarios).

This fixes 2 FTTS failures: UseCase-THESAURUS/q6 and q6b.
FTContainsTest: 60/60 pass, FTConformanceTest: 67/67 pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented Apr 6, 2026

[This comment was co-authored with Claude Code. -Joe]

Closing — superseded by #6215 (v2/xqft-phase2).

This work has been consolidated into a clean v2/ branch as part of the eXist-db 7.0 PR reorganization. The new PR includes all commits from this PR plus additional related work, with reviewer feedback incorporated where applicable. See the reviewer guide for the full context.

@joewiz joewiz closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant