Implement W3C XQuery and XPath Full Text 3.0 (contains text expression)#6133
Closed
joewiz wants to merge 8 commits intoeXist-db:developfrom
Closed
Implement W3C XQuery and XPath Full Text 3.0 (contains text expression)#6133joewiz wants to merge 8 commits intoeXist-db:developfrom
joewiz wants to merge 8 commits intoeXist-db:developfrom
Conversation
5 tasks
Add full XQuery Full Text 3.0 grammar support to the ANTLR parser and tree walker, along with the complete set of AST expression classes. Grammar (XQuery.g): FTContainsExpr, FTSelection, boolean operators, positional filter syntax, FTMatchOptions, FTTimes, FT option declaration, reserved keyword registration. Tree walker (XQueryTree.g): Maps parsed FT constructs to expression class instances with full match option handling. AST classes (org.exist.xquery.ft): FTAbstractExpr, FTAnd, FTOr, FTMildNot, FTUnaryNot, FTWords, FTPrimaryWithOptions, FTSelection, FTContent, FTDistance, FTOrder, FTRange, FTScope, FTTimes, FTUnit, FTWindow. Error codes: FTST0008, FTST0009, FTST0013, FTST0018/0019, FTDY0016/0017, FTDY0020. W3C spec: https://www.w3.org/TR/xpath-full-text-30/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement the W3C XQFT 3.0 formal semantics evaluator using the AllMatches model. Sequential (non-indexed) evaluator that tokenizes document text at query time. FTEvaluator: tokenizer, AllMatches model, boolean operators, FTWords evaluation, positional filters (window, distance, content, scope, order), sentence/paragraph boundary detection, wildcard matching, ignore option, case/diacritics/stemming modes. FTContainsExpr: the 'contains text' expression integrating with the XQuery expression evaluation framework. Case mode semantics (XQFTTS interpretation): LOWERCASE/UPPERCASE act as filters — only source tokens already in the specified case are eligible for matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FTMatchOptions: case, diacritics, stemming, stop words, wildcards, thesaurus, language options with inheritance. FTThesaurus: WordNet-format thesaurus loading and synonym lookup. XQueryContext: ft-option declaration support in query prolog. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ForExpr/LetExpr: score variable binding support. StaticXQueryException/XQuery: preserve FTST* error codes in static analysis exceptions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
162 tests: parser, evaluator unit tests, integration tests, and W3C-aligned conformance tests covering boolean operators, positional filters, match options, wildcards, thesaurus, and edge cases. XQFTTS 1.0.4: 675/685 (98.5%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2428a96 to
02b4669
Compare
Add default cases to switches, fix parameter reassignment in FTContainsExpr.eval(), collapse nested if in FTEvaluator, move field declarations before inner classes, replace FQNs with imports in XQueryContext, and suppress NPathComplexity on FTEvaluator class. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache the ft.stopWordURIMap and ft.thesaurusURIMap context attributes during analyze() rather than reading them during eval(). This fixes a race condition where XQueryContext.reset() (called between test executions in the XQTS runner or during query pool reuse) clears all context attributes, causing subsequent FT queries to fail with FTST0018 (thesaurus not available). Root cause: The XQTS runner sets thesaurus/stop-word URI maps as context attributes via context.setAttribute(). When tests run concurrently or contexts are reused from the XQuery pool, context.reset() clears all attributes (line 1668 of XQueryContext.java). If a FTContainsExpr's eval() runs after reset but before attributes are re-set, it gets null maps and fails to resolve thesaurus URIs. The fix caches the maps during analyze() (which runs once at compile time when attributes are guaranteed to be set) and falls back to context attributes if the cache is null (for non-XQTS scenarios). This fixes 2 FTTS failures: UseCase-THESAURUS/q6 and q6b. FTContainsTest: 60/60 pass, FTConformanceTest: 67/67 pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
Member
Author
|
[This comment was co-authored with Claude Code. -Joe] Closing — superseded by #6215 (v2/xqft-phase2). This work has been consolidated into a clean v2/ branch as part of the eXist-db 7.0 PR reorganization. The new PR includes all commits from this PR plus additional related work, with reviewer feedback incorporated where applicable. See the reviewer guide for the full context. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the W3C XQuery and XPath Full Text 3.0 specification, adding native support for the
contains textexpression, boolean full-text operators, positional filters, match options, and score variables.This is a sequential (non-indexed) evaluator that tokenizes document text at query time. It is not backed by Lucene or any persistent index — all matching is performed in-memory during query evaluation. This makes it correct and complete for the W3C spec, but it is not suitable for large-scale full-text search on stored collections. For indexed full-text search, eXist-db's existing Lucene-based
ft:query()remains the recommended approach. A future phase could integrate this W3C syntax with Lucene for indexed evaluation.Related: #4584 (the sequential evaluator correctly handles matches spanning multiple text nodes).
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
What Changed
New package:
org.exist.xquery.ftFTContainsExpr.javacontains textexpression — evaluates FTSelection against context nodes, manages match option inheritanceFTEvaluator.javaFTAbstractExpr.javaFTAnd.javaftand)FTOr.javaftor)FTMildNot.javanot in) — excludes matches without full negationFTUnaryNot.javaftnot)FTWords.javaany/all/phrase/any word/all wordsmodesFTPrimaryWithOptions.javaFTSelection.javaFTContent.javaat start,at end,entire contentFTDistance.javaFTWindow.javaFTScope.javasame sentence/same paragraph/different sentence/different paragraphFTOrder.javaFTRange.javaFTTimes.javaoccurs exactly/at least/at most/from N to M timesFTUnit.javawords,sentences,paragraphsFTMatchOptions.javaFTThesaurus.javaGrammar changes
XQuery.gftContainsExprproduction, full FT selection/option/filter syntax, FT prolog declarations, 27 new grammar rules, reserved keyword registrationXQueryTree.gModified existing files
ForExpr.javafor $x score $s in ...binding (§3.1)LetExpr.javalet score $s := ...binding (§3.1)XQueryContext.javaXQuery.javaStaticXQueryExceptionfor proper FTST error propagationStaticXQueryException.javaErrorCodes.javaTests
FTParserTest.javaFTEvaluatorTest.javaFTContainsTest.javaFTConformanceTest.javaXQTS Results
W3C XQFTTS 1.0.4 (667 non-skipped tests):
Remaining 6 failures — triage
FTST0018— thesaurus lookup fails when combined withusing case uppercase. Case option interaction with thesaurus.trueis correct per W3C formal semantics (first-token-at-start + last-token-at-end). Test expects stricter contiguity.Test suite results
Spec Reference
CI Status
All tests green (unit, integration on all platforms, XQTS, container image, license check). Codacy reports pre-existing style issues — no new findings from this PR.
Limitations and Future Work
No index backing (in-memory only)
This implementation is a sequential evaluator — it tokenizes and matches document text at query time without using any persistent index. This means:
ft:query()) is a separate, proprietary API. This implementation does not use it. A future phase could routecontains textexpressions to the Lucene index when a suitable index configuration exists, falling back to sequential evaluation otherwise.Features not implemented
using stemmingoption is parsed and recognized but uses a basic suffix-stripping heuristic. Integration with a proper stemming library (e.g., Snowball via Lucene) would improve recall.using thesaurus defaultoption (implementation-defined default thesaurus) is not supported.using stop words defaultoption uses this list. Language-specific stop word lists are not bundled.Relationship to eXist-db's existing full-text support
eXist-db has two existing full-text mechanisms:
ft:query()— Lucene-backed, proprietary XQuery function. Indexed, fast, supports Lucene query syntax. Recommended for production full-text search.near()— Deprecated proprietary function. Minimal functionality.This PR adds a third mechanism: W3C standard
contains textsyntax. It is complementary toft:query(), not a replacement. Applications that need indexed full-text search should continue usingft:query(). Applications that need W3C-standard syntax (portability, XQFTTS compliance, or precise positional matching) can now usecontains text.