Skip to content

Improve W3C serialization compliance across all output methods#6138

Closed
joewiz wants to merge 84 commits intoeXist-db:developfrom
joewiz:feature/serialization-compliance
Closed

Improve W3C serialization compliance across all output methods#6138
joewiz wants to merge 84 commits intoeXist-db:developfrom
joewiz:feature/serialization-compliance

Conversation

@joewiz
Copy link
Copy Markdown
Member

@joewiz joewiz commented Mar 14, 2026

Summary

Improve eXist-db's compliance with the W3C XSLT and XQuery Serialization 3.1 specification across all output methods (JSON, adaptive, XML, text, HTML, XHTML) and fix fn:xml-to-json for element node inputs.

Depends on: #6139 (XQuery 4.0 parser) — rebased on Parser; merge after it lands.

CI Note: This branch inherits CI timeouts from #6139 (BrokerPool shutdown hangs in DeadlockIT, RemoveCollectionIT, serialize-node). These are pre-existing issues unrelated to serialization changes and will resolve when merged in order (#6139 first, then this PR).

What Changed

File Changes
JSONSerializer.java Enable forward-slash escaping; handle INF/NaN/negative-zero per QT4; fix inverted allow-duplicate-names; add SERE0022 duplicate-key detection
XQuerySerializer.java Add item-separator support for XML/text methods; restore backwards-compat routing for single element/document nodes in JSON method (RESTXQ)
JSON.java Add option type validation for liberal (boolean) and duplicates (string)
AdaptiveWriter.java Remove map prefix ({...} not map{...}); fix double INF/NaN to use text not Unicode symbols
XMLWriter.java Standalone declaration output; CDATA section output for cdata-section-elements; CR and LINE SEPARATOR character reference escaping; &{ attribute escaping hook
IndentingXMLWriter.java suppress-indentation parameter
Option.java Allow Q{namespace}local URIQualifiedName in declare option
AbstractSerializer.java Default html-version to 5.0; output:versionhtml-version mapping
XHTMLWriter.java include-content-type meta tag (first child of <head>); boolean attribute minimization; XHTML content-type uses http-equiv form
HTML5Writer.java HTML5 processing instruction format
XHTML5Writer.java DOCTYPE PUBLIC/SYSTEM support
FunSerialize.java SEPM0009 parameter validation
FunXmlToJson.java DOM traversal rewrite; map key validation and duplicate detection
SerializerUtils.java Q{ns}local URIQualifiedName in QName-type properties; subtype checking for parameter validation

XQTS Results (QT4)

Test Set Before After Delta
method-adaptive 23/101 62/62 +39 (100%)
method-json 8/81 38/38 +30 (100%)
method-text 1/20 17/20 +16
method-xhtml 20/53 33/45 +13
method-xml 11/47 28/46 +17
method-html 31/69 38/62 +7
fn-serialize 84/151
fn-xml-to-json 82/166 97/166 +15

Spec References

Test Plan

  • HTML5WriterTest — 4/4 pass
  • ArrayTests — 163/164 pass (1 pre-existing serialize-node)
  • XQuery3Tests — 0 serialization regressions (3+1 map-ordering from Parser)
  • EvalTest — 19/19 pass
  • MediaTypeIntegrationTest — 4/4 pass

🤖 Generated with Claude Code

@joewiz joewiz requested a review from a team as a code owner March 14, 2026 23:30
@duncdrum
Copy link
Copy Markdown
Contributor

needs a rebase

@joewiz joewiz force-pushed the feature/serialization-compliance branch from 7f478be to 331d112 Compare March 16, 2026 18:08
joewiz and others added 17 commits March 16, 2026 14:17
fn:compare: XQ4 numeric/duration/dateTime total order via BigDecimal.
fn:min/fn:max: fn:compare-based mutual comparability. fn:round 3-arg.
fn:deep-equal: full XQ4 options engine, text node merging.
fn:every/fn:some, fn:all-equal/different, fn:atomic-equal,
fn:duplicate-values, fn:highest/fn:lowest, fn:scan-left/right,
fn:contains/starts-with/ends-with-subsequence.

Fix: SequenceComparator o2Count typo, AtomicValueComparator cause
preservation, Collations instanceof for non-RuleBasedCollator,
BigInteger comparison via string (not truncating getLong()).

XQTS: fn-min +73, fn-max +73, fn-deep-equal +20, fn-every/some +50

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
String: fn:characters, fn:graphemes (ICU4J), fn:char, fn:decode-from-uri,
  fn:insert-separator, fn:replicate
Parsing: fn:parse-html (NekoHTML+XHTML), fn:parse-integer, fn:parse-QName,
  fn:parse-uri, fn:build-uri, fn:html-doc, fn:collation/-available
Type: fn:atomic-type-annotation, fn:node-type-annotation, fn:type-of,
  fn:is-NaN, fn:identity, fn:void
Nav: fn:transitive-closure, fn:element-to-map, fn:siblings,
  fn:in-scope-namespaces, fn:distinct/ordered-nodes
Higher-order: fn:partition, fn:partial-apply, fn:sort-by, fn:op,
  fn:subsequence-where
Numeric: fn:seconds, fn:divide-decimals, fn:unix-dateTime,
  fn:civil-timezone, fn:hash, fn:expanded-QName, fn:unparsed-binary
Date: fn:build-dateTime, fn:parts-of-dateTime (record-compatible)
Data: fn:items-at, fn:slice, fn:message, fn:highest, fn:lowest

XQTS: fn-graphemes 1086/1189, fn-characters 45/45,
  misc-HtmlTestSuite 1105/1379, fn-unparsed-binary 14/15

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
array:slice (4 overloads), array:index-where, array:sort-with,
array:sort-by, array:empty, array:foot, array:trunk, array:items,
array:members, array:build, array:index-of, array:of-members,
array:split. Fix array:sort ClassCastException unwrap,
ArraySortBy key validation, ArraySortWith RuntimeException unwrap.

XQTS: array-slice 71/71, array-foot 9/9, array-trunk 6/6,
  array-items 8/8, math-cosh/sinh/tanh 27/27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hyperbolic trigonometric functions via Java Math.cosh/sinh/tanh.
Euler's number constant via Math.E.

XQTS: math-cosh 9/9, math-sinh 9/9, math-tanh 9/9, math-e 4/5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unicode block name fallback (\p{Is<Block>} → \p{In<Block>}).
XQ4 fn:replace: 'c' flag, empty match, function replacement.
XQ4 fn:matches and fn:tokenize enhancements.

FunAnalyzeString: use reflection proxy for RegexIterator.MatchHandler
to avoid NoClassDefFoundError when the inner class is stripped from
fat JARs. Falls back to text-only output when unavailable.

XQTS: fn-matches.re +45, fn-replace +12, fn-tokenize +8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fractional seconds: left-aligned digit semantics.
Word/Roman via ICU4J: W/w/Ww cardinal, Wo/wo/Wwo ordinal, I/i Roman.
Timezone: picture-driven rewrite with digit family support.
Era [E]/[C], calendar validation, grouping separators, optional digit
validation, ordinal suffix teens fix, whitespace stripping, military
TZ "J", name width truncation (max not min).

XQTS: format-time 46→77/92, format-date 79→111/133

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d-text, fn:json-doc

Resolve relative URIs against file: base URI with direct file: handling.
Only allow direct file: access for URIs resolved from relative paths
(absolute file: URIs go through SourceFactory security checks).
Separate FOJS0001 from FOUT1170 in fn:json-doc.
Add iso-8859 → iso-8859-1 charset fallback in fn:unparsed-text.

XQTS: misc-HtmlTestSuite 0→1105/1379, misc-JsonTestSuite 0→299/318

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fn:parse-csv, fn:csv-to-arrays, fn:csv-to-xml, fn:csv-to-json.
Custom streaming CSV parser with configurable delimiter, quote char,
header handling, and column naming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- fnXQuery40.xql: tests for 50+ new XQ4 functions
- deep-equal-options-test.xq: deep-equal options engine tests
- Re-enable arr:get-invalid-type (XPTY0004 now works)
- Update json-to-xml pending comments
- fn:replace test updates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parser and tree walker extensions for XQ4: focus functions, keyword
args, string templates, pipeline, mapping arrow, for member,
otherwise, braced if, while, try/finally, ternary, QName/hex/binary
literals, array/map filter, choice/union/enum types, method call, let
destructure, fn() shorthand, record types, gnode(), 4 new axes,
reservedKeywords sub-rules, expr split for code-too-large fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New expression classes: FocusFunction, KeywordArgumentExpression,
MappingArrowOperator, MethodCallOperator, PipelineExpression,
OtherwiseExpression, WhileClause, ForMemberExpr, ForKeyValueExpr,
LetDestructureExpr, FilterExprAM, ChoiceCast/CastableExpression,
EnumCastExpression, FunctionParameterFunctionSequenceType.

Modified: Function (keyword arg resolution), FunctionFactory (XQ4
no-namespace override, unknown type XPST0017), FunctionSignature
(default params), UserDefinedFunction (default param binding),
TryCatchExpression (finally), SwitchExpression (XQ4 version gating),
StringConstructor (atomization fixes), XQueryContext (version 4.0,
XQST0060 relaxed, compileModuleFromSource), Constants (4 new axes),
LocationStep (or-self axis evaluation with document node guard).

Type infrastructure: Type.RECORD constant, SequenceType.RecordField,
record type structural checking, record(*) and record() support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- convertTo(): FORG0001→XPTY0004 for type-incompatible casts (20 files)
- DoubleValue: NaN/INF→integer/decimal throws FOCA0002
- DynamicCardinalityCheck: ERROR→XPTY0004 (or XPDY0050 for treat-as)
- DynamicTypeCheck: FOCH0002→XPTY0004 (overridable for treat-as)
- CastExpression: xs:anySimpleType→XPST0080 (was XPST0051)
- StringValue: validation errors→FORG0001 (was generic ERROR)
- Base64BinaryValueType: FORG0001 with proper ErrorCode
- ErrorCodes: added convenience constructor

XQTS impact: prod-CastExpr 745→141F, prod-TreatExpr 18→1F

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compile modules from provided source strings instead of loading from
URIs. Required by misc-Subtyping XQTS tests (146 tests). Relaxed
version compatibility check for content-loaded modules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parse invisible XML grammars using the Markup Blitz iXML library.
Two signatures: fn:invisible-xml(grammar) returns a parsing function,
and fn:invisible-xml(grammar, input) parses directly. Updated pom.xml
with Markup Blitz dependency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Primitive long start/end instead of IntegerValue objects. Pre-computed
size with overflow protection. O(1) count/isEmpty/contains. Prevents
OOM on large ranges like 1 to 10000000000.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enhanced: fn:compare (XQ4 anyAtomicType, total order), fn:min/max
(comparison function), fn:deep-equal (options map), fn:matches/
fn:tokenize (XQ4 regex flags, ! flag version-gating), fn:replace
(function replacement, ! flag), fn:round (3-arg mode). Collations:
supplementary codepoint fix, ASCII case-insensitive collator.
InspectModule: keyword arg introspection. DocUtils: URI resolution.

Parameter name alignment across 59 fn: module files to match W3C
XQuery 4.0 Functions and Operators catalog.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive fnXQuery40.xql with tests for all XQ4 features.
Updated fnHigherOrderFunctions.xql, replace.xqm, fnLanguage.xqm,
InspectModuleTest.java. New deep-equal-options-test.xq and
fnInvisibleXml.xqm. Fixed stray backtick in Lucene facets.xql.
Updated map ordering test assertions for LinkedHashMap insertion order.

XQSuite: 1341 tests, 0 failures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joewiz and others added 5 commits March 16, 2026 14:56
Fix multiple issues in the JSON output method (method="json") and
JSON function option validation:

JSONSerializer:
- Enable forward slash escaping (ESCAPE_FORWARD_SLASHES) per JSON spec
- Handle INF/NaN/negative-zero per QT4 spec (1e9999, -1e9999, null)
- Fix inverted allow-duplicate-names logic: "yes" now correctly allows
  duplicates (was enabling STRICT_DUPLICATE_DETECTION)
- Add manual duplicate key detection in serializeMap for SERE0022 errors
  when allow-duplicate-names="no"
- Extract numeric serialization into dedicated serializeAtomicValue method

XQuerySerializer:
- Remove backwards-compatibility check in serializeJSON() that routed
  single element/document nodes to XML serialization instead of JSON

JSON.java (fn:parse-json, fn:json-to-xml, fn:json-doc):
- Validate option types: 'liberal' must be boolean, 'duplicates' must
  be string (XPTY0004)
- Check that options parameter is a map before casting

XQTS QT4 results: method-json 8/81 → 46/81 (+38)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove 'map' prefix from map serialization: output '{...}' not
  'map{...}' per W3C Serialization 3.1 Section 11 (Adaptive Output
  Method)
- Fix double INF/NaN serialization: use 'INF'/'-INF'/'NaN' string
  representations instead of Unicode symbols that DecimalFormat produces

XQTS QT4 results: method-adaptive 23/101 → 85/102 (+62)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XQuerySerializer:
- Add item-separator support: when item-separator is set and the
  sequence has multiple items, serialize each item individually with the
  separator between them (the internal Serializer doesn't handle
  item-separator)

XMLWriter:
- Output XML declaration when standalone parameter is set, even if
  omit-xml-declaration is not explicitly "no" (per W3C Serialization 3.1)
- Add CDATA section output for cdata-section-elements: when
  xdmSerialization is active and the current element is in the
  cdata-section-elements set, wrap text content in CDATA sections
  instead of character-escaping it

IndentingXMLWriter:
- Implement suppress-indentation parameter: parse space-separated
  element names and skip indentation inside those elements and their
  descendants

Option.java:
- Allow URIQualifiedName (Q{namespace}local) in declare option
  statements; was rejecting them because it required a prefix

XQTS QT4 results: method-xml 11/47 → 20/47 (+9),
method-text 1/20 → 17/20 (+16)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AbstractSerializer:
- Default html-version to 5.0 per W3C Serialization 3.1 spec (was 1.0,
  causing method="html" to use XHTML 1.0 writer instead of HTML5)
- Map output:version to html-version for html/xhtml methods per W3C
  spec (version controls HTML version, not XML version, for these
  methods)

HTML5Writer:
- Add include-content-type support: inject <meta> content-type tag in
  <head> when include-content-type=yes (the default)
- Add HTML5 processing instruction format: output <?pi data> instead of
  <?pi data?> per HTML5 spec

XHTMLWriter:
- Add 'embed' to void elements set (was missing, causing
  <embed></embed> instead of <embed />)

XQTS QT4 results: method-html 31/69 → 34/69 (+3),
method-xhtml 20/53 → 25/53 (+5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite fn:xml-to-json to use DOM traversal instead of XMLStreamReader.
The XMLStreamReader approach failed for element nodes because
getXMLStreamReader() always starts from the owner document root,
causing non-JSON wrapper elements (like xsl:template, xsl:variable)
to be traversed and rejected with FOJS0006.

The new DOM-based approach:
- Directly navigates the element's DOM tree
- Handles map, array, string, number, boolean, null elements
- Supports key/escaped/escaped-key attributes
- Works correctly for both document and element node inputs
- Keeps the old XMLStreamReader-based method for reference

XQTS QT4 results: fn-xml-to-json 82/166 → 97/166 (+15)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz force-pushed the feature/serialization-compliance branch from 331d112 to 443870e Compare March 16, 2026 18:56
joewiz and others added 2 commits March 16, 2026 21:55
Grammar (XQuery.g):
- fn() and function() type tests now accept named parameters:
  fn($name as xs:string, $age as xs:integer) as xs:boolean
  The names are parsed and discarded — only the sequence types matter
  for type checking. This matches the XQ4 spec.

CastExpression/CastableExpression:
- xs:anyType and xs:untyped now throw XPST0080 (was bypassing the
  abstract type check or using XPST0051)

XQTS: misc-BuiltInKeywords 227→234 (+7 tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore the backwards-compatibility check in XQuerySerializer.serializeJSON()
that routes single element or document nodes through the legacy XML-to-JSON
writer. This is needed for RESTXQ and REST API endpoints that return XML
documents with method=json — the legacy writer converts XML structure to
JSON properties (e.g., <firstName>Adam</firstName> → "firstName":"Adam").

Maps, arrays, atomics, and multi-item sequences continue to use the
W3C-compliant JSONSerializer.

Fixes MediaTypeIntegrationTest.mediaTypeJson1 and mediaTypeJson2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz force-pushed the feature/serialization-compliance branch from a11919c to e1eb77d Compare March 17, 2026 03:33
joewiz and others added 9 commits March 28, 2026 13:57
JSONSerializer:
- Fix json-lines output adding extra whitespace between values. Jackson
  adds separator whitespace between root-level values, so each json-line
  is now serialized via a separate generator to a string buffer, then
  written as raw content.

XQuerySerializer:
- Flatten arrays before XML/text serialization — ArrayType items can't
  be serialized as SAX events, so [1,2,3,4,5] is flattened to the
  sequence (1,2,3,4,5) before passing to the SAX serializer.
- For text method with flattened arrays, set default item-separator
  to space (per W3C spec) when not explicitly provided.

Fixes: serialize-json-201, -203 (json-lines whitespace),
Serialization-text-19 (array serialization in text method).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two new validation checks in the XML element form of fn:serialize options:

1. Reject unrecognized attributes on the <serialization-parameters>
   root element (e.g., value2="no" is not a valid attribute)
2. Reject child elements with no namespace — serialization parameter
   elements must be in the output: or exist: namespace
   (e.g., <indent value="yes"/> without output: prefix is invalid)

Both raise SEPM0017 per W3C Serialization 3.1.
Fixes serialize-xml-015 and serialize-xml-020.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The W3C Serialization 3.1 spec requires whitespace normalization for
parameter values. The 'value' attribute on parameter elements like
<output:standalone value=" no "/> should be trimmed to "no" before
use.

Add value.trim() in readSerializationProperty after reading the
'value' attribute from the XML element form.

Fixes serialize-xml-029 (standalone=" no ") and serialize-xml-030
(standalone=" yes ").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nt-spaces

Three issues from the restxq-impl serialization audit:

1. Canonical parameter — reject with SEPM0016 instead of silently
   accepting. Full canonical XML (C14N) requires sorted attributes,
   sorted namespace declarations, expanded empty elements — complex
   to implement correctly. The parameter was accepted but had no effect;
   now throws SEPM0016 "not supported" to be honest about the limitation.

2. CSV serializer test coverage — add 13 XQSuite tests covering:
   array-of-arrays, sequence-of-maps, XML table, empty input, single
   item, custom delimiters, quoting (always/minimal), escaped quotes,
   header row, values with commas/newlines.

3. JSON indent-spaces — wire the exist:indent-spaces parameter to
   Jackson's JsonGenerator pretty printer. Previously JSON always used
   Jackson's default 2-space indent; now respects the configured
   indent-spaces value (default 4, matching XML/HTML).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XMLWriter:
- Mark DEL character (0x7F) as special in both text and attribute
  specialChars arrays, so it's escaped as &#x7F; per XML spec.
  Previously DEL passed through unescaped because it fell below
  the 0x80 threshold for the 0x7F-0x9F check.

IndentingXMLWriter:
- Accept "true" and "1" as boolean true for the indent parameter,
  not just "yes". Per W3C Serialization 3.1, boolean parameters
  accept all three forms.

Fixes: K2-Serialization-9 (DEL in attributes), K2-Serialization-10
(DEL in text), K2-Serialization-36 and -37 (indent="true" with
suppress-indentation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…NR0001)

The W3C Serialization spec requires that maps and function items cannot
be serialized with the XML or text output methods (error SENR0001).
Previously these items would silently produce incorrect output or errors
deep in the SAX pipeline. Now XQuerySerializer validates the sequence
upfront and throws a clear SENR0001 error.

Also flatten arrays in sequences before XML/text serialization, since
the SAX serializer cannot handle ArrayType items directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ements, suppress-indentation)

The cardinality check in setPropertyForMap was inverted: it used
Cardinality._MANY.isSuperCardinalityOrEqualOf(convention.getCardinality())
which always returned false because _MANY (bit 4) cannot contain
ZERO_OR_MORE (bits 1+2+4). This meant that when passing multiple QName
values for cdata-section-elements (e.g., both an unprefixed and a
namespaced element), only the first QName was kept.

Fix: reverse the check to convention.getCardinality().isSuperCardinalityOrEqualOf(_MANY),
which correctly tests whether the parameter accepts multiple values.

Also adds XQSuite tests for cdata-section-elements with namespaced
elements, both individually and combined.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…path

When serializing sequences with item-separator using the XML method,
the serializeXMLWithItemSeparator path now respects the
omit-xml-declaration parameter. If omit-xml-declaration is explicitly
set to no/false/0, the declaration is output before the first item.
Also ensures that individual node items within the sequence don't
each get their own declaration.

Note: eXist-db defaults omit-xml-declaration to yes (omit). The W3C
Serialization spec leaves the default to the host language, so this
is a valid choice.

Adds test for item-separator with atomic values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz force-pushed the feature/serialization-compliance branch from 1cc2446 to 301bb0e Compare March 30, 2026 01:56
joewiz and others added 6 commits March 29, 2026 22:13
… DOCTYPE

Three serialization fixes:

1. standalone="omit" was written literally into the XML declaration
   instead of being treated as absent. Now "omit" suppresses the
   standalone attribute. Also normalizes true/false/1/0 to yes/no
   in the declaration output.

2. xs:untypedAtomic values in serialization parameter maps caused
   XPTY0004 errors. Now untypedAtomic is accepted and coerced to
   the expected type (boolean, string, etc.), matching the W3C spec
   requirement that untypedAtomic be castable to the target type.

3. HTML5 DOCTYPE was always emitted even for non-html root elements
   (e.g., serializing a bare <body> or <br> fragment). Now DOCTYPE
   is only output when the root element is <html>, matching W3C
   Serialization behavior for HTML fragments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Adaptive serialization: add "map" keyword prefix to map output
   per W3C Serialization 3.1, section 10. Maps are now serialized as
   map{key:value,...} instead of {key:value,...}.

2. Character map validation: use SEPM0016 (invalid parameter value)
   instead of SEPM0017 (unrecognized parameter) when a character map
   key is not a single character. SEPM0016 is the correct error code
   per the W3C spec for invalid parameter values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes:

1. Maps and function items in the normalize() step of fn:serialize
   now correctly raise SENR0001 instead of being silently converted
   to strings. Previously, map{} with method=xml/html/xhtml would
   produce an empty string instead of an error.

2. XHTML5Writer now always outputs simple <!DOCTYPE html> for HTML5,
   ignoring doctype-public and doctype-system parameters per W3C
   Serialization spec. Previously it would pass through any
   doctype-public value, producing non-HTML5 DOCTYPEs like
   <!DOCTYPE html PUBLIC "...">.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three serialization fixes:

1. SERE0020: INF, -INF, and NaN values now raise SERE0020 when
   serialized with the JSON output method, per W3C Serialization
   3.1. Previously these were silently converted to null/1e9999.

2. SERE0023: Multi-item sequences within array members now raise
   SERE0023. Previously [(1, 2)] would silently nest the items
   into a JSON sub-array. The spec requires an error for sequences
   with more than one item at any level.

3. CDATA encoding split: When serializing CDATA sections with a
   restricted encoding (e.g., us-ascii), characters that cannot
   be represented in the encoding now cause the CDATA section to
   be split, with the unencodable character written as a numeric
   character reference between CDATA segments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… C0 escaping)

Two changes enable XML 1.1 serialization output:

1. ElementConstructor: Remove unconditional XQST0085 throw for
   namespace undeclaration (xmlns:prefix=""). Per the spec, XQST0085
   only applies when the implementation does NOT support XML Names
   1.1. Since eXist now supports XML 1.1 serialization, namespace
   undeclaration is allowed.

2. XMLWriter: Version-aware C0 control character escaping. When
   version="1.1" is set, characters 0x01-0x08, 0x0B, 0x0C, 0x0E-0x1F
   are serialized as numeric character references (e.g., &#x1;).
   These characters are valid in XML 1.1 but must be escaped.

Note: codepoints-to-string() still uses XML 1.0 validation
(XMLChar.isValid) since eXist does not yet have a context-level
XML version setting. K2-7/K2-8 (XML 1.1 control char tests)
remain blocked on this — they need the XDM to allow C0 controls,
which requires a version-aware codepoints-to-string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ng, validation)

Implements the W3C QT4 canonical serialization parameter for XML and
XHTML output methods:

1. XMLWriter: when canonical=true, buffers namespace and attribute
   events. On closeStartTag(), sorts namespaces by prefix (default
   first) and attributes by namespace URI then local name. Emits
   namespaces before attributes per C14N spec. Rejects relative
   namespace URIs with SERE0024 during namespace buffering.

2. Empty element expansion: canonical mode writes <elem></elem>
   instead of <elem/> for empty elements.

3. SERE0024 validation: rejects relative namespace URIs (checked
   both in XQuerySerializer pre-validation and XMLWriter namespace
   buffering) and multi-root documents when canonical=true.

4. FunSerialize: removes SEPM0016 rejection of canonical parameter.
   When canonical=true, forces omit-xml-declaration=yes, encoding=
   UTF-8, include-content-type=no, and removes CDATA sections.

5. XHTML5Writer: suppresses DOCTYPE output when canonical=true.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz force-pushed the feature/serialization-compliance branch from 4f75f1f to 6d9c4e9 Compare March 30, 2026 11:43
joewiz and others added 6 commits March 30, 2026 08:00
Implements the W3C QT4 canonical serialization parameter for the JSON
output method per RFC 8785 (JSON Canonicalization Scheme):

1. Map key sorting: when canonical=true, map entries are sorted by
   key using String.compareTo() (UTF-16 code unit order per RFC 8785).

2. Number formatting: all numeric values are cast to double and
   formatted using ECMAScript shortest representation via BigDecimal.
   Plain notation for [1e-6, 1e21), exponential otherwise. NaN and
   Infinity raise SERE0020.

3. Solidus escaping: disabled in canonical mode (RFC 8785 does not
   escape forward slashes).

4. Duplicate key rejection: canonical mode always rejects duplicate
   keys with SERE0022, regardless of allow-duplicate-names setting.

5. FunSerialize: canonical JSON forces indent=no, escape-solidus=no.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The XML/HTML serializers already supported use-character-maps via the
CharacterMappingWriter decorator, but the JSON serializer (which uses
Jackson's JsonGenerator directly) bypassed it entirely.

Add character map support to JSONSerializer:
- Parse use-character-maps from output properties in constructor
- applyCharacterMap() substitutes mapped codepoints with replacement strings
- writeStringWithCharMap() applies character map before passing to
  generator.writeString(), preserving Jackson's structural state

Character map replacements in JSON string values go through writeString
so Jackson handles structural separators (colons, commas) correctly.
Replacement strings are included as-is in the JSON string (e.g., "<b>"
stays literal since < is valid in JSON strings).

5 new XQSuite tests: JSON string mapping, special characters, raw output
bypass, copyright symbol mapping, XML element text mapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds Unicode normalization support during XML serialization:

- Reads the normalization-form parameter (NFC, NFD, NFKC, NFKD, none)
- Normalizes text content and attribute values via java.text.Normalizer
- Applied in writeChars() (text and attributes) and writeCdataContent()
- Skips normalization when form is "none" (default) or text is already
  in the target form (checked via Normalizer.isNormalized())
- "fully-normalized" treated as "none" (optional per spec)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XHTML5 serialization now normalizes prefixed SVG and MathML elements
to use default namespace bindings instead of prefixed forms:

- <svg:svg xmlns:svg="...svg"> → <svg xmlns="...svg">
- <m:math xmlns:m="...MathML"> → <math xmlns="...MathML">

This applies only to html-version >= 5.0 and only to the two specific
namespace URIs (http://www.w3.org/2000/svg and
http://www.w3.org/1998/Math/MathML). General namespace handling is
unchanged.

The implementation extends the existing XHTML prefix collapsing
mechanism in XHTMLWriter to also handle these foreign namespaces,
converting prefixed namespace declarations to default ones.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Loads serialization parameters from an external XML file referenced
by the parameter-document option:

  declare option output:parameter-document "path/to/params.xml";

The document is resolved relative to the query's static base URI
and parsed as a W3C serialization parameters element. Parameters
from the document provide base settings; inline declare option
statements override them (per W3C spec).

Supports all parameter types including use-character-maps, which
enables character map expansion from external parameter documents.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ments

Two fixes that resolve eXide and other apps failing through the URL rewrite
view pipeline:

1. XMLWriter.namespace(): Skip empty default namespace undeclarations
   (prefix='' nsURI='') that caused "namespace declaration outside an element"
   error. Also skip the implicit xml namespace prefix.

2. XHTMLWriter.writeContentTypeMeta(): Use self-closing <meta .../> tags in
   XHTML mode. The URL rewrite pipeline serializes source documents as XHTML
   (RESTServer forces method=xhtml for text/html), then the view re-parses
   the serialized output as XML. Non-self-closing <meta> tags made the XHTML
   output not well-formed XML, causing parseAsXml() to fail and
   request:get-data() to return a string instead of XML nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz marked this pull request as draft March 31, 2026 20:24
Tests that HTML documents with <head> elements can be served through the
URL rewrite view pipeline without being returned as strings.

Background: The W3C Serialization 3.1 spec requires that when
include-content-type is "yes" (the default), the XHTML/HTML serializer
should include a <meta> content-type declaration as the first child of
<head>. Commit e6e395f added writeContentTypeMeta() to XHTMLWriter to
implement this requirement. However, the injected <meta> tag used HTML-style
non-self-closing format (<meta ...> instead of <meta .../>) even in XHTML
mode. When the URL rewrite pipeline serialized a text/html document as XHTML
(RESTServer forces method=xhtml for text/html), the non-self-closing <meta>
made the output not well-formed XML. The view's request:get-data() then
failed to parse it as XML and returned a string, causing XPTY0019.

The test stores an HTML document with a <head> element, serves it through
a controller.xq + view.xq dispatch, and verifies:
- HTTP 200 (not 400 or 500)
- Source page content preserved
- View wrapper content applied
- No raw XML entities in output (indicating string instead of nodes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz force-pushed the feature/serialization-compliance branch from ceed804 to 6ec8727 Compare March 31, 2026 20:59
joewiz added a commit to joewiz/exist that referenced this pull request Apr 6, 2026
Three targeted fixes prevent the forked JVM from hanging after
BrokerPool.shutdown() completes:

1. StatusReporter threads are now daemon threads. The startup and
   shutdown status reporter threads are monitoring-only and must not
   prevent JVM exit. Added newInstanceDaemonThread() to ThreadUtils.

2. Four wait loops in BrokerPool that swallowed InterruptedException
   and used unbounded wait() now have 1-second poll timeouts,
   isShuttingDown() checks, and proper interrupt handling:
   - get() service mode wait: breaks on shutdown or interrupt
   - get() broker availability wait: throws EXistException on shutdown
   - enterServiceMode() wait: breaks on shutdown or interrupt
   - shutdown() active brokers wait: re-sets interrupt flag and breaks

3. At end of shutdown, instanceThreadGroup.interrupt() wakes any
   lingering threads in the instance's thread group.

Previously, 4 test classes required exclusion or timeout workarounds
(DeadlockIT, RemoveCollectionIT, CollectionLocksTest, MoveResourceTest).
Now all complete cleanly: 6533 unit tests + 9 integration tests,
0 failures, clean JVM exit.

Affects PRs with CI timeout workarounds: eXist-db#6112, eXist-db#6139, eXist-db#6138
Related: eXist-db#3685 (FragmentsTest deadlock)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented Apr 6, 2026

[This comment was co-authored with Claude Code. -Joe]

Closing — superseded by #6219 (v2/serialization-compliance).

This work has been consolidated into a clean v2/ branch as part of the eXist-db 7.0 PR reorganization. The new PR includes all commits from this PR plus additional related work, with reviewer feedback incorporated where applicable. See the reviewer guide for the full context.

@joewiz joewiz closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants