feat: add client.parse() for the Data Extraction API (/extraction/parse)#12
Merged
Conversation
The Data Extraction API (`POST /extraction/parse`) ships on a separate OpenAPI document from the existing DWS Processor API. Vendor the public spec so the new typed client surface is anchored to a checked-in source of truth. The Processor API spec stays at `dws-api-spec.yml`; the Data Extraction spec lives alongside it at `dws-data-extraction-spec.yml`.
Introduce hand-written types mirroring the public Data Extraction OpenAPI 3.1 contract (version 2026-05-25): - ParseMode (text | structure | understand | agentic) - ParseOutputFormat (spatial | markdown), ParseOutputOptions - ParseInstructions and ParseOptions request shapes - ParseResponseSpatial / ParseResponseMarkdown discriminated by output payload - Per-element types: ParagraphElement, FormulaElement, PictureElement, TableElement (with ParseTableCell), KeyValueRegionElement (with KeyValuePair / KeyValueEntity), HandwritingElement, and shared ParseElementBase / ParseBounds / ParsePageRef / ParseWord - ParseErrorResponse with structured failingPaths - ParseMetrics, ParseUsage (carrying data_extraction_credits), ParseConfiguration The Data Extraction API bills against a separate extraction-credits bucket from the processor API; type JSDoc makes the distinction explicit so client code does not conflate the two billing buckets. Wires the new endpoint into RequestTypeMap / ResponseTypeMap so the existing HTTP layer stays type-safe end-to-end.
…wrappers Adds first-class client methods for the Data Extraction API: - parse(input, options?) — full-fidelity call against POST /extraction/parse, supporting local files, buffers, streams, and URL inputs. Handles multipart upload for binary inputs and JSON body for URL-only requests. - parseToMarkdown(input, mode?) — convenience wrapper returning the whole- document Markdown string directly. Defaults to mode='text' (cheapest). - parseElements(input, mode?, includeWords?) — convenience wrapper returning the typed spatial-elements array. Defaults to mode='structure'. Threads x-nutrient-api-version through the HTTP layer when the caller pins a specific API version. JSDoc on every new method makes the billing distinction explicit: the Data Extraction API bills against extraction credits, a separate bucket from the processor API credits used by the rest of NutrientClient. The full set of new types is re-exported from the package root.
Adds 19 unit tests around the new /extraction/parse surface: - Request shape: multipart vs JSON, apiVersion header forwarding, option serialisation (language, output, includeWords), default behaviour. - Mode coverage: all four modes (text, structure, understand, agentic) round-trip through the instructions payload. - Output coverage: spatial elements and whole-document Markdown variants validated end-to-end, including extraction-credit accounting on the response (data_extraction_credits, not processor credits). - Error paths: HTTP-layer ValidationError propagation, file-input preflight failures surfaced before the request leaves the process. - Convenience wrappers: parseToMarkdown and parseElements default modes and includeWords forwarding, plus defensive output-mismatch errors. Adds examples/src/parse_smoke.ts — a live operator-runnable smoke test that prints a parsed summary plus extraction-credit usage. Documents the build/pack/install/run recipe in the file header.
- README: new "Data Extraction (/extraction/parse)" section with mode/ credit table, request examples for spatial + Markdown outputs, URL input, convenience wrappers, and a pointer to the smoke example. - docs/METHODS.md: new entries for parse, parseToMarkdown, parseElements inserted alongside the existing extract* convenience methods. - LLM_DOC.md: inject the same three method signatures so coding agents steered by this rule file know about parse and the extraction-credits bucket. - CHANGELOG.md: Unreleased entry covering the new client surface, the newly-exported public types, the live smoke script, and an explicit call-out that /extraction/parse bills against extraction credits (separate from processor API credits). Every doc surface that mentions cost says "extraction credits" explicitly so downstream readers cannot conflate the two billing buckets.
- CHANGELOG: correct path to live smoke script - METHODS.md: fix dangling sentence on parseElements compile-time guard
Factor the inline extraction-credit billing shape out of ParseUsage into a standalone ExtractionCredits interface in src/types/extraction_credits.ts, mirroring the Python client's type-factoring approach. ParseUsage.data_extraction_credits now references ExtractionCredits instead of an anonymous inline type, making the billing object reusable if future endpoints surface the same shape. ExtractionCredits is re-exported from the package root alongside the other parse types.
Lead with the "Designed for" preamble naming the three canonical workflows (RAG/search indexing, form/invoice extraction, layout-aware understanding) before describing modes and output formats. Broaden the @param input description to explicitly mention non-PDF inputs (Office documents, images), matching the actual endpoint capability rather than implying PDF-only like sign(). Update the @example block to show a form/invoice extraction recipe alongside the RAG recipe, and replace the generic paragraph-walk with a keyValueRegion traversal that a form-extraction caller can copy directly.
Restructure the README's /extraction/parse section to lead with use cases (RAG ingestion, form/invoice extraction, layout-aware understanding) before the mode table and code, matching the Python client's documentation approach. Add: - "Choosing an output format" table (markdown vs spatial, with shape and best-for columns). - "Modes — when to use which" table with credit costs and decision guidance. - Two worked recipes: RAG ingestion (PDF → Markdown → embed) and form/invoice extraction (PDF → spatial elements → structured object), each with the convenience-wrapper alternative shown alongside. - Explicit note that the endpoint accepts PDFs, Office documents, and images — not PDFs only. - Mention of the new ExtractionCredits type in the exported-types list. Update METHODS.md parse/parseToMarkdown/parseElements entries to match: lead with use-case positioning, add a parameters table, align examples with the recipe pattern from the README.
DWS Extract is a separate product from DWS Processor with its own API key and credit pool. Calling /extraction/parse with the Processor key returns 403. Add an optional `extractApiKey` constructor option (string or async getter) that parse() prefers over apiKey when set; every non-parse method keeps using apiKey. Falls back to apiKey when extractApiKey is omitted, so tenants with a single global DWS key still work. The routing happens via a per-call options copy that swaps apiKey to the extract key — leaves this.options untouched and covers both the multipart file-input path and the JSON url-input path. Drop the bundled parse smoke script — its dual-key dance and pack/install recipe were superseded by the unit-test coverage of the request shape, response handling, and routing. Live verification against a real account belongs to ad-hoc developer sessions, not committed scaffolding. Mirrors PR #47 on the Python sibling client.
Add `npm run generate:types:extract` that runs openapi-typescript against the vendored dws-data-extraction-spec.yml into src/generated/extract-types.ts, peer to the existing `generate:types` flow for the Processor spec. Rewrite src/types/parse.ts so the schema primitives derive from the generated `components['schemas']` rather than being hand-rolled: - ParseMode, ParseOutputFormat - ParseElement and the six element subtypes (ParagraphElement, FormulaElement, PictureElement, TableElement, KeyValueRegionElement, HandwritingElement) - ParseElementBase, ParseBounds, ParsePageRef, ParseWord - ParseTableCell, KeyValuePair, KeyValueEntity - ParseMetrics, ParseUsage, ParseConfiguration - ParseErrorResponse, ParseErrorDetails, ParseErrorFailingPath - ParagraphRole (now `NonNullable<ParagraphElement['role']>`) Keep four types hand-composed where they add something the spec doesn't express: - ParseOutputOptions / ParseInstructions — the spec marks `OutputOptions.includeWords` as required, but the server has a default and clients shouldn't be forced to pass it. - ParseResponseSpatial / ParseResponseMarkdown — cross-field discriminated narrowing (`elements?: undefined` / `markdown?: undefined`) the spec's ParseOutput doesn't model, letting callers write `if (output.markdown !== undefined)` without per-call `?.` access. - ParseOptions — adds the client-only `apiVersion` header concern that isn't a body field in the spec. Net: ~210 lines of hand-rolled type definitions deleted, replaced with one-line aliases that re-route through the generated schema. The public surface (every exported name) is unchanged.
…spec re-export Most APIs in this client (sign, ocr, watermark, redact, etc.) don't have a dedicated `src/types/<api>.ts` file — they reach types via `components['schemas']['X']` from `src/generated/api-types.ts`. The `src/types/parse.ts` and `src/types/extraction_credits.ts` files added on this branch were an outlier: most of their content was thin one-line aliases over the generated extract spec. Collapse to the rest-of-codebase pattern: - Delete `src/types/parse.ts` (was 254 lines, mostly aliases). - Delete `src/types/extraction_credits.ts` (single hand-rolled interface that duplicated the generated `Usage.data_extraction_credits` shape). - Move the 5 hand-composed types into `src/types/http.ts` (it already imports `ParseInstructions` / `ParseResponse` to type the endpoint maps): `ParseOutputOptions`, `ParseInstructions`, `ParseOptions`, `ParseResponseSpatial`, `ParseResponseMarkdown`, plus the derived `ExtractionCredits` alias. Each carries the JSDoc explaining why it's hand-composed instead of derived. - Drop the 23 cosmetic spec-alias exports from the package root. Consumers who need element-subtype types reach them via the new `extractComponents['schemas']['ParagraphElement']` namespace re-export, mirroring how Processor types are exposed via the existing `components` namespace. The package's public surface still exports the 7 hand-composed types (`ParseOutputOptions`, `ParseInstructions`, `ParseOptions`, `ParseResponse`, `ParseResponseSpatial`, `ParseResponseMarkdown`, `ExtractionCredits`) by name. Internal consumers (`src/client.ts`, the parse unit tests) shift to `extractComponents['schemas']['X']` for spec-derived types. Net: -290 lines on the type-definition surface, no behaviour change.
Five findings from review: 1. Empty-string `extractApiKey` bypassed constructor validation. `apiKey` uses `!options.apiKey` (falsy, catches `''`); the new `extractApiKey` validator only checked `!== undefined` plus the type guard, so `extractApiKey: ''` passed, propagated into the per-call options as `apiKey: ''`, and produced `Authorization: Bearer ` with no token — surfacing as a confusing server-side 401 instead of a constructor-time `ValidationError`. Add an explicit empty-string check. 2. `extractErrorMessage` in `src/http.ts` checked snake_case (`error_message`, `error_description`) and generic message fields but not `errorMessage` (camelCase) — the field DWS Extract returns on every 4xx/5xx. Result: the server's specific message (e.g. `"invalid mode: 'vlm'"`) was silently replaced by the generic `HTTP <status>: <statusText>` string. Add `errorMessage` to the priority list. 3. `parse()` accepted `mode='text' + output.format='spatial'` and let the server reject with 400. The Python sibling client adds a client-side `ValidationError` for this case (after reviewer feedback). The TS `parseElements()` wrapper blocked it at the type level via `Exclude`, but the low-level `parse()` did not. Add a pre-flight runtime guard. 4. `RequestTypeMap` JSDoc on `/extraction/parse` claimed `instructions` was optional for multipart upload, but the type definition marks it required and the implementation always passes it (an empty object when no options are supplied). Update the comment to match the type. 5. `parse()` `@param options.language` JSDoc described the field as "string or array of ISO 639-2 codes". The underlying spec also accepts lowercase language names (`'english'`, `'german'`) and `+`-joined multilingual strings (`'eng+spa'`). Document all four accepted forms. Adds three unit tests (empty-string `extractApiKey`, `errorMessage` extraction, text+spatial pre-flight rejection). 292 tests pass.
HungKNguyen
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds first-class TypeScript client support for the Data Extraction API (
/extraction/parse), which is now generally available. Mirrors the Python sibling PR.Changes Made
Public surface
client.parse()covering all four processing modes (text,structure,understand,agentic) and both output formats (spatialelement list, whole-documentmarkdown).client.parseToMarkdown()(markdown-only return) andclient.parseElements()(spatial-only return, withmode='text'excluded at the type level since the API rejects that combination).ParseResponseenvelope with a discriminated union of element variants (paragraph, table, formula, picture, keyValueRegion, handwriting) —if (element.type === 'table') { ... }narrows correctly via thetypediscriminator.NutrientClientaccepts a new optionalextractApiKeyconstructor option (string or async getter) for the Data Extraction product key, which is separate from the Processor key.parse()prefersextractApiKeyoverapiKeywhen set; every non-parse method keeps usingapiKey. Falls back toapiKeywhenextractApiKeyis omitted so tenants with a single global DWS key still work. Calling/extraction/parsewith a Processor-only key returns403.ExtractionCreditstype module to surface the extraction-credit billing bucket separately from the processor-credit bucket. README, CHANGELOG, and JSDoc all make the distinction explicit.ParseResponse,ParseResponseSpatial,ParseResponseMarkdown,ParseInstructions,ParseOptions,ParseMode,ParseOutputFormat, the discriminatedParseElementunion, all variant types,ExtractionCredits).Types & codegen
dws-data-extraction-spec.yml(sibling to the existingdws-api-spec.yml).npm run generate:types:extractscript (peer to the existinggenerate:types) that runsopenapi-typescriptagainst the vendored spec intosrc/generated/extract-types.ts.src/types/parse.tsderives its schema primitives (ParseMode,ParseElementand all six element subtypes,Bounds,PageRef,Word,TableCell,KeyValuePair,KeyValueEntity,Metrics,Usage,Configuration,ParseErrorResponse,ParagraphRole) from the generatedcomponents['schemas']rather than being hand-rolled — spec drift now flows through automatically. Four types stay hand-composed where they add something the spec doesn't express:ParseOutputOptions/ParseInstructions(spec marksincludeWordsas required but it has a server-side default),ParseResponseSpatial/ParseResponseMarkdown(cross-field discriminated narrowing soif (output.markdown !== undefined)works without per-call?.access), andParseOptions(adds the client-onlyapiVersionheader).Docs
docs/METHODS.mdandLLM_DOC.mdupdated to document the new surface and the dual-key requirement.Live verification (against prod)
Ran a full param sweep against the prod API using
examples/assets/sample.pdf(6 pages), covering every documented(mode, output_format)combination, the spec-rejected case, everyParseOptionsparam, all four input shapes, and a client-side error path:text+markdowntext+spatialValidationError(per spec)structure+markdownstructure+spatialunderstand+markdownunderstand+spatialagentic+markdownagentic+spatialstructure+spatialincludeWords=truestructure+mdlanguage=['eng','deu']text+mdapiVersion=2026-05-25headerBuffer{ type: 'url', url }objectValidationError(no network)Separate dual-key smoke also covered routing end-to-end:
{ apiKey: processor, extractApiKey: extract }→getAccountInfo()/extractText()/parse()/parseToMarkdown()all succeed.{ apiKey: processor }(no extractApiKey) →getAccountInfo()still works,parseToMarkdown()returnsAuthenticationErrorHTTP 403 from the Extract product — exactly the failure mode the new option exists to prevent.