DWS: Add document-extraction-api skill for /extraction/parse by nickwinder · Pull Request #13 · PSPDFKit-labs/nutrient-skills

nickwinder · 2026-05-27T08:41:13Z

Why

The Data Extraction API (/extraction/parse) went GA. It's a single document-understanding primitive that returns either the full structural document model (typed elements with bounding boxes and reading order) or a whole-document Markdown string — the natural primitive for RAG indexing, form/invoice extraction, and layout-aware understanding.

DWS Extract is a separate product from DWS Processor, with its own API key and credit pool. This PR adds a dedicated skill — document-extraction-api — alongside the existing document-processor-api, rather than conflating both products under one skill. The processor skill is left untouched.

Summary

New skill: plugins/nutrient-dws/skills/document-extraction-api/

SKILL.md — explains the product split and the dual credit pool. Defines a mode-selection principle: the agent decides from the request alone (no clarifying questions); walks request, filename, and intent cues in order; and picks the cheapest mode that satisfies every floor. Sets a 200-credit cost-confirmation threshold above which the agent surfaces an estimate to the operator before invoking.
scripts/parse.py — single primitive accepting a local file plus mode and output_format. Calls client.parse(), writes the result, and prints extraction-credit usage. Modes: text (1 cr/pg), structure (1.5 cr/pg, default), understand (9 cr/pg), agentic (18 cr/pg). Output shapes: spatial elements or whole-document markdown. Pinned to nutrient-dws>=3.1.0.
scripts/lib/common.py — create_client() factory reading NUTRIENT_EXTRACT_API_KEY (falls back to NUTRIENT_API_KEY for tenants on global keys); constructs NutrientClient(api_key=..., extract_api_key=...).
references/parse-output-filtering.md — points at the canonical upstream docs for the /parse response schema (extract-document-elements, extract-markdown, processing-modes, coordinate-spaces) and lists the tools we suggest reaching for to reshape a response: jq for filtering / projection, the stdlib json module for programmatic walks, pandas for table-to-dataframe projection, any LaTeX renderer for formula elements, and a standard Markdown parser for chunking output.markdown.

Top-level

AGENTS.md advertises the new skill alongside the existing two.

Mode-decision capability sweep

To validate that the skill makes the right (mode, output_format) call across the documented decision surface, I built a 9-eval matrix where each eval pairs an intent prompt with the (mode, output_format) the docs prescribe for that intent:

#	Probe	Expected
0	RAG ingestion of a born-digital PDF (cost matters)	`text + markdown`
1	RAG ingestion of a scanned PDF (OCR required)	`understand + markdown` ¹
2	Ambiguous "parse this for me" — no cues	`structure + spatial` (default)
3	Loan application key-value extraction for a reviewer UI	`understand + spatial`
4	Document outline tool — Title / SectionHeader / Body classification	`understand + spatial`
5	Accessibility audit — alt-text on figures / charts / diagrams	`agentic + spatial`
6	Chart descriptions for a brief	`agentic + (spatial \| markdown)`
7	Search index for born-digital Confluence-exported PDFs	`text + markdown`
8	Layout overlay UI — bounding boxes only, no semantic roles	`structure + spatial`

¹ The docs prescribe structure + markdown (1.5 cr/pg) for this case, but the server currently returns HTTP 500 on image-only PDFs for any structure-mode call regardless of output format. Tracked separately. understand + markdown (9 cr/pg) is the cheapest working combination for the scanned-RAG intent today.

Each eval was run twice — once with the skill present, once with a baseline subagent that has the same API access but no skill guidance. Sample input: the 6-page born-digital sample.pdf from nutrient-dws-client-python/tests/data/, plus a rasterized-at-150-DPI variant with no text layer for the scanned-input eval.

Round-1 results (18 parallel subagent runs):

9/9 mode-decision evals passed in both configurations under bug-aware grading. Both the skill and the baseline picked the right (mode, output_format) on every probe.
The skill prevents one overshoot the baseline makes on its own: eval Add Pi setup docs for shared skills #7 ("born-digital, search index, markdown, cost matters") — baseline reached for structure + markdown (1.5 cr/pg, overshoot), skill correctly walked down to text + markdown (1 cr/pg). 33% lower per-page cost on that workflow.
Tokens: ~27% lower with the skill (avg 25.0k vs 34.4k per task).
Wall time: ~60% lower with the skill (avg ~74s vs ~186s per task).

The skill's measurable value-add is avoiding the "go look up the API, then enumerate the modes, then decide what's cheapest" detour that the baseline walks through on every invocation.

Verification

claude plugin validate . passes.
uv run scripts/parse.py --help renders correctly.
Live end-to-end against the production /extraction/parse endpoint: 18 parallel subagent runs across 9 mode-decision evals × 2 configurations (with / without the skill). All runs returned well-formed outputs with the (mode, output_format) the docs prescribe; credit-usage summary prints correctly from the Python client.

Teach the DWS skill how to call the now-GA /extraction/parse endpoint: - scripts/parse.py — single primitive that accepts a local file plus mode and output_format, calls client.parse(), and writes the result. Modes: text (1 cr/pg), structure (1.5 cr/pg, default), understand (9 cr/pg), agentic (18 cr/pg). Output shapes: spatial elements or whole-document Markdown. Billed against extraction credits (separate from processor API credits). Prints usage summary after each run. - references/parse-output-filtering.md — new reference doc showing downstream consumption patterns after a single /parse call: reading- order plain text, table-to-grid projection, key-value dict, formula LaTeX, picture alt descriptions. Includes Python snippets and jq one-liners for each pattern. - references/script-catalog.md — adds parse.py entry under a new "Data Extraction" section with mode, cost, and output-shape summary. - SKILL.md — adds a Data Extraction section covering: what /parse is (document-understanding primitive, not per-element-type calls), mode selection table keyed to user intent, default of structure+spatial for ambiguous requests, invocation examples, downstream-consumption quick-ref, and pointer to parse-output-filtering.md. Also updates skill description and task-scripts list. Python client dependency: path-install of the local branch that adds client.parse() support (file:// URL in the uv inline script header).

…on-api skill DWS Extract is a separate product from DWS Processor — different API key, different credit pool, different billing. Splitting the parse primitive into its own skill removes the conflation and lets agents pick the right product upfront. - New skill: plugins/nutrient-dws/skills/document-extraction-api - parse.py + references/parse-output-filtering.md moved over via git mv - SKILL.md focused on the Data Extraction product, mode/output table, downstream consumption patterns, and the separate NUTRIENT_EXTRACT_API_KEY - Local lib/common.py with create_client() that reads NUTRIENT_EXTRACT_API_KEY (falls back to NUTRIENT_API_KEY for tenants on global keys) and constructs NutrientClient(api_key=..., extract_api_key=...) - Pinned to nutrient-dws>=3.1.0 in the script's PEP 723 metadata - document-processor-api: removed the Data Extraction section, the parse.py entry, and the parse-output-filtering reference map row. Cross-link to the sibling skill in the frontmatter description and "When to use" section. - AGENTS.md: advertise the new skill alongside the existing two. - Fix latent bug in parse.py: was reading usage.dataExtractionCredits (camelCase) but the API returns data_extraction_credits (snake_case), so the credit-usage summary was silently skipped on every call. Confirmed end-to-end via live smoke (6-page PDF, structure/spatial mode, 9 credits, ~46KB JSON, usage summary now prints correctly).

The split into document-extraction-api is purely additive — the processor skill doesn't need cross-links or trimming. Leave it untouched.

- references/parse-output-filtering.md: snake_case `data_extraction_credits` to match the actual response shape (it was camelCase in three places — the schema diagram, the Python snippet, and the prose note). Anyone following the reference's Python snippet would silently get nothing back. Verified against the live API. - scripts/lib/common.py: use `is None` instead of truthiness for the env var checks, so `export NUTRIENT_EXTRACT_API_KEY=` (explicit empty) is treated as a misconfiguration to surface, not as "fall back to the Processor key". Also drop helpers carried over from the sibling skill's common.py that this skill never uses (write_json_output, parse_csv, read_json_file, fix_negative_args). - scripts/parse.py: call assert_local_file() on `--input` so URL inputs produce a clear error message instead of leaking through to a misleading FileNotFoundError.

Add an inference principle that walks the request, filename, and intent to pick the cheapest mode that satisfies every floor — explicitly no clarifying questions to the user. Replace the vague "ask before large documents" prose with a concrete 200-credit confirmation threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The eval workspace it covered is local-only and doesn't need to be masked from a repo-level rule. Anything regenerated by future skill-creator runs lands untracked, the same as any other ephemeral local artefact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the duplicated schema walkthrough from the references doc and link to the canonical pages on nutrient.io instead. The reference now lists which tools we suggest for reshaping a `/parse` response (jq, json, pandas, a LaTeX renderer, a markdown parser) — rather than re-stating field shapes that are already authoritative upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The rule was overly prescriptive — there's no architectural reason the skill must stay single-script forever, and the sibling-skill boundary between data extraction and /build workflows is already implicit in the skill's purpose. Other rules in the section remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nickwinder self-assigned this May 27, 2026

nickwinder changed the title ~~DWS: Add /extraction/parse support to document-processor-api skill~~ DWS: Add document-extraction-api skill for /extraction/parse May 27, 2026

nickwinder and others added 5 commits May 28, 2026 06:02

Revert SKILL.md changes in document-processor-api

7ed4470

The split into document-extraction-api is purely additive — the processor skill doesn't need cross-links or trimming. Leave it untouched.

chore: gitignore skill-creator eval workspace for the new skill

5176c7b

nickwinder marked this pull request as ready for review May 28, 2026 00:03

nickwinder marked this pull request as draft May 28, 2026 00:03

nickwinder and others added 2 commits May 28, 2026 12:07

nickwinder marked this pull request as ready for review May 28, 2026 00:15

nickwinder requested a review from HungKNguyen May 28, 2026 00:15

HungKNguyen approved these changes May 29, 2026

View reviewed changes

nickwinder merged commit 6ec3a1a into main May 29, 2026

nickwinder deleted the feat/task-136-parse-ga branch May 29, 2026 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DWS: Add document-extraction-api skill for /extraction/parse#13

DWS: Add document-extraction-api skill for /extraction/parse#13
nickwinder merged 9 commits into
mainfrom
feat/task-136-parse-ga

nickwinder commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nickwinder commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Summary

Mode-decision capability sweep

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nickwinder commented May 27, 2026 •

edited

Loading