Skip to content

DWS: Add document-extraction-api skill for /extraction/parse#13

Merged
nickwinder merged 9 commits into
mainfrom
feat/task-136-parse-ga
May 29, 2026
Merged

DWS: Add document-extraction-api skill for /extraction/parse#13
nickwinder merged 9 commits into
mainfrom
feat/task-136-parse-ga

Conversation

@nickwinder
Copy link
Copy Markdown
Collaborator

@nickwinder nickwinder commented May 27, 2026

Why

The Data Extraction API (/extraction/parse) went GA. It's a single document-understanding primitive that returns either the full structural document model (typed elements with bounding boxes and reading order) or a whole-document Markdown string — the natural primitive for RAG indexing, form/invoice extraction, and layout-aware understanding.

DWS Extract is a separate product from DWS Processor, with its own API key and credit pool. This PR adds a dedicated skill — document-extraction-api — alongside the existing document-processor-api, rather than conflating both products under one skill. The processor skill is left untouched.

Summary

New skill: plugins/nutrient-dws/skills/document-extraction-api/

  • SKILL.md — explains the product split and the dual credit pool. Defines a mode-selection principle: the agent decides from the request alone (no clarifying questions); walks request, filename, and intent cues in order; and picks the cheapest mode that satisfies every floor. Sets a 200-credit cost-confirmation threshold above which the agent surfaces an estimate to the operator before invoking.
  • scripts/parse.py — single primitive accepting a local file plus mode and output_format. Calls client.parse(), writes the result, and prints extraction-credit usage. Modes: text (1 cr/pg), structure (1.5 cr/pg, default), understand (9 cr/pg), agentic (18 cr/pg). Output shapes: spatial elements or whole-document markdown. Pinned to nutrient-dws>=3.1.0.
  • scripts/lib/common.pycreate_client() factory reading NUTRIENT_EXTRACT_API_KEY (falls back to NUTRIENT_API_KEY for tenants on global keys); constructs NutrientClient(api_key=..., extract_api_key=...).
  • references/parse-output-filtering.md — points at the canonical upstream docs for the /parse response schema (extract-document-elements, extract-markdown, processing-modes, coordinate-spaces) and lists the tools we suggest reaching for to reshape a response: jq for filtering / projection, the stdlib json module for programmatic walks, pandas for table-to-dataframe projection, any LaTeX renderer for formula elements, and a standard Markdown parser for chunking output.markdown.

Top-level

  • AGENTS.md advertises the new skill alongside the existing two.

Mode-decision capability sweep

To validate that the skill makes the right (mode, output_format) call across the documented decision surface, I built a 9-eval matrix where each eval pairs an intent prompt with the (mode, output_format) the docs prescribe for that intent:

# Probe Expected
0 RAG ingestion of a born-digital PDF (cost matters) text + markdown
1 RAG ingestion of a scanned PDF (OCR required) understand + markdown ¹
2 Ambiguous "parse this for me" — no cues structure + spatial (default)
3 Loan application key-value extraction for a reviewer UI understand + spatial
4 Document outline tool — Title / SectionHeader / Body classification understand + spatial
5 Accessibility audit — alt-text on figures / charts / diagrams agentic + spatial
6 Chart descriptions for a brief agentic + (spatial | markdown)
7 Search index for born-digital Confluence-exported PDFs text + markdown
8 Layout overlay UI — bounding boxes only, no semantic roles structure + spatial

¹ The docs prescribe structure + markdown (1.5 cr/pg) for this case, but the server currently returns HTTP 500 on image-only PDFs for any structure-mode call regardless of output format. Tracked separately. understand + markdown (9 cr/pg) is the cheapest working combination for the scanned-RAG intent today.

Each eval was run twice — once with the skill present, once with a baseline subagent that has the same API access but no skill guidance. Sample input: the 6-page born-digital sample.pdf from nutrient-dws-client-python/tests/data/, plus a rasterized-at-150-DPI variant with no text layer for the scanned-input eval.

Round-1 results (18 parallel subagent runs):

  • 9/9 mode-decision evals passed in both configurations under bug-aware grading. Both the skill and the baseline picked the right (mode, output_format) on every probe.
  • The skill prevents one overshoot the baseline makes on its own: eval Add Pi setup docs for shared skills #7 ("born-digital, search index, markdown, cost matters") — baseline reached for structure + markdown (1.5 cr/pg, overshoot), skill correctly walked down to text + markdown (1 cr/pg). 33% lower per-page cost on that workflow.
  • Tokens: ~27% lower with the skill (avg 25.0k vs 34.4k per task).
  • Wall time: ~60% lower with the skill (avg ~74s vs ~186s per task).

The skill's measurable value-add is avoiding the "go look up the API, then enumerate the modes, then decide what's cheapest" detour that the baseline walks through on every invocation.

Verification

  • claude plugin validate . passes.
  • uv run scripts/parse.py --help renders correctly.
  • Live end-to-end against the production /extraction/parse endpoint: 18 parallel subagent runs across 9 mode-decision evals × 2 configurations (with / without the skill). All runs returned well-formed outputs with the (mode, output_format) the docs prescribe; credit-usage summary prints correctly from the Python client.

Teach the DWS skill how to call the now-GA /extraction/parse endpoint:

- scripts/parse.py — single primitive that accepts a local file plus
  mode and output_format, calls client.parse(), and writes the result.
  Modes: text (1 cr/pg), structure (1.5 cr/pg, default), understand
  (9 cr/pg), agentic (18 cr/pg). Output shapes: spatial elements or
  whole-document Markdown. Billed against extraction credits (separate
  from processor API credits). Prints usage summary after each run.

- references/parse-output-filtering.md — new reference doc showing
  downstream consumption patterns after a single /parse call: reading-
  order plain text, table-to-grid projection, key-value dict, formula
  LaTeX, picture alt descriptions. Includes Python snippets and jq
  one-liners for each pattern.

- references/script-catalog.md — adds parse.py entry under a new
  "Data Extraction" section with mode, cost, and output-shape summary.

- SKILL.md — adds a Data Extraction section covering: what /parse is
  (document-understanding primitive, not per-element-type calls), mode
  selection table keyed to user intent, default of structure+spatial
  for ambiguous requests, invocation examples, downstream-consumption
  quick-ref, and pointer to parse-output-filtering.md. Also updates
  skill description and task-scripts list.

Python client dependency: path-install of the local branch that adds
client.parse() support (file:// URL in the uv inline script header).
@nickwinder nickwinder self-assigned this May 27, 2026
…on-api skill

DWS Extract is a separate product from DWS Processor — different API key,
different credit pool, different billing. Splitting the parse primitive
into its own skill removes the conflation and lets agents pick the right
product upfront.

- New skill: plugins/nutrient-dws/skills/document-extraction-api
  - parse.py + references/parse-output-filtering.md moved over via git mv
  - SKILL.md focused on the Data Extraction product, mode/output table,
    downstream consumption patterns, and the separate NUTRIENT_EXTRACT_API_KEY
  - Local lib/common.py with create_client() that reads
    NUTRIENT_EXTRACT_API_KEY (falls back to NUTRIENT_API_KEY for tenants on
    global keys) and constructs NutrientClient(api_key=..., extract_api_key=...)
  - Pinned to nutrient-dws>=3.1.0 in the script's PEP 723 metadata

- document-processor-api: removed the Data Extraction section, the parse.py
  entry, and the parse-output-filtering reference map row. Cross-link to the
  sibling skill in the frontmatter description and "When to use" section.

- AGENTS.md: advertise the new skill alongside the existing two.

- Fix latent bug in parse.py: was reading usage.dataExtractionCredits
  (camelCase) but the API returns data_extraction_credits (snake_case), so
  the credit-usage summary was silently skipped on every call. Confirmed
  end-to-end via live smoke (6-page PDF, structure/spatial mode, 9 credits,
  ~46KB JSON, usage summary now prints correctly).
@nickwinder nickwinder changed the title DWS: Add /extraction/parse support to document-processor-api skill DWS: Add document-extraction-api skill for /extraction/parse May 27, 2026
nickwinder and others added 5 commits May 28, 2026 06:02
The split into document-extraction-api is purely additive — the processor
skill doesn't need cross-links or trimming. Leave it untouched.
- references/parse-output-filtering.md: snake_case `data_extraction_credits`
  to match the actual response shape (it was camelCase in three places —
  the schema diagram, the Python snippet, and the prose note). Anyone
  following the reference's Python snippet would silently get nothing
  back. Verified against the live API.
- scripts/lib/common.py: use `is None` instead of truthiness for the env
  var checks, so `export NUTRIENT_EXTRACT_API_KEY=` (explicit empty) is
  treated as a misconfiguration to surface, not as "fall back to the
  Processor key". Also drop helpers carried over from the sibling skill's
  common.py that this skill never uses (write_json_output, parse_csv,
  read_json_file, fix_negative_args).
- scripts/parse.py: call assert_local_file() on `--input` so URL inputs
  produce a clear error message instead of leaking through to a
  misleading FileNotFoundError.
Add an inference principle that walks the request, filename, and intent
to pick the cheapest mode that satisfies every floor — explicitly no
clarifying questions to the user. Replace the vague "ask before large
documents" prose with a concrete 200-credit confirmation threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval workspace it covered is local-only and doesn't need to be
masked from a repo-level rule. Anything regenerated by future
skill-creator runs lands untracked, the same as any other ephemeral
local artefact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nickwinder nickwinder marked this pull request as ready for review May 28, 2026 00:03
@nickwinder nickwinder marked this pull request as draft May 28, 2026 00:03
nickwinder and others added 2 commits May 28, 2026 12:07
Drop the duplicated schema walkthrough from the references doc and link
to the canonical pages on nutrient.io instead. The reference now lists
which tools we suggest for reshaping a `/parse` response (jq, json,
pandas, a LaTeX renderer, a markdown parser) — rather than re-stating
field shapes that are already authoritative upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rule was overly prescriptive — there's no architectural reason the
skill must stay single-script forever, and the sibling-skill boundary
between data extraction and /build workflows is already implicit in the
skill's purpose. Other rules in the section remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nickwinder nickwinder marked this pull request as ready for review May 28, 2026 00:15
@nickwinder nickwinder requested a review from HungKNguyen May 28, 2026 00:15
@nickwinder nickwinder merged commit 6ec3a1a into main May 29, 2026
@nickwinder nickwinder deleted the feat/task-136-parse-ga branch May 29, 2026 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants