PSPDFKit-labs · nickwinder · May 29, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -6,5 +6,6 @@ Each skill lives under `plugins/<plugin-name>/skills/<skill-name>/SKILL.md`. Rea
 
 ## Available Skills
 
-- **nutrient-dws / document-processor-api** — Convert, extract, transform, and secure documents via the Nutrient Document Web Services API (Python scripts via `uv`).
+- **nutrient-dws / document-processor-api** — Convert, transform, redact, sign, watermark, OCR, and secure documents via the Nutrient DWS Processor API (Python scripts via `uv`).
+- **nutrient-dws / document-extraction-api** — Parse documents into a structural model (typed elements with bounds) or whole-document Markdown via the Nutrient DWS Data Extraction API (`/extraction/parse`). Use for RAG ingestion, layout analysis, and form/invoice extraction.
 - **pdf-to-markdown / pdf-to-markdown** — Extract text from PDFs as structured, semantic Markdown. Use when converting a PDF to Markdown, extracting text from a PDF, or processing one or more PDFs into Markdown output.
diff --git a/plugins/nutrient-dws/skills/document-extraction-api/.gitignore b/plugins/nutrient-dws/skills/document-extraction-api/.gitignore
@@ -0,0 +1,2 @@
+__pycache__/
+*.pyc
diff --git a/plugins/nutrient-dws/skills/document-extraction-api/SKILL.md b/plugins/nutrient-dws/skills/document-extraction-api/SKILL.md
@@ -0,0 +1,151 @@
+---
+name: document-extraction-api
+description: >-
+  Parse documents into a structural model or whole-document Markdown via the Nutrient Data
+  Extraction API (`/extraction/parse`). Use when the user wants to extract layout, tables,
+  key-value pairs, formulas, or images with bounding boxes; build a RAG ingestion pipeline;
+  produce Markdown for search indexing or content migration; or run layout-aware document
+  understanding. Triggers include parse this document, extract layout, RAG pipeline, document
+  understanding, form/invoice extraction, layout analysis, or whole-document Markdown.
+license: MIT
+metadata:
+  author: nutrient-sdk
+  version: "1.0"
+  homepage: "https://www.nutrient.io/api/"
+  repository: "https://github.com/PSPDFKit-labs/nutrient-skills"
+  compatibility: "Requires Python 3.10+, uv, and internet. Works with Claude Code, Codex CLI, Gemini CLI, OpenCode, Cursor, Windsurf, GitHub Copilot, Amp, or any Agent Skills-compatible product."
+  short-description: "Parse documents into a structural model or Markdown via Nutrient Data Extraction"
+---
+
+# Nutrient Data Extraction
+
+Use Nutrient DWS Extract for document-understanding workflows where you need typed
+elements (paragraphs, tables, formulas, pictures, key-value regions, handwriting) with
+bounding boxes — or a clean Markdown representation of the whole document.
+
+## When to use
+
+- Build a RAG ingestion pipeline: PDF -> Markdown -> chunks -> embeddings.
+- Index content for search or migrate documents into a new CMS.
+- Extract structured fields from forms and invoices (key/value pairs, tables, semantic regions).
+- Reconstruct page layout for downstream rendering or comparison.
+- Run layout-aware document understanding (semantic paragraph roles, table cell spans,
+  formulas in LaTeX, picture classification and alt descriptions).
+
+This skill is **only** for `/extraction/parse`. For PDF generation, conversion, OCR,
+redaction, signing, watermarking, or any `/build`-based workflow, use the sibling
+`document-processor-api` skill.
+
+## Setup
+
+DWS Extract is a separate product from DWS Processor and has its own API key.
+
+- Get a Nutrient DWS Extract API key at <https://dashboard.nutrient.io/>.
+- Export it as `NUTRIENT_EXTRACT_API_KEY`:
+  ```bash
+  export NUTRIENT_EXTRACT_API_KEY="pdf_live_..."
+  ```
+- Scripts live in `scripts/` relative to this SKILL.md. Use the directory containing this
+  SKILL.md as the working directory:
+  ```bash
+  cd <directory containing this SKILL.md> && uv run scripts/<script>.py --help
+  ```
+
+Calling `/extraction/parse` with a DWS Processor key returns `403`. If your tenant has been
+migrated to global DWS API keys, a single key set as either `NUTRIENT_EXTRACT_API_KEY` or
+`NUTRIENT_API_KEY` will work for both products.
+
+## `/extraction/parse` — one primitive, two output shapes
+
+One call returns the full structural document model — typed elements with bounding boxes,
+confidence scores, and reading order — or a whole-document Markdown string. You always
+receive all element types in a single call.
+
+### Picking a mode
+
+Choose based on the user's intent and acceptable credit cost. All costs are
+**extraction credits per page** — a separate billing bucket from the processor API
+credits consumed by `/build`, `/sign`, OCR, and other DWS Processor endpoints.
+
+**Principle — decide from the request alone; do not ask the user clarifying questions.**
+Walk the checks below in order. Each rule that fires sets a minimum mode — the final
+pick is the highest minimum across all rules that fired. If none fired, use the default
+(rule 5).
+
+1. **Explicit features named in the request** are non-negotiable.
+   - Key-value pairs, form fields, semantic role classification (Title / SectionHeader /
+     etc.), formulas, or handwriting → at minimum `understand` (9 cr/pg).
+   - Alt text on pictures, charts, or diagrams → `agentic` (18 cr/pg).
+2. **Document type implied by the request or filename.**
+   - `form`, `invoice`, `receipt`, `application`, `claim` → likely contains key-value
+     pairs → `understand`.
+   - `chart`, `infographic`, or diagram-heavy doc + the user wants descriptions →
+     `agentic`.
+3. **OCR signal from filename or request** (`scanned`, `image-based`, `photographed`,
+   `handwritten`, `screenshot`) → `structure` minimum; `text` mode silently fails on
+   image-only input.
+4. **Output format from intent.** RAG, search indexing, embeddings, or content migration
+   → `markdown`. Layout overlay, per-element processing, or bounded extraction →
+   `spatial`.
+5. **No cues match anything above** → documented default `structure` + `spatial`
+   (1.5 cr/pg). Handles both born-digital and scanned, gives bounded typed elements
+   with table cells, never silently drops content.
+
+| User intent | Mode | Output format | Cost | Notes |
+|-------------|------|---------------|------|-------|
+| RAG / search indexing / content migration — born-digital PDF | `text` | `markdown` | 1 cr/pg | Cheapest path; no OCR or AI needed |
+| RAG / search indexing — scanned or image-based PDF | `structure` | `markdown` | 1.5 cr/pg | OCR required before Markdown assembly |
+| Form / invoice extraction | `understand` | `spatial` | 9 cr/pg | AI classification for reliable key-value and table detection |
+| Layout-aware document understanding | `understand` | `spatial` | 9 cr/pg | Semantic paragraph roles (Title, SectionHeader, etc.) |
+| Deep visual understanding (charts, diagrams, alt text) | `agentic` | `spatial` | 18 cr/pg | VLM adds alt descriptions on every picture element |
+| **Default / ambiguous intent** | **`structure`** | **`spatial`** | **1.5 cr/pg** | Good balance: OCR + spatial elements, low cost |
+
+**Confirm before running when the estimated cost exceeds 200 extraction credits** —
+roughly 11 pages of `agentic`, 22 of `understand`, 133 of `structure`, or 200 of `text`.
+Surface the estimate (`pages × cost_per_page`) and ask the operator to confirm before
+invoking. Under that threshold, just run.
+
+`mode='text'` is incompatible with `output_format='spatial'`; the client rejects the
+combination before the network call.
+
+### Invocation
+
+```bash
+# Default: structure mode, spatial output
+uv run scripts/parse.py --input doc.pdf --out out.json
+
+# Markdown for RAG (text mode — cheapest)
+uv run scripts/parse.py --input doc.pdf --out out.md --output-format markdown --mode text
+
+# Form extraction (understand mode)
+uv run scripts/parse.py --input doc.pdf --out out.json --mode understand
+
+# Agentic (VLM alt text on pictures)
+uv run scripts/parse.py --input doc.pdf --out out.json --mode agentic
+```
+
+The script prints extraction-credit usage after each run so you can verify the cost.
+
+### Downstream consumption
+
+After a single `/parse` call, slice the response for common needs:
+
+- **Reading-order plain text**: walk `output.elements` sorted by `(page.pageIndex, readingOrder)`, join `paragraph` and `handwriting` `text` fields
+- **Tables**: project `cells[]` on each `table` element into rows/columns using `cell.row` and `cell.column`
+- **Key-value pairs**: read `pairs[]` on each `keyValueRegion` element — each pair has `.key.value` and `.value.value`
+- **Formulas**: read `latex` on each `formula` element
+- **Pictures**: read `classification` and `altDescription` (populated by `agentic` mode) on each `picture` element
+- **Markdown output**: call with `--output-format markdown`; the script writes the Markdown string directly
+
+For the canonical response schema and per-mode field availability, see the official docs linked from `references/parse-output-filtering.md`; that file also lists the tools we suggest for filtering and reshaping the response.
+
+### Input constraint
+
+`parse.py` only accepts **local file paths** — the underlying API endpoint is
+multipart-only. For remote inputs, download the file first.
+
+## Rules
+
+- Always preserve the printed credit-usage summary in script output so the operator can
+  observe per-call cost.
+- Do not add a URL-fetch shortcut; the endpoint is multipart-only.
diff --git a/...utrient-dws/skills/document-extraction-api/references/parse-output-filtering.md b/...utrient-dws/skills/document-extraction-api/references/parse-output-filtering.md
@@ -0,0 +1,32 @@
+# Parse Output — Filtering and Downstream Patterns
+
+The response shape of `/extraction/parse` — element types, field-by-field
+schemas, coordinate spaces, per-mode field availability — is documented
+upstream. Use those pages as the source of truth; this reference only
+suggests which tools to reach for when slicing and reshaping the response.
+
+## Official documentation
+
+- [Document element extraction (spatial output)](https://www.nutrient.io/guides/dws-data-extraction/parsing/extract-document-elements/) —
+  schema for `output.elements`, element types, bounding-box conventions.
+- [Markdown extraction](https://www.nutrient.io/guides/dws-data-extraction/parsing/extract-markdown/) —
+  shape of `output.markdown`.
+- [Processing modes](https://www.nutrient.io/guides/dws-data-extraction/parsing/processing-modes/) —
+  which fields each mode populates (e.g. `altDescription` only with
+  `agentic`; `keyValueRegion` and `formula` only with `understand` or
+  higher).
+- [Coordinate spaces](https://www.nutrient.io/guides/dws-data-extraction/parsing/coordinate-spaces/) —
+  how `bounds` relate to `page.width` / `page.height`.
+
+## Suggested tools
+
+| Task | Tool | Why |
+|---|---|---|
+| Filter or project the spatial JSON response | `jq` | Discriminate on `type` (`paragraph`, `table`, `picture`, …), select by `page.pageIndex` / `readingOrder`, or pull nested fields without writing code. |
+| Walk the response programmatically | the standard `json` module | The response is plain JSON; a recursive walk over `output.elements` is enough for type filtering, reading-order sort, and bounds extraction. |
+| Project tables into rows / columns | `pandas` | Tables come as a flat `cells[]` list with `row` / `column` indices; `pd.DataFrame` reshapes them cleanly. |
+| Render formulas | any LaTeX renderer (MathJax, KaTeX, matplotlib) | `formula` elements carry `latex` strings ready to feed a renderer. |
+| Post-process markdown output (chunk on headings, strip tables, etc.) | `markdown-it-py`, `mistune`, or a regex on `#` lines | `output.markdown` uses standard heading hierarchy. |
+
+For the per-call extraction-credit cost, read `usage.data_extraction_credits.cost`
+directly from the response — no tool needed.
diff --git a/plugins/nutrient-dws/skills/document-extraction-api/scripts/lib/common.py b/plugins/nutrient-dws/skills/document-extraction-api/scripts/lib/common.py
@@ -0,0 +1,48 @@
+import os
+import sys
+from typing import NoReturn
+
+
+def create_client():
+    """Create and return a NutrientClient configured for DWS Extract.
+
+    DWS Extract is a separate product from DWS Processor and has its own
+    API key. Reads NUTRIENT_EXTRACT_API_KEY (required); falls back to
+    NUTRIENT_API_KEY if the former is unset, so a single global key works
+    once DWS rolls those out.
+
+    Uses `is None` rather than truthiness so an explicitly empty
+    NUTRIENT_EXTRACT_API_KEY (`export NUTRIENT_EXTRACT_API_KEY=`) is treated
+    as a misconfiguration to surface, not as "fall back to the other key".
+    """
+    extract_api_key = os.environ.get("NUTRIENT_EXTRACT_API_KEY")
+    fallback_key = os.environ.get("NUTRIENT_API_KEY")
+    if extract_api_key is None and fallback_key is None:
+        raise RuntimeError(
+            "NUTRIENT_EXTRACT_API_KEY is not set. DWS Extract requires its own "
+            "API key (separate from the DWS Processor key). Export it before "
+            "running this skill's scripts."
+        )
+    try:
+        from nutrient_dws import NutrientClient
+    except ImportError as e:
+        raise RuntimeError(
+            "Unable to import nutrient_dws. Install with: uv add 'nutrient-dws>=3.1.0'\n"
+            f"Original error: {e}"
+        ) from e
+    primary = extract_api_key if extract_api_key is not None else fallback_key
+    return NutrientClient(api_key=primary, extract_api_key=extract_api_key)
+
+
+def assert_local_file(value: str, arg: str) -> str:
+    """Raise if value looks like a URL; otherwise return the path."""
+    v = str(value).strip()
+    if v.startswith("http://") or v.startswith("https://"):
+        raise ValueError(f"--{arg} must be a local file path for this operation.")
+    return v
+
+
+def handle_error(e: Exception) -> NoReturn:
+    """Print the error message and exit with code 1."""
+    print(str(e), file=sys.stderr)
+    sys.exit(1)
diff --git a/plugins/nutrient-dws/skills/document-extraction-api/scripts/parse.py b/plugins/nutrient-dws/skills/document-extraction-api/scripts/parse.py
@@ -0,0 +1,152 @@
+#!/usr/bin/env python3
+# /// script
+# requires-python = ">=3.10"
+# dependencies = ["nutrient-dws>=3.1.0"]
+# ///
+"""Parse a document using the Nutrient Data Extraction API (/extraction/parse).
+
+This script is the single primitive for document understanding via /extraction/parse.
+One call returns the full structural document model — typed elements with bounding boxes,
+confidence scores, and reading order — or a whole-document Markdown string.
+
+DWS Extract is a separate product from DWS Processor. It uses its own API key, supplied
+via the NUTRIENT_EXTRACT_API_KEY environment variable. Calls to /extraction/parse with a
+DWS Processor key return 403.
+
+Billing note: /extraction/parse is billed against **extraction credits**, which are a
+separate billing bucket from the processor API credits consumed by /build, /sign, OCR,
+and other Processor API endpoints.
+
+Per-page extraction-credit costs by mode:
+  text:       1 extraction credit  — fast Markdown from born-digital documents (no OCR/AI)
+  structure:  1.5 extraction credits — OCR + spatial elements with bounding boxes
+  understand: 9 extraction credits  — AI layout analysis, table detection, semantic classification
+  agentic:    18 extraction credits — VLM-augmented; deepest visual understanding
+
+Output shapes:
+  spatial  (default): response.output.elements — typed elements list
+  markdown:           response.output.markdown — whole-document Markdown string
+
+Usage examples:
+  # Spatial elements (structure mode) — lowest-cost spatial extraction
+  uv run scripts/parse.py --input doc.pdf --out out.json
+
+  # Markdown for RAG / search indexing (text mode — cheapest)
+  uv run scripts/parse.py --input doc.pdf --out out.md --output-format markdown --mode text
+
+  # Form / invoice extraction (understand mode — typed elements with confidence)
+  uv run scripts/parse.py --input doc.pdf --out out.json --mode understand
+
+  # Deep visual understanding (agentic mode — VLM descriptions on pictures)
+  uv run scripts/parse.py --input doc.pdf --out out.json --mode agentic --output-format spatial
+"""
+
+import argparse
+import asyncio
+import json
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from lib.common import assert_local_file, create_client, handle_error
+
+
+async def main() -> None:
+    parser = argparse.ArgumentParser(
+        description=(
+            "Parse a document with the Nutrient Data Extraction API and write the result. "
+            "Billed against extraction credits (separate from processor API credits)."
+        ),
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Extraction credit costs per page:
+  text:       1 extraction credit  (born-digital Markdown, no OCR)
+  structure:  1.5 extraction credits (OCR + spatial elements)  [default]
+  understand: 9 extraction credits  (AI layout + table detection)
+  agentic:    18 extraction credits (VLM-augmented)
+
+Output shapes:
+  spatial  (default): typed element list at output.elements
+  markdown:           whole-document Markdown at output.markdown
+""",
+    )
+    parser.add_argument(
+        "--input",
+        required=True,
+        help="Path to the local input document (PDF, image, or Office file).",
+    )
+    parser.add_argument(
+        "--out",
+        required=True,
+        help="Output file path. Receives the full JSON response for spatial output, "
+        "or a .md file for markdown output.",
+    )
+    parser.add_argument(
+        "--mode",
+        choices=["text", "structure", "understand", "agentic"],
+        default="structure",
+        help=(
+            "Processing mode controlling cost and quality. "
+            "text=1cr, structure=1.5cr (default), understand=9cr, agentic=18cr — "
+            "all costs are extraction credits per page."
+        ),
+    )
+    parser.add_argument(
+        "--output-format",
+        dest="output_format",
+        choices=["spatial", "markdown"],
+        default="spatial",
+        help=(
+            "Shape of the output. "
+            "spatial: typed elements with bounds (default). "
+            "markdown: whole-document Markdown string."
+        ),
+    )
+    args = parser.parse_args()
+
+    # Validate input is a local file (the /extraction/parse endpoint is multipart-only).
+    assert_local_file(args.input, "input")
+    input_path = Path(args.input)
+    if not input_path.exists():
+        print(f"Error: input file not found: {args.input}", file=sys.stderr)
+        sys.exit(1)
+
+    client = create_client()
+    response = await client.parse(
+        input_path,
+        mode=args.mode,
+        output_format=args.output_format,
+    )
+
+    out_path = Path(args.out)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+
+    if args.output_format == "markdown":
+        markdown = response.get("output", {}).get("markdown", "")
+        out_path.write_text(markdown, encoding="utf-8")
+        print(f"Wrote {args.out}")
+    else:
+        with open(out_path, "w", encoding="utf-8") as f:
+            json.dump(response, f, indent=2)
+        print(f"Wrote {args.out}")
+
+    # Print usage summary so callers can see credit cost without opening the output file
+    usage = response.get("usage", {})
+    credits_info = usage.get("data_extraction_credits", {})
+    cost = credits_info.get("cost")
+    remaining = credits_info.get("remainingCredits")
+    metrics = response.get("metrics", {})
+    pages = metrics.get("pagesProcessed", "?")
+    if cost is not None:
+        remaining_str = f", remaining: {remaining}" if remaining is not None else ""
+        print(
+            f"Usage: {cost} extraction credits ({pages} page(s) at {args.mode} mode"
+            f"{remaining_str})"
+        )
+
+
+if __name__ == "__main__":
+    try:
+        asyncio.run(main())
+    except Exception as e:
+        handle_error(e)