Skip to content
Merged
3 changes: 2 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@ Each skill lives under `plugins/<plugin-name>/skills/<skill-name>/SKILL.md`. Rea

## Available Skills

- **nutrient-dws / document-processor-api** — Convert, extract, transform, and secure documents via the Nutrient Document Web Services API (Python scripts via `uv`).
- **nutrient-dws / document-processor-api** — Convert, transform, redact, sign, watermark, OCR, and secure documents via the Nutrient DWS Processor API (Python scripts via `uv`).
- **nutrient-dws / document-extraction-api** — Parse documents into a structural model (typed elements with bounds) or whole-document Markdown via the Nutrient DWS Data Extraction API (`/extraction/parse`). Use for RAG ingestion, layout analysis, and form/invoice extraction.
- **pdf-to-markdown / pdf-to-markdown** — Extract text from PDFs as structured, semantic Markdown. Use when converting a PDF to Markdown, extracting text from a PDF, or processing one or more PDFs into Markdown output.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
__pycache__/
*.pyc
151 changes: 151 additions & 0 deletions plugins/nutrient-dws/skills/document-extraction-api/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
name: document-extraction-api
description: >-
Parse documents into a structural model or whole-document Markdown via the Nutrient Data
Extraction API (`/extraction/parse`). Use when the user wants to extract layout, tables,
key-value pairs, formulas, or images with bounding boxes; build a RAG ingestion pipeline;
produce Markdown for search indexing or content migration; or run layout-aware document
understanding. Triggers include parse this document, extract layout, RAG pipeline, document
understanding, form/invoice extraction, layout analysis, or whole-document Markdown.
license: MIT
metadata:
author: nutrient-sdk
version: "1.0"
homepage: "https://www.nutrient.io/api/"
repository: "https://github.com/PSPDFKit-labs/nutrient-skills"
compatibility: "Requires Python 3.10+, uv, and internet. Works with Claude Code, Codex CLI, Gemini CLI, OpenCode, Cursor, Windsurf, GitHub Copilot, Amp, or any Agent Skills-compatible product."
short-description: "Parse documents into a structural model or Markdown via Nutrient Data Extraction"
---

# Nutrient Data Extraction

Use Nutrient DWS Extract for document-understanding workflows where you need typed
elements (paragraphs, tables, formulas, pictures, key-value regions, handwriting) with
bounding boxes — or a clean Markdown representation of the whole document.

## When to use

- Build a RAG ingestion pipeline: PDF -> Markdown -> chunks -> embeddings.
- Index content for search or migrate documents into a new CMS.
- Extract structured fields from forms and invoices (key/value pairs, tables, semantic regions).
- Reconstruct page layout for downstream rendering or comparison.
- Run layout-aware document understanding (semantic paragraph roles, table cell spans,
formulas in LaTeX, picture classification and alt descriptions).

This skill is **only** for `/extraction/parse`. For PDF generation, conversion, OCR,
redaction, signing, watermarking, or any `/build`-based workflow, use the sibling
`document-processor-api` skill.

## Setup

DWS Extract is a separate product from DWS Processor and has its own API key.

- Get a Nutrient DWS Extract API key at <https://dashboard.nutrient.io/>.
- Export it as `NUTRIENT_EXTRACT_API_KEY`:
```bash
export NUTRIENT_EXTRACT_API_KEY="pdf_live_..."
```
- Scripts live in `scripts/` relative to this SKILL.md. Use the directory containing this
SKILL.md as the working directory:
```bash
cd <directory containing this SKILL.md> && uv run scripts/<script>.py --help
```

Calling `/extraction/parse` with a DWS Processor key returns `403`. If your tenant has been
migrated to global DWS API keys, a single key set as either `NUTRIENT_EXTRACT_API_KEY` or
`NUTRIENT_API_KEY` will work for both products.

## `/extraction/parse` — one primitive, two output shapes

One call returns the full structural document model — typed elements with bounding boxes,
confidence scores, and reading order — or a whole-document Markdown string. You always
receive all element types in a single call.

### Picking a mode

Choose based on the user's intent and acceptable credit cost. All costs are
**extraction credits per page** — a separate billing bucket from the processor API
credits consumed by `/build`, `/sign`, OCR, and other DWS Processor endpoints.

**Principle — decide from the request alone; do not ask the user clarifying questions.**
Walk the checks below in order. Each rule that fires sets a minimum mode — the final
pick is the highest minimum across all rules that fired. If none fired, use the default
(rule 5).

1. **Explicit features named in the request** are non-negotiable.
- Key-value pairs, form fields, semantic role classification (Title / SectionHeader /
etc.), formulas, or handwriting → at minimum `understand` (9 cr/pg).
- Alt text on pictures, charts, or diagrams → `agentic` (18 cr/pg).
2. **Document type implied by the request or filename.**
- `form`, `invoice`, `receipt`, `application`, `claim` → likely contains key-value
pairs → `understand`.
- `chart`, `infographic`, or diagram-heavy doc + the user wants descriptions →
`agentic`.
3. **OCR signal from filename or request** (`scanned`, `image-based`, `photographed`,
`handwritten`, `screenshot`) → `structure` minimum; `text` mode silently fails on
image-only input.
4. **Output format from intent.** RAG, search indexing, embeddings, or content migration
→ `markdown`. Layout overlay, per-element processing, or bounded extraction →
`spatial`.
5. **No cues match anything above** → documented default `structure` + `spatial`
(1.5 cr/pg). Handles both born-digital and scanned, gives bounded typed elements
with table cells, never silently drops content.

| User intent | Mode | Output format | Cost | Notes |
|-------------|------|---------------|------|-------|
| RAG / search indexing / content migration — born-digital PDF | `text` | `markdown` | 1 cr/pg | Cheapest path; no OCR or AI needed |
| RAG / search indexing — scanned or image-based PDF | `structure` | `markdown` | 1.5 cr/pg | OCR required before Markdown assembly |
| Form / invoice extraction | `understand` | `spatial` | 9 cr/pg | AI classification for reliable key-value and table detection |
| Layout-aware document understanding | `understand` | `spatial` | 9 cr/pg | Semantic paragraph roles (Title, SectionHeader, etc.) |
| Deep visual understanding (charts, diagrams, alt text) | `agentic` | `spatial` | 18 cr/pg | VLM adds alt descriptions on every picture element |
| **Default / ambiguous intent** | **`structure`** | **`spatial`** | **1.5 cr/pg** | Good balance: OCR + spatial elements, low cost |

**Confirm before running when the estimated cost exceeds 200 extraction credits** —
roughly 11 pages of `agentic`, 22 of `understand`, 133 of `structure`, or 200 of `text`.
Surface the estimate (`pages × cost_per_page`) and ask the operator to confirm before
invoking. Under that threshold, just run.

`mode='text'` is incompatible with `output_format='spatial'`; the client rejects the
combination before the network call.

### Invocation

```bash
# Default: structure mode, spatial output
uv run scripts/parse.py --input doc.pdf --out out.json

# Markdown for RAG (text mode — cheapest)
uv run scripts/parse.py --input doc.pdf --out out.md --output-format markdown --mode text

# Form extraction (understand mode)
uv run scripts/parse.py --input doc.pdf --out out.json --mode understand

# Agentic (VLM alt text on pictures)
uv run scripts/parse.py --input doc.pdf --out out.json --mode agentic
```

The script prints extraction-credit usage after each run so you can verify the cost.

### Downstream consumption

After a single `/parse` call, slice the response for common needs:

- **Reading-order plain text**: walk `output.elements` sorted by `(page.pageIndex, readingOrder)`, join `paragraph` and `handwriting` `text` fields
- **Tables**: project `cells[]` on each `table` element into rows/columns using `cell.row` and `cell.column`
- **Key-value pairs**: read `pairs[]` on each `keyValueRegion` element — each pair has `.key.value` and `.value.value`
- **Formulas**: read `latex` on each `formula` element
- **Pictures**: read `classification` and `altDescription` (populated by `agentic` mode) on each `picture` element
- **Markdown output**: call with `--output-format markdown`; the script writes the Markdown string directly

For the canonical response schema and per-mode field availability, see the official docs linked from `references/parse-output-filtering.md`; that file also lists the tools we suggest for filtering and reshaping the response.

### Input constraint

`parse.py` only accepts **local file paths** — the underlying API endpoint is
multipart-only. For remote inputs, download the file first.

## Rules

- Always preserve the printed credit-usage summary in script output so the operator can
observe per-call cost.
- Do not add a URL-fetch shortcut; the endpoint is multipart-only.
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Parse Output — Filtering and Downstream Patterns

The response shape of `/extraction/parse` — element types, field-by-field
schemas, coordinate spaces, per-mode field availability — is documented
upstream. Use those pages as the source of truth; this reference only
suggests which tools to reach for when slicing and reshaping the response.

## Official documentation

- [Document element extraction (spatial output)](https://www.nutrient.io/guides/dws-data-extraction/parsing/extract-document-elements/) —
schema for `output.elements`, element types, bounding-box conventions.
- [Markdown extraction](https://www.nutrient.io/guides/dws-data-extraction/parsing/extract-markdown/) —
shape of `output.markdown`.
- [Processing modes](https://www.nutrient.io/guides/dws-data-extraction/parsing/processing-modes/) —
which fields each mode populates (e.g. `altDescription` only with
`agentic`; `keyValueRegion` and `formula` only with `understand` or
higher).
- [Coordinate spaces](https://www.nutrient.io/guides/dws-data-extraction/parsing/coordinate-spaces/) —
how `bounds` relate to `page.width` / `page.height`.

## Suggested tools

| Task | Tool | Why |
|---|---|---|
| Filter or project the spatial JSON response | `jq` | Discriminate on `type` (`paragraph`, `table`, `picture`, …), select by `page.pageIndex` / `readingOrder`, or pull nested fields without writing code. |
| Walk the response programmatically | the standard `json` module | The response is plain JSON; a recursive walk over `output.elements` is enough for type filtering, reading-order sort, and bounds extraction. |
| Project tables into rows / columns | `pandas` | Tables come as a flat `cells[]` list with `row` / `column` indices; `pd.DataFrame` reshapes them cleanly. |
| Render formulas | any LaTeX renderer (MathJax, KaTeX, matplotlib) | `formula` elements carry `latex` strings ready to feed a renderer. |
| Post-process markdown output (chunk on headings, strip tables, etc.) | `markdown-it-py`, `mistune`, or a regex on `#` lines | `output.markdown` uses standard heading hierarchy. |

For the per-call extraction-credit cost, read `usage.data_extraction_credits.cost`
directly from the response — no tool needed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import os
import sys
from typing import NoReturn


def create_client():
"""Create and return a NutrientClient configured for DWS Extract.

DWS Extract is a separate product from DWS Processor and has its own
API key. Reads NUTRIENT_EXTRACT_API_KEY (required); falls back to
NUTRIENT_API_KEY if the former is unset, so a single global key works
once DWS rolls those out.

Uses `is None` rather than truthiness so an explicitly empty
NUTRIENT_EXTRACT_API_KEY (`export NUTRIENT_EXTRACT_API_KEY=`) is treated
as a misconfiguration to surface, not as "fall back to the other key".
"""
extract_api_key = os.environ.get("NUTRIENT_EXTRACT_API_KEY")
fallback_key = os.environ.get("NUTRIENT_API_KEY")
if extract_api_key is None and fallback_key is None:
raise RuntimeError(
"NUTRIENT_EXTRACT_API_KEY is not set. DWS Extract requires its own "
"API key (separate from the DWS Processor key). Export it before "
"running this skill's scripts."
)
try:
from nutrient_dws import NutrientClient
except ImportError as e:
raise RuntimeError(
"Unable to import nutrient_dws. Install with: uv add 'nutrient-dws>=3.1.0'\n"
f"Original error: {e}"
) from e
primary = extract_api_key if extract_api_key is not None else fallback_key
return NutrientClient(api_key=primary, extract_api_key=extract_api_key)


def assert_local_file(value: str, arg: str) -> str:
"""Raise if value looks like a URL; otherwise return the path."""
v = str(value).strip()
if v.startswith("http://") or v.startswith("https://"):
raise ValueError(f"--{arg} must be a local file path for this operation.")
return v


def handle_error(e: Exception) -> NoReturn:
"""Print the error message and exit with code 1."""
print(str(e), file=sys.stderr)
sys.exit(1)
152 changes: 152 additions & 0 deletions plugins/nutrient-dws/skills/document-extraction-api/scripts/parse.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.10"
# dependencies = ["nutrient-dws>=3.1.0"]
# ///
"""Parse a document using the Nutrient Data Extraction API (/extraction/parse).

This script is the single primitive for document understanding via /extraction/parse.
One call returns the full structural document model — typed elements with bounding boxes,
confidence scores, and reading order — or a whole-document Markdown string.

DWS Extract is a separate product from DWS Processor. It uses its own API key, supplied
via the NUTRIENT_EXTRACT_API_KEY environment variable. Calls to /extraction/parse with a
DWS Processor key return 403.

Billing note: /extraction/parse is billed against **extraction credits**, which are a
separate billing bucket from the processor API credits consumed by /build, /sign, OCR,
and other Processor API endpoints.

Per-page extraction-credit costs by mode:
text: 1 extraction credit — fast Markdown from born-digital documents (no OCR/AI)
structure: 1.5 extraction credits — OCR + spatial elements with bounding boxes
understand: 9 extraction credits — AI layout analysis, table detection, semantic classification
agentic: 18 extraction credits — VLM-augmented; deepest visual understanding

Output shapes:
spatial (default): response.output.elements — typed elements list
markdown: response.output.markdown — whole-document Markdown string

Usage examples:
# Spatial elements (structure mode) — lowest-cost spatial extraction
uv run scripts/parse.py --input doc.pdf --out out.json

# Markdown for RAG / search indexing (text mode — cheapest)
uv run scripts/parse.py --input doc.pdf --out out.md --output-format markdown --mode text

# Form / invoice extraction (understand mode — typed elements with confidence)
uv run scripts/parse.py --input doc.pdf --out out.json --mode understand

# Deep visual understanding (agentic mode — VLM descriptions on pictures)
uv run scripts/parse.py --input doc.pdf --out out.json --mode agentic --output-format spatial
"""

import argparse
import asyncio
import json
import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent))
from lib.common import assert_local_file, create_client, handle_error


async def main() -> None:
parser = argparse.ArgumentParser(
description=(
"Parse a document with the Nutrient Data Extraction API and write the result. "
"Billed against extraction credits (separate from processor API credits)."
),
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Extraction credit costs per page:
text: 1 extraction credit (born-digital Markdown, no OCR)
structure: 1.5 extraction credits (OCR + spatial elements) [default]
understand: 9 extraction credits (AI layout + table detection)
agentic: 18 extraction credits (VLM-augmented)

Output shapes:
spatial (default): typed element list at output.elements
markdown: whole-document Markdown at output.markdown
""",
)
parser.add_argument(
"--input",
required=True,
help="Path to the local input document (PDF, image, or Office file).",
)
parser.add_argument(
"--out",
required=True,
help="Output file path. Receives the full JSON response for spatial output, "
"or a .md file for markdown output.",
)
parser.add_argument(
"--mode",
choices=["text", "structure", "understand", "agentic"],
default="structure",
help=(
"Processing mode controlling cost and quality. "
"text=1cr, structure=1.5cr (default), understand=9cr, agentic=18cr — "
"all costs are extraction credits per page."
),
)
parser.add_argument(
"--output-format",
dest="output_format",
choices=["spatial", "markdown"],
default="spatial",
help=(
"Shape of the output. "
"spatial: typed elements with bounds (default). "
"markdown: whole-document Markdown string."
),
)
args = parser.parse_args()

# Validate input is a local file (the /extraction/parse endpoint is multipart-only).
assert_local_file(args.input, "input")
input_path = Path(args.input)
if not input_path.exists():
print(f"Error: input file not found: {args.input}", file=sys.stderr)
sys.exit(1)

client = create_client()
response = await client.parse(
input_path,
mode=args.mode,
output_format=args.output_format,
)

out_path = Path(args.out)
out_path.parent.mkdir(parents=True, exist_ok=True)

if args.output_format == "markdown":
markdown = response.get("output", {}).get("markdown", "")
out_path.write_text(markdown, encoding="utf-8")
print(f"Wrote {args.out}")
else:
with open(out_path, "w", encoding="utf-8") as f:
json.dump(response, f, indent=2)
print(f"Wrote {args.out}")

# Print usage summary so callers can see credit cost without opening the output file
usage = response.get("usage", {})
credits_info = usage.get("data_extraction_credits", {})
cost = credits_info.get("cost")
remaining = credits_info.get("remainingCredits")
metrics = response.get("metrics", {})
pages = metrics.get("pagesProcessed", "?")
if cost is not None:
remaining_str = f", remaining: {remaining}" if remaining is not None else ""
print(
f"Usage: {cost} extraction credits ({pages} page(s) at {args.mode} mode"
f"{remaining_str})"
)


if __name__ == "__main__":
try:
asyncio.run(main())
except Exception as e:
handle_error(e)