Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- `client.parse()` — first-class support for the Data Extraction API
(`/extraction/parse`). Supports all four processing modes (`text`,
`structure`, `understand`, `agentic`) and both output shapes (spatial
elements and whole-document Markdown). Typed response model with
discriminated element variants (paragraph, table, formula, picture,
keyValueRegion, handwriting). Billed against **extraction credits**, a
separate billing bucket from the **processor API credits** used by the
other endpoints.
- New types exported from `nutrient_dws`: `ParseResponse`,
`ParseInstructions`, `ParseMode`, `ParseOutputFormat`, `ParseElement`,
`ParseOutputBody`, `ParseOutputElements`, `ParseOutputMarkdown`,
`ParagraphElement`, `TableElement`, `TableCell`, `FormulaElement`,
`PictureElement`, `KeyValueRegionElement`, `KeyValuePair`,
`HandwritingElement`.

## [3.0.0] - 2026-01-30

### Security
Expand Down
111 changes: 111 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,117 @@ asyncio.run(main())

For a complete list of available methods with examples, see the [Methods Documentation](docs/METHODS.md).

## Data Extraction (`/extraction/parse`)

`client.parse()` exposes Nutrient's Data Extraction API. It's designed for
**content-extraction workflows** where you need to feed document content into a
downstream pipeline rather than render or transform the document itself:

- **RAG (retrieval-augmented generation) pipelines** — pull a clean Markdown
representation of a document for chunking, embedding, and indexing in a
vector store.
- **Search indexing and content migration** — convert documents into Markdown
for full-text search or for migration into a new content management system.
- **Form and invoice extraction** — pull structured fields (key/value pairs,
tables, semantic regions) out of business documents with bounding boxes and
confidence scores attached to every element.
- **Layout-aware document understanding** — get a typed, page-anchored element
list (paragraphs with semantic roles, tables with cell spans, formulas in
LaTeX, pictures, handwriting) suitable for building document-comprehension
tooling, including agentic workflows.

### Choosing an output format

| Format | Best for | Shape |
|-------------------|----------------------------------------------------------------------------|----------------------------------------------------------------------|
| `markdown` | RAG, search indexing, content migration — anywhere structured text beats spatial data | One whole-document Markdown string at `response['output']['markdown']` |
| `spatial` (default) | Form/invoice extraction, layout reconstruction, flows that need per-element confidence | Flat list of typed elements at `response['output']['elements']` |

### Quick start

```python
import asyncio
from nutrient_dws import NutrientClient

async def main():
client = NutrientClient(api_key='your_api_key')
Comment thread
nickwinder marked this conversation as resolved.
Outdated

# Spatial elements (default) — paragraphs, tables, formulas, pictures, etc.
response = await client.parse('contract.pdf', mode='understand')
for element in response['output']['elements']:
if element['type'] == 'table':
print(element['rowCount'], element['columnCount'])

# Whole-document Markdown from a born-digital PDF
response = await client.parse(
'report.pdf', mode='text', output_format='markdown',
)
print(response['output']['markdown'])

asyncio.run(main())
```

### Modes — when to use which

| Mode | Credits / page | When to use |
|--------------|----------------|----------------------------------------------------------------------------------------------|
| `text` | 1 | Born-digital documents only. No OCR, no AI. Fastest and cheapest path to Markdown. |
| `structure` | 1.5 | OCR-based segmentation with bounding boxes. Handles scanned documents, images, and any input requiring OCR. |
| `understand` | 9 | Full pipeline with AI augmentation on top of OCR. Most accurate for documents with tables, multi-column layouts, formulas, and form fields. |
| `agentic` | 18 | Builds on `understand` and adds a vision-language model. Best for image descriptions, complex visual layouts, and deeper semantic understanding. |

### Recipes

**RAG ingestion** — PDF → Markdown → chunks → embeddings → vector store:

```python
response = await client.parse('whitepaper.pdf', mode='text', output_format='markdown')
markdown = response['output']['markdown']
# Then: chunk on headings, embed, push to your vector store of choice.
```

For born-digital PDFs, `mode='text'` is the cheapest path (1 credit/page).
For scanned PDFs or images, switch to `mode='structure'` so OCR runs.

**Form/invoice extraction** — PDF → spatial elements → structured dict:

```python
response = await client.parse('invoice.pdf', mode='understand')
elements = response['output']['elements']

# Pull key/value pairs from form regions
fields = {}
for element in elements:
if element['type'] == 'keyValueRegion':
for pair in element['pairs']:
fields[pair['key']['value']] = pair['value']['value']

# Walk tables — each cell carries row/col indices and span counts
for element in elements:
if element['type'] == 'table':
print(f"Table: {element['rowCount']}×{element['columnCount']}")
for cell in element['cells']:
print(f" [{cell['row']}][{cell['column']}] {cell['text']}")
```

For complex layouts that mix dense images with text, step up to
`mode='agentic'` so the VLM can produce image descriptions and semantic
classifications (18 credits/page).

### Billing — extraction credits vs processor credits

The Data Extraction API is billed against **extraction credits**, which are a
separate billing bucket from the **processor API credits** consumed by
`/build`, `/sign`, OCR, and the other Processor API endpoints used by this
client (`convert`, `watermark_text`, `merge`, etc.). The response surfaces the
extraction-credit accounting under `response['usage']['data_extraction_credits']`:

```python
usage = response['usage']['data_extraction_credits']
print(f"Cost: {usage['cost']} extraction credits, "
f"remaining: {usage['remainingCredits']}")
```

## Workflow System

The client also provides a fluent builder pattern with staged interfaces to create document processing workflows:
Expand Down
46 changes: 46 additions & 0 deletions docs/METHODS.md
Original file line number Diff line number Diff line change
Expand Up @@ -449,6 +449,52 @@ if kvps and len(kvps) > 0:
print(f'Total Amount: {dictionary.get("Total")}')
```

##### parse(file, mode?, output_format?)
Calls the Data Extraction API (`/extraction/parse`) to extract structured
content from a document. Designed for **RAG ingestion**, **search indexing**,
**content migration**, and **form/invoice extraction** workflows where the
goal is to feed document content into a downstream pipeline rather than
render or transform the document itself.

Billed against **extraction credits** — a separate billing bucket from the
processor API credits consumed by every other method on this client. See the
[README's Data Extraction section](../README.md#data-extraction-extractionparse)
for the full positioning, the per-mode comparison, and worked recipes.

**Parameters**:
- `file: LocalFileInput` - The document to parse. The endpoint accepts PDFs,
Office documents, and images. Only local inputs (paths, bytes, file-like
objects) are supported — URLs are not, because the underlying API surface is
multipart-only.
- `mode: ParseMode` - `"text"` (1 credit/page, born-digital only, no OCR/AI),
`"structure"` (1.5 credits/page, OCR + spatial layout — default),
`"understand"` (9 credits/page, AI-augmented), or `"agentic"` (18 credits/page,
adds a vision-language model).
- `output_format: ParseOutputFormat` - `"spatial"` (default — typed elements
with bounds and confidence at `response['output']['elements']`) or
`"markdown"` (whole-document Markdown string at `response['output']['markdown']`).

**Returns**: `ParseResponse` - The full response envelope, including `output`,
`metrics`, `configuration`, and `usage['data_extraction_credits']` (cost and
remaining balance in the extraction-credits bucket).

```python
# RAG ingestion — born-digital PDF to Markdown, cheap and fast.
response = await client.parse('whitepaper.pdf', mode='text', output_format='markdown')
markdown = response['output']['markdown']

# Form extraction — typed spatial elements with bounds and confidence.
response = await client.parse('invoice.pdf', mode='understand')
for element in response['output']['elements']:
if element['type'] == 'keyValueRegion':
for pair in element['pairs']:
print(pair['key']['value'], '→', pair['value']['value'])

# Inspect billing — cost is in extraction credits, not processor credits.
usage = response['usage']['data_extraction_credits']
print(f"Cost: {usage['cost']} extraction credits, remaining: {usage['remainingCredits']}")
```

##### flatten(file, annotation_ids?)
Flattens annotations in a PDF document.

Expand Down
36 changes: 36 additions & 0 deletions src/nutrient_dws/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,52 @@
process_file_input,
validate_file_input,
)
from nutrient_dws.types.extraction_credits import ExtractionCredits
from nutrient_dws.types.parse import (
FormulaElement,
HandwritingElement,
KeyValuePair,
KeyValueRegionElement,
ParagraphElement,
ParseElement,
ParseInstructions,
ParseMode,
ParseOutputBody,
ParseOutputElements,
ParseOutputFormat,
ParseOutputMarkdown,
ParseResponse,
PictureElement,
TableCell,
TableElement,
)
from nutrient_dws.utils import get_library_version, get_user_agent

__all__ = [
"APIError",
"AuthenticationError",
"ExtractionCredits",
"FileInput",
"FormulaElement",
"HandwritingElement",
"KeyValuePair",
"KeyValueRegionElement",
"LocalFileInput",
"NetworkError",
"NutrientClient",
"NutrientError",
"ParagraphElement",
"ParseElement",
"ParseInstructions",
"ParseMode",
"ParseOutputBody",
"ParseOutputElements",
"ParseOutputFormat",
"ParseOutputMarkdown",
"ParseResponse",
"PictureElement",
"TableCell",
"TableElement",
"UrlFileInput",
"ValidationError",
"get_library_version",
Expand Down
119 changes: 119 additions & 0 deletions src/nutrient_dws/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from nutrient_dws.errors import NutrientError, ValidationError
from nutrient_dws.http import (
NutrientClientOptions,
ParseRequestData,
RedactRequestData,
RequestConfig,
SignRequestData,
Expand Down Expand Up @@ -54,6 +55,12 @@
CreateAuthTokenResponse,
)
from nutrient_dws.types.misc import OcrLanguage, PageRange, Pages
from nutrient_dws.types.parse import (
ParseInstructions,
ParseMode,
ParseOutputFormat,
ParseResponse,
)
from nutrient_dws.types.redact_data import RedactOptions
from nutrient_dws.types.sign_request import CreateDigitalSignature

Expand Down Expand Up @@ -753,6 +760,118 @@ async def extract_key_value_pairs(

return cast("JsonContentOutput", self._process_typed_workflow_result(result))

async def parse(
self,
file: LocalFileInput,
mode: ParseMode = "structure",
output_format: ParseOutputFormat = "spatial",
) -> ParseResponse:
"""Parse a document using the Data Extraction API (`/extraction/parse`).

Designed for content-extraction workflows where document content feeds
a downstream pipeline rather than being rendered or transformed:

- **RAG / search indexing / content migration** — use
`output_format="markdown"` for a whole-document Markdown string
suitable for chunking, embedding, and indexing.
- **Form / invoice extraction** — use `output_format="spatial"`
(default) for a typed element list (paragraphs, tables,
keyValueRegions, etc.) with bounds and confidence per element.
- **Layout-aware document understanding** — combine `mode="understand"`
or `mode="agentic"` with spatial output for layout reconstruction
and semantic classification.

See the README's Data Extraction section for worked recipes (RAG
ingestion, form extraction) and per-mode positioning.

The Data Extraction API is billed against **extraction credits**, which
are a separate billing bucket from the **processor API credits**
consumed by `/build`, `/sign`, OCR, and other Processor API endpoints.

Per-page extraction-credit costs by mode:

- `text`: 1 extraction credit / page — fast Markdown extraction from
Comment thread
nickwinder marked this conversation as resolved.
born-digital documents (no OCR or AI).
- `structure`: 1.5 extraction credits / page — OCR-based spatial
extraction with bounding boxes (default).
- `understand`: 9 extraction credits / page — AI-augmented layout
analysis, table detection, and semantic classification.
- `agentic`: 18 extraction credits / page — VLM-augmented extraction
building on `understand` mode.

Output format selects the shape under `response.output`:

- `spatial` (default): `output.elements` — typed elements (paragraph,
table, formula, picture, keyValueRegion, handwriting) with bounds,
confidence, and reading order.
- `markdown`: `output.markdown` — a whole-document Markdown string,
well suited for RAG / search indexing pipelines.

**Security note**: this method only accepts local files (paths, bytes,
file objects) because the underlying API surface for this endpoint is
multipart-only. For remote inputs, fetch them client-side with
appropriate URL validation first.

Args:
file: The document to parse (local files only — paths, bytes, or
file-like objects). The endpoint accepts a range of document
formats (PDF, Office documents, images); see the public
guide for the authoritative list. Unlike `sign()`, parsing
is not restricted to PDFs.
mode: Processing mode. See per-mode credit costs above. Defaults
to `"structure"`.
output_format: Output shape — `"spatial"` for typed elements or
`"markdown"` for a Markdown document. Defaults to
`"spatial"`.

Returns:
The full parse response envelope, including `output`, `metrics`,
`usage` (the extraction-credit accounting), and `configuration`.

Example:
```python
# Spatial elements with full layout analysis (9 extraction credits / page)
response = await client.parse('contract.pdf', mode='understand')
for element in response['output']['elements']:
if element['type'] == 'table':
print(element['rowCount'], element['columnCount'])

# Whole-document Markdown from a born-digital PDF (1 extraction credit / page)
response = await client.parse(
'report.pdf', mode='text', output_format='markdown'
)
print(response['output']['markdown'])

# Inspect billing
usage = response['usage']['data_extraction_credits']
print(f"Cost: {usage['cost']} extraction credits "
f"(remaining: {usage['remainingCredits']})")
```
"""
# Multipart-only endpoint; only local file inputs are supported.
normalized_file = await process_file_input(file)

instructions: ParseInstructions = {
"mode": mode,
"output": {"format": output_format},
}

request_data: ParseRequestData = {
"file": normalized_file,
"instructions": instructions,
}

response: Any = await send_request(
{
"method": "POST",
"endpoint": "/extraction/parse",
"data": request_data,
"headers": None,
},
self.options,
)
return cast("ParseResponse", response["data"])

async def set_page_labels(
self,
pdf: FileInput,
Expand Down
Loading
Loading