Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- `client.parse()` — first-class support for the Data Extraction API
(`/extraction/parse`). Supports all four processing modes (`text`,
`structure`, `understand`, `agentic`) and both output shapes (spatial
elements and whole-document Markdown). Typed response model with
discriminated element variants (paragraph, table, formula, picture,
keyValueRegion, handwriting). Billed against **extraction credits**, a
separate billing bucket from the **processor API credits** used by the
other endpoints.
- New types exported from `nutrient_dws`: `ParseResponse`,
`ParseInstructions`, `ParseMode`, `ParseOutputFormat`, `ParseElement`,
`ParseOutputBody`, `ParseOutputElements`, `ParseOutputMarkdown`,
`ParagraphElement`, `TableElement`, `TableCell`, `FormulaElement`,
`PictureElement`, `KeyValueRegionElement`, `KeyValuePair`,
`HandwritingElement`.

## [3.0.0] - 2026-01-30

### Security
Expand Down
127 changes: 127 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,133 @@ asyncio.run(main())

For a complete list of available methods with examples, see the [Methods Documentation](docs/METHODS.md).

## Data Extraction (`/extraction/parse`)

`client.parse()` exposes Nutrient's Data Extraction API. It's designed for
**content-extraction workflows** where you need to feed document content into a
downstream pipeline rather than render or transform the document itself:

> **Heads up — separate API key.** DWS Extract is a different product from
> DWS Processor and has its own API key. Pass it as
> `NutrientClient(api_key=..., extract_api_key=...)`; the Extract key is
> used only for `parse()`, while every other method continues to use the
> Processor key. Using the Processor key against `/extraction/parse`
> returns `403`. If `extract_api_key` is omitted, `parse()` falls back to
> the main `api_key` — that path works once your tenant moves to global
> DWS API keys.

- **RAG (retrieval-augmented generation) pipelines** — pull a clean Markdown
representation of a document for chunking, embedding, and indexing in a
vector store.
- **Search indexing and content migration** — convert documents into Markdown
for full-text search or for migration into a new content management system.
- **Form and invoice extraction** — pull structured fields (key/value pairs,
tables, semantic regions) out of business documents with bounding boxes and
confidence scores attached to every element.
- **Layout-aware document understanding** — get a typed, page-anchored element
list (paragraphs with semantic roles, tables with cell spans, formulas in
LaTeX, pictures, handwriting) suitable for building document-comprehension
tooling, including agentic workflows.

### Choosing an output format

| Format | Best for | Shape |
|-------------------|----------------------------------------------------------------------------|----------------------------------------------------------------------|
| `markdown` | RAG, search indexing, content migration — anywhere structured text beats spatial data | One whole-document Markdown string at `response['output']['markdown']` |
| `spatial` (default) | Form/invoice extraction, layout reconstruction, flows that need per-element confidence | Flat list of typed elements at `response['output']['elements']` |

Spatial output requires an OCR-capable mode (`structure`, `understand`, or
`agentic`); `mode='text'` is markdown-only and the client rejects the
`text` + `spatial` combination before the request goes out.

### Quick start

```python
import asyncio
from nutrient_dws import NutrientClient

async def main():
client = NutrientClient(
api_key='your_processor_key',
extract_api_key='your_extract_key',
)

# Spatial elements (default) — paragraphs, tables, formulas, pictures, etc.
response = await client.parse('contract.pdf', mode='understand')
for element in response['output']['elements']:
if element['type'] == 'table':
print(element['rowCount'], element['columnCount'])

# Whole-document Markdown from a born-digital PDF
response = await client.parse(
'report.pdf', mode='text', output_format='markdown',
)
print(response['output']['markdown'])

asyncio.run(main())
```

### Modes — when to use which

| Mode | Credits / page | When to use |
|--------------|----------------|----------------------------------------------------------------------------------------------|
| `text` | 1 | Born-digital documents only. No OCR, no AI. Fastest and cheapest path to Markdown. |
| `structure` | 1.5 | OCR-based segmentation with bounding boxes. Handles scanned documents, images, and any input requiring OCR. |
| `understand` | 9 | Full pipeline with AI augmentation on top of OCR. Most accurate for documents with tables, multi-column layouts, formulas, and form fields. |
| `agentic` | 18 | Builds on `understand` and adds a vision-language model. Best for image descriptions, complex visual layouts, and deeper semantic understanding. |

### Recipes

**RAG ingestion** — PDF → Markdown → chunks → embeddings → vector store:

```python
response = await client.parse('whitepaper.pdf', mode='text', output_format='markdown')
markdown = response['output']['markdown']
# Then: chunk on headings, embed, push to your vector store of choice.
```

For born-digital PDFs, `mode='text'` is the cheapest path (1 credit/page).
For scanned PDFs or images, switch to `mode='structure'` so OCR runs.

**Form/invoice extraction** — PDF → spatial elements → structured dict:

```python
response = await client.parse('invoice.pdf', mode='understand')
elements = response['output']['elements']

# Pull key/value pairs from form regions
fields = {}
for element in elements:
if element['type'] == 'keyValueRegion':
for pair in element['pairs']:
fields[pair['key']['value']] = pair['value']['value']

# Walk tables — each cell carries row/col indices and span counts
for element in elements:
if element['type'] == 'table':
print(f"Table: {element['rowCount']}×{element['columnCount']}")
for cell in element['cells']:
print(f" [{cell['row']}][{cell['column']}] {cell['text']}")
```

For complex layouts that mix dense images with text, step up to
`mode='agentic'` so the VLM can produce image descriptions and semantic
classifications (18 credits/page).

### Billing — extraction credits vs processor credits

The Data Extraction API is billed against **extraction credits**, which are a
separate billing bucket from the **processor API credits** consumed by
`/build`, `/sign`, OCR, and the other Processor API endpoints used by this
client (`convert`, `watermark_text`, `merge`, etc.). The response surfaces the
extraction-credit accounting under `response['usage']['data_extraction_credits']`:

```python
usage = response['usage']['data_extraction_credits']
print(f"Cost: {usage['cost']} extraction credits, "
f"remaining: {usage['remainingCredits']}")
```

## Workflow System

The client also provides a fluent builder pattern with staged interfaces to create document processing workflows:
Expand Down
46 changes: 46 additions & 0 deletions docs/METHODS.md
Original file line number Diff line number Diff line change
Expand Up @@ -449,6 +449,52 @@ if kvps and len(kvps) > 0:
print(f'Total Amount: {dictionary.get("Total")}')
```

##### parse(file, mode?, output_format?)
Calls the Data Extraction API (`/extraction/parse`) to extract structured
content from a document. Designed for **RAG ingestion**, **search indexing**,
**content migration**, and **form/invoice extraction** workflows where the
goal is to feed document content into a downstream pipeline rather than
render or transform the document itself.

Billed against **extraction credits** — a separate billing bucket from the
processor API credits consumed by every other method on this client. See the
[README's Data Extraction section](../README.md#data-extraction-extractionparse)
for the full positioning, the per-mode comparison, and worked recipes.

**Parameters**:
- `file: LocalFileInput` - The document to parse. The endpoint accepts PDFs,
Office documents, and images. Only local inputs (paths, bytes, file-like
objects) are supported — URLs are not, because the underlying API surface is
multipart-only.
- `mode: ParseMode` - `"text"` (1 credit/page, born-digital only, no OCR/AI),
`"structure"` (1.5 credits/page, OCR + spatial layout — default),
`"understand"` (9 credits/page, AI-augmented), or `"agentic"` (18 credits/page,
adds a vision-language model).
- `output_format: ParseOutputFormat` - `"spatial"` (default — typed elements
with bounds and confidence at `response['output']['elements']`) or
`"markdown"` (whole-document Markdown string at `response['output']['markdown']`).

**Returns**: `ParseResponse` - The full response envelope, including `output`,
`metrics`, `configuration`, and `usage['data_extraction_credits']` (cost and
remaining balance in the extraction-credits bucket).

```python
# RAG ingestion — born-digital PDF to Markdown, cheap and fast.
response = await client.parse('whitepaper.pdf', mode='text', output_format='markdown')
markdown = response['output']['markdown']

# Form extraction — typed spatial elements with bounds and confidence.
response = await client.parse('invoice.pdf', mode='understand')
for element in response['output']['elements']:
if element['type'] == 'keyValueRegion':
for pair in element['pairs']:
print(pair['key']['value'], '→', pair['value']['value'])

# Inspect billing — cost is in extraction credits, not processor credits.
usage = response['usage']['data_extraction_credits']
print(f"Cost: {usage['cost']} extraction credits, remaining: {usage['remainingCredits']}")
```

##### flatten(file, annotation_ids?)
Flattens annotations in a PDF document.

Expand Down
36 changes: 36 additions & 0 deletions src/nutrient_dws/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,52 @@
process_file_input,
validate_file_input,
)
from nutrient_dws.types.extraction_credits import ExtractionCredits
from nutrient_dws.types.parse import (
FormulaElement,
HandwritingElement,
KeyValuePair,
KeyValueRegionElement,
ParagraphElement,
ParseElement,
ParseInstructions,
ParseMode,
ParseOutputBody,
ParseOutputElements,
ParseOutputFormat,
ParseOutputMarkdown,
ParseResponse,
PictureElement,
TableCell,
TableElement,
)
from nutrient_dws.utils import get_library_version, get_user_agent

__all__ = [
"APIError",
"AuthenticationError",
"ExtractionCredits",
"FileInput",
"FormulaElement",
"HandwritingElement",
"KeyValuePair",
"KeyValueRegionElement",
"LocalFileInput",
"NetworkError",
"NutrientClient",
"NutrientError",
"ParagraphElement",
"ParseElement",
"ParseInstructions",
"ParseMode",
"ParseOutputBody",
"ParseOutputElements",
"ParseOutputFormat",
"ParseOutputMarkdown",
"ParseResponse",
"PictureElement",
"TableCell",
"TableElement",
"UrlFileInput",
"ValidationError",
"get_library_version",
Expand Down
Loading
Loading