PSPDFKit · nickwinder · May 27, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+
+- `client.parse()` — first-class support for the Data Extraction API
+  (`/extraction/parse`). Supports all four processing modes (`text`,
+  `structure`, `understand`, `agentic`) and both output shapes (spatial
+  elements and whole-document Markdown). Typed response model with
+  discriminated element variants (paragraph, table, formula, picture,
+  keyValueRegion, handwriting). Billed against **extraction credits**, a
+  separate billing bucket from the **processor API credits** used by the
+  other endpoints.
+- New types exported from `nutrient_dws`: `ParseResponse`,
+  `ParseInstructions`, `ParseMode`, `ParseOutputFormat`, `ParseElement`,
+  `ParseOutputBody`, `ParseOutputElements`, `ParseOutputMarkdown`,
+  `ParagraphElement`, `TableElement`, `TableCell`, `FormulaElement`,
+  `PictureElement`, `KeyValueRegionElement`, `KeyValuePair`,
+  `HandwritingElement`.
+
 ## [3.0.0] - 2026-01-30
 
 ### Security

diff --git a/README.md b/README.md
@@ -88,6 +88,117 @@ asyncio.run(main())
 
 For a complete list of available methods with examples, see the [Methods Documentation](docs/METHODS.md).
 
+## Data Extraction (`/extraction/parse`)
+
+`client.parse()` exposes Nutrient's Data Extraction API. It's designed for
+**content-extraction workflows** where you need to feed document content into a
+downstream pipeline rather than render or transform the document itself:
+
+- **RAG (retrieval-augmented generation) pipelines** — pull a clean Markdown
+  representation of a document for chunking, embedding, and indexing in a
+  vector store.
+- **Search indexing and content migration** — convert documents into Markdown
+  for full-text search or for migration into a new content management system.
+- **Form and invoice extraction** — pull structured fields (key/value pairs,
+  tables, semantic regions) out of business documents with bounding boxes and
+  confidence scores attached to every element.
+- **Layout-aware document understanding** — get a typed, page-anchored element
+  list (paragraphs with semantic roles, tables with cell spans, formulas in
+  LaTeX, pictures, handwriting) suitable for building document-comprehension
+  tooling, including agentic workflows.
+
+### Choosing an output format
+
+| Format            | Best for                                                                   | Shape                                                                |
+|-------------------|----------------------------------------------------------------------------|----------------------------------------------------------------------|
+| `markdown`        | RAG, search indexing, content migration — anywhere structured text beats spatial data | One whole-document Markdown string at `response['output']['markdown']` |
+| `spatial` (default) | Form/invoice extraction, layout reconstruction, flows that need per-element confidence | Flat list of typed elements at `response['output']['elements']`        |
+
+### Quick start
+
+```python
+import asyncio
+from nutrient_dws import NutrientClient
+
+async def main():
+    client = NutrientClient(api_key='your_api_key')
+
+    # Spatial elements (default) — paragraphs, tables, formulas, pictures, etc.
+    response = await client.parse('contract.pdf', mode='understand')
+    for element in response['output']['elements']:
+        if element['type'] == 'table':
+            print(element['rowCount'], element['columnCount'])
+
+    # Whole-document Markdown from a born-digital PDF
+    response = await client.parse(
+        'report.pdf', mode='text', output_format='markdown',
+    )
+    print(response['output']['markdown'])
+
+asyncio.run(main())
+```
+
+### Modes — when to use which
+
+| Mode         | Credits / page | When to use                                                                                  |
+|--------------|----------------|----------------------------------------------------------------------------------------------|
+| `text`       | 1              | Born-digital documents only. No OCR, no AI. Fastest and cheapest path to Markdown.           |
+| `structure`  | 1.5            | OCR-based segmentation with bounding boxes. Handles scanned documents, images, and any input requiring OCR. |
+| `understand` | 9              | Full pipeline with AI augmentation on top of OCR. Most accurate for documents with tables, multi-column layouts, formulas, and form fields. |
+| `agentic`    | 18             | Builds on `understand` and adds a vision-language model. Best for image descriptions, complex visual layouts, and deeper semantic understanding. |
+
+### Recipes
+
+**RAG ingestion** — PDF → Markdown → chunks → embeddings → vector store:
+
+```python
+response = await client.parse('whitepaper.pdf', mode='text', output_format='markdown')
+markdown = response['output']['markdown']
+# Then: chunk on headings, embed, push to your vector store of choice.
+```
+
+For born-digital PDFs, `mode='text'` is the cheapest path (1 credit/page).
+For scanned PDFs or images, switch to `mode='structure'` so OCR runs.
+
+**Form/invoice extraction** — PDF → spatial elements → structured dict:
+
+```python
+response = await client.parse('invoice.pdf', mode='understand')
+elements = response['output']['elements']
+
+# Pull key/value pairs from form regions
+fields = {}
+for element in elements:
+    if element['type'] == 'keyValueRegion':
+        for pair in element['pairs']:
+            fields[pair['key']['value']] = pair['value']['value']
+
+# Walk tables — each cell carries row/col indices and span counts
+for element in elements:
+    if element['type'] == 'table':
+        print(f"Table: {element['rowCount']}×{element['columnCount']}")
+        for cell in element['cells']:
+            print(f"  [{cell['row']}][{cell['column']}] {cell['text']}")
+```
+
+For complex layouts that mix dense images with text, step up to
+`mode='agentic'` so the VLM can produce image descriptions and semantic
+classifications (18 credits/page).
+
+### Billing — extraction credits vs processor credits
+
+The Data Extraction API is billed against **extraction credits**, which are a
+separate billing bucket from the **processor API credits** consumed by
+`/build`, `/sign`, OCR, and the other Processor API endpoints used by this
+client (`convert`, `watermark_text`, `merge`, etc.). The response surfaces the
+extraction-credit accounting under `response['usage']['data_extraction_credits']`:
+
+```python
+usage = response['usage']['data_extraction_credits']
+print(f"Cost: {usage['cost']} extraction credits, "
+      f"remaining: {usage['remainingCredits']}")
+```
+
 ## Workflow System
 
 The client also provides a fluent builder pattern with staged interfaces to create document processing workflows:

diff --git a/docs/METHODS.md b/docs/METHODS.md
@@ -449,6 +449,52 @@ if kvps and len(kvps) > 0:
     print(f'Total Amount: {dictionary.get("Total")}')
 ```
 
+##### parse(file, mode?, output_format?)
+Calls the Data Extraction API (`/extraction/parse`) to extract structured
+content from a document. Designed for **RAG ingestion**, **search indexing**,
+**content migration**, and **form/invoice extraction** workflows where the
+goal is to feed document content into a downstream pipeline rather than
+render or transform the document itself.
+
+Billed against **extraction credits** — a separate billing bucket from the
+processor API credits consumed by every other method on this client. See the
+[README's Data Extraction section](../README.md#data-extraction-extractionparse)
+for the full positioning, the per-mode comparison, and worked recipes.
+
+**Parameters**:
+- `file: LocalFileInput` - The document to parse. The endpoint accepts PDFs,
+  Office documents, and images. Only local inputs (paths, bytes, file-like
+  objects) are supported — URLs are not, because the underlying API surface is
+  multipart-only.
+- `mode: ParseMode` - `"text"` (1 credit/page, born-digital only, no OCR/AI),
+  `"structure"` (1.5 credits/page, OCR + spatial layout — default),
+  `"understand"` (9 credits/page, AI-augmented), or `"agentic"` (18 credits/page,
+  adds a vision-language model).
+- `output_format: ParseOutputFormat` - `"spatial"` (default — typed elements
+  with bounds and confidence at `response['output']['elements']`) or
+  `"markdown"` (whole-document Markdown string at `response['output']['markdown']`).
+
+**Returns**: `ParseResponse` - The full response envelope, including `output`,
+`metrics`, `configuration`, and `usage['data_extraction_credits']` (cost and
+remaining balance in the extraction-credits bucket).
+
+```python
+# RAG ingestion — born-digital PDF to Markdown, cheap and fast.
+response = await client.parse('whitepaper.pdf', mode='text', output_format='markdown')
+markdown = response['output']['markdown']
+
+# Form extraction — typed spatial elements with bounds and confidence.
+response = await client.parse('invoice.pdf', mode='understand')
+for element in response['output']['elements']:
+    if element['type'] == 'keyValueRegion':
+        for pair in element['pairs']:
+            print(pair['key']['value'], '→', pair['value']['value'])
+
+# Inspect billing — cost is in extraction credits, not processor credits.
+usage = response['usage']['data_extraction_credits']
+print(f"Cost: {usage['cost']} extraction credits, remaining: {usage['remainingCredits']}")
+```
+
 ##### flatten(file, annotation_ids?)
 Flattens annotations in a PDF document.
 

diff --git a/src/nutrient_dws/__init__.py b/src/nutrient_dws/__init__.py
@@ -19,16 +19,52 @@
     process_file_input,
     validate_file_input,
 )
+from nutrient_dws.types.extraction_credits import ExtractionCredits
+from nutrient_dws.types.parse import (
+    FormulaElement,
+    HandwritingElement,
+    KeyValuePair,
+    KeyValueRegionElement,
+    ParagraphElement,
+    ParseElement,
+    ParseInstructions,
+    ParseMode,
+    ParseOutputBody,
+    ParseOutputElements,
+    ParseOutputFormat,
+    ParseOutputMarkdown,
+    ParseResponse,
+    PictureElement,
+    TableCell,
+    TableElement,
+)
 from nutrient_dws.utils import get_library_version, get_user_agent
 
 __all__ = [
     "APIError",
     "AuthenticationError",
+    "ExtractionCredits",
     "FileInput",
+    "FormulaElement",
+    "HandwritingElement",
+    "KeyValuePair",
+    "KeyValueRegionElement",
     "LocalFileInput",
     "NetworkError",
     "NutrientClient",
     "NutrientError",
+    "ParagraphElement",
+    "ParseElement",
+    "ParseInstructions",
+    "ParseMode",
+    "ParseOutputBody",
+    "ParseOutputElements",
+    "ParseOutputFormat",
+    "ParseOutputMarkdown",
+    "ParseResponse",
+    "PictureElement",
+    "TableCell",
+    "TableElement",
     "UrlFileInput",
     "ValidationError",
     "get_library_version",

diff --git a/src/nutrient_dws/client.py b/src/nutrient_dws/client.py
@@ -18,6 +18,7 @@
 from nutrient_dws.errors import NutrientError, ValidationError
 from nutrient_dws.http import (
     NutrientClientOptions,
+    ParseRequestData,
     RedactRequestData,
     RequestConfig,
     SignRequestData,
@@ -54,6 +55,12 @@
     CreateAuthTokenResponse,
 )
 from nutrient_dws.types.misc import OcrLanguage, PageRange, Pages
+from nutrient_dws.types.parse import (
+    ParseInstructions,
+    ParseMode,
+    ParseOutputFormat,
+    ParseResponse,
+)
 from nutrient_dws.types.redact_data import RedactOptions
 from nutrient_dws.types.sign_request import CreateDigitalSignature
 
@@ -753,6 +760,118 @@ async def extract_key_value_pairs(
 
         return cast("JsonContentOutput", self._process_typed_workflow_result(result))
 
+    async def parse(
+        self,
+        file: LocalFileInput,
+        mode: ParseMode = "structure",
+        output_format: ParseOutputFormat = "spatial",
+    ) -> ParseResponse:
+        """Parse a document using the Data Extraction API (`/extraction/parse`).
+
+        Designed for content-extraction workflows where document content feeds
+        a downstream pipeline rather than being rendered or transformed:
+
+        - **RAG / search indexing / content migration** — use
+          `output_format="markdown"` for a whole-document Markdown string
+          suitable for chunking, embedding, and indexing.
+        - **Form / invoice extraction** — use `output_format="spatial"`
+          (default) for a typed element list (paragraphs, tables,
+          keyValueRegions, etc.) with bounds and confidence per element.
+        - **Layout-aware document understanding** — combine `mode="understand"`
+          or `mode="agentic"` with spatial output for layout reconstruction
+          and semantic classification.
+
+        See the README's Data Extraction section for worked recipes (RAG
+        ingestion, form extraction) and per-mode positioning.
+
+        The Data Extraction API is billed against **extraction credits**, which
+        are a separate billing bucket from the **processor API credits**
+        consumed by `/build`, `/sign`, OCR, and other Processor API endpoints.
+
+        Per-page extraction-credit costs by mode:
+
+        - `text`: 1 extraction credit / page — fast Markdown extraction from
+          born-digital documents (no OCR or AI).
+        - `structure`: 1.5 extraction credits / page — OCR-based spatial
+          extraction with bounding boxes (default).
+        - `understand`: 9 extraction credits / page — AI-augmented layout
+          analysis, table detection, and semantic classification.
+        - `agentic`: 18 extraction credits / page — VLM-augmented extraction
+          building on `understand` mode.
+
+        Output format selects the shape under `response.output`:
+
+        - `spatial` (default): `output.elements` — typed elements (paragraph,
+          table, formula, picture, keyValueRegion, handwriting) with bounds,
+          confidence, and reading order.
+        - `markdown`: `output.markdown` — a whole-document Markdown string,
+          well suited for RAG / search indexing pipelines.
+
+        **Security note**: this method only accepts local files (paths, bytes,
+        file objects) because the underlying API surface for this endpoint is
+        multipart-only. For remote inputs, fetch them client-side with
+        appropriate URL validation first.
+
+        Args:
+            file: The document to parse (local files only — paths, bytes, or
+                file-like objects). The endpoint accepts a range of document
+                formats (PDF, Office documents, images); see the public
+                guide for the authoritative list. Unlike `sign()`, parsing
+                is not restricted to PDFs.
+            mode: Processing mode. See per-mode credit costs above. Defaults
+                to `"structure"`.
+            output_format: Output shape — `"spatial"` for typed elements or
+                `"markdown"` for a Markdown document. Defaults to
+                `"spatial"`.
+
+        Returns:
+            The full parse response envelope, including `output`, `metrics`,
+            `usage` (the extraction-credit accounting), and `configuration`.
+
+        Example:
+            ```python
+            # Spatial elements with full layout analysis (9 extraction credits / page)
+            response = await client.parse('contract.pdf', mode='understand')
+            for element in response['output']['elements']:
+                if element['type'] == 'table':
+                    print(element['rowCount'], element['columnCount'])
+
+            # Whole-document Markdown from a born-digital PDF (1 extraction credit / page)
+            response = await client.parse(
+                'report.pdf', mode='text', output_format='markdown'
+            )
+            print(response['output']['markdown'])
+
+            # Inspect billing
+            usage = response['usage']['data_extraction_credits']
+            print(f"Cost: {usage['cost']} extraction credits "
+                  f"(remaining: {usage['remainingCredits']})")
+            ```
+        """
+        # Multipart-only endpoint; only local file inputs are supported.
+        normalized_file = await process_file_input(file)
+
+        instructions: ParseInstructions = {
+            "mode": mode,
+            "output": {"format": output_format},
+        }
+
+        request_data: ParseRequestData = {
+            "file": normalized_file,
+            "instructions": instructions,
+        }
+
+        response: Any = await send_request(
+            {
+                "method": "POST",
+                "endpoint": "/extraction/parse",
+                "data": request_data,
+                "headers": None,
+            },
+            self.options,
+        )
+        return cast("ParseResponse", response["data"])
+
     async def set_page_labels(
         self,
         pdf: FileInput,