Skip to content

feat(file-based): Migrate unstructured document parser to MarkItDown#1062

Draft
Aaron ("AJ") Steers (aaronsteers) wants to merge 3 commits into
mainfrom
devin/1782939132-markitdown-migration
Draft

feat(file-based): Migrate unstructured document parser to MarkItDown#1062
Aaron ("AJ") Steers (aaronsteers) wants to merge 3 commits into
mainfrom
devin/1782939132-markitdown-migration

Conversation

@aaronsteers

@aaronsteers Aaron ("AJ") Steers (aaronsteers) commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Replaces the unstructured library with Microsoft's markitdown for document-to-markdown conversion in the file-based CDK parser. Simplifies the dependency tree (removes unstructured, nltk, pdf2image, pdfminer.six, pytesseract) and moves to fully local processing.

Dependency changes (pyproject.toml):

- unstructured[docx,pptx] == 0.10.27
- nltk, pdf2image, pdfminer.six, pytesseract
+ markitdown[pdf,docx,pptx] >= 0.1.4, < 0.2.0
+ onnxruntime >= 1.17.0, < 1.24  # 1.24+ dropped Python 3.10 wheels

UnstructuredParser rewrite — core routing logic:

def _read_file(self, file_handle, remote_file, ...):
    extension = _resolve_extension(remote_file.uri, remote_file.mime_type)
    if extension in _PLAINTEXT_EXTENSIONS:   # .md, .txt → return as-is
        return file_handle.read().decode()
    # .pdf, .docx, .pptx, .xlsx, .html, .xls → MarkItDown
    return self._convert_with_markitdown(file_handle, remote_file, extension)

MarkItDown requires file paths (not streams), so _convert_with_markitdown writes to a NamedTemporaryFile and calls MarkItDown().convert(path).

Config model (UnstructuredFormat): Unchanged from main. API mode rejected at runtime in check_config(). Backward-compatible fields retained.

Output format changes:

  • PDF "Hello World""Hello World\n\n" (was "# Hello World")
  • DOCX → "Content" (was "# Content")
  • PPTX → "<!-- Slide number: 1 -->\n# Title" (was "# Title")
  • Corrupted PDFs: MarkItDown gracefully returns raw text instead of raising

Link to Devin session: https://app.devin.ai/sessions/8b80180ef6244f39b5db16f548829a69
Requested by: Aaron ("AJ") Steers (@aaronsteers)

Replace the unstructured library with Microsoft's markitdown for
document-to-markdown conversion. This simplifies the dependency tree
and provides local-only document parsing.

Key changes:
- Replace unstructured with markitdown[pdf,docx,pptx] dependency
- Pin onnxruntime <1.24 (1.24+ dropped Python 3.10 wheels)
- Rewrite UnstructuredParser to use MarkItDown for PDF, DOCX, PPTX,
  XLSX, HTML, XLS conversion
- Plain text and markdown files returned as-is
- Auto-detect file types via MarkItDown when no extension present
- Mark API processing mode as deprecated (local-only now)
- Update unit tests and scenario tests for new output format
- Remove nltk, pdf2image, pdfminer.six, pytesseract dependencies

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration

Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1782939132-markitdown-migration#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1782939132-markitdown-migration

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

PyTest Results (Fast)

38 tests   - 4 085   37 ✅  - 4 074   20635d 21h 27m 43s ⏱️ - 20635d 21h 35m 14s
 1 suites ±    0    0 💤  -    12 
 1 files   ±    0    1 ❌ +    1 

For more details on these failures, see this check.

Results for commit 484572b. ± Comparison against base commit cdd7014.

This pull request removes 4085 tests.
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config0-expected_calls0]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config1-expected_calls1]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config2-expected_calls2]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config3-expected_calls3]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config4-expected_calls4]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config5-expected_calls5]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config6-expected_calls6]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config7-expected_calls7]
unit_tests.cli.airbyte_cdk.test_secret_masks ‑ test_print_ci_secrets_masks_for_config[config8-expected_calls8]
unit_tests.cli.airbyte_cdk.test_secrets.TestFetch ‑ test_fetch_with_all_failed_secrets
…

♻️ This comment has been updated with latest results.

Co-Authored-By: AJ Steers <aj@airbyte.io>
class UnstructuredFormat(BaseModel):
class Config(OneOfOptionConfig):
title = "Unstructured Document Format"
title = "Document Format"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still "Unstructured" ("unstructured", lower case, to be specific) - just not using the (brand) "Unstructured".

Revert name to still call it Unstructured. Propose alternatives but don't make any changes to how we describe it functionally until we've discussed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☑️ Resolved in 484572b. Reverted unstructured_format.py to the original descriptions — title is back to "Unstructured Document Format" and all field descriptions are unchanged from main.

Happy to discuss naming alternatives when you're ready. Some options to consider:

  • Keep "Unstructured Document Format" as-is (current)
  • "Unstructured Document Format (powered by MarkItDown)" — adds clarity on the engine
  • "Document Format (unstructured)" — lowercase as you noted

No further description changes until we've discussed.


Devin session

- Revert unstructured_format.py to original descriptions per reviewer feedback
- Pin requests_cache >=1.2.1,<1.3 to fix pytest version check failure

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration

Copy link
Copy Markdown
Contributor

E2E Test Results

Tested the MarkItDown migration end-to-end by creating real documents (PDF, DOCX, PPTX) and running them through UnstructuredParser.parse_records. All shell-based — no UI.

E2E Tests (17/17 passed)
Test Result
PDF → contains 'Hello World'
PDF → document_key matches URI
PDF → no parse error
DOCX → contains 'Test Document Content'
DOCX → no parse error
PPTX → has <!-- Slide number: 1 --> annotation
PPTX → contains 'Presentation Title'
PPTX → no parse error
TXT → exact passthrough match
MD → exact passthrough match
CSV (skip=False) → raises RecordParseError
CSV (skip=True) → error record with null content
CSV (skip=True) → _ab_source_file_parse_error set
API mode → rejected (is_ok=False)
API mode → error mentions "no longer supported"
Corrupted PDF → no crash (graceful)
Corrupted PDF → yields error record
Unit Tests (14/14 passed) + Scenario Tests (10/10 passed)

All test_unstructured_parser.py tests and all unstructured scenario tests in test_file_based_scenarios.py pass locally.

CI Notes
  • MyPy failure is pre-existing (verified: same 10 errors on main in unrelated files)
  • test_run_check_with_exception failure is flaky (passes locally on both main and this branch)

Devin session

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

PyTest Results (Full)

4 113 tests   - 13   3 997 ✅  - 117   20635d 21h 27m 42s ⏱️ - 20635d 21h 40m 13s
    1 suites ± 0      12 💤 ±  0 
    1 files   ± 0     104 ❌ +104 

For more details on these failures, see this check.

Results for commit 484572b. ± Comparison against base commit cdd7014.

This pull request removes 20 and adds 7 tests. Note that renamed tests count towards both.
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_check_config[api_error]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_check_config[api_ok]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_check_config[local]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_check_config[local_ok_strategy]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_check_config[local_unsupported_strategy]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_check_config[unexpected_handling_error]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_parse_records[docx_file]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_parse_records[exception_during_parsing]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_parse_records[multi_level_headings]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_parse_records[pdf_file]
…
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_check_config[api_mode_rejected]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_check_config[local_default]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_check_config[local_strategy_ignored]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_infer_schema[txt_file]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_parse_records[txt_file]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_parse_records[unsupported_file_type_raises]
unit_tests.sources.file_based.file_types.test_unstructured_parser ‑ test_parse_records[unsupported_file_type_skipped]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant