feat(file-based): Migrate unstructured document parser to MarkItDown#1062
feat(file-based): Migrate unstructured document parser to MarkItDown#1062Aaron ("AJ") Steers (aaronsteers) wants to merge 3 commits into
Conversation
Replace the unstructured library with Microsoft's markitdown for document-to-markdown conversion. This simplifies the dependency tree and provides local-only document parsing. Key changes: - Replace unstructured with markitdown[pdf,docx,pptx] dependency - Pin onnxruntime <1.24 (1.24+ dropped Python 3.10 wheels) - Rewrite UnstructuredParser to use MarkItDown for PDF, DOCX, PPTX, XLSX, HTML, XLS conversion - Plain text and markdown files returned as-is - Auto-detect file types via MarkItDown when no extension present - Mark API processing mode as deprecated (local-only now) - Update unit tests and scenario tests for new output format - Remove nltk, pdf2image, pdfminer.six, pytesseract dependencies Co-Authored-By: AJ Steers <aj@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1782939132-markitdown-migration#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1782939132-markitdown-migrationPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
PyTest Results (Fast)38 tests - 4 085 37 ✅ - 4 074 20635d 21h 27m 43s ⏱️ - 20635d 21h 35m 14s For more details on these failures, see this check. Results for commit 484572b. ± Comparison against base commit cdd7014. This pull request removes 4085 tests.♻️ This comment has been updated with latest results. |
Co-Authored-By: AJ Steers <aj@airbyte.io>
| class UnstructuredFormat(BaseModel): | ||
| class Config(OneOfOptionConfig): | ||
| title = "Unstructured Document Format" | ||
| title = "Document Format" |
There was a problem hiding this comment.
Still "Unstructured" ("unstructured", lower case, to be specific) - just not using the (brand) "Unstructured".
Revert name to still call it Unstructured. Propose alternatives but don't make any changes to how we describe it functionally until we've discussed.
There was a problem hiding this comment.
☑️ Resolved in 484572b. Reverted unstructured_format.py to the original descriptions — title is back to "Unstructured Document Format" and all field descriptions are unchanged from main.
Happy to discuss naming alternatives when you're ready. Some options to consider:
- Keep "Unstructured Document Format" as-is (current)
- "Unstructured Document Format (powered by MarkItDown)" — adds clarity on the engine
- "Document Format (unstructured)" — lowercase as you noted
No further description changes until we've discussed.
- Revert unstructured_format.py to original descriptions per reviewer feedback - Pin requests_cache >=1.2.1,<1.3 to fix pytest version check failure Co-Authored-By: AJ Steers <aj@airbyte.io>
E2E Test ResultsTested the MarkItDown migration end-to-end by creating real documents (PDF, DOCX, PPTX) and running them through E2E Tests (17/17 passed)
Unit Tests (14/14 passed) + Scenario Tests (10/10 passed)All CI Notes
|
PyTest Results (Full)4 113 tests - 13 3 997 ✅ - 117 20635d 21h 27m 42s ⏱️ - 20635d 21h 40m 13s For more details on these failures, see this check. Results for commit 484572b. ± Comparison against base commit cdd7014. This pull request removes 20 and adds 7 tests. Note that renamed tests count towards both. |
Summary
Replaces the
unstructuredlibrary with Microsoft'smarkitdownfor document-to-markdown conversion in the file-based CDK parser. Simplifies the dependency tree (removesunstructured,nltk,pdf2image,pdfminer.six,pytesseract) and moves to fully local processing.Dependency changes (
pyproject.toml):UnstructuredParserrewrite — core routing logic:MarkItDown requires file paths (not streams), so
_convert_with_markitdownwrites to aNamedTemporaryFileand callsMarkItDown().convert(path).Config model (
UnstructuredFormat): Unchanged from main. API mode rejected at runtime incheck_config(). Backward-compatible fields retained.Output format changes:
"Hello World"→"Hello World\n\n"(was"# Hello World")"Content"(was"# Content")"<!-- Slide number: 1 -->\n# Title"(was"# Title")Link to Devin session: https://app.devin.ai/sessions/8b80180ef6244f39b5db16f548829a69
Requested by: Aaron ("AJ") Steers (@aaronsteers)