feat(file-based): bump unstructured to 0.18.x with custom markdown preserved#1063
feat(file-based): bump unstructured to 0.18.x with custom markdown preserved#1063Aaron ("AJ") Steers (aaronsteers) wants to merge 15 commits into
Conversation
… export - Upgrade unstructured from 0.10.27 to >=0.18.27,<0.19 - Replace custom _render_markdown/_convert_to_markdown with native elements_to_md() - Migrate FileType API from dict lookups to enum methods (from_extension, from_mime_type) - Update detect_filetype call from filename= to file_path= - Handle partition module imports individually for graceful degradation - Remove dpath dependency for element traversal - Update test expectations for new markdown output format Co-Authored-By: AJ Steers <aj@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1782941877-bump-unstructured#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1782941877-bump-unstructuredPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
Co-Authored-By: AJ Steers <aj@airbyte.io>
Plain text content like 'Just a humble text file' gets detected as FileType.TXT by libmagic (installed in CI), which is a supported type. Use actual CSV content so the file is consistently detected as CSV regardless of whether libmagic is available. Co-Authored-By: AJ Steers <aj@airbyte.io>
With libmagic installed (CI), plain text content gets detected as FileType.TXT (a supported type), bypassing the intended error path. Use PDF magic bytes so libmagic correctly identifies the file as PDF, triggering the 'PDF partition dependencies not installed' error. Co-Authored-By: AJ Steers <aj@airbyte.io>
DOCX and PPTX are declared extras so their imports should always succeed. Only PDF needs a guarded import since it requires heavy optional deps (torch, unstructured-inference) not shipped with the CDK. Co-Authored-By: AJ Steers <aj@airbyte.io>
📝 WalkthroughWalkthroughThe unstructured file parser now loads DOCX, PPTX, and PDF partition callables dynamically, uses runtime assertions for local parsing, and switches filetype detection and upload MIME handling to filetype utilities. Dependency constraints and related parser tests and scenarios were updated to match. ChangesOptional partition dependency handling and markdown delegation
Estimated code review effort: 3 (Moderate) | ~25 minutes Suggested labels: dependencies, tests, refactor Suggested reviewers: A reviewer familiar with the unstructured parser module and Poetry dependency changes wdyt? 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pyproject.toml (1)
71-72: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUnbounded ranges for deps tied to
unstructured's pinned line — intentional, or worth an upper bound too?
unstructuredis capped at<0.19(Line 80), butpdf2image,pdfminer.six, andnltk— all noted as used indirectly byunstructured— are left open-ended (>=...with no ceiling). Since pdfminer.six in particular has a track record of breaking downstream consumers on new releases, a future bump could silently pull in a version untested againstunstructured==0.18.x. Would it make sense to add a matching upper bound (or at least document why it's intentionally left open) for these three, wdyt?Example: adding upper bounds to align with the `unstructured` cap
-pdf2image = { version = ">=1.16.3", optional = true } -"pdfminer.six" = { version = ">=20221105", optional = true } # Used indirectly by unstructured library +pdf2image = { version = ">=1.16.3,<2.0", optional = true } +"pdfminer.six" = { version = ">=20221105,<20250000", optional = true } # Used indirectly by unstructured libraryAlso applies to: 79-79
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pyproject.toml` around lines 71 - 72, The dependency specs for pdf2image, pdfminer.six, and nltk are unbounded while unstructured is pinned below 0.19, so update the dependency declarations in pyproject.toml to either add matching upper bounds for these indirect unstructured-related packages or add an explicit note documenting why they are intentionally left open-ended; use the existing package entries for pdf2image, pdfminer.six, and nltk as the place to make the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 423-440: The content detection path in unstructured_parser.file
type inference is dropping the URI context, which can cause remote files to be
misclassified before the extension fallback. Update the detection flow around
the file_type lookup and detect_filetype call in the file type resolution method
to pass remote_file.uri through as metadata_file_path (or equivalent URI hint)
so unsupported text-like formats can be identified correctly; keep the existing
extension fallback as the final step.
In `@unit_tests/sources/file_based/file_types/test_unstructured_parser.py`:
- Around line 243-248: The test is assigning mocks directly to the
unstructured_parser_module globals, which can leak into later tests and bypass
the missing-dependency behavior. Update the test to use monkeypatch.setattr on
unstructured_partition_pdf, unstructured_partition_docx, and
unstructured_partition_pptx so the original None sentinels are automatically
restored after the test. Keep the fix scoped to the test setup that currently
uses MagicMock and the unstructured_parser_module symbols.
---
Nitpick comments:
In `@pyproject.toml`:
- Around line 71-72: The dependency specs for pdf2image, pdfminer.six, and nltk
are unbounded while unstructured is pinned below 0.19, so update the dependency
declarations in pyproject.toml to either add matching upper bounds for these
indirect unstructured-related packages or add an explicit note documenting why
they are intentionally left open-ended; use the existing package entries for
pdf2image, pdfminer.six, and nltk as the place to make the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: a2a1a415-2efb-4519-82ae-c8f88ce55a88
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.pypyproject.tomlunit_tests/sources/file_based/file_types/test_unstructured_parser.pyunit_tests/sources/file_based/scenarios/unstructured_scenarios.py
…h in tests - Pass remote_file.uri as metadata_file_path to detect_filetype for content-based detection, preserving the extension hint for remote files. - Use monkeypatch.setattr for partition globals in tests to ensure automatic cleanup and prevent mock leakage between tests. Co-Authored-By: AJ Steers <aj@airbyte.io>
unstructured library to 0.18.x (latest) and new native markdown export
|
Devin, fyi, take note I retitled the PR and don't override it please. Do let me know if my "latest" annotation on this version ref is not accurate. (Ideally it is the latest.) |
|
Noted — I won't override the title. The "latest" annotation is accurate: latest on PyPI is 0.18.32, and our constraint |
- Restore PDF files and expected records in simple_unstructured_scenario and no_file_extension_unstructured_scenario - Remove all try/except guards: imports fail loudly per maintainer direction - Move _import_unstructured() call to _read_file_locally() only (not needed for API path) - Add 'pdf' extra to unstructured dependency in pyproject.toml - Update corrupted file error message to match real pdfminer error - Update poetry.lock Co-Authored-By: AJ Steers <aj@airbyte.io>
- Restore 'pdf' extra in unstructured dependency - Add onnxruntime pin (>=1.23,<1.24): the 1.24.x series claims >=3.10 but ships no cp310 wheels; >=1.25 requires Python >=3.11 - Add onnxruntime to file-based extras group - Regenerate poetry.lock with onnxruntime 1.23.2 Co-Authored-By: AJ Steers <aj@airbyte.io>
onnxruntime is a transitive dependency pin (not imported directly), added to cap below 1.24 for Python 3.10 compatibility. Co-Authored-By: AJ Steers <aj@airbyte.io>
Upstream `elements_to_md()` does not yet support heading levels, list-item bullets, or formula code-blocks. Keep our own `_render_markdown`/`_convert_to_markdown` with a comment to revisit when the native mapper reaches parity. Co-Authored-By: AJ Steers <aj@airbyte.io>
Add Optional[Callable[..., Any]] type hints for partition function globals and assert not-None before use. Eliminates 3 mypy errors that were present on main (7 errors) — this branch now has 0. Co-Authored-By: AJ Steers <aj@airbyte.io>
Restore poetry.lock from main and re-lock with --no-update to pick up only the pyproject.toml changes (unstructured bump, onnxruntime pin) without pulling in unrelated transitive dependency updates like langchain-classic 1.0.0 -> 1.0.8 which broke test_run_check_with_exception. Co-Authored-By: AJ Steers <aj@airbyte.io>
unstructured 0.18.x imports pdfminer.psexceptions which was added in pdfminer.six 20231228+. The previous lockfile pinned 20221105 which predates that module. Co-Authored-By: AJ Steers <aj@airbyte.io>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (1)
340-369: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winMissing
elseleaveselementsunbound iffiletypeisn't PDF/DOCX/PPTX.The
tryonly wraps theif/elifchain; if none of the branches match,elementsis never assigned, and line 369 raises an uncaughtNameErrorinstead of aRecordParseError. Today this can't happen since_supported_file_types()filters MD/TXT upstream, but it's a latent trap if that set is ever extended without updating this chain. wdyt about adding an explicitelsethat raisesself._create_parse_error(...)for future-proofing?🛡️ Proposed fix
try: if filetype == FileType.PDF: file_handle.seek(0) with BytesIO(file_handle.read()) as file: file_handle.seek(0) elements = unstructured_partition_pdf(file=file, strategy=strategy) elif filetype == FileType.DOCX: elements = unstructured_partition_docx(file=file) elif filetype == FileType.PPTX: elements = unstructured_partition_pptx(file=file) + else: + raise ValueError(f"Unsupported file type for local parsing: {filetype}") except Exception as e: raise self._create_parse_error(remote_file, str(e))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py` around lines 340 - 369, The _read_file_locally method in UnstructuredParser can leave elements unbound when filetype is not PDF, DOCX, or PPTX, causing a NameError instead of a parse error. Add an explicit else branch in the if/elif chain inside _read_file_locally to raise self._create_parse_error(remote_file, ...) for unsupported filetypes, so the method always fails with the intended RecordParseError path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 340-369: The _read_file_locally method in UnstructuredParser can
leave elements unbound when filetype is not PDF, DOCX, or PPTX, causing a
NameError instead of a parse error. Add an explicit else branch in the if/elif
chain inside _read_file_locally to raise self._create_parse_error(remote_file,
...) for unsupported filetypes, so the method always fails with the intended
RecordParseError path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d9c3e7a5-1ee6-4913-a635-94a1c0e96ed8
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.pypyproject.tomlunit_tests/sources/file_based/file_types/test_unstructured_parser.pyunit_tests/sources/file_based/scenarios/unstructured_scenarios.py
Co-Authored-By: AJ Steers <aj@airbyte.io>
|
☑️ Resolved in 2b58f6b. Added explicit (Responding to CodeRabbit nitpick on |
Co-Authored-By: AJ Steers <aj@airbyte.io>
unstructured library to 0.18.x (latest) and new native markdown export
Summary
Bumps
unstructuredfrom0.10.27to>=0.18.27,<0.19(resolves 0.18.32). Custom markdown rendering (_render_markdown/_convert_to_markdown) is preserved — upstream'selements_to_md()doesn't handle heading depths, list-item bullets, or formula code-blocks.Migrates
FileTypeAPI from removed dict lookups to enum methods:Pins
onnxruntime>=1.23,<1.24— transitive dep ofunstructured[pdf]; 1.24.x ships no cp310 wheels, >=1.25 requires Python >=3.11.Partition functions typed as
Optional[Callable[..., Any]]withassertguards before call sites (fixes mypy). Defensiveelsein_read_file_locallyraisesValueErrorfor unsupported file types.Scenario test expectations updated: corrupted-PDF error message changed from
"No /Root object!"to"Unable to get page count"in 0.18.x, and PDF content classified asNarrativeText(notTitle) so no#prefix.Link to Devin session: https://app.devin.ai/sessions/b701e6538df6470a856666567993e205
Requested by: Aaron ("AJ") Steers (@aaronsteers)