Skip to content

feat(source/dest huggingface): Improve the HF Datasets source, add HF Buckets source, add destinations#81357

Draft
Quentin Lhoest (lhoestq) wants to merge 2 commits into
airbytehq:masterfrom
lhoestq:hf
Draft

feat(source/dest huggingface): Improve the HF Datasets source, add HF Buckets source, add destinations#81357
Quentin Lhoest (lhoestq) wants to merge 2 commits into
airbytehq:masterfrom
lhoestq:hf

Conversation

@lhoestq

@lhoestq Quentin Lhoest (lhoestq) commented Jul 1, 2026

Copy link
Copy Markdown

What

Continuation of #48734 by Michel Tricot (@michel-tricot) which was a first implementation the source-huggingface-datasets. The new implementation uses the datasets library which is more efficient that using the dataset viewer's API

In addition to this improvement, I added a new source source-huggingface-buckets that points to Hugging Face Buckets (they simply are S3-like buckets)

Finally I added the corresponding destinations destination-huggingface-datasets and destination-huggingface-buckets to close the loop

How

For datasets I used the datasets library which is based on Arrow/Parquet, and for buckets I used the huggingface_hub library

User Impact

This will let user read/write to HF datasets/buckets

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Disclaimer

The spec and metadata files are AI generated, do you have a pointer to some docs for me to review them ? The main code is me. This is causing the CI to fail

@CLAassistant

CLAassistant commented Jul 1, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@octavia-bot

octavia-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Note

📝 PR Converted to Draft

More info...

Thank you for creating this PR. As a policy to protect our engineers' time, Airbyte requires all PRs to be created first in draft status. Your PR has been automatically converted to draft status in respect for this policy.

As soon as your PR is ready for formal review, you can proceed to convert the PR to "ready for review" status by clicking the "Ready for review" button at the bottom of the PR page.

To skip draft status in future PRs, please include [ready] in your PR title or add the skip-draft-status label when creating your PR.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

👋 Welcome to Airbyte!

Thank you for your contribution from lhoestq/airbyte! We're excited to have you in the Airbyte community.

If you have any questions, feel free to ask in the PR comments or join our Slack community.

💡 Show Tips and Tricks

PR Slash Commands

As needed or by request, Airbyte Maintainers can execute the following slash commands on your PR:

  • /format-fix - Fixes most formatting issues.
  • /bump-version - Bumps connector versions.
  • /run-connector-tests - Runs connector tests.
  • /run-cat-tests - Runs CAT tests.
  • /run-regression-tests - Runs regression tests for the modified connector(s).
  • /build-connector-images - Builds and publishes a pre-release docker image for the modified connector(s).
  • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-preview.{git-sha}) for all modified connectors in the PR.
  • /ai-review - AI-powered PR review for connector safety and quality gates.
  • /ai-docs-review - AI-powered documentation review for PRs with connector changes.
  • /ai-create-docs-pr - Creates a documentation PR for connector changes.
  • /force-merge reason="<A_GOOD_REASON>" - Force merges the PR using admin privileges, bypassing CI checks. Requires a reason.

Tips for Working with CI

  1. Pre-Release Checks. Please pay attention to these, as they contain standard checks on the metadata.yaml file, docs requirements, etc. If you need help resolving a pre-release check, please ask a maintainer.
    • Note: If you are creating a new connector, please be sure to replace the default logo.svg file with a suitable icon.
  2. Connector CI Tests. Some failures here may be expected if your tests require credentials. Please review these results to ensure (1) unit tests are passing, if applicable, and (2) integration tests pass to the degree possible and expected.
  3. (Optional.) BYO Connector Credentials for tests in your fork. You can optionally set up your fork with BYO credentials for your connector. This can significantly speed up your review, ensuring your changes are fully tested before the maintainers begin their review.
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants