Skip to content

feat(source-s3): migrate to FileBasedConcurrentCursor + bump CDK to prerelease with state throttle#78325

Draft
Anatolii Yatsuk (tolik0) wants to merge 8 commits into
masterfrom
tolik0/source-s3/bump-cdk-state-throttle
Draft

feat(source-s3): migrate to FileBasedConcurrentCursor + bump CDK to prerelease with state throttle#78325
Anatolii Yatsuk (tolik0) wants to merge 8 commits into
masterfrom
tolik0/source-s3/bump-cdk-state-throttle

Conversation

@tolik0

@tolik0 Anatolii Yatsuk (tolik0) commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Migrate source-s3 onto FileBasedConcurrentCursor (concurrent file-based read path).
  • Bump airbyte-cdk to prerelease 7.19.2.post3.dev26244645194 (from fix(file-based): throttle state-message emission to prevent platform OOM on large file streams airbyte-python-cdk#1032) which adds:
    • state-emission throttle (max one state per 600 s during a sync, with a force-emitted final state at end-of-stream);
    • dispatch fix in FileBasedSource.streams() so concurrent cursors don't crash on partial-catalog / check / discover paths.
  • Delete the now-unused legacy source_s3.v4.cursor.Cursor and its tests. That subclass only existed for v3→v4 state migration (2023), which all active customers completed long ago. A hypothetical stale-v3 connection would trigger one full re-sync, after which it would be on v4 state.
  • Bump source-s3 patch version 4.15.4 → 4.15.5.

Why

source_s3.v4.cursor.Cursor extends DefaultFileBasedCursor, so source-s3 took the legacy non-concurrent file-based read path. That path emits a state message per processed file, with the full file-history dict embedded in each one. On a sync of N files the platform/orchestrator buffers N states (each growing in size with N) waiting for the destination to ACK them. Slow destinations cause the orchestrator pod to OOM; the source pod is torn down before emitting terminal stream status; the destination fails with:

TransientErrorException: Input was fully read, but some streams did not receive a terminal stream status message

Mirror of the precedent in oncall #7856 (Jira / declarative concurrent cursor); fixes oncall #12663.

Local verification on a ~9 k-file stream

Before (legacy cursor) After (concurrent cursor + throttle)
State messages 9 200 2
STREAM_STATUS COMPLETE 1 1
Runtime 52:46 4:24
stdout volume 4.1 GB 922 MB
Peak RSS 254 MB 874 MB

Test plan

  • unit_tests/v4 — 75 passed locally.
  • End-to-end read against a customer-shaped bucket (~9 k files) — state count 2, COMPLETE emitted, sync exits cleanly.
  • airbyte-ci connectors --name=source-s3 test --unit
  • airbyte-ci connectors --name=source-s3 test --acceptance
  • Validate on the affected customer connection (oncall Source Amazon Seller Partner: Add GET_XML_BROWSE_TREE_DATA report #12663).

Notes

…7349277

Picks up the file-based concurrent cursor state-emission throttle from
airbytehq/airbyte-python-cdk#1032, which prevents the replication
orchestrator from OOMing on file streams with thousands of files. The
prior cursor emitted a state message per processed file containing the
full file-history dict (O(N^2) state-message bytes over a sync); under
back-pressure the platform's state buffer grew until the replication
pod was killed, leaving the sync to fail with TransientErrorException
on the destination side.

Mirrors the fix pattern previously applied to ConcurrentPerPartitionCursor
for oncall #7856; this bump pulls it in for source-s3 (oncall #12663).
@octavia-bot octavia-bot Bot marked this pull request as draft May 21, 2026 16:29
@octavia-bot

octavia-bot Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Note

📝 PR Converted to Draft

More info...

Thank you for creating this PR. As a policy to protect our engineers' time, Airbyte requires all PRs to be created first in draft status. Your PR has been automatically converted to draft status in respect for this policy.

As soon as your PR is ready for formal review, you can proceed to convert the PR to "ready for review" status by clicking the "Ready for review" button at the bottom of the PR page.

To skip draft status in future PRs, please include [ready] in your PR title or add the skip-draft-status label when creating your PR.

@github-actions

Copy link
Copy Markdown
Contributor

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • 🛠️ Quick Fixes
    • /format-fix - Fixes most formatting issues.
    • /bump-version - Bumps connector versions, scraping changelog description from the PR title.
      • Bump types: patch (default), minor, major, major_rc, rc, promote.
      • The rc type is a smart default: applies minor_rc if stable, or bumps the RC number if already RC.
      • The promote type strips the RC suffix to finalize a release.
      • Example: /bump-version type=rc or /bump-version type=minor
    • /bump-progressive-rollout-version - Alias for /bump-version type=rc. Bumps with an RC suffix and enables progressive rollout.
  • ❇️ AI Testing and Review (internal link: AI-SDLC Docs):
    • /ai-prove-fix - Runs prerelease readiness checks, including testing against customer connections.
    • /ai-canary-prerelease - Rolls out prerelease to 5-10 connections for canary testing.
    • /ai-review - AI-powered PR review for connector safety and quality gates.
  • 📝 AI Documentation:
    • /ai-docs-review - AI-powered documentation review for PRs with connector changes.
    • /ai-create-docs-pr - Creates a documentation PR for connector changes, stacked on the current PR.
  • 🚀 Connector Releases:
    • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-preview.{git-sha}) for all modified connectors in the PR.
  • ☕️ JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
  • 🐍 Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.
  • ⚙️ Admin commands:
    • /force-merge reason="<REASON>" - Force merges the PR using admin privileges, bypassing CI checks. Requires a reason.
      Example: /force-merge reason="CI is flaky, tests pass locally"
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@@ -356,6 +356,7 @@ This connector utilizes the open source [Unstructured](https://unstructured-io.g

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change

@tolik0

Anatolii Yatsuk (tolik0) commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

/publish-connectors-prerelease

Pre-release Connector Publish Started

Publishing pre-release build for connector source-s3.
PR: #78325

Pre-release versions will be tagged as {version}-preview.15416b7
and are available for version pinning via the scoped_configuration API.

View workflow run
Pre-release Publish FAILED for source-s3.

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Deploy preview for airbyte-docs ready!

Project:airbyte-docs
Status: ✅  Deploy successful!
Preview URL:https://airbyte-docs-huy9f7au1-airbyte-growth.vercel.app
Latest Commit:46778c9

Deployed with vercel-action

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

source-s3 Connector Test Results

183 tests   160 ✅  6m 18s ⏱️
  3 suites   23 💤
  3 files      0 ❌

Results for commit 46778c9.

♻️ This comment has been updated with latest results.

@tolik0

Anatolii Yatsuk (tolik0) commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

/publish-connectors-prerelease

Pre-release Connector Publish Started

Publishing pre-release build for connector source-s3.
PR: #78325

Pre-release versions will be tagged as {version}-preview.15416b7
and are available for version pinning via the scoped_configuration API.

View workflow run
Pre-release Publish FAILED for source-s3.

source-s3 previously used the legacy DefaultFileBasedCursor via the
source-s3-specific Cursor subclass. That cursor goes through the legacy
non-concurrent file-based read path, which emits a state message per
processed file. With ~10 k files in a stream, the platform/orchestrator
buffers ~10 k state messages (each carrying the growing history dict)
waiting for the destination to ACK them; on slow destinations the
orchestrator pod OOMs, the source pod is torn down before emitting
terminal stream status, and the destination fails with
TransientErrorException("Input was fully read, but some streams did not
receive a terminal stream status message").

Move source-s3 onto FileBasedConcurrentCursor (CDK) which:
  - emits state at most once per 600s during a sync (state throttle
    added in airbytehq/airbyte-python-cdk#1032);
  - always force-emits a final state via ensure_at_least_one_state_emitted;
  - runs the read on the concurrent file-based path with up to
    DEFAULT_CONCURRENCY (100) workers.

Verified locally against a customer connection with ~9 k files: state
messages dropped from 9 200 to 2, runtime from 52:46 to 4:24, stdout
volume from 4.1 GB to 922 MB, sync completes successfully with a
STREAM_STATUS COMPLETE.

The legacy source-s3 Cursor class only existed to host the v3-to-v4
state migration shipped in 2023 (#29028). 2.5 years on, all active
customers have completed the migration on their first v4 sync, so the
subclass and its tests are removed. A hypothetical customer with stale
v3 state would trigger one full re-sync, after which they'd be on v4.

Pinning to airbyte-cdk 7.19.2.post3.dev26244645194, a prerelease that
includes both the state throttle and a fallback-branch dispatch fix
for the file-based source so connectors using FileBasedConcurrentCursor
do not crash on partial-catalog reads / check / discover.
@tolik0 Anatolii Yatsuk (tolik0) changed the title chore(source-s3): bump airbyte-cdk to prerelease with file-based state-emission throttle feat(source-s3): migrate to FileBasedConcurrentCursor + bump CDK to prerelease with state throttle May 21, 2026
@tolik0

Anatolii Yatsuk (tolik0) commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

/publish-connectors-prerelease

Pre-release Connector Publish Started

Publishing pre-release build for connector source-s3.
PR: #78325

Pre-release versions will be tagged as {version}-preview.8de6aea
and are available for version pinning via the scoped_configuration API.

View workflow run
Pre-release Publish: SUCCESS

Docker image (pre-release):
airbyte/source-s3:4.15.5-preview.8de6aea

Docker Hub: https://hub.docker.com/layers/airbyte/source-s3/4.15.5-preview.8de6aea

Registry JSON:

… concurrency

After the FileBasedConcurrentCursor migration, source-s3 runs with up to
DEFAULT_CONCURRENCY (100) worker threads sharing a single boto3 S3 client.
botocore's default urllib3 pool size is 10, so 90+ requests at a time hit
the "pool is full" branch — connections are continuously created and torn
down. Customer reports the sync now hangs while emitting floods of:

  WARN Connection pool is full, discarding connection: s3.amazonaws.com

Pass `max_pool_connections=DEFAULT_CONCURRENCY` to the botocore Config so
the pool matches the concurrent reader's worker count. The TODO at
stream_reader.py:270 anticipated exactly this.

Reported on oncall #12663 after rolling out 4.15.5-preview.8de6aea.
@tolik0

Anatolii Yatsuk (tolik0) commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

/publish-connectors-prerelease

Pre-release Connector Publish Started

Publishing pre-release build for connector source-s3.
PR: #78325

Pre-release versions will be tagged as {version}-preview.ee33abd
and are available for version pinning via the scoped_configuration API.

View workflow run
Pre-release Publish: SUCCESS

Docker image (pre-release):
airbyte/source-s3:4.15.5-preview.ee33abd

Docker Hub: https://hub.docker.com/layers/airbyte/source-s3/4.15.5-preview.ee33abd

Registry JSON:

The file-based concurrent CDK defaults to 100 workers. Streams with many
small records push the source pod past its 2 Gi memory cap because the
unbounded message-repository deque buffers worker output faster than
stdout can drain.

Override `_concurrency_level` to 20. Still ~20x the legacy single-
threaded throughput; keeps peak RSS comfortably under 2 Gi locally even
on the largest stream we tested. The boto3 pool stays at the CDK
default (100), giving 5x headroom over the worker count and silencing
the residual "Connection pool is full" warnings.

A bounded message-repository deque in the CDK would let us return to
100-way concurrency later; tracked as a separate follow-up.
@tolik0

Anatolii Yatsuk (tolik0) commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

/publish-connectors-prerelease

Pre-release Connector Publish Started

Publishing pre-release build for connector source-s3.
PR: #78325

Pre-release versions will be tagged as {version}-preview.69bab22
and are available for version pinning via the scoped_configuration API.

View workflow run
Pre-release Publish: SUCCESS

Docker image (pre-release):
airbyte/source-s3:4.15.5-preview.69bab22

Docker Hub: https://hub.docker.com/layers/airbyte/source-s3/4.15.5-preview.69bab22

Registry JSON:

Memray profiling on a controlled comparison run found that the
concurrency=20 cap did not reduce peak memory vs concurrency=100; in
fact c=100 peaked at 377 MB while c=20 peaked at 1.22 GB on the same
stream. Memory churn is dominated by the per-record serialization
path (orjson dumps + decode + PrintBuffer flush), which is independent
of worker count. Lowering concurrency only costs throughput.

Restore _concurrency_level = DEFAULT_CONCURRENCY. The boto3 connection
pool size stays at the CDK default concurrency so the pool has
headroom over the worker count.
@tolik0

Anatolii Yatsuk (tolik0) commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

/publish-connectors-prerelease

Pre-release Connector Publish Started

Publishing pre-release build for connector source-s3.
PR: #78325

Pre-release versions will be tagged as {version}-preview.4d61235
and are available for version pinning via the scoped_configuration API.

View workflow run
Pre-release Publish: SUCCESS

Docker image (pre-release):
airbyte/source-s3:4.15.5-preview.4d61235

Docker Hub: https://hub.docker.com/layers/airbyte/source-s3/4.15.5-preview.4d61235

Registry JSON:

…spec

Two improvements to the concurrent file-based read path's memory footprint
under the customer pod's 2 Gi limit:

1. mallopt(M_ARENA_MAX, 2) at process start (source_s3/run.py)

   With the default `8 x N_CPUs` arenas, the 100-worker concurrent file read
   creates many thread-local glibc malloc arenas. Each can grow to ~64 MB
   and is never returned to the OS for the process lifetime — that pinned
   overhead alone accounts for ~1 GB on the affected dictionary streams.

   The mallopt call is the runtime equivalent of MALLOC_ARENA_MAX=2 but
   scoped to this connector only; no Dockerfile or container-env change.
   No-op on macOS / musl-libc images (catches OSError on libc.so.6 load).

   Measured on the affected dictionary stream (~113 k records):
     baseline c=100              : 2.78 GB peak RSS
     + mallopt(M_ARENA_MAX, 2)   : 1.79 GB peak RSS  (-990 MB)
     + bounded inter-worker queue: 0.85 GB peak RSS  (-1.93 GB, stacks)

2. concurrency_level config field (advanced group)

   Operational knob for tuning peak memory vs throughput on a per-connection
   basis. Leaves the CDK default (100) when unset. Useful as an escape hatch
   on connections that still hit memory limits despite the other fixes.

Both leave wall clock and pool-warning counts unchanged.
# MALLOC_ARENA_MAX environment variable, but applied at process start
# so it is scoped to this connector only.
libc.mallopt(-8, 2)
except (OSError, AttributeError):
… prerelease

Override `_concurrent_record_queue_maxsize = 1_000` (new CDK hook) so the
inter-worker queue is capped at 1k items for source-s3 specifically; the
CDK global default stays at 10_000. Each queue item is a parsed record;
on dictionary-style streams the records can be tens of KB each, so a
10_000-deep queue pinned hundreds of MB of liveset at peak.

Bumps airbyte-cdk to 7.19.2.post4.dev26418171295 which contains:
- state-emission throttle in FileBasedConcurrentCursor
- cursor-cls dispatch fix in FileBasedSource fallback branch
- `_concurrent_record_queue_maxsize` override hook this commit consumes

Measured combined effect on the affected dictionary stream:
  c=100 baseline                          : 2.78 GB peak RSS
  + mallopt(M_ARENA_MAX, 2)               : 1.79 GB
  + bounded record queue (this commit)    : 0.85 GB
Wall clock and STREAM_STATUS COMPLETE behaviour unchanged.
@tolik0

Anatolii Yatsuk (tolik0) commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

/publish-connectors-prerelease

Pre-release Connector Publish Started

Publishing pre-release build for connector source-s3.
PR: #78325

Pre-release versions will be tagged as {version}-preview.46778c9
and are available for version pinning via the scoped_configuration API.

View workflow run
Pre-release Publish: SUCCESS

Docker image (pre-release):
airbyte/source-s3:4.15.5-preview.46778c9

Docker Hub: https://hub.docker.com/layers/airbyte/source-s3/4.15.5-preview.46778c9

Registry JSON:

@devin-ai-integration

Copy link
Copy Markdown
Contributor

↪️ Triggering /ai-prove-fix per Hands-Free AI Triage Project triage next step.

Reason: Draft source-s3 migration PR now has a successful prerelease publish and connector tests; remaining Format Check failure can be handled during validation, so prove-fix is the next Hydra stage.

Devin session

@devin-ai-integration

Copy link
Copy Markdown
Contributor

↪️ Intended to trigger /ai-prove-fix per Hands-Free AI Triage Project triage next step, but workflow dispatch is currently failing for AI command workflows.

Reason: Draft source-s3 migration PR now has a successful prerelease publish and connector tests; remaining Format Check failure can be handled during validation, so prove-fix is the next Hydra stage.

Operational note: GitHub workflow_dispatch returned HTTP 500/403 with the configured trigger token; repository_dispatch does not start this workflow because the workflow does not subscribe to repository_dispatch. This has been reported as an Ops MCP/workflow blocker.

Devin session

@devin-ai-integration

Copy link
Copy Markdown
Contributor

↪️ Triggered /ai-prove-fix per Hands-Free AI Triage Project PR pipeline sweep.

Reason: Earlier sweep intended prove-fix for source-s3 concurrent cursor migration but dispatch failed; retrying via workflow_dispatch now.


Devin session

@octavia-bot

octavia-bot Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

🔍 AI Prove Fix session starting... Running readiness checks and testing against customer connections. View playbook

Devin AI session created successfully!

@devin-ai-integration

devin-ai-integration Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Outcome: Fix proven successfully

I validated this PR with the mandatory source regression workflow and live Cloud evidence from the affected large-file workload.

Evidence summary

  • Regression workflow completed successfully: https://github.com/airbytehq/airbyte-ops-mcp/actions/runs/26509715831
  • SPEC, CHECK, DISCOVER, and READ all passed against control 4.15.4 and target dev; no regression was reported.
  • READ comparison matched control exactly for protocol message counts and records: 10 records, 1 state message, 4 trace messages.
  • Existing live Cloud runs on PR pre-release images completed the affected single-stream workload at roughly 9.2k records / 921-922 MB.
  • Those live runs emitted only 2 source state messages and 2 destination state messages, which is the expected state-throttled behavior instead of per-file state-message pressure.
  • The pre-release runs completed successfully with all records committed and no failures; the earlier control-image evidence showed the reported missing terminal stream-status failure mode.
Next steps
  • No additional live pinning or new customer sync was needed: the connection was already pinned to PR pre-release builds and had representative successful runs.
  • I posted detailed private evidence, including workspace/connection/job identifiers and log excerpts, to the private oncall issue before updating this public report.
Connector details
Evidence plan

Proving criteria

  • Source connector reads the large file-based S3 workload successfully on PR pre-release images.
  • The run reaches terminal stream completion instead of failing with a missing terminal stream-status error.
  • State emission is throttled instead of emitting one source state per file.
  • Regression tests show no source-s3 behavior regressions versus the control image.

Disproving criteria

  • Regression tests show unexpected schema, record, state, or protocol-message differences.
  • A representative live run still fails with missing terminal stream status, OOM, stalled S3 reads, or connection-pool churn.
  • The pre-release introduces auth/config/spec incompatibilities or other unrelated failures.
Pre-flight checks
  • Viability: pass. The PR moves incremental S3 file reads to FileBasedConcurrentCursor, picks up the CDK state-emission throttle, constrains the connector record queue, caps glibc malloc arenas, sizes the S3 connection pool, and exposes concurrency_level as an operational knob.
  • Safety: pass. Diff is scoped to source-s3 connector code, dependency pin, metadata, and docs; no suspicious network exfiltration, credential handling changes, obfuscation, or unrelated code paths found.
  • Breaking-change check: pass / non-breaking. Patch version bump only (4.15.4 to 4.15.5), no removed streams, no schema/PK/cursor-field/spec required-field change, and no state format migration required.
  • Reversibility: pass. The change is a patch connector image update; pin/rollback to the previous connector image remains available if needed.
Detailed evidence log
  • Found existing successful pre-release publish for the PR head commit.
  • Confirmed current-head Docker image exists on Docker Hub.
  • Triggered mandatory source connector regression workflow in comparison mode.
  • Downloaded and inspected regression artifacts:
    • SPEC: both versions succeeded; no regression.
    • CHECK: both versions succeeded; no regression.
    • DISCOVER: both versions succeeded; no regression.
    • READ: both versions succeeded; no regression; message and record counts matched.
  • Inspected representative live sync logs from the affected large-file workload:
    • Successful PR pre-release runs completed with roughly 9.2k records and 921-922 MB synced.
    • Successful PR pre-release runs reported sourceStateMessagesEmitted = 2, destinationStateMessagesEmitted = 2, and empty failures.
    • Earlier control-image evidence showed the reported missing-terminal-stream-status failure mode.

Devin session

@devin-ai-integration devin-ai-integration Bot added the hyd-prove Hydra: ai-prove-fix stage has run label May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connectors/source/s3 hyd-prove Hydra: ai-prove-fix stage has run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants