Skip to content

feat(source-mixpanel): add optional stream filtering to speed up schema discovery#81343

Draft
Ryan Waskewich (rwask) wants to merge 3 commits into
masterfrom
devin/1782848783-source-mixpanel-stream-filtering
Draft

feat(source-mixpanel): add optional stream filtering to speed up schema discovery#81343
Ryan Waskewich (rwask) wants to merge 3 commits into
masterfrom
devin/1782848783-source-mixpanel-stream-filtering

Conversation

@rwask

@rwask Ryan Waskewich (rwask) commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What

When a Mixpanel service account has broad access, schema discovery can time out because the connector discovers schemas for all 6 streams — including engage and export which use dynamic schema loaders that make live API calls (engage/properties and events/properties/top). These calls are rate-limited and can return massive responses with broad-access accounts.

This adds an optional streams config parameter so users can restrict which streams are discovered and synced, avoiding the expensive dynamic schema discovery for streams they don't need.

Tracked in https://github.com/airbytehq/oncall/issues/13020. Requested by Ryan Waskewich (@rwask).

How

  1. spec.json — Added optional streams field (array of enum stream names: cohorts, engage, annotations, cohort_members, funnels, export)
  2. source.py — Filter returned streams in streams() method when streams config is populated:
    selected_streams = config.get("streams")
    # ... build all_streams ...
    if selected_streams:
        all_streams = [s for s in all_streams if s.name in selected_streams]
  3. Version bump — 4.0.0 → 4.1.0 (minor, non-breaking new feature)
  4. Docs — Added setup step and changelog entry
  5. Tests — Added parametrized test_streams_filtering covering: no filter, empty list, subset, single stream, and all-streams-explicit cases

Review guide

  1. source_mixpanel/spec.json — new optional streams config field
  2. source_mixpanel/source.py — stream filtering logic in streams() method
  3. unit_tests/test_source.py — parametrized tests for stream filtering
  4. metadata.yaml / pyproject.toml — version bump
  5. docs/integrations/sources/mixpanel.md — setup instructions and changelog

User Impact

Users with Mixpanel service accounts that have broad access can now optionally select specific streams to discover, preventing schema discovery timeouts. When the field is left empty (default), all streams are discovered as before — fully backward compatible.

Can this PR be safely reverted and rolled back?

  • YES 💚

Link to Devin session: https://app.devin.ai/sessions/c078325c99ee46068dd1d015435c66c7

…ma discovery

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@devin-ai-integration

Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@github-actions

Copy link
Copy Markdown
Contributor

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • 🛠️ Quick Fixes
    • /format-fix - Fixes most formatting issues.
    • /bump-version - Bumps connector versions, scraping changelog description from the PR title.
      • Bump types: patch (default), minor, major, major_rc, rc, promote.
      • The rc type is a smart default: applies minor_rc if stable, or bumps the RC number if already RC.
      • The promote type strips the RC suffix to finalize a release.
      • Example: /bump-version type=rc or /bump-version type=minor
    • /bump-progressive-rollout-version - Alias for /bump-version type=rc. Bumps with an RC suffix and enables progressive rollout.
  • ❇️ AI Testing and Review (internal link: AI-SDLC Docs):
    • /ai-prove-fix - Runs prerelease readiness checks, including testing against customer connections.
    • /ai-canary-prerelease - Rolls out prerelease to 5-10 connections for canary testing.
    • /ai-review - AI-powered PR review for connector safety and quality gates.
  • 📝 AI Documentation:
    • /ai-docs-review - AI-powered documentation review for PRs with connector changes.
    • /ai-create-docs-pr - Creates a documentation PR for connector changes, stacked on the current PR.
  • 🚀 Connector Releases:
    • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-preview.{git-sha}) for all modified connectors in the PR.
  • ☕️ JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
  • 🐍 Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.
  • ⚙️ Admin commands:
    • /force-merge reason="<REASON>" - Force merges the PR using admin privileges, bypassing CI checks. Requires a reason.
      Example: /force-merge reason="CI is flaky, tests pass locally"
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

source-mixpanel Connector Test Results

57 tests   53 ✅  30s ⏱️
 2 suites   4 💤
 2 files     0 ❌

Results for commit 0e25744.

♻️ This comment has been updated with latest results.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review

Comment on lines +158 to +159
if selected_streams:
all_streams = [s for s in all_streams if s.name in selected_streams]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Check stream interaction with the new streams filter is CDK-dependent

The manifest defines check.stream_names: [cohorts] at source_mixpanel/manifest.yaml:800-801. If a user sets the streams config to a subset that excludes cohorts, the behavior of check_connection depends on whether the CDK resolves check streams from the manifest definitions directly or from the output of streams(). If the CDK uses streams(), the check would fail because cohorts would be filtered out. In practice, declarative source check resolution typically resolves streams independently from streams(), so this is likely fine, but worth verifying with CDK documentation if this feature is used widely.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Not fixing. The CDK's CheckStream resolves check streams from the manifest definitions directly, not from the output of streams(). The check_connection call in test_source.py hits the cohorts endpoint (/api/query/cohorts/list) regardless of what streams() returns. So filtering via the streams config won't affect the check — cohorts will always be checked via the manifest-defined check stream.

Worth noting in docs if this feature sees wide adoption, but not a code-level concern.


Devin session

Comment on lines +148 to +159
selected_streams = config.get("streams")

all_streams = super().streams(config=config)

config_transformed = copy.deepcopy(config)
config_transformed = self._validate_and_transform(config_transformed)
auth = self.get_authenticator(config)

streams.append(Export(authenticator=auth, **config_transformed))
all_streams.append(Export(authenticator=auth, **config_transformed))

if selected_streams:
all_streams = [s for s in all_streams if s.name in selected_streams]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 No tests added for the new streams filtering feature

The PR adds a new config parameter and filtering logic but does not include any unit tests for the behavior. The existing test_streams at unit_tests/test_source.py:49 asserts len(streams) == 6 without exercising the filter. Tests for (1) filtering with a subset, (2) empty list returning all streams, and (3) None returning all streams would increase confidence in this feature.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☑️ Resolved in 0e25744. Added a parametrized test_streams_filtering test covering: no filter (None), empty list, subset, single stream, and all streams explicit.


Devin session

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Deploy preview for airbyte-docs ready!

Project:airbyte-docs
Status: ✅  Deploy successful!
Preview URL:https://airbyte-docs-oxhta8wuk-airbyte-growth.vercel.app
Latest Commit:0e25744

Deployed with vercel-action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants