Skip to content

fix(aiokafka): collect kafka cluster_id for DSM#18272

Open
piochelepiotr wants to merge 3 commits into
mainfrom
fix/aiokafka-cluster-id
Open

fix(aiokafka): collect kafka cluster_id for DSM#18272
piochelepiotr wants to merge 3 commits into
mainfrom
fix/aiokafka-cluster-id

Conversation

@piochelepiotr
Copy link
Copy Markdown
Contributor

Summary

  • The aiokafka integration was missing kafka_cluster_id collection while confluent_kafka already collects it. As a result, DSM edges and produce/commit tracking from aiokafka producers/consumers were missing the kafka_cluster_id tag, breaking cluster-level aggregation in Data Streams Monitoring.
  • Added an async _get_cluster_id helper that sends a MetadataRequest_v5 to a connected broker and caches the result on the AIOKafkaClient (success + 5 min failure cache, mirroring the confluent_kafka helper).
  • Set the id as the kafka.cluster_id span tag and propagate it to the DSM hooks via core.set_item("kafka_cluster_id", ...), so consume/produce edge tags and track_kafka_produce / track_kafka_commit calls include it.

Test plan

  • Locally verified end-to-end with the DSM kafka-demo docker-compose, swapping the producer/consumer for aiokafka apps against confluentinc/cp-kafka:7.5.0. _get_cluster_id(p.client, "demo-orders") returned the broker's CLUSTER_ID (Mka3OEVBNTcwNTJENDM2Qg), and the Datadog agent successfully forwarded traces and data_streams_messages to datad0g.com.
  • CI: existing aiokafka + DSM test suites should still pass.

🤖 Generated with Claude Code

The aiokafka integration was not collecting the Kafka cluster_id, while the
confluent_kafka integration was. This caused Data Streams Monitoring edges
and offset commits from aiokafka producers/consumers to be missing the
kafka_cluster_id tag, preventing cluster-level aggregation.

Fetch cluster_id by sending a MetadataRequest_v5 to a broker the first time
a producer/consumer is traced, cache the result on the AIOKafkaClient, and
use a 5 minute failure cache to avoid repeated slow lookups (matching the
confluent_kafka integration). The id is exposed both as the kafka.cluster_id
span tag and via core.set_item("kafka_cluster_id", ...) so the DSM hooks
include it in checkpoint edge tags and produce/commit tracking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@piochelepiotr piochelepiotr requested review from a team as code owners May 26, 2026 20:56
@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented May 26, 2026

Codeowners resolved as

ddtrace/contrib/internal/aiokafka/patch.py                              @DataDog/apm-core-python @DataDog/apm-idm-python

@datadog-prod-us1-5
Copy link
Copy Markdown
Contributor

datadog-prod-us1-5 Bot commented May 26, 2026

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 8 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-py | build linux: [amd64, cp315-cp315, v113741238-d2b8243-manylinux2014_x86_64]   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). NotImplementedError: This version of CPython is not supported yet.

DataDog/apm-reliability/dd-trace-py | build linux: [arm64, cp315-cp315, v113741589-d2b8243-musllinux_1_2_aarch64]   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). NotImplementedError: This version of CPython is not supported yet during ddtrace import.

DataDog/apm-reliability/dd-trace-py | build linux serverless: [amd64, cp315-cp315, v113741238-d2b8243-manylinux2014_x86_64, 1]   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. NotImplementedError: This version of CPython is not supported yet

View all 8 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 9239ad4 | Docs | Datadog PR Page | Give us feedback!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 446e7971d0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


async def traced_commit(func, instance, args, kwargs):
result = await func(*args, **kwargs)
cluster_id = await _get_cluster_id(getattr(instance, "_client", None), None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid all-topics metadata lookup in commit path

traced_commit always calls _get_cluster_id(..., None), and _get_cluster_id builds MetadataRequest_v5([] , False) when topic is None. In Kafka, an empty topic list requests metadata for the entire cluster, so the first commit on each consumer can trigger a large broker round-trip and add avoidable latency in a hot path (especially with many topics). This should use a concrete topic from commit offsets (as the confluent integration does) or skip cluster-id lookup when no topic is available.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9239ad4traced_commit now reads only the cached _dd_cluster_id and never issues a metadata request from the commit path.

For reference: an empty topics=[] in MetadataRequest_v5 actually means "no topic metadata, brokers/cluster only" — verified by encoding:

  • []\x00\x00\x00\x00 (array length 0, no topics)
  • None\xff\xff\xff\xff (null, all topics)

So the produce/getone path doesn't fan out across all topics either. But the commit hot-path concern is fair — switched it to cached-only.

piochelepiotr and others added 2 commits May 27, 2026 14:45
Tracking calls and DSM edge tags now include kafka_cluster_id, so the
existing offset-lookup keys (with cluster_id="") and pathway-hash
expectations no longer matched. Read the cluster id from the
AIOKafkaClient and use it in PartitionKey / pathway tag expectations.
Snapshots regenerated to include the new kafka.cluster_id span tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per PR review: the commit handler should not issue a metadata round-trip.
Use only the cached cluster id on the commit path — it is normally
populated by a prior produce/consume; if not yet cached, skip the tag for
that commit rather than block on a broker request.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant