-
Notifications
You must be signed in to change notification settings - Fork 421
feat(kafka): add MessageBroker/Kafka/Cluster/{id}/Topic/{topic}/Produce|Consume metrics #4044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shashank-reddy-nr
wants to merge
19
commits into
newrelic:main
Choose a base branch
from
shashank-reddy-nr:feat/kafka-cluster-id
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+200
−6
Open
Changes from 14 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
51f04bf
feat(kafka): inject producerServiceName header and read on consumer side
shashank-reddy-nr 8929d42
fix(kafka): fall back to NEW_RELIC_APP_NAME env var if config returns…
shashank-reddy-nr 7c23c5e
feat(kafka): add MessageBroker/Kafka/Cluster/{id}/Topic/{topic} metrics
shashank-reddy-nr ce8bb6f
fix(kafka): fix CI test failures
shashank-reddy-nr 051f987
fix(kafka): fix bugs found in final review
shashank-reddy-nr 8e31e5a
fix(kafka): remove kafka.cluster.id custom span attribute
shashank-reddy-nr b8101aa
fix(kafka): scope cleanup — metrics only, fix DT regression, remove s…
shashank-reddy-nr 1e721b9
fix(kafka): normalize broker cache key to sorted order
shashank-reddy-nr 4e48d09
fix(test): make segment truncation timer test deterministic
shashank-reddy-nr 2387cf8
fix(kafka): record cluster metric for every topic in sendBatch
shashank-reddy-nr b081071
fix(kafka): fix JSDoc and reduce cognitive complexity in #wrapProduce…
shashank-reddy-nr 88116eb
fix(kafka): use argumentless catch to satisfy sonarjs/no-ignored-exce…
shashank-reddy-nr cdb2795
fix(kafka): handle function-valued brokers, fix cache key, guard clus…
shashank-reddy-nr ad6a8ec
feat(kafka): always emit cluster metrics without opt-in flag
shashank-reddy-nr 4e030ed
fix(kafka): scope cleanup — remove groupId/rawNr scope creep, revert …
shashank-reddy-nr 8b53a69
fix(kafka): add missing cluster-id-cache.js file
shashank-reddy-nr 4a0ef43
fix(kafka): remove verbose comments, revert injectHeaders refactor
shashank-reddy-nr fedfdff
fix(kafka): rename promise param _ to _resolve for promise/param-name…
shashank-reddy-nr 2f977a5
feat(kafka): enable kafkajs_instrumentation feature flag by default
shashank-reddy-nr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,6 +15,116 @@ const recordDataMetrics = require('./utils/record-data-metrics.js') | |
| const recordLinkingMetrics = require('./utils/record-linking-metrics.js') | ||
| const recordMethodMetric = require('./utils/record-method-metric.js') | ||
|
|
||
| // Cache: brokers-key → resolved clusterId string. | ||
| // Kafka cluster IDs are stable for the lifetime of a cluster, so this cache persists | ||
| // for the process lifetime. | ||
| const _clusterIdCache = new Map() | ||
|
|
||
| /** | ||
| * Look up the cluster ID for a set of brokers from the module-level cache. | ||
| * Used by record-data-metrics.js as a fallback when kafkaCtx.clusterId is | ||
| * absent because the consumer spread-copied the context before the async | ||
| * describeCluster() resolved. | ||
| * | ||
| * @param {string[]} brokers List of broker addresses used as the cache key. | ||
| * @returns {string|undefined} Cached cluster UUID, or undefined if not yet resolved. | ||
| */ | ||
| function getClusterIdFromCache(brokers) { | ||
| if (!brokers || !brokers.length) return undefined | ||
| const key = brokers.slice().sort().join(',') | ||
| return _clusterIdCache.get(key) | ||
| } | ||
|
|
||
| // In-flight cache: brokers-key → Promise<string|null>. | ||
| // Coalesces concurrent describeCluster() calls so that N Kafka clients created with | ||
| // the same brokers share a single admin connection rather than each opening their own. | ||
| const _clusterIdInFlight = new Map() | ||
|
|
||
| /** | ||
| * Fetches the Kafka cluster UUID via the kafkajs admin API and caches it. | ||
| * | ||
| * Multiple callers with the same broker set share a single in-flight Promise — the | ||
| * second and subsequent callers await the first caller's fetch instead of opening | ||
| * duplicate admin connections. Once resolved the result is cached permanently. | ||
| * | ||
| * Note: fire-and-forget from the caller's perspective. Sends that occur before the | ||
| * promise resolves will miss the cluster-level metric — accepted trade-off since | ||
| * blocking the producer path is not an option. | ||
| */ | ||
| function _fetchAndCacheClusterId(client, brokers) { | ||
| const key = Array.isArray(brokers) ? brokers.slice().sort().join(',') : String(brokers ?? 'none') | ||
|
|
||
| // Return already-resolved value immediately. | ||
| if (_clusterIdCache.has(key)) return Promise.resolve(_clusterIdCache.get(key)) | ||
|
|
||
| // Coalesce: return the existing in-flight promise if one exists. | ||
| if (_clusterIdInFlight.has(key)) return _clusterIdInFlight.get(key) | ||
|
|
||
| let admin | ||
| const promise = (async () => { | ||
| try { | ||
| admin = client.admin() | ||
| await admin.connect() | ||
| const { clusterId } = await admin.describeCluster() | ||
| await admin.disconnect() | ||
| admin = null | ||
| if (clusterId) _clusterIdCache.set(key, clusterId) | ||
| return clusterId ?? null | ||
| } catch { | ||
| // Cluster ID fetch failure is non-fatal — metrics simply won't include the | ||
| // cluster ID for this session. Best-effort cleanup of the admin connection. | ||
| try { await admin?.disconnect() } catch { /* ignore disconnect error */ } | ||
| return null | ||
| } finally { | ||
| _clusterIdInFlight.delete(key) | ||
| } | ||
| })() | ||
|
|
||
| _clusterIdInFlight.set(key, promise) | ||
| return promise | ||
| } | ||
|
|
||
| /** | ||
| * Injects DT headers into a single message object. | ||
| * Extracted to reduce cognitive complexity of #wrapProducerMethod. | ||
| * | ||
| * @param {object} params Named parameters. | ||
| * @param {object} params.self The ConstructorSubscriber instance. | ||
| * @param {object} params.ctx The current tracer context. | ||
| * @param {object} params.msg The kafkajs message object (mutated in place). | ||
| */ | ||
| function injectHeaders({ self, ctx, msg }) { | ||
| const headers = msg.headers ?? {} | ||
| self.insertDTHeaders({ ctx, headers, useMqNames: true }) | ||
| msg.headers = headers | ||
| } | ||
|
|
||
| /** | ||
| * Records MessageBroker/Kafka/Cluster cluster-level produce metrics. | ||
| * For send() records one metric for the single topic. | ||
| * For sendBatch() records one metric per distinct topic in the batch. | ||
| * | ||
| * @param {object} metrics The agent metrics aggregator. | ||
| * @param {string} clusterId Kafka cluster UUID. | ||
| * @param {boolean} batch Whether this is a sendBatch call. | ||
| * @param {object} data The send/sendBatch arguments object. | ||
| */ | ||
| function recordClusterProduceMetrics(metrics, clusterId, batch, data) { | ||
| if (batch === false) { | ||
| metrics | ||
| .getOrCreateMetric(`MessageBroker/Kafka/Cluster/${clusterId}/Topic/${data.topic}/Produce`) | ||
| .incrementCallCount() | ||
| } else { | ||
| for (const topicMessage of data.topicMessages) { | ||
| metrics | ||
| .getOrCreateMetric( | ||
| `MessageBroker/Kafka/Cluster/${clusterId}/Topic/${topicMessage.topic}/Produce` | ||
| ) | ||
| .incrementCallCount() | ||
| } | ||
| } | ||
| } | ||
|
|
||
| const CONSUMER_METHODS = [ | ||
| 'commitOffsets', | ||
| 'connect', | ||
|
|
@@ -78,13 +188,34 @@ module.exports = class ConstructorSubscriber extends Subscriber { | |
| const { arguments: args, self: client } = data | ||
| client[kafkaCtx] = { brokers: args[0].brokers ?? ['none'] } | ||
|
|
||
| _fetchAndCacheClusterId(client, client[kafkaCtx].brokers).then((id) => { | ||
| if (id) { | ||
| // Update the shared client context. Producers share this object by reference so they | ||
| // see the update immediately. Consumers spread-copy the context at creation time, so | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is this "spread-copy" happening? Consumers should not have access to our internal symbol and context object. |
||
| // we also store the id in the module-level cache; record-data-metrics reads from the | ||
| // header as a fallback when kafkaCtx.clusterId is absent on the consumer side. | ||
| client[kafkaCtx].clusterId = id | ||
| } | ||
| }).catch(() => { | ||
| // Cluster ID fetch failure is non-fatal — metrics simply omit the cluster ID. | ||
| }) | ||
|
|
||
| const origConsumer = client.consumer | ||
| client.consumer = function nrConsumer(...args) { | ||
| const consumer = origConsumer.apply(client, args) | ||
| consumer[kafkaCtx] = client[kafkaCtx] | ||
| // Capture groupId and clientId at consumer creation so they are available | ||
| // on every transaction without waiting for the async REQUEST event. | ||
| consumer[kafkaCtx] = { | ||
| ...client[kafkaCtx], | ||
| groupId: args[0]?.groupId ?? null, | ||
| clientId: args[0]?.clientId ?? client[kafkaCtx]?.clientId ?? null | ||
| } | ||
|
|
||
| consumer.on(consumer.events.REQUEST, function nrListener(data) { | ||
| consumer[kafkaCtx].clientId = data?.payload?.clientId | ||
| // REQUEST event may refine clientId from the live connection payload | ||
| if (data?.payload?.clientId) { | ||
| consumer[kafkaCtx].clientId = data.payload.clientId | ||
| } | ||
| }) | ||
| for (const method of CONSUMER_METHODS) { | ||
| self.#wrapConsumerMethod(consumer, method) | ||
|
|
@@ -199,20 +330,20 @@ module.exports = class ConstructorSubscriber extends Subscriber { | |
|
|
||
| if (batch === false) { | ||
| for (const msg of data.messages) { | ||
| const headers = msg.headers ?? {} | ||
| self.insertDTHeaders({ ctx, headers, useMqNames: true }) | ||
| msg.headers = headers | ||
| injectHeaders({ self, ctx, msg }) | ||
| } | ||
| } else { | ||
| for (const topicMessage of data.topicMessages) { | ||
| for (const msg of topicMessage.messages) { | ||
| const headers = msg.headers ?? {} | ||
| self.insertDTHeaders({ ctx, headers, useMqNames: true }) | ||
| msg.headers = headers | ||
| injectHeaders({ self, ctx, msg }) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (instance[kafkaCtx].clusterId) { | ||
| recordClusterProduceMetrics(self.agent.metrics, instance[kafkaCtx].clusterId, batch, data) | ||
| } | ||
|
|
||
| return self.agent.tracer.runInContext({ handler: orig, context: ctx, full: true, thisArg: instance, args }) | ||
| } | ||
| } | ||
|
|
@@ -347,3 +478,5 @@ module.exports = class ConstructorSubscriber extends Subscriber { | |
| } | ||
| } | ||
| } | ||
|
|
||
| module.exports.getClusterIdFromCache = getClusterIdFromCache | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the changes in this file not part of the exported class? Are the two caches really necessary? What is the expected size of these caches in a typical, or extreme, deployment scenario?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jsumners-nr, I'm new to this repo and leaned heavily on AI assistance to implement this, so I apologize upfront for anything that doesn't follow conventions here.
The two caches serve different lifecycles:
cluster-id-cache.jsis a module-level map (brokers string → cluster UUID). It's intentionally process-global because the cluster UUID is the same for every producer/consumer connecting to the same Kafka cluster - it makes no sense to refetch it per client instance._clusterIdByInstancedoesn't exist here - the only per-instance storage is via the existing kafkaCtx symbol already set on each client/consumer.In a typical deployment you'd have 1–3 distinct Kafka clusters, so the cache holds 1–3 entries. In an extreme multi-tenant scenario (many distinct broker strings) it could grow unbounded - that's a fair concern. Happy to add a size cap or TTL if you'd prefer.
Open to moving the fetch logic inside the class if that's the convention - I'll follow your guidance. Can you please guide me?