Skip to content

[elastic_agent] Add TSDB dimensions and fix metric_type declarations#18508

Draft
AndersonQ wants to merge 1 commit intoelastic:mainfrom
AndersonQ:17349-elastic-agent-tsdb-check
Draft

[elastic_agent] Add TSDB dimensions and fix metric_type declarations#18508
AndersonQ wants to merge 1 commit intoelastic:mainfrom
AndersonQ:17349-elastic-agent-tsdb-check

Conversation

@AndersonQ
Copy link
Copy Markdown
Member

Proposed commit message

[elastic_agent] Add TSDB dimensions and fix metric_type declarations

Add `dimension: true` to fields across the 12 elastic_agent metrics data streams. The added dimensions are redundant with the existing dimensions (1-to-1 with `agent.id` or `component.id`) and therefore do not increase time-series cardinality.

Fix metric_type on:
- beat.stats.libbeat.pipeline.events.active: counter -> gauge
- beat.stats.libbeat.output.events.active: counter -> gauge
- filebeat_input.*.histogram.count: gauge -> counter

Add metric_type on numeric fields that were missing it: system stats, cpu ticks, memstats, handles, runtime.goroutines, uptime, cgroup stats, libbeat pipeline/config/output metrics, write-latency histogram, system.process.cgroup.{memory.mem.failures, cpuacct.percpu}, and filebeat_input.{cel_executions, system_packet_drops}.

Assisted by Claude Code

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Screenshots

@AndersonQ AndersonQ self-assigned this Apr 17, 2026
@AndersonQ AndersonQ added Integration:elastic_agent Elastic Agent Team:Elastic-Agent-Data-Plane Agent Data Plane team [elastic/elastic-agent-data-plane] labels Apr 17, 2026
Add `dimension: true` to fields across the 12 elastic_agent metrics
data streams. The added dimensions are redundant with the existing
dimensions (1-to-1 with `agent.id` or `component.id`) and therefore
do not increase time-series cardinality.

Fix metric_type on:
- beat.stats.libbeat.pipeline.events.active: counter -> gauge
- beat.stats.libbeat.output.events.active: counter -> gauge
- filebeat_input.*.histogram.count: gauge -> counter

Add metric_type on numeric fields that were missing it: system stats,
cpu ticks, memstats, handles, runtime.goroutines, uptime, cgroup
stats, libbeat pipeline/config/output metrics, write-latency histogram,
system.process.cgroup.{memory.mem.failures, cpuacct.percpu}, and
filebeat_input.{cel_executions, system_packet_drops}.

Assisted by Claude Code
@AndersonQ AndersonQ force-pushed the 17349-elastic-agent-tsdb-check branch from 9c202fc to d2cf4f4 Compare April 17, 2026 16:16
@@ -195,6 +197,7 @@
description: CPU time consumed by tasks in user (kernel) mode.
- name: percpu
type: long
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium fields/fields.yml:199

system.process.cgroup.cpuacct.percpu is marked metric_type: gauge, but it represents cumulative CPU time consumed on each CPU — a monotonically increasing value. TSDB will compute incorrect rates for this field, producing wrong aggregation results. Change to metric_type: counter to match the sibling cpuacct.*.ns fields.

Also found in 8 other location(s)

packages/elastic_agent/data_stream/cloudbeat_metrics/fields/fields.yml:195

system.process.cgroup.cpuacct.percpu at line 195 is given metric_type: gauge, but the field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be metric_type: counter. Other cpuacct fields in the same file (e.g., cpuacct.total.ns at line 179, cpuacct.stats.user.ns at line 185, cpuacct.stats.system.ns at line 190) are all correctly declared as counter. Using gauge here will cause TSDB to treat cumulative nanosecond values as point-in-time measurements, leading to incorrect rate calculations and misleading metric aggregations.

packages/elastic_agent/data_stream/elastic_agent_metrics/fields/fields.yml:199

cpuacct.percpu at line 199 is assigned metric_type: gauge but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is cumulative CPU time from the Linux cgroup cpuacct.usage_percpu file — a monotonically increasing value that should be metric_type: counter. Labeling it as gauge causes TSDB to skip counter-specific handling (e.g., rate calculations, counter reset detection), leading to incorrect visualizations and aggregations in Kibana. This directly contradicts the PR's goal of fixing metric_type declarations.

packages/elastic_agent/data_stream/filebeat_metrics/fields/fields.yml:200

cpuacct.percpu at line 200 is assigned metric_type: gauge, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup" — a monotonically increasing cumulative value that should be metric_type: counter. All other cpuacct time fields in this same group (cpuacct.total.ns, cpuacct.stats.user.ns, cpuacct.stats.system.ns) are correctly typed as counter. Using gauge means TSDB will not apply rate/delta calculations correctly for this field, producing incorrect visualizations and aggregations.

packages/elastic_agent/data_stream/fleet_server_metrics/fields/fields.yml:200

metric_type: gauge added to system.process.cgroup.cpuacct.percpu at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a cumulative, monotonically increasing value and should be metric_type: counter. All sibling cpuacct fields measuring CPU time in nanoseconds (cpuacct.total.ns at line 184, cpuacct.stats.user.ns at line 190, cpuacct.stats.system.ns at line 196) correctly use metric_type: counter. Using gauge will cause TSDB to treat this cumulative counter as a point-in-time value, breaking rate calculations and counter-based aggregations.

packages/elastic_agent/data_stream/heartbeat_metrics/fields/fields.yml:200

system.process.cgroup.cpuacct.percpu at line 200 is given metric_type: gauge, but its description says "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." CPU time consumed is a monotonically increasing cumulative value and should be metric_type: counter, consistent with the sibling fields cpuacct.total.ns (line 184), cpuacct.stats.user.ns (line 189), and cpuacct.stats.system.ns (line 195), which are all counter. Using gauge causes TSDB to treat this as a point-in-time measurement rather than a cumulative counter, leading to incorrect rate calculations and potentially wrong dashboard visualizations.

packages/elastic_agent/data_stream/metricbeat_metrics/fields/fields.yml:200

system.process.cgroup.cpuacct.percpu at line 200 has metric_type: gauge added, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." Cumulative CPU time consumed is a monotonically increasing value sourced from the cgroup cpuacct.usage_percpu file, which makes it a counter, not a gauge. Using gauge will cause TSDB to store and handle this metric incorrectly — for example, rate calculations won't be applied automatically, and rollups/downsampling will use last-value semantics instead of cumulative semantics, leading to incorrect metric values in dashboards and alerts.

packages/elastic_agent/data_stream/osquerybeat_metrics/fields/fields.yml:200

system.process.cgroup.cpuacct.percpu at line 200 is assigned metric_type: gauge, but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is a monotonically increasing cumulative value and should be metric_type: counter, consistent with the sibling cpuacct fields (total.ns at line 184, stats.user.ns at line 190, stats.system.ns at line 196) which are all declared as counter. Using gauge causes TSDB to apply incorrect downsampling (last-value instead of counter-appropriate aggregation), producing incorrect results over time.

packages/elastic_agent/data_stream/packetbeat_metrics/fields/fields.yml:200

The newly added metric_type: gauge for system.process.cgroup.cpuacct.percpu at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be metric_type: counter, not gauge. All sibling fields under cpuacct that also represent cumulative CPU time (total.ns, stats.user.ns, stats.system.ns) are correctly declared as counter. Using gauge will cause TSDB to apply incorrect aggregation (e.g., last-value instead of rate), leading to misleading metric visualizations and broken rate/delta calculations.

🤖 Copy this AI Prompt to have your agent fix this:
In file packages/elastic_agent/data_stream/apm_server_metrics/fields/fields.yml around line 199:

`system.process.cgroup.cpuacct.percpu` is marked `metric_type: gauge`, but it represents cumulative CPU time consumed on each CPU — a monotonically increasing value. TSDB will compute incorrect rates for this field, producing wrong aggregation results. Change to `metric_type: counter` to match the sibling `cpuacct.*.ns` fields.

Also found in 8 other location(s):
- packages/elastic_agent/data_stream/cloudbeat_metrics/fields/fields.yml:195 -- `system.process.cgroup.cpuacct.percpu` at line 195 is given `metric_type: gauge`, but the field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be `metric_type: counter`. Other cpuacct fields in the same file (e.g., `cpuacct.total.ns` at line 179, `cpuacct.stats.user.ns` at line 185, `cpuacct.stats.system.ns` at line 190) are all correctly declared as `counter`. Using `gauge` here will cause TSDB to treat cumulative nanosecond values as point-in-time measurements, leading to incorrect rate calculations and misleading metric aggregations.
- packages/elastic_agent/data_stream/elastic_agent_metrics/fields/fields.yml:199 -- `cpuacct.percpu` at line 199 is assigned `metric_type: gauge` but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is cumulative CPU time from the Linux cgroup `cpuacct.usage_percpu` file — a monotonically increasing value that should be `metric_type: counter`. Labeling it as `gauge` causes TSDB to skip counter-specific handling (e.g., rate calculations, counter reset detection), leading to incorrect visualizations and aggregations in Kibana. This directly contradicts the PR's goal of fixing `metric_type` declarations.
- packages/elastic_agent/data_stream/filebeat_metrics/fields/fields.yml:200 -- `cpuacct.percpu` at line 200 is assigned `metric_type: gauge`, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup" — a monotonically increasing cumulative value that should be `metric_type: counter`. All other `cpuacct` time fields in this same group (`cpuacct.total.ns`, `cpuacct.stats.user.ns`, `cpuacct.stats.system.ns`) are correctly typed as `counter`. Using `gauge` means TSDB will not apply rate/delta calculations correctly for this field, producing incorrect visualizations and aggregations.
- packages/elastic_agent/data_stream/fleet_server_metrics/fields/fields.yml:200 -- `metric_type: gauge` added to `system.process.cgroup.cpuacct.percpu` at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a cumulative, monotonically increasing value and should be `metric_type: counter`. All sibling `cpuacct` fields measuring CPU time in nanoseconds (`cpuacct.total.ns` at line 184, `cpuacct.stats.user.ns` at line 190, `cpuacct.stats.system.ns` at line 196) correctly use `metric_type: counter`. Using `gauge` will cause TSDB to treat this cumulative counter as a point-in-time value, breaking rate calculations and counter-based aggregations.
- packages/elastic_agent/data_stream/heartbeat_metrics/fields/fields.yml:200 -- `system.process.cgroup.cpuacct.percpu` at line 200 is given `metric_type: gauge`, but its description says "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." CPU time consumed is a monotonically increasing cumulative value and should be `metric_type: counter`, consistent with the sibling fields `cpuacct.total.ns` (line 184), `cpuacct.stats.user.ns` (line 189), and `cpuacct.stats.system.ns` (line 195), which are all `counter`. Using `gauge` causes TSDB to treat this as a point-in-time measurement rather than a cumulative counter, leading to incorrect rate calculations and potentially wrong dashboard visualizations.
- packages/elastic_agent/data_stream/metricbeat_metrics/fields/fields.yml:200 -- `system.process.cgroup.cpuacct.percpu` at line 200 has `metric_type: gauge` added, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." Cumulative CPU time consumed is a monotonically increasing value sourced from the cgroup `cpuacct.usage_percpu` file, which makes it a `counter`, not a `gauge`. Using `gauge` will cause TSDB to store and handle this metric incorrectly — for example, rate calculations won't be applied automatically, and rollups/downsampling will use last-value semantics instead of cumulative semantics, leading to incorrect metric values in dashboards and alerts.
- packages/elastic_agent/data_stream/osquerybeat_metrics/fields/fields.yml:200 -- `system.process.cgroup.cpuacct.percpu` at line 200 is assigned `metric_type: gauge`, but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is a monotonically increasing cumulative value and should be `metric_type: counter`, consistent with the sibling `cpuacct` fields (`total.ns` at line 184, `stats.user.ns` at line 190, `stats.system.ns` at line 196) which are all declared as `counter`. Using `gauge` causes TSDB to apply incorrect downsampling (last-value instead of counter-appropriate aggregation), producing incorrect results over time.
- packages/elastic_agent/data_stream/packetbeat_metrics/fields/fields.yml:200 -- The newly added `metric_type: gauge` for `system.process.cgroup.cpuacct.percpu` at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be `metric_type: counter`, not `gauge`. All sibling fields under `cpuacct` that also represent cumulative CPU time (`total.ns`, `stats.user.ns`, `stats.system.ns`) are correctly declared as `counter`. Using `gauge` will cause TSDB to apply incorrect aggregation (e.g., last-value instead of rate), leading to misleading metric visualizations and broken rate/delta calculations.

Comment on lines 134 to +136
- name: memory.total
type: long
metric_type: gauge
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium fields/beat-stats-fields.yml:134

beat.stats.memstats.memory.total is labeled metric_type: gauge, but this field maps to Go's runtime.MemStats.TotalAlloc which tracks cumulative bytes allocated for heap objects — a monotonically increasing counter. With metric_type: gauge, TSDB will not apply counter-specific handling like rate calculations, producing incorrect aggregations and misleading dashboard visualizations. Change to metric_type: counter.

         - name: memory.total
           type: long
+          metric_type: counter
Also found in 2 other location(s)

packages/elastic_agent/data_stream/auditbeat_metrics/fields/beat-stats-fields.yml:136

memstats.memory.total at line 136 is declared as metric_type: gauge, but this field represents Go's runtime.MemStats.TotalAlloc — cumulative bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" (see docs/reference/filebeat/understand-filebeat-logs.md in the beats repo), and sample log data confirms the value increases monotonically (e.g., 48348409672 → 48352988904 → 48353325376). Since it's a monotonically increasing cumulative value, it should be metric_type: counter, not gauge. Using gauge means TSDB won't apply counter-specific handling (e.g., rate calculations, rollups), leading to incorrect metric interpretation by dashboards and alerting.

packages/elastic_agent/data_stream/filebeat_input_metrics/fields/beat-stats-fields.yml:136

memstats.memory.total at line 136 is declared as metric_type: gauge but it corresponds to Go's runtime.MemStats.TotalAlloc, which is a cumulative counter of bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" and sample log data from the Beats repo confirms it is monotonically increasing. It should be metric_type: counter. Incorrect metric_type causes TSDB to apply wrong downsampling/aggregation logic — for a counter, rate calculations are appropriate, but labeling it as gauge means TSDB may take last-value samples instead of computing rates, leading to incorrect metric aggregations over time.

🤖 Copy this AI Prompt to have your agent fix this:
In file packages/elastic_agent/data_stream/apm_server_metrics/fields/beat-stats-fields.yml around lines 134-136:

`beat.stats.memstats.memory.total` is labeled `metric_type: gauge`, but this field maps to Go's `runtime.MemStats.TotalAlloc` which tracks cumulative bytes allocated for heap objects — a monotonically increasing counter. With `metric_type: gauge`, TSDB will not apply counter-specific handling like rate calculations, producing incorrect aggregations and misleading dashboard visualizations. Change to `metric_type: counter`.

Also found in 2 other location(s):
- packages/elastic_agent/data_stream/auditbeat_metrics/fields/beat-stats-fields.yml:136 -- `memstats.memory.total` at line 136 is declared as `metric_type: gauge`, but this field represents Go's `runtime.MemStats.TotalAlloc` — cumulative bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" (see `docs/reference/filebeat/understand-filebeat-logs.md` in the beats repo), and sample log data confirms the value increases monotonically (e.g., 48348409672 → 48352988904 → 48353325376). Since it's a monotonically increasing cumulative value, it should be `metric_type: counter`, not `gauge`. Using `gauge` means TSDB won't apply counter-specific handling (e.g., rate calculations, rollups), leading to incorrect metric interpretation by dashboards and alerting.
- packages/elastic_agent/data_stream/filebeat_input_metrics/fields/beat-stats-fields.yml:136 -- `memstats.memory.total` at line 136 is declared as `metric_type: gauge` but it corresponds to Go's `runtime.MemStats.TotalAlloc`, which is a cumulative counter of bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" and sample log data from the Beats repo confirms it is monotonically increasing. It should be `metric_type: counter`. Incorrect `metric_type` causes TSDB to apply wrong downsampling/aggregation logic — for a counter, rate calculations are appropriate, but labeling it as gauge means TSDB may take last-value samples instead of computing rates, leading to incorrect metric aggregations over time.

@elastic-vault-github-plugin-prod
Copy link
Copy Markdown

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

@elasticmachine
Copy link
Copy Markdown

💚 Build Succeeded

cc @AndersonQ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Integration:elastic_agent Elastic Agent Team:Elastic-Agent-Data-Plane Agent Data Plane team [elastic/elastic-agent-data-plane]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants