[elastic_agent] Add TSDB dimensions and fix metric_type declarations#18508
[elastic_agent] Add TSDB dimensions and fix metric_type declarations#18508AndersonQ wants to merge 1 commit intoelastic:mainfrom
Conversation
Add `dimension: true` to fields across the 12 elastic_agent metrics
data streams. The added dimensions are redundant with the existing
dimensions (1-to-1 with `agent.id` or `component.id`) and therefore
do not increase time-series cardinality.
Fix metric_type on:
- beat.stats.libbeat.pipeline.events.active: counter -> gauge
- beat.stats.libbeat.output.events.active: counter -> gauge
- filebeat_input.*.histogram.count: gauge -> counter
Add metric_type on numeric fields that were missing it: system stats,
cpu ticks, memstats, handles, runtime.goroutines, uptime, cgroup
stats, libbeat pipeline/config/output metrics, write-latency histogram,
system.process.cgroup.{memory.mem.failures, cpuacct.percpu}, and
filebeat_input.{cel_executions, system_packet_drops}.
Assisted by Claude Code
9c202fc to
d2cf4f4
Compare
| @@ -195,6 +197,7 @@ | |||
| description: CPU time consumed by tasks in user (kernel) mode. | |||
| - name: percpu | |||
| type: long | |||
There was a problem hiding this comment.
🟡 Medium fields/fields.yml:199
system.process.cgroup.cpuacct.percpu is marked metric_type: gauge, but it represents cumulative CPU time consumed on each CPU — a monotonically increasing value. TSDB will compute incorrect rates for this field, producing wrong aggregation results. Change to metric_type: counter to match the sibling cpuacct.*.ns fields.
Also found in 8 other location(s)
packages/elastic_agent/data_stream/cloudbeat_metrics/fields/fields.yml:195
system.process.cgroup.cpuacct.percpuat line 195 is givenmetric_type: gauge, but the field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should bemetric_type: counter. Other cpuacct fields in the same file (e.g.,cpuacct.total.nsat line 179,cpuacct.stats.user.nsat line 185,cpuacct.stats.system.nsat line 190) are all correctly declared ascounter. Usinggaugehere will cause TSDB to treat cumulative nanosecond values as point-in-time measurements, leading to incorrect rate calculations and misleading metric aggregations.
packages/elastic_agent/data_stream/elastic_agent_metrics/fields/fields.yml:199
cpuacct.percpuat line 199 is assignedmetric_type: gaugebut its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is cumulative CPU time from the Linux cgroupcpuacct.usage_percpufile — a monotonically increasing value that should bemetric_type: counter. Labeling it asgaugecauses TSDB to skip counter-specific handling (e.g., rate calculations, counter reset detection), leading to incorrect visualizations and aggregations in Kibana. This directly contradicts the PR's goal of fixingmetric_typedeclarations.
packages/elastic_agent/data_stream/filebeat_metrics/fields/fields.yml:200
cpuacct.percpuat line 200 is assignedmetric_type: gauge, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup" — a monotonically increasing cumulative value that should bemetric_type: counter. All othercpuaccttime fields in this same group (cpuacct.total.ns,cpuacct.stats.user.ns,cpuacct.stats.system.ns) are correctly typed ascounter. Usinggaugemeans TSDB will not apply rate/delta calculations correctly for this field, producing incorrect visualizations and aggregations.
packages/elastic_agent/data_stream/fleet_server_metrics/fields/fields.yml:200
metric_type: gaugeadded tosystem.process.cgroup.cpuacct.percpuat line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a cumulative, monotonically increasing value and should bemetric_type: counter. All siblingcpuacctfields measuring CPU time in nanoseconds (cpuacct.total.nsat line 184,cpuacct.stats.user.nsat line 190,cpuacct.stats.system.nsat line 196) correctly usemetric_type: counter. Usinggaugewill cause TSDB to treat this cumulative counter as a point-in-time value, breaking rate calculations and counter-based aggregations.
packages/elastic_agent/data_stream/heartbeat_metrics/fields/fields.yml:200
system.process.cgroup.cpuacct.percpuat line 200 is givenmetric_type: gauge, but its description says "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." CPU time consumed is a monotonically increasing cumulative value and should bemetric_type: counter, consistent with the sibling fieldscpuacct.total.ns(line 184),cpuacct.stats.user.ns(line 189), andcpuacct.stats.system.ns(line 195), which are allcounter. Usinggaugecauses TSDB to treat this as a point-in-time measurement rather than a cumulative counter, leading to incorrect rate calculations and potentially wrong dashboard visualizations.
packages/elastic_agent/data_stream/metricbeat_metrics/fields/fields.yml:200
system.process.cgroup.cpuacct.percpuat line 200 hasmetric_type: gaugeadded, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." Cumulative CPU time consumed is a monotonically increasing value sourced from the cgroupcpuacct.usage_percpufile, which makes it acounter, not agauge. Usinggaugewill cause TSDB to store and handle this metric incorrectly — for example, rate calculations won't be applied automatically, and rollups/downsampling will use last-value semantics instead of cumulative semantics, leading to incorrect metric values in dashboards and alerts.
packages/elastic_agent/data_stream/osquerybeat_metrics/fields/fields.yml:200
system.process.cgroup.cpuacct.percpuat line 200 is assignedmetric_type: gauge, but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is a monotonically increasing cumulative value and should bemetric_type: counter, consistent with the siblingcpuacctfields (total.nsat line 184,stats.user.nsat line 190,stats.system.nsat line 196) which are all declared ascounter. Usinggaugecauses TSDB to apply incorrect downsampling (last-value instead of counter-appropriate aggregation), producing incorrect results over time.
packages/elastic_agent/data_stream/packetbeat_metrics/fields/fields.yml:200
The newly added
metric_type: gaugeforsystem.process.cgroup.cpuacct.percpuat line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should bemetric_type: counter, notgauge. All sibling fields undercpuacctthat also represent cumulative CPU time (total.ns,stats.user.ns,stats.system.ns) are correctly declared ascounter. Usinggaugewill cause TSDB to apply incorrect aggregation (e.g., last-value instead of rate), leading to misleading metric visualizations and broken rate/delta calculations.
🤖 Copy this AI Prompt to have your agent fix this:
In file packages/elastic_agent/data_stream/apm_server_metrics/fields/fields.yml around line 199:
`system.process.cgroup.cpuacct.percpu` is marked `metric_type: gauge`, but it represents cumulative CPU time consumed on each CPU — a monotonically increasing value. TSDB will compute incorrect rates for this field, producing wrong aggregation results. Change to `metric_type: counter` to match the sibling `cpuacct.*.ns` fields.
Also found in 8 other location(s):
- packages/elastic_agent/data_stream/cloudbeat_metrics/fields/fields.yml:195 -- `system.process.cgroup.cpuacct.percpu` at line 195 is given `metric_type: gauge`, but the field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be `metric_type: counter`. Other cpuacct fields in the same file (e.g., `cpuacct.total.ns` at line 179, `cpuacct.stats.user.ns` at line 185, `cpuacct.stats.system.ns` at line 190) are all correctly declared as `counter`. Using `gauge` here will cause TSDB to treat cumulative nanosecond values as point-in-time measurements, leading to incorrect rate calculations and misleading metric aggregations.
- packages/elastic_agent/data_stream/elastic_agent_metrics/fields/fields.yml:199 -- `cpuacct.percpu` at line 199 is assigned `metric_type: gauge` but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is cumulative CPU time from the Linux cgroup `cpuacct.usage_percpu` file — a monotonically increasing value that should be `metric_type: counter`. Labeling it as `gauge` causes TSDB to skip counter-specific handling (e.g., rate calculations, counter reset detection), leading to incorrect visualizations and aggregations in Kibana. This directly contradicts the PR's goal of fixing `metric_type` declarations.
- packages/elastic_agent/data_stream/filebeat_metrics/fields/fields.yml:200 -- `cpuacct.percpu` at line 200 is assigned `metric_type: gauge`, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup" — a monotonically increasing cumulative value that should be `metric_type: counter`. All other `cpuacct` time fields in this same group (`cpuacct.total.ns`, `cpuacct.stats.user.ns`, `cpuacct.stats.system.ns`) are correctly typed as `counter`. Using `gauge` means TSDB will not apply rate/delta calculations correctly for this field, producing incorrect visualizations and aggregations.
- packages/elastic_agent/data_stream/fleet_server_metrics/fields/fields.yml:200 -- `metric_type: gauge` added to `system.process.cgroup.cpuacct.percpu` at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a cumulative, monotonically increasing value and should be `metric_type: counter`. All sibling `cpuacct` fields measuring CPU time in nanoseconds (`cpuacct.total.ns` at line 184, `cpuacct.stats.user.ns` at line 190, `cpuacct.stats.system.ns` at line 196) correctly use `metric_type: counter`. Using `gauge` will cause TSDB to treat this cumulative counter as a point-in-time value, breaking rate calculations and counter-based aggregations.
- packages/elastic_agent/data_stream/heartbeat_metrics/fields/fields.yml:200 -- `system.process.cgroup.cpuacct.percpu` at line 200 is given `metric_type: gauge`, but its description says "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." CPU time consumed is a monotonically increasing cumulative value and should be `metric_type: counter`, consistent with the sibling fields `cpuacct.total.ns` (line 184), `cpuacct.stats.user.ns` (line 189), and `cpuacct.stats.system.ns` (line 195), which are all `counter`. Using `gauge` causes TSDB to treat this as a point-in-time measurement rather than a cumulative counter, leading to incorrect rate calculations and potentially wrong dashboard visualizations.
- packages/elastic_agent/data_stream/metricbeat_metrics/fields/fields.yml:200 -- `system.process.cgroup.cpuacct.percpu` at line 200 has `metric_type: gauge` added, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." Cumulative CPU time consumed is a monotonically increasing value sourced from the cgroup `cpuacct.usage_percpu` file, which makes it a `counter`, not a `gauge`. Using `gauge` will cause TSDB to store and handle this metric incorrectly — for example, rate calculations won't be applied automatically, and rollups/downsampling will use last-value semantics instead of cumulative semantics, leading to incorrect metric values in dashboards and alerts.
- packages/elastic_agent/data_stream/osquerybeat_metrics/fields/fields.yml:200 -- `system.process.cgroup.cpuacct.percpu` at line 200 is assigned `metric_type: gauge`, but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is a monotonically increasing cumulative value and should be `metric_type: counter`, consistent with the sibling `cpuacct` fields (`total.ns` at line 184, `stats.user.ns` at line 190, `stats.system.ns` at line 196) which are all declared as `counter`. Using `gauge` causes TSDB to apply incorrect downsampling (last-value instead of counter-appropriate aggregation), producing incorrect results over time.
- packages/elastic_agent/data_stream/packetbeat_metrics/fields/fields.yml:200 -- The newly added `metric_type: gauge` for `system.process.cgroup.cpuacct.percpu` at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be `metric_type: counter`, not `gauge`. All sibling fields under `cpuacct` that also represent cumulative CPU time (`total.ns`, `stats.user.ns`, `stats.system.ns`) are correctly declared as `counter`. Using `gauge` will cause TSDB to apply incorrect aggregation (e.g., last-value instead of rate), leading to misleading metric visualizations and broken rate/delta calculations.
| - name: memory.total | ||
| type: long | ||
| metric_type: gauge |
There was a problem hiding this comment.
🟡 Medium fields/beat-stats-fields.yml:134
beat.stats.memstats.memory.total is labeled metric_type: gauge, but this field maps to Go's runtime.MemStats.TotalAlloc which tracks cumulative bytes allocated for heap objects — a monotonically increasing counter. With metric_type: gauge, TSDB will not apply counter-specific handling like rate calculations, producing incorrect aggregations and misleading dashboard visualizations. Change to metric_type: counter.
- name: memory.total
type: long
+ metric_type: counterAlso found in 2 other location(s)
packages/elastic_agent/data_stream/auditbeat_metrics/fields/beat-stats-fields.yml:136
memstats.memory.totalat line 136 is declared asmetric_type: gauge, but this field represents Go'sruntime.MemStats.TotalAlloc— cumulative bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" (seedocs/reference/filebeat/understand-filebeat-logs.mdin the beats repo), and sample log data confirms the value increases monotonically (e.g., 48348409672 → 48352988904 → 48353325376). Since it's a monotonically increasing cumulative value, it should bemetric_type: counter, notgauge. Usinggaugemeans TSDB won't apply counter-specific handling (e.g., rate calculations, rollups), leading to incorrect metric interpretation by dashboards and alerting.
packages/elastic_agent/data_stream/filebeat_input_metrics/fields/beat-stats-fields.yml:136
memstats.memory.totalat line 136 is declared asmetric_type: gaugebut it corresponds to Go'sruntime.MemStats.TotalAlloc, which is a cumulative counter of bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" and sample log data from the Beats repo confirms it is monotonically increasing. It should bemetric_type: counter. Incorrectmetric_typecauses TSDB to apply wrong downsampling/aggregation logic — for a counter, rate calculations are appropriate, but labeling it as gauge means TSDB may take last-value samples instead of computing rates, leading to incorrect metric aggregations over time.
🤖 Copy this AI Prompt to have your agent fix this:
In file packages/elastic_agent/data_stream/apm_server_metrics/fields/beat-stats-fields.yml around lines 134-136:
`beat.stats.memstats.memory.total` is labeled `metric_type: gauge`, but this field maps to Go's `runtime.MemStats.TotalAlloc` which tracks cumulative bytes allocated for heap objects — a monotonically increasing counter. With `metric_type: gauge`, TSDB will not apply counter-specific handling like rate calculations, producing incorrect aggregations and misleading dashboard visualizations. Change to `metric_type: counter`.
Also found in 2 other location(s):
- packages/elastic_agent/data_stream/auditbeat_metrics/fields/beat-stats-fields.yml:136 -- `memstats.memory.total` at line 136 is declared as `metric_type: gauge`, but this field represents Go's `runtime.MemStats.TotalAlloc` — cumulative bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" (see `docs/reference/filebeat/understand-filebeat-logs.md` in the beats repo), and sample log data confirms the value increases monotonically (e.g., 48348409672 → 48352988904 → 48353325376). Since it's a monotonically increasing cumulative value, it should be `metric_type: counter`, not `gauge`. Using `gauge` means TSDB won't apply counter-specific handling (e.g., rate calculations, rollups), leading to incorrect metric interpretation by dashboards and alerting.
- packages/elastic_agent/data_stream/filebeat_input_metrics/fields/beat-stats-fields.yml:136 -- `memstats.memory.total` at line 136 is declared as `metric_type: gauge` but it corresponds to Go's `runtime.MemStats.TotalAlloc`, which is a cumulative counter of bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" and sample log data from the Beats repo confirms it is monotonically increasing. It should be `metric_type: counter`. Incorrect `metric_type` causes TSDB to apply wrong downsampling/aggregation logic — for a counter, rate calculations are appropriate, but labeling it as gauge means TSDB may take last-value samples instead of computing rates, leading to incorrect metric aggregations over time.
🚀 Benchmarks reportTo see the full report comment with |
💚 Build Succeeded
cc @AndersonQ |
Proposed commit message
Checklist
changelog.ymlfile.Author's Checklist
How to test this PR locally
Related issues
Screenshots