Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions packages/elastic_agent/changelog.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# newer versions go on top
- version: "2.8.0"
changes:
- description: Add TSDB dimensions and complete/fix metric_type declarations across elastic_agent metrics data streams.
type: enhancement
link: "https://github.com/elastic/integrations/pull/18508"
- version: "2.7.6"
changes:
- description: Add auto-expand-replicas setting
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,10 @@
external: ecs
- name: domain
external: ecs
dimension: true
- name: hostname
external: ecs
dimension: true
- name: id
external: ecs
- name: ip
Expand All @@ -48,6 +50,7 @@
external: ecs
- name: name
external: ecs
dimension: true
- name: os.family
external: ecs
- name: os.kernel
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
- name: beat.type
description: Beat type.
type: keyword
type: keyword
dimension: true
Original file line number Diff line number Diff line change
Expand Up @@ -6,61 +6,80 @@
fields:
- name: name
type: keyword
dimension: true
- name: host
type: keyword
dimension: true
- name: type
type: keyword
dimension: true
- name: uuid
type: keyword
dimension: true
- name: version
type: keyword
- name: system
type: group
fields:
- name: cpu.cores
type: long
metric_type: gauge
- name: load
type: group
fields:
- name: "1"
type: double
metric_type: gauge
- name: "15"
type: double
metric_type: gauge
- name: "5"
type: double
metric_type: gauge
- name: norm
type: group
fields:
- name: "1"
type: double
metric_type: gauge
- name: "15"
type: double
metric_type: gauge
- name: "5"
type: double
metric_type: gauge
- name: cpu
type: group
fields:
- name: system.ticks
type: long
metric_type: counter
- name: system.time.ms
type: long
metric_type: counter
- name: total.value
type: long
metric_type: counter
- name: total.ticks
type: long
metric_type: counter
- name: total.time.ms
type: long
metric_type: counter
- name: user.ticks
type: long
metric_type: counter
- name: user.time.ms
type: long
metric_type: counter
- name: info
type: group
fields:
- name: ephemeral_id
type: keyword
- name: uptime.ms
type: long
metric_type: counter
- name: cgroup
type: group
fields:
Expand All @@ -69,58 +88,75 @@
fields:
- name: cfs.period.us
type: long
metric_type: gauge
- name: cfs.quota.us
type: long
metric_type: gauge
- name: id
type: keyword
- name: stats
type: group
fields:
- name: periods
type: long
metric_type: counter
- name: throttled.periods
type: long
metric_type: counter
- name: throttled.ns
type: long
metric_type: counter
- name: cpuacct.id
type: keyword
- name: cpuacct.total.ns
type: long
metric_type: counter
- name: memory
type: group
fields:
- name: id
type: keyword
- name: mem.limit.bytes
type: long
metric_type: gauge
- name: mem.usage.bytes
type: long
metric_type: gauge
- name: memstats
type: group
fields:
- name: gc_next
type: long
metric_type: gauge
- name: memory.alloc
type: long
metric_type: gauge
- name: memory.total
type: long
metric_type: gauge
Comment on lines 134 to +136
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium fields/beat-stats-fields.yml:134

beat.stats.memstats.memory.total is labeled metric_type: gauge, but this field maps to Go's runtime.MemStats.TotalAlloc which tracks cumulative bytes allocated for heap objects — a monotonically increasing counter. With metric_type: gauge, TSDB will not apply counter-specific handling like rate calculations, producing incorrect aggregations and misleading dashboard visualizations. Change to metric_type: counter.

         - name: memory.total
           type: long
+          metric_type: counter
Also found in 2 other location(s)

packages/elastic_agent/data_stream/auditbeat_metrics/fields/beat-stats-fields.yml:136

memstats.memory.total at line 136 is declared as metric_type: gauge, but this field represents Go's runtime.MemStats.TotalAlloc — cumulative bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" (see docs/reference/filebeat/understand-filebeat-logs.md in the beats repo), and sample log data confirms the value increases monotonically (e.g., 48348409672 → 48352988904 → 48353325376). Since it's a monotonically increasing cumulative value, it should be metric_type: counter, not gauge. Using gauge means TSDB won't apply counter-specific handling (e.g., rate calculations, rollups), leading to incorrect metric interpretation by dashboards and alerting.

packages/elastic_agent/data_stream/filebeat_input_metrics/fields/beat-stats-fields.yml:136

memstats.memory.total at line 136 is declared as metric_type: gauge but it corresponds to Go's runtime.MemStats.TotalAlloc, which is a cumulative counter of bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" and sample log data from the Beats repo confirms it is monotonically increasing. It should be metric_type: counter. Incorrect metric_type causes TSDB to apply wrong downsampling/aggregation logic — for a counter, rate calculations are appropriate, but labeling it as gauge means TSDB may take last-value samples instead of computing rates, leading to incorrect metric aggregations over time.

🤖 Copy this AI Prompt to have your agent fix this:
In file packages/elastic_agent/data_stream/apm_server_metrics/fields/beat-stats-fields.yml around lines 134-136:

`beat.stats.memstats.memory.total` is labeled `metric_type: gauge`, but this field maps to Go's `runtime.MemStats.TotalAlloc` which tracks cumulative bytes allocated for heap objects — a monotonically increasing counter. With `metric_type: gauge`, TSDB will not apply counter-specific handling like rate calculations, producing incorrect aggregations and misleading dashboard visualizations. Change to `metric_type: counter`.

Also found in 2 other location(s):
- packages/elastic_agent/data_stream/auditbeat_metrics/fields/beat-stats-fields.yml:136 -- `memstats.memory.total` at line 136 is declared as `metric_type: gauge`, but this field represents Go's `runtime.MemStats.TotalAlloc` — cumulative bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" (see `docs/reference/filebeat/understand-filebeat-logs.md` in the beats repo), and sample log data confirms the value increases monotonically (e.g., 48348409672 → 48352988904 → 48353325376). Since it's a monotonically increasing cumulative value, it should be `metric_type: counter`, not `gauge`. Using `gauge` means TSDB won't apply counter-specific handling (e.g., rate calculations, rollups), leading to incorrect metric interpretation by dashboards and alerting.
- packages/elastic_agent/data_stream/filebeat_input_metrics/fields/beat-stats-fields.yml:136 -- `memstats.memory.total` at line 136 is declared as `metric_type: gauge` but it corresponds to Go's `runtime.MemStats.TotalAlloc`, which is a cumulative counter of bytes allocated for heap objects. The Beats documentation explicitly describes it as "Cumulative bytes allocated for heap objects" and sample log data from the Beats repo confirms it is monotonically increasing. It should be `metric_type: counter`. Incorrect `metric_type` causes TSDB to apply wrong downsampling/aggregation logic — for a counter, rate calculations are appropriate, but labeling it as gauge means TSDB may take last-value samples instead of computing rates, leading to incorrect metric aggregations over time.

- name: rss
type: long
metric_type: gauge
- name: handles
type: group
fields:
- name: open
type: long
metric_type: gauge
- name: limit.hard
type: long
metric_type: gauge
- name: limit.soft
type: long
metric_type: gauge
- name: uptime.ms
type: long
metric_type: counter
description: >
Beat uptime
- name: runtime.goroutines
type: long
metric_type: gauge
description: >
Number of goroutines running in Beat
- name: libbeat
Expand All @@ -133,6 +169,7 @@
fields:
- name: clients
type: long
metric_type: gauge
- name: queue
type: group
fields:
Expand Down Expand Up @@ -196,7 +233,7 @@
fields:
- name: active
type: long
metric_type: counter
metric_type: gauge
- name: dropped
type: long
metric_type: counter
Expand All @@ -220,12 +257,16 @@
fields:
- name: reloads
type: long
metric_type: counter
- name: running
type: long
metric_type: gauge
- name: starts
type: long
metric_type: counter
- name: stops
type: long
metric_type: counter
- name: output
type: group
fields:
Expand All @@ -243,7 +284,7 @@
Number of events acknowledged
- name: active
type: long
metric_type: counter
metric_type: gauge
description: >
Number of active events
- name: batches
Expand Down Expand Up @@ -299,12 +340,16 @@
fields:
- name: count
type: long
metric_type: counter
- name: max
type: float
metric_type: gauge
- name: median
type: long
metric_type: gauge
- name: p99
type: float
metric_type: gauge
- name: read
type: group
description: >
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@
dimension: true
- name: agent.name
external: ecs
dimension: true
- name: agent.type
external: ecs
dimension: true
- name: agent.version
external: ecs
- name: log.level
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@
type: keyword
ignore_above: 1024
description: Elastic Agent id.
dimension: true
- name: process
level: extended
type: keyword
ignore_above: 1024
description: Process run by the Elastic Agent.
example: metricbeat
dimension: true
- name: snapshot
level: extended
type: boolean
Expand Down Expand Up @@ -195,6 +197,7 @@
description: CPU time consumed by tasks in user (kernel) mode.
- name: percpu
type: long
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium fields/fields.yml:199

system.process.cgroup.cpuacct.percpu is marked metric_type: gauge, but it represents cumulative CPU time consumed on each CPU — a monotonically increasing value. TSDB will compute incorrect rates for this field, producing wrong aggregation results. Change to metric_type: counter to match the sibling cpuacct.*.ns fields.

Also found in 8 other location(s)

packages/elastic_agent/data_stream/cloudbeat_metrics/fields/fields.yml:195

system.process.cgroup.cpuacct.percpu at line 195 is given metric_type: gauge, but the field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be metric_type: counter. Other cpuacct fields in the same file (e.g., cpuacct.total.ns at line 179, cpuacct.stats.user.ns at line 185, cpuacct.stats.system.ns at line 190) are all correctly declared as counter. Using gauge here will cause TSDB to treat cumulative nanosecond values as point-in-time measurements, leading to incorrect rate calculations and misleading metric aggregations.

packages/elastic_agent/data_stream/elastic_agent_metrics/fields/fields.yml:199

cpuacct.percpu at line 199 is assigned metric_type: gauge but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is cumulative CPU time from the Linux cgroup cpuacct.usage_percpu file — a monotonically increasing value that should be metric_type: counter. Labeling it as gauge causes TSDB to skip counter-specific handling (e.g., rate calculations, counter reset detection), leading to incorrect visualizations and aggregations in Kibana. This directly contradicts the PR's goal of fixing metric_type declarations.

packages/elastic_agent/data_stream/filebeat_metrics/fields/fields.yml:200

cpuacct.percpu at line 200 is assigned metric_type: gauge, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup" — a monotonically increasing cumulative value that should be metric_type: counter. All other cpuacct time fields in this same group (cpuacct.total.ns, cpuacct.stats.user.ns, cpuacct.stats.system.ns) are correctly typed as counter. Using gauge means TSDB will not apply rate/delta calculations correctly for this field, producing incorrect visualizations and aggregations.

packages/elastic_agent/data_stream/fleet_server_metrics/fields/fields.yml:200

metric_type: gauge added to system.process.cgroup.cpuacct.percpu at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a cumulative, monotonically increasing value and should be metric_type: counter. All sibling cpuacct fields measuring CPU time in nanoseconds (cpuacct.total.ns at line 184, cpuacct.stats.user.ns at line 190, cpuacct.stats.system.ns at line 196) correctly use metric_type: counter. Using gauge will cause TSDB to treat this cumulative counter as a point-in-time value, breaking rate calculations and counter-based aggregations.

packages/elastic_agent/data_stream/heartbeat_metrics/fields/fields.yml:200

system.process.cgroup.cpuacct.percpu at line 200 is given metric_type: gauge, but its description says "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." CPU time consumed is a monotonically increasing cumulative value and should be metric_type: counter, consistent with the sibling fields cpuacct.total.ns (line 184), cpuacct.stats.user.ns (line 189), and cpuacct.stats.system.ns (line 195), which are all counter. Using gauge causes TSDB to treat this as a point-in-time measurement rather than a cumulative counter, leading to incorrect rate calculations and potentially wrong dashboard visualizations.

packages/elastic_agent/data_stream/metricbeat_metrics/fields/fields.yml:200

system.process.cgroup.cpuacct.percpu at line 200 has metric_type: gauge added, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." Cumulative CPU time consumed is a monotonically increasing value sourced from the cgroup cpuacct.usage_percpu file, which makes it a counter, not a gauge. Using gauge will cause TSDB to store and handle this metric incorrectly — for example, rate calculations won't be applied automatically, and rollups/downsampling will use last-value semantics instead of cumulative semantics, leading to incorrect metric values in dashboards and alerts.

packages/elastic_agent/data_stream/osquerybeat_metrics/fields/fields.yml:200

system.process.cgroup.cpuacct.percpu at line 200 is assigned metric_type: gauge, but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is a monotonically increasing cumulative value and should be metric_type: counter, consistent with the sibling cpuacct fields (total.ns at line 184, stats.user.ns at line 190, stats.system.ns at line 196) which are all declared as counter. Using gauge causes TSDB to apply incorrect downsampling (last-value instead of counter-appropriate aggregation), producing incorrect results over time.

packages/elastic_agent/data_stream/packetbeat_metrics/fields/fields.yml:200

The newly added metric_type: gauge for system.process.cgroup.cpuacct.percpu at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be metric_type: counter, not gauge. All sibling fields under cpuacct that also represent cumulative CPU time (total.ns, stats.user.ns, stats.system.ns) are correctly declared as counter. Using gauge will cause TSDB to apply incorrect aggregation (e.g., last-value instead of rate), leading to misleading metric visualizations and broken rate/delta calculations.

🤖 Copy this AI Prompt to have your agent fix this:
In file packages/elastic_agent/data_stream/apm_server_metrics/fields/fields.yml around line 199:

`system.process.cgroup.cpuacct.percpu` is marked `metric_type: gauge`, but it represents cumulative CPU time consumed on each CPU — a monotonically increasing value. TSDB will compute incorrect rates for this field, producing wrong aggregation results. Change to `metric_type: counter` to match the sibling `cpuacct.*.ns` fields.

Also found in 8 other location(s):
- packages/elastic_agent/data_stream/cloudbeat_metrics/fields/fields.yml:195 -- `system.process.cgroup.cpuacct.percpu` at line 195 is given `metric_type: gauge`, but the field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be `metric_type: counter`. Other cpuacct fields in the same file (e.g., `cpuacct.total.ns` at line 179, `cpuacct.stats.user.ns` at line 185, `cpuacct.stats.system.ns` at line 190) are all correctly declared as `counter`. Using `gauge` here will cause TSDB to treat cumulative nanosecond values as point-in-time measurements, leading to incorrect rate calculations and misleading metric aggregations.
- packages/elastic_agent/data_stream/elastic_agent_metrics/fields/fields.yml:199 -- `cpuacct.percpu` at line 199 is assigned `metric_type: gauge` but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is cumulative CPU time from the Linux cgroup `cpuacct.usage_percpu` file — a monotonically increasing value that should be `metric_type: counter`. Labeling it as `gauge` causes TSDB to skip counter-specific handling (e.g., rate calculations, counter reset detection), leading to incorrect visualizations and aggregations in Kibana. This directly contradicts the PR's goal of fixing `metric_type` declarations.
- packages/elastic_agent/data_stream/filebeat_metrics/fields/fields.yml:200 -- `cpuacct.percpu` at line 200 is assigned `metric_type: gauge`, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup" — a monotonically increasing cumulative value that should be `metric_type: counter`. All other `cpuacct` time fields in this same group (`cpuacct.total.ns`, `cpuacct.stats.user.ns`, `cpuacct.stats.system.ns`) are correctly typed as `counter`. Using `gauge` means TSDB will not apply rate/delta calculations correctly for this field, producing incorrect visualizations and aggregations.
- packages/elastic_agent/data_stream/fleet_server_metrics/fields/fields.yml:200 -- `metric_type: gauge` added to `system.process.cgroup.cpuacct.percpu` at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a cumulative, monotonically increasing value and should be `metric_type: counter`. All sibling `cpuacct` fields measuring CPU time in nanoseconds (`cpuacct.total.ns` at line 184, `cpuacct.stats.user.ns` at line 190, `cpuacct.stats.system.ns` at line 196) correctly use `metric_type: counter`. Using `gauge` will cause TSDB to treat this cumulative counter as a point-in-time value, breaking rate calculations and counter-based aggregations.
- packages/elastic_agent/data_stream/heartbeat_metrics/fields/fields.yml:200 -- `system.process.cgroup.cpuacct.percpu` at line 200 is given `metric_type: gauge`, but its description says "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." CPU time consumed is a monotonically increasing cumulative value and should be `metric_type: counter`, consistent with the sibling fields `cpuacct.total.ns` (line 184), `cpuacct.stats.user.ns` (line 189), and `cpuacct.stats.system.ns` (line 195), which are all `counter`. Using `gauge` causes TSDB to treat this as a point-in-time measurement rather than a cumulative counter, leading to incorrect rate calculations and potentially wrong dashboard visualizations.
- packages/elastic_agent/data_stream/metricbeat_metrics/fields/fields.yml:200 -- `system.process.cgroup.cpuacct.percpu` at line 200 has `metric_type: gauge` added, but this field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." Cumulative CPU time consumed is a monotonically increasing value sourced from the cgroup `cpuacct.usage_percpu` file, which makes it a `counter`, not a `gauge`. Using `gauge` will cause TSDB to store and handle this metric incorrectly — for example, rate calculations won't be applied automatically, and rollups/downsampling will use last-value semantics instead of cumulative semantics, leading to incorrect metric values in dashboards and alerts.
- packages/elastic_agent/data_stream/osquerybeat_metrics/fields/fields.yml:200 -- `system.process.cgroup.cpuacct.percpu` at line 200 is assigned `metric_type: gauge`, but its description states "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup." This is a monotonically increasing cumulative value and should be `metric_type: counter`, consistent with the sibling `cpuacct` fields (`total.ns` at line 184, `stats.user.ns` at line 190, `stats.system.ns` at line 196) which are all declared as `counter`. Using `gauge` causes TSDB to apply incorrect downsampling (last-value instead of counter-appropriate aggregation), producing incorrect results over time.
- packages/elastic_agent/data_stream/packetbeat_metrics/fields/fields.yml:200 -- The newly added `metric_type: gauge` for `system.process.cgroup.cpuacct.percpu` at line 200 is incorrect. This field represents "CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup," which is a monotonically increasing cumulative value and should be `metric_type: counter`, not `gauge`. All sibling fields under `cpuacct` that also represent cumulative CPU time (`total.ns`, `stats.user.ns`, `stats.system.ns`) are correctly declared as `counter`. Using `gauge` will cause TSDB to apply incorrect aggregation (e.g., last-value instead of rate), leading to misleading metric visualizations and broken rate/delta calculations.

metric_type: gauge
description: |
CPU time (in nanoseconds) consumed on each CPU by all tasks in this cgroup.
- name: memory
Expand Down Expand Up @@ -230,6 +233,7 @@
The maximum amount of user memory in bytes (including file cache) that tasks in the cgroup are allowed to use.
- name: mem.failures
type: long
metric_type: counter
description: |
The number of times that the memory limit (mem.limit.bytes) was reached.
- name: memsw.usage.bytes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,10 @@
external: ecs
- name: domain
external: ecs
dimension: true
- name: hostname
external: ecs
dimension: true
- name: id
external: ecs
- name: ip
Expand All @@ -48,6 +50,7 @@
external: ecs
- name: name
external: ecs
dimension: true
- name: os.family
external: ecs
- name: os.kernel
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
- name: beat.type
description: Beat type.
type: keyword
type: keyword
dimension: true
Loading
Loading