feat(BA-5806): force capacity sentinel for unbounded kernel live_stat metrics#11535
Merged
Conversation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR restores non-zero capacity values for a subset of kernel live_stat metrics after the Valkey → Prometheus migration by synthesizing a sentinel CAPACITY sample when Prometheus doesn’t provide a capacity series, keeping downstream consumers compatible.
Changes:
- Added
KernelLiveStatValues.with_capacity_sentinels()plusCAPACITY_SENTINEL/whitelist constants to synthesize missing capacity samples for specific live metrics. - Updated
PrometheusClient.fetch_container_live_stats()to return live stats with sentinel capacity injection applied. - Added unit tests covering sentinel injection/preservation/isolation/non-mutation, and a Towncrier fragment documenting the change.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/common/metrics/test_capacity.py | Adds unit tests validating capacity-sentinel synthesis behavior and invariants. |
| src/ai/backend/manager/repositories/metric/repository.py | Adapts repository code to the updated live-stats return shape. |
| src/ai/backend/common/metrics/types.py | Introduces sentinel constants and KernelLiveStatValues wrapper with injection logic. |
| src/ai/backend/common/clients/prometheus/client.py | Wires sentinel synthesis into live-stat fetching. |
| changes/11535.enhance.md | Adds changelog entry for capacity-sentinel injection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5 tasks
jopemachine
reviewed
May 11, 2026
Whitelisted live-stat metrics (cpu_used, net_rx, net_tx, io_read, io_write) are cumulative counters or rates with no meaningful capacity. Any CAPACITY sample carried in the Prometheus response for these is a stale current-as-fallback artifact rather than a real upper bound, so `with_capacity_sentinels` now overwrites it with `CAPACITY_SENTINEL` instead of preserving it. Capacities for non-whitelisted metrics (e.g. mem) are unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment on lines
+51
to
+56
| reported_currents: set[str] = { | ||
| v.metric_name for v in vs if v.value_type is ValueType.CURRENT | ||
| } | ||
| sentinel_targets = reported_currents & CAPACITY_SENTINEL_METRICS | ||
| if not sentinel_targets: | ||
| continue |
Collaborator
There was a problem hiding this comment.
The implementation feels a bit forced.
HyeockJinKim
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
KernelLiveStatValuesincommon/metrics/types.py. Itswith_capacity_sentinelsclassmethod forces theCAPACITYsample toCAPACITY_SENTINEL(2**63 - 1) for whitelisted metrics that have aCURRENTsample —cpu_used,net_rx,net_tx,io_read,io_write. These are cumulative counters / rates with no meaningful upper bound, so any capacity present in the Prometheus response is a stale current-as-fallback artifact and is overwritten rather than respected.PrometheusClient.fetch_container_live_stats, so callers (MetricRepository) receive an already-normalizedKernelLiveStatValuesand stay unaware of the sentinel.capacityfield that legacy live_stat consumers depend on, addressing the "❌ capacity for cpu_used / net_* / io_*" gap from refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus #11330. Real Prometheus capacities for non-whitelisted metrics (e.g.mem) are untouched.Test plan
pants fmt fix lint check test tests/unit/common/metrics/test_capacity.pypasses (parametrized: forced sentinel for all 5 metrics with or without an existing capacity, no synthesis withoutCURRENT, non-whitelisted metrics untouched, per-kernel isolation, input non-mutation).pants test tests/unit/manager/services/utilization_metric::regression-clean.PrometheusClient.fetch_container_live_statsreturnscapacity = 9223372036854775807for each ofcpu_used/net_rx/net_tx/io_read/io_write, whilemem.capacitykeeps its real allocated value.git merge --no-commit pr-11360to verify theclient.pyoverlap is resolvable.Resolves BA-5806
🤖 Generated with Claude Code