Skip to content

feat(BA-5806): force capacity sentinel for unbounded kernel live_stat metrics#11535

Merged
HyeockJinKim merged 5 commits into
mainfrom
BA-5806
May 12, 2026
Merged

feat(BA-5806): force capacity sentinel for unbounded kernel live_stat metrics#11535
HyeockJinKim merged 5 commits into
mainfrom
BA-5806

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

@seedspirit seedspirit commented May 10, 2026

Summary

  • Adds KernelLiveStatValues in common/metrics/types.py. Its with_capacity_sentinels classmethod forces the CAPACITY sample to CAPACITY_SENTINEL (2**63 - 1) for whitelisted metrics that have a CURRENT sample — cpu_used, net_rx, net_tx, io_read, io_write. These are cumulative counters / rates with no meaningful upper bound, so any capacity present in the Prometheus response is a stale current-as-fallback artifact and is overwritten rather than respected.
  • Wires the injection inside PrometheusClient.fetch_container_live_stats, so callers (MetricRepository) receive an already-normalized KernelLiveStatValues and stay unaware of the sentinel.
  • Restores the non-zero capacity field that legacy live_stat consumers depend on, addressing the "❌ capacity for cpu_used / net_* / io_*" gap from refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus #11330. Real Prometheus capacities for non-whitelisted metrics (e.g. mem) are untouched.

Test plan

  • pants fmt fix lint check test tests/unit/common/metrics/test_capacity.py passes (parametrized: forced sentinel for all 5 metrics with or without an existing capacity, no synthesis without CURRENT, non-whitelisted metrics untouched, per-kernel isolation, input non-mutation).
  • pants test tests/unit/manager/services/utilization_metric:: regression-clean.
  • Live: against a running kernel on local halfstack, PrometheusClient.fetch_container_live_stats returns capacity = 9223372036854775807 for each of cpu_used / net_rx / net_tx / io_read / io_write, while mem.capacity keeps its real allocated value.
  • Conflict dry-run: git merge --no-commit pr-11360 to verify the client.py overlap is resolvable.

Resolves BA-5806

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component comp:common Related to Common component labels May 10, 2026
@seedspirit seedspirit marked this pull request as ready for review May 10, 2026 11:37
@seedspirit seedspirit requested review from a team and Copilot May 10, 2026 11:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restores non-zero capacity values for a subset of kernel live_stat metrics after the Valkey → Prometheus migration by synthesizing a sentinel CAPACITY sample when Prometheus doesn’t provide a capacity series, keeping downstream consumers compatible.

Changes:

  • Added KernelLiveStatValues.with_capacity_sentinels() plus CAPACITY_SENTINEL/whitelist constants to synthesize missing capacity samples for specific live metrics.
  • Updated PrometheusClient.fetch_container_live_stats() to return live stats with sentinel capacity injection applied.
  • Added unit tests covering sentinel injection/preservation/isolation/non-mutation, and a Towncrier fragment documenting the change.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/common/metrics/test_capacity.py Adds unit tests validating capacity-sentinel synthesis behavior and invariants.
src/ai/backend/manager/repositories/metric/repository.py Adapts repository code to the updated live-stats return shape.
src/ai/backend/common/metrics/types.py Introduces sentinel constants and KernelLiveStatValues wrapper with injection logic.
src/ai/backend/common/clients/prometheus/client.py Wires sentinel synthesis into live-stat fetching.
changes/11535.enhance.md Adds changelog entry for capacity-sentinel injection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread changes/11535.feature.md
Comment thread changes/11535.feature.md
@seedspirit seedspirit requested review from a team and jopemachine May 11, 2026 01:42
Whitelisted live-stat metrics (cpu_used, net_rx, net_tx, io_read, io_write)
are cumulative counters or rates with no meaningful capacity. Any CAPACITY
sample carried in the Prometheus response for these is a stale
current-as-fallback artifact rather than a real upper bound, so
`with_capacity_sentinels` now overwrites it with `CAPACITY_SENTINEL` instead
of preserving it. Capacities for non-whitelisted metrics (e.g. mem) are
unaffected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@seedspirit seedspirit changed the title feat(BA-5806): synthesize capacity sentinel for kernel live_stat feat(BA-5806): force capacity sentinel for unbounded kernel live_stat metrics May 11, 2026
Comment on lines +51 to +56
reported_currents: set[str] = {
v.metric_name for v in vs if v.value_type is ValueType.CURRENT
}
sentinel_targets = reported_currents & CAPACITY_SENTINEL_METRICS
if not sentinel_targets:
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation feels a bit forced.

@HyeockJinKim HyeockJinKim merged commit de3439b into main May 12, 2026
33 checks passed
@HyeockJinKim HyeockJinKim deleted the BA-5806 branch May 12, 2026 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:common Related to Common component comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants