Emit metric tracking empty responses from prometheus#7671
Emit metric tracking empty responses from prometheus#7671aliaqel-stripe wants to merge 12 commits intokedacore:mainfrom
Conversation
Signed-off-by: Daniele Rolando <drolando@stripe.com>
Signed-off-by: Daniele Rolando <drolando@stripe.com>
Signed-off-by: Daniele Rolando <drolando@stripe.com>
Co-authored-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> Signed-off-by: drolando-stripe <102543345+drolando-stripe@users.noreply.github.com>
Co-authored-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> Signed-off-by: drolando-stripe <102543345+drolando-stripe@users.noreply.github.com>
|
Thank you for your contribution! 🙏 Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected. While you are waiting, make sure to:
Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient. Learn more about our contribution guide. |
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
…nse metric Add labels to keda_scaler_empty_upstream_responses_total so operators can identify which scaler is producing empty upstream responses. Also add e2e tests for both Prometheus and OpenTelemetry metric backends. Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
5840abf to
2edc55e
Compare
wozniakjan
left a comment
There was a problem hiding this comment.
minor nits below for your consideration
| logger: logger, | ||
| scalableObjectName: config.ScalableObjectName, | ||
| scalableObjectNS: config.ScalableObjectNamespace, | ||
| triggerName: config.TriggerName, |
There was a problem hiding this comment.
triggerName can be empty string, would it make sense to add triggerIndex too? that should be readily available from the same config struct
There was a problem hiding this comment.
or metric_name?
There was a problem hiding this comment.
adding metric_name
There was a problem hiding this comment.
trigger index is kind of useless because it's just a number and when you have 2000+ scaled objects in a cluster, it doesn't give any userful info
I wonder if we should explore removing it from other metrics?
| if s.metadata.IgnoreNullValues { | ||
| return 0, nil | ||
| } | ||
| metricscollector.RecordEmptyUpstreamResponse(s.scalableObjectNS, s.scalableObjectName, s.triggerName) |
There was a problem hiding this comment.
is it desired to record the metric even when IgnoreNullValues is set to true? The value of IgnoreNullValues can be added as yet another label so users can filter which one they care about. This could surface broken prometheus queries that have been masked by IgnoreNullValues
There was a problem hiding this comment.
adding ignorenullvalues as a label
|
/run-e2e prometheus |
There was a problem hiding this comment.
Pull request overview
Adds a dedicated metric to track Prometheus-scaler empty query responses so operators can distinguish this failure mode from generic scaler errors, along with sequential e2e coverage validating labels/attributes for both Prometheus and OpenTelemetry metric pipelines.
Changes:
- Add
keda_scaler_empty_upstream_responses_total(Prometheus) andkeda.scaler.empty.upstream.responses(OpenTelemetry) counters withnamespace/scaledObject/triggerNamelabeling. - Record the counter from the Prometheus scaler when queries return empty results (when
ignoreNullValues=false). - Extend sequential e2e tests to deploy a Prometheus instance returning empty results and assert the metric is emitted with expected labels.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/scalers/prometheus_scaler.go |
Records the new “empty upstream response” metric when Prometheus query results/value arrays are empty. |
pkg/metricscollector/prommetrics.go |
Defines/registers the new Prometheus CounterVec and exposes a recorder method. |
pkg/metricscollector/opentelemetry.go |
Defines/registers the new OTel counter and records it with attributes. |
pkg/metricscollector/metricscollectors.go |
Extends the collector interface and adds a dispatcher function for the new metric. |
tests/sequential/prometheus_metrics/prometheus_metrics_test.go |
Adds sequential test validating the Prometheus-exported metric and labels. |
tests/sequential/opentelemetry_metrics/opentelemetry_metrics_test.go |
Adds sequential test validating the Prometheus-exported view of OTel metric and labels. |
CHANGELOG.md |
Documents the new Prometheus scaler metric in the Unreleased Improvements section. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Rename scaledObject label -> scaledResource, add metricName, resourceType, and ignoreNullValues labels to keda_scaler_empty_upstream_responses_total - Record metric unconditionally (before IgnoreNullValues guard) so masked empty responses are also visible, with ignoreNullValues label for filtering - Fix CHANGELOG: reference issue kedacore#7062 instead of PR kedacore#7060, capitalize Prometheus - Update e2e tests to assert new labels Signed-off-by: Ali Aqel <aliaqel@stripe.com>
|
@rickbrouwer ready for 2nd review. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
pkg/scalers/prometheus_scaler.go:273
ExecutePromQuerytreats any response withlen(result.Data.Result)==0as an "empty upstream response" and now records the empty-upstream metric. However the Prometheus HTTP API can return HTTP 200 with JSONstatus: "error"(e.g., invalid query), wheredata.resultwill also be empty in this struct. That would incorrectly increment the empty-upstream counter for query errors. Consider checkingresult.Status(and/or modeling theerrorType/errorfields) and returning an error before recording the empty-upstream metric unlessstatus == "success".
var v float64 = -1
// allow for zero element or single element result sets
if len(result.Data.Result) == 0 {
metricscollector.RecordEmptyUpstreamResponse(s.scalableObjectNS, s.scalableObjectName, s.triggerName, s.metricName, s.resourceType, s.metadata.IgnoreNullValues)
if s.metadata.IgnoreNullValues {
return 0, nil
}
return -1, fmt.Errorf("prometheus metrics 'prometheus' target may be lost, the result is empty")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| time.Sleep(15 * time.Second) | ||
|
|
||
| family := fetchAndParsePrometheusMetrics(t, fmt.Sprintf("curl --insecure %s", kedaOperatorCollectorPrometheusExportURL)) | ||
| val, ok := family["keda_scaler_empty_upstream_responses_total"] | ||
| assert.True(t, ok, "keda_scaler_empty_upstream_responses_total not available") | ||
| if ok { |
There was a problem hiding this comment.
This test relies on a fixed time.Sleep(15 * time.Second) before scraping metrics. On slower clusters/CI runs the metric may not be exported yet, causing intermittent failures. Consider polling until keda_scaler_empty_upstream_responses_total is present with the expected labels (with an overall timeout), similar to how the Prometheus-metrics e2e test waits for metrics to appear.
|
Could you add another PR to docs adding the otel metric? I see that prometeus one is already merged, but otel is pending 😅 |
|
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: aliaqel-stripe <120822631+aliaqel-stripe@users.noreply.github.com>
|
/run-e2e prometheus |
We'd like to have a way to monitor the number of Keda errors due to empty responses from prometheus after enabling the
ignoreNullValuesflag for most of our prometheus triggers.Right now this error gets logged but the error metric that Keda emits is generic and doesn't differentiate by error type.
The metric
keda_scaler_empty_upstream_responses_totalis labeled withnamespace,scaledObject, andtriggerNameso operators can identify which scaler is producing empty upstream responses.Tests
E2e tests have been added to
tests/sequential/prometheus_metrics/andtests/sequential/opentelemetry_metrics/that deploy a real Prometheus instance and verify the metric is emitted with the correct labels when a query returns an empty result.Checklist
When introducing a new scaler, I agree with the scaling governance policyN/AA PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)N/AFixes #7062