Add caller-side Nexus operation metrics#10071
Conversation
| var NexusOperationStartToCloseLatency = metrics.NewTimerDef( | ||
| "nexus_operation_start_to_close_latency", | ||
| metrics.WithDescription("Duration from Nexus Operation scheduled time to completed time. Only emitted for async operations."), | ||
| ) |
There was a problem hiding this comment.
@bergundy I think I've addressed everything mentioned in the previous version of this PR, except I haven't done anything with the names here.
I agree, I don't really care for the names here, but I think it is good to be consistent with activities.
I double checked, and and activity uses:
activity_success
activity_fail
activity_cancel
activity_terminate
activity_timeout
and I have matched those.
| return ctx.MetricsHandler().WithTags(tags...), nil | ||
| } | ||
|
|
||
| func (o *Operation) emitOnSucceededMetrics(handler metrics.Handler, closeTime time.Time) { |
There was a problem hiding this comment.
The 5 methods seem almost identical, how about sth like
func (o *Operation) emitTerminalMetrics(handler metrics.Handler, closeTime time.Time, status nexusoperationpb.OperationStatus, counter metrics.CounterDef, extraTags ...metrics.Tag)There was a problem hiding this comment.
And I'd probably merge that with emitLatencyMetrics then.
There was a problem hiding this comment.
The 5 methods seem almost identical, how about sth like
func (o *Operation) emitTerminalMetrics(handler metrics.Handler, closeTime time.Time, status nexusoperationpb.OperationStatus, counter metrics.CounterDef, extraTags ...metrics.Tag)
Noting that metrics counterDefinition is unexported. We could do something like passing the .With function as func(metrics.Handler) metrics.CounterIface.
| package nexusoperation | ||
|
|
||
| import ( | ||
| "context" |
There was a problem hiding this comment.
I think we should verify at least one of the states emits the right metrics now. You can use metricstest.NewCaptureHandler for that.
| o.OperationToken = event.OperationToken | ||
|
|
||
| // Emit schedule-to-start latency on the transition to started. | ||
| metricsHandler, err := o.enrichMetricsHandler(ctx) |
There was a problem hiding this comment.
I just noticed that if there's a problem committing the transition, we would be emitting the metric multiple times during the retries. IDK if that's solvable right now in CHASM, but wanted to point that out as a potential issue.
| if err != nil { | ||
| return err | ||
| } | ||
| NexusOperationScheduleToStartLatency.With(metricsHandler).Record(startTime.Sub(o.GetScheduledTime().AsTime())) |
There was a problem hiding this comment.
I'm trying to wrap my head around the fact that emitLatencyMetrics also emits NexusOperationScheduleToStartLatency. Wouldn't be be double-counting?
There was a problem hiding this comment.
emitLatencyMetrics only emits NexusOperationScheduleToStartLatency if startedTime is nil, which implies we never transitioned to started. As I Understand, that happens for sync operations (it could also happen I think w/ completion before start, I think)
What changed?
NOTE: Recreation of #10026 since that PR cannot be re-opened or have its target branch updated.
New metrics:
Counters
nexus_operation_success_countnexus_operation_failed_countnexus_operation_cancel_countnexus_operation_terminate_countnexus_operation_timeout_countHistograms
nexus_operation_schedule_to_close_latencynexus_operation_schedule_to_start_latencynexus_operation_start_to_close_latencyLabels
All metrics use the following labels:
namespacenexus_endpointnexus_servicenexus_operationworkflowTypeLatency metrics also include:
outcomeTimeout Count also includes:
timeout_typeHow did you test it?
Potential risks
Metrics emitting isn't nil safe, and enrich metrics could panic if misconfigured.