Skip to content

Add caller-side Nexus operation metrics#10071

Open
S15 wants to merge 4 commits intotemporalio:mainfrom
S15:nexus-metrics
Open

Add caller-side Nexus operation metrics#10071
S15 wants to merge 4 commits intotemporalio:mainfrom
S15:nexus-metrics

Conversation

@S15
Copy link
Copy Markdown
Contributor

@S15 S15 commented Apr 24, 2026

What changed?

NOTE: Recreation of #10026 since that PR cannot be re-opened or have its target branch updated.

New metrics:

Counters

Metric Description
nexus_operation_success_count Successfully completed operations
nexus_operation_failed_count Failed operations
nexus_operation_cancel_count Cancelled operations
nexus_operation_terminate_count Operations terminated before completion
nexus_operation_timeout_count Operations timed out before completion

Histograms

Metric Description
nexus_operation_schedule_to_close_latency Time between schedule and close for sync and async operations
nexus_operation_schedule_to_start_latency Time between schedule and start for sync and async operations
nexus_operation_start_to_close_latency Time between start and close for async operations only

Labels

All metrics use the following labels:

Label Notes
namespace
nexus_endpoint
nexus_service Controlled by Dynamic config
nexus_operation Controlled by Dynamic config
workflowType

Latency metrics also include:

Label Values
outcome succeeded, failed, canceled, terminated, timedout

Timeout Count also includes:

Label Values
timeout_type StartToClose, ScheduleToStart, ScheduleToClose

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

Metrics emitting isn't nil safe, and enrich metrics could panic if misconfigured.

@S15 S15 marked this pull request as ready for review April 24, 2026 23:54
@S15 S15 requested review from a team as code owners April 24, 2026 23:54
var NexusOperationStartToCloseLatency = metrics.NewTimerDef(
"nexus_operation_start_to_close_latency",
metrics.WithDescription("Duration from Nexus Operation scheduled time to completed time. Only emitted for async operations."),
)
Copy link
Copy Markdown
Contributor Author

@S15 S15 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergundy I think I've addressed everything mentioned in the previous version of this PR, except I haven't done anything with the names here.

#10026 (comment)

I agree, I don't really care for the names here, but I think it is good to be consistent with activities.

I double checked, and and activity uses:

activity_success
activity_fail
activity_cancel
activity_terminate
activity_timeout

and I have matched those.

@S15 S15 requested review from bergundy and stephanos April 24, 2026 23:55
Comment thread chasm/lib/nexusoperation/metrics.go
Comment thread chasm/lib/nexusoperation/operation.go
return ctx.MetricsHandler().WithTags(tags...), nil
}

func (o *Operation) emitOnSucceededMetrics(handler metrics.Handler, closeTime time.Time) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5 methods seem almost identical, how about sth like

func (o *Operation) emitTerminalMetrics(handler metrics.Handler, closeTime time.Time, status nexusoperationpb.OperationStatus, counter metrics.CounterDef, extraTags ...metrics.Tag)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'd probably merge that with emitLatencyMetrics then.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5 methods seem almost identical, how about sth like

func (o *Operation) emitTerminalMetrics(handler metrics.Handler, closeTime time.Time, status nexusoperationpb.OperationStatus, counter metrics.CounterDef, extraTags ...metrics.Tag)

Noting that metrics counterDefinition is unexported. We could do something like passing the .With function as func(metrics.Handler) metrics.CounterIface.

Comment thread chasm/lib/nexusoperation/library.go
Comment thread chasm/lib/nexusoperation/metrics.go
package nexusoperation

import (
"context"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should verify at least one of the states emits the right metrics now. You can use metricstest.NewCaptureHandler for that.

Comment thread chasm/lib/nexusoperation/operation_statemachine.go
o.OperationToken = event.OperationToken

// Emit schedule-to-start latency on the transition to started.
metricsHandler, err := o.enrichMetricsHandler(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that if there's a problem committing the transition, we would be emitting the metric multiple times during the retries. IDK if that's solvable right now in CHASM, but wanted to point that out as a potential issue.

if err != nil {
return err
}
NexusOperationScheduleToStartLatency.With(metricsHandler).Record(startTime.Sub(o.GetScheduledTime().AsTime()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to wrap my head around the fact that emitLatencyMetrics also emits NexusOperationScheduleToStartLatency. Wouldn't be be double-counting?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emitLatencyMetrics only emits NexusOperationScheduleToStartLatency if startedTime is nil, which implies we never transitioned to started. As I Understand, that happens for sync operations (it could also happen I think w/ completion before start, I think)

Comment thread chasm/lib/nexusoperation/operation_statemachine.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants