feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140delthas wants to merge 4 commits into
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 1 file with indirect coverage changes @@ Coverage Diff @@
## development/9.4 #6140 +/- ##
===================================================
+ Coverage 85.21% 85.25% +0.03%
===================================================
Files 207 211 +4
Lines 13832 13976 +144
===================================================
+ Hits 11787 11915 +128
- Misses 2045 2061 +16
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
9c93a1d to
97e63fc
Compare
97e63fc to
f978280
Compare
f978280 to
06eea4e
Compare
06eea4e to
75c3afb
Compare
75c3afb to
207b785
Compare
|
francoisferrand
left a comment
There was a problem hiding this comment.
should target latest branch (9.4) instead
|
|
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
|
| "@opentelemetry/sdk-trace-base": "^2.7.0", | ||
| "@smithy/node-http-handler": "^3.0.0", | ||
| "arsenal": "git+https://github.com/scality/Arsenal#8.4.2", | ||
| "arsenal": "git+https://github.com/scality/Arsenal#improvement/ARSN-572/trace-context", |
There was a problem hiding this comment.
Arsenal is pinned to branch improvement/ARSN-572/trace-context instead of a release tag. Git-based deps must pin to a tag, not a branch — branches are mutable and can break builds without warning. Additionally, this downgrades arsenal from 8.4.2 to 8.3.11 which may introduce regressions.
— Claude Code
| describe('instrumentApiMethod', () => { | ||
| describe('OTEL on', () => { | ||
| afterEach(() => exporter.reset()); | ||
|
|
There was a problem hiding this comment.
Test names using it() should start with should per project convention (e.g. should wrap a callback handler and end span on success).
— Claude Code
| const { isHealthPath } = require('../../../../lib/tracing/healthPaths'); | ||
|
|
||
| describe('tracing.isHealthPath', () => { | ||
| it('matches the canonical probe and scrape paths', () => { |
There was a problem hiding this comment.
Test names using it() should start with should per project convention.
— Claude Code
| if (saved === undefined) { | ||
| delete process.env.ENABLE_OTEL; | ||
| } else { | ||
| process.env.ENABLE_OTEL = saved; |
There was a problem hiding this comment.
Test names using it() should start with should per project convention.
— Claude Code
|
|
||
| const { loadTrustedHosts, extractHost, makeRequestHook } = require('../../../../lib/tracing/trustedHosts'); | ||
|
|
||
| describe('tracing.extractHost', () => { |
There was a problem hiding this comment.
Test names using it() should start with should per project convention.
— Claude Code
|
| "@opentelemetry/sdk-trace-base": "^2.7.0", | ||
| "@smithy/node-http-handler": "^3.0.0", | ||
| "arsenal": "git+https://github.com/scality/Arsenal#8.4.2", | ||
| "arsenal": "git+https://github.com/scality/Arsenal#improvement/ARSN-572/trace-context", |
There was a problem hiding this comment.
Arsenal is pinned to a branch (improvement/ARSN-572/trace-context) instead of a release tag. The PR description acknowledges this is temporary, but branch pins are a deployment risk — the resolved commit can change at any time, breaking reproducible builds. Consider pinning to a specific commit SHA until the ARSN-572 tag is cut.
— Claude Code
| describe('OTEL on', () => { | ||
| afterEach(() => exporter.reset()); | ||
|
|
||
| it('wraps a callback handler and ends span on success', done => { |
There was a problem hiding this comment.
Test names using it() should start with should per project convention. For example: it('should wrap a callback handler and end span on success', ...).
— Claude Code
| const { isHealthPath } = require('../../../../lib/tracing/healthPaths'); | ||
|
|
||
| describe('tracing.isHealthPath', () => { | ||
| it('matches the canonical probe and scrape paths', () => { |
There was a problem hiding this comment.
Test names using it() should start with should per project convention.
— Claude Code
| } | ||
| }); | ||
|
|
||
| it('returns false when ENABLE_OTEL is unset', () => { |
There was a problem hiding this comment.
Test names using it() should start with should per project convention.
— Claude Code
| const { loadTrustedHosts, extractHost, makeRequestHook } = require('../../../../lib/tracing/trustedHosts'); | ||
|
|
||
| describe('tracing.extractHost', () => { | ||
| it('extracts hostname from a plain host string', () => { |
There was a problem hiding this comment.
Test names using it() should start with should per project convention.
— Claude Code
Review by Claude Code |
Summary
Add OpenTelemetry tracing instrumentation to cloudserver, gated behind
ENABLE_OTEL=true. When the flag is unset, no@opentelemetry/*package is loaded — zero overhead off the OTEL path.The four atomic commits:
feat: add OpenTelemetry tracing with trust boundaries and probe filteringParentBasedSampler({ root: TraceIdRatioBasedSampler(ratio) })so inboundtraceparentsampled=1from NGINX/Beyla is honored end-to-end (was being re-sampled by a plain ratio sampler before)auto-instrumentations-nodebundle (which pulled ~36 unused instrumentations: pg, mysql, kafkajs, cassandra, oracle, etc.):instrumentation-httpwithignoreIncomingRequestHook(drops k8s probes + OPTIONS) andrequestHook(stripstraceparent/tracestateon outbound requests to hosts not in the trusted allowlist; client span is preserved and taggedscality.trace.suppressed=true)instrumentation-mongodbwith low-cardinality settings (no collection names in span names, no captured query payloads)instrumentation-iorediswithrequireParentSpan: true(no orphan spans from background stats/rate-limit jobs)buildTrustedHosts(config)derives the allowlist from cloudserver's own Config (vaultd, dataClient, metadataClient, pfsClient, cdmi, bucketd, KMIP, KMS, scuba, utapi, mongodb replica set, backbeat, managementAgent +hdclient/sproxydbootstrap entries fromlocationConstraintsfor direct Scality storage connectors), plusPUSH_ENDPOINT/MANAGEMENT_ENDPOINTenv vars and loopback aliases. Unit test asserts every config host shape is included so the derivation stays honest.lib/tracing/healthPaths.js):/live,/ready,/_/healthcheck,/_/healthcheck/deep,/metricsproduce no spansdiag.setLogger(new DiagConsoleLogger(), DiagLogLevel.WARN)so SDK errors (export failures, malformedtraceparent, etc.) surface instead of disappearingattributeValueLengthLimit: 4096,attributeCountLimit: 128,eventCountLimit: 128,linkCountLimit: 128to bound memory under pathological caseslogRecordProcessors: [],metricReaders: []— explicit traces-only; without these, NodeSDK silently spins up OTLP log + metric exportersfeat: flush OTEL on shutdownshutdownOtel()wired intoS3Server.cleanUp()'s shutdown chain (betweenserver.close()andprocess.exit(0)), capped at 5 s viaPromise.raceso an unreachable collector can't block past Kubernetes' default 30 sterminationGracePeriodSecondscaughtExceptionShutdown()also attempts a best-effort flush beforeprocess.exit(1)on the non-cluster path.finally()guard so any unexpected promise rejection in the shutdown chain still exits the process@opentelemetry/instrumentation-httpalready extractstraceparentand creates the server span as a child of the remote parent. A manual extract on top would replace the active context with the (non-recording) remote parent and demote downstreamapi.<handler>spans to siblings instead of children of the HTTP server span. Removed from the original draft.feat: instrument all S3 API handlers with OTEL spanslib/instrumentation/simple.jsexports a single helperinstrumentApiMethod(handler, methodName)that wraps an S3 handler in anapi.<methodName>span owning the entire handler execution (auth, body parsing, metadata I/O, data path, finalizers). Auto-instrumentation spans (HTTP, MongoDB, ioredis) nest beneath. ~70 distinct span names total — well within trace-backend limits.lib/api/api.jsapplies the wrapper viaObject.entries(api)loop with aNON_HANDLER_KEYSskip set (callApiMethod,checkAuthResults,handleAuthorizationResults). New S3 handlers added to the literal are auto-instrumented; no per-handler boilerplate.cloudserver.error_codeon the error pathchore: bump arsenal to ARSN-572 PoC branch for e2e trace context testingtraceContextplumbing to MongoDB metadata writes (object oplog entries carry the trace context that caused them, so async consumers — Backbeat, Sorbet, lifecycle — can continue the trace across the oplog/queue boundary)8.3.xrelease tag once ARSN-572 shipsOrigin
Extracted and cleaned up from William Lardier's
user/test/wlardier/servicemesh-2branch (based ondevelopment/9.0, July–Sep 2025), which mixed OTEL with performance optimizations and debug artifacts. This PR contains only the OTEL instrumentation, rebased ontodevelopment/9.3. Dropped:lib/otelContextPropagation.js(dead code; never imported)MOCK_DOAUTH/lib/api/apiUtils/mock/backendMocks.js(production-dangerous auth-result caching, unrelated to tracing)console.log()debug artifacts, commented-out span codemonitorLatency,releaseRequestContexts, arsenal perf pin) — out of scope; will land separately if neededTests
29 unit tests (22 in
tests/unit/lib/otel.spec.js, 7 intests/unit/lib/instrumentationSimple.spec.js) coveringextractHost,buildTrustedHosts(full config snapshot + locationConstraints connector enumeration),makeRequestHook(outbound trust check, inbound IncomingMessage skip),isHealthPath, and the full lifecycle ofinstrumentApiMethod(callback success/error, double-end guard, async resolve/reject, sync throw, OTEL-disabled bypass).Context
@opentelemetry/instrumentation-http's automatic outboundtraceparentpropagationIssue: CLDSRV-884