Skip to content

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140

Open
delthas wants to merge 4 commits into
development/9.4from
improvement/CLDSRV-884/otel-instrumentation
Open

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
delthas wants to merge 4 commits into
development/9.4from
improvement/CLDSRV-884/otel-instrumentation

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 2, 2026

Summary

Add OpenTelemetry tracing instrumentation to cloudserver, gated behind ENABLE_OTEL=true. When the flag is unset, no @opentelemetry/* package is loaded — zero overhead off the OTEL path.

The four atomic commits:

  1. feat: add OpenTelemetry tracing with trust boundaries and probe filtering

    • NodeSDK with OTLP/HTTP trace exporter, default 1% sampling
    • ParentBasedSampler({ root: TraceIdRatioBasedSampler(ratio) }) so inbound traceparent sampled=1 from NGINX/Beyla is honored end-to-end (was being re-sampled by a plain ratio sampler before)
    • Three explicit instrumentation packages, declared as direct deps — no auto-instrumentations-node bundle (which pulled ~36 unused instrumentations: pg, mysql, kafkajs, cassandra, oracle, etc.):
      • instrumentation-http with ignoreIncomingRequestHook (drops k8s probes + OPTIONS) and requestHook (strips traceparent/tracestate on outbound requests to hosts not in the trusted allowlist; client span is preserved and tagged scality.trace.suppressed=true)
      • instrumentation-mongodb with low-cardinality settings (no collection names in span names, no captured query payloads)
      • instrumentation-ioredis with requireParentSpan: true (no orphan spans from background stats/rate-limit jobs)
    • buildTrustedHosts(config) derives the allowlist from cloudserver's own Config (vaultd, dataClient, metadataClient, pfsClient, cdmi, bucketd, KMIP, KMS, scuba, utapi, mongodb replica set, backbeat, managementAgent + hdclient/sproxyd bootstrap entries from locationConstraints for direct Scality storage connectors), plus PUSH_ENDPOINT/MANAGEMENT_ENDPOINT env vars and loopback aliases. Unit test asserts every config host shape is included so the derivation stays honest.
    • Probe/scrape filter (lib/tracing/healthPaths.js): /live, /ready, /_/healthcheck, /_/healthcheck/deep, /metrics produce no spans
    • diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.WARN) so SDK errors (export failures, malformed traceparent, etc.) surface instead of disappearing
    • Span limits: attributeValueLengthLimit: 4096, attributeCountLimit: 128, eventCountLimit: 128, linkCountLimit: 128 to bound memory under pathological cases
    • logRecordProcessors: [], metricReaders: [] — explicit traces-only; without these, NodeSDK silently spins up OTLP log + metric exporters
  2. feat: flush OTEL on shutdown

    • shutdownOtel() wired into S3Server.cleanUp()'s shutdown chain (between server.close() and process.exit(0)), capped at 5 s via Promise.race so an unreachable collector can't block past Kubernetes' default 30 s terminationGracePeriodSeconds
    • caughtExceptionShutdown() also attempts a best-effort flush before process.exit(1) on the non-cluster path
    • .finally() guard so any unexpected promise rejection in the shutdown chain still exits the process
    • Inbound trace-context extraction is intentionally NOT done manually here: @opentelemetry/instrumentation-http already extracts traceparent and creates the server span as a child of the remote parent. A manual extract on top would replace the active context with the (non-recording) remote parent and demote downstream api.<handler> spans to siblings instead of children of the HTTP server span. Removed from the original draft.
  3. feat: instrument all S3 API handlers with OTEL spans

    • lib/instrumentation/simple.js exports a single helper instrumentApiMethod(handler, methodName) that wraps an S3 handler in an api.<methodName> span owning the entire handler execution (auth, body parsing, metadata I/O, data path, finalizers). Auto-instrumentation spans (HTTP, MongoDB, ioredis) nest beneath. ~70 distinct span names total — well within trace-backend limits.
    • lib/api/api.js applies the wrapper via Object.entries(api) loop with a NON_HANDLER_KEYS skip set (callApiMethod, checkAuthResults, handleAuthorizationResults). New S3 handlers added to the literal are auto-instrumented; no per-handler boilerplate.
    • The wrapper handles callback / Promise / sync return shapes, has a single-end-once guard, sets cloudserver.error_code on the error path
  4. chore: bump arsenal to ARSN-572 PoC branch for e2e trace context testing

    • Temporary: pins arsenal at the ARSN-572 PoC branch (scality/Arsenal#2611), which adds traceContext plumbing to MongoDB metadata writes (object oplog entries carry the trace context that caused them, so async consumers — Backbeat, Sorbet, lifecycle — can continue the trace across the oplog/queue boundary)
    • To be reverted to a real 8.3.x release tag once ARSN-572 ships

Origin

Extracted and cleaned up from William Lardier's user/test/wlardier/servicemesh-2 branch (based on development/9.0, July–Sep 2025), which mixed OTEL with performance optimizations and debug artifacts. This PR contains only the OTEL instrumentation, rebased onto development/9.3. Dropped:

  • lib/otelContextPropagation.js (dead code; never imported)
  • MOCK_DOAUTH / lib/api/apiUtils/mock/backendMocks.js (production-dangerous auth-result caching, unrelated to tracing)
  • ~15 console.log() debug artifacts, commented-out span code
  • All performance optimizations (manual GC, monitorLatency, releaseRequestContexts, arsenal perf pin) — out of scope; will land separately if needed

Tests

29 unit tests (22 in tests/unit/lib/otel.spec.js, 7 in tests/unit/lib/instrumentationSimple.spec.js) covering extractHost, buildTrustedHosts (full config snapshot + locationConstraints connector enumeration), makeRequestHook (outbound trust check, inbound IncomingMessage skip), isHealthPath, and the full lifecycle of instrumentApiMethod (callback success/error, double-end guard, async resolve/reject, sync throw, OTEL-disabled bypass).

Context

  • Parent investigation: OS-1072
  • Companion arsenal change: scality/Arsenal#2611 (ARSN-572) — to be merged before this PR can lose its temporary branch pin
  • Scality ADR mandates OpenTelemetry across all products. The storage layer (hdcontroller 1.12+ / hyperiod) is already OTEL-instrumented; this PR makes cloudserver complete the chain via @opentelemetry/instrumentation-http's automatic outbound traceparent propagation

Issue: CLDSRV-884

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 88.18898% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.25%. Comparing base (3f7ce59) to head (ec27b72).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
lib/server.js 57.14% 12 Missing ⚠️
lib/api/api.js 89.47% 10 Missing ⚠️
lib/tracing/index.js 86.36% 6 Missing ⚠️
lib/instrumentation/simple.js 96.29% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
lib/tracing/healthPaths.js 100.00% <100.00%> (ø)
lib/tracing/trustedHosts.js 100.00% <100.00%> (ø)
lib/instrumentation/simple.js 96.29% <96.29%> (ø)
lib/tracing/index.js 86.36% <86.36%> (ø)
lib/api/api.js 90.87% <89.47%> (+0.19%) ⬆️
lib/server.js 76.99% <57.14%> (-2.62%) ⬇️

... and 1 file with indirect coverage changes

@@                 Coverage Diff                 @@
##           development/9.4    #6140      +/-   ##
===================================================
+ Coverage            85.21%   85.25%   +0.03%     
===================================================
  Files                  207      211       +4     
  Lines                13832    13976     +144     
===================================================
+ Hits                 11787    11915     +128     
- Misses                2045     2061      +16     
Flag Coverage Δ
file-ft-tests 68.66% <46.85%> (-0.59%) ⬇️
kmip-ft-tests 28.15% <40.15%> (-0.12%) ⬇️
mongo-v0-ft-tests 69.80% <46.85%> (-0.62%) ⬇️
mongo-v1-ft-tests 69.86% <46.85%> (-0.62%) ⬇️
multiple-backend 36.59% <41.73%> (-0.21%) ⬇️
sur-tests 35.45% <41.73%> (-0.19%) ⬇️
sur-tests-inflights 37.36% <41.73%> (-0.21%) ⬇️
unit 72.02% <83.07%> (+0.17%) ⬆️
utapi-v2-tests 34.58% <42.12%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 9c93a1d to 97e63fc Compare April 2, 2026 16:32
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/otel.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 97e63fc to f978280 Compare April 2, 2026 16:35
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from f978280 to 06eea4e Compare April 2, 2026 16:49
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 06eea4e to 75c3afb Compare April 2, 2026 17:02
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 75c3afb to 207b785 Compare April 13, 2026 13:05
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/server.js
Comment thread lib/instrumentation/simple.js
Comment thread index.js
Comment thread package.json
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

  • Arsenal is pinned to a branch, not a tag (package.json:41). Must be converted to a release tag before merging to a release branch.

    Otherwise solid work: OTEL gating is clean (zero overhead when disabled), shutdown is race-safe with a 5s deadline, span lifecycle handles callback/async/hybrid/sync-throw shapes correctly with a double-end guard, trusted-host derivation covers all config paths with a test fixture to keep it honest, and probe filtering avoids span pollution. Well-tested with 29 unit tests covering the critical paths.

    Review by Claude Code

Copy link
Copy Markdown
Contributor

@francoisferrand francoisferrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should target latest branch (9.4) instead

Comment thread lib/api/api.js
Comment thread lib/api/api.js
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/tracing/index.js
Comment thread lib/tracing/index.js Outdated
Comment thread lib/tracing/trustedHosts.js Outdated
Comment thread lib/tracing/trustedHosts.js Outdated
Comment thread lib/tracing/trustedHosts.js Outdated
Comment thread package.json
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

  • Arsenal is pinned to a branch (improvement/ARSN-572/trace-context) instead of a release tag, making builds non-reproducible. Pin to a tag or commit SHA before merging.

    The OTEL instrumentation itself is well-structured: clean feature gate via ENABLE_OTEL, proper trust boundary enforcement on outbound trace propagation, robust shutdown with a 5s deadline, race-safe concurrent close, and good test coverage (29 unit tests). The instrumentApiMethod wrapper correctly handles callback, async, and sync return shapes with proper span lifecycle management.

    Review by Claude Code

Comment thread package.json
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

  • arsenal is pinned to a branch (improvement/ARSN-572/trace-context), not a tag — this must be reverted to a tag before merging to a release branch
    - Track the arsenal tag release (ARSN-572) as a merge blocker, or add a merge-block label so this PR does not accidentally ship with a floating branch pin

    The OTEL instrumentation itself is well-structured: proper feature-gating behind ENABLE_OTEL, race-safe shutdown with a 5s deadline, correct ParentBasedSampler for end-to-end trace propagation, trusted-host trace header stripping to prevent trace ID leakage, and solid test coverage (init lifecycle, callback/async/sync handler wrapping, trusted host derivation, health path filtering). The instrumentApiMethod callback detection via findLastIndex correctly handles all existing handler signatures, and the denylist-based auto-instrumentation loop ensures new handlers are traced without boilerplate.

    Review by Claude Code

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented May 20, 2026

Incorrect fix version

The Fix Version/s in issue CLDSRV-884 contains:

  • 9.3.8

  • 9.4.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.4.0

Please check the Fix Version/s of CLDSRV-884, or the target
branch of this pull request.

Comment thread lib/api/api.js
Comment thread package.json
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

  • Arsenal is pinned to a branch (improvement/ARSN-572/trace-context), not a release tag. Git-based deps must pin to a tag or at minimum a specific commit SHA to prevent silent breakage from branch updates. This PR should not merge until the Arsenal companion PR (ARSN-572) ships a release tag.

    Review by Claude Code

Comment thread package.json
"@opentelemetry/sdk-trace-base": "^2.7.0",
"@smithy/node-http-handler": "^3.0.0",
"arsenal": "git+https://github.com/scality/Arsenal#8.4.2",
"arsenal": "git+https://github.com/scality/Arsenal#improvement/ARSN-572/trace-context",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arsenal is pinned to branch improvement/ARSN-572/trace-context instead of a release tag. Git-based deps must pin to a tag, not a branch — branches are mutable and can break builds without warning. Additionally, this downgrades arsenal from 8.4.2 to 8.3.11 which may introduce regressions.

— Claude Code

describe('instrumentApiMethod', () => {
describe('OTEL on', () => {
afterEach(() => exporter.reset());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names using it() should start with should per project convention (e.g. should wrap a callback handler and end span on success).

— Claude Code

const { isHealthPath } = require('../../../../lib/tracing/healthPaths');

describe('tracing.isHealthPath', () => {
it('matches the canonical probe and scrape paths', () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names using it() should start with should per project convention.

— Claude Code

if (saved === undefined) {
delete process.env.ENABLE_OTEL;
} else {
process.env.ENABLE_OTEL = saved;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names using it() should start with should per project convention.

— Claude Code


const { loadTrustedHosts, extractHost, makeRequestHook } = require('../../../../lib/tracing/trustedHosts');

describe('tracing.extractHost', () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names using it() should start with should per project convention.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

  • Arsenal pinned to branch improvement/ARSN-572/trace-context instead of a release tag. Git-based deps must pin to a tag — this also downgrades from 8.4.2 to 8.3.11.
    - Pin to the corresponding arsenal release tag once ARSN-572 ships, or keep the current 8.4.2 tag if the trace-context changes are not yet needed.
    - Test names in all 4 new test files do not start with should per project convention.
    - Prefix each it() description with should (e.g. should wrap a callback handler and end span on success).

    Review by Claude Code

Comment thread package.json
"@opentelemetry/sdk-trace-base": "^2.7.0",
"@smithy/node-http-handler": "^3.0.0",
"arsenal": "git+https://github.com/scality/Arsenal#8.4.2",
"arsenal": "git+https://github.com/scality/Arsenal#improvement/ARSN-572/trace-context",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arsenal is pinned to a branch (improvement/ARSN-572/trace-context) instead of a release tag. The PR description acknowledges this is temporary, but branch pins are a deployment risk — the resolved commit can change at any time, breaking reproducible builds. Consider pinning to a specific commit SHA until the ARSN-572 tag is cut.

— Claude Code

describe('OTEL on', () => {
afterEach(() => exporter.reset());

it('wraps a callback handler and ends span on success', done => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names using it() should start with should per project convention. For example: it('should wrap a callback handler and end span on success', ...).

— Claude Code

const { isHealthPath } = require('../../../../lib/tracing/healthPaths');

describe('tracing.isHealthPath', () => {
it('matches the canonical probe and scrape paths', () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names using it() should start with should per project convention.

— Claude Code

}
});

it('returns false when ENABLE_OTEL is unset', () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names using it() should start with should per project convention.

— Claude Code

const { loadTrustedHosts, extractHost, makeRequestHook } = require('../../../../lib/tracing/trustedHosts');

describe('tracing.extractHost', () => {
it('extracts hostname from a plain host string', () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names using it() should start with should per project convention.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

  • Arsenal is pinned to a branch (improvement/ARSN-572/trace-context) instead of a release tag, which breaks reproducible builds.
    • Pin to the specific commit SHA until the ARSN-572 tag is released.
  • Test names in all four new test files do not start with should per project convention.
    • Prefix each it() description with should.

Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants