Add OTLP job log export#3979
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
5833518 to
60aa8fe
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 60aa8fe29d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
60aa8fe to
2566837
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2566837108
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| semconv.DeploymentEnvironmentKey.String("ci"), | ||
| ) | ||
| provider := sdklog.NewLoggerProvider( | ||
| sdklog.WithProcessor(sdklog.NewBatchProcessor(exporter)), |
There was a problem hiding this comment.
Avoid dropping OTLP records under bursty output
When a job emits more log lines than sdklog.NewBatchProcessor can queue before the collector accepts them, such as a verbose test suite or a slow/unreachable OTLP endpoint, the batch processor drops excess records instead of back-pressuring the writer. Since this is the only OTLP export path here and Emit cannot report the loss back to Write, the exported job log can be silently incomplete for normal CI bursts; use a blocking/synchronous processor or explicitly configure the queue/export behavior so job-log lines are not lost by default.
Useful? React with 👍 / 👎.
25d9724 to
a4d79b0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4d79b0c8d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| line := append(w.buf, data[:i]...) | ||
| line = bytes.TrimSuffix(line, []byte{'\r'}) | ||
| w.emit(string(line)) |
There was a problem hiding this comment.
Redact before emitting OTLP logs
When --job-logs-otlp is used with tracing-backend=opentelemetry, this writer is installed inside Executor.Run after setupRedactors, so w.out.Write(data) sends the bytes through the normal redactor but w.emit(string(line)) exports the original pre-redaction bytes to OTLP. Any secret masked from the visible Buildkite log via RedactedVars, secrets, or Job API redaction will still be sent in clear text to the OTLP collector; emit from the redacted stream or put the OTLP writer behind the same redactor.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@catkins this popped up with another review i was doing, any issue with it?
| "buildkite.phase": "hook", | ||
| "buildkite.hook.name": "command", | ||
| "buildkite.hook.scope": "default", |
There was a problem hiding this comment.
Mark command output as command phase
For OpenTelemetry-backed job-log export, normal command output goes through this attributes map, but it is tagged as buildkite.phase=hook with buildkite.hook.name=command. Fresh evidence beyond the earlier scope-parser comment is this separate hard-coded internal exporter path, which makes ordinary command logs look like hook logs whenever BUILDKITE_JOB_LOGS_OTLP and the OpenTelemetry tracing backend are enabled; use the command phase attributes here instead.
Useful? React with 👍 / 👎.
852196b to
8851f51
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8851f514f1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var flushers []interface{ Flush() } | ||
| if c.shell.outputInterceptor != nil { | ||
| if cfg.captureStdout == nil && stdout != nil && stdout != io.Discard { | ||
| stdout = c.shell.outputInterceptor(ctx, stdout, cfg.outputAttrs) |
There was a problem hiding this comment.
Include shell-generated log lines in OTLP
When BUILDKITE_JOB_LOGS_OTLP runs with the OpenTelemetry tracing backend, the parent exporter is disabled in agent/job_runner.go, so this OutputInterceptor becomes the only OTLP log path; however it is installed only around Command.Run stdout/stderr here. Lines emitted directly by the bootstrap shell logger (Headerf, Printf, Warningf, prompts, checkout comments, etc.) still write to the normal shell writer and never pass through the interceptor, so visible job log lines generated by the agent are missing from OTLP in that configuration. Wrap the shell logger/stdout as well, or keep a full job-log exporter without duplicating command output.
Useful? React with 👍 / 👎.
8851f51 to
061c4c5
Compare
Amp-Thread-ID: https://ampcode.com/threads/T-019ebe62-7ccf-774b-ba96-19d000353bdb Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019ebe62-7ccf-774b-ba96-19d000353bdb Co-authored-by: Amp <amp@ampcode.com>
Previously OTLP job log export was split across the process boundary: the agent process emitted records when tracing was not OpenTelemetry, and the bootstrap process emitted them when it was. That produced two near-identical implementations, a duplicated buildkite.* attribute schema, and two different sources of phase/hook truth (brittle log-text regex parsing in the agent vs structured HookConfig in the bootstrap). The same --job-logs-otlp flag also yielded materially different output depending on the unrelated tracing backend. Make the bootstrap process the single home for OTLP job log export: - Remove the agent-side emitter (agent/otlp_job_logger.go) and its wiring in job_runner.go / run_job.go. - Activate the bootstrap OTLP logger whenever --job-logs-otlp is set, regardless of tracing backend. Records are trace-correlated when OpenTelemetry tracing is enabled and uncorrelated otherwise. The agent still propagates BUILDKITE_JOB_LOGS_OTLP to the bootstrap. Amp-Thread-ID: https://ampcode.com/threads/T-019ebf08-2dbd-710f-ad6d-5bc1c842bb0a Co-authored-by: Amp <amp@ampcode.com>
The bootstrap OTLP job log writer wrapped the shell's stdout writer, which is itself the secret redactor. Because the OTLP writer sat upstream of the redactor, it emitted raw pre-redaction bytes to the OTLP backend while the customer-facing job log was correctly redacted. Confirmed live against a local ClickStack: a redacted env var appeared verbatim in OTLP records. Route the OTLP copy through its own replacer.Replacer seeded with the live needle set from the job's redactor Mux, so OTLP records carry the same [REDACTED] markers as the job log. Streaming redaction also handles secrets split across multiple writes. - Add replacer.Mux.Needles() to expose the current needle set. - Split otlpJobLogWriter into a tee that feeds a redactor whose downstream is a line-buffering OTLP emitter. - Add regression tests for inline and split-across-writes secrets. Amp-Thread-ID: https://ampcode.com/threads/T-019ebf08-2dbd-710f-ad6d-5bc1c842bb0a Co-authored-by: Amp <amp@ampcode.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7d6cfdf486
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // API-derived needles added during earlier phases). | ||
| return &otlpJobLogWriter{ | ||
| out: out, | ||
| redactor: replacer.New(emitter, l.redactors.Needles(), redact.Redacted), |
There was a problem hiding this comment.
Keep OTLP redaction synchronized during commands
When a running command adds a redaction dynamically, for example via the Job API /redactions endpoint used by buildkite-agent redactor add or secret helpers, the normal job-log redactors are updated through e.redactors.Add, but this OTLP writer was seeded from a one-time Needles() snapshot when the command started. Any later output from the same process containing that newly redacted value is still emitted to the OTLP backend unredacted, diverging from the customer-facing log and leaking secrets whenever OTLP job logs are enabled.
Useful? React with 👍 / 👎.
Tee the redacted shell logger output (section headers, prompts, comments, warnings) into the OTLP exporter so exported records match the downloadable Buildkite job log and the UI stream. Previously only child-process output was mirrored, so OTLP destinations were missing the control lines customers see. Amp-Thread-ID: https://ampcode.com/threads/T-019ebf08-2dbd-710f-ad6d-5bc1c842bb0a Co-authored-by: Amp <amp@ampcode.com>
|
@codex review Please focus on the new control-line parity change (latest commit |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2d124cc3e2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| w.mu.Lock() | ||
| secondary := w.secondary | ||
| w.mu.Unlock() |
There was a problem hiding this comment.
Serialize tee writes before shutting down OTLP
When cancellation/control output races with executor shutdown, Write snapshots secondary and releases the mutex before writing to that captured control writer. The deferred close path can then detach the secondary and flush/shutdown the OTLP provider while this in-flight write still has the old writer, so the cancellation line can be emitted after the final flush or dropped after shutdown, breaking OTLP parity in the cancellation path. Keep the tee lock held through the secondary write, or otherwise wait for in-flight secondary writes before jobLogger.Close() flushes/shuts down.
Useful? React with 👍 / 👎.
Address adversarial review findings: - Sync the per-command OTLP redactor with the live redactor Mux on each write so secrets added mid-command (e.g. via the Job API) are redacted in OTLP output, not just secrets known when the command writer was created. - Hold the teeWriter lock across the whole write so detaching the OTLP control sink cannot race with an in-flight write, preventing control output from landing on the emitter after it has been flushed and the provider shut down. Amp-Thread-ID: https://ampcode.com/threads/T-019ebf08-2dbd-710f-ad6d-5bc1c842bb0a Co-authored-by: Amp <amp@ampcode.com>
|
@codex review Updated since the last review with commit
Validated end-to-end against a real build on |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b7e9eb9e4e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if r.conf.AgentConfiguration.JobLogsOTLP { | ||
| setEnv("BUILDKITE_JOB_LOGS_OTLP", "true") | ||
| } |
There was a problem hiding this comment.
Prevent jobs from enabling OTLP export
When the agent option is false, this branch does nothing, so any BUILDKITE_JOB_LOGS_OTLP=true supplied in the job environment survives into the bootstrap. The hidden bootstrap flag reads that same env var, and the exporter also honors job-supplied OTEL_EXPORTER_OTLP_*, so a pipeline can enable and redirect job-log export even when the agent operator did not opt in; force this env to the agent-configured boolean or remove/protect it before launching bootstrap.
Useful? React with 👍 / 👎.
| } | ||
| } | ||
| if stderr != nil && stderr != io.Discard { | ||
| stderr = c.shell.outputInterceptor(ctx, stderr, cfg.outputAttrs) |
There was a problem hiding this comment.
Share OTLP redaction across stdout/stderr
When both stdout and stderr are visible they start as the same job-log writer, but this wraps stderr in a separate OTLP redactor/emitter from stdout. If a process writes the first bytes of a redacted value to stdout and the rest to stderr, the single downstream Buildkite redactor can still redact the combined stream while each OTLP redactor only sees a fragment and emits those fragments unredacted; share one redaction state for both wrappers when they target the same downstream stream.
Useful? React with 👍 / 👎.
|
Hey @catkins just wondering if this is ready for review or still needed? |
Description
Adds an opt-in job log OTLP sink. When
--job-logs-otlp/BUILDKITE_JOB_LOGS_OTLPis enabled, the agent emits job output as OpenTelemetry log records using the existing OTLP exporter environment configuration. This is an additional sink: the normal Buildkite job log is still streamed to the control plane unchanged. The difference from that stream is that OTLP records carry native OTLP timestamps rather than timestamps encoded into the ANSI/OSC job-log body.The OTLP endpoint and transport are intentionally inherited from the OpenTelemetry exporter configuration rather than adding Buildkite-specific endpoint flags. For logs, set
OTEL_EXPORTER_OTLP_LOGS_ENDPOINTfor a log-specific endpoint orOTEL_EXPORTER_OTLP_ENDPOINTfor the generic endpoint; protocol selection followsOTEL_EXPORTER_OTLP_LOGS_PROTOCOL/OTEL_EXPORTER_OTLP_PROTOCOL.The log records carry native OTLP timestamps plus a small set of Buildkite attributes for correlation: organization, pipeline, branch, queue, agent, build, job, current phase, and current hook scope/plugin where known. Trace correlation uses the native OTLP
LogRecordtrace context fields rather than duplicatingtrace_id/span_idas log attributes.Parity with the Buildkite job log
The OTLP sink mirrors the same visible content customers see in the Buildkite UI / downloadable job log, so an OTLP destination is not confusingly different from Buildkite:
~~~), prompts ($), comments (#) and warnings — is mirrored by teeing the redacted shell logger output into the exporter.Both paths emit the same bytes (including ANSI colour codes) that land in the downloadable Buildkite log. The only intended difference is the timestamp transport: OTLP records use the native
LogRecordtimestamp instead of the\x1b_bk;t=…OSC markers Buildkite encodes into the raw log body. The human-visible content is identical in both destinations.Control output is bootstrap narration rather than the output of a specific traced hook/command, so those records carry the base
buildkite.*attributes but no per-hook span context; child-process records remain trace-correlated to their hook/command span.Architecture: bootstrap-only
OTLP job log export lives entirely in the bootstrap process, which is the single home for this feature. This is a deliberate choice: the bootstrap is the process that actually runs the hooks and command, so it has structured phase/hook metadata and the active hook/command span directly, with no need to reconstruct them from the job-log text downstream. Keeping emission here also means a single emitter, a single
buildkite.*attribute schema, and behaviour that does not vary with the (otherwise unrelated) tracing backend.The bootstrap emits records whenever
--job-logs-otlpis enabled, regardless of tracing backend:TraceId/SpanIdas the corresponding exported hook/command span.--tracing-propagate-traceparentonly controls accepting the Buildkite control-plane traceparent; it is not required for local agent trace/log correlation.flowchart TD subgraph agent["Agent process"] AS["buildkite-agent start<br/>--job-logs-otlp"] end AS -->|"BUILDKITE_JOB_LOGS_OTLP=true"| RUN subgraph bootstrap["Bootstrap process — single home for OTLP export"] RUN["Executor.Run<br/>JobLogsOTLP enabled?"] -->|"yes (any tracing backend)"| WIRE HC["hook / command<br/>stdout + stderr"] --> SH["shell OutputInterceptor<br/>(per hook/command, structured attrs)"] CTRL["bootstrap control output<br/>headers / prompts / comments / warnings"] --> SL["shell logger"] WIRE(("wire sinks")) WIRE --> SH WIRE --> SL SH --> TEE{{"tee raw output"}} TEE --> JR["job-log redactor<br/>replacer.Mux"] TEE --> OR["OTLP redactor<br/>replacer seeded with live needles"] JR --> JLOG["Buildkite job log stream"] OR --> EM["line emitter<br/>+ buildkite.* attributes"] SL --> LR["logger redactor"] LR --> LTEE{{"tee redacted control output"}} LTEE --> JLOG LTEE --> CEM["control line emitter<br/>+ base buildkite.* attributes"] EM --> EXP["OTLP log exporter<br/>(batch)"] CEM --> EXP end SPAN["active hook/command span<br/>only when OTel tracing enabled"] -. "native TraceId / SpanId" .-> EM EXP -->|"OTLP/gRPC or HTTP"| BE[("OTLP backend")] classDef redact fill:#fde2e2,stroke:#d33; class JR,OR,LR redact;Both sinks share the same redaction (
replacer, highlighted), so secrets are[REDACTED]in the OTLP records exactly as in the job log. Control output is teed after the logger redactor, so it is never re-redacted and never leaks pre-redaction bytes. Trace context is attached to child-process OTLP records only when OpenTelemetry tracing is enabled (dotted edge); otherwise records are emitted uncorrelated.Redaction
OTLP records are redacted with the same secret needles as the customer-facing job log. The child-process OTLP copy is routed through its own
replacer.Replacer, seeded from the job's live redactorMux, so secret values are replaced with[REDACTED]before they reach the OTLP backend (including secrets split across multiple writes). Control output is mirrored from the post-redaction side of the shell logger redactor, so it inherits the same redaction without a second pass.Relevant OpenTelemetry references:
Context
Slack context: https://buildkite-corp.slack.com/archives/C05R3MTRK38/p1780282420246339
This is a spike for moving high-rate job-log consumption toward agent-side OTLP log export, using the existing OpenTelemetry SDK plumbing in the agent.
Changes
--job-logs-otlpandBUILDKITE_JOB_LOGS_OTLPtobuildkite-agent start, propagated to the bootstrap subprocess.LogRecords with native timestamps and structured Buildkite hook/command attributes.LogRecordfields, not duplicated log attributes.--tracing-propagate-traceparentopt-in.CLI help excerpt:
Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Additional verification:
go build ./...,golangci-lint run ./internal/job/go test -race ./internal/job/ ./internal/replacer/ ./internal/shell/(one pre-existing, unrelated git failure —TestVerifyCommit/fails_when_commit_is_not_on_branch— reproduces on a clean tree).TestOTLPJobLoggerRedactsSecrets,TestOTLPJobLoggerRedactsSecretsSplitAcrossWrites,TestOTLPJobLoggerControlWriter,TestOTLPJobLoggerControlWriterReused.Validated end-to-end with a real build on a local Buildkite (
buildkite.localhost) exporting to a local ClickStack (ClickHouse + OpenTelemetry collector). A locally-built agent ran a job with a secret env var, then the downloadable Buildkite log (via the v2 REST API) was compared againstdefault.otel_logs/default.otel_tracesfor the same job id:~~~ Preparing working directory,$ git clone …,# Creating …,~~~ Running commands) all appear verbatim — including ANSI colour codes — in the OTLP records. The only difference is the OTLP records use native timestamps instead of the\x1b_bk;t=…OSC markers in the raw Buildkite log body.leak=[REDACTED]in both the downloadable Buildkite log and the OTLP record bodies.checkoutanddefault command hookspans via native OTLP trace context; control-narration records were emitted with base attributes and no span context, as intended.Affiliation (optional, external contributors)
Buildkite.
Disclosures / Credits
Codex and Amp assisted with implementation, adversarial review, local verification, and drafting this PR description.