feat(llmobs): support agent-based LLMObs export via APM trace meta_struct#18254
feat(llmobs): support agent-based LLMObs export via APM trace meta_struct#18254mabdinur wants to merge 6 commits into
Conversation
When LLM Observability runs in APM_AGENT_PROXY or APM_AGENTLESS mode, the LLMObs payload rides the APM span via meta_struct["_llmobs"] so a single trace carries both telemetry. That path silently loses the LLMObs event whenever the SDK's local sampler decides the trace should be dropped (root sampling_priority <= 0): the Agent's client-side stats / libdatadog short-circuits the trace before it reaches intake. This change: - Forces the APM trace writer to v0.4 whenever LLMObs is enabled and warns on explicit v0.5, since v0.5 does not carry meta_struct. Mirrors the AppSec recreate() pattern. - Stops scrubbing meta_struct["_llmobs"] in _on_span_finish for APM_AGENT_PROXY / APM_AGENTLESS and stashes the prepared event on the span via a context key for later rescue. - Adds LLMObsSamplingFallbackProcessor in SpanAggregator's hardcoded chain (after TraceSamplingProcessor, before TraceTagsProcessor): on predicted drop it re-ships the cached event via LLMObsSpanWriter, stamps _dd.llmobs.submitted=1 for idempotency, and scrubs meta_struct so APM-side extract and writer-side intake never double-count. - Preserves the LLMOBS_DIRECT immediate-ship behavior when DD_APM_TRACING_ENABLED=false or DD_TRACE_ENABLED=0 (no trace flush ever runs so the rescue chain wouldn't fire). - Never mutates sampling_priority or adds sampling rules; LLM Observability has zero impact on APM sampling decisions or billing. The removed _DD_LLMOBS_TEST_KEEP_META_STRUCT escape hatch is no longer needed because the default APM_AGENT_PROXY behavior now keeps meta_struct on the span; the llmobs test fixture installs a test-only always-enqueue variant of the fallback processor so the mocked writer is still exercised on every LLM span finish. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Codeowners resolved as |
…queue - LLMObs._child_after_fork now reinstalls LLMObsSamplingFallbackProcessor with the recreated post-fork LLMObsSpanWriter. The processor instance captured the pre-fork writer at enable() time and its background worker did not survive fork(), causing silent buffering in child processes. - Match the rescue path and LLMOBS_DIRECT immediate-ship path on the same set_tag -> scrub -> enqueue order so a writer failure cannot leave a partial state where meta_struct["_llmobs"] still rides the APM trace without the de-dup tag. - Release note: add upgrade entry for the agentless APM sampling behavior change in 709084d (DD_TRACE_SAMPLE_RATE/SAMPLING_RULES/RATE_LIMIT are now honored in agentless mode). - Tests: add chain-order positional assertion, no-cached-event rescue branch, processor returns trace unchanged, and tracer-disabled + APM_AGENT_PROXY immediate-ship path. Co-authored-by: Cursor <cursoragent@cursor.com>
Consolidate the comments introduced in this PR: drop narration of the code
on the line below ("set the tag", "log the warning") and keep only edge cases
or workarounds that cannot be inferred from the surrounding lines (writer
rebind after fork, tag+scrub-before-enqueue atomicity, local-root priority,
no-cached-event branch, chain-position constraint, v0.5 meta_struct strip).
Co-authored-by: Cursor <cursoragent@cursor.com>
BenchmarksBenchmark execution time: 2026-05-25 03:52:58 Comparing candidate commit 0829a33 in PR branch Found 0 performance improvements and 4 performance regressions! Performance is the same for 617 metrics, 10 unstable metrics. scenario:iastaspects-stringio_aspect
scenario:iastaspectsospath-ospathbasename_aspect
scenario:span-start
scenario:telemetryaddmetric-1-count-metric-1-times
|
Rebind the fallback processor after mock writer swap in bedrock/MCP fixtures so meta_struct is not scrubbed under USER_REJECT. Set DD_APM_TRACING_ENABLED for the MCP distributed-tracing subprocess test. Add DD_API_KEY to LLMOBS_DIRECT subprocess tests and read tags before span finish scrubs meta_struct. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0829a33d95
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # the span (APM-side extract would duplicate without the dedup tag). | ||
| span.set_tag(LLMOBS_SUBMITTED_TAG_KEY, "1") | ||
| span._remove_struct_tag(LLMOBS_STRUCT.KEY) | ||
| self._llmobs_span_writer.enqueue(event) |
There was a problem hiding this comment.
Regenerate agentless fallback events before enqueueing
When _export_mode == APM_AGENTLESS and the root sampling priority is <= 0, this enqueues the cached event that was rendered for APM meta_struct extraction rather than for LLMObsSpanWriter. In that agentless path _prepare_llmobs_span_data has already applied APM-intake-only transformations such as replacing dotted tag keys, and _llmobs_tags/normalization omit error details that the direct span writer normally includes; rejected errored spans or spans with dotted user tags will therefore be rescued with mutated tags or missing error information. Cache a writer-shaped event separately, or rebuild the event before handing it to the span writer.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0829a33d95
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # the span (APM-side extract would duplicate without the dedup tag). | ||
| span.set_tag(LLMOBS_SUBMITTED_TAG_KEY, "1") | ||
| span._remove_struct_tag(LLMOBS_STRUCT.KEY) | ||
| self._llmobs_span_writer.enqueue(event) |
There was a problem hiding this comment.
Regenerate agentless fallback events before enqueueing
When _export_mode == APM_AGENTLESS and the root sampling priority is <= 0, this enqueues the cached event that was rendered for APM meta_struct extraction rather than for LLMObsSpanWriter. In that agentless path _prepare_llmobs_span_data has already applied APM-intake-only transformations such as replacing dotted tag keys, and _llmobs_tags/normalization omit error details that the direct span writer normally includes; rejected errored spans or spans with dotted user tags will therefore be rescued with mutated tags or missing error information. Cache a writer-shaped event separately, or rebuild the event before handing it to the span writer.
Useful? React with 👍 / 👎.
| self._trace_rate_limit = -1 | ||
| self._trace_compute_stats = False | ||
| setattr(self, "_trace_sampling_rules", "") |
There was a problem hiding this comment.
Why have those things been removed?
Description
Goal: when LLM Observability rides APM traces via
meta_struct["_llmobs"](APM_AGENT_PROXY/APM_AGENTLESS), stop losing LLMObs events on predicted-drop traces (rootsampling_priority <= 0) where the Agent's local sampler / libdatadog short-circuits the chunk before the trace-edge.Design constraints:
sampling_prioritynever mutated, no sampling rules added).dd-trace-py<-> libdatadog <-> Agent protocol changes; intake already extractsmeta_struct["_llmobs"]at/v1/inputand/api/v0.2/traces._dd.llmobs.submitted=1tag + scrubmeta_struct) -- intake can OR-dedup as a belt-and-suspenders fallback.DD_API_KEY/DD_SITErequirement forAPM_AGENT_PROXYmode.Design:
LLMObs.enable()forces the APM trace writer tov0.4viaSpanAggregator.reset(llmobs_enabled=True).v0.5does not carrymeta_struct. Mirrors the AppSec recreate hook._on_span_finishstashes the preparedLLMObsSpanEventon the span (CACHED_LLMOBS_EVENT_CTX_KEY) instead of scrubbing + enqueuing.meta_structrides the APM trace.LLMObsSamplingFallbackProcessor(slot inSpanAggregator's hardcoded chain betweenTraceSamplingProcessorandTraceTagsProcessor) re-ships the cached event viaLLMObsSpanWriteron predicted drop, stamps_dd.llmobs.submitted=1, and scrubsmeta_struct["_llmobs"].Edge cases handled:
LLMOBS_DIRECTmode (DD_APM_TRACING_ENABLED=false) andtracer.enabled == False(DD_TRACE_ENABLED=0) keep immediate-ship at_on_span_finishbecause the trace never reaches the processor chain.span._local_root.context.sampling_priorityso an upstreamUSER_REJECTis honored on the child service._dd.llmobs.submitted=1is already set, so a re-flush or LLMOBS_DIRECT hook cannot cause duplicates.meta_structto avoid shipping a partial event to APM-side extract.span_type != SpanTypes.LLM).DD_TRACE_API_VERSION=v0.5+ LLMObs enabled: silently downgraded tov0.4with alog.warning.Testing
tests/llmobs/test_sampling_fallback_processor.pycovers wire-format forcing, every rescue trigger condition, idempotency, no-sampling-side-effect, processor chain wiring, and sampling-priority round-trip.tests/llmobs/conftest.pyinstalls an always-enqueue variant of the fallback processor so the mocked writer is still exercised on every LLM span finish.Risks
DD_TRACE_API_VERSION=v0.5and enable LLMObs are silently downgraded tov0.4(with a warning); without this, the entiremeta_structpayload would be lost on the wire. Same approach AppSec already takes.LLMObsSpanEventis cached on the span until trace-flush -- only forspan_type == SpanTypes.LLM, released as soon asprocess_tracereturns.Additional Notes
_DD_LLMOBS_TEST_KEEP_META_STRUCTescape hatch (defaultAPM_AGENT_PROXYnow keepsmeta_structon the span).DummyWriter.recreate,CIVisibilityWriter.recreate,LogWriter.recreate,AgentlessTraceWriter.recreate, andNativeWriter.recreateall acceptllmobs_enabled: Optional[bool]to keep theTraceWriterinterface in sync.