Skip to content

feat(llmobs): support agent-based LLMObs export via APM trace meta_struct#18254

Open
mabdinur wants to merge 6 commits into
mainfrom
munir/agentbased-llmo
Open

feat(llmobs): support agent-based LLMObs export via APM trace meta_struct#18254
mabdinur wants to merge 6 commits into
mainfrom
munir/agentbased-llmo

Conversation

@mabdinur
Copy link
Copy Markdown
Contributor

@mabdinur mabdinur commented May 23, 2026

Description

Goal: when LLM Observability rides APM traces via meta_struct["_llmobs"] (APM_AGENT_PROXY / APM_AGENTLESS), stop losing LLMObs events on predicted-drop traces (root sampling_priority <= 0) where the Agent's local sampler / libdatadog short-circuits the chunk before the trace-edge.

Design constraints:

  • Zero impact on APM sampling decisions and billing (sampling_priority never mutated, no sampling rules added).
  • No dd-trace-py <-> libdatadog <-> Agent protocol changes; intake already extracts meta_struct["_llmobs"] at /v1/input and /api/v0.2/traces.
  • Exactly-once intake delivery via in-SDK de-dup (_dd.llmobs.submitted=1 tag + scrub meta_struct) -- intake can OR-dedup as a belt-and-suspenders fallback.
  • No new DD_API_KEY / DD_SITE requirement for APM_AGENT_PROXY mode.

Design:

  • LLMObs.enable() forces the APM trace writer to v0.4 via SpanAggregator.reset(llmobs_enabled=True). v0.5 does not carry meta_struct. Mirrors the AppSec recreate hook.
  • _on_span_finish stashes the prepared LLMObsSpanEvent on the span (CACHED_LLMOBS_EVENT_CTX_KEY) instead of scrubbing + enqueuing. meta_struct rides the APM trace.
  • New LLMObsSamplingFallbackProcessor (slot in SpanAggregator's hardcoded chain between TraceSamplingProcessor and TraceTagsProcessor) re-ships the cached event via LLMObsSpanWriter on predicted drop, stamps _dd.llmobs.submitted=1, and scrubs meta_struct["_llmobs"].

Edge cases handled:

  • LLMOBS_DIRECT mode (DD_APM_TRACING_ENABLED=false) and tracer.enabled == False (DD_TRACE_ENABLED=0) keep immediate-ship at _on_span_finish because the trace never reaches the processor chain.
  • Distributed traces: rescue reads span._local_root.context.sampling_priority so an upstream USER_REJECT is honored on the child service.
  • Idempotency: rescue early-returns if _dd.llmobs.submitted=1 is already set, so a re-flush or LLMOBS_DIRECT hook cannot cause duplicates.
  • Cached event missing (user processor dropped it or build failed): rescue scrubs meta_struct to avoid shipping a partial event to APM-side extract.
  • Non-LLM spans on the same trace are skipped (span_type != SpanTypes.LLM).
  • Explicit DD_TRACE_API_VERSION=v0.5 + LLMObs enabled: silently downgraded to v0.4 with a log.warning.

Testing

  • tests/llmobs/test_sampling_fallback_processor.py covers wire-format forcing, every rescue trigger condition, idempotency, no-sampling-side-effect, processor chain wiring, and sampling-priority round-trip.
  • tests/llmobs/conftest.py installs an always-enqueue variant of the fallback processor so the mocked writer is still exercised on every LLM span finish.

Risks

  • Users who explicitly set DD_TRACE_API_VERSION=v0.5 and enable LLMObs are silently downgraded to v0.4 (with a warning); without this, the entire meta_struct payload would be lost on the wire. Same approach AppSec already takes.
  • LLMObsSpanEvent is cached on the span until trace-flush -- only for span_type == SpanTypes.LLM, released as soon as process_trace returns.
  • All new code paths swallow exceptions and log -- APM trace flow continues even if rescue raises.

Additional Notes

  • Removed _DD_LLMOBS_TEST_KEEP_META_STRUCT escape hatch (default APM_AGENT_PROXY now keeps meta_struct on the span).
  • DummyWriter.recreate, CIVisibilityWriter.recreate, LogWriter.recreate, AgentlessTraceWriter.recreate, and NativeWriter.recreate all accept llmobs_enabled: Optional[bool] to keep the TraceWriter interface in sync.

mabdinur and others added 2 commits May 23, 2026 02:26
When LLM Observability runs in APM_AGENT_PROXY or APM_AGENTLESS mode,
the LLMObs payload rides the APM span via meta_struct["_llmobs"] so a
single trace carries both telemetry. That path silently loses the
LLMObs event whenever the SDK's local sampler decides the trace should
be dropped (root sampling_priority <= 0): the Agent's client-side
stats / libdatadog short-circuits the trace before it reaches intake.

This change:

- Forces the APM trace writer to v0.4 whenever LLMObs is enabled and
  warns on explicit v0.5, since v0.5 does not carry meta_struct.
  Mirrors the AppSec recreate() pattern.
- Stops scrubbing meta_struct["_llmobs"] in _on_span_finish for
  APM_AGENT_PROXY / APM_AGENTLESS and stashes the prepared event on
  the span via a context key for later rescue.
- Adds LLMObsSamplingFallbackProcessor in SpanAggregator's hardcoded
  chain (after TraceSamplingProcessor, before TraceTagsProcessor): on
  predicted drop it re-ships the cached event via LLMObsSpanWriter,
  stamps _dd.llmobs.submitted=1 for idempotency, and scrubs
  meta_struct so APM-side extract and writer-side intake never
  double-count.
- Preserves the LLMOBS_DIRECT immediate-ship behavior when
  DD_APM_TRACING_ENABLED=false or DD_TRACE_ENABLED=0 (no trace flush
  ever runs so the rescue chain wouldn't fire).
- Never mutates sampling_priority or adds sampling rules; LLM
  Observability has zero impact on APM sampling decisions or billing.

The removed _DD_LLMOBS_TEST_KEEP_META_STRUCT escape hatch is no
longer needed because the default APM_AGENT_PROXY behavior now keeps
meta_struct on the span; the llmobs test fixture installs a
test-only always-enqueue variant of the fallback processor so the
mocked writer is still exercised on every LLM span finish.

Co-authored-by: Cursor <cursoragent@cursor.com>
@datadog-prod-us1-6
Copy link
Copy Markdown

datadog-prod-us1-6 Bot commented May 23, 2026

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 17 Pipeline jobs failed

System Tests | tracer-release / End-to-end #1 / anthropic-py@0.75.0 1   View in Datadog   GitHub Actions

🔄 Retry job. This looks flaky and may succeed on retry. 13 failed tests. Error: Number (1) of traces not available from test agent, got 0.

🧪 13 Tests failed

tests.integration_frameworks.llm.anthropic.test_anthropic_llmobs.TestAnthropicLlmObsMessages.test_create_content_block[False, anthropic-py@0.75.0] from system_tests_suite   View in Datadog (Fix with Cursor)
ValueError: Number (1) of traces not available from test agent, got 0:
[]

self = &lt;tests.integration_frameworks.llm.anthropic.test_anthropic_llmobs.TestAnthropicLlmObsMessages object at 0x7fa668c64e90&gt;
test_agent = &lt;utils.docker_fixtures._test_agent.TestAgentAPI object at 0x7fa669d2e180&gt;
test_client = &lt;utils.docker_fixtures._test_clients._test_client_framework_integrations.FrameworkTestClientApi object at 0x7fa6584d8b00&gt;

    @pytest.mark.parametrize(&#34;stream&#34;, [True, False])
    def test_create_content_block(self, test_agent: TestAgentAPI, test_client: FrameworkTestClientApi, *, stream: bool):
        with test_agent.vcr_context(stream=stream):
...
tests.integration_frameworks.llm.anthropic.test_anthropic_llmobs.TestAnthropicLlmObsMessages.test_create_content_block[True, anthropic-py@0.75.0] from system_tests_suite   View in Datadog (Fix with Cursor)
ValueError: Number (1) of traces not available from test agent, got 0:
[]

self = &lt;tests.integration_frameworks.llm.anthropic.test_anthropic_llmobs.TestAnthropicLlmObsMessages object at 0x7fa668c650d0&gt;
test_agent = &lt;utils.docker_fixtures._test_agent.TestAgentAPI object at 0x7fa669cebb30&gt;
test_client = &lt;utils.docker_fixtures._test_clients._test_client_framework_integrations.FrameworkTestClientApi object at 0x7fa657bcaea0&gt;

    @pytest.mark.parametrize(&#34;stream&#34;, [True, False])
    def test_create_content_block(self, test_agent: TestAgentAPI, test_client: FrameworkTestClientApi, *, stream: bool):
        with test_agent.vcr_context(stream=stream):
...
View all 13 test failures

System Tests | tracer-release / End-to-end #1 / google_genai-py@1.55.0 1   View in Datadog   GitHub Actions

🔄 Retry job. This looks flaky and may succeed on retry. 19 failed tests due to insufficient traces available from test agent after requests.

🧪 7 Tests failed

tests.integration_frameworks.llm.google_genai.test_google_genai_llmobs.TestGoogleGenAiEmbedContent.test_embed_content_content_block_input[google_genai-py@1.55.0] from system_tests_suite   View in Datadog (Fix with Cursor)
ValueError: Number (1) of traces not available from test agent, got 0:
[]

self = &lt;tests.integration_frameworks.llm.google_genai.test_google_genai_llmobs.TestGoogleGenAiEmbedContent object at 0x7fb5d8b201a0&gt;
test_agent = &lt;utils.docker_fixtures._test_agent.TestAgentAPI object at 0x7fb5c724bc80&gt;
test_client = &lt;utils.docker_fixtures._test_clients._test_client_framework_integrations.FrameworkTestClientApi object at 0x7fb5c723d970&gt;

    def test_embed_content_content_block_input(self, test_agent: TestAgentAPI, test_client: FrameworkTestClientApi):
        with test_agent.vcr_context():
            test_client.request(
...
tests.integration_frameworks.llm.google_genai.test_google_genai_llmobs.TestGoogleGenAiEmbedContent.test_embed_content[google_genai-py@1.55.0] from system_tests_suite   View in Datadog (Fix with Cursor)
ValueError: Number (1) of traces not available from test agent, got 0:
[]

self = &lt;tests.integration_frameworks.llm.google_genai.test_google_genai_llmobs.TestGoogleGenAiEmbedContent object at 0x7fb5d8b19880&gt;
test_agent = &lt;utils.docker_fixtures._test_agent.TestAgentAPI object at 0x7fb5c723d130&gt;
test_client = &lt;utils.docker_fixtures._test_clients._test_client_framework_integrations.FrameworkTestClientApi object at 0x7fb5c7228a40&gt;

    def test_embed_content(self, test_agent: TestAgentAPI, test_client: FrameworkTestClientApi):
        with test_agent.vcr_context():
            test_client.request(
...
View all 7 test failures

DataDog/apm-reliability/dd-trace-py | build linux: [amd64, cp315-cp315, v113741238-d2b8243-manylinux2014_x86_64]   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Failed to create pod sandbox due to network allocation failure. No IPs currently available on the node.

View all 17 failed jobs.

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 0829a33 | Docs | Datadog PR Page | Give us feedback!

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented May 23, 2026

Codeowners resolved as

tests/contrib/botocore/test_bedrock_agents.py                           @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability
tests/snapshots/tests.contrib.botocore.test_bedrock_agents.test_agent_invoke_with_step_spans.json  @DataDog/ml-observability

@mabdinur mabdinur changed the title fix(llmobs): rescue LLMObs events when APM trace is predicted dropped fix(llmobs): eliminate LLMObs data loss on unsampled APM traces May 23, 2026
@mabdinur mabdinur changed the title fix(llmobs): eliminate LLMObs data loss on unsampled APM traces feat(llmobs): support agent-based LLMObs export via APM trace meta_struct May 23, 2026
mabdinur and others added 2 commits May 23, 2026 04:03
…queue

- LLMObs._child_after_fork now reinstalls LLMObsSamplingFallbackProcessor
  with the recreated post-fork LLMObsSpanWriter. The processor instance
  captured the pre-fork writer at enable() time and its background worker
  did not survive fork(), causing silent buffering in child processes.
- Match the rescue path and LLMOBS_DIRECT immediate-ship path on the same
  set_tag -> scrub -> enqueue order so a writer failure cannot leave a
  partial state where meta_struct["_llmobs"] still rides the APM trace
  without the de-dup tag.
- Release note: add upgrade entry for the agentless APM sampling behavior
  change in 709084d (DD_TRACE_SAMPLE_RATE/SAMPLING_RULES/RATE_LIMIT
  are now honored in agentless mode).
- Tests: add chain-order positional assertion, no-cached-event rescue
  branch, processor returns trace unchanged, and tracer-disabled +
  APM_AGENT_PROXY immediate-ship path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Consolidate the comments introduced in this PR: drop narration of the code
on the line below ("set the tag", "log the warning") and keep only edge cases
or workarounds that cannot be inferred from the surrounding lines (writer
rebind after fork, tag+scrub-before-enqueue atomicity, local-root priority,
no-cached-event branch, chain-position constraint, v0.5 meta_struct strip).

Co-authored-by: Cursor <cursoragent@cursor.com>
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 23, 2026

Benchmarks

Benchmark execution time: 2026-05-25 03:52:58

Comparing candidate commit 0829a33 in PR branch munir/agentbased-llmo with baseline commit fd67a37 in branch main.

Found 0 performance improvements and 4 performance regressions! Performance is the same for 617 metrics, 10 unstable metrics.

scenario:iastaspects-stringio_aspect

  • 🟥 execution_time [+667.144µs; +718.728µs] or [+17.338%; +18.678%]

scenario:iastaspectsospath-ospathbasename_aspect

  • 🟥 execution_time [+102.717µs; +110.564µs] or [+24.084%; +25.924%]

scenario:span-start

  • 🟥 execution_time [+1.463ms; +1.628ms] or [+9.365%; +10.419%]

scenario:telemetryaddmetric-1-count-metric-1-times

  • 🟥 execution_time [+284.226ns; +319.210ns] or [+13.394%; +15.043%]

mabdinur and others added 2 commits May 23, 2026 14:03
Rebind the fallback processor after mock writer swap in bedrock/MCP fixtures
so meta_struct is not scrubbed under USER_REJECT. Set DD_APM_TRACING_ENABLED
for the MCP distributed-tracing subprocess test. Add DD_API_KEY to LLMOBS_DIRECT
subprocess tests and read tags before span finish scrubs meta_struct.

Co-authored-by: Cursor <cursoragent@cursor.com>
@mabdinur mabdinur marked this pull request as ready for review May 27, 2026 21:58
@mabdinur mabdinur requested review from a team as code owners May 27, 2026 21:58
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0829a33d95

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# the span (APM-side extract would duplicate without the dedup tag).
span.set_tag(LLMOBS_SUBMITTED_TAG_KEY, "1")
span._remove_struct_tag(LLMOBS_STRUCT.KEY)
self._llmobs_span_writer.enqueue(event)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Regenerate agentless fallback events before enqueueing

When _export_mode == APM_AGENTLESS and the root sampling priority is <= 0, this enqueues the cached event that was rendered for APM meta_struct extraction rather than for LLMObsSpanWriter. In that agentless path _prepare_llmobs_span_data has already applied APM-intake-only transformations such as replacing dotted tag keys, and _llmobs_tags/normalization omit error details that the direct span writer normally includes; rejected errored spans or spans with dotted user tags will therefore be rescued with mutated tags or missing error information. Cache a writer-shaped event separately, or rebuild the event before handing it to the span writer.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0829a33d95

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# the span (APM-side extract would duplicate without the dedup tag).
span.set_tag(LLMOBS_SUBMITTED_TAG_KEY, "1")
span._remove_struct_tag(LLMOBS_STRUCT.KEY)
self._llmobs_span_writer.enqueue(event)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Regenerate agentless fallback events before enqueueing

When _export_mode == APM_AGENTLESS and the root sampling priority is <= 0, this enqueues the cached event that was rendered for APM meta_struct extraction rather than for LLMObsSpanWriter. In that agentless path _prepare_llmobs_span_data has already applied APM-intake-only transformations such as replacing dotted tag keys, and _llmobs_tags/normalization omit error details that the direct span writer normally includes; rejected errored spans or spans with dotted user tags will therefore be rescued with mutated tags or missing error information. Cache a writer-shaped event separately, or rebuild the event before handing it to the span writer.

Useful? React with 👍 / 👎.

Comment on lines -740 to -742
self._trace_rate_limit = -1
self._trace_compute_stats = False
setattr(self, "_trace_sampling_rules", "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have those things been removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants