Skip to content

OpenTelemetry tracing support#786

Open
tewbo wants to merge 38 commits intoydb-platform:mainfrom
tewbo:otel-tracing-support
Open

OpenTelemetry tracing support#786
tewbo wants to merge 38 commits intoydb-platform:mainfrom
tewbo:otel-tracing-support

Conversation

@tewbo
Copy link
Copy Markdown

@tewbo tewbo commented Mar 21, 2026

Pull request type

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation content changes
  • Other (please describe):

What is the current behavior?

The YDB Python SDK does not provide built-in OpenTelemetry tracing support. There is legacy integration with OpenTracing API which uses the deprecated standard.

Issue Number: N/A

What is the new behavior?

Adds OpenTelemetry tracing support to the YDB Python SDK. When enabled via enable_tracing(), the SDK automatically creates spans for key operations:

  • ydb.CreateSession — session creation
  • ydb.ExecuteQuery — query execution (session and transaction level, both sync and async)
  • ydb.Commit / ydb.Rollback — transaction commit and rollback
  • ydb.Driver.Initialize — driver initialization

Each span includes standard attributes: db.system.name, db.namespace, server.address, server.port, ydb.session.id, ydb.node.id, ydb.tx.id.

W3C Trace Context (traceparent) is automatically propagated in gRPC metadata, enabling end-to-end distributed tracing between client and YDB server. Execute spans cover the full operation lifecycle, including streaming result iteration — not just the initial gRPC call. Errors are recorded on spans with error.type, db.response.status_code, and exception events.

Tracing is opt-in (pip install ydb[tracing] + enable_tracing()). Without it, the SDK behavior is unchanged — all tracing code paths are no-op.

Other information

  • Includes unit tests for sync, async, error handling, parent-child relationships, context propagation, noop mode, and concurrent span isolation

@tewbo tewbo marked this pull request as draft March 23, 2026 10:36
@KirillKurdyukov KirillKurdyukov self-requested a review March 23, 2026 13:47
@tewbo tewbo marked this pull request as ready for review March 24, 2026 07:02
Comment thread ydb/opentelemetry/_plugin.py Outdated
Comment thread setup.py Outdated
Comment thread examples/opentelemetry/example.py Outdated
Comment thread ydb/opentelemetry/_plugin.py Outdated
@tewbo tewbo marked this pull request as draft April 9, 2026 17:56
@tewbo tewbo marked this pull request as ready for review April 9, 2026 18:15
KirillKurdyukov and others added 3 commits April 20, 2026 14:38
Query-service retries now emit an umbrella INTERNAL ydb.RunWithRetry span
and a ydb.Try INTERNAL span per attempt. Each ydb.Try carries the
ydb.retry.backoff_ms attribute (the sleep preceding the attempt — 0 for
the first one, i.e. the next-attempt timeline includes the backoff).
Retriable exceptions are recorded on the owning ydb.Try span, and an
exception that escapes the whole retry loop (including an
asyncio.CancelledError hitting a backoff sleep) is recorded on the outer
ydb.RunWithRetry span.

CLIENT spans (ydb.CreateSession, ydb.ExecuteQuery, ydb.Commit,
ydb.Rollback) now also emit network.peer.address / network.peer.port
for the concrete node serving the session, while server.address /
server.port keep meaning the host from the connection string.

Also fixes a "Пр" typo in docs/opentelemetry.rst and corrects span names
(ydb.CommitTransaction -> ydb.Commit, ydb.RollbackTransaction -> ydb.Rollback).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move ydb.RunWithRetry / ydb.Try span emission directly into
retry_operation_sync / retry_operation_async in ydb/retries.py, and drop
the short-lived ydb.query._retries shim. Tracing is still no-op by
default, so there is no cost for the table-service callers that share
the same retry loop; we just stop duplicating the retry logic to add
spans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p session.id/tx.id

RPC (CLIENT-kind) spans now carry the peer metadata from the discovery
endpoint map, not from the grpc-target string of the request:

  * network.peer.address = EndpointInfo.address (the node host)
  * network.peer.port    = EndpointInfo.port
  * ydb.node.dc          = EndpointInfo.location

To do that, EndpointOptions and Connection now also carry address/port/
location populated by resolver.endpoints_with_options(); sessions
resolve their peer tuple via driver._store.connections_by_node_id after
CreateSession returns, which is the right place to ask which node owns
this session.

Dropped the noisy ydb.session.id and ydb.tx.id attributes - they pollute
every span and are recoverable from trace context if really needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@vgvoleg vgvoleg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Поправьте замечания и сделайте rebase на свежий main.

Comment thread docs/index.rst Outdated
Comment thread ydb/opentelemetry/_plugin.py Outdated
Comment thread ydb/query/base.py Outdated
Comment thread ydb/retries.py Outdated
Comment thread ydb/opentelemetry/tracing.py
Comment thread ydb/opentelemetry/tracing.py Outdated
Comment thread ydb/opentelemetry/plugin.py
Copy link
Copy Markdown

@robot-vibe-db robot-vibe-db Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review Summary

Verdict: ✅ No critical issues found

Critical issues

No critical issues found.

Other findings

  • Major | High: Async Connection class is missing peer_address, peer_port, peer_location slots/init — network.peer.* and ydb.node.dc span attributes are silently never populated in the async path — ydb/aio/connection.py:143-165

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

Comment thread ydb/aio/connection.py
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class OpenTelemetry (OTel) instrumentation to the YDB Python SDK, enabling opt-in span creation and W3C Trace Context propagation over gRPC for query/session/transaction operations (sync + async), plus docs, examples, and unit tests.

Changes:

  • Introduce an internal OTel registry + plugin bridge, with enable_tracing()/disable_tracing() and gRPC metadata injection (traceparent).
  • Instrument key SDK operations (Driver wait/init, session create, query execute streams, tx begin/commit/rollback) and add retry-attempt spans (ydb.RunWithRetry / ydb.Try).
  • Add tracing-focused unit tests, documentation, and a runnable Docker-based example stack (collector + Tempo + Grafana).

Reviewed changes

Copilot reviewed 41 out of 43 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
ydb/table_test.py Updates retry tests to expect 0.0 “marker” sleep opts for skip-backoff errors.
ydb/retries.py Adds ydb.RunWithRetry / ydb.Try spans and emits 0.0 sleep markers for fast-retry errors.
ydb/resolver.py Plumbs endpoint address/port/location into EndpointOptions for peer attribution.
ydb/query/transaction.py Wraps tx begin/commit/rollback + execute streams in OTel spans.
ydb/query/session.py Adds session peer resolution and wraps create/execute in OTel spans.
ydb/query/base.py Extends sync result iterator to end the associated span when the stream is consumed.
ydb/pool.py Wraps Driver.wait() with an internal ydb.Driver.Initialize span.
ydb/opentelemetry/tracing.py Adds no-op span + registry + helpers to build standard YDB span attributes.
ydb/opentelemetry/_plugin.py Implements OTel bridge: span wrapper, error recording, and W3C injection hook.
ydb/opentelemetry/init.py Public API: enable_tracing() / disable_tracing() with optional dependency handling.
ydb/connection.py Injects OTel trace metadata into sync gRPC call metadata; extends endpoint options/connection peer fields.
ydb/aio/query/transaction.py Async tx begin/commit/rollback + execute stream spans.
ydb/aio/query/session.py Async session create/execute spans + peer attribute support.
ydb/aio/query/base.py Async result iterator ends span on stream consumption.
ydb/aio/pool.py Async Driver.wait() wrapped in ydb.Driver.Initialize span.
ydb/aio/driver.py Passes **kwargs through config creation for parity/extensibility.
ydb/aio/connection.py Injects OTel trace metadata into async gRPC call metadata; adds peer fields.
tests/tracing/test_tracing_sync.py New sync tracing tests (spans/attrs/errors/parenting/retry nesting).
tests/tracing/test_tracing_async.py New async tracing tests (spans/attrs/errors/parenting/retry nesting/concurrency).
tests/tracing/conftest.py New OTel in-memory exporter fixture for tracing tests.
tests/tracing/init.py Marks tracing tests package.
test-requirements.txt Adds OTel deps for CI tests (api/sdk/exporter).
setup.py Adds opentelemetry extra (opentelemetry-api).
pyproject.toml Mypy config: ignore missing imports for opentelemetry.*.
examples/opentelemetry/ydb_config/ydb-config-with-tracing.yaml Example YDB server config enabling OTel tracing.
examples/opentelemetry/ydb_config/README.md Instructions for enabling server-side tracing via custom YDB config.
examples/opentelemetry/ydb_config/otel-tracing-snippet.yaml Minimal tracing snippet for YDB config.
examples/opentelemetry/tempo.yaml Tempo config for local demo stack.
examples/opentelemetry/requirements.txt Example dependency pins (sdk + OTLP exporter).
examples/opentelemetry/README.md End-to-end instructions for running the OTel demo (host + Docker).
examples/opentelemetry/prometheus.yaml Prometheus scrape config for the demo stack.
examples/opentelemetry/otel-collector-config.yaml OTel collector config (OTLP receive → Tempo + debug).
examples/opentelemetry/otel_example.py Runnable async demo script using enable_tracing() and concurrent workload.
examples/opentelemetry/grafana/provisioning/datasources/datasources.yaml Grafana provisioning for Prometheus + Tempo datasources.
examples/opentelemetry/grafana/provisioning/dashboards/dashboards.yaml Grafana dashboard provisioning config.
examples/opentelemetry/grafana/dashboards/README.md Placeholder readme for dashboards directory.
examples/opentelemetry/Dockerfile Build image to run the demo script inside Docker Compose.
examples/opentelemetry/compose-e2e.yaml Full demo Compose stack (YDB + collector + Tempo + Grafana + demo runner).
docs/opentelemetry.rst New documentation page describing instrumentation, attributes, and usage.
docs/index.rst Adds Observability section linking to the new OTel docs page.
CHANGELOG.md Adds an Unreleased entry describing OTel and retry instrumentation changes.
.gitignore Normalizes .idea/ and .venv/ ignore patterns.
.dockerignore Whitelists the OTel example files for Docker build context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ydb/retries.py
Comment thread ydb/query/transaction.py
Comment thread ydb/aio/query/transaction.py
Comment thread tests/tracing/conftest.py
Comment thread docs/opentelemetry.rst
Comment thread setup.py
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🌋 SLO Test Results

Status: 🟢 2 workloads tested • All passed

Workload Metrics Regressions Improvements Links
🟢 ydb-python-sync-table 8 1 0 ReportCheck
🟢 ydb-python-sync-query 8 1 0 ReportCheck

Generated by ydb-slo-action

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 41 out of 42 changed files in this pull request and generated 4 comments.

Comment thread ydb/retries.py
Comment thread CHANGELOG.md
Comment thread setup.py
Comment thread ydb/opentelemetry/tracing.py
Copy link
Copy Markdown

@robot-vibe-db robot-vibe-db Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review Summary

Verdict: ❌ 1 critical issue(s) found

Critical issues

  • Critical | Medium: retry_operation_sync unconditionally calls time.sleep(0.0) for skip-yield errors, while the async path guards with if timeout > 0. For high-frequency fast retries (Aborted, BadSession), this introduces an unnecessary syscall on every retry attempt in the sync path — ydb/retries.py:177

Other findings

  • Major | Medium: _split_endpoint returns empty server.address for endpoints without a port after scheme stripping (e.g. bare hostname) — ydb/opentelemetry/tracing.py:104
  • Minor | Medium: create_span() returns a context manager while create_ydb_span() returns a bare span; the inconsistent return types are an API hazard for future contributors — ydb/opentelemetry/tracing.py:129-139
  • Minor | High: node_id or 0 is redundant; the preceding if node_id is not None: guard already ensures node_id is truthy-safe — ydb/opentelemetry/tracing.py:125
  • Minor | Medium: **kwargs addition in aio/driver.py is unrelated to OpenTelemetry and silently swallows unknown keyword arguments, potentially masking user typos — ydb/aio/driver.py:72
  • Nit | Low: enable_tracing() silently ignores a different tracer on subsequent calls. While documented, a warning log would help users who reconfigure their OTel setup — ydb/opentelemetry/plugin.py:118

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

Comment thread ydb/retries.py Outdated
Comment thread ydb/opentelemetry/tracing.py Outdated
Comment thread ydb/opentelemetry/tracing.py Outdated
Comment thread ydb/opentelemetry/plugin.py
Comment thread ydb/aio/driver.py
Comment thread ydb/opentelemetry/tracing.py
@vgvoleg vgvoleg added the SLO label May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants