OpenTelemetry tracing support#786
Conversation
Query-service retries now emit an umbrella INTERNAL ydb.RunWithRetry span and a ydb.Try INTERNAL span per attempt. Each ydb.Try carries the ydb.retry.backoff_ms attribute (the sleep preceding the attempt — 0 for the first one, i.e. the next-attempt timeline includes the backoff). Retriable exceptions are recorded on the owning ydb.Try span, and an exception that escapes the whole retry loop (including an asyncio.CancelledError hitting a backoff sleep) is recorded on the outer ydb.RunWithRetry span. CLIENT spans (ydb.CreateSession, ydb.ExecuteQuery, ydb.Commit, ydb.Rollback) now also emit network.peer.address / network.peer.port for the concrete node serving the session, while server.address / server.port keep meaning the host from the connection string. Also fixes a "Пр" typo in docs/opentelemetry.rst and corrects span names (ydb.CommitTransaction -> ydb.Commit, ydb.RollbackTransaction -> ydb.Rollback). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move ydb.RunWithRetry / ydb.Try span emission directly into retry_operation_sync / retry_operation_async in ydb/retries.py, and drop the short-lived ydb.query._retries shim. Tracing is still no-op by default, so there is no cost for the table-service callers that share the same retry loop; we just stop duplicating the retry logic to add spans. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p session.id/tx.id RPC (CLIENT-kind) spans now carry the peer metadata from the discovery endpoint map, not from the grpc-target string of the request: * network.peer.address = EndpointInfo.address (the node host) * network.peer.port = EndpointInfo.port * ydb.node.dc = EndpointInfo.location To do that, EndpointOptions and Connection now also carry address/port/ location populated by resolver.endpoints_with_options(); sessions resolve their peer tuple via driver._store.connections_by_node_id after CreateSession returns, which is the right place to ask which node owns this session. Dropped the noisy ydb.session.id and ydb.tx.id attributes - they pollute every span and are recoverable from trace context if really needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vgvoleg
left a comment
There was a problem hiding this comment.
Поправьте замечания и сделайте rebase на свежий main.
There was a problem hiding this comment.
AI Review Summary
Verdict: ✅ No critical issues found
Critical issues
No critical issues found.
Other findings
- Major | High: Async
Connectionclass is missingpeer_address,peer_port,peer_locationslots/init —network.peer.*andydb.node.dcspan attributes are silently never populated in the async path —ydb/aio/connection.py:143-165
This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.
There was a problem hiding this comment.
Pull request overview
Adds first-class OpenTelemetry (OTel) instrumentation to the YDB Python SDK, enabling opt-in span creation and W3C Trace Context propagation over gRPC for query/session/transaction operations (sync + async), plus docs, examples, and unit tests.
Changes:
- Introduce an internal OTel registry + plugin bridge, with
enable_tracing()/disable_tracing()and gRPC metadata injection (traceparent). - Instrument key SDK operations (Driver wait/init, session create, query execute streams, tx begin/commit/rollback) and add retry-attempt spans (
ydb.RunWithRetry/ydb.Try). - Add tracing-focused unit tests, documentation, and a runnable Docker-based example stack (collector + Tempo + Grafana).
Reviewed changes
Copilot reviewed 41 out of 43 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| ydb/table_test.py | Updates retry tests to expect 0.0 “marker” sleep opts for skip-backoff errors. |
| ydb/retries.py | Adds ydb.RunWithRetry / ydb.Try spans and emits 0.0 sleep markers for fast-retry errors. |
| ydb/resolver.py | Plumbs endpoint address/port/location into EndpointOptions for peer attribution. |
| ydb/query/transaction.py | Wraps tx begin/commit/rollback + execute streams in OTel spans. |
| ydb/query/session.py | Adds session peer resolution and wraps create/execute in OTel spans. |
| ydb/query/base.py | Extends sync result iterator to end the associated span when the stream is consumed. |
| ydb/pool.py | Wraps Driver.wait() with an internal ydb.Driver.Initialize span. |
| ydb/opentelemetry/tracing.py | Adds no-op span + registry + helpers to build standard YDB span attributes. |
| ydb/opentelemetry/_plugin.py | Implements OTel bridge: span wrapper, error recording, and W3C injection hook. |
| ydb/opentelemetry/init.py | Public API: enable_tracing() / disable_tracing() with optional dependency handling. |
| ydb/connection.py | Injects OTel trace metadata into sync gRPC call metadata; extends endpoint options/connection peer fields. |
| ydb/aio/query/transaction.py | Async tx begin/commit/rollback + execute stream spans. |
| ydb/aio/query/session.py | Async session create/execute spans + peer attribute support. |
| ydb/aio/query/base.py | Async result iterator ends span on stream consumption. |
| ydb/aio/pool.py | Async Driver.wait() wrapped in ydb.Driver.Initialize span. |
| ydb/aio/driver.py | Passes **kwargs through config creation for parity/extensibility. |
| ydb/aio/connection.py | Injects OTel trace metadata into async gRPC call metadata; adds peer fields. |
| tests/tracing/test_tracing_sync.py | New sync tracing tests (spans/attrs/errors/parenting/retry nesting). |
| tests/tracing/test_tracing_async.py | New async tracing tests (spans/attrs/errors/parenting/retry nesting/concurrency). |
| tests/tracing/conftest.py | New OTel in-memory exporter fixture for tracing tests. |
| tests/tracing/init.py | Marks tracing tests package. |
| test-requirements.txt | Adds OTel deps for CI tests (api/sdk/exporter). |
| setup.py | Adds opentelemetry extra (opentelemetry-api). |
| pyproject.toml | Mypy config: ignore missing imports for opentelemetry.*. |
| examples/opentelemetry/ydb_config/ydb-config-with-tracing.yaml | Example YDB server config enabling OTel tracing. |
| examples/opentelemetry/ydb_config/README.md | Instructions for enabling server-side tracing via custom YDB config. |
| examples/opentelemetry/ydb_config/otel-tracing-snippet.yaml | Minimal tracing snippet for YDB config. |
| examples/opentelemetry/tempo.yaml | Tempo config for local demo stack. |
| examples/opentelemetry/requirements.txt | Example dependency pins (sdk + OTLP exporter). |
| examples/opentelemetry/README.md | End-to-end instructions for running the OTel demo (host + Docker). |
| examples/opentelemetry/prometheus.yaml | Prometheus scrape config for the demo stack. |
| examples/opentelemetry/otel-collector-config.yaml | OTel collector config (OTLP receive → Tempo + debug). |
| examples/opentelemetry/otel_example.py | Runnable async demo script using enable_tracing() and concurrent workload. |
| examples/opentelemetry/grafana/provisioning/datasources/datasources.yaml | Grafana provisioning for Prometheus + Tempo datasources. |
| examples/opentelemetry/grafana/provisioning/dashboards/dashboards.yaml | Grafana dashboard provisioning config. |
| examples/opentelemetry/grafana/dashboards/README.md | Placeholder readme for dashboards directory. |
| examples/opentelemetry/Dockerfile | Build image to run the demo script inside Docker Compose. |
| examples/opentelemetry/compose-e2e.yaml | Full demo Compose stack (YDB + collector + Tempo + Grafana + demo runner). |
| docs/opentelemetry.rst | New documentation page describing instrumentation, attributes, and usage. |
| docs/index.rst | Adds Observability section linking to the new OTel docs page. |
| CHANGELOG.md | Adds an Unreleased entry describing OTel and retry instrumentation changes. |
| .gitignore | Normalizes .idea/ and .venv/ ignore patterns. |
| .dockerignore | Whitelists the OTel example files for Docker build context. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🌋 SLO Test ResultsStatus: 🟢 2 workloads tested • All passed
Generated by ydb-slo-action |
There was a problem hiding this comment.
AI Review Summary
Verdict: ❌ 1 critical issue(s) found
Critical issues
- Critical | Medium:
retry_operation_syncunconditionally callstime.sleep(0.0)for skip-yield errors, while the async path guards withif timeout > 0. For high-frequency fast retries (Aborted, BadSession), this introduces an unnecessary syscall on every retry attempt in the sync path —ydb/retries.py:177
Other findings
- Major | Medium:
_split_endpointreturns emptyserver.addressfor endpoints without a port after scheme stripping (e.g. bare hostname) —ydb/opentelemetry/tracing.py:104 - Minor | Medium:
create_span()returns a context manager whilecreate_ydb_span()returns a bare span; the inconsistent return types are an API hazard for future contributors —ydb/opentelemetry/tracing.py:129-139 - Minor | High:
node_id or 0is redundant; the precedingif node_id is not None:guard already ensuresnode_idis truthy-safe —ydb/opentelemetry/tracing.py:125 - Minor | Medium:
**kwargsaddition inaio/driver.pyis unrelated to OpenTelemetry and silently swallows unknown keyword arguments, potentially masking user typos —ydb/aio/driver.py:72 - Nit | Low:
enable_tracing()silently ignores a different tracer on subsequent calls. While documented, a warning log would help users who reconfigure their OTel setup —ydb/opentelemetry/plugin.py:118
This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.
Pull request type
What is the current behavior?
The YDB Python SDK does not provide built-in OpenTelemetry tracing support. There is legacy integration with OpenTracing API which uses the deprecated standard.
Issue Number: N/A
What is the new behavior?
Adds OpenTelemetry tracing support to the YDB Python SDK. When enabled via enable_tracing(), the SDK automatically creates spans for key operations:
Each span includes standard attributes: db.system.name, db.namespace, server.address, server.port, ydb.session.id, ydb.node.id, ydb.tx.id.
W3C Trace Context (traceparent) is automatically propagated in gRPC metadata, enabling end-to-end distributed tracing between client and YDB server. Execute spans cover the full operation lifecycle, including streaming result iteration — not just the initial gRPC call. Errors are recorded on spans with error.type, db.response.status_code, and exception events.
Tracing is opt-in (pip install ydb[tracing] + enable_tracing()). Without it, the SDK behavior is unchanged — all tracing code paths are no-op.
Other information