Add E2E test verifying OTEL tracing across Zenko services#2378
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
c362986 to
9a3a251
Compare
9a3a251 to
f32a16b
Compare
f32a16b to
a7b196c
Compare
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
a7b196c to
c9ae52e
Compare
- Deploy Jaeger all-in-one (memory-only, OTLP-enabled) in the kind CI cluster alongside existing dependencies - Patch the Zenko CR with `spec.otel` (enabled, sampling ratio 1.0) so every request is traced during CI — also acts as a smoke test that OTEL doesn't break existing @premerge tests - Add a new @premerge CTST scenario that puts an S3 object and then polls the Jaeger query API to assert a trace exists with spans from both cloudserver and vault Issue: ZENKO-5258
c9ae52e to
79d656d
Compare
francoisferrand
left a comment
There was a problem hiding this comment.
overall not sure where to stand here: there is a fundamental compromise: do we want to test the "production" case first and foremost (i.e. without tracing), or is it safe-enough to enable Otel all the time (i.e. not testing production anymore, which could hide crashes or introduce subtle delays...)
| # The CR is patched later, after file-backend SSE tests have run. | ||
| bash "$(dirname "$0")/../mocks/setup-kmip.sh" | ||
|
|
||
| # Enable OTEL tracing on the Zenko CR (always-on in CI, acts as a smoke test) |
There was a problem hiding this comment.
how long does it add to CI (deployment time) ?
is there a performance impact ? (esp. if performance degradation is preventing from hitting race conditions, not sure it is better :-/ )
| kubectl patch zenko ${ZENKO_NAME} -n ${NAMESPACE} --type merge -p '{ | ||
| "spec": {"tracing": { | ||
| "enabled": true, | ||
| "samplingRatio": "1.0", |
|
|
||
| # Enable OTEL tracing on the Zenko CR (always-on in CI, acts as a smoke test) | ||
| NAMESPACE="${NAMESPACE:-default}" | ||
| kubectl patch zenko ${ZENKO_NAME} -n ${NAMESPACE} --type merge -p '{ |
There was a problem hiding this comment.
instead of patching Zenko (i.e re-reconcile), should be added to the Zenko CR before it is installed on the cluster.
| kubectl rollout status sts/keycloak --timeout=10m | ||
|
|
||
| # jaeger all-in-one (OTLP collector + query UI, memory-only) | ||
| kubectl apply -f - <<'EOF' |
There was a problem hiding this comment.
best to use a separate CR, and keep the bash script simple?
| @2.15.0 | ||
| @PreMerge | ||
| Feature: OpenTelemetry Tracing | ||
| Traces should propagate across Zenko services |
There was a problem hiding this comment.
instead of tracing every request, could we "trust" localhost : and let the test initiate a request with an Otel trace Id ?
This may limit impact of the change....
| Feature: OpenTelemetry Tracing | ||
| Traces should propagate across Zenko services | ||
|
|
||
| Scenario: S3 PutObject produces a trace spanning cloudserver and vault |
There was a problem hiding this comment.
enabling trace all the time means we are not testing the "production" case anymore...
should we instead enable/disable tracing at the beginning and end of this test? (it would reconcile and possibly trigger instability -if we are not resilient-, but would ensure we test both cases...)
| interface JaegerProcess { | ||
| serviceName: string; | ||
| tags: { key: string; value: string }[]; | ||
| } | ||
|
|
||
| interface JaegerSpan { | ||
| traceID: string; | ||
| spanID: string; | ||
| operationName: string; | ||
| processID: string; | ||
| tags: { key: string; type: string; value: unknown }[]; | ||
| } | ||
|
|
||
| interface JaegerTrace { | ||
| traceID: string; | ||
| spans: JaegerSpan[]; | ||
| processes: Record<string, JaegerProcess>; | ||
| } | ||
|
|
||
| interface JaegerSearchResponse { | ||
| data: JaegerTrace[]; | ||
| } |
There was a problem hiding this comment.
- are these really Jeager specific? or really standard OpenTelemetry?
- can't we use types from some sdk, instead of re-defining these?
| "UtilizationServicePort":"${UTILIZATION_SERVICE_PORT}", | ||
| "KubeconfigPath":"${KUBECONFIG:-${HOME}/.kube/config}" | ||
| "KubeconfigPath":"${KUBECONFIG:-${HOME}/.kube/config}", | ||
| "JaegerQueryEndpoint":"${JAEGER_QUERY_ENDPOINT}" |
There was a problem hiding this comment.
can we add trailing comma already?
so we can add new lines without modifying the previous/last one?
Summary
1.76.0)in the kind CI cluster alongside existing dependencies (Keycloak,
Prometheus, etc.)
spec.tracing(enabled, sampling ratio1.0)so every request is traced during CI — this also acts as a smoke test
that OTEL doesn't break existing
@PreMergetests@PreMergeCTST scenario that puts an S3 object and thenpolls the Jaeger query API to assert a trace exists with spans from
both connector-cloudserver and connector-vault (matching the
OTEL_SERVICE_NAMEvalues emitted by the operator)What changed
install-kind-dependencies.shjaegertracing/all-in-one:1.76.0)configure-e2e-ctst.shkubectl patch zenkowithspec.tracing, wait for cloudserver + vault rolloutsetup-e2e-env.shJaegerQueryEndpointin world paramsworld/Zenko.tsJaegerQueryEndpointadded toZenkoWorldParametersfeatures/otel-tracing.feature@PreMergescenariosteps/otel-tracing.tsWhy
Parent ticket OS-1072
tracks adding OpenTelemetry tracing across the Zenko stack. This PR
adds CI coverage to verify that traces actually propagate end-to-end
(cloudserver → vault) once tracing is enabled on the CR.
Dependencies
Requires zenko-operator PR #607 (ZKOP-539)
to be merged and the bundled operator version bumped in
solution/deps.yaml— it adds thespec.tracingCRD field andpropagates
ENABLE_OTEL/OTEL_*env vars to the manageddeployments. Without it, the
kubectl patchhere will be rejectedas an unknown field.
Issue: ZENKO-5258