Skip to content

Add helper to install ceph cluster#13

Open
viktor-karpochev wants to merge 16 commits intomainfrom
vkarpochev/csi-ceph-testkit
Open

Add helper to install ceph cluster#13
viktor-karpochev wants to merge 16 commits intomainfrom
vkarpochev/csi-ceph-testkit

Conversation

@viktor-karpochev
Copy link
Copy Markdown
Contributor

@viktor-karpochev viktor-karpochev commented Apr 23, 2026

Summary

  • Move reusable Rook/Ceph provisioning helpers into pkg/testkit so downstream module repositories can create a working Ceph-backed StorageClass without duplicating setup code.
  • Add Kubernetes helpers for Rook CephCluster / CephBlockPool, rook config overrides, Ceph credentials, csi-ceph connection/auth resources, CephStorageClass, and VolumeSnapshotClass support.
  • Keep storage-e2e focused on shared Ceph testkit utilities and update docs after the main documentation restructure.

Test Plan

  • go test ./pkg/...

Add kubernetes helpers for CephCluster, CephBlockPool, rook-config-override,
Ceph credentials, CephClusterConnection/Authentication, CephStorageClass,
VolumeSnapshotClass, and OSD backing StorageClass resolution.

Add testkit.EnsureCephStorageClass orchestrating module enablement through
working csi-ceph StorageClass, plus csi-ceph e2e test package.

Signed-off-by: Viktor Karpochev <viktor.karpochev@flant.com>
Made-with: Cursor
Signed-off-by: Viktor Karpochev <viktor.karpochev@flant.com>
@viktor-karpochev viktor-karpochev force-pushed the vkarpochev/csi-ceph-testkit branch from 4431dc7 to e4d5466 Compare April 23, 2026 09:51
viktor-karpochev and others added 14 commits April 28, 2026 17:03
Move reusable Rook/Ceph provisioning and CRC toggling into storage-e2e so csi-ceph e2e can consume the shared testkit instead of carrying duplicated setup code.

Signed-off-by: Viktor Karpochev <viktor.karpochev@flant.com>
Made-with: Cursor
Keep storage-e2e focused on reusable Ceph testkit helpers while the csi-ceph repository owns its module-specific e2e suite.

Signed-off-by: Viktor Karpochev <viktor.karpochev@flant.com>
Made-with: Cursor
Keep the public Ceph helper comments aligned with the 10Gi OSD default and avoid referring to the old full 2x2 CRC matrix.

Signed-off-by: Viktor Karpochev <viktor.karpochev@flant.com>
Made-with: Cursor
Resolve the documentation restructure conflict while keeping the Ceph testkit helper docs aligned with the current tree.

Made-with: Cursor
Extend storage-e2e so callers can provision a CephFS-backed
CephStorageClass alongside the existing RBD path.

* New pkg/kubernetes/cephfilesystem.go with idempotent
  CreateCephFilesystem / WaitForCephFilesystemReady /
  DeleteCephFilesystem helpers (single replicated metadata pool +
  one replicated data pool, configurable failure domain and MDS
  active count). WaitForCephFilesystemReady accepts both
  status.phase=Ready and status.conditions[Ready]=True so it works
  across Rook revisions. Adds CephFSDataPoolFullName helper that
  encodes Rook's <fsName>-<dataPoolName> pool naming convention so
  callers can feed the right value into CephStorageClass.spec.cephFS.pool.

* pkg/testkit/ceph.go: CephStorageClassConfig grows a Type field
  ("RBD" default / "CephFS") plus CephFSName, CephFSDataPoolName,
  CephFS{Metadata,Data}Replicas, CephFSActiveMDSCount and
  CephFilesystemReadyTimeout knobs. EnsureCephStorageClass step 5
  now branches on Type to create the matching pool primitive, and
  step 8 wires the resulting CephStorageClass with rbd.pool or
  cephFS.{fsName,pool} accordingly. TeardownCephStorageClass deletes
  the right Rook primitive based on Type.

* New SkipClusterTeardown flag on CephStorageClassConfig: when
  several StorageClasses share one CephCluster, every teardown
  except the last one sets it to true so only the owning call
  removes the underlying CephCluster and rook-config-override.

* Re-export CephStorageClassTypeRBD / CephStorageClassTypeCephFS
  from the testkit package so suites don't have to import
  pkg/kubernetes just to set cfg.Type.

* docs/FUNCTIONS_GLOSSARY.md: documents the new CephFilesystem
  helpers, the CephFS branch of EnsureCephStorageClass, and the
  TeardownCephStorageClass + SkipClusterTeardown semantics.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…mplating

This bundles four related fixes that surfaced during csi-ceph e2e diagnosis,
all aimed at the same failure mode: a flapping Wi-Fi or unreliable bootstrap
network silently breaking a 50-minute test run.

1. modulePullOverride env templating
   - internal/config/overrides.go (+_test.go): ExpandEnvInModulePullOverride
     resolves ${VAR} placeholders in cluster_config.yml at config load time.
     CI sets one MODULE_IMAGE_TAG (e.g. "pr131" / "mr131") and points multiple
     modules at it without per-run YAML edits. Missing env fails fast with
     an explicit message so the wrong-image-pull confusion is gone.
   - Hooks in internal/cluster/cluster.go::LoadClusterConfig and
     pkg/cluster/cluster.go::loadClusterConfigFromPath after yaml.Unmarshal.
   - README.md documents the new ${VAR} form.

2. Bootstrap robustness on developer laptops
   - pkg/cluster/setup.go: pass FORCE_NO_PRIVATE_KEYS=true and
     USE_AGENT_WITH_NO_PRIVATE_KEYS=true into the dhctl install:main
     container so lib-connection stops trying to open /root/.ssh/id_rsa and
     authenticates only via the mounted ssh-agent socket. Fixes
     "extract config: Failed to read private keys from flags" with a
     passphrase-protected key.
   - pkg/cluster/vms.go: cloud-init now pins apt at mirror.yandex.ru and
     forces IPv4 so package_update + Docker install stop stalling on egress
     paths where archive.ubuntu.com is partially unreachable.
   - internal/config/env.go: extracted ApplyDefaults() out of
     ValidateEnvironment so suites that skip validation still get defaults
     for SSH_VM_USER / SSH_PRIVATE_KEY / etc.
   - pkg/cluster/cluster.go::CreateTestCluster now calls ApplyDefaults() and
     falls back to YAMLConfigFilenameDefaultValue on empty arg.
   - internal/cluster/cluster.go::GetKubeconfig falls back to clientcmd
     default loading rules (KUBECONFIG / ~/.kube/config, minified to the
     current context) when SSH retrieval fails and KUBE_CONFIG_PATH is
     unset.

3. SSH tunnel auto-reconnect
   - internal/infrastructure/ssh/client.go: both (*client).StartTunnel and
     (*jumpHostClient).StartTunnel now share runTunnelLoop driven by a
     tunnelDialer struct. When the underlying SSH session dies, dial fails
     with EOF; the loop emits a WARN, calls the existing reconnect() (which
     already has retry + exponential backoff), and retries the dial once
     with the rebuilt session. Without this a Wi-Fi flap killed the tunnel
     and every client-go GET silently returned EOF until the parent
     readiness timeout fired.

4. Per-call deadline + visible WARN in Ceph readiness pollers
   - pkg/kubernetes/poll.go (new): pollResourceUntilReady centralizes our
     Wait*Ready loops. Each Get is bounded by PollGetTimeout (30s) so a hung
     TCP connect surfaces in seconds, and consecutive Get failures escalate
     to WARN once they cross 3 so the user sees the cluster connection is
     dying instead of waiting for the readyTimeout.
   - pkg/kubernetes/{cephcluster,cephblockpool,cephfilesystem}.go:
     WaitForCephClusterReady / WaitForCephBlockPoolReady /
     WaitForCephFilesystemReady migrated. Public signatures unchanged.

Docs:
- docs/WORKLOG.md: 2026-05-05 entries.
- docs/FUNCTIONS_GLOSSARY.md: updated descriptions for the three Wait*Ready
  helpers.
- docs/ARCHITECTURE.md: poll.go and cephfilesystem.go added to the package
  tree (Sections 1.1 and 3.6); overrides.go in Section 3.1.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
GetKubeconfig used to log a single info-level line when SSH retrieval of
admin.conf failed and we silently dropped to the developer's local
kubeconfig. In practice that hid a class of nasty bugs where tests were
acquiring stale locks on unrelated SAN clusters or installing modules
against the wrong stand because $KUBECONFIG happened to point elsewhere.

Make the fallback obvious:

* Tag every kubeconfig source path with a short label
  (SSH(...), KUBE_CONFIG_PATH=..., LOCAL_FALLBACK(...)).
* Promote the fallback message to logger.Warn, include the resolved
  current-context and cluster server URL, and tell the user how to
  fail-fast (unset KUBECONFIG, drop ~/.kube/config) if that behaviour
  is undesirable.
* Always print a final "Loaded kubeconfig (source=..., current-context=...,
  server=...)" line so the actual cluster is visible in test logs
  regardless of which resolution path fired.

The new kubeconfigContextSummary helper parses the serialized kubeconfig
through clientcmd.Load and degrades to "<unknown>" on any error so the
surrounding log line stays safe to print.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
TeardownCephStorageClass now waits for each CR to be GC'd before
deleting its parent. Without that synchronization the parent
CephCluster could be deleted while a child CephBlockPool /
CephFilesystem is still alive, leaving Rook stuck with
DeletionIsBlocked / ObjectHasDependents and the cluster in
phase=Deleting indefinitely.

Adds:
- pollResourceUntilGone helper with periodic deletionTimestamp /
  finalizers progress logging, so a stuck finalizer surfaces
  immediately instead of after a silent timeout.
- WaitFor*Gone helpers for CephCluster, CephBlockPool,
  CephFilesystem, CephClusterAuthentication, CephClusterConnection,
  CephStorageClass with sensible per-CR default budgets.
- errIfTerminating guard in every Create* helper so an Ensure*
  call finds a Terminating CR and fails fast instead of issuing a
  silent no-op Update and trapping WaitFor*Ready for 15-20m.
- pollResourceUntilReady fail-fast on deletionTimestamp != nil for
  the same reason.

Fail-fast policy on Wait*Gone timeouts: errors are aggregated and
returned, no auto-strip of finalizers — that would mask real Rook
bugs. Operator must investigate the cluster manually before
re-running.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…l containers)

storage-e2e had no pod-exec helpers at all (pkg/kubernetes/pod.go
only covers WaitFor*Ready). Each downstream test suite was forced to
roll its own — see csi-ceph/e2e/tests/e2e_shared_test.go::execInPod
which wraps remotecommand.NewSPDYExecutor and only works on
containers that have cat (i.e. test probe pods, not the actual
distroless csi-controllers).

This commit lifts pod exec into the shared testkit so any module's
e2e suite can reuse it.

New file: pkg/kubernetes/pod_exec.go

- ExecInPod(ctx, kubeconfig, ns, pod, container, cmd) (stdout,
  stderr string, error). General SPDY exec on /pods/<name>/exec.
  Returns stdout/stderr SEPARATELY (the csi-ceph copy concatenates
  them and loses signal).
- ReadFileFromPod(...) — ExecInPod + cat <path>. For containers
  that ship a real userland.
- ReadFileFromDistrolessPod(..., opts ReadFileOptions) — adds a
  short-lived ephemeral container with TargetContainerName set,
  polls until it goes Running, then cat /proc/1/root<path>.

The distroless path leans on Kubernetes Ephemeral Containers (GA
since 1.25). They're added through the dedicated
/pods/<name>/ephemeralcontainers subresource — NOT via the regular
pod PUT/PATCH path, which is why the apiserver explicitly allows
this mutation on a running pod and existing containers do NOT
restart. metadata.generation, spec.containers, pod sandbox UID
and ReplicaSet/DaemonSet observation all stay intact, so e2e
suites that subsequently assert on checksum/... annotations or
rollout state see a clean signal — the FS read does not
contaminate it.

Caveat documented in the doc-comment: ephemeral containers cannot
be removed once added; sleep 60 lets the cat process exit on its
own. For long-running test suites the entry just stays as
Terminated in pod.status.ephemeralContainerStatuses until the next
rollout recycles the pod.

docs/FUNCTIONS_GLOSSARY.md gets a new entry under the Pod
subsection listing the three primitives with selection guidance
(which to pick for distroless vs. shell-bearing containers).

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
ReadFileFromDistrolessPod was designed as one-shot: every call injects
a fresh ephemeral container, waits for the kubelet to launch it, runs
cat once, and exits. That's fine for diagnostics, but makes
Eventually-style polling loops painfully slow — each iteration pays
the full ephemeral-container cold-start cost (~10-20 s for kubelet to
launch a new container in the existing pod sandbox), so a "predicate
matches in 30 s" case can spend 2+ minutes inside the loop. A real
trace from the msCrcData matrix shows ~127 s for an rbd FS-poll that
should have settled in well under a minute.

This commit splits the helper into a session API:

- OpenDistrolessReader(...) injects ONE ephemeral container with a
  long sleep (default 30 minutes via opts.SessionTTL), waits for it
  to go Running, and returns a DistrolessReader bound to that
  ephemeral container.
- DistrolessReader.ReadFile(ctx, path) is just a pods/exec round-trip
  into the already-running ephemeral container — sub-second.
- ReadFileFromDistrolessPod is now a thin wrapper (open + read) for
  one-shot callers. Behaviour is unchanged from their perspective,
  but ReadFileOptions grows a SessionTTL field used by the session
  path.

Reader API is what callers running poll loops should use; the
single-shot helper stays for the one-shot diagnostics case.

The reader cannot outlive its target pod — there's no Close() because
Kubernetes does not allow removing an ephemeral container, and a pod
recycle (rollout) drops the entry along with the rest of the pod
status. Callers that need fresh sessions across pod identities should
re-open against the new pod (DistrolessReader.PodName helps detect
this).

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…system.Ready

RestartCephDaemons used to rolling-restart only mon/mgr/osd, which
left two classes of state stuck on the pre-flip ms_crc_data:

  1. rook-ceph-mds: a CephFS daemon that talks to mons over the same
     messenger that ms_crc_data toggles CRC for. With mons on the new
     value and MDS still on the old one, the MDS↔mon channel silently
     desynchronises, CephFS goes degraded, and any csi-cephfs PVC
     hangs in Pending until somebody bounces MDS by hand. Reproduced
     reliably in the msCrcData matrix on cell `protocol=cephfs
     server=off client=off -> Bound`: PVC stuck for ~2 minutes,
     unstuck only after kubectl rollout restart of the d8-sds-elastic
     namespace.

  2. The rook-operator pod: itself a Ceph admin client that uses an
     in-pod ceph.conf rendered at startup. Without a pod restart it
     keeps using the stale ms_crc_data and can't talk to the freshly-
     bounced mons, surfacing as cephcluster CR phase=Ready /
     state=Error / `failed to get status. . timed out` until the next
     reconcile after operator pod recycle.

Fix:

  * Extend RestartCephDaemons selector to mon/mgr/osd/mds/rgw. rgw is
    pre-included for forward-compat with future S3 tests; absence is
    not an error.
  * Add RestartRookOperator helper that bounces the rook-operator
    Deployment and waits for Ready. Operator-Deployment name is
    derived from the namespace by stripping the leading `d8-` prefix
    (`d8-sds-elastic` → `sds-elastic`), matching how Deckhouse
    packages the operator binary as a per-module Helm release.
    Vanilla Rook (`rook-ceph-operator` in `rook-ceph` namespace) is
    not supported — storage-e2e targets the Deckhouse flavor
    exclusively. Returns a descriptive error if the namespace doesn't
    have the expected prefix or the derived Deployment isn't there.
  * Wire RestartRookOperator into SetMsCrcDataOnServer (after the
    daemon restart so the operator boots against fresh-config mons).
  * Gate the whole flip on every CephFilesystem in the namespace
    reaching Ready before returning. Catches the MDS-stuck-on-old-CRC
    class of bug at the source instead of letting it surface as a PVC
    timeout downstream. RBD-only clusters are a no-op (no
    CephFilesystem CRs to wait for).

Net cost: ~30s extra per flip (mds + operator restart). In return:
no manual kubectl rollout restart between matrix cells, no spurious
HEALTH_ERR on cephcluster CR, and CephFS PVCs stop hanging in Pending
when CRC flips back to a matched state.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
When SSH retrieval of /etc/kubernetes/{super-admin,admin}.conf from the
master fails and KUBE_CONFIG_PATH is not set, GetKubeconfig now fails fast
again instead of silently loading the developer's $KUBECONFIG /
~/.kube/config via clientcmd.NewDefaultClientConfigLoadingRules.

The fallback (added in e3d4e8d) was convenient on dev laptops but too risky
in CI and on machines whose `kubectl` already targets an unrelated cluster:
tests would silently deploy modules to / acquire cluster locks on the wrong
stand. Reverting preserves the original fail-fast contract that downstream
suites already relied on.

- internal/cluster/cluster.go: replace the default switch branch with a
  descriptive error pointing at KUBE_CONFIG_PATH and embedding the SSH
  error; drop loadDefaultKubeconfig and the now-unused clientcmdapi import.
- docs/WORKLOG.md: rewrite the 2026-05-05 GetKubeconfig bullet to reflect
  the final fail-fast behavior.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Backfill the documentation that earlier commits in this branch should
have updated as they landed. No code changes.

FUNCTIONS_GLOSSARY.md:
- Pod section: documented OpenDistrolessReader and the three
  *DistrolessReader methods (PodName, EphemeralName, ReadFile) added
  alongside the single-shot ReadFileFromDistrolessPod helper.
- New sections: "Ceph CRC (Testkit)" (EnableServerCRC / DisableServerCRC
  / ResetServerCRCToDefault / SetMsCrcDataOnServer / RestartCephDaemons
  / RestartRookOperator) and "VolumeSnapshotClass"
  (CreateVolumeSnapshotClass / WaitForVolumeSnapshotClass).
- StorageClass section: documented CreateStorageClass (in
  pkg/kubernetes/storageclass_manage.go).
- Rook Config Override section: documented RenderCephGlobalConfig.
- Table of Contents: added missing entries for "Ceph Cluster (Testkit)
  - no csi-ceph wiring", "VolumeSnapshotClass", and "Ceph CRC (Testkit)".

ARCHITECTURE.md:
- Section 1.1 (Package Structure): added internal/config/overrides.go
  (was only listed in 3.1) and pkg/kubernetes/pod_exec.go.
- Section 3.6 (Public API): added pkg/kubernetes/pod_exec.go.
- Section 7 (Environment Variables): documented the new fail-fast
  KUBE_CONFIG_PATH semantics and the generic ${VAR} expansion in
  modulePullOverride (e.g. MODULE_IMAGE_TAG).

WORKLOG.md:
- 2026-05-05: backfilled entries for pod_exec.go, DistrolessReader,
  the WaitFor*Gone family + Create-time deletionTimestamp guards,
  TeardownCephStorageClass rewrite, RestartCephDaemons selector
  extension (mds/rgw), RestartRookOperator, SetMsCrcDataOnServer
  rework. The GetKubeconfig revert (which actually landed today) was
  hoisted out of 2026-05-05 into a new 2026-05-06 heading.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
When the SSH-side kubeconfig fetch fails and KUBE_CONFIG_PATH is unset,
the default switch branch in GetKubeconfig used to return a single
generic "command failed: Process exited with status 1" wrapped into a
vague suggestion to "fix SSH credentials so passwordless sudo works on
the master". That left the operator guessing.

The default branch now runs two cheap probe commands against the master
to classify the failure:

  1) test -f /etc/kubernetes/{super-admin,admin}.conf
     -> at least one kubeconfig file exists on the host
  2) sudo -n -l /bin/cat <path-from-1>
     -> a NOPASSWD rule that matches the cat command actually applies

and returns a multi-line, actionable error tailored to the detected
cause. The "sudo password required" branch embeds a ready-to-paste
/etc/sudoers.d/e2e-kubeconfig snippet (with the actual SSH user baked
in), the "kubeconfig missing" branch points at SSH_HOST/SSH_JUMP_HOST
misconfig, and the unknown branch lists all three remedies.

While here, fix a self-inflicted source of the same failure: the SSH
command used to read the kubeconfig was

    sudo -n sh -c 'if [ -f .../super-admin.conf ]; then cat ...; ...'

so the privileged binary as far as sudoers was concerned was /bin/sh,
NOT /bin/cat. The fine-grained NOPASSWD rule the new error message
recommends ("NOPASSWD: /bin/cat /etc/kubernetes/{super-admin,admin}.conf")
therefore did not match and sudo asked for a password — exactly the
situation the error message tells the user to fix. The command is now

    sudo -n /bin/cat /etc/kubernetes/super-admin.conf 2>/dev/null \
      || sudo -A -n /bin/cat /etc/kubernetes/admin.conf

which works with the recommended minimal rule. The classifier probe was
moved off "sudo -n true" for the same reason: under hosts that grant
"NOPASSWD: ALL" the probe returned 0 even when the per-file rule was
absent, which would mask the real cause. "sudo -n -l /bin/cat <path>"
asks sudo whether THAT specific command is allowed without a password.

Contract preserved: still fail-fast (no silent ~/.kube/config fallback),
still wraps the original ssh exit error via %w so callers' errors.Is /
errors.As keep working. Probes are best-effort -- any error from a probe
is treated as "unknown" rather than masking the original sshErr.

Out of scope: actual SUDO_PASSWORD plumbing via 'sudo -S' (requires
extending SSHClient to forward stdin and adding secret-redaction in
command logs). Documented as a follow-up.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@AleksZimin AleksZimin force-pushed the vkarpochev/csi-ceph-testkit branch from 75f4eae to 1d636a5 Compare May 7, 2026 07:50
Conflicts:
- go.mod: dropped mxk/go-flowrate and openshift/api indirect deps;
  go mod tidy confirms they are unused.
- internal/config/env.go: kept main's EffectiveVirtualMachineClassName
  and our ApplyDefaults() extraction; ValidateEnvironment now calls
  ApplyDefaults first.
- pkg/cluster/setup.go: took main's --connection-config approach for
  passphrase-protected dhctl bootstrap; main's solution subsumes our
  earlier FORCE_NO_PRIVATE_KEYS / USE_AGENT_WITH_NO_PRIVATE_KEYS hack.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants