test: e2e test for large snapshot restore by jjaferson · Pull Request #3912 · loft-sh/vcluster

jjaferson · 2026-05-05T12:02:37Z

What issue type does this pull request address? (keep at least one, remove the others)
kind test

What does this pull request do? Which issues does it resolve? (use resolves #<issue_number> if possible)
resolves ENGCP-503

Please provide a short message that should be published in the vcluster release notes
Addes e2e test for large snapshot and restore

What else do we need to know?

E2E Tests

Default Test Execution

The mandatory PR suite runs automatically. Only specify additional test suites below if needed.

Adding New Test Suites

When adding a new ginkgo test suite:

Add labels to the test suite
Update label-filter section below to execute the new test suite
Verify test suite runs in CI/CD pipeline

Additional test suites

Additional test suite(s) that will be executed before the mandatory PR suite:

snapshot-large-restore

pascalbreuninger · 2026-05-05T12:17:17Z

+			By("Creating 1,000 configmaps in each of 100 namespaces concurrently", func() {
+				// The default client-go rate limiter (5 QPS, burst 10) would make 100k creates
+				// take ~20,000 seconds. Use a high-QPS client so the goroutines actually run in
+				// parallel and we hit the server's throughput ceiling instead.


Do you roughly know how this performs on GH runners? I'm a little concerned about it taking forever there or just straight up failing

I've tested it locally and to run the requests in parallel without being blocked by the rate limiter in client-go, so using the default, the first 10 requests go through immediately then after that the next one goes every ~200 ms

pascalbreuninger · 2026-05-05T12:18:28Z

+			})
+		})
+
+		AfterAll(func(ctx context.Context) {


Could this go into a DeferCleanup block instead or does AfterAll also fire appropriately (I don't know)?

Good question, I am using AfterAll as it's used in other parts of the e2e snapshot test suite, @adriankabala do you know which one should we use here?

AfterAll: splits creation and cleanup across two functions.
DeferCleanup: runs inside BeforeAll

In this case DeferCleanup and AfterAll should work in the same way, both should get executed when the Ordered container completes (after all It specs).

using AfterAll as it's used in other parts of the e2e snapshot test suite

➕ here, for the sake of consistency, it's probably better to use AfterAll in this PR.

The benefit of having DeferCleanup in BeforeAll is that it would place the cleanup code next to the setup code, which would be easier and more reliable to maintain in tests, e.g. when you setup something new in the test, the cleanup code is right next to the setup to remind you (or the agent) that your new resource should be also cleaned up.

All this being said, we could refactor cleanup code in tests and use DeferCleanup instead of AfterAll where it brings some value, but that's probably and idea for the @loft-sh/eng-qa team :)

I think the whole AfterAll is not needed at all. Why would we want to delete all these namespaces right before the tenant cluster is dropped anyway?

mfranczy

This cannot be part of the standard e2e pipeline.

It extends the test runtime to over an hour without introducing any specific constraint or signal that would help us identify functionality regressions or performance degradation.

Please put this on hold and let's discuss it during the tech sync meeting.

roehrijn

Only rather small explicit findings, but in general I fully agree with Marcin. We should discuss this as a team (likely also with QA team involved). This is nothing to just add to a regular e2e test suite. Maybe to the suite as such, but should not be executed on a daily base. For quite a while I have the idea of weekly executions in my mind. Maybe this is a viable option to label-filter it accordingly and run it once per week? Definitely we have to come to a common agreement what the goal is. This test is highly synthetic. 100k rather small config maps. But what about the object size limit in etcd? What if the 100k Configmaps are close to 4MB each? And is it enough if the test succeeds although it takes ages? We need to define a goal first and then adapt metric thresholds around that at least.

roehrijn · 2026-05-11T14:38:41Z

          ginkgo-label: "${{ needs.parse_label-filter.outputs.label-filter }} || pr"
          additional-args: "--vcluster-image=${{ env.REPOSITORY_NAME }}:${{ env.TAG_NAME }} --teardown=false"
          additional-ginkgo-flags: "-v --show-node-events"
+          timeout: 90m


The matching timeout: 90m was not mirrored in .github/workflows/e2e-ginkgo-nightly.yaml:117, which still passes timeout: "60m" to run-ginkgo. If the snapshot-large-restore label is ever included in a nightly run (e.g. via workflow_dispatch or once added to a default label set), Ginkgo will abort after 60 minutes — before the three 90-minute Eventually/waitForRequestToFinish calls in test_snapshot.go (lines 894, 916, 938) can complete. Consider bumping the nightly workflow's timeout too, or documenting why it intentionally stays at 60m.

roehrijn · 2026-05-11T14:38:44Z

 	PollingTimeout          = time.Second * 60
 	PollingTimeoutLong      = time.Second * 120
 	PollingTimeoutVeryLong  = time.Second * 300
+	PollingTimeoutExtraLong = time.Minute * 90


PollingTimeoutExtraLong (90m) is only used by the new large-restore spec, yet it lives in the shared constants package and jumps 18x from PollingTimeoutVeryLong (5m). Once a generic 90m option exists, future flaky tests are tempted to reach for it rather than diagnose. Consider either keeping it scoped (e.g. a package-local const largeRestoreTimeout = 90 * time.Minute inside e2e-next/test_storage/snapshot/), or renaming to convey scope so reuse outside this scenario is discouraged:

Suggested change

PollingTimeoutExtraLong = time.Minute * 90

SnapshotLargeRestoreTimeout = time.Minute * 90

roehrijn · 2026-05-11T14:39:04Z

+	const (
+		objectCount   = 100_000
+		nsCount       = 100
+		objectsPerNS  = objectCount / nsCount // 1_000 per namespace


objectsPerNS = objectCount / nsCount carries an implicit invariant that objectCount is divisible by nsCount. If a future maintainer tweaks objectCount to e.g. 100_001, the channel-fed loop below still publishes objectCount indices and i/objectsPerNS will exceed nsCount-1, producing creates into a non-existent namespace. Either assert the invariant (Expect(objectCount % nsCount).To(Equal(0))) or iterate for ns := range nsCount { for j := range objectsPerNS { ... } } so the loop bound matches the namespace topology.

roehrijn · 2026-05-11T14:39:10Z

+
+		It("Takes a snapshot of the tenant cluster", func(ctx context.Context) {
+			createSnapshot(s.vClusterName, s.vClusterNS, true, snapshotPath, false)
+			waitForRequestToFinish(ctx, s.hostClient, s.vClusterNS,


This open-codes a waitForRequestToFinish(...) call with a custom timeout instead of using waitForSnapshotToBeCreated, leaking helper-internal knowledge (label key + unmarshal fn) into the spec. Consider extending the helper to accept a timeout — e.g. waitForSnapshotToBeCreated(ctx, host, ns, timeout) with existing callers passing PollingTimeoutVeryLong and this one passing the extra-long value — so all snapshot specs share the same abstraction.

roehrijn · 2026-05-11T14:39:15Z

+						g.Expect(kerrors.IsNotFound(err)).To(BeTrue(),
+							"namespace %s should be deleted but still exists", ns)
+					}
+				}).WithContext(ctx).WithPolling(constants.PollingInterval).WithTimeout(constants.PollingTimeoutExtraLong).Should(Succeed())


Using a 90-minute polling timeout for namespace deletion conflates two concerns: the overall test budget and the expected operation latency. A real 90m wait on namespace termination usually masks a stuck finalizer rather than legitimately taking that long. Consider scoping each Eventually to its expected upper bound (a few minutes for ns deletion of 1000 cms) so failures fail fast with the right signal, and reserving the extra-long timeout only for the snapshot/restore RPC waits where 90m is plausible.

roehrijn · 2026-05-11T14:39:22Z

+
+		It("Restores snapshot and verifies the deleted namespaces are back", func(ctx context.Context) {
+			By("Restoring the tenant cluster from snapshot", func() {
+				restoreVCluster(ctx, s.hostClient, s.vClusterName, s.vClusterNS, snapshotPath, true, false)


The large-restore suite uses PollingTimeoutExtraLong (90m) elsewhere, but restoreVCluster(...) still goes through a helper that hardcodes PollingTimeoutVeryLong (5m) for the post-restore pod-ready wait (helpers.go:74). With a 100k-object restore the pods may take longer to become ready, and the suite will fail here long before the later 90m assertions run. Consider threading the timeout into restoreVCluster.

roehrijn · 2026-05-11T14:39:27Z

+		It("Restores snapshot and verifies the deleted namespaces are back", func(ctx context.Context) {
+			By("Restoring the tenant cluster from snapshot", func() {
+				restoreVCluster(ctx, s.hostClient, s.vClusterName, s.vClusterNS, snapshotPath, true, false)
+				s.refreshClient(ctx)


s.refreshClient(ctx) hardcodes a PollingTimeoutLong (2m) wait for the restored API server to become usable (test_snapshot.go:112). After restoring 100k objects the pod may be Running while the tenant API is still replaying state, so this reconnect can time out even though the restore would eventually succeed. Worth letting the large-restore suite use a longer reconnect timeout.

roehrijn · 2026-05-11T14:39:34Z

+    k8s:
+      apiServer:
+        extraArgs:
+          - "--etcd-compaction-interval=1h"


Adding --etcd-compaction-interval=1h here changes the API server config for all snapshot suites that embed this YAML (both suite_snapshot_test.go and the new suite_snapshot_large_restore_test.go). The existing snapshot suite did not need this tuning. To avoid coupling unrelated suites to a perf knob added for one of them, consider either (a) moving this to a dedicated vcluster-snapshot-large-restore.yaml embedded only from the new suite, or (b) parameterizing via lazyvcluster.WithExtraTemplateVars so the existing suite keeps the default.

roehrijn · 2026-05-11T14:51:18Z

+			})
+		})
+
+		AfterAll(func(ctx context.Context) {


I think the whole AfterAll is not needed at all. Why would we want to delete all these namespaces right before the tenant cluster is dropped anyway?

jjaferson · 2026-05-18T13:09:50Z

I'll convert this to a draft PR as we might change it to the performance test framework

jjaferson requested a review from a team as a code owner May 5, 2026 12:02

pascalbreuninger reviewed May 5, 2026

View reviewed changes

adriankabala reviewed May 5, 2026

View reviewed changes

Comment thread e2e-next/test_storage/snapshot/test_snapshot.go

adriankabala reviewed May 5, 2026

View reviewed changes

Comment thread e2e-next/test_storage/snapshot/test_snapshot.go Outdated

adriankabala reviewed May 5, 2026

View reviewed changes

Comment thread e2e-next/test_storage/snapshot/test_snapshot.go Outdated

This comment was marked as resolved.

Sign in to view

jjaferson force-pushed the ENGCP-503 branch 6 times, most recently from d23b46c to 0315458 Compare May 7, 2026 12:33

jjaferson requested a review from a team as a code owner May 7, 2026 12:33

test: e2e test for large snapshot restore

bd54bdb

jjaferson force-pushed the ENGCP-503 branch from 0315458 to bd54bdb Compare May 7, 2026 14:56

jjaferson requested a review from a team as a code owner May 7, 2026 14:56

mfranczy requested changes May 8, 2026

View reviewed changes

roehrijn requested changes May 11, 2026

View reviewed changes

jjaferson marked this pull request as draft May 18, 2026 13:09

	PollingTimeoutExtraLong = time.Minute * 90
	SnapshotLargeRestoreTimeout = time.Minute * 90

Conversation

jjaferson commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Tests

Default Test Execution

Adding New Test Suites

Additional test suites

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

mfranczy left a comment

Choose a reason for hiding this comment

Uh oh!

roehrijn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjaferson commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jjaferson commented May 5, 2026 •

edited

Loading

roehrijn left a comment •

edited

Loading