Skip to content

test: e2e test for large snapshot restore#3912

Draft
jjaferson wants to merge 1 commit into
loft-sh:mainfrom
jjaferson:ENGCP-503
Draft

test: e2e test for large snapshot restore#3912
jjaferson wants to merge 1 commit into
loft-sh:mainfrom
jjaferson:ENGCP-503

Conversation

@jjaferson
Copy link
Copy Markdown
Contributor

@jjaferson jjaferson commented May 5, 2026

What issue type does this pull request address? (keep at least one, remove the others)
kind test

What does this pull request do? Which issues does it resolve? (use resolves #<issue_number> if possible)
resolves ENGCP-503

Please provide a short message that should be published in the vcluster release notes
Addes e2e test for large snapshot and restore

What else do we need to know?

E2E Tests

Default Test Execution

The mandatory PR suite runs automatically. Only specify additional test suites below if needed.

Adding New Test Suites

When adding a new ginkgo test suite:

  • Add labels to the test suite
  • Update label-filter section below to execute the new test suite
  • Verify test suite runs in CI/CD pipeline

Additional test suites

Additional test suite(s) that will be executed before the mandatory PR suite:

snapshot-large-restore

@jjaferson jjaferson requested a review from a team as a code owner May 5, 2026 12:02
Comment thread e2e-next/test_storage/snapshot/test_snapshot.go
Comment thread e2e-next/test_storage/snapshot/test_snapshot.go
By("Creating 1,000 configmaps in each of 100 namespaces concurrently", func() {
// The default client-go rate limiter (5 QPS, burst 10) would make 100k creates
// take ~20,000 seconds. Use a high-QPS client so the goroutines actually run in
// parallel and we hit the server's throughput ceiling instead.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you roughly know how this performs on GH runners? I'm a little concerned about it taking forever there or just straight up failing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it locally and to run the requests in parallel without being blocked by the rate limiter in client-go, so using the default, the first 10 requests go through immediately then after that the next one goes every ~200 ms

})
})

AfterAll(func(ctx context.Context) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this go into a DeferCleanup block instead or does AfterAll also fire appropriately (I don't know)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I am using AfterAll as it's used in other parts of the e2e snapshot test suite, @adriankabala do you know which one should we use here?

AfterAll: splits creation and cleanup across two functions.
DeferCleanup: runs inside BeforeAll

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case DeferCleanup and AfterAll should work in the same way, both should get executed when the Ordered container completes (after all It specs).

using AfterAll as it's used in other parts of the e2e snapshot test suite

➕ here, for the sake of consistency, it's probably better to use AfterAll in this PR.

The benefit of having DeferCleanup in BeforeAll is that it would place the cleanup code next to the setup code, which would be easier and more reliable to maintain in tests, e.g. when you setup something new in the test, the cleanup code is right next to the setup to remind you (or the agent) that your new resource should be also cleaned up.

All this being said, we could refactor cleanup code in tests and use DeferCleanup instead of AfterAll where it brings some value, but that's probably and idea for the @loft-sh/eng-qa team :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole AfterAll is not needed at all. Why would we want to delete all these namespaces right before the tenant cluster is dropped anyway?

Comment thread e2e-next/test_storage/snapshot/test_snapshot.go
Comment thread e2e-next/test_storage/snapshot/test_snapshot.go Outdated
Comment thread e2e-next/test_storage/snapshot/test_snapshot.go Outdated
adriankabala

This comment was marked as resolved.

@jjaferson jjaferson force-pushed the ENGCP-503 branch 6 times, most recently from d23b46c to 0315458 Compare May 7, 2026 12:33
@jjaferson jjaferson requested a review from a team as a code owner May 7, 2026 12:33
@jjaferson jjaferson requested a review from a team as a code owner May 7, 2026 14:56
Copy link
Copy Markdown
Contributor

@mfranczy mfranczy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be part of the standard e2e pipeline.

It extends the test runtime to over an hour without introducing any specific constraint or signal that would help us identify functionality regressions or performance degradation.

Please put this on hold and let's discuss it during the tech sync meeting.

Copy link
Copy Markdown
Contributor

@roehrijn roehrijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only rather small explicit findings, but in general I fully agree with Marcin. We should discuss this as a team (likely also with QA team involved). This is nothing to just add to a regular e2e test suite. Maybe to the suite as such, but should not be executed on a daily base. For quite a while I have the idea of weekly executions in my mind. Maybe this is a viable option to label-filter it accordingly and run it once per week? Definitely we have to come to a common agreement what the goal is. This test is highly synthetic. 100k rather small config maps. But what about the object size limit in etcd? What if the 100k Configmaps are close to 4MB each? And is it enough if the test succeeds although it takes ages? We need to define a goal first and then adapt metric thresholds around that at least.

ginkgo-label: "${{ needs.parse_label-filter.outputs.label-filter }} || pr"
additional-args: "--vcluster-image=${{ env.REPOSITORY_NAME }}:${{ env.TAG_NAME }} --teardown=false"
additional-ginkgo-flags: "-v --show-node-events"
timeout: 90m
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matching timeout: 90m was not mirrored in .github/workflows/e2e-ginkgo-nightly.yaml:117, which still passes timeout: "60m" to run-ginkgo. If the snapshot-large-restore label is ever included in a nightly run (e.g. via workflow_dispatch or once added to a default label set), Ginkgo will abort after 60 minutes — before the three 90-minute Eventually/waitForRequestToFinish calls in test_snapshot.go (lines 894, 916, 938) can complete. Consider bumping the nightly workflow's timeout too, or documenting why it intentionally stays at 60m.

PollingTimeout = time.Second * 60
PollingTimeoutLong = time.Second * 120
PollingTimeoutVeryLong = time.Second * 300
PollingTimeoutExtraLong = time.Minute * 90
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PollingTimeoutExtraLong (90m) is only used by the new large-restore spec, yet it lives in the shared constants package and jumps 18x from PollingTimeoutVeryLong (5m). Once a generic 90m option exists, future flaky tests are tempted to reach for it rather than diagnose. Consider either keeping it scoped (e.g. a package-local const largeRestoreTimeout = 90 * time.Minute inside e2e-next/test_storage/snapshot/), or renaming to convey scope so reuse outside this scenario is discouraged:

Suggested change
PollingTimeoutExtraLong = time.Minute * 90
SnapshotLargeRestoreTimeout = time.Minute * 90

const (
objectCount = 100_000
nsCount = 100
objectsPerNS = objectCount / nsCount // 1_000 per namespace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

objectsPerNS = objectCount / nsCount carries an implicit invariant that objectCount is divisible by nsCount. If a future maintainer tweaks objectCount to e.g. 100_001, the channel-fed loop below still publishes objectCount indices and i/objectsPerNS will exceed nsCount-1, producing creates into a non-existent namespace. Either assert the invariant (Expect(objectCount % nsCount).To(Equal(0))) or iterate for ns := range nsCount { for j := range objectsPerNS { ... } } so the loop bound matches the namespace topology.


It("Takes a snapshot of the tenant cluster", func(ctx context.Context) {
createSnapshot(s.vClusterName, s.vClusterNS, true, snapshotPath, false)
waitForRequestToFinish(ctx, s.hostClient, s.vClusterNS,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This open-codes a waitForRequestToFinish(...) call with a custom timeout instead of using waitForSnapshotToBeCreated, leaking helper-internal knowledge (label key + unmarshal fn) into the spec. Consider extending the helper to accept a timeout — e.g. waitForSnapshotToBeCreated(ctx, host, ns, timeout) with existing callers passing PollingTimeoutVeryLong and this one passing the extra-long value — so all snapshot specs share the same abstraction.

g.Expect(kerrors.IsNotFound(err)).To(BeTrue(),
"namespace %s should be deleted but still exists", ns)
}
}).WithContext(ctx).WithPolling(constants.PollingInterval).WithTimeout(constants.PollingTimeoutExtraLong).Should(Succeed())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a 90-minute polling timeout for namespace deletion conflates two concerns: the overall test budget and the expected operation latency. A real 90m wait on namespace termination usually masks a stuck finalizer rather than legitimately taking that long. Consider scoping each Eventually to its expected upper bound (a few minutes for ns deletion of 1000 cms) so failures fail fast with the right signal, and reserving the extra-long timeout only for the snapshot/restore RPC waits where 90m is plausible.


It("Restores snapshot and verifies the deleted namespaces are back", func(ctx context.Context) {
By("Restoring the tenant cluster from snapshot", func() {
restoreVCluster(ctx, s.hostClient, s.vClusterName, s.vClusterNS, snapshotPath, true, false)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The large-restore suite uses PollingTimeoutExtraLong (90m) elsewhere, but restoreVCluster(...) still goes through a helper that hardcodes PollingTimeoutVeryLong (5m) for the post-restore pod-ready wait (helpers.go:74). With a 100k-object restore the pods may take longer to become ready, and the suite will fail here long before the later 90m assertions run. Consider threading the timeout into restoreVCluster.

It("Restores snapshot and verifies the deleted namespaces are back", func(ctx context.Context) {
By("Restoring the tenant cluster from snapshot", func() {
restoreVCluster(ctx, s.hostClient, s.vClusterName, s.vClusterNS, snapshotPath, true, false)
s.refreshClient(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.refreshClient(ctx) hardcodes a PollingTimeoutLong (2m) wait for the restored API server to become usable (test_snapshot.go:112). After restoring 100k objects the pod may be Running while the tenant API is still replaying state, so this reconnect can time out even though the restore would eventually succeed. Worth letting the large-restore suite use a longer reconnect timeout.

k8s:
apiServer:
extraArgs:
- "--etcd-compaction-interval=1h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding --etcd-compaction-interval=1h here changes the API server config for all snapshot suites that embed this YAML (both suite_snapshot_test.go and the new suite_snapshot_large_restore_test.go). The existing snapshot suite did not need this tuning. To avoid coupling unrelated suites to a perf knob added for one of them, consider either (a) moving this to a dedicated vcluster-snapshot-large-restore.yaml embedded only from the new suite, or (b) parameterizing via lazyvcluster.WithExtraTemplateVars so the existing suite keeps the default.

})
})

AfterAll(func(ctx context.Context) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole AfterAll is not needed at all. Why would we want to delete all these namespaces right before the tenant cluster is dropped anyway?

@jjaferson
Copy link
Copy Markdown
Contributor Author

I'll convert this to a draft PR as we might change it to the performance test framework

@jjaferson jjaferson marked this pull request as draft May 18, 2026 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants