feat(fc): drain virtio-balloon free-page-hinting before pause by ValentaTomas · Pull Request #2552 · e2b-dev/infra

ValentaTomas · 2026-05-04T00:05:41Z

Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free.

Balloon install is gated by free-page-hinting-install (bool LD flag); kernel-side eligibility is targeted via the LD context (kernel/FC version). On pause we call start_balloon_hinting(acknowledge_on_stop=true) and poll describe_balloon_hinting until host_cmd == DONE, gated by free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled). Reclaimed pages emit UFFD_EVENT_REMOVE, already tracked by the parent FPR work.

Hot path is kept minimal: post-drain and post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. System-level FPH activity is observable via the periodic 5 s metrics flush.

Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for offline FPR vs FPR+FPH comparison.

Operator must wait for the kernel FPH race fix to roll out before enabling free-page-hinting-timeout-ms in prod.

cursor · 2026-05-04T00:05:46Z

PR Summary

High Risk
Touches the sandbox pause/snapshot hot path and Firecracker API interactions; mis-gating or drain/poll issues could increase pause latency or break snapshot creation on some FC/kernel combinations.

Overview
There’s a compile-time bug in cmd/resume-build (for i := range opts.iterations ranges over an int).

create-build now forces free-page-hinting-config.enabled=true in local runs, which may unintentionally change behavior for callers expecting the LD default.

The pause path now conditionally runs balloon hinting drain + extra metrics flushes; incorrect LaunchDarkly targeting/version gating or Firecracker API behavior differences (notably the 204 “success” workaround) could still cause unexpected pause latency or no-op behavior across environments.

^{Reviewed by Cursor Bugbot for commit dcc9dcd. Bugbot is set up for automated code reviews on this repo. Configure here.}

linear-code · 2026-05-06T20:34:02Z

ENG-3664 Add free page hinting before pause

codecov · 2026-05-06T20:35:02Z

❌ 14 Tests Failed:

Tests completed	Failed	Passed	Skipped
2619	14	2605	7

View the top 1 failed test(s) by shortest run time

github.com/e2b-dev/infra/packages/shared/pkg/storage::TestMultipartUploader_HighConcurrency_StressTest

Stack Traces | 0.69s run time

=== RUN   TestMultipartUploader_HighConcurrency_StressTest
=== PAUSE TestMultipartUploader_HighConcurrency_StressTest
=== CONT  TestMultipartUploader_HighConcurrency_StressTest
    gcp_multipart_test.go:369: 
        	Error Trace:	.../pkg/storage/gcp_multipart_test.go:369
        	Error:      	"1" is not greater than "1"
        	Test:       	TestMultipartUploader_HighConcurrency_StressTest
        	Messages:   	Should have concurrent uploads
--- FAIL: TestMultipartUploader_HighConcurrency_StressTest (0.69s)

View the full list of 13 ❄️ flaky test(s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 70.64% (Passed 192 times, Failed 462 times)

Stack Traces | 0.3s run time

=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (0.30s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestEgressFirewallWithInternetAccessFalse

Flake rate in main: 54.69% (Passed 193 times, Failed 233 times)

Stack Traces | 6.18s run time

=== RUN   TestEgressFirewallWithInternetAccessFalse
=== PAUSE TestEgressFirewallWithInternetAccessFalse
=== CONT  TestEgressFirewallWithInternetAccessFalse
    sandbox_network_out_test.go:400: Command [curl] output: event:{start:{pid:1311}}
    sandbox_network_out_test.go:400: Command [curl] output: event:{end:{exit_code:28 exited:true status:"exit status 28" error:"exit status 28"}}
    sandbox_network_out_test.go:400: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:67
        	            				.../api/sandboxes/sandbox_network_out_test.go:400
        	Error:      	Received unexpected error:
        	            	command curl in sandbox im1efj3ayn3kozoifnxjx failed with exit code 28
        	Test:       	TestEgressFirewallWithInternetAccessFalse
        	Messages:   	Expected curl to succeed for allowed IP even with allow_internet_access=false (network config takes precedence)
--- FAIL: TestEgressFirewallWithInternetAccessFalse (6.18s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.66% (Passed 200 times, Failed 657 times)

Stack Traces | 35.2s run time

=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
Executing command curl in sandbox i1tm429ujag7ja6rx7mn0
--- FAIL: TestUpdateNetworkConfig (35.25s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.13% (Passed 193 times, Failed 651 times)

Stack Traces | 6.62s run time

=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox ifxyithxg5wx9c290xfg1
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1354}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1355}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox im1efj3ayn3kozoifnxjx
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1356}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Fri, 15 May 2026 00:02:06 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ifxyithxg5wx9c290xfg1
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (6.62s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost
Flake rate in main: 56.36% (Passed 343 times, Failed 443 times)
Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 64.35% (Passed 190 times, Failed 343 times)

Stack Traces | 8.25s run time

=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox ilgucl3ggygdm4l08bihm
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1261}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (8.25s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 64.35% (Passed 190 times, Failed 343 times)

Stack Traces | 7.43s run time

=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox ihcuqz9sqrvinisdeyc18
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1261}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (7.43s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir

Flake rate in main: 53.88% (Passed 238 times, Failed 278 times)

Stack Traces | 1.81s run time

=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
Executing command /bin/bash in sandbox i7q057jj52x6c27bujbmp
--- FAIL: TestListDir (1.81s)
Executing command python in sandbox i50m1wdy7wrnevct8ii2k

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory

Flake rate in main: 57.49% (Passed 190 times, Failed 257 times)

Stack Traces | 0.02s run time

=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.02s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory

Flake rate in main: 57.49% (Passed 190 times, Failed 257 times)

Stack Traces | 0.01s run time

=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.01s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)

Flake rate in main: 57.49% (Passed 190 times, Failed 257 times)

Stack Traces | 0.01s run time

=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.01s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files

Flake rate in main: 57.49% (Passed 190 times, Failed 257 times)

Stack Traces | 0.01s run time

=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.01s)

github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestSandboxAutoResumeViaProxy

Flake rate in main: 54.88% (Passed 194 times, Failed 236 times)

Stack Traces | 32.7s run time

=== RUN   TestSandboxAutoResumeViaProxy
=== PAUSE TestSandboxAutoResumeViaProxy
=== CONT  TestSandboxAutoResumeViaProxy
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"i9ltymovqbhusrgswrzl5","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"i9ltymovqbhusrgswrzl5","message":"The sandbox is running but port is not open","port":8000,"code":502}
Executing command ls in sandbox iiqjgf4jo97j2aseiskuv
    auto_resume_test.go:116: 
        	Error Trace:	.../tests/proxies/auto_resume_test.go:116
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestSandboxAutoResumeViaProxy
--- FAIL: TestSandboxAutoResumeViaProxy (32.69s)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: FPR conflicts with hugepages
- Added !hugePages condition to FPR auto-enable logic, matching the server build path's conflict prevention.

Or push these changes by commenting:

@cursor push 7c518d0d3e

Preview (7c518d0d3e)

diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
 		})
 	}
 
-	// Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+	// Default FPR on for FC v1.14+ unless hugepages is enabled.
+	// Firecracker rejects balloon (free-page-reporting) together with hugepages.
 	var fprEnabled bool
 	if freePageReporting != nil {
 		fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
 		versionOnly, _, _ := strings.Cut(fcVersion, "_")
 		supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
 		if err == nil {
-			fprEnabled = supported
+			fprEnabled = !hugePages && supported
 		}
 	}

_{You can send follow-ups to the cloud agent here.}

ValentaTomas · 2026-05-07T06:28:33Z

Waiting for the merge of #2541, but otherwise should be ready.

ValentaTomas · 2026-05-07T06:29:09Z

Before enabling in prod we need to deploy the kernel fix though.

qodo-code-review · 2026-05-07T06:38:50Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. ~~FPH kernel gate disables~~ ✓ Resolved 🐞

Description

MinFreePageHintingKernelVersion is set to 999.0.0, so kernelSupportsFreePageHinting() will never
enable FreePageHinting for normal guest kernels and installBalloon() will always configure the
balloon with hinting disabled. With hinting disabled, DrainBalloon() will consistently no-op as “not
configured”, so enabling free-page-hinting-timeout-ms won’t actually drain anything before pause.

Code

packages/orchestrator/pkg/sandbox/fc/fph_gates.go[R10-18]

+// MinFreePageHintingKernelVersion is the minimum guest kernel version that
+// contains the FPH/MADV_DONTNEED race fix. Bump once the fixed kernel ships.
+const MinFreePageHintingKernelVersion = "999.0.0"
+
+func kernelSupportsFreePageHinting(kernelVersion string) bool {
+	v := strings.TrimPrefix(kernelVersion, "vmlinux-")
+	ok, _ := utils.IsGTEVersion(v, MinFreePageHintingKernelVersion)
+
+	return ok

Evidence
The kernel gate compares the guest kernel version against 999.0.0, which will fail for real kernel
versions (e.g. the repo default vmlinux-6.1.158), causing freePageHinting to be false when
configuring the balloon. Firecracker’s API reports 400 when hinting wasn’t enabled at device
configuration time; DrainBalloon treats that specific 400 as “not configured” and returns nil,
making the pre-pause drain ineffective.
packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
packages/orchestrator/pkg/sandbox/fc/process.go[446-454]
packages/shared/pkg/featureflags/flags.go[244-247]
packages/shared/pkg/fc/client/operations/start_balloon_hinting_responses.go[110-114]
packages/orchestrator/pkg/sandbox/fc/process.go[734-740]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Free-page-hinting is effectively impossible to enable because `MinFreePageHintingKernelVersion` is hardcoded to `999.0.0`, making `kernelSupportsFreePageHinting()` always return false for real kernel versions; this causes the balloon to be configured without hinting and makes `DrainBalloon()` a no-op.

### Issue Context
The pre-pause drain is guarded by a timeout feature flag, but the balloon hinting capability is separately gated by the kernel version check; with the current constant, the drain cannot ever perform useful work.

### Fix Focus Areas
- packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
- packages/orchestrator/pkg/sandbox/fc/process.go[446-454]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. FPH override no-op online 🐞

Description

resume-build’s -fph-timeout-ms calls featureflags.NewIntFlag(), which only updates the offline test
datasource, not a live LaunchDarkly environment. When LAUNCH_DARKLY_API_KEY is set,
NewClientWithLogLevel uses a real LaunchDarkly client and the override is ignored, so the CLI flag
does not do what its help text claims.

Code

packages/orchestrator/cmd/resume-build/main.go[R76-82]

+	fphTimeoutMs := flag.Int("fph-timeout-ms", 0, "override free-page-hinting-timeout-ms LD flag (0 = use LD default)")
+
	flag.Parse()

+	if *fphTimeoutMs > 0 {
+		featureflags.NewIntFlag("free-page-hinting-timeout-ms", *fphTimeoutMs)
+	}

Evidence
The CLI override is implemented by calling NewIntFlag(), which mutates the in-process ldtestdata
(offline) store. The featureflags client switches to a real LaunchDarkly client whenever
LAUNCH_DARKLY_API_KEY is set, so changes to the offline store won’t affect evaluation in that mode.
packages/orchestrator/cmd/resume-build/main.go[76-82]
packages/shared/pkg/featureflags/flags.go[147-152]
packages/shared/pkg/featureflags/client.go[19-23]
packages/shared/pkg/featureflags/client.go[71-86]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`-fph-timeout-ms` currently only affects the offline LaunchDarkly test datasource; when a real LaunchDarkly client is in use, the override is ignored.

### Issue Context
The flag help text says it “overrides free-page-hinting-timeout-ms LD flag”, so it should deterministically control the drain timeout in resume-build regardless of whether LaunchDarkly is configured.

### Fix Focus Areas
- packages/orchestrator/cmd/resume-build/main.go[76-82]
- packages/shared/pkg/featureflags/flags.go[147-152]
- packages/shared/pkg/featureflags/client.go[19-23]
- packages/shared/pkg/featureflags/client.go[71-86]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Adds an opt-in pre-pause step that runs `sync`, `drop_caches`, `compact_memory`, and `fstrim -av` on the live VM via envd's Process service to shrink the memfile/rootfs diff. Each step is wrapped in `timeout -s KILL` with its own cap, so a stuck step (most realistically a slow `sync` on a large dirty backlog) cannot starve the rest — and a killed step does not abort the chain (`;`-separated, not `&&`). Pausing FC is unaffected by an in-flight guest `sync` we time out: FC only drains in-flight virtio I/O before completing the pause; any unflushed dirty pages stay in the memfile snapshot and converge on resume. Per-step timeouts trade reclaim payoff, never correctness — `drop_caches` is documented non-destructive, `fstrim` consults FS allocation metadata not pagecache, and a partial `compact_memory` is just less-compacted. Disabled by default — the LD flag's null default leaves every step at 0 (skipped). Missing keys, zero, negative, and wrong-type values all collapse to "skip". The orchestrator skips the envd call entirely when the chain is empty. The outer `Connect-Timeout-Ms` is the sum of per-step caps plus a small slack. Single LD flag, one rule per cohort: - `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by step name, evaluated against sandbox / team / template LD contexts so targeting is configured in LaunchDarkly. Example value: ```json {"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500} ``` `resume-build` exposes `-reclaim` to inject the example values into the offline LD store for local testing. Pairs cleanly with #2553 (disable proactive compaction in the guest base image), but is independent of it and of FPH (#2552). Split out from #2550.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55d213b1bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Drain poll can miss fast cycle when hostBefore equals freePageHintDone
- Initialized sawBump to true when hostBefore equals freePageHintDone so fast-completing cycles are correctly detected as successful instead of timing out.

Or push these changes by commenting:

@cursor push ed7f7d6038

Preview (ed7f7d6038)

diff --git a/packages/orchestrator/pkg/sandbox/fc/process.go b/packages/orchestrator/pkg/sandbox/fc/process.go
--- a/packages/orchestrator/pkg/sandbox/fc/process.go
+++ b/packages/orchestrator/pkg/sandbox/fc/process.go
@@ -772,7 +772,12 @@
 	}
 
 	backoff := 5 * time.Millisecond
-	sawBump := false
+	// If hostBefore is already freePageHintDone, we're starting from a
+	// previously completed cycle. In this case, if the new cycle completes
+	// before the first poll, host will remain at freePageHintDone and we'd
+	// miss the bump. Initialize sawBump=true so any observation of
+	// host==freePageHintDone signals completion.
+	sawBump := hostBefore == freePageHintDone
 	for {
 		select {
 		case <-ctx.Done():

_{You can send follow-ups to the cloud agent here.}

ValentaTomas · 2026-05-11T22:42:45Z

@cla-bot check

the issue in the drain logic has been addressed

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c2bfac48d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Drain balloon metrics read after paused VM may fail
- Moved FlushAndReadBalloonMetrics call before Pause to avoid timeout when reading metrics from paused VM.

Or push these changes by commenting:

@cursor push 25f80a5ef2

Preview (25f80a5ef2)

diff --git a/packages/orchestrator/cmd/resume-build/fph_bench.go b/packages/orchestrator/cmd/resume-build/fph_bench.go
--- a/packages/orchestrator/cmd/resume-build/fph_bench.go
+++ b/packages/orchestrator/cmd/resume-build/fph_bench.go
@@ -139,6 +139,8 @@
 	newMeta := origMeta
 	newMeta.Template.BuildID = buildID
 
+	balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
+
 	pauseStart := time.Now()
 	snapshot, err := sbx.Pause(ctx, newMeta, sandbox.SnapshotUseCasePause)
 	pauseDur := time.Since(pauseStart)
@@ -147,8 +149,6 @@
 	}
 	defer snapshot.Close(context.WithoutCancel(ctx))
 
-	balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
-
 	upload, err := sandbox.NewUpload(ctx, nil, snapshot, r.storage, storage.CompressConfig{}, nil, "", nil)
 	if err != nil {
 		return fphBenchSample{pause: pauseDur, err: fmt.Errorf("upload prepare: %w", err)}

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit fab97ff. Configure here.}

Adds an opt-in pre-pause step that runs `sync`, `drop_caches`, `compact_memory`, and `fstrim -av` on the live VM via envd's Process service to shrink the memfile/rootfs diff. Each step is wrapped in `timeout -s KILL` with its own cap, so a stuck step (most realistically a slow `sync` on a large dirty backlog) cannot starve the rest — and a killed step does not abort the chain (`;`-separated, not `&&`). Pausing FC is unaffected by an in-flight guest `sync` we time out: FC only drains in-flight virtio I/O before completing the pause; any unflushed dirty pages stay in the memfile snapshot and converge on resume. Per-step timeouts trade reclaim payoff, never correctness — `drop_caches` is documented non-destructive, `fstrim` consults FS allocation metadata not pagecache, and a partial `compact_memory` is just less-compacted. Disabled by default — the LD flag's null default leaves every step at 0 (skipped). Missing keys, zero, negative, and wrong-type values all collapse to "skip". The orchestrator skips the envd call entirely when the chain is empty. The outer `Connect-Timeout-Ms` is the sum of per-step caps plus a small slack. Single LD flag, one rule per cohort: - `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by step name, evaluated against sandbox / team / template LD contexts so targeting is configured in LaunchDarkly. Example value: ```json {"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500} ``` `resume-build` exposes `-reclaim` to inject the example values into the offline LD store for local testing. Pairs cleanly with e2b-dev#2553 (disable proactive compaction in the guest base image), but is independent of it and of FPH (e2b-dev#2552). Split out from e2b-dev#2550.

kalyazin · 2026-05-14T11:32:06Z

Just wanted to note that with unconditional enabling of FPR/FPH in https://github.com/e2b-dev/infra/pull/2552/changes#diff-264fea9254b55eb530decfded9ded8f53bc7623f8c207656650746796b64f178 , the tests would be failing for FC < 1.14. We may accept it if we believe we won't need to stick with 1.12 in the future.

Or we could harden it a bit:

diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
index 8fe6a832b..fa600eba9 100644
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -39,6 +39,7 @@ import (
 	"github.com/e2b-dev/infra/packages/orchestrator/pkg/template/build/metrics"
 	artifactsregistry "github.com/e2b-dev/infra/packages/shared/pkg/artifacts-registry"
 	"github.com/e2b-dev/infra/packages/shared/pkg/dockerhub"
+	"github.com/e2b-dev/infra/packages/shared/pkg/fcversion"
 	"github.com/e2b-dev/infra/packages/shared/pkg/featureflags"
 	templatemanager "github.com/e2b-dev/infra/packages/shared/pkg/grpc/template-manager"
 	"github.com/e2b-dev/infra/packages/shared/pkg/logger"
@@ -354,6 +355,13 @@ func doBuild(
 		})
 	}
 
+	// Mirror prod's gating (pkg/template/server/create_template.go:70): only
+	// enable FPR if the chosen Firecracker version actually supports it.
+	// Hardcoding true breaks builds on FC <1.14 with a balloon 400.
+	fcInfo, err := fcversion.New(fc)
+	if err != nil {
+		return fmt.Errorf("invalid firecracker version %q: %w", fc, err)
+	}
 	tmpl := config.TemplateConfig{
 		Version:            templates.TemplateV2LatestVersion,
 		TemplateID:         templateID,
@@ -366,7 +374,7 @@ func doBuild(
 		ReadyCmd:           readyCmd,
 		KernelVersion:      kernel,
 		FirecrackerVersion: fc,
-		FreePageReporting:  true,
+		FreePageReporting:  fcInfo.HasFreePageReporting(),
 		TeamID:             "local",
 		Steps:              steps,
 	}

ValentaTomas · 2026-05-14T23:06:11Z

Just wanted to note that with unconditional enabling of FPR/FPH in #2552 (changes) , the tests would be failing for FC < 1.14. We may accept it if we believe we won't need to stick with 1.12 in the future.

Or we could harden it a bit:

diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
index 8fe6a832b..fa600eba9 100644
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -39,6 +39,7 @@ import (
 	"github.com/e2b-dev/infra/packages/orchestrator/pkg/template/build/metrics"
 	artifactsregistry "github.com/e2b-dev/infra/packages/shared/pkg/artifacts-registry"
 	"github.com/e2b-dev/infra/packages/shared/pkg/dockerhub"
+	"github.com/e2b-dev/infra/packages/shared/pkg/fcversion"
 	"github.com/e2b-dev/infra/packages/shared/pkg/featureflags"
 	templatemanager "github.com/e2b-dev/infra/packages/shared/pkg/grpc/template-manager"
 	"github.com/e2b-dev/infra/packages/shared/pkg/logger"
@@ -354,6 +355,13 @@ func doBuild(
 		})
 	}
 
+	// Mirror prod's gating (pkg/template/server/create_template.go:70): only
+	// enable FPR if the chosen Firecracker version actually supports it.
+	// Hardcoding true breaks builds on FC <1.14 with a balloon 400.
+	fcInfo, err := fcversion.New(fc)
+	if err != nil {
+		return fmt.Errorf("invalid firecracker version %q: %w", fc, err)
+	}
 	tmpl := config.TemplateConfig{
 		Version:            templates.TemplateV2LatestVersion,
 		TemplateID:         templateID,
@@ -366,7 +374,7 @@ func doBuild(
 		ReadyCmd:           readyCmd,
 		KernelVersion:      kernel,
 		FirecrackerVersion: fc,
-		FreePageReporting:  true,
+		FreePageReporting:  fcInfo.HasFreePageReporting(),
 		TeamID:             "local",
 		Steps:              steps,
 	}

We are gating the enabling here https://github.com/e2b-dev/infra/blob/feat/sandbox-pause-fph/packages/orchestrator/pkg/template/server/create_template.go#L70 and for now we will be using the feature flags context (which should carry the kernel and fc version) to check. At worst I think this should just fail and continue as the baloon device is not there.

Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free. Balloon install gated by free-page-hinting-install (bool LD flag); kernel-side eligibility targeted via the LD context (kernel/FC version). On pause we call start_balloon_hinting(acknowledge_on_stop=true) and poll describe_balloon_hinting until host_cmd == DONE, gated by free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled). Hot path: post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for offline FPR vs FPR+FPH comparison.

Replaces `free-page-hinting-install` (bool) and the prior `free-page-hinting-timeout-ms` (int) with a single `free-page-hinting-config` JSON flag keyed by `enabled`, `pause`, `build` (matching SnapshotUseCase). Lets operators install FPH on the balloon but disable the pre-pause drain per use case — e.g. keep it on for normal pause and off for template build, where it was observed to grow the memfile.

Mirror prod's gating from `pkg/template/server/create_template.go`: balloon installation fails on FC <1.14, so the local CLI must not hardcode `FreePageReporting: true`.

e2b-request-same-site-reviewers Bot assigned djeebus May 4, 2026

cursor Bot reviewed May 4, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

ValentaTomas force-pushed the feat/sandbox-pause-fph branch from e8bd708 to bf00edc Compare May 4, 2026 00:35

This was referenced May 4, 2026

feat(uffd,fc): balloon free-page-hinting + envd reclaim on pause #2550

Closed

feat(sandbox): pre-pause guest reclaim via envd #2551

Merged

cursor Bot reviewed May 4, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

ValentaTomas force-pushed the feat/sandbox-pause-fph branch 4 times, most recently from f4e3ab0 to 7619cc9 Compare May 4, 2026 00:55

ValentaTomas unassigned djeebus May 4, 2026

ValentaTomas force-pushed the feat/uffd-fc-free-page-reporting-integration branch 2 times, most recently from 920e8ec to 7f22709 Compare May 5, 2026 08:19

cla-bot Bot added the cla-signed label May 6, 2026

cursor Bot reviewed May 6, 2026

View reviewed changes

Comment thread packages/orchestrator/cmd/create-build/main.go Outdated

ValentaTomas marked this pull request as ready for review May 7, 2026 06:28

ValentaTomas requested review from dobrac and jakubno as code owners May 7, 2026 06:28

e2b-request-same-site-reviewers Bot assigned levb May 7, 2026

ValentaTomas unassigned levb May 7, 2026

qodo-code-review Bot reviewed May 7, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/fph_gates.go Outdated

claude Bot reviewed May 7, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

ValentaTomas removed the request for review from dobrac May 8, 2026 08:48

ValentaTomas requested review from bchalios and kalyazin May 8, 2026 08:49

Base automatically changed from feat/uffd-fc-free-page-reporting-integration to main May 8, 2026 23:42

ValentaTomas enabled auto-merge (squash) May 9, 2026 22:19

kalyazin previously requested changes May 11, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

ValentaTomas disabled auto-merge May 11, 2026 22:42

kalyazin reviewed May 12, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated

ValentaTomas mentioned this pull request May 13, 2026

feat(metrics): record per-snapshot dirty/empty/total bytes at creation #2649

Merged

ValentaTomas enabled auto-merge (squash) May 13, 2026 23:09

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread iac/modules/job-otel-collector/configs/otel-collector.yaml

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/drain_balloon_test.go

ValentaTomas force-pushed the feat/sandbox-pause-fph branch from 2eedd3f to 86be69e Compare May 13, 2026 23:50

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread packages/orchestrator/cmd/resume-build/fph_bench.go

ValentaTomas requested a review from kalyazin May 14, 2026 00:20

kalyazin approved these changes May 14, 2026

View reviewed changes

ValentaTomas added 5 commits May 14, 2026 16:10

chore(fph): trim bench tooling for review

929aa7c

fix(fph-bench): satisfy nlreturn lint

9ca627f

fix(create-build): gate FreePageReporting on Firecracker version

dcc9dcd

Mirror prod's gating from `pkg/template/server/create_template.go`: balloon installation fails on FC <1.14, so the local CLI must not hardcode `FreePageReporting: true`.

ValentaTomas force-pushed the feat/sandbox-pause-fph branch from 291933b to dcc9dcd Compare May 14, 2026 23:48

ValentaTomas merged commit 9481380 into main May 15, 2026
52 of 53 checks passed

ValentaTomas deleted the feat/sandbox-pause-fph branch May 15, 2026 00:11

Conversation

ValentaTomas commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Uh oh!

linear-code Bot commented May 6, 2026

Uh oh!

codecov Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 14 Tests Failed:

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ValentaTomas commented May 7, 2026

Uh oh!

ValentaTomas commented May 7, 2026

Uh oh!

qodo-code-review Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ValentaTomas commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kalyazin commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ValentaTomas commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ValentaTomas commented May 4, 2026 •

edited

Loading

cursor Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 6, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

qodo-code-review Bot commented May 7, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

kalyazin commented May 14, 2026 •

edited

Loading