Skip to content

feat(fc): drain virtio-balloon free-page-hinting before pause#2552

Merged
ValentaTomas merged 5 commits into
mainfrom
feat/sandbox-pause-fph
May 15, 2026
Merged

feat(fc): drain virtio-balloon free-page-hinting before pause#2552
ValentaTomas merged 5 commits into
mainfrom
feat/sandbox-pause-fph

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 4, 2026

Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free.

Balloon install is gated by free-page-hinting-install (bool LD flag); kernel-side eligibility is targeted via the LD context (kernel/FC version). On pause we call start_balloon_hinting(acknowledge_on_stop=true) and poll describe_balloon_hinting until host_cmd == DONE, gated by free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled). Reclaimed pages emit UFFD_EVENT_REMOVE, already tracked by the parent FPR work.

Hot path is kept minimal: post-drain and post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. System-level FPH activity is observable via the periodic 5 s metrics flush.

Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for offline FPR vs FPR+FPH comparison.

Operator must wait for the kernel FPH race fix to roll out before enabling free-page-hinting-timeout-ms in prod.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 4, 2026

PR Summary

High Risk
Touches the sandbox pause/snapshot hot path and Firecracker API interactions; mis-gating or drain/poll issues could increase pause latency or break snapshot creation on some FC/kernel combinations.

Overview
There’s a compile-time bug in cmd/resume-build (for i := range opts.iterations ranges over an int).

create-build now forces free-page-hinting-config.enabled=true in local runs, which may unintentionally change behavior for callers expecting the LD default.

The pause path now conditionally runs balloon hinting drain + extra metrics flushes; incorrect LaunchDarkly targeting/version gating or Firecracker API behavior differences (notably the 204 “success” workaround) could still cause unexpected pause latency or no-op behavior across environments.

Reviewed by Cursor Bugbot for commit dcc9dcd. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
@ValentaTomas ValentaTomas force-pushed the feat/sandbox-pause-fph branch 4 times, most recently from f4e3ab0 to 7619cc9 Compare May 4, 2026 00:55
@ValentaTomas ValentaTomas force-pushed the feat/uffd-fc-free-page-reporting-integration branch 2 times, most recently from 920e8ec to 7f22709 Compare May 5, 2026 08:19
@cla-bot cla-bot Bot added the cla-signed label May 6, 2026
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 6, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

❌ 14 Tests Failed:

Tests completed Failed Passed Skipped
2619 14 2605 7
View the top 1 failed test(s) by shortest run time
github.com/e2b-dev/infra/packages/shared/pkg/storage::TestMultipartUploader_HighConcurrency_StressTest
Stack Traces | 0.69s run time
=== RUN   TestMultipartUploader_HighConcurrency_StressTest
=== PAUSE TestMultipartUploader_HighConcurrency_StressTest
=== CONT  TestMultipartUploader_HighConcurrency_StressTest
    gcp_multipart_test.go:369: 
        	Error Trace:	.../pkg/storage/gcp_multipart_test.go:369
        	Error:      	"1" is not greater than "1"
        	Test:       	TestMultipartUploader_HighConcurrency_StressTest
        	Messages:   	Should have concurrent uploads
--- FAIL: TestMultipartUploader_HighConcurrency_StressTest (0.69s)
View the full list of 13 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 70.64% (Passed 192 times, Failed 462 times)

Stack Traces | 0.3s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (0.30s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestEgressFirewallWithInternetAccessFalse

Flake rate in main: 54.69% (Passed 193 times, Failed 233 times)

Stack Traces | 6.18s run time
=== RUN   TestEgressFirewallWithInternetAccessFalse
=== PAUSE TestEgressFirewallWithInternetAccessFalse
=== CONT  TestEgressFirewallWithInternetAccessFalse
    sandbox_network_out_test.go:400: Command [curl] output: event:{start:{pid:1311}}
    sandbox_network_out_test.go:400: Command [curl] output: event:{end:{exit_code:28 exited:true status:"exit status 28" error:"exit status 28"}}
    sandbox_network_out_test.go:400: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:67
        	            				.../api/sandboxes/sandbox_network_out_test.go:400
        	Error:      	Received unexpected error:
        	            	command curl in sandbox im1efj3ayn3kozoifnxjx failed with exit code 28
        	Test:       	TestEgressFirewallWithInternetAccessFalse
        	Messages:   	Expected curl to succeed for allowed IP even with allow_internet_access=false (network config takes precedence)
--- FAIL: TestEgressFirewallWithInternetAccessFalse (6.18s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.66% (Passed 200 times, Failed 657 times)

Stack Traces | 35.2s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
Executing command curl in sandbox i1tm429ujag7ja6rx7mn0
--- FAIL: TestUpdateNetworkConfig (35.25s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.13% (Passed 193 times, Failed 651 times)

Stack Traces | 6.62s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox ifxyithxg5wx9c290xfg1
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1354}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1355}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox im1efj3ayn3kozoifnxjx
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1356}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Fri, 15 May 2026 00:02:06 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ifxyithxg5wx9c290xfg1
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (6.62s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 56.36% (Passed 343 times, Failed 443 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 64.35% (Passed 190 times, Failed 343 times)

Stack Traces | 8.25s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox ilgucl3ggygdm4l08bihm
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1261}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (8.25s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 64.35% (Passed 190 times, Failed 343 times)

Stack Traces | 7.43s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox ihcuqz9sqrvinisdeyc18
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1261}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (7.43s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir

Flake rate in main: 53.88% (Passed 238 times, Failed 278 times)

Stack Traces | 1.81s run time
=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
Executing command /bin/bash in sandbox i7q057jj52x6c27bujbmp
--- FAIL: TestListDir (1.81s)
Executing command python in sandbox i50m1wdy7wrnevct8ii2k
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory

Flake rate in main: 57.49% (Passed 190 times, Failed 257 times)

Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory

Flake rate in main: 57.49% (Passed 190 times, Failed 257 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)

Flake rate in main: 57.49% (Passed 190 times, Failed 257 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files

Flake rate in main: 57.49% (Passed 190 times, Failed 257 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestSandboxAutoResumeViaProxy

Flake rate in main: 54.88% (Passed 194 times, Failed 236 times)

Stack Traces | 32.7s run time
=== RUN   TestSandboxAutoResumeViaProxy
=== PAUSE TestSandboxAutoResumeViaProxy
=== CONT  TestSandboxAutoResumeViaProxy
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"i9ltymovqbhusrgswrzl5","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"i9ltymovqbhusrgswrzl5","message":"The sandbox is running but port is not open","port":8000,"code":502}
Executing command ls in sandbox iiqjgf4jo97j2aseiskuv
    auto_resume_test.go:116: 
        	Error Trace:	.../tests/proxies/auto_resume_test.go:116
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestSandboxAutoResumeViaProxy
--- FAIL: TestSandboxAutoResumeViaProxy (32.69s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: FPR conflicts with hugepages
    • Added !hugePages condition to FPR auto-enable logic, matching the server build path's conflict prevention.

Create PR

Or push these changes by commenting:

@cursor push 7c518d0d3e
Preview (7c518d0d3e)
diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
 		})
 	}
 
-	// Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+	// Default FPR on for FC v1.14+ unless hugepages is enabled.
+	// Firecracker rejects balloon (free-page-reporting) together with hugepages.
 	var fprEnabled bool
 	if freePageReporting != nil {
 		fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
 		versionOnly, _, _ := strings.Cut(fcVersion, "_")
 		supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
 		if err == nil {
-			fprEnabled = supported
+			fprEnabled = !hugePages && supported
 		}
 	}

You can send follow-ups to the cloud agent here.

Comment thread packages/orchestrator/cmd/create-build/main.go Outdated
@ValentaTomas
Copy link
Copy Markdown
Member Author

Waiting for the merge of #2541, but otherwise should be ready.

@ValentaTomas ValentaTomas marked this pull request as ready for review May 7, 2026 06:28
@ValentaTomas
Copy link
Copy Markdown
Member Author

Before enabling in prod we need to deploy the kernel fix though.

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 7, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. FPH kernel gate disables ✓ Resolved 🐞
Description
MinFreePageHintingKernelVersion is set to 999.0.0, so kernelSupportsFreePageHinting() will never
enable FreePageHinting for normal guest kernels and installBalloon() will always configure the
balloon with hinting disabled. With hinting disabled, DrainBalloon() will consistently no-op as “not
configured”, so enabling free-page-hinting-timeout-ms won’t actually drain anything before pause.
Code

packages/orchestrator/pkg/sandbox/fc/fph_gates.go[R10-18]

+// MinFreePageHintingKernelVersion is the minimum guest kernel version that
+// contains the FPH/MADV_DONTNEED race fix. Bump once the fixed kernel ships.
+const MinFreePageHintingKernelVersion = "999.0.0"
+
+func kernelSupportsFreePageHinting(kernelVersion string) bool {
+	v := strings.TrimPrefix(kernelVersion, "vmlinux-")
+	ok, _ := utils.IsGTEVersion(v, MinFreePageHintingKernelVersion)
+
+	return ok
Evidence
The kernel gate compares the guest kernel version against 999.0.0, which will fail for real kernel
versions (e.g. the repo default vmlinux-6.1.158), causing freePageHinting to be false when
configuring the balloon. Firecracker’s API reports 400 when hinting wasn’t enabled at device
configuration time; DrainBalloon treats that specific 400 as “not configured” and returns nil,
making the pre-pause drain ineffective.

packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
packages/orchestrator/pkg/sandbox/fc/process.go[446-454]
packages/shared/pkg/featureflags/flags.go[244-247]
packages/shared/pkg/fc/client/operations/start_balloon_hinting_responses.go[110-114]
packages/orchestrator/pkg/sandbox/fc/process.go[734-740]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Free-page-hinting is effectively impossible to enable because `MinFreePageHintingKernelVersion` is hardcoded to `999.0.0`, making `kernelSupportsFreePageHinting()` always return false for real kernel versions; this causes the balloon to be configured without hinting and makes `DrainBalloon()` a no-op.

### Issue Context
The pre-pause drain is guarded by a timeout feature flag, but the balloon hinting capability is separately gated by the kernel version check; with the current constant, the drain cannot ever perform useful work.

### Fix Focus Areas
- packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
- packages/orchestrator/pkg/sandbox/fc/process.go[446-454]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. FPH override no-op online 🐞
Description
resume-build’s -fph-timeout-ms calls featureflags.NewIntFlag(), which only updates the offline test
datasource, not a live LaunchDarkly environment. When LAUNCH_DARKLY_API_KEY is set,
NewClientWithLogLevel uses a real LaunchDarkly client and the override is ignored, so the CLI flag
does not do what its help text claims.
Code

packages/orchestrator/cmd/resume-build/main.go[R76-82]

+	fphTimeoutMs := flag.Int("fph-timeout-ms", 0, "override free-page-hinting-timeout-ms LD flag (0 = use LD default)")
+
	flag.Parse()

+	if *fphTimeoutMs > 0 {
+		featureflags.NewIntFlag("free-page-hinting-timeout-ms", *fphTimeoutMs)
+	}
Evidence
The CLI override is implemented by calling NewIntFlag(), which mutates the in-process ldtestdata
(offline) store. The featureflags client switches to a real LaunchDarkly client whenever
LAUNCH_DARKLY_API_KEY is set, so changes to the offline store won’t affect evaluation in that mode.

packages/orchestrator/cmd/resume-build/main.go[76-82]
packages/shared/pkg/featureflags/flags.go[147-152]
packages/shared/pkg/featureflags/client.go[19-23]
packages/shared/pkg/featureflags/client.go[71-86]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`-fph-timeout-ms` currently only affects the offline LaunchDarkly test datasource; when a real LaunchDarkly client is in use, the override is ignored.

### Issue Context
The flag help text says it “overrides free-page-hinting-timeout-ms LD flag”, so it should deterministically control the drain timeout in resume-build regardless of whether LaunchDarkly is configured.

### Fix Focus Areas
- packages/orchestrator/cmd/resume-build/main.go[76-82]
- packages/shared/pkg/featureflags/flags.go[147-152]
- packages/shared/pkg/featureflags/client.go[19-23]
- packages/shared/pkg/featureflags/client.go[71-86]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread packages/orchestrator/pkg/sandbox/fc/fph_gates.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
ValentaTomas added a commit that referenced this pull request May 8, 2026
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).

Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.

Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.

Single LD flag, one rule per cohort:

- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.

Example value:

```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```

`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.

Pairs cleanly with #2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (#2552). Split out from
#2550.
@ValentaTomas ValentaTomas removed the request for review from dobrac May 8, 2026 08:48
@ValentaTomas ValentaTomas requested review from bchalios and kalyazin May 8, 2026 08:49
Base automatically changed from feat/uffd-fc-free-page-reporting-integration to main May 8, 2026 23:42
@ValentaTomas ValentaTomas enabled auto-merge (squash) May 9, 2026 22:19
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55d213b1bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Drain poll can miss fast cycle when hostBefore equals freePageHintDone
    • Initialized sawBump to true when hostBefore equals freePageHintDone so fast-completing cycles are correctly detected as successful instead of timing out.

Create PR

Or push these changes by commenting:

@cursor push ed7f7d6038
Preview (ed7f7d6038)
diff --git a/packages/orchestrator/pkg/sandbox/fc/process.go b/packages/orchestrator/pkg/sandbox/fc/process.go
--- a/packages/orchestrator/pkg/sandbox/fc/process.go
+++ b/packages/orchestrator/pkg/sandbox/fc/process.go
@@ -772,7 +772,12 @@
 	}
 
 	backoff := 5 * time.Millisecond
-	sawBump := false
+	// If hostBefore is already freePageHintDone, we're starting from a
+	// previously completed cycle. In this case, if the new cycle completes
+	// before the first poll, host will remain at freePageHintDone and we'd
+	// miss the bump. Initialize sawBump=true so any observation of
+	// host==freePageHintDone signals completion.
+	sawBump := hostBefore == freePageHintDone
 	for {
 		select {
 		case <-ctx.Done():

You can send follow-ups to the cloud agent here.

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
@ValentaTomas
Copy link
Copy Markdown
Member Author

@cla-bot check

@ValentaTomas ValentaTomas disabled auto-merge May 11, 2026 22:42
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated
@kalyazin kalyazin dismissed their stale review May 12, 2026 10:19

the issue in the drain logic has been addressed

@ValentaTomas ValentaTomas enabled auto-merge (squash) May 13, 2026 23:09
Comment thread iac/modules/job-otel-collector/configs/otel-collector.yaml
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c2bfac48d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/fc/drain_balloon_test.go
@ValentaTomas ValentaTomas force-pushed the feat/sandbox-pause-fph branch from 2eedd3f to 86be69e Compare May 13, 2026 23:50
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Drain balloon metrics read after paused VM may fail
    • Moved FlushAndReadBalloonMetrics call before Pause to avoid timeout when reading metrics from paused VM.

Create PR

Or push these changes by commenting:

@cursor push 25f80a5ef2
Preview (25f80a5ef2)
diff --git a/packages/orchestrator/cmd/resume-build/fph_bench.go b/packages/orchestrator/cmd/resume-build/fph_bench.go
--- a/packages/orchestrator/cmd/resume-build/fph_bench.go
+++ b/packages/orchestrator/cmd/resume-build/fph_bench.go
@@ -139,6 +139,8 @@
 	newMeta := origMeta
 	newMeta.Template.BuildID = buildID
 
+	balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
+
 	pauseStart := time.Now()
 	snapshot, err := sbx.Pause(ctx, newMeta, sandbox.SnapshotUseCasePause)
 	pauseDur := time.Since(pauseStart)
@@ -147,8 +149,6 @@
 	}
 	defer snapshot.Close(context.WithoutCancel(ctx))
 
-	balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
-
 	upload, err := sandbox.NewUpload(ctx, nil, snapshot, r.storage, storage.CompressConfig{}, nil, "", nil)
 	if err != nil {
 		return fphBenchSample{pause: pauseDur, err: fmt.Errorf("upload prepare: %w", err)}

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit fab97ff. Configure here.

Comment thread packages/orchestrator/cmd/resume-build/fph_bench.go
@ValentaTomas ValentaTomas requested a review from kalyazin May 14, 2026 00:20
AdaAibaby pushed a commit to AdaAibaby/infra that referenced this pull request May 14, 2026
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).

Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.

Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.

Single LD flag, one rule per cohort:

- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.

Example value:

```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```

`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.

Pairs cleanly with e2b-dev#2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (e2b-dev#2552). Split out from
e2b-dev#2550.
@kalyazin
Copy link
Copy Markdown
Contributor

kalyazin commented May 14, 2026

Just wanted to note that with unconditional enabling of FPR/FPH in https://github.com/e2b-dev/infra/pull/2552/changes#diff-264fea9254b55eb530decfded9ded8f53bc7623f8c207656650746796b64f178 , the tests would be failing for FC < 1.14. We may accept it if we believe we won't need to stick with 1.12 in the future.

Or we could harden it a bit:

diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
index 8fe6a832b..fa600eba9 100644
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -39,6 +39,7 @@ import (
 	"github.com/e2b-dev/infra/packages/orchestrator/pkg/template/build/metrics"
 	artifactsregistry "github.com/e2b-dev/infra/packages/shared/pkg/artifacts-registry"
 	"github.com/e2b-dev/infra/packages/shared/pkg/dockerhub"
+	"github.com/e2b-dev/infra/packages/shared/pkg/fcversion"
 	"github.com/e2b-dev/infra/packages/shared/pkg/featureflags"
 	templatemanager "github.com/e2b-dev/infra/packages/shared/pkg/grpc/template-manager"
 	"github.com/e2b-dev/infra/packages/shared/pkg/logger"
@@ -354,6 +355,13 @@ func doBuild(
 		})
 	}
 
+	// Mirror prod's gating (pkg/template/server/create_template.go:70): only
+	// enable FPR if the chosen Firecracker version actually supports it.
+	// Hardcoding true breaks builds on FC <1.14 with a balloon 400.
+	fcInfo, err := fcversion.New(fc)
+	if err != nil {
+		return fmt.Errorf("invalid firecracker version %q: %w", fc, err)
+	}
 	tmpl := config.TemplateConfig{
 		Version:            templates.TemplateV2LatestVersion,
 		TemplateID:         templateID,
@@ -366,7 +374,7 @@ func doBuild(
 		ReadyCmd:           readyCmd,
 		KernelVersion:      kernel,
 		FirecrackerVersion: fc,
-		FreePageReporting:  true,
+		FreePageReporting:  fcInfo.HasFreePageReporting(),
 		TeamID:             "local",
 		Steps:              steps,
 	}

@ValentaTomas
Copy link
Copy Markdown
Member Author

Just wanted to note that with unconditional enabling of FPR/FPH in #2552 (changes) , the tests would be failing for FC < 1.14. We may accept it if we believe we won't need to stick with 1.12 in the future.

Or we could harden it a bit:

diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
index 8fe6a832b..fa600eba9 100644
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -39,6 +39,7 @@ import (
 	"github.com/e2b-dev/infra/packages/orchestrator/pkg/template/build/metrics"
 	artifactsregistry "github.com/e2b-dev/infra/packages/shared/pkg/artifacts-registry"
 	"github.com/e2b-dev/infra/packages/shared/pkg/dockerhub"
+	"github.com/e2b-dev/infra/packages/shared/pkg/fcversion"
 	"github.com/e2b-dev/infra/packages/shared/pkg/featureflags"
 	templatemanager "github.com/e2b-dev/infra/packages/shared/pkg/grpc/template-manager"
 	"github.com/e2b-dev/infra/packages/shared/pkg/logger"
@@ -354,6 +355,13 @@ func doBuild(
 		})
 	}
 
+	// Mirror prod's gating (pkg/template/server/create_template.go:70): only
+	// enable FPR if the chosen Firecracker version actually supports it.
+	// Hardcoding true breaks builds on FC <1.14 with a balloon 400.
+	fcInfo, err := fcversion.New(fc)
+	if err != nil {
+		return fmt.Errorf("invalid firecracker version %q: %w", fc, err)
+	}
 	tmpl := config.TemplateConfig{
 		Version:            templates.TemplateV2LatestVersion,
 		TemplateID:         templateID,
@@ -366,7 +374,7 @@ func doBuild(
 		ReadyCmd:           readyCmd,
 		KernelVersion:      kernel,
 		FirecrackerVersion: fc,
-		FreePageReporting:  true,
+		FreePageReporting:  fcInfo.HasFreePageReporting(),
 		TeamID:             "local",
 		Steps:              steps,
 	}

We are gating the enabling here https://github.com/e2b-dev/infra/blob/feat/sandbox-pause-fph/packages/orchestrator/pkg/template/server/create_template.go#L70 and for now we will be using the feature flags context (which should carry the kernel and fc version) to check. At worst I think this should just fail and continue as the baloon device is not there.

Drains virtio-balloon free-page-hinting before pause so snapshots don't
capture pages the guest already considers free. Balloon install gated by
free-page-hinting-install (bool LD flag); kernel-side eligibility targeted
via the LD context (kernel/FC version). On pause we call
start_balloon_hinting(acknowledge_on_stop=true) and poll
describe_balloon_hinting until host_cmd == DONE, gated by
free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled).

Hot path: post-pause we trigger an FC metrics flush but don't wait for
the reader, trading per-pause counter precision for pause latency.

Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for
offline FPR vs FPR+FPH comparison.
Replaces `free-page-hinting-install` (bool) and the prior `free-page-hinting-timeout-ms` (int) with a single `free-page-hinting-config` JSON flag keyed by `enabled`, `pause`, `build` (matching SnapshotUseCase). Lets operators install FPH on the balloon but disable the pre-pause drain per use case — e.g. keep it on for normal pause and off for template build, where it was observed to grow the memfile.
Mirror prod's gating from `pkg/template/server/create_template.go`: balloon installation fails on FC <1.14, so the local CLI must not hardcode `FreePageReporting: true`.
@ValentaTomas ValentaTomas force-pushed the feat/sandbox-pause-fph branch from 291933b to dcc9dcd Compare May 14, 2026 23:48
@ValentaTomas ValentaTomas merged commit 9481380 into main May 15, 2026
52 of 53 checks passed
@ValentaTomas ValentaTomas deleted the feat/sandbox-pause-fph branch May 15, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants