feat(fc): drain virtio-balloon free-page-hinting before pause#2552
Conversation
PR SummaryHigh Risk Overview
The pause path now conditionally runs balloon hinting drain + extra metrics flushes; incorrect LaunchDarkly targeting/version gating or Firecracker API behavior differences (notably the 204 “success” workaround) could still cause unexpected pause latency or no-op behavior across environments. Reviewed by Cursor Bugbot for commit dcc9dcd. Bugbot is set up for automated code reviews on this repo. Configure here. |
e8bd708 to
bf00edc
Compare
f4e3ab0 to
7619cc9
Compare
920e8ec to
7f22709
Compare
❌ 14 Tests Failed:
View the top 1 failed test(s) by shortest run time
View the full list of 13 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: FPR conflicts with hugepages
- Added
!hugePagescondition to FPR auto-enable logic, matching the server build path's conflict prevention.
- Added
Or push these changes by commenting:
@cursor push 7c518d0d3e
Preview (7c518d0d3e)
diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
})
}
- // Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+ // Default FPR on for FC v1.14+ unless hugepages is enabled.
+ // Firecracker rejects balloon (free-page-reporting) together with hugepages.
var fprEnabled bool
if freePageReporting != nil {
fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
versionOnly, _, _ := strings.Cut(fcVersion, "_")
supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
if err == nil {
- fprEnabled = supported
+ fprEnabled = !hugePages && supported
}
}You can send follow-ups to the cloud agent here.
|
Waiting for the merge of #2541, but otherwise should be ready. |
|
Before enabling in prod we need to deploy the kernel fix though. |
Code Review by Qodo
1.
|
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).
Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.
Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.
Single LD flag, one rule per cohort:
- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.
Example value:
```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```
`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.
Pairs cleanly with #2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (#2552). Split out from
#2550.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 55d213b1bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Drain poll can miss fast cycle when hostBefore equals freePageHintDone
- Initialized sawBump to true when hostBefore equals freePageHintDone so fast-completing cycles are correctly detected as successful instead of timing out.
Or push these changes by commenting:
@cursor push ed7f7d6038
Preview (ed7f7d6038)
diff --git a/packages/orchestrator/pkg/sandbox/fc/process.go b/packages/orchestrator/pkg/sandbox/fc/process.go
--- a/packages/orchestrator/pkg/sandbox/fc/process.go
+++ b/packages/orchestrator/pkg/sandbox/fc/process.go
@@ -772,7 +772,12 @@
}
backoff := 5 * time.Millisecond
- sawBump := false
+ // If hostBefore is already freePageHintDone, we're starting from a
+ // previously completed cycle. In this case, if the new cycle completes
+ // before the first poll, host will remain at freePageHintDone and we'd
+ // miss the bump. Initialize sawBump=true so any observation of
+ // host==freePageHintDone signals completion.
+ sawBump := hostBefore == freePageHintDone
for {
select {
case <-ctx.Done():You can send follow-ups to the cloud agent here.
|
@cla-bot check |
the issue in the drain logic has been addressed
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c2bfac48d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
2eedd3f to
86be69e
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Drain balloon metrics read after paused VM may fail
- Moved FlushAndReadBalloonMetrics call before Pause to avoid timeout when reading metrics from paused VM.
Or push these changes by commenting:
@cursor push 25f80a5ef2
Preview (25f80a5ef2)
diff --git a/packages/orchestrator/cmd/resume-build/fph_bench.go b/packages/orchestrator/cmd/resume-build/fph_bench.go
--- a/packages/orchestrator/cmd/resume-build/fph_bench.go
+++ b/packages/orchestrator/cmd/resume-build/fph_bench.go
@@ -139,6 +139,8 @@
newMeta := origMeta
newMeta.Template.BuildID = buildID
+ balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
+
pauseStart := time.Now()
snapshot, err := sbx.Pause(ctx, newMeta, sandbox.SnapshotUseCasePause)
pauseDur := time.Since(pauseStart)
@@ -147,8 +149,6 @@
}
defer snapshot.Close(context.WithoutCancel(ctx))
- balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
-
upload, err := sandbox.NewUpload(ctx, nil, snapshot, r.storage, storage.CompressConfig{}, nil, "", nil)
if err != nil {
return fphBenchSample{pause: pauseDur, err: fmt.Errorf("upload prepare: %w", err)}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit fab97ff. Configure here.
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).
Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.
Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.
Single LD flag, one rule per cohort:
- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.
Example value:
```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```
`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.
Pairs cleanly with e2b-dev#2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (e2b-dev#2552). Split out from
e2b-dev#2550.
|
Just wanted to note that with unconditional enabling of FPR/FPH in https://github.com/e2b-dev/infra/pull/2552/changes#diff-264fea9254b55eb530decfded9ded8f53bc7623f8c207656650746796b64f178 , the tests would be failing for FC < 1.14. We may accept it if we believe we won't need to stick with 1.12 in the future. Or we could harden it a bit: diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
index 8fe6a832b..fa600eba9 100644
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -39,6 +39,7 @@ import (
"github.com/e2b-dev/infra/packages/orchestrator/pkg/template/build/metrics"
artifactsregistry "github.com/e2b-dev/infra/packages/shared/pkg/artifacts-registry"
"github.com/e2b-dev/infra/packages/shared/pkg/dockerhub"
+ "github.com/e2b-dev/infra/packages/shared/pkg/fcversion"
"github.com/e2b-dev/infra/packages/shared/pkg/featureflags"
templatemanager "github.com/e2b-dev/infra/packages/shared/pkg/grpc/template-manager"
"github.com/e2b-dev/infra/packages/shared/pkg/logger"
@@ -354,6 +355,13 @@ func doBuild(
})
}
+ // Mirror prod's gating (pkg/template/server/create_template.go:70): only
+ // enable FPR if the chosen Firecracker version actually supports it.
+ // Hardcoding true breaks builds on FC <1.14 with a balloon 400.
+ fcInfo, err := fcversion.New(fc)
+ if err != nil {
+ return fmt.Errorf("invalid firecracker version %q: %w", fc, err)
+ }
tmpl := config.TemplateConfig{
Version: templates.TemplateV2LatestVersion,
TemplateID: templateID,
@@ -366,7 +374,7 @@ func doBuild(
ReadyCmd: readyCmd,
KernelVersion: kernel,
FirecrackerVersion: fc,
- FreePageReporting: true,
+ FreePageReporting: fcInfo.HasFreePageReporting(),
TeamID: "local",
Steps: steps,
} |
We are gating the enabling here https://github.com/e2b-dev/infra/blob/feat/sandbox-pause-fph/packages/orchestrator/pkg/template/server/create_template.go#L70 and for now we will be using the feature flags context (which should carry the kernel and fc version) to check. At worst I think this should just fail and continue as the baloon device is not there. |
Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free. Balloon install gated by free-page-hinting-install (bool LD flag); kernel-side eligibility targeted via the LD context (kernel/FC version). On pause we call start_balloon_hinting(acknowledge_on_stop=true) and poll describe_balloon_hinting until host_cmd == DONE, gated by free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled). Hot path: post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for offline FPR vs FPR+FPH comparison.
Replaces `free-page-hinting-install` (bool) and the prior `free-page-hinting-timeout-ms` (int) with a single `free-page-hinting-config` JSON flag keyed by `enabled`, `pause`, `build` (matching SnapshotUseCase). Lets operators install FPH on the balloon but disable the pre-pause drain per use case — e.g. keep it on for normal pause and off for template build, where it was observed to grow the memfile.
Mirror prod's gating from `pkg/template/server/create_template.go`: balloon installation fails on FC <1.14, so the local CLI must not hardcode `FreePageReporting: true`.
291933b to
dcc9dcd
Compare


Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free.
Balloon install is gated by
free-page-hinting-install(bool LD flag); kernel-side eligibility is targeted via the LD context (kernel/FC version). On pause we callstart_balloon_hinting(acknowledge_on_stop=true)and polldescribe_balloon_hintinguntilhost_cmd == DONE, gated byfree-page-hinting-timeout-ms(int LD flag, ms; 0 = disabled). Reclaimed pages emitUFFD_EVENT_REMOVE, already tracked by the parent FPR work.Hot path is kept minimal: post-drain and post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. System-level FPH activity is observable via the periodic 5 s metrics flush.
Includes
cmd/resume-build -fph-benchandscripts/bench-fph.shfor offline FPR vs FPR+FPH comparison.Operator must wait for the kernel FPH race fix to roll out before enabling
free-page-hinting-timeout-msin prod.