ENGINE-1383 winrm fixes for reboot and panic resolve.#324
ENGINE-1383 winrm fixes for reboot and panic resolve.#324james-nesbitt wants to merge 2 commits intok0sproject:release-0.xfrom
Conversation
WinRM sessions on AWS EC2 (and similar environments) run under a filtered Administrator token that lacks SeShutdownPrivilege. Calling 'shutdown /r /t 5' directly in the WinRM session succeeds (exit 0) but is silently ignored by the OS. Fix: create a SYSTEM-context scheduled task (which always holds SeShutdownPrivilege) that runs 'shutdown /r /f /t 5', trigger it immediately, then delete it within the 5-second countdown window before the reboot fires. This prevents the ONSTART task from re-executing on subsequent boots. /sc onstart is used instead of /sc once to avoid schtasks writing a stderr warning when the scheduled time is in the past, which rig treats as an error.
The stdout and stderr goroutines in Exec() were explicitly closing command.Stdout and command.Stderr after draining them, while 'defer command.Close()' was also set to close them when the function returned. This created a double-close race on the underlying channel, causing a 'close of closed channel' panic. The panics were being caught by recover() workarounds added in commits 31b7511 and c46cde3, but the recovery left goroutines in a degraded state, causing the process to eventually terminate silently when running many concurrent WinRM connections (reproduced with 6 simultaneous WinRM sessions during waitForHost polling). Fix: remove the explicit Close() calls from both goroutines. The masterzen/winrm library closes stdout/stderr pipes when the command completes, so the scanner loops terminate via natural EOF. The 'defer command.Close()' remains as the sole cleanup path. Also remove the now-unnecessary panic recovery workaround from Command.Wait().
|
The double-close race in Exec() The close of closed channel panic originates from two independent Close() calls racing against defer command.Close() in winrm.go's Exec() First close — inside the stdout goroutine (~line 341 pre-fix): if execOpts.Writer == nil {
outputScanner := bufio.NewScanner(command.Stdout)
for outputScanner.Scan() { ... }
// ...
command.Stdout.Close() // ← first close of Stdout
}Second close — inside the stderr goroutine (~line 373 pre-fix): outputScanner := bufio.NewScanner(command.Stderr)
for outputScanner.Scan() { ... }
// ...
command.Stderr.Close() // ← first close of StderrThird close — deferred on the function itself (~line 308): command, err := shell.ExecuteWithContext(context.Background(), cmd)
// ...
defer command.Close() // ← closes Stdout and Stderr again on function returnThe race sequence:
With a single WinRM connection the timing rarely overlapped. With 6 concurrent WinRM connections all finishing waitForHost polling The fix removes the explicit command.Stdout.Close() and command.Stderr.Close() calls from both goroutines. The masterzen/winrm library |
There was a problem hiding this comment.
Pull request overview
This PR addresses Windows/WinRM reliability issues in rig by changing how Windows reboots are triggered under WinRM and by adjusting WinRM command stream handling to avoid race-related panics during close.
Changes:
- Reworked Windows
Reboot()to create/run a SYSTEM scheduled task that triggersshutdown /r /f /t 5, then attempts immediate cleanup. - Removed panic-recovery logic and adjusted WinRM
Exec()stream readers to avoid closing stdout/stderr in multiple places. - Simplified
Command.Wait()error returns.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
winrm.go |
Removes panic-recovery and adjusts close/return flow for WinRM command execution. |
os/windows.go |
Implements reboot via schtasks running as SYSTEM and adds documentation around the approach. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const taskName = "RigReboot" | ||
| // Create a SYSTEM-context ONSTART task that runs 'shutdown /r /f /t 5'. | ||
| // The 5-second delay gives us time to delete the task before the OS | ||
| // actually executes the reboot, preventing it from firing again on the | ||
| // next startup. | ||
| create := fmt.Sprintf(`schtasks /create /tn "%s" /tr "shutdown /r /f /t 5" /sc onstart /f /ru SYSTEM`, taskName) | ||
| if err := h.Exec(create); err != nil { | ||
| return fmt.Errorf("failed to create reboot task: %w", err) |
There was a problem hiding this comment.
Using a fixed scheduled task name ("RigReboot") with /f can overwrite an existing task with the same name and may conflict with concurrent reboot attempts. Consider generating a unique task name per invocation (e.g., include a timestamp/UUID) and/or checking for an existing task before overwriting/deleting.
| // Delete the task immediately while the 5-second shutdown timer is still | ||
| // counting down. This prevents it from re-firing on subsequent startups. | ||
| del := fmt.Sprintf(`schtasks /delete /tn "%s" /f`, taskName) | ||
| // Best-effort: ignore delete errors — if the task fires before we can | ||
| // delete it, the caller is expected to delete it after reconnecting. | ||
| _ = h.Exec(del) |
There was a problem hiding this comment.
Delete failures are fully ignored, but with /sc onstart a leftover task will reboot the machine on every subsequent startup (persistent reboot loop). This is a high-risk failure mode. Prefer making the task self-cleaning (e.g., include a delete/disable step in the task action) and/or use a schedule that cannot retrigger on startup (e.g., /sc once with stderr tolerated via exec.AllowWinStderr()), and only ignore delete errors that are clearly caused by the ongoing reboot/connection drop.
| // Allow Windows time to complete shutdown before waitForHost begins polling. | ||
| time.Sleep(15 * time.Second) | ||
| return nil |
There was a problem hiding this comment.
time.Sleep(15 * time.Second) is a hard-coded delay that is decoupled from the configured shutdown timer (/t 5). Consider deriving this from the shutdown delay (plus a small buffer) or documenting why 15s is required, to avoid unnecessary slowdowns and make future changes safer.
| } | ||
| }() | ||
|
|
||
| // Wait blocks until the command finishes |
There was a problem hiding this comment.
The doc comment for Command.Wait is duplicated ("Wait blocks until the command finishes" appears twice). Please remove the extra line to keep godoc clean.
| // Wait blocks until the command finishes |
| // Reboot triggers an immediate forced restart by scheduling a SYSTEM-context | ||
| // one-shot task that runs 'shutdown /r /f /t 5', then immediately triggering | ||
| // and deleting it within the 5-second countdown window. | ||
| // | ||
| // Running via a scheduled task bypasses the filtered Administrator token used | ||
| // by WinRM sessions (e.g. AWS EC2) which lacks SeShutdownPrivilege. Issuing | ||
| // 'shutdown /r' directly in the WinRM session is silently ignored in that | ||
| // context. | ||
| // | ||
| // /sc onstart is used instead of /sc once to avoid schtasks writing a | ||
| // stderr warning about the start time being in the past, which rig treats | ||
| // as an error. The task is deleted immediately after triggering (while the | ||
| // 5-second timer counts down) so it does not re-fire on subsequent startups. |
There was a problem hiding this comment.
This comment calls the scheduled task "one-shot", but the implementation uses schtasks /sc onstart, which will run on every startup unless the task is successfully deleted. Either adjust the wording to reflect that it's an ONSTART task deleted as part of the reboot flow, or switch to a truly one-shot schedule (e.g., /sc once) if feasible.
Hey K0s team. I have been battling some windows/winrm issues in Launchpad, and with some Claude help have discovered:
In this PR are two fixes:
The new reboot switches a flat "shutdown" Exec() to using "schtasks" to run a shutdown. The effort get's more complicated after CoPilot pointed out that you have to be careful with the schtask removal, as you may end up in a persistant reboot. I will take another stab at simplifying it (I think that there is a
--onceflag for schtasks.)Drops the panic recover() and focuses on race resolution in closing the command Stdout/Stderr in two different places.