Skip to content

ENGINE-1383 winrm fixes for reboot and panic resolve.#324

Draft
james-nesbitt wants to merge 2 commits intok0sproject:release-0.xfrom
james-nesbitt:ENGINE-1383-winrm-fixes
Draft

ENGINE-1383 winrm fixes for reboot and panic resolve.#324
james-nesbitt wants to merge 2 commits intok0sproject:release-0.xfrom
james-nesbitt:ENGINE-1383-winrm-fixes

Conversation

@james-nesbitt
Copy link
Copy Markdown
Contributor

Hey K0s team. I have been battling some windows/winrm issues in Launchpad, and with some Claude help have discovered:

  1. windows 2019 (in particular) doesn't like the windows Reboot() function, which doesn't always work.
  2. the winrm close panic still needs a different approach to resolve a race condition.

In this PR are two fixes:

  1. The new reboot switches a flat "shutdown" Exec() to using "schtasks" to run a shutdown. The effort get's more complicated after CoPilot pointed out that you have to be careful with the schtask removal, as you may end up in a persistant reboot. I will take another stab at simplifying it (I think that there is a --once flag for schtasks.)

  2. Drops the panic recover() and focuses on race resolution in closing the command Stdout/Stderr in two different places.

WinRM sessions on AWS EC2 (and similar environments) run under a
filtered Administrator token that lacks SeShutdownPrivilege. Calling
'shutdown /r /t 5' directly in the WinRM session succeeds (exit 0) but
is silently ignored by the OS.

Fix: create a SYSTEM-context scheduled task (which always holds
SeShutdownPrivilege) that runs 'shutdown /r /f /t 5', trigger it
immediately, then delete it within the 5-second countdown window before
the reboot fires. This prevents the ONSTART task from re-executing on
subsequent boots.

/sc onstart is used instead of /sc once to avoid schtasks writing a
stderr warning when the scheduled time is in the past, which rig treats
as an error.
The stdout and stderr goroutines in Exec() were explicitly closing
command.Stdout and command.Stderr after draining them, while
'defer command.Close()' was also set to close them when the function
returned. This created a double-close race on the underlying channel,
causing a 'close of closed channel' panic.

The panics were being caught by recover() workarounds added in
commits 31b7511 and c46cde3, but the recovery left goroutines in a
degraded state, causing the process to eventually terminate silently
when running many concurrent WinRM connections (reproduced with 6
simultaneous WinRM sessions during waitForHost polling).

Fix: remove the explicit Close() calls from both goroutines. The
masterzen/winrm library closes stdout/stderr pipes when the command
completes, so the scanner loops terminate via natural EOF. The
'defer command.Close()' remains as the sole cleanup path.

Also remove the now-unnecessary panic recovery workaround from
Command.Wait().
@james-nesbitt
Copy link
Copy Markdown
Contributor Author

The double-close race in Exec()

The close of closed channel panic originates from two independent Close() calls racing against defer command.Close() in winrm.go's Exec()
function.

First close — inside the stdout goroutine (~line 341 pre-fix):

  if execOpts.Writer == nil {
      outputScanner := bufio.NewScanner(command.Stdout)
      for outputScanner.Scan() { ... }
      // ...
      command.Stdout.Close()   // ← first close of Stdout
  }

Second close — inside the stderr goroutine (~line 373 pre-fix):

  outputScanner := bufio.NewScanner(command.Stderr)
  for outputScanner.Scan() { ... }
  // ...
  command.Stderr.Close()   // ← first close of Stderr

Third close — deferred on the function itself (~line 308):

  command, err := shell.ExecuteWithContext(context.Background(), cmd)
  // ...
  defer command.Close()   // ← closes Stdout and Stderr again on function return

The race sequence:

  1. Stdout goroutine finishes scanning → calls command.Stdout.Close()
  2. Stderr goroutine finishes scanning → calls command.Stderr.Close()
  3. wg.Wait() returns (both goroutines done)
  4. command.Wait() is called
  5. Function returns → defer command.Close() fires, closing Stdout and Stderr a second time → close of closed channel panic

With a single WinRM connection the timing rarely overlapped. With 6 concurrent WinRM connections all finishing waitForHost polling
simultaneously (reproduced by running launchpad against a cluster with 2× each of Windows Server 2019, 2022, 2025), the race fired
consistently and caused the process to terminate silently after the recovered panics left goroutine state degraded.

The fix removes the explicit command.Stdout.Close() and command.Stderr.Close() calls from both goroutines. The masterzen/winrm library
closes the pipes when the WinRM protocol signals command completion, so the scanner loops reach EOF and exit cleanly without any explicit
close. defer command.Close() remains as the sole cleanup path. The recover() workarounds in both goroutines and in Command.Wait() are
then no longer needed and are removed.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Windows/WinRM reliability issues in rig by changing how Windows reboots are triggered under WinRM and by adjusting WinRM command stream handling to avoid race-related panics during close.

Changes:

  • Reworked Windows Reboot() to create/run a SYSTEM scheduled task that triggers shutdown /r /f /t 5, then attempts immediate cleanup.
  • Removed panic-recovery logic and adjusted WinRM Exec() stream readers to avoid closing stdout/stderr in multiple places.
  • Simplified Command.Wait() error returns.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
winrm.go Removes panic-recovery and adjusts close/return flow for WinRM command execution.
os/windows.go Implements reboot via schtasks running as SYSTEM and adds documentation around the approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread os/windows.go
Comment on lines +194 to +201
const taskName = "RigReboot"
// Create a SYSTEM-context ONSTART task that runs 'shutdown /r /f /t 5'.
// The 5-second delay gives us time to delete the task before the OS
// actually executes the reboot, preventing it from firing again on the
// next startup.
create := fmt.Sprintf(`schtasks /create /tn "%s" /tr "shutdown /r /f /t 5" /sc onstart /f /ru SYSTEM`, taskName)
if err := h.Exec(create); err != nil {
return fmt.Errorf("failed to create reboot task: %w", err)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a fixed scheduled task name ("RigReboot") with /f can overwrite an existing task with the same name and may conflict with concurrent reboot attempts. Consider generating a unique task name per invocation (e.g., include a timestamp/UUID) and/or checking for an existing task before overwriting/deleting.

Copilot uses AI. Check for mistakes.
Comment thread os/windows.go
Comment on lines +212 to +217
// Delete the task immediately while the 5-second shutdown timer is still
// counting down. This prevents it from re-firing on subsequent startups.
del := fmt.Sprintf(`schtasks /delete /tn "%s" /f`, taskName)
// Best-effort: ignore delete errors — if the task fires before we can
// delete it, the caller is expected to delete it after reconnecting.
_ = h.Exec(del)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete failures are fully ignored, but with /sc onstart a leftover task will reboot the machine on every subsequent startup (persistent reboot loop). This is a high-risk failure mode. Prefer making the task self-cleaning (e.g., include a delete/disable step in the task action) and/or use a schedule that cannot retrigger on startup (e.g., /sc once with stderr tolerated via exec.AllowWinStderr()), and only ignore delete errors that are clearly caused by the ongoing reboot/connection drop.

Copilot uses AI. Check for mistakes.
Comment thread os/windows.go
Comment on lines +218 to 220
// Allow Windows time to complete shutdown before waitForHost begins polling.
time.Sleep(15 * time.Second)
return nil
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.Sleep(15 * time.Second) is a hard-coded delay that is decoupled from the configured shutdown timer (/t 5). Consider deriving this from the shutdown delay (plus a small buffer) or documenting why 15s is required, to avoid unnecessary slowdowns and make future changes safer.

Copilot uses AI. Check for mistakes.
Comment thread winrm.go
}
}()

// Wait blocks until the command finishes
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment for Command.Wait is duplicated ("Wait blocks until the command finishes" appears twice). Please remove the extra line to keep godoc clean.

Suggested change
// Wait blocks until the command finishes

Copilot uses AI. Check for mistakes.
Comment thread os/windows.go
Comment on lines +180 to +192
// Reboot triggers an immediate forced restart by scheduling a SYSTEM-context
// one-shot task that runs 'shutdown /r /f /t 5', then immediately triggering
// and deleting it within the 5-second countdown window.
//
// Running via a scheduled task bypasses the filtered Administrator token used
// by WinRM sessions (e.g. AWS EC2) which lacks SeShutdownPrivilege. Issuing
// 'shutdown /r' directly in the WinRM session is silently ignored in that
// context.
//
// /sc onstart is used instead of /sc once to avoid schtasks writing a
// stderr warning about the start time being in the past, which rig treats
// as an error. The task is deleted immediately after triggering (while the
// 5-second timer counts down) so it does not re-fire on subsequent startups.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment calls the scheduled task "one-shot", but the implementation uses schtasks /sc onstart, which will run on every startup unless the task is successfully deleted. Either adjust the wording to reflect that it's an ONSTART task deleted as part of the reboot flow, or switch to a truly one-shot schedule (e.g., /sc once) if feasible.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants