Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -991,15 +991,43 @@ func TestContainerPersistenceAfterColdReboot(t *testing.T) {

t.Run("VerifyPersistence", func(t *testing.T) {
t.Log("Waiting for DUT to reboot and reconnect...")

// Wait for reboot.
time.Sleep(8 * time.Minute)
maxRebootTime := 8 * time.Minute
ticker := time.NewTicker(30 * time.Second)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A 30-second ticker interval might be too long for detecting a reboot, especially on virtual devices or fast-rebooting hardware. If the device reboots and comes back up within the 30-second window, the loop might miss the 'down' state and eventually timeout. Consider reducing the interval to 5 or 10 seconds for better reliability.

Suggested change
ticker := time.NewTicker(30 * time.Second)
ticker := time.NewTicker(10 * time.Second)

defer ticker.Stop()
timeout := time.After(maxRebootTime)
var deviceWentDown bool

rebootLoop:
for {
select {
case <-timeout:
t.Fatalf("Timeout exceeded: DUT did not reboot within %v seconds.", maxRebootTime)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message 'DUT did not reboot' is slightly misleading if the device actually went down but failed to become reachable again. Additionally, using %v seconds with a time.Duration results in a redundant unit (e.g., '8m0s seconds'). Using t.Fatalf is preferred here as the failure makes subsequent test steps meaningless.

Suggested change
t.Fatalf("Timeout exceeded: DUT did not reboot within %v seconds.", maxRebootTime)
t.Fatalf("Timeout exceeded: DUT did not become reachable after reboot within %v.", maxRebootTime)
References
  1. In tests, t.Fatalf is preferred over t.Errorf when a failure makes subsequent test steps meaningless, as this fails fast and reduces overall test execution time.

case <-ticker.C:
// use GNOI to refresh the stale cached connection post reboot.
sysClient := dut.RawAPIs().GNOI(t).System()
_, err := sysClient.Time(ctx, &gspb.TimeRequest{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Time RPC call uses the test context ctx, which has a long timeout (8 minutes). If the device is in a state where it accepts connections but hangs on RPCs, this call could block the polling loop for a long time. It's safer to use a shorter timeout for each individual reachability check.

tctx, cancel := context.WithTimeout(ctx, 10*time.Second)
_, err := sysClient.Time(tctx, &gspb.TimeRequest{})
cancel()

if err != nil {
if !deviceWentDown {
t.Logf("Device is now unreachable. Waiting for it to come back up.")
deviceWentDown = true
}
} else {
if deviceWentDown {
t.Logf("Device rebooted successfully.")
break rebootLoop
}
t.Logf("Device is still reachable; reboot hasn't started yet.")
}
}
}

// Poll for container state.
cli = containerztest.Client(t, dut)

// Use a generous timeout for the device to come back up and the container to start.
timeout := 5 * time.Minute
if err := containerztest.WaitForRunning(ctx, t, cli, instanceName, timeout); err != nil {
if err := containerztest.WaitForRunning(ctx, t, cli, instanceName, 5*time.Minute); err != nil {
t.Errorf("Container persistence failed: %v", err)
}

Expand Down
34 changes: 34 additions & 0 deletions topologies/binding/binding.go
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,40 @@ func (d *staticDUT) reset(ctx context.Context) error {
return resetGRIBI(ctx, d)
}

func (d *staticDUT) PushConfig(ctx context.Context, config string, reset bool) error {
if reset {
if err := resetGNMI(ctx, d); err != nil {
return err
}
}
if config == "" {
return nil
}

setRequest := &gpb.SetRequest{Update: []*gpb.Update{
{
Path: &gpb.Path{
Origin: "cli",
},
Val: &gpb.TypedValue{
Value: &gpb.TypedValue_AsciiVal{
AsciiVal: config,
},
},
},
}}

gnmiClient, err := d.DialGNMI(ctx)
if err != nil {
return err
}
if _, err := gnmiClient.Set(ctx, setRequest); err != nil {
return err
}
Comment on lines +201 to +207
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to d.DialGNMI(ctx) creates a new gRPC connection that is never closed, leading to a resource leak. Since PushConfig is an internal method of staticDUT, you should use dialConn directly to obtain the connection and ensure it is closed after the operation.

Suggested change
gnmiClient, err := d.DialGNMI(ctx)
if err != nil {
return err
}
if _, err := gnmiClient.Set(ctx, setRequest); err != nil {
return err
}
conn, err := dialConn(ctx, d, introspect.GNMI, nil)
if err != nil {
return err
}
defer conn.Close()
if _, err := gpb.NewGNMIClient(conn).Set(ctx, setRequest); err != nil {
return err
}


return nil
}

func (d *staticDUT) DialGNMI(ctx context.Context, opts ...grpc.DialOption) (gpb.GNMIClient, error) {
conn, err := dialConn(ctx, d, introspect.GNMI, opts)
if err != nil {
Expand Down
Loading