Skip to content

multi_stage: default VPN entrypoint wait timeout to 5m#5157

Closed
ibm-adarsh wants to merge 1 commit intoopenshift:mainfrom
ibm-adarsh:vpn-default-wait-timeout
Closed

multi_stage: default VPN entrypoint wait timeout to 5m#5157
ibm-adarsh wants to merge 1 commit intoopenshift:mainfrom
ibm-adarsh:vpn-default-wait-timeout

Conversation

@ibm-adarsh
Copy link
Copy Markdown

@ibm-adarsh ibm-adarsh commented May 6, 2026

Problem
IBM Z CI jobs that run behind the cluster-profile VPN (for example libvirt / libvirt-s390x-vpn–style workflows) were failing early in the test step. The wrapped command logs showed the entrypoint-wrapper giving up while waiting for the VPN readiness file, e.g. a warning like timeout after waiting for file /tmp/vpn/up. The main container then started (or behaved as if the wait had ended) before the VPN client had finished bringing the tunnel up, which breaks anything that needs intranet/DNS reachability from the first second.

image

What this change does
Whenever VPN is enabled for a multi-stage step, we always pass --wait-for-file /tmp/vpn/up and a --wait-timeout. If vpn.yaml does not define wait_timeout, we default to 5 minutes so IBM Z (and similar) jobs have time for the VPN sidecar to create /tmp/vpn/up. An explicit wait_timeout in the cluster profile still overrides this default.

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-fips-ovn-remote-libvirt-multi-z-z/2051658376382779392

Summary by CodeRabbit

  • Bug Fixes
    • Improved VPN timeout handling to ensure default and custom wait timeouts are properly applied to container operations when VPN configuration is provided.

When VPN is enabled, always pass --wait-for-file /tmp/vpn/up and
--wait-timeout to the entrypoint wrapper. If vpn.yaml omits wait_timeout,
use 5m0s instead of omitting the flags (which led to no meaningful wait).
Explicit wait_timeout in the cluster profile still overrides.
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ibm-adarsh
Once this PR has been reviewed and has the lgtm label, please assign prucek for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

📝 Walkthrough

Walkthrough

The PR introduces a default VPN wait timeout constant of 5m0s and modifies the secret wrapper logic to always apply wait-timeout configuration when VPN is enabled, using the default unless explicitly overridden.

Changes

VPN Wait Timeout Default

Layer / File(s) Summary
Constant Definition
pkg/steps/multi_stage/gen.go (lines 28–29)
New defaultVPNWaitTimeout constant set to "5m0s" with explanatory comment.
Core Logic
pkg/steps/multi_stage/gen.go (lines 310–317)
addSecretWrapper now computes waitTimeout using the default and overrides it only if vpnConf.WaitTimeout is explicitly set, then appends both --wait-for-file and --wait-timeout arguments unconditionally when VPN config is present.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 12 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Coverage For New Features ❓ Inconclusive No result was produced after verification. Marking as INCONCLUSIVE. Re-run the check or adjust instructions to produce a final result.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: introducing a default 5-minute VPN entrypoint wait timeout for multi-stage steps.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Go Error Handling ✅ Passed Code follows Go error handling best practices: proper nil checks before pointer dereferencing (lines 310-314), no ignored errors, and no inappropriate panic calls.
Stable And Deterministic Test Names ✅ Passed Check not applicable. PR modifies production code (VPN timeout config) with no changes to any Ginkgo test names or test definitions.
Test Structure And Quality ✅ Passed PR modifies production code only (pkg/steps/multi_stage/gen.go). Repository uses standard Go tests, not Ginkgo. Check is not applicable to this PR.
Microshift Test Compatibility ✅ Passed PR modifies CI infrastructure code only. No Ginkgo e2e tests added. Custom check applies only to new e2e tests, not applicable here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR modifies CI infrastructure code (pkg/steps/multi_stage/gen.go), not test code. SNO check only applies to new e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies CI tooling (pkg/steps/multi_stage/gen.go) to add default VPN timeout handling. It does not modify deployment manifests, operator code, or controllers. The check is not applicable.
Ote Binary Stdout Contract ✅ Passed The modified file is a library package that generates Kubernetes Pod configurations. It contains no process-level code or stdout writes. The OTE check does not apply here.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR modifies infrastructure code (pkg/steps/multi_stage/gen.go) for VPN timeout handling, not Ginkgo e2e tests. The custom check targets Ginkgo test additions and is not applicable here.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

Hi @ibm-adarsh. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/steps/multi_stage/gen.go (1)

310-317: ⚡ Quick win

Add test coverage for the new default-timeout path.

The behavioral change — always passing --wait-timeout with a default of 5m0s when VPN is enabled — isn't covered by any test in this PR. The previous conditional path (WaitTimeout != nil) presumably had some coverage. Verify both the default and the override cases are exercised in addSecretWrapper tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/steps/multi_stage/gen.go` around lines 310 - 317, Add tests in the
addSecretWrapper test suite to cover the new default-timeout path when VPN is
enabled: create one test case where vpnConf is set but vpnConf.WaitTimeout ==
nil and assert that container.Args contains "--wait-timeout" with the value of
defaultVPNWaitTimeout (5m0s), and another test case where vpnConf.WaitTimeout is
non-nil to assert the override value is passed; locate the behavior around the
container.Args construction in gen.go (referencing vpnConf, WaitTimeout,
defaultVPNWaitTimeout, and addSecretWrapper) and update or add assertions to
verify both default and explicit timeout values are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/steps/multi_stage/gen.go`:
- Around line 310-317: Add tests in the addSecretWrapper test suite to cover the
new default-timeout path when VPN is enabled: create one test case where vpnConf
is set but vpnConf.WaitTimeout == nil and assert that container.Args contains
"--wait-timeout" with the value of defaultVPNWaitTimeout (5m0s), and another
test case where vpnConf.WaitTimeout is non-nil to assert the override value is
passed; locate the behavior around the container.Args construction in gen.go
(referencing vpnConf, WaitTimeout, defaultVPNWaitTimeout, and addSecretWrapper)
and update or add assertions to verify both default and explicit timeout values
are present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2fe1442d-21f3-4c76-a082-1678ad89d0c7

📥 Commits

Reviewing files that changed from the base of the PR and between 4468929 and bb53441.

📒 Files selected for processing (1)
  • pkg/steps/multi_stage/gen.go

@ibm-adarsh
Copy link
Copy Markdown
Author

Closing for now, issue was mainly because of virtproxyd-tcp.socket , which is a systemd socket unit used by the libvirt virtualization stack to expose libvirt services over a plain TCP connection.

@ibm-adarsh ibm-adarsh closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant