Skip to content

docs: implementation plan for hypershift CI WIF migration#39

Open
cristianoveiga wants to merge 6 commits into
openshift-online:mainfrom
cristianoveiga:plan/hypershift-ci-wif-migration
Open

docs: implementation plan for hypershift CI WIF migration#39
cristianoveiga wants to merge 6 commits into
openshift-online:mainfrom
cristianoveiga:plan/hypershift-ci-wif-migration

Conversation

@cristianoveiga
Copy link
Copy Markdown
Contributor

@cristianoveiga cristianoveiga commented Apr 7, 2026

Summary

  • Implementation plan to migrate hypershift-ci SA authentication from static JSON key to Workload Identity Federation (WIF)
  • Card 1: Create WIF infrastructure in gcp-hcp-infra (pool, OIDC providers, IAM bindings)
  • Card 2: Migrate CI scripts in openshift/release and revoke static key
  • Jobs pinned to AWS build clusters via arm64 capability (all have public OIDC issuers)
  • Federation scoped to specific test SA names via CEL attribute condition
  • References design decision from PR docs(studies): add ROSA-to-GCP WIF study #38

Test plan

  • Review implementation plan for completeness and accuracy
  • Validate OIDC issuer URIs for all arm64-capable build clusters
  • Confirm CEL attribute condition covers all current e2e test SA names

🤖 Generated with Claude Code

Covers the migration of hypershift-ci service account authentication
from a static JSON key to Workload Identity Federation using OpenShift
CI build cluster OIDC issuers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cristianoveiga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Summary by CodeRabbit

  • Documentation
    • Added a detailed migration plan documenting replacement of static service-account keys with Workload Identity Federation (WIF) for CI authentication.
    • Updated CI authentication docs and decision notes to mark static key usage as superseded and link to the new migration plan.

Walkthrough

Adds a new implementation plan to migrate hypershift-ci CI auth from static JSON keys to Workload Identity Federation (WIF) and updates the e2e-tests implementation doc to mark static credentials as superseded and link the new plan.

Changes

hypershift CI WIF migration

Layer / File(s) Summary
Design / Overview
implementation-plans/hypershift-ci-wif-migration.md
Adds migration overview, security scoping (CEL attribute_condition limiting namespaces ci-op-* and SA names e2e-gke, e2e-v2-gke), runtime flow, acceptance criteria, and concerns/mitigations.
Terraform / Infra Plan
implementation-plans/hypershift-ci-wif-migration.md
Card 1: enable sts.googleapis.com; grant Atlantis roles/iam.workloadIdentityPoolAdmin; add variables wif_pool_id, wif_providers, wif_attribute_condition; create WIF pool openshift-ci; per-cluster OIDC providers (wif_providers map) with allowed_audiences, attribute mappings, CEL restriction; add roles/iam.workloadIdentityUser bindings scoped by principalSet; add outputs for CI provider lookup; remove static-key generation references.
CI Step Changes / Runbook
implementation-plans/hypershift-ci-wif-migration.md, ci-operator/step-registry/hypershift/gcp/*
Card 2: update CI step scripts to detect/map projected-token issuers, use projected K8s SA token and STS exchange (no static keys), authenticate via gcloud auth login --cred-file flow, fail fast on unrecognized issuers, ensure tooling (e.g., jq), run make update, then revoke/remove user-managed hypershift-ci key and monitor e2e-gke.
Cross-Reference / Decision Update
implementation-plans/hypershift-repo-gcp-hcp-e2e-tests-implementation.md
Marks “Service Account with static credentials” as superseded by WIF; updates Story 11 and the Decision table to reference the migration plan and note static key rotation is superseded.

Sequence Diagram(s)

sequenceDiagram
participant CI as CI Job (build cluster)
participant K8s as Kubernetes API
participant STS as GCP STS (sts.googleapis.com)
participant IAM as GCP IAM / hypershift-ci SA
participant Vault as Vault / Secret Store

CI->>K8s: Read projected SA token (aud = issuer URL)
K8s-->>CI: Projected token (JWT)
CI->>STS: Exchange JWT for GCP token (STS) with provider attribute_condition
STS-->>CI: Short-lived GCP access token
CI->>IAM: Impersonate `hypershift-ci` using access token (gcloud)
IAM-->>CI: Impersonated credentials for GCP APIs
CI->>Vault: Delete static service account key (post-validation)
Vault-->>CI: Confirm removal
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding an implementation plan for migrating hypershift CI authentication to Workload Identity Federation.
Description check ✅ Passed The description provides relevant context about the migration plan, including the two main work cards, federation scoping strategy, and a test plan aligned with the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristianoveiga
Copy link
Copy Markdown
Contributor Author

/cc: @patjlm

@patjlm
Copy link
Copy Markdown
Contributor

patjlm commented Apr 8, 2026

LGTM

- Restructure plan as two Jira cards (WIF infrastructure + CI script migration)
- Add security scoping with namespace + SA name attribute condition
- Add pool/provider architecture decision (single pool, multiple providers)
- Resolve build cluster OIDC availability via arm64 capability pinning
- Mark Story 11 and CI Authentication decision as superseded in e2e tests plan

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cristianoveiga cristianoveiga marked this pull request as ready for review April 8, 2026 14:22
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026
@openshift-ci openshift-ci Bot requested review from ckandag and jimdaga April 8, 2026 14:22
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@implementation-plans/hypershift-ci-wif-migration.md`:
- Line 95: The CEL condition currently hardcodes service account names in the
default expression
("assertion['kubernetes.io']['namespace'].startsWith('ci-op-') &&
assertion['kubernetes.io']['serviceaccount']['name'] in ['e2e-gke',
'e2e-v2-gke']"), which rejects periodic/current job SAs like e2e-gke-periodic;
update the CEL to broaden the allowlist by matching SA name patterns instead of
exact names (e.g., use a startsWith/regex match against
assertion['kubernetes.io']['serviceaccount']['name'] to accept variants such as
e2e-gke*, e2e-v2-gke*, or include the specific periodic SA names referenced
elsewhere) so periodic jobs will pass WIF auth.
- Line 105: The principalSet URI in the example for
google_service_account_iam_member.wif_workload_identity_user uses the wrong pool
ID format; update the principalSet://... portion to reference the actual pool ID
"openshift-ci" (with a hyphen) instead of "openshift_ci" (underscore) so the IAM
binding points to the defined pool (pool ID openshift-ci) and will succeed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 44dd09ba-4306-4e6e-b1ea-2fd0a3e6c9cb

📥 Commits

Reviewing files that changed from the base of the PR and between e0f9d5d and 1d1d18e.

📒 Files selected for processing (2)
  • implementation-plans/hypershift-ci-wif-migration.md
  • implementation-plans/hypershift-repo-gcp-hcp-e2e-tests-implementation.md

Comment thread implementation-plans/hypershift-ci-wif-migration.md
Comment thread implementation-plans/hypershift-ci-wif-migration.md Outdated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread implementation-plans/hypershift-ci-wif-migration.md Outdated
Comment on lines +81 to +92
variable "wif_providers" {
description = "Map of WIF OIDC providers, keyed by provider ID"
type = map(object({
issuer_uri = string
}))
# Populate with all build clusters that have the arm64 capability
default = {
build01 = { issuer_uri = "https://build01-oidc.s3.us-east-1.amazonaws.com" }
# ... add all arm64-capable clusters
}
}
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd rather read the issuer url from a yaml file in eg terraform/metadata/prow-build-clusters.yaml. Eg

clusters:
- name: build01
  issuer_uri: https://build01-oidc.s3.us-east-1.amazonaws.com
  api_url: https://api.build01.ci.devcluster.openshift.com:6443
- ...

Comment thread implementation-plans/hypershift-ci-wif-migration.md Outdated
Comment thread implementation-plans/hypershift-ci-wif-migration.md
Instead of principalSet://.../openshift-ci/* (all pool identities),
use principalSet://.../attribute.service_account/<sa-name> to restrict
impersonation to specific CI test identities (e2e-gke, e2e-v2-gke).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@patjlm
Copy link
Copy Markdown
Contributor

patjlm commented Apr 9, 2026

A question about the custom audience approach in the implementation plan.

The plan mentions using kubectl create token --audience=... to request a token with a custom WIF audience. This requires create permission on serviceaccounts/token — which I'm not sure the test SA has.

Looking at ci-operator's RBAC setup (ci-tools/pkg/steps/multi_stage/init.go), the role granted to the test SA allows:

  • create on rolebindings, roles
  • get/update on specific named secrets
  • get on imagestreams/layers
  • view ClusterRole (read-only)

I don't see serviceaccounts/token in there, but this should be verified on an actual build cluster. If the SA can't call the TokenRequest API, the options are:

  1. Use the default projected token (mounted at /var/run/secrets/kubernetes.io/serviceaccount/token) — but it has the OIDC issuer URL as audience, not a custom one. The WIF provider's allowed_audiences would need to match the issuer URL (which is what we validated on app.ci). This works but drops the custom audience security layer.

  2. Add a projected volume with custom audience to the pod spec — but ci-operator doesn't support this today. There's no volumes or serviceAccountTokenProjection field in LiteralTestStep, and the pod generation code (ci-tools/pkg/steps/multi_stage/gen.go) only handles Secrets, ConfigMaps, EmptyDirs, and CSI. This would require a feature contribution to ci-tools.

  3. Request RBAC for serviceaccounts/token — upstream change to ci-operator's role setup. But this also means any step in the test could request tokens with arbitrary audiences, which widens the attack surface.

If option 1 is acceptable (default audience, relying on CEL condition + SA name matching as the security boundary), it's the simplest path. The CI folder isolation is the ultimate backstop either way. But worth validating the RBAC assumption before deciding.

@patjlm
Copy link
Copy Markdown
Contributor

patjlm commented Apr 9, 2026

Update: Verified on build01 that CI step pods cannot call kubectl create token --audience=... (no RBAC on serviceaccounts/token). The auto-mounted token at /var/run/secrets/kubernetes.io/serviceaccount/token uses the OIDC issuer URL as audience — not a custom one.

However, Kubernetes supports projected volumes with custom audiences directly in the pod spec (spec.volumes[].projected.sources[].serviceAccountToken.audience). The kubelet handles the token request — no extra RBAC needed. ci-operator just doesn't expose this field today.

Opened openshift/ci-tools#5093 to add a service_account_tokens field to multi-stage test steps:

service_account_tokens:
- audience: gcp-hcp-ci-wif
  mount_path: /var/run/secrets/wif
  expiration_seconds: 3600

If that lands, the custom audience approach in this implementation plan would work as designed.

@cristianoveiga
Copy link
Copy Markdown
Contributor Author

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2026
Custom audience requires ci-tools#5093 which has no DPTP review timeline.
The core security model (CEL condition + principalSet scoping + CI folder
isolation) is sufficient without it. Use default audience (issuer URL)
for initial implementation; custom audience can be added later as a
defense-in-depth enhancement without infrastructure changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@implementation-plans/hypershift-ci-wif-migration.md`:
- Around line 102-103: The plan incorrectly claims `allowed_audiences` defaults
to the issuer URL; update the text around `allowed_audiences` to state the
actual default is the provider's canonical resource name
(https://iam.googleapis.com/.../workloadIdentityPools/.../providers/...) and
either instruct to explicitly set `allowed_audiences` to the issuer URL or call
out that the STS token `aud` must match the provider canonical resource
name—adjust the bullet about `allowed_audiences` and the guidance so it
accurately describes the default and the required configuration to avoid token
exchange failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 629f310c-ee4a-4d1e-87a5-ec491b202d94

📥 Commits

Reviewing files that changed from the base of the PR and between 75b814a and 306fe75.

📒 Files selected for processing (1)
  • implementation-plans/hypershift-ci-wif-migration.md

Comment thread implementation-plans/hypershift-ci-wif-migration.md Outdated
GCP's default allowed_audiences is the provider's canonical resource
name, not the issuer URL. The provider must explicitly set
allowed_audiences to the issuer_uri so the pod's default projected
SA token is accepted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
implementation-plans/hypershift-ci-wif-migration.md (1)

102-102: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten wording for clarity at Line 102.

The second sentence is a fragment (“Can be tightened…”). Make it a complete sentence to avoid ambiguity in implementation docs.

✍️ Suggested doc tweak
-- `allowed_audiences` explicitly set to the provider's `issuer_uri` so that the pod's default projected SA token (whose `aud` = issuer URL) is accepted. GCP's default audience is the provider's canonical resource name, not the issuer URL, so this must be set explicitly. Can be tightened to a custom audience string when [ci-tools#5093](https://github.com/openshift/ci-tools/pull/5093) lands
+- `allowed_audiences` explicitly set to the provider's `issuer_uri` so that the pod's default projected SA token (whose `aud` = issuer URL) is accepted. GCP's default audience is the provider's canonical resource name, not the issuer URL, so this must be set explicitly. This can be tightened to a custom audience string when [ci-tools#5093](https://github.com/openshift/ci-tools/pull/5093) lands.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@implementation-plans/hypershift-ci-wif-migration.md` at line 102, The second
sentence in the bullet about `allowed_audiences` is a sentence fragment; change
it to a complete sentence by prefixing with a subject and verb (for example:
"This can be tightened to a custom audience string when ci-tools#5093 lands."),
so the bullet reads clearly: explain why `allowed_audiences` must be set and
then state the tightening option as a full sentence referencing the PR number
`ci-tools#5093`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@implementation-plans/hypershift-ci-wif-migration.md`:
- Line 102: The second sentence in the bullet about `allowed_audiences` is a
sentence fragment; change it to a complete sentence by prefixing with a subject
and verb (for example: "This can be tightened to a custom audience string when
ci-tools#5093 lands."), so the bullet reads clearly: explain why
`allowed_audiences` must be set and then state the tightening option as a full
sentence referencing the PR number `ci-tools#5093`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7baaf01b-d94d-4063-a72e-98acaa4eb7fd

📥 Commits

Reviewing files that changed from the base of the PR and between 306fe75 and 0dc2b97.

📒 Files selected for processing (1)
  • implementation-plans/hypershift-ci-wif-migration.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants