diff --git a/implementation-plans/hypershift-ci-wif-migration.md b/implementation-plans/hypershift-ci-wif-migration.md new file mode 100644 index 0000000..2577c5c --- /dev/null +++ b/implementation-plans/hypershift-ci-wif-migration.md @@ -0,0 +1,195 @@ +# Hypershift CI Workload Identity Federation Migration + +## Overview + +Migrate the `hypershift-ci` service account authentication from a static JSON key to Workload Identity Federation (WIF), using the OpenShift CI build cluster OIDC issuers. CI job pods will exchange their projected Kubernetes SA token for a short-lived GCP federated access token via STS, then impersonate the existing `hypershift-ci` service account. + +This eliminates long-lived credentials per [Architectural Invariant #3](../design-decisions/workload-identity-implementation.md) and follows the same WIF pattern used for hosted cluster operators in [gcp-wif-integration](./gcp-wif-integration.md), adapted for external (non-GKE) OIDC issuers. + +**Design decision**: [prow-ci-workload-identity-federation](../design-decisions/prow-ci-workload-identity-federation.md) + +**Study**: [rosa-to-gcp-wif](../studies/rosa-to-gcp-wif.md) — validated the WIF token exchange flow on the `app.ci` ROSA cluster. This implementation targets the **build farm clusters** where hypershift GCP e2e jobs run. + +**Security scoping:** + +Federation is restricted to specific CI test jobs via a CEL attribute condition on the WIF provider that matches on both the ci-operator namespace prefix and the K8s ServiceAccount name: + +```cel +assertion['kubernetes.io']['namespace'].startsWith('ci-op-') && +assertion['kubernetes.io']['serviceaccount']['name'] in ['e2e-gke', 'e2e-v2-gke'] +``` + +ci-operator creates one K8s ServiceAccount per multi-stage test, named after the test's `as` field. All steps (pre, test, post) within that test share this SA. The SA names are stable — defined in `ci-operator/config/openshift/hypershift/openshift-hypershift-main.yaml`. If new test names are added, the attribute condition must be updated. See `openshift/ci-tools` `pkg/steps/multi_stage/init.go` (SA creation) and `pkg/steps/multi_stage/gen.go` (SA assignment to step pods). + +Additional mitigation layers: +- **CI folder isolation** (hard boundary): The SA's permissions are scoped exclusively to the CI folder. No path to production, integration, or staging resources regardless of authentication method. +- **Custom audience** (optional hardening, deferred): When [ci-tools#5093](https://github.com/openshift/ci-tools/pull/5093) lands, the WIF provider can be configured to require a specific audience, rejecting the default projected token. This adds defense-in-depth against SA name collisions from other repos but is not required for the core security model — the CEL condition, principalSet scoping, and folder isolation already provide strong boundaries. Until then, the default audience (issuer URL) is used. + +**Pool and provider architecture:** + +A single WIF pool (`openshift-ci`) with one OIDC provider per build cluster. IAM bindings use `principalSet://.../attribute.service_account/` to scope impersonation to specific CI test identities (e.g., `e2e-gke`, `e2e-v2-gke`), rather than granting access to all identities in the pool. The attribute condition on the provider adds namespace-level filtering (`ci-op-*`). Adding or removing a build cluster is a single Terraform change (add/remove a provider) with no IAM modifications. If trust needs to be revoked for a specific cluster, its provider can be deleted or disabled within the pool without affecting others. The Terraform module uses `for_each` on a `wif_providers` variable map of `{ provider_id => issuer_uri }`. + +**Authentication flow:** + +```mermaid +sequenceDiagram + participant Pod as CI Job Pod (build cluster) + participant K8s as Build Cluster API Server + participant STS as GCP Security Token Service + participant IAM as GCP IAM Credentials + participant GCP as GCP APIs + + Pod->>K8s: Use default projected SA token + K8s-->>Pod: SA token (JWT, audience = issuer URL) + Pod->>STS: Exchange JWT for STS token + Note over STS: Validates JWT against WIF pool/provider
Checks attribute_condition (namespace + SA name) + STS-->>Pod: Federated access token + Pod->>IAM: Impersonate hypershift-ci SA + IAM-->>Pod: Short-lived access token + Pod->>GCP: Create/delete projects, manage DNS +``` + +--- + +## Card 1: Create WIF Infrastructure + +### User Story + +**As a** GCP HCP platform engineer, +**I want** a WIF pool and OIDC providers configured in the `gcp-hcp-hypershift-ci` project, +**so that** CI build clusters can federate identities to the `hypershift-ci` service account without static keys. + +### Context / Background + +The `hypershift-ci` service account authenticates CI jobs to GCP for creating/deleting ephemeral test projects. Currently it uses a static JSON key stored in Vault. This card provisions the GCP-side WIF infrastructure that will allow CI pods to authenticate via token exchange instead. + +### Technical Approach + +Single PR in `gcp-hcp-infra` containing all Terraform changes. The first `atlantis apply` may fail due to IAM propagation delay (up to 60 seconds) between the new `workloadIdentityPoolAdmin` role and the WIF pool creation — a re-run of `atlantis apply` will succeed. + +1. **Enable STS API** (modify: `terraform/modules/hypershift-ci/variables.tf`): + - Add `sts.googleapis.com` to the `activated_apis` default list + +2. **Grant Atlantis WIF pool admin** (modify: `terraform/modules/hypershift-ci/atlantis.tf`): + - Add `google_project_iam_member.global_atlantis_wif_pool_admin` with `roles/iam.workloadIdentityPoolAdmin` + - Add it to `null_resource.atlantis_iam_ready` depends_on list + +3. **Add WIF variables** (modify: `terraform/modules/hypershift-ci/variables.tf`): + - `wif_pool_id` (default: `"openshift-ci"`) - Pool ID for OpenShift CI federation + - `wif_providers` - Map of provider configurations: + ```hcl + variable "wif_providers" { + description = "Map of WIF OIDC providers, keyed by provider ID" + type = map(object({ + issuer_uri = string + })) + # Populate with all build clusters that have the arm64 capability + default = { + build01 = { issuer_uri = "https://build01-oidc.s3.us-east-1.amazonaws.com" } + # ... add all arm64-capable clusters + } + } + ``` + - `wif_attribute_condition` - CEL condition restricting federation to specific CI test service accounts: + ```hcl + default = "assertion['kubernetes.io']['namespace'].startsWith('ci-op-') && assertion['kubernetes.io']['serviceaccount']['name'] in ['e2e-gke', 'e2e-v2-gke']" + ``` + - Add input validation for all variables + +4. **Create WIF resources** (new file: `terraform/modules/hypershift-ci/workload-identity-federation.tf`): + - `google_iam_workload_identity_pool.openshift_ci` - Single WIF pool for all build clusters + - `google_iam_workload_identity_pool_provider.build_clusters` - One OIDC provider per entry in `wif_providers` (using `for_each`), each with: + - `allowed_audiences` explicitly set to the provider's `issuer_uri` so that the pod's default projected SA token (whose `aud` = issuer URL) is accepted. GCP's default audience is the provider's canonical resource name, not the issuer URL, so this must be set explicitly. Can be tightened to a custom audience string when [ci-tools#5093](https://github.com/openshift/ci-tools/pull/5093) lands + - Attribute mapping: `google.subject` = `assertion.sub`, `attribute.namespace` = `assertion['kubernetes.io']['namespace']`, `attribute.service_account` = `assertion['kubernetes.io']['serviceaccount']['name']` + - Attribute condition from variable (restricts to `ci-op-*` namespaces AND specific test SA names) + - `google_service_account_iam_member.wif_workload_identity_user` - One IAM binding per test SA name (using `for_each`), scoped via `principalSet://.../attribute.service_account/` to restrict impersonation to specific CI test identities only + - Pool and providers depend on `null_resource.atlantis_iam_ready` + +5. **Add WIF outputs** (modify: `terraform/modules/hypershift-ci/outputs.tf` and `terraform/config/hypershift-ci/main.tf`): + - `wif_pool_name` - Full resource name of the WIF pool + - `wif_provider_names` - Map of provider ID to full resource name (for CI script lookup) + - `project_number` - GCP project number + +6. **Clean up static key references**: + - Remove SA key generation instructions from `terraform/modules/hypershift-ci/service-account.tf` header comment + - Remove SA key generation instructions from `terraform/config/hypershift-ci/main.tf` header comment + - Replace with WIF authentication documentation pointing to `workload-identity-federation.tf` + +### Acceptance Criteria + +- [ ] `sts.googleapis.com` is in the activated APIs list +- [ ] Atlantis SA has `roles/iam.workloadIdentityPoolAdmin` on the `gcp-hcp-hypershift-ci` project +- [ ] WIF pool `openshift-ci` exists in `gcp-hcp-hypershift-ci` project with one OIDC provider per WIF-compatible build cluster +- [ ] Attribute condition restricts federation to `ci-op-*` namespaces AND specific test SA names (`e2e-gke`, `e2e-v2-gke`) +- [ ] `hypershift-ci` SA has `roles/iam.workloadIdentityUser` binding for the WIF pool principal set +- [ ] `terraform output -json wif_provider_names` returns provider names for all configured clusters +- [ ] No static key generation comments remain in Terraform files +- [ ] `make terraform-check` passes + +--- + +## Card 2: Migrate CI Scripts and Revoke Static Key + +### User Story + +**As a** GCP HCP platform engineer, +**I want** CI step scripts to authenticate via WIF token exchange instead of static JSON keys, +**so that** the static key can be revoked and removed from Vault. + +### Context / Background + +Card 1 provisions the WIF infrastructure in GCP. This card updates the CI step scripts in `openshift/release` to use that infrastructure, validates the migration, and revokes the static key. + +### Technical Approach + +#### Update CI step scripts to use WIF (`openshift/release`) + +**Prerequisites**: Card 1 complete (WIF pool and providers exist and are applied). + +1. **Update GCP authentication in CI steps** (modify: `ci-operator/step-registry/hypershift/gcp/` scripts): + - Detect the build cluster's OIDC issuer from the pod's default projected SA token and map it to the corresponding WIF provider + - Write a GCP credential configuration file pointing to the token and authenticate with `gcloud auth login --cred-file` + - Remove references to cluster profile JSON key file + - Ensure `jq` is available in the CI step image + - Fail fast with a clear error if the OIDC issuer is not recognized + +2. **Ensure jobs only land on WIF-compatible clusters**: + - Jobs are pinned to AWS build clusters via the existing `arm64` capability, which is only available on clusters with public OIDC issuers + +3. **Run `make update`** to regenerate Prow job configs + +#### Post-validation: Revoke static key + +**Prerequisites**: CI script PR merged, at least one full CI run passing with WIF. + +1. Revoke the user-managed key: `gcloud iam service-accounts keys delete KEY_ID --iam-account=hypershift-ci@gcp-hcp-hypershift-ci.iam.gserviceaccount.com` +2. Remove the key from Vault (selfservice.vault.ci.openshift.org) +3. Monitor `e2e-gke` jobs for 24-48 hours after key revocation + +### Dependencies + +- **Card 1**: WIF pool and OIDC providers must be applied before this card starts. + +### Acceptance Criteria + +- [ ] CI step scripts authenticate via default projected SA token and `gcloud auth login --cred-file` (no static key) +- [ ] `e2e-gke` rehearsal job passes end-to-end with WIF authentication +- [ ] Jobs fail fast with a clear error if they land on a cluster without a public OIDC issuer +- [ ] No user-managed keys exist on the `hypershift-ci` service account +- [ ] Key is removed from Vault +- [ ] (Optional, deferred) When [ci-tools#5093](https://github.com/openshift/ci-tools/pull/5093) lands: switch to custom audience and validate that tokens without the correct audience are rejected + +--- + +## Concerns and Mitigations + +| Concern | Mitigation | +|---------|-----------| +| **GCP build clusters lack public OIDC** | Jobs are pinned to AWS build clusters via the `arm64` capability. | +| **Federation scope**: Attribute condition must be updated when new test names are added | Condition matches on both namespace (`ci-op-*`) and SA name (`e2e-gke`, `e2e-v2-gke`). Adding a new test requires updating `wif_attribute_condition`. | +| **OIDC issuer stability**: If a build cluster is rebuilt, the issuer URL or signing keys change | One WIF provider per cluster. Broken provider only affects that cluster's jobs. Variable-driven config allows quick updates. | +| **Token lifetime**: The static key never expires; WIF tokens do. CI steps can run for 2-3 hours. | Validate token expiration behavior during implementation. The GCP SDK may re-read the token file and re-exchange automatically, or the script may need to refresh the token periodically. | +| **Rollback path**: If WIF breaks CI | Keep the static key in Vault until post-validation. Only revoke after confirmed stability. | +| **New build clusters added**: Future clusters may or may not have public OIDC | Add new WIF providers to Terraform as clusters are added. CI scripts fail fast on unrecognized issuers. | +| **Custom audience deferred**: Default audience (issuer URL) is less restrictive than a custom audience | Low risk — requires SA name collision (`e2e-gke`) from another repo landing on the same build cluster. CEL + principalSet + folder isolation remain enforced. Custom audience can be added later via [ci-tools#5093](https://github.com/openshift/ci-tools/pull/5093) without infrastructure changes. | + diff --git a/implementation-plans/hypershift-repo-gcp-hcp-e2e-tests-implementation.md b/implementation-plans/hypershift-repo-gcp-hcp-e2e-tests-implementation.md index 9dc5166..58d9d16 100644 --- a/implementation-plans/hypershift-repo-gcp-hcp-e2e-tests-implementation.md +++ b/implementation-plans/hypershift-repo-gcp-hcp-e2e-tests-implementation.md @@ -137,7 +137,7 @@ Before implementing the CI infrastructure, we need team decisions on several arc - [x] Required GCP IAM roles and permissions → See Appendix G for complete IAM roles table. - [x] Quota requirements for running E2E tests → Standard quotas sufficient. GCP-HCP team will set appropriate limits during project setup. - [x] GCP Project Creation → Dynamic projects per-test. CI infrastructure (folder, project, SA) via Terraform in `gcp-hcp-infra`. See Story 2 and Appendix G. -- [x] CI Authentication Method → **Service Account with static credentials**. Matches existing OpenShift CI patterns for AWS/Azure. +- [x] CI Authentication Method → **Service Account with static credentials**. Matches existing OpenShift CI patterns for AWS/Azure. **Superseded**: being migrated to WIF — see [hypershift-ci-wif-migration](./hypershift-ci-wif-migration.md). - [x] Cluster Profile → **Dedicated `hypershift-gcp` profile** with own quota slice. GCP tests won't use resources from AWS-based hypershift-ci cluster. **Reference**: @@ -1169,6 +1169,8 @@ Configure budget alerts and monitoring to control CI spending and detect cost an ### Story 11: Investigate CI Service Account Key Rotation Strategy +> **Superseded** by the WIF migration — see [hypershift-ci-wif-migration](./hypershift-ci-wif-migration.md). Static key rotation is no longer needed as the key will be eliminated entirely. + **Summary**: Investigate options for rotating the CI service account key used for GCP E2E tests **Description**: @@ -1520,7 +1522,7 @@ This mirrors real-world deployments where: | CI Rollout Strategy | Start with `always_run: false`, `skip_report: true`. Promotion to blocking TBD (likely after internal preview). | | GCP Project Source | **HCM org account**. Dynamic projects created per-test under GCP HCP org CI folder. | | Project Lifecycle | Dynamic projects per-test under CI folder. Projects deleted on cleanup. See Appendix G. | -| CI Authentication | **Service Account with static credentials** stored in Vault. Matches existing OpenShift CI patterns for AWS/Azure (HyperShift team owns the cloud accounts). | +| CI Authentication | **Service Account with static credentials** stored in Vault (superseded — being migrated to WIF, see [hypershift-ci-wif-migration](./hypershift-ci-wif-migration.md)). | | CI Isolation | Direct folder permissions on CI folder only. CI SA cannot affect managed service projects in sibling folders. | | CI Folder Location | Under GCP HCP folder. New `gcp-hcp-ci` folder for test isolation. | | Cluster Profile | **Dedicated `hypershift-gcp` profile** with own quota slice. GCP tests won't use resources from the AWS-based hypershift-ci cluster, so having a separate profile avoids artificially limiting AWS cluster capacity. See Appendix D. |