Include --ci.runId in test ProjectKey to prevent cross-suite collisions#3555
Open
naveenku-jfrog wants to merge 13 commits into
Open
Include --ci.runId in test ProjectKey to prevent cross-suite collisions#3555naveenku-jfrog wants to merge 13 commits into
naveenku-jfrog wants to merge 13 commits into
Conversation
9f433d2 to
cbf7644
Compare
cbf7644 to
c37b947
Compare
c37b947 to
08e284c
Compare
08e284c to
1c8d8f5
Compare
1c8d8f5 to
2271866
Compare
Intermittent "Project not found" failures occur when lifecycle and artifactory test suites run concurrently against the same JFrog Platform instance. createTestProject() calls deleteProjectIfExists(tests.ProjectKey) unconditionally before creating the project, so if two suites resolve to the same ProjectKey one will silently delete the other's project (and every release bundle inside it). utils/tests/utils.go: - Add SanitizedCiRunId() helper that converts --ci.runId to a valid project-key string (lowercase, non-alphanumeric chars replaced with hyphens, leading/trailing hyphens trimmed). - Splice the sanitized runId into the test ProjectKey so concurrent suites get distinct keys (e.g. "prjlinux-lifecycle-1781784930" vs "prjlinux-artifactory-1781784935"). Previously both suites resolved to the same 7-digit-suffixed key. - Use the full 10-digit Unix timestamp instead of the last 7 digits, giving 1000x more distinct values and making same-second collisions between independent CI runs effectively impossible. Co-authored-by: Cursor <cursoragent@cursor.com>
2271866 to
e460105
Compare
…e using it After createTestProject() succeeds, each Artifactory node behind the load balancer may still have a stale in-memory cache of Access project data. This causes intermittent 400 "Project not found" on build-publish and release-bundle creation when a request is routed to a node whose cache has not yet refreshed. Add waitForProjectInArtifactory() which polls GET /api/repositories/<key>-build-info (the repo Artifactory auto-creates per project) and requires 5 consecutive 200 responses before proceeding. Requiring consecutive successes — not just one — ensures that every node in a round-robin pool has warmed its cache, eliminating the intermittent failures seen in TestReleaseBundlesSearchVersions, TestReleaseBundleCreationFromMultiBundlesUsingCommandFlagWithProject and TestCreateBundleWithoutSpecAndWithProject. Co-authored-by: Cursor <cursoragent@cursor.com>
…y sync The correct API to check project existence is GET /access/api/v1/projects/<key>. Poll until Access returns 200, then sleep 5s for Artifactory's internal project cache to sync before build-publish or release-bundle calls that scope to the project. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ests Instead of scattering waitForProjectInArtifactory at each call site, call it inside createTestProject so every caller gets the 30s sync wait for free. Also add the wait to the two artifactory inline tests (TestArtifactoryDownloadByBuildUsingSimpleDownloadWithProject and TestArtifactoryDownloadWithEnvProject) which inline their own project creation and had the same race against Artifactory cache sync. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace unreliable fixed sleep with targeted retry logic. After a project is created via the Access API, Artifactory and Lifecycle services take 35-40s (sometimes longer on HA nodes) to sync their internal project cache. retryOnProjectNotFound() wraps any project-scoped operation and retries up to 12 times with 5s intervals (max 60s) when the response contains 'not found' or 'project key' — the exact errors from both Artifactory's build-publish API and Lifecycle's release-bundle creation API. Applied to: - uploadBuildWithArtifactsAndProject: retries jf rt build-publish - uploadBuildWithDepsAndProject: retries jf rt build-publish - createRbWithFlags: retries jf rbc waitForProjectInArtifactory is kept but simplified to only confirm project creation in Access (no more fixed sleep). Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the unreliable sleep-based wait with clean retry logic: - retryOnProjectNotFound: 5 attempts, 30s between each (2.5min max) - Removed waitForProjectInArtifactory and all call sites - Artifactory build-publish retry now also applied directly in the two artifactory project tests Co-authored-by: Cursor <cursoragent@cursor.com>
… poisoning
On gotestsum re-runs, the ProjectKey is the same as the initial run.
createTestProject was calling deleteProjectIfExists before creating, which:
1. Deleted the project left by the failed previous attempt
2. Immediately recreated it with the SAME key
This delete+recreate cycle poisons Artifactory's internal project cache
with a 'not found' entry for that key. Some HA nodes take 160+ seconds
(entire retry budget) to invalidate this negative cache entry, causing
all project-scoped ops to fail indefinitely.
Fix: remove the upfront deleteProjectIfExists. Project keys are unique
per suite run (full Unix timestamp), so deletion is only needed at
cleanup. If the project already exists on a re-run, reuse it silently
('already exists' is treated as success).
Co-authored-by: Cursor <cursoragent@cursor.com>
… creation The Lifecycle service has its own project cache separate from Artifactory's and can take 5+ minutes to warm on HA nodes running draft Artifactory builds. Running the full 'jfrog rbc' command as retries exhausted the 2.5-min retry budget long before LC was ready. waitForLifecycleProjectVisibility() polls a cheap GET endpoint: lifecycle/api/v2/release_bundle/records/non-existing-rb?project=<key> - 400 Bad Request = LC doesn't know the project yet (keep waiting) - 404 Not Found = LC knows the project (proceed) Polls every 15s with a 15-minute timeout. Called inside uploadBuildsWithProject so all callers (3 project tests) wait for LC readiness before any rbc attempt. Co-authored-by: Cursor <cursoragent@cursor.com>
Observed Artifactory 7.158.0 draft cache propagation taking 120-150s on some instances. 5 retries x 30s = 2.5 min was not enough. 10 retries x 30s = 5 min covers the observed worst-case propagation delay. Co-authored-by: Cursor <cursoragent@cursor.com>
…oject Same Artifactory project cache propagation delay that affected lifecycle and artifactory tests also hits pnpm's TestPnpmInstallAndPublishWithProject: - build-publish (bp) with --project fails with 400 Project not found - panic at line 984 follows because publishedBuildInfo is nil Changes: - Wrap 'jfrog rt bp --project' with retryOnProjectNotFound (10x 30s) - Drop DeleteProject before CreateProject to avoid cache poisoning on re-runs (same fix applied to transfer_test.go earlier) Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The test-only ProjectKey was suffixed with only the last 7 digits of the unix timestamp, so two test runs starting within the same calendar second on a shared JPD produced byte-identical keys (e.g. "prj1777802"). Because createTestProject() calls deleteProjectIfExists(tests.ProjectKey) unconditionally before creating the project, one suite would silently nuke another concurrent suite's project (and every release bundle inside it), showing up as flaky "not found" failures in lifecycle and artifactory project tests on shared JFrog instances.
Splice the sanitized --ci.runId into the key so concurrent runs get isolated keys (e.g. "prjlinux-lifecycle-1777802"), while still respecting the 2-32 lowercase-alphanumeric-hyphen project-key format and the "starts with a letter" rule. A new SanitizedCiRunId() helper exposes the runId in a charset-safe form for any future callers that need the same.
No behavior change when --ci.runId is unset (local single-suite runs and upstream workflows that bootstrap a fresh JPD per job).
masterbranch.go vet ./....go fmt ./....