Fix concurrent publish operations causing missing package files#1511
Fix concurrent publish operations causing missing package files#1511agustinhenze wants to merge 5 commits into
Conversation
76a7b8f to
8a47888
Compare
|
Sounds my new test is giving timeout (the new test takes longer than 2 minutes, of course depending on the load of the machine). @iofq or @neolynx would you mind taking a look. I can try to fix the timeout later, but an early review would be really appreciated. I have found these multiple bugs after some stress testing I have done due to production bugs we had randomly. |
a9591c7 to
47b362c
Compare
| @mkdir -p /tmp/aptly-etcd-data; system/t13_etcd/start-etcd.sh > /tmp/aptly-etcd-data/etcd.log 2>&1 & | ||
| @echo "\e[33m\e[1mRunning go test ...\e[0m" | ||
| faketime "$(TEST_FAKETIME)" go test -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret | ||
| faketime "$(TEST_FAKETIME)" go test -timeout 20m -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret |
There was a problem hiding this comment.
Tests should be run with -race to detect race conditions
|
@agustinhenze I cannot get the unit tests to work... looks like there is a deadlock now, the evil twin of race conditions... I wonder if such long running tests would not be better implemented as system tests, testing api and cmdline. Did I get this right, the tests should run things concurrently for a while and not loose files ? |
|
it hangs in TestIdenticalPackageRace:430 I believe, here the logs: here is the backtrace after the timeout: |
I think this assumption is wrong. The task list is unlocked while the task is still running, multiple publish tasks acquire the database and run concurrently, creating chaos and deadlocks. a bit more logging reveals: apiReposPackageFromDir should not modify the database in parallel tasks... what do you think ? |
|
I fixed some deadlocks in: I think one test fails sporadically, bcs repo is published before package is added. I combined the adding and publishing, now sure if this still reproduces your race condition... what do you think ? if you could give me edit access (chekcbox in PR settings), I can rebase and push the fixed to this PR to continue here.. |
| _ = s.snapshotCollection.Add(snap3) | ||
|
|
||
| // Ensure that adding a second publish point with matching files doesn't give duplicate results. | ||
| // When a second publish point references the same files, they should be listed for each repo |
There was a problem hiding this comment.
this change of behavior could cause troubles elsewhere...
There was a problem hiding this comment.
mm right, I agree. Probably you know better how to handle this case, same package uploaded to two different repositories and published "at the same time".
47b362c to
8eb7119
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1511 +/- ##
==========================================
- Coverage 77.22% 76.97% -0.26%
==========================================
Files 161 161
Lines 15080 15098 +18
==========================================
- Hits 11646 11622 -24
- Misses 2291 2338 +47
+ Partials 1143 1138 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
It's already set :/ |
Yes, exactly. It's weird you are getting timeout and the tests are passing in the CI and also on my machine 🤔 |
8eb7119 to
8cd758b
Compare
|
Hello, I just tested with the freshly built version 1.6.2+20260124095043.55446d84 and I no longer get the “Unable to update” errors; I had before when making several asynchronous publications on the same prefix; the code seems fine to me! However, I'm seeing a lock on task 58 and I'm wondering if this lock is legitimate? |
|
We hit this bug at least like once a week, what is this blocked on? |
|
|
||
| // Ensure that adding a second publish point with matching files doesn't give duplicate results. | ||
| // When a second publish point references the same files, they should be listed for each repo | ||
| // to ensure cleanup doesn't delete files still referenced by other distributions. |
There was a problem hiding this comment.
@agustinhenze this changes the behavior of aptly... and could case problems elsewhere..
(previous comment got lost somehow)
When multiple repository operations execute concurrently on shared pool directories, race conditions could cause .deb files to be deleted despite appearing in repository metadata, resulting in apt 404 errors. Three distinct but related race conditions were identified and fixed: 1. Package addition vs publish race: When packages are added to a local repository that is already published, the publish operation could read stale package references before the add transaction commits. Fixed by locking all published repositories that reference the local repo during package addition. 2. Pool file deletion race: When multiple published repositories share the same pool directory (same storage+prefix) and publish concurrently, cleanup operations could delete each other's newly created files. The cleanup in thread B would: - Query database for referenced files (not seeing thread A's uncommitted files) - Scan pool directory (seeing thread A's files) - Delete thread A's files as "orphaned" Fixed by implementing pool-sibling locking: acquire locks on ALL published repositories sharing the same storage and prefix before publish/cleanup. 3. Concurrent cleanup on same prefix: Multiple distributions publishing to the same prefix concurrently could have cleanup operations delete shared files. Fixed by: - Adding prefix-level locking to serialize cleanup operations - Removing ref subtraction that incorrectly marked shared files as orphaned - Forcing database reload before cleanup to see recent commits The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired. Test coverage includes concurrent publish scenarios that reliably reproduced all three bugs before the fixes.
8cd758b to
3e522a1
Compare
|
@russelltg I am not really sure what this PR fixes and how. I had to modify the tests quite a bit, but there are duplicate packages now showing up. Something does not look right, and I would like to understand it fully before merging.. While analyzing this, I fixed several bugs with locking and the queue (#1529). I think we might be seeing several bugs here. What is the bug you are seeing ? Missing packages, hanging aptly ? Could you share what API calls each one of your concurrent pipeline steps are doing ? It might be that the fixes on master are already addressing this (bcs repos properly locked now), or maybe only parts of this PR would be needed. Would you be able to run latest CI build (aptly_1.6.2+20260426215605.a20eb686_amd64.deb) and see if the behavior improves ? |
|
@cchazalet I could not really figure out where your build is from. Was this a CI build from master ? if that task 58 never started, then this might be before the deadlock fix. Could you also try latest CI build (aptly_1.6.2+20260426215605.a20eb686_amd64.deb) ? |
|
Hi @agustinhenze, I tried to reach out by mail a while ago, but it probably got lost. As I do not fully understand the changes here, I wanted to connect..
I think the repos get locked when publishing, adding all published repositories might not be needed, see
If the locking works properly now, publish operations on the same pool directory should not run concurrently anymore, so there should be no deletion race ?
Also here, the locking should prevent this ? Could you share the steps of the concurrent operations in your setup ? Let me know if you would be available to connect ! |
The build 1.6.2+20260124095043.55446d84 was from agustinhenze:fix-concurrent-publish-race-conditions branch.
No, I don't think, 1.6.2+20260124095043.55446d84 version included deadlock fix.
Yes, I tested this version and it seems good. |
|
Hi @neolynx I have just conducted a more thorough test of the latest stable version of aptly from the master branch (1.6.2+20260426215605.a20eb686) and have finally identified an error that occurs during a concurrent release to the same pool. Attached is the error: After restarting the publication without concurrency, it works fine So I think the fix proposed by @agustinhenze is indeed warranted. |
|
Hi @cchazalet, thanks a lot for testing ! I would like to understand the race condition better... |
|
Hi @neolynx, You can see below my workflow. I start multiple publications tasks which same prefix (filesystem:fs1:stable) in asynchrone mode and in the same time. After some seconds : You can see task 116 in failed with We can clearly see an error related to a missing or nonexistent package because another concurrent task is performing operations on that binary in the pool. If I retry the publishing task that failed without any other publishing tasks currently running, it now works When I use the code in this branch with locks, under the same use case, the publishing tasks would run one by one. |
|
@cchazalet thanks ! I am interested in what commands are you running to publish ? I see it is a snapshot your are publishing, how is this one created ? how are files added to the repo ? and how much of this is running concurrently ? |
|
We can set up a call so I can give you a demo. Are you available later today? |


When multiple repository operations execute concurrently on shared pool directories, race conditions could cause .deb files to be deleted despite appearing in repository metadata, resulting in apt 404 errors.
Three distinct but related race conditions were identified and fixed:
Package addition vs publish race: When packages are added to a local repository that is already published, the publish operation could read stale package references before the add transaction commits. Fixed by locking all published repositories that reference the local repo during package addition.
Pool file deletion race: When multiple published repositories share the same pool directory (same storage+prefix) and publish concurrently, cleanup operations could delete each other's newly created files. The cleanup in thread B would:
Fixed by implementing pool-sibling locking: acquire locks on ALL published
repositories sharing the same storage and prefix before publish/cleanup.
Concurrent cleanup on same prefix: Multiple distributions publishing to the same prefix concurrently could have cleanup operations delete shared files. Fixed by:
The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired.
Test coverage includes concurrent publish scenarios that reliably reproduced all three bugs before the fixes.
Checklist
AUTHORS