Fix concurrent publish operations causing missing package files by agustinhenze · Pull Request #1511 · aptly-dev/aptly

agustinhenze · 2025-12-04T15:30:47Z

When multiple repository operations execute concurrently on shared pool directories, race conditions could cause .deb files to be deleted despite appearing in repository metadata, resulting in apt 404 errors.

Three distinct but related race conditions were identified and fixed:

Package addition vs publish race: When packages are added to a local repository that is already published, the publish operation could read stale package references before the add transaction commits. Fixed by locking all published repositories that reference the local repo during package addition.
Pool file deletion race: When multiple published repositories share the same pool directory (same storage+prefix) and publish concurrently, cleanup operations could delete each other's newly created files. The cleanup in thread B would:
- Query database for referenced files (not seeing thread A's uncommitted files)
- Scan pool directory (seeing thread A's files)
- Delete thread A's files as "orphaned"
Fixed by implementing pool-sibling locking: acquire locks on ALL published
repositories sharing the same storage and prefix before publish/cleanup.
Concurrent cleanup on same prefix: Multiple distributions publishing to the same prefix concurrently could have cleanup operations delete shared files. Fixed by:
- Adding prefix-level locking to serialize cleanup operations
- Removing ref subtraction that incorrectly marked shared files as orphaned
- Forcing database reload before cleanup to see recent commits

The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired.

Test coverage includes concurrent publish scenarios that reliably reproduced all three bugs before the fixes.

Checklist

unit-test added
functional test added/updated (if change is functional)
author name in AUTHORS

agustinhenze · 2025-12-04T18:19:55Z

Sounds my new test is giving timeout (the new test takes longer than 2 minutes, of course depending on the load of the machine). @iofq or @neolynx would you mind taking a look. I can try to fix the timeout later, but an early review would be really appreciated.

I have found these multiple bugs after some stress testing I have done due to production bugs we had randomly.

gaby · 2025-12-11T01:57:54Z

 	@mkdir -p /tmp/aptly-etcd-data; system/t13_etcd/start-etcd.sh > /tmp/aptly-etcd-data/etcd.log 2>&1 &
 	@echo "\e[33m\e[1mRunning go test ...\e[0m"
-	faketime "$(TEST_FAKETIME)" go test -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret
+	faketime "$(TEST_FAKETIME)" go test -timeout 20m -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret


Tests should be run with -race to detect race conditions

neolynx · 2025-12-28T20:04:53Z

@agustinhenze I cannot get the unit tests to work... looks like there is a deadlock now, the evil twin of race conditions...

I wonder if such long running tests would not be better implemented as system tests, testing api and cmdline.

Did I get this right, the tests should run things concurrently for a while and not loose files ?

neolynx · 2025-12-29T16:21:18Z

it hangs in TestIdenticalPackageRace:430 I believe, here the logs:

Iteration 2/4
[GIN] 2025/12/29 - 16:19:50 | 201 |   20.537467ms |                 | POST     "/api/repos"
[GIN] 2025/12/29 - 16:19:50 | 201 |   41.633654ms |                 | POST     "/api/publish/identical"
[GIN] 2025/12/29 - 16:19:50 | 201 |   14.738816ms |                 | POST     "/api/repos"
[GIN] 2025/12/29 - 16:19:50 | 201 |  124.126469ms |                 | POST     "/api/publish/identical"
[GIN] 2025/12/29 - 16:19:50 | 200 |   36.300477ms |                 | PUT      "/api/publish/identical/dist-a-1"
[iter 1] Publish A complete: 200
[GIN] 2025/12/29 - 16:19:50 | 200 |   47.507265ms |                 | PUT      "/api/publish/identical/dist-b-1"
[iter 1] Publish B complete: 200
[GIN] 2025/12/29 - 16:19:50 | 200 |   67.549527ms |                 | POST     "/api/repos/identical-b-1/file/identical-upload-2-1?noRemove=0"
[iter 1] Import B complete: 200
***hangs***

here is the backtrace after the timeout:

coverage: 23.9% of statements
panic: test timed out after 2m0s
        running tests:
                Test (2m0s)

goroutine 1491 [running]:
testing.(*M).startAlarm.func1()
        /usr/lib/go-1.24/src/testing/testing.go:2484 +0x394
created by time.goFunc
        /usr/lib/go-1.24/src/time/sleep.go:215 +0x2d

goroutine 1 [chan receive, 2 minutes]:
testing.(*T).Run(0xc000602c40, {0x18aef75?, 0xc000051b30?}, 0x195d5f8)
        /usr/lib/go-1.24/src/testing/testing.go:1859 +0x431
testing.runTests.func1(0xc000602c40)
        /usr/lib/go-1.24/src/testing/testing.go:2279 +0x37
testing.tRunner(0xc000602c40, 0xc000051c70)
        /usr/lib/go-1.24/src/testing/testing.go:1792 +0xf4
testing.runTests(0xc000406060, {0x2f03430, 0x1, 0x1}, {0x2f57640?, 0x7?, 0x2f550e0?})
        /usr/lib/go-1.24/src/testing/testing.go:2277 +0x4b4
testing.(*M).Run(0xc0004fa000)
        /usr/lib/go-1.24/src/testing/testing.go:2142 +0x64a
main.main()
        _testmain.go:55 +0x9b

goroutine 68 [chan receive, 1 minutes]:
gopkg.in/check%2ev1.(*suiteRunner).runTest(...)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:813
gopkg.in/check%2ev1.(*suiteRunner).run(0xc000313000)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:618 +0x1ba
gopkg.in/check%2ev1.Run({0x17f1140?, 0xc000110180?}, 0x2332a10?)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/run.go:92 +0x25
gopkg.in/check%2ev1.RunAll(0xc00008aef8)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/run.go:84 +0x75
gopkg.in/check%2ev1.TestingT(0xc000603340)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/run.go:72 +0x25e
github.com/aptly-dev/aptly/api.Test(0xc000603340?)
        /work/src/api/api_test.go:24 +0x13
testing.tRunner(0xc000603340, 0x195d5f8)
        /usr/lib/go-1.24/src/testing/testing.go:1792 +0xf4
created by testing.(*T).Run in goroutine 1
        /usr/lib/go-1.24/src/testing/testing.go:1851 +0x413

goroutine 196 [select, 1 minutes]:
gopkg.in/check%2ev1.(*resultTracker)._loopRoutine(0xc000458750)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:464 +0x93
created by gopkg.in/check%2ev1.(*resultTracker).start in goroutine 68
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:444 +0x4f

goroutine 198 [chan receive]:
github.com/aptly-dev/aptly/api.acquireDatabase()
        /work/src/api/api.go:110 +0x65
created by github.com/aptly-dev/aptly/api.Router in goroutine 197
        /work/src/api/router.go:91 +0x679

goroutine 1401 [sync.WaitGroup.Wait, 1 minutes]:
sync.runtime_SemacquireWaitGroup(0xc00109d440?)
        /usr/lib/go-1.24/src/runtime/sema.go:110 +0x25
sync.(*WaitGroup).Wait(0xc001007b20?)
        /usr/lib/go-1.24/src/sync/waitgroup.go:118 +0x48
github.com/aptly-dev/aptly/api.(*PublishedFileMissingSuite).TestIdenticalPackageRace(0xc000110180, 0xc000e361e0)
        /work/src/api/published_file_missing_test.go:460 +0xb78
reflect.Value.call({0x17f1140?, 0xc000110180?, 0xc00008fe28?}, {0x18af145, 0x4}, {0xc00008fef8, 0x1, 0x671d40827aa18?})
        /usr/lib/go-1.24/src/reflect/value.go:584 +0xca6
reflect.Value.Call({0x17f1140?, 0xc000110180?, 0xc00004a940?}, {0xc00008fef8?, 0x0?, 0xc00008fe96?})
        /usr/lib/go-1.24/src/reflect/value.go:368 +0xb9
gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1(0xc000e361e0)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:775 +0x535
gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:669 +0x88
created by gopkg.in/check%2ev1.(*suiteRunner).forkCall in goroutine 68
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:666 +0x235

goroutine 227 [select]:
github.com/aptly-dev/aptly/task.(*List).consumer(0xc000447c80)
        /work/src/task/list.go:45 +0x7a
created by github.com/aptly-dev/aptly/task.NewList in goroutine 199
        /work/src/task/list.go:38 +0x1ee

goroutine 1554 [select]:
github.com/syndtr/goleveldb/leveldb.(*DB).compactionError(0xc0004bb880)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db_compaction.go:92 +0x149
created by github.com/syndtr/goleveldb/leveldb.openDB in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db.go:148 +0x447

goroutine 1555 [select]:
github.com/syndtr/goleveldb/leveldb.(*DB).mpoolDrain(0xc0004bb880)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db_state.go:101 +0x9c
created by github.com/syndtr/goleveldb/leveldb.openDB in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db.go:149 +0x485

goroutine 1557 [select]:
github.com/syndtr/goleveldb/leveldb.(*DB).mCompaction(0xc0004bb880)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db_compaction.go:782 +0xf3
created by github.com/syndtr/goleveldb/leveldb.openDB in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db.go:158 +0x547

goroutine 1556 [select]:
github.com/syndtr/goleveldb/leveldb.(*DB).tCompaction(0xc0004bb880)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db_compaction.go:845 +0x6b7
created by github.com/syndtr/goleveldb/leveldb.openDB in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db.go:157 +0x50b

goroutine 1550 [sync.WaitGroup.Wait, 1 minutes]:
sync.runtime_SemacquireWaitGroup(0xc00067f3d8?)
        /usr/lib/go-1.24/src/runtime/sema.go:110 +0x25
sync.(*WaitGroup).Wait(0x16cc3e0?)
        /usr/lib/go-1.24/src/sync/waitgroup.go:118 +0x48
github.com/aptly-dev/aptly/task.(*List).WaitForTaskByID(0xc000447c80, 0x60)
        /work/src/task/list.go:242 +0xf6
github.com/aptly-dev/aptly/api.maybeRunTaskInBackground(0xc006a12800, {0xc000fb2480, 0x40}, {0xc0003f9000, 0x3, 0x4}, 0xc004e04d20)
        /work/src/api/api.go:217 +0x3a5
github.com/aptly-dev/aptly/api.apiReposPackageFromDir(0xc006a12800)
        /work/src/api/repos.go:540 +0xbac
github.com/gin-gonic/gin.(*Context).Next(...)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/aptly-dev/aptly/api.Router.func2(0xc006a12800)
        /work/src/api/router.go:113 +0x13f
github.com/gin-gonic/gin.(*Context).Next(0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 +0x2b
github.com/aptly-dev/aptly/api.Router.ErrorLogger.ErrorLoggerT.func8(0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/logger.go:172 +0x25
github.com/gin-gonic/gin.(*Context).Next(...)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/gin-gonic/gin.CustomRecoveryWithWriter.func1(0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102 +0x6f
github.com/gin-gonic/gin.(*Context).Next(...)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/gin-gonic/gin.LoggerWithConfig.func1(0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/logger.go:240 +0xdd
github.com/gin-gonic/gin.(*Context).Next(...)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/gin-gonic/gin.(*Engine).handleHTTPRequest(0xc0000deea0, 0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620 +0x64e
github.com/gin-gonic/gin.(*Engine).ServeHTTP(0xc0000deea0, {0x1b68140, 0xc0003acd00}, 0xc000141040)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576 +0x1aa
github.com/aptly-dev/aptly/api.(*PublishedFileMissingSuite).httpRequest(0xc000110180, 0xc000e361e0, {0x18aeffd, 0x4}, {0xc000fb3d80, 0x3d}, {0x0, 0x0, 0x0})
        /work/src/api/published_file_missing_test.go:132 +0x2d1
github.com/aptly-dev/aptly/api.(*PublishedFileMissingSuite).TestIdenticalPackageRace.func1()
        /work/src/api/published_file_missing_test.go:433 +0x16a
created by github.com/aptly-dev/aptly/api.(*PublishedFileMissingSuite).TestIdenticalPackageRace in goroutine 1401
        /work/src/api/published_file_missing_test.go:430 +0x950

goroutine 1527 [select]:
github.com/syndtr/goleveldb/leveldb.(*session).refLoop(0xc0001e11d0)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/session_util.go:189 +0x51c
created by github.com/syndtr/goleveldb/leveldb.newSession in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/session.go:93 +0x296
FAIL    github.com/aptly-dev/aptly/api  120.054s
FAIL

neolynx · 2025-12-30T12:20:36Z

The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired.

I think this assumption is wrong. The task list is unlocked while the task is still running, multiple publish tasks acquire the database and run concurrently, creating chaos and deadlocks.

a bit more logging reveals:

[iter 0] Import A
[iter 0] Import B
0xc00007e2b8: apiReposPackageFromDir 1
0xc000520258: apiReposPackageFromDir 1
0xc000520258: apiReposPackageFromDir 7
INF Executing task synchronously
runTaskInBackground
0xc000520440: tasklist::RunTaskInBackground lock
0xc000520440: tasklist::RunTaskInBackground locked
0xc000520440: tasklist::RunTaskInBackground unlock
acquireDatabaseConnection
apiReposPackageFromDir 8
0xc00007e2b8: apiReposPackageFromDir 7
INF Executing task synchronously
runTaskInBackground
0xc00007e370: tasklist::RunTaskInBackground lock
0xc00007e370: tasklist::RunTaskInBackground locked
0xc00007e370: tasklist::RunTaskInBackground unlock
acquireDatabaseConnection
apiReposPackageFromDir 8
0xc000520258: apiReposPackageFromDir 13
0xc00007e2b8: apiReposPackageFromDir 13
0xc000520258: apiReposPackageFromDir 14
0xc00007e2b8: apiReposPackageFromDir 14
0xc000520258: apiReposPackageFromDir 15
0xc000520258: apiReposPackageFromDir 16
releaseDatabaseConnection
0xc00007e2b8: apiReposPackageFromDir 15
0xc00007e2b8: apiReposPackageFromDir 16
releaseDatabaseConnection
0xc000520258: apiReposPackageFromDir done
0xc00007e2b8: apiReposPackageFromDir done

apiReposPackageFromDir should not modify the database in parallel tasks...

what do you think ?

neolynx · 2026-01-03T22:15:27Z

I fixed some deadlocks in:
#1517

I think one test fails sporadically, bcs repo is published before package is added. I combined the adding and publishing, now sure if this still reproduces your race condition... what do you think ?

if you could give me edit access (chekcbox in PR settings), I can rebase and push the fixed to this PR to continue here..

neolynx · 2026-01-03T22:35:05Z

 	_ = s.snapshotCollection.Add(snap3)

-	// Ensure that adding a second publish point with matching files doesn't give duplicate results.
+	// When a second publish point references the same files, they should be listed for each repo


this change of behavior could cause troubles elsewhere...

mm right, I agree. Probably you know better how to handle this case, same package uploaded to two different repositories and published "at the same time".

codecov · 2026-01-18T17:13:14Z

Codecov Report

❌ Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.97%. Comparing base (809ab47) to head (3e522a1).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
api/publish.go	83.33%	1 Missing and 1 partial ⚠️
deb/publish.go	89.47%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1511      +/-   ##
==========================================
- Coverage   77.22%   76.97%   -0.26%     
==========================================
  Files         161      161              
  Lines       15080    15098      +18     
==========================================
- Hits        11646    11622      -24     
- Misses       2291     2338      +47     
+ Partials     1143     1138       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

agustinhenze · 2026-01-22T15:47:43Z

I fixed some deadlocks in: #1517

I think one test fails sporadically, bcs repo is published before package is added. I combined the adding and publishing, now sure if this still reproduces your race condition... what do you think ?

if you could give me edit access (chekcbox in PR settings), I can rebase and push the fixed to this PR to continue here..

It's already set :/

agustinhenze · 2026-01-22T15:49:46Z

@agustinhenze I cannot get the unit tests to work... looks like there is a deadlock now, the evil twin of race conditions...

I wonder if such long running tests would not be better implemented as system tests, testing api and cmdline.

Did I get this right, the tests should run things concurrently for a while and not loose files ?

Yes, exactly. It's weird you are getting timeout and the tests are passing in the CI and also on my machine 🤔

cchazalet · 2026-03-05T16:32:24Z

Hello,

I just tested with the freshly built version 1.6.2+20260124095043.55446d84 and I no longer get the “Unable to update” errors; I had before when making several asynchronous publications on the same prefix; the code seems fine to me!

However, I'm seeing a lock on task 58 and I'm wondering if this lock is legitimate?

# curl -XGET http://localhost:8080/api/tasks | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   872  100   872    0     0   425k      0 --:--:-- --:--:-- --:--:--  851k
[
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/trixie",
    "ID": 38,
    "State": 2
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/trixie-security",
    "ID": 48,
    "State": 2
  },
  {
    "Name": "Publish snapshot repository s3:scibianorg:repos-iq/trixie-security with components \"main\", \"contrib\", \"non-free\", \"non-free-firmware\" and sources \"trixie-security--stable--main_2c065fbd-161b-48e3-83c7-77797f3b98bd_tmpforpublish\", \"trixie-security--stable--contrib_2c065fbd-161b-48e3-83c7-77797f3b98bd_tmpforpublish\", \"trixie-security--stable--non-free_2c065fbd-161b-48e3-83c7-77797f3b98bd_tmpforpublish\", \"trixie-security--stable--non-free-firmware_2c065fbd-161b-48e3-83c7-77797f3b98bd_tmpforpublish\"",
    "ID": 57,
    "State": 1
  },
  {
    "Name": "Update snapshot trixie-security--stable--main_2c065fbd-161b-48e3-83c7-77797f3b98bd_tmpforpublish",
    "ID": 58,
    "State": 0
  }
]

russelltg · 2026-04-24T17:53:00Z

We hit this bug at least like once a week, what is this blocked on?

neolynx · 2026-04-26T16:27:41Z


-	// Ensure that adding a second publish point with matching files doesn't give duplicate results.
+	// When a second publish point references the same files, they should be listed for each repo
+	// to ensure cleanup doesn't delete files still referenced by other distributions.


@agustinhenze this changes the behavior of aptly... and could case problems elsewhere..

(previous comment got lost somehow)

When multiple repository operations execute concurrently on shared pool directories, race conditions could cause .deb files to be deleted despite appearing in repository metadata, resulting in apt 404 errors. Three distinct but related race conditions were identified and fixed: 1. Package addition vs publish race: When packages are added to a local repository that is already published, the publish operation could read stale package references before the add transaction commits. Fixed by locking all published repositories that reference the local repo during package addition. 2. Pool file deletion race: When multiple published repositories share the same pool directory (same storage+prefix) and publish concurrently, cleanup operations could delete each other's newly created files. The cleanup in thread B would: - Query database for referenced files (not seeing thread A's uncommitted files) - Scan pool directory (seeing thread A's files) - Delete thread A's files as "orphaned" Fixed by implementing pool-sibling locking: acquire locks on ALL published repositories sharing the same storage and prefix before publish/cleanup. 3. Concurrent cleanup on same prefix: Multiple distributions publishing to the same prefix concurrently could have cleanup operations delete shared files. Fixed by: - Adding prefix-level locking to serialize cleanup operations - Removing ref subtraction that incorrectly marked shared files as orphaned - Forcing database reload before cleanup to see recent commits The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired. Test coverage includes concurrent publish scenarios that reliably reproduced all three bugs before the fixes.

neolynx · 2026-04-26T22:34:18Z

@russelltg I am not really sure what this PR fixes and how. I had to modify the tests quite a bit, but there are duplicate packages now showing up. Something does not look right, and I would like to understand it fully before merging..

While analyzing this, I fixed several bugs with locking and the queue (#1529). I think we might be seeing several bugs here.

What is the bug you are seeing ? Missing packages, hanging aptly ?

Could you share what API calls each one of your concurrent pipeline steps are doing ?
(I guess it is similar to https://github.com/aptly-dev/aptly/blob/master/.github/workflows/scripts/upload-artifacts.sh and https://github.com/aptly-dev/aptly/actions/runs/24968033887/job/73107274773#step:10:12)

It might be that the fixes on master are already addressing this (bcs repos properly locked now), or maybe only parts of this PR would be needed.

Would you be able to run latest CI build (aptly_1.6.2+20260426215605.a20eb686_amd64.deb) and see if the behavior improves ?

neolynx · 2026-04-26T22:37:44Z

@cchazalet I could not really figure out where your build is from. Was this a CI build from master ?

if that task 58 never started, then this might be before the deadlock fix.

Could you also try latest CI build (aptly_1.6.2+20260426215605.a20eb686_amd64.deb) ?

neolynx · 2026-04-27T09:51:47Z

Hi @agustinhenze, I tried to reach out by mail a while ago, but it probably got lost. As I do not fully understand the changes here, I wanted to connect..

Package addition vs publish race: When packages are added to a local repository that is already published, the publish operation could read stale package references before the add transaction commits. Fixed by locking all published repositories that reference the local repo during package addition.

I think the repos get locked when publishing, adding all published repositories might not be needed, see
https://github.com/aptly-dev/aptly/blob/master/api/publish.go#L288 . With the locking fix, multiple publish operations on the same repo should run at the same time anymore, right ?

Pool file deletion race: When multiple published repositories share the same pool directory (same storage+prefix) and publish concurrently, cleanup operations could delete each other's newly created files.

If the locking works properly now, publish operations on the same pool directory should not run concurrently anymore, so there should be no deletion race ?

Concurrent cleanup on same prefix: Multiple distributions publishing to the same prefix concurrently could have cleanup operations delete shared files.

Also here, the locking should prevent this ?

Could you share the steps of the concurrent operations in your setup ?

Let me know if you would be available to connect !

cchazalet · 2026-04-27T14:22:56Z

@cchazalet I could not really figure out where your build is from. Was this a CI build from master ?

The build 1.6.2+20260124095043.55446d84 was from agustinhenze:fix-concurrent-publish-race-conditions branch.

if that task 58 never started, then this might be before the deadlock fix.

No, I don't think, 1.6.2+20260124095043.55446d84 version included deadlock fix.

Could you also try latest CI build (aptly_1.6.2+20260426215605.a20eb686_amd64.deb) ?

Yes, I tested this version and it seems good.
I admit I don't know the aptly code in depth, but a priori I think the code of the current master branch solves this fix; the tests I just did show it.

# curl -XGET http://localhost:8080/api/tasks | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   821  100   821    0     0   200k      0 --:--:-- --:--:-- --:--:--  200k
[
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian13",
    "ID": 442,
    "State": 2
  },
  {
    "Name": "Update published snapshot repository s3:scibianorg:repos-iq/scibian13",
    "ID": 451,
    "State": 2
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/trixie-security",
    "ID": 460,
    "State": 2
  },
  {
    "Name": "Update published snapshot repository s3:scibianorg:repos-iq/trixie-security",
    "ID": 469,
    "State": 2
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/trixie-security",
    "ID": 511,
    "State": 2
  },
  {
    "Name": "Update published snapshot repository s3:scibianorg:repos-iq/trixie-security",
    "ID": 520,
    "State": 1
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian13",
    "ID": 530,
    "State": 2
  },
  {
    "Name": "Update published snapshot repository s3:scibianorg:repos-iq/scibian13",
    "ID": 539,
    "State": 1
  }
]

cchazalet · 2026-04-29T14:19:31Z

Hi @neolynx

I have just conducted a more thorough test of the latest stable version of aptly from the master branch (1.6.2+20260426215605.a20eb686) and have finally identified an error that occurs during a concurrent release to the same pool.

Attached is the error:

Loading packages...
Generating metadata files and linking package files...
Finalizing metadata files...
Cleaning up published repository filesystem:fs1:stable/scibian10-g8...
Cleaning up component ‘contrib’...
Cleaning up component ‘main’...
Task failed with error: unable to update: remove /srv/brian/aptly/public_stable/stable/pool/main/l/linux-base/linux-base_4.6.dsc: no such file or directory

After restarting the publication without concurrency, it works fine

So I think the fix proposed by @agustinhenze is indeed warranted.

neolynx · 2026-05-02T11:30:17Z

Hi @cchazalet,

thanks a lot for testing !

I would like to understand the race condition better...
Could you explain what API calls are running concurrently ? Are these pipeline jobs, building deb packages, uploading, adding and publishing them ? What is the workflow in your case ?

cchazalet · 2026-05-04T07:50:23Z

Hi @neolynx,

You can see below my workflow.

I start multiple publications tasks which same prefix (filesystem:fs1:stable) in asynchrone mode and in the same time.
When tasks are running :

# curl -X GET http://localhost:8080/api/tasks | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   410  100   410    0     0  14137      0 --:--:-- --:--:-- --:--:-- 14642
[
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian11-g10",
    "ID": 106,
    "State": 1
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian10-g6",
    "ID": 116,
    "State": 1
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian10-g7",
    "ID": 126,
    "State": 1
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian10-g8",
    "ID": 136,
    "State": 1
  }
]

After some seconds :

# curl -X GET http://localhost:8080/api/tasks | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   410  100   410    0     0  27333      0 --:--:-- --:--:-- --:--:-- 27333
[
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian11-g10",
    "ID": 106,
    "State": 2
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian10-g6",
    "ID": 116,
    "State": 3
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian10-g7",
    "ID": 126,
    "State": 2
  },
  {
    "Name": "Update published snapshot repository filesystem:fs1:stable/scibian10-g8",
    "ID": 136,
    "State": 2
]

You can see task 116 in failed with

Loading packages...
Generating metadata files and linking package files...
Finalizing metadata files...
Cleaning up published repository filesystem:fs1:stable/scibian10-g6...
Cleaning up component 'contrib'...
Cleaning up component 'main'...
Task failed with error: unable to update: remove /srv/brian/aptly/public_stable/stable/pool/main/l/linux-base/linux-base_4.6.dsc: no such file or directory

We can clearly see an error related to a missing or nonexistent package because another concurrent task is performing operations on that binary in the pool.

If I retry the publishing task that failed without any other publishing tasks currently running, it now works

When I use the code in this branch with locks, under the same use case, the publishing tasks would run one by one.

neolynx · 2026-05-04T08:36:51Z

@cchazalet thanks ! I am interested in what commands are you running to publish ? I see it is a snapshot your are publishing, how is this one created ? how are files added to the repo ? and how much of this is running concurrently ?

cchazalet · 2026-05-04T08:47:38Z

We can set up a call so I can give you a demo.

Are you available later today?

agustinhenze force-pushed the fix-concurrent-publish-race-conditions branch 3 times, most recently from 76a7b8f to 8a47888 Compare December 4, 2025 18:15

agustinhenze force-pushed the fix-concurrent-publish-race-conditions branch 2 times, most recently from a9591c7 to 47b362c Compare December 4, 2025 19:18

neolynx self-assigned this Dec 4, 2025

gaby reviewed Dec 11, 2025

View reviewed changes

neolynx mentioned this pull request Dec 26, 2025

Coverage and fixes for https://github.com/aptly-dev/aptly/pull/1511 #1517

Closed

neolynx requested a review from a team January 3, 2026 22:17

neolynx reviewed Jan 3, 2026

View reviewed changes

neolynx requested a review from a team January 4, 2026 14:38

neolynx added the needs review Ready for review & merge label Jan 4, 2026

neolynx force-pushed the fix-concurrent-publish-race-conditions branch from 47b362c to 8eb7119 Compare January 18, 2026 16:49

neolynx force-pushed the fix-concurrent-publish-race-conditions branch from 8eb7119 to 8cd758b Compare January 24, 2026 09:50

neolynx reviewed Apr 26, 2026

View reviewed changes

Agustin Henze and others added 5 commits April 26, 2026 20:36

add tests

877cfba

unit tests: be more verbose

f06af0c

unit test: fix test logic

497b203

fix lint

3e522a1

neolynx force-pushed the fix-concurrent-publish-race-conditions branch from 8cd758b to 3e522a1 Compare April 26, 2026 18:39

neolynx added the please confirm resolved We believe the issue is resolved ! if so, please close the issue, thanks ;-) label Apr 28, 2026

neolynx added this to the 1.6.3 milestone May 2, 2026

Uh oh!

Conversation

agustinhenze commented Dec 4, 2025

Checklist

Uh oh!

agustinhenze commented Dec 4, 2025

Uh oh!

gaby Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

neolynx commented Dec 28, 2025

Uh oh!

neolynx commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neolynx commented Dec 30, 2025

Uh oh!

neolynx commented Jan 3, 2026

Uh oh!

neolynx Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

agustinhenze Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

agustinhenze commented Jan 22, 2026

Uh oh!

agustinhenze commented Jan 22, 2026

Uh oh!

cchazalet commented Mar 5, 2026

Uh oh!

russelltg commented Apr 24, 2026

Uh oh!

neolynx Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

neolynx commented Apr 26, 2026

Uh oh!

neolynx commented Apr 26, 2026

Uh oh!

neolynx commented Apr 27, 2026

Uh oh!

cchazalet commented Apr 27, 2026

Uh oh!

cchazalet commented Apr 29, 2026

Uh oh!

neolynx commented May 2, 2026

Uh oh!

cchazalet commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neolynx commented May 4, 2026

Uh oh!

cchazalet commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

neolynx commented Dec 29, 2025 •

edited

Loading

codecov Bot commented Jan 18, 2026 •

edited

Loading

cchazalet commented May 4, 2026 •

edited

Loading