Description
The operator's SBOMWatcher correlates incoming SBOMSyft events with their owning workload by looking up the image hash in an in-memory map (ImageToContainerData) that is populated by Pod events from a separate informer channel. The two channels are merged in a select:
// operator/watcher/sbomwatcher.go:81-104 (excerpt)
for {
select {
// FIXME select processes the events randomly, so we might see the SBOM event before the pod event
case event := <-wh.eventQueue.ResultChan: // pod events fill ImageToContainerData
...
for _, containerStatus := range containerStatuses {
hash := hashFromImageID(containerStatus.ImageID)
wh.ImageToContainerData.Set(hash, utils.ContainerData{
ContainerName: containerStatus.Name,
Wlid: wlid,
})
}
case sbomEvent, ok := <-sbomEvents:
if ok { eventQueue.Enqueue(sbomEvent) } else { ... }
...
}
Two problems compose:
-
Initial list races with pod cache fill. Before the select loop starts, the watcher pages the full existing SBOM list and enqueues each as watch.Added (sbomwatcher.go:65-77). The race is sharper than the FIXME at line 82 suggests, because HandleSBOMEvents runs as its own goroutine (started at sbomwatcher.go:54, before the SBOM list call) and consumes from a separate cooldown queue. The ImageToContainerData map it reads from is only populated by the main select loop draining pod events from wh.eventQueue (lines 83-101). So even though listPods is invoked synchronously at line 45 when ServiceDiscovery.Enabled == true, it only enqueues pod events it does not directly populate the map. The map is filled lazily by the select loop, which has not yet started when the SBOM goroutine begins servicing events.
Two concrete failure paths:
- With
ServiceDiscovery.Enabled == true: race between the SBOM-handling goroutine and the select loop draining queued pod events. Whichever wins decides whether Wlid is empty.
- With
ServiceDiscovery.Enabled == false (line 43): listPods never runs at all, so ImageToContainerData is never populated and every SBOM dispatch goes out with Wlid="".
-
validateContainerData does not check Wlid. When HandleSBOMEvents processes an SBOM whose image hash isn't in ImageToContainerData yet, imageContainerData is the zero-value struct and containerData.Wlid == "". The validation step only rejects empty ImageID/ImageTag:
// sbomwatcher.go:221-229
func validateContainerData(containerData *utils.ContainerData) error {
if containerData.ImageID == "" { return ErrMissingImageID }
if containerData.ImageTag == "" { return ErrMissingImageTag }
return nil
}
So validation passes, and a TypeScanImages command is dispatched with Wlid = "":
// sbomwatcher.go:181-191
cmd := &apis.Command{
Wlid: containerData.Wlid, // = ""
CommandName: apis.TypeScanImages,
Args: map[string]interface{}{
utils.ArgsContainerData: containerData,
},
}
logger.L().Info("scanning SBOM", helpers.String("wlid", cmd.Wlid), ...)
producedCommands <- cmd
The downstream kubevuln HTTP scan endpoint (controllers/http.go:140-175) accepts an empty Wlid and scans the image. ScanService.ValidateScanCVE (kubevuln/core/services/scan.go:857-881) only checks ImageHash and ImageSlug, so the empty-Wlid command passes validation and the scan executes. Then at kubevuln/core/services/scan.go:499-507, the platform CVE submission is gated on if workload.Wlid != "" — meaning the manifest is computed but silently never submitted to the backend when Wlid is empty. End-to-end effect: the operator dispatches a malformed scan, kubevuln burns the CPU/network to scan the image, and the result is silently dropped from the platform submission path. There is no retry path: once the SBOM event has been drained from the channel, it is gone, and no error log above info is emitted anywhere along the way.
The presence of the if workload.Wlid != "" guard in kubevuln suggests the empty-Wlid case was already a known-bad input on the consumer side; it was patched defensively downstream rather than rejected at the source.
Environment
- Repo:
kubescape/operator at main HEAD
- Go: 1.25.0 (per
go.mod)
- Tests:
go test ./... in operator/ — passes (the race is not covered by an existing test).
Steps To Reproduce
This is hard to script in a unit test today because HandleSBOMEvents and the populating goroutine share state through a private map. Conceptually:
- Apply a
SBOMSyft object to storage with valid ImageID/ImageTag annotations but before the corresponding Pod is observed by the operator (e.g. delete the operator pod, leave SBOMs and pods alive, then restart only the operator, the initial SBOM list runs immediately while the pod informer is still warming up).
- Observe
kubescape scanning SBOM log lines with wlid="".
- Inspect the vuln report, the affected entries have no workload attribution.
A minimal reproducible unit test could be added by exposing a test seam (e.g. injecting the initial list and pod informer as separate channels in a constructor).
Expected behavior
Either:
-
Wait for the Wlid. If the image hash is not yet in ImageToContainerData, the SBOM event should be parked (re-enqueued with backoff, or buffered until the corresponding pod event arrives) rather than dispatched with an empty Wlid. The map can also be pre-populated by listing pods before the SBOM list call.
-
Reject explicitly. If parking is too complex, validateContainerData should treat empty Wlid as an error — at least surfacing the problem as ErrMissingWlid instead of dispatching a malformed command. This trades a silent attribution loss for an explicit, retryable failure.
In either case the FIXME at line 82 should be resolved, not just annotated.
Actual Behavior
- On operator startup, the full existing SBOM list is enqueued before the pod cache has filled. Affected entries are dispatched with
Wlid="".
- When
ServiceDiscovery.Enabled == false, listPods is never called and every SBOM dispatch goes out with Wlid="".
validateContainerData accepts the empty-Wlid case as valid.
- A vuln scan command is sent over the worker pool with the malformed identifier.
kubevuln's ValidateScanCVE does not check Wlid either, so the scan executes.
- At
kubevuln/core/services/scan.go:500 the platform CVE submission is gated on if workload.Wlid != "" and silently no-ops for empty Wlid — the scan result is computed but never submitted to the backend.
- No log line above
info is emitted that would let an operator know correlation was lost or that the submission was skipped.
Suggested fix
Minimum, low-risk change in operator/watcher/sbomwatcher.go:
// sbomwatcher.go:221
func validateContainerData(containerData *utils.ContainerData) error {
if containerData.ImageID == "" {
return ErrMissingImageID
}
if containerData.ImageTag == "" {
return ErrMissingImageTag
}
if containerData.Wlid == "" {
return ErrMissingWlid
}
return nil
}
Combined with re-queueing the SBOM event on ErrMissingWlid (with a small TTL/backoff) so a slightly-too-early arrival doesn't drop the scan permanently. Pseudocode in HandleSBOMEvents:
if err := validateContainerData(containerData); err != nil {
if errors.Is(err, ErrMissingWlid) {
wh.deferredSBOMs.Enqueue(e, time.Now().Add(5*time.Second))
continue
}
errorCh <- err
continue
}
A small worker drains deferredSBOMs on a timer and re-enqueues into the main queue, with a max-retries cap.
A better long-term fix: order the startup explicitly. List pods first, populate ImageToContainerData synchronously, then call the initial SBOM EachListItem. The current code does the SBOM list (sbomwatcher.go:65) without waiting on any pod-informer sync signal.
Regression test: inject a fake informer pair into SBOMWatcher whose pod channel is blocked at start. Push an SBOM event. Assert that either (a) no command is dispatched with Wlid="", or (b) the SBOM is re-enqueued and eventually dispatched with the correct Wlid after the pod event is unblocked.
Source
operator/watcher/sbomwatcher.go:82 — the FIXME marking the race.
operator/watcher/sbomwatcher.go:43-51 — listPods only runs when ServiceDiscovery.Enabled; even then it only enqueues pod events rather than populating the map directly.
operator/watcher/sbomwatcher.go:54 — HandleSBOMEvents goroutine started before the SBOM list paging at lines 65-77.
operator/watcher/sbomwatcher.go:65-77 — initial SBOM list, enqueued before the pod informer warms up.
operator/watcher/sbomwatcher.go:137-191 — HandleSBOMEvents reads the racing map (line 163) and dispatches the scan command (lines 181-190).
operator/watcher/sbomwatcher.go:221-229 — validateContainerData (missing the Wlid check).
kubevuln/controllers/http.go:140-175 — HTTP handler that forwards the command to the scan service.
kubevuln/core/services/scan.go:857-881 — ValidateScanCVE only validates ImageHash and ImageSlug, not Wlid.
kubevuln/core/services/scan.go:499-507 — if workload.Wlid != "" guard that silently skips platform CVE submission when Wlid is empty.
Description
The operator's
SBOMWatchercorrelates incomingSBOMSyftevents with their owning workload by looking up the image hash in an in-memory map (ImageToContainerData) that is populated by Pod events from a separate informer channel. The two channels are merged in aselect:Two problems compose:
Initial list races with pod cache fill. Before the
selectloop starts, the watcher pages the full existing SBOM list and enqueues each aswatch.Added(sbomwatcher.go:65-77). The race is sharper than the FIXME at line 82 suggests, becauseHandleSBOMEventsruns as its own goroutine (started atsbomwatcher.go:54, before the SBOM list call) and consumes from a separate cooldown queue. TheImageToContainerDatamap it reads from is only populated by the mainselectloop draining pod events fromwh.eventQueue(lines 83-101). So even thoughlistPodsis invoked synchronously at line 45 whenServiceDiscovery.Enabled == true, it only enqueues pod events it does not directly populate the map. The map is filled lazily by the select loop, which has not yet started when the SBOM goroutine begins servicing events.Two concrete failure paths:
ServiceDiscovery.Enabled == true: race between the SBOM-handling goroutine and the select loop draining queued pod events. Whichever wins decides whetherWlidis empty.ServiceDiscovery.Enabled == false(line 43):listPodsnever runs at all, soImageToContainerDatais never populated and every SBOM dispatch goes out withWlid="".validateContainerDatadoes not checkWlid. WhenHandleSBOMEventsprocesses an SBOM whose image hash isn't inImageToContainerDatayet,imageContainerDatais the zero-value struct andcontainerData.Wlid == "". The validation step only rejects emptyImageID/ImageTag:So validation passes, and a
TypeScanImagescommand is dispatched withWlid = "":The downstream
kubevulnHTTP scan endpoint (controllers/http.go:140-175) accepts an emptyWlidand scans the image.ScanService.ValidateScanCVE(kubevuln/core/services/scan.go:857-881) only checksImageHashandImageSlug, so the empty-Wlidcommand passes validation and the scan executes. Then atkubevuln/core/services/scan.go:499-507, the platform CVE submission is gated onif workload.Wlid != ""— meaning the manifest is computed but silently never submitted to the backend whenWlidis empty. End-to-end effect: the operator dispatches a malformed scan, kubevuln burns the CPU/network to scan the image, and the result is silently dropped from the platform submission path. There is no retry path: once the SBOM event has been drained from the channel, it is gone, and no error log aboveinfois emitted anywhere along the way.The presence of the
if workload.Wlid != ""guard in kubevuln suggests the empty-Wlidcase was already a known-bad input on the consumer side; it was patched defensively downstream rather than rejected at the source.Environment
kubescape/operatoratmainHEADgo.mod)go test ./...inoperator/— passes (the race is not covered by an existing test).Steps To Reproduce
This is hard to script in a unit test today because
HandleSBOMEventsand the populating goroutine share state through a private map. Conceptually:SBOMSyftobject to storage with validImageID/ImageTagannotations but before the corresponding Pod is observed by the operator (e.g. delete the operator pod, leave SBOMs and pods alive, then restart only the operator, the initial SBOM list runs immediately while the pod informer is still warming up).kubescape scanning SBOMlog lines withwlid="".A minimal reproducible unit test could be added by exposing a test seam (e.g. injecting the initial list and pod informer as separate channels in a constructor).
Expected behavior
Either:
Wait for the Wlid. If the image hash is not yet in
ImageToContainerData, the SBOM event should be parked (re-enqueued with backoff, or buffered until the corresponding pod event arrives) rather than dispatched with an empty Wlid. The map can also be pre-populated by listing pods before the SBOM list call.Reject explicitly. If parking is too complex,
validateContainerDatashould treat emptyWlidas an error — at least surfacing the problem asErrMissingWlidinstead of dispatching a malformed command. This trades a silent attribution loss for an explicit, retryable failure.In either case the FIXME at line 82 should be resolved, not just annotated.
Actual Behavior
Wlid="".ServiceDiscovery.Enabled == false,listPodsis never called and every SBOM dispatch goes out withWlid="".validateContainerDataaccepts the empty-Wlidcase as valid.kubevuln'sValidateScanCVEdoes not checkWlideither, so the scan executes.kubevuln/core/services/scan.go:500the platform CVE submission is gated onif workload.Wlid != ""and silently no-ops for emptyWlid— the scan result is computed but never submitted to the backend.infois emitted that would let an operator know correlation was lost or that the submission was skipped.Suggested fix
Minimum, low-risk change in
operator/watcher/sbomwatcher.go:Combined with re-queueing the SBOM event on
ErrMissingWlid(with a small TTL/backoff) so a slightly-too-early arrival doesn't drop the scan permanently. Pseudocode inHandleSBOMEvents:A small worker drains
deferredSBOMson a timer and re-enqueues into the main queue, with a max-retries cap.A better long-term fix: order the startup explicitly. List pods first, populate
ImageToContainerDatasynchronously, then call the initial SBOMEachListItem. The current code does the SBOM list (sbomwatcher.go:65) without waiting on any pod-informer sync signal.Regression test: inject a fake informer pair into
SBOMWatcherwhose pod channel is blocked at start. Push an SBOM event. Assert that either (a) no command is dispatched withWlid="", or (b) the SBOM is re-enqueued and eventually dispatched with the correct Wlid after the pod event is unblocked.Source
operator/watcher/sbomwatcher.go:82— the FIXME marking the race.operator/watcher/sbomwatcher.go:43-51—listPodsonly runs whenServiceDiscovery.Enabled; even then it only enqueues pod events rather than populating the map directly.operator/watcher/sbomwatcher.go:54—HandleSBOMEventsgoroutine started before the SBOM list paging at lines 65-77.operator/watcher/sbomwatcher.go:65-77— initial SBOM list, enqueued before the pod informer warms up.operator/watcher/sbomwatcher.go:137-191—HandleSBOMEventsreads the racing map (line 163) and dispatches the scan command (lines 181-190).operator/watcher/sbomwatcher.go:221-229—validateContainerData(missing theWlidcheck).kubevuln/controllers/http.go:140-175— HTTP handler that forwards the command to the scan service.kubevuln/core/services/scan.go:857-881—ValidateScanCVEonly validatesImageHashandImageSlug, notWlid.kubevuln/core/services/scan.go:499-507—if workload.Wlid != ""guard that silently skips platform CVE submission whenWlidis empty.