Skip to content

Operator SBOM watcher dispatches TypeScanImages commands with empty Wlid when SBOM event beats Pod event, vuln results lose workload attribution #378

@yugal07

Description

@yugal07

Description

The operator's SBOMWatcher correlates incoming SBOMSyft events with their owning workload by looking up the image hash in an in-memory map (ImageToContainerData) that is populated by Pod events from a separate informer channel. The two channels are merged in a select:

// operator/watcher/sbomwatcher.go:81-104 (excerpt)
for {
    select {
    // FIXME select processes the events randomly, so we might see the SBOM event before the pod event
    case event := <-wh.eventQueue.ResultChan: // pod events fill ImageToContainerData
        ...
        for _, containerStatus := range containerStatuses {
            hash := hashFromImageID(containerStatus.ImageID)
            wh.ImageToContainerData.Set(hash, utils.ContainerData{
                ContainerName: containerStatus.Name,
                Wlid:          wlid,
            })
        }
    case sbomEvent, ok := <-sbomEvents:
        if ok { eventQueue.Enqueue(sbomEvent) } else { ... }
    ...
}

Two problems compose:

  1. Initial list races with pod cache fill. Before the select loop starts, the watcher pages the full existing SBOM list and enqueues each as watch.Added (sbomwatcher.go:65-77). The race is sharper than the FIXME at line 82 suggests, because HandleSBOMEvents runs as its own goroutine (started at sbomwatcher.go:54, before the SBOM list call) and consumes from a separate cooldown queue. The ImageToContainerData map it reads from is only populated by the main select loop draining pod events from wh.eventQueue (lines 83-101). So even though listPods is invoked synchronously at line 45 when ServiceDiscovery.Enabled == true, it only enqueues pod events it does not directly populate the map. The map is filled lazily by the select loop, which has not yet started when the SBOM goroutine begins servicing events.

    Two concrete failure paths:

    • With ServiceDiscovery.Enabled == true: race between the SBOM-handling goroutine and the select loop draining queued pod events. Whichever wins decides whether Wlid is empty.
    • With ServiceDiscovery.Enabled == false (line 43): listPods never runs at all, so ImageToContainerData is never populated and every SBOM dispatch goes out with Wlid="".
  2. validateContainerData does not check Wlid. When HandleSBOMEvents processes an SBOM whose image hash isn't in ImageToContainerData yet, imageContainerData is the zero-value struct and containerData.Wlid == "". The validation step only rejects empty ImageID/ImageTag:

    // sbomwatcher.go:221-229
    func validateContainerData(containerData *utils.ContainerData) error {
        if containerData.ImageID == "" { return ErrMissingImageID }
        if containerData.ImageTag == "" { return ErrMissingImageTag }
        return nil
    }

    So validation passes, and a TypeScanImages command is dispatched with Wlid = "":

    // sbomwatcher.go:181-191
    cmd := &apis.Command{
        Wlid:        containerData.Wlid,          // = ""
        CommandName: apis.TypeScanImages,
        Args: map[string]interface{}{
            utils.ArgsContainerData: containerData,
        },
    }
    logger.L().Info("scanning SBOM", helpers.String("wlid", cmd.Wlid), ...)
    producedCommands <- cmd

    The downstream kubevuln HTTP scan endpoint (controllers/http.go:140-175) accepts an empty Wlid and scans the image. ScanService.ValidateScanCVE (kubevuln/core/services/scan.go:857-881) only checks ImageHash and ImageSlug, so the empty-Wlid command passes validation and the scan executes. Then at kubevuln/core/services/scan.go:499-507, the platform CVE submission is gated on if workload.Wlid != "" — meaning the manifest is computed but silently never submitted to the backend when Wlid is empty. End-to-end effect: the operator dispatches a malformed scan, kubevuln burns the CPU/network to scan the image, and the result is silently dropped from the platform submission path. There is no retry path: once the SBOM event has been drained from the channel, it is gone, and no error log above info is emitted anywhere along the way.

    The presence of the if workload.Wlid != "" guard in kubevuln suggests the empty-Wlid case was already a known-bad input on the consumer side; it was patched defensively downstream rather than rejected at the source.

Environment

  • Repo: kubescape/operator at main HEAD
  • Go: 1.25.0 (per go.mod)
  • Tests: go test ./... in operator/ — passes (the race is not covered by an existing test).

Steps To Reproduce

This is hard to script in a unit test today because HandleSBOMEvents and the populating goroutine share state through a private map. Conceptually:

  1. Apply a SBOMSyft object to storage with valid ImageID/ImageTag annotations but before the corresponding Pod is observed by the operator (e.g. delete the operator pod, leave SBOMs and pods alive, then restart only the operator, the initial SBOM list runs immediately while the pod informer is still warming up).
  2. Observe kubescape scanning SBOM log lines with wlid="".
  3. Inspect the vuln report, the affected entries have no workload attribution.

A minimal reproducible unit test could be added by exposing a test seam (e.g. injecting the initial list and pod informer as separate channels in a constructor).

Expected behavior

Either:

  • Wait for the Wlid. If the image hash is not yet in ImageToContainerData, the SBOM event should be parked (re-enqueued with backoff, or buffered until the corresponding pod event arrives) rather than dispatched with an empty Wlid. The map can also be pre-populated by listing pods before the SBOM list call.

  • Reject explicitly. If parking is too complex, validateContainerData should treat empty Wlid as an error — at least surfacing the problem as ErrMissingWlid instead of dispatching a malformed command. This trades a silent attribution loss for an explicit, retryable failure.

In either case the FIXME at line 82 should be resolved, not just annotated.

Actual Behavior

  • On operator startup, the full existing SBOM list is enqueued before the pod cache has filled. Affected entries are dispatched with Wlid="".
  • When ServiceDiscovery.Enabled == false, listPods is never called and every SBOM dispatch goes out with Wlid="".
  • validateContainerData accepts the empty-Wlid case as valid.
  • A vuln scan command is sent over the worker pool with the malformed identifier.
  • kubevuln's ValidateScanCVE does not check Wlid either, so the scan executes.
  • At kubevuln/core/services/scan.go:500 the platform CVE submission is gated on if workload.Wlid != "" and silently no-ops for empty Wlid — the scan result is computed but never submitted to the backend.
  • No log line above info is emitted that would let an operator know correlation was lost or that the submission was skipped.

Suggested fix

Minimum, low-risk change in operator/watcher/sbomwatcher.go:

// sbomwatcher.go:221
func validateContainerData(containerData *utils.ContainerData) error {
    if containerData.ImageID == "" {
        return ErrMissingImageID
    }
    if containerData.ImageTag == "" {
        return ErrMissingImageTag
    }
    if containerData.Wlid == "" {
        return ErrMissingWlid
    }
    return nil
}

Combined with re-queueing the SBOM event on ErrMissingWlid (with a small TTL/backoff) so a slightly-too-early arrival doesn't drop the scan permanently. Pseudocode in HandleSBOMEvents:

if err := validateContainerData(containerData); err != nil {
    if errors.Is(err, ErrMissingWlid) {
        wh.deferredSBOMs.Enqueue(e, time.Now().Add(5*time.Second))
        continue
    }
    errorCh <- err
    continue
}

A small worker drains deferredSBOMs on a timer and re-enqueues into the main queue, with a max-retries cap.

A better long-term fix: order the startup explicitly. List pods first, populate ImageToContainerData synchronously, then call the initial SBOM EachListItem. The current code does the SBOM list (sbomwatcher.go:65) without waiting on any pod-informer sync signal.

Regression test: inject a fake informer pair into SBOMWatcher whose pod channel is blocked at start. Push an SBOM event. Assert that either (a) no command is dispatched with Wlid="", or (b) the SBOM is re-enqueued and eventually dispatched with the correct Wlid after the pod event is unblocked.

Source

  • operator/watcher/sbomwatcher.go:82 — the FIXME marking the race.
  • operator/watcher/sbomwatcher.go:43-51listPods only runs when ServiceDiscovery.Enabled; even then it only enqueues pod events rather than populating the map directly.
  • operator/watcher/sbomwatcher.go:54HandleSBOMEvents goroutine started before the SBOM list paging at lines 65-77.
  • operator/watcher/sbomwatcher.go:65-77 — initial SBOM list, enqueued before the pod informer warms up.
  • operator/watcher/sbomwatcher.go:137-191HandleSBOMEvents reads the racing map (line 163) and dispatches the scan command (lines 181-190).
  • operator/watcher/sbomwatcher.go:221-229validateContainerData (missing the Wlid check).
  • kubevuln/controllers/http.go:140-175 — HTTP handler that forwards the command to the scan service.
  • kubevuln/core/services/scan.go:857-881ValidateScanCVE only validates ImageHash and ImageSlug, not Wlid.
  • kubevuln/core/services/scan.go:499-507if workload.Wlid != "" guard that silently skips platform CVE submission when Wlid is empty.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

Status
High Priority

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions