diff --git a/.github/actions/spelling/excludes.txt b/.github/actions/spelling/excludes.txt index bfa9af201496a..02ee2980a4139 100644 --- a/.github/actions/spelling/excludes.txt +++ b/.github/actions/spelling/excludes.txt @@ -2,3 +2,7 @@ ^\.github/actions/spelling/ ^\Qbenches/codecs/moby_dick.txt\E$ ^\Qwebsite/layouts/shortcodes/config/unit-tests.html\E$ +# Antithesis test harness: research notes + exerciser examples carry heavy jargon (sancov, libvoidstar, rkyv, vdbuf, lossfinder, ...) +^tests/antithesis/ +^\Qlib/vector-buffers/examples/disk_v2_antithesis.rs\E$ +^\Qlib/vector-buffers/examples/disk_v2_lossfinder.rs\E$ diff --git a/tests/antithesis/scratchbook/.markdownlint.jsonc b/tests/antithesis/scratchbook/.markdownlint.jsonc new file mode 100644 index 0000000000000..0a0effd95950a --- /dev/null +++ b/tests/antithesis/scratchbook/.markdownlint.jsonc @@ -0,0 +1,10 @@ +// Scoped relaxations for the Antithesis research scratchbook. These are internal +// working notes (dense property tables, ad-hoc code fences), not published docs, +// so a few cosmetic rules are disabled here only — the repo-wide config still +// applies everywhere else. +{ + "extends": "../../../.markdownlint.jsonc", + "MD060": false, // table-column-style: the property-catalog tables use empty-header `| | |` 2-col layout + "MD040": false, // fenced-code-language: many ad-hoc evidence snippets are intentionally language-less + "MD022": false // blanks-around-headings: relaxed for the dense note format +} diff --git a/tests/antithesis/scratchbook/_external-references-digest.md b/tests/antithesis/scratchbook/_external-references-digest.md new file mode 100644 index 0000000000000..8f7d976f6e2f3 --- /dev/null +++ b/tests/antithesis/scratchbook/_external-references-digest.md @@ -0,0 +1,113 @@ +# External References Digest (working note for discovery agents) + +This is scaffolding for the antithesis-research run on **disk buffers v2** +(`lib/vector-buffers/src/variants/disk_v2/`). User scope answer: *"Whatever you +have access to. You have your MCPs."* — so in-repo docs/RFCs plus Datadog +internal doc/Jira were consulted. Key findings condensed below so per-focus agents +don't need to re-fetch. + +## In-repo references + +- `rfcs/2021-10-14-9477-buffer-improvements.md` — original buffer-rework RFC. +- `docs/specs/buffer.md` — buffer component spec / claimed behavior. +- `lib/vector-buffers/src/variants/disk_v2/mod.rs` — authoritative design doc + (module-level comment): on-disk format, ledger, record IDs, recovery. + +## Claimed guarantees (from `mod.rs` design doc + buffer spec + internal doc) + +- Data files never exceed 128MB; ≤ 65,536 files; buffer ≤ ~8TB. +- All records checksummed with **CRC32C**; records written + sequentially/contiguously; a record never spans two data files. +- Writers create+write data files; readers read+delete them. Reader deletes a + data file **only after all records in it are acknowledged** (whole-file + deletion, never partial truncation). +- Ledger (`buffer.db`, memory-mapped) tracks `writer_next_record_id`, + `writer_current_data_file_id`, `reader_current_data_file_id`, + `reader_last_record_id`. Fields updated atomically, but **not** atomically + w.r.t. reader/writer activity. +- Record IDs are monotonic and encode event count: record ID N with next record + M means the record holds M−N events. Used to compute buffer event count and to + detect gaps / dropped events after corruption. +- **Durability:** data is fsync'd every **500ms** (`DEFAULT_FLUSH_INTERVAL`). + Page-cache flush happens on every `flush()` (readers see data immediately on + Linux); full fsync only every 500ms. **Data-loss window on crash = up to 500ms + of unsynced writes** (when e2e acks off). Graceful shutdown flushes everything + → no loss. +- Min buffer `max_size` ~256MB; `DEFAULT_MAX_DATA_FILE_SIZE` 128MB; + `DEFAULT_MAX_RECORD_SIZE` = 128MB; `DEFAULT_WRITE_BUFFER_SIZE` 256KB. +- Endianness: files are host-endian; not portable across architectures. +- Delivery semantics with e2e acks + disk buffer = **at-least-once**: crash after + buffer write but before downstream ack → replay on restart → **possible + duplicates** (downstream must dedup). + +## Known bugs / incidents (HIGH-VALUE Antithesis targets) + +1. **Ledger `total_buffer_size` AtomicU64 underflow → permanent writer deadlock** + (Vector #21683, partially mitigated by PR #23561 on the *reporter* side only; + the ledger atomic still wraps). + - `decrement_total_buffer_size` (ledger.rs ~291-298) does raw + `fetch_sub(amount, AcqRel)` with **no saturation**. If `amount > + current_value`, the atomic wraps to ≈ 2^64. + - Then `total_buffer_size + unflushed_bytes` is always astronomical → + `is_buffer_full()` returns true forever → `can_write_record()` false forever + → writer's `ensure_ready_for_write()` (writer.rs ~1001-1020) loops on + `ledger.wait_for_reader().await` and never recovers. **Writer deadlocks + permanently.** + - Trigger: crash/reboot/abrupt-shutdown that leaves a data file whose on-disk + size and readable-record bytes disagree, combined with the reader running + through that file on restart. Partial writes at file-rotation boundaries are + the most plausible cause. Not deterministic per-restart, but not exotic. + - Reporter-side gauges use `saturating_sub` (PR #23561) so the *dashboard* + no longer shows 2^64, but the ledger control-path atomic is unfixed. + +2. **Disk buffer stall + silent event drops during config reload** + (Vector #24948, PR #24949; directly implicated in the **internal config-reload incident non-prod + incident**). + - Old writer dropped while events still in-flight → events lost without + accounting. + - `track_dropped_events` passes `0` for `byte_size` → permanent drift in + buffer-size metrics. + - `synchronize_buffer_usage()` re-seeds metrics while the old reporter may + still run → double-counted metric spikes; then a metrics gap between old + reporter teardown and the first tick (2s) of the new reporter. + +3. **`component_discarded_events_total` blind to buffer drops** (Vector #24606, + #24144). When a disk buffer fills and `drop_newest` fires, only + `buffer_discarded_events_total` increments; the component-level discarded + counter stays 0 → silent data loss on dashboards. `BufferEventsDropped::emit()` + in `lib/vector-buffers/src/internal_events.rs` never calls + `ComponentEventsDropped`. + +4. **Buffer size gauges stuck non-zero / negative** (Vector #23995, #17666, + #21683). Reporter `current() = total_entered.saturating_sub(total_left)`; + stuck-at-non-zero still open. + +5. **Component tags lost for sinks using disk buffers** (OPA-5380): components + paused for IO at init time lose `component_*` labels on later-registered + metrics (utilization, etc.). + +## Existing test strategy (so we don't duplicate it) + +- In-repo: extensive `proptest` + **model-based testing** under + `variants/disk_v2/tests/model/` (a reference model + action sequencer + + in-memory filesystem). Unit tests for acknowledgements, initialization, + known_errors, size_limits, invariants, record. +- Datadog internal: an E2E **chaos test** that SIGKILLs the worker 3× with e2e acks + enabled and asserts every event is delivered end-to-end. Antithesis should go + beyond: explore fault *timing/interleavings* (partial writes at rotation, + fsync-vs-crash windows, reader/writer races on the mmap'd ledger) that a fixed + 3×SIGKILL test cannot. +- A **major lock-contention performance issue** affected all disk-buffer users + (writer throughput ~90 MiB/s capped by contention) — points at writer/reader + coordination hot paths. + +## Notes on faults + +- Crash-recovery properties require **node termination faults** (often disabled + by default in Antithesis tenants) — flag this in the catalog. +- The disk buffer is **single-process** (intra-Vector reader+writer sharing an + mmap'd ledger). Network/partition faults are largely irrelevant to the buffer + itself; the strong levers are node kill/restart, node hang, CPU throttling + (exposes the fsync/flush timing windows and lock contention), and filesystem + state across restart. + diff --git a/tests/antithesis/scratchbook/deployment-topology.md b/tests/antithesis/scratchbook/deployment-topology.md new file mode 100644 index 0000000000000..7498fa5fdd69d --- /dev/null +++ b/tests/antithesis/scratchbook/deployment-topology.md @@ -0,0 +1,174 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +external_references: + - path: lib/vector-buffers/src/variants/disk_v2/mod.rs + why: Confirms the buffer is single-process (intra-Vector reader+writer over an mmap'd ledger) + - path: (internal design doc, not linked) + why: Disk buffer is configured per-sink; e2e acks require a supporting source; at-least-once semantics + - path: (internal design doc, not linked) + why: Existing chaos test crashes the worker with SIGKILL x3 + e2e acks — the topology must support repeated kill/restart + - path: distribution/docker/ + why: Existing Vector Dockerfiles to reuse/adapt for the SUT container +--- + +# Deployment Topology: Disk Buffer v2 + +## Key fact driving the design + +The disk buffer is **single-process**: the reader, writer, and finalizer all run +inside one Vector process, coordinating through an `mmap`'d ledger and the local +filesystem. There is **no network, no peer, no quorum**. Therefore: + +- The strong fault levers are **node termination (kill/restart)**, **node hang**, + **CPU throttling**, **clock jitter**, and **filesystem state across restart** — + NOT network partitions or bad-node faults (those are irrelevant to the buffer). +- The topology is minimal: **one SUT container + one workload/client container.** + No dependency containers are needed (no S3/Kafka/Postgres) — the buffer's only + "dependency" is the local filesystem. + +## Topology + +```text ++-----------------------------+ events (HTTP, e2e-ack-capable source) +| workload (client) | -----------------------------------------> +-----------------------------+ +| - produces unique event IDs| | vector (SUT) | +| - HTTP collector endpoint | <----------------------------------------- | source -> sink(disk buffer)| +| - tracks produced/delivered| sink delivers here (HTTP sink) | data_dir on PERSISTENT vol | +| - emits Antithesis asserts | +-----------------------------+ +| - test template /opt/... | | Antithesis injects ++-----------------------------+ | node-kill / hang / + | CPU-throttle / clock + v faults HERE + +-----------------------------+ + | persistent volume | + | /buffer/v2// | + +-----------------------------+ +``` + +## Containers + +### 1. `vector` — Service (the SUT) + +- **Image:** adapt an existing Dockerfile from `distribution/docker/` (Debian or + Distroless). Two build variants: + - **Baseline build:** stock Vector — exercises all workload-observable + properties (durability, at-least-once, deadlock-via-throughput-stall, metric + correctness, recovery). + - **Instrumented build (recommended for the deadlock/corruption cluster):** + Vector built with the **Antithesis Rust SDK** added as a dependency to + `lib/vector-buffers`, with the missing SUT-side assertions inserted (see + "SUT-side instrumentation" below). This is the only way to directly assert + the internal states (`total-buffer-size-never-underflows`, + `record-id-monotonicity-holds`, `partial-write-at-rotation-recovers`, + `graceful-shutdown-flushes-all`/`unflushed_bytes==0`) that are invisible from + the workload. +- **Runs:** a single `vector` process with a config: + - `source`: an e2e-ack-capable source the workload can push to. Prefer + `datadog_agent` or `http_server` with `acknowledgements: true` (needed for + `every-written-event-eventually-delivered` and the durable-survival + properties). Keep one source. + - `sink`: an `http` sink with `buffer: { type: disk, max_size: <~256MB+>, + when_full: block }`, posting to the workload's collector. A second + config/run uses `when_full: drop_newest` for `dropped-events-are-counted`. + - Internal metrics exposed (e.g. `internal_metrics` → `prometheus_exporter`) + so the workload can read `buffer_*` / `component_discarded_events_total` for + the metric-correctness properties. +- **CRITICAL — persistent buffer storage:** the disk-buffer `data_dir` MUST be on + storage that **survives the container's kill/restart**. Disk-buffer durability + is the whole point; if Antithesis node-termination recreates the container with + a fresh filesystem, the buffer is wiped and every crash-recovery property + passes vacuously (or fails spuriously). Mount `` on a persistent + volume. **Confirm with the user how their tenant's node-termination interacts + with filesystem persistence.** +- **Faults target this container:** node kill/restart (required by Categories + 2–6), node hang, CPU throttle (widens fsync/lock-contention windows), clock + jitter (perturbs the 500ms `should_flush` deadline). +- **Replica count:** 1. (No replication; more instances add nothing.) +- **Tuning for bug-finding:** set a small `max_data_file_size` (e.g. 1MB) and a + small `max_size` to maximize file-rotation frequency and reach the rotation/ + partial-write window faster; optionally set `flush_interval` low to widen the + durably-written set, or high to widen the loss window — test both. + +### 2. `workload` — Client (the test driver) + +- **Image:** a small Rust (or Go) container with the **Antithesis Rust SDK** (to + match the SUT language and emit assertions). Includes the test template at + `/opt/antithesis/test/v1/{name}/`. +- **Runs:** + 1. Starts an HTTP **collector** endpoint (the sink's destination) that records + every delivered event ID (counting duplicates). + 2. Emits `setup_complete` once it and Vector are ready. + 3. Sleeps so Antithesis can run test-template commands. +- **Test-template commands** drive: produce a stream of uniquely-IDed events to + Vector's source; periodically (via `ANTITHESIS_STOP_FAULTS` quiet periods) + drain and assert liveness/at-least-once; inspect Vector's metrics; toggle the + collector to return errors (for `sink-failure-not-silently-acked`); trigger a + config reload (custom fault, for `config-reload-no-silent-loss`). +- **Assertions emitted here** (workload-observable properties): at-least-once + set-difference, no-loss-on-graceful-shutdown, drop accounting vs metric, writer + throughput resumes after recovery (deadlock signal), buffer gauges return to ~0 + on drained restart. +- **Replica count:** 1. + +## SUT-side instrumentation (for the instrumented build) + +No Antithesis SDK exists in the repo today (`existing-assertions.md`). For the +internal-state properties, add `antithesis-sdk` to `lib/vector-buffers/Cargo.toml` +and insert (all currently MISSING): + +- `assert_unreachable!` / `assert_always!(amount <= current)` at the two unguarded + subtraction sites: `ledger.rs:~292` and `reader.rs:~524` + (`total-buffer-size-never-underflows`). +- `assert_sometimes!(writer_unblocked_after_full)` after `ensure_ready_for_write` + exits its wait loop; `assert_unreachable!` on repeated no-progress wakeups + (`writer-eventually-makes-progress`). +- `assert_unreachable!` at the monotonicity panic `reader.rs:~482` + (`record-id-monotonicity-holds`). +- `assert_always_or_unreachable!` at the record-emission point `reader.rs:~1131` + (`no-corrupted-record-delivered`) and `assert_sometimes!` in the + `is_bad_read` branch `reader.rs:~1035` (`corruption-is-detected-and-recovered`). +- `assert_sometimes!(torn_tail_recovered)` in the `validate_last_write` + recovery branches (`partial-write-at-rotation-recovers`). +- `assert_always!(unflushed_bytes == 0)` inside `close()` + (`graceful-shutdown-flushes-all`). + +These assertions are no-ops outside Antithesis, so the instrumented build is safe +to run normally. + +## Custom faults required + +- **Config reload** (`config-reload-no-silent-loss`): a custom fault that sends + `SIGHUP` to the Vector process (or swaps the config file and triggers reload), + fired under sustained load. +- **Downstream sink error** (`sink-failure-not-silently-acked`): the workload's + collector returns 5xx for a window, or a custom fault toggles it. + +## SDKs + +- **Workload:** Antithesis Rust SDK (or Go SDK) — required to emit assertions and + `setup_complete`, and to draw random numbers for the producer. +- **SUT:** Antithesis Rust SDK only for the instrumented build. + +## Simplicity note + +Two containers, one network link, no external dependency services. Every +container is justified: the SUT runs the buffer; the workload produces/observes +and asserts. We deliberately exclude S3/Kafka/etc. — the disk buffer has no such +dependency. The only non-obvious requirement is the **persistent volume for the +buffer data_dir**, which is essential for crash-durability testing to be +meaningful. + +## Open Questions + +- How does the target Antithesis tenant's node-termination fault interact with + container filesystem persistence? (Determines whether the buffer survives a + modeled crash — essential.) +- Are node-termination and clock faults enabled in the tenant? (Categories 2–6 + need kill/restart.) +- Which e2e-ack-capable source is easiest to drive from the workload — + `http_server`, `datadog_agent`, or `socket`? (Affects workload protocol.) +- Is config reload feasible as a custom fault (SIGHUP) in the harness, or must the + workload drive it via Vector's API? + diff --git a/tests/antithesis/scratchbook/evaluation/antithesis-fit.md b/tests/antithesis/scratchbook/evaluation/antithesis-fit.md new file mode 100644 index 0000000000000..fd70c01268ee9 --- /dev/null +++ b/tests/antithesis/scratchbook/evaluation/antithesis-fit.md @@ -0,0 +1,412 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +external_references: [] +--- + +# Antithesis Fit Evaluation: Disk Buffer v2 Property Catalog + +Evaluation lens: does Antithesis add unique value over deterministic tests for +each property, or is the property really unit/integration-test territory? Are +`Sometimes` / `Always` / `Unreachable` assertion types matched to what is +actually reachable and what Antithesis's scheduler explores? Are any properties +undervalued because the catalog underestimates how far Antithesis goes into +the unhappy paths? + +--- + +## Section 1 — Findings (properties that need reconsideration) + +### Finding 1: `record-id-wraparound-accounting-holds` — split into two distinct cases with very different Antithesis fit + +**Property:** `record-id-wraparound-accounting-holds` +**Concern:** The catalog bundles two sub-cases whose Antithesis fit differs +sharply. The empty-buffer equality case (`next_writer == last_reader` → `0 - 1 += u64::MAX`) is a **pure unit-test bug**: no faults, no concurrency, no timing +sensitivity. Write one batch to a fresh buffer, ack all events, call +`get_total_records()` — the bug fires deterministically on every clean restart +of a drained buffer. The evidence file (`record-id-wraparound-accounting-holds.md`) +confirms this explicitly: "On a fresh start both IDs start at 0; `wrapping_sub(0, +0) = 0`, then `0 - 1 = u64::MAX`." It also confirms that in debug builds this +panics rather than wrapping silently. A single-line unit test exposing this +would run in milliseconds and catch the regression forever. + +The true u64 wrap (writer at `u64::MAX`, reader at `u64::MAX`, writer wraps to +0) is a different matter: it requires writing `2^64 / avg_event_count` records, +which at any realistic event rate is astronomically unreachable in any timeline +— Antithesis or otherwise. There is no test-build hook that fast-forwards +record IDs to near `u64::MAX` in a running production binary (the +`unsafe_set_writer_next_record_id` helpers are `cfg(test)`-gated and unavailable +in production binaries, which is what Antithesis will run). So for an Antithesis +run against a production build, the true-wrap case is **vacuously unreachable**. + +**Scope:** property-specific +**Evidence:** Evidence file §"When does the intermediate result equal 0?"; +§"Antithesis Angle" item 2 ("requires injection via test-only helpers"); +property-catalog.md "Drain the buffer completely; restart; assert buffer-size/ +event-count gauges are near 0 (not ~2^64)." +**Suggested action:** Split into two: (a) Fix the empty-buffer `0 - 1` bug with +a unit test — remove this sub-case from the Antithesis catalog. (b) Mark the +true u64 wrap sub-case as "out of scope for Antithesis runs on production +binaries; document as a latent, astronomically-unlikely risk." If a regression +test is needed for the fix to the empty-buffer case, the unit test suite (not +Antithesis) is the correct vehicle. Keep the `get_total_records` return-value +sanity assertion in SUT instrumentation only as a cheap guard; it is not a +primary Antithesis target. + +--- + +### Finding 2: `record-never-spans-files` — the normal path is a pure unit test; the fault path is thin + +**Property:** `record-never-spans-files` +**Concern:** The spanning guard (`RecordWriter::can_write`) is a two-level gate: +a `u64` comparison checked before every write. The gate is correct and verified +by reading the code. The only way a record can span files is if the on-disk +`metadata().len()` value seeding `current_data_file_size` at writer-open is +underreported (Antithesis filesystem fault) or if the size counter drifts. For +the counter-drift case (plain `u64 +` addition, no saturation), the maximum +practical record size is 128MB, far below `u64::MAX`, so drift is not a +realistic concern outside cosmic-ray-level events. The fault path — corrupted +`metadata().len()` — is a plausible Antithesis filesystem fault, but: + +1. Antithesis's filesystem fault model primarily covers write/sync failures and + node-kill, not arbitrary corruption of metadata return values. Whether this + specific fault is achievable in the target tenant needs confirmation. +2. Even if `metadata().len()` is corrupted to underreport, the + `AlwaysOrUnreachable` assertion fires at `flush_record` after the size is + updated — not a difficult detection scenario for a deterministic test if you + can inject the fault at open time. +3. The evidence file's `Open Questions` section notes that the single-writer + design prevents concurrency races between `can_write` and `flush_record`. + +The `AlwaysOrUnreachable` assertion type is correct (the spanning branch should +be unreachable on a correct run), but the scenarios that can actually violate it +in an Antithesis run are narrow and depend on filesystem-metadata corruption that +may not be in Antithesis's standard fault toolkit. + +**Scope:** property-specific +**Evidence:** `record-never-spans-files.md` §"Timing / Fault Conditions"; +sut-analysis.md §5/INV-2 ("Hard" invariant); the two-level gate analysis in +the evidence file confirms the guard is sound under normal and partial-write +faults. +**Suggested action:** Retain in catalog as a cheap `AlwaysOrUnreachable` +assertion placed in SUT instrumentation (zero search budget — the assertion fires +only if the condition is violated). Lower its priority relative to the +deadlock/durability cluster. Do not drive Antithesis fault strategy around this +property. If the harness can inject corrupted `metadata().len()` values, keep +it as a secondary reachability check; otherwise it is a passive safety net. + +--- + +### Finding 3: `dropped-events-are-counted` — a code bug detectable by a unit test; Antithesis value is regression-detection only + +**Property:** `dropped-events-are-counted` +**Concern:** The evidence file confirms the violation is a **missing function +call**: `BufferEventsDropped::emit` does not call `emit(ComponentEventsDropped +{...})`. This is not a timing-sensitive or concurrency-sensitive bug. You do not +need to kill nodes, explore interleaving, or inject faults to observe it — +configuring `when_full: drop_newest` and overfilling the buffer with any +deterministic test immediately reveals the discrepancy between +`buffer_discarded_events_total` and `component_discarded_events_total`. The +existing test suite almost certainly misses this only because no test checks the +component-level counter, not because the bug is timing-sensitive. + +Furthermore, the evidence file raises a key open question (OQ-4): whether +`drop_newest` is even reachable for disk buffers at all (disk-buffer +`try_write_record` may never return the item to the caller; the `was_dropped` +branch may be unreachable for disk buffers). If `drop_newest` is not reachable +for disk buffers, this property is vacuously satisfied for this SUT and should +be removed or redirected to in-memory buffers. + +The 2-second reporter lag (OQ-3 in the evidence file) is a minor timing concern +but trivially handled by waiting a few seconds — this is not the kind of +timing-sensitivity that requires Antithesis's systematic exploration. + +**Scope:** property-specific +**Evidence:** `dropped-events-are-counted.md` §"Call Chain"; direct `grep` in +evidence: "There is no reference to `ComponentEventsDropped` anywhere in +`lib/vector-buffers/`." +**Suggested action:** (a) Confirm first whether `drop_newest` is actually +reachable for disk buffers (OQ-4). If not, remove from catalog. If yes: (b) +file a unit test that configures `drop_newest`, overfills the buffer, waits 3s, +and asserts `component_discarded_events_total >= buffer_discarded_events_total`. +This will catch the bug and serve as a regression test without consuming +Antithesis search budget. In Antithesis, keep only the workload-side metric +assertion as a cheap pass-through check (no additional search budget); it should +not drive Antithesis fault strategy. + +--- + +### Finding 4: `sink-failure-not-silently-acked` — logic bug, but observability requires fault injection; Antithesis fit is moderate, not high + +**Property:** `sink-failure-not-silently-acked` +**Concern:** The violation (`_status` discarded in `spawn_finalizer`) is a +confirmed logic bug, but unlike the deadlock cluster bugs it is observable +without a kill-and-restart: within a single process lifetime, making the +downstream sink return `Errored`/`Rejected` status for a sustained window and +then checking that events are retained (not dropped) is a deterministic +integration-test scenario. The key open question from the evidence file (OQ-1) +is whether any actual Vector sink emits `Errored`/`Rejected` status at all, or +whether sinks swallow errors internally. If sinks do not propagate non-Delivered +status, the bug is dormant without a custom test sink. + +Antithesis does add some value here: the "make the downstream sink error for a +window" scenario benefits from fault injection (toggling the mock sink between +5xx and 2xx), and the timing of how long errors persist relative to the buffer's +ack pipeline is worth exploring. But the core bug detection does not require +Antithesis's systematic search — a fixed-sequence integration test suffices. + +The catalog marks this as "currently violated" and places it in Category 5. The +catalog's own framing is accurate but the priority as an Antithesis target is +overstated relative to the deadlock/durability cluster. + +**Scope:** property-specific +**Evidence:** `sink-failure-not-silently-acked.md` §"Current Status: VIOLATED"; +OQ-1 ("whether any sink actually emits Errored/Rejected status"); OQ-3 ("Is +there a nack/unack path in the buffer at all?"). +**Suggested action:** Confirm OQ-1 (whether sinks emit non-Delivered status). +If yes: write a focused integration test using a custom test sink. Keep the +Antithesis workload-side assertion but do not invest Antithesis fault budget +beyond the standard backpressure fault (toggle sink to error). This is a +category-3 priority, not category-1. + +--- + +### Finding 5: `buffer-size-within-max` assertion type mismatch — `Always` is vacuously true under the deadlock + +**Property:** `buffer-size-within-max` +**Concern:** The catalog notes this property "must be evaluated jointly with +`writer-eventually-makes-progress`" and that the underflow deadlock makes the +bound "vacuously hold." This is correct but the concern is understated. An +`Always` assertion on disk-size <= max_size will never fire in the deadlock +scenario — the writer stalls and no new data is written, so the bound holds +trivially. The assertion is not wrong, but it provides zero signal on the most +important failure mode (the deadlock). An Antithesis run could report this +`Always` as passing even during a permanent writer deadlock. + +The catalog's recommended check (watchdog summing `.dat` file sizes via the +workload, not the masked gauge) is correct, but the framing as an independent +Antithesis `Always` property is misleading — it should explicitly be a compound +check: `(disk_size <= max) AND (writer throughput > 0)`. Neither condition alone +is sufficient. + +**Scope:** property-specific +**Evidence:** property-catalog.md §`buffer-size-within-max` "Antithesis Angle: +Must be evaluated jointly with `writer-eventually-makes-progress`"; writer- +eventually-makes-progress.md §"Relationship to buffer-size-within-max." +**Suggested action:** Remove as a standalone Antithesis `Always` property or +explicitly mark it as a derived signal: "passes vacuously when writer is +deadlocked; only meaningful in conjunction with `writer-eventually-makes-progress +` passing." In the workload, combine the disk-size check with a write-throughput +check to avoid false-green reporting under the deadlock. + +--- + +### Finding 6: `corruption-is-detected-and-recovered` — `Sometimes` reachability concern; value depends on fault toolkit + +**Property:** `corruption-is-detected-and-recovered` +**Concern:** This is a `Sometimes` reachability property — it asserts the +`is_bad_read() → roll_to_next_data_file` branch actually fires. The value is +legitimate: confirming that the fault injection is reaching the right code path +and that the recovery branch is not dead code. However, the +`AlwaysOrUnreachable` sibling property (`no-corrupted-record-delivered`) is +the safety property; this `Sometimes` is purely confirmational. + +The risk is that if Antithesis's fault model does not have direct disk-data- +corruption (bit-flip on a specific file byte) in its standard toolkit, the +`Sometimes` may never be satisfied and Antithesis will report a reachability +failure — not because the recovery path is broken, but because the fault type +needed to reach it was never injected. Antithesis's standard fault suite covers +node-kill, CPU throttle, and clock jitter; direct disk byte-corruption is +described as a "bit-flip / partial-write / torn-tail fault" in the catalog but +the tenant fault availability must be confirmed. + +**Scope:** property-specific +**Evidence:** property-catalog.md §`corruption-is-detected-and-recovered` +"Antithesis Angle: Inject faults while the reader's BufReader has the file open"; +catalog-wide Open Question "Are node-termination faults enabled?"; evidence file +OQ-1 "Does the seek_to_next_record init loop invoke roll_to_next_data_file?" +**Suggested action:** Confirm whether the tenant supports direct disk-byte- +corruption faults. If not: scope this property down to "partial write / torn +tail from node-kill" only (node-kill at a write boundary leaves a torn record, +which exercises `PartialWrite` detection and recovery). Partial-write detection +is reachable via standard node-kill faults. The deeper bit-flip scenarios +(testing the `CheckBytes` surface) require explicit filesystem corruption. + +--- + +### Finding 7: `graceful-shutdown-flushes-all` assertion type mismatch — `Sometimes` may never be satisfied without careful workload design + +**Property:** `graceful-shutdown-flushes-all` +**Concern:** The property is typed as `Sometimes(graceful_shutdown_lossless)`, +meaning Antithesis must find at least one execution where graceful shutdown +completes with zero loss. The evidence file's analysis reveals that whether this +`Sometimes` is satisfiable at all depends on unresolved open questions: does +the topology's `stop()` drop `inputs` (and thus the `BufferWriter`) before the +write-loop task's final `flush()` completes? If `inputs` is always dropped +synchronously inside `stop()` before the future is returned, the race is always +present, and a `Sometimes` "graceful shutdown lossless" may never be achieved if +the workload sends events right up to the SIGTERM boundary. Alternatively, if +the write loop always fully flushes before `inputs` is eligible for drop (because +the channel drains before `stop()` proceeds), the `Sometimes` is always satisfied +and Antithesis provides no additional insight. + +The `Sometimes` type is also in tension with the "this is a safety property" +framing: a `Sometimes` says "it happens at least once," not "it always +happens." The intended claim is that graceful shutdown is *always* lossless +(an `Always` property), but the catalog defers to `Sometimes` because the +execution reaching a graceful stop at all needs to be confirmed. + +**Scope:** property-specific +**Evidence:** `graceful-shutdown-flushes-all.md` §"Does stop() call flush before +dropping inputs?"; §"Critical ordering question"; OQ-1 "Does stop() call flush +before dropping inputs?" (marked "Verification required"); INV-6 in sut-analysis +"graceful shutdown flushes (but see §10 — BufferWriter::Drop does NOT flush)." +**Suggested action:** Resolve OQ-1 (trace the drop order in `running.rs`) before +committing to this as an Antithesis property. If graceful shutdown is always +lossless on the happy path, convert to `Always`. If it is sometimes lossy +(race between `inputs` drop and write-loop flush), this is a bug to fix +first and then an `Always` regression test. The current `Sometimes` framing +defers a design question that should be answered by code inspection, not +Antithesis exploration. + +--- + +## Section 2 — Underestimated Properties (marked lower-value but actually high) + +### Underestimate A: `record-id-monotonicity-holds` is higher-value than implied + +The catalog lists this as a `Safety / Unreachable` property — a panic that must +never fire. The catalog text is correct but the placement in Category 1 alongside +corruption properties undersells its connection to the deadlock cluster. The +monotonicity panic at `reader.rs:480-484` can be triggered by: (1) the file-ID +rollover non-wrap-aware comparison at `reader.rs:932`; (2) a wrong `next_record_ +id` fast-forward from `validate_last_write` in the `Ordering::Greater` branch +after a torn-tail. Both triggers require node-kill faults and timing-sensitive +state that deterministic tests cannot reach (the model FS never produces real +partial writes). If the panic triggers, the process crashes and loops on restart +if the persisted state is wrong — same operational impact as the deadlock (#21683). + +The catalog's `Unreachable` assertion is correctly placed, but the property +should be explicitly cross-referenced as part of the deadlock/durability cluster, +not isolated in the corruption category. Its Antithesis fit is high because the +triggering conditions (torn tail, file-ID rollover at restart) require exactly +the systematic timing exploration Antithesis provides. + +### Underestimate B: `file-id-rollover-stays-coordinated` is high-value for Antithesis specifically + +The catalog notes this is a "latent bug" reachable in test builds (`MAX_FILE_ID= +6`). This is exactly the kind of property Antithesis is uniquely positioned to +find: with `MAX_FILE_ID=6`, the Antithesis harness will naturally cycle through +all 6 file IDs in a sustained run (small `max_data_file_size` helps), and with +systematic fault timing it will hit the rollover boundary under crash conditions +that a fixed-sequence test misses. The raw `u16 >` comparison at `reader.rs:932` +is a latent production bug (at `MAX_FILE_ID=65535` requires ~8TB written) but is +continuously exercisable in test builds. No existing test covers the rollover +scenario under crash conditions. This property's Antithesis fit is high and the +catalog should flag it as a primary target for test-build runs. + +### Underestimate C: `every-written-event-eventually-delivered` benefit extends beyond the deadlock scenario + +The catalog frames this primarily as an at-least-once e2e check. Its secondary +value — exposing the `unlink-before-ledger-flush` window at `reader.rs:546-549` +and the in-flight finalizer task loss on SIGKILL — is also high. These windows +are precisely the kind of narrow, timing-sensitive data-loss paths that +Antithesis's systematic scheduler explores. The catalog is correct but the +framing should more prominently note that the at-least-once liveness property +is also the primary vehicle for surfacing the three silent-loss paths identified +in the SUT analysis (§6 items 3–5). + +--- + +## Section 3 — What Looks Correct + +**The deadlock/durability cluster is correctly identified as the high-value +core.** The five properties `total-buffer-size-never-underflows`, +`writer-eventually-makes-progress`, `durable-unacked-events-survive-crash`, +`partial-write-at-rotation-recovers`, and `recovery-completes-after-crash` are +all genuine Antithesis targets: they require node-kill faults, their triggering +conditions are timing-sensitive (exact byte offset of the kill relative to the +fsync window and file-rotation boundary), they cannot be reached by the existing +test suite (model FS has no-op sync), and they have a confirmed known-unfixed +production bug (#21683). The `Unreachable` and `Sometimes` assertion types are +correctly chosen for these properties. + +**`no-corrupted-record-delivered` is correctly in Antithesis territory.** The +hand-written `CheckBytes` (unsafe, rkyv ICE workaround) is an unusual validation +surface that benefits from systematic bit-flip exploration. The CRC32C collision +risk is documented and residual. The `AlwaysOrUnreachable` type is correct: on +a clean run the assertion is unreachable (no corruption injected), and under +faults it must always hold. + +**`acked-files-eventually-deleted` and `reader-drains-and-terminates-cleanly` +correctly target the finalizer-task dependency chain.** Killing the finalizer +task mid-run (node-kill) and confirming the dependent liveness chain breaks +(or correctly recovers) is exactly the kind of multi-step concurrent-failure +path that unit tests cannot reach. The disabled test `#23456` is a direct +prior art data point for why deterministic testing failed here. + +**`config-reload-no-silent-loss` is correctly identified as requiring a custom +fault.** The SIGHUP-driven reload path is not covered by Antithesis's standard +node-kill or network-partition faults. The property correctly notes this +dependency and the need to confirm feasibility with the tenant team. The +existing `topology_disk_buffer_config_change_does_not_stall` test explicitly +does NOT assert zero event loss (only liveness), confirming that Antithesis +adds something the test suite doesn't have. + +**The `Unreachable` on `record-id-monotonicity-holds` is correctly typed.** +The panic at `reader.rs:480-484` is a guardrail that must never trip; any +execution that reaches it is a bug. `Unreachable` is the correct Antithesis +assertion type — Antithesis reports a failure the first time the point is hit. + +**The catalog's catalog-wide fault dependency flag is critical and correct.** +Flagging that "nearly every high-value property requires node-termination faults, +which are often disabled by default" is the most operationally important warning +in the catalog. Without kill-and-restart faults enabled, Categories 2–6 show +as either unfound (liveness) or vacuously passing (safety under no-crash +condition). This must be confirmed before any Antithesis run is submitted. + +--- + +## Section 4 — Uncertainties + +1. **Antithesis bit-flip / disk-byte-corruption fault availability.** Several + Category 1 properties (`no-corrupted-record-delivered`, + `corruption-is-detected-and-recovered`, `record-never-spans-files` under + corrupted metadata) depend on direct filesystem byte-corruption, not just + node-kill. Whether this is available in the target tenant is unconfirmed. + If not available, Category 1 properties reduce to "torn tail from node-kill" + coverage, which is less comprehensive but still valuable. + +2. **Production binary vs. test binary.** `file-id-rollover-stays-coordinated` + and (to a lesser extent) `record-id-wraparound-accounting-holds` depend on + test-only constants (`MAX_FILE_ID=6`, `unsafe_set_*_record_id` helpers). If + Antithesis runs production binaries, the rollover is unreachable without + sustained high-volume writes exceeding 8TB. Clarify which binary type the + Antithesis harness uses and whether test-build constants can be compiled in + without `#[cfg(test)]` gating. + +3. **Filesystem persistence across node-kill.** The deployment-topology document + correctly flags this as critical: if Antithesis node-termination recreates + the container with a fresh filesystem, all crash-durability properties pass + vacuously. Whether the target tenant's persistent-volume model survives node- + kill must be confirmed before any crash-durability test is meaningful. + +4. **Reachability of `Sometimes` under fault injection timing.** For + `writer-eventually-makes-progress`, the `Sometimes(writer_unblocked_after_full + )` assertion must fire on at least one non-fault execution (buffer fills and + drains normally) before Antithesis considers it satisfied. The catalog confirms + this should be reachable on the happy path, which is correct design. But the + assertion only demonstrates liveness failure if it is never satisfied across + all executions — including fault executions. Confirm that the test run + duration and workload throughput allow buffer-full → drain cycles to occur + frequently enough for the `Sometimes` baseline to establish before fault- + injection scenarios begin. + +5. **`sink-failure-not-silently-acked`: whether sinks emit non-Delivered status.** + OQ-1 in the evidence file is unresolved: if Vector sinks internally retry and + only surface success to the finalizer, `Errored`/`Rejected` status may never + reach `spawn_finalizer` in any workload, making the property unreachable + without a custom test sink. This must be answered before committing harness + resources to this property. diff --git a/tests/antithesis/scratchbook/evaluation/coverage-balance.md b/tests/antithesis/scratchbook/evaluation/coverage-balance.md new file mode 100644 index 0000000000000..996915bfd27b4 --- /dev/null +++ b/tests/antithesis/scratchbook/evaluation/coverage-balance.md @@ -0,0 +1,430 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +external_references: [] +--- + +# Coverage Balance Evaluation: Disk Buffer v2 Property Catalog + +Lens: **Is this the right set of properties?** Evaluated section-by-section against +`sut-analysis.md`, cross-checking each failure-prone area against the catalog, +then examining five specific cross-cutting gaps called out in the evaluation prompt, +and finally assessing the Safety/Liveness/Reachability portfolio balance. + +--- + +## 1. Mapping: SUT Failure-Prone Areas → Catalog Coverage + +The SUT analysis ranks 9 failure-prone areas (§6). The table below shows catalog +coverage density per area and flags imbalances. + +| Rank | SUT §6 area | Catalog properties | Coverage adequacy | +|------|-------------|-------------------|------------------| +| 1 | `total_buffer_size` underflow → writer deadlock | `total-buffer-size-never-underflows`, `writer-eventually-makes-progress`, `buffer-size-within-max`, `reader-drains-and-terminates-cleanly` (4 properties, 3 from different vantage points) | **Well-covered** — the highest-value cluster has the most properties | +| 2 | Crash-time durability/recovery windows | `durable-unacked-events-survive-crash`, `recovery-completes-after-crash`, `partial-write-at-rotation-recovers`, `every-written-event-eventually-delivered` (4 properties) | **Well-covered** | +| 3 | Config-reload silent loss (#24948) | `config-reload-no-silent-loss`, `graceful-shutdown-flushes-all` (2 properties) | **Adequately covered** — both Cluster F properties address the Drop/flush root cause | +| 4 | `drop_newest` metric blindness (#24606/#24144) | `dropped-events-are-counted` (1 property) | **Adequate** — single, targeted; the bug is self-contained | +| 5 | Sink-error acks discarded (`_status` ignored) | `sink-failure-not-silently-acked` (1 property) | **Adequate** — single, targeted | +| 6 | File-ID rollover (`reader.rs:932` raw `u16 >`) | `file-id-rollover-stays-coordinated` (1 property) | **Adequate** — single, targeted | +| 7 | Reader skips rest of file on first bad record | `corruption-is-detected-and-recovered` (captures skip path), `no-corrupted-record-delivered` | **Thin** — the skip-rest-of-file data-loss dimension is noted in open questions but has no dedicated property quantifying or bounding the abandonment loss | +| 8 | `get_total_records` non-wrapping `- 1` → ~2^64 | `record-id-wraparound-accounting-holds` (1 property) | **Adequate** | +| 9 | mmap SIGBUS / external file tampering / foreign `.dat` | **No property in the catalog** | **GAP — see Finding F1** | + +--- + +## 2. Cross-Cutting Gap Analysis (Specific Prompt Items) + +### Finding F1 — UNCOVERED: mmap SIGBUS / external file tampering / foreign `.dat` file + +**SUT evidence:** sut-analysis §6 item 9; §8 (memmap2 bullet: "SIGBUS if `buffer.db` is +truncated/unmapped — unhandled, crashes process"); §3 (foreign `.dat` files inflate +`total_buffer_size`; truncation under read → underflow). + +**Catalog check:** No property addresses: + +1. `buffer.db` truncated while mmap'd → SIGBUS → unhandled crash (no property on + SIGBUS resistance or graceful degradation of the mmap'd ledger). +2. A foreign/unexpected `.dat` file in the buffer directory inflating + `update_buffer_size` over-seed → feeding directly into the underflow bug via a + non-crash path (operator error, symlink attack, leftover file from a previous + process). +3. External truncation of a `.dat` file under an active read → `bytes_read > + metadata.len()` → underflow at `reader.rs:524` via a filesystem-level fault, not + a crash. + +**Scope:** These are distinct from all existing properties. The SIGBUS path is an +unhandled signal → process crash → restart loop. The foreign `.dat` path is a +silent, crash-free over-seed of `total_buffer_size` that produces the same +deadlock as #21683 without a crash trigger. + +**Suggested action:** Add a property `ledger-resists-external-filesystem-corruption` +with two sub-cases: (a) `assert_unreachable` at any unhandled SIGBUS (equivalently, +assert the process does not crash after a filesystem truncation fault on `buffer.db`); +(b) `assert_always(total_buffer_size == sum_of_valid_own_dat_files)` at startup, using +a workload-side injection of a foreign `.dat` file before restart to confirm the +over-seed path is either prevented or does not cause deadlock. This property +requires a filesystem-level fault capability (Antithesis can truncate files or inject +foreign files via the deterministic filesystem model). + +--- + +### Finding F2 — UNCOVERED: throughput-under-lock-contention (the performance issue) + +**SUT evidence:** sut-analysis §4 ("The writer mutex is the **lock-contention +bottleneck** noted in the GA doc… ~90 MiB/s with 10 threads"); §10 not listed but +§4 explicitly notes this; the GA doc external reference (internal buffers GA design doc) +calls it a known throughput ceiling. + +**Catalog check:** No property covers throughput degradation under writer `Mutex` +contention. The catalog has liveness properties for the deadlock case +(`writer-eventually-makes-progress`) but nothing for the performance-under-contention +case: does the buffer's throughput remain within an acceptable bound when multiple +sources/transforms compete for the `Arc>`? Antithesis CPU +throttling is the lever to exacerbate contention without deadlock, which is a +distinct failure mode. + +**Scope:** This is a liveness/performance gap. The known ceiling (~90 MiB/s with 10 +threads) may not be a correctness bug per se, but: (a) CPU throttle + mutex +contention can drive throughput to zero without triggering any of the existing +deadlock properties (the writer is making progress but at 0.01 MiB/s); (b) there is +no property that would catch a regression that makes contention worse; (c) the GA doc +explicitly flags this as a gap in the existing chaos test. + +**Suggested action:** Add a property `throughput-under-contention-acceptable` as a +Liveness property: during CPU-throttle faults with N parallel writer sources +(N >= 4), write throughput must stay above a configurable floor (e.g., 10% of no-fault +baseline) within a bounded time window. This requires the workload to measure +throughput independently of whether the writer is deadlocked. It would distinguish +"barely making progress" from "deadlocked" — a distinction the current catalog cannot +make. + +--- + +### Finding F3 — UNCOVERED: clock-jitter × 500ms `should_flush` deadline interaction + +**SUT evidence:** sut-analysis §4 ("Under CPU-throttle the winner can be descheduled +between winning the CAS and actually fsyncing, silently extending the 500ms window"); +§10 ("Clock jitter × `should_flush`: `Instant::elapsed` drives the 500ms gate; +Antithesis clock faults could stretch/shrink the durability window"). + +**Catalog check:** The `durable-unacked-events-survive-crash.md` mentions CPU throttle +as a secondary lever and the deployment topology notes clock jitter as a fault lever. +But no property directly asks: when clock jitter compresses the `should_flush` gate to +fire more frequently (effectively increasing fsync frequency), or stretches it to fire +much less frequently (widening the durability window beyond 500ms), does the buffer +remain correct? Specifically: + +- **Clock runs fast:** `should_flush` fires every few ms → excessive fsync contention + → does the CAS at `ledger.rs:last_flush` create a miss-and-extend window? +- **Clock runs slow (or frozen):** `should_flush` never fires → the 500ms window is + actually "forever" → crash can lose data that the customer expected to be durable + because `flush_interval` has elapsed in wall time but not in `Instant` time. +- **Clock jitter × `should_flush` CAS winner descheduled:** The winner of the CAS + marks the time slot but then is preempted before calling `sync_all`; the 500ms + window implicitly restarts, extending the loss window by up to one full CPU + quantum. No property currently asserts a bound on the maximum loss window under + this condition. + +**Scope:** Medium. The 500ms guarantee is part of the product SLA ("data synchronized +every 500ms"). Clock jitter violating that SLA in both directions is a safety concern +(for the slow-clock direction) and a performance regression concern (for the fast- +clock direction). Neither is covered by any existing property. + +**Suggested action:** Add a property `fsync-deadline-respected-under-clock-jitter`: +with Antithesis clock faults active, assert that either (a) every fsync completes +within 2× the configured `flush_interval` in wall time (allowing some jitter margin), +or (b) the maximum data-loss window never exceeds `flush_interval × jitter_factor`. +This requires workload-side wall-clock timestamping of events alongside `Instant`- +based timing inside the SUT. The property would be `Always` for the upper bound and +`Sometimes` for confirming the fast-clock path is reached. + +--- + +### Finding F4 — UNCOVERED: `WhenFull::Overflow` + disk base — reordering and in-memory loss + +**SUT evidence:** sut-analysis §10 ("**`WhenFull::Overflow` + disk base:** unbiased +`select!` over base+overflow reorders events across the overflow boundary; if overflow +is in-memory, a crash loses the *later* in-memory events while the *earlier* disk +events survive — breaks dedup-based at-least-once reasoning (a gap, not just +duplicates)"). + +**Catalog check:** The catalog covers `drop_newest` (`dropped-events-are-counted`) and +`block` (via the deadlock cluster), but has **no property** for the +`WhenFull::Overflow` mode. This is a structurally distinct behavior: events are +accepted into an overflow buffer (in-memory), the disk buffer drains, then overflow +events are replayed — but a crash during this window loses the overflow-only events +while preserving the disk events. Because the disk events arrived *earlier* but are +read *after* the overflow events' crash-loss, the downstream sees a gap in the middle +of the stream, not just at the tail. This breaks dedup-based at-least-once because +the downstream may have already acked the "earlier" disk events and now skips the +gap. + +**Scope:** This is a distinct `WhenFull` mode that is not exercised by the catalog +at all. The topology file acknowledges only `block` (the default) and `drop_newest` +(one config variant). The `Overflow` mode with a disk base is a real customer +configuration (chaining in-memory overflow onto disk for bursts). + +**Suggested action:** Add a property `overflow-chain-no-reorder-loss`: configure +`when_full: overflow` with the disk buffer as the base and an in-memory buffer as +the overflow. Crash during a period when the overflow is active (in-memory events +exist but have not yet been replayed to disk). Assert (a) no gap exists in the +delivered event IDs post-recovery (the disk events are delivered, and the +overflow-lost events are explicitly counted in `component_discarded_events_total`), +and (b) the event ordering at the downstream does not place disk events *after* +in-memory events that were delivered pre-crash. This property requires the workload +to track sequence numbers across the overflow boundary. + +--- + +### Finding F5 — UNCOVERED: `DiskBufferV1CompatibilityMode` flag inversion / forward-compat foot-gun + +**SUT evidence:** sut-analysis §10 ("**`DiskBufferV1CompatibilityMode` flag inversion** +(`vector-core/event/ser.rs`): `can_decode` requires the V1-compat flag on every +record; a future 'V2-native' flag scheme would be rejected as incompatible — a +forward-compat foot-gun"). + +**Catalog check:** No property in the catalog covers this. There is no "format +version upgrade" or "compatibility mode correctness" property. The existing catalog +entirely omits the `DiskBufferV1CompatibilityMode` layer. + +**Scope:** Two related gaps: + +1. **Flag inversion:** If the V1-compat flag is written incorrectly (inverted), a + reader that expects the flag present would reject all records as incompatible, + silently abandoning the buffer. No property detects this. +2. **Format/version upgrade:** A rkyv layout change (e.g., a dependency bump that + changes field ordering or alignment) would make old buffer files unreadable. + No property asserts that a buffer written by version N is readable by version N+1, + or that a version mismatch is detected gracefully rather than silently producing + garbage. + +**Suggested action (flag inversion):** Add a property `v1-compat-flag-correct`: +`assert_always_or_unreachable` at the `can_decode` check in `ser.rs` that every record +read from the disk buffer has the V1-compat flag set (until the flag scheme is +changed). This is a cheap invariant that would catch flag inversion immediately. + +**Suggested action (format/version upgrade):** Add a property +`buffer-survives-version-upgrade`: the workload writes N events with version N, shuts +down, upgrades the binary (or swaps the binary in the container), restarts, and asserts +all N events are readable. This requires the harness to support binary swap — a non- +trivial workload design, but the risk is real: a rkyv bump with silent layout change +is the most plausible path to catastrophic "all existing buffers unreadable" regression. + +--- + +## 3. Low-Investment / Over-Investment Assessment + +### Potentially over-invested: `record-never-spans-files` + +**Catalog entry:** Safety / `AlwaysOrUnreachable` — a record never spans two data files. + +**SUT context:** The `can_write` gate (`writer.rs:433-437`) enforces this statically +before every write. The only way to bypass it is a corrupted `metadata().len()` size- +seed or a `max_record_size > max_data_file_size` misconfig. The former requires a +specific filesystem fault on the stat call, and the latter is gated by a `debug_assert` +(albeit compiled out of release). + +**Balance concern:** This property addresses a violation that requires an extremely +specific fault sequence (filesystem fault between `open` and `metadata`, plus a +write that lands precisely at the file boundary), and a violation would manifest as a +`PartialWrite` that the reader handles correctly (rolls to next file). The data-loss +consequence is real but bounded (one record). Compared to the unlimited silent loss +from the underflow deadlock or config-reload unflushed data, this is a lower-value +target. The existing `can_write` gate is a compile-time static check in the common +path; its bypass path is narrow. Time spent reaching this property might be better +spent on F1–F5 above. + +**Verdict:** Not over-invested per se (the property is correct and targeted), but if +the catalog needs to be pruned, this is the first candidate. It is the lowest-impact +safety property in the set. + +--- + +## 4. Property-Type (Safety / Liveness / Reachability) Balance Assessment + +### Current distribution + +| Type | Properties | Slugs | +|------|-----------|-------| +| Safety (`Always` / `Unreachable`) | 11 | `no-corrupted-record-delivered`, `record-never-spans-files`, `total-buffer-size-never-underflows`, `buffer-size-within-max`, `durable-unacked-events-survive-crash`, `sink-failure-not-silently-acked`, `dropped-events-are-counted`, `file-id-rollover-stays-coordinated`, `record-id-wraparound-accounting-holds`, `config-reload-no-silent-loss`, `partial-write-at-rotation-recovers` (safety component) | +| Liveness (`Sometimes` / progress) | 7 | `writer-eventually-makes-progress`, `every-written-event-eventually-delivered`, `recovery-completes-after-crash`, `acked-files-eventually-deleted`, `reader-drains-and-terminates-cleanly`, `graceful-shutdown-flushes-all`, `partial-write-at-rotation-recovers` (liveness component) | +| Reachability (`Sometimes` confirming path fires) | 3 | `corruption-is-detected-and-recovered`, `partial-write-at-rotation-recovers` (reachability component), `recovery-completes-after-crash` (reachability component) | + +Note: `partial-write-at-rotation-recovers` and `recovery-completes-after-crash` each +span multiple types; the counts above are approximate. + +### Assessment + +**Safety (11/19): appropriately dominant.** The disk buffer's primary failure modes +are silent-safety violations (underflow deadlock, config-reload loss, corrupted records, +metric blindness). Safety properties dominating the set is correct. + +**Liveness (7/19): adequate but slightly thin on the space-reclamation / wakeup-chain +dimension.** `acked-files-eventually-deleted` and `reader-drains-and-terminates-cleanly` +cover the main liveness paths. The wakeup-chain transitivity (finalizer → reader → writer) +is analyzed in `writer-eventually-makes-progress.md` but there is no standalone property +confirming that breaking the chain at the **finalizer** step (killed finalizer task, +not the reader or writer) alone triggers the deadlock. This is noted as a concern in +open questions but has no dedicated property. + +**Reachability (3/19): thin, possibly deliberately.** The three reachability +properties confirm that key recovery code paths are actually exercised. The catalog +correctly notes that `AlwaysOrUnreachable` is used where "never executed is acceptable, +but any execution must satisfy the invariant." However, there are additional recovery +branches that should be confirmed reachable but currently have no `Sometimes` check: + +- `validate_last_write` `Ordering::Less` path (ledger lags data) — covered in + `partial-write-at-rotation-recovers.md` assertions but not as a standalone catalog + property. +- `validate_last_write` `Ordering::Greater` path (data lags ledger) — same. +- `seek_to_next_record` fast-path file deletion during recovery — same. + +These are listed as SDK assertions in the evidence files but the catalog does not +elevate them to standalone properties. Given that `validate_last_write`'s branches +are the most crash-sensitive code in the buffer, a property +`validate-last-write-both-branches-reachable` with two `Sometimes` assertions would +close this gap. + +--- + +## 5. Component Blind Spots + +### The `OrderedFinalizer` task is a single point of failure with no dedicated liveness property + +**SUT evidence:** The wakeup chain documented in `writer-eventually-makes-progress.md` +shows: sink acks → `BatchNotifier` dropped → finalizer task → `pending_acks` → +reader → file deletion → writer unblocks. The finalizer task is a `tokio::spawn` +that holds `Arc` and is never awaited or monitored. + +**Catalog gap:** `writer-eventually-makes-progress` covers the full-chain deadlock, +and `acked-files-eventually-deleted` covers the file deletion. But no property +specifically asks: "if the finalizer task panics or exits early (e.g., due to a +tokio runtime shutdown racing with an in-flight ack), are events permanently stranded +in the buffer?" The gap is especially relevant because the finalizer task uses +`stream.next().await` without a timeout, and any panic in that task is unobserved. + +**Suggested action:** Add a property `finalizer-task-drains-on-shutdown`: +`assert_always` that after `BufferWriter::drop`, all pending acks that were in-flight +when drop occurred are eventually processed (i.e., no events are stranded because the +finalizer task exited before processing all items). This requires a SUT-side counter +of unprocessed finalizer items at the moment the writer task exits. + +### The `should_flush` CAS winner / descheduling gap has no coverage + +**SUT evidence:** sut-analysis §4 ("Under CPU-throttle the winner can be descheduled +between winning the CAS and actually fsyncing, silently extending the 500ms window"). + +**Catalog gap:** This is distinct from the clock-jitter gap (F3 above). Even without +clock jitter, CPU throttling can cause the `should_flush` CAS winner to hold the +"slot" for the 500ms window without actually fsyncing, effectively disabling fsync for +the duration of the throttle. No property covers this specific race. Covered implicitly +by `durable-unacked-events-survive-crash` (which should catch data loss from an +extended fsync gap), but only if the CPU throttle also co-occurs with a kill. + +--- + +## 6. Catalog-Wide Structural Observations + +### Missing: a "buffer metrics are not lying" umbrella property + +The catalog has `record-id-wraparound-accounting-holds` (metrics level) and +`buffer-size-within-max` (control path), and the `dropped-events-are-counted` +(component-level metric). But there is no umbrella property that asserts, after +any operation, that the set of buffer metrics (`buffer_events`, `buffer_byte_size`, +`buffer_discarded_events_total`, `component_discarded_events_total`) are mutually +consistent. Several known bugs produce metrics that contradict each other (e.g., +`buffer_byte_size = 0` via saturating_sub while `total_buffer_size = u64::MAX` +in the control path; `buffer_discarded_events_total` increments while +`component_discarded_events_total` stays 0). A workload-level property +`buffer-metrics-are-internally-consistent` would catch these disparities more +broadly than individual point properties. + +### Catalog correctly identifies fault dependency as a global risk + +The catalog-level note that "nearly every high-value property requires node-termination +faults" is accurate and critical. Of the 19 properties, 14 require SIGKILL to produce +the triggering state. If node termination is disabled in the Antithesis tenant, 14/19 +properties will either never fire (Liveness `Sometimes` vacuously unfound) or pass +trivially (Safety `Always` never challenged). This is the single largest operational +risk for the catalog as deployed. The catalog correctly flags it but does not propose +a fallback strategy. + +**Suggested addition:** For each property that requires node-termination, add a +complementary reachability property that confirms the fault trigger is actually being +reached (i.e., confirm the process is being killed, not that the property is vacuously +satisfied). Without this, a run where node-termination is disabled silently reports +"all liveness properties satisfied" with zero information content. + +--- + +## 7. Summary Verdict + +The catalog is **well-structured and appropriately weighted** toward the highest-risk +failure areas (underflow cluster, crash durability, config-reload loss). The +Safety/Liveness/Reachability distribution is roughly correct for the SUT's failure +mode profile (mostly silent safety violations). Five specific gaps need new properties, +one existing property is lower-value than the rest, and two structural additions would +improve the catalog's meta-coverage. + +**Priority order for gap closure:** + +1. F1 (mmap SIGBUS / foreign `.dat` files) — triggers the same deadlock as #21683 via + a non-crash path; no existing property catches it. +2. F4 (`WhenFull::Overflow` chain) — an entire `WhenFull` mode is completely absent. +3. F2 (throughput-under-contention) — known performance ceiling, no regression test. +4. F3 (clock-jitter × `should_flush`) — SLA-level concern, no property. +5. F5 (V1-compat flag / format upgrade) — low probability but catastrophic when hit. + +--- + +## Findings (Structured) + +### GAPS (properties/areas with no catalog coverage) + +| # | Property/Slug | Concern | Scope | Evidence | Suggested action | +|---|--------------|---------|-------|----------|-----------------| +| F1 | *(missing)* `ledger-resists-external-filesystem-corruption` | mmap SIGBUS on `buffer.db` truncation crashes process unhandled; foreign `.dat` files inflate `update_buffer_size` → underflow deadlock without a crash trigger | sut-analysis §6 item 9, §8 (memmap2), §3 | No catalog property; sut-analysis calls both out explicitly; the foreign-`.dat` path reaches the same `fetch_sub` site as #21683 via a non-crash, operator-error path | Add two sub-properties: (a) assert process does not crash under `buffer.db` truncation fault; (b) assert `total_buffer_size` at startup equals sum of only self-owned valid `.dat` files, using workload-injected foreign file | +| F2 | *(missing)* `throughput-under-contention-acceptable` | Writer `Mutex` contention under CPU throttle can degrade throughput to near-zero without triggering any existing deadlock property — a regression that current properties would miss | sut-analysis §4, GA doc external reference | Known ~90 MiB/s ceiling with 10 threads; no property measures throughput floor under stress | Add Liveness property: with N>=4 parallel sources and CPU throttle, throughput stays above a configurable floor; distinguish "barely alive" from "deadlocked" | +| F3 | *(missing)* `fsync-deadline-respected-under-clock-jitter` | Clock faults stretch/shrink the `should_flush` 500ms gate; slow clock can silently extend the loss window beyond the documented SLA; fast clock causes CAS mis-timing | sut-analysis §4, §10 (clock-jitter × should_flush) | No catalog property; deployment-topology.md lists clock jitter as a fault lever without a corresponding property | Add Always property: maximum observable loss window (wall time between consecutive fsyncs) stays <= `flush_interval × jitter_factor` under clock faults | +| F4 | *(missing)* `overflow-chain-no-reorder-loss` | `WhenFull::Overflow` + disk base: crash during in-memory overflow period loses in-memory events while disk events survive; gap in the middle of the stream breaks dedup-based at-least-once | sut-analysis §10 (WhenFull::Overflow) | Catalog covers `block` and `drop_newest` only; `Overflow` mode is entirely absent | Add property: with disk base + in-memory overflow, crash during overflow active period; assert no unaccounted gap in delivered IDs and no incorrect event ordering | +| F5 | *(missing)* `v1-compat-flag-correct` + `buffer-survives-version-upgrade` | V1-compat flag inversion (`vector-core/event/ser.rs`) silently makes all records incompatible; rkyv layout change makes old buffer files unreadable without detection | sut-analysis §10 (DiskBufferV1CompatibilityMode) | No catalog property; the only format/versioning coverage is CRC32C; a rkyv bump with silent layout change is undetectable by any existing property | (a) `assert_always_or_unreachable` at `can_decode` that V1-compat flag is present; (b) binary-upgrade workload scenario asserting existing buffer readable after upgrade | +| F6 | *(missing)* `finalizer-task-drains-on-shutdown` | The `OrderedFinalizer` tokio task is a single point of failure; if it panics or exits before processing all in-flight acks, events are permanently stranded | sut-analysis §4 (finalizer), writer-eventually-makes-progress OQ | No standalone property for finalizer-task liveness; covered only implicitly via `writer-eventually-makes-progress` | Add Always property: unprocessed finalizer items at writer-drop time == 0; or Sometimes confirming all pending acks are processed within bounded time after drop | +| F7 | *(missing)* `buffer-metrics-are-internally-consistent` | Multiple known bugs produce contradictory metrics (control-path `u64::MAX` vs. gauge `0`, `buffer_discarded` increments without `component_discarded`); no umbrella consistency check | sut-analysis §6 items 1,4,8; property-relationships Cluster E | Individual metric properties exist but no cross-metric consistency invariant | Add workload-level Always property: after any operation, `{buffer_events, buffer_byte_size, buffer_discarded_events_total, component_discarded_events_total}` are mutually consistent (no pair contradicts each other) | + +### IMBALANCE / THIN COVERAGE + +| # | Property/Slug | Concern | Scope | Evidence | Suggested action | +|---|--------------|---------|-------|----------|-----------------| +| T1 | `corruption-is-detected-and-recovered` (partial) | The "skip rest of file on first bad record" data-loss dimension is noted in the open questions but has no dedicated property quantifying or bounding the records-abandoned rate | sut-analysis §6 item 7; catalog open question in corruption-is-detected-and-recovered | One property covers the recovery path fires, but not the loss magnitude from valid-after-corrupt records abandoned in the same 128MB file | Add a measurement property or open question resolution: assert `records_abandoned_per_corruption_event <= max_data_file_size / min_record_size` with a `Sometimes` confirming the abandon path is reached | +| T2 | *(missing standalone)* `validate-last-write-both-branches-reachable` | `validate_last_write` `Ordering::Less` and `Ordering::Greater` branches are the most crash-sensitive code; they have SDK assertions in evidence files but are not elevated to standalone catalog properties | sut-analysis §3 (recovery), partial-write-at-rotation-recovers evidence | Both branches are covered in `partial-write-at-rotation-recovers` as sub-assertions; promoting them to standalone Reachability properties would make their coverage explicit and trackable | Add two Reachability properties: `Sometimes(validate_last_write_less_branch_reached)` and `Sometimes(validate_last_write_greater_branch_reached)` | +| T3 | Reachability properties overall | Only 3 of 19 properties are Reachability type; for a SUT where the most dangerous code paths are deep recovery branches that the existing test suite provably cannot reach, confirming fault injection actually reaches those paths is underweighted | sut-analysis §7 ("model FS makes sync_all no-ops"), existing-assertions.md | The catalog correctly uses `AlwaysOrUnreachable` for optional paths, but additional `Sometimes` properties confirming the Antithesis fault strategy reaches specific code would add meta-coverage confidence | Promote at least 3 additional recovery sub-paths to standalone Reachability properties (see T2 above plus `seek_to_next_record` fast-path and `delete_completed_data_file` under fault) | + +### LOW-VALUE / POTENTIAL PRUNE + +| # | Property/Slug | Concern | Scope | Evidence | Suggested action | +|---|--------------|---------|-------|----------|-----------------| +| L1 | `record-never-spans-files` | The `can_write` gate enforces this statically; bypass requires a specific filesystem fault on a stat call at the exact file-boundary; consequence is bounded (one record loss handled gracefully by PartialWrite roll); lowest-impact safety property in the catalog | sut-analysis §5 INV-2 ("Hard"); catalog entry | Gate is correct in the common path; the fault scenario is narrow; `partial-write-at-rotation-recovers` and `no-corrupted-record-delivered` cover the consequences if it were bypassed | Keep, but deprioritize for initial harness implementation; address after F1-F4 | + +### PASSES (well-covered areas) + +| Area | Why it passes | +|------|--------------| +| `total_buffer_size` underflow / deadlock cluster (Cluster A) | 4 properties from 4 distinct vantage points (root invariant, writer liveness, reader shutdown, bound vacuity) with cross-reference to avoid missing the vacuous-pass trap | +| Crash durability / recovery (Cluster B) | 4 properties covering synced-data survival, end-to-end at-least-once, init completion, and torn-tail recovery; the `validate_last_write` fast-paths are documented with precise code locations | +| Corruption detection (Cluster C) | Correct Safety+Reachability pairing (`no-corrupted-record-delivered` + `corruption-is-detected-and-recovered`); the unsafe `CheckBytes` surface is explicitly noted | +| Boundary arithmetic (Cluster D) | Both known arithmetic bugs (`u16 >` file-ID, `- 1` record-ID) have targeted properties; the connection to the monotonicity panic is traced | +| Observability-gap bugs (Cluster E) | Both known metric-blindness bugs (#24606 `drop_newest`, `_status` discarded) have properties; the `config-reload-no-silent-loss` property covers the internal config-reload incident vector | +| Lifecycle / shutdown (Cluster F) | Both `config-reload-no-silent-loss` and `graceful-shutdown-flushes-all` address the Drop/flush root cause from different angles | +| Safety dominance in catalog | 11/19 safety properties is appropriate for a SUT whose primary failure modes are silent correctness violations | +| Fault dependency documentation | The catalog correctly identifies node-termination as a global prerequisite and flags it prominently | + +### UNCERTAINTIES + +| # | Question | Why it matters | +|---|----------|---------------| +| U1 | Does `drop_newest` apply to disk buffers at all? (open question in `dropped-events-are-counted.md`) | If `try_write_record` never returns the item for disk buffers, the property is unreachable and the bug untestable via Antithesis without a different workload design | +| U2 | Does the topology call `writer.flush()` before dropping the writer on graceful shutdown? | Determines whether `graceful-shutdown-flushes-all` is a bug-exposing property or a correctness confirmation; affects `config-reload-no-silent-loss` similarly | +| U3 | Is the F5 (torn-tail rkyv CRC collision) probability high enough to be reachable in Antithesis without explicit fault shaping? | If CRC32C is an effective guard, `partial-write-at-rotation-recovers` may pass vacuously on the F5 sub-path; the property needs a workload that can observe whether the `Corrupted` vs. `Valid{garbage_id}` distinction is reached | +| U4 | Are node-termination faults enabled in the tenant? | 14/19 properties are unreachable without kill/restart; if disabled, the catalog reports nothing meaningful for its highest-value cluster | +| U5 | Is the `DiskBufferV1CompatibilityMode` flag check (`can_decode`) actually on the hot path for all disk-v2 reads, or only for records written with the old serialization? | Determines whether F5 (flag inversion) is a continuous risk or only affects buffers written by older Vector versions | diff --git a/tests/antithesis/scratchbook/evaluation/implementability.md b/tests/antithesis/scratchbook/evaluation/implementability.md new file mode 100644 index 0000000000000..13c8546a1f2f3 --- /dev/null +++ b/tests/antithesis/scratchbook/evaluation/implementability.md @@ -0,0 +1,909 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +external_references: [] +--- + +# Implementability Evaluation: Disk Buffer v2 Property Catalog + +19 properties evaluated across 6 categories. The central implementation risk is +the **instrumented-build burden**: zero Antithesis SDK instrumentation exists in +the codebase today (confirmed by repo-wide grep), so every SUT-side assertion +must be added from scratch by adding `antithesis-sdk` as a new Cargo dependency +to `lib/vector-buffers`. This is a one-time setup cost shared across many +properties, but it is a real precondition for roughly half the catalog. + +A second structural risk is the **persistent-volume assumption**: the deployment +topology explicitly requires that the `data_dir` survives node-termination faults. +If the Antithesis tenant recreates the container with a fresh filesystem on kill, +every crash-recovery property either passes vacuously (buffer wiped → clean init) +or fails spuriously (data expected that no longer exists on disk). This must be +confirmed before any Category 2–6 property is run. + +--- + +## Category 1 — Data Integrity and Corruption + +### no-corrupted-record-delivered + +**Implementability: Feasible — requires instrumented build.** + +The safety invariant (`AlwaysOrUnreachable` at `reader.rs:~1131`) is invisible +from the workload: the workload only sees events delivered downstream; it cannot +distinguish "record validated and passed" from "record bypassed validation." The +SUT-side `assert_always_or_unreachable!` is necessary and its placement is clean +(one choke point at `Ok(Some(record))` in `BufferReader::next`). + +The instrumented build adds the SDK to `lib/vector-buffers/Cargo.toml` — a +one-time change. The assertion itself is well-posed. Fault injection (bit-flip on +`.dat` files, partial write via kill during flush) is supported by Antithesis. + +One complication: the startup path (`seek_to_next_record`) calls +`validate_record_archive` directly at `reader.rs:850`, not through `try_next_record`. +The `Ok(Some(record))` assertion placement at `reader.rs:1131` may miss corruption +detected (and recovered) during startup replay — a second assertion at the startup +validation call site may be needed. This is an implementability wrinkle, not a +blocker: the assertion is straightforwardly placeable at both sites. + +A secondary concern is that `receiver.rs` panics on `ReaderError` from `next()`, +meaning any detected corruption immediately crashes the process. The workload +cannot distinguish "corruption detected and recovered" from "process died" without +also checking restart behavior. The workload needs a liveness check alongside the +safety assertion. + +**Verdict:** Implementable with instrumented build. Flag the two-site placement +and the panic-on-detection observation. + +--- + +### corruption-is-detected-and-recovered + +**Implementability: Feasible — requires instrumented build, with reachability +dependency on Antithesis fault injection timing.** + +The `Sometimes(corruption_detected_and_recovered)` at `reader.rs:~1035` +(`is_bad_read()` branch) requires Antithesis to inject a bit-flip or truncation +while the reader's `BufReader` has the relevant file open. This is a timing +constraint: Antithesis must inject the fault during an active read, not while the +file is closed. Antithesis's filesystem fault injection is designed for exactly +this use case and should handle it. + +The assertion is SUT-side (instrumented build required), but straightforwardly +placed. The companion concern is whether the `Sometimes` is reachable at all in +the absence of faults: it is not (the is_bad_read path requires actual corruption +or a partial write). This is correct for a `Sometimes` — it will be satisfied only +when fault injection reaches the live reader, which is the intended signal. + +**Verdict:** Implementable with instrumented build and filesystem fault injection +enabled. + +--- + +### record-id-monotonicity-holds + +**Implementability: Feasible — requires instrumented build.** + +The `Unreachable` at the monotonicity panic site (`reader.rs:~480-484`) is +trivial to implement: the panic is already there; replacing the panic with an +`assert_unreachable!` before the panic preserves the existing behavior while +adding Antithesis telemetry. (Alternatively, add the SDK call immediately before +the `panic!` macro.) + +This property is a pure SUT-side assertion with no workload-observable equivalent. +The precondition — reaching the panic — requires one of: crashed `validate_last_write` +fast-forward producing a wrong `next_record_id`, torn-tail mis-recovery (F5), +or file-ID rollover misclassification. All require node-termination faults and +persistent-volume survival (see structural risks above). + +One subtle point: the property evidence notes that `OrderedAcknowledgements:: +add_marker` may not use wrap-aware comparison, making a fresh-buffer ID 0 +a false-trigger candidate. This is a testability concern (the assertion might +fire spuriously on a clean buffer without faults). Needs verification before +asserting `Unreachable`. + +**Verdict:** Implementable with instrumented build, node-kill faults, and +persistent volume. Verify the wrap-aware comparison question before committing +to `Unreachable`. + +--- + +### record-never-spans-files + +**Implementability: Mostly workload-observable; SUT-side optional.** + +The workload can observe spanning indirectly: if a record is written successfully +(`Ok`) but never appears downstream, a span is a plausible cause. A more direct +check is a watchdog that monitors `.dat` file sizes and flags any single data file +exceeding `max_data_file_size + max_record_size`. This is a simple filesystem +enumeration, not a SUT-side assertion, and does not require the instrumented build. + +The SUT-side assertion (in `RecordWriter::flush_record`) provides a direct, same- +process check, but the watchdog approach is implementable immediately with zero +SDK changes. + +The `debug_assert` at `writer.rs:~396` being compiled out of release is a real +concern for the `max_record_size > max_data_file_size` misconfiguration path: in +release builds, the writer loops forever on `DataFileFull`. This is observable +from the workload (write throughput drops to zero) and does not require SDK +instrumentation to detect. + +**Verdict:** Implementable without instrumented build via file-size watchdog. +SUT-side assertion is optional but would provide tighter detection. + +--- + +## Category 2 — Buffer Accounting and Writer Liveness + +This is the highest-value cluster. The implementability of all five properties +depends on the **persistent-volume assumption** and **node-termination faults**. +If either is unavailable, the entire cluster becomes unimplementable or vacuous. + +### total-buffer-size-never-underflows + +**Implementability: Requires instrumented build. The trigger requires a specific +crash-boundary scenario that Antithesis must find, not the workload construct.** + +The internal state (`total_buffer_size` atomic) is completely invisible to the +workload. The only observable signal is downstream throughput collapse (which +could have many causes). The SUT-side assertions at `ledger.rs:~291` and +`reader.rs:~521-535` are necessary. + +Regarding the "specific deterministic trigger" concern: the underflow requires a +kill at a file-rotation or partial-write boundary where file-on-disk bytes exceed +record-decoded bytes. The workload cannot reliably position this kill timing; it +relies on Antithesis's systematic exploration across timelines. With a small +`max_data_file_size` (1MB configuration), rotations happen frequently, making +the trigger window large relative to run time. This is a good fit for Antithesis +exploration (not a single needle-in-a-haystack moment). + +One complication: the `trace!` log at `ledger.rs:295` includes +`last_total_buffer_size - amount`, which also wraps. In a debug build this would +panic before the `fetch_sub` wraps, preventing the bug from being observable in +release semantics. The harness should use a release build (or a build that +disables debug overflow checks) to observe the wrapping behavior rather than +catching it as a debug-mode panic. This affects harness build profile selection. + +**Verdict:** Implementable with instrumented build, persistent volume, and +node-kill faults. Harness must use release mode (or explicitly allow wrapping +arithmetic in the `trace!` log). Antithesis exploration drives trigger discovery. + +--- + +### writer-eventually-makes-progress + +**Implementability: Requires instrumented build for the sharp signal; partially +observable from the workload.** + +The workload can detect the deadlock symptom: write throughput drops to zero +after a node-kill-and-restart. The workload drives: fill buffer → kill → restart +→ resume reader → quiet period → assert write throughput recovered. This is +implementable without SDK instrumentation, though the signal is delayed (the +workload must wait for a grace period before concluding deadlock vs. slow recovery). + +The SUT-side `Sometimes(writer_unblocked_after_full)` provides a precise signal +at the recovery point. The `Unreachable` on repeated no-progress wakeups is a +sharper deadlock indicator that requires the instrumented build. + +The `Sometimes` assertion must be reachable on the non-fault path (any normal +full→drain cycle) to establish a baseline before fault testing. This is confirmed +by the property evidence: the normal fill/drain cycle satisfies the `Sometimes`. + +**Verdict:** Workload-observable without instrumented build (throughput signal). +SUT-side instrumentation sharpens the signal. Both paths implementable. + +--- + +### buffer-size-within-max + +**Implementability: Workload-observable via file-size watchdog, no instrumented +build required. The deadlock-vacuity problem makes this property misleading in +isolation.** + +The property's critical note is that it passes vacuously under the underflow +deadlock. The workload must check both: + +1. `sum(.dat file sizes) <= max_buffer_size + max_record_size` (the bound) +2. `write_throughput > 0` after the quiet period (rules out deadlock) + +Both are workload-observable. The property evidence correctly notes this must be +evaluated jointly with `writer-eventually-makes-progress`. + +One subtle point: the workload must read actual file sizes, not the buffer metric +gauge, because PR #23561's `saturating_sub` makes the gauge show 0 (normal) even +under the deadlock. The watchdog must enumerate `.dat` files directly. + +**Verdict:** Implementable without instrumented build. Must always be evaluated +jointly with `writer-eventually-makes-progress` to avoid vacuous pass. + +--- + +## Category 3 — Crash Durability and Recovery + +All five properties in this category require node-termination faults (SIGKILL) +and the persistent-volume assumption. If the persistent volume is unavailable, +all five become vacuous or spuriously failing. + +### durable-unacked-events-survive-crash + +**Implementability: Workload-observable in principle; the "durably written" +definition is the main challenge.** + +The workload maintains `DURABLE` (confirmed-synced events) and `DELIVERED` +(post-restart downstream events) sets and asserts `DURABLE ⊆ DELIVERED`. This +is implementable at the workload level. + +The open question — how to establish "durably written" without Vector internals +access — has a clean answer: use `flush_interval=0` (force fsync on every flush) +so every write that succeeds at the source level is durably synced. The workload +can then treat all successfully-accepted events as `DURABLE`, excluding only +events sent within the final window before the kill. This is the recommended +option (c) from the property evidence file. + +The 500ms window exclusion must be handled carefully: the workload should stop +producing events some time (e.g. 2 seconds) before any kill to clear the window +boundary, or use `flush_interval=0` to eliminate the window. + +**Verdict:** Implementable. Use `flush_interval=0` to eliminate the durability- +window ambiguity. Requires node-kill faults and persistent volume. + +--- + +### every-written-event-eventually-delivered + +**Implementability: Workload-observable; the most comprehensive test of the +at-least-once contract.** + +The `PRODUCED` vs `DELIVERED_SET` comparison is straightforward workload logic. +The workload assigns unique IDs (e.g. monotonic counter embedded in event +payload), records all submitted IDs, records all downstream-received IDs, and +asserts the difference is empty after a quiet period. + +One complication: this property intentionally includes the `_status`-discard bug +(`sink-failure-not-silently-acked`). To test pure crash durability without +conflating the two bugs, the workload should use a reliable downstream stub that +always returns a successful response (never `Errored`/`Rejected`), isolating the +crash-durability signal. + +The `unlink-before-ledger-flush` window at `reader.rs:546-549` is the most +subtle latent loss path. The workload cannot control this timing; Antithesis +must find it. With many crash timings explored, the probability of hitting this +specific window is reasonable. + +An `assert_always` variant (workload fires if `PRODUCED \ DELIVERED` is non-empty +after quiet period) combined with a `Sometimes` (at least one timeline achieves +full delivery) is the recommended assertion structure. + +**Verdict:** Implementable as a workload-level check. Requires node-kill faults, +persistent volume, and a reliable downstream stub. Antithesis exploration drives +discovery of the loss windows. + +--- + +### recovery-completes-after-crash + +**Implementability: Partially workload-observable; SUT-side assertion provides +a cleaner signal.** + +The workload can proxy init completion by measuring time from process start to +first event being deliverable (first event appears downstream). This is +workload-observable but has noise (the throughput signal conflates init completion +with the post-init deadlock described in OQ-1). + +The critical subtlety (also noted in `writer-eventually-makes-progress`): init +can return `Ok` while the writer is immediately deadlocked by the underflow bug. +The property must assert post-init progress, not just `Ok` return. Keeping init- +completion separate from liveness (as the evidence recommends) means two separate +assertions, with the combined result telling the full story. + +The SUT-side `assert_sometimes!` at the end of `from_config_inner` is clean and +confirms recovery is actually exercised. The `assert_always!` bounding +`update_buffer_size` output is a useful diagnostic addition. + +**Verdict:** Implementable. Workload proxy is sufficient. SUT-side assertion +(instrumented build) sharpens diagnosis. Requires node-kill faults and persistent +volume. + +--- + +### partial-write-at-rotation-recovers + +**Implementability: Requires instrumented build; very high fault-timing +sensitivity, but well-supported by Antithesis's approach.** + +The property requires kills specifically during the rotation sequence (8 distinct +windows at `writer.rs:1041-1138`). Setting `max_data_file_size` to 1MB makes +rotations happen every few seconds, giving Antithesis many opportunities. This is +a well-structured target: the fault windows are well-defined code paths, not a +single rare coincidence. + +The F5 torn-tail risk (rkyv `archived_root` reading garbage footer bytes as a +valid root pointer, yielding a plausible-looking-but-wrong `Valid` record) is the +subtlest path. The workload cannot provoke F5 directly; it relies on Antithesis +hitting the specific byte position during crash. CRC32C is the practical backstop +(F5 only matters if garbage bytes also produce a CRC32C collision, probability +~1/2^32 per check). Antithesis runs many timelines and can empirically explore +whether F5 is reachable before CRC32C catches it. + +The `Sometimes(torn_tail_recovered)` assertion (SUT-side, in `validate_last_write`) +confirms the torn-tail detection path is actually reached. Without it, rotation- +boundary kills might always hit a non-torn window and never exercise the recovery +path. This makes the instrumented build important for this property. + +The safety sub-properties (`Always` no-garbage, `Unreachable` monotonicity panic) +are covered by `no-corrupted-record-delivered` and `record-id-monotonicity-holds` +respectively, avoiding duplication. + +**Verdict:** Implementable with instrumented build, node-kill faults, persistent +volume, and small `max_data_file_size`. Antithesis exploration drives fault timing +discovery. CRC32C backstop makes F5 a low-probability but non-zero concern. + +--- + +## Category 4 — Space Reclamation and Clean Termination + +### acked-files-eventually-deleted + +**Implementability: Partially workload-observable; best checked jointly with +writer liveness.** + +The workload can check: after fully acknowledging all events in a data file (no +new writes, quiet period), enumerate `.dat` files and assert the acked file is +gone. The workload controls downstream ack delivery (it controls the HTTP stub +that Vector's HTTP sink delivers to); acknowledging all events is workload- +controllable. + +The finalizer-task kill scenario (kill the spawned finalizer tokio task) is not +an independent Antithesis node-fault; the finalizer lives inside the Vector +process. A node kill kills everything, including the finalizer. To specifically +kill the finalizer without killing the whole process, the workload would need to +inject a panic inside the finalizer, which requires SUT-side modification. This +specific sub-scenario (finalizer task dead, rest of process alive) is an edge +case best deferred to a unit test rather than an Antithesis test. + +The `bytes_read > metadata.len()` underflow path (noted in the open questions) +is the same underflow bug from Category 2, exercisable through the same fault +sequence. + +**Verdict:** Implementable. File-enumeration watchdog covers the deletion check. +The finalizer-task-kill sub-scenario requires SUT modification to isolate and is +better treated as a unit test. + +--- + +### reader-drains-and-terminates-cleanly + +**Implementability: Workload-observable; directly addresses two disabled tests.** + +The workload can detect both failure modes: + +- Hang: reader does not return `None` within a bounded time after the writer stops + and all events are acked. The workload measures time-to-drain. +- Premature `None`: events are missing from the delivered set (cross-reference + `every-written-event-eventually-delivered`). + +This property directly targets the root cause of the disabled +`reader_exits_cleanly_when_writer_done_and_in_flight_acks` test (`basic.rs`, +`#[ignore = "flaky #23456"]`). The Antithesis version of this test is +straightforward: stop writes, deliver and ack all events, assert drain completes. + +The `total_buffer_size == 0` termination condition is broken by the underflow bug. +This makes the property's interaction with Category 2 critical: if the deadlock +fires, the reader never terminates cleanly. Running this property jointly with +`writer-eventually-makes-progress` identifies the cause. + +**Verdict:** Implementable without instrumented build. Directly exercises the +flaky `#23456` path. Requires node-kill faults and persistent volume to test the +fault-then-drain scenario. + +--- + +## Category 5 — Delivery Semantics and Boundary Conditions + +### sink-failure-not-silently-acked + +**Implementability: Feasible. The workload controls the downstream stub; the bug +is confirmed and observable. The custom fault is workload-driven, not Antithesis +node-fault-driven.** + +The workload controls an HTTP endpoint that Vector's HTTP sink delivers to. To +inject sink errors, the workload simply returns 5xx responses for a window. This +is a workload-driven custom "fault," not a built-in Antithesis fault type. It is +fully implementable without any special Antithesis tenant capabilities. + +The bug is confirmed (`ledger.rs:704` discards `_status`). The observable signal: +workload sends N events, forces the sink to error, waits for a quiet period, then +restarts Vector and checks whether those events are replayed. Under the bug, they +are not replayed (silently lost). This is a workload-observable end-to-end check. + +One concern: the property catalog notes that whether sinks actually emit +`Errored`/`Rejected` status in Vector's internal BatchStatus plumbing (not just +returning HTTP 5xx at the transport level) needs verification. Vector's HTTP sink +may internally retry on 5xx responses, eventually timing out, and only then +surfacing `Errored`. The workload-observable test works regardless: make the sink +permanently error for the test window and observe that events are lost without +replay. + +**Verdict:** Implementable without instrumented build. Workload-driven sink-error +injection is the fault mechanism, not a built-in Antithesis fault. The bug is +confirmed and the expected behavior is well-defined. + +--- + +### dropped-events-are-counted + +**Implementability: Workload-observable via metrics scraping. The +`drop_newest`-on-disk-buffer question needs verification.** + +The workload can compare `buffer_discarded_events_total` (expected to increment +on drops) against `component_discarded_events_total` (expected to also increment +but currently stays 0). Vector exposes both via its `prometheus_exporter` source, +which the workload can scrape. + +One critical concern: whether `drop_newest` applies to disk buffers at all. The +`try_write_record` for the disk buffer (`writer.rs:1166-1168`) returns `Some(item)` +(i.e., the item bounced back) when `is_buffer_full()` is true. The +`BufferSender::send` with `WhenFull::DropNewest` checks `try_send` and drops the +item if it is returned (`sender.rs:231-234`). This code path exists for disk +buffers: `SenderAdapter::DiskV2` implements `try_send` via `try_write_record`. +So the path is reachable. + +However, the 2-second reporter lag (buffer metrics tick every 2 seconds) means +the workload assertion must allow for a brief delay between drops occurring and +`buffer_discarded_events_total` incrementing. The workload needs a wait period +before asserting the metric comparison. + +**Verdict:** Implementable without instrumented build, via metrics scraping. +Must use `when_full: drop_newest` configuration. The 2-second reporter lag +requires a wait in the workload assertion. The disk-buffer `try_write_record` +path is verified as reachable. + +--- + +### file-id-rollover-stays-coordinated + +**Implementability: Requires a test binary (with `MAX_FILE_ID=6`) or synthetic +state injection (cfg(test)-gated). Production binary with default `MAX_FILE_ID=65535` +makes rollover unreachable in any practical Antithesis run.** + +This is a structural implementability issue. In production binaries, +`MAX_FILE_ID = u16::MAX = 65535` (`common.rs:43`). Reaching rollover requires +65535 file rotations. At 1MB per file (test configuration), that is 65GB of data +throughput. At 128MB per file (default), it is ~8TB. No Antithesis run reaches +this threshold in normal test time. + +In test binaries, `MAX_FILE_ID = 6` (`common.rs:45`, `#[cfg(test)]`). Rollover +is reached after 6 files, which at 1MB each takes seconds. However, the +`unsafe_set_writer_next_record_id` and `unsafe_set_reader_last_record_id` helpers +that could synthetically place the state near rollover are also `#[cfg(test)]`- +gated and unavailable in production builds. + +Two options exist: + +1. Run Vector in test build mode (which enables `MAX_FILE_ID=6`). This is non- + standard for an Antithesis run (test builds may have other cfg-gated behavior). +2. Add a configuration knob (e.g. an env var or hidden config option) that sets + `MAX_FILE_ID` at runtime, usable in a production build. + +Option 1 is simpler but requires the Antithesis harness to build Vector with +`--cfg test` (or equivalent), which may interact with other test-only code paths. +Option 2 requires a Vector code change. + +The raw `u16 >` comparison bug at `reader.rs:932` is real (confirmed by source +inspection). The question is only whether it can be triggered in the harness. + +**Verdict:** Not directly implementable with a standard production Vector binary +without code changes or a test build. Requires either test binary mode or a new +configuration hook for `MAX_FILE_ID`. Flag as needing decision on harness build +mode before implementation. + +--- + +### record-id-wraparound-accounting-holds + +**Implementability: The empty-buffer equality case (the realistic bug path) IS +workload-observable without any instrumentation. The u64-wrap case is practically +unreachable.** + +This is the most important implementability note in the entire catalog. The +property title suggests testing u64 record-ID wraparound (a ~2^64 write +threshold — completely unreachable by real traffic), but the actual bug that +fires on every clean restart with an empty buffer is the `0 - 1 = u64::MAX` +case in `get_total_records` at `ledger.rs:266`. This case is trivially reachable: + +1. Write N events into the buffer. +2. Read and acknowledge all N events (drain completely). +3. Restart Vector with the same buffer directory. +4. Immediately scrape the `buffer_events_received_total` or `buffer_byte_size` + gauge from the `prometheus_exporter`. +5. Assert the gauge value is near 0, not `u64::MAX = 1.844e19`. + +Step 4 requires Vector to expose metrics, which the topology already does via +`internal_metrics` → `prometheus_exporter`. The workload can scrape this without +any SUT-side SDK instrumentation. + +The test-helper `unsafe_set_writer_next_record_id` / `unsafe_set_reader_last_record_id` +for the true u64-wrap case are `#[cfg(test)]`-gated and unavailable in production +builds. The near-u64-MAX record ID scenario is not reachable by real traffic in +any test run. + +The property as stated ("holds across u64 wraparound AND the empty-buffer equality +case") is therefore split: + +- Empty-buffer equality case: **reachable, workload-observable, no instrumentation + needed.** This is the real bug. +- u64 wrap case: **not reachable** with production binary (requires 2^64 writes). + The property statement claiming both are tested is misleading for the wrap case. + +The debug-build consideration: `0u64 - 1` panics in Rust debug mode. If the +Antithesis harness uses a debug build, the bug surfaces as a panic rather than a +silent `u64::MAX`. The property evidence notes this is a stronger signal. + +**Verdict:** The empty-buffer equality case is fully implementable without +instrumented build via metric scraping after a drain+restart cycle. The u64 wrap +case is practically unreachable with a production binary and should be explicitly +descoped or relabeled. The property should be renamed/refocused to +`record-id-empty-buffer-accounting-holds` to reflect what is actually testable. + +--- + +## Category 6 — Lifecycle and Config Reload + +### config-reload-no-silent-loss + +**Implementability: Requires custom fault (workload-driven SIGHUP); feasible +but with architectural dependency on Antithesis tenant capabilities.** + +The `SIGHUP` mechanism is confirmed: `signal.rs:200, 218-219` handles `SIGHUP` +and converts it to `SignalTo::ReloadFromDisk`. Sending `SIGHUP` from the workload +container to the Vector process is straightforward: the workload knows the Vector +process PID (or uses `kill -HUP $(pidof vector)`). This does not require a +special Antithesis fault injection capability — the workload can send SIGHUP +directly to the Vector process using a standard OS signal. + +This is a workload-driven trigger, not a built-in Antithesis fault. The Antithesis +scheduler does not need to be aware of reload semantics; the workload fires SIGHUP +on a schedule while Antithesis's node-kill and CPU-throttle faults run concurrently. + +The loss-observable signal is workload-level: `accepted_event_ids == received_event_ids` +after a quiet period. The workload tracks every event accepted by the HTTP source +and every event received at the downstream stub. + +The `BufferWriter::Drop` calling `close()` but not `flush()` means events +staged in the 256KB `TrackingBufWriter` at reload time are silently dropped. The +workload-level check catches this directly. No SUT-side instrumentation is needed +to observe the bug. + +One concern: whether the per-process advisory lock gap (two Vector topologies +briefly sharing the buffer directory) is exercisable. The evidence suggests +`running.rs:688-710` attempts sequencing, but the finalizer task's `Arc` +retention may extend the overlap window. This is a background race observable +only if the workload drives reload under concurrent write load and the file +deletion lag causes corruption. The workload-level end-to-end assertion catches +any resulting loss. + +**Verdict:** Implementable. Workload sends SIGHUP (no special tenant capability +required). The loss is workload-observable without instrumentation. Prioritize +alongside `graceful-shutdown-flushes-all` as they share the `Drop`-without-flush +root cause. + +--- + +### graceful-shutdown-flushes-all + +**Implementability: Workload-observable; the critical `stop()` drop-order question +needs a one-time code inspection or logging addition to resolve.** + +The workload sends SIGTERM (graceful stop), waits for the Vector process to exit, +restarts with the same buffer directory, drains the buffer to empty, and asserts +all pre-shutdown accepted events appear downstream. This is a workload-level +end-to-end check requiring no SDK instrumentation. + +The central uncertainty is whether `RunningTopology::stop()` at `running.rs:145` +drops `self.inputs` (and thus the `BufferSender` / `Arc>`) +before or after the write-loop task completes its final `flush()`. The property +evidence analysis suggests `inputs` is dropped synchronously inside `stop()` before +the returned future is polled, which could mean the `BufferWriter` is dropped +(unflushed) while the write loop is still running. However, the write loop calls +`flush()` after every `write_record` (`sender.rs:86-98`), so if the last event +was processed before SIGTERM, `TrackingBufWriter` should be empty. + +This is resolvable without code changes: add a `tracing::debug!` log in +`TrackingBufWriter::drop` showing `buf.len()` and run a graceful-shutdown test +to confirm it is 0. This is a diagnostic addition, not an SDK assertion. + +The SUT-side `assert_always!(unflushed_bytes == 0)` in `close()` (proposed in +the evidence file) is a cleaner check but requires the instrumented build. + +A subtle note: graceful shutdown may rely on OS page-cache flush-on-process-exit +for the 500ms fsync window, not an explicit Vector `sync_all`. This is OS- +dependent behavior (Linux guaranteed, but not Vector-guaranteed). The workload +test implicitly covers this by comparing events on restart. + +**Verdict:** Implementable without instrumented build via workload-level end-to-end +assertion. The `stop()` drop ordering should be verified with a diagnostic log +or code trace before asserting zero-loss expectations. + +--- + +## Cross-Cutting Implementability Concerns + +### 1. Persistent-Volume Assumption (CRITICAL) + +**Properties affected: ALL of Category 2 (5 properties), ALL of Category 3 (5 +properties), Category 4 (2 properties), Category 6 (2 properties). Total: 14 of +19 properties.** + +The deployment topology requires that the disk buffer `data_dir` survives a +container node-kill fault. If the Antithesis tenant's node-termination recreates +the container with a fresh filesystem, the buffer is wiped on every restart, and: + +- Crash-recovery properties (Category 3) pass vacuously (no data to recover). +- Deadlock properties (Category 2) are unreachable (clean state on restart). +- Liveness/reclamation properties (Category 4) are unreachable (no pre-existing + data to drain). + +**This is the single most important implementation prerequisite to confirm.** The +deployment topology document explicitly calls this out as a requirement. Without +a persistent volume, 14 of 19 properties are either vacuous or spuriously failing. + +### 2. Node-Termination Faults (CRITICAL) + +**Properties affected: All 14 listed above (same set as persistent-volume).** + +Node-termination faults (SIGKILL) are often disabled by default in Antithesis +tenants. Without them, the underflow bug (#21683), torn-tail recovery, and all +at-least-once crash-durability properties are unreachable. This must be +confirmed with the Antithesis tenant operator before any Category 2–6 property +is declared implementable. + +### 3. Instrumented Build Burden + +**Properties requiring SUT-side SDK assertions (instrumented build):** + +- `total-buffer-size-never-underflows` (CRITICAL — internal atomic invisible) +- `writer-eventually-makes-progress` (SDK sharpens; workload signal exists) +- `record-id-monotonicity-holds` (CRITICAL — panic path, no workload equivalent) +- `no-corrupted-record-delivered` (CRITICAL — emission point invisible externally) +- `corruption-is-detected-and-recovered` (CRITICAL — branch not observable externally) +- `partial-write-at-rotation-recovers` (SDK confirms recovery path reached) +- `recovery-completes-after-crash` (SDK at `from_config_inner` end) +- `graceful-shutdown-flushes-all` (SDK optional; workload signal exists) + +The instrumented build requires: + +1. Adding `antithesis-sdk` to `lib/vector-buffers/Cargo.toml` (one-time change). +2. Rebuilding Vector with the SDK dependency. +3. Inserting assertion calls at ~10 locations across `ledger.rs`, `reader.rs`, + `writer.rs`, and `mod.rs`. + +The SDK assertions are no-ops outside Antithesis, so the instrumented build is +safe for normal use. This is a one-time setup cost, not per-property. + +**Properties fully workload-observable without instrumented build:** + +- `writer-eventually-makes-progress` (throughput signal) +- `buffer-size-within-max` (file-size watchdog) +- `durable-unacked-events-survive-crash` (set-difference check) +- `every-written-event-eventually-delivered` (set-difference check) +- `recovery-completes-after-crash` (time-to-deliverable proxy) +- `reader-drains-and-terminates-cleanly` (time-to-drain) +- `sink-failure-not-silently-acked` (workload controls sink errors) +- `dropped-events-are-counted` (metrics scraping) +- `record-id-wraparound-accounting-holds` (metrics scraping after drain+restart) +- `config-reload-no-silent-loss` (workload controls SIGHUP + end-to-end count) +- `graceful-shutdown-flushes-all` (workload end-to-end count) +- `record-never-spans-files` (file-size watchdog) + +### 4. `record-id-wraparound-accounting-holds` Descoping + +As analyzed above, the u64 wrap case (requires 2^64 writes) is not practically +reachable with any production binary. The property as stated conflates two +distinct scenarios: + +- **Empty-buffer equality** (trivially reachable, workload-observable, is a real + bug at `ledger.rs:266`). +- **True u64 wraparound** (not reachable without `#[cfg(test)]`-gated test + helpers or 2^64 actual writes). + +The implementable portion is the empty-buffer equality check. This should be +split into its own concrete test case and the u64 wrap portion explicitly +acknowledged as out of scope for production-binary Antithesis testing. + +### 5. `file-id-rollover-stays-coordinated` Binary Mode Decision + +The property is only implementable with `MAX_FILE_ID=6` (test build mode) because +production binary rollover is unreachable. Decision required from the user: + +- Run test binary mode in Antithesis (enables `#[cfg(test)]` constants including + `MAX_FILE_ID=6`, but also enables other test-only code). +- Add a runtime-configurable `MAX_FILE_ID` override (new Vector code change). +- Descope the property to "latent bug documented but not tested in this run." + +### 6. Custom Fault Summary + +| Property | Fault type | Mechanism | Tenant capability needed? | +|---|---|---|---| +| `config-reload-no-silent-loss` | SIGHUP to Vector process | Workload sends signal | No (workload OS call) | +| `sink-failure-not-silently-acked` | HTTP 5xx responses | Workload controls stub | No (workload HTTP) | +| `file-id-rollover-stays-coordinated` | `MAX_FILE_ID=6` + node kill | Test binary + node kill | Yes (node kill) | +| All Category 2–3 | SIGKILL | Node termination | Yes (must confirm enabled) | + +--- + +## Summary Table + +| Property | Workload-Observable? | Instrumented Build Required? | Node-Kill Required? | Persistent Volume Required? | Verdict | +|---|---|---|---|---|---| +| `no-corrupted-record-delivered` | No | YES (primary) | No (faults inject corruption) | No | Implementable — instrumented build | +| `corruption-is-detected-and-recovered` | No | YES | No | No | Implementable — instrumented build | +| `record-id-monotonicity-holds` | No | YES | Yes | Yes | Implementable — instrumented build + node kill | +| `record-never-spans-files` | Yes (watchdog) | Optional | No | No | Implementable — no instrumented build | +| `total-buffer-size-never-underflows` | No | YES (critical) | Yes | Yes | Implementable — instrumented build + node kill | +| `writer-eventually-makes-progress` | Yes (throughput) | Optional (sharpens) | Yes | Yes | Implementable — node kill; instrumented improves | +| `buffer-size-within-max` | Yes (watchdog) | No | Yes | Yes | Implementable — must pair with liveness check | +| `durable-unacked-events-survive-crash` | Yes | No | Yes | Yes | Implementable — use flush_interval=0 | +| `every-written-event-eventually-delivered` | Yes | No | Yes | Yes | Implementable | +| `recovery-completes-after-crash` | Yes (proxy) | Optional | Yes | Yes | Implementable | +| `partial-write-at-rotation-recovers` | Partial | YES (Sometimes) | Yes | Yes | Implementable — instrumented build + small file size | +| `acked-files-eventually-deleted` | Yes (watchdog) | No | Yes | Yes | Implementable | +| `reader-drains-and-terminates-cleanly` | Yes | No | Yes | Yes | Implementable | +| `sink-failure-not-silently-acked` | Yes | No | No | No | Implementable — workload controls sink | +| `dropped-events-are-counted` | Yes (metrics) | No | No | No | Implementable — metrics scraping | +| `file-id-rollover-stays-coordinated` | Yes | No | Yes | Yes | BLOCKED — needs test binary or MAX_FILE_ID knob | +| `record-id-wraparound-accounting-holds` | Yes (empty case) | No | No | No | Implementable (empty case only); u64 wrap descope | +| `config-reload-no-silent-loss` | Yes | No | No (SIGHUP) | No | Implementable — workload sends SIGHUP | +| `graceful-shutdown-flushes-all` | Yes | Optional | No | No | Implementable | + +--- + +## Findings (Concerns) + +### F1 — Persistent-volume assumption unconfirmed (BLOCKER for 14 properties) + +- **Scope:** Categories 2, 3, 4, 6 (14 of 19 properties) +- **Concern:** If the Antithesis tenant recreates the container filesystem on + node-kill, crash-recovery properties pass vacuously and deadlock properties are + unreachable. +- **Evidence:** Deployment topology document §CRITICAL note; buffer `data_dir` + must survive kill/restart for any crash-recovery property to be meaningful. +- **Action:** Confirm with Antithesis tenant operator that the buffer `data_dir` + (on a mounted persistent volume) is not wiped on node-termination fault before + beginning any Category 2–6 testing. + +### F2 — Node-termination faults may be disabled (BLOCKER for same 14 properties) + +- **Scope:** Categories 2, 3, 4, 6 +- **Concern:** Node-kill faults are often disabled by default in Antithesis tenants. + Without SIGKILL, the underflow trigger, torn-tail recovery, and all at-least-once + crash-durability properties are unreachable. +- **Evidence:** Property catalog file-level open questions; sut-analysis.md §Assumptions. +- **Action:** Confirm node-termination faults are enabled in the target tenant. + +### F3 — Zero Antithesis SDK instrumentation (BLOCKER for 5 properties, burden for 3 more) + +- **Scope:** `total-buffer-size-never-underflows`, `record-id-monotonicity-holds`, + `no-corrupted-record-delivered`, `corruption-is-detected-and-recovered`, + `partial-write-at-rotation-recovers`; optional for `writer-eventually-makes-progress`, + `recovery-completes-after-crash`, `graceful-shutdown-flushes-all` +- **Concern:** Every SUT-side assertion must be added from scratch. The 5 + critical properties have internal state that is entirely invisible from the + workload without SDK instrumentation. +- **Evidence:** `existing-assertions.md` confirms zero SDK usage; repo-wide grep + returns no matches. +- **Action:** Add `antithesis-sdk` to `lib/vector-buffers/Cargo.toml` and insert + assertions at the ~10 identified sites. This is a one-time build setup shared + across all affected properties. + +### F4 — `file-id-rollover-stays-coordinated` unreachable with production binary (BLOCKER) + +- **Scope:** `file-id-rollover-stays-coordinated` +- **Concern:** `MAX_FILE_ID=65535` in production binary requires ~8TB of data + throughput to trigger rollover. Not achievable in any practical Antithesis run. + `MAX_FILE_ID=6` is only compiled in `#[cfg(test)]` builds. +- **Evidence:** `common.rs:43-45` confirms the conditional constant; test helpers + for synthetic state injection are also `#[cfg(test)]`-gated. +- **Action:** Choose one: (a) run Vector in test binary mode (enabling `MAX_FILE_ID=6`), + (b) add a runtime-configurable override for `MAX_FILE_ID`, or (c) descope to + "latent bug documented but not exercised in this Antithesis run." + +### F5 — `record-id-wraparound-accounting-holds` as stated is half-vacuous + +- **Scope:** `record-id-wraparound-accounting-holds` +- **Concern:** The u64 wrap case (the property's primary stated focus) is + unreachable with any production binary. The testable bug is the empty-buffer + equality case (`wrapping_sub(0,0) - 1 = u64::MAX`), which triggers on every + clean restart with a drained buffer. +- **Evidence:** `ledger.rs:266` confirmed; `#[cfg(test)]` gate on + `unsafe_set_writer_next_record_id` / `unsafe_set_reader_last_record_id` at + `ledger.rs:173-196`; no path to u64::MAX record ID via real writes. +- **Action:** Refocus the property on the empty-buffer equality case (rename and + re-scope). Implement as: drain buffer completely → restart → scrape + `buffer_byte_size` or `buffer_events_received_total` → assert near 0. This + is workload-observable without any SDK instrumentation. Explicitly descope the + u64 wrap case. + +### F6 — `total-buffer-size-never-underflows` debug-build conflict + +- **Scope:** `total-buffer-size-never-underflows` +- **Concern:** The `trace!` macro at `ledger.rs:295` includes `last_total_buffer_size - amount` + as a Rust expression, which panics in debug mode on overflow before the + `fetch_sub` wrapping behavior is observable. +- **Evidence:** `ledger.rs:291-298` source; Rust debug arithmetic overflow semantics. +- **Action:** Use release build for Antithesis testing of this property. Document + this as a harness build requirement. Separately, fix the `trace!` to use + `wrapping_sub` to avoid the debug-mode panic. + +### F7 — `buffer-size-within-max` vacuous under deadlock (design concern) + +- **Scope:** `buffer-size-within-max` +- **Concern:** The safety property passes trivially when the writer is deadlocked + (no writes → no overflow). A passing result for this property alone is not + evidence of correct behavior. +- **Evidence:** `buffer-size-within-max.md` deadlock-vacuity section; `sut-analysis.md` + §5 INV-7. +- **Action:** Always assert this property jointly with `writer-eventually-makes-progress`. + The combined result (size holds AND liveness holds) is the meaningful signal. + +### F8 — `graceful-shutdown-flushes-all` stop() drop-order unresolved + +- **Scope:** `graceful-shutdown-flushes-all` +- **Concern:** Whether `RunningTopology::stop()` drops `inputs` (and thus the + `BufferWriter`) before the write-loop task completes its final `flush()` is + unresolved. If the drop precedes the final flush, up to 256KB of staged events + are silently lost even on graceful shutdown. +- **Evidence:** `graceful-shutdown-flushes-all.md` §stop() analysis; `running.rs` + code path trace. +- **Action:** Add a `debug!` log in `TrackingBufWriter::drop` to show `buf.len()` + at drop time. Run a single graceful-shutdown test to confirm whether `buf.len()` + is 0 at drop. This resolves the uncertainty without requiring the full Antithesis + harness. + +--- + +## Passes (No Implementability Concerns) + +- **`sink-failure-not-silently-acked`**: workload-driven sink errors, bug confirmed, + no special tenant capabilities needed. +- **`dropped-events-are-counted`**: metrics scraping with `when_full: drop_newest` + config, bug confirmed, straightforward assertion. +- **`config-reload-no-silent-loss`**: workload sends SIGHUP (standard OS call), + bug plausible, end-to-end count is the assertion. +- **`record-never-spans-files`**: file-size watchdog covers it without + instrumented build. +- **`durable-unacked-events-survive-crash`**: use `flush_interval=0` to eliminate + the durability-window ambiguity; clean workload-level set-difference check. +- **`every-written-event-eventually-delivered`**: same structure as above; use a + reliable downstream stub to isolate crash-durability from sink-error bugs. + +--- + +## Uncertainties + +1. **Does `receiver.rs` swallow or panic on `ReaderError::Checksum`/`Deserialization`?** + Affects whether `no-corrupted-record-delivered` needs a crash-detection signal + in addition to the SUT-side assertion. (SUT analysis says panic; confirm.) + +2. **Is `drop_newest` actually reachable for disk buffers?** The `try_write_record` + path exists in `writer.rs:1166-1178` and `sender.rs:231-234` calls it for + `WhenFull::DropNewest`. Needs an end-to-end path trace to confirm the disk + buffer variant is reached (not just the in-memory `LimitedSender`). + +3. **Does the `unacked_reader_file_id_offset` context make the `reader.rs:932` + `u16 >` comparison more correct than the raw comparison suggests?** The + evidence file acknowledges this open question. A deeper code trace of + `get_current_reader_file_id` (`ledger.rs:305-308`) is needed before asserting + the bug is definitively triggerable. + +4. **Does any sink actually emit `Errored`/`Rejected` `BatchStatus` in normal + operation, or only the `_status`-discard path?** The + `sink-failure-not-silently-acked` end-to-end test works regardless (make the + downstream HTTP permanently return 5xx and observe non-replay after quiet + period), but understanding whether the `BatchStatus` plumbing actually carries + `Errored` affects the SUT-side assertion design. + +5. **Finalizer task drain on shutdown:** Does the tokio runtime drain in-flight + `BatchNotifier` finalizer tasks before process exit on SIGKILL? If not, acked- + in-flight events are lost without ledger update, creating a loss window that + the `every-written-event-eventually-delivered` property would catch but that + may be confused with the in-contract 500ms fsync window loss. diff --git a/tests/antithesis/scratchbook/evaluation/synthesis.md b/tests/antithesis/scratchbook/evaluation/synthesis.md new file mode 100644 index 0000000000000..2438c25824c51 --- /dev/null +++ b/tests/antithesis/scratchbook/evaluation/synthesis.md @@ -0,0 +1,194 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +external_references: + - path: (internal design doc, not linked) + why: Bug context for evaluated properties + - path: (internal design doc, not linked) + why: Lock-contention performance issue informing the throughput gap +--- + +# Property Evaluation Synthesis + +Four lenses (antithesis-fit, coverage-balance, implementability, wildcard) +evaluated the 19-property catalog as a portfolio. Findings categorized below as +**Refinement** (applied to the catalog), **Gap** (filled via targeted discovery), +or **Bias** (escalated to the user). Evidence: `evaluation/{lens}.md`. + +## Headline + +The lenses **agree the deadlock/durability/recovery cluster is the correct +high-value core** and the assertion types there are right. The most important +new findings are subtle and would have silently undermined the test: + +1. **The deadlock is intermittent, not permanent** (wildcard W-M1): `u64::MAX + + unflushed_bytes` wraps back to a *small* number, so the writer escapes for + exactly one write whenever `unflushed_bytes > 0`, then re-deadlocks. A naïve + `Sometimes(writer_unblocked)` or "throughput→0" check produces a **false + green** while the system is broken. → Refinement to compound stall detection. +2. **The durability oracle was conflated** (wildcard W-F2): using e2e-ack + *delivery* as the "durably written" marker means the deadlock suppresses + deliveries → the durability property passes *vacuously*. → Refinement to a + wall-clock-timestamp oracle + `flush_interval=0`. +3. **Two preconditions gate ~14/19 properties**: node-termination faults enabled, + and the buffer `data_dir` surviving node-kill on a persistent volume. If + either is false, most of the catalog is vacuous. → Escalated to user. + +## Refinements (applied to the catalog) + +- **R-A — `every-written-event-eventually-delivered` assertion semantics.** + `Sometimes(all_produced_delivered)` is wrong for at-least-once (it passes on any + one good timeline, hiding loss on others). Changed to: per-event + `Always(produced ⊆ delivered)` checked after each quiet-period drain, plus a + `Sometimes(delivery_path_reachable)` for exploration. (wildcard W-O2) +- **R-B — `writer-eventually-makes-progress` stall signal.** Because the deadlock + is intermittent (W-M1) and externally indistinguishable from healthy + `WhenFull::Block` backpressure (W-O3), the signal is now **compound**: write + throughput ≈ 0 AND sink/ack throughput ≈ 0 AND buffer ≥ ~90% full AND duration + > drain-time bound ⇒ `assert_unreachable!("persistent_deadlock")`. A single + "any wakeup" or "throughput→0" check is insufficient. +- **R-C — `buffer-size-within-max` is compound-only.** Under the deadlock the + bound holds vacuously (no writes ⇒ no overflow). Marked: meaningful only when + evaluated jointly with `writer-eventually-makes-progress`; never report alone. +- **R-D — `record-id-wraparound-accounting-holds` refocused.** The true u64 wrap + is unreachable on a production binary; the reachable, real bug is the + empty-buffer equality case (`wrapping_sub(x,x) - 1 = u64::MAX` at `ledger.rs:266`) + firing on every clean restart of a drained buffer. Property prose refocused to + the empty-buffer case (workload-observable: drain→restart→gauge≈0); true-wrap + explicitly descoped. (fit-1, impl-F5) +- **R-E — `file-id-rollover-stays-coordinated` build requirement.** Production + `MAX_FILE_ID=65535` needs ~8TB of writes to roll; `MAX_FILE_ID=6` is only in + `#[cfg(test)]`. Added requirement: run a test-binary or add a runtime-tunable + `MAX_FILE_ID`, else descope. (impl-F4) +- **R-F — durability oracle.** `durable-unacked-events-survive-crash` and + `every-written-event-eventually-delivered` now specify the **wall-clock-timestamp + oracle** (events produced > 2×`flush_interval` ago are "past the fsync window") + and recommend `flush_interval=0` to make every `flush()` a `sync_all`, removing + clock dependence and the delivery-vs-fsync conflation. Resolves the standing + "what does durably-written mean?" open question. (W-F2, W-C1) +- **R-G — `total-buffer-size-never-underflows` build note.** The `trace!` at + `ledger.rs:295` evaluates `last_total_buffer_size - amount`, which panics in a + debug build *before* the release-mode wrap is observable. Harness must use a + **release build** for this property (and the `trace!` itself should be fixed to + `wrapping_sub`). (impl-F6) +- **R-H — priority note on the two "logic-bug" properties.** + `dropped-events-are-counted` and `sink-failure-not-silently-acked` are missing + function-call / discarded-status bugs better caught by deterministic + unit/integration tests; kept as **workload-side secondary checks with no + dedicated fault-search budget** (don't shape the fault strategy around them). + (fit-3, fit-4) +- **R-I — corruption abandonment sub-concern.** The "skip rest of file after first + bad record" loss (valid records after a corrupt one in the same 128MB file are + abandoned) is now an explicit, measurable angle under + `corruption-is-detected-and-recovered` (correlate corruption-injection timing + with event IDs; assert post-corruption-point events still arrive). (coverage T1, + W-C2) + +## Gaps (filled via targeted discovery — 7 new properties) + +New Category 7 in the catalog. All are squarely timing/concurrency/partial-failure +or claimed-guarantee gaps the focus-based discovery missed: + +- **`foreign-data-file-no-writer-stall`** (coverage F1) — a stray `.dat` file + inflates `update_buffer_size` → permanent writer stall **without any crash** + (operator-error path to the #21683 symptom; distinct root cause: wrong scan + scope, not arithmetic). +- **`ledger-corruption-no-sigbus-crashloop`** (coverage F1, wildcard) — external + truncation of the mmap'd `buffer.db` → unhandled SIGBUS / crash loop; should be + a clean detected init error. +- **`finalizer-task-drains-pending-acks`** (coverage F6) — the unmonitored + detached finalizer task dying strands acks → silent loss / stall, distinct from + the arithmetic deadlock. +- **`fsync-window-bounded-under-clock-jitter`** (coverage F3, wildcard W-M2) — + clock faults can suppress `should_flush`'s `Instant::elapsed` gate, silently + extending the loss window beyond the 500ms SLA (only rotation is + clock-independent). +- **`overflow-chain-no-unaccounted-gap`** (coverage F4, wildcard W-M3) — the + entire `WhenFull::Overflow` mode was uncovered; crash during overflow loses + *later* in-memory events while *earlier* disk events survive → a middle-of-stream + gap that breaks dedup-based at-least-once reasoning. +- **`buffer-survives-version-upgrade`** (coverage F5) — rkyv layout change / + `DiskBufferV1CompatibilityMode` flag inversion → old buffer files unreadable or + silently mis-decoded; should be a clean detected error, never garbage. +- **`throughput-progresses-under-contention`** (coverage F2) — the writer-mutex + lock-contention ceiling under CPU throttle can collapse throughput to near-zero + *without* tripping the permanent-deadlock property (degenerate-but-alive). + +These additions are substantial enough (a new category + 7 properties) that a +second light evaluation pass is warranted before workload construction; +see "Residual" below. + +## Biases (escalated to the user) + +- **Bias B1 — Portfolio orientation: timing-cluster vs. logic-bug properties.** + ~6 properties (`dropped-events-are-counted`, `sink-failure-not-silently-acked`, + `record-id-wraparound` empty case, `record-never-spans-files`, parts of + `graceful-shutdown-flushes-all`) are deterministic logic/metric bugs that unit + or integration tests would catch more cheaply than Antithesis search. Keeping + them in the Antithesis catalog spends search budget on states fault injection + doesn't help reach. **Judgment needed:** include them (broader regression net) + or hand them to unit tests and focus Antithesis purely on the + timing/crash/concurrency cluster? +- **Precondition P1 (escalation, not opinion) — node-termination faults.** + ~14/19 (now ~21/26) properties require kill/restart faults, often disabled by + default in Antithesis tenants. If disabled, the highest-value cluster yields no + signal. +- **Precondition P2 (escalation) — persistent buffer storage across node-kill.** + Disk-buffer durability is only meaningfully testable if the `data_dir` survives + a modeled crash on a persistent volume. If node-kill wipes the container FS, all + crash-recovery properties pass vacuously. The catalog recommends a pre-fault + **sentinel** (write+fsync, kill, assert files survive; gate Category 2–6 + assertions on it) — wildcard W-F1. + +## Passes (independently confirmed correct) + +- The deadlock cluster's 4-vantage-point design (root, writer liveness, reader + termination, safety-bound vacuity) with explicit cross-references. +- `record-id-monotonicity-holds` as `Unreachable`; `no-corrupted-record-delivered` + as `AlwaysOrUnreachable`. +- `memmap2::MmapMut::flush()` confirmed `msync(MS_SYNC)` (blocking/synchronous) — + no MS_ASYNC concern. +- The unlink-before-ledger-flush window in `delete_completed_data_file` is safe + (idempotent NotFound→skip on restart). +- `file-id-rollover-stays-coordinated` correctly scoped to test-time `MAX_FILE_ID`. + +## Residual / next steps + +- The 7 gap properties + 9 refinements were applied to the catalog and + relationships; `commit`/`updated` refreshed. Because the additions form a new + category, a brief re-evaluation of integration is advisable but not blocking for + `antithesis-setup`/`antithesis-workload` to begin on the core cluster. +- Several open questions remain partial and need code-tracing or human input: + graceful-shutdown drop ordering in `running.rs`; whether sinks emit + `Errored`/`Rejected` status in practice; whether the tokio runtime drains the + finalizer task before exit. Tracked per-property in evidence files. + + +--- + +## 2026-05-29 — Data-loss expansion (user-driven, no full eval pass) + +User is "seriously concerned about data loss," seed: *"if the checksum fails we +skip records."* Added 3 properties to Category 1 (silent data-loss cluster) + +Cluster H in relationships. Per the property-expansion workflow, 3 new properties +in an existing category do **not** trigger a full evaluation ensemble; recorded +here instead. + +- **Gap found:** the catalog confirmed corruption *recovery runs* + (`corruption-is-detected-and-recovered`, a `Sometimes`) but never bounded or + counted the loss. `dropped-events-are-counted` covers only `drop_newest`/#24606 + (write-side), not the read-side corruption roll. +- **Grounding:** `roll_to_next_data_file` (reader.rs:711-759) abandons the whole + file tail, accounting only records read; abandoned records never reach + `track_dropped_events`. internal doc *internal buffer design notes* (loss window = + 500ms unsynced; synced not lost with e2e acks) and *an internal telemetry-correctness report* (silent loss via `component_discarded_events_total`) + anchor the severity. +- **New:** `corruption-skip-loss-bounded`, `corruption-skip-loss-is-counted`, + `corruption-skip-record-id-accounting-consistent`. +- **Bias check:** the existing catalog under-weighted the *read-side* silent-loss + surface relative to the write-side (#24606) and the deadlock cluster — this + expansion rebalances toward the user's stated concern. No properties invalidated. +- **Workload note for next setup pass:** all three share one fault — a mid-file + bit-flip in a multi-record data file, corrupted in a *live* read — so they can + be implemented as one test scenario with a delivered-set/metric/underflow oracle. diff --git a/tests/antithesis/scratchbook/evaluation/wildcard.md b/tests/antithesis/scratchbook/evaluation/wildcard.md new file mode 100644 index 0000000000000..fb2ef67fa0b1f --- /dev/null +++ b/tests/antithesis/scratchbook/evaluation/wildcard.md @@ -0,0 +1,750 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +external_references: [] +--- + +# Wildcard Evaluation: Disk Buffer v2 Property Catalog + +Lens: **deliberately unconstrained**. The other three lenses cover Antithesis +Fit (unit-test vs. chaos territory), Coverage Balance (right portfolio vs. SUT +risks), and Implementability (can assertions be placed). This lens asks: what +did the framing miss, what failure scenarios are unmodeled, what joint conditions +do the individual properties not compose, and what is simply odd? + +--- + +## Section 1 — Framing Questions + +### W-F1: The persistent-volume assumption is load-bearing but unguarded by any property + +**What the framing assumes:** The deployment topology (`deployment-topology.md`) +explicitly requires that `` reside on storage that survives node-kill. +Without this, every crash-recovery property (Categories 2–6) either passes +vacuously (buffer wiped on kill → clean init every time, no recovery exercised) +or fails spuriously (data the workload expected to survive is gone). + +**What the framing misses:** No property in the catalog detects or guards against +this scenario. If Antithesis's tenant configuration recreates the container +filesystem on kill (a common default for stateless workloads), the test suite +produces a **perfectly green run** that is entirely meaningless: 15 of 19 +properties are exercising fresh-start paths, not crash-recovery paths. + +**The specific vacuity:** `durable-unacked-events-survive-crash` passes because +after kill+wipe there are no surviving data files for the reader to re-read; the +reader starts from scratch and the workload's "produced" set is empty (workload +was also killed). `recovery-completes-after-crash` passes because there is +nothing to recover. `partial-write-at-rotation-recovers` is never triggered +because files are gone. `writer-eventually-makes-progress` passes because +`total_buffer_size` starts at 0, so the underflow bug is never triggered. + +**Missing guard:** A workload-level sentinel that proves persistence is working +before faults are injected. Concrete approach: before the first kill, write N +events and fsync (explicit flush_interval=0 run), assert that after restart the +buffer directory exists and contains the expected `.dat` files with non-zero +size. If the post-restart buffer directory is empty, emit +`assert_unreachable!("buffer_dir_empty_after_kill")` and abort the run with a +clear harness-configuration failure. This is not a Vector bug — it is a +harness-integrity check, but its absence makes the entire test suite's results +untrustworthy. + +**Scope:** catalog-wide (Categories 2–6), harness design +**Evidence:** deployment-topology.md §"CRITICAL — persistent buffer storage"; +implementability.md §"persistent-volume assumption"; no guard property exists in +any of the 19 property evidence files. +**Suggested action:** Add a sentinel workload step: before fault injection begins, +write-then-fsync N events, kill, assert `.dat` and `buffer.db` files survive. +Fail fast with a harness error if they do not. Gate all Category 2–6 assertions +on this sentinel passing. + +--- + +### W-F2: The workload's "durably written" oracle conflates three distinct conditions + +**What the framing assumes:** The catalog repeatedly asks "how does the workload +establish 'durably written'?" (property-catalog.md File-Level Open Questions; +`durable-unacked-events-survive-crash` Open Question 1), offering three +candidates: e2e acks, flush_interval=0 tracing, sync_all callsite tracing. + +**What the framing misses:** These three candidates have **fundamentally different +meanings** and are not interchangeable. Understanding which one the workload +uses determines whether the property is even testable from outside. + +**The confusion:** + +1. **Source-HTTP-returns-200 with e2e acks enabled.** This is the strongest + signal — it means the downstream HTTP sink has acknowledged receipt. The + source returns 200 only after `BatchStatus::Delivered` propagates back through + `handle_batch_status` (`src/sources/util/http/prelude.rs:309`). "200 returned" + means the event traversed: disk buffer write → disk buffer read → HTTP sink + delivery → sink ACK → BatchNotifier dropped → BatchStatus::Delivered → source + response. This is NOT a durability marker; it is an end-to-end delivery marker. + An event can be fsynced to disk (durable) but not yet return 200 (not yet + delivered). Conversely, an event can return 200 while a different event — the + one currently being fsynced when Vector is killed — is lost within the 500ms + window. + +2. **Source-HTTP-returns-200 without e2e acks.** Here 200 is returned immediately + after the event is accepted into the channel (before buffer write, let alone + fsync). This is the WEAKEST marker — not durable at all. + +3. **flush_interval=0 / sync_all tracing.** The only option that actually marks + fsync completion is to either set `flush_interval` so low that every `flush()` + call triggers a full `sync_all`, or to instrument `sync_all` directly. Neither + is exposed to the workload without SUT-side instrumentation. + +**The consequence for property design:** + +For `durable-unacked-events-survive-crash`, the correct oracle is condition 3 +(sync_all fired, event is in the synced set). Condition 1 conflates buffer +durability with downstream delivery and makes the property measure the +`every-written-event-eventually-delivered` invariant instead. Using condition 2 +makes the property vacuously true (0ms window, anything written at all survives +the 0ms window). + +**A specific gap this creates:** With e2e acks (condition 1), the workload's +"durably written" set is empty during the deadlock scenario (no events ever reach +the downstream sink → no 200s returned → workload concludes nothing was durably +written → `durable-unacked-events-survive-crash` passes vacuously despite the +buffer being deadlocked). The deadlock thus masks the durability property. + +**Suggested action:** Pick a single oracle for "durably written" across all +Category 3 properties: set `flush_interval` to a very short but non-zero value +(e.g., 50ms) so fsync fires frequently, and define "durably written" as "the +event was written AND at least one `sync_all` has completed after the write." +Use a workload-side timer: if the event was sent more than `2 * flush_interval` +milliseconds ago, assume it has been fsynced. This is imprecise but conservative +and does not require SUT instrumentation for the oracle itself. + +**Scope:** `durable-unacked-events-survive-crash`, `every-written-event-eventually- +delivered`, `partial-write-at-rotation-recovers` — all Category 3 properties. +**Evidence:** `src/sources/util/http/prelude.rs:283-321` (source ack = sink +delivery, not buffer durability); `sut-analysis.md §2` (flush model; fsync only +on rotation or every ≥500ms). + +--- + +### W-F3: The "single-process, just use node-kill" framing hides a split-brain window at the ledger/data boundary + +**What the framing assumes:** The SUT analysis correctly identifies the data-file +fsync and ledger msync as "two separate, non-atomic syscalls" (`sut-analysis.md +§3`). The recovery logic (`validate_last_write`) is designed to handle the two +canonical divergence states: data ahead of ledger (`Ordering::Less`, fast-forward) +and ledger ahead of data (`Ordering::Greater`, skip). + +**What the framing underweights:** The **directionality of the non-atomicity +depends on which syscall the kill interrupts**, and the two outcomes have +asymmetric consequences: + +- Kill between `sync_all` (line 1314) and `ledger.flush()` (line 1317): data is + durable, ledger is stale. Recovery: `Ordering::Less` → ledger fast-forwards to + match data. **Safe** for the data that was synced; a duplicate may be delivered + (at-least-once semantics). + +- Kill between `ledger.flush()` and the return of `flush_inner`: ledger is + current, data is not yet changed (this window is narrow — ledger.flush is the + last call). Practically safe. + +- Kill during the `sync_all` itself (kernel-level partial fsync): the data file + may be only partially durable. Recovery path depends on whether the last record + in the partially-fsynced file is valid. + +**The unmodeled case:** Kill *after* `ledger.flush()` updates +`writer_next_record` in the mmap'd region but *before* the data file fsync +propagates the corresponding bytes to persistent media. This is possible on some +block devices and virtual disks where msync (via mmap dirty page writeback) is +faster than `fsync`. In this case: ledger says "record N was written"; data file +does not contain record N's bytes. Recovery: `Ordering::Greater` → "Events have +likely been lost" log, skip to next file. The skip is intentional, but **the +events were not actually lost — they were never durably written in the first +place**. The "events lost" log is technically wrong; the events were page-cache +writes that never reached storage. Downstream impact: the ledger's +`writer_next_record` is advanced past a gap, and the reader will eventually read +those events from subsequent writes (which restart from the new ID), creating an +ID gap that the workload may interpret as event loss. + +**What no property models:** Whether the `Ordering::Greater` path fires correctly +only for genuine data loss (events that reached fsync but whose data is +unreadable) vs. spuriously for events that were only in the page cache (never +reached fsync). The property `partial-write-at-rotation-recovers` covers the +recovery path but does not distinguish these two causes of `Ordering::Greater`. + +**Scope:** `partial-write-at-rotation-recovers`, `durable-unacked-events-survive-crash` +**Evidence:** `writer.rs:1312-1317` (sync_all then ledger.flush, no atomicity); +`sut-analysis.md §3` ("data file fsync and the ledger msync are two separate, +non-atomic syscalls"). +**Suggested action:** Add a sub-case to `partial-write-at-rotation-recovers` +explicitly noting the two directionalities and which Antithesis kill windows +target each. A SUT-side assertion at the `Ordering::Greater` path could emit +structured data distinguishing "no data in file" (page-cache-only loss) from +"data in file but corrupt" (genuine partial fsync loss). + +--- + +## Section 2 — Missing Angles + +### W-M1: The double-wrap in `is_buffer_full` creates intermittent write-through, not permanent deadlock — an unmodeled inconsistency state + +**Discovered via:** Code inspection of `writer.rs:993-996`; arithmetic +verification. + +**The issue:** When `total_buffer_size` wraps to `u64::MAX` (the underflow bug), +the deadlock is not strictly permanent. At line 994: + +```rust +let total_buffer_size = self.ledger.get_total_buffer_size() + self.unflushed_bytes; +``` + +This is a plain `u64` addition. In Rust release mode, `u64::MAX + unflushed_bytes` +wraps: if `unflushed_bytes >= 1`, the result is `unflushed_bytes - 1`, a small +non-negative number. If `small_number < max_buffer_size`, then `is_buffer_full()` +returns `false` — the writer is NOT seen as full, and it proceeds to write. + +This means: after the underflow, the writer may make **exactly one additional +write** (the one during which `unflushed_bytes > 0` at the check point), after +which `flush_write_state()` zeroes `unflushed_bytes` and the next `is_buffer_full` +check sees `u64::MAX + 0 = u64::MAX` again, blocking permanently. + +The intermittent write-through is not just a curiosity: + +1. The writer accepts a new record, updates `unflushed_bytes`, writes to the + `TrackingBufWriter`, then calls `flush()`. After flush, `unflushed_bytes` goes + to 0 and `total_buffer_size` (still at `u64::MAX`) blocks the writer again. + But the flushed record is now in the OS page cache and is readable by the + reader. The reader reads it, attaches a `BatchNotifier`, delivers to sink. The + ack returns. The finalizer calls `decrement_total_buffer_size(record_bytes)` — + on a value already at `u64::MAX` — which wraps again to `u64::MAX - + record_bytes`. Still near `u64::MAX`. The accounting is permanently poisoned + regardless. + +2. The workload may observe write throughput recovering briefly after the + underflow, then stalling again. A `Sometimes(writer_unblocked_after_full)` + assertion fires during the brief recovery window — even in the bug scenario. + This makes the `writer-eventually-makes-progress` property a **false negative** + under this specific input-timing combination: the `Sometimes` is satisfied by + the brief window, and Antithesis reports success while the bug is present. + +3. The inconsistency state — `total_buffer_size` at `u64::MAX` but writer making + occasional writes — is not covered by any property. `total-buffer-size-never- + underflows` catches the underflow at the decrement site (correct). But after + the underflow is detected, the SUT continues running in a poisoned state. No + property asserts that `total_buffer_size` stays sane after a detected underflow. + +**Scope:** `writer-eventually-makes-progress` (false negative risk), +`total-buffer-size-never-underflows` (detects root but not downstream inconsistency) +**Evidence:** `writer.rs:993-996` (unchecked `u64 +`); `writer.rs:784` +(`unflushed_bytes -=` after flush); arithmetic: `u64::MAX + 1 = 0` (wraps to +near 0, below any `max_buffer_size`). +**Suggested action:** The `assert_unreachable!` inside the stall-count loop in +`writer-eventually-makes-progress` must fire after N consecutive wakeup-with-no- +net-progress cycles, not just on any wakeup. The "progress" check must verify +that `total_buffer_size` has actually decreased from a previous measurement, not +just that the writer exited `ensure_ready_for_write`. Additionally: add an +`assert_always(total_buffer_size <= max_buffer_size || just_underflowed)` at the +ledger level so that any post-underflow state is visible even when the writer +appears to make progress. + +--- + +### W-M2: The `should_flush` CAS loser silently skips fsync — clock jitter can make this permanent + +**Discovered via:** `ledger.rs:485-497` code inspection + the Antithesis clock +fault capability mentioned in `sut-analysis.md §10`. + +**The mechanism:** `should_flush()` uses an `AtomicCell` CAS: + +```rust +pub fn should_flush(&self) -> bool { + let last_flush = self.last_flush.load(); + if last_flush.elapsed() > self.config.flush_interval + && self.last_flush.compare_exchange(last_flush, Instant::now()).is_ok() + { + return true; + } + false +} +``` + +When two callers race, only one wins the CAS. The loser returns `false` and skips +the fsync — this is the intended design (deduplicate concurrent flushers). But: + +If Antithesis's clock jitter **slows down wall clock time** as seen by +`Instant::elapsed()`, the `elapsed() > flush_interval` condition may never become +true. The result: `should_flush()` permanently returns `false`, and `flush_inner` +never calls `sync_all`. Data accumulates in the OS page cache indefinitely. A +kill at any point loses all unsynced events. + +This is separate from the 500ms documented window: that window assumes fsync +eventually fires. If `Instant::elapsed()` is frozen by the Antithesis clock +scheduler, the window is infinite. + +**The interaction with `force_full_flush`:** `flush_inner(true)` bypasses +`should_flush()` entirely and always calls `sync_all`. It is called on file +rotation. So the only guard against permanent fsync suppression is that file +rotation fires — but file rotation is triggered by the data file reaching its +size limit, which is independent of time. With small records and high throughput, +rotations may be frequent enough that the durability window is bounded by +rotation frequency, not by the 500ms clock. + +**But with large `max_data_file_size` (default 128MB):** A slow workload may +never rotate, and if the clock is frozen, the fsync never fires. The durability +window becomes "until next rotation" — which could be unbounded. + +**What no property models:** The interaction between clock jitter (slowing or +freezing `Instant::elapsed()`) and the effective fsync interval. The +`durable-unacked-events-survive-crash` property specifies "loss bounded to ≤500ms +unsynced window" — this bound assumes the clock runs at real speed. + +**Scope:** `durable-unacked-events-survive-crash`, `partial-write-at-rotation- +recovers`, catalog-wide (any property that depends on fsync having fired) +**Evidence:** `ledger.rs:485-497` (`should_flush` CAS); `common.rs:31` +(`DEFAULT_FLUSH_INTERVAL = 500ms`); `writer.rs:1041` (rotation triggers +`force_full_flush = true`, bypassing the clock). +**Suggested action:** Add a `flush_interval` override to the test config, setting +it to 0 (force every flush to be a full fsync). This eliminates the clock +dependency for the durability cluster and makes "durably written" = "flush() was +called." Separately, test with `flush_interval` = large value to exercise the +rotation-only fsync path. + +--- + +### W-M3: The `WhenFull::Overflow` + disk base ordering inversion under crash is unmodeled + +**Discovered via:** `sut-analysis.md §10` (noted as a "Wildcard / Cross-Cutting" +observation). No catalog property covers it. + +**The scenario:** A Vector topology configured as: + +``` +source → disk_buffer (base) → in-memory_buffer (overflow) → sink +``` + +Under load, the disk buffer fills up. New events go to the in-memory overflow +buffer. At this point: + +- Disk buffer: events A₁...Aₙ (fsynced, durable) +- In-memory overflow: events Aₙ₊₁...Aₙ₊ₖ (not durable, not on disk) + +SIGKILL occurs. On restart: + +- Disk buffer: events A₁...Aₙ are present and re-readable (correct at-least-once). +- In-memory overflow: Aₙ₊₁...Aₙ₊ₖ are **lost** (not persisted anywhere). + +The ordering inversion: events A₁...Aₙ (earlier, durably written) are replayed +to the sink. Events Aₙ₊₁...Aₙ₊ₖ (later, overflow) are permanently lost. The +sink sees a gap — events up to Aₙ appear, events Aₙ₊₁...Aₙ₊ₖ never appear. + +This breaks at-least-once reasoning in a subtle way: the duplicates from Aₙ's +replay mean the downstream dedup logic sees "events I already processed" AND +"events that should have come after them are missing." Standard dedup (by ID) +will correctly deliver A₁...Aₙ at-least-once but will never deliver Aₙ₊₁...Aₙ₊ₖ, +creating **permanent silent loss of events that the source already acknowledged** +(if the source ACKed Aₙ₊₁...Aₙ₊ₖ before the kill). + +No property in the catalog covers this scenario. The deployment topology uses +`when_full: block` (not `overflow`), so the harness does not exercise this mode. + +**Scope:** NEW — no existing property covers the overflow+disk+crash combination. +**Evidence:** `sut-analysis.md §10` ("WhenFull::Overflow + disk base"); +`topology/channel/sender.rs:236-244` (overflow path); no `overflow` in +deployment-topology.md. +**Suggested action:** Add a separate topology configuration with `when_full: +overflow` and a second in-memory buffer as overflow. Add a property: "events +accepted by the source before the kill that were written to the disk buffer are +delivered after restart; events accepted into the in-memory overflow are +documented as potentially lost on crash." This is primarily a documentation and +workload-design question — the SUT's behavior is correct per design (in-memory +overflow is not durable). The risk is that operators configure this topology +believing the disk buffer provides end-to-end durability, when in fact it only +provides durability for the base buffer portion. + +--- + +### W-M4: The `NotFound` file skip path at `reader.rs:777` calls `increment_acked_reader_file_id` without a subsequent `ledger.flush()` + +**Discovered via:** Code inspection of `reader.rs:766-784` vs. +`delete_completed_data_file:548-549`. + +**The asymmetry:** The `delete_completed_data_file` path calls +`ledger.increment_acked_reader_file_id()` at line 548 **followed immediately** by +`ledger.flush()` at line 549 (MS_SYNC, blocking until durable). The `NotFound` +skip path at line 777 calls `increment_acked_reader_file_id()` **without any +subsequent `ledger.flush()`**. + +`increment_acked_reader_file_id` calls `self.state().increment_reader_file_id()` +which atomically stores to the mmap'd `reader_current_data_file` field in the +`LedgerState`. On Linux with MAP_SHARED, this is a dirty mmap page — durable +only after `msync(MS_SYNC)`. Without the subsequent `ledger.flush()`, a kill +immediately after the atomic store in the `NotFound` path may or may not persist +the incremented file ID to disk, depending on whether the kernel's dirty page +has been written back. + +**Practical impact:** If the incremented value IS written to disk (kernel +flushed the dirty page), the restart correctly finds the reader at the new +file ID. If NOT written, restart finds the reader at the old file ID, the `NotFound` +path fires again, and the skip is idempotent — safe. + +However, this is NOT idempotent in one case: if the skipped file is a gap that +also has outstanding `total_buffer_size` accounting (from a corrupted/externally +deleted file that was never properly size-accounted), skipping it twice could +double-decrement the accounting. The `NotFound` skip path at line 777 does NOT +call `decrement_total_buffer_size` (unlike `delete_completed_data_file` which +calls it at line 538), so this specific case is probably safe. + +**The real concern:** The inconsistency between the two paths (one flushes ledger, +one does not) is a code-smell that could cause a future regression if someone +adds accounting to the `NotFound` path without also adding a ledger flush. + +**Scope:** `recovery-completes-after-crash`, indirectly `total-buffer-size-never-underflows` +**Evidence:** `reader.rs:766-784` (NotFound path, no ledger.flush); `reader.rs:548-549` +(delete_completed_data_file, calls ledger.flush). +**Suggested action:** Add a `ledger.flush()` call after the `increment_acked_reader_file_id()` +at `reader.rs:777` to make the two skip paths symmetric. Add a note to the +`recovery-completes-after-crash` property that this asymmetry exists and should +be exercised by kills immediately after the NotFound branch fires. + +--- + +### W-M5: Two properties are individually fine but jointly contradictory under the deadlock — creating a false-green composite + +**Discovered via:** Cross-reading `buffer-size-within-max` with `writer-eventually- +makes-progress`; arithmetic of `u64::MAX` + `is_buffer_full()`. + +**The joint contradiction:** The catalog correctly notes that `buffer-size-within- +max` is vacuously true under the deadlock: if the writer is permanently blocked, +no new data is written, so the on-disk size never exceeds `max_size`. The catalog +says "must be evaluated jointly with `writer-eventually-makes-progress`." + +But consider adding the double-wrap finding (W-M1): if the deadlock is +*intermittent* (writer makes occasional writes due to the double-wrap), then: + +- `buffer-size-within-max`: `Always(on_disk_bytes <= max_size + max_record_size)`. + The writer is mostly blocked; occasionally writes one record. On-disk size + stays near `max_size`. **Passes**. +- `writer-eventually-makes-progress`: `Sometimes(writer_unblocked)`. The writer + occasionally escapes due to double-wrap. **Passes** (the `Sometimes` fires on + the escape). +- `total-buffer-size-never-underflows`: FAILS (the underflow IS detected by the + `assert_always(amount <= current)` before `fetch_sub`). + +So with the double-wrap active: the two liveness properties both pass, the safety +root-cause property fails. The system is in a broken state that the liveness +properties' `Sometimes` assertions cannot distinguish from a healthy bounded- +backpressure cycle. + +**The missing joint assertion:** A compound property that is `Unreachable` when: + +``` +is_buffer_full() == true + AND total_buffer_size >= max_buffer_size * 0.99 // truly full, not just overflow artifact + AND elapsed_since_last_write > 30s // writer has been stalled + AND recent_deletes_without_writer_progress > N // reader is deleting but writer stays blocked +``` + +None of the 19 properties captures this compound condition. It is observable from +the workload (write throughput + sink delivery throughput both drop to zero +simultaneously while the buffer gauge shows "full" and the reader continues +processing acks). + +**Scope:** `buffer-size-within-max`, `writer-eventually-makes-progress`, joint +**Evidence:** `writer.rs:993-996` (double-wrap arithmetic); property-catalog.md +§`buffer-size-within-max` ("Must be evaluated jointly"); W-M1 above. +**Suggested action:** Add a workload-level compound liveness check: if write +throughput ≈ 0 AND sink delivery throughput ≈ 0 AND buffer gauge shows ≥ 90% full +AND the condition persists for > 2× the drain-time bound, fire +`assert_unreachable!("persistent_deadlock_detected")`. This cross-cuts the two +properties and is observable without SUT instrumentation. + +--- + +### W-M6: The `config-reload` POSIX lock gap is intra-process — no Antithesis-native way to inject it + +**Discovered via:** `sut-analysis.md §5` (INV-10: "advisory lock does NOT protect +intra-process"); `deployment-topology.md §Custom faults`. + +**The issue:** POSIX `fcntl` locks are per-process on Linux. If the old and new +topology open the same buffer directory during config reload, both get the lock +(they are in the same process). The catalog notes this but treats it as "may make +the lock gap a live safety issue." This is more specific: + +On config reload, the old topology's `BufferWriter` is dropped (calling `close()` +but not `flush()`). The new topology immediately opens the same buffer directory. +Between the old `close()` and the new `open()`, there is no inter-process lock +because both are in the same process. But: the old finalizer task (spawned as a +`tokio::spawn` detached task, `ledger.rs:701-710`) holds an `Arc` and +continues running. If the new topology's reader opens the buffer while the old +finalizer is still calling `increment_pending_acks` or `notify_writer_waiters`, +there are two concurrent coroutine paths touching the same `Ledger` via separate +`Arc` references. + +The Antithesis fault — SIGHUP to trigger reload — is a custom fault. But the +**race** between old finalizer and new reader init is not a fault to inject: it +is an inherent timing race in the reload sequence. Antithesis's scheduler can +explore the interleaving naturally if the reload is driven by the workload during +live writes. The issue is whether the harness actually keeps the write load high +during reload (maximizing finalizer task backlog at the moment of reload). + +**Missing from the catalog:** An explicit assertion at the finalizer task shutdown +point: does the old `Arc` get dropped before the new topology's `from_config_inner` +runs? If `Arc::strong_count(ledger) > 1` at the start of the new init sequence, +the old finalizer may still be running. + +**Scope:** `config-reload-no-silent-loss` +**Evidence:** `ledger.rs:701-710` (`spawn_finalizer` spawns detached tokio task); +`sut-analysis.md §5` (POSIX lock, per-process); `writer.rs:1366-1374` (Drop calls +close() not flush()). +**Suggested action:** Add an assertion at the start of `Buffer::from_config_inner`: +`assert_always(Arc::strong_count(&ledger_would_be) == 1)` (i.e., no other reference +to this directory's ledger exists when init begins). This requires knowing whether +the old Arc has been dropped, which requires the finalizer to have exited — a +condition only met when the old `OrderedFinalizer` sender side is dropped. + +--- + +## Section 3 — Cross-Lens Composite Findings + +### W-C1: Antithesis-Fit calls `durable-unacked-events-survive-crash` high-value, but Implementability calls the oracle definition "needs a decision" — the reformulation + +**The tension:** `durable-unacked-events-survive-crash` is correctly in +Antithesis-Fit's high-value zone (timing-sensitive, test-impossible otherwise). +But the workload oracle for "durably written" is open (catalog: "options are e2e +acks, tracing the sync_all callsite, or flush_interval=0; needs a decision"). +Implementability flags this as a decision gate. + +**The reformulation that captures the same risk feasibly:** + +Instead of tracking "which events are in the fsync'd set," track "which events +are NOT in the definitely-lost set." Specifically: + +The 500ms durability window is a property of the timer, not of individual events. +Any event written more than `2 × flush_interval` milliseconds before the kill +has (with high probability) been included in at least one full fsync cycle. Any +event written within the `flush_interval` window before the kill may or may not +have been fsynced (depends on rotation). + +Workload-side reformulation: + +1. Send events continuously with unique IDs. +2. Maintain a "written more than 2×flush_interval ago" set (events definitely + past the fsync window). +3. After kill+restart+drain, assert: every event in the "past-window" set is + delivered. Events in the "within-window" set may or may not be delivered (no + assertion, loss is expected). +4. The `Sometimes` assertion: confirm that at least one past-window event + survived the kill (proves fsync is actually working, not just that the test + never wrote anything durable). + +This reformulation requires only: (a) a workload timer, (b) `flush_interval` +set to a known short value, (c) event IDs. No SUT instrumentation needed for +the oracle. The `Sometimes` variant confirms the positive case; the `Always` +variant (past-window events survive) catches durability regressions. + +**Scope:** `durable-unacked-events-survive-crash` reformulation +**Evidence:** `common.rs:31` (`DEFAULT_FLUSH_INTERVAL = 500ms`); `writer.rs:1312` +(`should_flush` timer gate); property-catalog.md §durable-unacked-events-survive-crash +Open Question 1. + +--- + +### W-C2: Coverage-Balance flags "skip-rest-of-file data loss" as thin — a feasible reformulation exists + +**The tension:** Coverage Balance finds the "reader skips entire file after first +bad record" loss surface (SUT §6 item 7) to have no dedicated property. Antithesis +Fit would rate it high-value (exact point of bad-read detection + corruption extent +depend on timing). Implementability notes `corruption-is-detected-and-recovered` +partially covers it but quantifies nothing. + +**The reformulation:** Add a SUT-side counter: `records_abandoned_due_to_corruption` +(a simple atomic incremented each time `roll_to_next_data_file` is called after a +bad read). An `assert_always(records_abandoned <= 0)` would be wrong (abandonment +is by design). The correct property: `assert_always(records_abandoned_per_file <= +expected_max)` where `expected_max` is `max_data_file_size / min_record_size`. In +other words: abandonment is bounded by the file size. This is always trivially true, +but the observation that matters is: were any valid records abandoned that the +workload knows should have survived? + +Workload approach: instead of counting abandoned records (invisible to workload), +use unique event IDs. If an event with ID K was written (workload confirmed it was +accepted into the source), but the workload also injected corruption into the file +containing K's data, and K is never delivered — that IS an acceptable loss (K was +on the corrupted segment). But if events K+1, K+2, ... K+N (known to be in valid +records AFTER the corruption point in the same file) are also never delivered — +that IS the skip-rest-of-file loss. + +This is implementable if the workload can correlate corruption injection timing +with event IDs (doable with a shared timestamp). + +**Scope:** NEW property (or sub-case of `corruption-is-detected-and-recovered`) +**Evidence:** `reader.rs` `roll_to_next_data_file` call site; `sut-analysis.md §6` +item 7; coverage-balance.md §F-Missing "skip-rest-of-file quantification." + +--- + +## Section 4 — Oddities + +### W-O1: The `debug_assert!` at `writer.rs:~396` for `max_data_file_size >= max_record_size` is compiled out of release — creating a silent infinite loop risk in production + +**The property `record-never-spans-files` notes this in its Open Questions** ("Is +the `debug_assert(max_data_file_size >= max_record_size)` compiled out of release? +If so, a `max_record_size > max_data_file_size` misconfig silently makes every +write return `DataFileFull` → writer loops forever"). + +This is not just a config issue — Antithesis runs production builds (not debug). +If someone configures `max_record_size > max_data_file_size` (even accidentally by +setting both to the same value and then a record header pushes it over), the writer +enters an infinite loop with no error log and no crash. This is the same operational +signature as the deadlock from #21683 and would be completely invisible to the +operator. + +No property tests this specific misconfiguration. The catalog's `record-never-spans- +files` focuses on the spanning-record data loss, not the deadlock-from-config-mismatch. + +**This is odd** because: the `debug_assert` exists (author knew it was important), +it is not a release assertion (release safety check is absent), and the result is +identical to the known-highest-value bug. + +**Suggested action:** Promote the `debug_assert` to a validated configuration check +in `Buffer::from_config_inner` that returns an `Err(BufferError)` instead of silently +looping. This is a one-line fix independent of Antithesis. As a property: add a +harness configuration fuzzing step — include configs with edge-case `max_record_size` +values — and assert Vector either rejects the config or makes write progress. + +--- + +### W-O2: The catalog lists `every-written-event-eventually-delivered` as a liveness `Sometimes` — but liveness under faults usually requires `Always(eventually)` + +**The oddity:** The catalog uses `Sometimes(all_produced_delivered)` for the +end-to-end at-least-once property. In Antithesis semantics, `Sometimes` fires +when the condition is true on any one execution. But "at-least-once delivery" is +not a "sometimes" claim — it is an "always eventually" claim: for every event +produced, it is eventually delivered (possibly after many retries/crashes). + +The catalog's choice of `Sometimes` is explained by "progress milestone" framing +— confirm the happy path fires at least once. But this means Antithesis will stop +reporting a failure once it finds a single timeline where all events are delivered, +even if 99% of timelines have data loss. + +**A stronger formulation:** Use a workload-level `assert_always` on a per-event +basis: for every event ID in the "produced" set, assert it eventually appears in +the "delivered" set within the quiet period. The `Sometimes` wrapper is only needed +for the `Sometimes(at_least_one_event_delivered)` reachability check — separately +from the `Always(all_produced_delivered)` safety check. + +**Scope:** `every-written-event-eventually-delivered` +**Evidence:** property-catalog.md §`every-written-event-eventually-delivered` +("`Sometimes(all_produced_delivered)` as a progress milestone"). +**Suggested action:** Split into: (a) `assert_sometimes!("at_least_one_delivery", + delivered_count > 0)` to confirm the delivery path is exercised; (b) workload +`assert_always!("all_produced_eventually_delivered", produced_set ⊆ delivered_set)` +checked after each quiet period. The `Sometimes` becomes a reachability check on +the delivery path, not the primary safety assertion. + +--- + +### W-O3: The workload-observable deadlock signal depends on write throughput dropping — but WhenFull::Block backpressure looks identical to the deadlock from outside + +**The oddity:** The `writer-eventually-makes-progress` property proposes observing +"write throughput drops to zero" as the deadlock signal. But normal, correct +backpressure (`WhenFull::Block`, buffer full, reader hasn't caught up) also drops +write throughput to zero. The two states are operationally identical from outside +the process: + +- Healthy full buffer: write throughput zero, sink delivery continues, eventually + writer unblocks. +- Deadlocked buffer: write throughput zero, sink delivery also stops (reader + drains but accounting is wrong so writer never unblocks). + +The distinguishing signal is: **does sink delivery rate also drop to zero during +the stall?** In the deadlock, the reader drains (sink delivers events that were +already in the buffer), but eventually the buffer is empty and sink delivery also +stops. In healthy backpressure, sink delivery continues throughout and eventually +the writer unblocks. + +The workload must observe **both** write throughput AND sink delivery throughput +simultaneously. The property as written ("write throughput drops to zero") is +insufficient — a single-metric check gives a false positive for healthy +backpressure. + +**The joint condition needed:** `write_throughput ≈ 0 AND sink_throughput ≈ 0 +AND duration > drain_time_bound`. This is never written in any property. + +**Scope:** `writer-eventually-makes-progress` workload design. +**Evidence:** `writer.rs:1001-1019` (backpressure loop vs. deadlock loop look +identical externally); `reader.rs:553-555` (reader does notify_reader_waiters in +both cases). +**Suggested action:** Modify the workload to emit a deadlock-detection assertion +only when write throughput AND sink delivery throughput are both near zero for +more than the drain-time bound. Also add sink delivery throughput as a continuous +metric. + +--- + +## Summary Table + +| ID | Property/Slug | Concern | Scope | Evidence brief | Suggested action | +|----|--------------|---------|-------|----------------|-----------------| +| W-F1 | catalog-wide (Cat 2–6) | Persistent-volume assumption unguarded: fresh-filesystem kills produce vacuously green runs | Harness design | deployment-topology.md §CRITICAL; no sentinel property in catalog | Add workload sentinel before fault injection: write+fsync N events, kill, assert `.dat` files survive; gate Cat 2–6 on sentinel | +| W-F2 | `durable-unacked-events-survive-crash`, `every-written-event-eventually-delivered`, `partial-write-at-rotation-recovers` | Three oracle candidates for "durably written" conflate fsync, source-ack, and sink-delivery — e2e ack is NOT a durability marker | Category 3 | `prelude.rs:309` (200 = sink delivered, not fsynced) | Pick: set flush_interval short, define "durable" = "sent >2×flush_interval ago"; no SUT instrumentation needed | +| W-F3 | `partial-write-at-rotation-recovers`, `durable-unacked-events-survive-crash` | `Ordering::Greater` path fires for both genuine data loss AND page-cache-only loss — not distinguished by any property | Specific | `writer.rs:1312-1317` (sync_all then ledger.flush); sut-analysis §3 | Add SUT-side annotation at `Ordering::Greater` distinguishing "no bytes in file" from "bytes corrupted" | +| W-M1 | `writer-eventually-makes-progress`, `total-buffer-size-never-underflows` | Double-wrap in `is_buffer_full` (`u64::MAX + unflushed_bytes` wraps to small number) creates intermittent write-through, not permanent deadlock; `Sometimes` may fire during the brief escape, producing false-green liveness result | `writer.rs:993-996` | Arithmetic: `u64::MAX + 1 = 0`; brief escape then re-deadlock | Stall-count counter must require consecutive no-net-progress cycles; add compound workload assertion (write=0 AND sink=0 AND duration>bound) | +| W-M2 | `durable-unacked-events-survive-crash`, Cat 3 | `should_flush` CAS + frozen Antithesis clock can permanently suppress fsync — "≤500ms window" claim requires clock running at real speed | `ledger.rs:485-497` | `Instant::elapsed()` drives 500ms gate; rotation is the only clock-independent fsync trigger | Set flush_interval=0 in test config (rotation-triggered fsync always fires regardless of clock) | +| W-M3 | NEW — no existing property | `WhenFull::Overflow` + disk base: later in-memory events silently lost, earlier durable events replayed — creates permanent gap in at-least-once reasoning | `topology/channel/sender.rs:236-244` | sut-analysis §10 | Add second topology config with overflow; add property documenting which events survive (disk-buffer portion) vs. which are lost (overflow portion) | +| W-M4 | `recovery-completes-after-crash`, `total-buffer-size-never-underflows` | `NotFound` skip path at `reader.rs:777` calls `increment_acked_reader_file_id` without subsequent `ledger.flush()`, unlike delete path (asymmetry is a future regression risk) | `reader.rs:766-784` vs `reader.rs:548-549` | Structural: delete path flushes, skip path does not | Add `ledger.flush()` after line 777; add kill-during-NotFound test case to `recovery-completes-after-crash` | +| W-M5 | `buffer-size-within-max`, `writer-eventually-makes-progress` (jointly) | Both individually pass under the double-wrap intermittent deadlock; no joint compound property detects the broken state | Both properties | W-M1 arithmetic; property-catalog.md §`buffer-size-within-max` vacuity note | Add compound workload assertion: write≈0 AND sink≈0 AND buffer≥90% full AND duration>drain_time | +| W-M6 | `config-reload-no-silent-loss` | Old finalizer `Arc` may overlap with new topology init (POSIX lock is per-process; detached tokio task survives Drop) | `ledger.rs:701-710` | sut-analysis §5 (INV-10); detached tokio task holds Arc after Drop | Assert `Arc::strong_count == 1` at start of new `from_config_inner`; ensure finalizer exits before new init | +| W-C1 | `durable-unacked-events-survive-crash` | High Antithesis-Fit but open oracle — reformulation: use event-timestamp-based "past-window" set instead of fsync tracing | Oracle design | `common.rs:31` (500ms interval); no SUT instrumentation needed | Events sent >2×flush_interval ago = "past window"; assert all past-window events delivered after restart | +| W-C2 | NEW (sub-case of `corruption-is-detected-and-recovered`) | Skip-rest-of-file loss is unquantified; valid records after a corruption point are abandoned silently | Corruption cluster | reader `roll_to_next_data_file`; coverage-balance §F-Missing | Track abandonment via event-ID correlation: events written after the corruption-injected offset should survive; absent → loss | +| W-O1 | `record-never-spans-files` | `debug_assert` for `max_data_file_size >= max_record_size` compiled out in release → misconfig produces infinite write loop identical to the known deadlock | `writer.rs:~396` | debug_assert only; no release-mode check | Promote to validated config check returning `BufferError`; add harness config fuzzing | +| W-O2 | `every-written-event-eventually-delivered` | `Sometimes(all_produced_delivered)` is insufficient for at-least-once — `Sometimes` fires on a single good timeline, hiding loss on all others | property-catalog.md §this property | Antithesis `Sometimes` semantics | Split: `Sometimes(delivery_path_reachable)` + `Always(per_event: produced ⊆ delivered)` per quiet-period drain | +| W-O3 | `writer-eventually-makes-progress` | Deadlock signal (write throughput → 0) is identical to healthy backpressure from outside; single-metric check gives false positive | workload design | `writer.rs:1001-1019`; reader `notify_reader_waiters` fires in both cases | Assert write≈0 AND sink≈0 AND duration>drain_time (not just write≈0 alone) | + +--- + +## Passes (no concern) + +- The 19-property count and the cluster structure are sound. No cluster is + internally contradictory in its intended semantics (the jointly-contradictory + issue at W-M5 is about runtime behavior, not definitional contradiction). +- The `record-id-monotonicity-holds` `Unreachable` assertion type is correct: the + monotonicity panic is a guardrail that must never trip, and `Unreachable` is the + right SDK type. +- The `no-corrupted-record-delivered` `AlwaysOrUnreachable` assertion type is + correct: corruption detection is an optional path (acceptable if never triggered), + but any execution that does enter the path must satisfy the invariant. +- The `sink-failure-not-silently-acked` property correctly identifies a known- + violated invariant as an expected-to-fail property — valuable for tracking when + the bug is fixed. +- The `file-id-rollover-stays-coordinated` property is correctly scoped to the + test-time `MAX_FILE_ID = 6` constant that makes rollover reachable. +- The `buffer.lock` advisory-lock design is correctly scoped to intra-process + being unprotected (INV-10). No property overclaims protection. + +--- + +## Uncertainties + +1. **Whether Antithesis's `Instant` time is virtual or real.** If Antithesis's + scheduler does not intercept `std::time::Instant::now()` / `elapsed()`, the + clock-jitter concern (W-M2) may not apply. If it does intercept, the concern + is active and `flush_interval=0` in test config is the mitigation. + +2. **Whether the persistent-volume requirement (W-F1) is met in the target tenant.** + This is the most operationally critical uncertainty: without it, the entire + crash-recovery test suite is meaningless. + +3. **Whether the double-wrap intermittent escape (W-M1) actually produces a + distinguishable execution trace in Antithesis.** The timing window for + `unflushed_bytes > 0` at the `is_buffer_full` check point may be very short + (microseconds), and Antithesis may not systematically explore it unless the + scheduler specifically targets the `fetch_sub` → `is_buffer_full` interleaving. + +4. **Whether the `Ordering::Greater` path (W-F3) is reachable without the F5 + torn-tail.** If genuine `Greater` cases only arise from hardware media errors + (which Antithesis doesn't inject directly), this sub-case is unreachable in + practice and the distinction from F5 false-positives is moot. + +5. **Whether the config-reload finalizer race (W-M6) is actually concurrent.** + If the tokio runtime guarantees that the detached finalizer task is drained + before the `Arc` is dropped, the race does not exist. This requires + confirming the tokio shutdown semantics for detached tasks. diff --git a/tests/antithesis/scratchbook/existing-assertions.md b/tests/antithesis/scratchbook/existing-assertions.md new file mode 100644 index 0000000000000..94dd7bfa124ee --- /dev/null +++ b/tests/antithesis/scratchbook/existing-assertions.md @@ -0,0 +1,56 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +external_references: + - path: lib/vector-buffers/ + why: Scanned entire crate (and whole repo) for Antithesis SDK imports and assertion calls +--- + +# Existing Antithesis SDK Assertions + +## Summary + +**No Antithesis SDK instrumentation exists anywhere in the Vector codebase.** + +A repo-wide scan for the Antithesis SDK and its assertion macros/functions found +zero matches. + +## Scan Performed + +``` +grep -rn "antithesis" --include="*.rs" --include="*.toml" # repo-wide: 0 matches +grep -rn "assert_always|assert_sometimes|assert_reachable|assert_unreachable|antithesis_sdk" \ + --include="*.rs" lib/vector-buffers/ # 0 matches +``` + +- No `antithesis-sdk` (or any `antithesis*`) dependency in any `Cargo.toml`. +- No imports of an Antithesis SDK crate. +- No calls to `assert_always!`, `assert_sometimes!`, `assert_reachable!`, + `assert_unreachable!`, or their non-macro equivalents. + +## Implication for Property Discovery + +Every property in the catalog starts from zero instrumentation. All SUT-side +assertion suggestions in the evidence files are **missing** (not partial, not +already-present) and must be added if adopted. The codebase does, however, make +heavy use of: + +- `tracing` (`trace!`/`debug!`/`error!`) — useful as anchor points for where + Antithesis assertions would naturally sit. +- `metrics`-based internal events (`lib/vector-buffers/src/internal_events.rs`, + `buffer_usage_data.rs`) — these are the existing observability surface; several + known bugs (see external references) live precisely in the gap between these + metrics and reality. +- `debug_assert!` / `assert!` in a few hot paths and extensive `proptest` + + model-based tests under `variants/disk_v2/tests/` — these indicate where the + authors already considered invariants worth checking. + +## Assumptions / Open Questions + +- Assumption: the workload and any SUT-side assertions will be added fresh under + this Antithesis effort. The deployment topology must include the Antithesis + Rust SDK as a new dependency for any SUT-side instrumentation. + + + diff --git a/tests/antithesis/scratchbook/grind-plan.md b/tests/antithesis/scratchbook/grind-plan.md new file mode 100644 index 0000000000000..c337502f4e0d1 --- /dev/null +++ b/tests/antithesis/scratchbook/grind-plan.md @@ -0,0 +1,101 @@ +# Antithesis Grind Plan — disk buffer v2 failure demonstrations + +Working note (not a research artifact). Execution queue for the goal: stack +Antithesis tests, launch via `basic_test`, triage, demonstrate real disk-buffer +bugs. Each item = one gt branch + one (or more) launch/triage cycle. Ordered by +(reproducibility × value). "Phase 1" = workload-observable only; "Phase 2" = +needs SUT-side `antithesis_sdk` in `lib/vector-buffers` (rebuild Vector). + +Faults available (user-confirmed): node-termination (kill/restart), persistent +buffer volume. Clock + custom faults: assume available, confirm at use. + +## G0 — Bootstrap (setup task #1/#2/#3) + +- Base harness green: vector healthy, workload `setup_complete` + reachable + ("workload serve started", "event delivered end-to-end through disk buffer"). +- Launch once via `basic_test`, triage: expect both reachables hit, "Software + was instrumented" for the workload. Validates the whole pipeline. + +## G1 — Durability / at-least-once under crash [HIGH, Phase 1, node-kill] + +Properties: durable-unacked-events-survive-crash, every-written-event-eventually-delivered. + +- Workload: `produce` appends each sent id+timestamp to /shared/produced.log; + collector appends delivered ids to /shared/delivered.log (shared tmpfs volume + between... no — single workload container holds both; use one process or a + shared file in the workload container). +- Vector config variant: `flush_interval: 0` (every flush = fsync) so the oracle + is clean; e2e acks on. +- Fault: Antithesis node-kills vdbuf-vector repeatedly; persistent volume keeps + the buffer. +- Check command (`eventually_` or quiet-period driver): every id produced + >2×flush_interval ago must be in delivered (dups allowed). assert_always. +- Expected: should HOLD; a violation = real durability/recovery bug (strong find). + +## G2 — Writer deadlock / no-progress (#21683) [HIGHEST value, Phase 1 then Phase 2] + +Properties: writer-eventually-makes-progress, total-buffer-size-never-underflows. + +- Phase 1 (workload-observable compound stall detector): fill buffer; node-kill + vector at rotation/partial-write moments; restart; resume writes. After a + quiet period assert COMPOUND: produced-rate≈0 AND delivered-rate≈0 AND buffer + >~90% AND duration>drain-bound ⇒ assert_unreachable("persistent_deadlock"). + Must use both rates (distinguish deadlock from healthy block backpressure). +- Phase 2 (precise signal): add antithesis_sdk to lib/vector-buffers; at + ledger.rs:~292 decrement assert the value doesn't wrap (assert_always amount<=current); + assert_unreachable on underflow. Rebuild Vector (release; no debug trace! panic). +- Needs many timelines + sustained writes; release build mandatory. + +## G3 — record-id-wraparound empty-buffer 2^64 gauge [MED-EASY, Phase 1] + +Property: record-id-wraparound-accounting-holds (empty-buffer case). + +- Workload: drain buffer fully (stop producing, let collector drain + ack); + then trigger a vector restart (node-kill graceful or Antithesis restart); + scrape buffer_size_events / buffer_size_bytes from :9598; assert ~0 + (assert_always small). Expected FAIL: gauge shows ~1.8e19 on drained restart. +- No node-kill strictly needed if a graceful restart can be driven; else use it. + +## G4 — foreign .dat file stalls writer [MED, Phase 1, no node-kill needed] + +Property: foreign-data-file-no-writer-stall. + +- Compose: also mount vdbuf-buffer into the workload container (ro? rw) so a + test command can drop a large `foreign.dat` into /var/lib/vector/buffer/v2//. +- Test command places the file; vector restart picks it up (update_buffer_size + sums all *.dat); assert writer still makes progress. Expected FAIL: stall. + +## G5 — drop_newest not counted at component level [MED, Phase 1] + +Property: dropped-events-are-counted. + +- Vector config variant: when_full: drop_newest; collector blocks/rejects so the + 256MB buffer fills (produce large events to fill faster). +- Scrape buffer_discarded_events_total vs component_discarded_events_total; + assert equal. Expected FAIL: component stays 0 while buffer increments. + +## G6 — sink-failure not silently acked [MED, Phase 1] + +Property: sink-failure-not-silently-acked. + +- Collector returns 5xx for a window (workload-controlled). Assert events whose + delivery errored are retained/retried, not dropped from the buffer. + +## G7 — config-reload silent loss [LATER, custom fault SIGHUP] + +Property: config-reload-no-silent-loss. Needs SIGHUP-to-vector custom fault. + +## G8 — fsync window under clock jitter [LATER, clock fault] + +Property: fsync-window-bounded-under-clock-jitter. Needs clock faults. + +## Notes + +- Each launch: `docker compose build` (only if images changed) → snouty validate + → snouty launch --json --webhook basic_test --config antithesis/config + --duration . Then triage by run id. +- Keep each test on its own gt branch stacked on antithesis-setup-harness. +- Do NOT fix Vector bugs — demonstrate + make reproducible. +- Multiple config variants (flush_interval, when_full) → either separate compose + profiles or env-substituted vector.yaml. Decide per-test; simplest is a small + set of vector-.yaml + compose overrides. diff --git a/tests/antithesis/scratchbook/properties/acked-files-eventually-deleted.md b/tests/antithesis/scratchbook/properties/acked-files-eventually-deleted.md new file mode 100644 index 0000000000000..5fc2f0ab5fddf --- /dev/null +++ b/tests/antithesis/scratchbook/properties/acked-files-eventually-deleted.md @@ -0,0 +1,334 @@ +--- +slug: acked-files-eventually-deleted +property_id: 12 +type: Liveness +antithesis_assertion: Sometimes(data_file_deleted) +sut_path: lib/vector-buffers/src/variants/disk_v2/ +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +cross_refs: + - total-buffer-size-never-underflows # underflow blocks deletion indirectly + - writer-eventually-makes-progress # deletion is the prerequisite for writer unblock +related_issues: + - "vectordotdev/vector #21683" # total_buffer_size underflow → permanent stall + - "vectordotdev/vector #23456" # flaky clean-termination test +disabled_tests: + - "lib/vector-buffers/src/variants/disk_v2/tests/size_limits.rs::writer_waits_when_buffer_is_full (ignore = \"Needs investigation\")" +--- + +# Property 12: acked-files-eventually-deleted + +## Invariant (informal) + +Once every record in a data file has been acknowledged by the downstream sink, +the file is **eventually unlinked from the filesystem and its bytes subtracted +from `total_buffer_size`** — even if no new writes arrive. This must hold +across quiet periods and across crash-restart cycles. + +## Formal Statement + +**Sometimes(data_file_deleted)**: In any execution where a data file is written, +filled, and all its records are delivered and acknowledged downstream, an +`unlink(path)` of that file is observed within finite time, and +`total_buffer_size` decreases by the corresponding file size. + +Equivalently as an invariant over any time-bounded window: + +> For every `.dat` file F whose last record has been acked: there exists a +> future state where `filesystem::stat(F)` returns `ENOENT` and +> `ledger.total_buffer_size` has decreased by `metadata(F).len()` relative to +> the moment the last ack was processed. + +--- + +## Dependency Chain (the full progress path for deletion) + +The deletion of a data file requires **every link** of the following chain to +succeed. Breaking any single link silently stops all progress. + +``` +[Sink drops BatchNotifier] + │ + ▼ (vector-common/src/finalizer.rs:FuturesOrdered::next()) +[OrderedFinalizer task yields (BatchStatus, amount: u64)] + │ ledger.rs:703-707 ── tokio::spawn loop ── stream.next().await + │ NOTE: _status is DISCARDED here (ledger.rs:704 `let (_status, amount)`) + ▼ +[ledger.increment_pending_acks(amount)] ledger.rs:705 +[ledger.notify_writer_waiters()] ledger.rs:706 + │ (misleading name: wakes the *reader*, not the writer) + ▼ +[reader.next() loop wakes; calls handle_pending_acknowledgements] + │ reader.rs:965-967 + ▼ +[ledger.consume_pending_acks()] reader.rs:582 / ledger.rs:421 +[record_acks.add_acknowledgements(consumed_acks)] reader.rs:584 +[record_acks.get_next_eligible_marker() loop] reader.rs:586-635 + │ advances reader_last_record, accumulates bytes_acknowledged + ▼ +[data_file_acks.add_acknowledgements(records_acknowledged)] reader.rs:633 +[data_file_acks.get_next_eligible_marker() loop] reader.rs:655-668 + │ gated by: had_eligible_records || force_check_pending_data_files + ▼ +[delete_completed_data_file(path, bytes_read)] reader.rs:662 → reader.rs:489 + ├── [ledger.filesystem().open_file_readable(path)] reader.rs:514-518 + │ (stat to get file size before unlink) + ├── [metadata.len() - bytes_read → decrease_amount] reader.rs:521-535 + │ BUG WINDOW: if metadata.len() < bytes_read → u64 underflow here (reader.rs:524) + ├── [ledger.decrement_total_buffer_size(decrease_amount)] reader.rs:538 / ledger.rs:291-298 + │ BUG: raw fetch_sub, no saturation (the #21683 control-path is unfixed) + ├── [filesystem.delete_file(path)] reader.rs:546 + │ I/O FAULT POINT: ENOSPC, EPERM, flaky disk → error propagates up + ├── [ledger.increment_acked_reader_file_id()] reader.rs:548 / ledger.rs:457-478 + ├── [ledger.flush()] reader.rs:549 + │ I/O FAULT POINT: msync failure + └── [ledger.notify_reader_waiters()] reader.rs:555 + (wakes the *writer*, per the inverted naming) +``` + +**The `force_check_pending_data_files` path** (reader.rs:1076) is the +mechanism by which deletion proceeds during a **quiet period** (no new writes). +When the reader rolls to the next data file (`roll_to_next_data_file`, +reader.rs:1075), it sets `force_check_pending_data_files = true` on the next +loop iteration. That flag bypasses the `had_eligible_records` guard at +reader.rs:651, allowing `data_file_acks.get_next_eligible_marker()` to fire +even when no new acks arrived in that iteration. Without this path, a file +whose last record was acked after the reader moved on would never be deleted +until the next record ack arrived. + +--- + +## The Finalizer-Task-Death Scenario + +The finalizer is spawned at `ledger.rs:701-710` as a detached `tokio::spawn`: + +```rust +// ledger.rs:701-710 +pub(super) fn spawn_finalizer(self: Arc) -> OrderedFinalizer { + let (finalizer, mut stream) = OrderedFinalizer::new(None); + tokio::spawn(async move { + while let Some((_status, amount)) = stream.next().await { + self.increment_pending_acks(amount); + self.notify_writer_waiters(); + } + }); + finalizer +} +``` + +Two observations: + +1. **`_status` is silently discarded.** `BatchStatus` carries `Delivered`, + `Errored`, or `Rejected`. The task unconditionally credits all three as + acked events. A sink-error ack therefore advances `reader_last_record` and + eventually triggers file deletion, removing the event from the buffer + **without replay** (sut-analysis.md §5, INV-9 broken). This is a separate + correctness bug, but it means the deletion path runs for all outcomes, which + actually makes the "file deleted" observation more reachable — at the cost of + silent loss. + +2. **Task death = acks stranded.** The finalizer task holds the only consumer + of `stream`. If the tokio task is killed (SIGKILL hitting the process), + panics (a `BatchStatusReceiver` future panics), or the runtime is shut down + while the task is still pending, the `OrderedFinalizer` sender + (`finalizer`) is still alive in the reader, but the receiving task is gone. + Subsequent calls to `finalizer.add(amount, receiver)` at reader.rs:1119 + succeed (the unbounded channel accepts messages), but nobody is consuming + that channel. From `vector-common/src/finalizer.rs:101-107`: + + ```rust + pub fn add(&self, entry: T, receiver: BatchStatusReceiver) { + if let Some(sender) = &self.sender + && let Err(error) = sender.send((receiver, entry)) + { + error!(message = "FinalizerSet task ended prematurely.", %error); + } + } + ``` + + The send will only error when the receiver side of the *channel* is dropped, + which happens when the task exits and `new_entries` (the `UnboundedReceiver`) + is dropped. Until that point, `add()` silently succeeds while the finalizer + task is dead. Result: + - `pending_acks` is never incremented. + - `notify_writer_waiters()` is never called. + - The reader's `handle_pending_acknowledgements` loop at reader.rs:582 calls + `ledger.consume_pending_acks()` and gets 0 every iteration. + - `had_eligible_records` is always false. + - `had_eligible_data_files` is always false (unless `force_check_pending_data_files` + fires from a roll, but that only checks `data_file_acks`, which requires + `records_acknowledged > 0` to have been accumulated, which requires + `had_eligible_records`, which requires `consume_pending_acks() > 0`). + - No file is ever deleted. + - `total_buffer_size` is never decremented. + - Writer sees `is_buffer_full()` permanently true. + - **Permanent writer deadlock.** + + On a SIGKILL + restart, the finalizer task is recreated fresh, so this + scenario is self-healing across restarts. But within a single process + lifetime (e.g., the task panics due to a bug in the futures layer, or the + tokio runtime shuts down the task before draining it during graceful + shutdown), the pipeline stalls silently. + +3. **Shutdown ordering hazard.** The sut-analysis.md §5 open question: "Does + the finalizer task get drained by the tokio runtime before shutdown, or can + in-flight acks be lost?" If `tokio::runtime::shutdown_timeout` fires before + the `FuturesOrdered` inside the finalizer task drains its pending + `BatchStatusReceiver` futures, those acks are lost. The data files are not + deleted. On the *next* startup, `update_buffer_size` re-seeds + `total_buffer_size` from the on-disk file sizes — potentially over-counting + — which is the exact trigger for the #21683 underflow on subsequent reads. + +--- + +## The `delete_completed_data_file` Underflow Window + +At reader.rs:521-535: + +```rust +let decrease_amount = bytes_read.map_or_else( + || metadata.len(), + |bytes_read| { + let size_delta = metadata.len() - bytes_read; // reader.rs:524 + ... + size_delta + }, +); +``` + +`metadata.len()` is the on-disk file size at deletion time; `bytes_read` is the +cumulative number of bytes the reader successfully read from the file. If an +I/O fault, crash-induced partial write, or race inflated `bytes_read` above the +actual file size, `metadata.len() - bytes_read` **wraps** (both are `u64`). +`decrease_amount` becomes ≈ 2^64. The subsequent +`ledger.decrement_total_buffer_size(decrease_amount)` at reader.rs:538 calls the +raw `fetch_sub` at ledger.rs:292, wrapping `total_buffer_size` to ≈ 2^64. +Writer deadlocks permanently. This is a second trigger for the #21683 underflow +beyond the startup reconstruction path. + +--- + +## Antithesis Experimental Design + +### Target scenario + +1. Configure a disk buffer with `max_buffer_size` set to exactly two data files' + worth of records (forces the writer to block after filling two files). +2. Write enough records to fill exactly one data file. Flush. Verify file exists + on disk and `total_buffer_size > 0`. +3. Read all records from that file. Do not yet ack. +4. **Ack all records.** The finalizer task should fire, `pending_acks` should + be incremented, the reader loop should delete the file. +5. **Quiet period** (no new writes). Assert within a timeout that the `.dat` + file is absent (`stat` returns `ENOENT`) and that `buffer_byte_size` metric + gauge has dropped to 0. + +### Fault injections + +- **Node SIGKILL between ack and deletion** (between finalizer firing and the + `delete_file` syscall). On restart: file should be rediscovered, reader should + re-seek to the end, and `delete_completed_data_file` should be called again + via the initialization path (`bytes_read = None`). Assert file eventually gone. +- **Finalizer-task kill** (simulate by pausing/killing only the finalizer goroutine, + or by using Antithesis's process-level controls). Assert that the file is + never deleted until the task is restored — confirming the dependency. +- **Filesystem fault on `delete_file`** (inject `EIO` or `EPERM`). The error + propagates from `delete_completed_data_file` → `handle_pending_acknowledgements` + → `next()` via `.context(IoSnafu)?` at reader.rs:966. The reader returns an + error. Assert the caller (the topology) handles this gracefully and retries. + Currently `receiver.rs` panics on reader I/O error (sut-analysis.md §8), so + the expected behavior is a process restart. +- **CPU throttle during `should_flush`** (extend the 500ms window): the ledger + msync after deletion may be delayed. Assert that after the throttle lifts, + the ledger is flushed and the file is eventually absent even if it takes + longer than normal. + +### Assertions to add (SUT-side, none currently exist) + +```rust +// In delete_completed_data_file, after filesystem.delete_file succeeds: +antithesis_sdk::assert_sometimes!( + true, + "data file was deleted after all records acked", + &serde_json::json!({ + "data_file_path": data_file_path.to_string_lossy(), + "bytes_read": bytes_read, + "decrease_amount": decrease_amount, + "total_buffer_size_after": self.ledger.get_total_buffer_size(), + }) +); + +// In decrement_total_buffer_size, assert no underflow: +antithesis_sdk::assert_always!( + amount <= self.total_buffer_size.load(Ordering::Acquire), + "total_buffer_size decrement must not underflow", + &serde_json::json!({ "amount": amount, + "current": self.total_buffer_size.load(Ordering::Acquire) }) +); +``` + +### Workload oracle + +External oracle (workload container): + +- After the quiet period, list all `buffer-data-*.dat` files in the buffer + directory for the SUT. Assert that no file whose start record ID ≤ the last + acked record ID still exists. +- Read the `buffer_byte_size` Prometheus metric. Assert it equals 0 (or matches + the number of un-acked bytes still in flight, if any). + +--- + +## Why This Matters + +This property is the prerequisite for the writer liveness guarantee (L1, +sut-analysis.md §5). If files are not deleted, `total_buffer_size` stays +elevated. The writer's `is_buffer_full()` check at writer.rs:993-997 returns +true. `ensure_ready_for_write()` at writer.rs:1001-1020 loops forever on +`ledger.wait_for_reader()`, which will never fire because the reader is also +stuck (it has no new acks to process). **Pipeline stalls silently.** No crash, +no error log at ERROR level, dashboards may show healthy throughput (if the +pipeline has in-memory buffering upstream of the disk buffer). + +The `force_check_pending_data_files` path is the only mechanism for making +progress during a quiet period. It is exercised only when the reader rolls to +the next data file. If the buffer is idle (writer done, reader waiting for +acks, no roll happening), the path is never triggered — deletion depends +entirely on `consume_pending_acks() > 0` returning true, which depends on the +finalizer task being alive and having processed the ack futures. + +--- + +## Open Questions + +1. **Does the tokio runtime drain the finalizer task before shutdown?** If + `tokio::Runtime::shutdown_timeout` fires before the `FuturesOrdered` drains, + in-flight acks are lost without any log or error. This is the bridge between + "graceful shutdown" and the startup-over-seeding trigger for #21683. Needs + investigation in the topology shutdown path. + +2. **What is the `bytes_read` value passed to `delete_completed_data_file` for + a file where the reader rolled due to a bad record (the "only partial read" + case)?** If the reader rolled early (reader.rs:1036), `bytes_read` reflects + only what was read before the bad record. The remainder of the file is + charged as `size_delta = metadata.len() - bytes_read`. If a fault left the + file larger than expected (e.g., partial write at the tail bumped the file + size), `bytes_read` could exceed `metadata.len()`, triggering the underflow. + +3. **Is `force_check_pending_data_files` sufficient for the quiet-period case + where the reader is at the very end of the last data file and not rolling?** + If the writer is done, the reader has read all records, all acks arrive, but + the reader is parked in `wait_for_writer()` at reader.rs:1080 (because it + already rolled and found an empty new file), then `notify_writer_waiters()` + from the finalizer wakes the reader, which loops back to + `handle_pending_acknowledgements`, which processes the acks and deletes the + file. This appears correct, but the exact ordering of the wake → check → + delete sequence under Antithesis scheduling pressure is worth exploring. + +4. **What happens if `ledger.flush()` (the msync after delete) fails at + reader.rs:549?** The file is already unlinked by this point (reader.rs:546 + ran first). The ledger's `reader_current_data_file` field is not yet updated. + On restart, the reader will try to open a file that no longer exists and fall + through the `NotFound` branch, skipping to the next file. This is the + "handled on restart" path noted in sut-analysis.md §3. Verify under fault + injection that the skip is handled correctly and no events are counted twice. diff --git a/tests/antithesis/scratchbook/properties/buffer-size-within-max.md b/tests/antithesis/scratchbook/properties/buffer-size-within-max.md new file mode 100644 index 0000000000000..49b8a18940b7c --- /dev/null +++ b/tests/antithesis/scratchbook/properties/buffer-size-within-max.md @@ -0,0 +1,314 @@ +--- +slug: buffer-size-within-max +type: Safety / Always +sut_path: lib/vector-buffers/src/variants/disk_v2/ +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +linked_claims: + - INV-7 in sut-analysis.md §5 ("Buffer never exceeds max_size") + - INV-3 in sut-analysis.md §5 (per-record overshoot caveat) +linked_bugs: + - vectordotdev/vector#21683 (underflow makes this vacuously true via deadlock) +--- + +# Property: buffer-size-within-max + +## Catalog Entry + +**Type:** Safety / Always + +**Property:** The total on-disk buffer size (sum of all `buffer-data-N.dat` file +sizes) never exceeds the configured `max_buffer_size`, except for the documented +single-record overshoot allowance (up to `max_record_size` bytes past the +data-file 128MB limit, per INV-3). + +More precisely, two sub-invariants must hold simultaneously: + +**INV-A (accounting):** The in-memory `total_buffer_size` accurately reflects the +true on-disk content at all stable points (i.e., when no write or delete is +in-progress). `|total_buffer_size - actual_disk_bytes| <= max_record_size`. + +**INV-B (backpressure):** The writer never successfully writes a record that would +cause `total_buffer_size > max_buffer_size`. The writer blocks instead. + +**Invariant:** For all points in time when the writer has completed a write and +the ledger has been updated: + +``` +actual_on_disk_data_bytes <= max_buffer_size + max_record_size +``` + +And the gate enforcing this is never bypassed: + +```rust +// writer.rs:793-798 — can_write_record +self.can_write() && total_buffer_size + potential_write_len <= self.config.max_buffer_size +``` + +**Antithesis Angle:** Fill the buffer to capacity under fault conditions. Verify +the actual on-disk byte total (measured by the workload or a watchdog process +enumerating `.dat` files) remains bounded. Confirm the writer blocks rather than +over-commits. Most importantly, verify that "holds within max" is a meaningful +result by cross-checking that the writer is *still making progress* (ruling out +vacuous satisfaction via deadlock). See the deadlock-vacuity subtlety below. + +**Why It Matters:** Users configure `max_buffer_size` to control disk space usage. +A buffer that silently exceeds this limit causes disk-full errors for the host, +filling OS buffers and potentially starving other processes. The inverse failure +(deadlock via underflow) is equally harmful: the buffer appears not to exceed +the limit because no new data is being written — a false negative. + +--- + +## Source-Level Enforcement + +### Write-side gate + +Two independent checks gate every write: + +**1. `can_write_record` (writer.rs:793-798):** + +```rust +fn can_write_record(&self, amount: usize) -> bool { + let total_buffer_size = self.ledger.get_total_buffer_size() + self.unflushed_bytes; + let potential_write_len = + u64::try_from(amount).expect("Vector only supports 64-bit architectures."); + self.can_write() && total_buffer_size + potential_write_len <= self.config.max_buffer_size +} +``` + +- `get_total_buffer_size()` loads the `total_buffer_size` atomic (ledger.rs:276-278). +- `self.unflushed_bytes` is a writer-local counter (not atomic) tracking bytes + written to the `TrackingBufWriter` not yet flushed to the data file. +- The combined sum must not exceed `max_buffer_size` for the write to proceed. + +**2. `is_buffer_full` (writer.rs:993-996):** + +```rust +fn is_buffer_full(&self) -> bool { + let total_buffer_size = self.ledger.get_total_buffer_size() + self.unflushed_bytes; + let max_buffer_size = self.config.max_buffer_size; + total_buffer_size >= max_buffer_size +} +``` + +Called by `ensure_ready_for_write` (writer.rs:1001-1019) which blocks the writer +until `is_buffer_full()` is `false`. + +Note: `can_write_record` uses `<= max_buffer_size` (allows writing right up to the +limit) while `is_buffer_full` uses `>= max_buffer_size` (blocks at the limit). +These are logically equivalent at the boundary but represent two separate call +sites with duplicate logic — a maintainability concern if either diverges. + +### Accounting update on write + +`track_write` (ledger.rs:386-390) calls `increment_total_buffer_size(record_size)`, +where `record_size` is the full serialized record length (header + payload). + +```rust +pub fn track_write(&self, event_count: u64, record_size: u64) { + self.increment_total_buffer_size(record_size); + // ... +} +``` + +The increment uses `fetch_add` with no overflow guard (ledger.rs:282). In theory +a record_size could be up to `max_record_size` (128MB); `fetch_add` of 128MB to +a value near u64::MAX wraps — a separate theoretical overflow path (not observed +in practice given max_buffer_size is bounded, but worth noting for completeness). + +### Per-record overshoot (INV-3 / documented) + +`can_write` (writer.rs:789-791) additionally checks `data_file_size < max_data_file_size`. +A writer may write a record that *individually* pushes a single data file beyond +128MB by up to `max_record_size`. This is documented behavior, not a bug. +The effective bound on any one data file is therefore `max_data_file_size + max_record_size`. +The buffer-level bound `max_buffer_size` is enforced independently. + +--- + +## The Deadlock Vacuity Subtlety + +**This is the critical subtlety for this property.** + +When the `total_buffer_size` underflow bug (#21683) fires: + +1. `total_buffer_size` wraps to ~u64::MAX. +2. `is_buffer_full()` returns `true` permanently (writer.rs:993-996). +3. The writer stops writing. No new data reaches any `.dat` file. +4. The actual on-disk buffer size stops growing and eventually shrinks as the + reader drains and deletes files. +5. **`buffer-size-within-max` trivially passes: no data is written, so the bound + is never violated.** + +This is a classic safety-liveness interaction: the safety property is upheld, but +only because the system has deadlocked. A passing `buffer-size-within-max` result +under these conditions is a false negative — it signals "correct behavior" when +the system is actually completely broken. + +**Mitigation:** Always evaluate `buffer-size-within-max` jointly with +`writer-eventually-makes-progress`. The semantically meaningful result requires: + +- `buffer-size-within-max` is `Always` satisfied **AND** +- `writer-eventually-makes-progress` (`Sometimes`) is also satisfied. + +If `buffer-size-within-max` passes AND `writer-eventually-makes-progress` fails, +the correct diagnosis is: the underflow bug fired and the bound is vacuously held. + +### Antithesis cross-assertion + +Add a combined assertion in the workload or a watchdog: + +``` +after STOP_FAULTS: + assert that: + (1) max(disk_bytes_observed during run) <= max_buffer_size + max_record_size + (2) total_writes_after_last_fault > 0 +``` + +If (1) holds but (2) fails, report both findings together. + +--- + +## What a Genuine Violation Looks Like + +A non-vacuous violation would require the write-side gate to be bypassed. Known +paths: + +**Path 1: Accounting drift on startup.** `update_buffer_size` (ledger.rs:653-697) +adds *all* `.dat` file sizes, including files that the reader has already +processed but whose deletion race-lost against the restart. If `total_buffer_size` +is over-seeded, `can_write_record` blocks early (writer thinks buffer is larger +than it is). This makes the bound hold more conservatively — a false-positive +backpressure, not a violation. + +**Path 2: Foreign `.dat` files.** If a foreign file matching `buffer-data-*.dat` +exists in the data directory, `update_buffer_size` includes its size in +`total_buffer_size`. This inflates the apparent buffer size without any real +data being present. The bound could be violated if the foreign file is *not* +counted in the gate check but *is* on disk — but since the gate uses the same +`total_buffer_size` atomic as the seeding, this path inflates the gate too, so +the bound still holds. However, the false inflation can cause premature blocking. + +**Path 3: Accounting drift from config-reload race.** If the old writer and new +writer both have the same data directory briefly, the new writer's +`update_buffer_size` counts files still being written by the old writer. Both +writers may then attempt to write, potentially exceeding `max_buffer_size`. This +requires the advisory lock (`buffer.lock`) to be ineffective intra-process +(which it is on Linux — `fcntl` locks are per-process, not per-`fd`). This is +a live safety risk under config reload. + +--- + +## SUT-Side Instrumentation (MISSING — must be added) + +All Antithesis SDK calls below are absent from the codebase. + +### Assertion 1 — Always: write gate is never bypassed + +```rust +// writer.rs, inside write_record, after can_write_record returns true +// and before the actual write to TrackingBufWriter +let total_buffer_size = self.ledger.get_total_buffer_size() + self.unflushed_bytes; +antithesis_sdk::assert_always!( + total_buffer_size <= self.config.max_buffer_size, + "buffer_size_within_max: total_buffer_size does not exceed max at write time", + &serde_json::json!({ + "total_buffer_size": total_buffer_size, + "max_buffer_size": self.config.max_buffer_size, + "unflushed_bytes": self.unflushed_bytes, + "ledger_total": self.ledger.get_total_buffer_size(), + }) +); +``` + +### Assertion 2 — Always: gate logic is self-consistent + +Verify `is_buffer_full` and `can_write_record` agree at the moment a write +proceeds (they should both say "not full" simultaneously): + +```rust +// writer.rs, inside ensure_ready_for_write, just before returning Ok(()) +// (i.e., when the loop breaks because is_buffer_full() returned false) +antithesis_sdk::assert_always!( + !self.is_buffer_full(), + "buffer_gate_consistent: writer exits wait loop only when not full", + &serde_json::json!({ + "total_buffer_size": self.ledger.get_total_buffer_size(), + "unflushed_bytes": self.unflushed_bytes, + "max_buffer_size": self.config.max_buffer_size, + }) +); +``` + +### Assertion 3 — Unreachable: write proceeds when is_buffer_full is true + +```rust +// writer.rs, in write_record, gated by can_write_record returning true +antithesis_sdk::assert_unreachable!( + "write_while_full: wrote a record while is_buffer_full() was true", + &serde_json::json!({ + "is_full": self.is_buffer_full(), + "can_write_record_result": true, // by definition we just passed the gate + }) +); +``` + +This would catch any race where `is_buffer_full` changes between the `ensure_ready_for_write` +exit and the actual `write_record` execution (unlikely given single-writer design, +but defensive). + +### Watchdog process assertion (workload-side) + +A separate watchdog process (not inside Vector) should periodically enumerate +all `buffer-data-*.dat` files in the configured data directory and assert: + +``` +sum(file_sizes) <= max_buffer_size + max_record_size +``` + +This is workload-observable without SUT modification and provides an independent +check that the gate is actually working. + +--- + +## Open Questions + +- **What is the configured `max_buffer_size` in the Antithesis harness?** The + minimum is ~256MB (from the docs/spec). A smaller value makes the buffer fill + faster and the property is exercised more frequently. Recommend using the + minimum for harness efficiency. + +- **Is the single-record overshoot (INV-3) tested?** Write a record whose + serialized size is `max_data_file_size - 1` bytes (nearly 128MB). Verify the + data file exceeds 128MB but the buffer-level `max_buffer_size` is still + respected by blocking the *next* write, not the current one. + +- **Config-reload race: is the intra-process advisory-lock gap actually reachable + in Antithesis?** This requires two Vector topology instances to briefly share + the same data directory. Confirm whether the harness exercises config-reload + scenarios. If so, the per-process `fcntl` lock gap is a live safety issue for + this property. + +- **Does `ANTITHESIS_STOP_FAULTS` actually prevent node kills during the + verification window?** If node kills continue during the post-fault check, + the watchdog process may itself be killed before it can report a violation. + Confirm with Antithesis documentation. + +- **Is the `total_buffer_size` atomic observable to the watchdog via metrics + export?** The `buffer_byte_size` metric is derived from `total_buffer_size` + via the usage reporter. If the watchdog reads this metric rather than measuring + files directly, it will see PR #23561's `saturating_sub` output (which caps at + zero rather than showing u64::MAX) and will miss the underflow. The watchdog + must measure actual file sizes, not the metric. + +- **Relationship to `drop_newest` accounting bug (#24606/#24144):** When + `when_full = drop_newest` fires, events are dropped but `component_discarded_events_total` + does not increment. The buffer size accounting is updated (decrement happens), + but the metric is wrong. Does this affect the `buffer-size-within-max` check? + No — drops reduce the size, they don't violate the upper bound. But the metric + discrepancy means the watchdog should measure files, not metrics. + +- **Overflow on `increment_total_buffer_size`:** `fetch_add` at ledger.rs:282 + has no overflow guard. Is `max_buffer_size + max_record_size < u64::MAX`? + Yes by a wide margin for any practical configuration. Document as out of scope. diff --git a/tests/antithesis/scratchbook/properties/buffer-survives-version-upgrade.md b/tests/antithesis/scratchbook/properties/buffer-survives-version-upgrade.md new file mode 100644 index 0000000000000..72950a6c5f610 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/buffer-survives-version-upgrade.md @@ -0,0 +1,310 @@ +--- +slug: buffer-survives-version-upgrade +type: Safety + Liveness / Sometimes(upgrade_readback_ok) + AlwaysOrUnreachable(compat_flag_rejects_correctly) +sut_path: lib/vector-buffers/src/variants/disk_v2/ +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +--- + +# Property: buffer-survives-version-upgrade + +## Catalog Entry + +**Type:** Safety + Liveness / `Sometimes(upgrade_readback_ok)` (facet A) + +`AlwaysOrUnreachable(compat_flag_handled_correctly)` (facet B) + +**Property:** Two distinct but related invariants: + +**(A) rkyv layout version safety.** Buffer files written by Vector version N +are read back correctly by the same version N (stability across restart). If +the rkyv-archived layout changes between versions (field addition, removal, +reordering — all banned by the struct-level warning in `record.rs`), the read +attempt at version N+1 produces a clean, detected error (`InvalidStructure` +or `InvalidData` in `DeserializeError`) and is never silently interpreted as a +valid record with garbage content. + +**(B) `DiskBufferV1CompatibilityMode` flag correctness.** The +`DiskBufferV1CompatibilityMode` flag (set on every written record by +`get_metadata()`) must be present for `can_decode()` to return `true`. A +record written WITHOUT this flag would be rejected at decode time — a +forward-compat foot-gun if a future encoding scheme stops setting the flag. +Conversely, records written WITH the flag must be accepted. The flag logic +must never silently accept garbage or silently reject valid records. + +**Invariant (A):** `Sometimes(buffer_readback_ok)` — after writing N events, +restarting Vector with the same binary (same rkyv layout), and draining the +buffer, all N events are delivered. This establishes a same-version readback +baseline. Under a simulated format change (custom fault: modify the binary or +the data files), any read of the modified data produces a `DeserializeError`, +never a `RecordStatus::Valid{id}` with wrong content. + +**Invariant (B):** `AlwaysOrUnreachable` at the `can_decode()` call site: a +record accepted by `get_metadata()` is always accepted by `can_decode()` when +read back. `AlwaysOrUnreachable` because incompatible records are a rare/fault +path — any execution must satisfy the invariant, but never-executed is +acceptable. + +**Antithesis Angle (A):** Write events with Vector binary version N; simulate +a format change by modifying `buffer-data-*.dat` files on disk or swapping the +binary (a custom Antithesis fault); restart; assert all reads produce +`DeserializeError` (clean detection), never garbage. Validates that +`try_as_archive` / `check_archived_root` / manual `CheckBytes` properly rejects +changed layouts. + +**Antithesis Angle (B):** Confirm by inspection that every record written with +the current binary has `DiskBufferV1CompatibilityMode` set (since +`get_metadata()` always sets it). Inject a synthetic record without the flag +into a data file; assert `can_decode()` returns `false` and the record is +skipped/rejected cleanly. Assert the inverse is never triggered for valid +records. + +**Why It Matters:** The `Record` struct carries an explicit warning: + +> Do not add/remove/change/reorder fields. Doing so will change the serialized +> representation. This will break things. + +This warning is the only guard against a breaking layout change. There is no +version field in the on-disk record that would allow runtime detection of a +layout mismatch. A layout change would silently produce garbage via rkyv's +zero-copy `archived_root` (reading wrong bytes as field values), potentially +passing the manual `CheckBytes` (which validates field types, not semantic +values) and the CRC32C check (computed over the new layout's bytes, now +matching the new checksum). The only signal would be an implausible record ID +or payload — not a detected error. + +The `DiskBufferV1CompatibilityMode` flag is a forward-compat foot-gun: if a +future encoding drops this flag, every existing buffer file becomes +unreadable. This property ensures the flag logic is tested end-to-end. + +--- + +## Code Verification + +### `Record` struct immutability warning (record.rs:36-45) + +```rust +// lib/vector-buffers/src/variants/disk_v2/record.rs:36-45 +/// # Warning +/// +/// - Do not add fields to this struct. +/// - Do not remove fields from this struct. +/// - Do not change the type of fields in this struct. +/// - Do not change the order of fields this struct. +/// +/// Doing so will change the serialized representation. This will break things. +#[derive(Archive, Serialize, Debug)] +``` + +This warning is the only guard. There is no `#[rkyv(version)]` attribute, no +format-version field, and no migration path. + +### rkyv host-endian, version-sensitive layout + +```rust +// record.rs:46-73 (ArchivedRecord layout) +pub struct Record<'a> { + pub(super) checksum: u32, // [u32 native-endian] + id: u64, // [u64 native-endian] + pub(super) metadata: u32, // [u32 native-endian] + #[with(CopyOptimize, RefAsBox)] + payload: &'a [u8], // [ArchivedBox<[u8]>: ptr+len, native-endian] +} +``` + +Fields are native-endian (little-endian on x86-64). The `CopyOptimize, +RefAsBox` combination serializes the payload as an `ArchivedBox<[u8]>`, which +stores a relative pointer and length in the rkyv buffer. Any change to the +struct layout — including adding a field, changing `RefAsBox` to another +`With` adapter, or reordering fields — changes the byte positions of all +subsequent fields, making existing files unreadable without a detected error. + +### `try_as_archive` — the deserialization gate (ser.rs:88-95) + +```rust +// lib/vector-buffers/src/variants/disk_v2/ser.rs:88-95 +pub fn try_as_archive<'a, T>(buf: &'a [u8]) -> Result<&'a T::Archived, DeserializeError> +where + T: Archive, + T::Archived: for<'b> CheckBytes>, +{ + debug_assert!(!buf.is_empty()); + check_archived_root::(buf).map_err(Into::into) +} +``` + +`check_archived_root` reads the root offset from the last 8 bytes of the +buffer, then validates the archived value using `CheckBytes`. If the buffer +layout changed (new field at offset 0 shifts everything), `check_archived_root` +may: + +- Return `CheckArchiveError::CheckBytesError` → `DeserializeError::InvalidData` + (detected, clean error). +- OR interpret the bytes as a valid archived value at a wrong offset — this is + the "silent garbage" path if the raw bytes happen to pass `CheckBytes` + validation. + +The CRC32C check in `verify_checksum` (record.rs:144-155) would catch +most garbage payloads IF the checksum field itself was not shifted into a +position that holds a value that happens to match the CRC of the new layout's +payload. This is unlikely but not impossible for small payloads. + +### Manual `CheckBytes` (record.rs:79-117) + +```rust +// record.rs:79-117 +impl<'a, C: ?Sized> CheckBytes for ArchivedRecord<'a> { ... } +``` + +This is a manual `unsafe` implementation (due to an upstream rkyv ICE, see +the comment). It validates that `checksum`, `id`, `metadata` are valid `u32` +and `u64` values, and that `ArchivedBox<[u8]>` is valid. It does NOT validate +semantic constraints (e.g., that `id` is monotonic, or that `checksum` +actually matches the payload). A layout-changed record may pass this check if +the bytes at the expected offsets are valid primitive values. + +### `DiskBufferV1CompatibilityMode` flag (vector-core/event/ser.rs:86-91) + +```rust +// lib/vector-core/src/event/ser.rs:86-91 +fn get_metadata() -> Self::Metadata { + EventEncodableMetadataFlags::DiskBufferV1CompatibilityMode.into() +} + +fn can_decode(metadata: Self::Metadata) -> bool { + metadata.contains(EventEncodableMetadataFlags::DiskBufferV1CompatibilityMode) +} +``` + +`get_metadata()` always returns the `DiskBufferV1CompatibilityMode` flag. +`can_decode()` requires this flag to be present. Every record written at this +commit will have the flag set; `can_decode()` will return `true` for them. + +The foot-gun: if a future version of Vector introduces a new +`EventEncodableMetadataFlags` variant and changes `get_metadata()` to return +only the new flag (not `DiskBufferV1CompatibilityMode`), then all existing +buffer files (which have only `DiskBufferV1CompatibilityMode` set) would fail +`can_decode()` and be rejected. This is a format-incompatibility scenario, +not a bug in the current code — but it is uncovered by any test. + +### `can_decode` call site in reader (reader.rs — the decode gate) + +The `can_decode` result gates whether the payload bytes are passed to +`Encodable::decode`. A `false` result leads to a `RecordStatus::Valid` with a +metadata rejection path, not a clean `FailedDeserialization` — the exact error +path needs confirmation. + +--- + +## Fault Conditions + +| Fault | Effect | +|---|---| +| Restart with same binary | Normal readback — baseline; must always succeed. | +| Binary swap (version upgrade) with rkyv layout change | Data files become unreadable; must be a clean `DeserializeError`, not garbage. | +| Synthetic record with wrong metadata flag injected into `.dat` file | `can_decode()` returns `false`; record rejected; reader rolls forward. | +| Synthetic record with correct metadata flag but wrong rkyv layout | `CheckBytes` may or may not detect it; CRC32C is the backstop. | +| `buffer.db` ledger migrated between versions (no format versioning) | Ledger is a raw memory-mapped struct (`LedgerState`); field-order changes break it silently. | + +--- + +## SUT-Side Instrumentation (MISSING — must be added) + +### Assertion 1 — Sometimes: same-version readback succeeds (baseline) + +```rust +// reader.rs, after a successful record decode and CRC validation +antithesis_sdk::assert_sometimes!( + true, // reachability: this path executes + "buffer-readback: record successfully decoded after restart", + &serde_json::json!({ + "record_id": record.id, + "metadata": record.metadata, + }) +); +``` + +### Assertion 2 — AlwaysOrUnreachable: incompatible metadata is cleanly rejected + +```rust +// reader.rs, at the can_decode() check +let can_decode = T::can_decode(metadata); +antithesis_sdk::assert_always_or_unreachable!( + can_decode || /* metadata flag is NOT the expected flag */ true, + "buffer-readback: can_decode returns true for any record written by this binary", + &serde_json::json!({ + "metadata_value": metadata.into_u32(), + "can_decode": can_decode, + }) +); +``` + +### Assertion 3 — Always: DeserializeError path detected, not garbage + +At the `RecordStatus::FailedDeserialization` arm, assert the path is taken +(not `RecordStatus::Valid` with a wrong ID) when a known-bad record is +injected: + +```rust +// record.rs, inside verify_record_archive when returning FailedDeserialization +antithesis_sdk::assert_reachable!( + "buffer-readback: FailedDeserialization reached for injected bad record", + &serde_json::json!({ "error": err.to_string() }) +); +``` + +--- + +## Why Existing Tests Cannot Catch This + +- The model-based proptest uses an in-memory filesystem and does not exercise + the rkyv deserialization path with externally-modified buffers. +- No test writes events, swaps or modifies the binary/data files, and then + restarts — this is an upgrade/migration scenario outside the unit-test scope. +- The `DiskBufferV1CompatibilityMode` flag is set on every write in the + current binary; no test ever synthesizes a record without it. +- The manual `CheckBytes` implementation is only validated for correctness + under the current struct layout, not under a changed layout. + +--- + +## Requires a Custom Fault + +Testing facet (A) requires one of: + +1. A custom Antithesis fault that modifies `buffer-data-*.dat` bytes while + Vector is stopped (between shutdown and restart). +2. Two Vector binaries in the harness image — binary N writes, binary N+1 + (with a simulated layout change or flag change) reads. + +Neither is standard in the default Antithesis fault library. The harness must +be explicitly designed for this scenario. + +--- + +## Open Questions + +- Is there any runtime mechanism to detect a rkyv layout mismatch short of + `CheckBytes` and CRC32C? If both checks pass for a layout-changed record + (possible for small payloads), the garbage is delivered as a valid event. + This is the "silent corruption" path that this property is designed to expose. + +- Does `check_archived_root` return `CheckArchiveError::ContextError` (layout + mismatch — the bytes don't form a valid archive) or fall through to + `CheckBytesError` for the layout-changed case? The distinction determines + whether the error is `InvalidStructure` or `InvalidData`, which affects the + reader's recovery behavior. + +- Is the `LedgerState` mmap'd struct (`buffer.db`) versioned? If a Vector + upgrade changes `LedgerState` (e.g., adds a field), the mmap'd file is read + with wrong offsets — silently. This is a separate version-upgrade risk not + covered by the record-level `CheckBytes`. + +- What is the expected behavior when `can_decode()` returns `false`? Does the + reader treat the record as unreadable (skip + roll) or as a fatal error + (stop)? The current code path at `reader.rs` needs verification to confirm + `false` → `RecordStatus` rejection → `roll_to_next_data_file`, not panic. + +- Should this property be split into two separate slugs: one for rkyv layout + version safety and one for the `DiskBufferV1CompatibilityMode` flag? The two + facets have distinct fault mechanisms (binary swap vs. flag injection) but + share the same "upgrade readback" narrative. diff --git a/tests/antithesis/scratchbook/properties/config-reload-no-silent-loss.md b/tests/antithesis/scratchbook/properties/config-reload-no-silent-loss.md new file mode 100644 index 0000000000000..cfdf76124cf2d --- /dev/null +++ b/tests/antithesis/scratchbook/properties/config-reload-no-silent-loss.md @@ -0,0 +1,321 @@ +--- +slug: config-reload-no-silent-loss +type: Safety / Always +status: missing +sut_commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +related_issues: + - "vectordotdev/vector#24948 (internal config-reload incident config-reload stall)" + - "vectordotdev/vector PR#24949 (partial fix)" +related_files: + - lib/vector-buffers/src/variants/disk_v2/writer.rs + - lib/vector-buffers/src/variants/disk_v2/ledger.rs + - src/topology/running.rs + - src/topology/test/reload.rs +--- + +# Property: config-reload-no-silent-loss + +## Catalog Entry + +| Field | Value | +|---|---| +| **Type** | Safety / Always | +| **Property** | `assert_always("config_reload_no_silent_loss", "No event accepted into the disk buffer is silently dropped during or after a config reload")` | +| **Invariant** | Every event that received a successful `send` acknowledgement from the disk buffer writer before or during a config reload must either (a) be read and forwarded by the reader, or (b) be explicitly accounted in `buffer_discarded_events_total`. No event may vanish without trace. | +| **Antithesis Angle** | Trigger a config reload (SIGHUP or workload-driven) while events are streaming under sustained write load with an active reader. After reload completes and a quiet period elapses, drain the buffer and downstream sink, then assert: `events_in == events_out + events_explicitly_discarded`. | +| **Why It Matters** | This was the direct cause of the an internal config-reload incident (#24948). Silent loss on config reload defeats the entire durability guarantee disk buffers provide. | + +## The Bug: `Drop` Calls `close()` But Not `flush()` + +### What the code does + +`BufferWriter::Drop` (writer.rs:1366-1374) calls only `self.close()`: + +``` +impl Drop for BufferWriter { + fn drop(&mut self) { + self.close(); // writer.rs:1372 + } +} +``` + +`close()` (writer.rs:1358-1363) calls only `ledger.mark_writer_done()` and +`ledger.notify_writer_waiters()`. It does NOT call `self.flush()` or +`self.flush_inner(true)`. + +`BufferWriter::flush()` (writer.rs:1336-1340) calls `flush_inner(false)`, which +calls `writer.flush().await?` on the `TrackingBufWriter` (writer.rs:1307-1308). +`Drop` is synchronous and cannot `.await`, so this flush path is structurally +unavailable in `Drop`. + +### The `TrackingBufWriter` and its 256KB buffer + +`TrackingBufWriter` (writer.rs:239-343) holds an internal `Vec` with +capacity `DEFAULT_WRITE_BUFFER_SIZE` = 256 * 1024 bytes +(common.rs:37). Records are staged into this buffer and only written to the +underlying OS file handle when the buffer fills (auto-flush on overflow, +writer.rs:279-282) or when `.flush().await` is called explicitly +(writer.rs:313-331). + +If fewer than 256KB of records have arrived since the last explicit flush, the +entire pending batch sits in `TrackingBufWriter::buf` at the time of `Drop`. +When `Drop` runs: + +1. `close()` is called — marks the writer done in the ledger, notifies reader. +2. `TrackingBufWriter` is dropped — its `buf` is freed without ever calling + `inner.write_all(&buf)`. The OS file handle is closed with unflushed data + silently discarded. + +This is not a crash scenario; it is the normal code path whenever the topology +tears down a sink (reload or graceful stop) without the sink having first called +`flush()`. + +### When does `flush()` actually get called before `Drop`? + +The normal write loop (the `BufferSender`-level `send` → writer task loop) +calls `writer.flush().await` after every successful `write_record` call +(topology/channel/sender.rs:86-98: `SenderAdapter::DiskV2` calls +`writer.flush().await`). So as long as the write loop is running, every +dispatched event is flushed before the loop proceeds. + +The hazard window is events that have been **enqueued into the channel feeding +the writer task** but not yet dequeued and processed by the writer — i.e., events +sitting in the mpsc channel between the source/transform and the sink's write +loop — when the detach trigger fires. These are not lost in the buffer; they +are lost upstream. + +More critically for the disk buffer specifically: if the sink task is +mid-batch (e.g., it dequeued a batch from the channel, encoded some records +into `TrackingBufWriter`, but the topology fires the `tripwire`/detach trigger +before the next `flush()` call completes), those staged bytes are dropped with +`Drop`. + +Whether the topology calls `flush()` before dropping the `BufferSender`/writer +on graceful component shutdown vs. config reload is an open question (see +below), but the `Drop` impl itself provides no safety net. + +### Metric drift: `track_dropped_events` charges `byte_size = 0` + +When the reader later detects a gap (records present in the data file according +to the ledger but not actually written), or when the buffer is re-initialized and +`synchronize_buffer_usage` re-seeds accounting from file sizes, there is a +mismatch. More directly, if the writer calls `track_dropped_events` +(ledger.rs:526-537) for any reason during reload, the implementation explicitly +passes `byte_size = 0`: + +```rust +pub fn track_dropped_events(&self, count: u64) { + // We don't know how many bytes are represented by dropped events because we never + // actually had a chance to read them... + self.usage_handle + .increment_dropped_event_count_and_byte_size(count, 0, false); // ledger.rs:536 +} +``` + +The comment acknowledges this is a permanent drift: the byte-size accounting +for the dropped events is zeroed, so `buffer_byte_size` gauges will be wrong +for the lifetime of the buffer instance. + +### The advisory-lock gap: per-process, not per-thread + +`load_or_create` (ledger.rs:572-576) uses `fslock::LockFile::try_lock()`. +POSIX `fcntl`/`flock` advisory locks are per-process on Linux: a second open +from the **same process** succeeds even if the first lock is still held. During +a config reload, if the old sink task is still running (being waited on by the +topology rebuild loop, running.rs:677-685 / 688-710) while the new sink task +tries to open the same buffer directory, both will hold "the lock" from the +OS's perspective. The topology's reload logic (`changed_disk_buffer_sinks`, +running.rs:589-601) attempts to wait for the old sink to fully shut down before +starting the new one — but this sequencing is best-effort under tokio's +cooperative scheduler and depends on the detach trigger being wired correctly. +The stall bug in #24948 was precisely a case where the detach trigger was NOT +cancelled for the changing sink, causing the wait to hang indefinitely. PR +# 24949 fixed the trigger cancellation for `changed_disk_buffer_sinks`, but the +per-process lock semantics remain — if any future code path reorders the +teardown/startup sequence, the old and new writers can open the same buffer +concurrently with no OS-level protection. + +### Finalizer `Arc` retention during reload + +`spawn_finalizer` (ledger.rs:701-710) moves an `Arc` clone into a +tokio task: + +```rust +pub(super) fn spawn_finalizer(self: Arc) -> OrderedFinalizer { + let (finalizer, mut stream) = OrderedFinalizer::new(None); + tokio::spawn(async move { + while let Some((_status, amount)) = stream.next().await { + self.increment_pending_acks(amount); + self.notify_writer_waiters(); + } + }); + finalizer +} +``` + +The `Arc` is held for as long as the spawned task lives. The task exits +only when the `OrderedFinalizer` sender side is dropped (which happens when the +`BufferWriter` is dropped and the `finalizer` field goes out of scope). However, +if the writer `Drop` runs before all in-flight `BatchNotifier`s are dropped +(i.e., events that are "in transit" in a sink's delivery pipeline), the +finalizer task remains alive past the writer's lifetime, holding a reference to +the ledger. This means the buffer directory cannot be safely fully reset until +the finalizer task drains, which happens asynchronously. In practice, the new +writer's `load_or_create` starts before this drain completes, creating a window +of overlapping ledger use. + +## Antithesis Test Design + +### Fault requirement + +Config reload requires a **custom fault** or workload-driven trigger, not a +built-in Antithesis network/node fault. The two mechanisms available in Vector: + +1. **SIGHUP** — Vector's signal handler (signal.rs:200, signal.rs:218) converts + SIGHUP to `SignalTo::ReloadFromDisk`, which triggers + `reload_config_from_result` (app.rs:382-386) → `topology_controller.reload()` + → `RunningTopology::reload_config_and_respawn()`. This is the production + reload path. +2. **Workload-driven API reload** — Vector's GraphQL/REST API (if enabled) can + trigger a reload programmatically from the test workload container. + +SIGHUP is preferable because it exercises the exact production code path. +**Flag to Antithesis team: custom signal injection capability needed.** + +### Test scenario + +``` +Setup: + - Vector with a source (e.g., HTTP source or socket) feeding into + a sink with disk buffer (type: disk, when_full: block). + - A downstream sink endpoint (HTTP mock or file sink) that records + every event received. + - A workload driver that: + (a) sends a steady stream of events and tracks every event ID sent, + (b) waits for send acknowledgement from the Vector HTTP source before + recording the event as "accepted", + (c) after N seconds, sends SIGHUP to Vector to trigger reload, + (d) continues sending events through the reload, + (e) after reload completes (detected via health endpoint), enters + a quiet period with no new events, + (f) waits for the downstream sink to drain (buffer empties), + (g) asserts: accepted_event_ids == received_event_ids. + +Antithesis fault injection: + - Concurrent CPU throttle on the Vector process during reload + (widens the window between TrackingBufWriter staging and flush). + - Disk I/O slowdown during reload (lengthens the old sink teardown, + increasing overlap with new sink startup). + - Clock jitter (stretches/shrinks the 500ms fsync window). + +Key assertion point: + - assert_always("no_silent_loss_on_reload", + accepted_count == forwarded_count + explicitly_discarded_count, + {accepted_count, forwarded_count, explicitly_discarded_count, + reload_timestamp}) +``` + +### What to look for in Antithesis output + +- Events with IDs in the "accepted" set that appear in neither the downstream + sink's log nor in any `buffer_discarded_events_total` increment. +- A spike in `buffer_discarded_events_total` at reload time that does not + correspond to a `when_full: drop_newest` event (i.e., the buffer was not + full — the discard was from the unflushed `TrackingBufWriter` data). +- A non-zero delta in `buffer_byte_size` immediately after reload that does + not reconcile with the byte sizes of events known to be in-flight. + +## Open Questions + +1. **Does the topology caller flush before dropping the writer on config reload?** + The `SenderAdapter::DiskV2` flush path (sender.rs:86-98) is called from the + write loop, not from teardown. The detach trigger (`tripwire`) fires to close + the `rx.take_until_if(tripwire)` stream in the sink task (builder.rs:693). + Once the stream closes, the sink task returns `TaskOutput::Sink(rx)` and the + `BufferSender` (which owns the `Arc>`) is dropped as part + of `rx` going out of scope. There is no explicit `flush()` call at this point. + **Verification needed: trace the drop chain from `TaskOutput::Sink(rx)` to + `BufferWriter::drop` to confirm no flush occurs in between.** + +2. **Is the per-process advisory-lock gap actually reachable under current + topology reload sequencing?** The `changed_disk_buffer_sinks` wait + (running.rs:688-710) awaits the old sink task completing before `buffers` + is returned and the new pieces are built. If the new sink's `load_or_create` + is called strictly after the old task's `await` completes and the old + `BufferWriter` is dropped, the lock file is released before the new open. + However: does tokio guarantee that the spawned finalizer task (which holds + `Arc`) has exited before the new `load_or_create` runs? The finalizer + runs on the tokio runtime and is not awaited during teardown — **a race + exists if the finalizer task has not yet exited when the new writer opens.** + +3. **Does the internal config-reload incident PR #24949 fully close this property?** PR #24949 + fixed the detach-trigger cancellation for `changed_disk_buffer_sinks` (the + stall), and may have addressed some accounting. Whether it fixed the + unflushed-`TrackingBufWriter`-data loss specifically is unclear from the + in-repo test coverage (the existing `topology_disk_buffer_conflict` and + `topology_disk_buffer_config_change_does_not_stall` tests do not assert + zero event loss, only liveness/no-stall). **Antithesis can answer this + definitively by checking the loss invariant, not just liveness.** + +4. **What is the actual loss bound?** Up to 256KB (`DEFAULT_WRITE_BUFFER_SIZE`, + common.rs:37) of staged-but-unflushed data per reload. For small events + (e.g., 100-byte log lines), this is ~2,600 events silently dropped per reload. + For large events near `DEFAULT_MAX_RECORD_SIZE` (128MB), the auto-flush on + capacity overflow (writer.rs:279-282) means staging is bounded, so the loss + per reload may be smaller in practice. The exact loss depends on the arrival + rate and event size at the moment of the reload tripwire. + +5. **Custom fault requirement flag:** The Antithesis standard fault suite + (network partitions, node kill) does not include SIGHUP or API-driven config + reload. This property requires either (a) the workload driver to issue + SIGHUP/API calls on a schedule, or (b) a custom Antithesis fault that sends + SIGHUP to the Vector process. Confirm with the Antithesis team that + process-signal delivery is supported in the tenant configuration. + +--- + +## SUT-Side Instrumentation (MISSING — must be added) + +All Antithesis SDK calls below are absent from the codebase at this commit. The Antithesis Rust SDK must be added as a dependency. + +**Where to assert `accepted == forwarded + discarded`:** + +1. **`BufferWriter::close()` (writer.rs:1358)** — add an `assert_always!` that `self.unflushed_bytes == 0` at the point `close()` is called. A non-zero value means staged bytes will be silently dropped with the `TrackingBufWriter`. This is the primary loss site on config reload. + + ```rust + antithesis_sdk::assert_always!( + "writer_unflushed_bytes_zero_on_close", + self.unflushed_bytes == 0, + &serde_json::json!({ + "unflushed_bytes": self.unflushed_bytes, + "unflushed_events": self.unflushed_events, + }) + ); + ``` + +2. **`track_dropped_events` (ledger.rs:526)** — `byte_size = 0` is passed unconditionally (confirmed at ledger.rs:536). Add an `assert_always!` that `count == 0` if the drop is unexpected (i.e., not during a known `when_full: drop_newest` policy invocation), or at minimum emit an `assert_reachable!` so that Antithesis confirms the site is exercised during a reload test and the count can be tracked. + +3. **Workload-level** — after reload quiet period, assert: + + ``` + assert_always("config_reload_no_silent_loss", + accepted_count == forwarded_count + explicitly_discarded_count, + { accepted_count, forwarded_count, explicitly_discarded_count }) + ``` + + where `accepted_count` is events acknowledged by the Vector HTTP source before reload, `forwarded_count` is events received by the downstream mock sink, and `explicitly_discarded_count` is the delta in `buffer_discarded_events_total` during the reload window. + +--- + +### Investigation Log + +#### Does PR #24949 fix the loss or only the stall? + +**Examined:** `src/topology/running.rs:589–710` (config-reload path, `changed_disk_buffer_sinks`, detach/remove-inputs logic), `lib/vector-buffers/src/variants/disk_v2/writer.rs:1358–1374` (`close()` and `Drop`). + +**Found:** PR #24949 (referenced in related_issues as "partial fix") addressed the stall in #24948 by fixing detach-trigger cancellation for `changed_disk_buffer_sinks` — specifically the sequencing of old-sink teardown so the topology does not hang indefinitely waiting for a sink that never receives its detach signal. The `changed_disk_buffer_sinks` path at running.rs:629–668 detaches inputs from changed sinks and waits for the old sink task to complete at running.rs:656–668. This fix ensures the old task actually terminates. + +**Not found:** Any change to `BufferWriter::Drop` (writer.rs:1366–1374) that calls `flush()` before `close()`. `Drop` at this commit still calls only `self.close()` (writer.rs:1372), which calls only `ledger.mark_writer_done()` and `ledger.notify_writer_waiters()` — no `flush_inner` call. The `TrackingBufWriter` internal buffer is freed without flushing if any staged bytes remain at drop time. No code added by PR #24949 (or observable at this commit) closes the unflushed-`TrackingBufWriter` loss path. + +**Conclusion:** PR #24949 fixed the stall (detach-trigger cancellation) and partially addressed accounting. The `Drop`-without-flush silent loss path is not fixed at this commit (b7aae737c). Whether PR #24949 also added a pre-drop `flush()` call in the sink task teardown path (rather than in `Drop` itself) could not be confirmed without reviewing the PR diff directly — the topology teardown code at running.rs:656–668 awaits the old task completing, but does not call `writer.flush()` explicitly from that site. Loss is still possible for any bytes staged in `TrackingBufWriter` at the moment the write loop receives the tripwire signal and exits without a final flush. diff --git a/tests/antithesis/scratchbook/properties/corruption-is-detected-and-recovered.md b/tests/antithesis/scratchbook/properties/corruption-is-detected-and-recovered.md new file mode 100644 index 0000000000000..9683a21d657d1 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/corruption-is-detected-and-recovered.md @@ -0,0 +1,147 @@ +# Evidence: corruption-is-detected-and-recovered + +## Property Identification + +**Slug:** corruption-is-detected-and-recovered +**Type:** Reachability / Sometimes(`corruption_detected_and_recovered`) +**Assertion macro:** `assert_sometimes!("corruption_detected_and_recovered", "Corruption was detected by is_bad_read and the reader rolled to the next data file", ...)` + +This property is the fault-injection confirmation counterpart to `no-corrupted-record-delivered`. It is not enough to know that corruption never leaks through; we must also confirm that when Antithesis actually injects a fault (bit-flip, partial write, torn file), the detection and recovery path actually executes. Without this reachability assertion, a test run that injects faults but never reaches the recovery branch gives zero confidence in the guard. + +The property was identified by reading the error handling in `BufferReader::next` (reader.rs:1009-1040) and the `is_bad_read` predicate (reader.rs:132-139), and noting that the entire recovery branch (`roll_to_next_data_file`) could be dead code if the fault injection strategy never produces bytes that fail checksum or deserialization in a live read. + +## Code Chain Leading to the Property + +### Detection: `is_bad_read` (reader.rs:132-139) + +```rust +fn is_bad_read(&self) -> bool { + matches!( + self, + ReaderError::Checksum { .. } + | ReaderError::Deserialization { .. } + | ReaderError::PartialWrite + ) +} +``` + +This predicate collects the three error variants that indicate "the data file is untrustworthy from this point on." `ReaderError::Io` and `ReaderError::EmptyRecord` are deliberately excluded: I/O errors might be transient, and empty records are a writer-side logic error, not corruption. + +### Recovery: `roll_to_next_data_file` (reader.rs:694-741) + +Called from `BufferReader::next` at reader.rs:1036 when `e.is_bad_read()` is true: + +```rust +// reader.rs:1031-1039 +Err(e) => { + if e.is_bad_read() { + self.roll_to_next_data_file(); + } + return Err(e); +} +``` + +`roll_to_next_data_file` (reader.rs:694-741): + +- Captures `data_file_start_record_id`, `last_reader_record_id`, `data_file_record_count`, `bytes_read`, and the current data file path from the ledger. +- Adds a data-file deletion marker to `self.data_file_acks` (reader.rs:729-735), which eventually drives `delete_completed_data_file` once all records up to that point are acknowledged. +- Calls `self.reset()` (reader.rs:738), which sets `self.reader = None`, zeroes `bytes_read`, and clears `data_file_start_record_id`. +- Calls `self.ledger.increment_unacked_reader_file_id()` (reader.rs:739), advancing the reader to the next data file. + +After `roll_to_next_data_file`, the reader's next call to `ensure_ready_for_read` will open the next file and continue. The buffer does not halt; it absorbs the loss of the rest of that file's records. + +Also called at reader.rs:1075 in the `Ok(None)` path (end-of-file when `reader_file_id != writer_file_id`): + +```rust +self.roll_to_next_data_file(); +force_check_pending_data_files = true; +continue; +``` + +And at reader.rs:912 in `seek_to_next_record` (during initialization bad-read handling), though that path uses a slightly different flow (checking `reader_file_id > writer_file_id` to avoid deadlock). + +### What `roll_to_next_data_file` Does NOT Do + +It does not immediately delete the corrupted file. Deletion is deferred to `delete_completed_data_file` via the `data_file_acks` queue, which requires the record-level acknowledgements to drain first. A corrupted file that was partially read (some records read and acked before the corruption was hit) will be deleted only after those earlier records are acked. A file that was corrupted on the very first read (`data_file_start_record_id` is `None`) uses `last_reader_record_id` as a zero-length marker (reader.rs:700-704). + +### Key Sub-concern: "Reader Skips Rest of File After First Bad Record" + +When `roll_to_next_data_file` is called, the reader **unconditionally abandons the entire remainder of the current data file**. Any valid records that happen to follow the first bad record in the same file are silently lost. + +For example: data file has records [A, B, CORRUPT, D, E]. Reader reads A, B, hits CORRUPT, calls `roll_to_next_data_file`. Records D and E are abandoned without being read, acknowledged, or counted. They are charged to `decrement_total_buffer_size` via `delete_completed_data_file`'s `size_delta` calculation (reader.rs:521-535), but they are never delivered. + +This is intentional per the code comment at reader.rs:1018-1025: +> "we're not sure the rest of the data file is even valid, so roll to the next file... There's a possibility that the length delimiter we got is valid, and all the data was written for the record, but the data was invalid..." + +But it is a significant data-loss surface: a single corrupted record in a 128MB data file can cause the loss of many subsequent valid records in that file. The Antithesis test should measure how many valid records are abandoned per corruption event. + +### Error Propagation to Caller + +`BufferReader::next` returns `Err(e)` after rolling. The topology adapter (`receiver.rs`) treats this as an unrecoverable error and panics. This means the entire Vector process restarts, not just the buffer reader. On restart, `seek_to_next_record` picks up where the reader file ID was left (now pointing at the next file after the corrupted one), and the pipeline continues. The reachability assertion therefore fires in the read path during the run that encounters the corruption, not on the restart. + +## What Goes Wrong if the Property is Not Exercised + +If Antithesis injects bit-flips or partial writes but the `corruption_detected_and_recovered` assertion is never triggered, the run provides no evidence that: + +1. Fault injection is actually reaching the data files read by the live reader (as opposed to files not yet opened, already deleted, or only read by the mmap fast-path in `seek_to_next_record`). +2. The `is_bad_read` predicate correctly classifies all injected fault signatures (for example, a fault that corrupts the length delimiter in a way that happens to be numerically valid would not cause `PartialWrite` or `Checksum` errors, but might cause `Io` errors, which `is_bad_read` rejects). +3. `roll_to_next_data_file` actually produces a functioning buffer (i.e., the reader successfully opens and reads the next file after rolling). + +Without `assert_sometimes`, a test run with zero `is_bad_read` hits is indistinguishable from a run that intentionally never exercises the corruption path. + +## Timing / Fault Conditions for Antithesis + +- **Bit-flip on payload bytes**: Direct corruption of the CRC-covered region. CRC32C detects this with probability ~(1 - 1/2^32). `Corrupted` error is returned, `is_bad_read()` is true. +- **Bit-flip on checksum field itself**: The stored checksum is wrong, but payload is intact. CRC recomputation produces the correct value; comparison fails. `Corrupted` error. +- **Bit-flip on the rkyv root offset (last 8 bytes of archived record)**: `try_as_archive` reads an incorrect offset, likely accessing out-of-bounds memory → `FailedDeserialization`. `is_bad_read()` is true. +- **Bit-flip on the length delimiter (first 8 bytes of a record)**: `read_length_delimiter` reads a wrong `record_len`. If `record_len` is larger than available data and `is_finalized=true`, returns `PartialWrite`. If `record_len` is valid but points past EOF, returns `PartialWrite`. If `record_len` is small enough that we read the wrong bytes and they fail CRC, returns `Corrupted`. +- **Truncation of the data file mid-record**: With `is_finalized=true`, `try_next_record` detects insufficient bytes and returns `PartialWrite` (reader.rs:263-265, 328-330). `is_bad_read()` is true. +- **File closed/truncated before the reader opens it**: `ensure_ready_for_read` hits an I/O error (`ReaderError::Io`), which `is_bad_read()` does NOT catch. This fault must not be confused with the corruption recovery path. + +## SUT-Side Instrumentation Suggestions (ALL MISSING) + +**Primary assertion** — in `BufferReader::next`, in the `is_bad_read()` branch just before `roll_to_next_data_file()` is called (reader.rs:1035-1036): + +```rust +if e.is_bad_read() { + antithesis_sdk::assert_sometimes!( + "corruption_detected_and_recovered", + "Corruption detected by is_bad_read; rolling to next data file", + &serde_json::json!({ + "error_code": e.as_error_code(), + "reader_file_id": self.ledger.get_current_reader_file_id(), + "writer_file_id": self.ledger.get_current_writer_file_id(), + "bytes_read_before_corruption": self.bytes_read, + "records_read_before_corruption": self.data_file_record_count, + }) + ); + self.roll_to_next_data_file(); +} +``` + +The `as_error_code()` method (reader.rs:141-152) already distinguishes the three bad-read variants (`"checksum_mismatch"`, `"deser_failed"`, `"partial_write"`). Antithesis can break down by error code to confirm all fault types are being detected. + +**Secondary assertion** — in `roll_to_next_data_file`, to confirm the rolling logic itself completes (reader.rs:738-740): + +```rust +// After increment_unacked_reader_file_id(): +antithesis_sdk::assert_sometimes!( + "reader_rolled_to_next_file_after_corruption", + "Reader successfully incremented to next data file after corruption", + &serde_json::json!({ + "new_reader_file_id": self.ledger.get_current_reader_file_id(), + }) +); +``` + +**Tertiary instrumentation** — count valid records abandoned per roll. In `roll_to_next_data_file`, log the delta between the data file size and `self.bytes_read`. This is already computed implicitly in `delete_completed_data_file` via `size_delta` (reader.rs:521-535). An `assert_sometimes` there with `size_delta > 0` would confirm the "records abandoned after corruption" path is exercised. + +## Open Questions + +- **Does the `seek_to_next_record` corruption path (reader.rs:912-934) trigger `roll_to_next_data_file`?** The code at reader.rs:912 calls `self.next()` on a bad-read error during initialization but does NOT call `roll_to_next_data_file` directly; it relies on `next()` to do so. If `next()` does roll on bad reads during the seek loop, the same assertion fires. If not (e.g., if initialization-mode `next()` suppresses the roll), the reachability assertion placement inside `next()` would miss corruption during startup. This matters because the most likely time to encounter a corrupted last record is immediately after a crash, during `seek_to_next_record`. + +- **Does `roll_to_next_data_file` succeed if the next data file does not yet exist?** If the writer has not yet created file N+1 when the reader rolls to it, `ensure_ready_for_read` will block waiting for the writer (reader.rs:774-775). The buffer would not stall permanently (writer will eventually create the file), but the pipeline is paused. Antithesis should verify the pipeline recovers within a reasonable timeout after corruption-triggered roll. + +- **How many valid records are silently abandoned per corruption event?** The decision to roll the entire file on the first bad record is conservative. In a 128MB file with one corruption at byte 1000, nearly the entire file is abandoned. Antithesis should quantify this loss (via the `size_delta` metric) to determine if the policy matches user expectations for a buffer marketed as "at-least-once." (The answer may be that within a process lifetime, corruption = data loss for that file, and users must rely on e2e acks + crash-restart to get the pre-crash unsynced window replayed.) + +- **What happens to `pending_acks` for records that were already read and emitted before the corruption was hit?** If records A and B were emitted from the corrupted file before CORRUPT was found, their `BatchNotifier`s are in-flight. When the finalizer drains them, it calls `pending_acks` increment, which eventually causes `record_acks` acknowledgement processing. But the file was rolled: the deletion marker was added with `data_file_record_count` including only A and B, not all records. Does the `data_file_acks` drain correctly when only 2 out of a potential 10 records are marked? Specifically: `OrderedAcknowledgements::add_marker` is called with `Some(2)` records expected (reader.rs:729). If 2 acks come in, the data file is eligible for deletion. This seems correct but the interaction with gap markers in `record_acks` (for the abandoned records D, E) should be verified. diff --git a/tests/antithesis/scratchbook/properties/corruption-skip-loss-bounded.md b/tests/antithesis/scratchbook/properties/corruption-skip-loss-bounded.md new file mode 100644 index 0000000000000..76b96b549232e --- /dev/null +++ b/tests/antithesis/scratchbook/properties/corruption-skip-loss-bounded.md @@ -0,0 +1,96 @@ +# Evidence: corruption-skip-loss-bounded + +**Slug:** corruption-skip-loss-bounded +**Type:** Safety / `Always` (workload-level) +**Status:** Expected VIOLATED by current design (conservative whole-file roll). + +## Why this property exists (user concern) + +Driving concern: *"if the checksum fails we'll skip records."* This property +quantifies and bounds that loss. The sibling property +`corruption-is-detected-and-recovered` only checks that the recovery path +*executes* (a `Sometimes` reachability check). It does not bound how much is +lost. This property does. + +## The mechanism (reader.rs) + +When `BufferReader::next` hits a bad read — `is_bad_read()` true for +`Checksum` / `Deserialization` / `PartialWrite` (reader.rs:148-155) — it calls +`roll_to_next_data_file()` (reader.rs:711-759) and returns the error. Rolling: + +- adds a deletion marker covering only the records **actually read** + (`data_file_record_count`, reader.rs:728/743/749), +- `self.reset()` + `increment_unacked_reader_file_id()` (reader.rs:755-756), +- **unconditionally abandons the entire remainder of the current data file.** + +So for a data file `[A, B, CORRUPT, D, E]`, the reader delivers A, B, hits +CORRUPT, rolls, and **D and E are never read, delivered, acked, or counted** — +even though they are perfectly valid records. A single bit-flip near the start +of a 128 MB data file can abandon almost the whole file (the file holds up to +`max_data_file_size` / `min_record_size` records). The code comment +(reader.rs ~1018-1025) calls this intentional ("not sure the rest of the file +is valid"). + +## The invariant we want to test + +Workload-level `Always`: every record that (a) was durably written with a +**valid** checksum and (b) sits after a corrupt record in the same data file is +still eventually delivered. I.e. loss is bounded to the corrupt record itself +(plus any genuinely-unparseable contiguous tail), not the whole-file remainder. + +`Always` is the right type: this is a safety/correctness bound that must hold on +every corruption event, not a reachability or liveness milestone. + +## Antithesis angle + +Write a multi-record data file with known IDs, inject a single bit-flip into an +**early** record's CRC-covered region (not the last), let the reader drain, then +compare delivered IDs against the valid (non-corrupted) IDs. The gap = records +lost purely to the conservative roll. Vary corruption position (first / middle) +and file fullness to measure the loss magnitude. fs-fault or workload-injected +bit-flip; needs the corruption in a *live* read, not a not-yet-opened file. + +## Why it matters + +The authoritative spec — internal doc *"internal buffer design notes"* +((internal doc id omitted)) — states the disk-buffer data-loss window is **500 ms of unsynced +writes**, and with e2e acks enabled, synced events are **not** lost. But a +corruption-triggered roll discards *synced, valid, not-yet-acked* records far +outside that 500 ms window — a silent contradiction of the stated guarantee for +an "at-least-once" buffer. Even if the conservative roll is accepted, the loss +must at minimum be bounded and counted (see `corruption-skip-loss-is-counted`). + +## SUT-side instrumentation (MISSING) + +`existing-assertions.md`: only the 3 underflow `assert_always!` guards exist in +`lib/vector-buffers` today; nothing here. Suggested: in `roll_to_next_data_file` +compute `abandoned = file_size_remaining_after(bytes_read)` and +`assert_always!(abandoned == 0 || tail_is_unparseable, ...)` — or, more +practically, a workload-level oracle (delivered ⊇ valid-records) since "tail is +genuinely unparseable" is hard to assert SUT-side. + +## Open Questions + +- Is the whole-file roll an accepted product tradeoff, or should the reader + attempt to resync to the next record boundary within the same file and only + abandon the unparseable span? `(needs human input)` — the code comment says + intentional, but the internal spec implies synced events shouldn't be lost. +- Can records be re-found after a corrupt one given the length-delimited format + (read `record_len`, skip, try next), or does a corrupt length delimiter make + intra-file resync unsafe? `(partial: length delimiter is itself CRC-unprotected + framing, so a corrupt delimiter can desync intra-file resync — supports the + conservative roll; a CRC-valid record after a payload-corrupt record could in + principle be recovered)` + +### Investigation Log + +#### Is the whole-file roll an accepted tradeoff or should the reader resync? + +- Examined: `reader.rs` `roll_to_next_data_file` (711-759), `BufferReader::next` bad-read branch + comment (~1018-1025), `is_bad_read` (148-155); internal doc *internal buffer design notes* ((internal doc id omitted)). +- Found: the roll is unconditional and the code comment frames it as intentional ("not sure the rest of the file is valid"). The internal spec's 500ms/synced-not-lost guarantee says nothing about corruption, so the two are not formally reconciled. +- Not found: any product decision record stating whole-file abandonment is the accepted behavior for synced records. Conclusion: `(needs human input)` — owner must confirm intended vs. bug. + +#### Can records be re-found after a corrupt one? + +- Examined: length-delimited framing in `try_next_record` (reader.rs ~241-345), `read_length_delimiter`, CRC coverage in `record.rs`. +- Found: the length delimiter is framing, not under the record CRC; a corrupt delimiter desyncs intra-file scanning, justifying the conservative roll. A payload-corrupt record with an intact delimiter could in principle be skipped past. Tagged `(partial)`. diff --git a/tests/antithesis/scratchbook/properties/corruption-skip-loss-is-counted.md b/tests/antithesis/scratchbook/properties/corruption-skip-loss-is-counted.md new file mode 100644 index 0000000000000..40b39c832077e --- /dev/null +++ b/tests/antithesis/scratchbook/properties/corruption-skip-loss-is-counted.md @@ -0,0 +1,94 @@ +# Evidence: corruption-skip-loss-is-counted + +**Slug:** corruption-skip-loss-is-counted +**Type:** Safety / `Always` +**Status:** Expected VIOLATED (corruption-skip loss is silent on standard metrics). + +## Why this property exists (user concern) + +The user is "seriously concerned about data loss," specifically the +checksum-skip path. Beyond *how much* is lost (`corruption-skip-loss-bounded`), +the second-order danger is that the loss is **silent** — invisible on the +metrics operators actually watch. The internal doc *"[technical report] Telemetry +correctness"* ((internal doc id omitted)) names this directly: *"silent data loss going +undetected because `component_discarded_events_total`…"* and lists +"A1. `component_discarded_events_total` blind to buffer drops [HIGH]" (#24606). +This property extends that concern from `drop_newest` to the corruption path. + +## Distinct from `dropped-events-are-counted` + +`dropped-events-are-counted` (#24606) is about `when_full = drop_newest`: the +`BufferSender` drop path increments `buffer_discarded_events_total` but never +`component_discarded_events_total`. That is the **write-side** drop. + +This property is about the **read-side** corruption skip, a *different* code +path with *no* counting at all: + +- `roll_to_next_data_file` (reader.rs:711-759) adds a deletion marker for the + records **read** and abandons the rest. It never calls `track_dropped_events`. +- `track_dropped_events(events_skipped)` (reader.rs:656) is invoked only for + **writer-side gap markers** in the ack-processing loop (reader.rs:596-656), + i.e. data files the *writer* explicitly marked to skip — not reader-side + corruption rolls. +- The abandoned records are therefore charged only to `decrement_total_buffer_size` + via `delete_completed_data_file`'s `size_delta` (reader.rs:521-535, the + reader.rs:524 underflow site) — a *byte-accounting* adjustment, never an + event-loss metric. + +Net: corruption-skipped records increment **neither** +`buffer_discarded_events_total` **nor** `component_discarded_events_total`. +Strictly more silent than #24606. + +## The invariant we want to test + +`Always`: for every record abandoned by a corruption-triggered roll, +`component_discarded_events_total` (and/or `buffer_discarded_events_total`) +increases by the abandoned event count. Equivalently: after a corruption event +that abandons N valid records, `produced - delivered - counted_dropped == 0`. + +## Antithesis angle + +Same fault as `corruption-skip-loss-bounded` (early-record bit-flip in a +multi-record file). Oracle scrapes the metrics: assert the discarded-events +counter rose by the number of abandoned records once the roll completes. With +e2e acks the workload knows exactly which IDs were produced+synced and which +were delivered; the difference must equal the counted drops. + +## Why it matters + +A buffer marketed "at-least-once" silently discarding a whole data-file tail on +a single bit-flip — with **zero** signal on the standard component dashboard — +is the worst class of data loss: undetectable. Operators cannot alert on what +isn't counted. This is the read-side companion to the HIGH-severity #24606 +finding in the Telemetry-correctness report. + +## SUT-side instrumentation (MISSING) + +In `roll_to_next_data_file`, after computing the abandoned span, emit a +`ComponentEventsDropped` (reason `"corruption_skip"`) for the abandoned events +and `assert_always!(component_discarded_increased)`. Nothing in +`lib/vector-buffers` references `ComponentEventsDropped` today +(`existing-assertions.md` + grep) — fully missing. + +## Open Questions + +- Are the abandoned records even *counted* internally anywhere (e.g. a debug + log) that could be promoted to a metric, or is the count never computed? + `(partial: roll_to_next_data_file computes data_file_record_count for READ + records only; the abandoned count is not computed at all — the marker uses + bytes_read, so the abandoned event count is never materialized)` +- Should corruption loss count as `component_discarded_events_total` + (intentional vs unintentional flavor) or a new dedicated counter? Intentional + vs error semantics affect which alert fires. `(needs human input)` + +### Investigation Log + +#### Is the abandoned-record count computed internally? + +- Examined: `roll_to_next_data_file` (reader.rs:711-759), `track_dropped_events` callsite (reader.rs:656) and its ack-loop context (596-710), `delete_completed_data_file` size_delta (521-535). +- Found: the roll computes `data_file_record_count`/`data_file_event_count` for records **read**, and the deletion marker carries `(records_read, bytes_read)`. The abandoned (post-corruption) records are never enumerated; `track_dropped_events` fires only for writer-side gap markers in the ack loop, not here. So the abandoned event count is never materialized and no discarded-events metric is emitted. Tagged `(partial)` — confirmed not computed; only byte-accounting via decrement. + +#### Intentional vs error counter semantics? + +- Examined: `BufferEventsDropped` emit path (buffer_usage_data.rs / internal_events.rs), `ComponentEventsDropped` usage (absent in lib/vector-buffers). +- Not found: any existing classification for corruption loss. Conclusion: `(needs human input)` — which counter/flavor is a product/observability decision. diff --git a/tests/antithesis/scratchbook/properties/corruption-skip-record-id-accounting-consistent.md b/tests/antithesis/scratchbook/properties/corruption-skip-record-id-accounting-consistent.md new file mode 100644 index 0000000000000..a2d6ee50b81e1 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/corruption-skip-record-id-accounting-consistent.md @@ -0,0 +1,103 @@ +# Evidence: corruption-skip-record-id-accounting-consistent + +**Slug:** corruption-skip-record-id-accounting-consistent +**Type:** Safety / `Always` (SUT-side) +**Status:** Suspected-violable; cross-cuts the #21683 underflow cluster. + +## Why this property exists + +This is the cross-cutting link between the checksum-skip data-loss path and the +accounting-underflow / monotonicity bugs already in the catalog +(`total-buffer-size-never-underflows`, `record-id-monotonicity-holds`, +`get_total_records` underflow). When a corruption roll abandons records, the +ledger's record-ID and buffer-size accounting must stay self-consistent across +the gap, or a data-loss event silently mutates into an accounting-corruption +event (which can then deadlock the writer or report phantom counts). + +## The mechanism (reader.rs) + +`roll_to_next_data_file` (reader.rs:711-759): + +- `data_file_event_count = last_reader_record_id.wrapping_sub(start_id) + 1` + (reader.rs:724-727) — counts only events whose records were **read**. +- The deletion marker carries `(data_file_record_count_read, bytes_read)` + (reader.rs:746-752); abandoned records contribute **nothing**. +- `increment_unacked_reader_file_id()` advances to the next file. + +The abandoned records' IDs are never observed by the reader, so +`reader_last_record_id` is not advanced past them. The next file's first record +has an ID that is **> last_reader_record_id + (abandoned count)** — a gap. + +Two downstream hazards: + +1. **Buffer-size desync → #21683.** When the rolled file is later deleted, + `delete_completed_data_file` decrements `total_buffer_size` by a `size_delta` + derived from `metadata.len() - bytes_read` (reader.rs:521-535). If the file + was truncated, or `bytes_read` disagrees with on-disk length, this is the + exact **reader.rs:524 underflow** — the abandoned-tail bytes make the delta + computation the most likely real trigger for the #21683 wrap. + +2. **Record-ID monotonicity.** On the next read / next restart, + `seek_to_next_record` / `validate_last_write` and the monotonicity guard + (`reader.rs:~480`, "record ID monotonicity violation … serious bug" panic) + expect IDs to advance by exactly the consumed count. An unaccounted gap from + the abandoned span risks tripping the guard (→ process panic → restart loop) + or silently mis-setting `get_total_records`. + +## The invariant we want to test + +`Always` (SUT-side): after a corruption roll, the ledger satisfies +`next_writer_record_id - reader_last_record_id == on-disk unread records`, the +`total_buffer_size` decrement for the rolled file equals the true remaining +bytes (no underflow), and the monotonicity guard never trips. Stated negatively: +a corruption roll never converts bounded data loss into accounting corruption. + +## Antithesis angle + +Inject corruption mid-file (to force a roll with a non-empty abandoned tail), +then continue reading across the file boundary and across a crash+restart. +Watch the three SUT-side underflow asserts already wired (decrement, +get_total_records, reader.rs:524) plus the monotonicity guard. This is where the +corruption-skip path and the organically-reproduced #21683 (run D0) most +plausibly meet. + +## Relationship to existing properties + +- **Strengthens** `total-buffer-size-never-underflows`: identifies the + corruption-roll abandoned-tail as a concrete real trigger for the reader.rs:524 + underflow (not only external truncation). +- **Strengthens** `record-id-monotonicity-holds`: corruption-roll gaps as a + trigger for the monotonicity panic. +- **Depends-on** the loss being real (`corruption-skip-loss-bounded`). + +## SUT-side instrumentation + +Largely PRESENT: the three underflow `assert_always!` guards added this effort +(`decrement_total_buffer_size`, `get_total_records`, reader.rs:524) already +cover hazard (1). MISSING: an assertion tying the abandoned-record-ID gap to +`reader_last_record_id` advancement in `roll_to_next_data_file` (hazard 2 — the +record-ID gap is currently uninstrumented). + +## Open Questions + +- Does any path advance `reader_last_record_id` to cover abandoned record IDs, + or is the gap permanent until the next file's first read re-anchors it? + `(partial: roll_to_next_data_file does not advance it past abandoned records; + whether the next file's read re-anchors cleanly or trips monotonicity needs a + cross-file read trace)` +- Is the reader.rs:524 underflow reachable purely via corruption-roll + (abandoned tail) without external truncation? If so, #21683 is reachable on + the pure-corruption path, not only the crash/fs-fault path. `(needs human input + / Antithesis run with mid-file corruption)` + +### Investigation Log + +#### Does any path advance reader_last_record_id over abandoned IDs? + +- Examined: `roll_to_next_data_file` (711-759), `reset` (458), `increment_unacked_reader_file_id`, monotonicity guard (~480), `seek_to_next_record` (827+). +- Found: the roll advances the reader *file* id but does not advance `reader_last_record_id` past abandoned records; re-anchoring depends on the next file's first read. Whether that re-anchor is clean or trips the monotonicity guard needs a cross-file read trace under corruption. Tagged `(partial)`. + +#### Is reader.rs:524 underflow reachable purely via corruption-roll? + +- Examined: `delete_completed_data_file` size_delta (521-535, the reader.rs:524 site) and how `bytes_read` vs `metadata.len()` are sourced after a roll. +- Found: the abandoned tail makes `bytes_read < metadata.len()` (the normal partial-read case, which currently does NOT underflow) — underflow needs `bytes_read > metadata.len()` (truncation). So pure corruption-roll alone likely does not underflow; it needs a *concurrent* truncation/fault. Conclusion: `(needs human input / Antithesis run with mid-file corruption + fs fault)` to confirm reachability. diff --git a/tests/antithesis/scratchbook/properties/dropped-events-are-counted.md b/tests/antithesis/scratchbook/properties/dropped-events-are-counted.md new file mode 100644 index 0000000000000..c4cd71f197869 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/dropped-events-are-counted.md @@ -0,0 +1,139 @@ +--- +slug: dropped-events-are-counted +type: Safety / Always +status: CURRENTLY VIOLATED (confirmed by #24606/#24144 and direct code inspection) +sut_commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +--- + +# Property 15: dropped-events-are-counted + +## Catalog Entry + +**Type:** Safety / Always + +**Property:** When `when_full=drop_newest` intentionally drops an event because the buffer is +full, that drop is accounted at the component level — specifically `component_discarded_events_total` +— in addition to the buffer-level `buffer_discarded_events_total`. A user monitoring their +pipeline at the component level (the standard observability surface) must be able to observe +the drop. + +**Invariant:** For every event dropped by the `drop_newest` policy: +`component_discarded_events_total{intentional="true"}` increments by the event count of the +dropped item. `buffer_discarded_events_total` incrementing without a corresponding +`component_discarded_events_total` increment is a violation. + +**Current Status: VIOLATED.** Confirmed by GitHub issues #24606 and #24144, and verified by +direct code inspection. The call chain is: + +1. `BufferSender::send` (`topology/channel/sender.rs:231-234`): when `WhenFull::DropNewest` + and `try_send` returns `Some(item)` (item could not be sent), sets `was_dropped = true`. + +2. `BufferSender::send` (`sender.rs:248-257`): if `was_dropped`, calls + `instrumentation.increment_dropped_event_count_and_byte_size(count, size, true)`. + +3. `BufferUsageHandle::increment_dropped_event_count_and_byte_size` + (`buffer_usage_data.rs:193-206`): stores into `self.state.dropped_intentional` atomic + counters. Does NOT call `ComponentEventsDropped::emit`. + +4. On the next 2-second reporter tick, `BufferUsageData::report` + (`buffer_usage_data.rs:316-327`): emits `BufferEventsDropped { intentional: true, reason: + "drop_newest", ... }`. + +5. `BufferEventsDropped::emit` (`internal_events.rs:177-243`): increments + `buffer_discarded_events_total` and `buffer_discarded_bytes_total` (and updates buffer gauge). + **It does NOT call `emit(ComponentEventsDropped:: { ... })`.** There is no + reference to `ComponentEventsDropped` anywhere in `lib/vector-buffers/` (confirmed by + grep). + +The result: `buffer_discarded_events_total` increments correctly, but +`component_discarded_events_total` stays 0. Any alert or dashboard based on the component-level +counter — which is the primary observability surface for Vector components — silently misses +all buffer-policy drops. + +The `ComponentEventsDropped` type lives in `lib/vector-common/src/internal_event/component_events_dropped.rs` +and increments `CounterName::ComponentDiscardedEventsTotal`. It is used by sinks and +transforms, but not by the buffer layer. The fix would be to add a call to +`emit(ComponentEventsDropped:: { count: ..., reason: "drop_newest" })` inside +`BufferEventsDropped::emit` or inside `BufferUsageData::report` at the point where +`dropped_intentional` is emitted. + +**Antithesis Angle:** + +Configure a pipeline with a disk buffer using `when_full: drop_newest` and a slow/paused +downstream (modeled as backpressure via Antithesis network/CPU fault injection): + +1. Write events faster than the buffer can drain (by pausing the downstream sink or making it + very slow). +2. Assert, via workload-side metric scraping, that when `buffer_discarded_events_total` + increments (i.e. drops are occurring), `component_discarded_events_total` also increments + by at least the same amount. +3. The invariant to assert in the Antithesis workload: + `component_discarded_events_total >= buffer_discarded_events_total` at any stable point + (allowing for the 2-second reporting lag of the buffer metrics reporter). + +SUT-side: add an `assert_always!` inside `BufferUsageData::report` or `BufferEventsDropped::emit` +that fires after `buffer_discarded_events_total` is incremented to assert that +`component_discarded_events_total` has also been incremented. Alternatively, add the missing +`emit(ComponentEventsDropped...)` call and add an `assert_reachable!` to confirm the path is +exercised. + +The 2-second reporter lag means the workload assertion must be written as: "eventually (within +a bounded window after drops stop), both counters match." Antithesis's time-control makes +this straightforward. + +**Why It Matters:** + +This is a blind spot in the primary observability surface. Vector operators monitor +`component_discarded_events_total` to detect data loss; they may not know about or monitor the +lower-level `buffer_discarded_events_total` counter. A pipeline silently dropping events under +backpressure shows 0 on the component dashboard while data is being lost. This is a known bug +(#24606, #24144) that appears to remain unaddressed as of the current commit +(`b7aae737c`). The Antithesis property will both confirm the bug is present and provide a +regression test once fixed. + +## Open Questions + +1. **Is the fix intentionally deferred or just missed?** Issues #24606 and #24144 are open; + there is no linked PR fixing `BufferEventsDropped::emit`. Confirm with the owning team + before marking this as "known-unfixed" versus "recently fixed but not yet landed." + +2. **Should `component_discarded_events_total` equal or only be bounded by + `buffer_discarded_events_total`?** If the same event is counted at multiple buffer stages + (e.g. overflow chain), the component counter might be expected to equal the total across all + stages, not just the per-stage buffer counter. Clarify the intended semantics before writing + the exact assertion. + +3. **Does the 2-second reporting lag make the Antithesis assertion flaky?** The buffer metrics + reporter ticks every 2 seconds (`buffer_usage_data.rs:405`), so `buffer_discarded_events_total` + is not real-time — it is batched. If the workload checks immediately after injecting + backpressure, both counters may be 0 and the test passes vacuously. The workload must + either wait for a reporter tick or scrape after a delay. Antithesis's deterministic scheduling + makes this tractable if the tick interval is modeled. + +4. **Does `drop_newest` apply to disk buffers at all, or only to in-memory buffers?** The disk + buffer writer's `try_write_record` is called from `SenderAdapter::try_send` + (`sender.rs:69-83`), which is called from `BufferSender::send` for `DropNewest`. The disk + buffer writer's `try_write_record` returns `Some(item)` when full (the item cannot be + written). Confirm this code path is actually reachable for disk buffers, not only for + `LimitedSender` in-memory variants. If disk-buffer `try_write_record` never returns the + item (always blocks or errors), the `was_dropped` branch may be unreachable for disk buffers. + +--- + +### Investigation Log + +#### Is `drop_newest` reachable for the disk-buffer variant? + +**Examined:** `lib/vector-buffers/src/variants/disk_v2/writer.rs:1166–1178` (`try_write_record` and `try_write_record_inner`). + +**Found:** `try_write_record_inner` at writer.rs:1175–1178 checks `self.is_buffer_full()` at the top: if the buffer is full it immediately returns `Ok(Err(record))` — i.e., it returns the item back to the caller. `try_write_record` at writer.rs:1166–1168 maps this to `Ok(Some(record))`. This return value propagates up to `SenderAdapter::try_send` which sets `was_dropped = true`, triggering the `increment_dropped_event_count_and_byte_size` call. The `drop_newest` path is therefore reachable for disk buffers whenever `is_buffer_full()` returns true. + +**Conclusion:** Confirmed reachable for the disk-buffer variant. The missing `ComponentEventsDropped` emission affects disk buffers, not only in-memory variants. + +#### Is there a partial fix on another branch? + +**Examined:** `lib/vector-buffers/src/variants/disk_v2/` (whole directory) via grep for `ComponentEventsDropped`. + +**Not found:** No call to `ComponentEventsDropped` anywhere in `lib/vector-buffers/` (confirmed by grep returning no results). No evidence of a partial fix or in-progress work at commit b7aae737c. Issues #24606 and #24144 remain open and unlinked to any merged PR at this commit. + +**Conclusion:** No partial fix is present at this commit. The missing `component_discarded_events_total` increment is an unaddressed gap as of b7aae737c. diff --git a/tests/antithesis/scratchbook/properties/durable-unacked-events-survive-crash.md b/tests/antithesis/scratchbook/properties/durable-unacked-events-survive-crash.md new file mode 100644 index 0000000000000..359d63dd7a8a2 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/durable-unacked-events-survive-crash.md @@ -0,0 +1,159 @@ +# Property: durable-unacked-events-survive-crash + +## Catalog Entry + +**Type:** Safety / Always + +**Property:** Every event that was durably synced (fsync'd) to a data file and +not yet acknowledged is readable after an ungraceful crash + restart. No synced +event is skipped or lost by recovery. Data-loss is bounded to the documented +≤500ms unsynced window. + +**Invariant:** Let `S` be the set of events whose writes were confirmed durable +by a completed `sync_all` call (data file fsync) and a completed ledger `flush` +(msync) before the kill signal. After restart and full recovery +(`from_config_inner` returns), draining the buffer to the quiet-period boundary +must yield every event in `S` at least once. No event in `S` may be silently +absent. + +**Antithesis Angle:** + +1. Workload writes events with monotonically increasing unique IDs. It only + marks an event as "durably written" (adds its ID to the expected-set `S`) + after receiving an application-level confirmation that an fsync window has + closed — concretely, after the workload observes that an e2e ack has come + back for an event written ≥500ms ago (demonstrating that at least one fsync + cycle has elapsed and the data is committed), OR the workload uses an + out-of-band fsync signal (see Open Question OQ-1). +2. Antithesis injects a SIGKILL at an arbitrary point — before, during, or + after the 500ms fsync window. +3. Vector restarts. The workload waits for the quiet period (no new events + produced, buffer drains to empty). +4. The workload computes `S_delivered` (set of IDs that came out downstream) + and asserts `S ⊆ S_delivered` (at-least-once; duplicates are allowed and + expected). +5. Events written within the 500ms unsynced window before the kill are + explicitly excluded from `S` — their loss is within the documented contract. + +**Why It Matters:** This is the core durability promise marketed to customers: +"data synchronized to disk will not be lost if Vector crashes." If a synced +event is lost silently during recovery, the product's fundamental safety +guarantee is violated. The 500ms window is documented; what is not documented +(and not currently tested) is the possibility that even synced data is lost due +to: + +- A crash between the data `sync_all` and the ledger `msync` (two separate + non-atomic syscalls: `writer.rs:1314` `writer.sync_all()` then + `ledger.rs:507-508` `self.state.get_backing_ref().flush()`), leaving the + ledger lagging the data — handled by `validate_last_write` `Ordering::Less` + path (`writer.rs:922-944`), which fast-forwards the ledger. If this path has + a bug, synced data silently disappears. +- The `validate_last_write` `Ordering::Greater` path (`writer.rs:910-919`) + logs "Events have likely been lost" and rolls to the next file — the correct + detection path, but the roll-over leaves an intentional gap that should not + include synced data. +- `update_buffer_size` at startup (`ledger.rs:653-698`) sums `.dat` file sizes + and seeds `total_buffer_size`. If this overseeds relative to what + `seek_to_next_record` will decrement, the underflow path (#21683) can be + triggered and the writer deadlocks, causing zero more events to be delivered + — indistinguishable from loss from the caller's perspective. + +**Crash Windows (code-precise):** + +| Window | Code location | Risk | +|--------|--------------|------| +| Write committed to page cache, `sync_all` not yet called | `writer.rs:1308` `writer.flush()` succeeded; `writer.rs:1314` `writer.sync_all()` not reached | Loss is expected and in-contract (≤500ms) | +| `sync_all` done, ledger `flush` not done | `writer.rs:1314` done; `ledger.rs:507` not done | Ledger lags data → `Ordering::Less` recovery path fast-forwards; synced data must survive | +| Ledger `flush` done; file-rotation increment not yet done | `ledger.rs:507` done; `writer.rs:1138` `increment_writer_file_id` not reached | File ID in ledger still points to old file; recovery opens the same file; data must be re-readable | +| Kill inside `delete_completed_data_file` | `reader.rs:546` unlink done; `reader.rs:548` `increment_acked_reader_file_id` and `ledger.flush` not done | Ledger still points at a deleted file; handled by `NotFound`→skip on restart; relevant unacked events in that file survive on the next undeleted file or are in-ack-flight | + +**Recovery Branches Exercised:** + +- `validate_last_write` `Ordering::Less` (ledger lags data): `writer.rs:922-944` + — fast-forwards `next_record_id` using `increment_next_writer_record_id`. +- `validate_last_write` `Ordering::Greater` (data lags ledger): `writer.rs:910-919` + — emits error log, sets `should_skip_to_next_file = true`. +- `seek_to_next_record` fast-path (different reader/writer file IDs): + `reader.rs:840-898` — mmap-validates last record of each reader file and + deletes already-acked files. +- `seek_to_next_record` slow-path: `reader.rs:904-938` — reads records via + `next()` until `last_reader_record_id` matches ledger. + +**Fault Requirements:** Node-termination faults (SIGKILL) required. These are +often disabled by default in Antithesis tenants — confirm enabled. CPU +throttling (stretching the 500ms window beyond its nominal boundary) is a +useful secondary lever. + +**Antithesis SDK Assertion (SUT-side, to be added):** + +```rust +// In validate_last_write, after Ordering::Less fast-forward: +antithesis_sdk::assert_always!( + record_next >= ledger_next, + "validate_last_write: fast-forwarded ledger to data file; no synced data lost", + json!({ "ledger_next": ledger_next, "record_next": record_next }) +); + +// In seek_to_next_record, after returning Ok(()): +antithesis_sdk::assert_reachable!( + "seek_to_next_record completed after crash recovery" +); +``` + +**Workload-level set-difference check:** +The workload maintains a set `DURABLE` (event IDs confirmed synced before +kill) and a set `DELIVERED` (IDs received downstream post-restart). After +quiet period: assert `DURABLE.difference(DELIVERED).is_empty()`. Duplicates +from `DELIVERED` that are not in `DURABLE` are acceptable (at-least-once). + +--- + +## Open Questions + +**OQ-1 (Critical): How does the workload establish "durably written" without +access to Vector internals?** +The 500ms fsync window makes it hard for an external workload to know which +events were synced before the kill. Options: + +- (a) Rely on e2e acks: an event is considered durable if it was acked by the + downstream sink before the kill (i.e., it completed the full + write→read→deliver→ack→delete cycle). This is conservative but correct. +- (b) Instrument Vector to emit a structured log line after each `sync_all` + completes (noting the `writer_next_record_id` at that point), and have the + workload parse it to determine the durable frontier. More precise. +- (c) Use a configurable `flush_interval=0` (force fsync on every flush) so + every write is immediately durable; then all written events are in `S` and + only the final partial write before the kill is excluded. Cleanest but + changes the production code path. +Option (c) is recommended for initial testing; option (b) for production- +representative timing. + +**OQ-2: Does `validate_last_write` `Ordering::Greater` path always correctly +skip to a clean state, or can it produce a gap that overlaps synced records?** +The `Ordering::Greater` case emits an error log and sets `should_skip = +true` (`writer.rs:983-986`). This causes `reset()` + `mark_for_skip()` and +defers the actual skip to the next `ensure_ready_for_write`. If the writer +rolls to the next file and there are valid synced records in the old file that +the reader hasn't yet read, are those records still accessible? They should be +(the old file is not deleted), but the gap in `writer_next_record_id` means +the reader might interpret them as a monotonicity violation — verify against +`reader.rs:480-484` `MonotonicityViolation` panic. + +**OQ-3: Is the data `sync_all` and ledger `msync` ordered (data first, ledger +second) or can they be reordered under the Tokio executor?** +`flush_inner` at `writer.rs:1314` calls `writer.sync_all().await` then +`self.ledger.flush()` (synchronous msync). The async await creates a yield +point between them. The Antithesis scheduler can exploit this yield to inject a +kill between the two, which is the exact scenario we want. Confirmed this is +reachable. + +**OQ-4: Does `BufferWriter::drop` (`writer.rs:1371-1374`) call `flush()` before +`close()`?** +Reading the source: `Drop::drop` only calls `self.close()` (which marks +`writer_done` and notifies) — it does NOT call `flush()`. On graceful topology +shutdown the caller is expected to call `flush()` first. If it does not, up to +256KB of `TrackingBufWriter` data plus any unsynced page-cache data can be lost +even on clean shutdown. This is the silent-loss-on-config-reload vector (#24948). +For this property, it means the "graceful shutdown is fully lossless" claim is +only true if the topology's shutdown path calls `writer.flush()` explicitly +before dropping — confirm. diff --git a/tests/antithesis/scratchbook/properties/every-written-event-eventually-delivered.md b/tests/antithesis/scratchbook/properties/every-written-event-eventually-delivered.md new file mode 100644 index 0000000000000..6f41bcfa09b82 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/every-written-event-eventually-delivered.md @@ -0,0 +1,175 @@ +# Property: every-written-event-eventually-delivered + +## Catalog Entry + +**Type:** Liveness / Sometimes (at-least-once) — `Sometimes(all_produced_delivered)` + +**Property:** With end-to-end (e2e) acks enabled, every event accepted by the +source (written into the disk buffer) is eventually delivered downstream at +least once across crashes. Duplicates are allowed; silent loss is not. This is +the at-least-once contract the product sells. + +**Invariant:** Let `PRODUCED` be the set of unique IDs of all events the +workload submitted to Vector's source. After faults and a quiet period +(no new events produced, no in-flight acks), every ID in `PRODUCED` must appear +at least once in `DELIVERED` (the set of IDs observed at the downstream sink). +`|DELIVERED| ≥ |PRODUCED|` is expected (duplicates from crash+replay); the +violation is any ID in `PRODUCED \ DELIVERED`. + +**Antithesis Angle:** + +1. Workload assigns each event a globally unique ID (e.g., a monotonic counter + embedded in the event payload). It records every submitted ID in `PRODUCED`. +2. The downstream sink (a workload-controlled stub or a log sink with + structured output) records every received ID in `DELIVERED`. +3. Antithesis injects faults at arbitrary timing: SIGKILL during write, during + fsync, during read, during ack, during file deletion, during rotation. + Vector restarts after each kill and resumes. +4. After the quiet period (workload stops producing, Vector drains to empty, + writer is closed), the workload compares `PRODUCED` and `DELIVERED`. +5. `Sometimes(all_produced_delivered)` fires when `PRODUCED ⊆ DELIVERED` is + reached — i.e., the system successfully delivered all events at least once + in at least one timeline explored by Antithesis. +6. A workload-level hard assertion (`assert_always`) fires if + `PRODUCED \ DELIVERED` is non-empty after the quiet period — this is the + primary falsification signal. + +**Why It Matters:** This is the product's headline durability guarantee. A +customer enabling disk buffers + e2e acks is explicitly opting into at-least- +once delivery. If even one event is silently dropped across a crash, the +contract is broken. Silent loss is the hardest failure mode to detect in +production (no error, no alert, dashboards may show normal throughput). Known +silent-loss paths: + +- Events in the `TrackingBufWriter` 256KB in-memory buffer on crash (not yet + page-cache flushed): in-contract loss for events not yet synced. +- Events synced to data file but not yet acked: must survive crash via replay. + This is the primary liveness test. +- Events synced to data file but whose data file was deleted after kill but + before ledger flush: the deleted-file-before-ledger-msync window + (`reader.rs:546` unlink, `reader.rs:548-549` ledger flush). On restart the + reader file ID in the ledger still points at the now-deleted file; the code + handles this via NotFound→advance. If the events in that file were not yet + acked, they are genuinely lost. This is the most serious latent loss path for + this liveness property. +- `BufferWriter::drop` does not call `flush()` (`writer.rs:1371-1374`): on + graceful shutdown that skips an explicit flush, up to 256KB is lost silently. +- Sink-error acks: `spawn_finalizer` at `ledger.rs:701-709` discards + `_status`, so `Errored`/`Rejected` delivery still credits the ack. This + means a downstream error causes silent loss even with e2e acks nominally on. + +**Fault Requirements:** Node-termination faults (SIGKILL) required. Confirm +enabled. The following fault sequences are especially valuable: + +- Kill during the `delete_completed_data_file` window (`reader.rs:546-549`): + unlink done, ledger flush not done. +- Kill during the page-cache-write-to-fsync window (≤500ms): tests which + events are in-contract vs. out-of-contract loss. +- Kill during file rotation: `ensure_ready_for_write` partial rotation + (`writer.rs:1047-1154`). +- CPU throttle: stretches the 500ms window, increasing the expected-loss set. + +**Workload Implementation Notes:** + +The key design decision for this property is how to handle the `PRODUCED vs +DELIVERED` comparison when duplicates are expected. The recommended approach: + +``` +PRODUCED: Set -- all IDs submitted to Vector source +DELIVERED: MultiSet -- all IDs seen at downstream sink (may repeat) +DELIVERED_SET: Set = unique(DELIVERED) + +-- After quiet period: +missing = PRODUCED.difference(DELIVERED_SET) +assert missing.is_empty() // at-least-once: every produced ID must appear + +-- Count duplicates (expected; used to verify replay, not assert on): +duplicate_count = |DELIVERED| - |DELIVERED_SET| +log("duplicate deliveries observed: {}", duplicate_count) +``` + +Dedup responsibility is at the workload level — the downstream sink deduplicates +by ID before any downstream business logic, matching the stated contract +("downstream must dedup"). + +**Antithesis SDK Assertions (SUT-side, to be added):** + +```rust +// In handle_pending_acknowledgements, after all acks processed: +antithesis_sdk::assert_sometimes!( + self.ledger.get_total_buffer_size() == 0, + "buffer drained to empty after quiet period", + json!({ "total_buffer_size": self.ledger.get_total_buffer_size() }) +); + +// In spawn_finalizer closure (ledger.rs:703-707), instrument the discarded status: +antithesis_sdk::assert_always!( + matches!(status, BatchStatus::Delivered), + "all acked events were successfully delivered (not errored/rejected)", + json!({ "batch_status": format!("{:?}", status) }) +); +// NOTE: The above will fail under sink errors — this is intentional; it surfaces +// the known silent-loss bug (INV-9 in sut-analysis.md). +``` + +**Workload-level milestone assertion:** + +```rust +antithesis_sdk::assert_sometimes!( + produced_set.difference(&delivered_set).next().is_none(), + "all produced events eventually delivered (at-least-once contract satisfied)", + json!({ + "produced_count": produced_set.len(), + "delivered_count": delivered_set.len(), + }) +); +``` + +--- + +## Open Questions + +**OQ-1: Does the topology's shutdown path call `writer.flush()` explicitly +before dropping the writer?** +`BufferWriter::drop` calls only `close()` (marks writer done + notifies). If +the topology calls `drop` without a preceding `flush()`, in-buffer events are +lost silently. This is specifically the internal config-reload incident vector (#24948). The +liveness property must be tested with both graceful shutdown (should see zero +loss) and SIGKILL (at-most-500ms-unsynced loss). If graceful shutdown also +loses events, that is a higher-severity bug. + +**OQ-2: Does the `OrderedFinalizer` task drain before the tokio runtime shuts +down on SIGKILL?** +The finalizer (`ledger.rs:701-709`) is a `tokio::spawn`'d task. On SIGKILL the +entire process dies — the finalizer task does not get to drain. In-flight +`BatchNotifier` handles that have been dropped by the sink but whose IDs have +not yet propagated to `pending_acks` are lost. This means ack-in-flight events +must be replayed on restart (they were not ledger-decremented). If the reader's +`reader_last_record` was already persisted past those events (lazy ledger +flush), the events cannot be replayed — they are lost. This interaction between +the finalizer task lifecycle, the lazy ledger flush of `reader_last_record`, and +SIGKILL timing is the most subtle loss path for this property. + +**OQ-3: What is the maximum number of in-flight acks at a given moment?** +The workload should size `max_size` and batch sizes to keep many events in +various ack-flight states simultaneously. A small buffer drains too quickly for +the fault injection to hit interesting timing windows. + +**OQ-4: Sink-error ack discarding — is this in scope for this property?** +The `_status` discard at `ledger.rs:704` means this property as stated will +not catch sink-error loss (since the buffer always credits the ack). A separate +property specifically testing `Errored`/`Rejected` ack propagation is +recommended. For this property, use a reliable downstream sink stub that always +returns `Delivered` to avoid conflating the two bugs. + +**OQ-5: Does `delete_completed_data_file` → unlink-before-ledger-flush create +a genuine loss window?** +`reader.rs:546`: `delete_file` called (unlink). `reader.rs:548`: +`increment_acked_reader_file_id`. `reader.rs:549`: `ledger.flush()` (msync). +A kill between unlink and ledger flush: on restart, `ledger.state() +.get_current_reader_file_id()` still points to the deleted file. Code path on +restart: `seek_to_next_record` fast-path (`reader.rs:840-898`) tries to +`open_mmap_readable` the file → `NotFound` → falls through to slow path. The +slow path reads from the ledger's `reader_current_data_file_id` which is still +the deleted file. Needs careful trace to confirm no events in the deleted file +are silently abandoned if they had not been fully acked. diff --git a/tests/antithesis/scratchbook/properties/file-id-rollover-stays-coordinated.md b/tests/antithesis/scratchbook/properties/file-id-rollover-stays-coordinated.md new file mode 100644 index 0000000000000..7929d68608a43 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/file-id-rollover-stays-coordinated.md @@ -0,0 +1,175 @@ +--- +slug: file-id-rollover-stays-coordinated +type: Safety / Always +status: LATENT BUG — reachable in tests (MAX_FILE_ID=6), latent in production (MAX_FILE_ID=65536) +sut_commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +--- + +# Property 16: file-id-rollover-stays-coordinated + +## Catalog Entry + +**Type:** Safety / Always + +**Property:** Across a u16 file-ID rollover (writer wraps from `MAX_FILE_ID - 1` back to 0), +the reader and writer remain correctly coordinated: the reader can still determine its position +relative to the writer, and `seek_to_next_record` does not misclassify the synchronization +state, deadlock, or skip events. + +**Invariant:** At all times, including across file-ID rollover, the condition used by +`seek_to_next_record` to determine "reader is now synchronized with writer" (`reader_file_id > +writer_file_id`) is semantically correct. After rollover, this raw `u16 >` comparison produces +false answers for any configuration where the reader's file ID has wrapped past 0 while the +writer's has not (or vice versa), causing either premature "synchronized" claims or failure to +terminate the seek loop. + +**The Bug — `reader.rs:930-934`:** + +```rust +// reader.rs:930-934 — raw u16 comparison, not wrap-aware +let (reader_file_id, writer_file_id) = + self.ledger.get_current_reader_writer_file_id(); +if reader_file_id > writer_file_id { + break; +} +``` + +This comparison is the synchronization gate inside `seek_to_next_record`'s bad-read handling +loop. It is intended to detect the case where the reader has advanced to the file the writer +hasn't yet created (meaning they are synchronized). The logic is correct in the non-rollover +case: reader on file 4, writer on file 3 → reader_file_id (4) > writer_file_id (3) → break. + +After rollover, the semantics invert. Example with `MAX_FILE_ID = 6`: + +- Writer has written to files 0, 1, 2, 3, 4, 5, 0 (wrapped), 1. Writer is now on file ID 1. +- Reader has advanced to file ID 2 (wrapped past 5→0→1→2), ahead of the writer. +- `reader_file_id` = 2, `writer_file_id` = 1. Condition: `2 > 1` → `true` → `break`. + This is correct in this case (reader is ahead). + +But consider the inverse: reader on file 5 (about to wrap), writer just wrapped to file 0: + +- `reader_file_id` = 5, `writer_file_id` = 0. Condition: `5 > 0` → `true` → `break`. + The reader incorrectly claims it is synchronized (ahead of the writer) when in fact it is + behind (the writer has lapped the reader). + +The wrap-safe comparison requires tracking how many times each side has wrapped, or using a +distance function that is rollover-aware (e.g. treating file IDs as a modular ring). The writer +and reader file IDs both live in the mmap'd ledger as `AtomicU16`; there is no wrap counter. +The arithmetic used to compute "next" file IDs (`(current + 1) % MAX_FILE_ID`, +`ledger.rs:129,151`) is wrap-correct, but the comparisons used for ordering are not. + +**Additional file-ID arithmetic context:** + +- `get_next_writer_file_id` (`ledger.rs:129`): `(writer + 1) % MAX_FILE_ID` — wrap-correct. +- `get_next_reader_file_id` (`ledger.rs:151`): `(reader + 1) % MAX_FILE_ID` — wrap-correct. +- `get_offset_reader_file_id` (`ledger.rs:155`): `reader.wrapping_add(offset) % MAX_FILE_ID` + — wrap-correct for the addition. +- `reader_file_id > writer_file_id` (`reader.rs:932`): raw `u16` comparison — NOT wrap-aware. + +**Reachability:** + +In test builds, `MAX_FILE_ID = 6` (`common.rs:45`). With a small enough buffer configuration, +a test can cycle through all 6 file IDs in a short run. The in-repo proptest model suite uses +this constant. In production, `MAX_FILE_ID = u16::MAX = 65535` (`common.rs:43`), requiring +65535 data-file rotations (each up to 128MB = ~8TB of data) to hit rollover — not reachable +in production without months of sustained high-volume writes, but reachable in Antithesis with +`MAX_FILE_ID = 6` (the test build constant). + +**Antithesis Angle:** + +1. Run Vector with a test build (`MAX_FILE_ID = 6`, which is the default under `#[cfg(test)]`) + or with a small configured `max_data_file_size` and high write volume to force rapid rotation. +2. Write enough data to cycle through all 6 file IDs multiple times (triggering rollover). +3. Inject node-kill faults at the rollover boundary (when writer_file_id is near 5 and + reader_file_id is near 0, or vice versa) to force `seek_to_next_record` to run across the + rollover point. +4. After restart, assert: + - Vector does not deadlock (the seek loop terminates). + - No events are skipped (event count before crash = events delivered after restart). + - `buffer_discarded_events_total` does not increment unexpectedly. + - Vector's buffer metrics settle to a consistent state. + +The "deadlock / no-progress" variant is most likely to manifest: if the comparison misfires and +the seek loop never breaks, `seek_to_next_record` hangs forever (no timeout), and Vector's +read path never marks itself `ready_to_read`. The pipeline stalls silently, similar to the +`total_buffer_size` underflow deadlock (INV-7 / L1 / L8). + +SUT-side `assert_always!` candidates: + +- Inside the `seek_to_next_record` bad-read loop: assert that the loop terminates within a + bounded number of iterations (e.g. `MAX_FILE_ID` iterations). +- After `seek_to_next_record` completes: assert `self.ready_to_read == true` within a + bounded time after startup. + +**Why It Matters:** + +A file-ID rollover bug causes a silent pipeline stall on restart after the buffer has cycled +through its full file namespace. In production this is an extreme edge case (requires ~8TB +written through a single buffer), but it is exactly reachable in test environments with +`MAX_FILE_ID = 6`. Antithesis, running the test binary, will hit rollover routinely. The bug +provides a concrete, Antithesis-reachable test of whether the recovery path is robust to +rollover — and whether the two disabled tests +(`reader_exits_cleanly_when_writer_done_and_in_flight_acks`, `writer_waits_when_buffer_is_full`) +are related to this class of issue. + +## Open Questions + +1. **Is the `reader_file_id > writer_file_id` comparison actually the problematic gate, or is + there additional context (e.g. the `unacked_reader_file_id_offset` accounting) that makes + it correct in the rollover case?** `get_current_reader_file_id` (`ledger.rs:305-308`) adds + the `unacked_reader_file_id_offset` to the persisted reader file ID. This offset represents + how many files the reader has consumed but not yet acked. If this offset is bounded and + resets correctly at rollover, the comparison may be more correct than the raw IDs suggest. + Needs deeper analysis. + +2. **Does the `unacked_reader_file_id_offset` also suffer from rollover? The `get_offset_reader_file_id` + function uses `wrapping_add(offset) % MAX_FILE_ID` (ledger.rs:155). If `offset` grows large + enough (multiple unacked files), could the offset itself wrap and produce a spurious file ID? + The offset is bounded by the number of concurrently unacked files, which is bounded by the + buffer size / data file size, so it should be small in practice. But this should be verified. + +3. **Does Antithesis run test binaries (with `MAX_FILE_ID=6`) or production binaries (with + `MAX_FILE_ID=65535`)?** If production binaries are used, the rollover scenario requires + injecting synthetic file-ID state (using the `unsafe_set_writer_next_record_id` test helpers, + which are cfg(test)-gated and unavailable in production binaries). A production-binary + Antithesis run would need to use a very small `max_data_file_size` and very high write + throughput to approach rollover, which may not be achievable in the test window. + +4. **What is the `seek_to_next_record` loop bound?** The loop has no explicit iteration limit. + If the rollover bug causes the loop to not terminate (e.g. the condition never fires), it + will spin indefinitely using CPU but blocking the read path. Whether this manifests as a CPU + spike (spinning loop) or a true deadlock (blocked await) depends on whether any `await` + points are hit inside the loop — which they are (`self.next().await`). So the actual + behavior is likely a livelock: calling `next()` repeatedly, hitting bad reads, never breaking, + consuming resources without progress. Antithesis CPU throttle faults could help expose this. + +--- + +### Investigation Log + +#### Build requirement: `MAX_FILE_ID` is 65535 in production, 6 only in `cfg(test)` + +**Examined:** `common.rs:42–45`. + +**Found:** The cfg-gated constants are confirmed at common.rs:42–45: + +```rust +#[cfg(not(test))] +pub const MAX_FILE_ID: u16 = u16::MAX; // 65535 +#[cfg(test)] +pub const MAX_FILE_ID: u16 = 6; +``` + +The test-only value of 6 means rollover is exercisable in a handful of file rotations under test; the production value of 65535 requires ~8TB of data through a single buffer. An Antithesis run using the production binary would not trigger rollover organically; triggering it would require either a test binary or injected file-ID state via `unsafe_set_writer_next_record_id` / `unsafe_set_reader_last_record_id`, which are themselves `#[cfg(test)]`-gated (ledger.rs:173, 189). + +**Conclusion:** Confirmed cfg-gated at common.rs:43–45. Rollover testing in Antithesis requires the test binary (MAX_FILE_ID=6) or a specially instrumented production build. + +#### Does the `unacked_reader_file_id_offset` indirection make the raw `>` comparison at `reader.rs:932` more correct than it looks? + +**Examined:** `ledger.rs:229` (`unacked_reader_file_id_offset: AtomicU16`), `ledger.rs:305–308` (`get_current_reader_file_id`), `ledger.rs:327–332` (`get_current_reader_writer_file_id`), `reader.rs:930–934`. + +**Found:** `get_current_reader_file_id` at ledger.rs:305–308 returns `self.state().get_offset_reader_file_id(unacked_offset)` where `unacked_offset = self.unacked_reader_file_id_offset.load(Ordering::Acquire)`. The `get_offset_reader_file_id` helper at ledger.rs:154–156 computes `get_current_reader_file_id().wrapping_add(offset) % MAX_FILE_ID` — the `%` is wrap-correct for the arithmetic. `get_current_reader_writer_file_id` at ledger.rs:327–332 calls both accessors and returns the pair. The offset means `reader_file_id` at reader.rs:932 is the *adjusted* reader file ID (accounting for files consumed but not yet acked), not the raw persisted value. This makes the raw `u16 >` comparison slightly more correct than a comparison of bare persisted IDs, because the adjusted ID represents where the reader is actually reading, not just where it last checkpointed. + +**Not found:** No wrap-aware modular distance function is used for the comparison — it is still `reader_file_id > writer_file_id` with raw `u16` values. The `unacked_reader_file_id_offset` does not introduce a generation counter or wrap-epoch tracking. If both the reader and writer have wrapped through 0 at least once, the adjusted reader ID can still be numerically less than the writer ID even though the reader is ahead in the modular ring, making the `>` comparison produce a false negative (reader incorrectly appears to be behind). The offset indirection reduces — but does not eliminate — the incorrectness of the raw comparison. + +**Conclusion:** The `unacked_reader_file_id_offset` makes the comparison marginally more correct (it reflects actual read position rather than acked position) but does not fix the rollover-incorrectness of the raw `u16 >` gate. The bug is real and the comparison at reader.rs:932 is not wrap-safe. diff --git a/tests/antithesis/scratchbook/properties/finalizer-task-drains-pending-acks.md b/tests/antithesis/scratchbook/properties/finalizer-task-drains-pending-acks.md new file mode 100644 index 0000000000000..a7f6e97091d5b --- /dev/null +++ b/tests/antithesis/scratchbook/properties/finalizer-task-drains-pending-acks.md @@ -0,0 +1,311 @@ +--- +slug: finalizer-task-drains-pending-acks +catalog_category: 4 — Space Reclamation & Clean Termination +type: Liveness / Sometimes(all_acks_drained) +status: cataloged (Category 7) +related: + - acked-files-eventually-deleted + - reader-drains-and-terminates-cleanly + - every-written-event-eventually-delivered + - writer-eventually-makes-progress + - total-buffer-size-never-underflows +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +--- + +### finalizer-task-drains-pending-acks — Finalizer Task Drains All In-Flight Acks Before Exit + +| | | +|---|---| +| **Type** | Liveness | +| **Property** | All in-flight `BatchNotifier` acknowledgements are eventually processed by the finalizer task and reflected in `pending_acks` — both (a) during steady-state operation after a quiet period, and (b) on graceful shutdown before the process exits. Acks are never permanently stranded in the `FuturesOrdered` queue of the finalizer task without being consumed by the `Arc` ack machinery. | +| **Invariant** | `Sometimes(all_acks_drained)`: after a write-then-ack-then-quiet-period sequence, `pending_acks` reflects all delivered acks and `total_buffer_size` has been decremented correctly; no acks are left un-processed in the finalizer stream. Pair with an `Unreachable` assertion on "finalizer task died while acks were still pending" for a sharper fault-injection signal. | +| **Antithesis Angle** | (1) **SIGKILL with acks in flight**: write records, begin ack flow, SIGKILL the process mid-drain; assert on restart that events are replayed (ack progress was not persisted) rather than silently stranded. (2) **Finalizer task panic**: inject a panic into the `tokio::spawn` body (`ledger.rs:703`); assert that the reader detects the dead finalizer and takes a recovery action rather than hanging indefinitely. (3) **Graceful shutdown ordering**: SIGTERM with acks in flight; assert `pending_acks == 0` and `total_buffer_size == 0` (after full drain) before the process exits; replay check after restart. | +| **Why It Matters** | The finalizer task is an **unmonitored, detached** `tokio::spawn` that holds the only reference capable of calling `increment_pending_acks` and `notify_writer_waiters`. If it panics, is killed, or is not drained before shutdown, the entire write-read-ack-delete chain breaks: `pending_acks` never advances → `handle_pending_acknowledgements` never fires → no file deletion → `total_buffer_size` never decremented → eventual writer deadlock. Additionally, events already delivered to the sink but not yet acked in the buffer are silently lost (no replay on restart) — a distinct silent-loss path from the arithmetic underflow. | + +--- + +## What Led to This Property + +The `spawn_finalizer` function (`ledger.rs:701–710`) spawns a detached tokio +task that acts as the bridge between sink delivery and buffer accounting: + +```rust +pub(super) fn spawn_finalizer(self: Arc) -> OrderedFinalizer { + let (finalizer, mut stream) = OrderedFinalizer::new(None); // ledger.rs:702 + tokio::spawn(async move { // ledger.rs:703 + while let Some((_status, amount)) = stream.next().await { // ledger.rs:704 + self.increment_pending_acks(amount); // ledger.rs:705 + self.notify_writer_waiters(); // ledger.rs:706 + } + }); + finalizer +} +``` + +Key observations: + +1. **The task is detached**: `tokio::spawn` returns a `JoinHandle` that is + immediately discarded (no `let _ = tokio::spawn(...)` or any join/abort + handle stored). There is no supervision: if the task panics or the runtime + drops it without draining, the caller has no way to detect this. + +2. **The `_status` discard**: the matched `BatchStatus` is explicitly ignored + (`_status` at `ledger.rs:704`). Every ack — regardless of whether it is + `Delivered`, `Errored`, or `Rejected` — increments `pending_acks` by + `amount`. This is the sink-failure silent-loss bug documented in + `sink-failure-not-silently-acked`. For this property, what matters is that + acks that **do** flow through the task are counted; the discard is a + correctness issue, not a liveness issue. + +3. **The `stream` and its drain semantics**: `OrderedFinalizer::new(None)` + creates a `FinalizerSet` plus a `BoxStream<'static, (BatchStatus, u64)>`. + The stream is driven by the `finalizer_stream` function (`finalizer.rs:114–167`). + Its `None`-from-`new_entries` branch (`finalizer.rs:150–151`) breaks the + main loop and falls into the drain-loop at `finalizer.rs:159–161`: + + ```rust + while let Some((status, entry)) = status_receivers.next().await { + yield (status, entry); + } + ``` + + The drain loop runs only if the tokio task is still alive and being polled. + The `sender` half of the `FinalizerSet`'s internal channel is held by the + `OrderedFinalizer` handle (`finalizer.rs:49`), which is stored in + `BufferReader` (`reader.rs:411`). When `BufferReader` is dropped, the + `sender` drops, the channel closes, the task sees `None`, the drain loop + runs, and the stream terminates — **in theory**. + +4. **SIGKILL cuts the drain**: on SIGKILL, the entire tokio runtime is + terminated without running any Drop or drain logic. In-flight + `BatchNotifier`s that have been detached from their events (event delivered + downstream, notifier dropped by the sink) but whose `Future` in the + `FuturesOrdered` has not yet been polled to completion are lost. On restart, + those events are not in `pending_acks`, so `handle_pending_acknowledgements` + never advances the reader position past them, and the file containing them + is not deleted. Because `total_buffer_size` is re-seeded from file sizes at + startup (not from persisted ack state), the file's bytes are counted again + in the seed, and those records **will be re-read and re-delivered** — which + is the correct at-least-once behavior. The stranded-ack scenario therefore + results in **duplicate delivery, not silent loss**, as long as the reader + re-reads those records correctly. + + However: there is a **silent-loss path** if a record was read and a + `BatchNotifier` was attached (`reader.rs:1117–1119`), the event was + delivered to the sink, the `BatchNotifier` was dropped, the `BatchStatusReceiver` + future completed in the `FuturesOrdered` — but the task loop had not yet + called `stream.next()` to yield it. At SIGKILL, this completed future is + in the `FuturesOrdered` but unpolled. On restart, `total_buffer_size` is + re-seeded from file sizes, so the file appears to still be present. But + `reader_last_record` (persisted in the ledger) reflects the reader's acked + position from the last `ledger.flush()` call, which may lag behind the + in-memory ack state. If `reader_last_record` was flushed at a position + before those records, the reader will re-read them (safe replay). If + `reader_last_record` was flushed at or beyond those records (unlikely given + the lazy flush interval), they are skipped. The exact outcome depends on + the flush timing — this is the interleaving Antithesis is designed to explore. + +5. **Graceful shutdown ordering**: `FinalizerSet` has no `Drop` impl + (`finalizer.rs` — no `impl Drop for FinalizerSet` found). The drain of + in-flight acks on graceful shutdown depends on: + (a) Vector's topology shutdown calling `writer.close()` (which marks the + writer done via `ledger.mark_writer_done()`, `writer.rs:1359`), followed + by the read loop draining, followed by `BufferReader` drop, which drops + the `OrderedFinalizer`, which closes the internal channel, which triggers + the `finalizer_stream` drain loop, which the tokio task must be polled + to complete. + (b) The tokio runtime not shutting down before the finalizer task completes + its drain. `tokio::runtime::Runtime::drop` waits for all spawned tasks + to complete by default only with `block_on` semantics; a detached spawn + may be abandoned if the runtime shuts down before the task is polled + through the drain. + This is the "does the runtime drain finalizer tasks before shutdown" open + question from `sut-analysis.md §Open Questions`. + +--- + +## Code References + +| Location | Relevance | +|---|---| +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:701–710` | `spawn_finalizer`: detached `tokio::spawn`, `_status` discard, no join handle retained | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:704` | `_status` discard — `BatchStatus` ignored; every ack counted as `Delivered` | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:415–416` | `increment_pending_acks`: `fetch_add` on `pending_acks` — the only path that unblocks the reader ack machinery | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:381` | `notify_writer_waiters`: wakes the reader, which then advances acks and frees files | +| `lib/vector-buffers/src/variants/disk_v2/mod.rs:262–264` | `spawn_finalizer` call site: `Arc` cloned, `finalizer` passed to `BufferReader` | +| `lib/vector-buffers/src/variants/disk_v2/reader.rs:411` | `finalizer: OrderedFinalizer` — stored in `BufferReader`; drop triggers sender close | +| `lib/vector-buffers/src/variants/disk_v2/reader.rs:1117–1119` | `BatchNotifier::new_with_receiver` + `finalizer.add(record_events.get(), receiver)` — ack registration per record | +| `lib/vector-common/src/finalizer.rs:70–82` | `OrderedFinalizer::new(None)`: creates sender+stream; `shutdown=None` means drain-only-on-channel-close | +| `lib/vector-common/src/finalizer.rs:114–167` | `finalizer_stream`: main loop (`None` branch breaks at L151) + drain loop (L159–161) | +| `lib/vector-common/src/finalizer.rs:48–52` | `FinalizerSet`: no `Drop` impl — drain is implicit via sender-drop, not explicit | +| `lib/vector-buffers/src/variants/disk_v2/writer.rs:1358–1363` | `close()`: marks writer done + wakes reader; called from `Drop` (`writer.rs:1371–1373`) | + +--- + +## What Breaks + +**Path A — SIGKILL with acks in `FuturesOrdered` but unpolled:** +The tokio runtime is killed. In-flight `BatchStatusReceiver` futures that +completed (event delivered, notifier dropped) but whose completion has not yet +been polled by `finalizer_stream` are lost. `pending_acks` on the next startup +starts at 0. `total_buffer_size` is re-seeded from file sizes. The reader seeks +to `reader_last_record` (persisted). Records after the acked position are +re-read. **Result: duplicate delivery** (at-least-once holds, but duplicates +are generated from records that were already delivered downstream). This is the +expected and correct behavior for SIGKILL. + +**Path B — Finalizer task panic:** +If the `tokio::spawn` body panics (e.g., due to a bug in the ack counting +arithmetic or in `notify_writer_waiters`), the task terminates. The +`OrderedFinalizer` handle in `BufferReader` is still alive; `add` calls +(`reader.rs:1119`) attempt to send on the internal channel. With the task gone, +the receiver is dropped, and `sender.send(...)` returns `Err`. The `add` method +logs `error!(message = "FinalizerSet task ended prematurely.", %error)` +(`finalizer.rs:105`) but does not return an error or signal the reader. The +reader continues reading and attaching notifiers; those notifiers complete when +events are delivered, but `increment_pending_acks` is never called. +`handle_pending_acknowledgements` in the reader receives 0 acks. No file is +ever deleted. `total_buffer_size` never decrements. **Result: no file deletion, +eventual writer deadlock** — same user-visible manifestation as #21683, but +with a different trigger path. The reader continues making forward progress +until the buffer fills; then the writer stalls. + +**Path C — Graceful shutdown with acks in flight:** +Topology shuts down; `BufferWriter` is dropped, `close()` is called. The +reader's drain loop runs. At some point `BufferReader` is dropped, dropping +the `OrderedFinalizer`, closing the internal channel. The `finalizer_stream` +in the tokio task sees `None`, breaks its main loop, and enters the drain loop +(`finalizer.rs:159–161`). If the tokio runtime shuts down (e.g., `Runtime::drop`) +before the drain loop completes, in-flight acks are abandoned. +**Result: records that were delivered downstream but not yet counted in +`pending_acks` are not reflected in `reader_last_record` at shutdown. On +restart, they are replayed. Duplicates, not silent loss.** But if +`reader_last_record` was flushed past those records (because the reader's ack +machinery had already advanced the position before the finalizer drained), those +records would be skipped on restart — silent loss. + +**Path D — Steady-state liveness:** +Under heavy load, if the finalizer task falls behind (e.g., CPU throttling, +the `FuturesOrdered` queue grows faster than it drains), `pending_acks` is +not incremented promptly. The reader's `handle_pending_acknowledgements` loop +does not fire. File deletions are delayed. The writer, if blocked full, waits +longer for the reader to free a file. **Result: head-of-line blocking, +throughput degradation, eventual writer stall under sustained CPU pressure.** +This is not a permanent failure but a performance degradation that becomes +permanent if the CPU throttle is extreme enough. + +--- + +## Fault Conditions + +1. **SIGKILL with acks in flight** — requires node-termination fault. Flag if + disabled in the Antithesis tenant. + +2. **Finalizer task panic** — can be injected via a custom Antithesis fault that + sends a panic to the spawned task, or by adding a fault-injection point in + the task body. Does not require node-kill. + +3. **CPU throttling** — available as an Antithesis resource fault. Slows the + finalizer task relative to the reader, growing the `FuturesOrdered` queue + and extending ack latency. + +4. **Shutdown ordering race** — requires Antithesis to explore the interleaving + between the topology shutdown path and the finalizer drain. No special fault + primitive needed; Antithesis will explore these orderings automatically. + +--- + +## OrderedFinalizer Drop/Drain Semantics Summary + +Based on `finalizer.rs`: + +- `FinalizerSet::new(shutdown: Option)` with `None` (as used + at `ledger.rs:702`): the stream terminates only when the `sender` is dropped. +- **The drain loop at `finalizer.rs:159–161` runs only if the tokio task is + scheduled** after the sender closes. There is no blocking synchronous drain + — it is an async drain that depends on the runtime continuing to poll the + task. +- **`FinalizerSet` has no `Drop` impl** — it performs no synchronous drain on + drop. Dropping the `OrderedFinalizer` only closes the channel; the actual + drain is async and may not complete before the runtime exits. +- The `flush` method (`finalizer.rs:109–111`) calls `flush.notify_one()` which + drops all pending `status_receivers` (`finalizer.rs:131–133`). This is a + discard path, not a drain path. It would abandon in-flight acks — relevant if + called during shutdown. + +--- + +## Missing SUT Instrumentation + +No Antithesis SDK assertions exist. Needed: + +1. **Finalizer task health check**: an `Unreachable` assertion inside the + `add` method's `Err` branch (`finalizer.rs:104–106`) would fire when the + finalizer task terminates prematurely. This is already signaled by the + `error!` log, but an SDK assertion makes it an explicit test failure. + +2. **`Sometimes` drain completion assertion**: at the point where the + `finalizer_stream` drain loop exits (`finalizer.rs:161`), an SDK + `Sometimes(drain_completed)` assertion confirms the drain ran to completion + at least once per test run. + +3. **`pending_acks` ground-truth assertion**: at `ledger.rs:416` (after + `increment_pending_acks`), an `Always(pending_acks > 0)` assertion would + confirm that the ack is actually registered before `notify_writer_waiters` + wakes the reader. This catches the race where `notify_writer_waiters` fires + but `pending_acks` is 0 due to the ack being dropped. + +4. **Workload-level replay detection**: after SIGKILL with known-delivered + events, the workload asserts that on restart those events are re-delivered + exactly once (duplicate, not silent loss). No SUT modification needed for + this path. + +--- + +## Open Questions + +- Does the tokio runtime's shutdown sequence (`Runtime::drop` or + `Runtime::shutdown_background`) drain spawned tasks before exiting, or does + it abandon them? The answer determines whether graceful-shutdown acks are + reliably processed. `(partial: tokio's default`Runtime::drop` gives spawned + tasks a 2-second grace period to complete; if the drain takes longer — e.g., + many unresolved `BatchStatusReceiver`futures — some acks may be abandoned)` + +- Is there any mechanism in Vector's topology shutdown that explicitly joins or + awaits the finalizer task? If `spawn_finalizer` stored the `JoinHandle` and + the topology awaited it during shutdown, graceful-shutdown acks would be + reliable. Currently the handle is discarded. + +- If the finalizer task panics (path B above), the `error!` at `finalizer.rs:105` + fires on every subsequent `add` call. Does the Vector metrics/alerting + infrastructure surface this log as an actionable signal, or is it buried? An + operator would not naturally look for this log as the signal for a deadlock-in-progress. + +- Can the finalizer task grow its `FuturesOrdered` queue without bound under + a stalled sink that never delivers/acks? If so, the queue is a memory leak + vector under backpressure — distinct from the write-path `max_buffer_size` + guard. + +- Is `OrderedFinalizer::new(None)` the correct call for the disk buffer's + finalizer, given that `None` means "no shutdown signal"? If a `ShutdownSignal` + were passed, the stream would terminate immediately on shutdown (discarding + in-flight acks per `finalizer.rs:65–68`), which is worse. But with `None`, + the drain depends on the sender being dropped, which is implicit and + ordering-sensitive. Neither option provides a synchronous, guaranteed drain — + this is a structural gap. + +--- + +### Investigation Log + +#### Does the tokio runtime drain spawned tasks before exit? + +**Examined:** `lib/vector-buffers/src/variants/disk_v2/ledger.rs:701–710` (`spawn_finalizer`), `lib/vector-common/src/finalizer.rs:114–167` (`finalizer_stream`, drain loop at lines 159–161). + +**Found:** `spawn_finalizer` at ledger.rs:703 calls `tokio::spawn(async move { ... })` and discards the returned `JoinHandle` — no handle is stored and no join/abort is called anywhere in the codebase. The finalizer task is fully detached. The drain loop in `finalizer_stream` at finalizer.rs:159–161 (`while let Some((status, entry)) = status_receivers.next().await { yield (status, entry); }`) runs only after the `new_entries` channel closes (when the `OrderedFinalizer` sender, held in `BufferReader`, is dropped). This drain is async and depends on the tokio runtime continuing to poll the task. There is no explicit join on the finalizer task handle — neither in `spawn_finalizer` nor in any topology shutdown path. + +**Found — ~2s window:** Tokio's `Runtime::drop` (the default `block_on`/drop path) does not have a documented fixed timeout for completing spawned tasks. The evidence file's claim of "~2s window" was sourced from prior analysis of tokio internals and the Vector topology graceful-shutdown deadline configuration; it is not a hard tokio guarantee. Tokio's shutdown does allow tasks to run to completion *if* the runtime is shut down via `Runtime::shutdown_timeout` or similar — but `Runtime::drop` without an explicit shutdown_timeout may abandon incomplete tasks immediately on drop. The exact behavior depends on how the tokio runtime is constructed and whether `shutdown_timeout` is configured in Vector's main binary. + +**Not found:** No explicit `join`, `await`, or `abort` on the finalizer task's `JoinHandle` at any site in `lib/vector-buffers/`. No `impl Drop for FinalizerSet` at finalizer.rs. No `ShutdownSignal` passed to `OrderedFinalizer::new(None)` at ledger.rs:702 that would give the stream an explicit termination signal. + +**Conclusion:** The tokio runtime does not provide a guaranteed synchronous drain of the finalizer task before exit. The ~2s window is a best-effort estimate based on topology shutdown configuration, not a hard guarantee from tokio. In-flight acks at the time of process exit or runtime drop may be abandoned, resulting in duplicate delivery on restart (correct at-least-once) rather than silent loss — provided `reader_last_record` has not been flushed past those records. diff --git a/tests/antithesis/scratchbook/properties/foreign-data-file-no-writer-stall.md b/tests/antithesis/scratchbook/properties/foreign-data-file-no-writer-stall.md new file mode 100644 index 0000000000000..bc9b7bfc73776 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/foreign-data-file-no-writer-stall.md @@ -0,0 +1,209 @@ +--- +slug: foreign-data-file-no-writer-stall +catalog_category: 2 — Buffer Accounting & Writer Liveness +type: Safety / Always +status: cataloged (Category 7) +related: + - total-buffer-size-never-underflows + - writer-eventually-makes-progress + - buffer-size-within-max +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +--- + +### foreign-data-file-no-writer-stall — Foreign `.dat` File Does Not Permanently Stall the Writer + +| | | +|---|---| +| **Type** | Safety | +| **Property** | A stray `.dat` file placed in the buffer data directory before startup (by an operator, a prior process, a symlink, etc.) inflates `total_buffer_size` at init but is never read by the reader and therefore never decremented; the writer must still eventually make progress and must not be permanently deadlocked. | +| **Invariant** | `Always(writer_makes_progress_after_drain)`: even with a foreign `.dat` file present, after the reader has processed and acked its legitimate data files, the writer is not permanently stalled. Equivalently: the over-seeded `total_buffer_size` does not hold `is_buffer_full()` permanently true when the buffer's actual content is below `max_buffer_size`. | +| **Antithesis Angle** | Custom fault / workload: (1) fill and partially drain the buffer to establish a baseline; (2) inject a stray `foreign.dat` file into the buffer data directory; (3) restart Vector; (4) `ANTITHESIS_STOP_FAULTS` quiet period; (5) assert that the writer resumes accepting writes within a bounded time. The foreign file must be large enough to push the over-seeded `total_buffer_size` above `max_buffer_size`, simulating the deadlock condition. No node-kill needed — this is a non-crash, operator-error path. | +| **Why It Matters** | This is a distinct, non-crash path to the #21683 permanent writer stall. It requires only an operator mistake (or a leftover `.dat` from a prior cleanup) and a restart — no crash, no race, no timing luck. The condition is silent: the writer hangs indefinitely, `is_buffer_full()` is forever true, dashboards may appear healthy (post-PR-#23561 the gauge reads 0 due to `saturating_sub` masking), and no error is emitted. | + +--- + +## What Led to This Property + +The `update_buffer_size` function (`ledger.rs:653–698`) is called once during +`Ledger::load_or_create` (`ledger.rs:648`) to seed `total_buffer_size`. Its +implementation (`ledger.rs:671–695`) reads the buffer data directory and sums +the size of **every file whose name ends in `.dat`** (`ledger.rs:681`): + +```rust +if file_name.ends_with(".dat") { + let metadata = dir_entry.metadata().await.context(IoSnafu)?; + total_buffer_size += metadata.len(); +``` + +The predicate is `ends_with(".dat")` — a suffix check with no further +validation against the expected `buffer-data-{id}.dat` pattern. The comment at +`ledger.rs:676–680` explicitly acknowledges this: the author wanted only +lowercase `.dat` files but made no attempt to filter by name prefix, accepting +any compliant extension. + +The accumulated sum is applied unconditionally: + +```rust +self.increment_total_buffer_size(total_buffer_size); // ledger.rs:695 +``` + +This uses `fetch_add` with no saturation (`ledger.rs:282`). The resulting +`total_buffer_size` then feeds directly into `is_buffer_full()` in `writer.rs`: + +```rust +fn is_buffer_full(&self) -> bool { // writer.rs:993 + let total_buffer_size = + self.ledger.get_total_buffer_size() + self.unflushed_bytes; + let max_buffer_size = self.config.max_buffer_size; + total_buffer_size >= max_buffer_size +} +``` + +and into `can_write_record()` (`writer.rs:793–798`). If the foreign `.dat` file +is large enough to push `total_buffer_size >= max_buffer_size` at startup, the +writer enters the `ensure_ready_for_write` wait loop (`writer.rs:1001–1020`) +and never exits, because there is nothing to ever decrement the contribution +from the foreign file. + +The reader decrements `total_buffer_size` only by **record bytes it has actually +read** (via `track_read`, called from `reader.rs:1115`) and by the +**file-size minus bytes-read** delta when deleting completed data files +(`reader.rs:521–538`, calling `decrement_total_buffer_size`). Neither path +reaches the foreign file, because the foreign file is not a valid `buffer-data-{id}.dat` +file (it won't match the file-ID sequence the reader follows) and will never be +opened, read, or deleted by the reader. The inflated seed never gets +decremented. + +The writer's wakeup chain (`notify_writer_waiters` → `wait_for_reader` → +`notify_reader_waiters`) is sound, but it is conditioned on the reader +delivering acks that flow through the finalizer task. If the inflation from the +foreign file is larger than `max_buffer_size - actual_content`, the writer +remains blocked forever even after the buffer is completely drained of +legitimate data. + +This was flagged in the SUT analysis (§6 item 9) as part of the +"mmap SIGBUS / external file tampering" cluster, but it is a pure +non-crash, operator-error path that deserves its own property: it requires no +node kill, no timing luck, and no concurrent fault — only a stray `.dat` file +and a restart. + +--- + +## Code References + +| Location | Relevance | +|---|---| +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:671–695` | `update_buffer_size` — scans `data_dir`, sums all `*.dat` files without name-prefix filtering | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:681` | Exact predicate: `file_name.ends_with(".dat")` — no `buffer-data-` prefix check | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:695` | `increment_total_buffer_size(total_buffer_size)` — unconditional seed | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:282` | `fetch_add` with no saturation on `total_buffer_size` | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:291–297` | `decrement_total_buffer_size` — raw `fetch_sub`, also no saturation | +| `lib/vector-buffers/src/variants/disk_v2/writer.rs:993–997` | `is_buffer_full` — reads `total_buffer_size` directly | +| `lib/vector-buffers/src/variants/disk_v2/writer.rs:1001–1020` | `ensure_ready_for_write` — permanent wait loop if `is_buffer_full` | +| `lib/vector-buffers/src/variants/disk_v2/reader.rs:521–538` | `delete_completed_data_file` — decrements by `metadata.len() - bytes_read`; only runs for self-owned files | +| `lib/vector-buffers/src/variants/disk_v2/reader.rs:1115` | `track_read` — decrements by record bytes; only runs for records actually read | + +--- + +## What Breaks + +**Failure mode:** the writer hangs permanently in +`ensure_ready_for_write` → `ledger.wait_for_reader().await` (`writer.rs:1019`). +No error is logged (only a `trace!` at `writer.rs:1013`). No crash. The pipeline +stalls silently. + +**Severity:** same user-visible impact as #21683 — silent pipeline stall with +healthy-looking dashboards (the buffer gauge is masked by `saturating_sub` since +PR #23561). Durability promise destroyed without any observable signal. + +**Threshold calculation:** for a `max_buffer_size = 256MB` buffer with no +legitimate data, placing a `foreign.dat` file of ≥ 256MB (or several smaller +ones totaling ≥ 256MB) is sufficient to trigger the stall. In a typical +production buffer with `max_buffer_size` in the GB range, the threshold is +higher, but the vector is still operator-accessible: a single misplaced large +`.dat` file (e.g., a leftover from manual inspection or a prior failed migration) +can trigger it. + +**Difference from the #21683 underflow path:** the #21683 path wraps +`total_buffer_size` toward `u64::MAX` via an underflow, making the stall +essentially unrecoverable until a fresh buffer is created. The foreign-file path +over-seeds `total_buffer_size` to a large-but-finite value. If the foreign file +is subsequently removed and Vector is restarted, the stall resolves — so there +is a recovery path, but it requires operator intervention (remove the file, +restart). This makes it a Safety violation (a foreign file deadlocks the writer) +rather than a permanent corruption. + +--- + +## Fault Conditions + +1. **No node-kill needed.** The stall is triggered on a normal startup with a + stray file present. A SIGKILL + restart sequence is the most realistic + delivery mechanism (crash leaves a temporary file in the buffer dir), but the + property holds even without crash faults. + +2. **Operator-accessible.** The buffer data dir path is user-configurable and + often writable. An operator `cp`ing a file into the wrong directory, a + symlink, or a `.dat` file left by a prior Vector version or migration script + can trigger this. + +3. **Filesystem fault delivery in Antithesis.** Antithesis can place a file in + the buffer directory at any point via workload logic (no special fault + primitive needed — just a file-write before Vector restarts). This is a + pure-workload exercise, not a filesystem fault. + +--- + +## Missing SUT Instrumentation + +No Antithesis SDK assertions exist in the codebase. The following SUT-side +assertions would make this property automatically testable: + +1. **`Always` assertion in `update_buffer_size`** (`ledger.rs:681`): before + accumulating a `.dat` file's size, assert that the filename matches the + expected `buffer-data-{N}.dat` pattern. If it does not, log a `warn!` and + skip it (which would also fix the bug). Alternatively, assert + `total_buffer_size_after_seed <= max_buffer_size` immediately after + `increment_total_buffer_size` at `ledger.rs:695` — an `Always` assertion + whose violation proves the seed-overrun condition. + +2. **`Sometimes` writer-progress assertion** in `ensure_ready_for_write` + (`writer.rs` post-wait-loop): once a writer wakes from `wait_for_reader`, + assert it makes progress (the `Sometimes` reachability assertion already in + `writer-eventually-makes-progress`). With a permanently stalled writer this + assertion is never reached — Antithesis will flag the `Sometimes` as + "never observed." + +3. **Workload-level observation:** the workload can assert that write throughput + resumes after a quiet-period drain even when a foreign `.dat` was present at + startup. No SUT modification required for this path. + +--- + +## Open Questions + +- Should `update_buffer_size` filter by the `buffer-data-{N}.dat` naming + pattern (a fix), or assert and reject on unknown `.dat` files (a safer + defensive posture that surfaces the operator error)? The fix and the + assertion are different choices with different user-facing behavior. + +- Is the `data_dir` directory shared with any other Vector component or + external tool that might legitimately write `.dat` files? (If so, the + over-seeding becomes unavoidable without a stricter naming contract.) + +- Can `update_buffer_size` encounter a non-`buffer-data-{N}.dat` file during + normal operation (e.g., from a failed atomic file creation leaving a + partial name)? If yes, this is a normal-operation trigger, not just an + operator-error trigger. + +- After the foreign file inflates `total_buffer_size`, is there any existing + code path that would eventually decrement it back to a correct value (e.g., + if the reader somehow opens the foreign file)? Confirmed no: `delete_completed_data_file` + only runs for file IDs in `[reader_current_data_file .. writer_current_data_file]`; + a foreign file outside that ID range is unreachable from the reader. + +- How does this interact with the `buffer-size-within-max` property? That + property uses actual `.dat` file sizes as the ground truth. With a foreign + `.dat` in the directory, the watchdog sum would also be inflated — the + property's `Always(actual_disk_bytes <= max_buffer_size + max_record_size)` + could falsely fail even though the buffer's own data is within bounds. diff --git a/tests/antithesis/scratchbook/properties/fsync-window-bounded-under-clock-jitter.md b/tests/antithesis/scratchbook/properties/fsync-window-bounded-under-clock-jitter.md new file mode 100644 index 0000000000000..165fc37939c2a --- /dev/null +++ b/tests/antithesis/scratchbook/properties/fsync-window-bounded-under-clock-jitter.md @@ -0,0 +1,239 @@ +--- +slug: fsync-window-bounded-under-clock-jitter +type: Safety / Always +sut_path: lib/vector-buffers/src/variants/disk_v2/ +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +--- + +# Property: fsync-window-bounded-under-clock-jitter + +## Catalog Entry + +**Type:** Safety / Always + +**Property:** Under Antithesis clock-jitter faults, the durable-loss window for +synced data stays bounded: every write that the writer accepted is either +durable (fsync'd + ledger msync'd) within a bounded multiple of +`flush_interval`, OR it is durable because a data-file rotation forced a +`force_full_flush`. No silent indefinite suppression of `sync_all` occurs. + +**Invariant:** `Always`: the elapsed time since the last successful `sync_all` + ++ `ledger.flush()` pair, measured in real wall time, never exceeds a +configurable bound (e.g. `K × flush_interval`, or since the last file +rotation, whichever is shorter). Violations mean the durability SLA +("≤500ms loss window") is silently extended, with no observable signal to the +operator. + +**Antithesis Angle:** Enable Antithesis clock-jitter faults (virtual-time +stretch/compress). A slowed clock prevents `last_flush.elapsed()` from +exceeding `flush_interval` at the normal wall-time cadence, suppressing +`should_flush()` → `sync_all()` indefinitely (only file rotation, which calls +`flush_inner(force_full_flush=true)`, is clock-independent). Crash during the +suppressed window; verify that the data loss is bounded by the last rotation +boundary or a bounded multiple of `flush_interval`, not unbounded. + +A second sub-scenario: the CAS winner of `should_flush()` is descheduled +(Antithesis can extend the descheduling window) between winning the CAS and +calling `sync_all()`. Other callers all see `should_flush()=false` (CAS +already consumed). Crash during that window; loss extends silently beyond +`flush_interval`. + +**Why It Matters:** The product's stated guarantee is "data synchronized to +disk will not be lost if Vector crashes; data synchronized every 500ms." A +clock-jitter fault, which is a standard Antithesis capability, can suppress +the entire `sync_all` path indefinitely (only page-cache flushes happen), +silently extending the loss window with no error, no log at ERROR level, and +no watchdog. The only mitigant is file rotation (which is event-count driven, +not clock-driven), making the rotation frequency the de facto maximum loss +window under clock jitter. + +--- + +## Code Verification + +### `should_flush` gate (ledger.rs:485-497) + +```rust +// lib/vector-buffers/src/variants/disk_v2/ledger.rs:485-497 +pub fn should_flush(&self) -> bool { + let last_flush = self.last_flush.load(); + if last_flush.elapsed() > self.config.flush_interval + && self + .last_flush + .compare_exchange(last_flush, Instant::now()) + .is_ok() + { + return true; + } + false +} +``` + +`last_flush.elapsed()` calls `Instant::elapsed`, which is monotonic-clock +relative. Under Antithesis virtual-time compression (clock slowed), this value +advances slower than wall time, suppressing the `> flush_interval` condition. + +### `flush_inner` — where sync_all is actually called (writer.rs:1299-1321) + +```rust +// lib/vector-buffers/src/variants/disk_v2/writer.rs:1299-1321 +async fn flush_inner(&mut self, force_full_flush: bool) -> io::Result<()> { + if let Some(writer) = self.writer.as_mut() { + writer.flush().await?; // page-cache flush: always happens + self.ledger.notify_writer_waiters(); + } + if self.ledger.should_flush() || force_full_flush { + if let Some(writer) = self.writer.as_mut() { + writer.sync_all().await?; // fsync: only when should_flush() or rotation + } + self.ledger.flush() // ledger msync + } else { + Ok(()) + } +} +``` + +The page-cache path (`writer.flush()`) always runs. The durable path +(`sync_all` + `ledger.flush()`) runs ONLY when `should_flush()=true` or +`force_full_flush=true`. Under clock jitter, `sync_all` can be suppressed +indefinitely. + +### `force_full_flush` is clock-independent (writer.rs:1120-1130) + +File rotation calls `flush_inner(force_full_flush=true)` directly, bypassing +`should_flush()`: + +```rust +// writer.rs:~1124 (inside rotate_data_file) +data_file.sync_all().await?; +``` + +This is the only clock-independent durability checkpoint. If the workload +generates insufficient write volume to trigger rotation, the only durability +interval is the `should_flush()` timer — which is suppressed under clock jitter. + +### `DEFAULT_FLUSH_INTERVAL` (common.rs:31) + +```rust +// lib/vector-buffers/src/variants/disk_v2/common.rs:31 +pub const DEFAULT_FLUSH_INTERVAL: Duration = Duration::from_millis(500); +``` + +### Mitigation: `flush_interval=0` removes clock dependence + +When `flush_interval = Duration::ZERO`, the condition +`last_flush.elapsed() > Duration::ZERO` evaluates to `true` after any +measurable time elapses, effectively making every `flush_inner()` call a +`sync_all`. Setting `flush_interval=0` in the harness configuration removes +the clock-jitter attack surface for other durability properties (e.g., +`durable-unacked-events-survive-crash`) but is not the production default. + +### CAS winner descheduling extension + +`should_flush()` uses `AtomicCell::compare_exchange`: exactly one concurrent +caller wins the CAS and becomes responsible for calling `sync_all()`. If that +caller is descheduled between the CAS win (`is_ok()`) and the actual +`sync_all()` call, all other callers see `should_flush()=false` for the +duration. Antithesis can extend this descheduling window arbitrarily. + +--- + +## Fault Conditions + +| Fault | Effect | +|---|---| +| Clock jitter (slow virtual clock) | `Instant::elapsed()` advances slowly; `should_flush()` rarely/never true; `sync_all` suppressed. | +| CPU throttle + slow clock | Writer thread descheduled after CAS win; sync_all delayed past 500ms window. | +| Crash during suppressed window | Loss extends to all data since last file rotation, not just last 500ms. | +| Low write volume (no rotation) | No `force_full_flush` path; clock jitter can suppress sync indefinitely. | + +--- + +## SUT-Side Instrumentation (MISSING — must be added) + +No Antithesis SDK instrumentation exists anywhere in the Vector codebase +(confirmed: `existing-assertions.md`). All assertions below are missing. + +### Assertion 1 — Always: elapsed since last sync stays bounded + +Placed at the end of `flush_inner`, after the `sync_all` branch: + +```rust +// writer.rs, inside flush_inner, after sync_all completes +let elapsed_since_sync = self.ledger.last_flush.load().elapsed(); +// This assertion fires on every flush_inner call that DID do sync_all. +// The bound is set generously to detect extended suppression. +antithesis_sdk::assert_always!( + elapsed_since_sync <= MAX_ACCEPTABLE_SYNC_GAP, + "fsync window bounded: elapsed since last sync must be within configurable bound", + &serde_json::json!({ + "elapsed_ms": elapsed_since_sync.as_millis(), + "flush_interval_ms": self.config.flush_interval.as_millis(), + "bound_ms": MAX_ACCEPTABLE_SYNC_GAP.as_millis(), + }) +); +``` + +Note: this assertion only fires when `sync_all` executes. A complementary +workload-side check is needed to detect complete suppression (when +`flush_inner` runs but `sync_all` is never called over a long window). + +### Assertion 2 — Workload-side: monotone sync timestamp + +The workload maintains a shadow `last_sync_wall_time` (from an +Antithesis-provided clock source, not Rust's `Instant`) and asserts that +`now - last_sync_wall_time <= K * flush_interval` periodically, even under +clock jitter. This requires a workload-observable hook (tracing event or +metric emitted when `sync_all` is called). + +Candidate: emit a `tracing::info!` event at the `sync_all` callsite: + +```rust +// writer.rs, after sync_all succeeds +info!(timestamp = ?std::time::SystemTime::now(), "sync_all completed"); +``` + +The workload monitors this trace event and asserts bounded gaps. + +--- + +## Relationship to Other Properties + ++ `durable-unacked-events-survive-crash`: that property assumes a bounded + loss window. Clock jitter extends it, potentially causing events that would + survive under normal timing to be lost. Setting `flush_interval=0` in the + harness is the recommended oracle setup for that property. ++ `writer-eventually-makes-progress`: independent — the deadlock path is + arithmetic, not clock-driven. + +--- + +## Open Questions + ++ Does Antithesis virtual-time affect `Instant::now()` / `Instant::elapsed()` + in Rust's standard library on the target runtime? The answer is yes for + Antithesis's standard virtual-time instrumentation, but confirm that + `crossbeam_utils::atomic::AtomicCell` operations in the CAS path + also see virtual time (they likely do since they use the underlying + `Instant` type). + ++ What is the maximum loss window under clock jitter before the FIRST file + rotation if write volume is low (e.g., 1 event/second, `max_data_file_size` + = 128MB)? At default sizes, rotation never fires; loss window is unbounded + under indefinite clock jitter with no incoming writes to trigger rotation. + Worth confirming against the GA doc's "500ms" claim. + ++ Is there a watchdog / health check in Vector that would alert the operator + to a suppressed fsync? Currently: no, the only observability is the + `buffer_byte_size` gauge (masked by the #23561 fix) and tracing at TRACE + level. + ++ Antithesis clock-fault availability: confirm whether clock jitter is + enabled in the target tenant or must be explicitly requested, since this + property is meaningless without it. + ++ The CAS-winner descheduling scenario is a second, distinct sub-property. + Should it be split into its own slug, or is the combined framing sufficient + for one evidence file? diff --git a/tests/antithesis/scratchbook/properties/graceful-shutdown-flushes-all.md b/tests/antithesis/scratchbook/properties/graceful-shutdown-flushes-all.md new file mode 100644 index 0000000000000..d2a35a1c24c76 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/graceful-shutdown-flushes-all.md @@ -0,0 +1,312 @@ +--- +slug: graceful-shutdown-flushes-all +type: Liveness / Sometimes(graceful_shutdown_lossless) +status: missing +sut_commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +related_issues: + - "vectordotdev/vector#24948 (config-reload stall, partially overlapping concern)" + - "lib/vector-buffers/src/variants/disk_v2/mod.rs (design doc: 'graceful shutdown flushes everything → no loss')" + - "antithesis/scratchbook/sut-analysis.md §5 INV-6" +related_files: + - lib/vector-buffers/src/variants/disk_v2/writer.rs + - lib/vector-buffers/src/variants/disk_v2/ledger.rs + - lib/vector-buffers/src/topology/channel/sender.rs + - src/topology/builder.rs + - src/topology/running.rs +--- + +# Property: graceful-shutdown-flushes-all + +## Catalog Entry + +| Field | Value | +|---|---| +| **Type** | Liveness / Sometimes(graceful_shutdown_lossless) | +| **Property** | `assert_sometimes("graceful_shutdown_lossless", "Vector completes a graceful shutdown without losing any buffered events that were accepted before shutdown began")` | +| **Invariant** | On a graceful stop (SIGTERM, not SIGKILL), all events that were written into the disk buffer and acknowledged at the source level must survive to the downstream sink. No event that is durable (i.e., fsync'd to disk) may be lost; no event that was accepted but only in the 256KB `TrackingBufWriter` page-cache stage may be silently dropped. The buffer spec (mod.rs design doc, `_external-references-digest.md`) claims "graceful shutdown flushes everything → no loss." | +| **Antithesis Angle** | Send graceful stop (SIGTERM/`vector stop`) under sustained write load; restart Vector; assert all pre-shutdown accepted events are present at the downstream sink. Compare with the ungraceful-crash (SIGKILL) property where the 500ms fsync window is the known loss bound. | +| **Why It Matters** | The disk buffer is sold as "no data loss on graceful shutdown." If this is violated, the entire durability value proposition fails for the most common operational scenario (planned restarts, deploys, scaling events). | + +## The Core Question: Does `flush()` Get Called Before `Drop`? + +### What `Drop` does (and does not do) + +`BufferWriter::Drop` (writer.rs:1366-1374): + +```rust +impl Drop for BufferWriter { + fn drop(&mut self) { + self.close(); // writer.rs:1372 + } +} +``` + +`close()` (writer.rs:1358-1363) only calls `ledger.mark_writer_done()` and +`ledger.notify_writer_waiters()`. It does NOT call `flush()` or +`flush_inner(true)`. The `BufferWriter::flush()` method is `async` +(writer.rs:1336-1340), and `Drop` is synchronous, so flush-on-drop is +structurally impossible in the current design. + +This means: if any data is staged in the `TrackingBufWriter`'s 256KB internal +buffer (writer.rs:239-251, capacity = `DEFAULT_WRITE_BUFFER_SIZE` = 256 * 1024, +common.rs:37) at the moment `BufferWriter` is dropped, that data is lost +silently. The OS closes the file handle without ever calling `write_all` on the +buffered bytes. + +### The claimed "graceful shutdown flushes" path — does it exist? + +The external references digest and the mod.rs design doc assert: "Graceful +shutdown flushes everything → no loss." This claim must come from somewhere in +the shutdown code path calling `writer.flush()` before the writer is dropped. + +Tracing the shutdown path: + +1. `RunningTopology::stop()` (running.rs:145) sends the shutdown signal to all + sources via `shutdown_coordinator.shutdown_all(deadline)` (running.rs:259). + +2. Sources stop producing events. The channel feeding the sink drains naturally. + +3. The sink task runs until its input stream is exhausted (the stream closes + because all sender halves are dropped as the source/transform tasks finish). + +4. The sink's async function (builder.rs:666-704) returns + `TaskOutput::Sink(rx)` when `sink.run(...)` completes. + +5. `rx` is a `Utilization`-wrapped `BufferReceiverStream`. The `BufferSender` + (which holds `Arc>`) is NOT held by `rx` — the sender + side is held by the upstream component's output channel (`self.inputs` in + `RunningTopology`, running.rs). The inputs are dropped when + `RunningTopology::stop()` drops `self` (running.rs:145 takes ownership of + `self`, moving `detach_triggers` and `tasks` into the returned future). + +6. When does the `BufferSender` (and thus `Arc>`) get + dropped? It is stored in `self.inputs: HashMap>` + (running.rs). When `stop()` moves `self`, `inputs` is dropped as part of + `RunningTopology`. But `stop()` moves only `self.tasks` and + `self.source_tasks` into the returned future; `self.inputs` (and other + fields) are dropped synchronously when `stop()` is called. + + **Critical ordering question:** Does `self.inputs` (containing the + `BufferSender`) get dropped before the tasks finish? If yes, the + `Arc>`'s reference count drops to zero before the sink + task has drained the reader side — meaning `BufferWriter::drop` runs while + events are still being processed. If no, the `Arc` outlives the tasks + somehow. + + Looking at running.rs:145-267: `stop()` takes `self` by value. The fields + that are NOT moved into the returned future are dropped at the end of the + `stop()` function body before it returns the future. `self.inputs` is one + such field. The `tasks` future only polls the already-spawned JoinHandles; + `inputs` is gone. So: **`inputs` (and the `BufferSender`) are dropped when + `stop()` returns the future, which is before the future is polled/completed.** + + This means `BufferWriter` is dropped (and `Drop::drop` calls `close()` but + not `flush()`) potentially while the sink task is still running its final + drain. However, because the sink task holds `rx` (the reader side), not the + writer side, this may be acceptable for the reader but means any + still-staged writes are lost. + +7. The write loop (`SenderAdapter::DiskV2` flush path, sender.rs:86-98) calls + `writer.flush().await` after every successful `write_record` via the + `send` → `write_record` → `flush` cycle. So every event that was + successfully processed by the write loop has been flushed to the OS page + cache (and possibly fsync'd to disk if 500ms elapsed). The residual in + `TrackingBufWriter` at shutdown time would only be from a batch that was + encoded but for which the async flush had not yet been awaited. + + Under graceful shutdown: the source stops, the mpsc channel closes, the + write loop's `next()` returns `None` and the loop exits. At that point, + the last call to `send` should have already triggered a `flush`. If the + write loop always calls `flush` after each `write_record`, and the source + sent its last event, and `flush` was awaited successfully, then + `TrackingBufWriter` should be empty when the loop exits and the + `BufferSender` is dropped. + + **But:** is there a race between the write loop's final `flush` completing + and the drop of `inputs`? The write loop runs in an async task. The + `inputs` drop happens when `stop()` is called (synchronously, before the + future is polled). If `stop()` is called before the write loop task + completes its final `flush`, the `BufferWriter` can be dropped mid-flush. + +### The `should_flush` / 500ms fsync gate + +Even if `flush()` is called (moving data from `TrackingBufWriter` to the OS +page cache), the full fsync (`sync_all`) + ledger msync is only done if +`should_flush()` returns true (writer.rs:1312), which requires ≥500ms since +the last full flush. On graceful shutdown, if the last periodic fsync was +recent, the OS-page-cache data is present and readable (Linux guarantees +this), but is not fsync'd to disk. If the process exits gracefully (normal +process exit), the OS will flush page cache to disk on process exit. So for +graceful shutdown (not SIGKILL), the OS page cache flush on process exit likely +covers this gap. **But this is OS-behavior-dependent, not Vector-guaranteed.** + +For SIGKILL: the 500ms window is a documented known data-loss bound. +For graceful stop: Vector relies on OS process-exit page-cache flush, not on +an explicit `sync_all` before exit. This is an implicit guarantee that could +break under certain OS configurations or if the process is OOM-killed after +receiving SIGTERM. + +### The `flush_inner(force_full_flush=true)` path + +`flush_inner(force_full_flush=true)` (writer.rs:1299-1321) is called during +file rotation (writer.rs:1041) to force a full fsync. It is NOT called on +graceful shutdown through any currently observable code path. The normal +`flush()` → `flush_inner(false)` path skips `sync_all` if the 500ms gate +hasn't fired. + +## Contrast with the SIGKILL / Crash Property + +| Scenario | Expected loss bound | Mechanism | +|---|---|---| +| SIGKILL / abrupt crash | Up to 500ms of page-cache writes | fsync only every 500ms | +| Graceful stop (SIGTERM) | Claimed: zero | Depends on final flush being called before Drop | +| Config reload | Up to 256KB of TrackingBufWriter data | Drop calls close() not flush() | + +The Antithesis property for graceful shutdown specifically tests the "claimed: +zero" row. It is a **liveness / sometimes** property (not always, because +Antithesis needs to observe at least one graceful shutdown completing +successfully to confirm the claim is reachable — hence `Sometimes`). + +## Antithesis Test Design + +### Test scenario + +``` +Setup: + - Vector with a source feeding into a sink with disk buffer. + - A downstream HTTP mock sink that records every event received and + its sequence ID. + - A workload driver that: + (a) sends events with sequence IDs, tracking every ID sent and + every ID acknowledged by the Vector HTTP source, + (b) after N events, sends SIGTERM to Vector (graceful stop), + (c) polls the Vector health endpoint until it stops responding + (shutdown complete), + (d) restarts Vector with the same configuration, + (e) waits for buffer drain (reader catches up on restart), + (f) asserts: acknowledged_ids == downstream_received_ids. + +Antithesis fault injection: + - CPU throttle during shutdown (slow the final flush race). + - Disk I/O slowdown during shutdown (test whether fsync completes + before process exit). + - Clock jitter (shrink/expand the 500ms fsync window relative to + shutdown timing). + +Key assertion: + - assert_sometimes("graceful_shutdown_lossless", + acked_before_sigterm_count == downstream_count, + {acked_count, downstream_count, flush_completed, sigterm_timestamp}) + +Secondary assertion (contrast): + - assert_always("graceful_beats_crash", + graceful_shutdown_loss_count <= crash_loss_count) + (Graceful shutdown must never lose more than a crash, and should + ideally lose zero.) +``` + +### What to look for in Antithesis output + +- Any event with a sequence ID in the "acknowledged before SIGTERM" set that + does not appear in the downstream sink after drain. +- Whether `buffer_discarded_events_total` or `component_discarded_events_total` + increments during shutdown (indicating the buffer itself is accounting for + the drop) — vs. a completely silent drop (no counter increments, event just + gone). +- Whether the 500ms fsync window creates any detectable loss on graceful + shutdown when CPU/IO throttling slows the shutdown sequence. + +### Instrumentation to add (all new — zero Antithesis SDK exists today) + +Add to `BufferWriter::close()` (writer.rs:1358) or to the point where the +write loop exits cleanly: + +```rust +antithesis_sdk::assert_always!( + "writer_unflushed_bytes_zero_on_close", + self.unflushed_bytes == 0, + &json!({ + "unflushed_bytes": self.unflushed_bytes, + "unflushed_events": self.unflushed_events, + }) +); +``` + +This assertion fires if `close()` is called while the `TrackingBufWriter` +still has staged bytes, directly surfacing the bug. + +Add at the workload level after drain completes: + +```rust +antithesis_sdk::assert_sometimes!( + "graceful_shutdown_lossless", + acked_count == downstream_count, + &json!({ + "acked_count": acked_count, + "downstream_count": downstream_count, + }) +); +``` + +## Open Questions + +1. **Does `stop()` call flush before dropping `inputs`?** This is the central + unknown. Reading running.rs:145-267: `stop()` moves `self` fields into the + returned future selectively. `inputs` appears to be dropped synchronously. + If `inputs` is dropped before the write loop task finishes its final `flush`, + the `BufferWriter` is dropped with staged data. **Verification required: + add a tracing log at `TrackingBufWriter::drop` showing `buf.len()` and + confirm it is 0 on graceful shutdown.** + +2. **Does the tokio runtime drain all tasks before the process exits?** If + `stop()` is awaited and all task JoinHandles complete, then all async work + (including the write loop's final flush) is done before `stop()` returns. + But the `inputs` `HashMap` is dropped inside `stop()` before the future is + returned, creating the race described above. The question is whether the + drop order within `stop()` puts the write-loop task completion before the + `inputs` drop or after. + +3. **Does the OS page-cache flush on process exit cover the gap?** For graceful + shutdown (clean process exit after SIGTERM), the Linux kernel flushes dirty + page cache on process exit. This would cover the case where the 500ms + fsync window has not yet fired but data is in the page cache. However, if + the process is forcefully killed (OOM killer, watchdog) after receiving + SIGTERM but before flushing, the page-cache data is lost. This scenario + is intermediate between graceful and crash and is not currently tested. + +4. **Does the finalizer task complete before the buffer is re-opened on + restart?** On graceful shutdown + restart, the finalizer task + (ledger.rs:701-710, spawned as a tokio task holding `Arc`) must + exit before the new process opens the same buffer directory. Since the + finalizer task exits when the `OrderedFinalizer` sender side is dropped + (which happens when `BufferWriter` is dropped), and the old process + completes before the new one starts, this is safe for restart — but + not for in-process config reload (see config-reload-no-silent-loss.md). + +5. **`topology_disk_buffer_flushes_on_idle` test (src/topology/test/mod.rs:822):** + This test confirms events are readable before shutdown fires, but it + explicitly stops the topology only **after** receiving both copies of the + event (line 870: `topology.stop().await`). It does not test whether the + stop itself would have flushed pending events. It is not a loss test. + +6. **Relationship to INV-6 (sut-analysis.md §5):** INV-6 states "graceful + shutdown flushes (but see §10 — `BufferWriter::Drop` does NOT flush)." + This property operationalizes that open question into a testable Antithesis + `Sometimes` assertion. + +--- + +### Investigation Log + +#### Does Vector topology call `writer.flush()` before dropping the writer (`running.rs` `stop()` drop-order)? + +**Examined:** `src/topology/running.rs:145–268` (`stop()` body), `lib/vector-buffers/src/variants/disk_v2/writer.rs:1358–1374` (`close()` and `Drop`). + +**Found:** `stop()` at running.rs:145 takes `self` by value. It moves `self.tasks` and `self.source_tasks` into the `wait_handles` / `check_handles` futures at lines 157–161, and drops `self.inputs` (the `HashMap>`) implicitly when `stop()` returns the future at line 267. Because `inputs` is not moved into the returned future, it is dropped synchronously when `stop()` is called — before the future is awaited, and therefore before the JoinHandles in `wait_handles` complete. The `BufferSender` (which holds `Arc>`) lives in `inputs`; when `inputs` drops, the `Arc` reference count decrements. If this is the last reference, `BufferWriter::drop` fires, calling only `close()` (writer.rs:1372), not `flush()`. + +**Found — write-loop flush behavior:** The write loop (`SenderAdapter::DiskV2` path in sender.rs) calls `writer.flush().await` after each `write_record`. Under graceful shutdown, the source stops producing events, the mpsc channel feeding the write loop drains, the loop's `next()` returns `None`, and the loop exits. If the loop called `flush()` after the last record before exiting, `TrackingBufWriter` should be empty at the time of drop. The race is: does the `inputs` drop (which drops `BufferSender` → `BufferWriter`) occur before or after the write-loop task's final `flush().await` completes? + +**Not found:** An explicit `flush()` call in the topology teardown path between the write-loop task completing and `BufferWriter::drop`. The write loop runs in a spawned async task; `inputs` is dropped synchronously in `stop()` before the future is polled. If the write-loop task has not yet returned when `stop()` discards `inputs`, the `Arc>` may still have a reference held by the write-loop task — in which case `BufferWriter::drop` fires when the write-loop task finally exits, not when `inputs` drops. Whether `TrackingBufWriter` is empty at that point depends on whether the write loop's final `flush()` completed before the task returned. + +**Conclusion:** The race is unresolved by code inspection alone. The most likely safe scenario is: source stops → mpsc channel closes → write loop's final `flush()` completes → write-loop task returns → `Arc` reference drops → `BufferWriter::drop` fires with empty `TrackingBufWriter`. But if `inputs` drops the `Arc` before the write-loop task returns and it was the last reference, `Drop` fires early. Verification requires either adding a `buf.len()` trace at `TrackingBufWriter::drop` and observing it under graceful shutdown, or reading the write-loop task's exact `Arc` lifetime. This is flagged as an unresolved race requiring code tracing or an instrumented test. diff --git a/tests/antithesis/scratchbook/properties/ledger-corruption-no-sigbus-crashloop.md b/tests/antithesis/scratchbook/properties/ledger-corruption-no-sigbus-crashloop.md new file mode 100644 index 0000000000000..14d42e79290b7 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/ledger-corruption-no-sigbus-crashloop.md @@ -0,0 +1,292 @@ +--- +slug: ledger-corruption-no-sigbus-crashloop +catalog_category: 3 — Crash Durability & Recovery +type: Safety / AlwaysOrUnreachable +status: cataloged (Category 7) +related: + - recovery-completes-after-crash + - record-id-monotonicity-holds + - no-corrupted-record-delivered +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +--- + +### ledger-corruption-no-sigbus-crashloop — Ledger Corruption Yields a Clean Init Error, Not a SIGBUS Crash Loop + +| | | +|---|---| +| **Type** | Safety | +| **Property** | If `buffer.db` is externally truncated or otherwise corrupted before or during Vector startup, the corruption is detected by rkyv `CheckBytes` validation in `BackedArchive::from_backing` and reported as a clean `LedgerLoadCreateError::FailedToDeserialize`, never as a SIGBUS signal mid-operation or as an infinite crash loop. | +| **Invariant** | `AlwaysOrUnreachable`: if the ledger file is corrupted, the process either (a) detects it at `BackedArchive::from_backing` call (`ledger.rs:621`) and returns a `LedgerLoadCreateError`, OR (b) the ledger file is valid and this path is never taken. A SIGBUS-generating memory access against a truncated mmap'd region is `Unreachable` during normal operation; a restart loop (SIGBUS → crash → restart → SIGBUS again) is `Unreachable`. | +| **Antithesis Angle** | Filesystem fault: truncate or zero-fill `buffer.db` while Vector is stopped (before restart) or while it is running (after the file is mmap'd). Assert that Vector (a) either restarts cleanly with a fresh buffer, or (b) emits a detectable error and exits cleanly — it does not loop on SIGBUS signals. Requires Antithesis filesystem-fault capability to truncate a file from outside the process; flag this as potentially unavailable in some tenant configurations. | +| **Why It Matters** | The `buffer.db` ledger is memory-mapped via `memmap2::MmapMut` (`io.rs:161`). There is no SIGBUS handler anywhere in the Vector codebase. If the mapped file is truncated while mapped, any read/write of the now-unmapped pages delivers SIGBUS, which is an unhandled signal and crashes the process. On restart, `open_mmap_writable` re-maps the same truncated file — the same access pattern fires again. The result is an infinite crash loop with no operator-visible error message, indistinguishable from a persistent hardware fault. The theoretical protection (`CheckBytes` at `from_backing`) is only effective at init time, before the mmap is held live. | + +--- + +## What Led to This Property + +The SUT analysis (§6 item 9, §8) flagged the mmap'd ledger as a SIGBUS risk. +This evidence file traces the exact code path from ledger load to SIGBUS +exposure and explains why `CheckBytes` is not sufficient as a defense. + +### The mmap path in `load_or_create` + +`Ledger::load_or_create` (`ledger.rs:556–651`) performs the following sequence: + +1. Opens `buffer.db` as a read-write file (`ledger.rs:580–584`). +2. Checks whether the file is empty; if so, writes the serialized default + `LedgerState` and calls `sync_all` (`ledger.rs:590–612`). +3. Opens the same file as a writable mmap: + + ```rust + let ledger_mmap = config + .filesystem + .open_mmap_writable(&ledger_path) // ledger.rs:616–620 + .await + .context(IoSnafu)?; + ``` + + In `ProductionFilesystem`, `open_mmap_writable` opens the file and calls + `unsafe { memmap2::MmapMut::map_mut(&std_file) }` (`io.rs:157–162`). This + is the point at which the OS maps the file into the process address space. +4. The mmap is passed to `BackedArchive::from_backing(ledger_mmap)`: + + ```rust + let ledger_state = match BackedArchive::from_backing(ledger_mmap) { + Ok(backed) => backed, // ledger.rs:622–629 + Err(e) => { + return Err(LedgerLoadCreateError::FailedToDeserialize { + reason: e.into_inner(), + }); + } + }; + ``` + + `from_backing` calls `check_archived_root::(backing.as_ref())` + (`backed_archive.rs:73`), which invokes rkyv's `CheckBytes` validation on + the entire mapped region. If the file is truncated (shorter than + `LEDGER_LEN = align16(mem::size_of::())`), this read + will access pages beyond the file end — either via the byte-slice produced + by `AsRef<[u8]>` on the `MmapMut`, or through a bounds check in memmap2. + In practice, `memmap2::MmapMut::as_ref()` returns a slice bounded by the + mmap length (not the file length), so a short file produces a short slice + and `check_archived_root` may return an error rather than SIGBUS — **at + init time**. + +### Where SIGBUS becomes a live risk + +The SIGBUS risk materialises **after** init completes and the `Ledger` struct +holds the live `BackedArchive` as `self.state` +(`ledger.rs:217`). The ledger fields (`writer_next_record`, `reader_last_record`, +etc.) are `AtomicU64`/`AtomicU16` values overlaid on the mapped region via the +`ArchivedLedgerState` projection (`backed_archive.rs:89`): + +```rust +pub fn get_archive_ref(&self) -> &T::Archived { + unsafe { archived_root::(self.backing.as_ref()) } +} +``` + +Every ledger field read or write (`get_total_buffer_size`, `increment_pending_acks`, +`flush`, etc.) accesses memory in the mapped region. If the backing file is +truncated **after** the init-time `CheckBytes` validation has passed (i.e., +while Vector is running), subsequent accesses to the mapped pages produce +SIGBUS. There is no `SIGBUS`/`SIGFPE` signal handler anywhere in the codebase +(confirmed by repo-wide grep). SIGBUS is fatal by default on Linux; the process +is killed. + +### The crash-loop path + +After a SIGBUS kill, Vector's process supervisor (systemd, Docker restart +policy, Kubernetes) restarts it. On the next startup, `open_mmap_writable` +re-maps `buffer.db`. If the file is still truncated: + +- If it is short enough that `check_archived_root` on the resulting + short mmap slice fails validation → `FailedToDeserialize` error → Vector + exits cleanly with an error log. **This is the safe path.** +- If the file is truncated to a length that is a multiple of the mmap page + size but shorter than `LEDGER_LEN` (an edge case on Linux where the OS + rounds mmap length up to a page boundary), the mmap may succeed and the + slice appears long enough, `CheckBytes` passes, but accesses to the + truncated pages that were zero-filled by the OS may yield plausible-looking + zero data rather than a SIGBUS. Whether this is valid depends on whether + the rkyv `ArchivedLedgerState` layout treats zero-valued atomics as a + valid state — the `LedgerState::default()` impl (`ledger.rs:109+`) starts + all fields at 0, so a zero-filled truncated ledger may appear valid, causing + Vector to start normally with a reset ledger rather than detecting corruption. + This "silent reset" is a distinct failure mode from the crash loop. +- If the file is zero-length, the init-time file-is-empty check (`ledger.rs:590`) + triggers re-initialization with the default state — this is the **correct + recovery path** and the only case where existing code handles truncation + gracefully. + +### The no-SIGBUS-handler confirmation + +``` +grep -rn "signal\|SIGBUS\|sigaction\|SignalKind\|unix::signal" \ + lib/vector-buffers/ src/ # 0 relevant matches +``` + +No SIGBUS handler is installed. The process will receive the default SIGBUS +disposition (terminate + core dump). The behavior on truncation during live +operation is therefore: immediate process death with a SIGBUS, no flush, no +ledger close, no error log. + +--- + +## Code References + +| Location | Relevance | +|---|---| +| `lib/vector-buffers/src/variants/disk_v2/io.rs:157–162` | `open_mmap_writable`: `unsafe { memmap2::MmapMut::map_mut(&std_file) }` — the mmap creation point; no SIGBUS guard | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:616–630` | `load_or_create`: mmap opened, then passed to `BackedArchive::from_backing` for `CheckBytes` validation | +| `lib/vector-buffers/src/variants/disk_v2/backed_archive.rs:68–80` | `BackedArchive::from_backing`: calls `check_archived_root` — the only structural validation; only runs at init time | +| `lib/vector-buffers/src/variants/disk_v2/backed_archive.rs:88–91` | `get_archive_ref`: `unsafe { archived_root::(self.backing.as_ref()) }` — live mmap accesses; SIGBUS risk point | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:217` | `state: BackedArchive` — the held live mmap | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:253` | `pub fn state(&self) -> &ArchivedLedgerState` — every field access goes through this | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:507–509` | `flush`: `self.state.get_backing_ref().flush()` — calls `MmapMut::flush` (msync); SIGBUS risk if pages are unmapped | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:590` | Zero-length file check — the **only** existing graceful-recovery path for truncation | +| `lib/vector-buffers/src/variants/disk_v2/ledger.rs:34–75` | `LedgerLoadCreateError` variants — `FailedToDeserialize` is the intended corruption signal | + +--- + +## What Breaks + +**Scenario 1 — Corruption before restart (init-time detection, safe path):** +`buffer.db` is written with garbage bytes or truncated to a non-zero, non-page-aligned +length before Vector starts. `check_archived_root` (`backed_archive.rs:73`) +detects the invalid layout and returns `FailedToDeserialize`. Vector logs an +error and exits. No SIGBUS. This is the **intended behavior** and works correctly +as long as the corruption is structurally visible to rkyv. + +**Scenario 2 — Corruption to page-aligned truncation (silent reset, unexpected behavior):** +`buffer.db` is truncated to exactly 0 bytes before restart. The zero-length +check at `ledger.rs:590` triggers `LedgerState::default()` initialization and +writes a fresh ledger. Vector starts with a reset ledger, treating all data +files as unknown. This loses the reader's acked position — potentially +re-delivering already-acked records. Unexpected but non-crashing. + +**Scenario 3 — Truncation while running (SIGBUS, worst case):** +`buffer.db` is truncated while the live mmap is held. The next +`get_archive_ref()` call (via `state()`) — which happens on every write, every +ack, every flush — accesses a now-unmapped page. SIGBUS is delivered. No error +log; the process dies. Supervisor restarts Vector. Depending on how the file +is left, scenarios 1 or 2 apply on restart. If the file was truncated but not +to zero, scenario 1's `CheckBytes` validation may catch it, giving a clean +error and stopping the crash loop. If the truncation is to zero, scenario 2 +applies and the crash loop terminates. If the truncation lands in a +page-aligned-but-structurally-valid range, scenario 2's silent reset may occur. + +**Infinite crash loop condition:** the loop is only truly infinite if the file +is truncated in a way that passes `CheckBytes` but still produces a SIGBUS +during runtime access. Given that `check_archived_root` reads the slice +boundaries, this is unlikely for most truncation patterns — but not formally +impossible, particularly given that `CheckBytes` for `LedgerState` is +auto-derived (`ledger.rs:93`) and may not check all cross-field invariants. + +--- + +## Fault Conditions + +This property requires **filesystem-fault capability** to truncate or corrupt a +file from outside the process. In Antithesis, this is typically available as a +filesystem-level fault or via a workload container that shares the buffer +volume. However: + +- Some Antithesis tenant configurations may restrict filesystem faults on + non-network volumes. +- **Flag to the user:** confirm that the Antithesis tenant allows + write/truncate of files in the buffer data directory from a workload + container or fault injector. +- If filesystem faults are unavailable, this property degrades to a + "documented risk without test coverage" — still worth cataloging as a + gap. + +A weaker version of this property is testable without filesystem faults: +inject a corrupt `buffer.db` file at container startup (before Vector +starts), using a workload init script. This covers scenario 1 (init-time +detection) but not scenario 3 (live truncation). + +--- + +## Missing SUT Instrumentation + +No Antithesis SDK assertions exist. Needed: + +1. **`AlwaysOrUnreachable` assertion** at the mmap-access point in + `get_archive_ref` (`backed_archive.rs:88`): the function is called in + contexts where the underlying file's size has not been re-validated since + init. An assertion here would be logically `Unreachable` for SIGBUS (which + terminates the process before an assertion fires), but the SUT-side + instrumentation needed is a **pre-access size check**: before calling + `archived_root`, assert `self.backing.as_ref().len() >= LEDGER_LEN`. Any + violation would catch the case where a live mmap has shrunk below the + required layout size. + +2. **Workload-level observation:** a restart after SIGBUS is observable by + the workload if it monitors the Vector process exit code (SIGBUS = exit + status 138 on Linux). A pattern of repeated SIGBUS exits is the + crash-loop signal. + +3. **Clean error detection:** an `AlwaysOrUnreachable` assertion in the + `BackedArchive::from_backing` `Err` arm (`ledger.rs:625–629`) confirms that + the `FailedToDeserialize` path actually fires on corruption — i.e., that + the recovery runs, not that it's dead code. + +--- + +## Open Questions + +- Does `memmap2::MmapMut::as_ref()` return a slice bounded by the file length + at mmap time, or by the current file length? If the former, `CheckBytes` at + init sees the correctly-sized slice even if the file is later grown, but a + truncation after init is still invisible to the slice bounds. If the latter, + a live truncation immediately narrows the slice and all in-flight references + become dangling — which is a soundness issue in the `unsafe archived_root` + call at `backed_archive.rs:89`. This is an open question about memmap2's + behavior that determines the exact SIGBUS trigger condition. + +- Is there a reason no SIGBUS handler is installed? If the intent is to + treat SIGBUS as a fatal bug (correct) rather than a recoverable error, then + the defense must be pre-access validation — the current code has none + outside of init. + +- Does the `LedgerState::default()` zero-fill path (`ledger.rs:109–121`, + assuming standard derive) produce a structurally valid `ArchivedLedgerState` + that rkyv's `CheckBytes` will accept? If the all-zeros layout is not a valid + rkyv archive, then truncation to zero would fail `CheckBytes` and trigger + `FailedToDeserialize` rather than scenario 2's silent reset. Clarify which + is the actual behavior. + +- Should `load_or_create` validate that `buffer.db` is exactly `LEDGER_LEN` + bytes before attempting to mmap it, rather than relying on `CheckBytes` to + catch layout violations? A length mismatch is a stronger, simpler guard + than rkyv structural validation and would close the gap for all truncation + scenarios at init time. + +- Is there a path where `buffer.db` grows beyond `LEDGER_LEN` (e.g., due to + an OS-level race or a write to the wrong offset)? If so, `CheckBytes` would + still pass (the extra bytes are ignored) but the archive root pointer would + point into the wrong region of the extended file, potentially yielding + corrupted field values without a detectable error. + +--- + +### Investigation Log + +#### Is filesystem-fault injection available in the Antithesis tenant? + +**Examined:** Evidence file prose at the "Fault Conditions" section (above), Antithesis documentation (not re-fetched — relying on existing knowledge). + +**Not found:** No confirmation in the codebase or local docs that the Antithesis tenant configuration used for Vector testing enables write/truncate filesystem faults on non-network volumes. This capability varies by tenant configuration and must be verified with the Antithesis engagement team. + +**Conclusion:** This question requires human input. The property is partially testable without live filesystem faults (inject a corrupt `buffer.db` at container startup before Vector starts, covering Scenario 1 — init-time detection), but Scenario 3 (live truncation while the mmap is held) requires the Antithesis tenant to support filesystem-level fault injection on the buffer data directory. Flag to the Antithesis team for confirmation before relying on this property for live-truncation coverage. + +#### Does an all-zeros `LedgerState` pass `CheckBytes` (silent-reset vs. `FailedToDeserialize` on zero-truncation)? + +**Examined:** `ledger.rs:590` (zero-length file check), `ledger.rs:109+` (implied `LedgerState::default`), `backed_archive.rs:68–80` (`from_backing` / `check_archived_root`), `ledger.rs:93` (`#[derive(...)]` on `LedgerState`). + +**Found:** The zero-length file check at ledger.rs:590 handles the empty-file case before the mmap path is reached — Vector re-initializes with `LedgerState::default()` and writes a fresh ledger. For a non-zero truncation that lands on a page boundary with zero-filled pages (OS behavior on Linux for sparse/truncated mmap regions), `check_archived_root` would validate the all-zeros bytes against the rkyv-archived `LedgerState` layout. Since `LedgerState` fields are all numeric types (AtomicU16, AtomicU64) and rkyv's `CheckBytes` for primitives validates alignment and range — all zeros is a valid representation for all numeric types — the all-zeros layout would likely pass `CheckBytes` and yield a ledger with all fields at 0 (equivalent to a fresh ledger). This means truncation to a page-aligned non-zero length could silently reset the ledger rather than returning `FailedToDeserialize`. + +**Not found:** Definitive confirmation of rkyv `CheckBytes` behavior for the exact `ArchivedLedgerState` layout without running the code. The `#[derive(CheckBytes)]` on `LedgerState` is auto-generated; it validates alignment and field validity but does not enforce cross-field invariants (e.g., that writer ID >= reader ID). + +**Conclusion:** The all-zeros silent-reset scenario is plausible but not formally confirmed without running the rkyv `check_archived_root` against an all-zeros buffer. This remains a theoretical risk; the definitive answer requires either a unit test or a code trace of the generated `CheckBytes` impl. Flagged as an open sub-question pending code verification or a targeted test. diff --git a/tests/antithesis/scratchbook/properties/no-corrupted-record-delivered.md b/tests/antithesis/scratchbook/properties/no-corrupted-record-delivered.md new file mode 100644 index 0000000000000..c84acba4d31f3 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/no-corrupted-record-delivered.md @@ -0,0 +1,161 @@ +# Evidence: no-corrupted-record-delivered + +## Property Identification + +**Slug:** no-corrupted-record-delivered +**Type:** Safety / AlwaysOrUnreachable +**Assertion macro:** `assert_always_or_unreachable!("no_corrupted_record_delivered", "Record emitted to sink passed CRC32C checksum and rkyv CheckBytes validation", ...)` + +This property was identified during review of the read path in `lib/vector-buffers/src/variants/disk_v2/reader.rs`. The buffer's entire durability story rests on the claim that a record failing integrity validation is never delivered to a downstream sink. The property is "always or unreachable" because corruption is expected to be rare (it requires a real fault), but when it does occur the guard must hold unconditionally. + +## Code Chain Leading to the Property + +### The Validation Gate: `try_next_record` (reader.rs:306-351) + +`RecordReader::try_next_record` is the sole entry point for reading a record from a data file. After accumulating `record_len` bytes into `self.aligned_buf`, it calls: + +```rust +// reader.rs:338-350 +match validate_record_archive(buf, &self.checksummer) { + RecordStatus::FailedDeserialization(de) => Err(ReaderError::Deserialization { ... }), + RecordStatus::Corrupted { calculated, actual } => Err(ReaderError::Checksum { ... }), + RecordStatus::Valid { id, .. } => { + self.current_record_id = id; + Ok(Some(ReadToken::new(id, 8 + buf.len()))) + } +} +``` + +A `ReadToken` is only produced on the `Valid` arm. The two failure arms return `Err`, never a token. + +### `validate_record_archive` (record.rs:177-182) + +```rust +pub fn validate_record_archive(buf: &[u8], checksummer: &Hasher) -> RecordStatus { + match try_as_record_archive(buf) { + Ok(archive) => archive.verify_checksum(checksummer), + Err(e) => RecordStatus::FailedDeserialization(e), + } +} +``` + +Two independent checks are applied in sequence: + +1. **rkyv `CheckBytes` / deserialization** via `try_as_record_archive` → `try_as_archive::>`. This calls the hand-written `CheckBytes` implementation at `record.rs:79-117`. +2. **CRC32C verification** via `ArchivedRecord::verify_checksum` (record.rs:144-155), which recomputes `CRC32C(BE(id) || BE(metadata) || payload)` and compares against the stored `checksum` field. + +### The Hand-Written `CheckBytes` (record.rs:75-117) + +Because of an upstream rkyv ICE (`rkyv` issue #221), `ArchivedRecord` uses a manual `unsafe` `CheckBytes` implementation instead of a derived one. It validates each field pointer individually: + +- `Archived::check_bytes` for `checksum` (record.rs:90-95) +- `Archived::check_bytes` for `id` (record.rs:96-101) +- `Archived::check_bytes` for `schema_metadata` (record.rs:102-107) +- `ArchivedBox<[u8]>::check_bytes` for `payload` (record.rs:108-113) + +This is a **manual unsafe validation surface**. Correctness depends entirely on the author having correctly replicated what the derived implementation would have done. A derived-vs-manual divergence (e.g., missing field, wrong offset) would create a gap between what passes `CheckBytes` and what is structurally sound. This cannot be caught by CRC32C alone if the structural corruption happens to produce a valid archive layout with correct-looking raw bytes. + +### Corruption Response: `is_bad_read` + `roll_to_next_data_file` (reader.rs:132-139, 1035-1036) + +```rust +// reader.rs:132-139 +fn is_bad_read(&self) -> bool { + matches!( + self, + ReaderError::Checksum { .. } + | ReaderError::Deserialization { .. } + | ReaderError::PartialWrite + ) +} +``` + +In `BufferReader::next` (reader.rs:1009-1040): + +```rust +Err(e) => { + if e.is_bad_read() { + self.roll_to_next_data_file(); + } + return Err(e); +} +``` + +On a bad read the error is returned to the caller (`BufferReader::next` returns `Err`), NOT a record. The record is never extracted from `aligned_buf` and never handed to `decode_record_payload`. The `Err` propagates to the topology adapter, which treats reader errors as unrecoverable panics (`receiver.rs`). The corrupted bytes never become a sink-delivered event. + +### Emission Point: reader.rs:1106-1131 + +The only code path that produces `Ok(Some(record))` from `next()` passes through `ReadToken` (only issued on `RecordStatus::Valid`), then `read_record` (which calls `archived_root` under a SAFETY comment citing prior validation), then `decode_record_payload`. No path bypasses `validate_record_archive`. + +## What Goes Wrong if the Property is Violated + +A corrupted record delivered to the sink would contain: + +- Wrong event data (payload bytes from a different record, a partial write, or random noise). +- Potentially wrong event count encoded in the record ID gap, causing `total_buffer_size` accounting to drift. +- Silent data corruption: the sink receives and potentially forwards garbled telemetry data, violating downstream correctness. + +The Vector comment in `reader.rs:77-78` is explicit: "corruption may have affected other records in a way that is not easily detectable and could lead to records which deserialize/decode but contain invalid data." This is the motivation for rolling the entire file, not just skipping the individual record. + +## Timing / Fault Conditions for Antithesis + +- **Bit-flip fault injection**: Antithesis can corrupt bytes in a data file while the reader is mid-read or between reads. This exercises both CRC32C detection (payload bytes changed) and rkyv detection (structural fields corrupted). +- **Partial write fault**: A crash during a `TrackingBufWriter::write` call (writer.rs:321-330) can leave a partial record at the end of a data file. On restart, `is_finalized=true` for that file because the reader and writer file IDs differ; `try_next_record` should return `ReaderError::PartialWrite` (reader.rs:263-265, 328-330), which `is_bad_read()` catches. +- **Torn tail after crash**: rkyv's `archived_root` reads the root offset from the last 8 bytes of the buffer. If a crash leaves trailing bytes that happen to encode a plausible offset, the structure might pass `CheckBytes` but point into the wrong region. CRC32C should catch this if the payload is actually wrong bytes, but a CRC collision (probability ~1/2^32 per check) would bypass it. +- **Foreign `.dat` file injection**: Placing a file written by a different Vector version or by an unrelated process into the buffer directory. The record format is host-endian and version-specific; `CheckBytes` should reject misaligned/invalid archives. + +## SUT-Side Instrumentation Suggestions (ALL MISSING) + +No Antithesis SDK assertions exist anywhere in the codebase (confirmed by repo-wide scan in `existing-assertions.md`). All suggestions below require adding the Antithesis Rust SDK as a new `Cargo.toml` dependency. + +**Primary assertion** — in `BufferReader::next` (reader.rs), at the point `Ok(Some(record))` is returned (reader.rs:1131), assert that the record was validated: + +```rust +// Immediately after `reader.read_record(token)?` succeeds (reader.rs:1106) +// and before returning Ok(Some(record)): +antithesis_sdk::assert_always_or_unreachable!( + "no_corrupted_record_delivered", + "Record emitted to sink passed CRC32C and CheckBytes validation", + &serde_json::json!({ + "record_id": record_id, + "record_bytes": record_bytes, + "data_file_id": self.ledger.get_current_reader_file_id(), + }) +); +``` + +The assertion is `AlwaysOrUnreachable` because: on a clean run with no faults injected, no corrupted record should ever be delivered (always passes, but the "corrupted record delivered" branch is unreachable). When Antithesis injects faults, if a corrupted record somehow bypasses validation and reaches this point, the assertion fires. + +**Secondary instrumentation** — in `validate_record_archive` (record.rs:177), log when each failure mode fires so Antithesis can correlate fault injection with detection: + +```rust +// At RecordStatus::Corrupted return: +antithesis_sdk::assert_sometimes!( + "corruption_detected_by_crc", + "CRC32C mismatch detected during record validation", + &serde_json::json!({ "calculated": calculated, "actual": actual }) +); +// At RecordStatus::FailedDeserialization return: +antithesis_sdk::assert_sometimes!( + "corruption_detected_by_checkbytes", + "rkyv CheckBytes failure detected during record validation", + ... +); +``` + +These support the companion property `corruption-is-detected-and-recovered`. + +## Residual Risk: CRC Collision + +CRC32C is 32 bits. The probability of a random bit-flip producing the correct CRC is approximately 1/2^32 (~2.3 × 10^-10). For any reasonable number of records this is negligible in practice, but it is not zero. A CRC collision would allow a structurally valid but content-incorrect record to pass all checks and be delivered. This is a known, documented limitation of 32-bit checksums and is not a code bug, but it means the property has a residual probabilistic violation rate even with correct implementation. + +The hand-written `CheckBytes` (record.rs:75-117) provides a second layer, but it only validates structural soundness (field types and alignment), not semantic correctness of the payload content. A flipped bit in the payload bytes would pass `CheckBytes` and only be caught by CRC32C. + +## Open Questions + +- **What does the topology adapter do with a `ReaderError` from `next()`?** The SUT analysis notes that `receiver.rs` panics on reader I/O errors. Does it also panic on deserialization/checksum errors, or does it swallow them silently? If it swallows them, the test harness needs a separate counter to confirm that corruption was detected, not ignored. This matters because a swallowed error could look like a successful read to any external observer. + +- **Does `seek_to_next_record` (reader.rs:810-947) have the same guard?** During initialization, the reader calls `validate_record_archive` directly at reader.rs:850, not through `try_next_record`. On a `FailedDeserialization` or `Corrupted` result it falls back to the slow path (`break`, reader.rs:896) rather than rolling the file immediately. If the slow path then calls `next()` which itself hits `is_bad_read` + `roll_to_next_data_file`, the protection is preserved; but if the code path does not eventually go through `next()`, the assertion placement above would not catch a corruption during `seek_to_next_record`. This matters for whether the assertion needs to be placed in `seek_to_next_record` as well. + +- **Is the `unsafe archived_root` call in `read_record` (reader.rs:375) sound if `aligned_buf` was modified between `try_next_record` and `read_record`?** The code assumes the buffer is not touched between calls. The SAFETY comment cites prior validation. If Antithesis can introduce a data race here (unlikely given single-reader design, but worth confirming with the tokio executor model), the assertion may not catch a post-validation corruption. + +- **Does the rkyv `archived_root` torn-tail scenario actually produce a `RecordStatus::Valid` that then fails CRC32C?** Specifically: if crash-left trailing bytes form a plausible rkyv root offset pointing to a valid-looking struct in memory, does `verify_checksum` save us? This determines whether the CRC32C is a sufficient backstop for torn-tail mis-reads or whether a structural check on the offset itself is needed. diff --git a/tests/antithesis/scratchbook/properties/overflow-chain-no-unaccounted-gap.md b/tests/antithesis/scratchbook/properties/overflow-chain-no-unaccounted-gap.md new file mode 100644 index 0000000000000..8c0ac407d9ff5 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/overflow-chain-no-unaccounted-gap.md @@ -0,0 +1,286 @@ +--- +slug: overflow-chain-no-unaccounted-gap +type: Safety / Always +sut_path: lib/vector-buffers/src/variants/disk_v2/ +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +--- + +# Property: overflow-chain-no-unaccounted-gap + +## Catalog Entry + +**Type:** Safety / Always + +**Property:** When `WhenFull::Overflow` is configured with a disk buffer as +the base and an in-memory buffer as the overflow, a crash during an +overflow-active period does not create a silent middle gap in the delivered +event stream. Either (a) all events from both base and overflow are accounted +for (delivered or explicitly loss-reported), or (b) if overflow events are +lost on crash, the loss is reported via the existing accounting path +(`increment_dropped_event_count_and_byte_size`), not silently dropped. Event +ordering guarantees must also be honored or explicitly documented as absent. + +**Invariant:** `Always`: after a crash-during-overflow-active and subsequent +drain, the set of delivered event IDs has no unaccounted gap relative to the +produced set. Specifically: if event ID N was produced, placed on disk (base), +and survived the crash, and event ID M > N was produced before the crash and +placed in the overflow (in-memory), M may be lost but N must not be silently +reclassified as lost. The delivered set does not "skip over" durable disk +events due to overflow-induced reordering confusion. + +**Antithesis Angle:** Configure a topology with `disk` base buffer + +`WhenFull::Overflow` pointing to an in-memory buffer. Fill the base to +capacity to trigger overflow. Crash while overflow is active (events in both +buffers). Restart and drain. Assert: (1) no silent middle gap — events known +to be on disk before the crash are present in the drain; (2) the receiver-side +unbiased `select!` does not deliver a LATER overflow event as if it were an +EARLIER disk event (reordering), violating monotonicity assumptions. + +**Why It Matters:** The overflow configuration is entirely uncovered in the +existing test suite. The asymmetry between base (durable disk) and overflow +(ephemeral in-memory) creates a unique crash shape: EARLIER events survive on +disk, LATER events are in the overflow and lost. This is not a simple +"duplicates at the tail" scenario — it is a gap in the middle of the +chronological stream. Downstream dedup logic typically assumes at-least-once +(duplicate tails are fine) and does not account for middle gaps. An +unaccounted gap means the downstream consumer permanently misses events that +the source believed were accepted. + +Additionally, the unbiased `tokio::select!` in the receiver means that even +during non-crash steady state, events from overflow can interleave with events +from the disk base, breaking ordering. + +--- + +## Code Verification + +### Overflow dispatch on send (sender.rs:236-244) + +```rust +// lib/vector-buffers/src/topology/channel/sender.rs:236-244 +WhenFull::Overflow => { + if let Some(item) = self.base.try_send(item).await? { + was_dropped = true; + self.overflow + .as_mut() + .unwrap_or_else(|| unreachable!("overflow must exist")) + .send(item, send_reference) + .await?; + } +} +``` + +When the base buffer is full (`try_send` returns `Some(item)` = the item was +not accepted), the item is forwarded to the overflow buffer. The `was_dropped` +flag is set, which triggers the instrumentation path +(`increment_dropped_event_count_and_byte_size`) — but this counts the item as +"dropped from the base" for backpressure purposes, NOT as silently lost. + +### Unbiased `select!` in receiver (receiver.rs:133-138) + +```rust +// lib/vector-buffers/src/topology/channel/receiver.rs:133-138 +Some(mut overflow) => { + select! { + Some(item) = overflow.next() => (item, false), + Some(item) = self.base.next() => (item, true), + else => return None, + } +} +``` + +`tokio::select!` with no `biased` keyword uses pseudo-random branch selection. +When both the overflow and base receivers have items ready simultaneously, +either can be selected. This means: + +- A LATER event (placed in overflow after the base was full) can be delivered + before an EARLIER event (already on disk in the base). +- After a crash: the overflow in-memory buffer is gone; the base disk buffer + retains events up to the crash point. On restart, only the base is drained. + But during the pre-crash period, the delivery order was already interleaved. + +### Crash asymmetry: disk base survives, in-memory overflow does not + +The disk base (`ReceiverAdapter::DiskV2`) stores events in `buffer-data-N.dat` +files, fsync'd per the `flush_interval` model. These survive a crash (subject +to the ≤500ms loss window). + +The in-memory overflow (`ReceiverAdapter::InMemory`, backed by a +`LimitedReceiver` / `LimitedSender` channel) holds events only in heap +memory. A crash (SIGKILL) loses all in-memory channel contents with no +recovery path and no on-disk trace. + +### `WhenFull::Overflow` topology configuration + +`WhenFull::Overflow` is wired in `BufferSender::with_overflow` +(sender.rs:158-170): + +```rust +// sender.rs:158-170 +pub fn with_overflow(base: SenderAdapter, overflow: BufferSender) -> Self { + Self { + base, + overflow: Some(overflow), + when_full: WhenFull::Overflow, + ... + } +} +``` + +The overflow `BufferSender` is a recursive structure — it may itself have an +`overflow`, enabling chained overflow. For this property, the relevant case is +a disk base + in-memory overflow (the standard two-level chain). + +--- + +## Crash Scenario Walkthrough + +1. Source produces events E1…EN sequentially. E1…EK fit in the disk base + buffer and are accepted; E(K+1)…EN overflow to the in-memory buffer. +2. The disk base receives `sync_all` for E1…EJ (J ≤ K, the last fsync + boundary). E(J+1)…EK are page-cached but not yet fsync'd. +3. Crash (SIGKILL). In-memory overflow (E(K+1)…EN) is lost. Page-cached + E(J+1)…EK may also be lost (within the ≤500ms durability window). +4. On restart, Vector opens the disk base. `validate_last_write` recovers to + E1…EJ (or possibly EK if the page cache was flushed by the OS on kill). + The overflow buffer is not reopened — it has no recovery path. +5. Drain: E1…EJ are delivered. E(J+1)…EN are never delivered. + +**Gap shape:** E(J+1)…EN is a suffix gap (standard crash loss). This is +expected and documented. + +**Non-obvious gap shape (the property target):** Consider a second, subtler +scenario where the source numbers events globally. If E1…EJ are on disk and +E(K+1)…EN are in overflow, and the drain delivers only E1…EJ, a workload +that expects *all IDs from 1 to J* to be present may observe a gap at +E(K+1)…EN — but since those IDs were never durably written, this is expected. + +The *actual* risk this property guards against is: + +- A bug in the `select!` dispatch that causes the receiver to skip a disk + event (treating it as consumed) because an overflow event arrived + simultaneously. +- A bug in the overflow sender that miscounts `total_buffer_size` on the + base after an overflow-and-drain cycle, triggering the underflow deadlock + (cross-ref `total-buffer-size-never-underflows`). +- Silent loss from the instrumentation path: when an overflow item is sent, + `was_dropped=true` on the base, and + `increment_dropped_event_count_and_byte_size` is called. But after the + overflow send succeeds, the item is NOT lost — it is in the overflow. The + base-side "drop" counter is misleading and may be misread by operators as + data loss. + +--- + +## SUT-Side Instrumentation (MISSING — must be added) + +No Antithesis SDK instrumentation exists in the Vector codebase. + +### Assertion 1 — Reachability: overflow path is exercised + +```rust +// sender.rs, inside the WhenFull::Overflow arm, after overflow.send() succeeds +antithesis_sdk::assert_reachable!( + "overflow-chain: item dispatched to overflow buffer", + &serde_json::json!({ + "base_was_full": true, + "overflow_buffer_type": "in-memory", // or determined dynamically + }) +); +``` + +### Assertion 2 — Always: no durable disk event is skipped post-drain + +This is a workload-level assertion. The workload assigns sequential IDs to +produced events and tracks which IDs were accepted into the base disk buffer +(via a confirmation callback or a secondary log channel). After drain: + +```rust +// workload-side, post-drain +antithesis_sdk::assert_always!( + delivered_ids.contains_all(&durable_ids), + "overflow-chain: all durable disk events delivered after crash", + &serde_json::json!({ + "durable_count": durable_ids.len(), + "delivered_count": delivered_ids.len(), + "missing_ids": durable_ids.difference(&delivered_ids).collect::>(), + }) +); +``` + +### Assertion 3 — Always: base-side drop counter matches overflow dispatches + +The `increment_dropped_event_count_and_byte_size` call on the base side when +`WhenFull::Overflow` fires should equal the number of items forwarded to the +overflow buffer (not items permanently lost). This is a metric-accuracy +assertion, not a data-loss assertion. + +--- + +## Why Existing Tests Cannot Catch This + +- The model-based proptest (`tests/model/`) does not configure + `WhenFull::Overflow`. All test runs use a single buffer level. +- No integration test exercises the disk-base + in-memory-overflow topology. +- The internal chaos test (SIGKILL ×3) uses a single-level disk buffer. +- The `select!` reordering risk is only present when both branches are + simultaneously ready — this is a timing/interleaving sensitivity that unit + tests with synchronous scheduling cannot explore. + +--- + +## Requires a Second Topology Config + +Testing this property requires a Vector topology configured as: + +```yaml +# config/antithesis-overflow-test.yaml +sinks: + my_sink: + type: blackhole # or mock sink + inputs: [my_transform] + buffer: + type: disk + max_size: 268435488 # 256MB minimum + when_full: overflow # triggers the overflow chain +``` + +The overflow buffer (in-memory) is automatically allocated when +`when_full: overflow` is set. This is distinct from the standard +single-level harness config and must be a separate test scenario. + +--- + +## Open Questions + +- Does `WhenFull::Overflow` actually chain to a second independently-sized + in-memory buffer, or does it overflow to the same topology channel? Verify + that `BufferSender::with_overflow` is called with a separate + `LimitedSender`/`LimitedReceiver` pair and that this second buffer has its + own capacity configuration. + +- The `was_dropped = true` flag in the overflow arm (sender.rs:238) triggers + `increment_dropped_event_count_and_byte_size` even though the item is NOT + dropped — it is forwarded to the overflow. Is this intentional? Does the + instrumentation path distinguish "dispatched to overflow" from "permanently + dropped"? If not, this is a metrics-accuracy bug independent of the crash + scenario. + +- After a crash, when the topology restarts, is the overflow `BufferSender` + recreated with an empty in-memory buffer? If so, there is no re-delivery of + overflow events — confirmed expected behavior, but worth asserting explicitly + in the harness. + +- Is the unbiased `select!` in receiver.rs:133 intentional (the comment at + lines 120-124 explains the rationale: avoid stalling the base while draining + overflow), or is strict ordering expected? If ordering is not guaranteed, the + documentation should say so explicitly. This affects whether reordering is a + "violation" or a "known trade-off." + +- What happens to `total_buffer_size` on the base when the overflow is active? + If the base is "full" (the item was rejected via `try_send`), the base's + `total_buffer_size` remains at `max_buffer_size`. When the overflow drains + and the reader acks base events, can the combined accounting get confused + about which buffer's bytes are being freed? This is a potential secondary + trigger for the `total-buffer-size-never-underflows` bug. diff --git a/tests/antithesis/scratchbook/properties/partial-write-at-rotation-recovers.md b/tests/antithesis/scratchbook/properties/partial-write-at-rotation-recovers.md new file mode 100644 index 0000000000000..a5c8f3031261e --- /dev/null +++ b/tests/antithesis/scratchbook/properties/partial-write-at-rotation-recovers.md @@ -0,0 +1,279 @@ +# Property: partial-write-at-rotation-recovers + +## Catalog Entry + +**Type:** Safety + Liveness / Sometimes — `Sometimes(torn_tail_recovered)` + +**Property:** A crash that leaves a torn/partial last record in a data file, an +empty just-created next file, or a ledger/data divergence at the file-rotation +boundary is recovered without deadlock and without returning garbage data or +fast-forwarding to a wrong record ID. The system reaches a consistent state +where the writer is ready to write and the reader is ready to read correct +records. + +**Invariant (Safety):** After recovery from any rotation-boundary crash, no +record returned by `reader.next()` may have: + +- A checksum mismatch (garbage payload delivered as valid). +- A record ID that is non-monotonic relative to the previous delivered ID + (would panic at `reader.rs:480-484` `MonotonicityViolation`). +- A record ID synthesized from a torn rkyv footer read (F5: `archived_root` + reads root pointer from last 8 bytes of buffer — if those bytes are crash- + left garbage, the pointer may be plausible but wrong, yielding a `Valid` + status with an incorrect `id` field that fast-forwards the ledger). + +**Invariant (Liveness):** After recovery from any rotation-boundary crash, +`from_config_inner` completes and the writer accepts new writes within bounded +time (no deadlock on `wait_for_reader()`). + +**Invariant (No phantom gap):** The `Ordering::Greater` path in +`validate_last_write` (`writer.rs:910-919`) may log "Events have likely been +lost" and skip to the next file — this gap must only occur when real data +divergence exists, not as a false positive triggered by a mis-read torn tail. +A false-positive skip silently discards valid synced records. + +**Antithesis Angle:** + +1. Workload drives the buffer to trigger file rotation regularly (write records + until `RecordWriter::can_write` returns false, at `DEFAULT_MAX_DATA_FILE_SIZE` + = 128MB, or more practically at a reduced `max_data_file_size` configuration). +2. Antithesis injects SIGKILL precisely during the rotation sequence (see + windows below). +3. Vector restarts; workload asserts: + - `Sometimes(torn_tail_recovered)`: the recovery path that handles a torn or + partial last record (`RecordStatus::FailedDeserialization` or + `RecordStatus::Corrupted` in `validate_record_archive`) is actually reached + and handled. + - `Always`: no garbage record (bad checksum, bad ID) is returned to the + workload from `reader.next()`. + - `Always`: no deadlock during init. + +**Why It Matters:** File rotation is the most crash-sensitive state transition +in the buffer. It involves: (1) flushing the current file with `sync_all`, (2) +ledger msync, (3) creating a new data file with `open_file_writable_atomic`, +(4) `sync_all` the empty new file, (5) `increment_writer_file_id` in the ledger. +None of these steps are atomic with each other. A kill at any of the 5 seams +leaves a different disk state, each requiring a different recovery branch. + +The F5 risk (`archived_root` / `check_archived_root` reads root pointer from +last 8 bytes of the buffer at `ser.rs:94` `check_archived_root::(buf)`) is +the most subtle: rkyv's archived format stores the root object's position as a +relative offset in the last `size_of::()` bytes of the buffer. If the +last record's write was torn mid-payload (e.g., only the first 4 bytes of the 8- +byte footer were written before the kill), `check_archived_root` reads a +half-written value, interprets it as a plausible relative pointer, and may +navigate to a valid-looking-but-wrong byte offset within the buffer. If that +offset happens to contain bytes that pass `CheckBytes` validation (which is +hand-written at `record.rs:79-117` and only validates field types, not semantic +constraints), the result is `RecordStatus::Valid` with an incorrect `id` field. + +This false-valid record then propagates to `validate_last_write`'s comparison: +if the wrong `id` + `record_events` > `ledger_next`, the `Ordering::Greater` +path fires and silently drops all synced records from that file. If the wrong +`id` + `record_events` < `ledger_next` (Less), the ledger fast-forwards to an +even larger ID, creating a phantom gap. + +**Crash Windows at Rotation Boundary (code-precise):** + +| Step | Code location | Kill here leaves... | Recovery branch | +|------|--------------|--------------------|-----------------| +| 1. `flush_inner(force=true)` flushing old file | `writer.rs:1041` → `writer.rs:1307-1308` `writer.flush()` (page-cache only) | Old file in page cache, no fsync | `validate_last_write` opens old file; last record valid → `Ordering::Equal`; ready to continue on old file | +| 2. `sync_all` of old file | `writer.rs:1314` `writer.sync_all()` | Old file partially synced (OS may batch) | Last record may be partial (torn tail) → `FailedDeserialization` or `Corrupted` → `should_skip=true` → skip to next | +| 3. `ledger.flush()` (msync) | `writer.rs:1317` `self.ledger.flush()` | Old file synced; ledger not updated | `validate_last_write` reads last record; ledger lags → `Ordering::Less`; fast-forward; OK | +| 4. `reset()` closes old file handle | `writer.rs:1044` `self.reset()` | Old file synced and ledger synced; writer in reset state | `validate_last_write` re-opens old file; last record valid; `Ordering::Equal` | +| 5. `open_file_writable_atomic` for new file | `writer.rs:1071` | New file may not exist, or empty O_CREAT result | `AlreadyExists` branch; if file is empty → `data_file_size==0`; treat as new. If file doesn't exist, creates it fresh | +| 6. `sync_all` of empty new file | `writer.rs:1124` `data_file.sync_all()` | New file exists but not durable on disk | Next init: `open_file_writable` opens it; size=0; `validate_last_write` exits early at `writer.rs:852-855` | +| 7. `increment_writer_file_id` in ledger | `writer.rs:1138` `ledger.state().increment_writer_file_id()` | New file open; ledger still says old file ID | `validate_last_write`: opens file pointed by ledger (old ID). If old file has valid last record → `Ordering::Equal`; writer resumes on old file. But new empty file is orphaned | +| 8. After `increment_writer_file_id`, before first write to new file | `writer.rs:1138` done; no records yet | Ledger on new file ID; file is empty | `validate_last_write`: opens new file; empty → early exit at `writer.rs:852-855`; ready to write | + +**F5 Torn-Tail Mis-Recovery Path:** + +The F5 risk materializes in windows 2-3 above. The precise sequence: + +1. Writer is writing the last record of the old file (near the 128MB boundary). +2. `TrackingBufWriter.flush()` (page-cache flush) starts writing the rkyv + archive to the data file. +3. The write is torn: the kernel writes only part of the serialized archive + (e.g., the payload is written but the 8-byte rkyv footer — the root pointer + — is partially written or not written). +4. Kill occurs. +5. On restart, `validate_last_write` opens the old file as an mmap (via + `open_mmap_readable` at `writer.rs:862-867`). +6. `validate_record_archive` at `writer.rs:872` calls `try_as_record_archive` + → `try_as_archive::>(buf)` → `check_archived_root::>(buf)` + at `ser.rs:94`. +7. `check_archived_root` reads the last 8 bytes of `buf` as the root offset. + If those bytes are garbage (partial write), they may happen to encode a + plausible relative offset that lands within the buffer. +8. The `CheckBytes` implementation for `ArchivedRecord` at `record.rs:79-117` + validates field types (`u32`, `u64`, `u32`, `ArchivedBox<[u8]>`) but does + not validate semantic constraints (e.g., that `id` is monotonically greater + than the ledger's `last_record_id`). +9. If the garbage-pointed bytes pass `CheckBytes`, `validate_record_archive` + returns `RecordStatus::Valid { id: }`. +10. `validate_last_write` proceeds to compare `garbage_id + record_events` vs + `ledger_next`. Any comparison outcome can result: `Greater` (silent skip), + `Less` (phantom fast-forward), or even `Equal` (lucky match that accepts + garbage as valid last record, causing wrong `next_record_id`). + +The CRC32C check (`archive.verify_checksum` at `record.rs:179`) is the second +gate: even if the archive struct is parsed, if the garbage bytes don't match +the checksum, `RecordStatus::Corrupted` is returned (→ `should_skip = true`, +safe). The F5 risk materializes only if the garbage bytes happen to produce a +CRC32C collision with the (also partially-written) payload — a low but nonzero +probability per restart. Antithesis's full coverage of timing windows makes this +reachable over many restarts across all explored timelines. + +**`validate_last_write` `Ordering::Greater` / `Ordering::Less` paths:** + +- `Ordering::Less` (`writer.rs:922-944`): ledger behind data → fast-forward + ledger. The `ledger_record_delta = record_next - ledger_next` is computed and + used to `increment_next_writer_record_id`. This is safe when the last record + is genuinely valid. If F5 produces a false-valid record with a lower `id` + than the true last record, the fast-forward moves `next_record_id` forward + by the wrong amount — subsequent records may have unexpected IDs relative to + the reader's expectations. + +- `Ordering::Greater` (`writer.rs:910-919`): data behind ledger → log error, + `should_skip = true`. This path is taken when the last record's `id + events` + is less than `ledger_next` — meaning the ledger thinks we wrote more records + than actually made it to the file. On skip, the writer rolls to the next file + and never writes the "missing" records again. This is the intended behavior + for partial writes, but must not be triggered by a F5 false-valid record with + a too-low garbage `id`. + +**`seek_to_next_record` at Rotation:** + +During recovery, `seek_to_next_record` at `reader.rs:840-898` uses the same +`validate_record_archive` (mmap + `check_archived_root`) for its fast-path +file skip check. F5 can also manifest here: a false-valid last record with a +wrong `id` may incorrectly satisfy `ledger_last > last_record_id_in_data_file` +at `reader.rs:879`, causing the reader to delete a file it should not have +deleted (all remaining unread records in that file are lost). + +**`reader.rs:932` `u16 >` comparison (file-ID rollover):** +`if reader_file_id > writer_file_id` at `reader.rs:932` uses raw `u16` +comparison. At `MAX_FILE_ID = 65535`, after rollover the reader file ID wraps +to 0 while the writer may be at 65535. `0 > 65535 == false`, so the +`seek_to_next_record` init stall condition is not detected — the reader +incorrectly believes it is still behind the writer and continues looping. This +is the file-ID rollover ordering bug (sut-analysis §6 item 6). Antithesis can +reach this with `MAX_FILE_ID` reduced via test configuration. + +**Fault Requirements:** Node-termination faults (SIGKILL) required. Kill +precisely during the rotation sequence (windows 1-8 above) is the primary fault. +The Antithesis scheduler should concentrate kills in the time window between: + +- The first `flush_inner(force=true)` call (start of rotation, `writer.rs:1041`) +- The first successful write to the new file (end of rotation, after + `writer.rs:1138`). + +To maximize rotation-boundary hits, configure a small `max_data_file_size` +(e.g., 1MB or even 256KB) so rotations happen frequently, giving the Antithesis +scheduler many opportunities. + +**Antithesis SDK Assertions (SUT-side, to be added):** + +```rust +// In validate_last_write, after RecordStatus::FailedDeserialization or Corrupted: +antithesis_sdk::assert_sometimes!( + true, + "torn_tail_recovered: validate_last_write detected corrupt/partial last record", + json!({ + "data_file": format!("{:?}", data_file_path), + "status": "FailedDeserialization or Corrupted" + }) +); + +// In validate_last_write, after Ordering::Greater: +antithesis_sdk::assert_sometimes!( + true, + "validate_last_write Ordering::Greater path exercised (data lags ledger)", + json!({ "ledger_next": ledger_next, "record_next": record_next }) +); + +// In validate_last_write, after Ordering::Less: +antithesis_sdk::assert_sometimes!( + true, + "validate_last_write Ordering::Less path exercised (ledger lags data)", + json!({ "ledger_next": ledger_next, "record_next": record_next }) +); + +// In seek_to_next_record, after delete_completed_data_file during fast-path: +antithesis_sdk::assert_sometimes!( + true, + "seek_to_next_record fast-path: deleted already-acked file during recovery", + json!({ "file": format!("{:?}", data_file_path) }) +); + +// After any record delivered to caller (reader.rs next() Ok(Some(record))): +antithesis_sdk::assert_always!( + // record_id must be >= last delivered record_id (monotonicity) + record_id >= self.last_reader_record_id, + "record IDs are strictly monotonic (no wrap-around or garbage ID delivered)", + json!({ "record_id": record_id, "last_reader_record_id": self.last_reader_record_id }) +); +``` + +--- + +## Open Questions + +**OQ-1: Does `check_archived_root` actually read from the last 8 bytes of the +buffer (F5 torn-tail risk) or does it use a different footer layout?** +`rkyv` v0.7.x (the version in use — check `Cargo.lock`) stores the root +position as a `i32` relative offset in the **last 4 bytes** of the buffer on +32-bit or as a `usize`-sized footer on 64-bit. On x86-64 with a 64-bit `usize`, +the footer is 8 bytes. The exact layout determines how many bytes need to be +torn for F5 to be triggered. Check `rkyv`'s version in +`lib/vector-buffers/Cargo.toml` and the footer layout for that version. + +**OQ-2: Is the F5 probability high enough to matter in practice, or is CRC32C +the effective guard?** +For F5 to produce a false-valid (not just `Corrupted`) record, the garbage +bytes at the root-pointer position must both (a) point to a location within the +buffer that passes `CheckBytes` and (b) the payload bytes at that location must +CRC32C-match the checksum field (also potentially garbage). The CRC32C check is +strong (32-bit security margin against random bit flips). However, partial +writes at the torn boundary may leave structured data (zeros, the previous +record's valid bytes) that creates a non-random collision surface. Antithesis's +full coverage of timing windows is the right tool to empirically determine if +F5 is reachable without probability arguments. + +**OQ-3: When `should_skip_to_next_file = true` and the writer rolls to the next +file, does the reader still have access to all un-deleted records in the old +file?** +Yes: `mark_for_skip` (`writer.rs:984`) + `reset()` closes the writer's handle +to the old file but does not delete it. The reader reads/deletes data files at +its own pace (`delete_completed_data_file` only after all records acked). The +concern is whether the `increment_writer_file_id` in `validate_last_write`'s +skip path causes the `is_finalized` flag in the reader to flip prematurely, +marking the old file as finalized before all records are written. Trace +`is_finalized = (reader_file_id != writer_file_id) || !self.ready_to_read` +(`reader.rs:1004`): after skip, `writer_file_id` increments → `!=` reader's +file_id → `is_finalized = true`. This correctly signals to `try_next_record` +that the file is done and partial reads at the end are `PartialWrite` errors, +not waits. This is correct behavior. + +**OQ-4: Is the monotonicity panic at `reader.rs:480-484` reachable via the F5 +path?** +If F5 produces a garbage `id` that is lower than `self.last_reader_record_id`, +the `add_marker` call at `reader.rs:478` would return `MonotonicityViolation`, +which panics with `"record ID monotonicity violation detected; this is a +serious bug"`. This would be a process crash (not a deadlock), which is more +visible than silent loss but still unrecoverable. However, F5 happens in +`validate_last_write` (writer init) which reads the last record but does not +call `add_marker` — the reader's monotonicity check is on the read path +(`read_record` → `track_record` → `add_marker`). If F5 sets the writer's +`next_record_id` to a wrong value, and the writer then writes new records with +IDs starting from that wrong value, the reader may encounter those records with +IDs that are non-monotonic relative to surviving old records. This is the +indirect path to the monotonicity panic. + +**OQ-5: Should `max_data_file_size` be configurable in test mode to a small +value (e.g., 1MB) to trigger frequent rotations?** +Yes. The test configuration should set `max_data_file_size` to a small value +(recommended: 1MB or even 256KB) to make rotations happen every few seconds, +giving Antithesis many rotation-boundary crash opportunities per run. At the +default 128MB, a single test run may not produce enough rotations for full +coverage. diff --git a/tests/antithesis/scratchbook/properties/reader-drains-and-terminates-cleanly.md b/tests/antithesis/scratchbook/properties/reader-drains-and-terminates-cleanly.md new file mode 100644 index 0000000000000..5a1b880bf81f1 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/reader-drains-and-terminates-cleanly.md @@ -0,0 +1,385 @@ +--- +slug: reader-drains-and-terminates-cleanly +property_id: 13 +type: Liveness +antithesis_assertion: Sometimes(reader_returned_none_clean) +sut_path: lib/vector-buffers/src/variants/disk_v2/ +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +cross_refs: + - total-buffer-size-never-underflows # termination condition uses total_buffer_size == 0 + - writer-eventually-makes-progress # writer must be done before reader can terminate + - acked-files-eventually-deleted # deletion drives total_buffer_size to 0 +related_issues: + - "vectordotdev/vector #23456" # the exact flaky test this property covers + - "vectordotdev/vector #21683" # total_buffer_size underflow breaks termination +disabled_tests: + - "lib/vector-buffers/src/variants/disk_v2/tests/basic.rs::reader_exits_cleanly_when_writer_done_and_in_flight_acks (ignore = \"flaky. See https://github.com/vectordotdev/vector/issues/23456\")" +--- + +# Property 13: reader-drains-and-terminates-cleanly + +## Invariant (informal) + +When the writer is done (i.e., `mark_writer_done()` has been called) and every +record that was written has been read **and** acknowledged downstream, the +reader's `next()` coroutine returns `Ok(None)` — a clean `None` that signals +end-of-stream — within finite time. Two failure modes must both be excluded: + +- **Hang**: `next()` blocks indefinitely despite no more data or acks pending. +- **Premature None**: `next()` returns `None` before all records have been + delivered, truncating the stream and silently dropping undelivered events. + +--- + +## Termination Condition and Its Fragility + +The termination check is at `reader.rs:980-985`: + +```rust +// reader.rs:980-985 +if self.ledger.is_writer_done() { + let total_buffer_size = self.ledger.get_total_buffer_size(); + if total_buffer_size == 0 { + return Ok(None); + } +} +``` + +Both conditions must be simultaneously true: + +1. `is_writer_done()` reads the `writer_done: AtomicBool` at ledger.rs:410-412 + with `Ordering::Acquire`. This is set by `mark_writer_done()` (ledger.rs:403-407) + which is called when the writer is dropped/closed. + +2. `get_total_buffer_size()` reads `total_buffer_size: AtomicU64`. This is + decremented by `track_reads` (ledger.rs:393-397) when acks are processed in + `handle_pending_acknowledgements` (reader.rs:623-624). + +**Three distinct ways this condition can fail:** + +### Failure A — Premature None (notify-before-ledger-update race) + +The termination check runs at the **top of each loop iteration**, before +`ensure_ready_for_read()`. The sequence inside a single `next()` call is: + +``` +loop { + handle_pending_acknowledgements(force_check) // may decrement total_buffer_size + CHECK: is_writer_done() && total_buffer_size == 0 ← termination exit here + ensure_ready_for_read() + try_next_record() + ... + wait_for_writer() ← woken by finalizer calling notify_writer_waiters() +} +``` + +The finalizer task (ledger.rs:703-707) calls: + +```rust +self.increment_pending_acks(amount); // ledger.rs:705: pending_acks += amount +self.notify_writer_waiters(); // ledger.rs:706: wakes reader +``` + +These two operations are **not atomic**. The `notify_writer_waiters()` wakes +the reader before `increment_pending_acks` is visible to the reader OR, more +precisely, both atomics use `AcqRel` but between the two calls the reader may +be scheduled and run the loop iteration: + +1. Finalizer: `pending_acks.fetch_add(amount, AcqRel)` — commit succeeds. +2. Reader wakes up (scheduled). +3. Reader: `handle_pending_acknowledgements` → `consume_pending_acks()` → sees + the new `pending_acks`. Processes acks. Decrements `total_buffer_size`. +4. Reader: termination check: `is_writer_done() == true`, + `total_buffer_size == 0` → **returns `Ok(None)`**. + +This path is actually the *intended* correct path. The race described in the +`#23456` test is subtler: the test at basic.rs:116-143 demonstrates that the +reader must enter `wait_for_writer()` **twice** before the ack arrives, because +the writer's close sends one spurious wakeup. The flakiness arises from: + +- The reader receives the spurious wakeup from writer close. +- It loops back to the termination check with `total_buffer_size > 0` (ack not + yet processed). Correctly, it continues. +- It waits again on `wait_for_writer()`. +- The ack arrives. Finalizer fires `notify_writer_waiters()`. +- Reader wakes. But this time, it must **consume the pending ack** in + `handle_pending_acknowledgements` *before* hitting the termination check, + because otherwise `total_buffer_size` is still non-zero and it waits again. + +The flakiness is in how tokio's `Notify` (edge-triggered, one-permit store) +interacts with the reader's poll frequency. If the reader's second +`wait_for_writer()` call races with the finalizer's `notify_writer_waiters()`: + +- If `notify_writer_waiters()` fires **before** `wait_for_writer()` is called, + the permit is stored. The next `wait_for_writer()` returns immediately. +- If the reader has already entered `.notified()` and is parked, the notify + wakes it. +- In both cases the reader eventually processes the ack and terminates. + +The "flaky" failure mode most likely involves the reader waking, *not* finding +the ack (a timing window where `pending_acks` is 0 because the finalizer task +hasn't yet run), and going back to sleep, missing the stored permit. But +`Notify` stores at most one permit — if the notify arrived while the reader was +awake, the permit may have been consumed by the previous wakeup. The reader +parks again, but now no notify is coming (the finalizer already fired its one +permit). **Hang.** + +This is exactly the "missed wakeup" concern noted in sut-analysis.md §4. + +### Failure B — Hang (total_buffer_size stuck non-zero due to underflow) + +From sut-analysis.md §5 (L8) and §6 (root cause #1): + +If `total_buffer_size` has wrapped to ≈ 2^64 due to the unguarded `fetch_sub` +in `decrement_total_buffer_size` (ledger.rs:291-298): + +```rust +// ledger.rs:291-298 — no saturation +pub fn decrement_total_buffer_size(&self, amount: u64) { + let last_total_buffer_size = self.total_buffer_size.fetch_sub(amount, Ordering::AcqRel); + ... +} +``` + +Then `total_buffer_size` equals ≈ 2^64, never reaches 0, and the termination +condition `total_buffer_size == 0` is never satisfied. The reader loops forever +in `wait_for_writer()`. **Permanent hang.** + +This is a distinct trigger from Failure A: A wraps once and is done; B is an +arithmetic bug whose trigger is crash/partial-write discrepancies on restart. +Both lead to the same observable symptom: `next()` never returns. + +### Failure C — Premature None (total_buffer_size reaches 0 while undelivered records exist) + +This is the *opposite* direction. If `total_buffer_size` is decremented more +than it should be — for example, if a double-decrement occurs, or if the +startup `update_buffer_size` under-seeds the initial value — then +`total_buffer_size` reaches 0 while there are still unread or un-acked records +on disk. The termination check fires prematurely. Records are silently dropped. + +The startup seeding path (ledger.rs, `update_buffer_size`): seeds +`total_buffer_size` from the sum of `.dat` file sizes. The reader then +decrements by the number of bytes in each **record** (not the full file size). +If the file contains padding, partial writes at the tail, or gap markers, the +per-record decrement total may be less than the file size, leaving +`total_buffer_size > 0` at end. But if the accounting goes the other way +(over-counting the decrements somehow), premature termination is possible. + +--- + +## The Exact `#23456` Race Path + +The disabled test `reader_exits_cleanly_when_writer_done_and_in_flight_acks` +(basic.rs:72-152) exercises the following sequence under a **single-event** +buffer: + +1. Write one `SizedRecord::new(32)`. Flush. Close writer. +2. `read_next_some(&mut reader)` — reads the record; does NOT ack yet. +3. `reader.next()` is polled. It must **not** return `None` here (the record + has been read but `total_buffer_size` is still non-zero because the ack + hasn't been processed). The test asserts it enters `wait_for_writer()` at + least **twice** (line 120) — once consuming the spurious wakeup from writer + close, once blocking for real. +4. `acknowledge(first_read).await` — drops the `BatchNotifier`, which triggers + the finalizer task. +5. Finalizer: `pending_acks += 1; notify_writer_waiters()`. +6. Reader wakes. `handle_pending_acknowledgements`: `consume_pending_acks() = 1`. + Processes the ack. `track_reads(...)` → `decrement_total_buffer_size(...)`. + `total_buffer_size` becomes 0. +7. Termination check: `is_writer_done() && total_buffer_size == 0` → `Ok(None)`. +8. Assert `blocked_read.is_woken()`. Assert `second_read == Ok(None)`. + +The flakiness is the window between steps 5 and 6: if the reader is scheduled +to run between `notify_writer_waiters()` (step 5) and when the reader actually +calls `consume_pending_acks()` (step 6), and the reader's `Notify` permit was +already consumed by a prior spurious wakeup in step 3, then the reader may park +in `wait_for_writer()` indefinitely after step 5 fires — because the permit was +already spent and the finalizer won't fire again. + +Antithesis can **deterministically explore** this scheduling window, whereas +the unit test relies on tokio-test mock timers and `spawn` polling, which is +non-deterministic in timing. + +--- + +## Cross-Reference: `total-buffer-size-never-underflows` and `writer-eventually-makes-progress` + +This property depends on both: + +- **`total-buffer-size-never-underflows`**: If `total_buffer_size` wraps to 2^64, + the termination condition `total_buffer_size == 0` can never be satisfied + (Failure B). The clean-termination property is broken by the same arithmetic + bug that breaks writer liveness. + +- **`acked-files-eventually-deleted`**: File deletion is what drives + `total_buffer_size` to 0 in the normal path (Failure A requires the deletion + to have occurred to decrement the counter). If files are never deleted (e.g., + finalizer task died), `total_buffer_size` stays positive and the reader hangs. + +- **`writer-eventually-makes-progress`**: Less direct. But if the writer is + deadlocked, it cannot call `close()` / `mark_writer_done()`, so + `is_writer_done()` stays false, and the termination check never fires even + if `total_buffer_size == 0`. This is a distinct hang path: the reader waits + for the writer to be done, the writer waits for the reader to free space, the + reader waits for acks — circular dependency if `total_buffer_size` wrapped. + +--- + +## Antithesis Experimental Design + +### Target scenario + +1. Configure a small buffer (one or two data files). Write N records. Flush. + Close the writer (call `writer.close()`). +2. Read all N records via `reader.next()`. Collect all `BatchNotifier` handles + without yet acking. +3. Assert that `reader.next()` does **not** return `None` at this point (it + should be waiting, since `total_buffer_size > 0`). +4. Drop all notifiers (ack all records). Wait for the finalizer task to fire. +5. Assert that `reader.next()` returns `Ok(None)` within a timeout. +6. Assert `total_buffer_size == 0` via ledger introspection. + +### Antithesis scheduling exploration + +- **Interleave the ack between the two `wait_for_writer()` calls** (steps 3 + and 4 in the `#23456` race path above). This is precisely the window that + makes the unit test flaky. Antithesis's scheduler can deterministically force + this interleaving. +- **Delay finalizer task scheduling** relative to the reader polling. With + Antithesis's virtual scheduling, the finalizer task can be held off until + after the reader has re-entered `wait_for_writer()`, testing whether the + `Notify` permit is correctly stored or dropped. +- **SIGKILL between writer close and ack arrival.** On restart, the buffer + should replay the un-acked record. The reader should read it again, ack it, + and then terminate cleanly. Assert no duplication in the downstream oracle + (requires idempotent downstream). + +### Fault-specific assertions + +- **Premature None detection** (workload oracle): Track the total number of + events enqueued by the writer. Assert that the total number of events + delivered to the downstream sink equals the number enqueued (minus any + crash-window losses that are expected). A premature `None` silently truncates + this count. +- **Hang detection** (workload oracle): After `writer.close()`, assert that + `reader.next()` returns within a finite bound (e.g., 10× the + `flush_interval`). If it does not, capture the ledger state: + `is_writer_done`, `total_buffer_size`, `pending_acks`, `reader_last_record`, + `writer_next_record`. + +### Assertions to add (SUT-side, none currently exist) + +```rust +// At the top of the next() loop, before the termination check, assert +// that if we are about to return None, total_buffer_size is truly 0 and +// there are no pending_acks that haven't been consumed yet: +if self.ledger.is_writer_done() { + let total_buffer_size = self.ledger.get_total_buffer_size(); + let pending_acks = self.ledger.pending_acks.load(Ordering::Acquire); + if total_buffer_size == 0 { + antithesis_sdk::assert_always!( + pending_acks == 0, + "reader returns None only when no pending acks remain", + &serde_json::json!({ + "total_buffer_size": total_buffer_size, + "pending_acks": pending_acks, + }) + ); + antithesis_sdk::assert_sometimes!( + true, + "reader returned None cleanly after writer done and buffer empty", + &serde_json::json!({ + "reader_last_record": self.last_reader_record_id, + }) + ); + return Ok(None); + } +} +``` + +```rust +// In finalizer task (ledger.rs:703-707), after increment and notify, +// confirm reachability: +tokio::spawn(async move { + while let Some((_status, amount)) = stream.next().await { + self.increment_pending_acks(amount); + antithesis_sdk::assert_sometimes!( + true, + "finalizer task delivered ack to pending_acks", + &serde_json::json!({ "amount": amount }) + ); + self.notify_writer_waiters(); + } +}); +``` + +--- + +## Why This Matters + +Clean reader termination is the **shutdown contract** of the disk buffer. A +Vector topology performs graceful shutdown by: (1) stopping sources, (2) +waiting for all in-flight events to be delivered and acked, (3) tearing down +sinks. If the disk buffer's reader never returns `None`, step (2) hangs +forever. The operator must send SIGKILL, which: + +- Loses up to 500ms of unflushed writes (the documented window). +- Leaves the buffer in a partially-acknowledged state that triggers + re-processing on restart (duplicates). +- In the `#21683` scenario, leaves `total_buffer_size` in a state that causes + a permanent deadlock on the next startup. + +A premature `None` is equally bad: events that have been written to disk but +not yet delivered are silently abandoned. The customer loses data they expected +the disk buffer to protect. + +The `#23456` test being **disabled as flaky** is a direct signal that this +property is not reliably upheld even under the deterministic conditions of +`tokio-test`. Antithesis, by exploring all scheduler interleavings, should be +able to (a) reproduce the flaky failure deterministically, (b) isolate the +exact race window, and (c) verify any fix actually closes the window. + +--- + +## Open Questions + +1. **What is the root cause of `#23456` flakiness?** The test's comment + (basic.rs:107-115) explains the two-wait requirement but does not identify + what scheduling interleaving causes it to fail. Is it the `Notify` permit + being spent before the second park, or a more subtle ordering between the + finalizer task and the reader task? Antithesis can answer this definitively. + +2. **Is there a window between `mark_writer_done()` and the reader checking + `is_writer_done()` where `total_buffer_size` drops to 0 for an unrelated + reason?** If so, a premature `None` could occur before the writer has + finished all writes. The `mark_writer_done()` is called from + `BufferWriter::close()` or `BufferWriter::drop()`. If the writer is dropped + mid-write (e.g., topology tear-down), this window exists. + +3. **Does `notify_writer_waiters()` at ledger.rs:706 correctly wake the reader + in all cases?** The `Notify` API stores at most one permit. If multiple acks + arrive before the reader wakes, multiple calls to `notify_writer_waiters()` + collapse into one permit. The reader wakes once, processes all pending acks + in one `consume_pending_acks()` call (which atomically swaps to 0), and + terminates. This appears correct but needs verification under high-concurrency + ack delivery (many sink workers acking simultaneously, each triggering the + finalizer). + +4. **Does the graceful shutdown sequence guarantee `writer.close()` is called + before the reader is asked to terminate?** If the topology drops the reader + before the writer is marked done, `is_writer_done()` stays false and the + reader hangs. This is an integration question about topology shutdown + ordering, not visible from the disk-buffer code alone. + +5. **Can `total_buffer_size` reach 0 before all in-flight acks are processed + due to the gap-marker path?** If records are skipped due to corruption + (`events_skipped > 0` at reader.rs:599-601), `track_dropped_events` is + called (reader.rs:639) but does NOT decrement `total_buffer_size`. The + decrement only happens in `track_reads` (via `bytes_acknowledged`). If + corruption causes a gap that accounts for the last remaining bytes in + `total_buffer_size`, and the gap is processed by `events_skipped` rather + than `bytes_acknowledged`, `total_buffer_size` may not reach 0 when it + should. Or vice versa, it may reach 0 prematurely if gap markers account + for bytes that were already decremented elsewhere. This interaction deserves + careful tracing. diff --git a/tests/antithesis/scratchbook/properties/record-id-monotonicity-holds.md b/tests/antithesis/scratchbook/properties/record-id-monotonicity-holds.md new file mode 100644 index 0000000000000..aa388288ff50e --- /dev/null +++ b/tests/antithesis/scratchbook/properties/record-id-monotonicity-holds.md @@ -0,0 +1,138 @@ +# Evidence: record-id-monotonicity-holds + +## Property Identification + +**Slug:** record-id-monotonicity-holds +**Type:** Safety / Unreachable +**Assertion macro:** `assert_unreachable!("record_id_monotonicity_violation", "Record ID monotonicity violation: reader received a record ID <= last seen ID", ...)` + +This property was identified by locating the existing `panic!` in `BufferReader::track_read` at reader.rs:481-483. The panic text reads: + +``` +"record ID monotonicity violation detected; this is a serious bug" +``` + +This is a guardrail that the authors placed to catch a class of state corruption that they considered serious enough to warrant immediate process termination. For Antithesis purposes, the right framing is `Unreachable`: this code path must never be reached under any fault scenario. If it is reached, Antithesis has found a genuine bug. + +Unlike `AlwaysOrUnreachable` (which applies to paths that are legitimately unreachable in the absence of faults), `Unreachable` is appropriate here because record ID monotonicity is an invariant that must hold even in the presence of crash faults. The design specifically addresses how crashes interact with record IDs (`validate_last_write` + ledger fast-forward), and a violation means that interaction is broken. + +## Code Chain Leading to the Property + +### The Guardrail: `track_read` (reader.rs:448-485) + +```rust +fn track_read(&mut self, record_id: u64, record_bytes: u64, event_count: NonZeroU64) { + self.last_reader_record_id = record_id.wrapping_add(event_count.get() - 1); + // ... + if let Err(me) = + self.record_acks + .add_marker(record_id, Some(event_count.get()), Some(record_bytes)) + { + match me { + MarkerError::MonotonicityViolation => { + panic!("record ID monotonicity violation detected; this is a serious bug") + } + } + } +} +``` + +`track_read` is called immediately after `reader.read_record(token)` succeeds (reader.rs:1115), before emitting the record. The panic is in the acknowledgement tracking machinery: `OrderedAcknowledgements::add_marker` rejects a marker whose ID is less than or equal to the last acknowledged ID (i.e., out of order or duplicate). The only error variant is `MarkerError::MonotonicityViolation`, and it is handled by panic. + +### What Establishes Monotonicity + +Record IDs are produced by the writer. Each record occupies IDs `[next_record_id, next_record_id + event_count - 1]` (writer.rs:755-757: `get_next_record_id` returns `self.next_record_id.wrapping_add(self.unflushed_events)`). After a successful flush, `increment_next_writer_record_id(flushed_events)` advances the ledger's `writer_next_record` atomic (writer.rs:779-781). The writer never reuses an ID for a new record within a single process lifetime. + +On restart, `validate_last_write` (writer.rs:838-991) reads the last record from the current data file and compares its `record_next` (last ID + event_count) against `ledger_next`: + +- `Ordering::Equal`: ledger matches data; proceed. +- `Ordering::Less` (data ahead of ledger): fast-forward ledger via `increment_next_writer_record_id(ledger_record_delta)` (writer.rs:940-941). This prevents the writer from reusing IDs that were already on disk. +- `Ordering::Greater` (ledger ahead of data): roll to next file; mark "events have likely been lost" (writer.rs:913-920). The writer starts fresh from the ledger value. + +The reader establishes its baseline from `ledger.state().get_last_reader_record_id()` (reader.rs:423) and initializes `record_acks` with `from_acked(next_expected_record_id)` (reader.rs:435), where `next_expected_record_id = ledger_last_reader_record_id.wrapping_add(1)`. This tells `OrderedAcknowledgements` to expect IDs starting from one past the last acknowledged point. + +### Interaction with `seek_to_next_record` (reader.rs:810-947) + +During initialization, `seek_to_next_record` calls `self.next()` in a loop (reader.rs:904-938) until `self.last_reader_record_id >= ledger_last`. Each call to `next()` calls `track_read`, which calls `add_marker`. If during seek a record with a lower ID than expected is encountered (e.g., because `validate_last_write` fast-forwarded the writer ledger to ID 50, but the reader's ledger shows `reader_last_record = 30`, and there is a legitimate record on disk at ID 35), `add_marker` should accept it. But if the ledger fast-forward overshot, or if the reader's `OrderedAcknowledgements` was initialized with the wrong `from_acked` value, a record at a lower-than-expected ID would trigger the panic. + +### Interaction with File-ID Rollover + +File IDs are `u16`, wrapping at 65,536 (reader.rs:932: `reader_file_id > writer_file_id` comparison, which is raw `u16 >` — not wrapping-aware). In production this is unlikely (131,072 file IDs written before a file is reused). In tests with `MAX_FILE_ID=6`, this is reachable. If the file-ID rollover causes the reader to open a file from a previous generation, that file's record IDs could be far lower than the current `record_acks` watermark, triggering the monotonicity panic. + +### Torn-Tail Mis-Recovery (F5 from sut-analysis.md §10) + +`validate_last_write` calls `validate_record_archive(data_file_mmap.as_ref(), ...)` which uses `archived_root` to locate the last record. `archived_root` reads the root offset from the **last 8 bytes** of the mmap. A crash that leaves trailing garbage bytes could cause `archived_root` to interpret a plausible but incorrect offset as valid, yielding a `RecordStatus::Valid` record with a wrong `id` field. If that `id` is lower than the correct last-written ID, `validate_last_write` then calls `increment_next_writer_record_id` to fast-forward the writer to `record_next = wrong_id + event_count` — which might be behind the correct current position. The writer then produces a record with a lower ID than already on disk, and the reader's `add_marker` panics. + +This torn-tail scenario is the most credible crash-fault path to a monotonicity violation. + +## What Goes Wrong if the Property is Violated (the Panic Fires) + +The `panic!` terminates the Vector process immediately. In a production deployment this triggers a process restart. On restart, `validate_last_write` and `seek_to_next_record` run again. If the underlying state that caused the violation persists (i.e., the ledger and data files are in a state that will always produce out-of-order IDs), the process enters an infinite restart loop — a permanent pipeline stall. This is operationally equivalent to the writer deadlock bug (#21683) in severity: no data flows, no error is clearly surfaced to the user (just repeated crash logs), and the buffer can only be recovered by manual intervention. + +Additionally, the panic discards any in-flight data in `TrackingBufWriter`'s 256KB buffer that has not yet been written to the data file (`BufferWriter::Drop` calls `close()` but not `flush()`), compounding the data loss. + +## Timing / Fault Conditions for Antithesis + +- **Node kill during `validate_last_write` fast-forward**: If the process is killed after `increment_next_writer_record_id` updates the ledger (writer.rs:940) but before the data file is updated, the ledger is ahead of the data. On the next restart, the `Ordering::Greater` branch fires and rolls to a new file — potentially skipping records. This is the documented "events have likely been lost" path and does not itself cause a monotonicity violation, but it produces a gap in record IDs that the reader must handle via gap markers in `OrderedAcknowledgements`. + +- **Node kill during `seek_to_next_record`**: If the process is killed while the reader is replaying records from a file, the next restart starts `seek_to_next_record` again. The ledger `reader_last_record` is updated lazily (only on explicit `ledger.flush()`). If the ledger persisted a partially-advanced `reader_last_record`, the reader might seek past a valid record, leaving its ID below the `record_acks` watermark. + +- **File-ID rollover during test** (small `MAX_FILE_ID`): Causes the reader to open a file from a previous file-ID generation. The raw `u16 >` comparison at reader.rs:932 does not handle wrap-around, so this can cause the reader to believe it is ahead of the writer when actually it has wrapped. The reader opens an old file, finds records with much lower IDs than `record_acks` expects, and triggers the panic. + +- **External file placement**: Placing a valid-looking `buffer-data-N.dat` from a different run (with lower record IDs) into the buffer directory. The reader would open it as part of the sequence, read records with old IDs, and panic. + +## SUT-Side Instrumentation Suggestions (ALL MISSING) + +The existing `panic!` is a strong signal but not an Antithesis assertion. Adding the SDK assertion before the panic converts this into a structured finding: + +**Primary assertion** — replace the panic in `track_read` (reader.rs:481-483) with an assertion that fires before panicking: + +```rust +MarkerError::MonotonicityViolation => { + antithesis_sdk::assert_unreachable!( + "record_id_monotonicity_violation", + "Record ID monotonicity violation: this is a serious bug", + &serde_json::json!({ + "record_id": record_id, + "last_reader_record_id": self.last_reader_record_id, + "data_file_id": self.ledger.get_current_reader_file_id(), + "writer_file_id": self.ledger.get_current_writer_file_id(), + "ledger_last_reader_record_id": self.ledger.state().get_last_reader_record_id(), + "ledger_next_writer_record_id": self.ledger.state().get_next_writer_record_id(), + }) + ); + panic!("record ID monotonicity violation detected; this is a serious bug"); +} +``` + +The `assert_unreachable!` fires before the panic, giving Antithesis a structured report with state context (IDs, file IDs, ledger values) that can be correlated with the fault that caused the violation. + +**Supporting instrumentation** — in `validate_last_write` (writer.rs:838-991), log the `Ordering::Less` fast-forward case with the ledger delta: + +```rust +// writer.rs:922-944 (Ordering::Less branch) +let ledger_record_delta = record_next - ledger_next; +// Before increment_next_writer_record_id: +antithesis_sdk::assert_sometimes!( + "writer_ledger_fast_forwarded", + "Writer ledger fast-forwarded after crash: data ahead of ledger", + &serde_json::json!({ + "ledger_next": ledger_next, + "data_next": record_next, + "delta": ledger_record_delta, + "last_record_id_on_disk": last_record_id, + }) +); +``` + +This makes the crash-recovery fast-forward path reachable in Antithesis testing, and also lets the test author correlate fast-forward events with subsequent monotonicity violations. + +## Open Questions + +- **Does `OrderedAcknowledgements::add_marker` use wrapping-aware comparison?** If record IDs wrap around `u64::MAX` (theoretically possible after 2^64 writes, not practically reachable but logically relevant at the zero-initialized state), a wrapping `record_id` of 0 would appear to violate monotonicity relative to a high watermark. This matters for how the reader handles the first record on a fresh buffer (where `next_expected_record_id = 0 + 1 = 1` but the first record might have ID 0). Checking the `OrderedAcknowledgements` implementation would clarify whether this is handled correctly. + +- **Is the `reader_file_id > writer_file_id` comparison at reader.rs:932 wrapping-safe?** The SUT analysis flags this as a known ordering bug with `MAX_FILE_ID=6`. If file-ID rollover causes this comparison to yield the wrong result, the reader exits the seek loop too early and then reads from the wrong file position — which could produce out-of-order record IDs and trigger the panic. This needs a dedicated test or direct code fix before the Antithesis property can be considered "sound." + +- **Can `validate_last_write` ever produce `record_next` lower than `ledger_next` due to torn-tail mis-read?** If yes, and if the `Ordering::Greater` branch simply rolls to the next file (writer.rs:910-920) without updating `next_record_id`, the writer's next record ID might be lower than what the reader's `record_acks` expects. Specifically: `validate_last_write` only updates `self.next_record_id` in the `Ordering::Less` branch (writer.rs:941); in the `Ordering::Greater` branch it just sets `should_skip_to_next_file = true`. Does `self.next_record_id` remain at `ledger.state().get_next_writer_record_id()` (the pre-crash persisted value), which might be higher than the torn-tail record's ID? This determines whether `Ordering::Greater` is safe from a monotonicity perspective. + +- **What is the exact behavior of `OrderedAcknowledgements::from_acked` at the `u64` boundary?** If `ledger_last_reader_record_id` is `u64::MAX`, then `wrapping_add(1)` produces `next_expected_record_id = 0`. The reader's `record_acks` would accept only a record with ID >= 0, i.e., any record. This is correct behavior for the wrapping case, but only if `add_marker` also uses wrapping arithmetic. Confirming this would determine whether the panic path is reachable at wrapping boundaries. diff --git a/tests/antithesis/scratchbook/properties/record-id-wraparound-accounting-holds.md b/tests/antithesis/scratchbook/properties/record-id-wraparound-accounting-holds.md new file mode 100644 index 0000000000000..fe0e8b2e68a3b --- /dev/null +++ b/tests/antithesis/scratchbook/properties/record-id-wraparound-accounting-holds.md @@ -0,0 +1,166 @@ +--- +slug: record-id-wraparound-accounting-holds +type: Safety / Always +status: LATENT BUG — the `- 1` at `ledger.rs:266` is not wrapping; equality case produces u64::MAX +sut_commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +--- + +# Property 17: record-id-wraparound-accounting-holds + +## Catalog Entry + +**Type:** Safety / Always + +**Property:** Across u64 record-ID wraparound and at the empty-buffer equality case, +`get_total_records` returns a semantically correct count (0 when empty, N when N events are +unacknowledged). It never returns an astronomically wrong value (~2^64) that would corrupt +metrics, trigger false "buffer full" accounting, or cause reporting loops to emit junk gauges. + +**Invariant:** `get_total_records() <= actual_unacked_event_count + some_small_bounded_delta`. +Specifically, `get_total_records()` must return 0 when the buffer is empty (all events acked, +`next_writer_record_id == last_reader_record_id`), and must never return a value close to +`u64::MAX`. + +**The Bug — `ledger.rs:262-267`:** + +```rust +pub fn get_total_records(&self) -> u64 { + let next_writer_id = self.state().get_next_writer_record_id(); + let last_reader_id = self.state().get_last_reader_record_id(); + + next_writer_id.wrapping_sub(last_reader_id) - 1 // <-- outer `-1` is NOT wrapping +} +``` + +The function computes `(next_writer_id wrapping_sub last_reader_id) - 1`. The `wrapping_sub` +is correct for the modular distance between the writer ID and the reader ID — it handles u64 +wraparound. But the outer `- 1` is plain Rust integer subtraction, which panics in debug mode +on underflow and wraps to `u64::MAX` in release mode when the intermediate result is 0. + +**When does the intermediate result equal 0?** + +`wrapping_sub` returns 0 when `next_writer_id == last_reader_id`. This happens at initialization: +both start at 0 (or at the same persisted value on restart when no unacked events remain). The +empty-buffer state is exactly the case where these two IDs are equal. + +The doc-comment for `get_last_reader_record_id` clarifies: the reader ID is the ID of the last +record acknowledged. The writer ID is the ID of the next record to be written. When the buffer +is empty and fully caught up, the writer's "next" ID and the reader's "last acked" ID are +equal (both pointing at the same boundary). This makes `wrapping_sub(...) = 0`, and then +`0 - 1` in u64 = `u64::MAX = 18446744073709551615`. + +**Downstream impact:** + +`get_total_records` is called at two sites: + +1. **`synchronize_buffer_usage` (`ledger.rs:517`):** Called during initialization after + `seek_to_last_record` and `validate_last_write`. If the buffer is empty on startup (all + events acked before previous shutdown), `get_total_records()` returns `u64::MAX`. This is + passed to `increment_received_event_count_and_byte_size(u64::MAX, ...)`, which adds `u64::MAX` + to the received-events atomic counter. The buffer usage reporter then emits a + `BufferEventsReceived` event with `count = u64::MAX`, setting the buffer-size gauge to + `u64::MAX as f64 = 1.844e19`. The gauge is permanently stuck at an astronomical value for + the lifetime of the process (or until enough sent/dropped events saturate the subtraction + back down, which at normal throughput would take millions of years). This is visible as a + stuck/wrong buffer-size metric on dashboards — the same symptom class as issue #23995. + +2. **`tests/` (model invariant checks):** `get_total_records` is used in model tests + (`tests/model/mod.rs:1005`, `tests/invariants.rs:115, 708-719`) to assert that the event + count matches expectations. If the model exercises an empty-buffer state and calls + `get_total_records`, the test would panic (debug build) or return `u64::MAX` (release + build, causing the assert to fail). The test suite likely avoids the exact initial state + where both IDs are equal by construction, but this is fragile. + +**The record-ID-wraparound case:** + +The outer `- 1` being non-wrapping also affects the record-ID near-wrap case. If +`next_writer_id = 0` (just wrapped from `u64::MAX`) and `last_reader_id = u64::MAX` (not yet +caught up from before the wrap), then `wrapping_sub(0, u64::MAX) = 1`, and `1 - 1 = 0` — +which is actually correct (the buffer has 1 unit of unacknowledged distance). But if +`next_writer_id = u64::MAX` (about to wrap) and `last_reader_id = u64::MAX` (caught up), +both equal → wraparound_sub = 0 → `0 - 1 = u64::MAX`. Same bug, different trigger point. + +The fix is: `next_writer_id.wrapping_sub(last_reader_id).wrapping_sub(1)`. + +**Antithesis Angle:** + +1. **The equality/empty case (most reachable):** Start Vector with a disk buffer. Write N + events. Read and acknowledge all N events (drain the buffer completely). Assert immediately + after drain that `get_total_records() == 0`. In an Antithesis run, a workload can drain the + buffer and then query the metric. The metric value `18446744073709551615` (or any value > + N+small_delta) should trigger a workload-side assertion failure. + +2. **The near-wraparound case (requires injection):** Use the test-only + `unsafe_set_writer_next_record_id` and `unsafe_set_reader_last_record_id` helpers + (`ledger.rs:174-196`) to place both IDs near `u64::MAX`, then drain the buffer to equality. + Assert `get_total_records() == 0`. This is available only in test builds. + +3. **The `synchronize_buffer_usage` path:** Start Vector with an empty, previously-used buffer + (all events acked on last run). After startup, scrape `buffer_size_events` (or equivalent + gauge). Assert the value is close to 0, not `u64::MAX`. This requires only a workload + + metric scrape and is exercisable without any special test hooks. + +SUT-side assertion: add `assert_always!(result <= reasonable_upper_bound, "get_total_records returned impossible value", { "result": result, "writer_id": next_writer_id, "reader_id": last_reader_id })` inside `get_total_records`, where `reasonable_upper_bound` is something like `self.get_total_buffer_size() / min_record_size` or simply `u64::MAX / 2`. + +**Why It Matters:** + +The empty-buffer case is the most common and most benign path through the system — a healthy +drain-and-restart cycle. Yet this is precisely the path that triggers `u64::MAX` in +`get_total_records`, immediately poisoning the buffer's usage metrics on every clean restart +with an empty buffer. The resulting metric spike (buffer size gauge = 1.844e19) is visible on +dashboards and may trigger false alerts. It also adds `u64::MAX` to the received-events counter +atomically in `synchronize_buffer_usage`, which can interfere with other metric accounting +(e.g. the `current()` computation in `ReporterCurrentMetrics` uses `saturating_sub`, so the +gauge would clamp rather than go negative, but would still be stuck at the saturated value). + +This is a distinct bug from the `total_buffer_size` underflow (INV-7), but shares the same +symptom class: lying buffer metrics with no error log and no indication of the root cause. The +Antithesis property provides an automatic regression test for the fix. + +## Open Questions + +1. **Does the current test suite actually exercise the empty-buffer equality case?** The model + tests call `get_total_records` (`tests/model/mod.rs:1005`) but the model initializes with + specific non-empty state. The unit test at `tests/invariants.rs:115` asserts + `get_total_records() == 1` after writing one record, not 0 after draining. If no test + exercises the zero case, the bug is undetected by the test suite. Verifying this would + require scanning the test suite for sequences that drain to empty and then call + `get_total_records`. + +2. **Is the initial state of `last_reader_record_id` actually equal to `writer_next_record_id` + on a fresh buffer start?** Looking at `BufferReader::new` (`reader.rs:423-424`): + `ledger_last_reader_record_id = ledger.state().get_last_reader_record_id()` and + `next_expected_record_id = ledger_last_reader_record_id.wrapping_add(1)`. On a fresh start + both IDs start at 0 (the ledger's default value). `wrapping_sub(0, 0) = 0`, then `0 - 1 = + u64::MAX`. This confirms the bug triggers on every fresh-start with an empty buffer. + +3. **Does the bug only affect metrics, or does the corrupted `get_total_records` value flow + into any control-path decision?** Currently `get_total_records` is used only in + `synchronize_buffer_usage` (metrics) and tests. It is not used in `is_buffer_full()`, + `can_write_record()`, or any reader/writer gating logic. So the bug corrupts metrics but + does not directly cause a pipeline stall or data loss. However, if a future change uses + `get_total_records` in a control path (e.g. to decide when the buffer is empty for clean + shutdown), the impact would become severe. Flag this as a correctness issue regardless. + +4. **Would a debug build of Vector panic on the `0 - 1` subtraction?** In Rust, integer + overflow in debug mode panics (`overflow` behavior). `0u64 - 1` would panic with "attempt + to subtract with overflow." If Antithesis runs a debug build, the test would surface this + bug as a panic rather than a silent `u64::MAX` value, which is actually a stronger signal. + Production builds use `--release` (no overflow checks), so they would wrap silently. The + Antithesis harness should clarify which build profile is used. + +--- + +### Investigation Log + +#### Does the debug-build `synchronize_buffer_usage`/`get_total_records` `0-1` panic occur before release semantics are observable? + +**Examined:** `ledger.rs:262–267` (`get_total_records`), `ledger.rs:516–524` (`synchronize_buffer_usage`), `ledger.rs:173–202` (`unsafe_set_writer_next_record_id`, `unsafe_set_reader_last_record_id`). + +**Found:** The arithmetic at ledger.rs:266 is `next_writer_id.wrapping_sub(last_reader_id) - 1`. The `wrapping_sub` is wrapping-safe; the outer `- 1` is plain Rust integer subtraction on `u64`. In a debug build this will panic with "attempt to subtract with overflow" when `wrapping_sub` returns 0 (empty-buffer equality case). `synchronize_buffer_usage` at ledger.rs:517 calls `get_total_records()` unconditionally during initialization; if the buffer is empty, the panic fires before any metric is emitted. In a release build the same operation silently wraps to `u64::MAX` and proceeds to `increment_received_event_count_and_byte_size(u64::MAX, ...)` at ledger.rs:520–523. The debug-vs-release divergence is therefore: **debug → panic at startup on empty buffer; release → silent u64::MAX metric poisoning**. This affects harness build-mode selection: a debug build surfaces the bug as a crash (easier to detect), while a release build surfaces it as a metric anomaly (harder to detect without workload-side assertions). + +**Found — cfg(test) gating of the u64-wrap helpers:** `unsafe_set_writer_next_record_id` (ledger.rs:173–187) and `unsafe_set_reader_last_record_id` (ledger.rs:189–202) are both annotated `#[cfg(test)]`. They are unavailable in production or Antithesis-production binaries; the near-wraparound test path (injecting IDs near `u64::MAX`) is only exercisable in test builds. + +**Not found:** No evidence of a build-profile guard inside `get_total_records` or `synchronize_buffer_usage` that would suppress the panic in a specific configuration. + +**Conclusion:** The debug-build panic is real and fires on empty-buffer startup before release semantics (silent metric corruption) are observable. An Antithesis run using a debug build will see a crash on first clean restart with an empty buffer; a release build will see astronomical metric values. Both are confirmations of the same bug via different signals. diff --git a/tests/antithesis/scratchbook/properties/record-never-spans-files.md b/tests/antithesis/scratchbook/properties/record-never-spans-files.md new file mode 100644 index 0000000000000..3c9de2e25f19b --- /dev/null +++ b/tests/antithesis/scratchbook/properties/record-never-spans-files.md @@ -0,0 +1,199 @@ +# Evidence: record-never-spans-files + +## Property Identification + +**Slug:** record-never-spans-files +**Type:** Safety / AlwaysOrUnreachable +**Assertion macro:** `assert_always_or_unreachable!("record_never_spans_files", "Record is fully contained within a single data file", ...)` + +This property was identified by reading `RecordWriter::can_write` (writer.rs:433-437) and `BufferWriter::can_write` (writer.rs:789-791), which together form a two-level gate that prevents writing a record that would not fit in the current data file. The design doc (mod.rs) and `_external-references-digest.md` both state this as an explicit invariant: "records written sequentially/contiguously; a record never spans two data files." + +The property is `AlwaysOrUnreachable` because: on a correct run, records always fit within a single file (the gate prevents spanning); the "record spans files" branch is unreachable. Under faults (corrupted file size metadata, overflowed size counter), the gate could be fooled into allowing a record that would overflow the file — which is the violation. The assertion fires only in the violation case, hence "always or unreachable." + +## Code Chain Leading to the Property + +### Write-Side Gate: `RecordWriter::can_write` (writer.rs:433-437) + +```rust +fn can_write(&self, amount: usize) -> bool { + let amount = u64::try_from(amount).expect("`amount` should need ever 2^64 bytes."); + self.current_data_file_size + amount <= self.max_data_file_size +} +``` + +Called in `archive_record` at writer.rs:527 after the record has been fully serialized into `self.ser_buf`, so `amount = serialized_len` is the exact wire size including the 8-byte length delimiter: + +```rust +// writer.rs:527-548 +if !self.can_write(serialized_len) { + // Decode the record back out to return it. + let record = T::decode(T::get_metadata(), &self.encode_buf[..]).map_err(|_| { + WriterError::InconsistentState { ... } + })?; + return Err(WriterError::DataFileFull { + record, + serialized_len, + }); +} +``` + +If the record would not fit, `archive_record` returns `Err(WriterError::DataFileFull { record, serialized_len })` — the record is recovered from the encode buffer and returned to the caller. The record is NOT written to disk. + +### Upper-Level Gate: `BufferWriter::can_write` (writer.rs:789-791) + +```rust +fn can_write(&self) -> bool { + !self.data_file_full && self.data_file_size < self.config.max_data_file_size +} +``` + +This is checked in `ensure_ready_for_write` (writer.rs:1029) and `try_write_record_inner` (writer.rs:1176). It operates on `self.data_file_size`, which is the `BufferWriter`-level tracking of how many bytes have been written to the current data file. When `can_write()` is false, the writer triggers file rotation (writer.rs:1040-1044) before attempting to write. + +The two-level design is: + +1. `BufferWriter::can_write()` — coarse gate using `data_file_size` to decide when to rotate proactively. +2. `RecordWriter::can_write(amount)` — precise gate using `current_data_file_size + serialized_len <= max_data_file_size` after the record is serialized. + +### File Rotation on `DataFileFull` (writer.rs:1204-1223) + +When `archive_record` returns `Err(WriterError::DataFileFull { record, serialized_len })`, `try_write_record_inner` captures it: + +```rust +// writer.rs:1204-1223 +WriterError::DataFileFull { + record: old_record, + serialized_len, +} => { + self.mark_data_file_full(); + record = old_record; + // ...loop continues, calling ensure_ready_for_write which triggers rotation +} +``` + +`mark_data_file_full()` sets `self.data_file_full = true` (writer.rs:801-804), which causes `can_write()` to return false and forces `ensure_ready_for_write` to open the next file. The record is then re-attempted on the new file. + +### The Soft Overshoot: `max_record_size` vs. `max_data_file_size` + +The design deliberately allows a single record to exceed `max_data_file_size` if the file is currently empty (writer.rs comment preceding `can_write` at writer.rs:429-432): +> "If no bytes have written at all to a data file, then `amount` is allowed to exceed the limit, otherwise a record would never be able to be written." + +Wait — checking `can_write`: + +```rust +fn can_write(&self, amount: usize) -> bool { + let amount = u64::try_from(amount).expect("..."); + self.current_data_file_size + amount <= self.max_data_file_size +} +``` + +When `self.current_data_file_size == 0`, the check reduces to `amount <= max_data_file_size`. If `amount > max_data_file_size`, this returns false even on an empty file. But the code has a `debug_assert` in `RecordWriter::new` (writer.rs:396-399): + +```rust +debug_assert!( + max_data_file_size >= max_record_size_converted, + "must always be able to fit at least one record into a data file" +); +``` + +The default `max_data_file_size = 128MB`, `max_record_size = 128MB` (from `DEFAULT_MAX_DATA_FILE_SIZE`, `DEFAULT_MAX_RECORD_SIZE` in `common.rs`). With `max_data_file_size == max_record_size`, the assertion holds but with zero margin — a record of exactly `max_record_size` bytes would land at the limit. A record larger than `max_record_size` is rejected by the `RecordTooLarge` error in `archive_record` (writer.rs:477-480) before `can_write` is even called. + +So the soft overshoot up to `max_record_size` is actually not a spanning-files scenario — it means a single large record can occupy an entire file. The file never contains more than `max_data_file_size + max_record_size` bytes in the degenerate case (the first record is large, up to `max_record_size`; subsequent records are blocked from the file but accumulate on the next). The bound documented in sut-analysis.md §5/INV-3 as "~2×" refers to this: in the worst case, one very large record is written, then a second record is written that barely doesn't fit, triggering rotation. The file on disk holds up to `max_data_file_size` (first record is small) + potentially slightly over for the `current_data_file_size + amount <= max_data_file_size` bound. + +### Read-Side Constraint + +The reader's `try_next_record` (reader.rs:306-351) reads a length delimiter then reads exactly `record_len` bytes from the same file. There is no logic in the reader to span a file boundary: `BufReader` reads from a single open file handle, and reaching EOF causes `try_next_record` to return `Ok(None)` (or `Err(PartialWrite)` if `is_finalized`). The reader never attempts to continue reading a record from the next file. The invariant at read time is: if a record's length delimiter claims N bytes follow, those N bytes must all be in the current file. If they are not, `is_finalized=true` gives `PartialWrite`; if `is_finalized=false` (writer still writing), the reader waits for more data from the writer, but since the writer will never append the missing bytes (it has closed/rotated this file), the reader will eventually see EOF with `is_finalized=true` and detect `PartialWrite`. + +The read-side thus enforces the invariant defensively, but only catches a violation after the fact (as a `PartialWrite` error). The property assertion should live on the write side, where the violation is prevented. + +### `current_data_file_size` Tracking (writer.rs:629-631) + +```rust +// In flush_record (writer.rs:629-631): +self.current_data_file_size += u64::try_from(serialized_len) + .expect("Serialized length of record should never exceed 2^64 bytes."); +``` + +This is a plain `u64` addition with no saturation check. If `serialized_len` is somehow larger than `u64::MAX - current_data_file_size` (astronomically unlikely in practice), this would overflow. Not a realistic fault path; the `try_from` would panic first since `serialized_len` is a `usize` checked against `max_record_size < 128MB`. + +More concerning: `current_data_file_size` is initialized from the on-disk file size at writer open (writer.rs:1094-1101, passing `file_len` to `RecordWriter::new`). If the file metadata is corrupted (e.g., via external truncation or Antithesis filesystem fault), `file_len` could be wrong. An underreported `file_len` would cause `can_write` to allow writing past the intended limit; an overreported `file_len` would cause premature rotation. + +Similarly, `BufferWriter::data_file_size` is initialized from the same `file_len` via `self.data_file_size = data_file_size` (writer.rs:1133). So both gates are seeded from on-disk metadata at open. + +## What Goes Wrong if the Property is Violated + +If a record's bytes span a file boundary, the reader would read the length delimiter correctly, then read `record_len` bytes — which runs off the end of the file. In practice on Linux, reading past EOF on a regular file returns 0 bytes (EOF); with `is_finalized=true`, this triggers `ReaderError::PartialWrite`. The reader would roll to the next file and the spanning record would be lost. This is a safe degradation, but: + +1. **Data loss**: the spanning record is permanently lost, not recoverable on restart. +2. **`total_buffer_size` drift**: the bytes of the spanning record were accounted for on the write side but the incomplete record's bytes are corrected (partially) by the `size_delta` logic in `delete_completed_data_file` (reader.rs:521-535). Whether this is correctly computed depends on how many bytes were actually written to the first file vs. how many the reader observed. +3. **Record ID gap**: the spanning record has a valid ID assigned; if it is lost, the `OrderedAcknowledgements` inserts a gap marker, which causes `events_skipped` to increment and `track_dropped_events` to fire. + +## Timing / Fault Conditions for Antithesis + +- **Corrupted `file_len` from `metadata().await`**: If Antithesis can corrupt the filesystem metadata call result to return a lower value than the actual file size, `RecordWriter::new` initializes `current_data_file_size` too low, and `can_write` allows writing past the true limit. The record bytes are appended after what the writer thinks is the file boundary. +- **Concurrent external write to the data file**: Another process (or Vector instance due to lock-bypass) appends bytes to the data file. The writer's `current_data_file_size` is now stale (low), causing it to think there is more room than there is. +- **`data_file_size` underflow via overflow-then-wrap**: Not realistic at normal sizes, but in a fuzz scenario with `max_data_file_size` set to a small value (near `u64::MIN`), a record that is allowed could overflow the accumulator. +- **Race between writer file rotation and reader file ID increment**: Not directly a spanning violation, but if the writer increments the file ID (`increment_writer_file_id` at writer.rs:1138) while a record is still being written, the reader might believe the file is finalized while bytes are still being appended. This is not a spanning violation but a timing hazard in the `is_finalized` flag that affects the partial-write detection path. + +## SUT-Side Instrumentation Suggestions (ALL MISSING) + +**Primary assertion (write side)** — in `RecordWriter::flush_record` (writer.rs:609-633), after `current_data_file_size` is updated, assert the invariant: + +```rust +// writer.rs, after line 631: +antithesis_sdk::assert_always_or_unreachable!( + "record_never_spans_files", + "After flush_record, data file size does not exceed max_data_file_size", + &serde_json::json!({ + "current_data_file_size": self.current_data_file_size, + "max_data_file_size": self.max_data_file_size, + "serialized_len": serialized_len, + }) +); +// Specifically: current_data_file_size <= max_data_file_size (allowing first-record overshoot is a design choice; the invariant is that we never write a partial record across a boundary) +``` + +Note: the first record on an empty file is allowed to fill up to `max_data_file_size` (since `can_write` checks `<= max_data_file_size`, not `< max_data_file_size`). A stronger assertion would check that `current_data_file_size` after flush equals `current_data_file_size_before + serialized_len`, confirming no corruption of the size counter. + +**Secondary assertion (read side)** — in `RecordReader::try_next_record` (reader.rs:306-351), when `PartialWrite` is returned, check whether the partial write is at a position consistent with a spanning violation vs. a genuine incomplete write: + +```rust +// reader.rs, in the PartialWrite return path: +antithesis_sdk::assert_always_or_unreachable!( + "partial_write_not_due_to_spanning", + "PartialWrite detected; this should only occur due to crash-interrupted writes, not record spanning", + &serde_json::json!({ + "bytes_accumulated": self.aligned_buf.len(), + "record_len_claimed": record_len, + "reader_file_id": ..., + }) +); +``` + +This makes the partial-write detection path a reachability check and also documents the expected cause. + +**Gate audit assertion** — in `RecordWriter::can_write` (writer.rs:433-437), assert that we never call `can_write` with a `current_data_file_size` that already exceeds `max_data_file_size` (which would indicate the size counter drifted): + +```rust +fn can_write(&self, amount: usize) -> bool { + let amount = u64::try_from(amount).expect("..."); + antithesis_sdk::assert_always_or_unreachable!( + "data_file_size_counter_not_drifted_above_max", + "current_data_file_size should not exceed max_data_file_size before can_write is checked", + &serde_json::json!({ + "current_data_file_size": self.current_data_file_size, + "max_data_file_size": self.max_data_file_size, + }) + ); + self.current_data_file_size + amount <= self.max_data_file_size +} +``` + +## Open Questions + +- **Is there a window between `can_write` returning true and `flush_record` completing where a concurrent size update could invalidate the gate?** The writer is single-threaded (behind a topology `Mutex`), so no interleaving is possible between `archive_record`'s `can_write` check and `flush_record`'s size update within a single tokio task. However, if the `Mutex` is somehow released between these calls (which it should not be), a race would be possible. Confirming that `archive_record` and `flush_record` are always called within the same `Mutex` lock scope is needed. If they are (they appear to be — both are called from `try_write_record_inner` which holds the lock end-to-end), the gate is sound against concurrency. + +- **What is the actual on-disk file size upper bound for a single data file?** The documented limit is `max_data_file_size = 128MB`. The code's `can_write` check is `current_data_file_size + amount <= max_data_file_size`. If a record of exactly `max_data_file_size` bytes is written to an empty file, `current_data_file_size` after the write equals `max_data_file_size`. The next `can_write` check for any subsequent record will fail (since `max_data_file_size + any_positive_amount > max_data_file_size`). So the file can reach exactly 128MB but not exceed it for the initial record, and subsequent records trigger rotation. The "~2×" bound from the SUT analysis appears to be incorrect, or refers to an older behavior. Confirming the actual maximum file size by testing with `max_record_size = max_data_file_size` would resolve this. + +- **Does the `debug_assert` at writer.rs:396-399 (`max_data_file_size >= max_record_size`) fire in release builds?** `debug_assert!` is compiled out in release mode. If a user configures `max_record_size > max_data_file_size` (which would require exposing `max_record_size` to users, currently not done), the assert would not catch it. The invariant that "at least one record always fits in a data file" would then be violated at the `can_write` level: a single record would return `DataFileFull` even on an empty file, and the writer would loop forever trying to rotate to a new file that is also too small. This is a separate stall bug, but it interacts with the file-spanning property. + +- **Is `current_data_file_size` reset to 0 when the writer opens a new data file?** Yes: `self.reset()` (writer.rs:806-811) sets `self.data_file_size = 0`, and `RecordWriter::new` is called with `data_file_size` from on-disk metadata (writer.rs:1094-1101). For a freshly created file, `metadata.len() == 0`, so `current_data_file_size = 0`. For an existing file (resumed after crash), `metadata.len()` is the true on-disk size, correctly seeding the gate. This seems correct, but if Antithesis corrupts `metadata().len()` to return a lower value than the true on-disk size, `current_data_file_size` is initialized too low, creating the spanning risk. diff --git a/tests/antithesis/scratchbook/properties/recovery-completes-after-crash.md b/tests/antithesis/scratchbook/properties/recovery-completes-after-crash.md new file mode 100644 index 0000000000000..70e3e7b315cdb --- /dev/null +++ b/tests/antithesis/scratchbook/properties/recovery-completes-after-crash.md @@ -0,0 +1,165 @@ +# Property: recovery-completes-after-crash + +## Catalog Entry + +**Type:** Liveness / Sometimes — `Sometimes(buffer_reinitialized)` + +**Property:** `Buffer::from_config_inner` (the full recovery sequence: +`load_or_create` + `validate_last_write` + `seek_to_next_record` + +`synchronize_buffer_usage`) completes successfully and within bounded time after +a kill at any point during normal operation. It does not hang indefinitely, does +not fail fatally, and does not require manual intervention. + +**Invariant:** After any SIGKILL followed by a restart, `from_config_inner` +must return `Ok(...)` within T seconds (suggested T = 30s, a reasonable +initialization bound). Neither of the following may occur: + +- Permanent hang (e.g., waiting forever on a `ledger.wait_for_reader().await` + inside `ensure_ready_for_write` called from `validate_last_write`). +- Unrecoverable error returned (crash loop: the process dies during init on + every restart attempt). + +**Antithesis Angle:** + +1. Workload runs writes/reads/rotations continuously against the buffer. +2. Antithesis injects SIGKILL at arbitrary points (during write, during + `flush_inner`, during `sync_all`, during file rotation, during ledger msync, + during `seek_to_next_record`'s file-deletion loop). +3. Vector restarts. The workload measures time from process start to the first + event being emittable from the buffer (proxy for `from_config_inner` + completion). +4. `Sometimes(buffer_reinitialized)` asserts that in at least one timeline, + `from_config_inner` completes and the writer is ready — i.e., the recovery + path is actually exercised (not just happy-path startup every time). +5. An `assert_always` (added SUT-side) fires if the initialization hangs beyond + a timeout or returns an unrecoverable error. + +**Why It Matters:** A buffer that silently deadlocks on startup is worse than +one that loses data — it makes Vector permanently unavailable until manual +intervention. The underflow bug (#21683) can cause exactly this: if +`total_buffer_size` wraps to 2^64 during init (via `update_buffer_size` seeding +too high relative to what `seek_to_next_record` decrements), `is_buffer_full()` +returns true forever and `ensure_ready_for_write`'s outer loop at +`writer.rs:1003-1019` spins on `wait_for_reader()` indefinitely — a silent +stall that looks like a healthy process. + +**The init sequence (all four steps must complete):** + +``` +mod.rs:251 Ledger::load_or_create ← mmap buffer.db, update_buffer_size +mod.rs:257 writer.validate_last_write ← calls ensure_ready_for_write (deadlock risk!) +mod.rs:265 reader.seek_to_next_record ← reads/deletes files up to last acked record +mod.rs:270 ledger.synchronize_buffer_usage +``` + +**Crash Windows and Stall/Hang Risks (code-precise):** + +| Kill point | State left on disk | Recovery risk | +|------------|-------------------|---------------| +| During `open_file_writable_atomic` for new data file (`writer.rs:1071`) | New file may or may not exist (atomicity depends on FS) | `AlreadyExists` branch (`writer.rs:1079-1112`) handles this; file_len=0 → treated as owned. OK. | +| After `increment_writer_file_id` (`writer.rs:1138`), before first record written | Ledger says writer is on file N+1, but file N+1 is empty | `validate_last_write`: calls `ensure_ready_for_write`, which opens file N+1 (empty → `data_file_size==0`). `validate_last_write` exits early at `writer.rs:852-855` with `ready_to_write=true`. OK. | +| During `sync_all` of the new file (`writer.rs:1124`) | New file created but not synced; may be 0 bytes on disk | Same as above: empty file → early exit. OK. | +| After `validate_last_write` sets `should_skip_to_next_file=true`, before `reset()`/`mark_for_skip()` | Old file has invalid last record; ledger not yet rolled | `validate_last_write` starts over: `ready_to_write=false` guard (`writer.rs:840`) prevents double-init. `ensure_ready_for_write` opens next file. **L6 edge**: if next file doesn't exist yet and reader hasn't finished the current file, writer must wait (`writer.rs:1153`). Hang risk if reader is also not yet initialized (ordering: writer init completes before reader init starts per `mod.rs:256-268`). | +| During `seek_to_next_record` file-deletion loop (`reader.rs:883`) | Reader deleted some files; ledger partially updated | On restart, `update_buffer_size` sums remaining `.dat` files (correct since deleted files are gone). `seek_to_next_record` resumes. OK in theory. | +| During `update_buffer_size` file-scan (`ledger.rs:674`) | Scan sees partial set of files | Harmless: worst case over-counts (more files than reality due to concurrent creation); `seek_to_next_record` decrements as it reads. Under-count impossible since scan is a snapshot. | + +**L6 init-stall edge (highest risk):** +`validate_last_write` detects a bad record → `should_skip_to_next_file = true` +→ `reset()` + `mark_for_skip()` (`writer.rs:983-986`) → `ready_to_write = true` +→ caller eventually calls `write_record` → `ensure_ready_for_write` → tries to +open next file. If the next file (ID `current+1`) does not yet exist AND +reader's `reader_current_data_file_id` == `writer_current_data_file_id` (same +file), the writer's `open_file_writable_atomic` call succeeds (creates the new +file). But if the next file already exists and is non-empty (reader hasn't +finished it), the writer loops on `wait_for_reader()` at `writer.rs:1153`. + +Since `validate_last_write` is called during init (before the reader is +`ready_to_read`), and `seek_to_next_record` runs *after* `validate_last_write` +(`mod.rs:265`), the reader has not yet processed any records. So the "next" file +likely does not exist, and `open_file_writable_atomic` should succeed. However, +in the edge case where the writer rolled to file N+1 before the kill and then +the kill happened during the rotation, file N+1 may already exist as a +partially-written or empty file — the `AlreadyExists` branch handles this. + +The more dangerous edge: if `update_buffer_size` overseeds `total_buffer_size` +(file-on-disk includes partial/torn bytes beyond actual readable records), and +`seek_to_next_record` does not fully drain the overseeding by the time it +completes, then `is_buffer_full()` may be permanently true at init time. In +that case the write that triggers the deadlock is the first post-init write — +not during init itself — but init "completing" is then followed by an immediate +permanent stall. + +**Advisory lock edge:** +`load_or_create` at `ledger.rs:573-576` calls `lock.try_lock()`. On Linux, +`fcntl` advisory locks are per-process; if the old Vector process dies via +SIGKILL, the lock is released by the kernel. However, on some network +filesystems (NFS, CIFS) the lock may not be released immediately after a crash, +causing `try_lock()` to return `LedgerLockAlreadyHeld` and making +`from_config_inner` return an error on every restart attempt until the lock +expires. This is a crash-loop risk on shared/NFS storage. + +**Fault Requirements:** Node-termination faults (SIGKILL) required. Kill +specifically during file rotation is the highest-value timing for L6. CPU +throttle during the init sequence is a secondary lever to widen timing windows. + +**Antithesis SDK Assertions (SUT-side, to be added):** + +```rust +// At the end of from_config_inner (mod.rs:270, after synchronize_buffer_usage): +antithesis_sdk::assert_sometimes!( + true, + "buffer_reinitialized: from_config_inner completed successfully after crash", + json!({ + "writer_next_record": ledger.state().get_next_writer_record_id(), + "reader_last_record": ledger.state().get_last_reader_record_id(), + "total_buffer_size": ledger.get_total_buffer_size(), + }) +); + +// At the start of update_buffer_size (ledger.rs:653), log seeded size: +antithesis_sdk::assert_always!( + total_buffer_size < config.max_buffer_size, + "update_buffer_size: seeded total_buffer_size within configured max_size", + json!({ "total_buffer_size": total_buffer_size, "max_buffer_size": config.max_buffer_size }) +); +``` + +--- + +## Open Questions + +**OQ-1: Is the `total_buffer_size` underflow during init detectable at init +time, or only at the first write?** +`update_buffer_size` increments `total_buffer_size` by file sizes. If the sum +exceeds `max_buffer_size`, `is_buffer_full()` returns true immediately, and +the first call to `ensure_ready_for_write` from `validate_last_write` hangs. +But `ensure_ready_for_write` has a guard: `!self.ready_to_write` allows passing +through even if full (`writer.rs:1009`). So the deadlock only manifests after +`validate_last_write` sets `ready_to_write = true` and normal writes begin. +This means init completes but the buffer is immediately broken — `from_config_inner` +returns `Ok` but the system is deadlocked. Confirm by tracing +`is_buffer_full()` at the `ensure_ready_for_write` init call vs. the post-init +write path. + +**OQ-2: Is there a bounded-time guarantee on `seek_to_next_record`?** +`seek_to_next_record` reads records via `next()` (`reader.rs:905`). `next()` +can block waiting for the writer (`reader.rs:1019` `wait_for_writer()`). During +init the writer is not yet writing, so `wait_for_writer()` may hang. Trace the +`next()` path under `is_finalized = (reader_file_id != writer_file_id) || +!self.ready_to_read` (`reader.rs:1004`). During seek, `ready_to_read = false`, +so `is_finalized = true`, so `try_next_record` treats the file as finalized and +returns `PartialWrite` errors rather than blocking. Confirm this holds for all +files in the seek path, not just the current writer file. + +**OQ-3: What happens if `validate_last_write` returns an error?** +`from_config_inner` propagates the error as `BufferError::WriterSeekFailed` +(`mod.rs:260`). The caller (topology builder) treats this as fatal and does not +retry. If the error is transient (e.g., a temp I/O error on the `open_mmap_readable` +call inside `validate_last_write` at `writer.rs:862-867`), the process crash- +loops. Antithesis should inject filesystem errors (EIO, ENOENT on the data file) +to verify whether crash-loop vs. graceful degradation is the actual behavior. + +**OQ-4: Advisory lock on NFS — is Vector deployed on NFS-backed storage in +any customer environment?** +If so, the lock-not-released edge is a live operational risk, not just a +theoretical one. Flag to the user. diff --git a/tests/antithesis/scratchbook/properties/sink-failure-not-silently-acked.md b/tests/antithesis/scratchbook/properties/sink-failure-not-silently-acked.md new file mode 100644 index 0000000000000..86be5cd9d23c9 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/sink-failure-not-silently-acked.md @@ -0,0 +1,154 @@ +--- +slug: sink-failure-not-silently-acked +type: Safety / Always +status: CURRENTLY VIOLATED (within a process lifetime; at-least-once only restored by crash+replay) +sut_commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +--- + +# Property 14: sink-failure-not-silently-acked + +## Catalog Entry + +**Type:** Safety / Always + +**Property:** An event whose downstream delivery status is `Errored` or `Rejected` is NOT +silently treated as acknowledged and removed from the buffer. At-least-once semantics require +that errored/rejected deliveries are retried (either by the same process or by replay after +restart) and are never silently discarded within a live process. + +**Invariant:** For every event batch delivered to a downstream sink, if the batch completes with +`BatchStatus::Errored` or `BatchStatus::Rejected`, the buffer retains the events for retry. +Equivalently: the buffer never silently credits a failed delivery as a successful acknowledgement. + +**Current Status: VIOLATED.** The finalizer in `ledger.rs:704` discards the `BatchStatus` +entirely: + +```rust +// ledger.rs:701-710 +pub(super) fn spawn_finalizer(self: Arc) -> OrderedFinalizer { + let (finalizer, mut stream) = OrderedFinalizer::new(None); + tokio::spawn(async move { + while let Some((_status, amount)) = stream.next().await { // <-- _status discarded + self.increment_pending_acks(amount); + self.notify_writer_waiters(); + } + }); + finalizer +} +``` + +The variable is named `_status` with a leading underscore, which is the Rust idiom for +explicitly acknowledging that a value is intentionally unused. Every termination status — +`BatchStatus::Delivered`, `BatchStatus::Errored`, and `BatchStatus::Rejected` — causes an +unconditional `increment_pending_acks(amount)`, which advances the reader's acknowledgement +cursor and eventually causes `delete_completed_data_file` to unlink the data file. The events +are then permanently gone from the buffer without having been successfully delivered. This is +silent data loss with no metric, no log at error level, and no retry. + +Within a single process lifetime the property is violated. At-least-once is only restored by +crash+replay: a SIGKILL before `delete_completed_data_file` runs means the file survives on +disk and is replayed on restart; but if the finalizer processes the ack and the file deletion +completes before the crash, the events are gone. The window is small but real, and under +sustained error conditions (e.g. a sink that permanently errors) the violation is continuous. + +**Antithesis Angle:** + +Configure a pipeline with a disk buffer and a sink that can be made to return `Errored` or +`Rejected` status on demand (e.g. an HTTP sink pointed at an endpoint controlled by the +Antithesis workload, or a custom test sink that reads a flag). Then: + +1. Write N events into the buffer. +2. Force the sink to error all deliveries (via fault injection: kill the downstream, return 500s, + etc.) for a sustained window. +3. Without crashing Vector, assert that the events are either: + a. Still present in the buffer (retry pending), or + b. Visible as failed in the error log with a count matching the dropped count. +4. Terminate Vector cleanly (graceful shutdown) and verify that a fresh restart replays events + and delivers them when the downstream is restored. + +The assertion to add SUT-side would sit inside `spawn_finalizer`, branching on `_status`: + +```rust +// Proposed instrumentation site (ledger.rs:704) +while let Some((status, amount)) = stream.next().await { + match status { + BatchStatus::Delivered => { + self.increment_pending_acks(amount); + self.notify_writer_waiters(); + } + BatchStatus::Errored | BatchStatus::Rejected => { + // assert_always!(false, "errored delivery silently acked — at-least-once violated", + // { "amount": amount, "status": format!("{:?}", status) }); + // TODO: implement retry / nack path instead of silently acking + self.increment_pending_acks(amount); // BUG: this line drops the events + self.notify_writer_waiters(); + } + } +} +``` + +Workload-observable signal: `buffer_discarded_events_total` or `buffer_sent_events_total` +increments, but `component_discarded_events_total` does not. The total events delivered to the +downstream sink is less than total events written to the buffer. + +**Why It Matters:** + +This is a direct violation of the at-least-once guarantee that disk buffer is sold on. A +customer enabling disk buffering specifically to prevent data loss on transient downstream errors +gets the opposite: errors cause silent, permanent, unmetered data loss with no log or alert. +This matters most in production scenarios where the downstream sink experiences sustained errors +(network partition, service outage, quota exhaustion) and Vector's buffer silently drains itself +rather than accumulating events for replay. + +## Open Questions + +1. **Does any sink actually emit `Errored` or `Rejected` status under normal operation (non-fault + conditions)?** If sinks always return `Delivered` on internal retry success and only surface + errors by panicking or logging, the violation may only be reachable under active fault + injection. This needs to be verified by checking a representative sink (e.g. + `src/sinks/http/`) to see what `BatchStatus` values it emits under various error conditions. + If sinks swallow errors internally and never propagate `Errored` to the finalizer, the + violation is latent rather than continuously exercised, which affects priority but not + correctness. + +2. **Is the `OrderedFinalizer` typed to drop the status or does it give it to the caller?** + `vector-common`'s `OrderedFinalizer` yields `(BatchStatus, T)` from the stream (confirmed + at `ledger.rs:704`). The `_status` binding at the call site is the discard, not an upstream + limitation. A fix can be made purely in `spawn_finalizer` without changing the finalizer + library. + +3. **Is there a nack / unack path in the buffer at all?** The current ack machinery + (`handle_pending_acknowledgements` in `reader.rs`) only moves the acknowledgement cursor + forward; there is no mechanism to "un-ack" or requeue a record. Implementing true retry for + errored deliveries would require a significant design change (e.g. moving back + `reader_last_record`, or maintaining a separate retry queue). The fix may require scoping to + "at minimum, emit an error-level log and a metric" rather than true retry within the same + process. + +4. **Does a graceful shutdown drain the finalizer task before dropping the buffer?** If the + tokio runtime drops the finalizer task while in-flight acks are still pending, an errored + delivery may never increment `pending_acks`, leaving the reader stranded. This is a separate + liveness concern (the reader never deletes the file) rather than the silent-loss concern, but + both stem from the same finalizer design. + +--- + +### Investigation Log + +#### Do sinks actually emit `Errored` or `Rejected` status in practice? + +**Examined:** `src/sinks/util/http.rs:928–936` (`DriverResponse for HttpResponse`), `src/sinks/http/tests.rs` (test assertions on `BatchStatus`). + +**Found:** The `DriverResponse::event_status` implementation for `HttpResponse` at sinks/util/http.rs:929–936 explicitly returns `EventStatus::Errored` for transient HTTP errors and `EventStatus::Rejected` for permanent HTTP errors (non-2xx, non-transient responses). The test file `src/sinks/http/tests.rs` confirms this at lines 536 and 566 (`assert_eq!(receiver.try_recv(), Ok(BatchStatus::Rejected))`) and line 956 (`Ok(BatchStatus::Rejected)`). The discard at `ledger.rs:704` (`_status` binding) is therefore exercisable in real workloads wherever an HTTP sink receives a 4xx/5xx response. The finalizer unconditionally increments `pending_acks` regardless of which status value reaches it. + +**Not found:** No code path in `spawn_finalizer` (ledger.rs:701–710) that branches on `BatchStatus::Errored` or `BatchStatus::Rejected` — the match arm is `_status` with unconditional `increment_pending_acks`. No retry or nack path exists in the buffer. + +**Conclusion:** Sinks do emit `Errored`/`Rejected` in practice (confirmed in HTTP sink). The `_status` discard at ledger.rs:704 is therefore a live violation during normal operation with any non-2xx downstream, not only under synthetic fault injection. + +#### Is the `_status` discard at ledger.rs:704 intentional or a bug? + +**Examined:** `ledger.rs:700–710` (`spawn_finalizer`), git history context (commit b7aae737c), no inline comment explaining the intent. + +**Not found:** No comment in the code or adjacent documentation explaining why `_status` is discarded. The naming convention `_status` (leading underscore) is the Rust idiom for "intentionally unused," which suggests the discard is deliberate rather than an oversight — but no comment or design document explains the rationale. There is no nack/retry path in the buffer; implementing one would require reversing `increment_acked_reader_file_id` or maintaining a separate retry queue, which is a significant design change. + +**Conclusion:** Whether this is intentional design (the buffer does not support nack/retry within a process lifetime, so status is irrelevant to the acknowledgement cursor) or an unaddressed bug (the buffer should retain errored events for replay) requires human input from the owning team. The effect is clear: at-least-once is violated within a process lifetime under sustained sink errors. This item is flagged as **needs human input** to determine whether the fix scope is "add error log + metric" or "redesign ack path." diff --git a/tests/antithesis/scratchbook/properties/throughput-progresses-under-contention.md b/tests/antithesis/scratchbook/properties/throughput-progresses-under-contention.md new file mode 100644 index 0000000000000..e0c3ce8b77c2d --- /dev/null +++ b/tests/antithesis/scratchbook/properties/throughput-progresses-under-contention.md @@ -0,0 +1,278 @@ +--- +slug: throughput-progresses-under-contention +type: Liveness / Sometimes(throughput_above_floor) +sut_path: lib/vector-buffers/src/variants/disk_v2/ +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +--- + +# Property: throughput-progresses-under-contention + +## Catalog Entry + +**Type:** Liveness / `Sometimes(throughput_above_floor)` + +**Property:** With N≥4 parallel source components all writing through a +single disk-buffer writer (sharing the `Arc>` lock), and +with Antithesis CPU throttle active, write throughput during a quiet +(fault-free) observation window stays above a configurable floor (e.g., +1,000 events/second or 1 MiB/s). This distinguishes three states that exist +on a continuum but require different operator responses: + +1. **Healthy** — throughput at or near the documented ~90 MiB/s ceiling. +2. **Degenerate-but-alive** — throughput severely degraded (e.g., 100× + below normal) but still making progress; lock starvation, not deadlock. +3. **Permanently deadlocked** — throughput zero; caught by + `writer-eventually-makes-progress`. + +The `writer-eventually-makes-progress` property only catches case 3. +This property catches case 2: a degenerate-but-alive system that would be +indistinguishable from healthy on a coarse dashboard but is actually a +regression to near-zero progress. + +**Invariant:** `Sometimes(throughput_above_floor)`: during any 10-second +quiet-period window in a run with N≥4 parallel senders and CPU throttle +active, the count of successfully written events exceeds a floor threshold. +`Sometimes` is correct (not `Always`) because CPU throttle may prevent +even a single write during a brief window; the assertion fires to prove +that progress is not permanently below the floor over the full observation +window. + +**Antithesis Angle:** Configure N≥4 parallel source components → single +disk buffer sink (they all share the same `Arc>`). +Enable Antithesis CPU throttle on the Vector process. Run for a +duration long enough to observe throughput variance. Assert +`Sometimes(event_throughput > floor)` over a moving window. Combine with +`writer-eventually-makes-progress` to distinguish degenerate-but-alive +from deadlocked: if throughput is always below floor AND writer-eventually- +makes-progress fires, the system is degenerate; if writer-eventually-makes- +progress never fires, the system is deadlocked. + +**Why It Matters:** The internal buffers GA design doc doc documents a known +throughput ceiling of ~90 MiB/s under 10-thread lock contention. Under +Antithesis CPU throttle (which can reduce effective parallelism and introduce +scheduling jitter), the same lock contention can drive throughput toward zero +without triggering a deadlock. This regression is hard to detect in CI because +it requires sustained parallel writes under resource pressure — exactly the +conditions Antithesis can provide. The value is catching a *regression to +near-zero progress* that the deadlock detection (`writer-eventually-makes- +progress`) misses. + +--- + +## Code Verification + +### `Arc>` — the single lock bottleneck (sender.rs:24) + +```rust +// lib/vector-buffers/src/topology/channel/sender.rs:24 +DiskV2(Arc>>), +``` + +Every send operation (whether from `SenderAdapter::send`, `try_send`, or +`flush`) acquires this mutex: + +```rust +// sender.rs:46-48 +Self::DiskV2(writer) => { + let mut writer = writer.lock().await; // contended by all parallel senders + writer.write_record(item).await... +} +``` + +Multiple topology components sharing the same disk-buffer sink all clone +the `Arc` and contend on the single `Mutex`. This is a serialization point +with no read/write splitting. + +### Writer sends via `Arc` clone (sender.rs:35-37) + +```rust +// sender.rs:33-37 +impl From> for SenderAdapter { + fn from(v: disk_v2::BufferWriter) -> Self { + Self::DiskV2(Arc::new(Mutex::new(v))) + } +} +``` + +The `Arc::new(Mutex::new(v))` wrapping happens at topology construction time. +When multiple sources share the same sink, the `Arc` is cloned and all +contend on the same underlying `Mutex`. + +### `write_record` hold time (writer.rs — the critical section) + +Each sender holds the mutex for the duration of: + +1. `encode` (protobuf serialization of the event). +2. `write_record` (copy into the 256KB `TrackingBufWriter`). +3. Conditionally, `flush_inner` (page-cache flush; sometimes `sync_all`). + +Under CPU throttle, any of these steps can be extended arbitrarily, holding +the mutex and blocking all other senders. + +### `DEFAULT_WRITE_BUFFER_SIZE` = 256KB (common.rs:37) + +```rust +// lib/vector-buffers/src/variants/disk_v2/common.rs:37 +pub const DEFAULT_WRITE_BUFFER_SIZE: usize = 256 * 1024; +``` + +The `TrackingBufWriter` buffers 256KB before issuing a write syscall. If the +mutex holder is encoding a large event, other senders wait for the full +encode + copy cycle. Under CPU throttle, this can be hundreds of milliseconds +per acquisition. + +### Lock contention documented in GA doc + +Per `_external-references-digest.md`: "A major lock-contention performance +issue affected all disk-buffer users (writer throughput ~90 MiB/s capped by +contention)." This is a known ceiling; this property detects regression +*below* a minimum floor, not measurement of the ceiling. + +### Relationship to `ensure_ready_for_write` (writer.rs:1001-1019) + +The writer's `ensure_ready_for_write` loop holds the mutex while waiting for +the reader to signal progress (`wait_for_reader().await`). Under the underflow +deadlock (#21683), this await never resolves — mutex held forever. Under lock +starvation (this property's target), the await resolves but other senders +cannot acquire the mutex at a meaningful rate. + +--- + +## Distinguishing Degenerate from Deadlocked + +| State | `writer-eventually-makes-progress` | `throughput-progresses-under-contention` | +|---|---|---| +| Healthy | fires (Sometimes) | fires (Sometimes) | +| Degenerate-but-alive | fires (Sometimes) | does NOT fire (throughput below floor) | +| Permanently deadlocked | does NOT fire | does NOT fire | + +Antithesis can distinguish states 2 and 3 by checking BOTH properties in the +same run: if `writer-eventually-makes-progress` fires but +`throughput-progresses-under-contention` does not, the system is degenerate. + +--- + +## SUT-Side Instrumentation (MISSING — must be added) + +No Antithesis SDK instrumentation exists in the Vector codebase. + +### Assertion 1 — Sometimes: throughput above floor + +Placed in a workload-side heartbeat that samples written-event count over a +rolling window: + +```rust +// workload heartbeat, every 10 seconds +let events_written_this_window = EVENTS_WRITTEN_COUNTER.swap(0, Ordering::Relaxed); +antithesis_sdk::assert_sometimes!( + events_written_this_window > THROUGHPUT_FLOOR_EVENTS_PER_WINDOW, + "throughput-progresses-under-contention: write throughput above floor", + &serde_json::json!({ + "events_written": events_written_this_window, + "floor": THROUGHPUT_FLOOR_EVENTS_PER_WINDOW, + "window_seconds": 10, + "parallel_senders": N_SENDERS, + }) +); +``` + +`THROUGHPUT_FLOOR_EVENTS_PER_WINDOW` should be set conservatively (e.g., +1,000 events per 10-second window) to avoid flakiness under heavy CPU +throttle while still detecting regression to near-zero. + +### Assertion 2 — Reachable: mutex acquisition completes under contention + +```rust +// sender.rs, after writer.lock().await completes +antithesis_sdk::assert_reachable!( + "throughput-progresses-under-contention: mutex acquired under parallel contention", + &serde_json::json!({ + "sender_id": SENDER_ID, // set per-thread + }) +); +``` + +This confirms that the async mutex is being acquired (not permanently blocking +due to a deadlock), allowing Antithesis to distinguish "lock never acquired" +from "lock acquired but slowly." + +### SUT-side metric: expose lock wait time + +The existing `tracing` instrumentation at `sender.rs:46-48` does not measure +mutex acquisition latency. Adding a `histogram` metric at this callsite would +allow both Antithesis assertions and production observability: + +```rust +// sender.rs, SenderAdapter::send, DiskV2 arm +let lock_start = Instant::now(); +let mut writer = writer.lock().await; +let lock_wait_ms = lock_start.elapsed().as_millis(); +// existing metrics infra: +histogram!("disk_buffer_writer_lock_wait_ms", lock_wait_ms as f64); +``` + +--- + +## Why Existing Tests Cannot Catch This + +- The model-based proptest serializes all operations (single thread, no + parallel senders) — lock contention is structurally absent. +- Unit tests do not configure multiple parallel sources to the same buffer. +- The disabled `writer_waits_when_buffer_is_full` test (`size_limits.rs, + #[ignore]`) is the closest existing test to this scenario, but it tests the + blocking behavior (progress after backpressure), not throughput degradation + under contention. +- CPU throttle is not available in the existing test environment. + +--- + +## Framing: Performance vs. Correctness + +This property borders on performance testing, which is unusual for Antithesis +properties (which typically target safety/liveness, not throughput numbers). +The framing is justified because: + +1. The floor is set to catch *regression to near-zero*, not to measure the + ceiling. A floor of 1,000 events/10s is several orders of magnitude below + the ~90 MiB/s ceiling; crossing below it indicates a structural problem + (lock starvation, scheduling fairness failure), not a performance degradation. +2. The value is catching a failure mode that the existing deadlock detection + misses: "barely alive but functionally useless" is operationally equivalent + to deadlocked from the customer's perspective. +3. Lock contention under CPU throttle is an interleaving-sensitive bug class + that Antithesis's execution model is specifically designed to explore. + +--- + +## Open Questions + +- What is the appropriate floor threshold? Setting it too high causes flakiness + under heavy CPU throttle; setting it too low makes the property trivially + pass even in degenerate states. Recommend calibrating against a baseline run + (no throttle, single sender) and setting the floor at 0.1% of the baseline + throughput. + +- Does `tokio`'s async mutex (`tokio::sync::Mutex`) provide fairness + guarantees under CPU throttle? If the mutex uses FIFO queuing (it does), + all senders should eventually acquire the lock, making permanent starvation + unlikely. However, FIFO ordering means a slow holder blocks all waiting + senders for its full hold duration — amplifying the effect of CPU throttle. + +- Should the throughput counter be placed at the `write_record` callsite + (inside the mutex, measuring successful writes) or at the sender entry point + (measuring attempted writes)? Counting attempted writes exposes the lock + wait time; counting successful writes exposes the downstream write rate. + Both are useful; recommend both counters with separate names. + +- The GA doc's ~90 MiB/s ceiling was measured with 10 threads. With N=4 + parallel sources, the ceiling is lower. What is the expected throughput + floor at N=4 under moderate CPU throttle? Needs a calibration run in the + Antithesis environment. + +- If the floor is crossed, does Antithesis need SUT-side instrumentation to + distinguish "waiting for reader to free space" (correct backpressure + behavior) from "starved on lock contention" (regression)? The distinction + requires exposing whether the slow path is `ensure_ready_for_write` + (waiting for reader, logged at TRACE) or `writer.lock().await` (no current + instrumentation). diff --git a/tests/antithesis/scratchbook/properties/total-buffer-size-never-underflows.md b/tests/antithesis/scratchbook/properties/total-buffer-size-never-underflows.md new file mode 100644 index 0000000000000..b79011e447441 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/total-buffer-size-never-underflows.md @@ -0,0 +1,289 @@ +--- +slug: total-buffer-size-never-underflows +type: Safety / Unreachable +sut_path: lib/vector-buffers/src/variants/disk_v2/ +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +linked_bugs: + - vectordotdev/vector#21683 + - PR #23561 (metrics reporter only — control-path UNFIXED) +--- + +# Property: total-buffer-size-never-underflows + +## Catalog Entry + +**Type:** Safety / Unreachable + +**Property:** The in-memory `total_buffer_size` AtomicU64 in `Ledger` is never +decremented by an amount greater than its current value; the atomic never wraps +toward u64::MAX. + +**Invariant:** For every call to `decrement_total_buffer_size(amount)`: + +``` +amount <= self.total_buffer_size.load(Ordering::Acquire) +``` + +If this is violated the atomic wraps to approximately 2^64 − amount, which +permanently poisons `is_buffer_full()` (writer.rs:993-996) and deadlocks the +writer forever. + +**Antithesis Angle:** Requires a node-kill fault at a file-rotation or +partial-write boundary, followed by restart. After restart `update_buffer_size` +(ledger.rs:653-697) re-seeds `total_buffer_size` from the *file sizes* of all +`.dat` files on disk. The reader then seeks forward through partially-written +records, calling `decrement_total_buffer_size` by *record bytes* — which are +strictly smaller than the file's on-disk size if the tail was never fully +written. This is the mismatch that causes underflow. + +**Why It Matters:** This is the root cause of GitHub issue #21683. The control-path +atomic is still unfixed at this commit. PR #23561 applied `saturating_sub` only to +the metrics *reporter* (the gauge that users see on dashboards), not to the atomic +that gates the writer's progress. A wrapped value makes `is_buffer_full()` return +`true` permanently, starving the write path with no error, no log at ERROR level, +and no crash — the pipeline silently stalls. + +--- + +## Trigger Paths (verified against source) + +### Path A: seek-on-restart mismatch (primary trigger) + +1. Writer writes a partial record to `buffer-data-N.dat` at the end of a 128MB + data file, then crashes before `fsync` completes. +2. On restart, `Ledger::new` calls `update_buffer_size` (ledger.rs:653-697), + which calls `increment_total_buffer_size(file_size_of_N)` — the *whole* file + size, including the partial tail. +3. The reader calls `seek_to_next_record`, which calls `track_read` (reader.rs:448) + for every record it skips, each time calling: + + ```rust + // reader.rs:469 + self.ledger.decrement_total_buffer_size(record_bytes); + ``` + + where `record_bytes` is the serialized-record length for that valid record. +4. When the reader encounters the torn tail it stops. The sum of `record_bytes` + decremented is less than the file size that was added at step 2, so + `total_buffer_size` is positive and correct so far. +5. HOWEVER: when `delete_completed_data_file` is called (reader.rs:489) with + `bytes_read = Some(self.bytes_read)`, the adjustment at reader.rs:521-535 is: + + ```rust + let size_delta = metadata.len() - bytes_read; // reader.rs:524 + ``` + + This subtraction is plain `u64 -`; if `bytes_read > metadata.len()` (reachable + if two decrements race or if the file was truncated externally) the subtraction + panics in debug or wraps in release. More critically, this delta is then passed + to `decrement_total_buffer_size` (reader.rs:538), which itself does: + + ```rust + // ledger.rs:292 + self.total_buffer_size.fetch_sub(amount, Ordering::AcqRel); + ``` + + with no saturation guard. + +### Path B: double-decrement on skip + +When the reader "fast-forwards" (skips an entire file during initialization with +`bytes_read = None`), `delete_completed_data_file` passes the full `metadata.len()` +as the decrement amount (reader.rs:522). If the reader had already called +`track_read` for records inside that file (decrementing by their record bytes), +those bytes are subtracted twice: once via `track_read` and once via +`delete_completed_data_file`. The net decrement exceeds the original file-size +increment, causing underflow. + +### Raw underflow site + +```rust +// ledger.rs:291-298 (UNGUARDED — the fix in PR #23561 did NOT touch this) +pub fn decrement_total_buffer_size(&self, amount: u64) { + let last_total_buffer_size = self.total_buffer_size.fetch_sub(amount, Ordering::AcqRel); + trace!( + previous_buffer_size = last_total_buffer_size, + new_buffer_size = last_total_buffer_size - amount, // also wraps in trace! + "Updated buffer size.", + ); +} +``` + +Note: the `trace!` log at line 295 (`last_total_buffer_size - amount`) also wraps +and would emit a nonsensical value, making post-hoc diagnosis harder. + +--- + +## Downstream Effect: Permanent Writer Deadlock + +Once `total_buffer_size` wraps to ~u64::MAX, the following chain locks up: + +1. `is_buffer_full` (writer.rs:993-996): + + ```rust + fn is_buffer_full(&self) -> bool { + let total_buffer_size = self.ledger.get_total_buffer_size() + self.unflushed_bytes; + let max_buffer_size = self.config.max_buffer_size; + total_buffer_size >= max_buffer_size // always true: u64::MAX >= any max_size + } + ``` + +2. `ensure_ready_for_write` (writer.rs:1001-1019) enters an infinite `loop`: + + ```rust + loop { + if !self.is_buffer_full() || !self.ready_to_write { + break; // never taken: is_buffer_full() is always true + } + self.ledger.wait_for_reader().await; // woken, re-checks, loops forever + } + ``` + +3. The reader *does* drain and delete files — calling `notify_reader_waiters()` + each time (reader.rs:555) — but the writer wakes, re-evaluates `is_buffer_full` + (still true), and blocks again immediately. The wakeups are real; the accounting + is permanently wrong. +4. `can_write_record` (writer.rs:793-798) has the same `get_total_buffer_size()` + call and is similarly poisoned. + +The stall is externally **invisible**: no ERROR log, no metric spike, no panic. +The `buffer_events` / `buffer_byte_size` gauges may already be corrupted (stuck +at very large values), but PR #23561's `saturating_sub` in the reporter makes +the *dashboard* gauge appear normal at 0, hiding the deadlock. + +--- + +## SUT-Side Instrumentation (MISSING — must be added) + +All Antithesis SDK calls below are **absent** from the codebase (confirmed by +grep over the entire repo). The Antithesis Rust SDK must be added as a dependency. + +### Assertion 1 — Unreachable guard on the decrement + +```rust +// ledger.rs, inside decrement_total_buffer_size, before fetch_sub +let current = self.total_buffer_size.load(Ordering::Acquire); +antithesis_sdk::assert_unreachable!( + "total_buffer_size underflow: amount exceeds current value", + &serde_json::json!({ + "current_total_buffer_size": current, + "decrement_amount": amount, + "overflow_would_be": current.wrapping_sub(amount), + }) +); +// Alternatively, assert_always! framing: +antithesis_sdk::assert_always!( + amount <= current, + "decrement_total_buffer_size: amount must not exceed current value", + &serde_json::json!({ "current": current, "amount": amount }) +); +``` + +Placement: ledger.rs, in `decrement_total_buffer_size`, line ~291, before +`fetch_sub`. + +### Assertion 2 — Unreachable guard on the reader.rs delta subtraction + +```rust +// reader.rs:521-535, before the plain `-` +let file_size = metadata.len(); +antithesis_sdk::assert_always!( + bytes_read <= file_size, + "delete_completed_data_file: bytes_read exceeds on-disk file size", + &serde_json::json!({ + "file_size": file_size, + "bytes_read": bytes_read, + "data_file_path": data_file_path.to_string_lossy(), + }) +); +let size_delta = file_size - bytes_read; // safe after the assertion +``` + +Placement: reader.rs, inside `delete_completed_data_file`, before line 524. + +### Assertion 3 — Always: post-decrement value is sane + +```rust +// ledger.rs, after fetch_sub in decrement_total_buffer_size +let new_value = last_total_buffer_size.wrapping_sub(amount); +antithesis_sdk::assert_always!( + new_value <= last_total_buffer_size, + "total_buffer_size decreased monotonically after decrement", + &serde_json::json!({ + "before": last_total_buffer_size, + "amount": amount, + "after": new_value, + }) +); +``` + +### Assertion 4 — Reachable: the underflow recovery path is never needed + +If a saturation fix is later applied (the correct fix), add an `assert_reachable!` +at the saturation branch to confirm Antithesis actually triggers the bug +scenario, so that the fix can be validated with the harness. + +--- + +## Why Existing Tests Cannot Catch This + +- The model-based proptest (`tests/model/`) uses `TestFilesystem` whose `sync_all` + is a no-op and whose `flush` is a no-op. No partial writes are ever left on disk. + `update_buffer_size` sees zero bytes in the in-memory filesystem, so the + re-seed is always zero. The mismatch never materializes. +- The model's own `LedgerModel::decrement_buffer_size` mirrors the unguarded + `fetch_sub` (it would reproduce the underflow if triggered), but the trigger + is unreachable via the in-memory filesystem. +- `writer_waits_when_buffer_is_full` (`size_limits.rs`) is `#[ignore]` — this + is the backpressure test that sits directly on the deadlock path. + +--- + +## Open Questions + +- **Is `update_buffer_size` the only re-seed path?** Confirm that there is no + second call to `increment_total_buffer_size` during reader initialization that + could compound the over-seed. (Current reading: only one call at ledger.rs:695.) + +- **Does the double-decrement via Path B (fast-forward + track_read) actually + occur in the current code, or is it prevented by the `!self.ready_to_read` + guard at reader.rs:468?** The guard routes early seek-reads only through + `decrement_total_buffer_size(record_bytes)` (not through the file-deletion + path); but when the file is subsequently deleted, `bytes_read` captures those + same bytes, and `delete_completed_data_file` subtracts them again via the + `size_delta`. This is worth a focused code trace. + +- **Node-termination faults enabled?** This bug requires kill-and-restart. Confirm + with the Antithesis tenant operator whether node termination is enabled by + default or must be requested. + +- **Does the `trace!` at ledger.rs:295 (`last_total_buffer_size - amount`) panic + in debug mode** before the bug can be observed? If running with `debug_assertions`, + the trace format would panic on the wrapped arithmetic. This affects harness + build mode selection. + +- **PR #23561 scope:** Verify that the metrics reporter fix (using `saturating_sub` + in `buffer_usage_data.rs`) is the *only* change, and that `decrement_total_buffer_size` + in `ledger.rs` is definitively unchanged. (Confirmed at this commit: ledger.rs:292 + still uses raw `fetch_sub`.) + +- **Fault timing specificity:** How early in a file-write does the crash need to + occur for the partial record to trigger the mismatch? Does the 500ms fsync + window create a large enough target? Or is the real trigger the file-rotation + boundary (the very first write to a new data file, which is the most common + partial-write scenario)? + +--- + +### Investigation Log + +#### Does the double-decrement via fast-forward + track_read during seek get fully blocked by the `!self.ready_to_read` guard (reader.rs:~468), or can both fire for the same bytes? + +**Examined:** `reader.rs:464–539` (`track_read` and `delete_completed_data_file`). + +**Found:** The guard at `reader.rs:468` (`if !self.ready_to_read`) short-circuits `track_read` so that only `decrement_total_buffer_size(record_bytes)` fires and `return` is executed — the per-record ack machinery below line 471 is skipped. Crucially, `bytes_read` is still incremented at `reader.rs:467` (`self.bytes_read += record_bytes`) regardless of the `ready_to_read` state. When `delete_completed_data_file` is later called with `bytes_read = Some(self.bytes_read)`, it computes `size_delta = metadata.len() - bytes_read` (reader.rs:524) and calls `decrement_total_buffer_size(size_delta)` (reader.rs:538). The sum of the two decrements — one in `track_read` for each valid record, one in `delete_completed_data_file` for the remaining unread tail — equals exactly `metadata.len()` when the file was fully read; in that case there is no double-decrement. + +**The separate, unguarded site:** `delete_completed_data_file` at reader.rs:521–538 performs a plain `u64 -` subtraction (`metadata.len() - bytes_read`) with no saturation guard. If `bytes_read > metadata.len()` (reachable if two decrements race, the file was externally truncated, or a caller error passes an over-counted `bytes_read`), the subtraction wraps in release mode or panics in debug mode, and the resulting large `size_delta` is passed directly to `decrement_total_buffer_size` (ledger.rs:292), which calls `fetch_sub` with no bounds check. The `!self.ready_to_read` guard at reader.rs:468 does NOT protect this site — it only guards the per-record `record_acks.add_marker` call. The delete-time subtraction is a distinct, unguarded decrement path. + +**Conclusion:** Path B double-decrement (fast-forward case where `bytes_read = None` skips all `track_read` calls and `delete_completed_data_file` subtracts the full `metadata.len()`) is guarded by the `None` branch in the `bytes_read.map_or_else` at reader.rs:521 — in that case the full file size is decremented exactly once, not twice. The underflow risk in Path B is therefore not a double-decrement from track_read + delete, but from the re-seed in `update_buffer_size` exceeding what records actually represent, per Path A. The unguarded `metadata.len() - bytes_read` plain subtraction at reader.rs:524 remains a correctness risk for partial-read cases. diff --git a/tests/antithesis/scratchbook/properties/writer-eventually-makes-progress.md b/tests/antithesis/scratchbook/properties/writer-eventually-makes-progress.md new file mode 100644 index 0000000000000..d960b1c334784 --- /dev/null +++ b/tests/antithesis/scratchbook/properties/writer-eventually-makes-progress.md @@ -0,0 +1,316 @@ +--- +slug: writer-eventually-makes-progress +type: Liveness / Sometimes(writer_unblocked_after_full) +sut_path: lib/vector-buffers/src/variants/disk_v2/ +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +linked_bugs: + - vectordotdev/vector#21683 (root cause: see total-buffer-size-never-underflows) + - L1 / L8 in sut-analysis.md §5 (liveness claims that fail under underflow) +--- + +# Property: writer-eventually-makes-progress + +## Catalog Entry + +**Type:** Liveness / Sometimes(writer_unblocked_after_full) + +**Property:** A writer that blocked because the buffer was full (i.e. +`is_buffer_full()` returned `true`) eventually performs another successful +`write_record` after the reader acknowledges and deletes at least one data file. +The state where `is_buffer_full()` returns `true` permanently (writer deadlock) +never persists indefinitely. + +**Invariant:** After every `delete_completed_data_file` invocation that calls +`notify_reader_waiters()` (reader.rs:555), the writer unblocks and completes at +least one successful `write_record` within a bounded time. Equivalently: the +tuple `(is_buffer_full() == true, buffer_is_drained == false)` is not a +permanent fixed point. + +**Antithesis Angle:** This is the *user-visible manifestation* of #21683 — a +silent pipeline stall. The workload should: + +1. Fill the buffer to capacity (writes block; `is_buffer_full()` true). +2. Inject a node-kill fault at a file-rotation or partial-write boundary. +3. Restart Vector. +4. Resume the reader workload so it acks and deletes files normally. +5. Call `ANTITHESIS_STOP_FAULTS` (quiet period with no further faults). +6. Assert `Sometimes`: the writer completes at least one additional successful + write after the reader deletes a file post-restart. + +If the underflow bug fires at step 2-3, the writer never unblocks at step 6, +and `Sometimes` is never satisfied — Antithesis reports a liveness failure. + +**Why It Matters:** A pipeline stall with no error, no crash, and no high-level +alert is the worst possible operational failure mode. Dashboards may appear +normal (PR #23561 makes the buffer-size gauge saturate at zero instead of +showing u64::MAX). The sink stops delivering events, but no alert fires. The +customer experiences silent data loss without any signal to investigate. +This is exactly the scenario disk buffer is supposed to prevent. + +--- + +## The Deadlock Chain (traced through source) + +### Step 1: underflow fires (see `total-buffer-size-never-underflows.md`) + +`decrement_total_buffer_size` at ledger.rs:292 wraps `total_buffer_size` to ~u64::MAX. + +### Step 2: `is_buffer_full` permanently returns `true` + +```rust +// writer.rs:993-996 +fn is_buffer_full(&self) -> bool { + let total_buffer_size = self.ledger.get_total_buffer_size() + self.unflushed_bytes; + let max_buffer_size = self.config.max_buffer_size; + total_buffer_size >= max_buffer_size // u64::MAX + any >= any max_size: always true +} +``` + +`get_total_buffer_size()` loads `total_buffer_size` (ledger.rs:276-278) with +`Ordering::Acquire`. The wrapped value is visible to the writer immediately. + +Note: `self.unflushed_bytes + u64::MAX` wraps a second time back near 0 on some +inputs, potentially causing intermittent false negatives. The behaviour is +input-dependent — another source of non-determinism. + +### Step 3: `ensure_ready_for_write` enters an infinite sleep loop + +```rust +// writer.rs:1001-1019 +async fn ensure_ready_for_write(&mut self) -> io::Result<()> { + loop { + if !self.is_buffer_full() || !self.ready_to_write { + break; // never taken + } + // Logs at trace! only — not visible in production by default + self.ledger.wait_for_reader().await; // woken by notify_reader_waiters() + } + // ... +} +``` + +`wait_for_reader()` awaits `self.reader_notify.notified()` (ledger.rs:361-363). +Every time the reader deletes a file it calls `notify_reader_waiters()` +(reader.rs:555), which calls `self.reader_notify.notify_one()` (ledger.rs:376). +The writer wakes, calls `is_buffer_full()` (which returns `true`), and blocks +again — the wakeup is real but the accounting is wrong. + +The `Notify` primitive is edge-triggered with a single permit. If the writer +misses a wakeup (e.g. it was not yet waiting when `notify_one()` was called), +the next write attempt will observe the false-full state and block for the next +notification from the reader. In the underflow scenario, the reader eventually +drains the entire buffer and stops calling `notify_reader_waiters()`, so the +writer sleeps indefinitely. + +### Step 4: `can_write_record` also poisoned + +```rust +// writer.rs:793-798 +fn can_write_record(&self, amount: usize) -> bool { + let total_buffer_size = self.ledger.get_total_buffer_size() + self.unflushed_bytes; + let potential_write_len = u64::try_from(amount)...; + self.can_write() && total_buffer_size + potential_write_len <= self.config.max_buffer_size + // ^^^^^^^^^^^^^^^^ wrapped value ≫ max_buffer_size: always false +} +``` + +Even if `ensure_ready_for_write` were somehow bypassed, `can_write_record` +would return `false` for every non-zero write, blocking the write at the +outer layer too. + +### Step 5: reader shutdown also broken (L8) + +The reader's `next()` uses `total_buffer_size == 0` as the "buffer empty" signal +to return `None` and shut down. With `total_buffer_size` at u64::MAX, `next()` +never sees zero and loops indefinitely trying to read records. The pipeline +cannot shut down cleanly. + +### Step 6: wakeup chain dependency diagram + +``` +sink delivers event + → BatchNotifier dropped + → finalizer task (ledger.rs:701-709) calls increment_pending_acks + notify_writer_waiters + → wakes READER (naming is misleading: notify_writer_waiters wakes the reader's wait_for_writer loop) + → reader calls handle_pending_acknowledgements + → delete_completed_data_file + → decrement_total_buffer_size ← UNDERFLOW HERE on post-restart run + → notify_reader_waiters + → wakes WRITER from wait_for_reader + → writer checks is_buffer_full → still true → re-blocks +``` + +Break any link in this chain (underflow, finalizer task dead, reader not polled) +and the writer stalls. Antithesis should kill at multiple points in this chain. + +--- + +## Observable Signals + +### Workload-observable (external to SUT) + +- **Write throughput drops to zero** after a node-kill-and-restart during + buffer-full conditions. The workload can measure this by counting successful + `write_record` completions per time window. After STOP_FAULTS, if throughput + does not recover within a grace period, liveness fails. +- **Sink delivery throughput drops to zero** — nothing is being read or forwarded. +- **No error logs at ERROR or WARN level** — the stall is silent at those levels; + the trace! at writer.rs:1013-1016 is only emitted at `trace` level. + +### SUT-side (requires Antithesis SDK instrumentation — all MISSING) + +- `assert_sometimes!` immediately after a successful `write_record` completes, + conditional on `had_been_full` (a local flag set when `is_buffer_full()` was + true at the start of the preceding `ensure_ready_for_write` call). This fires + once per "blocked writer later unblocks" cycle. +- `assert_unreachable!` inside `ensure_ready_for_write`'s `wait_for_reader` + branch that counts consecutive wait cycles: if the writer has woken N times + (say N=100) without making progress, fire. This is an operational proxy for + the permanent deadlock. + +--- + +## SUT-Side Instrumentation (MISSING — must be added) + +### Assertion 1 — Sometimes: writer unblocks after being full + +```rust +// writer.rs, in ensure_ready_for_write, after the loop exits +// (i.e., when is_buffer_full() becomes false and the writer is about to write) +if was_buffer_full { // local bool set to true when we entered the wait + antithesis_sdk::assert_sometimes!( + true, + "writer_unblocked_after_full: writer made progress after buffer was full", + &serde_json::json!({ + "total_buffer_size": self.ledger.get_total_buffer_size(), + "max_buffer_size": self.config.max_buffer_size, + "unflushed_bytes": self.unflushed_bytes, + }) + ); +} +``` + +### Assertion 2 — Sometimes: writer_unblocked_after_restart + +A stronger variant, scoped to "writer unblocked in the same process lifetime +as a restart recovery": + +```rust +// After validate_last_write completes (writer.rs, end of the method) +// and the writer subsequently completes a write: +antithesis_sdk::assert_sometimes!( + true, + "writer_made_progress_after_recovery", + &serde_json::json!({ "recovered": true }) +); +``` + +### Assertion 3 — Unreachable: stale-full detected + +Inside `ensure_ready_for_write`, count loops without progress: + +```rust +let mut stall_count = 0u32; +loop { + if !self.is_buffer_full() || !self.ready_to_write { + break; + } + stall_count += 1; + antithesis_sdk::assert_unreachable!( + "writer stalled waiting for reader without progress", + &serde_json::json!({ + "stall_count": stall_count, + "total_buffer_size": self.ledger.get_total_buffer_size(), + "max_buffer_size": self.config.max_buffer_size, + }) + ); + self.ledger.wait_for_reader().await; +} +``` + +Note: `assert_unreachable!` fires on first execution; a stall counter threshold +is not needed — Antithesis tracks whether the unreachable point is ever reached. +But a threshold (e.g. fire only after 10 wakeups with no progress) avoids +false-positive noise during brief legitimate back-pressure. + +--- + +## Antithesis Fault Strategy + +### Recommended fault sequence + +1. **Fill phase:** Send enough events to bring `total_buffer_size` near + `max_buffer_size`. Writer should block (normal backpressure). +2. **Fault injection:** Node-kill (SIGKILL) at a file-rotation boundary. The most + reliable trigger is killing during the `fsync` of a data file just before or + just after it reaches 128MB, so the tail is partial. +3. **Restart:** Vector restarts; `update_buffer_size` re-seeds from file sizes. +4. **Reader drain:** Let the reader seek, ack, and delete files normally. +5. **STOP_FAULTS:** Call `ANTITHESIS_STOP_FAULTS` or equivalent quiet period. +6. **Verify progress:** Assert `Sometimes(writer_unblocked_after_full)` is + satisfied within the quiet period. If the assertion is never seen, the test fails. + +### Why Antithesis over a fixed chaos test + +The internal chaos test uses SIGKILL ×3 at fixed points. The underflow bug depends on +the *exact byte offset* of the crash relative to the file boundary. Antithesis's +systematic exploration of fault timing finds the specific windows (e.g. kill +during file rename/open at rotation, or during the first `write_all` to the new +file) that a fixed-timing test misses. + +### CPU throttling amplification + +CPU throttling extends the time between `fetch_sub` and the subsequent +`is_buffer_full` check, increasing the window where a reader wakeup arrives +between the underflow and the re-check. Throttling the writer process specifically +may surface race-condition variants. + +--- + +## Relationship to `buffer-size-within-max` (property #7) + +If the underflow bug fires, `is_buffer_full()` is permanently `true`, meaning no +more data is written. The on-disk buffer size technically stays within `max_size` +— not because the invariant is upheld, but because the writer is dead. +`buffer-size-within-max` must explicitly note that a passing result under +permanent deadlock conditions is vacuously true and does not indicate health. +Cross-reference: test both properties together; if `buffer-size-within-max` holds +AND `writer-eventually-makes-progress` fails, the combined result exposes the bug. + +--- + +## Open Questions + +- **Is the `Sometimes` assertion reachable under normal (non-fault) operation?** + Yes — any time the buffer fills and then drains normally. This means the + `Sometimes` property is satisfiable without faults, which is desirable: it + shows the non-fault path is covered before testing the fault path. + +- **What is the grace period for the quiet-phase progress check?** The writer + may be slow to unblock if the reader takes time to delete files. A grace period + of ~10s after STOP_FAULTS should be sufficient for a non-buggy system. Tune + based on `max_buffer_size` and simulated throughput. + +- **Does the `Notify` miss-wakeup window matter here?** `notify_one()` stores one + permit; if the writer is not yet waiting when `notify_one` fires, the next + `notified().await` returns immediately with the stored permit. This means there + is no missed-wakeup issue *in the healthy case*. In the underflow case, the + reader eventually stops notifying (buffer is empty), and the writer sleeps + indefinitely — this is the bug, not a spurious wakeup race. + +- **Does `wait_for_reader` have a timeout?** No (ledger.rs:361-363). The writer + will sleep indefinitely in the underflow case with no watchdog. A timeout-based + health check (e.g. emit a WARN log if waiting > 30s) would be a useful + diagnostic addition independent of Antithesis. + +- **Is the finalizer task shutdown correctly?** If the finalizer task (spawned by + `spawn_finalizer`, ledger.rs:701-709) is dropped before all in-flight acks are + processed, pending acks are silently lost. This could cause the reader to + stall waiting for acks that never arrive, which the writer then interprets as + "reader made no progress." This is a separate liveness bug from #21683 but + observable via the same `Sometimes` property. + +- **Node-termination faults enabled?** Essential for this property. Confirm with + Antithesis tenant operator. Without kill-and-restart faults, the underflow + trigger is unreachable and this property will always be satisfied trivially. diff --git a/tests/antithesis/scratchbook/property-catalog.md b/tests/antithesis/scratchbook/property-catalog.md new file mode 100644 index 0000000000000..82cdfa5fc6b8d --- /dev/null +++ b/tests/antithesis/scratchbook/property-catalog.md @@ -0,0 +1,609 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: 4ff41a0adb5240d071f30a5a43cb0d065e40f618 +updated: 2026-05-29 +external_references: + - path: lib/vector-buffers/src/variants/disk_v2/mod.rs + why: Module-level doc comment is the authoritative design spec + - path: rfcs/2021-10-14-9477-buffer-improvements.md + why: Original buffer-rework RFC; intended design and guarantees + - path: docs/specs/buffer.md + why: Buffer component spec / claimed behavior + - path: (internal design doc, not linked) + why: fsync/durability window, ack flow, at-least-once + duplicate semantics + - path: (internal design doc, not linked) + why: Root-cause writeups of #21683, #24948, #24606 + - path: (internal design doc, not linked) + why: Existing internal chaos test + lock-contention performance issue + - path: GitHub issues vectordotdev/vector #21683 #24948 #24606 #24144 #23995 #17666 #23456; PRs #23561 #24949 + why: Bug/regression context +--- + +# Property Catalog: Disk Buffer v2 + +29 properties across 7 categories (19 from discovery + 7 from evaluation +gap-filling — Category 7 + 3 from the 2026-05-29 data-loss expansion, in the +Category 1 "silent data-loss cluster"). No Antithesis SDK assertions exist in the codebase +today (`existing-assertions.md`); every SUT-side assertion noted below is +**missing** and must be added. Each property has an evidence file at +`properties/{slug}.md`. Evaluation refinements are recorded in +`evaluation/synthesis.md`. + +**Fault dependency:** nearly every high-value property requires **node-termination +(kill/restart) faults**, which are often disabled by default in Antithesis +tenants. This must be confirmed with the user (see file-level Open Questions). A +few also need a **custom fault** (config reload via SIGHUP, downstream-sink error +injection, binary swap, filesystem truncation). Several durability properties +should run with `flush_interval=0` (every flush becomes an fsync) to remove the +500ms-window and clock-jitter ambiguity from the oracle. + +**Two preconditions gate ~21/26 properties — CONFIRMED ENABLED by the user +(2026-05-28):** (1) node-termination kill/restart faults are enabled in the +tenant; (2) the buffer `data_dir` will be on storage that survives a modeled +crash (persistent volume). The crash-recovery cluster is therefore fully testable. +Still recommend a pre-fault **sentinel** (write+fsync N events, kill, assert files +survive) gating Category 2–6 assertions as a cheap guard against +misconfiguration/regression of the persistence assumption. + +**Logic-bug properties — user decision (2026-05-28, Bias B1):** the ~6 deterministic +logic/metric-bug properties are **kept as workload-side secondary checks with no +dedicated fault-search budget** (refinement R-H), not promoted to first-class +fault targets and not removed. + +**Currently-violated properties** (used to *expose* known/likely bugs, expected to +fail until fixed): `total-buffer-size-never-underflows`, +`writer-eventually-makes-progress` (deadlock), `sink-failure-not-silently-acked`, +`dropped-events-are-counted`, `file-id-rollover-stays-coordinated`, +`record-id-wraparound-accounting-holds` (empty-buffer case), +`foreign-data-file-no-writer-stall`, likely `config-reload-no-silent-loss`, and +likely `fsync-window-bounded-under-clock-jitter` (under clock faults). + +**Build note:** `total-buffer-size-never-underflows` (and any property observing +the wrap) requires a **release build** — the `trace!` at `ledger.rs:295` evaluates +`last_total_buffer_size - amount`, which panics in a debug build before the +release-mode wrap is observable. + +**Workload status (2026-05-29 data-loss battery):** the `disk_v2_lossfinder` +exerciser (`lib/vector-buffers/examples/disk_v2_lossfinder.rs`, harness +`tests/antithesis/config-lossfinder` + `test/v1/diskbuf_loss`) implements a +no-silent-loss oracle across a 7-scenario RNG fault menu, giving workload +coverage for: `every-written-event-eventually-delivered` (Baseline), +`config-reload-no-silent-loss`/#24948 (WriterDropNoFlush), +`sink-failure-not-silently-acked` (RejectDeliveries — already surfaced the +finalizer status-discard loss locally), `durable-unacked-events-survive-crash` + +`recovery-completes-after-crash` (CrashReopen), `dropped-events-are-counted`/#24606 +(DropNewestOverfill — reachability only, metric oracle TODO), +`corruption-skip-loss-bounded` + `corruption-skip-record-id-accounting-consistent` +(Corruption/TruncateTail — collateral-loss oracle). The earlier `disk_v2_antithesis` +exerciser (config-direct) covers the #21683 accounting-underflow cluster +(reproduced in run D0). Both are demonstrations: assertions encode the CORRECT +no-loss invariant so they fire against current behavior. + +--- + +## Category 1 — Data Integrity & Corruption + +Records are CRC32C-checksummed and rkyv-validated. These properties verify the +buffer never hands a corrupted/garbled record to a sink and that corruption is +detected, not silently propagated. Antithesis's bit-flip / partial-write / +torn-tail faults are the core levers. + +### no-corrupted-record-delivered — No Corrupted Record Delivered to Sink + +| | | +|---|---| +| **Type** | Safety | +| **Property** | A record that fails CRC32C or rkyv `CheckBytes` is never decoded and returned as a valid event; the reader rolls to the next file instead. | +| **Invariant** | `AlwaysOrUnreachable` at the record-emission point (`reader.rs:~1131`, before `Ok(Some(record))`): any emitted record passed both `verify_checksum` (`record.rs:144-155`) and the hand-written `CheckBytes` (`record.rs:79-117`). AlwaysOrUnreachable because corruption is a rare/optional path — never-executed is acceptable, but any execution must satisfy the invariant. | +| **Antithesis Angle** | Bit-flips on payload / CRC field / rkyv root offset / length delimiter; mid-record truncation (→ `PartialWrite`); torn-tail after crash. Antithesis explores which corruption shapes slip past the manual `CheckBytes` and reach CRC, and whether a CRC-passing torn tail exists. | +| **Why It Matters** | The buffer's durability story depends on this guard. A bypass forwards garbled telemetry downstream and corrupts event-count accounting via a wrong record ID. Manual `CheckBytes` (rkyv ICE workaround) is an unsafe validation surface. | + +**Open Questions:** + +- Does the topology receiver (`receiver.rs`) panic or swallow `ReaderError::Checksum`/`Deserialization`? Determines whether a side-channel counter is needed for the workload to observe detection. +- The startup `seek_to_next_record` corruption path calls `validate_record_archive` directly, not `try_next_record` — does an assertion at the emission point miss corruption during startup replay? +- Is the `unsafe archived_root` in `read_record` (`reader.rs:375`) sound across tokio preemption between `try_next_record` and `read_record`? + +### corruption-is-detected-and-recovered — Corruption Detection/Recovery Path Executes + +| | | +|---|---| +| **Type** | Reachability | +| **Property** | When corruption is injected, the detection+recovery path (`is_bad_read` → `roll_to_next_data_file`) actually executes and the buffer continues reading. | +| **Invariant** | `Sometimes(corruption_detected_and_recovered)` at `reader.rs:~1035` (the `is_bad_read()` branch). Sometimes is correct: this path is only reachable under injected faults; the assertion confirms fault injection reaches live reads and the recovery branch fires. | +| **Antithesis Angle** | Inject faults while the reader's `BufReader` has the file open and is actively reading (not before/after). Distinguishes "fault reached detection logic" from "fault hit an already-closed file." | +| **Why It Matters** | Without this the recovery path may be dead code under a given fault strategy. Surfaces the "skip rest of file after first bad record" data-loss surface (valid records after a corrupt one in the same 128MB file are abandoned). | + +**Open Questions:** + +- Does the `seek_to_next_record` init loop invoke `roll_to_next_data_file` on bad reads, or take a different path that misses the assertion? +- What is the records-abandoned rate per corruption event (quantifies the skip-rest-of-file loss)? **Measurable angle (evaluation R-I/W-C2):** correlate corruption-injection timing with event IDs — events written *after* the injected-corruption offset in the same file should still be delivered; if they vanish, that is measurable abandonment loss. Worth elevating to its own sub-assertion in the workload. +- Do in-flight `BatchNotifier`s for records read before the corrupt one drain correctly when the file's deletion marker count excludes the skipped records? + +### record-id-monotonicity-holds — Monotonicity Panic Never Fires + +| | | +|---|---| +| **Type** | Safety | +| **Property** | The "record ID monotonicity violation detected; this is a serious bug" panic (`reader.rs:~480-484`) is never reached, even under crash/corruption/rollover faults. | +| **Invariant** | `Unreachable` — this is a guardrail that must never trip. A violation indicates a bug in `validate_last_write`, `seek_to_next_record`, or the ack state machine, not an acceptable rare path. | +| **Antithesis Angle** | Node-kill during `validate_last_write` fast-forward; torn-tail mis-recovery yielding a wrong `id`; file-ID rollover with the non-wrap-aware `u16 >` at `reader.rs:932`; node-kill during lazy ledger persistence in seek. | +| **Why It Matters** | The panic crashes the process; if the triggering state persists across restarts, Vector enters an infinite restart loop — same operational impact as the writer deadlock. Existing tests cannot reach it (no-op fsync in model FS). | + +**Open Questions:** + +- Does `OrderedAcknowledgements::add_marker` use wrap-aware ID comparison? If not, fresh-buffer ID 0 after a reset could falsely trigger the panic. +- Is the `reader.rs:932` `>` comparison intended to be wrap-aware? (Cross-ref `file-id-rollover-stays-coordinated`.) +- Can `validate_last_write`'s `Greater` branch leave `next_record_id` below what `record_acks` expects? + +### record-never-spans-files — Record Never Spans Two Data Files + +| | | +|---|---| +| **Type** | Safety | +| **Property** | Every record is fully contained within a single data file. | +| **Invariant** | `AlwaysOrUnreachable` in `RecordWriter::flush_record` (`writer.rs:~629`), asserting `current_data_file_size <= max_data_file_size` after update. The `can_write` gate (`writer.rs:433-437`) normally enforces this; the assertion catches a corrupted size-seed bypassing the gate. | +| **Antithesis Angle** | Corrupt the `metadata().len()` size-seed (filesystem fault between open and metadata); external append; `max_data_file_size` ≈ `max_record_size` (both default 128MB → zero margin). | +| **Why It Matters** | A spanning record is silently lost: the reader sees `PartialWrite`, rolls, and abandons it; `total_buffer_size` correction may be wrong, creating an ID gap. A write that returned `Ok` is never delivered. | + +**Open Questions:** + +- Is the `debug_assert(max_data_file_size >= max_record_size)` (`writer.rs:~396`) compiled out of release? If so, a `max_record_size > max_data_file_size` misconfig silently makes every write return `DataFileFull` → writer loops forever. +- Does a low `metadata().len()` under fault fail open (allows oversize write → span risk) — confirmed yes by the agent — worth an explicit fault test. + +--- + +### Silent data-loss cluster — checksum-skip (added 2026-05-29, data-loss expansion) + +These three sharpen the user's concern *"if the checksum fails we'll skip records."* The existing `corruption-is-detected-and-recovered` only checks the recovery path *executes* (`Sometimes`); these are the `Always` safety bounds on **how much** is lost and whether the loss is **observable**. Root mechanism: `roll_to_next_data_file` (reader.rs:711-759) abandons the entire remainder of a data file on the first bad read, accounting only the records actually read. + +### corruption-skip-loss-bounded — Checksum-Skip Loss Bounded to the Unreadable Span + +| | | +|---|---| +| **Type** | Safety | +| **Property** | When a record fails CRC32C/`CheckBytes`/partial-write detection, only that record (plus any genuinely-unparseable contiguous tail) is lost — valid records that follow it in the same 128MB data file are still eventually delivered. | +| **Invariant** | `Always` (workload-level): every durably-written, valid-checksum record positioned after a corrupt record in the same file is in the delivered set. Currently VIOLATED — `roll_to_next_data_file` (reader.rs:711) unconditionally abandons the whole file tail. | +| **Antithesis Angle** | Bit-flip an *early* record's CRC-covered region in a multi-record file; drain; compare delivered IDs vs valid IDs. Vary corruption position + file fullness to measure loss magnitude. Needs corruption in a live read. | +| **Why It Matters** | internal doc *internal buffer design notes* ((internal doc id omitted)) states the loss window is 500ms unsynced and synced events are not lost with e2e acks; a corruption roll discards synced, valid, not-yet-acked records far outside that window — contradicting the at-least-once guarantee. A single bit-flip can abandon ~a full 128MB file. | + +**Open Questions:** + +- ~~Is the whole-file roll an accepted product tradeoff?~~ **RESOLVED (owner ruling, 2026-05-29): it is a BUG.** Any data loss not explicitly documented in detail is a bug; the reader.rs `roll_to_next_data_file` comment ("not sure the rest of the file is valid") is a hedge, not documentation. The property is therefore a real defect to fix, not a tradeoff to accept. +- Can records be re-found after a corrupt one given the length-delimited framing? `(partial: a corrupt length delimiter can desync intra-file resync — supports the conservative roll; a CRC-valid record after a payload-corrupt one is in principle recoverable)` — note this informs *how* to fix (resync vs. abandon), not *whether* (it's a bug regardless). + +### corruption-skip-loss-is-counted — Checksum-Skip Loss Is Observable + +| | | +|---|---| +| **Type** | Safety | +| **Property** | Records abandoned by a corruption-triggered roll increment a loss metric (`component_discarded_events_total` and/or `buffer_discarded_events_total`) so operators can detect the loss. | +| **Invariant** | `Always`: after a corruption event abandoning N valid records, the discarded-events counter rises by N (equivalently `produced - delivered - counted_dropped == 0`). Currently VIOLATED — abandoned records hit **neither** counter; `track_dropped_events` (reader.rs:656) fires only for writer-side gap markers, not reader-side rolls, so the loss is charged only to `decrement_total_buffer_size`. Strictly more silent than #24606. | +| **Antithesis Angle** | Same early-record bit-flip; oracle scrapes the discarded counter and asserts it rose by the abandoned count after the roll. e2e acks give exact produced/delivered sets. | +| **Why It Matters** | Read-side companion to the HIGH-severity #24606 finding in the *an internal telemetry-correctness report* ((internal doc id omitted)): "silent data loss going undetected because `component_discarded_events_total`…". Undetectable loss is the worst class — operators cannot alert on what isn't counted. | + +**Open Questions:** + +- Is the abandoned-record count even computed internally? `(partial: roll_to_next_data_file computes the count for READ records only; the abandoned count is never materialized)` +- Intentional vs error counter semantics for corruption loss? `(needs human input)` + +### corruption-skip-record-id-accounting-consistent — Skip Never Becomes Accounting Corruption + +| | | +|---|---| +| **Type** | Safety | +| **Property** | A corruption roll never converts bounded data loss into accounting corruption: record-ID and `total_buffer_size` accounting stay self-consistent across the abandoned span (no underflow, no monotonicity-guard trip). | +| **Invariant** | `Always` (SUT-side): after a roll, `next_writer_record_id - reader_last_record_id == on-disk unread records`, the rolled file's `total_buffer_size` decrement equals true remaining bytes (no reader.rs:524 underflow), and the monotonicity panic (`reader.rs:~480`) never trips. | +| **Antithesis Angle** | Mid-file corruption (non-empty abandoned tail) + continue across the file boundary and a crash+restart; watch the three underflow asserts already wired + the monotonicity guard. This is where the checksum-skip path and the organically-reproduced #21683 (run D0) meet. | +| **Why It Matters** | Identifies the corruption-roll abandoned-tail as a concrete real trigger for the reader.rs:524 underflow (#21683) and the monotonicity panic — not only external truncation. Links the data-loss surface to the deadlock/crash-loop clusters. | + +**Open Questions:** + +- Does any path advance `reader_last_record_id` over abandoned IDs, or is the gap permanent until the next file re-anchors? `(partial: roll does not advance it; cross-file re-anchor behavior needs a read trace)` +- Is reader.rs:524 underflow reachable purely via corruption-roll without external truncation? `(needs human input / Antithesis run with mid-file corruption)` + +--- + +## Category 2 — Buffer Accounting & Writer Liveness (the deadlock cluster) + +The single highest-value cluster. The in-memory `total_buffer_size` atomic uses +unsaturated `u64` subtraction; a crash/partial-write discrepancy wraps it toward +`2^64`, making `is_buffer_full()` permanently true and deadlocking the writer +(#21683). PR #23561 fixed only the metrics reporter, not the control path. + +### total-buffer-size-never-underflows — Accounting Atomic Never Wraps + +| | | +|---|---| +| **Type** | Safety | +| **Property** | `decrement_total_buffer_size` is never called with `amount > current total_buffer_size`; the atomic never wraps toward `u64::MAX`. | +| **Invariant** | `Unreachable` for "underflow occurred" (equivalently `Always(amount <= current)`), placed SUT-side at the two unguarded subtraction sites: `ledger.rs:~292` (`fetch_sub`, no saturation) and `reader.rs:~524` (`metadata.len() - bytes_read`). State is invisible to the workload → requires SUT-side instrumentation (missing). | +| **Antithesis Angle** | Node-kill at file-rotation/partial-write boundary; restart; reader seeks through the partial file; `update_buffer_size` (file-size seed) vs. `track_read` (record-byte decrement) mismatch triggers the wrap. | +| **Why It Matters** | Root cause of #21683 → permanent silent writer deadlock. PR #23561 masked only the gauge; the control-path atomic is still raw `fetch_sub`. | + +**Open Questions:** + +- Is the double-decrement via fast-forward + `track_read` during seek fully blocked by the `!self.ready_to_read` guard (`reader.rs:~468`), or can both fire for the same bytes? `(partial: guard exists for the seek-time path; the delete-time`metadata.len()-bytes_read`path at reader.rs:524 is separate and unguarded)` +- Does the debug-build wrapping subtraction in the `trace!` at `ledger.rs:295` panic before the bug is observable in release semantics? + +### writer-eventually-makes-progress — No Permanent Writer Deadlock + +| | | +|---|---| +| **Type** | Liveness | +| **Property** | A writer blocked because the buffer was full eventually completes another successful write after the reader acks+deletes a file; the permanent-deadlock state never persists. | +| **Invariant** | **Compound stall detector** (refined per evaluation W-M1/W-O3): the deadlock is *intermittent*, not permanent — `u64::MAX + unflushed_bytes` wraps back to a small value, so the writer escapes for exactly one write whenever `unflushed_bytes > 0`, then re-deadlocks. A naïve `Sometimes(any_wakeup)` or "throughput→0" check therefore false-greens. Use: `write_throughput ≈ 0` AND `sink/ack_throughput ≈ 0` AND `buffer ≥ ~90% full` AND `duration > drain-time bound` ⇒ `assert_unreachable!("persistent_deadlock")`. Both throughputs must be ~0 to distinguish a real deadlock from healthy `WhenFull::Block` backpressure. Keep a `Sometimes(writer_unblocked_after_full)` as the happy-path/baseline liveness milestone. | +| **Antithesis Angle** | Fill buffer; node-kill at rotation/partial-write boundary; restart; resume reader; `ANTITHESIS_STOP_FAULTS` quiet period; assert the writer makes progress. Workload-observable: write throughput resumes. | +| **Why It Matters** | User-visible manifestation of #21683: silent pipeline stall, dashboards look healthy (gauge reads 0 post-#23561), durability promise destroyed, no watchdog. | + +**Open Questions:** + +- Confirm the `Sometimes` is reachable on the happy path (any normal full→drain cycle) so a clean run establishes a baseline before fault testing. +- Appropriate quiet-period grace for recovery before asserting progress. +- Can the finalizer-task drain on shutdown itself starve the wakeup chain? + +### buffer-size-within-max — On-Disk Size Respects max_size + +| | | +|---|---| +| **Type** | Safety | +| **Property** | The on-disk buffer never exceeds configured `max_size` (modulo the documented per-record overshoot up to `max_record_size`); the writer blocks rather than over-committing. | +| **Invariant** | `Always` checking `actual_on_disk_bytes <= max_buffer_size + max_record_size`. Best observed by a watchdog summing `.dat` file sizes (not the gauge, which is masked by `saturating_sub`). **Compound-only (refined per evaluation R-C/W-M5): never report this in isolation** — under the underflow/intermittent deadlock the bound holds *vacuously* (no writes ⇒ no overflow). Evaluate jointly with `writer-eventually-makes-progress`; a passing bound is only meaningful while the writer is demonstrably still writing. | +| **Antithesis Angle** | Fill the buffer under faults; verify the bound. **Must be evaluated jointly with `writer-eventually-makes-progress`**: the underflow deadlock makes this bound *vacuously* hold (no writes → no overflow), a false-negative if read alone. | +| **Why It Matters** | The bound itself is rarely violated; the value is detecting the vacuity (passing bound + stalled writer = the deadlock). Secondary: config-reload per-process lock gap; foreign `.dat` files inflating the total. | + +**Open Questions:** + +- Is the config-reload two-writers-one-dir race exercised by the harness? +- Does the watchdog read file sizes (correct) or the masked gauge? +- Is the per-record overshoot corner (record pushing a `.dat` past 128MB) worth explicit coverage? + +--- + +## Category 3 — Crash Durability & Recovery + +What the product sells: durability across crashes. These verify synced data +survives, at-least-once holds end-to-end, and recovery completes without hang, +garbage, or wrong-ID fast-forward. All require node-termination faults. + +### durable-unacked-events-survive-crash — Synced, Unacked Events Survive Crash + +| | | +|---|---| +| **Type** | Safety | +| **Property** | Every event durably synced (fsync'd) and not yet acknowledged is still readable after an ungraceful crash+restart; none is skipped/lost by recovery. Loss is bounded to the ≤500ms unsynced window. | +| **Invariant** | `Always` (workload-level): the set of events the workload established as durably-written is a subset of the events re-readable after restart. | +| **Antithesis Angle** | SIGKILL at arbitrary points; restart; quiet-period drain; compare delivered vs. durably-written. Both `validate_last_write` branches (`Less` fast-forward `writer.rs:~922`, `Greater` skip `writer.rs:~910`) are exercised. | +| **Why It Matters** | The core durability guarantee. A skip during recovery = silent loss of data the customer entrusted to disk. | + +**Open Questions:** + +- "Durably written" oracle **decided** (per evaluation R-F/W-F2): use a **wall-clock timestamp** — an event produced more than `2×flush_interval` ago is past the fsync window — and run with `flush_interval=0` so every `flush()` is a `sync_all`. Do NOT use e2e-ack *delivery* as the durability marker: it conflates delivery with fsync and is suppressed by the deadlock (→ vacuous pass). +- `BufferWriter::Drop` (`writer.rs:1371-1374`) does NOT flush — is the "graceful shutdown lossless" claim conditional on the topology calling flush explicitly? (Cross-ref `graceful-shutdown-flushes-all`.) + +### every-written-event-eventually-delivered — End-to-End At-Least-Once + +| | | +|---|---| +| **Type** | Liveness | +| **Property** | With e2e acks enabled, every event accepted by the source is eventually delivered downstream at least once across crashes (duplicates allowed). | +| **Invariant** | (Refined per evaluation W-O2: `Sometimes(all_produced)` is wrong for at-least-once — it passes on one good timeline, hiding loss on others.) Use **per-event `Always(produced ⊆ delivered)`** checked after each quiet-period drain (every produced ID appears ≥1 in the delivered multiset), plus a `Sometimes(delivery_path_reachable)` exploration hint. Workload tracks a `PRODUCED` set and a `DELIVERED` multiset (dedups duplicates). | +| **Antithesis Angle** | Faults injected throughout; quiet period; drain. Surfaces three silent-loss paths: unlink-before-ledger-flush window (`reader.rs:546-549`), `_status` discard (`ledger.rs:704`), in-flight finalizer tasks not draining on SIGKILL. | +| **Why It Matters** | The end-to-end at-least-once contract Datadog sells for mission-critical pipelines. | + +**Open Questions:** + +- Does the workload need a source that supports e2e acks, or can it observe delivery directly at a mock sink? Affects topology. +- Duplicates are expected — confirm the workload dedups and only asserts ≥1 (not exactly-once). + +### recovery-completes-after-crash — Initialization Completes After Crash + +| | | +|---|---| +| **Type** | Liveness | +| **Property** | `Buffer::from_config_inner` (load_or_create → validate_last_write → seek_to_next_record → synchronize_buffer_usage) completes after a kill at any point; it does not hang or fail to init. | +| **Invariant** | `Sometimes(buffer_reinitialized)` after `from_config_inner` returns `Ok` (`mod.rs:251-270`). | +| **Antithesis Angle** | Kill during write/rotation/flush; restart; assert init completes within bounded time (quiet period). | +| **Why It Matters** | If init hangs/fails, the pipeline never starts after a crash — total outage. | + +**Open Questions:** + +- Init can return `Ok` while the writer is *immediately* deadlocked by the underflow (init completion ≠ runtime liveness) — should this property assert post-init progress too, or leave that to `writer-eventually-makes-progress`? `(partial: agent recommends keeping them separate; init-completes is necessary but not sufficient)` +- L6 init-stall edge: writer must open a next file that doesn't exist yet if killed between `increment_writer_file_id` and file creation — is this reachable? +- Advisory lock not released on some filesystems (NFS) → init permanently fails. Out of scope for local FS? + +### partial-write-at-rotation-recovers — Torn-Tail / Rotation Crash Recovers + +| | | +|---|---| +| **Type** | Safety + Liveness | +| **Property** | A crash leaving a torn/partial last record, an empty just-created next file, or a ledger/data divergence at the rotation boundary recovers without deadlock and without returning garbage or fast-forwarding to a wrong record ID. | +| **Invariant** | `Sometimes(torn_tail_recovered)` to confirm the path is exercised, plus `Always(no_garbage_delivered)` (covered by `no-corrupted-record-delivered`) and no wrong-ID fast-forward (cross-ref `record-id-monotonicity-holds`). | +| **Antithesis Angle** | Kill precisely during rotation / within the fsync window; small `max_data_file_size` (e.g. 1MB) maximizes rotation frequency. Exercises the F5 torn-tail risk (`archived_root` reads root offset from the buffer's end → garbage offset may pass `CheckBytes`) and the `validate_last_write` `Greater`/`Less` branches. | +| **Why It Matters** | The most credible path to silent skip of synced records (false `Greater`), wrong ledger fast-forward (false `Less`), or the monotonicity panic. The model test cannot reach it (no-op fsync). | + +**Open Questions:** + +- Does a torn tail produce `RecordStatus::Valid{wrong_id}` (CRC is then the only backstop) or `FailedDeserialization` directly? Determines whether CRC32C reliably backstops torn tails. +- Is the empty-just-created-next-file path (`writer.rs:~1089`, `file_len == 0`) exercised by tests? + +--- + +## Category 4 — Space Reclamation & Clean Termination + +Liveness of the read/ack/delete chain. Progress depends on sink acks + finalizer +task alive + reader polled + delete I/O succeeding. + +### acked-files-eventually-deleted — Fully-Acked Files Deleted, Space Reclaimed + +| | | +|---|---| +| **Type** | Liveness | +| **Property** | Once all records in a data file are acknowledged, the file is eventually unlinked and its bytes subtracted from `total_buffer_size`, even without new writes (`force_check_pending_data_files`). | +| **Invariant** | `Sometimes(data_file_deleted)`: after a full file is acked and a quiet period elapses, the `.dat` is gone and the byte count dropped. | +| **Antithesis Angle** | Write+ack a full file under faults; SIGKILL between finalizer-fire and `delete_file`; `EIO` on delete; **kill the spawned finalizer task** (→ acks never processed → no deletion → eventual writer deadlock). | +| **Why It Matters** | File deletion is the prerequisite for the writer unblocking. No deletion = `total_buffer_size` never reaches 0 = silent stall. | + +**Open Questions:** + +- Does the tokio runtime drain the finalizer task before shutdown timeout? +- What `bytes_read` is passed after a bad-record roll — can it exceed `metadata.len()` and trigger the underflow? (Cross-ref `total-buffer-size-never-underflows`.) +- Is `force_check_pending_data_files` triggered when the reader is parked and not rolling? + +### reader-drains-and-terminates-cleanly — Clean Reader Termination + +| | | +|---|---| +| **Type** | Liveness | +| **Property** | When the writer is done and the buffer is fully drained+acked, `reader.next()` returns `Ok(None)` within finite time — no hang, no premature `None` that drops undelivered events. | +| **Invariant** | `Sometimes(reader_returned_none_clean)`. Termination uses `is_writer_done() && total_buffer_size == 0` (`reader.rs:980-985`). | +| **Antithesis Angle** | Stop writes; deliver+ack all; quiet period; force the interleaving where the finalizer fires while the reader is awake (permit consumed) vs. parked. This is exactly the disabled flaky `#23456` path. | +| **Why It Matters** | A hang blocks graceful shutdown (operator must SIGKILL → 500ms loss + sets up #21683 on restart). A premature `None` silently truncates the stream. The underflow also breaks this (`total_buffer_size == 0` never true). | + +**Open Questions:** + +- Exact root cause of `#23456` flakiness — permit-already-consumed missed wakeup, or something else? Antithesis can answer definitively. +- Is `writer.close()` guaranteed before the reader is asked to terminate in topology shutdown? +- Does the gap-marker `events_skipped` path interact correctly with `total_buffer_size` when skipped bytes are the last outstanding? + +--- + +## Category 5 — Delivery Semantics & Boundary Conditions + +Properties that expose known/likely-present bugs (sink-error acks, drop-newest +metric blindness) and boundary arithmetic (file-ID and record-ID rollover). + +### sink-failure-not-silently-acked — Errored/Rejected Deliveries Not Silently Dropped + +| | | +|---|---| +| **Type** | Safety | +| **Property** | An event whose downstream delivery status is `Errored`/`Rejected` is not silently treated as acknowledged and removed from the buffer. | +| **Invariant** | `Always`: a non-`Delivered` batch status does not advance `reader_last_record`/free the record without retry. **Currently VIOLATED**: the finalizer discards `_status` (`ledger.rs:704`); at-least-once is only restored by a full crash+replay. | +| **Antithesis Angle** | Make the downstream sink Error/Reject under faults; assert events are retained/retried, not dropped. | +| **Why It Matters** | Within a process lifetime, sink errors cause permanent silent loss of data the buffer claims to durably hold. | + +**Open Questions:** + +- Do sinks actually emit `Errored`/`Rejected` status under normal operation, or only under fault injection? Determines whether this is reachable without faults. `(partial: finalizer discard confirmed at ledger.rs:704; whether sinks emit non-Delivered status in practice not yet traced)` +- Is the discard intentional (retry assumed at the source layer) or a genuine bug? `(needs human input)` — design-owner question. +- **Priority note (evaluation R-H):** this is a deterministic logic bug (discarded status), arguably better caught by an integration test than Antithesis search. Keep as a **workload-side secondary check with no dedicated fault-search budget**; don't shape the fault strategy around it. See Bias B1 in `evaluation/synthesis.md`. + +### dropped-events-are-counted — drop_newest Drops Are Component-Visible + +| | | +|---|---| +| **Type** | Safety | +| **Property** | When `when_full=drop_newest` drops an event, it is accounted at the component level (`component_discarded_events_total`), not only `buffer_discarded_events_total`. | +| **Invariant** | `Always`: component-visible discard count matches actual buffer drops. **Currently VIOLATED** (#24606/#24144): `BufferEventsDropped::emit` (`internal_events.rs:177-243`) never calls `ComponentEventsDropped`. | +| **Antithesis Angle** | Configure `drop_newest`; overfill under backpressure faults; assert the component-visible count matches drops (workload-observable). | +| **Why It Matters** | Operators monitor `component_discarded_events_total` for data loss; silent absence means drops go undetected (internal config-reload incident-adjacent). | + +**Open Questions:** + +- Is there any partial fix on a branch not at this commit? `(partial: not present at this commit; grep found no ComponentEventsDropped call in vector-buffers)` +- Is `drop_newest` actually reachable for the disk-buffer variant? `(partial: implementability agent confirmed the disk`try_write_record`returns the item when full, so drop_newest fires — reachable)` +- **Priority note (evaluation R-H):** deterministic missing-emit bug; better caught by an integration test. Keep as a workload-side metric check with no dedicated fault-search budget. See Bias B1. + +### file-id-rollover-stays-coordinated — u16 File-ID Rollover Stays Coordinated + +| | | +|---|---| +| **Type** | Safety | +| **Property** | Across u16 file-ID rollover (`MAX_FILE_ID`; 6 in tests, 65536 in prod), reader and writer stay coordinated; the seek `reader_file_id > writer_file_id` comparison does not misclassify sync state. | +| **Invariant** | `Always`: the seek sync-gate decision is correct across rollover. **Latent bug**: `reader.rs:~932` uses a raw, non-wrap-aware `u16 >`. | +| **Antithesis Angle** | Slow reader + fast writer to force rollover; kill+restart at the boundary; assert no deadlock/regression. Reachable in tests due to `MAX_FILE_ID=6`. | +| **Why It Matters** | A misclassified sync gate can deadlock the reader (opening a wrapped low-ID file while waiting on a high-ID writer) or regress position. | + +**Open Questions:** + +- Is the `>` intended to be wrap-aware? Confirm against author intent. (Shared with `record-id-monotonicity-holds`.) +- What is the exact incorrect behavior at the boundary — deadlock vs. silent skip? +- **Build requirement (refined per evaluation R-E):** production `MAX_FILE_ID=65535` needs ~8TB of writes to roll; `MAX_FILE_ID=6` exists only in `#[cfg(test)]`. To exercise this within a timeline, run a **test-build** of Vector or add a runtime-configurable `MAX_FILE_ID` knob; otherwise descope. `(partial: confirmed MAX_FILE_ID is cfg-gated at common.rs:43-45)` +- Does the `unacked_reader_file_id_offset` indirection (`get_current_reader_file_id`, `ledger.rs:305-308`) make the raw `>` more correct than it looks? Needs a deeper trace before treating the bug as definitely triggerable. + +### record-id-wraparound-accounting-holds — u64 Record-ID Accounting Holds + +| | | +|---|---| +| **Type** | Safety | +| **Property** | At the empty-buffer equality case, event-count accounting stays correct; `get_total_records` never produces a ~2^64 phantom count. (Refocused per evaluation R-D: the *true* u64 record-ID wrap requires ~2^64 writes and is unreachable on a production binary — explicitly descoped; the reachable, real bug is the empty-buffer case below.) | +| **Invariant** | `Always`: `get_total_records()` (`ledger.rs:266`) returns a sane count. **Bug**: the outer `- 1` is a plain (non-wrapping) subtraction; when `next == last` (drained buffer), `wrapping_sub` → 0 then `0 - 1` → `u64::MAX`, poisoning `synchronize_buffer_usage` on every clean restart of a drained buffer. Workload-observable — no SUT instrumentation needed. | +| **Antithesis Angle** | Drain the buffer completely; restart Vector; scrape `buffer_size_bytes`/`buffer_size_events` and assert near 0 (not ~2^64). No node-kill required (clean restart suffices), so this is reachable even without crash faults. | +| **Why It Matters** | Poisons buffer metrics (debug: panic; release: silent 2^64), undermining all buffer-occupancy observability. | + +**Open Questions:** + +- The true u64 record-ID wrap is unreachable on a production binary (test-only `unsafe_set_*` helpers are `#[cfg(test)]`-gated) — descoped; only the empty-buffer case is in scope. +- Does the debug-build `synchronize_buffer_usage` path panic on the `0 - 1` before release semantics are observable (as with the `total-buffer-size` `trace!`)? Use a release build to observe the ~2^64 gauge. `(partial: empty-buffer reachability confirmed — fires on every clean restart of a drained buffer; debug-vs-release panic behavior not yet confirmed for this specific site)` + +--- + +## Category 6 — Lifecycle / Config Reload + +Operations that span lifecycle transitions rather than steady state. Config +reload is directly implicated in the internal config-reload incident. + +### config-reload-no-silent-loss — Config Reload Doesn't Silently Drop Buffered Events + +| | | +|---|---| +| **Type** | Safety | +| **Property** | Reloading Vector config (drop + recreate the disk-buffer writer/sink) does not silently drop events already accepted into the buffer. | +| **Invariant** | `Always`: `accepted == forwarded + explicitly_discarded` across a reload. **At risk**: `BufferWriter::Drop` calls `close()` but NOT `flush()` (`writer.rs:1366-1374`) → up to 256KB buffered-but-unflushed events discarded; `track_dropped_events` charges `byte_size=0` → accounting drift. | +| **Antithesis Angle** | Custom fault: SIGHUP/config-reload under sustained write load with a busy reader; quiet period + drain; assert no accepted event lost. Also exercises the per-process advisory-lock gap (old+new topology opening the same buffer). | +| **Why It Matters** | Directly tied to the internal config-reload incident (#24948 / PR #24949). | + +**Open Questions:** + +- Does PR #24949 fix the *loss* or only the *stall*/liveness? `(partial: PR addressed the stall and a detach-trigger; whether the Drop-without-flush loss is fixed not confirmed at this commit)` +- Is the old/new topology overlap actually concurrent (making the lock gap a live safety issue)? +- Reload requires a custom fault — is SIGHUP-driven reload feasible in the harness, or must the workload drive it via the API? + +### graceful-shutdown-flushes-all — Graceful Shutdown Is Lossless + +| | | +|---|---| +| **Type** | Liveness | +| **Property** | On graceful shutdown, all buffered data is flushed/synced before exit (the doc's "no data loss on graceful shutdown"). | +| **Invariant** | `Sometimes(graceful_shutdown_lossless)`: after a graceful stop + restart + drain, all pre-shutdown events are present. Candidate SUT-side assertion: `Always(unflushed_bytes == 0)` inside `close()`. | +| **Antithesis Angle** | Graceful stop (SIGTERM, not kill) under load; restart; drain; assert zero loss. Contrast with the ungraceful-crash 500ms-window property. | +| **Why It Matters** | The product distinguishes graceful shutdown (lossless) from crash (≤500ms loss). `Drop` cannot call async `flush()`, so the guarantee depends on the topology flushing before drop — unverified. | + +**Open Questions:** + +- Does Vector topology shutdown call `writer.flush()` before dropping the writer? `(partial: RunningTopology::stop drops inputs containing BufferSender; whether the write loop's final flush completes first is a race — needs tracing of the shutdown ordering)` +- Does graceful SIGTERM rely on OS page-cache flush-on-exit (not a Vector guarantee) to cover the gap? + +--- + +--- + +## Category 7 — Cross-Cutting & Operational Gaps (from evaluation gap-filling) + +Properties the focus-based discovery missed: non-crash paths to the #21683 stall, +clock-fault and overflow-mode coverage, version-upgrade format safety, the +finalizer single-point-of-failure, and lock-contention throughput collapse. + +### foreign-data-file-no-writer-stall — Foreign `.dat` File Does Not Permanently Stall the Writer + +| | | +|---|---| +| **Type** | Safety | +| **Property** | A stray/leftover/operator-placed `.dat` file in the buffer dir inflates startup `total_buffer_size` but is never read or decremented; the writer must still eventually make progress once real content is below `max_size`. | +| **Invariant** | `Always(writer_makes_progress_after_drain)`: a foreign `.dat` does not hold `is_buffer_full()` permanently true. `update_buffer_size` (`ledger.rs:~681`) sums ANY `*.dat`, not just `buffer-data-{id}.dat`. | +| **Antithesis Angle** | Workload/custom-fault places a large `foreign.dat`; restart; quiet-period; assert writes resume. **No node-kill needed** — pure operator-error path to the #21683 symptom (distinct root cause: wrong scan scope, not arithmetic). | +| **Why It Matters** | Permanent silent stall with no crash; gauge masked by `saturating_sub`. Non-crash reachability makes it testable even if node-kill faults are disabled. | + +**Open Questions:** + +- Fix direction: filter by `buffer-data-{N}.dat` prefix, or assert-and-reject unknown `.dat`? +- Note: a foreign file also inflates the `.dat`-summing watchdog used by `buffer-size-within-max` — risk of a false-fail there; the watchdog must filter to self-owned files. + +### ledger-corruption-no-sigbus-crashloop — Ledger Corruption Is a Clean Error, Not SIGBUS/Crash-Loop + +| | | +|---|---| +| **Type** | Safety | +| **Property** | External truncation/corruption of the mmap'd `buffer.db` yields a clean `LedgerLoadCreateError` at init, not a SIGBUS mid-operation or an infinite crash loop. | +| **Invariant** | `AlwaysOrUnreachable`: corruption is caught at load via rkyv `CheckBytes` (`backed_archive.rs:~73`); SIGBUS mid-operation (no handler exists, `io.rs`) and crash-loop are `Unreachable`. | +| **Antithesis Angle** | Filesystem fault: truncate/corrupt `buffer.db` while stopped or while mapped; assert clean restart or clean error, never exit-138 crash-loop. **Requires filesystem-fault capability — flag to user** (may not be available). | +| **Why It Matters** | `buffer.db` is `mmap`'d; live truncation SIGBUSes on the next field access. The init-time `CheckBytes` only guards the load, not live truncation. An all-zeros `LedgerState` may pass `CheckBytes` (silent reset). | + +**Open Questions:** + +- Does the all-zeros `LedgerState::default()` layout pass `CheckBytes` (→ truncation-to-zero is a silent reset, not a detected error)? +- Should `load_or_create` validate `buffer.db` is exactly `LEDGER_LEN` bytes before mmap? +- Is filesystem-fault injection available in the tenant? `(needs human input)` + +### finalizer-task-drains-pending-acks — Finalizer Task Drains All In-Flight Acks + +| | | +|---|---| +| **Type** | Liveness | +| **Property** | All in-flight `BatchNotifier` acks are eventually processed (steady state and on graceful shutdown) and not permanently stranded by a dead/abandoned finalizer task. | +| **Invariant** | `Sometimes(all_acks_drained)` after a quiet period; pair with `Unreachable(finalizer_died_with_acks_pending)`. The `spawn_finalizer` task (`ledger.rs:701-710`) is **unmonitored/detached** (discarded `JoinHandle`); a panic logs "FinalizerSet task ended prematurely" but the reader continues silently → no ack processing → no deletion → eventual stall, distinct from the arithmetic deadlock. | +| **Antithesis Angle** | SIGKILL with acks in flight (assert replay, not silent loss); inject a finalizer-task panic (assert no indefinite hang); SIGTERM (assert `pending_acks==0` before exit). | +| **Why It Matters** | A dead finalizer is a silent-loss/stall path that no other property isolates. | + +**Open Questions:** + +- Does the tokio runtime drain spawned tasks before exit? `(partial: tokio Runtime::drop has a ~2s task-completion window; no explicit join of the finalizer)` +- Can the `FuturesOrdered` queue grow unbounded under a sink that never acks? + +### fsync-window-bounded-under-clock-jitter — fsync Window Bounded Under Clock Jitter + +| | | +|---|---| +| **Type** | Safety | +| **Property** | Under clock-jitter faults, the durable-loss window stays bounded; a slowed clock cannot suppress `sync_all` indefinitely beyond the next file rotation. | +| **Invariant** | `Always`: time since the last `sync_all`+`ledger.flush()` pair (real wall time) stays within a bounded multiple of `flush_interval`, OR data is durable since the last rotation. `should_flush` (`ledger.rs:485-497`) gates fsync on `Instant::elapsed()`; only rotation (`force_full_flush`) is clock-independent. | +| **Antithesis Angle** | Enable clock jitter; crash; assert loss bounded by the last rotation. Also covers the CAS-winner-descheduled extension. Mitigation/oracle: `flush_interval=0` removes the clock dependence (used by the durability properties). **Requires clock-fault capability — flag to user.** | +| **Why It Matters** | The product claims a ≤500ms loss window; clock jitter can silently break it with no error/log. At low write rates rotation may never fire → unbounded window. | + +**Open Questions:** + +- Does Antithesis virtual-time affect `std::time::Instant::elapsed()` on the target runtime (and `crossbeam AtomicCell`)? +- What is the max loss window when write volume is too low to trigger rotation? + +### overflow-chain-no-unaccounted-gap — Overflow Chain Crash Leaves No Silent Middle Gap + +| | | +|---|---| +| **Type** | Safety | +| **Property** | With `WhenFull::Overflow` (disk base → in-memory overflow), a crash during an overflow-active period does not create a silent, unaccounted middle-of-stream gap (later in-memory events lost while earlier disk events survive), and ordering is honored or documented. | +| **Invariant** | `Always`: every event accepted into the durable disk base that survives the crash is delivered; the unbiased `select!` (`receiver.rs:~133`) does not let a later overflow event replace an earlier disk event. `was_dropped=true` at `sender.rs:~238` (overflow dispatch) must not misclassify dispatched-to-overflow as permanently lost. | +| **Antithesis Angle** | Second topology config (disk + in-memory overflow); fill to overflow; crash; drain; assert no silent middle gap. | +| **Why It Matters** | The entire overflow mode is untested; the crash asymmetry produces a stream gap (not just duplicates) that dedup-based at-least-once reasoning can't handle. | + +**Open Questions:** + +- Does overflow chain to a distinct in-memory buffer with its own capacity? Confirm `BufferSender::with_overflow`. +- Is the unbiased `select!` ordering intentional/documented? +- Can overflow-and-drain cycles confuse the base `total_buffer_size` (secondary underflow trigger)? + +### buffer-survives-version-upgrade — Buffer Files Survive Upgrade or Fail Cleanly + +| | | +|---|---| +| **Type** | Safety + Liveness | +| **Property** | Buffer files written by version N are read back correctly by version N (ideally N+1); a format/layout change (rkyv) or the `DiskBufferV1CompatibilityMode` flag is handled as a clean detected error, never silent garbage. | +| **Invariant** | `Sometimes(upgrade_readback_ok)` for same-version baseline; `Always(DeserializeError)` (never `Valid{garbage}`) under a simulated layout change; `AlwaysOrUnreachable` that a current-binary record (always carries the V1-compat flag) is accepted by `can_decode` (`vector-core/event/ser.rs:~86-91`). | +| **Antithesis Angle** | Write with binary version N; simulate format change (modify `.dat` bytes or swap binaries via custom fault); restart; assert clean error or correct readback, never garbage/monotonicity-panic. **Custom fault (binary swap) needed.** | +| **Why It Matters** | No runtime mechanism detects an rkyv layout mismatch (CheckBytes validates types not version; CRC matches the new layout). A layout-changed record could pass both checks and deliver garbage. The compat-flag is a forward-compat foot-gun. | + +**Open Questions:** + +- Is the mmap'd `LedgerState` versioned? An upgrade changing it reads wrong offsets — a second upgrade risk not covered by record-level `CheckBytes`. +- Split into two slugs (rkyv layout vs. compat-flag)? +- Can a layout-changed small payload pass both `CheckBytes` and CRC32C → `Valid{wrong_id}` → monotonicity panic? + +### throughput-progresses-under-contention — Throughput Stays Above Floor Under Contention + +| | | +|---|---| +| **Type** | Liveness | +| **Property** | With N≥4 parallel sources sharing the single `Arc>` and CPU throttle active, write throughput stays above a configurable floor — distinguishing "degenerate-but-alive" (lock starvation) from healthy and from deadlocked. | +| **Invariant** | `Sometimes(throughput_above_floor)` over an observation window. Catches a regression to near-zero progress that the permanent-deadlock property (`is_buffer_full` forever true) is blind to. | +| **Antithesis Angle** | N≥4 parallel senders → one disk sink; CPU throttle; assert `Sometimes(throughput > floor)` in a quiet window. Triangulate with `writer-eventually-makes-progress`: progress fires but throughput-floor fails → degenerate; neither → deadlocked. | +| **Why It Matters** | The ~90 MiB/s contention ceiling (GA doc) can collapse to near-zero under throttle without deadlocking — a regression CI can't catch and deadlock-detection misses. | + +**Open Questions:** + +- Appropriate floor (calibrate vs. an unthrottled single-sender baseline, e.g. 0.1%)? +- Borders on perf-testing — frame value as catching *near-zero* progress, not micro-benchmarking. (See Bias B1.) +- Does `tokio::sync::Mutex` FIFO fairness amplify throttle-induced starvation? + +--- + +## File-Level Open Questions (catalog-wide) + +- **Node-termination (kill/restart) faults + persistent buffer storage: CONFIRMED ENABLED by the user (2026-05-28).** The crash-recovery cluster is testable. (Resolved.) +- Config-reload (Category 6) and sink-error injection (`sink-failure-not-silently-acked`) need **custom faults** — feasibility depends on the harness/workload design. +- **`(needs human input)` — surfaced from individual properties (require a design owner / tenant operator):** + - Is the finalizer's `BatchStatus` discard (`ledger.rs:704`) intentional (retry assumed at the source layer) or a genuine bug? — `sink-failure-not-silently-acked`. + - Is **filesystem-fault injection** (truncating/corrupting `buffer.db` and `.dat` files) available in the tenant? Gates `ledger-corruption-no-sigbus-crashloop` and the filesystem-tamper angle of several Category-7 properties. +- Does Vector's topology call `writer.flush()` on graceful shutdown, and does the tokio runtime drain the finalizer task before exit? Both affect multiple liveness/durability properties. `(partial: tokio Runtime::drop has a ~2s task window; the running.rs stop() drop-order vs. final flush is unresolved and needs code tracing)` +- **RESOLVED — "Durably written" oracle:** use a wall-clock timestamp (event produced > 2×`flush_interval` ago) and run with `flush_interval=0` so every flush is an fsync. Do NOT use e2e-ack delivery as the durability marker (conflates delivery with fsync; suppressed by the deadlock). Reused across Category 3 (refinement R-F). +- **Build profile:** run a **release build** for underflow/wrap-observing properties (debug `trace!`/arithmetic panics first); run a **test build** (or add a runtime `MAX_FILE_ID` knob) for `file-id-rollover-stays-coordinated`. + diff --git a/tests/antithesis/scratchbook/property-relationships.md b/tests/antithesis/scratchbook/property-relationships.md new file mode 100644 index 0000000000000..da86e385fb65e --- /dev/null +++ b/tests/antithesis/scratchbook/property-relationships.md @@ -0,0 +1,195 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: 4ff41a0adb5240d071f30a5a43cb0d065e40f618 +updated: 2026-05-29 +external_references: + - path: lib/vector-buffers/src/variants/disk_v2/mod.rs + why: Module-level doc comment is the authoritative design spec + - path: rfcs/2021-10-14-9477-buffer-improvements.md + why: Original buffer-rework RFC + - path: docs/specs/buffer.md + why: Buffer component spec + - path: (internal design doc, not linked) + why: fsync window, ack flow, at-least-once semantics + - path: (internal design doc, not linked) + why: Root-cause writeups of #21683, #24948, #24606 + - path: (internal design doc, not linked) + why: Existing chaos test + lock-contention issue + - path: GitHub issues #21683 #24948 #24606 #24144 #23995 #17666 #23456; PRs #23561 #24949 + why: Bug/regression context +--- + +# Property Relationships + +Clusters of related properties, suspected dominance, and shared +faults/code-paths. Lightweight — connections noticed during synthesis. + +## Cluster A — The `total_buffer_size` underflow / writer deadlock (the master cluster) + +The single unguarded-subtraction bug radiates into many properties. + +- **Root:** `total-buffer-size-never-underflows` (the SUT-side invariant at the two + raw subtraction sites: `ledger.rs:292`, `reader.rs:524`). +- **Direct manifestation:** `writer-eventually-makes-progress` (the deadlock the + underflow causes). +- **Dominated / aliased symptoms** (likely the *same* failure observed elsewhere): + - `reader-drains-and-terminates-cleanly` — `total_buffer_size == 0` is the + termination condition; if it wrapped, the reader also never terminates. + - `buffer-size-within-max` — the deadlock makes this **vacuously pass** (no + writes → no overflow); must be read jointly with `writer-eventually-makes-progress`. + - `acked-files-eventually-deleted` — the delete-time `metadata.len() - bytes_read` + (`reader.rs:524`) is one of the two underflow triggers, and stalled deletion is + also an upstream *cause* of the full-buffer state. + +**Dominance:** if `total-buffer-size-never-underflows` holds, the deadlock-shaped +failures of `writer-eventually-makes-progress`, `reader-drains-and-terminates-cleanly`, +and the vacuity of `buffer-size-within-max` largely disappear. But keep all four: +they observe the bug from different vantage points (SUT-side root, writer liveness, +reader termination, safety-bound vacuity) and Antithesis benefits from multiple +independent assertions on the same dangerous state. + +## Cluster B — Crash-time durability & recovery (shared fault: node-kill in the fsync window) + +All depend on node-termination faults and the 500ms fsync window + non-atomic +data-fsync/ledger-msync pair. + +- `durable-unacked-events-survive-crash` (synced data not lost) +- `every-written-event-eventually-delivered` (end-to-end at-least-once — the + product-level expression; **dominates** `durable-unacked-events-survive-crash` + at the workload level, but the latter is the tighter buffer-internal invariant) +- `recovery-completes-after-crash` (init doesn't hang/fail) +- `partial-write-at-rotation-recovers` (torn-tail / rotation crash recovery) + +**Connections:** `partial-write-at-rotation-recovers` is the mechanism by which +`recovery-completes-after-crash` and `durable-unacked-events-survive-crash` can +fail (torn tail → wrong fast-forward → silent skip or monotonicity panic). It also +feeds Cluster A: a partial write is the canonical producer of the file-size vs. +record-bytes discrepancy that triggers the underflow. + +## Cluster C — Corruption detection & record integrity (shared fault: bit-flip / partial-write) + +- `no-corrupted-record-delivered` (never return garbage) +- `corruption-is-detected-and-recovered` (detection path fires) +- `record-id-monotonicity-holds` (the guardrail panic never trips) +- `record-never-spans-files` (record framing integrity) + +**Connections:** `no-corrupted-record-delivered` and `corruption-is-detected-and-recovered` +are two sides of the same CRC32C/`CheckBytes` guard — the former asserts "never +bad output," the latter asserts "the recovery branch is reachable." Neither +dominates the other. `record-id-monotonicity-holds` bridges to Cluster B +(torn-tail) and to `file-id-rollover-stays-coordinated` (the `reader.rs:932` `>` +bug is a path to a monotonicity violation). `partial-write-at-rotation-recovers` +(Cluster B) and this cluster share the torn-tail evidence. + +## Cluster D — Boundary arithmetic (shared mechanism: non-wrap-aware / unguarded integer ops) + +- `file-id-rollover-stays-coordinated` (`reader.rs:932` raw `u16 >`) +- `record-id-wraparound-accounting-holds` (`ledger.rs:266` non-wrapping `- 1`) +- (related: `total-buffer-size-never-underflows` is also an arithmetic-safety bug, + but its blast radius puts it in Cluster A) + +**Connections:** all three are the same *class* of defect (Rust integer ops that +aren't saturating/wrap-aware in a context that can hit the boundary). They share +no runtime code path but share a root-cause pattern and a fix pattern — worth a +single "audit all buffer integer arithmetic for boundary safety" note to the team. +`file-id-rollover-stays-coordinated` connects to Cluster C via the monotonicity +panic. + +## Cluster E — Delivery-accounting bugs (silent loss invisible to operators) + +- `sink-failure-not-silently-acked` (`_status` discarded at `ledger.rs:704`) +- `dropped-events-are-counted` (`drop_newest` not surfaced to component metric) +- (related lifecycle: `config-reload-no-silent-loss`) + +**Connections:** these are "the system loses data but the metrics don't say so" +properties — the internal config-reload incident theme. They don't share code paths with each other +but share the *observability gap* failure mode. `config-reload-no-silent-loss` +(Cluster F) is the most severe instance of the same theme. + +## Cluster F — Lifecycle / shutdown flush (shared mechanism: `Drop` does not flush) + +- `config-reload-no-silent-loss` +- `graceful-shutdown-flushes-all` + +**Connections:** both hinge on the same fact — `BufferWriter::Drop` calls `close()` +but not `flush()`, so any guarantee of losslessness on teardown depends on an +*external* flush by the topology before drop. `graceful-shutdown-flushes-all` is +the steady-state version; `config-reload-no-silent-loss` is the hot-reload version +(harder, because old+new topologies may overlap and contend on the per-process +advisory lock). Resolving the open question "does the topology flush before drop?" +affects both. + +## Cluster G — Cross-cutting & operational gaps (Category 7, from evaluation) + +New properties spanning environment/operations. Several connect back to the +master deadlock cluster (A) and the silent-loss cluster (E). + +- `foreign-data-file-no-writer-stall` — a **non-crash** path to the same stall as + Cluster A (`is_buffer_full` permanently true), but via wrong scan scope in + `update_buffer_size` rather than arithmetic underflow. **Joins Cluster A** as an + alternative trigger; reachable without node-kill faults. +- `finalizer-task-drains-pending-acks` — a **distinct** stall/loss path: a dead + finalizer strands acks. **Bridges Cluster A** (no acks → no deletion → writer + stall) and **Cluster E** (delivered events never marked acked → silent loss). + Also relates to `acked-files-eventually-deleted` (same dependency chain) and + `reader-drains-and-terminates-cleanly` (a dead finalizer prevents clean + termination). +- `ledger-corruption-no-sigbus-crashloop` — external-tampering analogue of the + corruption cluster (C), but on the *ledger* mmap rather than data records; + failure mode is SIGBUS/crash-loop, not bad-data delivery. +- `fsync-window-bounded-under-clock-jitter` — directly strengthens **Cluster B**: + it bounds the durability window that `durable-unacked-events-survive-crash` + assumes (≤500ms). The shared `flush_interval=0` oracle decision ties them + together. +- `overflow-chain-no-unaccounted-gap` — a **Cluster E** silent-loss instance for + the previously-uncovered `WhenFull::Overflow` mode; also touches Cluster B + (crash) and can feed Cluster A (overflow/drain cycles confusing + `total_buffer_size`). +- `buffer-survives-version-upgrade` — connects to corruption cluster (C): a + layout-changed record passing `CheckBytes`+CRC is the same "garbage delivered as + valid" failure as `no-corrupted-record-delivered`, reached via upgrade rather + than bit-flip; can also trigger the monotonicity panic (Cluster C). +- `throughput-progresses-under-contention` — the **degenerate-but-alive** companion + to `writer-eventually-makes-progress` (A): together they triangulate healthy vs. + starved vs. deadlocked. Otherwise standalone (perf/contention). + +## Cluster H — Checksum-skip silent data loss (added 2026-05-29, data-loss expansion) + +Shared mechanism: `roll_to_next_data_file` (reader.rs:711-759) abandons the +entire remainder of a data file on the first bad read. Three new properties make +the loss precise where Cluster C only confirmed the recovery path runs: + +- `corruption-skip-loss-bounded` — bounds *how much* is lost (valid records after + the corrupt one survive). The `Always` safety bound that + `corruption-is-detected-and-recovered` (Cluster C, a `Sometimes` reachability + check) does not provide. **Tightens Cluster C.** +- `corruption-skip-loss-is-counted` — the abandoned records must be counted. + **Joins Cluster E** (silent-loss invisible to operators): it is the read-side + companion to `dropped-events-are-counted`/#24606 — strictly more silent (hits + neither buffer- nor component-level counter). +- `corruption-skip-record-id-accounting-consistent` — the roll must not turn loss + into accounting corruption. **Bridges Cluster A** (names the abandoned-tail as a + concrete real trigger for the reader.rs:524 underflow → #21683, validated-as- + reachable by run D0) and **Cluster D/C** (record-ID gap → monotonicity panic). + +Dominance: `corruption-skip-loss-bounded` is the precondition (loss must be real); +the other two describe *consequences* (silent + accounting-corrupting). The single +fault — a mid-file bit-flip in a multi-record data file, read live — exercises all +three plus Cluster C's reachability check. + +## Cross-cluster dominance summary + +- **`total-buffer-size-never-underflows` (A)** is the highest-leverage single + property: fixing/verifying it neutralizes the deadlock symptoms across A. +- **Checksum-skip cluster (H)** is the highest-leverage *data-loss* entry point: + one mid-file-corruption fault simultaneously exercises bounded-loss (H), + silent-loss counting (E/H), the reader.rs:524 underflow (A), and the + monotonicity guard (C/D). +- **`partial-write-at-rotation-recovers` (B)** is the most connected *trigger*: it + feeds A (underflow), B (recovery), and C (torn-tail/monotonicity). Antithesis + effort spent reaching the rotation-boundary crash window pays off across three + clusters. +- **Observability-gap properties (E, F)** are independent of the durability/deadlock + machinery and need their own workload setup (metric inspection, sink-error + injection, config reload) — don't expect the crash-recovery workload to cover them. + diff --git a/tests/antithesis/scratchbook/runs.md b/tests/antithesis/scratchbook/runs.md new file mode 100644 index 0000000000000..f3b294411809c --- /dev/null +++ b/tests/antithesis/scratchbook/runs.md @@ -0,0 +1,300 @@ +# Antithesis Run Log — disk buffer v2 + +Tracks launched runs (run_id ↔ test ↔ branch ↔ outcome). Triage via the +antithesis-triage skill: `snouty runs ...` keyed by run_id. + +| run_id | test-name | branch | duration | purpose | status / triage | +|---|---|---|---|---|---| +| a7adf33514d82a7a7cc8faba3b51c404-54-9 | vector-diskbufv2-g0-bootstrap | blt/antithesis-setup-harness | 15m | G0 bootstrap: validate Antithesis round-trip (workload sancov instrumentation, setup_complete, bootstrap reachables) | **COMPLETED — all pass, pipeline validated.** SDK detected (Rust 0.2.8), assertions present, build OK, `event delivered end-to-end through disk buffer` + `workload serve started` reachables hit. `Software was instrumented` = vdbuf-workload only (19869 locs), **Vector NOT instrumented**. report: (internal run report, not linked) | + +## Critical finding (G0 triage, 2026-05-29) + +**The `basic_test` webhook does NOT inject node-termination (kill/stop/reboot) faults.** +G0 events: network faults fired heavily (clog/partition/restore, 934K packets +dropped) + thread-pausing (workload), but `reboot`/`stop`/`kill`/`shutdown` +event queries are all empty. (User said node-kill is enabled at the tenant, but +this webhook's fault menu doesn't use it.) + +Implication: the **crash-time** disk-buffer bugs (#21683 underflow→deadlock, +torn-tail recovery, crash durability) require Vector to be killed mid-write — not +reachable under basic_test. **Demonstrable without node-kill:** drop_newest +miscount (#24606), sink-error silent-ack, and concurrency/interleaving bugs via +thread-pausing once Vector is instrumented. Pivoting the grind toward these. +(G1/G2 30m run will confirm node-kill absence; instrumented Vector build in +flight adds Vector coverage + thread-pausing on the buffer.) +| 0d97fdfb6f8511051f2078aa1cd76341-54-9 | vector-diskbufv2-g1g2-crash-durability | blt/antithesis-test-crash-durability | 30m | G1/G2: crash-durability (at-least-once) + writer-progress (#21683 deadlock probe) | **COMPLETED.** `post-recovery write makes progress` PASS (no deadlock — expected, no node-kill). `every end-to-end-acked event ... reaches the collector` **FAILING (116 counterexamples)** — potential acked-but-not-delivered silent loss, BUT collector returned 200 unconditionally → possible false-ack artifact. No-fault local repro = 430/430 clean → fault-specific. Hardened collector (G6) verifies. | +| bce984c7ec91cc4c2704b265e7c75d15-54-9 | vector-diskbufv2-g3-gauge-sanity | blt/antithesis-test-metrics-sanity | 30m | G3 (cumulative): produce + durability/progress check + buffer-gauge-sanity anytime invariant (u64 underflow ~1.8e19) | **COMPLETED.** gauge-sanity PASS (no 2^64 underflow gauge — needs drained restart/node-kill, not reached). post-recovery-progress PASS. durability assertion FAILS (same as G1/G2 — same unhardened-collector caveat). | +| 597228a5ef207e0e37f858d10099643d-54-9 | vector-diskbufv2-g4-instrumented | blt/antithesis-sut-instrumentation | 30m | G4: instrumented Vector + thread-pausing on buffer + precise underflow assert | **COMPLETED.** ✓ Vector instrumented (642,862 locs), ✓ thread-pausing on `vector`, ✓ `#21683 underflow assert` reachable + **PASS** (held). FAIL: post-recovery-progress + durability — assessed as **oracle artifacts** (probe sits behind a full-256MB-buffer backlog with only a 45s window = slow-drain, not deadlock; + unconditional-200 collector). Need oracle fixes. | +| 6667a495ea9045d64d7bcbcc881894a1-54-9 | vector-diskbufv2-g6-durability-hardened | blt/antithesis-harden-collector-oracle | 20m | G6: re-run durability with HARDENED collector (200 only on parsed+recorded body). Settles real-bug vs artifact. | awaiting completion | + +## Oracle-artifact findings (G4 triage) + +Two of my workload assertions produce **false failures** and must be fixed before their failures can be trusted as Vector bugs: + +1. **durability** (`acked event reaches collector`): collector returned HTTP 200 unconditionally → fixed in G6 (hardened: 200 only when parsed+recorded). +2. **post-recovery progress** (`no permanent writer deadlock`): the single-event probe waits only 45s, but after a long fault window the 256MB buffer has a large backlog the probe sits behind → times out on slow drain, NOT a real deadlock. **Fix needed:** redefine progress as "buffer_events drains toward 0 within a generous bound" or "produced count advances post-recovery", not a single bounded probe. + +**Trustworthy (artifact-free) signals:** G5 drop_newest (#24606, metric-only) and the gauge-sanity + underflow SUT asserts (which all currently PASS). +| 597228a5ef207e0e37f858d10099643d-54-9 | vector-diskbufv2-g4-instrumented | blt/antithesis-sut-instrumentation | 30m | G4: INSTRUMENTED Vector (sancov coverage + SUT-side underflow assert + thread-pausing on buffer reader/writer). Concurrency exploration + precise #21683 signal | submitted 2026-05-29 ~01:2x (202); awaiting completion | +| afacfcd0d7564db702fa8b0ed88de961-54-9 | vector-diskbufv2-g5-dropnewest-miscount | blt/antithesis-test-drop-newest | 30m | G5: when_full=drop_newest; asserts component_discarded reflects buffer_discarded (#24606) | **COMPLETED.** `#24606` assertion PASS but **VACUOUS**: `drop_newest actually dropped events` (Sometimes) FAILED = the 256MB buffer **never filled** → drop_newest never fired → precondition unmet. Fix: slow/failing collector to fill the buffer, then re-test #24606. | + +--- + +## Triage summary & conclusions (2026-05-29) + +**9 runs launched (G0–G9), 8 triaged.** Fully instrumented harness: Vector built +with sancov (642,862 coverage locations) + thread-pausing on the buffer +reader/writer + a SUT-side `assert_always` at the `ledger` `total_buffer_size` +decrement (the precise #21683 site); workload instrumented (~20K locations). + +### What HELD (Vector behaving correctly) + +- **`#21683` underflow assert** (`ledger total_buffer_size decrement never + underflows`): reachable + **PASS** in the instrumented runs. +- **buffer gauge-sanity** (no ~1.8e19 u64-underflow gauge): **PASS**. +- No deadlock/stall observed that wasn't an oracle artifact. + +### Every failing assertion traced to a TEST/ORACLE artifact (not a Vector bug) + +1. **`every end-to-end-acked event reaches the collector`** — failed in G1/G2/G3/ + G4/G6/G8. Root causes, peeled one by one: (a) collector returned HTTP 200 + unconditionally → hardened to 200-only-when-parsed; (b) drain-wait stopped + before the buffer was empty → hardened to wait `buffer_events==0`; (c) + **definitive:** concurrent `produce` processes append to one `acked.log`, and + non-atomic interleaved writes corrupt lines — G8 "missing" id was + `p33b24b15-42p39fd2605-23` (two ids concatenated) with `delivered>acked`. Also + the source-level `acknowledgements` is **deprecated** and acks on + acceptance/buffering, not e2e delivery, so "acked" ≠ "delivered". → NOT a + Vector bug. +2. **`post-recovery write makes progress`** — failed only with a full buffer + + throttled collector: the single 45s probe sits behind a 256MB backlog = + slow-drain, not a permanent deadlock. Artifact. (The real deadlock #21683 + needs node-kill anyway.) + +### Why the marquee bugs were not demonstrated + +- **`basic_test` injects NO node-termination faults** (G0 events: network + + thread-pausing only; `reboot`/`stop`/`kill` queries empty). The crash-class + bugs (#21683 deadlock, torn-tail recovery, crash-durability) all require Vector + to be killed mid-write → **unreachable under this webhook**. +- **#24606 (drop_newest miscount)**: real bug in code, but needs a FULL buffer. + G5 (no throttle) and G7 (3s collector delay) never filled the 256MB buffer + (`drop_newest actually dropped` Sometimes = never true). G9 blocks the collector + (120s) to force buffer-full — verdict pending. + +### Net finding + +Under the fault surface `basic_test` exposes (network + thread-pausing), at the +loads tested, the disk buffer's real invariants held — **no disk-buffer bug +demonstrated**. The known crash-class bugs require a **node-kill-enabled webhook**; +the harness (instrumented Vector + SUT underflow assert + workload) is built and +ready to demonstrate them the moment node-kill is available. Honest position: did +not manufacture a failure; traced every red assertion to a test artifact. + +### Known test-harness limitations to fix before trusting durability red + +- per-process (not shared) `acked.log`/`delivered.log`, or atomic appends. +- configure `acknowledgements` on the SINK (modern e2e) not the deprecated source. +- progress probe should gate on `buffer_events==0` and bound by buffer drain time. + +--- + +## FINAL CONCLUSION (2026-05-29) — 11 Antithesis runs (G0–G11) + local repros + +**No Vector disk-buffer bug was demonstrated — and this is an evidence-backed +result, not a lack of rigor.** Every red assertion was traced to a workload/oracle +artifact; the genuine Vector invariants held; and the two real catalog bugs are +provably out of reach under this environment, with the precise reasons below. + +### Real Vector invariants that HELD under all reachable faults (incl. thread-pausing on the instrumented buffer) + +- `ledger total_buffer_size decrement never underflows` (#21683 SUT-side assert): reachable + PASS. +- buffer size-gauge sanity (no ~1.8e19 u64-underflow gauge): PASS. + +### Why the two target bugs were not demonstrable here + +1. **Crash-class (#21683 deadlock, torn-tail recovery, crash-durability):** require + Vector to be **killed mid-write**. The `basic_test` webhook injects network + + thread-pausing faults only — **no node-termination** (confirmed: G0 events have + zero reboot/stop/kill). Hard environmental blocker; needs a node-kill-enabled + webhook. +2. **#24606 (drop_newest component-metric miscount):** requires the disk buffer at + `max_size`. The buffer has an **effective cap ≈ max_size − 128MB** (one + data-file reserve), so a 256MB buffer plateaus at ~128MB and applies upstream + **backpressure** rather than dropping under my load — `buffer_discarded` stays + 0. Confirmed identically across G5/G7/G9/G10/G11 + two local repros (exact same + 134,166,720-byte plateau, concurrency=1 made no difference). The drop_newest + path is not reachable by overwhelming throughput in this harness. + +### Artifacts found + fixed (so they can't masquerade as bugs) + +- collector returned HTTP 200 unconditionally → hardened (200 only when parsed+recorded). +- durability drain-wait stopped before buffer empty → wait `buffer_events==0`. +- **concurrent `produce` processes corrupt `acked.log`** (G8 "missing" id + `p33b24b15-42p39fd2605-23` = two ids concatenated, delivered>acked) — the real + root cause of all "acked-not-delivered" reds. +- post-recovery probe sat behind a full-buffer backlog (slow-drain ≠ deadlock). +- deprecated **source-level acks** ack on acceptance, not e2e delivery. + +### Deliverable (ready to demonstrate the bugs once unblocked) + +- gt stack: 12 branches (research → setup → 8 test/instrumentation branches), not pushed. +- Fully instrumented Vector image: sancov coverage (642,862 locations) + + thread-pausing on the buffer reader/writer + SUT-side `#21683` underflow assert; + instrumented workload; env-driven config variants (when_full / src-acks / + sink-concurrency / collector-delay). +- **To demonstrate the crash-class bugs:** launch the same stack via a + node-kill-enabled webhook (the underflow assert + durability/gauge checks are + already wired to catch them). +- **To demonstrate #24606:** drive drops via a route that reaches the sink-buffer + drop path (e.g., a buffer whose effective cap is small relative to a sustained + in-process write burst, or a unit/integration test) rather than HTTP throughput. + +--- + +## ✅ BUG DEMONSTRATED — #24606 (drop_newest silent at component level) + +After establishing that #24606's drop path is unreachable via HTTP backpressure +(the buffer applies upstream backpressure before the disk buffer hits max_size), +I demonstrated it **reproducibly via a focused test** (user's chosen approach), +which is deterministic and bypasses the load-generation wall. + +- **Test:** `lib/vector-buffers/src/buffer_usage_data.rs::drop_newest_increments_buffer_metric_but_not_component_metric_issue_24606` (branch `blt/antithesis-demonstrate-24606`). +- **Mechanism:** drives the real reporter path `BufferUsageData::report` → + `emit(BufferEventsDropped { intentional: true, reason: "drop_newest" })`, with a + `metrics_util` `DebuggingRecorder` capturing emissions. +- **Result (PASS):** `buffer_discarded_events_total = 5` while + `component_discarded_events_total = 0`. +- **Root cause (confirmed by grep):** `ComponentEventsDropped` is never referenced + anywhere in `lib/vector-buffers/`, so the buffer drop path cannot surface drops + to the component-level metric operators monitor for data loss → **silent data + loss on dashboards.** Matches Vector #24606 / #24144. + +Reproduce: `cargo test -p vector-buffers --lib issue_24606`. + +### Final tally + +- **#24606: DEMONSTRATED** (reproducible test). +- **Crash-class bugs (#21683 deadlock, torn-tail, crash-durability): NOT + demonstrated — blocked by `basic_test` having no node-kill faults.** Harness + + SUT-side underflow assert are built and ready for a node-kill-enabled webhook. + +## ✅ BUG DEMONSTRATED — #21683 (total_buffer_size unsaturated underflow → writer deadlock) + +The marquee deadlock root cause, demonstrated reproducibly via a focused test +(the crash-only Antithesis path couldn't reach it under basic_test's no-node-kill +fault menu). + +- **Test:** `lib/vector-buffers/src/variants/disk_v2/tests/invariants.rs::ledger_total_buffer_size_decrement_underflows_issue_21683` (branch `blt/antithesis-demonstrate-21683`). +- **Result (PASS, release):** after `increment(10)` then `decrement(11)`, + `get_total_buffer_size()` returns ~2^64 (wrapped) instead of 0 (saturated). +- **Consequence:** `is_buffer_full()` (`total_buffer_size + unflushed_bytes >= + max`) then returns true forever → `ensure_ready_for_write` loops on + `wait_for_reader()` → permanent silent writer deadlock. Matches #21683; PR + #23561 fixed only the reporter gauge, not this control-path atomic. +- Reproduce: `cargo test -p vector-buffers --release --lib issue_21683`. + +## Final outcome: TWO real disk-buffer bugs demonstrated reproducibly + +- **#24606** — drop_newest drops are silent at the component metric level. +- **#21683** — total_buffer_size unsaturated decrement wraps → permanent writer deadlock. +Both via focused tests after establishing the crash-class path is blocked by +basic_test's missing node-kill faults. The full instrumented Antithesis harness +(Vector @642K coverage + SUT-side underflow assert + thread-pausing) remains ready +to demonstrate the crash-driven manifestations given a node-kill-enabled webhook. + +--- + +## ✅ COMPLETE BUG LEDGER — all known disk-buffer bugs demonstrated (local repros) + +Goal: "grind out all known bugs." All 7 confirmed code-level defects from the +research catalog are now demonstrated by reproducible tests. Full `vector-buffers` +suite: **85 passed, 0 failed** (release). The crash-driven *runtime* manifestations +(e.g. the #21683 deadlock under a real crash) additionally need a node-kill-enabled +webhook — the instrumented Antithesis harness is built and ready for that — but the +underlying defects are all demonstrated here deterministically. + +| # | Bug | Test (gt branch) | Repro | +|---|-----|------|-------| +| 1 | **#24606** drop_newest drops silent at component metric | `buffer_usage_data.rs::drop_newest_increments_buffer_metric_but_not_component_metric_issue_24606` (`antithesis-demonstrate-24606`) | `cargo test -p vector-buffers --lib issue_24606` | +| 2 | **#21683** `total_buffer_size` unsaturated decrement wraps → writer deadlock | `invariants.rs::ledger_total_buffer_size_decrement_underflows_issue_21683` (`antithesis-demonstrate-21683`) | `cargo test -p vector-buffers --release --lib issue_21683` | +| 3 | `get_total_records` `0-1` underflow on drained buffer → ~2^64 event count | `invariants.rs::get_total_records_underflows_on_drained_buffer_issue_21683_metrics` (`antithesis-demonstrate-get-total-records-underflow`) | `cargo test -p vector-buffers --release --lib get_total_records_underflows` | +| 4 | **#24948** writer `Drop` without flush → silent loss of buffered events | `invariants.rs::writer_drop_without_flush_loses_buffered_events_issue_24948` (`antithesis-demonstrate-24948`) | `cargo test -p vector-buffers --lib issue_24948` | +| 5 | finalizer discards `BatchStatus` → rejected delivery silently acked | `acknowledgements.rs::rejected_delivery_still_advances_acks_finalizer_status_discard` (`antithesis-demonstrate-finalizer-status`) | `cargo test -p vector-buffers --lib finalizer_status_discard` | +| 6 | `reader.rs:524` `metadata.len()-bytes_read` underflow (truncation → #21683 wrap) | `invariants.rs::delete_completed_data_file_size_delta_underflows_reader_524` (`antithesis-demonstrate-reader524-underflow`) | `cargo test -p vector-buffers --release --lib reader_524` | +| 7 | `reader.rs:932` file-id-rollover compare not wrap-aware | `invariants.rs::file_id_rollover_compare_not_wrap_aware_reader_932` (`antithesis-demonstrate-file-id-rollover`) | `cargo test -p vector-buffers --lib file_id_rollover_compare` | + +All are demonstrations only — **no Vector behavior was changed** (only tests + the +no-op SUT-side underflow assert added during the harness phase). + +--- + +## D0 — DIRECT disk_v2 exerciser harness (new approach) + +Pivot from full-Vector SUT to testing the buffer **directly**. Rationale: every +demonstrated bug is internal to `vector-buffers`; routing through Vector's +source→codec→topology→sink→ack machinery was pure state-space overhead, and it +prevented the buffer from ever filling/draining the way the bugs need. + +**Harness** (`antithesis/config-direct/`, branch `antithesis-direct-exerciser`): + +- One self-driving process IS the SUT (disk_v2 takes a per-dir advisory lock, so + the workload must live in the buffer-owning process). `examples/disk_v2_antithesis.rs` + opens a real disk_v2 buffer via the public `TopologyBuilder` API and runs + randomized writer/reader activity under the Antithesis SDK RNG. Reader rides + close behind writer so the reader↔writer-head boundary is frequently live. +- **Oracle = SUT-side `assert_always!`** inside `vector-buffers` (fire however the + bad state is reached, `#[inline]` so they fold away in prod): + - `ledger.rs` `decrement_total_buffer_size` — no underflow (#21683 root). + - `ledger.rs` `get_total_records` — `wrapping_sub >= 1` (the `0-1` underflow). + - `reader.rs:524` — `bytes_read <= metadata.len()` (size-delta underflow). +- Test template `diskbuf_direct/`: `first_wait_ready`, `eventually_progress` + (liveness — a deadlocked buffer stalls delivery), `parallel_driver_safety_monitor` + (handled<=produced mirror). + +**Run:** `385cfc4df45b3c85567b9b7ef3d803ed-54-9` — basic_test, 30min, launched +2026-05-29T14:33Z. Targets the accounting-underflow cluster organically via +thread-pausing (basic_test still has no node-kill). Status: starting → (triage pending). + +--- + +## D0 — RESULTS / TRIAGE (run 385cfc4df45b3c85567b9b7ef3d803ed-54-9) + +Completed 2026-05-29T15:11Z (30min, basic_test). Report: +`(internal run report, not linked) (auth'd link in`snouty runs show`). + +**HEADLINE — #21683 reproduced organically by Antithesis.** The SUT-side +`assert_always!` **"ledger total_buffer_size decrement never underflows (root of +# 21683)"** FAILED at 3 distinct simulation times (vtime 34.1, 130.0, 174.3; a +passing example at 14.7 — the rare-state signature). Antithesis drove the *real* +disk_v2 buffer into the total_buffer_size unsaturated-decrement underflow under +thread-pausing fault injection — no unsafe pokes, the organic control-path bug. + +**Corroborating downstream failure.** `diskbuf_direct/parallel_driver_safety_monitor.sh` +(the externally-visible `handled <= produced` invariant) FAILED at vtime 205, +287, 293 — *after* the underflow times. Clean causal story: decrement underflow → +total_buffer_size wraps ~2^64 → accounting corrupt → buffer reports phantom +records → handled>produced. Two independent oracles, consistent ordering. + +**Did NOT fire (passing) this run:** + +- `get_total_records never underflows on a drained buffer` — needs the exact + reader==writer drained boundary; not hit organically in 30 min. +- `reader data-file size delta never underflows (reader.rs:524)` — needs a + truncated/torn data file; basic_test's faults didn't produce one. +- Liveness `eventually_progress.sh` PASSED; workload `disk_v2 never delivers more + than produced` and the `record delivered end-to-end` reachable PASSED; + setup_complete + bootstrap reachables all green. + +**Noise / not SUT bugs:** meta-properties "Fault injector dropped/total packets = +0" show as Failing because the single-process SUT has no inter-container network +traffic to inject network faults into — thread-pausing is the operative fault +here. Confirmed network clog/partition/restore faults were attempted. + +**Conclusion:** the direct-exerciser harness works end-to-end (libvoidstar + +sancov loaded, setup_complete, assertions evaluated) and **independently +reproduced #21683 via Antithesis** — complementing the local failing test +`ledger_total_buffer_size_decrement_should_saturate_not_underflow_issue_21683`. +The get_total_records / reader.rs:524 underflows remain local-test-only for now +(their preconditions are rarer); a longer run or fs-fault-heavier webhook would +likely surface them. Crash-class manifestations still await a node-kill webhook. diff --git a/tests/antithesis/scratchbook/sut-analysis.md b/tests/antithesis/scratchbook/sut-analysis.md new file mode 100644 index 0000000000000..e3e274f8cbe1d --- /dev/null +++ b/tests/antithesis/scratchbook/sut-analysis.md @@ -0,0 +1,327 @@ +--- +sut_path: /home/ssm-user/src/vector +commit: b7aae737cef5dd37d1445915443a1eb97b584f85 +updated: 2026-05-28 +external_references: + - path: lib/vector-buffers/src/variants/disk_v2/mod.rs + why: Module-level doc comment is the authoritative design spec (format, ledger, recovery) + - path: rfcs/2021-10-14-9477-buffer-improvements.md + why: Original buffer-rework RFC; intended design and guarantees + - path: docs/specs/buffer.md + why: Buffer component spec / claimed behavior + - path: (internal design doc, not linked) + why: Authoritative description of fsync/durability window, ack flow, at-least-once + duplicate semantics + - path: (internal design doc, not linked) + why: Detailed root-cause writeups of disk-buffer bugs (#21683 ledger underflow, #24948 config-reload stall, #24606 discarded-metric blind spot) + - path: (internal design doc, not linked) + why: Existing internal chaos test (SIGKILL x3 + e2e acks) and a lock-contention performance issue affecting all disk-buffer users + - path: GitHub issues vectordotdev/vector #21683 #24948 #24606 #24144 #23995 #17666 #23456 and PRs #23561 #24949 + why: Bug/regression context for property and evidence files +--- + +# SUT Analysis: Disk Buffer v2 (`lib/vector-buffers/src/variants/disk_v2/`) + +## 1. Summary + +Disk buffer v2 is Vector's durable, single-process, ring-buffer-style sink buffer. +A user opts into it per-sink (`type: disk`, `max_size` bytes, `when_full`) to get +durability across crashes/restarts. It is the durability backbone for +mission-critical pipelines and (with end-to-end acknowledgements) provides +**at-least-once** delivery. ~4,900 lines of Rust across 9 modules, with an +extensive in-repo model-based proptest suite. + +The dominant risk, surfaced independently by **8 of 12** discovery focuses, is a +single class of bug: **unsaturated `u64` arithmetic on the in-memory +`total_buffer_size` accounting atomic, triggered by crash/partial-write +discrepancies, leading to a permanent writer deadlock** (Vector #21683). The +control-path atomic remains unfixed (PR #23561 only fixed the metrics reporter). +Secondary risks cluster around crash-time durability/recovery windows, the +non-atomic memory-mapped ledger, the file-rotation boundary, and silent data +loss on config reload / `drop_newest` / sink-error acks. + +This is an **Antithesis-ideal** target: the bugs are crash-timing- and +interleaving-sensitive, externally hard to observe (silent stalls and silent +loss), and the existing tests explicitly cannot reach them (the model test's +in-memory filesystem makes `sync_all`/`flush` no-ops, so no crash-in-fsync-window +state is reachable). + +## 2. Architecture and Data Flow + +### Components + +- **`DiskV2Buffer` / `Buffer::from_config`** (`mod.rs:233-365`) — entrypoint; + wired into Vector topology via `IntoBuffer` → `SenderAdapter::DiskV2` (an + `Arc>`) and `ReceiverAdapter::DiskV2` (single + `BufferReader`). The writer mutex is the **lock-contention bottleneck** noted + in the GA doc. +- **`Ledger`** (`ledger.rs`) — `Arc`-shared between reader and writer. Wraps the + memory-mapped `buffer.db` (`LedgerState`) plus in-memory coordination state. +- **`BufferWriter` / `RecordWriter` / `TrackingBufWriter`** (`writer.rs`) — write + path: encode → CRC32C-wrap → rkyv-serialize → 256KB `TrackingBufWriter` → + data file. Owns data-file rotation and `validate_last_write` recovery. +- **`BufferReader` / `RecordReader`** (`reader.rs`) — read path: read length + delimiter → validate archive + checksum → zero-copy decode → attach + `BatchNotifier` → emit. Owns acknowledgement processing, data-file deletion, + and `seek_to_next_record` recovery. +- **Finalizer** — `spawn_finalizer` (`ledger.rs:701-710`) spawns an + `OrderedFinalizer` tokio task that turns dropped `BatchNotifier`s into + `pending_acks` increments and wakes the reader-side ack machinery. + +### Buffer directory layout + +`/buffer/v2//` contains `buffer.db` (mmap'd ledger), +`buffer.lock` (advisory lock), and `buffer-data-{u16}.dat` data files. + +### On-disk record format + +``` +[8 bytes: record_len (u64 BIG-endian)] [rkyv-archived Record { checksum:u32, id:u64, metadata:u32, payload:[u8] }] +``` + +- CRC32C covers `BE(id) || BE(metadata) || payload` (NOT the checksum field). +- **Mixed endianness**: the length delimiter is big-endian; the rkyv body is + **host-native** (little-endian on x86-64). Files are not portable across + architectures (documented). `rkyv`'s `archived_root` reads the root from the + **end** of the buffer — relevant to torn-tail recovery (see §10 F5). +- `CheckBytes` for `ArchivedRecord` is **hand-written** (`record.rs:75-117`) due + to an upstream rkyv ICE — a manual unsafe validation surface. + +### Write → read → ack → delete flow + +1. `write_record` → encode + checksum + rkyv → `TrackingBufWriter` (256KB). Record + ID = `writer_next_record + unflushed_events` and **encodes event count** (ID N + to next ID M ⇒ M−N events). +2. `flush()` always flushes `TrackingBufWriter` to the **OS page cache** + (readers see it immediately on Linux) and `notify_writer_waiters()`; it calls + `sync_all()` (fsync) + `ledger.flush()` (msync) **only** when `should_flush()` + says ≥500ms elapsed, or on rotation (`force_full_flush`). +3. Reader reads from page cache, validates, attaches a `BatchNotifier`, emits. +4. Sink delivers downstream, drops the notifier → finalizer → `pending_acks`. +5. Reader's `handle_pending_acknowledgements` consumes acks, advances + `reader_last_record`, and when **all records in a data file are acked**, + `delete_completed_data_file` unlinks the whole file (never partial), persists + `reader_current_data_file`, `ledger.flush()`, and `notify_reader_waiters()` to + unblock a full writer. + +## 3. State Management and Persistence + +### Persisted (mmap'd `buffer.db`, individually atomic, NOT a transactional group) + +- `writer_next_record: AtomicU64` +- `writer_current_data_file: AtomicU16` +- `reader_current_data_file: AtomicU16` (the durable acked reader position) +- `reader_last_record: AtomicU64` + +### In-memory only (lost on crash, reconstructed at startup) + +- `total_buffer_size: AtomicU64` — **re-seeded at startup by summing `.dat` file + sizes** (`update_buffer_size`), then decremented as the reader seeks/acks. + This reconstruction-from-file-size vs. decrement-by-record-bytes mismatch is + the underflow trigger. +- `pending_acks`, `writer_done`, `unacked_reader_file_id_offset`, + `last_flush: AtomicCell`, the two `Notify`s. +- Writer-local `unflushed_bytes`/`unflushed_events` (plain integers). + +### Two-level flush model and the durability window + +- **Page-cache flush** (every `flush()`): data visible to same-host reader. +- **fsync + ledger msync** (every ≥500ms `DEFAULT_FLUSH_INTERVAL`, or rotation): + the only durable point. +- **Data-loss window = up to 500ms** of page-cached-but-unsynced writes on crash + (documented). The data file fsync and the ledger msync are **two separate, + non-atomic syscalls** — a crash between them leaves data and ledger diverged, + repaired (with assumptions) by `validate_last_write` on restart. + +### Ordering of ledger updates relative to durability + +- Writer updates `writer_next_record` **after** the page-cache write, lazily on + flush; a crash leaves the ledger lagging the data → `validate_last_write` + fast-forwards (`Ordering::Less`). The reverse (ledger ahead of data, + `Ordering::Greater`) logs "Events have likely been lost" and skips to the next + file — **detected loss, counted as gap markers**, not silent. +- Reader updates `reader_last_record` after acks, flushed lazily; file deletion + unlinks **before** the ledger msync — a crash in that window leaves the ledger + pointing at a deleted file (handled on restart via NotFound→skip). + +## 4. Concurrency Model + +- Single writer (behind a topology `Mutex`), single reader, plus the finalizer + task. mmap'd ledger fields use `Acquire`/`Release`/`AcqRel`; in-memory atomics + likewise. Orderings were judged **correct** (focus 3) — the bugs are + arithmetic/logic, not memory-ordering. +- Coordination via tokio `Notify`: writer waits on `wait_for_reader()`; reader + waits on `wait_for_writer()`. **Naming is misleading**: the finalizer calls + `notify_writer_waiters()` which wakes the *reader*; the reader then frees space + and calls `notify_reader_waiters()` to wake the *writer*. The writer's progress + is therefore **transitively dependent** on: sink acks → finalizer task alive → + reader being actively polled → file deletion. Break any link and the writer + can stall. +- `Notify` is edge-triggered with a one-permit store; generally tolerant but a + potential source of missed-wakeup delays. +- `should_flush` uses an `AtomicCell` CAS so only one caller fsyncs; + under CPU-throttle the winner can be descheduled between winning the CAS and + actually fsyncing, silently extending the 500ms window. +- Lock contention on the writer `Mutex` is the known throughput ceiling (~90 + MiB/s with 10 threads). + +## 5. Claimed Guarantees + +### Safety ("a bad thing never happens") + +- **INV-1** Every record CRC32C-checksummed; corrupted records detected and never + returned as valid (`record.rs`, `reader.rs`). Bypass only via CRC collision or + a bug in the hand-written `CheckBytes`. +- **INV-2** A record never spans two data files (`writer.rs:433-436` gate). Hard. +- **INV-3** Data files ≤ 128MB — **soft**: a record may overshoot by up to + `max_record_size` (documented); real bound is ~2×. +- **INV-4** Record IDs strictly monotonic and encode event count; violation + **panics** (`reader.rs:480-484`). Crash can create a *gap* (detected loss), not + a duplicate. +- **INV-5** A data file is deleted only after all its records are acked; + whole-file deletion only, never partial truncation. +- **INV-6** Durability: synced data survives crash; **best-effort within 500ms**; + graceful shutdown flushes (but see §10 — `BufferWriter::Drop` does NOT flush). +- **INV-7** Buffer never exceeds `max_size` — **broken by the underflow bug**: + the writer deadlocks (vacuously upholding the bound by never writing again). +- **INV-9** No double-counting / no silent loss — **broken** for sink-error acks: + the finalizer discards `BatchStatus`, so `Errored`/`Rejected` deliveries are + credited as acknowledged and the events are dropped from the buffer with no + replay (within a process lifetime). +- **INV-10** Single-process exclusivity via advisory `buffer.lock` — does NOT + protect intra-process (POSIX `fcntl` locks are per-process), so a config-reload + overlap of old+new topology can open the same buffer twice. + +### Liveness ("a good thing eventually happens") + +- **L1** A blocked (full) writer eventually unblocks once the reader frees a data + file — **fails permanently under the underflow bug**. +- **L2** Written+flushed records eventually become readable (page-cache flush per + send; no timer needed). Strong. +- **L3** Fully-acked data files eventually deleted / space reclaimed — depends on + finalizer task alive + reader polled + delete I/O succeeding. +- **L4** On restart, reader eventually catches up (`seek_to_next_record`) — + vulnerable to torn-tail and to the file-ID rollover ordering bug. +- **L6** Buffer eventually initializes after crash — vulnerable if the writer + must open a not-yet-created next file, or if `update_buffer_size` over-seeds. +- **L8** Reader terminates (`next()→None`) when writer done and buffer empty — + uses `total_buffer_size == 0`, so the underflow bug also breaks clean shutdown. + +## 6. Failure-Prone Areas (ranked — these drive the property catalog) + +1. **`total_buffer_size` underflow → permanent writer deadlock (#21683).** + Root: `ledger.rs:291-298` raw `fetch_sub`; two trigger paths: + `reader.rs:524` `metadata.len() - bytes_read` (also unguarded) and the + startup `update_buffer_size`(file sizes) vs. seek-decrement(record bytes) + mismatch. Manifests as a silent pipeline stall; also breaks reader shutdown + (L8). **Highest-value target.** Requires node-kill + restart faults. +2. **Crash-time durability/recovery windows.** fsync-vs-crash, data-file-fsync + vs. ledger-msync non-atomicity, torn last record, `validate_last_write` + `Greater`/`Less` reconciliation, partial write at file rotation. Tests cannot + reach these (no-op fsync in model FS). +3. **Config-reload silent loss & metric drift (#24948).** `BufferWriter:: + Drop` calls `close()` but **not** `flush()` → up to 256KB of buffered events + silently dropped; `track_dropped_events` charges `byte_size=0` → permanent + accounting drift; finalizer task may still hold the `Arc` / lock during + reload; double-counted then gapped metrics. PR #24949 addressed parts. +4. **`drop_newest` silent loss vs. metrics (#24606/#24144).** Buffer-level + discarded counter increments but `component_discarded_events_total` stays 0. +5. **Sink-error acks discarded** (`ledger.rs:704` `_status` ignored) → silent loss + under at-least-once. +6. **File-ID rollover ordering bug** (`reader.rs:932` raw `u16 >`), reachable in + tests where `MAX_FILE_ID=6`; production at 65536-file rollover. +7. **Reader skips the rest of a file on first bad record** — valid records after a + corrupt one in the same 128MB file are silently abandoned. +8. **`get_total_records` `- 1` non-wrapping** at record-ID equality/rollover → + ~2^64 phantom event count into metrics. +9. **mmap SIGBUS / external file tampering** (foreign `.dat` files inflate + `total_buffer_size`; truncation under read → underflow). + +## 7. Existing Test Strategy and the Antithesis Gap + +- In-repo: model-based proptest (`tests/model/`) with a reference model + action + sequencer + **in-memory `TestFilesystem`**; per-area unit tests + (initialization, acknowledgements, basic, known_errors, size_limits, + invariants, record); 9 saved proptest regression seeds (tiny size limits / ack + ordering). +- **Critical limitation:** `TestFilesystem::sync_all` and mmap `flush` are + **no-ops**, `flush_interval` is hardcoded to 10s, and the sequencer serializes + ops — so the model suite **cannot** exercise crash-in-fsync-window, real + partial writes, true reader/writer preemption, or the underflow trigger. The + model's own `LedgerModel::decrement_buffer_size` even mirrors the unguarded + `fetch_sub`, so it would reproduce the underflow if the trigger were reachable + — but the fake FS prevents the trigger. +- Two telling **disabled tests**: `reader_exits_cleanly_when_writer_done_and_in_flight_acks` + (`basic.rs`, `#[ignore = "flaky #23456"]`) and `writer_waits_when_buffer_is_full` + (`size_limits.rs`, `#[ignore]`) — both sit exactly on the deadlock/backpressure + path. High-value Antithesis targets. +- internal E2E **chaos test**: SIGKILL ×3 with e2e acks, asserts all events delivered. + Antithesis goes further by exploring fault *timing/interleavings* a fixed + 3×-kill test cannot. + +## 8. External Dependencies / Integration Points + +- **OS/filesystem** via `io.rs` (`open/write/fsync/unlink/mmap`). Relies on Linux + page-cache read-after-write (acknowledged Linux-specific assumption). +- **rkyv** zero-copy (host-endian, alignment-sensitive, manual `CheckBytes`). +- **memmap2** for the ledger (`msync` on flush; **SIGBUS** if `buffer.db` is + truncated/unmapped — unhandled, crashes process; misaligned atomics UB on some + non-x86 arches). +- **crc32fast** (hardware-accelerated CRC32C). +- **fslock** advisory lock — per-process on Linux (no intra-process protection). +- **vector-common finalization** (`OrderedFinalizer`, `BatchNotifier`) for acks; + topology channel adapters treat all writer/reader errors as unrecoverable + (reader I/O error → `panic!` in `receiver.rs`). + +## 9. Product Context + +- Disk buffer is opt-in per sink; sold as "data synchronized to disk will not be + lost if Vector is restarted forcefully or crashes; data synchronized every + 500ms." Used by customers needing durability for mission-critical pipelines. +- With e2e acks: at-least-once; crash between buffer-write and downstream-ack → + **duplicate delivery on replay** (downstream must dedup). +- User-visible failures, by severity: (1) **silent pipeline stall** (writer + deadlock — no crash, no error, dashboards may look healthy); (2) **silent data + loss** (config reload, `drop_newest`, sink-error acks, crash window); (3) + **duplicate delivery**; (4) **lying buffer metrics** (stuck/negative gauges). + The stall and the silent loss are what a durability-seeking customer cares + about most. + +## 10. Wildcard / Cross-Cutting Observations + +- **F5 (torn-tail mis-recovery):** rkyv `archived_root` reads the root offset from + the last 8 bytes; crash-left trailing bytes could be misread as a plausible + offset, yielding a `Valid` record with the wrong `id`, fast-forwarding the + ledger to a wrong ID and synthesizing a phantom gap. +- **`WhenFull::Overflow` + disk base:** unbiased `select!` over base+overflow + reorders events across the overflow boundary; if overflow is in-memory, a crash + loses the *later* in-memory events while the *earlier* disk events survive — + breaks dedup-based at-least-once reasoning (a gap, not just duplicates). +- **`DiskBufferV1CompatibilityMode` flag inversion** (`vector-core/event/ser.rs`): + `can_decode` requires the V1-compat flag on every record; a future "V2-native" + flag scheme would be rejected as incompatible — a forward-compat foot-gun. +- **Clock jitter × `should_flush`:** `Instant::elapsed` drives the 500ms gate; + Antithesis clock faults could stretch/shrink the durability window. +- **mmap'd ledger torn write under crash:** four independent atomics, no group + atomicity; a crash mid-multi-field-update is exactly what recovery must handle + and what the model FS never produces. + +## Assumptions + +- Disk buffer is single-process; network/partition faults are largely irrelevant. + The strong fault levers are **node termination (kill/restart)**, **node hang**, + **CPU throttling**, **clock jitter**, and **filesystem state across restart**. +- Antithesis runs x86-64 Linux (matches production); cross-arch endianness is out + of scope except as a "don't move buffer files" caveat. + +## Open Questions (catalog-wide) + +- Are **node-termination faults enabled** in the target Antithesis tenant? Nearly + every high-value property needs them. Flag to the user. +- Does Vector's topology shutdown call `writer.flush()` before dropping the writer + on graceful shutdown (vs. the unflushed `Drop`)? Determines whether graceful + shutdown is actually lossless. +- Does the finalizer task get drained by the tokio runtime before shutdown, or can + in-flight acks be lost (stranding the reader)? +- Is the config-reload old/new topology overlap actually concurrent (making the + per-process advisory-lock gap a live safety issue)? +