vectordotdev · blt · May 29, 2026
@@ -2,3 +2,7 @@
 ^\.github/actions/spelling/
 ^\Qbenches/codecs/moby_dick.txt\E$
 ^\Qwebsite/layouts/shortcodes/config/unit-tests.html\E$
+# Antithesis test harness: research notes + exerciser examples carry heavy jargon (sancov, libvoidstar, rkyv, vdbuf, lossfinder, ...)
+^tests/antithesis/
+^\Qlib/vector-buffers/examples/disk_v2_antithesis.rs\E$
+^\Qlib/vector-buffers/examples/disk_v2_lossfinder.rs\E$
@@ -0,0 +1,10 @@
+// Scoped relaxations for the Antithesis research scratchbook. These are internal
+// working notes (dense property tables, ad-hoc code fences), not published docs,
+// so a few cosmetic rules are disabled here only — the repo-wide config still
+// applies everywhere else.
+{
+  "extends": "../../../.markdownlint.jsonc",
+  "MD060": false, // table-column-style: the property-catalog tables use empty-header `| | |` 2-col layout
+  "MD040": false, // fenced-code-language: many ad-hoc evidence snippets are intentionally language-less
+  "MD022": false  // blanks-around-headings: relaxed for the dense note format
+}
@@ -0,0 +1,113 @@
+# External References Digest (working note for discovery agents)
+
+This is scaffolding for the antithesis-research run on **disk buffers v2**
+(`lib/vector-buffers/src/variants/disk_v2/`). User scope answer: *"Whatever you
+have access to. You have your MCPs."* — so in-repo docs/RFCs plus Datadog
+internal doc/Jira were consulted. Key findings condensed below so per-focus agents
+don't need to re-fetch.
+
+## In-repo references
+
+- `rfcs/2021-10-14-9477-buffer-improvements.md` — original buffer-rework RFC.
+- `docs/specs/buffer.md` — buffer component spec / claimed behavior.
+- `lib/vector-buffers/src/variants/disk_v2/mod.rs` — authoritative design doc
+  (module-level comment): on-disk format, ledger, record IDs, recovery.
+
+## Claimed guarantees (from `mod.rs` design doc + buffer spec + internal doc)
+
+- Data files never exceed 128MB; ≤ 65,536 files; buffer ≤ ~8TB.
+- All records checksummed with **CRC32C**; records written
+  sequentially/contiguously; a record never spans two data files.
+- Writers create+write data files; readers read+delete them. Reader deletes a
+  data file **only after all records in it are acknowledged** (whole-file
+  deletion, never partial truncation).
+- Ledger (`buffer.db`, memory-mapped) tracks `writer_next_record_id`,
+  `writer_current_data_file_id`, `reader_current_data_file_id`,
+  `reader_last_record_id`. Fields updated atomically, but **not** atomically
+  w.r.t. reader/writer activity.
+- Record IDs are monotonic and encode event count: record ID N with next record
+  M means the record holds M−N events. Used to compute buffer event count and to
+  detect gaps / dropped events after corruption.
+- **Durability:** data is fsync'd every **500ms** (`DEFAULT_FLUSH_INTERVAL`).
+  Page-cache flush happens on every `flush()` (readers see data immediately on
+  Linux); full fsync only every 500ms. **Data-loss window on crash = up to 500ms
+  of unsynced writes** (when e2e acks off). Graceful shutdown flushes everything
+  → no loss.
+- Min buffer `max_size` ~256MB; `DEFAULT_MAX_DATA_FILE_SIZE` 128MB;
+  `DEFAULT_MAX_RECORD_SIZE` = 128MB; `DEFAULT_WRITE_BUFFER_SIZE` 256KB.
+- Endianness: files are host-endian; not portable across architectures.
+- Delivery semantics with e2e acks + disk buffer = **at-least-once**: crash after
+  buffer write but before downstream ack → replay on restart → **possible
+  duplicates** (downstream must dedup).
+
+## Known bugs / incidents (HIGH-VALUE Antithesis targets)
+
+1. **Ledger `total_buffer_size` AtomicU64 underflow → permanent writer deadlock**
+   (Vector #21683, partially mitigated by PR #23561 on the *reporter* side only;
+   the ledger atomic still wraps).
+   - `decrement_total_buffer_size` (ledger.rs ~291-298) does raw
+     `fetch_sub(amount, AcqRel)` with **no saturation**. If `amount >
+     current_value`, the atomic wraps to ≈ 2^64.
+   - Then `total_buffer_size + unflushed_bytes` is always astronomical →
+     `is_buffer_full()` returns true forever → `can_write_record()` false forever
+     → writer's `ensure_ready_for_write()` (writer.rs ~1001-1020) loops on
+     `ledger.wait_for_reader().await` and never recovers. **Writer deadlocks
+     permanently.**
+   - Trigger: crash/reboot/abrupt-shutdown that leaves a data file whose on-disk
+     size and readable-record bytes disagree, combined with the reader running
+     through that file on restart. Partial writes at file-rotation boundaries are
+     the most plausible cause. Not deterministic per-restart, but not exotic.
+   - Reporter-side gauges use `saturating_sub` (PR #23561) so the *dashboard*
+     no longer shows 2^64, but the ledger control-path atomic is unfixed.
+
+2. **Disk buffer stall + silent event drops during config reload**
+   (Vector #24948, PR #24949; directly implicated in the **internal config-reload incident non-prod
+   incident**).
+   - Old writer dropped while events still in-flight → events lost without
+     accounting.
+   - `track_dropped_events` passes `0` for `byte_size` → permanent drift in
+     buffer-size metrics.
+   - `synchronize_buffer_usage()` re-seeds metrics while the old reporter may
+     still run → double-counted metric spikes; then a metrics gap between old
+     reporter teardown and the first tick (2s) of the new reporter.
+
+3. **`component_discarded_events_total` blind to buffer drops** (Vector #24606,
+   #24144). When a disk buffer fills and `drop_newest` fires, only
+   `buffer_discarded_events_total` increments; the component-level discarded
+   counter stays 0 → silent data loss on dashboards. `BufferEventsDropped::emit()`
+   in `lib/vector-buffers/src/internal_events.rs` never calls
+   `ComponentEventsDropped`.
+
+4. **Buffer size gauges stuck non-zero / negative** (Vector #23995, #17666,
+   #21683). Reporter `current() = total_entered.saturating_sub(total_left)`;
+   stuck-at-non-zero still open.
+
+5. **Component tags lost for sinks using disk buffers** (OPA-5380): components
+   paused for IO at init time lose `component_*` labels on later-registered
+   metrics (utilization, etc.).
+
+## Existing test strategy (so we don't duplicate it)
+
+- In-repo: extensive `proptest` + **model-based testing** under
+  `variants/disk_v2/tests/model/` (a reference model + action sequencer +
+  in-memory filesystem). Unit tests for acknowledgements, initialization,
+  known_errors, size_limits, invariants, record.
+- Datadog internal: an E2E **chaos test** that SIGKILLs the worker 3× with e2e acks
+  enabled and asserts every event is delivered end-to-end. Antithesis should go
+  beyond: explore fault *timing/interleavings* (partial writes at rotation,
+  fsync-vs-crash windows, reader/writer races on the mmap'd ledger) that a fixed
+  3×SIGKILL test cannot.
+- A **major lock-contention performance issue** affected all disk-buffer users
+  (writer throughput ~90 MiB/s capped by contention) — points at writer/reader
+  coordination hot paths.
+
+## Notes on faults
+
+- Crash-recovery properties require **node termination faults** (often disabled
+  by default in Antithesis tenants) — flag this in the catalog.
+- The disk buffer is **single-process** (intra-Vector reader+writer sharing an
+  mmap'd ledger). Network/partition faults are largely irrelevant to the buffer
+  itself; the strong levers are node kill/restart, node hang, CPU throttling
+  (exposes the fsync/flush timing windows and lock contention), and filesystem
+  state across restart.
+</content>
@@ -0,0 +1,174 @@
+---
+sut_path: /home/ssm-user/src/vector
+commit: b7aae737cef5dd37d1445915443a1eb97b584f85
+updated: 2026-05-28
+external_references:
+  - path: lib/vector-buffers/src/variants/disk_v2/mod.rs
+    why: Confirms the buffer is single-process (intra-Vector reader+writer over an mmap'd ledger)
+  - path: (internal design doc, not linked)
+    why: Disk buffer is configured per-sink; e2e acks require a supporting source; at-least-once semantics
+  - path: (internal design doc, not linked)
+    why: Existing chaos test crashes the worker with SIGKILL x3 + e2e acks — the topology must support repeated kill/restart
+  - path: distribution/docker/
+    why: Existing Vector Dockerfiles to reuse/adapt for the SUT container
+---
+
+# Deployment Topology: Disk Buffer v2
+
+## Key fact driving the design
+
+The disk buffer is **single-process**: the reader, writer, and finalizer all run
+inside one Vector process, coordinating through an `mmap`'d ledger and the local
+filesystem. There is **no network, no peer, no quorum**. Therefore:
+
+- The strong fault levers are **node termination (kill/restart)**, **node hang**,
+  **CPU throttling**, **clock jitter**, and **filesystem state across restart** —
+  NOT network partitions or bad-node faults (those are irrelevant to the buffer).
+- The topology is minimal: **one SUT container + one workload/client container.**
+  No dependency containers are needed (no S3/Kafka/Postgres) — the buffer's only
+  "dependency" is the local filesystem.
+
+## Topology
+
+```text
++-----------------------------+         events (HTTP, e2e-ack-capable source)
+|  workload (client)          |  ----------------------------------------->  +-----------------------------+
+|  - produces unique event IDs|                                              |  vector (SUT)               |
+|  - HTTP collector endpoint  |  <-----------------------------------------  |  source -> sink(disk buffer)|
+|  - tracks produced/delivered|         sink delivers here (HTTP sink)        |  data_dir on PERSISTENT vol |
+|  - emits Antithesis asserts |                                              +-----------------------------+
+|  - test template /opt/...   |                                                      |  Antithesis injects
++-----------------------------+                                                      |  node-kill / hang /
+                                                                                     |  CPU-throttle / clock
+                                                                                     v  faults HERE
+                                                                            +-----------------------------+
+                                                                            | persistent volume           |
+                                                                            | <data_dir>/buffer/v2/<id>/  |
+                                                                            +-----------------------------+
+```
+
+## Containers
+
+### 1. `vector` — Service (the SUT)
+
+- **Image:** adapt an existing Dockerfile from `distribution/docker/` (Debian or
+  Distroless). Two build variants:
+  - **Baseline build:** stock Vector — exercises all workload-observable
+    properties (durability, at-least-once, deadlock-via-throughput-stall, metric
+    correctness, recovery).
+  - **Instrumented build (recommended for the deadlock/corruption cluster):**
+    Vector built with the **Antithesis Rust SDK** added as a dependency to
+    `lib/vector-buffers`, with the missing SUT-side assertions inserted (see
+    "SUT-side instrumentation" below). This is the only way to directly assert
+    the internal states (`total-buffer-size-never-underflows`,
+    `record-id-monotonicity-holds`, `partial-write-at-rotation-recovers`,
+    `graceful-shutdown-flushes-all`/`unflushed_bytes==0`) that are invisible from
+    the workload.
+- **Runs:** a single `vector` process with a config:
+  - `source`: an e2e-ack-capable source the workload can push to. Prefer
+    `datadog_agent` or `http_server` with `acknowledgements: true` (needed for
+    `every-written-event-eventually-delivered` and the durable-survival
+    properties). Keep one source.
+  - `sink`: an `http` sink with `buffer: { type: disk, max_size: <~256MB+>,
+    when_full: block }`, posting to the workload's collector. A second
+    config/run uses `when_full: drop_newest` for `dropped-events-are-counted`.
+  - Internal metrics exposed (e.g. `internal_metrics` → `prometheus_exporter`)
+    so the workload can read `buffer_*` / `component_discarded_events_total` for
+    the metric-correctness properties.
+- **CRITICAL — persistent buffer storage:** the disk-buffer `data_dir` MUST be on
+  storage that **survives the container's kill/restart**. Disk-buffer durability
+  is the whole point; if Antithesis node-termination recreates the container with
+  a fresh filesystem, the buffer is wiped and every crash-recovery property
+  passes vacuously (or fails spuriously). Mount `<data_dir>` on a persistent
+  volume. **Confirm with the user how their tenant's node-termination interacts
+  with filesystem persistence.**
+- **Faults target this container:** node kill/restart (required by Categories
+  2–6), node hang, CPU throttle (widens fsync/lock-contention windows), clock
+  jitter (perturbs the 500ms `should_flush` deadline).
+- **Replica count:** 1. (No replication; more instances add nothing.)
+- **Tuning for bug-finding:** set a small `max_data_file_size` (e.g. 1MB) and a
+  small `max_size` to maximize file-rotation frequency and reach the rotation/
+  partial-write window faster; optionally set `flush_interval` low to widen the
+  durably-written set, or high to widen the loss window — test both.
+
+### 2. `workload` — Client (the test driver)
+
+- **Image:** a small Rust (or Go) container with the **Antithesis Rust SDK** (to
+  match the SUT language and emit assertions). Includes the test template at
+  `/opt/antithesis/test/v1/{name}/`.
+- **Runs:**
+  1. Starts an HTTP **collector** endpoint (the sink's destination) that records
+     every delivered event ID (counting duplicates).
+  2. Emits `setup_complete` once it and Vector are ready.
+  3. Sleeps so Antithesis can run test-template commands.
+- **Test-template commands** drive: produce a stream of uniquely-IDed events to
+  Vector's source; periodically (via `ANTITHESIS_STOP_FAULTS` quiet periods)
+  drain and assert liveness/at-least-once; inspect Vector's metrics; toggle the
+  collector to return errors (for `sink-failure-not-silently-acked`); trigger a
+  config reload (custom fault, for `config-reload-no-silent-loss`).
+- **Assertions emitted here** (workload-observable properties): at-least-once
+  set-difference, no-loss-on-graceful-shutdown, drop accounting vs metric, writer
+  throughput resumes after recovery (deadlock signal), buffer gauges return to ~0
+  on drained restart.
+- **Replica count:** 1.
+
+## SUT-side instrumentation (for the instrumented build)
+
+No Antithesis SDK exists in the repo today (`existing-assertions.md`). For the
+internal-state properties, add `antithesis-sdk` to `lib/vector-buffers/Cargo.toml`
+and insert (all currently MISSING):
+
+- `assert_unreachable!` / `assert_always!(amount <= current)` at the two unguarded
+  subtraction sites: `ledger.rs:~292` and `reader.rs:~524`
+  (`total-buffer-size-never-underflows`).
+- `assert_sometimes!(writer_unblocked_after_full)` after `ensure_ready_for_write`
+  exits its wait loop; `assert_unreachable!` on repeated no-progress wakeups
+  (`writer-eventually-makes-progress`).
+- `assert_unreachable!` at the monotonicity panic `reader.rs:~482`
+  (`record-id-monotonicity-holds`).
+- `assert_always_or_unreachable!` at the record-emission point `reader.rs:~1131`
+  (`no-corrupted-record-delivered`) and `assert_sometimes!` in the
+  `is_bad_read` branch `reader.rs:~1035` (`corruption-is-detected-and-recovered`).
+- `assert_sometimes!(torn_tail_recovered)` in the `validate_last_write`
+  recovery branches (`partial-write-at-rotation-recovers`).
+- `assert_always!(unflushed_bytes == 0)` inside `close()`
+  (`graceful-shutdown-flushes-all`).
+
+These assertions are no-ops outside Antithesis, so the instrumented build is safe
+to run normally.
+
+## Custom faults required
+
+- **Config reload** (`config-reload-no-silent-loss`): a custom fault that sends
+  `SIGHUP` to the Vector process (or swaps the config file and triggers reload),
+  fired under sustained load.
+- **Downstream sink error** (`sink-failure-not-silently-acked`): the workload's
+  collector returns 5xx for a window, or a custom fault toggles it.
+
+## SDKs
+
+- **Workload:** Antithesis Rust SDK (or Go SDK) — required to emit assertions and
+  `setup_complete`, and to draw random numbers for the producer.
+- **SUT:** Antithesis Rust SDK only for the instrumented build.
+
+## Simplicity note
+
+Two containers, one network link, no external dependency services. Every
+container is justified: the SUT runs the buffer; the workload produces/observes
+and asserts. We deliberately exclude S3/Kafka/etc. — the disk buffer has no such
+dependency. The only non-obvious requirement is the **persistent volume for the
+buffer data_dir**, which is essential for crash-durability testing to be
+meaningful.
+
+## Open Questions
+
+- How does the target Antithesis tenant's node-termination fault interact with
+  container filesystem persistence? (Determines whether the buffer survives a
+  modeled crash — essential.)
+- Are node-termination and clock faults enabled in the tenant? (Categories 2–6
+  need kill/restart.)
+- Which e2e-ack-capable source is easiest to drive from the workload —
+  `http_server`, `datadog_agent`, or `socket`? (Affects workload protocol.)
+- Is config reload feasible as a custom fault (SIGHUP) in the harness, or must the
+  workload drive it via Vector's API?
+</content>