Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/actions/spelling/excludes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,7 @@
^\.github/actions/spelling/
^\Qbenches/codecs/moby_dick.txt\E$
^\Qwebsite/layouts/shortcodes/config/unit-tests.html\E$
# Antithesis test harness: research notes + exerciser examples carry heavy jargon (sancov, libvoidstar, rkyv, vdbuf, lossfinder, ...)
^tests/antithesis/
^\Qlib/vector-buffers/examples/disk_v2_antithesis.rs\E$
^\Qlib/vector-buffers/examples/disk_v2_lossfinder.rs\E$
10 changes: 10 additions & 0 deletions tests/antithesis/scratchbook/.markdownlint.jsonc
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
// Scoped relaxations for the Antithesis research scratchbook. These are internal
// working notes (dense property tables, ad-hoc code fences), not published docs,
// so a few cosmetic rules are disabled here only — the repo-wide config still
// applies everywhere else.
{
"extends": "../../../.markdownlint.jsonc",
"MD060": false, // table-column-style: the property-catalog tables use empty-header `| | |` 2-col layout
"MD040": false, // fenced-code-language: many ad-hoc evidence snippets are intentionally language-less
"MD022": false // blanks-around-headings: relaxed for the dense note format
}
113 changes: 113 additions & 0 deletions tests/antithesis/scratchbook/_external-references-digest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# External References Digest (working note for discovery agents)

This is scaffolding for the antithesis-research run on **disk buffers v2**
(`lib/vector-buffers/src/variants/disk_v2/`). User scope answer: *"Whatever you
have access to. You have your MCPs."* — so in-repo docs/RFCs plus Datadog
internal doc/Jira were consulted. Key findings condensed below so per-focus agents
don't need to re-fetch.

## In-repo references

- `rfcs/2021-10-14-9477-buffer-improvements.md` — original buffer-rework RFC.
- `docs/specs/buffer.md` — buffer component spec / claimed behavior.
- `lib/vector-buffers/src/variants/disk_v2/mod.rs` — authoritative design doc
(module-level comment): on-disk format, ledger, record IDs, recovery.

## Claimed guarantees (from `mod.rs` design doc + buffer spec + internal doc)

- Data files never exceed 128MB; ≤ 65,536 files; buffer ≤ ~8TB.
- All records checksummed with **CRC32C**; records written
sequentially/contiguously; a record never spans two data files.
- Writers create+write data files; readers read+delete them. Reader deletes a
data file **only after all records in it are acknowledged** (whole-file
deletion, never partial truncation).
- Ledger (`buffer.db`, memory-mapped) tracks `writer_next_record_id`,
`writer_current_data_file_id`, `reader_current_data_file_id`,
`reader_last_record_id`. Fields updated atomically, but **not** atomically
w.r.t. reader/writer activity.
- Record IDs are monotonic and encode event count: record ID N with next record
M means the record holds M−N events. Used to compute buffer event count and to
detect gaps / dropped events after corruption.
- **Durability:** data is fsync'd every **500ms** (`DEFAULT_FLUSH_INTERVAL`).
Page-cache flush happens on every `flush()` (readers see data immediately on
Linux); full fsync only every 500ms. **Data-loss window on crash = up to 500ms
of unsynced writes** (when e2e acks off). Graceful shutdown flushes everything
→ no loss.
- Min buffer `max_size` ~256MB; `DEFAULT_MAX_DATA_FILE_SIZE` 128MB;
`DEFAULT_MAX_RECORD_SIZE` = 128MB; `DEFAULT_WRITE_BUFFER_SIZE` 256KB.
- Endianness: files are host-endian; not portable across architectures.
- Delivery semantics with e2e acks + disk buffer = **at-least-once**: crash after
buffer write but before downstream ack → replay on restart → **possible
duplicates** (downstream must dedup).

## Known bugs / incidents (HIGH-VALUE Antithesis targets)

1. **Ledger `total_buffer_size` AtomicU64 underflow → permanent writer deadlock**
(Vector #21683, partially mitigated by PR #23561 on the *reporter* side only;
the ledger atomic still wraps).
- `decrement_total_buffer_size` (ledger.rs ~291-298) does raw
`fetch_sub(amount, AcqRel)` with **no saturation**. If `amount >
current_value`, the atomic wraps to ≈ 2^64.
- Then `total_buffer_size + unflushed_bytes` is always astronomical →
`is_buffer_full()` returns true forever → `can_write_record()` false forever
→ writer's `ensure_ready_for_write()` (writer.rs ~1001-1020) loops on
`ledger.wait_for_reader().await` and never recovers. **Writer deadlocks
permanently.**
- Trigger: crash/reboot/abrupt-shutdown that leaves a data file whose on-disk
size and readable-record bytes disagree, combined with the reader running
through that file on restart. Partial writes at file-rotation boundaries are
the most plausible cause. Not deterministic per-restart, but not exotic.
- Reporter-side gauges use `saturating_sub` (PR #23561) so the *dashboard*
no longer shows 2^64, but the ledger control-path atomic is unfixed.

2. **Disk buffer stall + silent event drops during config reload**
(Vector #24948, PR #24949; directly implicated in the **internal config-reload incident non-prod
incident**).
- Old writer dropped while events still in-flight → events lost without
accounting.
- `track_dropped_events` passes `0` for `byte_size` → permanent drift in
buffer-size metrics.
- `synchronize_buffer_usage()` re-seeds metrics while the old reporter may
still run → double-counted metric spikes; then a metrics gap between old
reporter teardown and the first tick (2s) of the new reporter.

3. **`component_discarded_events_total` blind to buffer drops** (Vector #24606,
#24144). When a disk buffer fills and `drop_newest` fires, only
`buffer_discarded_events_total` increments; the component-level discarded
counter stays 0 → silent data loss on dashboards. `BufferEventsDropped::emit()`
in `lib/vector-buffers/src/internal_events.rs` never calls
`ComponentEventsDropped`.

4. **Buffer size gauges stuck non-zero / negative** (Vector #23995, #17666,
#21683). Reporter `current() = total_entered.saturating_sub(total_left)`;
stuck-at-non-zero still open.

5. **Component tags lost for sinks using disk buffers** (OPA-5380): components
paused for IO at init time lose `component_*` labels on later-registered
metrics (utilization, etc.).

## Existing test strategy (so we don't duplicate it)

- In-repo: extensive `proptest` + **model-based testing** under
`variants/disk_v2/tests/model/` (a reference model + action sequencer +
in-memory filesystem). Unit tests for acknowledgements, initialization,
known_errors, size_limits, invariants, record.
- Datadog internal: an E2E **chaos test** that SIGKILLs the worker 3× with e2e acks
enabled and asserts every event is delivered end-to-end. Antithesis should go
beyond: explore fault *timing/interleavings* (partial writes at rotation,
fsync-vs-crash windows, reader/writer races on the mmap'd ledger) that a fixed
3×SIGKILL test cannot.
- A **major lock-contention performance issue** affected all disk-buffer users
(writer throughput ~90 MiB/s capped by contention) — points at writer/reader
coordination hot paths.

## Notes on faults

- Crash-recovery properties require **node termination faults** (often disabled
by default in Antithesis tenants) — flag this in the catalog.
- The disk buffer is **single-process** (intra-Vector reader+writer sharing an
mmap'd ledger). Network/partition faults are largely irrelevant to the buffer
itself; the strong levers are node kill/restart, node hang, CPU throttling
(exposes the fsync/flush timing windows and lock contention), and filesystem
state across restart.
</content>
174 changes: 174 additions & 0 deletions tests/antithesis/scratchbook/deployment-topology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
sut_path: /home/ssm-user/src/vector
commit: b7aae737cef5dd37d1445915443a1eb97b584f85
updated: 2026-05-28
external_references:
- path: lib/vector-buffers/src/variants/disk_v2/mod.rs
why: Confirms the buffer is single-process (intra-Vector reader+writer over an mmap'd ledger)
- path: (internal design doc, not linked)
why: Disk buffer is configured per-sink; e2e acks require a supporting source; at-least-once semantics
- path: (internal design doc, not linked)
why: Existing chaos test crashes the worker with SIGKILL x3 + e2e acks — the topology must support repeated kill/restart
- path: distribution/docker/
why: Existing Vector Dockerfiles to reuse/adapt for the SUT container
---

# Deployment Topology: Disk Buffer v2

## Key fact driving the design

The disk buffer is **single-process**: the reader, writer, and finalizer all run
inside one Vector process, coordinating through an `mmap`'d ledger and the local
filesystem. There is **no network, no peer, no quorum**. Therefore:

- The strong fault levers are **node termination (kill/restart)**, **node hang**,
**CPU throttling**, **clock jitter**, and **filesystem state across restart** —
NOT network partitions or bad-node faults (those are irrelevant to the buffer).
- The topology is minimal: **one SUT container + one workload/client container.**
No dependency containers are needed (no S3/Kafka/Postgres) — the buffer's only
"dependency" is the local filesystem.

## Topology

```text
+-----------------------------+ events (HTTP, e2e-ack-capable source)
| workload (client) | -----------------------------------------> +-----------------------------+
| - produces unique event IDs| | vector (SUT) |
| - HTTP collector endpoint | <----------------------------------------- | source -> sink(disk buffer)|
| - tracks produced/delivered| sink delivers here (HTTP sink) | data_dir on PERSISTENT vol |
| - emits Antithesis asserts | +-----------------------------+
| - test template /opt/... | | Antithesis injects
+-----------------------------+ | node-kill / hang /
| CPU-throttle / clock
v faults HERE
+-----------------------------+
| persistent volume |
| <data_dir>/buffer/v2/<id>/ |
+-----------------------------+
```

## Containers

### 1. `vector` — Service (the SUT)

- **Image:** adapt an existing Dockerfile from `distribution/docker/` (Debian or
Distroless). Two build variants:
- **Baseline build:** stock Vector — exercises all workload-observable
properties (durability, at-least-once, deadlock-via-throughput-stall, metric
correctness, recovery).
- **Instrumented build (recommended for the deadlock/corruption cluster):**
Vector built with the **Antithesis Rust SDK** added as a dependency to
`lib/vector-buffers`, with the missing SUT-side assertions inserted (see
"SUT-side instrumentation" below). This is the only way to directly assert
the internal states (`total-buffer-size-never-underflows`,
`record-id-monotonicity-holds`, `partial-write-at-rotation-recovers`,
`graceful-shutdown-flushes-all`/`unflushed_bytes==0`) that are invisible from
the workload.
- **Runs:** a single `vector` process with a config:
- `source`: an e2e-ack-capable source the workload can push to. Prefer
`datadog_agent` or `http_server` with `acknowledgements: true` (needed for
`every-written-event-eventually-delivered` and the durable-survival
properties). Keep one source.
- `sink`: an `http` sink with `buffer: { type: disk, max_size: <~256MB+>,
when_full: block }`, posting to the workload's collector. A second
config/run uses `when_full: drop_newest` for `dropped-events-are-counted`.
- Internal metrics exposed (e.g. `internal_metrics` → `prometheus_exporter`)
so the workload can read `buffer_*` / `component_discarded_events_total` for
the metric-correctness properties.
- **CRITICAL — persistent buffer storage:** the disk-buffer `data_dir` MUST be on
storage that **survives the container's kill/restart**. Disk-buffer durability
is the whole point; if Antithesis node-termination recreates the container with
a fresh filesystem, the buffer is wiped and every crash-recovery property
passes vacuously (or fails spuriously). Mount `<data_dir>` on a persistent
volume. **Confirm with the user how their tenant's node-termination interacts
with filesystem persistence.**
- **Faults target this container:** node kill/restart (required by Categories
2–6), node hang, CPU throttle (widens fsync/lock-contention windows), clock
jitter (perturbs the 500ms `should_flush` deadline).
- **Replica count:** 1. (No replication; more instances add nothing.)
- **Tuning for bug-finding:** set a small `max_data_file_size` (e.g. 1MB) and a
small `max_size` to maximize file-rotation frequency and reach the rotation/
partial-write window faster; optionally set `flush_interval` low to widen the
durably-written set, or high to widen the loss window — test both.

### 2. `workload` — Client (the test driver)

- **Image:** a small Rust (or Go) container with the **Antithesis Rust SDK** (to
match the SUT language and emit assertions). Includes the test template at
`/opt/antithesis/test/v1/{name}/`.
- **Runs:**
1. Starts an HTTP **collector** endpoint (the sink's destination) that records
every delivered event ID (counting duplicates).
2. Emits `setup_complete` once it and Vector are ready.
3. Sleeps so Antithesis can run test-template commands.
- **Test-template commands** drive: produce a stream of uniquely-IDed events to
Vector's source; periodically (via `ANTITHESIS_STOP_FAULTS` quiet periods)
drain and assert liveness/at-least-once; inspect Vector's metrics; toggle the
collector to return errors (for `sink-failure-not-silently-acked`); trigger a
config reload (custom fault, for `config-reload-no-silent-loss`).
- **Assertions emitted here** (workload-observable properties): at-least-once
set-difference, no-loss-on-graceful-shutdown, drop accounting vs metric, writer
throughput resumes after recovery (deadlock signal), buffer gauges return to ~0
on drained restart.
- **Replica count:** 1.

## SUT-side instrumentation (for the instrumented build)

No Antithesis SDK exists in the repo today (`existing-assertions.md`). For the
internal-state properties, add `antithesis-sdk` to `lib/vector-buffers/Cargo.toml`
and insert (all currently MISSING):

- `assert_unreachable!` / `assert_always!(amount <= current)` at the two unguarded
subtraction sites: `ledger.rs:~292` and `reader.rs:~524`
(`total-buffer-size-never-underflows`).
- `assert_sometimes!(writer_unblocked_after_full)` after `ensure_ready_for_write`
exits its wait loop; `assert_unreachable!` on repeated no-progress wakeups
(`writer-eventually-makes-progress`).
- `assert_unreachable!` at the monotonicity panic `reader.rs:~482`
(`record-id-monotonicity-holds`).
- `assert_always_or_unreachable!` at the record-emission point `reader.rs:~1131`
(`no-corrupted-record-delivered`) and `assert_sometimes!` in the
`is_bad_read` branch `reader.rs:~1035` (`corruption-is-detected-and-recovered`).
- `assert_sometimes!(torn_tail_recovered)` in the `validate_last_write`
recovery branches (`partial-write-at-rotation-recovers`).
- `assert_always!(unflushed_bytes == 0)` inside `close()`
(`graceful-shutdown-flushes-all`).

These assertions are no-ops outside Antithesis, so the instrumented build is safe
to run normally.

## Custom faults required

- **Config reload** (`config-reload-no-silent-loss`): a custom fault that sends
`SIGHUP` to the Vector process (or swaps the config file and triggers reload),
fired under sustained load.
- **Downstream sink error** (`sink-failure-not-silently-acked`): the workload's
collector returns 5xx for a window, or a custom fault toggles it.

## SDKs

- **Workload:** Antithesis Rust SDK (or Go SDK) — required to emit assertions and
`setup_complete`, and to draw random numbers for the producer.
- **SUT:** Antithesis Rust SDK only for the instrumented build.

## Simplicity note

Two containers, one network link, no external dependency services. Every
container is justified: the SUT runs the buffer; the workload produces/observes
and asserts. We deliberately exclude S3/Kafka/etc. — the disk buffer has no such
dependency. The only non-obvious requirement is the **persistent volume for the
buffer data_dir**, which is essential for crash-durability testing to be
meaningful.

## Open Questions

- How does the target Antithesis tenant's node-termination fault interact with
container filesystem persistence? (Determines whether the buffer survives a
modeled crash — essential.)
- Are node-termination and clock faults enabled in the tenant? (Categories 2–6
need kill/restart.)
- Which e2e-ack-capable source is easiest to drive from the workload —
`http_server`, `datadog_agent`, or `socket`? (Affects workload protocol.)
- Is config reload feasible as a custom fault (SIGHUP) in the harness, or must the
workload drive it via Vector's API?
</content>
Loading
Loading