Skip to content

[WIP] Add DB compression#2578

Draft
tylerkaraszewski wants to merge 16 commits intomainfrom
tyler-compress-db
Draft

[WIP] Add DB compression#2578
tylerkaraszewski wants to merge 16 commits intomainfrom
tyler-compress-db

Conversation

@tylerkaraszewski
Copy link
Copy Markdown
Contributor

@tylerkaraszewski tylerkaraszewski commented Apr 14, 2026

Implement release 1 of https://docs.google.com/document/d/1b05o9uNkH-_INm6ihBRHj_B7Khc9orvMa7aMMQJPRe8/edit?tab=t.0

Details

Notes

How journal snippets were selected:

sudo readdb.sh -j -csv "SELECT id, query FROM journal ORDER BY id DESC LIMIT 100" > recent_journal.csv

Sample data for zstd training created with:

#!/usr/bin/env python3
import csv
import os
import sys

csv.field_size_limit(sys.maxsize)

os.makedirs("sample_data", exist_ok=True)

with open("recent_journal.csv", "r", encoding="utf-8", errors="replace") as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    for row in reader:
        row_id, query = row[0], row[1]
        with open(f"sample_data/{row_id}.txt", "w") as out:
            out.write(query)

Create the dictionary with:

find sample_data -name '*.txt' | xargs zstd --train -o journal.dict --dictID=1

Currently synchronization sends compressed data to nodes that are catching up, but live replication is not compressed.

Fixed Issues

https://github.com/Expensify/Expensify/issues/601964

Tests


Internal Testing Reminder: when changing bedrock, please compile auth against your new changes

Add -lzstd to the Makefile LIBRARIES list to make the zstd
compression library available for upcoming journal compression work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tylerkaraszewski tylerkaraszewski self-assigned this Apr 14, 2026
@tylerkaraszewski tylerkaraszewski marked this pull request as draft April 14, 2026 19:38
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35dac5fea3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Makefile
tylerkaraszewski and others added 15 commits April 14, 2026 12:47
Add libzstd-dev to the GitHub Actions workflow (build and both test
jobs) and Docker build stage. Add libzstd1 to the Docker runtime
stage. This ensures the zstd library linked in the Makefile is
available in all build and runtime environments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add BedrockPlugin_Zstd with zstdDictionaries table, dictionary storage
maps, and loading infrastructure. Fix ci_style.sh and style.sh third
std:: detection clause to require a leading space, preventing false
matches on identifiers like Zstd::.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add compress(data, dictID) and decompress(data) as SQLite application-
defined functions for zstd dictionary compression. compress() is a
no-op when dictID is 0 (the default). decompress() transparently
passes through uncompressed data via ZSTD_isFrame() check.

Also adds a non-SQL decompress() for use in the synchronization path,
and renames Zstd.h/cpp to Compression.h/cpp to avoid a case-
insensitive filename collision with the system <zstd.h> header.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add decompress() calls to getCommit() and getCommits() in SQLite.cpp
so journal queries transparently handle compressed data. On
uncompressed data this is a no-op (ZSTD_isFrame byte check only).

Also define ZSTD_STATIC_LINKING_ONLY to access ZSTD_isFrame from
the zstd static API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tylerkaraszewski
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6ca10652d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

STHROW("failed to begin transaction");
}
if (!_db.writeUnmodified(commit.content)) {
if (!_db.writeUnmodified(BedrockPlugin_Zstd::decompress(commit.content))) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reload zstd dictionaries during sync replay

Ensure dictionary cache is refreshed while replaying SYNCHRONIZE_RESPONSE commits. commit.content is decompressed using the in-memory map loaded once at startup, so a node that missed the zstdDictionaries insert commit can still receive that insert and later compressed commits in the same catch-up stream; the later decompress() lookup misses, returns compressed bytes, and writeUnmodified(...) fails, preventing the node from catching up.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wont happen in practice, these will be done offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant