Skip to content

Streaming N:1 compaction#659

Open
dian-lun-lin wants to merge 7 commits intomainfrom
compaction-pr
Open

Streaming N:1 compaction#659
dian-lun-lin wants to merge 7 commits intomainfrom
compaction-pr

Conversation

@dian-lun-lin
Copy link
Copy Markdown

This PR addresses #580 that adds OnDiskGraphIndexCompactor, a streaming N:1 compaction algorithm for merging multiple on-disk HNSW graph indexes into a single compacted index.

source[0].index  ─┐
source[1].index  ─┤──► OnDiskGraphIndexCompactor ──► compacted.index                                                                                       
source[N].index  ─┘                                                                                                                                        

For a full description of the algorithm and benchmarking instructions, see docs/compaction.md and benchmarks-jmh/src/main/java/io/github/jbellis/jvector/bench/CompactorBenchmark.md.

Support

  • Streaming, low-memory: no full in-memory graph construction; runs under -Xmx5g even for 10M-vector, 2560-dim datasets
  • Deletion support: live-node FixedBitSet per source excludes deleted nodes from output
  • Ordinal remapping: user-provided OrdinalMapper maps each source's local ordinals to a contiguous global ordinal space; implemented OffsetMapper that handles the common sequential case

Usage

List<OnDiskGraphIndex> sources = List.of(index0, index1, index2);
                                                                                                                                                           
// Mark all nodes live (no deletions)                                                                                                                      
List<FixedBitSet> liveNodes = sources.stream()                                                                                                             
    .map(s -> { var bs = new FixedBitSet(s.size()); bs.set(0, s.size()); return bs; })                                                                     
    .collect(toList());                                                                                                                                    
                                                                                                                                                           
// Sequential ordinal remapping: source[s] node i → global offset[s] + i                                                                                   
int offset = 0; 
List<OrdinalMapper> remappers = new ArrayList<>();                                                                                                         
for (var src : sources) {
    remappers.add(new OrdinalMapper.OffsetMapper(offset, src.size()));                                                                                     
    offset += src.size();
}                                                                                                                                                          
                
var compactor = new OnDiskGraphIndexCompactor(                                                                                                             
    sources, liveNodes, remappers,
    VectorSimilarityFunction.COSINE,                                                                                                                       
    /* executor= */ null                  // null = create internal ForkJoinPool
);                                                                                                                                                         
                
compactor.compact(Path.of("compacted.index"));                                                                                                             

Key Changes

  • OnDiskGraphIndexCompactor — core compaction algorithm with parallel ForkJoinPool execution and backpressure windowing
  • PQRetrainer — balanced proportional sampling + sequential sorted reads for efficient codebook retraining
  • Minor API visibility changes to GraphSearcher, GraphIndexBuilder, and PQVectors required by the compactor
  • CompactorBenchmark — JMH benchmark with PARTITION_AND_COMPACT, PARTITION_ONLY, COMPACT_ONLY, and BUILD_FROM_SCRATCH modes
  • Unit tests covering basic compaction, deletions, ordinal remapping, and FusedPQ scenarios

Recall

Comparison against build-from-scratch (results averaged over three runs).

  • Build from scratch: build with PQ, search using FusedPQ with FP reranking.
  • Compaction: build source partitions with PQ, compact using FusedPQ with FP rescoring, search using FusedPQ with FP reranking. Source partitions are based on a Fibonacci distribution with 4 partitions.
Dataset Dim Build from Scratch Compaction Delta
cap-6M 768 0.626 0.619 -0.008
cap-1M 768 0.656 0.656 0.000
gecko-100k 768 0.690 0.701 +0.011
e5-small-v2-100k 384 0.572 0.586 +0.014
ada002-1M 1536 0.687 0.703 +0.016
e5-base-v2-100k 768 0.676 0.692 +0.016
cohere-english-v3-10M 1024 0.544 0.561 +0.017
e5-large-v2-100k 1024 0.686 0.703 +0.017
ada002-100k 1536 0.751 0.769 +0.018
cohere-english-v3-1M 1024 0.593 0.612 +0.019

Recall is generally comparable to build-from-scratch and often better, though some datasets show small drops. All datasets compact successfully under -Xmx5g; compaction has also been validated on a 2560-dim 10M-vector dataset under the same constraint.

Introduce OnDiskGraphIndexCompactor and PQRetrainer for streaming N:1
merging of on-disk HNSW indexes without full in-memory materialization.
Supports deletion filtering via live-node bitsets, custom ordinal
mapping, and PQ codebook retraining.
Tests for OnDiskGraphIndexCompactor covering basic compaction, deletions,
ordinal remapping, multi-source merging, and FusedPQ compaction scenarios.
Add JFR recording, system stats collection, JSONL logging, git info
capture, thread allocation tracking, dataset partitioning, and cloud
storage layout utilities used by CompactorBenchmark. Switch
jvector-examples logging from logback to log4j2 for consistency with
benchmarks-jmh and to avoid duplicate SLF4J bindings in the fat jar.
JMH-based benchmark with configurable workload modes (PARTITION_AND_COMPACT,
PARTITION_ONLY, COMPACT_ONLY, BUILD_FROM_SCRATCH), recall measurement, JFR
recording, and JSONL result logging. Includes BenchmarkParamCounter for
progress tracking, EventLogAnalyzer for post-run analysis, GHA workflow,
and exec-maven-plugin integration. Add forced vectorization provider
property to VectorizationProvider for benchmark reproducibility.
Add result file patterns to .gitignore, update rat-excludes for the new
compaction workflow and catalog cache files.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

The benchmarks-jmh-*.jar glob matched the -javadoc jar first, which has
no Main-Class. Select the shaded JMH jar explicitly by excluding
-javadoc and -sources jars.
Use -cp with CompactorBenchmark.main() instead of -jar with JMH Main
to avoid BenchmarkList discovery issues in CI's shaded jar.
Comment thread docs/compaction.md
- Compaction: build source partitions with PQ; compact using FusedPQ with FP rescoring; search using FusedPQ with FP reranking.

| Dataset | Dim | Build from Scratch | Compaction | Delta |
|----------------------|-----:|-------------------:|-----------:|-------:|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to modify this section for brevity?

* Handles writing the compacted graph index to disk, managing header, node records,
* upper layers, and footer in the on-disk format.
*/
private static final class CompactWriter implements AutoCloseable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe break this out. The parent file is already very large.

public final class SystemStatsCollector {
private static final Logger log = LoggerFactory.getLogger(SystemStatsCollector.class);

private static final String SCRIPT = String.join("\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this logic shouldn't be broken out as it is. Instead of invoking a shell wrapper, it should probably be direct reads and pattern matching in Java.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants