Skip to content

Hemant/rebuild updated#172

Open
hemant-endee wants to merge 11 commits intomasterfrom
hemant/rebuild_updated
Open

Hemant/rebuild updated#172
hemant-endee wants to merge 11 commits intomasterfrom
hemant/rebuild_updated

Conversation

@hemant-endee
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

VectorDB Benchmark — Dense — Passed

Triggered by @omnish-endee · Commit 4a3982f

Step Status
Provision Servers Up
Deploy Endee Server Done
Run Benchmark Passed
Results See below
Teardown Done ✓

No result file found on benchmark server.

@hemant-endee
Copy link
Copy Markdown
Collaborator Author

/correctness_benchmarking dense

3 similar comments
@omnish-endee
Copy link
Copy Markdown
Contributor

/correctness_benchmarking dense

@omnish-endee
Copy link
Copy Markdown
Contributor

/correctness_benchmarking dense

@omnish-endee
Copy link
Copy Markdown
Contributor

/correctness_benchmarking dense

Comment thread src/main.cpp
Comment thread src/main.cpp Outdated
Comment thread src/main.cpp Outdated
Comment thread src/core/ndd.hpp Outdated
Comment thread src/core/ndd.hpp
Comment thread src/core/ndd.hpp Outdated
Comment thread src/core/rebuild.hpp Outdated
Comment thread src/core/rebuild.hpp
Comment thread src/core/ndd.hpp Outdated
Comment thread src/core/ndd.hpp Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

VectorDB Benchmark - Ready To Run

CI Passed ([lint + unit tests] (https://github.com/endee-io/endee/actions/runs/24984978794)) - benchmark options unlocked.

Post one of the command below. Only members with write access can trigger runs.


Available Modes

Mode Command What runs
Dense /correctness_benchmarking dense HNSW insert throughput · query P50/P95/P99 · recall@10 · concurrent QPS
Hybrid /correctness_benchmarking hybrid Dense + sparse BM25 fusion · same suite + fusion latency overhead

Infrastructure

Server Role Instance
Endee Server Endee VectorDB — code from this branch t2.large
Benchmark Server Benchmark runner t3a.large

Both servers start on demand and are always terminated after the run — pass or fail.


How Correctness Benchmarking Works

1. Post /correctness_benchmarking <mode>
2. Endee Server Create  →  this branch's code deployed  →  Endee starts in chosen mode
3. Benchmark Server Create  →  benchmark suite transferred
4. Benchmark Server runs correctness benchmarking against Endee Server
5. Results posted back here  →  pass/fail + full metrics table
6. Both servers terminated   →  always, even on failure

After a new push, CI must pass again before this menu reappears.

@hemant-endee
Copy link
Copy Markdown
Collaborator Author

/correctness_benchmarking dense

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

VectorDB Benchmark — Dense — Failed

Triggered by @hemant-endee · Commit 117a81f

Step Status
Provision Servers Up
Deploy Endee Server Done
Run Benchmark Failed
Results See reason below
Teardown Done ✓

Benchmark job status: skipped

Comment thread src/core/rebuild.hpp Outdated
Comment thread src/main.cpp Outdated
Comment thread src/core/rebuild.hpp Outdated
Comment thread src/core/rebuild.hpp Outdated
Comment thread src/core/rebuild.hpp Outdated
Comment thread src/core/rebuild.hpp Outdated
@hemant-endee hemant-endee force-pushed the hemant/rebuild_updated branch from 117a81f to 6f4083e Compare April 23, 2026 16:09
…y + IndexManager to executeJob instead of unpacking fields
Comment thread src/core/rebuild.cpp
Comment on lines +29 to +50
void Rebuild::cleanupTempFiles(const std::string& data_dir) {
if (!std::filesystem::exists(data_dir)) {
return;
}
try {
std::string temp_filename = std::string(settings::DEFAULT_SUBINDEX) + ".idx.temp";
std::string ts_prefix = std::string(settings::DEFAULT_SUBINDEX) + ".idx.";
for (const auto& entry : std::filesystem::recursive_directory_iterator(data_dir)) {
if (!entry.is_regular_file()) continue;
const std::string fname = entry.path().filename().string();
bool is_temp = (fname == temp_filename);
bool is_ts = fname.size() > ts_prefix.size()
&& fname.substr(0, ts_prefix.size()) == ts_prefix
&& std::all_of(fname.begin() + ts_prefix.size(), fname.end(), ::isdigit);
if (is_temp || is_ts) {
std::filesystem::remove(entry.path());
}
}
} catch (const std::exception& e) {
LOG_WARN(1803, "rebuild", "Failed to cleanup temp files on startup: " << e.what());
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not throw error if there are not cleanup files left in the directory.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Narrowed catch to std::filesystem::filesystem_error so it only triggers on real filesystem errors, not unrelated exceptions.

Comment thread src/core/rebuild.cpp Outdated
Comment on lines +236 to +237
std::filesystem::copy_file(p.timestamped_path, p.index_path,
std::filesystem::copy_options::overwrite_existing);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is renaming the index file not enough here ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced copy_file + remove with std::filesystem::rename

Comment thread src/core/rebuild.cpp Outdated
std::filesystem::copy_file(p.timestamped_path, p.index_path,
std::filesystem::copy_options::overwrite_existing);

// Cannot call reloadIndex() here — we hold operation_mutex and reloadIndex acquires
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

st.stop_requested() is not checked in phase 3. makes the shutdown time unbounded for large indexes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added stop_requested() check after Phase 2 loop, before the Phase 3 saveIndex call.

Comment thread src/core/rebuild.cpp Outdated
Comment on lines +235 to +243
new_alg->saveIndex(p.timestamped_path);
std::filesystem::copy_file(p.timestamped_path, p.index_path,
std::filesystem::copy_options::overwrite_existing);

// Cannot call reloadIndex() here — we hold operation_mutex and reloadIndex acquires
// indices_mutex_, while deleteIndex holds indices_mutex_ then acquires operation_mutex.
// Calling reloadIndex here would deadlock with a concurrent delete on the same index.
auto fresh_alg = std::make_unique<hnswlib::HierarchicalNSW<float>>(p.index_path, 0);
IndexManager::wireVectorFetchers(fresh_alg.get(), entry->vector_storage);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_alg already has all the data and was just persisted.

can use entry->alg = std::move(new_alg) directly.

doesnt seem useful to re load again

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed disk reload. new_alg is fully built and fetchers are already wired in Phase 2 — moved directly into entry->alg.

Comment thread docs/rebuild.md
@@ -0,0 +1,124 @@
# Index Rebuild
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things that are missing in the doc. make it more comprehensive

Disk space: rebuild needs 2× the index size on disk (timestamped + canonical co-exist briefly).

Memory: 2× the graph in RAM during Phase 2 (old + new).

Expected duration: rough order of magnitude per million vectors helps capacity planning.

What "manually restart" means after a crash — operationally how does an operator know a rebuild was incomplete? (There's no persisted state.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added Capacity and Timing section covering disk space (2×), memory (2×), and expected duration.

Comment thread src/core/rebuild.cpp Outdated
Comment on lines +160 to +162
std::string Rebuild::formatTime(std::chrono::system_clock::time_point tp) {
return timeToISO8601(tp);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no one is using this function

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed formatTime. Call sites in getProgress now call timeToISO8601 directly.

Comment thread src/core/ndd.hpp Outdated
rebuild_.setActiveRebuild(username, index_id, current_count);

// Spawn thread — lambda calls rebuild_.executeJob directly (execution lives in Rebuild)
std::jthread t([this, params = std::move(params)](std::stop_token st) mutable {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here mutable is not required. executeJob takes const RebuildJobParams&, so the lambda doesn't actually mutate

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed mutable — executeJob takes const RebuildJobParams& so the lambda never mutates.

Comment thread src/core/rebuild.cpp Outdated
case RebuildStatus::IN_PROGRESS: return "in_progress";
case RebuildStatus::COMPLETED: return "completed";
case RebuildStatus::FAILED: return "failed";
default: return "unknown";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can never reach this line since RebuildStatus has only three defined values

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed default case. Added __builtin_unreachable() so compiler warns if a new enum value is added without updating the switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants