Hemant/rebuild updated#172
Conversation
VectorDB Benchmark — Dense — PassedTriggered by @omnish-endee · Commit
|
|
/correctness_benchmarking dense |
3 similar comments
|
/correctness_benchmarking dense |
|
/correctness_benchmarking dense |
|
/correctness_benchmarking dense |
VectorDB Benchmark - Ready To Run
Post one of the command below. Only members with write access can trigger runs. Available Modes
Infrastructure
Both servers start on demand and are always terminated after the run — pass or fail. How Correctness Benchmarking Works
|
|
/correctness_benchmarking dense |
VectorDB Benchmark — Dense — FailedTriggered by @hemant-endee · Commit
|
* Rebuild index with new config * fix 1 * index name in get stattu api correction * docs changes * Rebuild Status Persistence
…me as addVectors)
117a81f to
6f4083e
Compare
…y + IndexManager to executeJob instead of unpacking fields
| void Rebuild::cleanupTempFiles(const std::string& data_dir) { | ||
| if (!std::filesystem::exists(data_dir)) { | ||
| return; | ||
| } | ||
| try { | ||
| std::string temp_filename = std::string(settings::DEFAULT_SUBINDEX) + ".idx.temp"; | ||
| std::string ts_prefix = std::string(settings::DEFAULT_SUBINDEX) + ".idx."; | ||
| for (const auto& entry : std::filesystem::recursive_directory_iterator(data_dir)) { | ||
| if (!entry.is_regular_file()) continue; | ||
| const std::string fname = entry.path().filename().string(); | ||
| bool is_temp = (fname == temp_filename); | ||
| bool is_ts = fname.size() > ts_prefix.size() | ||
| && fname.substr(0, ts_prefix.size()) == ts_prefix | ||
| && std::all_of(fname.begin() + ts_prefix.size(), fname.end(), ::isdigit); | ||
| if (is_temp || is_ts) { | ||
| std::filesystem::remove(entry.path()); | ||
| } | ||
| } | ||
| } catch (const std::exception& e) { | ||
| LOG_WARN(1803, "rebuild", "Failed to cleanup temp files on startup: " << e.what()); | ||
| } | ||
| } |
There was a problem hiding this comment.
should not throw error if there are not cleanup files left in the directory.
There was a problem hiding this comment.
Fixed. Narrowed catch to std::filesystem::filesystem_error so it only triggers on real filesystem errors, not unrelated exceptions.
| std::filesystem::copy_file(p.timestamped_path, p.index_path, | ||
| std::filesystem::copy_options::overwrite_existing); |
There was a problem hiding this comment.
why is renaming the index file not enough here ?
There was a problem hiding this comment.
Fixed. Replaced copy_file + remove with std::filesystem::rename
| std::filesystem::copy_file(p.timestamped_path, p.index_path, | ||
| std::filesystem::copy_options::overwrite_existing); | ||
|
|
||
| // Cannot call reloadIndex() here — we hold operation_mutex and reloadIndex acquires |
There was a problem hiding this comment.
st.stop_requested() is not checked in phase 3. makes the shutdown time unbounded for large indexes.
There was a problem hiding this comment.
Fixed. Added stop_requested() check after Phase 2 loop, before the Phase 3 saveIndex call.
| new_alg->saveIndex(p.timestamped_path); | ||
| std::filesystem::copy_file(p.timestamped_path, p.index_path, | ||
| std::filesystem::copy_options::overwrite_existing); | ||
|
|
||
| // Cannot call reloadIndex() here — we hold operation_mutex and reloadIndex acquires | ||
| // indices_mutex_, while deleteIndex holds indices_mutex_ then acquires operation_mutex. | ||
| // Calling reloadIndex here would deadlock with a concurrent delete on the same index. | ||
| auto fresh_alg = std::make_unique<hnswlib::HierarchicalNSW<float>>(p.index_path, 0); | ||
| IndexManager::wireVectorFetchers(fresh_alg.get(), entry->vector_storage); |
There was a problem hiding this comment.
new_alg already has all the data and was just persisted.
can use entry->alg = std::move(new_alg) directly.
doesnt seem useful to re load again
There was a problem hiding this comment.
Fixed. Removed disk reload. new_alg is fully built and fetchers are already wired in Phase 2 — moved directly into entry->alg.
| @@ -0,0 +1,124 @@ | |||
| # Index Rebuild | |||
There was a problem hiding this comment.
Things that are missing in the doc. make it more comprehensive
Disk space: rebuild needs 2× the index size on disk (timestamped + canonical co-exist briefly).
Memory: 2× the graph in RAM during Phase 2 (old + new).
Expected duration: rough order of magnitude per million vectors helps capacity planning.
What "manually restart" means after a crash — operationally how does an operator know a rebuild was incomplete? (There's no persisted state.)
There was a problem hiding this comment.
Fixed. Added Capacity and Timing section covering disk space (2×), memory (2×), and expected duration.
| std::string Rebuild::formatTime(std::chrono::system_clock::time_point tp) { | ||
| return timeToISO8601(tp); | ||
| } |
There was a problem hiding this comment.
no one is using this function
There was a problem hiding this comment.
Fixed. Removed formatTime. Call sites in getProgress now call timeToISO8601 directly.
| rebuild_.setActiveRebuild(username, index_id, current_count); | ||
|
|
||
| // Spawn thread — lambda calls rebuild_.executeJob directly (execution lives in Rebuild) | ||
| std::jthread t([this, params = std::move(params)](std::stop_token st) mutable { |
There was a problem hiding this comment.
here mutable is not required. executeJob takes const RebuildJobParams&, so the lambda doesn't actually mutate
There was a problem hiding this comment.
Fixed. Removed mutable — executeJob takes const RebuildJobParams& so the lambda never mutates.
| case RebuildStatus::IN_PROGRESS: return "in_progress"; | ||
| case RebuildStatus::COMPLETED: return "completed"; | ||
| case RebuildStatus::FAILED: return "failed"; | ||
| default: return "unknown"; |
There was a problem hiding this comment.
can never reach this line since RebuildStatus has only three defined values
There was a problem hiding this comment.
Fixed. Removed default case. Added __builtin_unreachable() so compiler warns if a new enum value is added without updating the switch.
… doc review comments
No description provided.