[codex] Generate embeddings for semantic retrieval#44
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a semantic-search engine plus synchronous runtime wrappers; CLI, MCP, and renderer are refactored to call the runtime for vector and hybrid retrieval. Renderer gains a SemanticSearchProvider/options and collectors support semantic-search sections; embedding caching is made provider-aware. (49 words) ChangesSemantic search core + CLI/MCP + render wiring
Sequence DiagramsequenceDiagram
participant CLI as CLI Command
participant Runtime as SemanticSearchRuntime
participant Engine as SemanticSearchEngine
participant Provider as EmbeddingProvider
participant Store as EmbeddingStore/Index
CLI->>Runtime: new(ctx)
Runtime->>Engine: new(index, store, provider)
CLI->>Runtime: vector_search / hybrid_retrieve(graph, query, ...)
Runtime->>Engine: block_on(vector_search/hybrid_retrieve(...))
Engine->>Engine: ensure_embeddings(graph)
Engine->>Provider: embed_documents(graph_nodes)
Provider-->>Engine: document_vectors
Engine->>Store: persist/index document vectors
Engine->>Provider: embed_query(query)
Provider-->>Engine: query_vector
Engine->>Store: vector_search / hybrid scoring
Store-->>Engine: Vector/Hybrid results (with vector_score)
Engine-->>Runtime: results
Runtime-->>CLI: results
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Review rate limit: 1/5 review remaining, refill in 41 minutes and 21 seconds. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tempyr-index/src/semantic.rs`:
- Line 14: The field embeddings_ready currently short-circuits embedding
population (referencing embeddings_ready) so once true future graph changes
won’t trigger re-embedding; update the logic in the embedding-population routine
(the function that checks embeddings_ready / populates embeddings) to either
remove the one-time short-circuit or make embeddings_ready depend on graph
mutation state (e.g., compare a graph version/timestamp or reset
embeddings_ready when nodes are added/updated/deleted), and ensure any code
paths that modify the graph (node add/update/delete methods) reset or advance
that version so the populate routine will re-run and produce fresh embeddings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 90448b3e-638d-49dd-823f-30b51eece276
📒 Files selected for processing (16)
crates/tempyr-cli/src/commands/ask.rscrates/tempyr-cli/src/commands/context.rscrates/tempyr-cli/src/commands/mod.rscrates/tempyr-cli/src/commands/render_cmd.rscrates/tempyr-cli/src/commands/semantic.rscrates/tempyr-cli/src/commands/vsearch.rscrates/tempyr-index/src/health.rscrates/tempyr-index/src/hybrid.rscrates/tempyr-index/src/lib.rscrates/tempyr-index/src/semantic.rscrates/tempyr-mcp/src/handler.rscrates/tempyr-render/src/collector.rscrates/tempyr-render/src/lib.rscrates/tempyr-render/src/template.rsdocs/graphspec.mdtemplates/tdd.toml
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tempyr-index/src/semantic.rs`:
- Around line 70-72: The code currently takes the first element of
query_embeddings silently; change this to enforce exactly one embedding: in the
embed_query flow, collect query_embeddings into a Vec (or check its length) and
if vec.len() != 1 return Err(IndexError::General("Embedding provider returned
wrong number of vectors for the query; expected exactly 1".to_string())),
otherwise use vec.into_iter().next().unwrap(); reference symbols:
query_embeddings and IndexError::General (and the embed_query function) when
making this replacement.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7c0e7f81-e867-4629-a411-91973306e6f9
📒 Files selected for processing (1)
crates/tempyr-index/src/semantic.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tempyr-index/src/semantic.rs`:
- Around line 25-30: ensure_embeddings calls embeddings::embed_graph which keys
cached embeddings by content hash only, allowing different embedding
providers/models to reuse incompatible vectors; update the logic to include the
provider fingerprint (e.g. provider id/name, model/version/tokenizer dims) when
storing and looking up embeddings in the cache so embeddings are scoped to
provider identity and vector dimensionality, or persist a provider fingerprint
alongside cached entries and, in ensure_embeddings (or in
embeddings::embed_graph), detect mismatches and invalidate or re-embed affected
documents via self.provider before returning; reference ensure_embeddings,
embeddings::embed_graph, self.provider, and the cache/store layer to implement
this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 40610a3a-9376-4f52-87d8-a370164686ed
📒 Files selected for processing (1)
crates/tempyr-index/src/semantic.rs
| pub async fn ensure_embeddings(&mut self, graph: &Graph) -> Result<()> { | ||
| // embed_graph is content-hash aware and skips cached entries, so keep | ||
| // checking the current graph instead of assuming a long-lived engine has | ||
| // already seen every future graph mutation. | ||
| embeddings::embed_graph(&self.store, graph, self.provider.as_ref()).await?; | ||
| Ok(()) |
There was a problem hiding this comment.
Provider/model drift is not invalidating cached document embeddings.
At Line 29, embedding reuse is keyed by content hash only (via embed_graph), so changing embedding provider/model can silently mix vector spaces and degrade ranking correctness. Scope cache entries by provider identity/version (and dimensions), or persist/validate provider fingerprint and force re-embed on mismatch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tempyr-index/src/semantic.rs` around lines 25 - 30, ensure_embeddings
calls embeddings::embed_graph which keys cached embeddings by content hash only,
allowing different embedding providers/models to reuse incompatible vectors;
update the logic to include the provider fingerprint (e.g. provider id/name,
model/version/tokenizer dims) when storing and looking up embeddings in the
cache so embeddings are scoped to provider identity and vector dimensionality,
or persist a provider fingerprint alongside cached entries and, in
ensure_embeddings (or in embeddings::embed_graph), detect mismatches and
invalidate or re-embed affected documents via self.provider before returning;
reference ensure_embeddings, embeddings::embed_graph, self.provider, and the
cache/store layer to implement this change.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tempyr-index/src/embeddings.rs`:
- Around line 721-735: The current ensure_provider_fingerprint implementation
clears all embeddings when meta is missing, causing unnecessary invalidation;
modify the None branch so that if meta_value(Self::PROVIDER_FINGERPRINT_KEY)
returns None you do NOT call clear_embeddings() by default — instead directly
set_meta_value(Self::PROVIDER_FINGERPRINT_KEY, &fingerprint) to seed legacy
stores with the current fingerprint; reserve calling clear_embeddings() for an
explicit incompatible-schema path (e.g., a new function or an explicit check) so
only ensure_provider_fingerprint, PROVIDER_FINGERPRINT_KEY, clear_embeddings,
set_meta_value, count and meta_value are touched to implement this
non-destructive migration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7d1a9104-e26d-480f-9aa1-289014466ab9
📒 Files selected for processing (1)
crates/tempyr-index/src/embeddings.rs
Summary
vsearch,context,ask,render tdd, and MCPgraph_context/graph_renderValidation
cargo fmt --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspace --locked -- --format tersegit diff --checkSummary by CodeRabbit
New Features
Documentation