[codex] Generate embeddings for semantic retrieval by cleak · Pull Request #44 · cleak/tempyr

cleak · 2026-05-02T04:23:26Z

Summary

add a shared semantic search engine that populates missing graph embeddings before vector or hybrid retrieval
wire on-demand embedding generation into CLI vsearch, context, ask, render tdd, and MCP graph_context / graph_render
replace the render semantic placeholder with provider-backed semantic sections that fail clearly when no provider is configured
update docs, TDD template output, and health messaging for on-demand embedding population

Validation

cargo fmt --check
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace --locked -- --format terse
git diff --check

Summary by CodeRabbit

New Features
- CLI search/context/vsearch now use integrated semantic search and surface vector similarity (vec/vector_score) in JSON and console outputs.
- Renderer supports semantic-search-driven template sections (can include full node bodies) and accepts a semantic-search provider for rendering.
- Embedding cache now detects provider changes and refreshes embeddings automatically.
Documentation
- CLI spec and template examples updated to note embeddings are populated before vector retrieval and show body inclusion.

coderabbitai · 2026-05-02T04:23:35Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ae77267-b540-428e-9dfa-3380785580c4

📥 Commits

Reviewing files that changed from the base of the PR and between 95c1ad5 and bd95ece.

📒 Files selected for processing (1)

crates/tempyr-index/src/embeddings.rs

📝 Walkthrough

Walkthrough

Adds a semantic-search engine plus synchronous runtime wrappers; CLI, MCP, and renderer are refactored to call the runtime for vector and hybrid retrieval. Renderer gains a SemanticSearchProvider/options and collectors support semantic-search sections; embedding caching is made provider-aware. (49 words)

Changes

Semantic search core + CLI/MCP + render wiring

Layer / File(s)	Summary
Core Engine `crates/tempyr-index/src/semantic.rs`	Adds `SemanticSearchEngine`: ensures graph embeddings, embeds queries, supports async `vector_search` and `hybrid_retrieve`, and includes unit tests verifying embedding and vector-score behavior.
Embedding caching `crates/tempyr-index/src/embeddings.rs`	Adds `EmbeddingProvider::fingerprint()`, `embedding_store_meta` metadata, `EmbeddingStore::ensure_provider_fingerprint()`, and tests that cached embeddings are cleared/replaced when provider fingerprint changes.
Public export `crates/tempyr-index/src/lib.rs`	Exports new `semantic` module (`pub mod semantic;`).
Runtime wrapper (CLI) `crates/tempyr-cli/src/commands/semantic.rs`	Adds `SemanticSearchRuntime` holding a `SemanticSearchEngine` and a Tokio `Runtime`; provides synchronous `vector_search` and `hybrid_retrieve` via `runtime.block_on(...)`.
CLI command wiring `crates/tempyr-cli/src/commands/{ask.rs,context.rs,vsearch.rs}`	Replace in-file index/embedding wiring with `SemanticSearchRuntime`; `ask`/`context` call `hybrid_retrieve`, `vsearch` calls `vector_search`; `context` output adds `vector_score`; legacy embedding-selection helper/tests removed.
CLI module registry `crates/tempyr-cli/src/commands/mod.rs`	Registers new `semantic` submodule (`pub mod semantic;`).
Render API & types `crates/tempyr-render/src/lib.rs`	Adds `SemanticSearchRequest`, `SemanticSearchHit`, `SemanticSearchProvider` trait, and `RenderOptions<'a>`; exposes `render_with_options` / `render_from_str_with_options` and `collect_sections` branching on presence of a provider.
Collector refactor `crates/tempyr-render/src/collector.rs`	`collect_section` now accepts `semantic_search: Option<&mut dyn SemanticSearchProvider>` and returns `Result<SectionData>`; adds `collect_section_with_semantic_search`, `collect_semantic_section`, `semantic_query`, and `matches_status_filter`; implements request construction, hit filtering, optional body inclusion, and tests (including error when provider missing).
Render command wiring `crates/tempyr-cli/src/commands/render_cmd.rs`	Adds `RenderSemanticSearch` provider that lazily constructs `SemanticSearchRuntime`, implements `SemanticSearchProvider`, and passes it via `RenderOptions` into rendering calls (`render_with_options`/`render_from_str_with_options`).
MCP integration `crates/tempyr-mcp/src/handler.rs`	Adds `McpSemanticSearch` / `McpSemanticSearchRuntime` wrappers for MCP mode; updates `graph_context` and `graph_render` to use the runtime and options-based rendering.
Templates, docs & minor tweaks `templates/tdd.toml`, `docs/graphspec.md`, `crates/tempyr-index/src/hybrid.rs`, `crates/tempyr-index/src/health.rs`, `crates/tempyr-render/src/template.rs`	Template now sets `include_body = true` for the sample section; docs note embeddings are populated before retrieval; small doc/warning formatting tweaks and a test assertion added.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Command
    participant Runtime as SemanticSearchRuntime
    participant Engine as SemanticSearchEngine
    participant Provider as EmbeddingProvider
    participant Store as EmbeddingStore/Index

    CLI->>Runtime: new(ctx)
    Runtime->>Engine: new(index, store, provider)

    CLI->>Runtime: vector_search / hybrid_retrieve(graph, query, ...)
    Runtime->>Engine: block_on(vector_search/hybrid_retrieve(...))

    Engine->>Engine: ensure_embeddings(graph)
    Engine->>Provider: embed_documents(graph_nodes)
    Provider-->>Engine: document_vectors
    Engine->>Store: persist/index document vectors

    Engine->>Provider: embed_query(query)
    Provider-->>Engine: query_vector
    Engine->>Store: vector_search / hybrid scoring
    Store-->>Engine: Vector/Hybrid results (with vector_score)

    Engine-->>Runtime: results
    Runtime-->>CLI: results

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Add worktree-aware graph cache layout #1 — Related work on shared embedding-store and hybrid retrieval integration used by these changes.
[codex] Prepare repository for public release #40 — Overlaps on render_cmd.rs wiring and template/render behavior.

Poem

🐰 I hop the index, sniff the vector trail,
I spin a runtime bridge so queries never fail,
Providers hum their secret scores in tune,
Templates stitch the insights by the moon,
A rabbit cheers, "Semantic search — hooray, let's code!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.16% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding embedding generation for semantic retrieval, which is reflected across CLI commands, MCP endpoints, and render integration in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Review rate limit: 1/5 review remaining, refill in 41 minutes and 21 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tempyr-index/src/semantic.rs`:
- Line 14: The field embeddings_ready currently short-circuits embedding
population (referencing embeddings_ready) so once true future graph changes
won’t trigger re-embedding; update the logic in the embedding-population routine
(the function that checks embeddings_ready / populates embeddings) to either
remove the one-time short-circuit or make embeddings_ready depend on graph
mutation state (e.g., compare a graph version/timestamp or reset
embeddings_ready when nodes are added/updated/deleted), and ensure any code
paths that modify the graph (node add/update/delete methods) reset or advance
that version so the populate routine will re-run and produce fresh embeddings.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 90448b3e-638d-49dd-823f-30b51eece276

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd93b5 and 15ff8af.

📒 Files selected for processing (16)

crates/tempyr-cli/src/commands/ask.rs
crates/tempyr-cli/src/commands/context.rs
crates/tempyr-cli/src/commands/mod.rs
crates/tempyr-cli/src/commands/render_cmd.rs
crates/tempyr-cli/src/commands/semantic.rs
crates/tempyr-cli/src/commands/vsearch.rs
crates/tempyr-index/src/health.rs
crates/tempyr-index/src/hybrid.rs
crates/tempyr-index/src/lib.rs
crates/tempyr-index/src/semantic.rs
crates/tempyr-mcp/src/handler.rs
crates/tempyr-render/src/collector.rs
crates/tempyr-render/src/lib.rs
crates/tempyr-render/src/template.rs
docs/graphspec.md
templates/tdd.toml

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tempyr-index/src/semantic.rs`:
- Around line 70-72: The code currently takes the first element of
query_embeddings silently; change this to enforce exactly one embedding: in the
embed_query flow, collect query_embeddings into a Vec (or check its length) and
if vec.len() != 1 return Err(IndexError::General("Embedding provider returned
wrong number of vectors for the query; expected exactly 1".to_string())),
otherwise use vec.into_iter().next().unwrap(); reference symbols:
query_embeddings and IndexError::General (and the embed_query function) when
making this replacement.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7c0e7f81-e867-4629-a411-91973306e6f9

📥 Commits

Reviewing files that changed from the base of the PR and between 15ff8af and a4ab88f.

📒 Files selected for processing (1)

crates/tempyr-index/src/semantic.rs

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tempyr-index/src/semantic.rs`:
- Around line 25-30: ensure_embeddings calls embeddings::embed_graph which keys
cached embeddings by content hash only, allowing different embedding
providers/models to reuse incompatible vectors; update the logic to include the
provider fingerprint (e.g. provider id/name, model/version/tokenizer dims) when
storing and looking up embeddings in the cache so embeddings are scoped to
provider identity and vector dimensionality, or persist a provider fingerprint
alongside cached entries and, in ensure_embeddings (or in
embeddings::embed_graph), detect mismatches and invalidate or re-embed affected
documents via self.provider before returning; reference ensure_embeddings,
embeddings::embed_graph, self.provider, and the cache/store layer to implement
this change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 40610a3a-9376-4f52-87d8-a370164686ed

📥 Commits

Reviewing files that changed from the base of the PR and between a4ab88f and d37b13e.

📒 Files selected for processing (1)

crates/tempyr-index/src/semantic.rs

coderabbitai · 2026-05-02T05:37:51Z

+    pub async fn ensure_embeddings(&mut self, graph: &Graph) -> Result<()> {
+        // embed_graph is content-hash aware and skips cached entries, so keep
+        // checking the current graph instead of assuming a long-lived engine has
+        // already seen every future graph mutation.
+        embeddings::embed_graph(&self.store, graph, self.provider.as_ref()).await?;
+        Ok(())


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Provider/model drift is not invalidating cached document embeddings.

At Line 29, embedding reuse is keyed by content hash only (via embed_graph), so changing embedding provider/model can silently mix vector spaces and degrade ranking correctness. Scope cache entries by provider identity/version (and dimensions), or persist/validate provider fingerprint and force re-embed on mismatch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/tempyr-index/src/semantic.rs` around lines 25 - 30, ensure_embeddings calls embeddings::embed_graph which keys cached embeddings by content hash only, allowing different embedding providers/models to reuse incompatible vectors; update the logic to include the provider fingerprint (e.g. provider id/name, model/version/tokenizer dims) when storing and looking up embeddings in the cache so embeddings are scoped to provider identity and vector dimensionality, or persist a provider fingerprint alongside cached entries and, in ensure_embeddings (or in embeddings::embed_graph), detect mismatches and invalidate or re-embed affected documents via self.provider before returning; reference ensure_embeddings, embeddings::embed_graph, self.provider, and the cache/store layer to implement this change.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tempyr-index/src/embeddings.rs`:
- Around line 721-735: The current ensure_provider_fingerprint implementation
clears all embeddings when meta is missing, causing unnecessary invalidation;
modify the None branch so that if meta_value(Self::PROVIDER_FINGERPRINT_KEY)
returns None you do NOT call clear_embeddings() by default — instead directly
set_meta_value(Self::PROVIDER_FINGERPRINT_KEY, &fingerprint) to seed legacy
stores with the current fingerprint; reserve calling clear_embeddings() for an
explicit incompatible-schema path (e.g., a new function or an explicit check) so
only ensure_provider_fingerprint, PROVIDER_FINGERPRINT_KEY, clear_embeddings,
set_meta_value, count and meta_value are touched to implement this
non-destructive migration.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7d1a9104-e26d-480f-9aa1-289014466ab9

📥 Commits

Reviewing files that changed from the base of the PR and between d37b13e and 95c1ad5.

📒 Files selected for processing (1)

crates/tempyr-index/src/embeddings.rs

Generate embeddings for semantic retrieval

15ff8af

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

Comment thread crates/tempyr-index/src/semantic.rs Outdated

Recheck graph embeddings for semantic search

a4ab88f

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

Comment thread crates/tempyr-index/src/semantic.rs Outdated

Validate query embedding count

d37b13e

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

Scope embedding cache by provider

95c1ad5

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

Comment thread crates/tempyr-index/src/embeddings.rs

Seed legacy embedding fingerprints

bd95ece

cleak merged commit 65c62e0 into master May 2, 2026
5 checks passed

cleak deleted the codex/semantic-render-embeddings branch May 2, 2026 06:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Generate embeddings for semantic retrieval#44

[codex] Generate embeddings for semantic retrieval#44
cleak merged 5 commits into
masterfrom
codex/semantic-render-embeddings

cleak commented May 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cleak commented May 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cleak commented May 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 2, 2026 •

edited

Loading