[codex] Improve SQL recall and semantic retrieval by davidbuniat · Pull Request #65 · activeloopai/hivemind

davidbuniat · 2026-04-21T06:03:24Z

Summary

This PR brings the fstosql branch onto optimizations as a focused recall and retrieval upgrade. The branch layers five related changes together:

align session retrieval with local transcript files
add a SQL transcript surface for Deeplake recall
store sessions physically per message so SQL recall can ground answers on transcript rows
add embedding / graph retrieval experiments, including semantic and hybrid grep retrieval modes
switch fact extraction and LOCOMO fact backfill from summary-driven prompts to transcript-row prompts

What changed

adds transcript-oriented SQL recall plumbing across the bash command compiler, session-start guidance, virtual table access, and session storage
introduces configurable retrieval modes for grep-backed search: classic, embedding, and hybrid, using Harrier embeddings and Deeplake hybrid scoring when enabled
adds a stricter facts-and-sessions-only psql mode that exposes only sessions, memory_facts, memory_entities, and fact_entity_links
fixes intercepted SQL table normalization so related physical tables resolve correctly when deployments use suffixed table names such as _actual
moves memory fact extraction onto transcript rows rather than summary markdown, and updates the Claude/Codex wiki workers plus the LOCOMO fact backfill path to match
adds Harrier embedding/backfill tooling, related Python dependencies, and bundle externalization for transformers / ONNX runtime
regenerates the checked-in Claude/Codex bundles and expands source-level test coverage around grep, SQL interception, hook guidance, summaries, sessions, and memory facts
ignores Python cache/test artifacts and removes tracked scripts/__pycache__/backfill_harrier_embeddings.*.pyc files

Why

The main goal is to make recall more faithful to real session data while improving retrieval quality:

answers can be grounded on transcript rows instead of only summary documents
retrieval can use semantic or hybrid ranking instead of relying purely on lexical summary matching
fact extraction preserves exact phrasing, identities, locations, and relative-time statements better when it works from transcript rows
SQL-only recall flows can run in a narrower facts-and-sessions mode without exposing the broader summary/graph surface

Validation

Validated on this branch before the PR flow:

npm run typecheck
npm test -- claude-code/tests/bash-command-compiler.test.ts claude-code/tests/grep-core.test.ts claude-code/tests/memory-facts.test.ts claude-code/tests/hooks-source.test.ts
npm run build

davidbuniat added 8 commits April 20, 2026 13:54

Align session retrieval with local transcript files

dd8a0e7

Add SQL transcript surface for Deeplake recall

01bd4a1

Store sessions physically per message for SQL recall

61e207b

embedding and graph experiments

5f45b04

Add semantic retrieval and transcript-backed facts

13e8b28

recent changes

356cbd8

Improve regex parity and hybrid retrieval plumbing

a1d0585

Merge origin/main into fstosql

b7814d4

davidbuniat changed the base branch from optimizations to main April 21, 2026 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Improve SQL recall and semantic retrieval#65

[codex] Improve SQL recall and semantic retrieval#65
davidbuniat wants to merge 8 commits intomainfrom
fstosql

davidbuniat commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidbuniat commented Apr 21, 2026

Summary

What changed

Why

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant