Skip to content

[codex] Improve SQL recall and semantic retrieval#65

Draft
davidbuniat wants to merge 8 commits intomainfrom
fstosql
Draft

[codex] Improve SQL recall and semantic retrieval#65
davidbuniat wants to merge 8 commits intomainfrom
fstosql

Conversation

@davidbuniat
Copy link
Copy Markdown
Member

Summary

This PR brings the fstosql branch onto optimizations as a focused recall and retrieval upgrade. The branch layers five related changes together:

  • align session retrieval with local transcript files
  • add a SQL transcript surface for Deeplake recall
  • store sessions physically per message so SQL recall can ground answers on transcript rows
  • add embedding / graph retrieval experiments, including semantic and hybrid grep retrieval modes
  • switch fact extraction and LOCOMO fact backfill from summary-driven prompts to transcript-row prompts

What changed

  • adds transcript-oriented SQL recall plumbing across the bash command compiler, session-start guidance, virtual table access, and session storage
  • introduces configurable retrieval modes for grep-backed search: classic, embedding, and hybrid, using Harrier embeddings and Deeplake hybrid scoring when enabled
  • adds a stricter facts-and-sessions-only psql mode that exposes only sessions, memory_facts, memory_entities, and fact_entity_links
  • fixes intercepted SQL table normalization so related physical tables resolve correctly when deployments use suffixed table names such as _actual
  • moves memory fact extraction onto transcript rows rather than summary markdown, and updates the Claude/Codex wiki workers plus the LOCOMO fact backfill path to match
  • adds Harrier embedding/backfill tooling, related Python dependencies, and bundle externalization for transformers / ONNX runtime
  • regenerates the checked-in Claude/Codex bundles and expands source-level test coverage around grep, SQL interception, hook guidance, summaries, sessions, and memory facts
  • ignores Python cache/test artifacts and removes tracked scripts/__pycache__/backfill_harrier_embeddings.*.pyc files

Why

The main goal is to make recall more faithful to real session data while improving retrieval quality:

  1. answers can be grounded on transcript rows instead of only summary documents
  2. retrieval can use semantic or hybrid ranking instead of relying purely on lexical summary matching
  3. fact extraction preserves exact phrasing, identities, locations, and relative-time statements better when it works from transcript rows
  4. SQL-only recall flows can run in a narrower facts-and-sessions mode without exposing the broader summary/graph surface

Validation

Validated on this branch before the PR flow:

  • npm run typecheck
  • npm test -- claude-code/tests/bash-command-compiler.test.ts claude-code/tests/grep-core.test.ts claude-code/tests/memory-facts.test.ts claude-code/tests/hooks-source.test.ts
  • npm run build

@davidbuniat davidbuniat changed the base branch from optimizations to main April 21, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant