fix(benchmarks): add publication verification bundle#166
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a reproducibility-focused “publication verification” bundle for the May 2026 arXiv effort, updates published benchmark claims across docs, and fixes FalkorDB’s /memories/<id>/related fallback query by inlining a sanitized variable-length depth (with a regression test).
Changes:
- Updated README/docs benchmark numbers to the latest verified LoCoMo/LongMemEval results.
- Added
benchmarks/publication/2026-05-arxiv/bundle (commands, notes, machine-readable manifest). - Fixed FalkorDB fallback related-memories query to inline sanitized depth; added regression test.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_api_endpoints.py | Adds regression test ensuring fallback query inlines depth and avoids $max_depth in relationship range. |
| automem/api/recall.py | Inlines sanitized max_depth into variable-length range for FalkorDB compatibility. |
| pyproject.toml | Introduces Black/isort configuration and generated-directory excludes. |
| Makefile | Updates bench-health to run using venv Python. |
| .flake8 | Extends excludes for generated/worktree/benchmark directories. |
| README.md | Updates headline benchmark claims and adds publication bundle link. |
| docs/TESTING.md | Updates LoCoMo/LongMemEval baseline tables and example output to new results. |
| docs/COMPARISON.md | Refreshes text to reflect current canonical benchmark posture. |
| benchmarks/EXPERIMENT_LOG.md | Updates headline results; adds publication-bundle pointer and verification entry. |
| benchmarks/publication/2026-05-arxiv/README.md | Adds claim posture + bundle index for the arXiv publication effort. |
| benchmarks/publication/2026-05-arxiv/benchmark-summary.md | Adds paper-ready benchmark summary and limitations. |
| benchmarks/publication/2026-05-arxiv/commands.md | Documents repo + supplemental eval commands for verification. |
| benchmarks/publication/2026-05-arxiv/fresh-verification.md | Captures verification run notes and artifact hashes. |
| benchmarks/publication/2026-05-arxiv/artifact-manifest.json | Adds machine-readable manifest for claims/artifacts/hashes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Breaking Changes
None. No runtime API shape changed.
Related Issues
Publication/reproducibility prep for the AutoMem arXiv paper.
Test Plan
Notes
The fresh LongMemEval full console output included transient gpt-5-mini empty-answer warnings and one local recall read timeout, but the harness completed with memory_ingest_failures=0, judge_errors=0, and publishable=true. Actual arXiv/Hugging Face submission still needs author metadata and an arXiv ID.