Skip to content

fix(benchmarks): add publication verification bundle#166

Merged
jack-arturo merged 2 commits into
mainfrom
feat/automem-arxiv-publication
May 22, 2026
Merged

fix(benchmarks): add publication verification bundle#166
jack-arturo merged 2 commits into
mainfrom
feat/automem-arxiv-publication

Conversation

@jack-arturo
Copy link
Copy Markdown
Member

Summary

  • update README/docs benchmark claims to the fresh publication verification results: LongMemEval full 87.00% / recall@5 97.00%, LoCoMo full 84.74%
  • add benchmarks/publication/2026-05-arxiv with claim posture, commands, artifact hashes, fresh verification notes, and a machine-readable manifest
  • fix the /memories//related fallback query by inlining the sanitized FalkorDB variable-length depth, with a regression test

Breaking Changes

None. No runtime API shape changed.

Related Issues

Publication/reproducibility prep for the AutoMem arXiv paper.

Test Plan

  • make test: 238 passed, 1 skipped, 25 deselected
  • .venv/bin/black --check .: 117 files unchanged
  • .venv/bin/isort --check-only .: skipped 10 files
  • make lint: passed
  • make test-integration: 11 passed, 253 deselected
  • make bench-health: HEALTHY, p50=416ms p95=441ms mean=421ms
  • LoCoMo mini pinned judge: 85.20% (259/304), sha256 ba2b98b0055f92ca17de9bc36207d7f39cf90b6270c2c3d903d69b8044aa7015
  • LoCoMo full pinned judge: 84.74% (1683/1986), sha256 a75816e9a6d3302c22b34852b75ac19a9d9f5cb27d1a109e0af7e49359330716
  • LongMemEval mini: 70.00% (21/30), recall@5 96.67%, sha256 7ea922b77e312a17c313bbf8c0e81f0268b48d1082080cae1db3c38e906577b8
  • LongMemEval full: 87.00% (435/500), recall@5 97.00%, sha256 ed6f7cf69b7be6fa0050536ec2b0f947f5510afd8c2a374b3fafb9cde009da75
  • automem-evals: runner unit tests 95 passed, scripts unit tests 10 passed, Writ npm test 72 passed, Writ npm run build passed
  • paper static check: inputs and BibTeX cite keys resolve; no local LaTeX compiler available

Notes

The fresh LongMemEval full console output included transient gpt-5-mini empty-answer warnings and one local recall read timeout, but the harness completed with memory_ingest_failures=0, judge_errors=0, and publishable=true. Actual arXiv/Hugging Face submission still needs author metadata and an arXiv ID.

Copilot AI review requested due to automatic review settings May 18, 2026 06:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a reproducibility-focused “publication verification” bundle for the May 2026 arXiv effort, updates published benchmark claims across docs, and fixes FalkorDB’s /memories/<id>/related fallback query by inlining a sanitized variable-length depth (with a regression test).

Changes:

  • Updated README/docs benchmark numbers to the latest verified LoCoMo/LongMemEval results.
  • Added benchmarks/publication/2026-05-arxiv/ bundle (commands, notes, machine-readable manifest).
  • Fixed FalkorDB fallback related-memories query to inline sanitized depth; added regression test.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_api_endpoints.py Adds regression test ensuring fallback query inlines depth and avoids $max_depth in relationship range.
automem/api/recall.py Inlines sanitized max_depth into variable-length range for FalkorDB compatibility.
pyproject.toml Introduces Black/isort configuration and generated-directory excludes.
Makefile Updates bench-health to run using venv Python.
.flake8 Extends excludes for generated/worktree/benchmark directories.
README.md Updates headline benchmark claims and adds publication bundle link.
docs/TESTING.md Updates LoCoMo/LongMemEval baseline tables and example output to new results.
docs/COMPARISON.md Refreshes text to reflect current canonical benchmark posture.
benchmarks/EXPERIMENT_LOG.md Updates headline results; adds publication-bundle pointer and verification entry.
benchmarks/publication/2026-05-arxiv/README.md Adds claim posture + bundle index for the arXiv publication effort.
benchmarks/publication/2026-05-arxiv/benchmark-summary.md Adds paper-ready benchmark summary and limitations.
benchmarks/publication/2026-05-arxiv/commands.md Documents repo + supplemental eval commands for verification.
benchmarks/publication/2026-05-arxiv/fresh-verification.md Captures verification run notes and artifact hashes.
benchmarks/publication/2026-05-arxiv/artifact-manifest.json Adds machine-readable manifest for claims/artifacts/hashes.

Comment thread Makefile Outdated
Comment thread pyproject.toml
Comment thread benchmarks/publication/2026-05-arxiv/commands.md Outdated
Comment thread benchmarks/publication/2026-05-arxiv/fresh-verification.md Outdated
Comment thread tests/test_api_endpoints.py Outdated
Comment thread docs/TESTING.md Outdated
@jack-arturo jack-arturo merged commit 420d721 into main May 22, 2026
7 checks passed
@jack-arturo jack-arturo deleted the feat/automem-arxiv-publication branch May 22, 2026 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants