fix(benchmarks): add publication verification bundle by jack-arturo · Pull Request #166 · verygoodplugins/automem

jack-arturo · 2026-05-18T06:25:37Z

Summary

update README/docs benchmark claims to the fresh publication verification results: LongMemEval full 87.00% / recall@5 97.00%, LoCoMo full 84.74%
add benchmarks/publication/2026-05-arxiv with claim posture, commands, artifact hashes, fresh verification notes, and a machine-readable manifest
fix the /memories//related fallback query by inlining the sanitized FalkorDB variable-length depth, with a regression test

Breaking Changes

None. No runtime API shape changed.

Related Issues

Publication/reproducibility prep for the AutoMem arXiv paper.

Test Plan

make test: 238 passed, 1 skipped, 25 deselected
.venv/bin/black --check .: 117 files unchanged
.venv/bin/isort --check-only .: skipped 10 files
make lint: passed
make test-integration: 11 passed, 253 deselected
make bench-health: HEALTHY, p50=416ms p95=441ms mean=421ms
LoCoMo mini pinned judge: 85.20% (259/304), sha256 ba2b98b0055f92ca17de9bc36207d7f39cf90b6270c2c3d903d69b8044aa7015
LoCoMo full pinned judge: 84.74% (1683/1986), sha256 a75816e9a6d3302c22b34852b75ac19a9d9f5cb27d1a109e0af7e49359330716
LongMemEval mini: 70.00% (21/30), recall@5 96.67%, sha256 7ea922b77e312a17c313bbf8c0e81f0268b48d1082080cae1db3c38e906577b8
LongMemEval full: 87.00% (435/500), recall@5 97.00%, sha256 ed6f7cf69b7be6fa0050536ec2b0f947f5510afd8c2a374b3fafb9cde009da75
automem-evals: runner unit tests 95 passed, scripts unit tests 10 passed, Writ npm test 72 passed, Writ npm run build passed
paper static check: inputs and BibTeX cite keys resolve; no local LaTeX compiler available

Notes

The fresh LongMemEval full console output included transient gpt-5-mini empty-answer warnings and one local recall read timeout, but the harness completed with memory_ingest_failures=0, judge_errors=0, and publishable=true. Actual arXiv/Hugging Face submission still needs author metadata and an arXiv ID.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a reproducibility-focused “publication verification” bundle for the May 2026 arXiv effort, updates published benchmark claims across docs, and fixes FalkorDB’s /memories/<id>/related fallback query by inlining a sanitized variable-length depth (with a regression test).

Changes:

Updated README/docs benchmark numbers to the latest verified LoCoMo/LongMemEval results.
Added benchmarks/publication/2026-05-arxiv/ bundle (commands, notes, machine-readable manifest).
Fixed FalkorDB fallback related-memories query to inline sanitized depth; added regression test.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_api_endpoints.py	Adds regression test ensuring fallback query inlines depth and avoids `$max_depth` in relationship range.
automem/api/recall.py	Inlines sanitized `max_depth` into variable-length range for FalkorDB compatibility.
pyproject.toml	Introduces Black/isort configuration and generated-directory excludes.
Makefile	Updates `bench-health` to run using venv Python.
.flake8	Extends excludes for generated/worktree/benchmark directories.
README.md	Updates headline benchmark claims and adds publication bundle link.
docs/TESTING.md	Updates LoCoMo/LongMemEval baseline tables and example output to new results.
docs/COMPARISON.md	Refreshes text to reflect current canonical benchmark posture.
benchmarks/EXPERIMENT_LOG.md	Updates headline results; adds publication-bundle pointer and verification entry.
benchmarks/publication/2026-05-arxiv/README.md	Adds claim posture + bundle index for the arXiv publication effort.
benchmarks/publication/2026-05-arxiv/benchmark-summary.md	Adds paper-ready benchmark summary and limitations.
benchmarks/publication/2026-05-arxiv/commands.md	Documents repo + supplemental eval commands for verification.
benchmarks/publication/2026-05-arxiv/fresh-verification.md	Captures verification run notes and artifact hashes.
benchmarks/publication/2026-05-arxiv/artifact-manifest.json	Adds machine-readable manifest for claims/artifacts/hashes.

fix(benchmarks): add publication verification bundle

2e6ebd9

Copilot AI review requested due to automatic review settings May 18, 2026 06:25

Copilot AI reviewed May 18, 2026

View reviewed changes

fix(publication): address copilot review on PR #166

f13fe9e

jack-arturo merged commit 420d721 into main May 22, 2026
7 checks passed

jack-arturo deleted the feat/automem-arxiv-publication branch May 22, 2026 07:39

jack-arturo mentioned this pull request May 14, 2026

chore(main): release 0.16.0 #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(benchmarks): add publication verification bundle#166

fix(benchmarks): add publication verification bundle#166
jack-arturo merged 2 commits into
mainfrom
feat/automem-arxiv-publication

jack-arturo commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jack-arturo commented May 18, 2026

Summary

Breaking Changes

Related Issues

Test Plan

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants