-
Notifications
You must be signed in to change notification settings - Fork 92
fix(benchmarks): add publication verification bundle #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| # AutoMem arXiv Publication Bundle | ||
|
|
||
| This bundle collects the repository-side material for the May 2026 AutoMem | ||
| arXiv preprint effort. It is intentionally conservative: canonical claims come | ||
| from this repository's official benchmark log, while exploratory and external | ||
| numbers are labeled separately. | ||
|
|
||
| ## Claim Posture | ||
|
|
||
| AutoMem should be described as an open-source, inspectable, MCP-first | ||
| graph-vector memory service for AI agents with transparent benchmark harnesses | ||
| and strong canonical LoCoMo / LongMemEval results. | ||
|
|
||
| Do not claim "best memory system", "SOTA", or "beats Mem0" from this bundle. | ||
| Those claims require apples-to-apples reruns against the current external | ||
| systems, judge policies, dataset versions, and scale settings. | ||
|
|
||
| ## Canonical Results | ||
|
|
||
| | Status | Benchmark | Scope | Score | Retrieval | Source | | ||
| |---|---:|---:|---:|---:|---| | ||
| | canonical | LongMemEval full | 500 questions | 87.00% (435/500) | recall@5 97.00% (485/500) | Fresh publication verification run; see `fresh-verification.md` | | ||
| | representative canary | LongMemEval mini | 30 stratified questions | 70.00% (21/30) | recall@5 96.67% (29/30) | Fresh publication verification run; see `fresh-verification.md` | | ||
| | canonical | LoCoMo full | 10 conversations, 1,986 questions | 84.74% (1683/1986) | not reported | Fresh publication verification run with pinned `gpt-5.4-mini-2026-03-17` judge | | ||
|
|
||
| Canonical LongMemEval model policy: | ||
|
|
||
| - Answerer: `gpt-5-mini` | ||
| - Judge: `gpt-5.4-mini-2026-03-17` | ||
| - Judge errors: `0` | ||
| - Memory ingest failures: `0` | ||
| - Harness publishable flag: `true` | ||
|
|
||
| ## Supplemental Signals | ||
|
|
||
| | Status | Benchmark / Evidence | Scope | Result | Caveat | | ||
| |---|---|---:|---:|---| | ||
| | exploratory | BEAM 100K V1 raw-dialogue shim | 20 conversations, 400 questions | 76.25% (305/400), avg 0.677 | Not comparable to published BEAM 1M/10M claims. | | ||
| | exploratory | BEAM 100K V2 fact-extraction shim | 20 conversations, 400 questions | 73.75% (295/400), avg 0.653 | Diagnostic failure-mode signal only. | | ||
| | exploratory | Writ drift integration | 5 drift scenarios | 100% recall accuracy, 20% update fidelity, 0% drift rate | Lives in `automem-evals`; must remain labeled supplemental until promoted. | | ||
| | exploratory | Claude Code hook replay | fixture suite | metrics harness only | Lives in `automem-evals`; workflow-continuity signal, not a memory benchmark. | | ||
| | external reported | Mem0 managed platform | LoCoMo / LongMemEval / BEAM | see cited Mem0 docs | Proprietary managed-platform optimizations; not directly comparable. | | ||
| | not yet run | BEAM official 1M/10M | official BEAM scale | -- | Required before any BEAM-competitive claim. | | ||
| | not yet run | LongMemEval-V2 | web-agent memory | -- | Required before "experienced colleague" claims. | | ||
| | not yet run | Memora / FAMA | invalidated-memory reuse | -- | Natural fit for `INVALIDATED_BY`/`CONTRADICTS`, but not run yet. | | ||
|
|
||
| ## Bundle Files | ||
|
|
||
| - `benchmark-summary.md` - paper-ready benchmark and limitation summary. | ||
| - `artifact-manifest.json` - machine-readable manifest for claims, generated-artifact paths, and commands. | ||
| - `commands.md` - verification and reproduction command inventory. | ||
| - `fresh-verification.md` - latest local verification notes and generated artifact hashes. | ||
|
|
||
| ## Promotion Rule | ||
|
|
||
| Results from `../automem-evals` may inform the paper only as supplemental | ||
| evidence until a result is reproduced or explicitly summarized in this | ||
| repository. Official benchmark claims remain owned by | ||
| `benchmarks/EXPERIMENT_LOG.md`. |
145 changes: 145 additions & 0 deletions
145
benchmarks/publication/2026-05-arxiv/artifact-manifest.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,145 @@ | ||
| { | ||
| "bundle": "automem-arxiv-2026-05", | ||
| "generated_at_utc": "2026-05-18T06:20:58Z", | ||
| "automem_git_sha_at_creation": "a742602f5d6ad2dea5a4d3c387d5b49d610afe2c", | ||
| "git_sha_note": "This records the base HEAD before publication-bundle edits; the PR commit SHA is supplied by GitHub after commit creation.", | ||
| "official_claim_source": "benchmarks/EXPERIMENT_LOG.md", | ||
| "judge_policy": "docs/BENCHMARK_JUDGE_POLICY.md", | ||
| "claims": [ | ||
| { | ||
| "status": "canonical", | ||
| "benchmark": "LongMemEval", | ||
| "scope": "full", | ||
| "questions": 500, | ||
| "score": "87.00% (435/500)", | ||
| "retrieval": "recall@5 97.00% (485/500)", | ||
| "answer_model": "gpt-5-mini", | ||
| "judge_model": "gpt-5.4-mini-2026-03-17", | ||
| "source": "benchmarks/EXPERIMENT_LOG.md; fresh publication verification run 2026-05-18 UTC", | ||
| "generated_artifact": { | ||
| "path": "benchmarks/results/longmemeval-full-publication-20260518.json", | ||
| "gitignored": true, | ||
| "sha256": "ed6f7cf69b7be6fa0050536ec2b0f947f5510afd8c2a374b3fafb9cde009da75" | ||
| }, | ||
| "hypotheses_artifact": { | ||
| "path": "benchmarks/results/longmemeval-full-publication-20260518.jsonl", | ||
| "gitignored": true, | ||
| "sha256": "69cd9c8171d5caec8661b1d8c2b27579decc40e2f134ffde74fcc93e70e2e7ce" | ||
| }, | ||
| "memory_ingest_failures": 0, | ||
| "judge_errors": 0, | ||
| "publishable": true, | ||
| "elapsed_seconds": 10021.779591798782, | ||
| "category_breakdown": { | ||
| "knowledge_update": "88.46% (69/78)", | ||
| "multi_session": "84.21% (112/133)", | ||
| "single_session_assistant": "98.21% (55/56)", | ||
| "single_session_preference": "56.67% (17/30)", | ||
| "single_session_user": "92.86% (65/70)", | ||
| "temporal_reasoning": "87.97% (117/133)" | ||
| }, | ||
| "failure_split": "65 wrong total; 54 wrong had answer session retrieved at recall@5; 11 wrong were retrieval misses; 4 correct answers were retrieval misses." | ||
| }, | ||
| { | ||
| "status": "representative_canary", | ||
| "benchmark": "LongMemEval", | ||
| "scope": "mini stratified", | ||
| "questions": 30, | ||
| "score": "70.00% (21/30)", | ||
| "retrieval": "recall@5 96.67% (29/30)", | ||
| "answer_model": "gpt-5-mini", | ||
| "judge_model": "script LLM eval", | ||
| "source": "local run 2026-05-17; see fresh-verification.md", | ||
| "generated_artifact": { | ||
| "path": "benchmarks/results/longmemeval-mini-publication-20260517.json", | ||
| "gitignored": true, | ||
| "sha256": "7ea922b77e312a17c313bbf8c0e81f0268b48d1082080cae1db3c38e906577b8" | ||
| } | ||
| }, | ||
| { | ||
| "status": "canonical", | ||
| "benchmark": "LoCoMo", | ||
| "scope": "full", | ||
| "questions": 1986, | ||
| "score": "84.74% (1683/1986)", | ||
| "retrieval": null, | ||
| "answer_model": null, | ||
| "judge_model": "gpt-5.4-mini-2026-03-17", | ||
| "source": "benchmarks/EXPERIMENT_LOG.md; fresh publication verification run 2026-05-17", | ||
| "generated_artifact": { | ||
| "path": "benchmarks/results/locomo_baseline_20260517_193934.json", | ||
| "gitignored": true, | ||
| "sha256": "a75816e9a6d3302c22b34852b75ac19a9d9f5cb27d1a109e0af7e49359330716" | ||
| }, | ||
| "judge_calls": 444, | ||
| "judge_errors": 0, | ||
| "judge_skips": 0, | ||
| "estimated_cost_usd": 0.790877, | ||
| "category_breakdown": { | ||
| "single_hop": "52.13% (147/282)", | ||
| "temporal": "86.60% (278/321)", | ||
| "multi_hop": "46.88% (45/96)", | ||
| "open_domain": "93.58% (787/841)", | ||
| "complex": "95.52% (426/446)" | ||
| } | ||
| }, | ||
| { | ||
| "status": "fresh_verification", | ||
| "benchmark": "LoCoMo", | ||
| "scope": "mini", | ||
| "questions": 304, | ||
| "score": "85.20% (259/304)", | ||
| "retrieval": null, | ||
| "answer_model": null, | ||
| "judge_model": "gpt-5.4-mini-2026-03-17", | ||
| "source": "local run 2026-05-17; see fresh-verification.md", | ||
| "generated_artifact": { | ||
| "path": "benchmarks/results/locomo-mini_baseline_20260517_182318.json", | ||
| "gitignored": true, | ||
| "sha256": "ba2b98b0055f92ca17de9bc36207d7f39cf90b6270c2c3d903d69b8044aa7015" | ||
| } | ||
| }, | ||
| { | ||
| "status": "exploratory", | ||
| "benchmark": "BEAM", | ||
| "scope": "100K V1 raw-dialogue shim", | ||
| "questions": 400, | ||
| "score": "76.25% (305/400), avg 0.677", | ||
| "retrieval": "top-k 200", | ||
| "source": "benchmarks/EXPERIMENT_LOG.md" | ||
| }, | ||
| { | ||
| "status": "exploratory", | ||
| "benchmark": "BEAM", | ||
| "scope": "100K V2 fact-extraction shim", | ||
| "questions": 400, | ||
| "score": "73.75% (295/400), avg 0.653", | ||
| "retrieval": "top-k 200", | ||
| "source": "benchmarks/EXPERIMENT_LOG.md" | ||
| }, | ||
| { | ||
| "status": "exploratory", | ||
| "benchmark": "Writ", | ||
| "scope": "drift category, 5 scenarios", | ||
| "questions": 5, | ||
| "score": "100.0% recall_accuracy; 20.0% update_fidelity; 0.0% drift_rate", | ||
| "retrieval": null, | ||
| "source": "../automem-evals/docs/writ_integration.md" | ||
| }, | ||
| { | ||
| "status": "exploratory", | ||
| "benchmark": "Claude Code hook replay", | ||
| "scope": "fixture and metrics harness", | ||
| "questions": null, | ||
| "score": "harness tests only; no publication score", | ||
| "retrieval": null, | ||
| "source": "../automem-evals/docs/session_2026-04-28_hook_replay.md" | ||
| } | ||
| ], | ||
| "not_yet_run": [ | ||
| "BEAM official 1M/10M", | ||
| "LongMemEval-V2", | ||
| "Memora/FAMA", | ||
| "Mem0 managed-platform apples-to-apples comparison" | ||
| ] | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.