Skip to content

fix: escape Lucene boolean keywords as whole words, not individual characters (fixes #1302)#1371

Open
voidborne-d wants to merge 1 commit intogetzep:mainfrom
voidborne-d:fix/lucene-sanitize-uppercase-corruption
Open

fix: escape Lucene boolean keywords as whole words, not individual characters (fixes #1302)#1371
voidborne-d wants to merge 1 commit intogetzep:mainfrom
voidborne-d:fix/lucene-sanitize-uppercase-corruption

Conversation

@voidborne-d
Copy link
Copy Markdown

Problem

lucene_sanitize() in graphiti_core/helpers.py used str.maketrans to escape individual uppercase letters O, R, N, T, A, D — intending to block Lucene boolean operators (AND, OR, NOT). Because str.maketrans maps every occurrence of each character, this corrupted virtually every real-world query:

Query Before fix After fix
Donald Trump \Donald \Trump Donald Trump
ORACLE \O\R\ACLE ORACLE
NASA \N\AS\A NASA
Toronto \Toronto Toronto
cats AND dogs c\\ats \\AN\D dogs cats \AND dogs

This broke BM25 fulltext search for entity names, facts, and episodes whenever they contained any of those six letters — which is nearly all real-world text.

Fixes #1302.

Fix

  • Remove O, R, N, T, A, D from the character-level str.maketrans escape map
  • Add _LUCENE_KEYWORD_RE = re.compile(r'\b(AND|OR|NOT)\b') that matches the keywords only as standalone uppercase whole words
  • Apply keyword escaping as a second pass: lambda m: '\\' + m.group(1)

Result:

  • ANDROID, TORNADO, ORPHAN, NORMANDY — pass through unmodified
  • Standalone AND, OR, NOT — correctly escaped to \AND, \OR, \NOT
  • All Lucene special characters still escaped as before

Changes

  • graphiti_core/helpers.py — replace character-level escaping with regex + lambda
  • tests/test_lucene_sanitize.py — 58 new parametrized tests covering all cases
  • tests/helpers_test.py — update existing assertion that expected T in This to be escaped

Test results

58 passed, 0 failed

…aracters (fixes getzep#1302)

lucene_sanitize() used str.maketrans to escape individual uppercase
letters O, R, N, T, A, D — intending to block Lucene boolean operators
(AND, OR, NOT). Because str.maketrans maps every occurrence of each
character, this corrupted virtually every query containing those common
letters:

  'Donald Trump' → '\Donald \Trump'
  'ORACLE'       → '\O\R\ACLE'
  'NASA'         → '\N\AS\A'
  'Toronto'      → '\Toronto'

This broke BM25 fulltext search for entity names, facts, and episodes
whenever they contained any of those six letters — which is nearly all
real-world text.

Fix:
- Remove O, R, N, T, A, D from the character escape map
- Add a regex (_LUCENE_KEYWORD_RE) that matches AND, OR, NOT only as
  whole uppercase words (\b boundary)
- Apply keyword escaping as a second pass after character escaping
- Words like ANDROID, TORNADO, ORPHAN pass through correctly

Also updates the existing test_lucene_sanitize assertion which expected
the T in 'This' to be escaped.

58 new tests covering:
- Uppercase letters preserved in normal words (17 cases)
- Boolean keywords properly escaped (7 cases)
- Keywords inside words NOT escaped (10 cases)
- Special characters still escaped (18 cases)
- Combined scenarios (6 cases)
@danielchalef
Copy link
Copy Markdown
Member

danielchalef commented Apr 2, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@voidborne-d
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] lucene_sanitize() escapes individual uppercase letters, breaking BM25 for most queries

2 participants