Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 06e7340c49
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const matches = [...str.matchAll(regex)]; | ||
| const n = matches.length; | ||
| // Total of 2^n possible combinations | ||
| const total = 1 << n; |
There was a problem hiding this comment.
Avoid bitshift overflow in replacement count
Computing total with 1 << n breaks for inputs with 31+ matches (common when users search long Arabic phrases), because JS bitwise ops are 32-bit signed: at n = 31 it becomes negative and the loop yields no variants, and at higher n it wraps to incorrect counts. In this case the processor can return an empty array, which causes variant generation to collapse and lookups to miss entirely for those inputs.
Useful? React with 👍 / 👎.
| for (let mask = 0; mask < total; mask++) { | ||
| let i = 0; | ||
|
|
||
| const result = str.replaceAll(regex, (match) => { |
There was a problem hiding this comment.
Limit variant generation before building all combinations
This loop materializes every 2^n combination before any truncation happens, so even moderate inputs (e.g., 20 occurrences of ا) allocate over a million strings per call. Because Arabic now chains three such preprocessors, large selections can trigger severe CPU/memory spikes before downstream MAX_PROCESS_VARIANTS limits are applied.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I did a scan on the Living Arabic yomitan dictionary (currently the most extensive Arabic yomitan dict that I'm aware of), the highest number of occurrences of ا (Alif) in a given word was 4, which should be the extreme case
| export const addHamzaTop = { | ||
| name: 'Add Hamza to top of Alif', | ||
| description: 'اكبر → أكبر', | ||
| process: (text) => [text, text.replace('ا', 'أ')], | ||
| process: (text) => generateReplacementCombinations(text, 'ا', 'أ'), | ||
| }; | ||
|
|
||
| /** @type {import('language').TextProcessor} */ | ||
| export const addHamzaBottom = { | ||
| name: 'Add Hamza to bottom of Alif', | ||
| description: 'اسلام → إسلام', | ||
| process: (text) => [text, text.replace('ا', 'إ')], | ||
| process: (text) => generateReplacementCombinations(text, 'ا', 'إ'), | ||
| }; | ||
|
|
||
| /** @type {import('language').TextProcessor} */ | ||
| export const addMadd = { | ||
| name: 'Add Madd to Alif', | ||
| description: 'الان → الآن', | ||
| process: (text) => generateReplacementCombinations(text, 'ا', 'آ'), | ||
| }; |
There was a problem hiding this comment.
Since these are all replacing the same letter, they could be combined into a single preprocessor that handles all the replacements, rather than split into 3 individual preprocessors.
The current approach results in double the number of results. Will convert to a draft PR until updated
|
Sorry forgot about this - have now made the change mentioned above so reopening |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b72a9fae07
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const total = m ** n; // m^n combinations | ||
|
|
||
| const results = []; | ||
|
|
||
| for (let combination = 0; combination < total; combination++) { |
There was a problem hiding this comment.
Cap replacement combinations before materializing variants
generateReplacementCombinations always allocates every m^n permutation (total + full for loop) before any safeguard runs, while truncation to MAX_PROCESS_VARIANTS only happens later in translator._getProcessedTexts. With substituteAlif (m=4), inputs containing several plain alifs can explode to tens of thousands+ strings per call and stall lookups; fresh evidence versus the earlier thread is that lookup input is not bounded to dictionary-word shapes (scanner/search paths accept multi-character user text), so this can be triggered by real queries, not just synthetic words.
Useful? React with 👍 / 👎.
In light of the improvements to text preprocessors made in #2312, this PR modifies some of the Arabic preprocessors to correctly allow substituting letters in any position of a word, instead of only substituting the first occurrence of a letter.