Skip to content

Improve Arabic preprocessors#2321

Open
J1mBim wants to merge 5 commits intoyomidevs:masterfrom
J1mBim:master
Open

Improve Arabic preprocessors#2321
J1mBim wants to merge 5 commits intoyomidevs:masterfrom
J1mBim:master

Conversation

@J1mBim
Copy link
Copy Markdown

@J1mBim J1mBim commented Mar 1, 2026

In light of the improvements to text preprocessors made in #2312, this PR modifies some of the Arabic preprocessors to correctly allow substituting letters in any position of a word, instead of only substituting the first occurrence of a letter.

@J1mBim J1mBim requested a review from a team as a code owner March 1, 2026 20:51
@Kuuuube Kuuuube added kind/enhancement The issue or PR is a new feature or request area/linguistics The issue or PR is related to linguistics labels Mar 1, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06e7340c49

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

const matches = [...str.matchAll(regex)];
const n = matches.length;
// Total of 2^n possible combinations
const total = 1 << n;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid bitshift overflow in replacement count

Computing total with 1 << n breaks for inputs with 31+ matches (common when users search long Arabic phrases), because JS bitwise ops are 32-bit signed: at n = 31 it becomes negative and the loop yields no variants, and at higher n it wraps to incorrect counts. In this case the processor can return an empty array, which causes variant generation to collapse and lookups to miss entirely for those inputs.

Useful? React with 👍 / 👎.

Comment on lines +42 to +45
for (let mask = 0; mask < total; mask++) {
let i = 0;

const result = str.replaceAll(regex, (match) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit variant generation before building all combinations

This loop materializes every 2^n combination before any truncation happens, so even moderate inputs (e.g., 20 occurrences of ا) allocate over a million strings per call. Because Arabic now chains three such preprocessors, large selections can trigger severe CPU/memory spikes before downstream MAX_PROCESS_VARIANTS limits are applied.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a scan on the Living Arabic yomitan dictionary (currently the most extensive Arabic yomitan dict that I'm aware of), the highest number of occurrences of ا (Alif) in a given word was 4, which should be the extreme case

Comment on lines 100 to 118
export const addHamzaTop = {
name: 'Add Hamza to top of Alif',
description: 'اكبر → أكبر',
process: (text) => [text, text.replace('ا', 'أ')],
process: (text) => generateReplacementCombinations(text, 'ا', 'أ'),
};

/** @type {import('language').TextProcessor} */
export const addHamzaBottom = {
name: 'Add Hamza to bottom of Alif',
description: 'اسلام → إسلام',
process: (text) => [text, text.replace('ا', 'إ')],
process: (text) => generateReplacementCombinations(text, 'ا', 'إ'),
};

/** @type {import('language').TextProcessor} */
export const addMadd = {
name: 'Add Madd to Alif',
description: 'الان → الآن',
process: (text) => generateReplacementCombinations(text, 'ا', 'آ'),
};
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are all replacing the same letter, they could be combined into a single preprocessor that handles all the replacements, rather than split into 3 individual preprocessors.

The current approach results in double the number of results. Will convert to a draft PR until updated

@J1mBim J1mBim marked this pull request as draft March 1, 2026 21:24
@J1mBim
Copy link
Copy Markdown
Author

J1mBim commented Apr 11, 2026

Sorry forgot about this - have now made the change mentioned above so reopening

@J1mBim J1mBim marked this pull request as ready for review April 11, 2026 16:04
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b72a9fae07

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +39 to +43
const total = m ** n; // m^n combinations

const results = [];

for (let combination = 0; combination < total; combination++) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cap replacement combinations before materializing variants

generateReplacementCombinations always allocates every m^n permutation (total + full for loop) before any safeguard runs, while truncation to MAX_PROCESS_VARIANTS only happens later in translator._getProcessedTexts. With substituteAlif (m=4), inputs containing several plain alifs can explode to tens of thousands+ strings per call and stall lookups; fresh evidence versus the earlier thread is that lookup input is not bounded to dictionary-word shapes (scanner/search paths accept multi-character user text), so this can be triggered by real queries, not just synthetic words.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/linguistics The issue or PR is related to linguistics kind/enhancement The issue or PR is a new feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants