Improve Arabic preprocessors by J1mBim · Pull Request #2321 · yomidevs/yomitan

J1mBim · 2026-03-01T20:51:27Z

In light of the improvements to text preprocessors made in #2312, this PR modifies some of the Arabic preprocessors to correctly allow substituting letters in any position of a word, instead of only substituting the first occurrence of a letter.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06e7340c49

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-01T20:53:58Z

ext/js/language/ar/arabic-text-preprocessors.js

+    const matches = [...str.matchAll(regex)];
+    const n = matches.length;
+    // Total of 2^n possible combinations
+    const total = 1 << n;


Avoid bitshift overflow in replacement count

Computing total with 1 << n breaks for inputs with 31+ matches (common when users search long Arabic phrases), because JS bitwise ops are 32-bit signed: at n = 31 it becomes negative and the loop yields no variants, and at higher n it wraps to incorrect counts. In this case the processor can return an empty array, which causes variant generation to collapse and lookups to miss entirely for those inputs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-01T20:53:58Z

ext/js/language/ar/arabic-text-preprocessors.js

+    for (let mask = 0; mask < total; mask++) {
+        let i = 0;
+
+        const result = str.replaceAll(regex, (match) => {


Limit variant generation before building all combinations

This loop materializes every 2^n combination before any truncation happens, so even moderate inputs (e.g., 20 occurrences of ا) allocate over a million strings per call. Because Arabic now chains three such preprocessors, large selections can trigger severe CPU/memory spikes before downstream MAX_PROCESS_VARIANTS limits are applied.

Useful? React with 👍 / 👎.

I did a scan on the Living Arabic yomitan dictionary (currently the most extensive Arabic yomitan dict that I'm aware of), the highest number of occurrences of ا (Alif) in a given word was 4, which should be the extreme case

J1mBim · 2026-03-01T21:23:54Z

ext/js/language/ar/arabic-text-preprocessors.js

 export const addHamzaTop = {
    name: 'Add Hamza to top of Alif',
    description: 'اكبر → أكبر',
-    process: (text) => [text, text.replace('ا', 'أ')],
+    process: (text) => generateReplacementCombinations(text, 'ا', 'أ'),
 };

 /** @type {import('language').TextProcessor} */
 export const addHamzaBottom = {
    name: 'Add Hamza to bottom of Alif',
    description: 'اسلام → إسلام',
-    process: (text) => [text, text.replace('ا', 'إ')],
+    process: (text) => generateReplacementCombinations(text, 'ا', 'إ'),
+};
+
+/** @type {import('language').TextProcessor} */
+export const addMadd = {
+    name: 'Add Madd to Alif',
+    description: 'الان → الآن',
+    process: (text) => generateReplacementCombinations(text, 'ا', 'آ'),
 };


Since these are all replacing the same letter, they could be combined into a single preprocessor that handles all the replacements, rather than split into 3 individual preprocessors.

The current approach results in double the number of results. Will convert to a draft PR until updated

J1mBim · 2026-04-11T16:04:26Z

Sorry forgot about this - have now made the change mentioned above so reopening

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b72a9fae07

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-11T16:08:34Z

ext/js/language/ar/arabic-text-preprocessors.js

+    const total = m ** n; // m^n combinations
+
+    const results = [];
+
+    for (let combination = 0; combination < total; combination++) {


Cap replacement combinations before materializing variants

generateReplacementCombinations always allocates every m^n permutation (total + full for loop) before any safeguard runs, while truncation to MAX_PROCESS_VARIANTS only happens later in translator._getProcessedTexts. With substituteAlif (m=4), inputs containing several plain alifs can explode to tens of thousands+ strings per call and stall lookups; fresh evidence versus the earlier thread is that lookup input is not bounded to dictionary-word shapes (scanner/search paths accept multi-character user text), so this can be triggered by real queries, not just synthetic words.

Useful? React with 👍 / 👎.

Improve Arabic preprocessors

5d1391f

J1mBim requested a review from a team as a code owner March 1, 2026 20:51

Kuuuube added kind/enhancement The issue or PR is a new feature or request area/linguistics The issue or PR is related to linguistics labels Mar 1, 2026

Merge branch 'master' into master

06e7340

chatgpt-codex-connector bot reviewed Mar 1, 2026

View reviewed changes

J1mBim commented Mar 1, 2026

View reviewed changes

J1mBim marked this pull request as draft March 1, 2026 21:24

J1mBim added 3 commits April 11, 2026 16:33

Combine alif substitutions into single preprocessor

3df9bec

Merge branch 'yomidevs:master' into master

aacf86f

Update comments

b72a9fa

J1mBim marked this pull request as ready for review April 11, 2026 16:04

chatgpt-codex-connector bot reviewed Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Arabic preprocessors#2321

Improve Arabic preprocessors#2321
J1mBim wants to merge 5 commits intoyomidevs:masterfrom
J1mBim:master

J1mBim commented Mar 1, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Uh oh!

J1mBim Mar 1, 2026

Uh oh!

J1mBim Mar 1, 2026

Uh oh!

J1mBim commented Apr 11, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

J1mBim commented Mar 1, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

J1mBim Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

J1mBim Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

J1mBim commented Apr 11, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants