Skip to content

Optimize Dictionary Import Memory Use#2319

Open
Casheeew wants to merge 11 commits intomasterfrom
perf-optimize-import-memory
Open

Optimize Dictionary Import Memory Use#2319
Casheeew wants to merge 11 commits intomasterfrom
perf-optimize-import-memory

Conversation

@Casheeew
Copy link
Copy Markdown
Member

@Casheeew Casheeew commented Feb 25, 2026

The massive memory usage by loading term files all into memory may cause Chrome to be slower and sometimes have issues such as #1420 .

This PR instead achieves a constant memory usage via using a streaming pattern:
This PR uses zip.js' getData() stream to pipe into a custom JSON parser, and emits onBatch events whenever there are maxTransactionLength entries. This effectively reduces the memory usage by half even when the term files are small (at 1000 terms / term bank), and caps the memory usage at about 40 MB.

works flawlessly even when combining pixiv full into one humongous term bank

Based on my comprehensive testing, there is negligible difference in import time. this is slightly faster due to less GC

Maybe a partial fix to the many "Unknown error" issues... #536

@Casheeew Casheeew requested a review from a team as a code owner February 25, 2026 13:26
@Casheeew
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1694dafcd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Casheeew
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00ca6997b5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1981762c53

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Casheeew and others added 4 commits February 25, 2026 23:50
The test directories were gitignored by the `dictionaries/` rule
and needed to be force-added, matching how existing fixtures are tracked.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Casheeew
Copy link
Copy Markdown
Member Author

Should be good to go. We don't have to cover all edge cases since realistically dictionaries should be for the large part generated by code, so there is little reason to check every malformed JSON. The main check is the AJV layer and if the dictionary passes, it should not bork the user in any way with malformed data.

@Casheeew
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fdf7851b2b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Kuuuube
Copy link
Copy Markdown
Member

Kuuuube commented Feb 26, 2026

Based on my comprehensive testing, there is negligible difference in import time. this is slightly faster due to less GC

Can you give a few timed tests of imports and the peak memory usage youre seeing before and after?

@Casheeew
Copy link
Copy Markdown
Member Author

Okay, I tested again. For dictionaries which term files have 1000 entries the performance is comparable (both versions are loading 1000 entries into memory, I probably got the impression that it was halved from a wrong tab..)
import time is quite high variance (few seconds for EN-KRDICT, >10s for Pixiv)

PixivFull before (210.54s)
validation step
image
import step
image

PixivFull after (203.00s)
validation step
image
import step
image

for EN-KRDICT where each term file has a different number of entries, the memory is effectively halved:

Before (12.41s):
image
After (13.20s):
image

bonus: Pixiv with 1 term bank (700k+ entries)

Before: ∞ seconds
import error
After: 205.44s
validation step
image
import step
image

basically equivalent to when split into 1000 entry term banks

Copy link
Copy Markdown
Member

@Kuuuube Kuuuube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a test. Progress bar does not update when importing a single bank. This is a problem for dictionaries with large banks since it appear like something is broken/frozen.

You can test this with JPDB off the recommended dictionaries.

I suspect this is because you gutted the bits around const bulkAdd = async (objectStoreName, entries) => {. Those were added by me specifically for this purpose.

@tlonic
Copy link
Copy Markdown

tlonic commented Mar 12, 2026

can confirm that this branch fixes the unknown error i was receiving importing a large dictionary

@Kuuuube Kuuuube added kind/enhancement The issue or PR is a new feature or request area/performance The issue or PR is related to performance labels Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/performance The issue or PR is related to performance kind/enhancement The issue or PR is a new feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants