LibJS: Handle Unicode ID_Start characters excluded from XID_Start by officialasishkumar · Pull Request #8896 · LadybirdBrowser/ladybird

officialasishkumar · 2026-04-13T19:30:11Z

The ECMAScript specification defines UnicodeIDStart as any Unicode code point
with the ID_Start derived property. The unicode_ident crate that the Rust
lexer uses implements XID_Start, which is a proper subset of ID_Start.

The two sets differ for characters whose NFKC normalization form does not begin
with an ID_Start character. These characters are valid JavaScript identifier
starts per spec but were rejected by the lexer with an Invalid token.

The existing code already worked around this for U+309B and U+309C (Katakana
voiced/semi-voiced sound marks). This commit extends that workaround to cover
all 21 code points in ID_Start \ XID_Start under Unicode 16:

Code point	Name
U+037A	GREEK YPOGEGRAMMENI
U+0E33	THAI CHARACTER SARA AM
U+0EB3	LAO VOWEL SIGN AM
U+309B	KATAKANA-HIRAGANA VOICED SOUND MARK (already handled)
U+309C	KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK (already handled)
U+FC5E–U+FC63	Arabic shadda ligature isolated forms
U+FE70, U+FE72, U+FE74, U+FE76, U+FE78, U+FE7A, U+FE7C, U+FE7E	Arabic vowel isolated forms
U+FF9E	HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F	HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

The unicode_id_continue function is similarly updated for those characters
whose NFKC form also contains a non-ID_Continue character (a space), so they
remain valid as identifier-continuation characters. U+0E33, U+0EB3, U+FF9E,
and U+FF9F are omitted from that list because their NFKC forms consist solely
of XID_Continue characters and are therefore already handled by the crate.

Fixes #8870.

ladybird-bot · 2026-04-13T19:30:42Z

Hello!

One or more of the commit messages in this PR do not match the Ladybird code submission policy, please check the lint_commits CI job for more details on which commits were flagged and why.
Please do not close this PR and open another, instead modify your commit message(s) with git commit --amend and force push those changes to update this PR.

trflynn89 · 2026-04-13T19:46:01Z

Libraries/LibJS/Rust/src/lexer.rs

 }

 fn unicode_id_start(cp: u32) -> bool {
-    // NB: The ECMAScript spec requires ID_Start, not XID_Start.


oof, a couple things here (see #8788 (comment) for some info)

Let's not use unicode_ident at all. As pointed out in the link above, we should avoid duplicating large Unicode tables.

That PR added FFI to just invoke Unicode::code_point_has_identifier_start_property which should already be doing the right thing. Let's do the same here. That should let us remove the unicode_ident package from Cargo.toml.

The comments added here are exceptionally verbose.

Will update accordingly.

Replace the Rust unicode-ident lookup in the lexer with the existing C++ Unicode identifier property helpers and drop the direct dependency.

trflynn89 reviewed Apr 13, 2026

View reviewed changes

officialasishkumar force-pushed the unicode-id-start-fix branch 2 times, most recently from 4991765 to 1eb140c Compare April 13, 2026 23:07

LibJS: Use LibUnicode identifier property FFI

39fff2d

Replace the Rust unicode-ident lookup in the lexer with the existing C++ Unicode identifier property helpers and drop the direct dependency.

officialasishkumar force-pushed the unicode-id-start-fix branch from 1eb140c to db7e959 Compare April 14, 2026 04:59

officialasishkumar requested a review from alimpfard as a code owner April 14, 2026 04:59

LibRegex: Export Unicode identifier FFI helpers

287e674

officialasishkumar force-pushed the unicode-id-start-fix branch from db7e959 to 287e674 Compare April 14, 2026 05:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LibJS: Handle Unicode ID_Start characters excluded from XID_Start#8896

LibJS: Handle Unicode ID_Start characters excluded from XID_Start#8896
officialasishkumar wants to merge 2 commits intoLadybirdBrowser:masterfrom
officialasishkumar:unicode-id-start-fix

officialasishkumar commented Apr 13, 2026

Uh oh!

ladybird-bot commented Apr 13, 2026

Uh oh!

trflynn89 Apr 13, 2026

Uh oh!

officialasishkumar Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

officialasishkumar commented Apr 13, 2026

Uh oh!

ladybird-bot commented Apr 13, 2026

Uh oh!

trflynn89 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

officialasishkumar Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants