Skip to content

LibJS: Handle Unicode ID_Start characters excluded from XID_Start#8896

Open
officialasishkumar wants to merge 2 commits intoLadybirdBrowser:masterfrom
officialasishkumar:unicode-id-start-fix
Open

LibJS: Handle Unicode ID_Start characters excluded from XID_Start#8896
officialasishkumar wants to merge 2 commits intoLadybirdBrowser:masterfrom
officialasishkumar:unicode-id-start-fix

Conversation

@officialasishkumar
Copy link
Copy Markdown

The ECMAScript specification defines UnicodeIDStart as any Unicode code point
with the ID_Start derived property. The unicode_ident crate that the Rust
lexer uses implements XID_Start, which is a proper subset of ID_Start.

The two sets differ for characters whose NFKC normalization form does not begin
with an ID_Start character. These characters are valid JavaScript identifier
starts per spec but were rejected by the lexer with an Invalid token.

The existing code already worked around this for U+309B and U+309C (Katakana
voiced/semi-voiced sound marks). This commit extends that workaround to cover
all 21 code points in ID_Start \ XID_Start under Unicode 16:

Code point Name
U+037A GREEK YPOGEGRAMMENI
U+0E33 THAI CHARACTER SARA AM
U+0EB3 LAO VOWEL SIGN AM
U+309B KATAKANA-HIRAGANA VOICED SOUND MARK (already handled)
U+309C KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK (already handled)
U+FC5E–U+FC63 Arabic shadda ligature isolated forms
U+FE70, U+FE72, U+FE74, U+FE76, U+FE78, U+FE7A, U+FE7C, U+FE7E Arabic vowel isolated forms
U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

The unicode_id_continue function is similarly updated for those characters
whose NFKC form also contains a non-ID_Continue character (a space), so they
remain valid as identifier-continuation characters. U+0E33, U+0EB3, U+FF9E,
and U+FF9F are omitted from that list because their NFKC forms consist solely
of XID_Continue characters and are therefore already handled by the crate.

Fixes #8870.

@ladybird-bot
Copy link
Copy Markdown
Collaborator

Hello!

One or more of the commit messages in this PR do not match the Ladybird code submission policy, please check the lint_commits CI job for more details on which commits were flagged and why.
Please do not close this PR and open another, instead modify your commit message(s) with git commit --amend and force push those changes to update this PR.

}

fn unicode_id_start(cp: u32) -> bool {
// NB: The ECMAScript spec requires ID_Start, not XID_Start.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, a couple things here (see #8788 (comment) for some info)

  1. Let's not use unicode_ident at all. As pointed out in the link above, we should avoid duplicating large Unicode tables.
  2. That PR added FFI to just invoke Unicode::code_point_has_identifier_start_property which should already be doing the right thing. Let's do the same here. That should let us remove the unicode_ident package from Cargo.toml.
  3. The comments added here are exceptionally verbose.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update accordingly.

@officialasishkumar officialasishkumar force-pushed the unicode-id-start-fix branch 2 times, most recently from 4991765 to 1eb140c Compare April 13, 2026 23:07
Replace the Rust unicode-ident lookup in the lexer with the
existing C++ Unicode identifier property helpers and drop the
direct dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LibJS: SyntaxError when using U+0E33 as an object property key

3 participants