LibJS: Handle Unicode ID_Start characters excluded from XID_Start#8896
Open
officialasishkumar wants to merge 2 commits intoLadybirdBrowser:masterfrom
Open
LibJS: Handle Unicode ID_Start characters excluded from XID_Start#8896officialasishkumar wants to merge 2 commits intoLadybirdBrowser:masterfrom
officialasishkumar wants to merge 2 commits intoLadybirdBrowser:masterfrom
Conversation
Collaborator
|
Hello! One or more of the commit messages in this PR do not match the Ladybird code submission policy, please check the |
trflynn89
reviewed
Apr 13, 2026
| } | ||
|
|
||
| fn unicode_id_start(cp: u32) -> bool { | ||
| // NB: The ECMAScript spec requires ID_Start, not XID_Start. |
Contributor
There was a problem hiding this comment.
oof, a couple things here (see #8788 (comment) for some info)
- Let's not use
unicode_identat all. As pointed out in the link above, we should avoid duplicating large Unicode tables. - That PR added FFI to just invoke
Unicode::code_point_has_identifier_start_propertywhich should already be doing the right thing. Let's do the same here. That should let us remove theunicode_identpackage from Cargo.toml. - The comments added here are exceptionally verbose.
Author
There was a problem hiding this comment.
Will update accordingly.
4991765 to
1eb140c
Compare
Replace the Rust unicode-ident lookup in the lexer with the existing C++ Unicode identifier property helpers and drop the direct dependency.
1eb140c to
db7e959
Compare
db7e959 to
287e674
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The ECMAScript specification defines
UnicodeIDStartas any Unicode code pointwith the
ID_Startderived property. Theunicode_identcrate that the Rustlexer uses implements
XID_Start, which is a proper subset ofID_Start.The two sets differ for characters whose NFKC normalization form does not begin
with an
ID_Startcharacter. These characters are valid JavaScript identifierstarts per spec but were rejected by the lexer with an
Invalidtoken.The existing code already worked around this for U+309B and U+309C (Katakana
voiced/semi-voiced sound marks). This commit extends that workaround to cover
all 21 code points in
ID_Start \ XID_Startunder Unicode 16:The
unicode_id_continuefunction is similarly updated for those characterswhose NFKC form also contains a non-
ID_Continuecharacter (a space), so theyremain valid as identifier-continuation characters. U+0E33, U+0EB3, U+FF9E,
and U+FF9F are omitted from that list because their NFKC forms consist solely
of
XID_Continuecharacters and are therefore already handled by the crate.Fixes #8870.