-
Notifications
You must be signed in to change notification settings - Fork 110
Java/JS regex support for escapes within charclasses, JS regex null escape, identity escapes & legacy octal escape #1501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 24 commits
8899623
3a6fed3
b220e40
36927ad
52f459b
7e9f8a0
8c0ed79
964aa54
2750b78
94d8b8f
5ac1817
0c197be
e00c9b7
9dc49fd
79cc4ee
00ac8d3
5ebca66
cbda7f8
01aca5f
d89c728
983e4ad
76afa5f
e877db2
286b3e8
c48f31c
cc2185f
ea07dd0
79d22a9
1801403
eae8b76
10b23f6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -86,40 +86,46 @@ bracketQuantifierRange | |
| atom | ||
| : patternCharacter+ | ||
| | DOT | ||
| | AtomEscape | ||
| | atomEscape | ||
| | characterClass | ||
| | PAREN_open disjunction PAREN_close | ||
| //TODO | ||
| // | '(' '?' ':' disjunction ')' | ||
| ; | ||
|
|
||
|
|
||
|
|
||
| //TODO | ||
| fragment CharacterEscape | ||
| : ControlEscape | ||
| | 'c' ControlLetter | ||
| | HexEscapeSequence | ||
| | UnicodeEscapeSequence | ||
| //| IdentityEscape | ||
| CharacterEscape | ||
| : SLASH ControlEscape | ||
| | SLASH HexEscapeSequence | ||
| | SLASH UnicodeEscapeSequence | ||
| | SLASH OctalEscapeSequence // legacy octal escapes are deprecated, but this also works for null escape (\u0000) | ||
| | SLASH IdentityEscape | ||
| ; | ||
|
|
||
| //TODO backreferences | ||
|
|
||
| ControlLetterExtendedEscape | ||
| // This handles both control letter escapes (\ca, \cZ, etc.) and literal interpretations of \c. | ||
| // As in JS: "\c" + [^a-zA-Z]? is taken literally as "\c" + [^a-zA-Z]? outside charclasses | ||
| // while "\c" + [^a-zA-Z0-9_]? is taken literally as "\c" + [^a-zA-Z0-9_]? within charclasses. | ||
| // Therefore, as all characters following "\c" (or none) are permitted we accept "\c" + .? here | ||
| // and handle each case in visitor. | ||
| : SLASH 'c' .? // matches \c, \c<anything> | ||
| ; | ||
|
|
||
| fragment ControlEscape | ||
| //one of f n r t v | ||
| : [fnrtv] | ||
| ; | ||
|
|
||
| fragment ControlLetter | ||
| : [a-zA-Z] | ||
| fragment IdentityEscape | ||
| // In JS escape sequences that are not one of the above (excluding backreferences) become identity escapes: | ||
| // they represent the character that follows the backslash. (e.g.: "\a" becomes "a") | ||
| // see: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape#:~:text=identity%20escapes | ||
| : ~[dDsSwWfnrtvxuc0-9] | ||
| ; | ||
|
|
||
|
|
||
| //TODO | ||
| //fragment IdentityEscape :: | ||
| //SourceCharacter but not IdentifierPart | ||
| //<ZWJ> | ||
| //<ZWNJ> | ||
|
|
||
| //TODO | ||
| //DecimalEscape | ||
| // //[lookahead ∉ DecimalDigit] | ||
|
|
@@ -170,26 +176,29 @@ classAtomNoDash | |
| //SourceCharacter but not one of \ or ] or - | ||
| //TODO | ||
| //: ~[-\]\\] | ||
| // | '\\' ClassEscape | ||
| : BaseChar | ||
| : classEscape | ||
| | BaseChar | ||
| | DecimalDigit | ||
| | COMMA | CARET | DOLLAR | SLASH | DOT | STAR | PLUS | QUESTION | ||
| | PAREN_open | PAREN_close | BRACKET_open | BRACE_open | BRACE_close | OR; | ||
|
|
||
|
|
||
| //TODO | ||
| //ClassEscape | ||
| // : CharacterClassEscape | ||
| //// | DecimalEscape | ||
| //// | 'b' | ||
| // //| CharacterEscape | ||
| // ; | ||
| // TODO | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is this TODO?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was for \b within charclass, in JS some escapes behave differently within charclasses. In the case of \b within charclass it is interpreted as backspace. Implemented that now and removed TODO. The \b and \B escapes outside charclass (word boundary assertions) are not implemented yet. |
||
| classEscape | ||
| : controlLetterExtendedEscape // this needs to be first so that we can accept things like \c and \c0 within charclasses | ||
| | atomEscape | ||
| // | SLASH 'b' | ||
| ; | ||
|
|
||
| decimalDigits | ||
| : DecimalDigit+ | ||
| ; | ||
|
|
||
|
|
||
| controlLetterExtendedEscape | ||
| // we need this as a parser rule because differentiating between being inside a charclass or outside is important | ||
| // as behavior changes in each case | ||
| : ControlLetterExtendedEscape | ||
| ; | ||
|
|
||
| //------ LEXER ------------------------------ | ||
| // Lexer rules have first letter in upper-case | ||
|
|
@@ -199,16 +208,17 @@ DecimalDigit | |
| ; | ||
|
|
||
|
|
||
| AtomEscape | ||
| : '\\' CharacterClassEscape | ||
| atomEscape | ||
| : CharacterClassEscape | ||
| //TODO | ||
| // | '\\' DecimalEscape | ||
| | '\\' CharacterEscape | ||
| | CharacterEscape | ||
| | controlLetterExtendedEscape | ||
| ; | ||
|
|
||
| fragment CharacterClassEscape | ||
| CharacterClassEscape | ||
| //one of d D s S w W | ||
| : [dDsSwW] | ||
| : SLASH [dDsSwW] | ||
| ; | ||
|
|
||
|
|
||
|
|
@@ -236,6 +246,12 @@ BaseChar | |
| : ~[0-9,^$\\.*+?()[\]{}|-] | ||
| ; | ||
|
|
||
| fragment OctalEscapeSequence | ||
| : OctalDigit | ||
| | OctalDigit OctalDigit | ||
| | [0-3] OctalDigit OctalDigit | ||
| ; | ||
|
|
||
| fragment UnicodeEscapeSequence | ||
| : 'u' HexDigit HexDigit HexDigit HexDigit | ||
| ; | ||
|
|
@@ -248,6 +264,10 @@ fragment HexDigit: | |
| [a-fA-F0-9] | ||
| ; | ||
|
|
||
| fragment OctalDigit: | ||
| [0-7] | ||
| ; | ||
|
|
||
| //TODO | ||
| //DecimalIntegerLiteral | ||
| // : '0' | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain in this comment a bit more what backreferences are