Skip to content
Open
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
8899623
Moved predefined character classes definitions from Randomness.kt int…
lmasroca Dec 16, 2025
3a6fed3
Moved Java AtomEscape from lexer to parser
lmasroca Dec 16, 2025
b220e40
Regex support for predefined character classes inside character class…
lmasroca Dec 16, 2025
36927ad
Removed exception assertion from testInd1Issue in RegexHandlerTest.kt
lmasroca Dec 16, 2025
52f459b
Merge remote-tracking branch 'origin/master' into regex-support-exten…
lmasroca Dec 16, 2025
7e9f8a0
increasing iterations for an e2e test to account for changes in rando…
lmasroca Dec 19, 2025
8c0ed79
Merge remote-tracking branch 'origin/master' into regex-support-exten…
lmasroca Dec 19, 2025
964aa54
Merge remote-tracking branch 'origin/master' into regex-support-exten…
lmasroca Mar 12, 2026
2750b78
Merge remote-tracking branch 'origin/master' into regex-support-exten…
lmasroca Mar 17, 2026
94d8b8f
Add required field to spec
Pgarrett Mar 14, 2026
5ac1817
Increasing iterations for e2e test
lmasroca Mar 17, 2026
0c197be
Merge remote-tracking branch 'origin/master' into regex-support-exten…
lmasroca Apr 8, 2026
e00c9b7
quick fix
lmasroca Apr 8, 2026
9dc49fd
merge cleanup
lmasroca Apr 8, 2026
79cc4ee
more merge cleanup
lmasroca Apr 9, 2026
00ac8d3
Merge remote-tracking branch 'origin/master' into regex-support-exten…
lmasroca Apr 9, 2026
5ebca66
added some comments
lmasroca Apr 9, 2026
cbda7f8
Moved JS AtomEscape from lexer to parser
lmasroca Apr 9, 2026
01aca5f
JS Regex support for predefined character classes inside character cl…
lmasroca Apr 9, 2026
d89c728
JS legacy octal escapes and null escape
lmasroca Apr 10, 2026
983e4ad
JS fixed and completed control letter escapes
lmasroca Apr 10, 2026
76afa5f
JS identity escapes support
lmasroca Apr 10, 2026
e877db2
JS octal escape fix
lmasroca Apr 12, 2026
286b3e8
Merge remote-tracking branch 'origin/master' into regex-support-exten…
lmasroca Apr 13, 2026
c48f31c
Removed SLASH from classAtomNoDash as it is unused and should not be …
lmasroca Apr 17, 2026
cc2185f
Added some comments regarding backreferences
lmasroca Apr 17, 2026
ea07dd0
JS regex implemented \b within charclass (interpreted as backspace)
lmasroca Apr 17, 2026
79d22a9
moved atomEscape (parser rule) from lexer rule section to parser rule…
lmasroca Apr 17, 2026
1801403
Merge remote-tracking branch 'origin/master' into regex-support-exten…
lmasroca Apr 17, 2026
eae8b76
increased iterations for e2e test
lmasroca Apr 17, 2026
10b23f6
Merge remote-tracking branch 'origin/master' into regex-support-exten…
lmasroca Apr 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 52 additions & 32 deletions core/src/main/antlr4/org/evomaster/core/parser/RegexEcma262.g4
Original file line number Diff line number Diff line change
Expand Up @@ -86,40 +86,46 @@ bracketQuantifierRange
atom
: patternCharacter+
| DOT
| AtomEscape
| atomEscape
| characterClass
| PAREN_open disjunction PAREN_close
//TODO
// | '(' '?' ':' disjunction ')'
;



//TODO
fragment CharacterEscape
: ControlEscape
| 'c' ControlLetter
| HexEscapeSequence
| UnicodeEscapeSequence
//| IdentityEscape
CharacterEscape
: SLASH ControlEscape
| SLASH HexEscapeSequence
| SLASH UnicodeEscapeSequence
| SLASH OctalEscapeSequence // legacy octal escapes are deprecated, but this also works for null escape (\u0000)
| SLASH IdentityEscape
;

//TODO backreferences
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain in this comment a bit more what backreferences are


ControlLetterExtendedEscape
// This handles both control letter escapes (\ca, \cZ, etc.) and literal interpretations of \c.
// As in JS: "\c" + [^a-zA-Z]? is taken literally as "\c" + [^a-zA-Z]? outside charclasses
// while "\c" + [^a-zA-Z0-9_]? is taken literally as "\c" + [^a-zA-Z0-9_]? within charclasses.
// Therefore, as all characters following "\c" (or none) are permitted we accept "\c" + .? here
// and handle each case in visitor.
: SLASH 'c' .? // matches \c, \c<anything>
;

fragment ControlEscape
//one of f n r t v
: [fnrtv]
;

fragment ControlLetter
: [a-zA-Z]
fragment IdentityEscape
// In JS escape sequences that are not one of the above (excluding backreferences) become identity escapes:
// they represent the character that follows the backslash. (e.g.: "\a" becomes "a")
// see: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape#:~:text=identity%20escapes
: ~[dDsSwWfnrtvxuc0-9]
;


//TODO
//fragment IdentityEscape ::
//SourceCharacter but not IdentifierPart
//<ZWJ>
//<ZWNJ>

//TODO
//DecimalEscape
// //[lookahead ∉ DecimalDigit]
Expand Down Expand Up @@ -170,26 +176,29 @@ classAtomNoDash
//SourceCharacter but not one of \ or ] or -
//TODO
//: ~[-\]\\]
// | '\\' ClassEscape
: BaseChar
: classEscape
| BaseChar
| DecimalDigit
| COMMA | CARET | DOLLAR | SLASH | DOT | STAR | PLUS | QUESTION
| PAREN_open | PAREN_close | BRACKET_open | BRACE_open | BRACE_close | OR;


//TODO
//ClassEscape
// : CharacterClassEscape
//// | DecimalEscape
//// | 'b'
// //| CharacterEscape
// ;
// TODO
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this TODO?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was for \b within charclass, in JS some escapes behave differently within charclasses. In the case of \b within charclass it is interpreted as backspace. Implemented that now and removed TODO. The \b and \B escapes outside charclass (word boundary assertions) are not implemented yet.

classEscape
: controlLetterExtendedEscape // this needs to be first so that we can accept things like \c and \c0 within charclasses
| atomEscape
// | SLASH 'b'
;

decimalDigits
: DecimalDigit+
;


controlLetterExtendedEscape
// we need this as a parser rule because differentiating between being inside a charclass or outside is important
// as behavior changes in each case
: ControlLetterExtendedEscape
;

//------ LEXER ------------------------------
// Lexer rules have first letter in upper-case
Expand All @@ -199,16 +208,17 @@ DecimalDigit
;


AtomEscape
: '\\' CharacterClassEscape
atomEscape
: CharacterClassEscape
//TODO
// | '\\' DecimalEscape
| '\\' CharacterEscape
| CharacterEscape
| controlLetterExtendedEscape
;

fragment CharacterClassEscape
CharacterClassEscape
//one of d D s S w W
: [dDsSwW]
: SLASH [dDsSwW]
;


Expand Down Expand Up @@ -236,6 +246,12 @@ BaseChar
: ~[0-9,^$\\.*+?()[\]{}|-]
;

fragment OctalEscapeSequence
: OctalDigit
| OctalDigit OctalDigit
| [0-3] OctalDigit OctalDigit
;

fragment UnicodeEscapeSequence
: 'u' HexDigit HexDigit HexDigit HexDigit
;
Expand All @@ -248,6 +264,10 @@ fragment HexDigit:
[a-fA-F0-9]
;

fragment OctalDigit:
[0-7]
;

//TODO
//DecimalIntegerLiteral
// : '0'
Expand Down
74 changes: 42 additions & 32 deletions core/src/main/antlr4/org/evomaster/core/parser/RegexJava.g4
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ atom
: quote
| patternCharacter+
| DOT
| AtomEscape
| atomEscape
| characterClass
| PAREN_open disjunction PAREN_close
// These two rules are added to handle the . and + symbols in emails
Expand Down Expand Up @@ -119,17 +119,34 @@ quoteChar
;

//TODO
fragment CharacterEscape
: ControlEscape
| 'c' ControlLetter
| HexEscapeSequence
| UnicodeEscapeSequence
| OctalEscapeSequence
| 'p' BRACE_open PosixCharacterClassLabel BRACE_close // this is only implemented in Java at the moment as on JS this
// is allowed only while certain flags are enabled
CharacterEscape
: SLASH ControlEscape
| SLASH 'c' ControlLetter
| SLASH HexEscapeSequence
| SLASH UnicodeEscapeSequence
| SLASH OctalEscapeSequence
| SLASH ('p' | 'P') BRACE_open PCharacterClassEscapeLabel BRACE_close // this is only implemented in Java at the moment
// as on JS this is allowed only while certain flags are enabled

//| IdentityEscape
;

// TODO missing \p escapes
fragment PCharacterClassEscapeLabel
: PosixCharacterClassLabel
| UnicodeCategoriesLabel
// | UnicodeScriptsLabel // https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#usc
// | UnicodeBlocksLabel // https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#ubc
// | UnicodeBinaryProperiesLabel // https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#ubpc
// | javalangCharacterClassesLabel // https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#jcc
;

// TODO missing Unicode categories labels and implementations
// https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#ucc
fragment UnicodeCategoriesLabel
: 'Pe'
;

// basic US-ASCII only predefined POSIX character classes
// https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#:~:text=character%3A%20%5B%5E%5Cw%5D-,POSIX,-character%20classes%20(US
fragment PosixCharacterClassLabel
Expand Down Expand Up @@ -215,27 +232,28 @@ classAtomNoDash
//SourceCharacter but not one of \ or ] or -
//TODO
//: ~[-\]\\]
// | '\\' ClassEscape
: BaseChar
: classEscape
| BaseChar
| DecimalDigit
| COMMA | CARET | DOLLAR | SLASH | DOT | STAR | PLUS | QUESTION
| PAREN_open | PAREN_close | BRACKET_open | BRACE_open | BRACE_close | OR | E | Q
| ESCAPED_DOT | ESCAPED_PLUS;


//TODO
//ClassEscape
// : CharacterClassEscape
//// | DecimalEscape
//// | 'b'
// //| CharacterEscape
// ;

decimalDigits
: DecimalDigit+
;

classEscape
: atomEscape
// | SLASH 'b'
;

atomEscape
: CharacterClassEscape
| CharacterEscape
// TODO
// | '\\' DecimalEscape
;

//------ LEXER ------------------------------
// Lexer rules have first letter in upper-case
Expand All @@ -244,19 +262,11 @@ DecimalDigit
: [0-9]
;


AtomEscape
: '\\' CharacterClassEscape
//TODO
// | '\\' DecimalEscape
| '\\' CharacterEscape
;

fragment CharacterClassEscape
CharacterClassEscape
//one of d D s S w W v V h H
// v, V, h and H are java8 exclusive, they represent vertical spaces and horizaontal spaces respectively
// see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for more information
: [dDsSwWvVhH]
// v, V, h and H are java8 exclusive, they represent vertical spaces and horizaontal spaces respectively
// see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for more information
: SLASH [dDsSwWvVhH]
;


Expand Down
Loading
Loading