Skip to content

Add script to extract required phrases from rule files#5105

Open
Kaushik-Kumar-CEG wants to merge 1 commit into
aboutcode-org:developfrom
Kaushik-Kumar-CEG:gsoc/dataset-extraction-script
Open

Add script to extract required phrases from rule files#5105
Kaushik-Kumar-CEG wants to merge 1 commit into
aboutcode-org:developfrom
Kaushik-Kumar-CEG:gsoc/dataset-extraction-script

Conversation

@Kaushik-Kumar-CEG
Copy link
Copy Markdown

@Kaushik-Kumar-CEG Kaushik-Kumar-CEG commented Jun 3, 2026

references #5077

Adds a script to extract required phrase annotations ({{ }} markers) from .RULE files and output them as JSONL for NER model training

The script handles:

  • marker extraction with correct character positions
  • phrase normalization (HTML entities, xml tags, backticks, whitespace)
  • unicode and line ending normalization
  • position validation

more features (BIOES labels, train/val/test split, plain text field) coming in follow-up commits

disclosure : used Claude to help review and clean up few bugs in script

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled and links the original issue above
  • Tests pass
  • Commits are in uniquely-named feature branch and has no merge conflicts
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

Parses .RULE files for {{ }} markers and outputs JSONL with character positions and normalized phrases for NER training.

Signed-off-by: Kaushik <kaushikrjpm10@gmail.com>
@Kaushik-Kumar-CEG
Copy link
Copy Markdown
Author

@AyanSinhaMahapatra ready for review when you get a chance. ill make the follow up changes to script as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant