[WIP] Merge Chinese Word Segmentation work#19166
Conversation
- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide robust Chinese word segmentation capabilities. - Update `.gitmodules` to point to the official `cppjieba` repository and configure it to track the `master` branch. - Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp' - Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the project’s existing license, ensuring proper attribution and compliance. - Update documents
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/' - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method - Implements Meyers’ singleton via `getInstance()` with a private constructor - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance - Add C-style API in the same module: - `int initJieba()` to force singleton initialization - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets - `void freeOffsets(int* ptr)` to release allocated offset buffer
- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched. - This will cause the submodule of sonic to be fetched as well, which seems currently unused.
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
merge for Python updated
- create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern - implement `ChineseWordSegmentationStrategy` (for Chinese text) - implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy)
add `WordSegmenter` class for word segmentation, integrating segmentation strategies
- replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - integrate `WordSegmenter` for calculating word offsets
make `DisplayModelTextInfo`'s flag aligned with its superclass
- redesign 2 properties of 'OffsetTextInfo' as enums to make them more configurable, inspired by @LeonarddeR - override them in some subclasses to simulate specific behaviors
|
What are our main tasks in this PR? Is it just cleaning up the code? Is it possible to include bug fixes? |
Yes. I locally finished fixing remaining issues in #18865, and will make a PR soon, for this functionality to be integrated more properly. |
|
@cary-rowen @CrazySteve0605 - yes, PRs to this branch that improve the code quality and fix issues would be appreciated. Particularly resolving conflicts |
|
As we discussed offline @CrazySteve0605 According to feedback from the community, there are still some cases that need to be improved. I think that when the known to-do items for this PR are solved and merged into the Master, we can provide follow-up polish PR. Here, just co-pilot provides some comments worth addressing. |
Summary of the issue: Some punctuations have extra separators (spaces) before or after them. Description of user facing changes: Braille output will be more accurate.
|
Thanks @CrazySteve0605 - can you merge master and resolve conflicts now? |
| return self.text | ||
|
|
||
| result = "" | ||
| for sepIndex in range(len(self.wordEnds) - 1): |
| # For full terms and any additional permissions, see the NVDA license file: https://github.com/nvaccess/nvda/blob/master/copying.txt | ||
|
|
||
| import os | ||
| import ctypes |
| try: | ||
| if bool(charPtr): | ||
| self._lib.freeOffsets(charPtr) | ||
| except Exception: |
|
Hi @seanbudd ,
Yes, but I wonder how I can do this. Should I make a new PR to merge master into this branch now? I had misunderstood and thought the merge operation didn't need to be done by myself. |
|
Yes, create a branch from |
|
Hi @cary-rowen @CrazySteve0605 - there are still significant merge conflicts |
| charSegFlag: CharSegFlag = CharSegFlag.UNISCRIBE | ||
|
|
||
| @property | ||
| def wordSegFlag(self) -> WordSegFlag | None: |
|
Hi @seanbudd |
<!-- Please read and fill in the following template, for an explanation of the sections see: https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md Please also note that the NVDA project has a Citizen and Contributor Code of Conduct which can be found at https://github.com/nvaccess/nvda/blob/master/CODE_OF_CONDUCT.md. NV Access expects that all contributors and other community members read and abide by the rules set out in this document while participating or contributing to this project. This includes creating or commenting on issues and pull requests. Please initially open PRs as a draft. When you would like a review, mark the PR as "ready for review". See https://github.com/nvaccess/nvda/blob/master/.github/CONTRIBUTING.md. --> ### Link to issue number: <!-- Use Closes/Fixes/Resolves #xxx to link this PR to the issue it is responding to. --> ### Summary of the issue: ### Description of user facing changes: ### Description of developer facing changes: ### Description of development approach: ### Testing strategy: ### Known issues with pull request: ### Code Review Checklist: <!-- This checklist is a reminder of things commonly forgotten in a new PR. Authors, please do a self-review of this pull-request. Check items to confirm you have thought about the relevance of the item. Where items are missing (eg unit / system tests), please explain in the PR. To check an item `- [ ]` becomes `- [x]`, note spacing. You can also check the checkboxes after the PR is created. A detailed explanation of this checklist is available here: https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md#code-review-checklist --> - [ ] Documentation: - Change log entry - User Documentation - Developer / Technical Documentation - Context sensitive help for GUI changes - [ ] Testing: - Unit tests - System (end to end) tests - Manual testing - [ ] UX of all users considered: - Speech - Braille - Low Vision - Different web browsers - Localization in other languages / culture than English - [ ] API is compatible with existing add-ons. - [ ] Security precautions taken. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: GitHub Actions <github-actions@github.com> Co-authored-by: Sascha Cowley <16543535+SaschaCowley@users.noreply.github.com> Co-authored-by: Peng-An Chen <andy72039@gmail.com> Co-authored-by: Sean Budd <sean@nvaccess.org> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Michael Curran <mick@nvaccess.org> Co-authored-by: GitHub Actions <actions@github.com> Co-authored-by: Bill Dengler <codeofdusk@gmail.com> Co-authored-by: Cyrille Bougot <cyrille.bougot2@laposte.net> Co-authored-by: Boumtchack <147637328+Boumtchack@users.noreply.github.com> Co-authored-by: Kefas Lungu <84945041+kefaslungu@users.noreply.github.com> Co-authored-by: Jani Kinnunen <janikinnunen340@gmail.com> Co-authored-by: Akash Kakkar <6381747+akash07k@users.noreply.github.com> Co-authored-by: makhlwf <78276231+makhlwf@users.noreply.github.com> Co-authored-by: Quin Gillespie <trypsynth@gmail.com> Co-authored-by: Bram Duvigneau <bram@bramd.nl> Co-authored-by: Luke Davis <8139760+XLTechie@users.noreply.github.com> Co-authored-by: audioses <67341000+audioses@users.noreply.github.com> Co-authored-by: gexgd0419 <55008943+gexgd0419@users.noreply.github.com> Co-authored-by: Cyrille Bougot <cyrille.bougot@laposte.net> Co-authored-by: Joseph Lee <joseph.lee22590@gmail.com> Co-authored-by: Christopher Proß <cpross@mailbox.org> Co-authored-by: Ryan McCleary <remccleary@gmail.com> Co-authored-by: wencong <manchen_0528@outlook.com> Co-authored-by: Emil-18 <135248352+Emil-18@users.noreply.github.com> Co-authored-by: Danil <50794055+Danstiv@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Noelia Ruiz Martínez <nrm1977@gmail.com> Co-authored-by: Sean Budd <seanbudd123@gmail.com> Co-authored-by: James Teh <jamie@jantrid.net> Co-authored-by: Thiago Seus <thiago.seus@yahoo.com.br> Co-authored-by: WangFeng Huang <1398969445@qq.com> Co-authored-by: ethanl-11 <124083447+ethanl-11@users.noreply.github.com>
| self.wordSegFlag: WordSegFlag = wordSegFlag | ||
| self.strategy: wordSegStrategy.WordSegmentationStrategy = self._chooseStrategy() | ||
|
|
||
| def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: optimize |
| try: | ||
| if bool(charPtr): | ||
| self._lib.freeOffsets(charPtr) | ||
| except Exception: |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 30 out of 31 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (1)
source/braille.py:644
converteris overwritten when both Chinese word-segmentation spacing and Unicode normalization are active. In that case,brailleToRawPos/rawToBraillePosare only converted using the last converter, so offsets will be wrong (and cursor routing/selection mapping can break). Consider composing converters (apply both mappings), or apply normalization first and then segmentation while keeping a combined mapping back to the original raw text.
converter: OffsetConverter | None = None
textToTranslate = self.rawText
textToTranslateTypeforms = self.rawTextTypeforms
cursorPos = self.cursorPos
if (
config.conf["braille"]["translationTable"].startswith("zh")
or config.conf["braille"]["translationTable"] == "auto"
and brailleTables.getDefaultTableForCurLang(brailleTables.TableType.OUTPUT).startswith("zh")
):
from textUtils.wordSeg.wordSegUtils import WordSegWithSeparatorOffsetConverter # noqa: F401
converter = WordSegWithSeparatorOffsetConverter(textToTranslate)
textToTranslate = converter.encoded
if cursorPos is not None:
cursorPos = converter.strToEncodedOffsets(cursorPos)
if config.conf["braille"]["unicodeNormalization"] and not isUnicodeNormalized(textToTranslate):
converter = UnicodeNormalizationOffsetConverter(textToTranslate)
textToTranslate = converter.encoded
if textToTranslateTypeforms is not None:
# Typeforms must be adapted to represent normalized characters.
textToTranslateTypeforms = [
textToTranslateTypeforms[strOffset] for strOffset in converter.computedEncodedToStrOffsets
]
if cursorPos is not None:
# Convert the cursor position to a normalized offset.
cursorPos = converter.strToEncodedOffsets(cursorPos)
self.brailleCells, brailleToRawPos, rawToBraillePos, self.brailleCursorPos = louisHelper.translate(
[handler.table.fileName, "braille-patterns.cti"],
textToTranslate,
typeform=textToTranslateTypeforms,
mode=mode,
cursorPos=cursorPos,
)
if converter:
# The received brailleToRawPos contains braille to normalized positions.
# Process them to represent real raw positions by converting them from normalized ones.
brailleToRawPos = [converter.encodedToStrOffsets(i) for i in brailleToRawPos]
# The received rawToBraillePos contains normalized to braille positions.
# Create a new list based on real raw positions.
rawToBraillePos = [rawToBraillePos[i] for i in converter.computedStrToEncodedOffsets]
self.brailleToRawPos = brailleToRawPos
| JiebaSingleton& JiebaSingleton::getInstance(const char* dictDir) { | ||
| // convert incoming C-string+length to std::string (handles dictDir == nullptr) | ||
| std::string dir = dictDir; | ||
|
|
||
| // ensure singleton is constructed exactly once | ||
| std::call_once(JiebaSingleton::initFlag, [&]() { | ||
| // allocate on heap, so we avoid copy/move and control lifetime | ||
| JiebaSingleton::instance = new JiebaSingleton(dir.c_str()); | ||
| // optional: register deleter at exit | ||
| std::atexit([]() { | ||
| delete JiebaSingleton::instance; | ||
| JiebaSingleton::instance = nullptr; | ||
| }); | ||
| }); | ||
|
|
||
| // after call_once, instance must be non-null | ||
| return *JiebaSingleton::instance; | ||
| } |
| bool insertUserWord(const char* word, int freq, const char* tag = cppjieba::UNKNOWN_TAG) { | ||
| return JiebaSingleton::getInstance().InsertUserWord(string(word), freq, string(tag)); | ||
| } | ||
|
|
||
| bool deleteUserWord(const char* word, const char* tag = cppjieba::UNKNOWN_TAG) { | ||
| return JiebaSingleton::getInstance().DeleteUserWord(string(word), string(tag)); | ||
| } | ||
|
|
||
| bool find(const char* word) { | ||
| return JiebaSingleton::getInstance().Find(string(word)); | ||
| } |
| Compute a list of offsets so that: | ||
| encodedIndex = strIndex + relevantStrToEncodedOffsets[strIndex] | ||
|
|
||
| We build an explicit mapping from original string indices to encoded indices | ||
| by marking separator positions in the encoded string and then assigning | ||
| each non-separator encoded slot to the next original-character index. | ||
| The returned list contains the delta (encodedIndex - strIndex) for each | ||
| original index. |
| Thread(target=callable_to_call, args=args, kwargs=kwargs, daemon=True).start() | ||
| except Exception as e: | ||
| log.debug("Initializer %s.%s failed: %s", module_name, qualname, e) | ||
| return |
| case config.featureFlagEnums.WordNavigationUnitFlag.CHINESE: | ||
| return WordSegFlag.CHINESE | ||
| case _: | ||
| log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}") |
| def segmentedText(self, sep: str = " ", newSepIndex: list[int] | None = None) -> str: | ||
| """Segments the text using the word end indices.""" | ||
|
|
||
| if len(self.wordEnds) <= 1: | ||
| return self.text | ||
|
|
||
| result = "" | ||
| for sepIndex in range(len(self.wordEnds) - 1): | ||
| preIndex = 0 if sepIndex == 0 else self.wordEnds[sepIndex - 1] | ||
| curIndex = self.wordEnds[sepIndex] | ||
| postIndex = self.wordEnds[sepIndex + 1] | ||
|
|
||
| # append the token before the potential separator position | ||
| result += self.text[preIndex:curIndex] | ||
|
|
||
| # quick checks: avoid adding duplicate separator if already present | ||
| if result.endswith(sep) or self.text[curIndex:postIndex].startswith(sep): | ||
| # separator already present at either side -> skip adding | ||
| continue | ||
|
|
||
| # Unicode categories for punctuation | ||
| PUNCTUATION_CATEGORIES: str = "pP" | ||
| # Determine whether any punctuation forbids a separator | ||
| noSep = ( | ||
| unicodedata.category(self.text[curIndex - 1])[0] in PUNCTUATION_CATEGORIES | ||
| or unicodedata.category(self.text[curIndex])[0] in PUNCTUATION_CATEGORIES | ||
| ) | ||
|
|
||
| if not noSep: | ||
| # If neither side forbids the separator, add it | ||
| result += sep | ||
| if newSepIndex is not None: | ||
| newSepIndex.append(len(result) - len(sep)) | ||
| else: | ||
| # append the final trailing token after the loop | ||
| result += self.text[curIndex:postIndex] | ||
|
|
||
| return result | ||
|
|
||
| def getSegmentForOffset(self, offset: int) -> tuple[int, int] | None: | ||
| return self.getWordOffsetRange(offset) | ||
|
|
||
| def __init__(self, text, encoding=None): | ||
| super().__init__(text, encoding) | ||
| self.wordEnds = self._callCPPJieba() |
| def encodedToStrOffsets( | ||
| self, | ||
| encodedStart: int, | ||
| encodedEnd: int | None = None, | ||
| raiseOnError: bool = False, | ||
| ) -> int | tuple[int]: | ||
| super().encodedToStrOffsets(encodedStart, encodedEnd, raiseOnError) | ||
| if encodedStart == 0: | ||
| resultStart = 0 | ||
| else: | ||
| resultStart = self.computedEncodedToStrOffsets[encodedStart] | ||
| if encodedEnd is None: | ||
| return resultStart | ||
| elif encodedStart == encodedEnd: | ||
| return (resultStart, resultStart) | ||
| else: | ||
| resultEnd = self.computedEncodedToStrOffsets[encodedEnd] | ||
| return (resultStart, resultEnd) |
| reportClickable = boolean(default=true) | ||
|
|
||
| [documentNavigation] | ||
| initWordSegForUnusedLang = boolean(default=false) |
|
@CrazySteve0605 - could you open a PR to address the copilot feedback? |
Merges: