Skip to content

[WIP] Merge Chinese Word Segmentation work#19166

Draft
seanbudd wants to merge 133 commits intomasterfrom
try-chineseWordSegmentation-staging
Draft

[WIP] Merge Chinese Word Segmentation work#19166
seanbudd wants to merge 133 commits intomasterfrom
try-chineseWordSegmentation-staging

Conversation

CrazySteve0605 and others added 30 commits July 23, 2025 17:30
- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide
  robust Chinese word segmentation capabilities.
- Update `.gitmodules` to point to the official `cppjieba` repository and
  configure it to track the `master` branch.
- Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp'
- Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the
  project’s existing license, ensuring proper attribution and compliance.
- Update documents
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/'
  - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method
  - Implements Meyers’ singleton via `getInstance()` with a private constructor
  - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance
- Add C-style API in the same module:
  - `int initJieba()` to force singleton initialization
  - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets
  - `void freeOffsets(int* ptr)` to release allocated offset buffer
- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched.
- This will cause the submodule of sonic to be fetched as well, which seems currently unused.
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
- create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern
- implement `ChineseWordSegmentationStrategy` (for Chinese text)
- implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy)
add `WordSegmenter` class for word segmentation, integrating segmentation strategies
- replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions
- integrate `WordSegmenter` for calculating word offsets
make `DisplayModelTextInfo`'s flag aligned with its superclass
- redesign 2 properties of 'OffsetTextInfo' as enums to make them more configurable, inspired by @LeonarddeR
- override them in some subclasses to simulate specific behaviors
@cary-rowen
Copy link
Copy Markdown
Contributor

What are our main tasks in this PR? Is it just cleaning up the code? Is it possible to include bug fixes?
cc @CrazySteve0605
According to OSPP guidelines, is this still considered part of your job responsibilities?

@CrazySteve0605
Copy link
Copy Markdown
Contributor

CrazySteve0605 commented Nov 20, 2025

cc @CrazySteve0605
According to OSPP guidelines, is this still considered part of your job responsibilities?

Yes. I locally finished fixing remaining issues in #18865, and will make a PR soon, for this functionality to be integrated more properly.

@seanbudd
Copy link
Copy Markdown
Member Author

seanbudd commented Nov 21, 2025

@cary-rowen @CrazySteve0605 - yes, PRs to this branch that improve the code quality and fix issues would be appreciated. Particularly resolving conflicts

@cary-rowen
Copy link
Copy Markdown
Contributor

As we discussed offline @CrazySteve0605

According to feedback from the community, there are still some cases that need to be improved.

I think that when the known to-do items for this PR are solved and merged into the Master, we can provide follow-up polish PR.

Here, just co-pilot provides some comments worth addressing.

@seanbudd seanbudd added the merge-early Merge Early in a developer cycle label Dec 8, 2025
Summary of the issue:

Some punctuations have extra separators (spaces) before or after them.
Description of user facing changes:

Braille output will be more accurate.
@seanbudd
Copy link
Copy Markdown
Member Author

seanbudd commented Mar 2, 2026

Thanks @CrazySteve0605 - can you merge master and resolve conflicts now?

return self.text

result = ""
for sepIndex in range(len(self.wordEnds) - 1):
# For full terms and any additional permissions, see the NVDA license file: https://github.com/nvaccess/nvda/blob/master/copying.txt

import os
import ctypes
Comment thread source/textUtils/__init__.py Fixed
Comment thread source/textInfos/offsets.py Fixed
try:
if bool(charPtr):
self._lib.freeOffsets(charPtr)
except Exception:
Comment thread source/textUtils/wordSeg/wordSegStrategy.py Fixed
@CrazySteve0605
Copy link
Copy Markdown
Contributor

Hi @seanbudd ,

can you merge master and resolve conflicts now?

Yes, but I wonder how I can do this. Should I make a new PR to merge master into this branch now? I had misunderstood and thought the merge operation didn't need to be done by myself.

@seanbudd
Copy link
Copy Markdown
Member Author

seanbudd commented Mar 2, 2026

Yes, create a branch from nvaccesss/nvda;try-chineseWordSegmentation-staging and merge nvaccess/nvda:master into it, then open a PR to merge it to nvaccesss/nvda;try-chineseWordSegmentation-staging

@seanbudd
Copy link
Copy Markdown
Member Author

Hi @cary-rowen @CrazySteve0605 - there are still significant merge conflicts

Comment thread source/addonHandler/__init__.py Fixed
Comment thread tests/unit/test_addonHandler/test_addonsState.py Fixed
Comment thread tests/unit/test_addonHandler/test_addonsState.py Fixed
Comment thread source/addonHandler/_pickleToJsonMigration.py Fixed
Comment thread source/addonHandler/__init__.py Fixed
charSegFlag: CharSegFlag = CharSegFlag.UNISCRIBE

@property
def wordSegFlag(self) -> WordSegFlag | None:
@cary-rowen
Copy link
Copy Markdown
Contributor

Hi @seanbudd
@CrazySteve0605 will handle the merge conflict here. But I can do it if it's willing to let me help.

cary-rowen and others added 2 commits May 1, 2026 22:21
<!-- Please read and fill in the following template, for an explanation
of the sections see:

https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md
Please also note that the NVDA project has a Citizen and Contributor
Code of Conduct which can be found at
https://github.com/nvaccess/nvda/blob/master/CODE_OF_CONDUCT.md. NV
Access expects that all contributors and other community members read
and abide by the rules set out in this document while participating or
contributing to this project. This includes creating or commenting on
issues and pull requests.

Please initially open PRs as a draft.
When you would like a review, mark the PR as "ready for review".
See
https://github.com/nvaccess/nvda/blob/master/.github/CONTRIBUTING.md.
-->

### Link to issue number:
<!-- Use Closes/Fixes/Resolves #xxx to link this PR to the issue it is
responding to. -->

### Summary of the issue:

### Description of user facing changes:

### Description of developer facing changes:

### Description of development approach:

### Testing strategy:

### Known issues with pull request:

### Code Review Checklist:

<!--
This checklist is a reminder of things commonly forgotten in a new PR.
Authors, please do a self-review of this pull-request.
Check items to confirm you have thought about the relevance of the item.
Where items are missing (eg unit / system tests), please explain in the
PR.
To check an item `- [ ]` becomes `- [x]`, note spacing.
You can also check the checkboxes after the PR is created.
A detailed explanation of this checklist is available here:

https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md#code-review-checklist
-->

- [ ] Documentation:
  - Change log entry
  - User Documentation
  - Developer / Technical Documentation
  - Context sensitive help for GUI changes
- [ ] Testing:
  - Unit tests
  - System (end to end) tests
  - Manual testing
- [ ] UX of all users considered:
  - Speech
  - Braille
  - Low Vision
  - Different web browsers
  - Localization in other languages / culture than English
- [ ] API is compatible with existing add-ons.
- [ ] Security precautions taken.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: GitHub Actions <github-actions@github.com>
Co-authored-by: Sascha Cowley <16543535+SaschaCowley@users.noreply.github.com>
Co-authored-by: Peng-An Chen <andy72039@gmail.com>
Co-authored-by: Sean Budd <sean@nvaccess.org>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Michael Curran <mick@nvaccess.org>
Co-authored-by: GitHub Actions <actions@github.com>
Co-authored-by: Bill Dengler <codeofdusk@gmail.com>
Co-authored-by: Cyrille Bougot <cyrille.bougot2@laposte.net>
Co-authored-by: Boumtchack <147637328+Boumtchack@users.noreply.github.com>
Co-authored-by: Kefas Lungu <84945041+kefaslungu@users.noreply.github.com>
Co-authored-by: Jani Kinnunen <janikinnunen340@gmail.com>
Co-authored-by: Akash Kakkar <6381747+akash07k@users.noreply.github.com>
Co-authored-by: makhlwf <78276231+makhlwf@users.noreply.github.com>
Co-authored-by: Quin Gillespie <trypsynth@gmail.com>
Co-authored-by: Bram Duvigneau <bram@bramd.nl>
Co-authored-by: Luke Davis <8139760+XLTechie@users.noreply.github.com>
Co-authored-by: audioses <67341000+audioses@users.noreply.github.com>
Co-authored-by: gexgd0419 <55008943+gexgd0419@users.noreply.github.com>
Co-authored-by: Cyrille Bougot <cyrille.bougot@laposte.net>
Co-authored-by: Joseph Lee <joseph.lee22590@gmail.com>
Co-authored-by: Christopher Proß <cpross@mailbox.org>
Co-authored-by: Ryan McCleary <remccleary@gmail.com>
Co-authored-by: wencong <manchen_0528@outlook.com>
Co-authored-by: Emil-18 <135248352+Emil-18@users.noreply.github.com>
Co-authored-by: Danil <50794055+Danstiv@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Noelia Ruiz Martínez <nrm1977@gmail.com>
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
Co-authored-by: James Teh <jamie@jantrid.net>
Co-authored-by: Thiago Seus <thiago.seus@yahoo.com.br>
Co-authored-by: WangFeng Huang <1398969445@qq.com>
Co-authored-by: ethanl-11 <124083447+ethanl-11@users.noreply.github.com>
self.wordSegFlag: WordSegFlag = wordSegFlag
self.strategy: wordSegStrategy.WordSegmentationStrategy = self._chooseStrategy()

def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: optimize
try:
if bool(charPtr):
self._lib.freeOffsets(charPtr)
except Exception:
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

source/braille.py:644

  • converter is overwritten when both Chinese word-segmentation spacing and Unicode normalization are active. In that case, brailleToRawPos / rawToBraillePos are only converted using the last converter, so offsets will be wrong (and cursor routing/selection mapping can break). Consider composing converters (apply both mappings), or apply normalization first and then segmentation while keeping a combined mapping back to the original raw text.
		converter: OffsetConverter | None = None
		textToTranslate = self.rawText
		textToTranslateTypeforms = self.rawTextTypeforms
		cursorPos = self.cursorPos
		if (
			config.conf["braille"]["translationTable"].startswith("zh")
			or config.conf["braille"]["translationTable"] == "auto"
			and brailleTables.getDefaultTableForCurLang(brailleTables.TableType.OUTPUT).startswith("zh")
		):
			from textUtils.wordSeg.wordSegUtils import WordSegWithSeparatorOffsetConverter  # noqa: F401

			converter = WordSegWithSeparatorOffsetConverter(textToTranslate)
			textToTranslate = converter.encoded
			if cursorPos is not None:
				cursorPos = converter.strToEncodedOffsets(cursorPos)
		if config.conf["braille"]["unicodeNormalization"] and not isUnicodeNormalized(textToTranslate):
			converter = UnicodeNormalizationOffsetConverter(textToTranslate)
			textToTranslate = converter.encoded
			if textToTranslateTypeforms is not None:
				# Typeforms must be adapted to represent normalized characters.
				textToTranslateTypeforms = [
					textToTranslateTypeforms[strOffset] for strOffset in converter.computedEncodedToStrOffsets
				]
			if cursorPos is not None:
				# Convert the cursor position to a normalized offset.
				cursorPos = converter.strToEncodedOffsets(cursorPos)
		self.brailleCells, brailleToRawPos, rawToBraillePos, self.brailleCursorPos = louisHelper.translate(
			[handler.table.fileName, "braille-patterns.cti"],
			textToTranslate,
			typeform=textToTranslateTypeforms,
			mode=mode,
			cursorPos=cursorPos,
		)

		if converter:
			# The received brailleToRawPos contains braille to normalized positions.
			# Process them to represent real raw positions by converting them from normalized ones.
			brailleToRawPos = [converter.encodedToStrOffsets(i) for i in brailleToRawPos]
			# The received rawToBraillePos contains normalized to braille positions.
			# Create a new list based on real raw positions.
			rawToBraillePos = [rawToBraillePos[i] for i in converter.computedStrToEncodedOffsets]
		self.brailleToRawPos = brailleToRawPos

Comment on lines +17 to +34
JiebaSingleton& JiebaSingleton::getInstance(const char* dictDir) {
// convert incoming C-string+length to std::string (handles dictDir == nullptr)
std::string dir = dictDir;

// ensure singleton is constructed exactly once
std::call_once(JiebaSingleton::initFlag, [&]() {
// allocate on heap, so we avoid copy/move and control lifetime
JiebaSingleton::instance = new JiebaSingleton(dir.c_str());
// optional: register deleter at exit
std::atexit([]() {
delete JiebaSingleton::instance;
JiebaSingleton::instance = nullptr;
});
});

// after call_once, instance must be non-null
return *JiebaSingleton::instance;
}
Comment on lines +108 to +118
bool insertUserWord(const char* word, int freq, const char* tag = cppjieba::UNKNOWN_TAG) {
return JiebaSingleton::getInstance().InsertUserWord(string(word), freq, string(tag));
}

bool deleteUserWord(const char* word, const char* tag = cppjieba::UNKNOWN_TAG) {
return JiebaSingleton::getInstance().DeleteUserWord(string(word), string(tag));
}

bool find(const char* word) {
return JiebaSingleton::getInstance().Find(string(word));
}
Comment on lines +32 to +39
Compute a list of offsets so that:
encodedIndex = strIndex + relevantStrToEncodedOffsets[strIndex]

We build an explicit mapping from original string indices to encoded indices
by marking separator positions in the encoded string and then assigning
each non-separator encoded slot to the next original-character index.
The returned list contains the delta (encodedIndex - strIndex) for each
original index.
Thread(target=callable_to_call, args=args, kwargs=kwargs, daemon=True).start()
except Exception as e:
log.debug("Initializer %s.%s failed: %s", module_name, qualname, e)
return
case config.featureFlagEnums.WordNavigationUnitFlag.CHINESE:
return WordSegFlag.CHINESE
case _:
log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}")
Comment on lines +287 to +331
def segmentedText(self, sep: str = " ", newSepIndex: list[int] | None = None) -> str:
"""Segments the text using the word end indices."""

if len(self.wordEnds) <= 1:
return self.text

result = ""
for sepIndex in range(len(self.wordEnds) - 1):
preIndex = 0 if sepIndex == 0 else self.wordEnds[sepIndex - 1]
curIndex = self.wordEnds[sepIndex]
postIndex = self.wordEnds[sepIndex + 1]

# append the token before the potential separator position
result += self.text[preIndex:curIndex]

# quick checks: avoid adding duplicate separator if already present
if result.endswith(sep) or self.text[curIndex:postIndex].startswith(sep):
# separator already present at either side -> skip adding
continue

# Unicode categories for punctuation
PUNCTUATION_CATEGORIES: str = "pP"
# Determine whether any punctuation forbids a separator
noSep = (
unicodedata.category(self.text[curIndex - 1])[0] in PUNCTUATION_CATEGORIES
or unicodedata.category(self.text[curIndex])[0] in PUNCTUATION_CATEGORIES
)

if not noSep:
# If neither side forbids the separator, add it
result += sep
if newSepIndex is not None:
newSepIndex.append(len(result) - len(sep))
else:
# append the final trailing token after the loop
result += self.text[curIndex:postIndex]

return result

def getSegmentForOffset(self, offset: int) -> tuple[int, int] | None:
return self.getWordOffsetRange(offset)

def __init__(self, text, encoding=None):
super().__init__(text, encoding)
self.wordEnds = self._callCPPJieba()
Comment on lines +102 to +119
def encodedToStrOffsets(
self,
encodedStart: int,
encodedEnd: int | None = None,
raiseOnError: bool = False,
) -> int | tuple[int]:
super().encodedToStrOffsets(encodedStart, encodedEnd, raiseOnError)
if encodedStart == 0:
resultStart = 0
else:
resultStart = self.computedEncodedToStrOffsets[encodedStart]
if encodedEnd is None:
return resultStart
elif encodedStart == encodedEnd:
return (resultStart, resultStart)
else:
resultEnd = self.computedEncodedToStrOffsets[encodedEnd]
return (resultStart, resultEnd)
reportClickable = boolean(default=true)

[documentNavigation]
initWordSegForUnusedLang = boolean(default=false)
@seanbudd seanbudd removed the request for review from SaschaCowley May 4, 2026 02:23
@seanbudd
Copy link
Copy Markdown
Member Author

seanbudd commented May 4, 2026

@CrazySteve0605 - could you open a PR to address the copilot feedback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. merge-early Merge Early in a developer cycle

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants