[WIP] Merge Chinese Word Segmentation work by seanbudd · Pull Request #19166 · nvaccess/nvda

seanbudd · 2025-11-05T01:53:21Z

Merges:

- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide robust Chinese word segmentation capabilities. - Update `.gitmodules` to point to the official `cppjieba` repository and configure it to track the `master` branch. - Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp' - Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the project’s existing license, ensuring proper attribution and compliance. - Update documents

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/' - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method - Implements Meyers’ singleton via `getInstance()` with a private constructor - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance - Add C-style API in the same module: - `int initJieba()` to force singleton initialization - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets - `void freeOffsets(int* ptr)` to release allocated offset buffer

- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched. - This will cause the submodule of sonic to be fetched as well, which seems currently unused.

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

merge for Python updated

- create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern - implement `ChineseWordSegmentationStrategy` (for Chinese text) - implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy)

add `WordSegmenter` class for word segmentation, integrating segmentation strategies

- replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - integrate `WordSegmenter` for calculating word offsets

make `DisplayModelTextInfo`'s flag aligned with its superclass

@LeonarddeR

- redesign 2 properties of 'OffsetTextInfo' as enums to make them more configurable, inspired by @LeonarddeR - override them in some subclasses to simulate specific behaviors

cary-rowen · 2025-11-20T11:57:09Z

What are our main tasks in this PR? Is it just cleaning up the code? Is it possible to include bug fixes?
cc @CrazySteve0605
According to OSPP guidelines, is this still considered part of your job responsibilities?

CrazySteve0605 · 2025-11-20T12:08:00Z

cc @CrazySteve0605
According to OSPP guidelines, is this still considered part of your job responsibilities?

Yes. I locally finished fixing remaining issues in #18865, and will make a PR soon, for this functionality to be integrated more properly.

seanbudd · 2025-11-21T06:29:02Z

@cary-rowen @CrazySteve0605 - yes, PRs to this branch that improve the code quality and fix issues would be appreciated. Particularly resolving conflicts

cary-rowen · 2025-11-30T00:14:15Z

As we discussed offline @CrazySteve0605

According to feedback from the community, there are still some cases that need to be improved.

I think that when the known to-do items for this PR are solved and merged into the Master, we can provide follow-up polish PR.

Here, just co-pilot provides some comments worth addressing.

Summary of the issue: Some punctuations have extra separators (spaces) before or after them. Description of user facing changes: Braille output will be more accurate.

seanbudd · 2026-03-02T04:03:01Z

Thanks @CrazySteve0605 - can you merge master and resolve conflicts now?

+			return self.text
+
+		result = ""
+		for sepIndex in range(len(self.wordEnds) - 1):


+# For full terms and any additional permissions, see the NVDA license file: https://github.com/nvaccess/nvda/blob/master/copying.txt
+
+import os
+import ctypes


+			try:
+				if bool(charPtr):
+					self._lib.freeOffsets(charPtr)
+			except Exception:


CrazySteve0605 · 2026-03-02T05:07:17Z

Hi @seanbudd ,

can you merge master and resolve conflicts now?

Yes, but I wonder how I can do this. Should I make a new PR to merge master into this branch now? I had misunderstood and thought the merge operation didn't need to be done by myself.

seanbudd · 2026-03-02T05:18:28Z

Yes, create a branch from nvaccesss/nvda;try-chineseWordSegmentation-staging and merge nvaccess/nvda:master into it, then open a PR to merge it to nvaccesss/nvda;try-chineseWordSegmentation-staging

seanbudd · 2026-04-20T01:03:14Z

Hi @cary-rowen @CrazySteve0605 - there are still significant merge conflicts

+	charSegFlag: CharSegFlag = CharSegFlag.UNISCRIBE
+
+	@property
+	def wordSegFlag(self) -> WordSegFlag | None:


cary-rowen · 2026-04-20T15:13:45Z

Hi @seanbudd
@CrazySteve0605 will handle the merge conflict here. But I can do it if it's willing to let me help.

### Link to issue number:  ### Summary of the issue: ### Description of user facing changes: ### Description of developer facing changes: ### Description of development approach: ### Testing strategy: ### Known issues with pull request: ### Code Review Checklist:  - [ ] Documentation: - Change log entry - User Documentation - Developer / Technical Documentation - Context sensitive help for GUI changes - [ ] Testing: - Unit tests - System (end to end) tests - Manual testing - [ ] UX of all users considered: - Speech - Braille - Low Vision - Different web browsers - Localization in other languages / culture than English - [ ] API is compatible with existing add-ons. - [ ] Security precautions taken. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: GitHub Actions <github-actions@github.com> Co-authored-by: Sascha Cowley <16543535+SaschaCowley@users.noreply.github.com> Co-authored-by: Peng-An Chen <andy72039@gmail.com> Co-authored-by: Sean Budd <sean@nvaccess.org> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Michael Curran <mick@nvaccess.org> Co-authored-by: GitHub Actions <actions@github.com> Co-authored-by: Bill Dengler <codeofdusk@gmail.com> Co-authored-by: Cyrille Bougot <cyrille.bougot2@laposte.net> Co-authored-by: Boumtchack <147637328+Boumtchack@users.noreply.github.com> Co-authored-by: Kefas Lungu <84945041+kefaslungu@users.noreply.github.com> Co-authored-by: Jani Kinnunen <janikinnunen340@gmail.com> Co-authored-by: Akash Kakkar <6381747+akash07k@users.noreply.github.com> Co-authored-by: makhlwf <78276231+makhlwf@users.noreply.github.com> Co-authored-by: Quin Gillespie <trypsynth@gmail.com> Co-authored-by: Bram Duvigneau <bram@bramd.nl> Co-authored-by: Luke Davis <8139760+XLTechie@users.noreply.github.com> Co-authored-by: audioses <67341000+audioses@users.noreply.github.com> Co-authored-by: gexgd0419 <55008943+gexgd0419@users.noreply.github.com> Co-authored-by: Cyrille Bougot <cyrille.bougot@laposte.net> Co-authored-by: Joseph Lee <joseph.lee22590@gmail.com> Co-authored-by: Christopher Proß <cpross@mailbox.org> Co-authored-by: Ryan McCleary <remccleary@gmail.com> Co-authored-by: wencong <manchen_0528@outlook.com> Co-authored-by: Emil-18 <135248352+Emil-18@users.noreply.github.com> Co-authored-by: Danil <50794055+Danstiv@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Noelia Ruiz Martínez <nrm1977@gmail.com> Co-authored-by: Sean Budd <seanbudd123@gmail.com> Co-authored-by: James Teh <jamie@jantrid.net> Co-authored-by: Thiago Seus <thiago.seus@yahoo.com.br> Co-authored-by: WangFeng Huang <1398969445@qq.com> Co-authored-by: ethanl-11 <124083447+ethanl-11@users.noreply.github.com>

+		self.wordSegFlag: WordSegFlag = wordSegFlag
+		self.strategy: wordSegStrategy.WordSegmentationStrategy = self._chooseStrategy()
+
+	def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy:  # TODO: optimize


+				try:
+					if bool(charPtr):
+						self._lib.freeOffsets(charPtr)
+				except Exception:


Copilot

Pull request overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

source/braille.py:644

converter is overwritten when both Chinese word-segmentation spacing and Unicode normalization are active. In that case, brailleToRawPos / rawToBraillePos are only converted using the last converter, so offsets will be wrong (and cursor routing/selection mapping can break). Consider composing converters (apply both mappings), or apply normalization first and then segmentation while keeping a combined mapping back to the original raw text.

		converter: OffsetConverter | None = None
		textToTranslate = self.rawText
		textToTranslateTypeforms = self.rawTextTypeforms
		cursorPos = self.cursorPos
		if (
			config.conf["braille"]["translationTable"].startswith("zh")
			or config.conf["braille"]["translationTable"] == "auto"
			and brailleTables.getDefaultTableForCurLang(brailleTables.TableType.OUTPUT).startswith("zh")
		):
			from textUtils.wordSeg.wordSegUtils import WordSegWithSeparatorOffsetConverter  # noqa: F401

			converter = WordSegWithSeparatorOffsetConverter(textToTranslate)
			textToTranslate = converter.encoded
			if cursorPos is not None:
				cursorPos = converter.strToEncodedOffsets(cursorPos)
		if config.conf["braille"]["unicodeNormalization"] and not isUnicodeNormalized(textToTranslate):
			converter = UnicodeNormalizationOffsetConverter(textToTranslate)
			textToTranslate = converter.encoded
			if textToTranslateTypeforms is not None:
				# Typeforms must be adapted to represent normalized characters.
				textToTranslateTypeforms = [
					textToTranslateTypeforms[strOffset] for strOffset in converter.computedEncodedToStrOffsets
				]
			if cursorPos is not None:
				# Convert the cursor position to a normalized offset.
				cursorPos = converter.strToEncodedOffsets(cursorPos)
		self.brailleCells, brailleToRawPos, rawToBraillePos, self.brailleCursorPos = louisHelper.translate(
			[handler.table.fileName, "braille-patterns.cti"],
			textToTranslate,
			typeform=textToTranslateTypeforms,
			mode=mode,
			cursorPos=cursorPos,
		)

		if converter:
			# The received brailleToRawPos contains braille to normalized positions.
			# Process them to represent real raw positions by converting them from normalized ones.
			brailleToRawPos = [converter.encodedToStrOffsets(i) for i in brailleToRawPos]
			# The received rawToBraillePos contains normalized to braille positions.
			# Create a new list based on real raw positions.
			rawToBraillePos = [rawToBraillePos[i] for i in converter.computedStrToEncodedOffsets]
		self.brailleToRawPos = brailleToRawPos

+JiebaSingleton& JiebaSingleton::getInstance(const char* dictDir) {
+    // convert incoming C-string+length to std::string (handles dictDir == nullptr)
+    std::string dir = dictDir;
+
+    // ensure singleton is constructed exactly once
+    std::call_once(JiebaSingleton::initFlag, [&]() {
+        // allocate on heap, so we avoid copy/move and control lifetime
+        JiebaSingleton::instance = new JiebaSingleton(dir.c_str());
+        // optional: register deleter at exit
+        std::atexit([]() {
+            delete JiebaSingleton::instance;
+            JiebaSingleton::instance = nullptr;
+        });
+    });
+
+    // after call_once, instance must be non-null
+    return *JiebaSingleton::instance;
+}


+bool insertUserWord(const char* word, int freq, const char* tag = cppjieba::UNKNOWN_TAG) {
+	return JiebaSingleton::getInstance().InsertUserWord(string(word), freq, string(tag));
+}
+
+bool deleteUserWord(const char* word, const char* tag = cppjieba::UNKNOWN_TAG) {
+	return JiebaSingleton::getInstance().DeleteUserWord(string(word), string(tag));
+}
+
+bool find(const char* word) {
+	return JiebaSingleton::getInstance().Find(string(word));
+}


+		Compute a list of offsets so that:
+			encodedIndex = strIndex + relevantStrToEncodedOffsets[strIndex]
+
+		We build an explicit mapping from original string indices to encoded indices
+		by marking separator positions in the encoded string and then assigning
+		each non-separator encoded slot to the next original-character index.
+		The returned list contains the delta (encodedIndex - strIndex) for each
+		original index.


+			Thread(target=callable_to_call, args=args, kwargs=kwargs, daemon=True).start()
+		except Exception as e:
+			log.debug("Initializer %s.%s failed: %s", module_name, qualname, e)
+		return


+			case config.featureFlagEnums.WordNavigationUnitFlag.CHINESE:
+				return WordSegFlag.CHINESE
+			case _:
+				log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}")


+	def segmentedText(self, sep: str = " ", newSepIndex: list[int] | None = None) -> str:
+		"""Segments the text using the word end indices."""
+
+		if len(self.wordEnds) <= 1:
+			return self.text
+
+		result = ""
+		for sepIndex in range(len(self.wordEnds) - 1):
+			preIndex = 0 if sepIndex == 0 else self.wordEnds[sepIndex - 1]
+			curIndex = self.wordEnds[sepIndex]
+			postIndex = self.wordEnds[sepIndex + 1]
+
+			# append the token before the potential separator position
+			result += self.text[preIndex:curIndex]
+
+			# quick checks: avoid adding duplicate separator if already present
+			if result.endswith(sep) or self.text[curIndex:postIndex].startswith(sep):
+				# separator already present at either side -> skip adding
+				continue
+
+			# Unicode categories for punctuation
+			PUNCTUATION_CATEGORIES: str = "pP"
+			# Determine whether any punctuation forbids a separator
+			noSep = (
+				unicodedata.category(self.text[curIndex - 1])[0] in PUNCTUATION_CATEGORIES
+				or unicodedata.category(self.text[curIndex])[0] in PUNCTUATION_CATEGORIES
+			)
+
+			if not noSep:
+				# If neither side forbids the separator, add it
+				result += sep
+				if newSepIndex is not None:
+					newSepIndex.append(len(result) - len(sep))
+		else:
+			# append the final trailing token after the loop
+			result += self.text[curIndex:postIndex]
+
+		return result
+
+	def getSegmentForOffset(self, offset: int) -> tuple[int, int] | None:
+		return self.getWordOffsetRange(offset)
+
+	def __init__(self, text, encoding=None):
+		super().__init__(text, encoding)
+		self.wordEnds = self._callCPPJieba()


+	def encodedToStrOffsets(
+		self,
+		encodedStart: int,
+		encodedEnd: int | None = None,
+		raiseOnError: bool = False,
+	) -> int | tuple[int]:
+		super().encodedToStrOffsets(encodedStart, encodedEnd, raiseOnError)
+		if encodedStart == 0:
+			resultStart = 0
+		else:
+			resultStart = self.computedEncodedToStrOffsets[encodedStart]
+		if encodedEnd is None:
+			return resultStart
+		elif encodedStart == encodedEnd:
+			return (resultStart, resultStart)
+		else:
+			resultEnd = self.computedEncodedToStrOffsets[encodedEnd]
+			return (resultStart, resultEnd)


 	reportClickable = boolean(default=true)

 [documentNavigation]
+	initWordSegForUnusedLang = boolean(default=false)


seanbudd · 2026-05-04T02:24:02Z

@CrazySteve0605 - could you open a PR to address the copilot feedback?

CrazySteve0605 and others added 30 commits July 23, 2025 17:30

Update what's new

fb4efef

Add comments for building script of cppjieba and its dependency

ae58e9b

Update projectDocs/dev/createDevEnvironment.md

06070c1

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

Update include/readme.md

2273a60

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

Remove changes in sconscript for localLIb

3d4d9f1

add building script for cppjieba

1fbf05f

Merge branch 'master' into integrateCPPJieba

f4cab8a

Merge branch 'master' into integrateCPPJieba

d4c3a92

Update .gitignore for cppjieba

da662be

Update building and setup script for cppjieba's dicts installation

38a12dc

update copyright headers based on @seanbudd's suggestions

c60c2da

Update include/readme.md

c853b64

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

Merge branch 'master' into integrateCPPJieba

53dd3bb

Merge branch 'master' into integrateCPPJieba

eba63ab

merge for Python updated

update textUtils/__init__.py

9f62f04

add `WordSegmenter` class for word segmentation, integrating segmentation strategies

update textInfos/offsets.py

81f2040

- replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - integrate `WordSegmenter` for calculating word offsets

update displayModel.py

da64cd8

make `DisplayModelTextInfo`'s flag aligned with its superclass

Pre-commit auto-fix

557f404

update type annotations

f72d348

Merge branch 'master' into integrateCPPJieba

19cad8a

add wrapper for word manager

adc22fb

update the word segmentation structure

4adac07

- redesign 2 properties of 'OffsetTextInfo' as enums to make them more configurable, inspired by @LeonarddeR - override them in some subclasses to simulate specific behaviors

Merge branch 'integrateCPPJieba' into wordNavigationForChineseText

407d4b2

Pre-commit auto-fix

0d40f0a

add copyright header

676fc42

add type annotations

ddd48e8

seanbudd mentioned this pull request Nov 25, 2025

Feature Request: Add word navigation to NVDA #16237

Open

cary-rowen mentioned this pull request Nov 29, 2025

Add Chinese words segmentation support. #4075

Open

seanbudd added the merge-early Merge Early in a developer cycle label Dec 8, 2025

CrazySteve0605 mentioned this pull request Mar 1, 2026

Fixup for Chinese Word Segmentation and Braille Output #19324

Merged

5 tasks

Fixup for Chinese Word Segmentation and Braille Output (#19324)

096e985

Summary of the issue: Some punctuations have extra separators (spaces) before or after them. Description of user facing changes: Braille output will be more accurate.

github-code-quality Bot found potential problems Mar 2, 2026

View reviewed changes

CrazySteve0605 and others added 4 commits March 24, 2026 23:41

Merge branch 'master'

006277b

Fix word expansion without flowsTo

072b405

Fix textInfo word expansion test expectation

4c9d616

Merge master into try-chineseWordSegmentation-staging (#19747)

9edb45f

github-code-quality Bot found potential problems Apr 20, 2026

View reviewed changes

Merge latest master into try-chineseWordSegmentation-staging

07d9b13

cary-rowen and others added 2 commits May 1, 2026 22:21

Merge latest master into chinese word segmentation branch

d42d90c

seanbudd requested a review from Copilot May 3, 2026 23:26

Copilot started reviewing on behalf of seanbudd May 3, 2026 23:27 View session

github-code-quality Bot found potential problems May 3, 2026

View reviewed changes

Copilot AI reviewed May 3, 2026

View reviewed changes

seanbudd removed the request for review from SaschaCowley May 4, 2026 02:23

cary-rowen mentioned this pull request May 5, 2026

Fixup merge master into Chinese Word Segmentation #20055

Open

5 tasks

Uh oh!

Conversation

seanbudd commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cary-rowen commented Nov 20, 2025

Uh oh!

CrazySteve0605 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanbudd commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cary-rowen commented Nov 30, 2025

Uh oh!

seanbudd commented Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CrazySteve0605 commented Mar 2, 2026

Uh oh!

seanbudd commented Mar 2, 2026

Uh oh!

seanbudd commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cary-rowen commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

seanbudd commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

seanbudd commented Nov 5, 2025 •

edited

Loading

CrazySteve0605 commented Nov 20, 2025 •

edited

Loading

seanbudd commented Nov 21, 2025 •

edited

Loading