Skip to content

C++ optimization and abi3 wheel support for portable binaries#22

Merged
vthorsteinsson merged 14 commits into
mainfrom
cpp-dawgdictionary
Oct 22, 2025
Merged

C++ optimization and abi3 wheel support for portable binaries#22
vthorsteinsson merged 14 commits into
mainfrom
cpp-dawgdictionary

Conversation

@vthorsteinsson

Copy link
Copy Markdown
Member

Summary

This PR migrates performance-critical methods from Python to C++ and enables Python Stable ABI for portable wheel distribution.

Key improvements:

  • C++ implementations: Migrated core lookup methods (lookup(), lookup_ksnid(), lookup_id(), lemma_forms()) to C++ for significant performance gains
  • Stable ABI wheels: Enabled abi3 support - one CPython wheel now works across Python 3.9-3.14+ (instead of separate wheels per version)
  • CI/CD optimization: Updated workflows to build fewer wheels (~40% reduction) while maintaining comprehensive test coverage across 8 Python versions
  • Code cleanup: Removed 195 lines of duplicate Python code, added type checking with abstract method stubs

Performance impact:

  • lookup_id(): ~30× reduction in Python↔C++ boundary crossings
  • Hash-based deduplication in C++ (O(1) vs O(n²))
  • All hot paths now execute in native compiled code

Code changes:

  • 2,519 lines added (mostly C++ implementations)
  • 543 lines removed (duplicate Python code)
  • 13 files changed

Test plan

  • All 13 existing tests pass on Python 3.11
  • Type checking passes (mypy 0 errors)
  • CI tests passed on all 8 Python versions (3.9-3.14, PyPy 3.9-3.10)
  • Verified abi3 wheel builds correctly and installs on CPython 3.11
  • Abstract method stubs raise NotImplementedError with helpful messages

Pre-merge checklist:

  • CI/CD passes on all platforms
  • Wheel builds succeed for all target platforms
  • Performance benchmarks show expected improvements (optional)

Implementation details

C++ files added:

  • src/islenska/bincompress.cpp/h - BinCompressed C++ implementation
  • src/islenska/dawgdictionary.cpp/h - DAWG dictionary C++ implementation

Python changes:

  • src/islenska/bincompress.py - Hybrid wrapper + abstract base class
  • src/islenska/dawgdictionary.py - Reduced from 410 lines (now wraps C++)
  • src/islenska/bin_build.py - Added py_limited_api for abi3
  • setup.py - Configured bdist_wheel for stable ABI

CI/CD updates:

  • .github/workflows/wheels.yml - Build cp39-abi3 instead of cp39/cp310/cp311/...
  • .github/workflows/python-package.yml - Test on 8 Python versions (was 5)

🤖 Generated with Claude Code

vthorsteinsson and others added 12 commits October 5, 2025 14:25
This commit implements a C++ version of the DAWG (Directed Acyclic Word
Graph) dictionary functionality, replacing the previous pure Python
implementation for improved performance.

Key changes:

- Added dawgdictionary.cpp: C++ implementation of DAWG navigation
  - Comprehensive documentation of DAWG binary format
  - Navigator pattern for flexible traversal strategies
  - FindNavigator for simple word lookup
  - CompoundNavigator for recursive compound word splitting
  - Extensive inline comments for maintainability

- Added dawgdictionary.h: C interface for CFFI bindings
  - Uses bool (C99) instead of char for better type safety
  - Provides dawg_load, dawg_contains, dawg_find_combinations APIs

- Modified dawgdictionary.py: Python wrapper using CFFI
  - Thin wrapper around C++ implementation
  - Maintains same API for backward compatibility

- Modified bin_build.py: Updated CFFI declarations
  - Added dawgdictionary.cpp to build sources
  - Corrected bool return type for dawg_contains

Bug fixes:
- Fixed critical finality check bug: only examine LAST byte of edge
  prefix for the 0x80 finality marker (checking all bytes caused
  incorrect offset calculations and crashes)
- Fixed UTF-8 to Latin-1 conversion to handle embedded nulls
- Fixed encoding map to use character indices not byte positions

All tests pass (13/13 in test_bin.py).

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…F-8 optimization

Added C++ implementation of the compressed BÍN dictionary (bincompress.cpp) with
inheritance-based hybrid Python wrapper. Migrated lookup() to C++ for performance.
Optimized JSON serialization in both modules to use UTF-8 encoding with direct
streaming to eliminate temporary allocations.

Key changes:
- Created bincompress.cpp/h with C API for compressed dictionary access
- Implemented core methods: contains(), lookup() with cat/lemma/utg filters
- Created hybrid BinCompressed class inheriting from BinCompressedPure base
- Overrode contains() and lookup() to use optimized C++ implementations
- Converted JSON output to UTF-8 in both bincompress and dawgdictionary modules
- Added append_latin1_as_utf8() helpers for efficient streaming serialization
- Python now consumes JSON bytes directly without decode step
- All 13 tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Removed unused variables (templates_offset, alphabet_offset)
- Fixed signedness comparison (changed int to size_t for subcategory index)
- Replaced C++17 structured bindings with C++11-compatible pair access

All tests still passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Migrated lookup_ksnid() to C++ following the same pattern as lookup().
This provides performance improvements for KSNID lookups while maintaining
full compatibility with the Python interface.

Changes:
- Added KsnidEntry struct to bincompress.cpp
- Implemented BinCompressed::lookup_ksnid() in C++
- Added serialize_ksnid_entries() for JSON serialization
- Added C API function bin_compressed_lookup_ksnid()
- Updated bincompress.h and bin_build.py with new declarations
- Overrode lookup_ksnid() in Python hybrid wrapper class
- Updated .gitignore to exclude venv* directories

All tests passing on both PyPy and CPython.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Globally renamed the C API handle type from BinCompressedHandle to BcHandle
to make the code more concise while maintaining clarity.

Changes:
- Updated typedef in bincompress.h
- Updated all function signatures in bincompress.h
- Updated all function implementations in bincompress.cpp
- Updated CFFI declarations in bin_build.py

All tests passing on both PyPy and CPython.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Enhanced C++ implementations with better error handling, micro-optimizations,
and improved type safety based on manual review.

Changes to bincompress.cpp:
- Added input validation in meaning(), ksnid_string(), and lemma() methods
- Cached cat_is_no boolean in lookup methods to avoid repeated strcmp calls
- Fixed documentation typo (KSNid → KSnid)
- Minor whitespace cleanup

Changes to dawgdictionary.cpp:
- Improved documentation for utf8_to_latin1() function
- Changed parameter types from const char* to const BYTE* for better type safety
- Changed unsigned char to BYTE for consistency
- Changed int codepoint to size_t for better type semantics
- Changed encoding map from std::map<int, std::string> to std::map<BYTE, std::string>
- Refactored CompoundNavigator::accept() with early return for better readability

All tests passing on PyPy and CPython.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Implement lemma_forms() in C++ for improved performance
  - Add template decompression algorithm in bincompress.cpp
  - Support both long and short form differential encoding
  - Add C API wrapper and CFFI declarations

- Refactor lemma_forms() interface to return List[str] instead of List[bytes]
  - Update BinCompressedPure to decode forms to strings
  - Update BinCompressed to return UTF-8 strings directly from JSON
  - Remove unnecessary decode() calls in all callers
  - Simplify type hints: _raw_lookup() and _mapping_cffi() now use str
  - Remove unused AnyStr import

- Update AGENTS.md with proper C++ extension rebuild instructions
  - Document use of 'pip install -e .' instead of running bin_build.py directly
  - Explain src/ layout handling

All tests pass on both PyPy 3.9 and CPython 3.11.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add KsnidEntry::operator==() matching Python's Ksnid.__eq__()
  Compares all 7 fields: form, mark, bin_id, ord, ofl, hluti, ksnid
- Add KsnidEntryHash functor matching Python's Ksnid.__hash__()
  Hashes only 4 "primary key" fields: bin_id, ofl, form, mark
- Implement lookup_id() using std::unordered_set for O(1) deduplication
  Replaces O(n²) linear search with hash-based duplicate detection
- Add CFFI declarations and Python wrapper for lookup_id()

Eliminates ~30x Python↔C++ boundary crossings for typical lookups
by processing all word forms in C++ instead of calling back to Python.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Configure CFFI and setuptools to use the Python Stable ABI (Limited API),
generating wheels tagged cp39-abi3 that work across Python 3.9+.

Changes:
- bin_build.py: Add py_limited_api="cp39" for CPython (False for PyPy)
- setup.py: Configure bdist_wheel with py_limited_api="cp39" for CPython
- pyproject.toml: Add wheel>=0.38.0 to build requirements

Benefits:
- Single wheel works on Python 3.9, 3.10, 3.11, 3.12, 3.13+
- Reduces CI/CD build matrix from 5+ wheels to 1 per platform
- PyPy continues to get version-specific wheels as before

Wheel naming:
- Before: islenska-X.Y.Z-cp311-cp311-linux_x86_64.whl
- After:  islenska-X.Y.Z-cp39-abi3-linux_x86_64.whl

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Update both GitHub Actions workflows to optimize for stable ABI wheels:

wheels.yml changes:
- Use CIBW_BUILD instead of CIBW_SKIP for explicit control
- Build only cp39 for CPython (creates cp39-abi3 wheel for Python 3.9+)
- Build pp39 and pp310 for PyPy (version-specific wheels)
- Enable CIBW_PROJECT_REQUIRES_PYTHON: ">=3.9"
- Reduces wheel count per platform from ~15 to ~9 wheels

python-package.yml changes:
- Expand test matrix from 5 to 8 Python versions
- Add Python 3.10, 3.11, 3.12 to verify abi3 compatibility
- Update to Python 3.14 (now released, not beta)
- Change pypy-3.11 to pypy-3.10 (matches wheel builds)
- Ensures comprehensive testing across all supported versions

Expected wheel output per platform:
- 1× cp39-abi3 wheel (works on CPython 3.9-3.14+)
- 2× PyPy wheels (pp39, pp310)
- 3× architectures (x86_64, arm64, AMD64 depending on OS)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Clean up BinCompressedPure by removing Python implementations of methods
that have been migrated to C++ in BinCompressed. This eliminates code
duplication and reduces maintenance burden.

Removed methods (now only in C++):
- contains() and __contains__
- lookup()
- lookup_ksnid()
- lemma_forms()
- lookup_id()

These methods are now implemented only in C++ (bincompress.cpp) and
overridden in the BinCompressed class. BinCompressedPure retains:
- Infrastructure methods (__init__, meaning, lemma, etc.)
- Methods not yet migrated (lookup_case, lookup_variants, etc.)

Updated BinCompressedPure docstring to clarify it's now a base class,
not a standalone reference implementation.

Code reduction: -195 lines
All tests pass: 13/13

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fix mypy/Pylance type checking errors by adding stub methods to
BinCompressedPure for methods that are only implemented in BinCompressed.

Added stub methods:
- contains() / __contains__
- lookup()
- lookup_ksnid()
- lemma_forms()
- lookup_id()

These methods raise NotImplementedError with helpful error messages,
ensuring that:
1. Type checkers recognize the methods exist in the base class
2. Methods in BinCompressedPure can call them (via self.method())
3. Direct instantiation of BinCompressedPure fails with clear errors
4. BinCompressed overrides provide actual implementations

Type checking: mypy passes (0 errors)
Runtime behavior: All 13 tests pass
Error messages: Clear and actionable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR migrates performance-critical dictionary lookup methods from Python to C++ and enables Python Stable ABI (abi3) support for portable wheel distribution across Python 3.9-3.14+ without requiring separate builds per version.

Key changes:

  • Migrated lookup(), lookup_ksnid(), lookup_id(), and lemma_forms() methods to C++ for ~30× performance improvement
  • Implemented DAWG dictionary navigation in C++ with hash-based deduplication
  • Enabled abi3 wheel support reducing wheel builds by ~40%

Reviewed Changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/islenska/dawgdictionary.py Replaced 300+ lines of Python DAWG navigation with C++ wrapper class
src/islenska/dawgdictionary.h C API declarations for DAWG dictionary operations
src/islenska/dawgdictionary.cpp C++ implementation of DAWG navigation and compound word analysis
src/islenska/bincompress.py Refactored into base class with C++-overridden methods for hybrid approach
src/islenska/bincompress.h C API declarations for compressed dictionary operations
src/islenska/bincompress.cpp C++ implementation of dictionary lookup methods with JSON serialization
src/islenska/bin_build.py Added CFFI declarations and abi3 configuration
setup.py Configured bdist_wheel for stable ABI support
pyproject.toml Added wheel build dependency
AGENTS.md Documentation for rebuilding C++ extensions
.github/workflows/wheels.yml Updated to build abi3 wheels for CPython
.github/workflows/python-package.yml Expanded test matrix to 8 Python versions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/islenska/dawgdictionary.cpp Outdated
latin1_str.reserve(utf8_len);
for (size_t i = 0; i < utf8_len; ) {
BYTE c = utf8_bytes[i];
if (c < 0x80) { // Standard ASCII (including null byte!)

Copilot AI Oct 22, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment '(including null byte!)' on line 160 is misleading. A null byte (0x00) would terminate the string early in most C++ string operations, and the vocabulary string from a DAWG file should not contain embedded nulls. Consider removing or clarifying this comment.

Suggested change
if (c < 0x80) { // Standard ASCII (including null byte!)
if (c < 0x80) { // Standard ASCII

Copilot uses AI. Check for mistakes.
Comment thread src/islenska/bincompress.cpp
Comment thread src/islenska/bincompress.py Outdated

def __del__(self) -> None:
"""Clean up C++ resources."""
if hasattr(self, '_cpp_handle') and self._cpp_handle:

Copilot AI Oct 22, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hasattr check in __del__ is unnecessary because __init__ always sets _cpp_handle (even if to None). Consider simplifying to if self._cpp_handle: or adding proper initialization in case of early exceptions.

Suggested change
if hasattr(self, '_cpp_handle') and self._cpp_handle:
if self._cpp_handle:

Copilot uses AI. Check for mistakes.
env:
CIBW_SKIP: cp36-* cp37-* cp38-* pp37-* pp38-* *musllinux* *win_pp*
# Build abi3 wheel for CPython 3.9+ (one wheel for all versions)
# Build version-specific wheels for PyPy (which doesn't support abi3)

Copilot AI Oct 22, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building only cp39-* for CPython assumes cibuildwheel will detect the abi3 tag and create a single cp39-abi3-* wheel. This should be verified, or explicitly document that testing on Python 3.10-3.14 (as done in python-package.yml) validates that the abi3 wheel works correctly across versions.

Suggested change
# Build version-specific wheels for PyPy (which doesn't support abi3)
# Build version-specific wheels for PyPy (which doesn't support abi3)
# NOTE: Only cp39-* is built for CPython, assuming cibuildwheel will produce a cp39-abi3-* wheel.
# Compatibility with Python 3.10-3.14 is verified in the python-package.yml workflow.

Copilot uses AI. Check for mistakes.
vthorsteinsson and others added 2 commits October 22, 2025 21:45
Change python-package.yml to only run on:
- Push to main branch
- Pull requests targeting main

This eliminates duplicate CI runs that previously occurred when:
1. Pushing to a feature branch (CI runs)
2. Opening a PR from that branch (CI runs again on same commit)

Benefits:
- 50% reduction in CI runs
- Faster feedback (no queue from redundant builds)
- Same coverage (PRs still tested, main still validated)
- Industry standard pattern

Previously: push:* + pull_request:* = duplicate runs
Now: push:main + pull_request:main = efficient runs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes based on code review:

bincompress.py:
- Simplify __del__ method by removing hasattr check
  (self._cpp_handle is always set in __init__)

bincompress.cpp:
- Add clarifying comment explaining 3-bit signed integer extraction
  logic in template decompression (cut/diff calculation)

dawgdictionary.cpp:
- Clean up ASCII comment (remove redundant parenthetical)

These are minor code quality improvements that enhance readability
and simplify the implementation without changing functionality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@vthorsteinsson vthorsteinsson merged commit baa9cf4 into main Oct 22, 2025
9 checks passed
@vthorsteinsson vthorsteinsson deleted the cpp-dawgdictionary branch October 22, 2025 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants