C++ optimization and abi3 wheel support for portable binaries#22
Conversation
This commit implements a C++ version of the DAWG (Directed Acyclic Word Graph) dictionary functionality, replacing the previous pure Python implementation for improved performance. Key changes: - Added dawgdictionary.cpp: C++ implementation of DAWG navigation - Comprehensive documentation of DAWG binary format - Navigator pattern for flexible traversal strategies - FindNavigator for simple word lookup - CompoundNavigator for recursive compound word splitting - Extensive inline comments for maintainability - Added dawgdictionary.h: C interface for CFFI bindings - Uses bool (C99) instead of char for better type safety - Provides dawg_load, dawg_contains, dawg_find_combinations APIs - Modified dawgdictionary.py: Python wrapper using CFFI - Thin wrapper around C++ implementation - Maintains same API for backward compatibility - Modified bin_build.py: Updated CFFI declarations - Added dawgdictionary.cpp to build sources - Corrected bool return type for dawg_contains Bug fixes: - Fixed critical finality check bug: only examine LAST byte of edge prefix for the 0x80 finality marker (checking all bytes caused incorrect offset calculations and crashes) - Fixed UTF-8 to Latin-1 conversion to handle embedded nulls - Fixed encoding map to use character indices not byte positions All tests pass (13/13 in test_bin.py). 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…F-8 optimization Added C++ implementation of the compressed BÍN dictionary (bincompress.cpp) with inheritance-based hybrid Python wrapper. Migrated lookup() to C++ for performance. Optimized JSON serialization in both modules to use UTF-8 encoding with direct streaming to eliminate temporary allocations. Key changes: - Created bincompress.cpp/h with C API for compressed dictionary access - Implemented core methods: contains(), lookup() with cat/lemma/utg filters - Created hybrid BinCompressed class inheriting from BinCompressedPure base - Overrode contains() and lookup() to use optimized C++ implementations - Converted JSON output to UTF-8 in both bincompress and dawgdictionary modules - Added append_latin1_as_utf8() helpers for efficient streaming serialization - Python now consumes JSON bytes directly without decode step - All 13 tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Removed unused variables (templates_offset, alphabet_offset) - Fixed signedness comparison (changed int to size_t for subcategory index) - Replaced C++17 structured bindings with C++11-compatible pair access All tests still passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Migrated lookup_ksnid() to C++ following the same pattern as lookup(). This provides performance improvements for KSNID lookups while maintaining full compatibility with the Python interface. Changes: - Added KsnidEntry struct to bincompress.cpp - Implemented BinCompressed::lookup_ksnid() in C++ - Added serialize_ksnid_entries() for JSON serialization - Added C API function bin_compressed_lookup_ksnid() - Updated bincompress.h and bin_build.py with new declarations - Overrode lookup_ksnid() in Python hybrid wrapper class - Updated .gitignore to exclude venv* directories All tests passing on both PyPy and CPython. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Globally renamed the C API handle type from BinCompressedHandle to BcHandle to make the code more concise while maintaining clarity. Changes: - Updated typedef in bincompress.h - Updated all function signatures in bincompress.h - Updated all function implementations in bincompress.cpp - Updated CFFI declarations in bin_build.py All tests passing on both PyPy and CPython. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Enhanced C++ implementations with better error handling, micro-optimizations, and improved type safety based on manual review. Changes to bincompress.cpp: - Added input validation in meaning(), ksnid_string(), and lemma() methods - Cached cat_is_no boolean in lookup methods to avoid repeated strcmp calls - Fixed documentation typo (KSNid → KSnid) - Minor whitespace cleanup Changes to dawgdictionary.cpp: - Improved documentation for utf8_to_latin1() function - Changed parameter types from const char* to const BYTE* for better type safety - Changed unsigned char to BYTE for consistency - Changed int codepoint to size_t for better type semantics - Changed encoding map from std::map<int, std::string> to std::map<BYTE, std::string> - Refactored CompoundNavigator::accept() with early return for better readability All tests passing on PyPy and CPython. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Implement lemma_forms() in C++ for improved performance - Add template decompression algorithm in bincompress.cpp - Support both long and short form differential encoding - Add C API wrapper and CFFI declarations - Refactor lemma_forms() interface to return List[str] instead of List[bytes] - Update BinCompressedPure to decode forms to strings - Update BinCompressed to return UTF-8 strings directly from JSON - Remove unnecessary decode() calls in all callers - Simplify type hints: _raw_lookup() and _mapping_cffi() now use str - Remove unused AnyStr import - Update AGENTS.md with proper C++ extension rebuild instructions - Document use of 'pip install -e .' instead of running bin_build.py directly - Explain src/ layout handling All tests pass on both PyPy 3.9 and CPython 3.11. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add KsnidEntry::operator==() matching Python's Ksnid.__eq__() Compares all 7 fields: form, mark, bin_id, ord, ofl, hluti, ksnid - Add KsnidEntryHash functor matching Python's Ksnid.__hash__() Hashes only 4 "primary key" fields: bin_id, ofl, form, mark - Implement lookup_id() using std::unordered_set for O(1) deduplication Replaces O(n²) linear search with hash-based duplicate detection - Add CFFI declarations and Python wrapper for lookup_id() Eliminates ~30x Python↔C++ boundary crossings for typical lookups by processing all word forms in C++ instead of calling back to Python. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Configure CFFI and setuptools to use the Python Stable ABI (Limited API), generating wheels tagged cp39-abi3 that work across Python 3.9+. Changes: - bin_build.py: Add py_limited_api="cp39" for CPython (False for PyPy) - setup.py: Configure bdist_wheel with py_limited_api="cp39" for CPython - pyproject.toml: Add wheel>=0.38.0 to build requirements Benefits: - Single wheel works on Python 3.9, 3.10, 3.11, 3.12, 3.13+ - Reduces CI/CD build matrix from 5+ wheels to 1 per platform - PyPy continues to get version-specific wheels as before Wheel naming: - Before: islenska-X.Y.Z-cp311-cp311-linux_x86_64.whl - After: islenska-X.Y.Z-cp39-abi3-linux_x86_64.whl 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Update both GitHub Actions workflows to optimize for stable ABI wheels: wheels.yml changes: - Use CIBW_BUILD instead of CIBW_SKIP for explicit control - Build only cp39 for CPython (creates cp39-abi3 wheel for Python 3.9+) - Build pp39 and pp310 for PyPy (version-specific wheels) - Enable CIBW_PROJECT_REQUIRES_PYTHON: ">=3.9" - Reduces wheel count per platform from ~15 to ~9 wheels python-package.yml changes: - Expand test matrix from 5 to 8 Python versions - Add Python 3.10, 3.11, 3.12 to verify abi3 compatibility - Update to Python 3.14 (now released, not beta) - Change pypy-3.11 to pypy-3.10 (matches wheel builds) - Ensures comprehensive testing across all supported versions Expected wheel output per platform: - 1× cp39-abi3 wheel (works on CPython 3.9-3.14+) - 2× PyPy wheels (pp39, pp310) - 3× architectures (x86_64, arm64, AMD64 depending on OS) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Clean up BinCompressedPure by removing Python implementations of methods that have been migrated to C++ in BinCompressed. This eliminates code duplication and reduces maintenance burden. Removed methods (now only in C++): - contains() and __contains__ - lookup() - lookup_ksnid() - lemma_forms() - lookup_id() These methods are now implemented only in C++ (bincompress.cpp) and overridden in the BinCompressed class. BinCompressedPure retains: - Infrastructure methods (__init__, meaning, lemma, etc.) - Methods not yet migrated (lookup_case, lookup_variants, etc.) Updated BinCompressedPure docstring to clarify it's now a base class, not a standalone reference implementation. Code reduction: -195 lines All tests pass: 13/13 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Fix mypy/Pylance type checking errors by adding stub methods to BinCompressedPure for methods that are only implemented in BinCompressed. Added stub methods: - contains() / __contains__ - lookup() - lookup_ksnid() - lemma_forms() - lookup_id() These methods raise NotImplementedError with helpful error messages, ensuring that: 1. Type checkers recognize the methods exist in the base class 2. Methods in BinCompressedPure can call them (via self.method()) 3. Direct instantiation of BinCompressedPure fails with clear errors 4. BinCompressed overrides provide actual implementations Type checking: mypy passes (0 errors) Runtime behavior: All 13 tests pass Error messages: Clear and actionable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Pull Request Overview
This PR migrates performance-critical dictionary lookup methods from Python to C++ and enables Python Stable ABI (abi3) support for portable wheel distribution across Python 3.9-3.14+ without requiring separate builds per version.
Key changes:
- Migrated
lookup(),lookup_ksnid(),lookup_id(), andlemma_forms()methods to C++ for ~30× performance improvement - Implemented DAWG dictionary navigation in C++ with hash-based deduplication
- Enabled abi3 wheel support reducing wheel builds by ~40%
Reviewed Changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/islenska/dawgdictionary.py |
Replaced 300+ lines of Python DAWG navigation with C++ wrapper class |
src/islenska/dawgdictionary.h |
C API declarations for DAWG dictionary operations |
src/islenska/dawgdictionary.cpp |
C++ implementation of DAWG navigation and compound word analysis |
src/islenska/bincompress.py |
Refactored into base class with C++-overridden methods for hybrid approach |
src/islenska/bincompress.h |
C API declarations for compressed dictionary operations |
src/islenska/bincompress.cpp |
C++ implementation of dictionary lookup methods with JSON serialization |
src/islenska/bin_build.py |
Added CFFI declarations and abi3 configuration |
setup.py |
Configured bdist_wheel for stable ABI support |
pyproject.toml |
Added wheel build dependency |
AGENTS.md |
Documentation for rebuilding C++ extensions |
.github/workflows/wheels.yml |
Updated to build abi3 wheels for CPython |
.github/workflows/python-package.yml |
Expanded test matrix to 8 Python versions |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| latin1_str.reserve(utf8_len); | ||
| for (size_t i = 0; i < utf8_len; ) { | ||
| BYTE c = utf8_bytes[i]; | ||
| if (c < 0x80) { // Standard ASCII (including null byte!) |
There was a problem hiding this comment.
The comment '(including null byte!)' on line 160 is misleading. A null byte (0x00) would terminate the string early in most C++ string operations, and the vocabulary string from a DAWG file should not contain embedded nulls. Consider removing or clarifying this comment.
| if (c < 0x80) { // Standard ASCII (including null byte!) | |
| if (c < 0x80) { // Standard ASCII |
|
|
||
| def __del__(self) -> None: | ||
| """Clean up C++ resources.""" | ||
| if hasattr(self, '_cpp_handle') and self._cpp_handle: |
There was a problem hiding this comment.
The hasattr check in __del__ is unnecessary because __init__ always sets _cpp_handle (even if to None). Consider simplifying to if self._cpp_handle: or adding proper initialization in case of early exceptions.
| if hasattr(self, '_cpp_handle') and self._cpp_handle: | |
| if self._cpp_handle: |
| env: | ||
| CIBW_SKIP: cp36-* cp37-* cp38-* pp37-* pp38-* *musllinux* *win_pp* | ||
| # Build abi3 wheel for CPython 3.9+ (one wheel for all versions) | ||
| # Build version-specific wheels for PyPy (which doesn't support abi3) |
There was a problem hiding this comment.
Building only cp39-* for CPython assumes cibuildwheel will detect the abi3 tag and create a single cp39-abi3-* wheel. This should be verified, or explicitly document that testing on Python 3.10-3.14 (as done in python-package.yml) validates that the abi3 wheel works correctly across versions.
| # Build version-specific wheels for PyPy (which doesn't support abi3) | |
| # Build version-specific wheels for PyPy (which doesn't support abi3) | |
| # NOTE: Only cp39-* is built for CPython, assuming cibuildwheel will produce a cp39-abi3-* wheel. | |
| # Compatibility with Python 3.10-3.14 is verified in the python-package.yml workflow. |
Change python-package.yml to only run on: - Push to main branch - Pull requests targeting main This eliminates duplicate CI runs that previously occurred when: 1. Pushing to a feature branch (CI runs) 2. Opening a PR from that branch (CI runs again on same commit) Benefits: - 50% reduction in CI runs - Faster feedback (no queue from redundant builds) - Same coverage (PRs still tested, main still validated) - Industry standard pattern Previously: push:* + pull_request:* = duplicate runs Now: push:main + pull_request:main = efficient runs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changes based on code review: bincompress.py: - Simplify __del__ method by removing hasattr check (self._cpp_handle is always set in __init__) bincompress.cpp: - Add clarifying comment explaining 3-bit signed integer extraction logic in template decompression (cut/diff calculation) dawgdictionary.cpp: - Clean up ASCII comment (remove redundant parenthetical) These are minor code quality improvements that enhance readability and simplify the implementation without changing functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Summary
This PR migrates performance-critical methods from Python to C++ and enables Python Stable ABI for portable wheel distribution.
Key improvements:
lookup(),lookup_ksnid(),lookup_id(),lemma_forms()) to C++ for significant performance gainsPerformance impact:
lookup_id(): ~30× reduction in Python↔C++ boundary crossingsCode changes:
Test plan
Pre-merge checklist:
Implementation details
C++ files added:
src/islenska/bincompress.cpp/h- BinCompressed C++ implementationsrc/islenska/dawgdictionary.cpp/h- DAWG dictionary C++ implementationPython changes:
src/islenska/bincompress.py- Hybrid wrapper + abstract base classsrc/islenska/dawgdictionary.py- Reduced from 410 lines (now wraps C++)src/islenska/bin_build.py- Addedpy_limited_apifor abi3setup.py- Configured bdist_wheel for stable ABICI/CD updates:
.github/workflows/wheels.yml- Build cp39-abi3 instead of cp39/cp310/cp311/....github/workflows/python-package.yml- Test on 8 Python versions (was 5)🤖 Generated with Claude Code