A byte-level BPE tokenizer library in pure Kotlin.
Tessera — from Latin, a piece of mosaic. Each token is a tessera; together they form the mosaic of language.
See ARCHITECTURE.md for internals and BENCHMARKS.md for test results.
Tessera is a Kotlin library that implements a byte-level Byte-Pair Encoding (BPE) tokenizer in the style of GPT-4's cl100k_base. Built from scratch in pure Kotlin, with no ML framework dependencies, it is designed for developers who want to understand how modern tokenizers work and for Kotlin/JVM projects that need a lean, readable tokenization library.
- Library, not application — designed to be consumed by other Kotlin projects
- Pure Kotlin — no DJL, no KInference, no ML frameworks
- Standard library only for tokenization logic
- Byte-level — base vocabulary of 256 bytes, supports any UTF-8 input
- GPT-4 compatible approach — pre-tokenization with
cl100k_baseregex - Minimal public API — only what is necessary, explicitly marked
// settings.gradle.kts
dependencyResolutionManagement {
repositories {
maven { url = uri("https://jitpack.io") }
}
}
// build.gradle.kts
dependencies {
implementation("com.github.HectorIFC:tessera:v0.0.7")
}import dev.tessera.BpeTokenizer
import dev.tessera.Trainer
import dev.tessera.TrainingConfig
fun main() {
// 1. Train a tokenizer from a corpus
val tokenizer = Trainer(TrainingConfig(numMerges = 5000))
.trainFromFile("corpus/text.txt")
// 2. Save for later reuse
tokenizer.save("tessera.json")
// 3. Load and use
val loaded = BpeTokenizer.load("tessera.json")
val ids = loaded.encode("Hello, world!")
val text = loaded.decode(ids)
println("$ids → $text")
}More examples in the tessera-samples module.
This is a Gradle multi-module project:
tessera/
├── tessera-core/ ← The library (published artifact)
├── tessera-cli/ ← CLI application consuming the library
└── tessera-samples/ ← Usage examples
tessera-core: the consumable JAR. Minimal public API, no runtime dependencies beyond Kotlin stdlib and kotlinx-serialization.tessera-cli: runnable application (./gradlew :tessera-cli:run) demonstrating the library in use.tessera-samples: small Kotlin programs withmain()showing usage patterns.
# Build everything
./gradlew build
# Run tests
./gradlew test
# Run the full quality pipeline
./gradlew test koverVerify ktlintCheck detekt
# Install the library in Maven Local for testing in other projects
./gradlew publishToMavenLocal
# Run the CLI
./gradlew :tessera-cli:run --args="train --corpus corpus/text.txt --merges 5000 --output tessera.json"
# Run a sample
./gradlew :tessera-samples:run -PmainClass=dev.tessera.samples.QuickStartSampleKtAt a high level:
- Pre-tokenization: text is split into logical chunks (words, contractions, numbers, punctuation) by a GPT-4-style regex.
- Byte conversion: each chunk becomes a sequence of UTF-8 bytes (0–255).
- BPE: the learned algorithm iteratively merges the most frequent byte pairs, building composite tokens.
- Greedy encode: at inference time, always apply the merge with the lowest rank (learned first), reproducing GPT behaviour.
See ARCHITECTURE.md (created in Phase 5) for technical details.
- Define scope and architecture
- Phase 0: Gradle multi-module setup
- Phase 1: Core library with round-trip guarantee and stable public API
- Phase 2: Sample apps consuming the library
- Phase 3: CLI consuming the library
- Phase 4: Validation against tiktoken, fuzz tests, coverage ≥ 80%
- Phase 5: JitPack publication, ARCHITECTURE.md, full KDoc
Once Tessera is complete, the next step is a separate codebase for embeddings that will consume tessera-core as a Gradle dependency.
MIT — see LICENSE.
