Tessera

A byte-level BPE tokenizer library in pure Kotlin.
Tessera — from Latin, a piece of mosaic. Each token is a tessera; together they form the mosaic of language.

Status

See ARCHITECTURE.md for internals and BENCHMARKS.md for test results.

About

Tessera is a Kotlin library that implements a byte-level Byte-Pair Encoding (BPE) tokenizer in the style of GPT-4's cl100k_base. Built from scratch in pure Kotlin, with no ML framework dependencies, it is designed for developers who want to understand how modern tokenizers work and for Kotlin/JVM projects that need a lean, readable tokenization library.

Principles

Library, not application — designed to be consumed by other Kotlin projects
Pure Kotlin — no DJL, no KInference, no ML frameworks
Standard library only for tokenization logic
Byte-level — base vocabulary of 256 bytes, supports any UTF-8 input
GPT-4 compatible approach — pre-tokenization with cl100k_base regex
Minimal public API — only what is necessary, explicitly marked

Installation (after v0.0.1 release)

Gradle (Kotlin DSL)

// settings.gradle.kts
dependencyResolutionManagement {
    repositories {
        maven { url = uri("https://jitpack.io") }
    }
}

// build.gradle.kts
dependencies {
    implementation("com.github.HectorIFC:tessera:v0.0.7")
}

Quick start

import dev.tessera.BpeTokenizer
import dev.tessera.Trainer
import dev.tessera.TrainingConfig

fun main() {
    // 1. Train a tokenizer from a corpus
    val tokenizer = Trainer(TrainingConfig(numMerges = 5000))
        .trainFromFile("corpus/text.txt")

    // 2. Save for later reuse
    tokenizer.save("tessera.json")

    // 3. Load and use
    val loaded = BpeTokenizer.load("tessera.json")
    val ids = loaded.encode("Hello, world!")
    val text = loaded.decode(ids)
    println("$ids → $text")
}

More examples in the tessera-samples module.

Project structure

This is a Gradle multi-module project:

tessera/
├── tessera-core/      ← The library (published artifact)
├── tessera-cli/       ← CLI application consuming the library
└── tessera-samples/   ← Usage examples

tessera-core: the consumable JAR. Minimal public API, no runtime dependencies beyond Kotlin stdlib and kotlinx-serialization.
tessera-cli: runnable application (./gradlew :tessera-cli:run) demonstrating the library in use.
tessera-samples: small Kotlin programs with main() showing usage patterns.

Running locally

# Build everything
./gradlew build

# Run tests
./gradlew test

# Run the full quality pipeline
./gradlew test koverVerify ktlintCheck detekt

# Install the library in Maven Local for testing in other projects
./gradlew publishToMavenLocal

# Run the CLI
./gradlew :tessera-cli:run --args="train --corpus corpus/text.txt --merges 5000 --output tessera.json"

# Run a sample
./gradlew :tessera-samples:run -PmainClass=dev.tessera.samples.QuickStartSampleKt

Architecture

At a high level:

Pre-tokenization: text is split into logical chunks (words, contractions, numbers, punctuation) by a GPT-4-style regex.
Byte conversion: each chunk becomes a sequence of UTF-8 bytes (0–255).
BPE: the learned algorithm iteratively merges the most frequent byte pairs, building composite tokens.
Greedy encode: at inference time, always apply the merge with the lowest rank (learned first), reproducing GPT behaviour.

See ARCHITECTURE.md (created in Phase 5) for technical details.

Roadmap

Define scope and architecture
Phase 0: Gradle multi-module setup
Phase 1: Core library with round-trip guarantee and stable public API
Phase 2: Sample apps consuming the library
Phase 3: CLI consuming the library
Phase 4: Validation against tiktoken, fuzz tests, coverage ≥ 80%
Phase 5: JitPack publication, ARCHITECTURE.md, full KDoc

Sister project

Once Tessera is complete, the next step is a separate codebase for embeddings that will consume tessera-core as a Gradle dependency.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
config/detekt		config/detekt
corpus		corpus
docs		docs
gradle/wrapper		gradle/wrapper
tessera-cli		tessera-cli
tessera-core		tessera-core
tessera-samples		tessera-samples
.editorconfig		.editorconfig
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tessera

Status

About

Principles

Installation (after v0.0.1 release)

Gradle (Kotlin DSL)

Quick start

Project structure

Running locally

Architecture

Roadmap

Sister project

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tessera

Status

About

Principles

Installation (after v0.0.1 release)

Gradle (Kotlin DSL)

Quick start

Project structure

Running locally

Architecture

Roadmap

Sister project

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages