Text autocomplete 100% client-side in Brazilian Portuguese: custom BPE tokenizer, small causal Transformer, PyTorch training, INT8 quantization, and in-browser inference via WebAssembly (Rust). Great for a portfolio: it showcases NLP + compression + front-end engineering + low-level skills.
Demo: run locally (steps below) or deploy to S3 + CloudFront to serve it statically.
- From scratch: your own BPE tokenizer, minimal Transformer, quantization, and WASM runtime.
- Browser-only: no server in inference, cost $0 per request.
- PT-BR first: vocabulary and corpus tailored for Brazilian Portuguese.
- Real stack: Python (training), Rust→WASM (ops), Angular (UI).
data/ # raw and cleaned data
model/ # tokenizer/model training, quantization and export
wasm/ # Rust → WebAssembly core (matmul/softmax)
web/angular/ # Angular UI + TypeScript runtime
Flow: data → tokenizer → training → quantization → export .npz → WASM → UI
- Python 3.10+ (3.11 recommended)
- PyTorch (CUDA optional for faster training)
- Rust +
wasm-pack(cargo install wasm-packorcurl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh) - Node.js 18+ (20 recommended) and Angular CLI (
npm i -g @angular/cli)
-
Clone the repository and cd into it.
-
Prepare data (add
.txtfiles todata/raw/):
python model/clean_texts.py
python model/train_bpe.py- Train the model (tweak
steps/batch inmodel/train.pyfor a quick run):
python model/train.py- Quantize and export weights for the browser:
python model/quantize.pyThis generates web/angular/public/weights.npz (INT8 + scales). vocab.json and merges.txt also live in web/angular/public/.
- Build the WASM core:
cd wasm
wasm-pack build --target web --release
mkdir -p ../web/angular/public/wasm
cp -r pkg/* ../web/angular/public/wasm/- Run the Angular UI:
cd ../web/angular
npm install
ng serveOpen http://localhost:4200 and try the playground.
vocab_size: 12,000n_layers: 6,n_heads: 6,d_model: 384,d_ff: 1536seq_len: 256for the initial training- Approx size: 20–30M parameters
You can shrink dimensions for older phones or increase them for better quality on desktops.
micro-transformer-ptbr/
data/
raw/ # put your .txt files here
clean/ # produced by the cleaning step
tokenizer/
vocab.json # produced by BPE training
merges.txt # produced by BPE training
model/
clean_texts.py # basic cleaning/normalization
tokenizer_bpe.py # custom BPE (train/encode/decode)
train_bpe.py # BPE training script
transformer.py # TinyGPT minimal (PyTorch)
train.py # training loop
quantize.py # INT8 + export .npz for the browser
wasm/
Cargo.toml
src/lib.rs # matmul and softmax via wasm-bindgen
pkg/ # generated by wasm-pack (copy to web/public/wasm)
web/
angular/
src/app/services/
tokenizer.ts # BPE TS compatible with Python
model-runner.service.ts
src/app/components/playground/
playground.component.*
public/
weights.npz # quantized weights
vocab.json
merges.txt
wasm/ # generated wasm artifacts
package.json
To ensure numerical parity:
- Run a forward pass with batch=1/short seq in Python and save activations to
.npz(e.g.,emb_out,attn_scores,ffn_out,logits). - Replicate the same input in the browser and compare
|a−b| < 1e-3. - Common pitfalls: matrix order (row/col-major), head reshapes, quantization scales.
Tip: validate just 1 block first, then stack all blocks.
- Perplexity on a PT-BR validation set.
- Latency (ms/token) on desktop vs. mobile.
- Artifact sizes:
.wasm,weights.npz,vocab.json. - Throughput (tokens/s) across devices.
Add a small table in your fork’s README with real results.
- Build the UI:
ng build --configuration production - Upload
dist/andweb/angular/public/*to a static S3 bucket. - Publish via CloudFront with OAC and short TTL for
weights.npz. - Serve
.wasmwithContent-Type: application/wasmand enablegzip/brotlion assets.
- WASM won’t load → check
Content-Type: application/wasmand the/public/wasm/*path. - “Stuck-together” text → run an
encode → decoderoundtrip test in BPE before training; adjust</w>handling. - Slow on mobile → reduce
n_layers/d_model/vocab_size, use smallertopK(e.g., 20), and limitmaxNewTokens. - Logit mismatch → verify INT8 scales and Q/K/V reshapes across heads.
- Implement full parity in
model-runner.service.ts(LN, attn, FFN, head) - KV cache for token-by-token generation
- Per-channel quantization and SIMD
- WebGPU port and benchmark vs WASM
- Larger model with pruning/knowledge distillation
Implementation and engineering by Joseph Alexanndry. Conceptual inspirations: the original Transformer paper, client-side quantization work, and the Rust/WASM community.