This is a from-scratch implementation of an LLM with the intent to gain experience in current optimization tricks and hyperparamter tweaking. Unfortunately, I do only own 2x RTX 3090 (NVLink coupled) GPUs and not an entire datacenter but I might actually spend money deploying my models on rented hardware once I hit the most recent state-of-the-art wall. I'm using Stanfords CS336 (Language Modeling from Scratch) lecture and unit-test infrastructure as a guide-line but I will also put diffusion-transformer (not covered by this lecture) later into this or a seperate repository.
The code has been written fully by hand, using the course material - which is very sparse when it comes to implement the features - pencil and paper to work out matrix shapes and reshuffling, reading papers at least partially (e.g. Rope and AdamW) whenever something was unclear in the lecture.
I did use Claude Opus when asking for best pratices in logging and debugging my config generation code and once I finished the full training loop to search for non-obvious bugs (yes, they existed).
Current features:
- vanilla dense all-to-all transformer
- Rope positional embeddings
- Multiprocessing BPE tokenizer (10k vocab trained on tinyStoriesV2(GPT4))
- Numerical stable softmax + cross-entropy loss function
- AdamW Optimizer + Weight Decay
- SGD Optimizer
- SwiGLU Linear Layer
- Cosine learning rate schedule
- Gradient Clipping
- Logging / Tensorboard
- Checkpointing
Needs refinement:
- Logging / Tensorboard
Not implemented yet:
- Sharding (well, 'applying' it, not implementing this really from scratch)
- (Gated) Linear attention (=> Mamba2)
- Mixture of Experts
- KV Cache for inference
- Fine-tuning training
- Train tasks via Reinforcment-Learning (although some RL algorithms have been implemented in my other repos, I might copy them over, lets see)
Might land in another repo:
- Diffusion Transformer