Skip to content

uahic/LLMFromScratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

General

This is a from-scratch implementation of an LLM with the intent to gain experience in current optimization tricks and hyperparamter tweaking. Unfortunately, I do only own 2x RTX 3090 (NVLink coupled) GPUs and not an entire datacenter but I might actually spend money deploying my models on rented hardware once I hit the most recent state-of-the-art wall. I'm using Stanfords CS336 (Language Modeling from Scratch) lecture and unit-test infrastructure as a guide-line but I will also put diffusion-transformer (not covered by this lecture) later into this or a seperate repository.

Transparency: AI Usage in this repository

The code has been written fully by hand, using the course material - which is very sparse when it comes to implement the features - pencil and paper to work out matrix shapes and reshuffling, reading papers at least partially (e.g. Rope and AdamW) whenever something was unclear in the lecture.

I did use Claude Opus when asking for best pratices in logging and debugging my config generation code and once I finished the full training loop to search for non-obvious bugs (yes, they existed).

Features

Current features:

  • vanilla dense all-to-all transformer
  • Rope positional embeddings
  • Multiprocessing BPE tokenizer (10k vocab trained on tinyStoriesV2(GPT4))
  • Numerical stable softmax + cross-entropy loss function
  • AdamW Optimizer + Weight Decay
  • SGD Optimizer
  • SwiGLU Linear Layer
  • Cosine learning rate schedule
  • Gradient Clipping
  • Logging / Tensorboard
  • Checkpointing

Needs refinement:

  • Logging / Tensorboard

Not implemented yet:

  • Sharding (well, 'applying' it, not implementing this really from scratch)
  • (Gated) Linear attention (=> Mamba2)
  • Mixture of Experts
  • KV Cache for inference
  • Fine-tuning training
  • Train tasks via Reinforcment-Learning (although some RL algorithms have been implemented in my other repos, I might copy them over, lets see)

Might land in another repo:

  • Diffusion Transformer

About

From Scratch Implementation of various LLM techniques

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors