A complete Gymnasium environment, Python physics engine, and Stable Baselines 3 training pipeline designed to learn how to play Slither.io from scratch using visual CNN inputs and Recurrent PPO (LSTM) memory.
If the embedded player does not render in your GitHub view, open the video directly: videos/eval_t000291552_ep0000000.mp4
- Recurrent PPO (LSTM): Core architecture uses an LSTM hidden state (2048-step BPTT) to learn multi-step hunting strategies, coiling, and momentum.
- Reward Shaping with Loot Bonus: Reward balances food growth, boost mass cost, kill reward, and a temporary post-kill food multiplier to encourage collecting spoils after fights.
- High-Resolution Vision: 168Γ168 ego-centric 5-channel CNN observation map with
VIEW_RADIUS=500, ensuring even single food pellets are visible to the agent (β₯1px per orb). - 4-Layer CNN Extractor: Conv2d pipeline (32β64β128 channels) with stride-2 downsampling, purpose-built for the 168Γ168 input resolution.
- Smart Scripted Bots: 9 distinct algorithmic bot personalities (Bullies, Hunters, Foragers, Interceptors, Scavengers, etc.) that use spatial hashing to dodge bodies and fight intelligently.
- Debug Vision Mode: Press keys 1β6 during gameplay to cycle through the agent's raw CNN observation channels fullscreen for visual debugging.
- Headless Bot Tournament:
--simulateflag runs a 60-second fast-forward match with 2 of each bot type and prints a ranked leaderboard. - Debug Training HUD: Live PyGame overlay showing observation channel previews, steering magnitude, boost status, and episode stats.
- Automatic Video Recording: Periodically records a deterministic evaluation episode and pushes the video directly to TensorBoard.
- Optional Local Video Dumps:
--save-videos-localwrites eval videos tologs/eval_videos/for headless server debugging.
pip install -r requirements.txtpython3 train.py| Flag | Default | Description |
|---|---|---|
--stage <1|2|3> |
3 |
Curriculum stage (see below) |
--resume <path> |
β | Resume from "latest" or a specific checkpoint path |
--timesteps <N> |
50,000,000 |
Total training steps before auto-stop |
--num-envs <N> |
4 |
Parallel environment subprocesses |
--render |
off | Opens a live Pygame training HUD window (single env) |
--record-every <N> |
100,000 |
Record an eval episode to TensorBoard every N timesteps (0=disabled) |
--save-videos-local |
off | Also save eval videos to logs/eval_videos/ |
--no-lstm |
off | Fallback to standard feedforward PPO instead of RecurrentPPO |
| Stage | Command | Arena | Goal |
|---|---|---|---|
| 1 | python3 train.py --stage 1 |
Agent + Food only | Learn movement, eating, wall avoidance |
| 2 | python3 train.py --stage 2 |
Agent + 10 scripted bots | Learn combat, maneuvering, hunting |
| 3 | python3 train.py --stage 3 |
Agent + 10 scripted bots + 6 self-play clones | Develop robust, generalized strategies |
python3 train.py --resume latest --stage 2Always picks policy_final.zip first (saved on Ctrl+C), then falls back to the highest numbered checkpoint. Auto-detects if the checkpoint is LSTM or standard PPO. Incompatible checkpoints (e.g. trained with old 84Γ84 observations) are automatically rejected with an informative error.
Use this when you want to initialize RL from a scripted bot policy.
- Collect expert data from a bot:
python3 collect_expert_data.py \
--bot-type interceptor \
--frames 100000 \
--output-dir datasets/expert_interceptor \
--overwrite- Supervised pretraining (MSE on continuous actions):
python3 pretrain_bc.py \
--dataset datasets/expert_interceptor \
--epochs 10 \
--batch-size 256 \
--output checkpoints/policy_interceptor_bc_init- Resume RL from the BC checkpoint:
python3 train.py --resume checkpoints/policy_interceptor_bc_init.zip --stage 2Notes:
- Collection and training use the same
168x168map observation. - The collector writes chunked compressed
.npzfiles +metadata.jsonto handle large datasets. - You can switch the expert style with
--bot-type(interceptor,hunter,scavenger, etc.).
tensorboard --logdir logs/- Custom Metrics (
slither/):reward_mean,episode_length,peak_mass_mean,kills_mean,food_eaten_mean,mass_per_frame,death_wall_pct,death_collision_pct,survival_rate,boost_pct,loot_bonus_reward. - Gameplay Videos: Check the Images tab in TensorBoard to watch the periodic
--record-everyeval episodes.
Play against your trained models with your mouse. The script auto-detects if your checkpoints are LSTM or feedforward and manages hidden states automatically.
python3 test_rl.py| Flag | Default | Description |
|---|---|---|
--rl <N> |
6 |
Number of golden RL agents from checkpoints |
--bots <N> |
9 |
Number of algorithmic scripted bots |
--bot-type <name> |
mixed | Force all bots to a specific personality (e.g. scavenger, hunter) |
--simulate |
off | Run a 60s headless bot tournament (no UI, no player) |
--debug-vision |
off | Enable CNN channel debug overlay |
Play against 20 scavengers:
python3 test_rl.py --bots 20 --rl 0 --bot-type scavengerPure RL arena (no scripted bots):
python3 test_rl.py --rl 5 --bots 0Headless bot tournament:
python3 test_rl.py --simulate --rl 0Spawns 2 of each bot type, simulates 3600 frames (60 seconds at 60 FPS), and prints a ranked leaderboard by peak mass and kills.
| Key | Action |
|---|---|
| Mouse | Steer |
| Left Click | Boost |
| 1 | Normal view (default) |
| 2β6 | Fullscreen CNN channels: Self, Enemy, Food, Boundary, Velocity |
| Escape | Quit |
The agent perceives the world through a 168Γ168 ego-centric rotated mini-map with 5 channels:
| Channel | Color (debug) | Contents |
|---|---|---|
| 0 β Self | Cyan | Agent's own head and body segments (fading toward tail) |
| 1 β Enemies | Red | All enemy snake bodies (β₯30% brightness) with bright heads (100%) |
| 2 β Food | Green | Food pellets (small orbs β₯1px, death drops 2β3px) |
| 3 β Boundary | Yellow | World edge proximity gradient (fades in within 300 units) |
| 4 β Velocity | Purple | Enemy heading streaks (lines showing movement direction and speed) |
Plus an 8-float proprioception vector: mass, turn rate, speed, boost state, can-boost, wall distance, body length, kill count.
Continuous Box[-1,1] Γ Box[0,1]:
- Steering
[-1, 1]: Relative turn (left/right) from current heading - Boost
[0, 1]: >0.5 activates boost
python3 export_onnx.py # produces slither_policy.onnx
python3 -m http.server 8080 # host it locally(Note: Export script may need updating for ONNX LSTM dynamic axes support depending on implementation).
- Install
slither_ai.user.jsinto Tampermonkey. - Open slither.io and spawn in.
- Press Q to toggle AI control on/off.
- Observation CNN: 5-channel 168Γ168 ego-centric mini-map (Self, Enemies, Food, Boundary, Velocity Streaks) β processed by four Conv2d layers (32β64β64β128 channels, 3Γ3 kernels, stride 2).
- Proprioception MLP: 8-float vector (mass, turn rate, speed, boost state, wall distance, etc.).
- Memory (LSTM): 256-dim hidden state,
n_steps=2048, tracking short-term maneuvers and persistent threats. - Policy: RecurrentPPO (sb3-contrib) β 2 continuous outputs (steering [-1, 1], boost [0, 1]).
- Hardware: Auto-detects Apple Silicon (MPS), NVIDIA (CUDA), or CPU.
| Bot | Strategy | Color |
|---|---|---|
| Random | Wanders, grabs nearby food | Gray |
| Forager | Efficient food collector, flees threats | Green |
| Bully | Cuts off paths with lead targeting | Red |
| Scavenger | Hunts death drops, gravitates toward fights | Orange |
| Patrol | Sweeps world edges, eats along the way | Blue |
| Parasite | Shadows biggest snake's tail | Magenta |
| Trapper | Circles smaller snakes when big enough | Dark Red |
| Interceptor | Adaptive lookahead, aborts if body-blocked | Cyan |
| Hunter | Pursues snakes β€Β½ its size with heading prediction | Yellow |