Learn GPU programming from scratch by building neural networks. Two learning paths: Triton (Python, recommended) and CUDA C++.
- ML engineers who want to understand what happens under the hood
- Anyone curious about GPU programming
- People who want to write custom kernels for performance
No prior GPU experience required!
Python-based GPU programming. Modern, used in production (Flash Attention, vLLM).
triton/
├── 01-basics/ # First kernels, vector ops
├── 02-matrix-ops/ # Matrix multiplication (the core of ML)
├── 03-nn-components/ # Activations, linear layers, softmax
└── 04-mnist-classifier/ # Full training example
The traditional approach. Lower-level, maximum control.
cuda/
├── 01-basics/ # Thread hierarchy, memory management
├── 02-matrix-ops/ # Tiled matmul with shared memory
├── 03-nn-components/ # NN building blocks in CUDA
└── 04-mnist-classifier/ # Complete classifier
Start here if you're new to GPU programming!
00-prerequisites/
└── 00_gpu_fundamentals.ipynb # CPU vs GPU, memory hierarchy, why GPUs for ML
# Clone the repo
git clone <repo-url>
cd gpu-ml-learning
# Install dependencies
uv sync
# Start with prerequisites (if new to GPUs)
jupyter notebook 00-prerequisites/00_gpu_fundamentals.ipynb
# Then start Triton path
jupyter notebook triton/01-basics/01_basics.ipynb# Verify CUDA installation
nvcc --version
nvidia-smi
# Start with prerequisites
jupyter notebook 00-prerequisites/00_gpu_fundamentals.ipynb
# Then start CUDA path
jupyter notebook cuda/01-basics/01_basics.ipynb- CPU vs GPU architecture
- Why GPUs for machine learning
- Memory hierarchy (the real bottleneck!)
- What is a "kernel"?
- Basic PyTorch GPU operations
- Your first GPU kernel
- Vector addition (hello world of parallel computing)
- Thread/block organization
- Memory management
- Why matmul is everything in ML
- Naive vs tiled matrix multiplication
- Shared memory optimization
- Benchmarking against cuBLAS
- Activation functions (ReLU, GELU, Softmax)
- Linear layers (forward pass)
- Fused operations (why custom kernels matter!)
- Cross-entropy loss
- Complete 2-layer MLP
- Training loop
- ~97% accuracy
- Comparison with PyTorch
Registers (~1 cycle) Fastest, but tiny
↓
Shared Memory (~5 cycles) Fast, shared within block
↓
Global Memory (~400 cycles) Slow, but large (your VRAM)
↓
CPU Memory (~1000+ cycles) Very slow to access from GPU
Most GPU code is memory-bound, not compute-bound!
# Separate ops (3 memory round-trips)
temp = x + y # Read x,y → Write temp
result = temp * 2 # Read temp → Write result
# Fused (1 memory round-trip)
result = (x + y) * 2 # Read x,y → Write resultThis is why custom kernels can beat PyTorch - fewer memory accesses!
Grid (your entire computation)
├── Block 0
│ ├── Thread 0
│ ├── Thread 1
│ └── ... (up to 1024)
├── Block 1
└── Block N
DO use custom kernels for:
- Fusing multiple operations
- Operations PyTorch doesn't optimize
- Novel architectures
DON'T use custom kernels for:
- Standard operations (PyTorch is already optimized)
- Premature optimization
- Simple prototyping
- Official Triton Tutorials
- Flash Attention - Production Triton code
- GPU Gems - Classic techniques
- What Every Programmer Should Know About Memory
gpu-ml-learning/
├── 00-prerequisites/ # Start here!
│ └── 00_gpu_fundamentals.ipynb
├── triton/ # Python GPU programming
│ ├── 01-basics/
│ ├── 02-matrix-ops/
│ ├── 03-nn-components/
│ └── 04-mnist-classifier/
├── cuda/ # C++ GPU programming
│ ├── 01-basics/
│ ├── 02-matrix-ops/
│ ├── 03-nn-components/
│ └── 04-mnist-classifier/
├── pyproject.toml
└── README.md
- Python 3.10+
- PyTorch 2.0+
- NVIDIA GPU with CUDA support
- For Triton:
tritonpackage - For CUDA: CUDA Toolkit (
nvcccompiler)
MIT