GPU Programming for Machine Learning

Learn GPU programming from scratch by building neural networks. Two learning paths: Triton (Python, recommended) and CUDA C++.

Who Is This For?

ML engineers who want to understand what happens under the hood
Anyone curious about GPU programming
People who want to write custom kernels for performance

No prior GPU experience required!

Learning Paths

Path A: Triton (Recommended)

Python-based GPU programming. Modern, used in production (Flash Attention, vLLM).

triton/
├── 01-basics/         # First kernels, vector ops
├── 02-matrix-ops/     # Matrix multiplication (the core of ML)
├── 03-nn-components/  # Activations, linear layers, softmax
└── 04-mnist-classifier/  # Full training example

Path B: CUDA C++

The traditional approach. Lower-level, maximum control.

cuda/
├── 01-basics/         # Thread hierarchy, memory management
├── 02-matrix-ops/     # Tiled matmul with shared memory
├── 03-nn-components/  # NN building blocks in CUDA
└── 04-mnist-classifier/  # Complete classifier

Prerequisites Module

Start here if you're new to GPU programming!

00-prerequisites/
└── 00_gpu_fundamentals.ipynb  # CPU vs GPU, memory hierarchy, why GPUs for ML

Quick Start

For Triton (Python)

# Clone the repo
git clone <repo-url>
cd gpu-ml-learning

# Install dependencies
uv sync

# Start with prerequisites (if new to GPUs)
jupyter notebook 00-prerequisites/00_gpu_fundamentals.ipynb

# Then start Triton path
jupyter notebook triton/01-basics/01_basics.ipynb

For CUDA C++

# Verify CUDA installation
nvcc --version
nvidia-smi

# Start with prerequisites
jupyter notebook 00-prerequisites/00_gpu_fundamentals.ipynb

# Then start CUDA path
jupyter notebook cuda/01-basics/01_basics.ipynb

What You'll Learn

Module 0: Prerequisites

CPU vs GPU architecture
Why GPUs for machine learning
Memory hierarchy (the real bottleneck!)
What is a "kernel"?
Basic PyTorch GPU operations

Module 1: Basics

Your first GPU kernel
Vector addition (hello world of parallel computing)
Thread/block organization
Memory management

Module 2: Matrix Operations

Why matmul is everything in ML
Naive vs tiled matrix multiplication
Shared memory optimization
Benchmarking against cuBLAS

Module 3: Neural Network Components

Activation functions (ReLU, GELU, Softmax)
Linear layers (forward pass)
Fused operations (why custom kernels matter!)
Cross-entropy loss

Module 4: MNIST Classifier

Complete 2-layer MLP
Training loop
~97% accuracy
Comparison with PyTorch

Key Concepts

The Memory Hierarchy

Registers    (~1 cycle)     Fastest, but tiny
    ↓
Shared Memory (~5 cycles)   Fast, shared within block
    ↓
Global Memory (~400 cycles) Slow, but large (your VRAM)
    ↓
CPU Memory    (~1000+ cycles) Very slow to access from GPU

Most GPU code is memory-bound, not compute-bound!

Why Fusion Matters

# Separate ops (3 memory round-trips)
temp = x + y      # Read x,y → Write temp
result = temp * 2 # Read temp → Write result

# Fused (1 memory round-trip)
result = (x + y) * 2  # Read x,y → Write result

This is why custom kernels can beat PyTorch - fewer memory accesses!

The Thread Model

Grid (your entire computation)
├── Block 0
│   ├── Thread 0
│   ├── Thread 1
│   └── ... (up to 1024)
├── Block 1
└── Block N

When to Use Custom Kernels

DO use custom kernels for:

Fusing multiple operations
Operations PyTorch doesn't optimize
Novel architectures

DON'T use custom kernels for:

Standard operations (PyTorch is already optimized)
Premature optimization
Simple prototyping

Resources

Triton

Official Triton Tutorials
Flash Attention - Production Triton code

CUDA

General GPU Programming

GPU Gems - Classic techniques
What Every Programmer Should Know About Memory

Project Structure

gpu-ml-learning/
├── 00-prerequisites/           # Start here!
│   └── 00_gpu_fundamentals.ipynb
├── triton/                     # Python GPU programming
│   ├── 01-basics/
│   ├── 02-matrix-ops/
│   ├── 03-nn-components/
│   └── 04-mnist-classifier/
├── cuda/                       # C++ GPU programming
│   ├── 01-basics/
│   ├── 02-matrix-ops/
│   ├── 03-nn-components/
│   └── 04-mnist-classifier/
├── pyproject.toml
└── README.md

Requirements

Python 3.10+
PyTorch 2.0+
NVIDIA GPU with CUDA support
For Triton: triton package
For CUDA: CUDA Toolkit (nvcc compiler)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU Programming for Machine Learning

Who Is This For?

Learning Paths

Path A: Triton (Recommended)

Path B: CUDA C++

Prerequisites Module

Quick Start

For Triton (Python)

For CUDA C++

What You'll Learn

Module 0: Prerequisites

Module 1: Basics

Module 2: Matrix Operations

Module 3: Neural Network Components

Module 4: MNIST Classifier

Key Concepts

The Memory Hierarchy

Why Fusion Matters

The Thread Model

When to Use Custom Kernels

Resources

Triton

CUDA

General GPU Programming

Project Structure

Requirements

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
00-prerequisites		00-prerequisites
cuda		cuda
triton		triton
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

YZXBiz/gpu-ml-learning

Folders and files

Latest commit

History

Repository files navigation

GPU Programming for Machine Learning

Who Is This For?

Learning Paths

Path A: Triton (Recommended)

Path B: CUDA C++

Prerequisites Module

Quick Start

For Triton (Python)

For CUDA C++

What You'll Learn

Module 0: Prerequisites

Module 1: Basics

Module 2: Matrix Operations

Module 3: Neural Network Components

Module 4: MNIST Classifier

Key Concepts

The Memory Hierarchy

Why Fusion Matters

The Thread Model

When to Use Custom Kernels

Resources

Triton

CUDA

General GPU Programming

Project Structure

Requirements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages