Skip to content

YZXBiz/gpu-ml-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Programming for Machine Learning

Learn GPU programming from scratch by building neural networks. Two learning paths: Triton (Python, recommended) and CUDA C++.

Who Is This For?

  • ML engineers who want to understand what happens under the hood
  • Anyone curious about GPU programming
  • People who want to write custom kernels for performance

No prior GPU experience required!

Learning Paths

Path A: Triton (Recommended)

Python-based GPU programming. Modern, used in production (Flash Attention, vLLM).

triton/
├── 01-basics/         # First kernels, vector ops
├── 02-matrix-ops/     # Matrix multiplication (the core of ML)
├── 03-nn-components/  # Activations, linear layers, softmax
└── 04-mnist-classifier/  # Full training example

Path B: CUDA C++

The traditional approach. Lower-level, maximum control.

cuda/
├── 01-basics/         # Thread hierarchy, memory management
├── 02-matrix-ops/     # Tiled matmul with shared memory
├── 03-nn-components/  # NN building blocks in CUDA
└── 04-mnist-classifier/  # Complete classifier

Prerequisites Module

Start here if you're new to GPU programming!

00-prerequisites/
└── 00_gpu_fundamentals.ipynb  # CPU vs GPU, memory hierarchy, why GPUs for ML

Quick Start

For Triton (Python)

# Clone the repo
git clone <repo-url>
cd gpu-ml-learning

# Install dependencies
uv sync

# Start with prerequisites (if new to GPUs)
jupyter notebook 00-prerequisites/00_gpu_fundamentals.ipynb

# Then start Triton path
jupyter notebook triton/01-basics/01_basics.ipynb

For CUDA C++

# Verify CUDA installation
nvcc --version
nvidia-smi

# Start with prerequisites
jupyter notebook 00-prerequisites/00_gpu_fundamentals.ipynb

# Then start CUDA path
jupyter notebook cuda/01-basics/01_basics.ipynb

What You'll Learn

Module 0: Prerequisites

  • CPU vs GPU architecture
  • Why GPUs for machine learning
  • Memory hierarchy (the real bottleneck!)
  • What is a "kernel"?
  • Basic PyTorch GPU operations

Module 1: Basics

  • Your first GPU kernel
  • Vector addition (hello world of parallel computing)
  • Thread/block organization
  • Memory management

Module 2: Matrix Operations

  • Why matmul is everything in ML
  • Naive vs tiled matrix multiplication
  • Shared memory optimization
  • Benchmarking against cuBLAS

Module 3: Neural Network Components

  • Activation functions (ReLU, GELU, Softmax)
  • Linear layers (forward pass)
  • Fused operations (why custom kernels matter!)
  • Cross-entropy loss

Module 4: MNIST Classifier

  • Complete 2-layer MLP
  • Training loop
  • ~97% accuracy
  • Comparison with PyTorch

Key Concepts

The Memory Hierarchy

Registers    (~1 cycle)     Fastest, but tiny
    ↓
Shared Memory (~5 cycles)   Fast, shared within block
    ↓
Global Memory (~400 cycles) Slow, but large (your VRAM)
    ↓
CPU Memory    (~1000+ cycles) Very slow to access from GPU

Most GPU code is memory-bound, not compute-bound!

Why Fusion Matters

# Separate ops (3 memory round-trips)
temp = x + y      # Read x,y → Write temp
result = temp * 2 # Read temp → Write result

# Fused (1 memory round-trip)
result = (x + y) * 2  # Read x,y → Write result

This is why custom kernels can beat PyTorch - fewer memory accesses!

The Thread Model

Grid (your entire computation)
├── Block 0
│   ├── Thread 0
│   ├── Thread 1
│   └── ... (up to 1024)
├── Block 1
└── Block N

When to Use Custom Kernels

DO use custom kernels for:

  • Fusing multiple operations
  • Operations PyTorch doesn't optimize
  • Novel architectures

DON'T use custom kernels for:

  • Standard operations (PyTorch is already optimized)
  • Premature optimization
  • Simple prototyping

Resources

Triton

CUDA

General GPU Programming

Project Structure

gpu-ml-learning/
├── 00-prerequisites/           # Start here!
│   └── 00_gpu_fundamentals.ipynb
├── triton/                     # Python GPU programming
│   ├── 01-basics/
│   ├── 02-matrix-ops/
│   ├── 03-nn-components/
│   └── 04-mnist-classifier/
├── cuda/                       # C++ GPU programming
│   ├── 01-basics/
│   ├── 02-matrix-ops/
│   ├── 03-nn-components/
│   └── 04-mnist-classifier/
├── pyproject.toml
└── README.md

Requirements

  • Python 3.10+
  • PyTorch 2.0+
  • NVIDIA GPU with CUDA support
  • For Triton: triton package
  • For CUDA: CUDA Toolkit (nvcc compiler)

License

MIT

About

Learn GPU programming (Triton & CUDA) through hands-on ML examples

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •