This project recreates the Hogwarts Sorting Hat using machine learning! We've implemented a logistic regression classifier from scratch that can accurately assign students to their appropriate Hogwarts houses based on their academic performance.
- ✨ Pure Magic (No High-Level Libraries) - Everything is implemented from scratch with only basic Python, NumPy, and Pandas
- 🎯 98%+ Accuracy - Our model achieves McGonagall's required accuracy threshold
- 🔄 Multiple Learning Algorithms - Choose from three different gradient descent approaches
- 📊 Data Visualization Suite - Tools to understand Hogwarts student data patterns
- 🧰 Custom Statistical Functions - Hand-coded functions for all statistical operations
Our describe.py script provides statistical insights about the dataset, similar to Pandas' describe() function but built entirely from scratch:
python src/data/describe.py -d data/raw/dataset_train.csvAdd the -b flag for bonus statistics:
python src/data/describe.py -d data/raw/dataset_train.csv -bExplore the student data with various plots:
Histogram: Find courses with homogeneous distributions across houses
python src/visualization/histogram.py -c "Astronomy"Scatter Plot: Discover correlated features
python src/visualization/scatter_plot.py -c "Astronomy" "Defense Against the Dark Arts"Pair Plot: Comprehensive view of feature relationships
python src/visualization/pair_plot.pyTrain the sorting algorithm with different optimization methods:
# Default: Stochastic Gradient Descent (fastest)
python src/models/train.py -d data/processed/dataset_train.csv
# Batch Gradient Descent (most stable)
python src/models/train.py -d data/processed/dataset_train.csv -a gradient_descent
# Mini-Batch Gradient Descent (balanced approach)
python src/models/train.py -d data/processed/dataset_train.csv -a mini_batch_gradient_descentSort new students with the trained algorithm:
python src/models/predict.py -d data/processed/dataset_test.csv -m weights.pklThis project implements a one-vs-all logistic regression classifier with three gradient descent optimization techniques:
- Batch Gradient Descent: Updates weights using the entire dataset for each iteration
- Stochastic Gradient Descent: Updates weights using one random example at a time
- Mini-Batch Gradient Descent: Updates weights using small random batches of examples
Each approach offers different trade-offs between training speed and convergence stability.
- Python 3.8+
# Clone the repository
git clone https://github.com/LuckyIntegral/dslr.git
cd dslr
# Install dependencies
pip install -r requirements.txt
# Prepare the dataset (removes unnecessary features and handles missing values)
python src/data/prepare_dataset.py -i data/raw/dataset_train.csv -o data/processed/dataset_train.csv
python src/data/prepare_dataset.py -i data/raw/dataset_test.csv -o data/processed/dataset_test.csvdslr/
├── data/ # Data directory
│ ├── raw/ # Raw datasets
│ └── processed/ # Cleaned datasets
├── images/ # Generated visualizations
├── notebooks/ # Jupyter notebooks for demonstrations
├── src/ # Source code
│ ├── data/ # Data processing modules
│ │ ├── describe.py # Statistical analysis tool
│ │ └── prepare_dataset.py # Data cleaning & preparation
│ ├── models/ # Machine learning models
│ │ ├── train.py # Training algorithms
│ │ └── predict.py # Prediction functions
│ └── visualization/ # Visualization tools
│ ├── histogram.py # Histogram generator
│ ├── scatter_plot.py # Scatter plot generator
│ └── pair_plot.py # Pair plot generator
└── requirements.txt # Project dependencies
- Add regularization to prevent overfitting
- Implement cross-validation for better model evaluation
- Create an interactive web application for real-time sorting
- Extend the algorithm to use neural networks for comparison
Vitalii Frants 📍 42 Vienna 👉 GitHub