This repository contains the course project for Math5470, a mathematics course designed to explore fundamental techniques in machine learning.
As part of the course requirement, students work individually or in small teams to complete a hands-on machine learning project. The project involves data analysis, model development, and result interpretation, with emphasis on scientific reasoning rather than just performance.
Our team chose the Kaggle M5 Forecasting – Accuracy competition.
Objective:
To predict daily unit sales of Walmart retail products over a 28-day horizon using hierarchical time series data from multiple stores across California, Texas, and Wisconsin.
Dataset Features:
- Item-level, department, and store-level data
- Explanatory variables such as prices, promotions, calendar events, and special days
Goal:
Improve forecast accuracy by combining traditional time-series approaches with modern machine learning techniques.
Competition link:
Kaggle M5 Forecasting – Accuracy
Clone this repository and create the Conda environment using the provided .yml file:
git clone <your-repo-link>
cd <your-repo-folder>
conda env create -f environment.yml
conda activate mathDownload the M5 Forecasting dataset from Kaggle and place it in the following directory structure:
Math5470
|-- calendar.csv
|-- sales_train_validation.csv
|-- sell_prices.csv
|-- sample_submission.csv
Dataset download link: [🔗 Link]
Make sure the files are unzipped and located in the math5470/ folder before running any training or analysis scripts.
Download the pretrained model or checkpoints (if available) and place them in the following directory:
Math5470/
|-- model.lgb
|-- model_meta.json
Model download link: [🔗 Link]
Ensure that the model file name and path match the configuration in your training or inference scripts.
Explore the dataset patterns and insights using our interactive EDA notebook:
jupyter notebook EDA.ipynbThe analysis includes sales trends, seasonal patterns, and feature correlations to guide model development.
Run the training script to start model training:
python train.pyMake sure the dataset and environment are properly set up before running this command. Training logs and checkpoints will be automatically saved in the designated output directory.
After training, use the inference script to generate predictions:
python infer.pyEvaluate the model performance using the provided evaluation script:
python eval.pyWe also provide the scripts for training and evaluating a xgboost model. You can follow it and write your own method. Feel free to try it!
python train_xgboost.py
python infer_xgboost.py| Name | Contribution |
|---|---|
| Weizhen Bian | Performed initial data cleaning and feature extraction; implemented the main model, including training, inference, and evaluation; and contributed to writing and editing the final report. |
| Yiming Li | Conducted exploratory data analysis to identify sales patterns with key visualizations; implemented other models for comparison; aided in ablation study; and contributed to writing and editing the final report |
| Pengyu Chen | Conducted exploratory data analysis to identify sales patterns with key visualizations; supported data preprocessing, contributed to modeling via feature engineering, and aided in drafting the EDA section. |
| Jiahao Pan | Conducted exploratory data analysis to identify sales patterns with key visualizations; contributed to writing and editing the final report |
| Boyi Kang |