Starter attempt at using Machine Learning to predict what a theoretical IMDb movie's rating would be based on various factors provided from the public IMDb TSV dumps
- Download the public dumps
uv run python main.py download- Build a processed CSV dataset
uv run python main.py buildTODO: Write section on build flags for main.py, in case you want different options in the data (change at own risk)
The output CSV is written to data/processed/movies_dataset.csv.
The dataset includes:
tconst- the id of the filmprimaryTitle- the name of the film in EnglishstartYear- the year it was released ("start is a leftover from the tv data. I kept it in in case I ever feel like expanding in future)runtimeMinutes- the length of the filmgenres- the genres of the filmdirectorNames- the director(s)castNames- the top-billed cast of the film (there will probably be some noise here and there but it's the easiest and most reliable option to find the main stars)avgRating- the average rating of the filmnumVotes- how many people rated the filmdirectorAvgRating- the average rating of the director's filmographydirectorMovieCount- how many films the director has madecastAvgRating- the average rating of the cast's filmography (total, not individual)castMovieCount- the number of films the cast have starred in (this seems to need fixing)
This is where things get fun! We want to train the dataset using a random forest. you can do this by running uv run python train.py. So what does this do? Well imagine we have a flowchart that asks yes/no questions to make a decisions. Sorta like this:
is cast_avg > arbitrary_val?
YES: is it Sci-Fi?
YES: predict other_val
NO: predict arbitrary_val
NO: is director_count > 5?
YES: predict 6
NO: predict 5Each path through this tree leads to an prediction, and one tree can overfit (memorize training data). So, we are going to build 100 trees (the default amount for the module) where each is trained on a random subset of data and random subset of features. To predict, all 100 trees will cast a vote, and the final prediction will be the average of all votes. (We specifically use a regressor here, which predicts continuous numbers (1-10))
Here are the build flags for train.py if you wish to play with the way we train the data a bit:
-d --dataset path to the processed dataset (default is data/processed/movies_dataset.csv)
-t --test-size controls the fraction of the dataset used for the test split when a random split is used (so the default of 0.2 means 20% test and 80% train). This will be ignored if --times-split is provided (since that path uses a year cutoff instead)
-S --time-split use time-based split with year cutoff instead, (train on rows < given year, and test on rows >= given year)
-D --no-director excludes director features from training
-C --no-cast excludes cast features from training Now we are ready to do predictions! Run uv run python predict.py and pass in the fields you want to analyze. The director, cast, and genres are required here, but other flags are optional (I would highly recommend providing year and runtime however so you do not have to fallback to the defaults from train.py (the medians)). Here is a list for help:
-d --director name of director
-c --cast name(s) of one or more cast members (space-separated)
-g --genres movie genre(s) (space-separated)
-y --year year the film was released
-r --runtime the year the film was released
-D --dataset the path to processed dataset (has a default of data/processed/movies_dataset.csv)
-m --model path to trained model for higher accuracy (default is data/models/random_forest.joblib)
-M --metadata path to model metadata (default is data/models/model_metadata.joblib)So for example, to use it in full (assuming you use default paths), you would run
uv run python predict.py -d "David Lynch" -c "Kyle MacLachlan" "Sheryl Lee" "Ray Wise" -g "Drama" "Horror" "Mystery" -y 1992 -r 134Yes I am a big fan of Lynch.
This is mostly just an amateur attempt at getting into more data sciencey stuff and machine learning, as it's been a long time since I touched anything like that and wanted to explore it more. It'll be refined further in future as it is a little bit simplistic (I would have done something with dairy farming if Earth Sciences NZ actually got back to me on my request for access to data records but oh well).
Oh and I'll paypal you $10 if you 1) know what I'm talking about, 2) why it's here specifically, and 3) how I managed to connect the dots between a Sopranos reference and a Star Wars reference :P
- This will take a while to build as there is A LOT of data here.