Skip to content

Predict a movie’s IMDb rating (on a ten-point scale) from metadata like genre, director, cast, runtime, etc. using machine learning

Notifications You must be signed in to change notification settings

mdi48/quasimodo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

quasimodo

He predicted all of this.

Starter attempt at using Machine Learning to predict what a theoretical IMDb movie's rating would be based on various factors provided from the public IMDb TSV dumps

Quick start

  1. Download the public dumps
uv run python main.py download
  1. Build a processed CSV dataset
uv run python main.py build

TODO: Write section on build flags for main.py, in case you want different options in the data (change at own risk)

The output CSV is written to data/processed/movies_dataset.csv.

What it builds

The dataset includes:

  • tconst - the id of the film
  • primaryTitle - the name of the film in English
  • startYear - the year it was released ("start is a leftover from the tv data. I kept it in in case I ever feel like expanding in future)
  • runtimeMinutes - the length of the film
  • genres - the genres of the film
  • directorNames - the director(s)
  • castNames - the top-billed cast of the film (there will probably be some noise here and there but it's the easiest and most reliable option to find the main stars)
  • avgRating - the average rating of the film
  • numVotes - how many people rated the film
  • directorAvgRating - the average rating of the director's filmography
  • directorMovieCount - how many films the director has made
  • castAvgRating - the average rating of the cast's filmography (total, not individual)
  • castMovieCount - the number of films the cast have starred in (this seems to need fixing)

What Next?

This is where things get fun! We want to train the dataset using a random forest. you can do this by running uv run python train.py. So what does this do? Well imagine we have a flowchart that asks yes/no questions to make a decisions. Sorta like this:

is cast_avg > arbitrary_val?
    YES: is it Sci-Fi?
        YES: predict other_val
        NO:  predict arbitrary_val
    NO: is director_count > 5?
        YES: predict 6
        NO:  predict 5

Each path through this tree leads to an prediction, and one tree can overfit (memorize training data). So, we are going to build 100 trees (the default amount for the module) where each is trained on a random subset of data and random subset of features. To predict, all 100 trees will cast a vote, and the final prediction will be the average of all votes. (We specifically use a regressor here, which predicts continuous numbers (1-10))

Here are the build flags for train.py if you wish to play with the way we train the data a bit:

-d     --dataset       path to the processed dataset (default is data/processed/movies_dataset.csv)
-t     --test-size     controls the fraction of the dataset used for the test split when a random split is used (so the default of 0.2 means 20% test and 80% train). This will be ignored if --times-split is provided (since that path uses a year cutoff instead)
-S     --time-split    use time-based split with year cutoff instead, (train on rows < given year, and test on rows >= given year)
-D     --no-director   excludes director features from training
-C     --no-cast       excludes cast features from training  



Now we are ready to do predictions! Run uv run python predict.py and pass in the fields you want to analyze. The director, cast, and genres are required here, but other flags are optional (I would highly recommend providing year and runtime however so you do not have to fallback to the defaults from train.py (the medians)). Here is a list for help:

-d      --director      name of director
-c      --cast          name(s) of one or more cast members (space-separated)
-g      --genres        movie genre(s) (space-separated)
-y      --year          year the film was released
-r      --runtime       the year the film was released
-D      --dataset       the path to processed dataset (has a default of data/processed/movies_dataset.csv)
-m      --model         path to trained model for higher accuracy (default is data/models/random_forest.joblib)
-M      --metadata      path to model metadata (default is data/models/model_metadata.joblib)

So for example, to use it in full (assuming you use default paths), you would run

uv run python predict.py -d "David Lynch" -c "Kyle MacLachlan" "Sheryl Lee" "Ray Wise" -g "Drama" "Horror" "Mystery" -y 1992 -r 134

Yes I am a big fan of Lynch.

This is mostly just an amateur attempt at getting into more data sciencey stuff and machine learning, as it's been a long time since I touched anything like that and wanted to explore it more. It'll be refined further in future as it is a little bit simplistic (I would have done something with dairy farming if Earth Sciences NZ actually got back to me on my request for access to data records but oh well).

Oh and I'll paypal you $10 if you 1) know what I'm talking about, 2) why it's here specifically, and 3) how I managed to connect the dots between a Sopranos reference and a Star Wars reference :P

Notes

  • This will take a while to build as there is A LOT of data here.

About

Predict a movie’s IMDb rating (on a ten-point scale) from metadata like genre, director, cast, runtime, etc. using machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages