quasimodo

Starter attempt at using Machine Learning to predict what a theoretical IMDb movie's rating would be based on various factors provided from the public IMDb TSV dumps

Quick start

Download the public dumps

uv run python main.py download

Build a processed CSV dataset

uv run python main.py build

TODO: Write section on build flags for main.py, in case you want different options in the data (change at own risk)

The output CSV is written to data/processed/movies_dataset.csv.

What it builds

The dataset includes:

tconst - the id of the film
primaryTitle - the name of the film in English
startYear - the year it was released ("start is a leftover from the tv data. I kept it in in case I ever feel like expanding in future)
runtimeMinutes - the length of the film
genres - the genres of the film
directorNames - the director(s)
castNames - the top-billed cast of the film (there will probably be some noise here and there but it's the easiest and most reliable option to find the main stars)
avgRating - the average rating of the film
numVotes - how many people rated the film
directorAvgRating - the average rating of the director's filmography
directorMovieCount - how many films the director has made
castAvgRating - the average rating of the cast's filmography (total, not individual)
castMovieCount - the number of films the cast have starred in (this seems to need fixing)

What Next?

This is where things get fun! We want to train the dataset using a random forest. you can do this by running uv run python train.py. So what does this do? Well imagine we have a flowchart that asks yes/no questions to make a decisions. Sorta like this:

is cast_avg > arbitrary_val?
    YES: is it Sci-Fi?
        YES: predict other_val
        NO:  predict arbitrary_val
    NO: is director_count > 5?
        YES: predict 6
        NO:  predict 5

Each path through this tree leads to an prediction, and one tree can overfit (memorize training data). So, we are going to build 100 trees (the default amount for the module) where each is trained on a random subset of data and random subset of features. To predict, all 100 trees will cast a vote, and the final prediction will be the average of all votes. (We specifically use a regressor here, which predicts continuous numbers (1-10))

Here are the build flags for train.py if you wish to play with the way we train the data a bit:

-d     --dataset       path to the processed dataset (default is data/processed/movies_dataset.csv)
-t     --test-size     controls the fraction of the dataset used for the test split when a random split is used (so the default of 0.2 means 20% test and 80% train). This will be ignored if --times-split is provided (since that path uses a year cutoff instead)
-S     --time-split    use time-based split with year cutoff instead, (train on rows < given year, and test on rows >= given year)
-D     --no-director   excludes director features from training
-C     --no-cast       excludes cast features from training

Now we are ready to do predictions! Run uv run python predict.py and pass in the fields you want to analyze. The director, cast, and genres are required here, but other flags are optional (I would highly recommend providing year and runtime however so you do not have to fallback to the defaults from train.py (the medians)). Here is a list for help:

-d      --director      name of director
-c      --cast          name(s) of one or more cast members (space-separated)
-g      --genres        movie genre(s) (space-separated)
-y      --year          year the film was released
-r      --runtime       the year the film was released
-D      --dataset       the path to processed dataset (has a default of data/processed/movies_dataset.csv)
-m      --model         path to trained model for higher accuracy (default is data/models/random_forest.joblib)
-M      --metadata      path to model metadata (default is data/models/model_metadata.joblib)

So for example, to use it in full (assuming you use default paths), you would run

uv run python predict.py -d "David Lynch" -c "Kyle MacLachlan" "Sheryl Lee" "Ray Wise" -g "Drama" "Horror" "Mystery" -y 1992 -r 134

Yes I am a big fan of Lynch.

This is mostly just an amateur attempt at getting into more data sciencey stuff and machine learning, as it's been a long time since I touched anything like that and wanted to explore it more. It'll be refined further in future as it is a little bit simplistic (I would have done something with dairy farming if Earth Sciences NZ actually got back to me on my request for access to data records but oh well).

Oh and I'll paypal you $10 if you 1) know what I'm talking about, 2) why it's here specifically, and 3) how I managed to connect the dots between a Sopranos reference and a Star Wars reference :P

Notes

This will take a while to build as there is A LOT of data here.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
predict.py		predict.py
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quasimodo

Quick start

What it builds

What Next?

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

mdi48/quasimodo

Folders and files

Latest commit

History

Repository files navigation

quasimodo

Quick start

What it builds

What Next?

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages