Machine Learning to Predict success of Movies
CIS519: Applied Machine Learning, Penn engineering - Spring 2019
Python
Overview
The goal of our project is to predict a movie's success based on a set of features available before a movie is released. The success of our project depends on a multitude of design choices and therefore lends itself to experimentation across data sets. One of the invariants we set for our project was that we are only using information available prior to the release of a film as the features we are training on. We intend to leverage the multiple movie data sets available to collect a set of meaningful and telling features. In addition, we aim to combine IMDb Database with other data sets that have an IMDb Id as a feature. Considering PCA and standard data cleaning techniques, we consider and execute many ways of combining redundant features and convert categorical data into numeric features that algorithms can parse. Lastly, we hope to explore text classification algorithms like Naive Bayes, Multinomial Logistic Regression, and LDA to see if text data is representative of the film’s success and use our text classification information to generate novel features in our model.
Future Steps
LDA to generate good topics for movies
Image classification on poster data
Investigating TF-IDF versus Naive Bayes for text classification
Model specifics
Defined success as the average rating given to the movie by viewers.
Labels were initially binary, but later were decided to be numeric values 0-10.
Expanded features that were originally a series of values (i.e. genres) into three separate features with single values each
Utilized unique IMDB ids to combine data sets
Generated numeric text features utilizing the movie overviews.
Our final model was a random forest with 500 estimators.