Machine Learning to Predict success of Movies

CIS519: Applied Machine Learning, Penn engineering - Spring 2019

Python

Project Report

CIS 519: Applied Machine Learning Course Project Video Presentation that explains our process and findings in constructing an algorithm designed to predict the success of a movie.

Overview

The goal of our project is to predict a movie's success based on a set of features available before a movie is released. The success of our project depends on a multitude of design choices and therefore lends itself to experimentation across data sets. One of the invariants we set for our project was that we are only using information available prior to the release of a film as the features we are training on. We intend to leverage the multiple movie data sets available to collect a set of meaningful and telling features. In addition, we aim to combine IMDb Database with other data sets that have an IMDb Id as a feature. Considering PCA and standard data cleaning techniques, we consider and execute many ways of combining redundant features and convert categorical data into numeric features that algorithms can parse. Lastly, we hope to explore text classification algorithms like Naive Bayes, Multinomial Logistic Regression, and LDA to see if text data is representative of the film’s success and use our text classification information to generate novel features in our model.

Future Steps

  • LDA to generate good topics for movies

  • Image classification on poster data

  • Investigating TF-IDF versus Naive Bayes for text classification

Model specifics

  • Defined success as the average rating given to the movie by viewers.

  • Labels were initially binary, but later were decided to be numeric values 0-10.

  • Expanded features that were originally a series of values (i.e. genres) into three separate features with single values each

  • Utilized unique IMDB ids to combine data sets

  • Generated numeric text features utilizing the movie overviews.

Our final model was a random forest with 500 estimators.