Prediction of Movie Box Office Performance

by Qian Wang, Tianrun Sun, and Shuo Liu


This figure shows the distribution of movie productions in different countries that were released in the US from 2005 to 2017 that are collected from TMDb

Our site looks at quantifiable and easily obtainable and recognized factors that affect a movie’s success in box office performance. The goal of this analysis is to test the relationship between multiple variables: budget, revenue, genre, cast, crew, runtime, language, etc. We applied traditional statistical analysis methods and machine learning models to generate interesting patterns for predicting box office performance of movies using data collected from TMDb movie database and YouTube. ROI, i.e. return on investment, is used for evaluate box office performance of movies in this project. We address the task as both regression problem and classification problem, and compared the proformance of different methods and techniques for this two problems. Besides, we also find the most significant features in our dataset that help doing the regression or classification.

Keywords: Data mining; Statistical analysis; Machine learning; YouTube; TMDb


Movies are an integral part of modern culture, generating over billions and billions of dollars each year and delivering rich, intricately crafted stories to a worldwide audience. The film industry not only serves as a deeply expressive artistic medium from approximately 38 billion USD in 2016 to about 50 billion USD in 2020 (Statista, 2016). Moreover, the US is one of the most highly ranked countries in making the most movies per year. Therefore, it is an interesting question that which type of movies are more likely to succeed in box office performance. It is essential to understand what features of movies affect moviegoers’ pattern so that movie marketers are in a better position to align and adjust their strategies to capture the interest of potential moviegoers, for better box office performance of their films.

This topic of movies is of considerable interest in many industries. Research has been done to generate models for predicting revenue of movies. Back in 2013, Google have published a research paper on quantifying movie magic with Google search. Basically it dived into data analysis on Google search volume, amount of paid clicks, the movie trailer engagement and come to the conclusion that the times of trailers searches on both Google and YouTube are leading indicators of Box Office success.

plot from API (4)

This figure shows the number of movies on different release dates in the US from 2005 to 2017

In our research, we focus on analyzing how different factors affect a movie’s Box Office success including budget, revenue, runtime, genre, cast, crew, etc. Together, we have used 13 attributes to represent the assessment of movies derived from TMDb Movie Database. In social media aspect, some of these factors are either hard to quantify or difficult to obtain reliably, for example, Google search volume of some words of a movie does not mean the actual trend of the moviegoers’ interest in it. Instead, we obtain consistently and essential factors like YouTube trailer reviews since trailers of movies is one of the most important and effective advertising ways. All of these attribute values are applied to generate movie classes: Hit, Neutral, or Flop. We use a multiple linear regression model to analyze all the features and determine which features are more related to Box Office performance. On the other hand, we apply predictive models like Decision trees, kNN, SVM other methods to compare the accuracy of different of models and find a good prediction model.

In the following sections, we will talk about the collected movie data from 2005 to 2017, and number of the related YouTube trailers views of them. We will explain the data collection and the ways we clean in details. After that, we will present our experiments. Results and discussion are given in the last secton.