Conclusion, limitations and future work
We try to find correlation between different features and ROI and a good prediction model to predict Box Office performance. In our experiments, we collected data from TMDb and YouTube. The correlations between each feature and ROI are not obvious since the regression lines have very small slopes. We then address the prediction task of ROI as both regression problem and classification problem. For regression methods, the r-squared scores show that random forests performs better than other methods, but all regression methods are not good predictor of ROI. For classification methods, We come to a conclusion that CART is the best prediction model for the dataset. Moreover, the confusion matrix shows that all classifiers tend to underestimate the Box Office performance of movies, possibly because that bad movies have more common characters and our training set is unbalanced among three ROI classes. We find some features like youtubeView which represents the social media popularity of movies contribute the most when doing the classifications.
In our research there are some limitations. We have noisy dataset and a large portion of the dataset were ruled out during the cleaning process for their poor quality. Also, our classifiers tend to underestimate the box office performance of movies. But we do find some important and potentially important features, and there is reason to believe that with larger dataset our methods would give better results.
References
Anders, C. (2011). How much money does a movie need to make to be profitable?.
Asur, S., & Huberman, B. A. (2010, August). Predicting the future with social media. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on (Vol. 1, pp. 492-499). IEEE.
Panaligan, R., & Chen, A. (2013). Quantifying movie magic with google search. Google Whitepaper—Industry Perspectives+ User Insights.
TMDb: The Movie Database. (n.d.). Retrieved Sept. & oct., 2017, from https://www.themoviedb.org/