Regression Method

Correlations

In this subsection, We analyze the correlation analysis to see if there are correlationships between various features and ROI. The reason we plan to use correlation method is that this is a simple way to find a roughly relationship between features and ROI. Since we have over 16 features in the data set and some of them is helpful to the next analysis but there are some data that has no effect on the following project. Using correlations, we might be able to see weather or not these features are helpful to the project. This analysis will make an influence on feature selection in the next parts. The results are shown in the following figure.

These 8 images shows correlationships between 8 features and ROI

As we can see, no significant correlationship is shown between any feature and ROI. This might indicate that linear regression methods would have bad results too. These results also show that the relationship among features and ROI (if there is any) is not that easy to see.

Regression

In this part we try to learn the ROI score of movies with regression methods. We use different regressor from sklearn package on our datasets: k-nearest neighbor (KNN), decision tree regressor (CART), linear support vector machine (LinearSVR), three linear regressor: without penalization (LR), with l1 penalization (Lasso), and with l2 penalization (Ridge), two random regressor: random forest (RF) and extra tree (ExtraTree), and two boosting methods AdaBoost and gradient boosting (GB). To evaluate the average performance of the methods, we conducted random splits on our dataset for cross validation

We use r-squared score to evaluate performance of those methods. Results are shown in table the following table. We see that random forests performs better than the other methods. But all methods performs very bad, which consists with the assumption we made above on the performance of regression methods.

Table: r-squared scores on different methods