Predicting Popularity of various news topics (namely Obama, Microsoft, Economy and Palestine) on different social media platform (Facebook, LinkedIn and GooglePlus).
News popularity on various social media platforms depends on multiple features like the topic, source of publication, time-span and sentiment score. Here we are provided with a dataset that contains news items and their respective social feedback on different platforms: Facebook, GooglePlus and LinkedIn. Based on the previous trend, this data analysis and prediction with machine learning models can help us understand what are the reasons for news popularity on social media and obtain the best regression model.
A large data set of news items and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn.The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: Economy, Microsoft, Obama and Palestine.
We have 13 dataset which are categorised into 2 types News_Final dataset and Time Span Dataset.
It contains information of news related to 4 topics Microsoft, Obama, Economy and Palestine collected from different sources and their popularity on 3 social media platform namely Facebook, GooglePlus and LinkedIn.
Feature Name | Type | Description |
---|---|---|
IDLink | (numeric) | Unique identifier of news items |
Title | (string) | Title of the news item according to the official media sources |
Headline | (string) | Headline of the news item according to the official media sources |
Source | (string) | Original news outlet that published the news item |
Topic | (string) | Query topic used to obtain the items in the official media sources |
PublishDate | (timestamp) | Date and time of the news items' publication |
SentimentTitle | (numeric) | Sentiment score of the text in the news items' title |
SentimentHeadline | (numeric) | Sentiment score of the text in the news items' headline |
(numeric) | Final value of the news items' popularity according to the social media source Facebook | |
GooglePlus | (numeric) | Final value of the news items' popularity according to the social media source Google+ |
(numeric) | Final value of the news items' popularity according to the social media source LinkedIn |
The 12 datasets are permutation of 4 topics and 3 platforms as stated above. Each dataset contains 2 days news popularity divided in 20 minutes interval after it is published i.e. 144 Time span features.
Feature Name | Type | Description |
---|---|---|
IDLink | (numeric) | Unique identifier of news items |
TS1 | (numeric) | Level of popularity in time slice 1 (0-20 minutes upon publication) |
TS2 | (numeric) | Level of popularity in time slice 2 (20-40 minutes upon publication) |
TS... | (numeric) | Level of popularity in time slice ... |
... | ... | ... |
... | ... | ... |
TS144 | (numeric) | Final level of popularity after 2 days (2860-2880) upon publication |
After loading our datasets, we observed that in the news df, the source column contained 279 null values, so we replaced the null values with the source that has published the maximum number of news items. Further, we dropped the news items for which the published date was before Nov 2015 as this is trash data. We also performed the data cleaning by dropping duplicate rows and news items with null values in headlines. We treated the outliers in the dataset by using the 90th percentile method.
We observed that the values of our dependent variables were quite large compared to the values in the independent variables. So to standardize the data we applied StandardScaler for data transformation.
Data Before Standardization | Data After Standardization |
---|---|
In Exploratory Data Analysis, we categorized the sentimentTitle and sentimentHeadline into positive, neutral and negative sentiment categorical values. Then we observed that the number of sources publishing news items was quite large so to obtain a proper analysis we categorized the sources into 4 types, by grouping them on the number of news items published by each source. Further, we compared the popularity level of news items on the three social media platforms, observed the trend of popularity level based on sources and topics. We also observed the change in popularity level within the two days of publishing by using the time-span dataset.
For handling the textual data in the news title and headline we used TF-IDF Vectorizer and CountVectorizer.
We used one-hot encoding for converting the categorical columns such as source types, topics, sentiment category into numerical values so that our model can understand and extract valuable information from these columns.
For feature selection, we used algorithms like ExtraTreeRegressor, which provides ordering features on the basis of their gini importance and helps in obtaining features which are more important compared to others for our model. Next we obtained correlation between the independent and dependent features to understand their relation.
For modeling we tried the various regression algorithms like:
- Decision Tree Regression
- CatBoost
- LightGBM
- GradientBoosting
- KNN
Tuning of hyperparameters is necessary for modeling to obtain better accuracy and to avoid overfitting. In our project, we used the
- GridSearchCV,
- RandomizedSearchCV and
- HalvingRandomizedSearchCV.
Performance of different hyperparameter tuning techniques on Google Colaboratory
GridSearchCV | RandomizedSearchCV | HalvingRandomizedSearchCV |
---|---|---|
We used plotly to showcase the parameter score of hyperparameters in different algorithms when doing hyperparameter tuning. In the plot below we are showing hyperparameter tuning of CatBoost using HalvingRandomizedSearchCV
We used some of the metrics valuation techniques like MSE, RMSE, R2 Score, Adjusted R2, RMLSE, to obtain the accuracy and error rate of our models before and after hyperparameter tuning.
We implemented the SHAP value plot to obtain the importance of independent features in the model prediction to show the positive or negative relationship for each variable with the target.
Force Plot | Summary Plot | Summary Plot (Bar Type) |
---|---|---|
- Starting from loading the datasets, we covered data wrangling, EDA, feature selection and modeling.
- The R2 Score obtained for all models revolved around 85% to 92% for all the three dependent variables.
- Further, we carried out hyperparameter tuning and obtained the best score and best parameters for all the models and there was not much improvement in the R2 Score.
- So the accuracy obtained by the best model is 92%. From here we can also conclude that the TS columns in the time-span dataset have a higher influence on our dependent variables.
Here is a glimpse of few graphs we plotted, there are many more in the notebook please have a look.
Meet The Team:
Bhwesh Gaur: https://github.com/bhweshgaur
S Sravya Sri : https://github.com/SSravyaSri
Priyanka Dodeja : https://github.com/PriyankaDodeja
Ankit Bansal : https://github.com/ankit986