A recent survey has shown that people read a fair bit amount of time online and the screen time is still rising. For that matter the number of views of specific content is of interest as more views translate to more revenues. Building a system to predict whether a news article will be popular or not can help editors to identify how they could improve their content but also how they could generate significant financial returns.
In this project we create two models to solve this classification problem: one using Automated ML
and one customized model with hyperparameters tuned using HyperDrive
. Then we compare the performance of both the models and deploy the best performing model. Finally the endpoint produced will be used to get some answers about predictions.
Before we get starting with this second project, it is important to set up our local development environment to match with the Azure AutoML development environment.
Below are the steps:
- Download and install
anaconda
- Open
anaconda CMD
in the new folder - Clone this repo
cd
into the local directory- Run this
conda env create --file hyperdrive_dependencies.yml
- Run
jupyter notebook
The dataset used in this project is a dataset made available on UCI Machine Learning Repository called Online News Popularity Data Set.
The dataset summarizes heterogeneous set of features about the articles published by Mashable between 2013 and 2015.
- Number of Instances: 39797
- Number of Attributes: 61
- 58 predictive attributes
- 2 non-predictive (
url
andtimedelta
) - 1 target column
Attribute Information:
Column Names | Details |
---|---|
url |
URL of the article |
timedelta |
Days between the article publication and the dataset acquisition |
n_tokens_title |
Number of words in the title |
n_tokens_content |
Number of words in the content |
n_unique_tokens |
Rate of unique words in the content |
n_non_stop_words |
Rate of non-stop words in the content |
n_non_stop_unique_tokens |
Rate of unique non-stop words in the content |
num_hrefs |
Number of links |
num_self_hrefs |
Number of links to other articles published by Mashable |
num_imgs |
Number of images |
num_videos |
Number of videos |
average_token_length |
Average length of the words in the content |
num_keywords |
Number of keywords in the metadata |
data_channel_is_lifestyle |
Is data channel 'Lifestyle'? |
data_channel_is_entertainment |
Is data channel 'Entertainment'? |
data_channel_is_bus |
Is data channel 'Business'? |
data_channel_is_socmed |
Is data channel 'Social Media'? |
data_channel_is_tech |
Is data channel 'Tech'? |
data_channel_is_world |
Is data channel 'World'? |
kw_min_min |
Worst keyword (min. shares) |
kw_max_min |
Worst keyword (max. shares) |
kw_avg_min |
Worst keyword (avg. shares) |
kw_min_max |
Best keyword (min. shares) |
kw_max_max |
Best keyword (max. shares) |
kw_avg_max |
Best keyword (avg. shares) |
kw_min_avg |
Avg. keyword (min. shares) |
kw_max_avg |
Avg. keyword (max. shares) |
kw_avg_avg |
Avg. keyword (avg. shares) |
self_reference_min_shares |
Min. shares of referenced articles in Mashable |
self_reference_max_shares |
Max. shares of referenced articles in Mashable |
self_reference_avg_sharess |
Avg. shares of referenced articles in Mashable |
weekday_is_monday |
Was the article published on a Monday? |
weekday_is_tuesday |
Was the article published on a Tuesday? |
weekday_is_wednesday |
Was the article published on a Wednesday? |
weekday_is_thursday |
Was the article published on a Thursday? |
weekday_is_friday |
Was the article published on a Friday? |
weekday_is_saturday |
Was the article published on a Saturday? |
weekday_is_sunday |
Was the article published on a Sunday? |
is_weekend |
Was the article published on the weekend? |
LDA_00 |
Closeness to LDA topic 0 |
LDA_01 |
Closeness to LDA topic 1 |
LDA_02 |
Closeness to LDA topic 2 |
LDA_03 |
Closeness to LDA topic 3 |
LDA_04 |
Closeness to LDA topic 4 |
global_subjectivity |
Text subjectivity |
global_sentiment_polarity |
Text sentiment polarity |
global_rate_positive_words |
Rate of positive words in the content |
global_rate_negative_words |
Rate of negative words in the content |
rate_positive_words |
Rate of positive words among non-neutral tokens |
rate_negative_words |
Rate of negative words among non-neutral tokens |
avg_positive_polarity |
Avg. polarity of positive words |
min_positive_polarity |
Min. polarity of positive words |
max_positive_polarity |
Max. polarity of positive words |
avg_negative_polarity |
Avg. polarity of negative words |
min_negative_polarity |
Min. polarity of negative words |
max_negative_polarity |
Max. polarity of negative words |
title_subjectivity |
Title subjectivity |
title_sentiment_polarity |
Title polarity |
abs_title_subjectivity |
Absolute subjectivity level |
abs_title_sentiment_polarity |
Absolute polarity level |
shares |
Number of shares (target) |
Class Distribution: the class value (shares) is continuously valued. We transformed the task into a binary task using a decision threshold of 1400.
Shares Value Range: {'<1400':18490, '>=1400':21154}
We want to know:
- How to predict which news articles will be popular
- What features about news articles make them more popular
This is important to:
- Help news sites become more profitable: Generate a model and feature insights that will give a company an advantage over other platforms vying for customer consumption.
- Raise awareness of important issues: Insights about what makes news popular can produce insights to help policy writers gain a following around their policy issue.
Trade-offs: Efficiency (popularity prediction) and fairness (even distribution of article post days and themes)
The original dataset was downloaded from UCI Machine Learning Repository. A subfolder /data
was created to save this data.
The data is loaded to a remote datastore on AzureML from there we can apply both AutoML
and HyperDrive
approaches for modeling.
# Create AML Dataset and register it into Workspace
example_data = 'https://github.com/franckess/AzureML_Capstone/blob/main/data/OnlineNewsPopularity.csv'
dataset = Dataset.Tabular.from_delimited_files(example_data)
# Create TabularDataset using TabularDatasetFactory
dataset = TabularDatasetFactory.from_delimited_files(path=example_data)
#Register Dataset in Workspace
dataset = dataset.register(workspace=ws, name=key, description=description_text)
Azure Automated Machine Learning (AutoML
) provides capabilities to automate iterative tasks of machine learning model development for given dataset to predict which article will be popular
or not based on learnings from it's training data. In this approach, Azure Machine Learning taking user inputs such as Dataset
, Target Metric
, train multiple models (e.g. Logistic Regression
, Decision Tree
, XGBoost
, etc.) and will return best performing model (or an ensemble) with highest training score achieved. We will train and tune a model using the Accuray
primary metric for this project.
This class from Azure ML Python SDK represents configuration to submit an automated ML experiment in Azure ML. Configuration parameters used for this project includes:
Configration | Details | Value |
---|---|---|
compute_target |
Azure ML compute target to run the AutoML experiment on | compute_target |
task |
The type of task task to run, set as classification | classification |
training_data |
The training data to be used within the experiment contains training feature and a label column | Tabular Dataset |
label_column_name |
The nae of the label column | 'label' |
path |
The full path to the Azure ML project folder | './capstone-project' |
enable_early_stopping |
Enable AutoML to stop jobs that are not performing well after a minimum number of iterations | True |
featurization |
Config indicator for whether featurization step should be done autometically or not | auto |
debug_log |
The log file to write debug information to | 'automl_errors.log' |
verbosity |
Reporting of run activity | logging.INFO |
Also AutoML settings were as follows:
Configration | Details | Value |
---|---|---|
experiment_timeout_minutes |
Maximum amount of time in hours that all iterations combined can take before the experiment terminates | 60 |
max_concurrent_iterations |
Represents the maximum number of iterations that would be executed in parallel | 9 |
primary_metric |
The metric that the AutoML will optimize for model selection | accuracy |
-
Among all the models trained by AutoML,
Voting Ensemble
outperformed all the other models with67.55% Accuracy
.- Ensemble models in Automated ML are combination of multiple iterations which may provide better predictions compared to a single iteration and appear as the final iterations of run.
- Two types of ensemble methods for combining models: Voting and Stacking
- Voting ensemble model predicts based on the weighted average of predicted class probabilities.
Figure 1. Python SDK Notebook - AutoML Run Details widget
Figure 2. Python SDK Notebook - Accuracy plot using AutoML Run Details widget
Figure 3. Python SDK Notebook - Best performing run details (incl. Name, ID & Type)
Figure 4. Azure ML Studio - AutoML experiment completed
Figure 5. Azure ML Studio - AutoML best perforing model summary
Figure 6. Azure ML Studio - Performance Metrics of best performing model trained by AutoML
Figure 7. Azure ML Studio - Models trained in multiple iterations using AutoML
Click to expand!
datatransformer {'enable_dnn': None, 'enable_feature_sweeping': None, 'feature_sweeping_config': None, 'feature_sweeping_timeout': None, 'featurization_config': None, 'force_text_dnn': None, 'is_cross_validation': None, 'is_onnx_compatible': None, 'logger': None, 'observer': None, 'task': None, 'working_dir': None}
prefittedsoftvotingclassifier {'estimators': ['7', '0', '13', '14', '11', '30', '8', '5', '6', '19', '20'], 'weights': [0.2, 0.06666666666666667, 0.13333333333333333, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.13333333333333333, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667]}
7 - maxabsscaler {'copy': True}
7 - lightgbmclassifier {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.4955555555555555, 'importance_type': 'split', 'learning_rate': 0.1, 'max_bin': 20, 'max_depth': 7, 'min_child_samples': 438, 'min_child_weight': 6, 'min_split_gain': 0.3157894736842105, 'n_estimators': 600, 'n_jobs': -1, 'num_leaves': 224, 'objective': None, 'random_state': None, 'reg_alpha': 0.2631578947368421, 'reg_lambda': 0.42105263157894735, 'silent': True, 'subsample': 0.7426315789473684, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -10}
0 - maxabsscaler {'copy': True}
0 - lightgbmclassifier {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -10}
13 - sparsenormalizer {'copy': True, 'norm': 'max'}
13 - xgboostclassifier {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.7, 'eta': 0.001, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 4, 'max_leaves': 7, 'min_child_weight': 1, 'missing': nan, 'n_estimators': 100, 'n_jobs': -1, 'nthread': None, 'objective': 'reg:logistic', 'random_state': 0, 'reg_alpha': 0.3125, 'reg_lambda': 1.875, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 1, 'tree_method': 'auto', 'verbose': -10, 'verbosity': 0}
14 - sparsenormalizer {'copy': True, 'norm': 'l2'}
14 - xgboostclassifier {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.6, 'eta': 0.4, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 4, 'max_leaves': 3, 'min_child_weight': 1, 'missing': nan, 'n_estimators': 100, 'n_jobs': -1, 'nthread': None, 'objective': 'reg:logistic', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 2.0833333333333335, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 0.7, 'tree_method': 'auto', 'verbose': -10, 'verbosity': 0}
11 - sparsenormalizer {'copy': True, 'norm': 'l2'}
11 - xgboostclassifier {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.5, 'eta': 0.1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 6, 'max_leaves': 15, 'min_child_weight': 1, 'missing': nan, 'n_estimators': 100, 'n_jobs': -1, 'nthread': None, 'objective': 'reg:logistic', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 2.0833333333333335, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 1, 'tree_method': 'auto', 'verbose': -10, 'verbosity': 0}
30 - sparsenormalizer {'copy': True, 'norm': 'l1'}
30 - lightgbmclassifier {'boosting_type': 'goss', 'class_weight': None, 'colsample_bytree': 0.7922222222222222, 'importance_type': 'split', 'learning_rate': 0.07368684210526316, 'max_bin': 300, 'max_depth': 4, 'min_child_samples': 766, 'min_child_weight': 6, 'min_split_gain': 0.631578947368421, 'n_estimators': 10, 'n_jobs': -1, 'num_leaves': 74, 'objective': None, 'random_state': None, 'reg_alpha': 0, 'reg_lambda': 0.5263157894736842, 'silent': True, 'subsample': 1, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -10}
8 - standardscalerwrapper {'class_name': 'StandardScaler', 'copy': True, 'module_name': 'sklearn.preprocessing._data', 'with_mean': False, 'with_std': False}
8 - xgboostclassifier {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.7, 'eta': 0.4, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 9, 'max_leaves': 511, 'min_child_weight': 1, 'missing': nan, 'n_estimators': 200, 'n_jobs': -1, 'nthread': None, 'objective': 'reg:logistic', 'random_state': 0, 'reg_alpha': 1.7708333333333335, 'reg_lambda': 0.3125, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 0.6, 'tree_method': 'auto', 'verbose': -10, 'verbosity': 0}
5 - sparsenormalizer {'copy': True, 'norm': 'l1'}
5 - xgboostclassifier {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.5, 'eta': 0.1, 'gamma': 0, 'grow_policy': 'lossguide', 'learning_rate': 0.1, 'max_bin': 1023, 'max_delta_step': 0, 'max_depth': 10, 'max_leaves': 0, 'min_child_weight': 1, 'missing': nan, 'n_estimators': 100, 'n_jobs': -1, 'nthread': None, 'objective': 'reg:logistic', 'random_state': 0, 'reg_alpha': 1.875, 'reg_lambda': 2.291666666666667, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 0.5, 'tree_method': 'hist', 'verbose': -10, 'verbosity': 0}
6 - standardscalerwrapper {'class_name': 'StandardScaler', 'copy': True, 'module_name': 'sklearn.preprocessing._data', 'with_mean': False, 'with_std': False}
6 - xgboostclassifier {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.9, 'eta': 0.2, 'gamma': 0.1, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 6, 'max_leaves': 0, 'min_child_weight': 1, 'missing': nan, 'n_estimators': 10, 'n_jobs': -1, 'nthread': None, 'objective': 'reg:logistic', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1.3541666666666667, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 0.6, 'tree_method': 'auto', 'verbose': -10, 'verbosity': 0}
19 - standardscalerwrapper {'class_name': 'StandardScaler', 'copy': True, 'module_name': 'sklearn.preprocessing._data', 'with_mean': False, 'with_std': False}
19 - lightgbmclassifier {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.4955555555555555, 'importance_type': 'split', 'learning_rate': 0.0842121052631579, 'max_bin': 260, 'max_depth': 7, 'min_child_samples': 2735, 'min_child_weight': 1, 'min_split_gain': 0.7368421052631579, 'n_estimators': 25, 'n_jobs': -1, 'num_leaves': 140, 'objective': None, 'random_state': None, 'reg_alpha': 0.9473684210526315, 'reg_lambda': 0.2631578947368421, 'silent': True, 'subsample': 0.7921052631578948, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -10}
20 - standardscalerwrapper {'class_name': 'StandardScaler', 'copy': True, 'module_name': 'sklearn.preprocessing._data', 'with_mean': False, 'with_std': False}
20 - xgboostclassifier {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.7, 'eta': 0.3, 'gamma': 0, 'grow_policy': 'lossguide', 'learning_rate': 0.1, 'max_bin': 1023, 'max_delta_step': 0, 'max_depth': 2, 'max_leaves': 0, 'min_child_weight': 1, 'missing': nan, 'n_estimators': 10, 'n_jobs': -1, 'nthread': None, 'objective': 'reg:logistic', 'random_state': 0, 'reg_alpha': 0.9375, 'reg_lambda': 1.0416666666666667, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 1, 'tree_method': 'hist', 'verbose': -10, 'verbosity': 0}
You can find the best AutoML model in the compressed file automl_best_model.zip
For more details about AutoML implementation check: AutoML notebook
Classical models used for classification task are statistical models such as Logistic Regression
. In this experiment I wanted to try a Machine Learning algorithm. I have chosen Light GBM (LGBM)
for its great performance on different kind of tasks being, for instance, one of the most used algorithms in Kaggle competitions.
The ranges of parameters for the LGBM used were chosen considering the parameters tuning guides for different scenarios provided here.
Bayesian sampling
method was chosen because tries to intelligently pick the next sample of hyperparameters, based on how the previous samples performed, such that the new sample improves the reported primary metric. This sampling method does not support terminantion_policy
. Therefore, policy=None
.
Steps required to tune hyperparameters using Azure ML's HyperDrive package
;
- Define the parameter search space using
Bayesian Parameter Sampling
- Specify a
Accuracy
as a primary metric to optimize - Allocate
aml compute
resources - Launch an experiment with the defined configuration using
HyperDriveConfig
- Visualize the training runs with
RunDetails
Notebook widget - Select the best configuration for your model with
hyperdrive_run.get_best_run_by_primary_metric()
In order to compare the performance of HyperDrive with the one of AutoML we chose as objective metric of LGBM Accuracy
. For more information check this link.
--num-leaves
: number of leaves of the tree--min-data-in-leaf
: minimum # of samples in each leaf--learning-rate
: learning rate--feature-fraction
: ratio of features used in each iteration--bagging-fraction
: ratio of samples used in each iteration--bagging-freq
: bagging frequency--max-depth
: to limit the tree depth explicitly
param_sampling = BayesianParameterSampling(
{
"--num-leaves": quniform(8, 128, 1),
"--min-data-in-leaf": quniform(20, 500, 10),
"--learning-rate": choice(
1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1
),
"--feature-fraction": uniform(0.1, 1),
"--bagging-fraction": uniform(0.1, 1),
"--bagging-freq": quniform(1, 30, 1),
"--max-depth": quniform(5, 50, 5)
}
)
Figure 9. Azure ML Studio Experiment submitted with HyperDrive from notebook
Figure 10. Python SDK Notebook: Output from run using Run Details widget
Figure 11. Azure ML Studio: Child runs output
Figure 12. Python SDK Notebook: Best performing model from hyperparameter tuning using HyperDrive
Figure 13. Python SDK Notebook: HyperDrive Run Primary Metric Plot - Accuracy
Hyperparameters | Best Value |
---|---|
--num-leaves |
27 |
--min-data-in-leaf |
450 |
--learning-rate |
0.05 |
--feature-fraction |
0.764 |
--bagging-fraction |
0.789 |
--bagging-freq |
6 |
--max-depth |
25 |
Model Accuracy: 68.2936% which is better than the one generated using AutoML
part.
You can find the best AutoML model in the compressed file lgb_model.pkl
For more details about AutoML implementation check: AutoML notebook
Deployment
is about delivering a trained model into production so that it can be consumed by others. By deploying a model you make it possible to interact with the HTTP API service and interact with the model by sending data over POST requests, for example.
Comparing the results of AutoML and HyperDrive we saw that HyperDrive gave us the best model (higher Accuracy). Therefore, this is the model to be deployed.
Details of the deployment of the model can be seen in section Model Deployment
of the HyperDrive notebook.
Configuration object created for deploying an AciWebservice
used for this project is as follows:
script_file_name = './score.py'
inference_config = InferenceConfig(entry_script=script_file_name)
aciconfig = AciWebservice.deploy_configuration(cpu_cores = 2,
memory_gb = 4,
tags = {'Company': "Mashable", 'Type': "Hyperdrive", "Version":"1"},
description = 'sample service for Capstone Project Hyperdrive Classifier for Online News popularity')
aci_service_name = 'hyperdrive-deployment'
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(f'\nservice state: {aci_service.state}\n')
print(f'scoring URI: \n{aci_service.scoring_uri}\n')
print(f'swagger URI: \n{aci_service.swagger_uri}\n')
Figure 15. Python SDK Notebook: Deployment Completed
Figure 16. Python SDK Notebook: Successful Deployment
Once the model is deployed, we could use scoring_uri
in endpoint.py script so we can interact with the trained model.
Figure 17. Endpoint consumption
Click here to see a short demo of the project in action showing:
- A working model
- Demo of the deployed model
- Demo of a sample request sent to the endpoint and its response
- Downsizing Number of Features:
I used Boruta
and Correlation Analysis
as way to select a smaller set of features for this project. Another technique that could have been explored/implemented is PCA
(principal componet analysis).
-
Change parameters of AutoML
This could include as mentioned earlier:
- Increase
experiment_timeout_minutes
to give more time for AutoML to try other models. - Make use of
FeaturizationConfig
, for example, to use other form of imputation of Nan values than the one chosen by AutoML.
- Increase
-
Increase
max_total_runs
in HyperDrive:
Try HyperDrive with higher value of max_total_runs
to see if the performance increases.
With this project, we have deployed best performing model to HTTP web service endpoints in Azure Container Instance (ACI)
. To enable collecting additional data from an endpoint mentioned below, we will be enabling Azure Application Insight feature, an extensible Application Performance Management (APM) service.
- Output Data
- Responses
- Request rates, response times, and failure rates
- Dependency rates, response times, and failure rates
- Exceptions for failed requests
To perform this step, we have used logs.py
script uploaded with this repository. We will dynamically authenticate to Azure, enable Application Insight and Display logs for deployed model.
from azureml.core import Workspace
from azureml.core.webservice import Webservice
# Requires the config to be downloaded first to the current working directory
ws = Workspace.from_config()
# Set with the deployment name
name = "hyperdrive-deployment"
# load existing web service
service = Webservice(name=name, workspace=ws)
print(service)
# Enable Application Insight
service.update(enable_app_insights=True)
logs = service.get_logs()
for line in logs.split('\n'):
print(line)
Figure 18. Result - Enable Application Insight using logs.py
Figure 19. Result - Enable Application Insight using logs.py