Based on Bannerclick's version of the OpenWPM framework, OpenOBA provides a flexible and easy-to-set-up environment where highly configurable experiments involving web crawlers and ad capture can be created and run.
After each experiment's successful run, its configuration parameters, browser profile, and browsing data are saved. This allows the user to load any created experiment to keep feeding the browser with the specified behavior, or to analyze the data collected until that point to measure its OBA occurrence.
Figures explaining its usage can be seen in this folder.
What does it mean to measure OBA?
๐ Measuring OBA means quantifying a user's exposure to online advertisements targeted specifically to him, based on his past web browsing behavior as a result of *web tracking*.For a user to be shown targeted ads, his activity and interests must have been profiled and narrowed down to specific categories on the browsers he has used.
To quantify this phenomenon, we require all of the ads that were shown to the user together with the information about their content/category so that we can get how many of them were related to the userโs profile category.
OpenOBA is built on top of Bannerclick's OpenWPM framework ver 0.21.0
. It uses the following versions of its parts as reference:
First prerequisite is mamba, which will be used to install the openwpm conda environment. As stated in the mamba installation guide, we can use miniforge. To install it we can simply
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
git clone https://github.com/fukuda-lab/OpenOBA.git
cd OpenOBA
We will use the same script of OpenWPM, the only change is that we want to use specifically Firefox version 108.0.2
. If it doesn't work, try deleting the --force
tag in ./install.sh
file
./install.sh
If the last step was successful, we can now just install the missing dependencies. To do this, we have to activate the conda openwpm
environment by running:
conda activate openwpm
If everything is working correctly, we should be able to run the demo files from the demos folder (with the openwpm env activated).
In summary, these demo files show a very basic use of the main classes of the framework OBAMeasurementExperiment
, DataProcesser
, and ExperimentMetrics
(merged with a previously called OBAAnalysis
class).
The demos would be run chronologically as 1, 2, 3 and 4.
-
In MacOS, remember:
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
before any python command:OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES python -m demos.1_create_experiment_demo
See their code to follow/change the directories for data, results, and plots.
Demo on how to create and run a new fresh experiment instance using the OBAMeasurementExperiment
class, selecting the experiment instance name, cookie banner action and setting its training pages. Note that control_visits_rate
in the start()
method, is a percentage (from 0 to 100) that dictates the proportion of control visits.
python -m demos.1_create_experiment_demo
Demo on how to load an experiment instance previously created with the OBAMeasurementExperiment
class, loading its saved browser profile to resume the same experiment instance.
python -m demos.2_load_experiment_demo
Demo on how to filter, process, and categorize all the AdURLs captured during the control visits of an experiment instance using the DataProcesser
class.
python -m demos.3_data_processer_demo
Demo on what an ad analysis script could look like, using the ExperimentMetrics
class, which includes some example methods to query, tabulate, and plot an experiment ads data.
python -m demos.4_ads_analysis_demo
To crawl and extract ads from any website without going through all the process of an OpenOBA framework experiment, you can call the ExtractAdsCommand
as any OpenWPM command would do. We have published a demo on how to use this command similar to OpenWPM's demo.
python -m openoba_adscraper_demo
Ads from Youtube videos could be extracted, if the browser's autoplay is enabled. Else, the browser display mode has to be "native"
for the crawl so the user can manually play each video before the ExtractAdsCommand
is executed. This does not work flawlessly so expect to encounter difficulties.
python -m demos.youtube_crawler_demo
OpenWPM does not output all of the errors directly. For more insights when encountering errors, see the geckodriver.log
file created after running OpenWPM in the root directory of the repository, and openwpm.log
in the experiment directory folder (inside the data_dir
folder) created after creating an experiment.
For the input files used in our paper, see the oba/input_run_files folder.
This shows how to make all the three OBA Run instances performed in the experiment, following the same 0, 1, 2, 3 and 4 steps. Read the paper to understand better the idea of separating in instances
.
Other scripts can be found within oba_analysis/data_analysis and oba/third_party_analysis folders, with much more code showing data processing for ads, cookies and http requests in CSVs, plots, and markdowns.
This code can be untidy and hard to understand, because it mixes code referring to the OBA Runs and Random Runs (also called Control Runs throughout the code). It was specific to the experiments performed in the paper, but could help as guidance on how to use the methods in the ExperimentMetrics
class to do an analysis on OBA, cookies and http requests.
- Categorizer (private class, internal use only): Given valid credentials, using WebShrinker API is able to categorize URLs with the IAB taxonomy or WebShrinker own Taxonomy. Used by TrainingPagesHandler and DataProcesser.
- TrainingPagesHandler (public class if a user wants to access its functionalities, intended to be used just by OBACrawler class):
This class has several functionalities, but in summary:
- it takes charge into fetching training pages from tranco and saving them in a file
- loading them from already fetched previous dates
- categorizing any given set of training pages with the Categorizer (either loaded from Tranco or a custom training pages list provided by the user) and saving the training pages in a SQLite database categorized
- given already categorized pages in an SQLite database, return a list of all the training pages that belong to an input category
- more methods related to cookie banners presence of training pages
- OBAMeasurementExperiment (public class, directly used by the users): This is the entrypoint for the framework to run the crawlings and pages.
This class handles the setup of the environment according to the arguments values, it includes the calls to the TrainingPagesHandler. Functions include:
- init, the setup (initializer) where it can either create a new experiment or load an old one, load either pages from Tranco Top, or from a custom list and can either categorize the lists for them or not, making. the validations accordingy.
- Filter and set the training pages by category in case they were categorized beforehand.
- Run the actual crawling for the experiment, saving the ad urls found in the control_sites for advertisements, and adding all the necessary data about the visits to the sites for them to be analyzed later. It also handles the saving of the browser profiles to be then loaded when wanting to resume a previously started experiment.
- DataProcesser (public to the user): This is the other entrypoint for the Framework. It should only recieve the experiment_name. With that name it can connect to the sqlite database with all the crawling data (site visits, browser ids, etc), to the {experiment_name}_config.json. It is in charge of resolving the Ad URLs after being extracted during an experiment run
- ExperimentMetrics (public): ****Used to get several insights about the experiment after having its ads processed. Several other scripts in third_party_analysis and oba_analysis that are used to generate tables and analysis for the resources of an experiment.
This is an explanation of some of the parameters for an experiment run using OpenOBA. Adjust the imports/runs according to where the scripts are being ran/called
๐ We want to measure the impact of usersโ choice of cookie banners on the exposure to OBA they will receive.For this, we would need to run three different experiment instances, with the same parameters, but with a different cookie_banner_action
each.
In this tutorial, we will show how to run one of those experiment instances to show some of the OpenOBA features.
-
Create a new experiment
First, create a dictionary with the corresponding arguments
experiment_name
andfresh_experiment
parameters are required, the rest depend on the experiment. In this case:cookie_banner_action
of 1: accept all cookies when asked while trainingtranco_pages_params
: training pages will be retrieved from an updated list of Tranco most popular sites, of asize
of 100000- We need valid
webshrinker_credentials
because we will need to categorize the pages. This must be provided by the user.
oba_cookie_banner_experiment_with_categorization = { "experiment_name": "example_clothing_accept_cookie_banner_experiment", "fresh_experiment": True, "cookie_banner_action": 1, "tranco_pages_params": { "updated": True, "size": 100000, }, # Real values should be provided by the user "webshrinker_credentials": {"api_key": API_KEY, "secret_key": SECRET_KEY}, }
Create the experiment (this will take some time because the pages need to be categorized)
from oba_crawler import OBAMeasurementExperiment
experiment = OBAMeasurementExperiment(**oba_cookie_banner_experiment_with_categorization)
-
Set the training pages for the experiment
๐ **Loading an experiment**If we first just created the experiment, and now in another script or run we want to load it, we can just do:
experiment = OBAMeasurementExperiment(experiment_name="example_clothing_accept_cookie_banner_experiment", fresh_experiment=False)
Now, since we are using the tranco pages, to set our training pages we need a category, we will pick
Clothing
since we know it has cookie banners (we have to do this every time we want to run an experiment):experiment.set_training_pages_by_category(category="Style & Fashion")
and we can start the crawling
With an experiment created, loaded into an instance of OBAMeasurementExperiment
and categories set (in case of using tranco), we can start an instance of the crawling for the amount of time that we desire.
experiment.start(hours=3, minutes=30, browser_mode="headless")
This will always first run clean visits over the control pages (in clear browsers), so we gather ads that we know that are not due to OBA, and then the training + control process will start.
Now we are ready to use the DataProcesser to get the landing pages for the ads as shown in the demo files 3 and 4.