This repository contains code relating to the paper titled "Correlation Robust Influence Maximization", and its expanded version published in Management Science. The code is provided primarily for reproducibility, and certain code are left as-is to reproduce results from when the code was written. Ideas from this code may be used to create a library.
We use Python 3 for general scripting and C++ to speed up certain computations. More details on specific packages are given below. We run our code on existing data.
At a high level there are three steps to reproduce results:
- Acquire data by downloading from sources, and place them in
data
directory, same level as the scripts in here. - Ensure dependencies are in place. This includes compiling C++ programs whose source code is also located in this repository.
- Run main Python script
paper_management_science.py
.
Necessary subdirectories should be created and intermediate outputs should be placed in them.
If one wishes to reproduce the results in the Management Science paper, assuming all dependencies and data are in place, a user should run
poetry run python paper_management_science.py
The script will gather data, including data we have gotten in NeurIPS. Also, we
- gather data CVaR parameter and the convex combination parameter.
- analyze a very small (|V| = 16) randomly generated graph, and obtain optimal solutions through brute force.
- analyze a subset of amazon DVDs and greedy solutions resulting
- summarize a lot of it into a summary folder
To reproduce results published in NeurIPS paperneurips_paper, one may run
poetry run python paper_neurips.py
The script will begin calculating correlation-robust expected influence as well
as the expected influence under independence cascade and place such data files
into a folder called out/
. After such data is gathered, the script will run
analyses which should produce the graphs and tables as shown in the paper.
If a user wishes to run a specific experiment, users may also use
experiment.py
. For example, to run the correlation greedy experiment on
polblogs
with k
up to 20 and homogeneous edge weights of 0.37, a user can
use
poetry run python experiment.py polblogs 20 correlation_robust 0 -p 0.37
Because polblogs
and correlation_robust
are actually IntEnum
s (see
config.py
), the same experiment can be represented by
poetry run python experiment.py 0 20 0 0 -p 0.37
Users are encouraged to run python experiment.py --help
to look at the
options.
We use real datasets:
-
polblogs
- It seems that access to Mark Newman's page is not guaranteed, but mirrors of this dataset should be readily available. We provide two SHA256 hashes of files which should have the same content for this dataset, which depends on CRLF line terminators:- 5391B1CABC2C171426EAE9BBC4196468FF93B731AE98C96B098F342332817BD6
- BD83E3DD2397A57D2252CC212524C21FEE6708663136898D4ACE523224BDC632
-
wikivote
, with SHA256 hash (hash of the text file not the .gz)- D2AFBEDF262126F820C6B3DD9F39A6D68E6F5EA839C0508297032CA77578B28A
-
amazon
, with SHA256 hash- 600135116A05B7CE2DCB7E842E892D663C6190A0567D00373E0C5C4F3C908F02
It should be possible to download directly from data sources linked.
Please place the downloaded files polblogs.gml
, Wiki-Vote.txt
and
amazon-meta.txt
in a folder called data
after download. Note that we modify
all input graphs to simple graphs with Graph.simplify()
. In
this case, polblogs
is changed from a multidigraph with self-loops to a simple
graph.
To use your own dataset, please modify the files
graph_functions.py
and config.py
.
graph_functions.py
is used in both loading graphs and assigning edge weights.
config.py
influences the graphs loaded in experiment.py
.
We try to describe here some code functions that map to the Management Science paper tables and figures.
common.get_dataset_graph_summaries
-> Table 2paper_management_science.get_heterogenous_seed_summaries
-> Table 3paper_management_science.tabulate_conv_data
-> Tables 4 and 5paper_management_science.tabulate_alpha_greedy_compare
-> Tables 6 and 7paper_management_science.plot_small_example
-> Figures 8 and 9 - converted to TikZ and we have changed how it lookscommon.plot_compute_times
-> Figure 10common.plot_hists
-> Figure 11common.plot_expected_influence_graphs
-> Figure 12paper_management_science.get_greedy_genres
-> Figures 13 and 14paper_management_science.graph_degree_distribution_cdf
-> Figure 15
Note that outputs are only produced after running our scripts.
Our (raw) outputs are stored as .csv
files but have different outputs
depending on what we need. For files produced in out
(but not its
subdirectories), files have four columns and no headers. The four columns are
- Seed Index
- Objective Function Value under Correlation Robust Influence
- Objective Function Value under Independent Cascade Influence
- Computational Time
However, in subdirectory out/conv_cvar
, files contain our computations at
different alpha and lambda parameters. There should be two types of files
produced in this subdirectory. The typical file contains
- Seed Index
- Objective Function Value
- Computational Time
and the other type of file only for amazon
(see the get_amazon_est_data
function in paper_management_science.py
) measures marginal objective function
values and computational times.
We also produce other .pickle
and tsv
files. The former are serialized data
to skip repeated computation and the later are graph data stored in the form
FROM_INDEX[TAB_CHARACTER]TO_INDEX[TAB_CHARACTER]EDGE_WEIGHT
.
All other outputs are higher level outputs such as summary CSVs or PNGs. We use and present these in our paper.
To run and reproduce the results as shown, users are encouraged to run
poetry install
and run the scripts with the virtual environment generated.
Summarily, the following packages are required:
To represent graphs, we mainly use igraph. NetworkX was used in synthetic graph generation and visualisation. We use Pyomo as one of the methods to find the seed with the highest marginal gain, but even with a fast linear program solver Gurobi, it is not nearly as fast as graph-based methods. We use Matplotlib to plot graphs and Pillow for image manipulation.
-
Pruned Monte Carlo Simulator - because it is used for comparison, we also use a pruned Monte Carlo simulator, though we have modified it as in the fork for addition of some features mostly relating to data collection. Note that binaries are not provided but the code should compile relatively quickly - it takes an i7-7500U processor less than a minute. The resulting
pmc_greed.exe
andpmc_est.exe
need to be in PATH for the paper.py
scripts orexperiment.py
to run properly.- We have included a small helper
build_conv.sh
that creates the executables assuming one has compatible C++ compilers. This also usesmeson
. The resulting executables should not take long to compile.
- We have included a small helper
-
Linear Program Solver - if, for example, a user wants to use a linear program for influence maximization, then a solver is required. The files as provided assume that
gurobipy
is installed. Any solver that Pyomo can interface with can be used, but only CBC among free solvers has been tested. To use an alternative solver, please modify line 22 oflinear_program.py
.
We note that users who only wish to use the algorithm may modify the code and
simply use igraph
.
Code files:
.gitignore
- for managing the git repository with ignores. We ignore a lot of intermediate outputs and raw data. Outputs should be produced from running the code and raw data should be downloaded directly from sources.- LICENSE - License file. As part of abundance of caution we use
GPL-3.0 as we are using a GPL library (specifically
igraph
) poetry.lock
- Forpoetry install
. It is used for reproducible installs so results are reproducible.pyproject.toml
- Forpoetry
. This is how we specify the environment to run our Python scripts in at a high level. It specifies depedencies.
Python scripts:
paper_management_science.py
- this file is the main script that generates then analyses all data as shown in the expanded paper version. Runningpaper_management_science.py
or specifically its is expected to take, like below inpaper_neurips.py
about a day or so on modern hardware. After the objective values and computational times are stored in folderout/
, re-running the entire script should not take longer than 5 minutes.paper_neurips.py
- this file is the main file that generates then analyses all data as shown in the NeurIPS proceedings. Runningpaper_neurips.py
or specifically itsget_data()
function is expected to take around 20 hours on an i7-7500U processor. After the objective values and computational times are stored in folderout/
, re-running the entire script should not take longer than 5 minutes.config.py
- this file stores defaults and constants. The defaults are used forpaper_management_science.py
andpaper_neurips.py
, and is used to standardize things across different scripts.common.py
- this file stores common functions used in bothpaper_management_science.py
andpaper_neurips.py
.experiment.py
- this file represents a single experiment where the influence diffusion process proceeds to completion on a graph. We collect data while performing the experiments, mostly relating to objective values and computational times. We consider only the computational time from the start of the diffusion process to the end of the diffusion process, and do not include the setup computations. We use this as an easy interface for bothpaper_neurips.py
andpaper_management_science.py
.greed.py
- this file only containsaccelgreedy
, which is the accelerated greedy algorithm which seeks the seed with the highest marginal gain. The algorithm is also alternatively called 'Lazy Greedy.'dro.py
- this file stores the functions used for calculating expected influence under the correlation robust modelindependence_cascade.py
- similar to above, but gives results under independence cascade.linear_program.py
- this file stores the linear program for the correlation robust influence calculation problem. That is, given a seed set, the expected influence under a adversarial correlations given that the marginal probabilities for nodes to activate other nodes are fixed is calculated.graph_functions.py
- this file stores all functions relating to the graphs themselves. Specifically we standardize graph processing, handle TSVs, convert between different libraries, processing specific data graphs here. We heavily use features ofigraph
here.cvar.py
- this file stores functions for computation of Conditional Value at Risk, and is only used inpaper_management_science.py
as an extension of prior work.pmc.py
- this file stores all functions for interfacing with the Pruned Monte Carlo simulation program. This requirespmc_greed.exe
andpmc_est.exe
to be in PATH.data.py
- this file stores basicamazon
graph data processing. Specifically it creates a genre.csv
file, used in some analysis in our extended paper.influence.py
- this file stores some other influence functions not stored elsewhere (i.e. not DRO or IC). The 'deterministic' influence and comonotone influence were used as part of our research and comparison but do not feature heavily in the paper(s).
Subdirectories:
- conv_greed_dir - This directory stores our C++ fork of the pruned Monte Carlo simulation along with necessary changes to compute convex/cvar objectives.
- docs - We have basic documentation skeleton here. One may opt to run
sphinx-apidoc
and build the docs from code comments.
Note that within accelgreedy(...)
, there is a part which builds a distance
matrix based on the graph, which speeds up the expected
influence calculation. This has memory requirements scaling quadratically with
the number of nodes of the graph, and so is not expected to scale well above
millions. While the script can be changed to not require this, it will slow down
the calculations.
To generate basic documentation, we use
sphinx-apidoc
and auto-generate documentation from code comments.
We provide below the Bibtex entry as provided by Management Science and NeurIPS 2020:
@article{doi:10.1287/mnsc.2021.03445,
author = {Chen, Louis L. and Lim, Chee Chin and Padmanabhan, Divya and Natarajan, Karthik},
title = {Robustness to Dependency in Influence Maximization},
journal = {Management Science},
volume = {0},
number = {0},
pages = {null},
year = {0},
doi = {10.1287/mnsc.2021.03445},
URL = {
https://doi.org/10.1287/mnsc.2021.03445
},
eprint = {
https://doi.org/10.1287/mnsc.2021.03445
},
abstract = { In this paper, we pursue a correlation-robust study of the influence maximization problem. Departing from the classic independent cascade model, we study a diffusion process adversarially adapted to the choice of seed set. More precisely, rather than the independent coupling of known individual edge probabilities, we now evaluate a seed set’s expected influence under all possible correlations, specifically, the one that presents the worst case. We find that the worst case expected influence can be efficiently computed, its NP-hard optimization done approximately (1−1/e) with greedy construction, and we provide complete, efficient characterizations of the adversarial coupling, the random graph, and the random number of influenced nodes. But, most importantly, upon mixing the independent cascade with the worst case, we attain a tunable and more comprehensive model better suited for real-world diffusion phenomena than the independent cascade alone and without increased computational complexity. Extensions to the correlation-robust study of risk follow along with numerical experiments on network data sets with demonstration of how our models can be tuned.This paper was accepted by George Shanthikumar, data science.Funding: This work was supported by the Air Force Office of Scientific Research (Mathematical Optimization Program) under the grant: “Optimal Decision Making under Tight Performance Requirements in Adversarial and Uncertain Environments: Insight from Rockafellian Functions”.Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2021.03445. }
}
@inproceedings{NEURIPS2020_4ee78d41,
author = {Chen, Louis and Padmanabhan, Divya and Lim, Chee Chin and Natarajan, Karthik},
booktitle = {Advances in Neural Information Processing Systems},
editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
pages = {7078--7089},
publisher = {Curran Associates, Inc.},
title = {Correlation Robust Influence Maximization},
url = {https://proceedings.neurips.cc/paper/2020/file/4ee78d4122ef8503fe01cdad3e9ea4ee-Paper.pdf},
volume = {33},
year = {2020}
}