This source code forms the basis for our CIKM 2020 paper CauseNet: Towards a Causality Graph Extracted from the Web. The code is divided into two components: one component for analyzing the graph and another component for extracting the graph from the web. The final graph can be downloaded from causenet.org. When using the code, please make sure to refer to it as follows:
@inproceedings{heindorf2020causenet,
author = {Stefan Heindorf and
Yan Scholten and
Henning Wachsmuth and
Axel-Cyrille Ngonga Ngomo and
Martin Potthast},
title = {CauseNet: Towards a Causality Graph Extracted from the Web},
booktitle = {{CIKM}},
publisher = {{ACM}},
year = {2020}
}
We assume the following project structure:
CIKM-20/
├── java
│ ├── bootstrapping
│ └── extraction
├── notebooks
│ ├── 01-concept-spotting
│ │ ├── 01-texts-training.ipynb
│ │ ├── 02-texts-spotting-wikipedia.ipynb
│ │ ├── 03-texts-spotting-clueweb.ipynb
│ │ ├── 04-infoboxes-training.ipynb
│ │ ├── 05-infoboxes-spotting.ipynb
│ │ ├── 06-lists-training.ipynb
│ │ └── 07-lists-spotting.ipynb
│ ├── 02-graph-construction
│ │ └── 01-graph-construction.ipynb
│ ├── 03-graph-analysis
│ │ ├── 01-knowledge-bases-overview.ipynb
│ │ └── 02-graph-statistics.ipynb
│ └── 04-graph-evaluation
│ ├── 01-graph-evaluation-precision.ipynb
│ ├── 02-qa-corpus-construction.ipynb
│ └── 03-graph-evaluation-recall.ipynb
└── data/
├── bootstrapping
│ ├── 0-instances
│ ├── 0-patterns
│ ├── 1-instances
│ ├── 1-patterns
│ ├── 2-instances
│ ├── 2-patterns
│ └── seeds.csv
├── question-answering/
├── causality-graphs/
│ ├── extraction
│ │ ├── clueweb
│ │ └── wikipedia
│ ├── spotting
│ │ ├── clueweb
│ │ └── wikipedia
│ ├── integration
│ ├── causenet-full.jsonl.bz2
│ ├── causenet-precision.jsonl.bz2
│ └── causenet-sample.json
├── categorization
├── random
├── concept-spotting
│ ├── infoboxes
│ ├── lists
│ └── texts
├── flair-models
│ ├── infoboxes
│ ├── lists/
│ └── texts/
├── lucene-index/
└── external
├── extraction-sources
│ ├── clueweb12
│ └── wikipedia
├── knowledge-bases
│ ├── conceptnet-assertions-5.6.0.csv
│ ├── freebase-rdf-latest.gz
│ └── wikidata-20181001-all.json.bz2
├── msmarco
├── nltk
├── stop-word-lists
├── spacy
└── stanfordnlp
We recommend Miniconda for easy installation on many platforms.
- Create new environment:
conda env create -f environment.yml
- Activate environment:
conda activate cikm20-causenet
- Install Kernel:
python -m ipykernel install --user --name cikm20-causenet --display-name cikm20-causenet
- Start Jupyter:
jupyter notebook
The code was tested with Python 3.7.3, under Linux 4.9.0-8-amd64 with 16 cores and 256 GB RAM.
Overview of causal relations in knowledge bases as provided by Table 1.
- CauseNet-Full (output of the extraction component)
data/causality-graphs/causenet-full.jsonl.bz2
- Freebase
data/external/knowledge-bases/freebase-rdf-latest.gz
- ConceptNet (version 5.6.0)
data/external/knowledge-bases/conceptnet-assertions-5.6.0.csv
- Wikidata
data/external/knowledge-bases/wikidata-20181001-all.json.bz2
Execute the following notebook:
notebooks/03-graph-analysis/
└── 01-knowledge-bases-overview.ipynb
- CauseNet-Full (output of the extraction component)
data/causality-graphs/integration/causenet-full.jsonl.bz2
- Manual categorization
/data/categorization/manual_categorization.csv
- Wikipedia extraction (Output of Wikipedia extraction)
data/causality-graphs/extraction/wikipedia/wikipedia-extraction.tsv
Execute the following notebook:
notebooks/03-graph-analysis/
└── 02-graph-statistics.ipynb
- DBpedia Spotlight
- Installation Instructions: https://github.com/dbpedia-spotlight/dbpedia-spotlight-model
- Required files:
https://sourceforge.net/projects/dbpedia-spotlight/files/spotlight/dbpedia-spotlight-1.0.0.jar
https://sourceforge.net/projects/dbpedia-spotlight/files/2016-10/en/model/en.tar.gz
- CauseNet-Full (output of the extraction component)
data/causality-graphs/integration/causenet-full.jsonl.bz2
- Random numbers for reproducibility:
data/random/generated_random_numbers.bz2
- MSMARCO (version: 2.1):
data/external/msmarco/train_v2.1.json
data/external/msmarco/dev_v2.1.json
- ConceptNet (version 5.6.0)
data/external/knowledge-bases/conceptnet-assertions-5.6.0.csv
- Wikidata
data/external/knowledge-bases/wikidata-20181001-all.json.bz2
Execute the following notebooks:
notebooks/04-graph-evaluation/
├── 01-graph-evaluation-precision.ipynb
├── 02-qa-corpus-construction.ipynb
└── 03-graph-evaluation-recall.ipynb
02-qa-corpus-construction.ipynb
will extract simple causal questions from MSMARCO:
question-answering/
├── causality-qa-training.json
└── causality-qa-validation.json
The graph extraction is structured as follows:
- Bootstrapping Component (Java):
- generates linguistic patterns from Wikipedia sentences using a bootstrapping approach
- Extraction Component (Java):
- uses linguistic patterns to extract causal relations from the following sources:
- Causal Concept Spotting (Python):
- training sequence taggers for sentences, infoboxes and lists
- spotting causal concepts in extractions of previous step
- Graph construction (Python):
- final construction and reconciliation steps
The code was tested with Java 8 and Python 3.7.3, under Linux 4.9.0-8-amd64 with 16 cores and 256 GB RAM.
- Bootstrapping seeds:
data/bootstrapping/seeds.csv
- Lucene index with preprocessed Wikipedia sentences:
data/lucene-index/
- Compile:
mvn package -f ./java/bootstrapping/pom.xml
- Execute:
./scripts/bootstrapping.sh
The bootstrapping component will compute the following files:
data/bootstrapping/
├── 0-instances
├── 0-patterns
├── 1-instances
├── 1-patterns
├── 2-instances
└── 2-patterns
The following components will use the patterns after the second iteration: data/bootstrapping/2-patterns
.
- Wikipedia XML dump:
data/external/extraction-sources/wikipedia/enwiki-20181001-pages-articles.xml
- Patterns of the second bootstrapping iteration:
data/bootstrapping/2-patterns
- Compile:
mvn package -f ./java/extraction/pom.xml
- Execute:
./scripts/extraction-wikipedia.sh
- Causal relations extracted from texts, infoboxes and lists:
data/causality-graphs/extraction/ └── wikipedia └── wikipedia-extraction.tsv
We provide code to parse one ClueWeb12 file. To parse the entire ClueWeb12 corpus, you can integrate this code into your cluster software.
- ClueWeb12 file in WARC format:
data/external/extraction-sources/clueweb12/0013wb-88.warc.gz
- Patterns of the second bootstrapping iteration:
data/bootstrapping/2-patterns
- Stop word list for parsing webpages:
data/external/stop-word-lists/enStopWordList.txt
- Compile:
mvn package -f ./java/extraction/pom.xml
- Execute:
./scripts/extraction-clueweb12.sh
- Causal relations extracted from webpage texts:
data/causality-graphs/extraction/ └── clueweb12 └── clueweb12-extraction.tsv
Models were trained on a NVIDIA GeForce GTX 1080 Ti (11 GByte). To reproduce the results, we recommend to use a similar GPU architecture. If you do not want to retrain the models, you can use our models: /data/flair-models/
No manual steps required. The correct versions will be automatically installed if you use the provided environment.yml
.
For completeness:
- Flair (version: 0.4.2)
- Stanford Parser (version: 0.2.0) (The following bug should be fixed: stanfordnlp/stanza#135)
- Spacy (version: 2.1.8)
- Model version: 2.1.0
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz
- Model version: 2.1.0
- Concept Spotting datasets:
/data/concept-spotting/
: This folder contains the manually annotated training and evaluation data for the concept spotting. - Output data of the extraction components:
data/causality-graphs/extraction/ ├── clueweb12 │ └── clueweb12-extraction.tsv └── wikipedia └── wikipedia-extraction.tsv
Execute the following notebooks:
notebooks/01-spotting/
├── 01-texts-training.ipynb
├── 02-texts-spotting-wikipedia.ipynb
├── 03-texts-spotting-clueweb.ipynb
├── 04-infoboxes-training.ipynb
├── 05-infoboxes-spotting.ipynb
├── 06-lists-training.ipynb
└── 07-lists-spotting.ipynb
- Flair models for sequence labeling:
/data/flair-models/
- Separate causality graphs:
data/causality-graphs/spotting/ ├── clueweb12 │ └── clueweb-graph.json └── wikipedia ├── infobox-graph.json ├── list-graph.json └── text-graph.json
data/causality-graphs/spotting/
├── clueweb12
│ └── clueweb-graph.json
└── wikipedia
├── infobox-graph.json
├── list-graph.json
└── text-graph.json
Execute the following notebook:
notebooks/02-graph-construction/
└── 01-graph-construction.ipynb
data/causality-graphs/integration/
└── causenet-full.jsonl.bz2
For questions and feedback please contact:
Stefan Heindorf, Paderborn University
Yan Scholten, Technical University of Munich
Henning Wachsmuth, Paderborn University
Axel-Cyrille Ngonga Ngomo, Paderborn University
Martin Potthast, Leipzig University
The code by Stefan Heindorf, Yan Scholten, Henning Wachsmuth, Axel-Cyrille Ngonga Ngomo, Martin Potthast is licensed under a MIT license.