HPI information integration project SoSe 2022

This repository provides a code base for the information integration course in the summer semester of 2022. Below you can find the documentation for setting up the project.

Prerequisites

Install Poetry
Install Docker and docker-compose
Install Protobuf compiler (protoc). If you are using windows you can use this guide
Install jq

Architecture

German Transparency Register (Lobbyregister) Website

The German Transparency Register website gives information about the interest representatives who have an impact on the political decision-making process. It contains for example companies and private persons, the money they spend for their interests, how many persons are involved, in which associations they have memberships and in which areas of interest and projects they are working.

The German Transparency Register provides some standard searches for example to get all active interest representatives. It is also possible to download all the detail pages for the list. Furthermore, when you already know for which company you want to have the information, you can also search for a company name (Quick Search).

LR Crawler

The German Transparency Register crawler (lr_crawler) sends a get request to the url https://www.lobbyregister.bundestag.de/sucheDetailJson?sort=REGISTRATION_DESC and extracts the needed information for the following steps from the response. Because the dataset is only around 30 MB large. It is possible to crawl the whole dataset.

lobbyism-events topic

The lobbyism-events holds all the events produced by the lr_crawler. Each message in a Kafka topic consist of a key and value.

The key type of this topic is String. The key lobbyist_id is extracted by the lr_crawler from the json file.

The value of the message contains more information like lobbyist_name, organization_client_names, fields_of_interests, donators and more. Therefore, the value type is complex and needs a schema definition.

RB Website

The Registerbekanntmachung website contains announcements concerning entries made into the companies, cooperatives, and partnerships registers within the electronic information and communication system. You can search for the announcements. Each announcement can be requested through the link below. You only need to pass the query parameters rb_id and land_abk. For instance, we chose the state Rheinland-Pfalz rp with an announcement id of 56267, the new entry of the company BioNTech.

export STATE="rp" 
export RB_ID="56267"
curl -X GET  "https://www.handelsregisterbekanntmachungen.de/skripte/hrb.php?rb_id=$RB_ID&land_abk=$STATE"

RB Crawler

The Registerbekanntmachung crawler (rb_crawler) sends a get request to the link above with parameters (rb_id and land_abk) passed to it and extracts the information from the response.

We use Protocol buffers to define our schema.

The crawler uses the generated model class (i.e., Corporate class) from the protobuf schema. We will explain furthur how you can generate this class using the protobuf compiler. The compiler creates a Corporate class with the fields defined in the schema. The crawler fills the object fields with the extracted data from the website. It then serializes the Corporate object to bytes so that Kafka can read it and produces it to the corporate-events topic. After that, it increments the rb_id value and sends another GET request. This process continues until the end of the announcements is reached, and the crawler will stop automatically.

corporate-events topic

The corporate-events holds all the events (announcements) produced by the rb_crawler. Each message in a Kafka topic consist of a key and value.

The key type of this topic is String. The key is generated by the rb_crawler. The key is a combination of the land_abk and the rb_id. If we consider the rb_id and land_abk from the example above, the key will look like this: rp_56267.

The value of the message contains more information like event_name, event_date, and more. Therefore, the value type is complex and needs a schema definition.

Kafka Connect

Kafka Connect is a tool to move large data sets into (source) and out (sink) of Kafka. Here we only use the Sink connector, which consumes data from a Kafka topic into a secondary index such as Elasticsearch.

We use the Elasticsearch Sink Connector to move the data from the coporate-events topic into the Elasticsearch.

Setup

This project uses Poetry as a build tool. To install all the dependencies, just run poetry install.

This project uses Protobuf for serializing and deserializing objects. We provided a simple protobuf schema. Furthermore, you need to generate the Python code for the model class from the proto file. To do so run the generate-proto.sh script. This script uses the Protobuf compiler (protoc) to generate the model class under the build/gen/bakdata/corporate/v1 folder with the name corporate_pb2.py.

Run

Infrastructure

Use docker-compose up -d to start all the services: Zookeeper , Kafka, Schema Registry , Kafka REST Proxy, Kowl, Kafka Connect, and Elasticsearch. Depending on your system, it takes a couple of minutes before the services are up and running. You can use a tool like lazydocker to check the status of the services.

Kafka Connect

After all the services are up and running, you need to configure Kafka Connect to use the Elasticsearch sink connector. The config file is a JSON formatted file. We provided a basic configuration file. You can find more information about the configuration properties on the official documentation page.

To start the connector, you need to push the JSON config file to Kafka. You can either use the UI dashboard in Kowl or use the bash script provided. It is possible to remove a connector by deleting it through Kowl's UI dashboard or calling the deletion API in the bash script provided.

LR Crawler

You can start the crawler with the command below:

poetry run python lr_crawler/main.py

This command downloads the entire data set.

RB Crawler

You can start the crawler with the command below:

poetry run python rb_crawler/main.py --id $RB_ID --state $STATE

The --id option is an integer, which determines the initial event in the handelsregisterbekanntmachungen to be crawled.

The --state option takes a string (only the ones listed above). This string defines the state where the crawler should start from.

You can use the --help option to see the usage:

Usage: main.py [OPTIONS]

Options:
  -i, --id INTEGER                The rb_id to initialize the crawl from
  -s, --state [bw|by|be|br|hb|hh|he|mv|ni|nw|rp|sl|sn|st|sh|th]
                                  The state ISO code
  --help                          Show this message and exit.

Query data

Kowl

Kowl is a web application that helps you manage and debug your Kafka workloads effortlessly. You can create, update, and delete Kafka resources like Topics and Kafka Connect configs. You can see Kowl's dashboard in your browser under http://localhost:8080.

Elasticsearch

To query the data from Elasticsearch, you can use the query DSL of elastic. For example:

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {
            <field>
        }
    }
}
'

<field> is the field you wish to search. For example:

"reference_id":"HRB 41865"

Teardown

You can stop and remove all the resources by running:

docker-compose down

Name	Name	Last commit message	Last commit date
Latest commit Florian Papsdorf Clean up localized orgs and export result Jul 3, 2022 c276ad5 · Jul 3, 2022 History 58 Commits
build/gen/bakdata/corporate/v1	build/gen/bakdata/corporate/v1	Update README and scripts	Apr 19, 2022
connect	connect	Prepare fold out	Jun 30, 2022
data_importer	data_importer	Enrich lobbyist fold outs with city	Jul 3, 2022
lr_crawler	lr_crawler	Current progress of matching	Jul 2, 2022
project_matcher	project_matcher	Clean up localized orgs and export result	Jul 3, 2022
project_utilities	project_utilities	Current progress of matching	Jul 2, 2022
proto/bakdata	proto/bakdata	Current progress of matching	Jul 2, 2022
rb_crawler	rb_crawler	Current progress of matching	Jul 2, 2022
rb_transformer	rb_transformer	Current progress of matching	Jul 2, 2022
.gitignore	.gitignore	Enrich lobbyist fold outs with city	Jul 3, 2022
LICENSE	LICENSE	remove .idea file and add LICENSE	Apr 19, 2022
README.md	README.md	Replace lobbyist-events by lobbyism-events	May 25, 2022
architecture.png	architecture.png	new architecture image	Jun 7, 2022
docker-compose.yaml	docker-compose.yaml	Initial elasticsearch client	Jun 30, 2022
generate-proto.sh	generate-proto.sh	Prepare fold out	Jun 30, 2022
kowl-config.yaml	kowl-config.yaml	Initial commit	Apr 19, 2022
poetry.lock	poetry.lock	Enrich lobbyist fold outs with city	Jul 3, 2022
pyproject.toml	pyproject.toml	Enrich lobbyist fold outs with city	Jul 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HPI information integration project SoSe 2022

Prerequisites

Architecture

German Transparency Register (Lobbyregister) Website

LR Crawler

lobbyism-events topic

RB Website

RB Crawler

corporate-events topic

Kafka Connect

Setup

Run

Infrastructure

Kafka Connect

LR Crawler

RB Crawler

Query data

Kowl

Elasticsearch

Teardown

About

Releases

Packages

Languages

License

florian-papsdorf/hpi-ii-project-2022

Folders and files

Latest commit

History

Repository files navigation

HPI information integration project SoSe 2022

Prerequisites

Architecture

German Transparency Register (Lobbyregister) Website

LR Crawler

lobbyism-events topic

RB Website

RB Crawler

corporate-events topic

Kafka Connect

Setup

Run

Infrastructure

Kafka Connect

LR Crawler

RB Crawler

Query data

Kowl

Elasticsearch

Teardown

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages