Skip to content

digitalepidemiologylab/crowdbreaks-welcome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

Crowdbreaks logo

Hi! You're probably reading this either because you are working on the Crowdbreaks project or you are trying to understand more about what Crowdbreaks is.

The goal of this README is to provide an overview of what there is and where the project is headed.

Goal

For many health-related issues human behaviour is of central importance for Public Health to design appropriate policies. Health behaviors are partially influenced by people's opinion which has been traditionally assessed in surveys. Social media can be used to complement traditional surveys and serve as a low-cost, global, and real-time addition to the toolset of Public Health surveillance.

Crowdbreaks focuses on public social media data (currently from Twitter) to track such health behaviors. A common issue when building Machine Learning classifiers on social media data is model drift. Crowdbreaks is specifically built to overcome this issue by re-annotating newly collected data and re-training algorithms automatically. For a more comprehensive explanation of this idea, you may want to read the Crowdbreaks paper.

Platform overview

The Crowdbreaks platform consists of multiple parts, each part has its own repository on GitHub. It should be seen as a toolbox to overcome many distinct technical challenges in analyzing large amounts of data in real-time. In general, all code related to Crowdbreaks is open source under MIT license.

Website

The Crowdbreaks website is accessible under https://www.crowdbreaks.org/.

The website has three main purposes:

  1. Collect and store annotation data. Annotation data (also called "labelled data") is when raw data gets tagged with a class (or label) by a human. Annotation data is central to all supervised Machine Learning and is key to most projects running on Crowdbreaks. Annotation can be done by any user (registered or not) on the website (this process is also called "crowdsourcing"). Alternatively, annotation data can also be collected through Amazon Turk, which is a paid service. Lastly, annotation can be done by registered users (e.g. experts) who registered on the Crowdbreaks website. These three modes of annotation are called public, mturk, and local.
  2. Visualizations: Part of the idea of Crowdbreaks is to provide results of the analysis back to the public. One way of doing this is via visualizations of these trends. In the future the platform might also provide educational content or expose public APIs which would allow real-time trends from Crowdbreaks to be integrated into other Public Health tools.
  3. Project management: The website as a project management interface, allowing to 1) create and manage projects 2) control the data collection (see section on data streaming) 3) easily use Mturk 4) monitor the status of the system

Tech: The website is built with Ruby/Rails with a React.js front-end (visualizations are in d3.js). Annotation data is stored in a Postgres database. The website is hosted on Heroku.

Data streaming

Crowdbreaks leverages filtered streaming endpoints within the Twitter Developer API. This means that the data from Twitter is collected in real-time for multiple projects simultaneously. Machine Learning algorithms predict labels for the newly collected tweets (e.g. sentiment, relevance, etc.).

Tech: The technologies used for the Crowdbreaks streamer are closely interlinked with the AWS (Amazon Web Services) ecosystem. The streamer is a Python application that runs on a AWS Fargate cluster and uses a POST statuses/filter (API v1.1) request to connect to a filtered stream of tweets from Twitter. The platform employs tools such as AWS Kinesis delivery streams, AWS Lambda, AWS Sagemaker, as well as Elasticsearch/Kibana. These tools allow the streamer to be stable and scalable.

Analysis tools

Although Crowdbreaks is a real-time analysis tool (which is run in the Cloud, or more specifically on AWS), a great effort has been put in building tools to analyze Crowdbreaks data locally (e.g. on a cluster).

Preprocess

Preprocess is a CLI (command line interface) tool, which allows to:

  • init: Initialize for a specific project
  • sync: Download (synchronize) raw data (tweets) and annotation data for that project
  • parse: Parse the raw data into a Python-usable format (Pandas DataFrames)
  • sample: Select data for annotation on the Crowdbreaks website (usually around 10'000 tweeets)
  • batch: Create a batch of the sample to be uploaded to the Crowdbreaks website for annotation (Generally batches are <4000 tweets)
  • clean_labels: A single tweet is usually labelled by multiple users. In order to get a consensus between multiple users the annotation data is both cleaned from outliers and merged. The result is a consensus label for every tweet, which serves as training data for a Machine Learning algorithm

Additionally, the library provides efficient helper functions (e.g. to load the data). Usually this library is therefore integrated as a git submodule within a project-specific GitHub repository.

Tech: The tool is written in Python. For easier data cleaning we are using the Pandas library. Raw data is either in CSV format (annotation data) or in gzipped JSON-lines format (raw Twitter data). The parsed data (DataFrame) is in a binary format called parquet.

txtcls

txtcls is a CLI tool that automates text classifiction model training, testing and deployment for speed and reproducibility. Models are trained locally and can then directly be deployed to Crowdbreaks from the command line.

Tech: The tool is built in Python and PyTorch.

local-geocode

Local geocode is a library to perform reverse geo-coding. It is specificially tuned to work on the user location field of tweets. The library parses a string such as "New York City" and returns the geo coordinates (longitude and latitude) if there is match. It is a library which is used in preprocess (and soon in the streamer as well) to enrich geo-information of tweets.

Tech: The library is written in Python and uses data from geonames.org.

COVID-Twitter-BERT

COVID-Twitter-BERT is a library which allows to perform what is called domain-specific pretraining (DSP) of transformer models. DSP is a process in which a base model (such as BERT) is trained on domain specific data (such as tweets) in order to increase its performance over the base model in downstream tasks (such as classification). The process was specifically used on a large dataset of tweets about COVID-19 and is described in this paper.

Tech: The library is written in Python and using the Tensorflow 2 library. It is optimized for training on TPUs and requires a Google Cloud bucket.

Past & ongoing Crowdbreaks projects

Vaccine sentiment tracking

Vaccine sentiment tracking is an ongoing project and serves as a classic used case for Crowdbreaks. The lab has previous worked on this topic and collected annotation data outside of Crowdbreaks (e.g. in this project). The data was also used in a project to understand correlations between vaccine sentiment and vacine uptake in England. The study can be found here.

Vaccination sentiment in Brazil

This is a collaboration with PAHO (Pan American Health Organization), who is specifically interested in the influence of social media on vaccine uptake in Brazil.

Assessing public opinion on CRISPR/Cas9

In this collaboration with ETH Zurich we used Twitter data related to the gene editing technology CRISPR/Cas9. Feel free to read more here.

COVID-19 disease outbreak

The COVID-19 outbreak led to a massive flood of data which led to multiple projects:

  • Trending topics and tweets: Early on in the pandemic we used Crowdbreaks to filter incoming data and generated a visualization of real-time tweets and trending topics. This was later discontinued for cost reasons.
  • Attention to experts: A critical question is the role of scientific experts on Twitter. A study related to this can be found here.
  • Multiple other projects emerged from this data with work in progress.

About

Welcome doc for the Crowdbreaks project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published