A course on algorithms for doing journalis.
This is a course on algorithmic data analysis in journalism. We will cover basic methods for working with large(ish) data sets, and a variety of techniques used in story production, from regression to simulation to machine learning.
There are basically two different ways algorithms are combined journalism: we can use algorithms to analyze data to produce stories, and as we can do stories about algorithms that affect people's lives. We will do both.
- Instructor: Jonathan Stray, jms2361@columbia.edu
- Dates: Mondays and Wednesdays, 7/18-8/29
- Class: 10am-1pm
- Location: World Room
- Lab: 2pm-5pm
- Slack channel: #algorithms
This is a rough outline, and subject to change, but your homework assignments will always be up to date!
Every Monday, you must bring in an algorithmic story to share with the class.
Homework is due before the following class.
Algorithms for doing journalism, journalism about algorithms. The purpose of mathematical formalism. Homework:
- Assignment notebook. Show that an average of averages is not the same as the overall average. Work out when the overall average and an average of averages are equal. Show that this really works, by computing the values in Jupyter.
In this class we will develop the ubiquitous vector space document model, with TF-IDF weighting. You will learn to algorithmically summarize documents by extracting keywords, how to compare documents for similarity, and how a search engine and Google News work.
References:
- An article which describes TF-IDF in more detail TF-IDF is about what matters
- A real life example of TF-IDF and cosine similarity used in journalism: How ProPublica's Message Machine Reverse Engineers Political Microtargeting
- The Overview document mining platform, a powerful tool you can use to explore document sets, or OCR and convert them. See also this visualization of the TF-IDF vectors of a document set.
Homework:
- Assignment notebook Analyze the State of the Union speeches in the 20th century to see how topics changed by decade (see notebook assignment)