This repository holds multiply segmented corpora from the papers below. The data formats are as specified in the Segmentation Representation Specifcation Version 1.1 [PDF], and are of two types:
- JSON (JavaScript Object Notation) or
- TSV (Tab Separated Values)
To evaluate this corpora, and other segmentation metrics, use the SegEval software package.
/kubla_khan_fournier_2013/
andkubla_khan_fournier_2013.json
- Segmentations of the poem Kubla Khan by Samuel Taylor Coleridge (1816), codings collected by Fournier (2013); and/stargazer_hearst_1997/
andstargazer_hearst_1997.json
- Segmentations of Stargazers look for life by Baker (1990), codings collected by Hearst (1997).