Westac Project, 2020--2024 | Swerik Project, 2023--2025
The Swedish Parliament Corpus is actually a collection of corpora storing various document types and metadata related to proceedings of the Swedish Riksdag. Documents are structured and annotated in accordance with established practice in parliamentary corpus construction -- TEI and ParlaClarin. Metadata about actors and organizations involved in parliamentary business is stored in Normal Forms and linked across Swerik data sets and beyond...
The Swedish parlamentary corpora are developed and released iteratively, where the quality improvements and data are added and evaluated continuously. Semantic versioning is used for the whole corpus, following the established major-minor-patch practices as they apply to data. For each release, a battery of unit tests are run and a statistical sample is drawn, annotated and quantitatively evaluated to ensure integrity and quality of updated data. Errors are fixed as they are detected in order of priority. Moreover, the edit history is kept as a traceable git repository.
While the contents of these corpora will change due to qualitative improvements, curation and expansion, we aim to keep the deliverable API, i.e. the data/
folders, as stable as possible. This means we avoid changing existing folder structures, changing file names, changing formats, changing columns and column names in metadata files, or any other changes that might break downstream scripts.
Please consult Yrjänäinen et al. (2024) for an in-depth description of this project's design principles.
The full data set consists of multiple parts, which are version controlled independently from each other. For convenience, the most up-to-date versions of these data sets are zipped and made available as a package on the release page roughly once a month. These components are:
This is a collection of structured, annotated records from meetings of the Swedish Riksdag. The records are encoded in ParlaClarin-compliant xml. The release contains:
records.zip
-- the parliamentary recordsquality.zip
-- various calculations and estimates relating to the quality of the released dataSource code
-- a zipped / tar balled snap shot of the repository
View the Riksdagen Records Repository here or the complete most recent release here.
This is a comprehensive collection of members of parliament, ministers and governments during this period & associated metadata (mandate periods, party info, etc).
persons.zip
-- metadata tablesdumps.zip
-- various files containing merged / filtered / wrangled (meta)dataquality.zip
-- various calculations and estimates relating to the quality of the released dataSource code
-- a zipped / tar balled snap shot of the repository
View the Riksdagen Persons Repository here or the complete most recent release here.
This is a collection of structured, annotated motions submitted to the Swedish Riksdag with linked metadata. The motions are encoded in TEI-compliant XML.
motions.zip
-- the motionsquality.zip
-- various calculations and estimates relating to the quality of the released dataSource code
-- a zipped / tar balled snap shot of the repository
View the Riksdagen Motions Repository here or the complete most recent release here.
From the 1994/95 parliament year, written Interpellation questions submitted to members of the government are stored as a separate class of documents. This is a structured, annotated catalog of these Interpellation questions, encoded in TEI-compliant XML. Earlier interpellation questions, as well as interpellation debates in general can be found in the Riksdagen Records.
interpellation-questions.zip
-- the interpellationsquality.zip
-- various calculations and estimates relating to the quality of the released dataSource code
-- a zipped / tar balled snap shot of the repository
View the Riksdagen Interpellations Repository here or the complete most recent release here.
We offer some Python and R-based tools for working with the data. While users are free to work with the data in any way they see fit, we recommend utilizing tried and tested functions, particularly the Pyriksdagen Python module.
Pyriksdagen is a Python module developed in parallel with the corpus, designed spedifically for working with the corpus. It can be installed via PyPi in the ordinary way
(venv) ~$ pip install pyriksdagen
A simple workflow is demonstrated in this Google Colab notebook.
Each release of Pyriksdagen is published immediately on PyPi, nevertheless, each release is also zipped/tarballed on the releases page for manual installation.
View the Pyriksdagen Interpellations Repository here or the complete most recent release here.
The scripts repository contains (primarily) Python scripts that we use for curation and maintenance of the data sets. Although we have begun releasing versions of this repository, we make no promise of backwards compatibility, rather we offer this code as a set of examples that users may find helpful.
View the scripts repository here.
There is also an R package under development; to install, run:
library(remotes)
remotes::install_github('swerik-project/rcr')
As a first step, we point to the directory where the corpus files are stored.
set_riksdag_corpora_path("[THE PATH TO THE CORPORA HERE]")
To extract speeches, we use extract_speeches_from_records()
. Below is an example that assumes that the corpora path has been set and extracts the speeches from three different records.
fps <-
c("protocols/1896/prot-1896--ak--042.xml",
"protocols/1951/prot-1951--fk--029.xml",
"protocols/1975/prot-1975--036.xml")
sp <- extract_speeches_from_records(fps)
View the rcr repository here or the complete most recent release here.
From 2025, we aim to make new releases of all repositories around the middle of each month (assuming there is new work to release). In theory, these "dated" releases of various repositories should be compatible with others released around the same time. The table below is a record of semantically versioned repositories at the time of scheduled releases:
Dated Release | Repository Versions |
---|---|
v2025.01.15 | pyriksdagen: v1.7.1 riksdagen-persons: v1.1.1 riksdagen-records: v1.3.0 riksdagen-motions: v0.2.1 riksdagen-interpellations: v0.2.0 scripts: v0.0.1 rcr-version: v0.3.0 |
v2024.09.13 | pyriksdagen: v1.4.0 riksdagen-persons: v1.1.0 riksdagen-records: v1.2.0 |
v2024.06.19 | pyriksdagen: v1.2.0 riksdagen-persons: v1.1.0 riksdagen-records: v1.1.0 |
v2024.04.26 | pyriksdagen: v1.2.0 riksdagen-persons: v1.0.0 riksdagen-records: v1.0.0 |
For repositories that include documentation, including those data repositories described above, the documentation can be read at swerik-project.github.io/\<repo-name>
; e.g. https://swerik-project.github.io/pyriksdagen for the Pyriksdagen module or https://swerik-project.github.io/riksdagen-records for the Riksdagen Records repository.
We are developing and implementing an extensive battery of quality assessments and data integrity tests for each corpus. Some example results are presented in plot form below, but full results of these evaluations can be found in the respective repository's quality/
and test/
directories. We are working continuously to present these results in a more accessible way.
We check how many speakers in the parliamentary records our algorithms idenify in each release. From the riksdagen-records
repository v1.3.0.
We check the number of MPs with a mandate on a given day against he baseline number of MPs that we know should be sitting in parliament. From the riksdagen-persons repository v1.1.1.
This plot illustrates the mean daily number of MPs in the metadata compared to the baseling.
For more granularity, the plot below shows a box plot distribution of the daily number of MPs in each year agaist the baseline; mostly they are not visible, as they are tightly underneath the mean line (red). Colored dots represent outlier days.
If you would like to participate in the curation or quality control of data contained in the Swedish Parliament Corpus, please be in touch!
-
Westac funding: Vetenskapsrådet 2018-0606
-
Swerik funding:Riksbankens Jubileumsfond IN22-0003
Last update: 2025-01-21, 13:56:16