We have built a dataset of nearly all the Indian electors. Our data includes information on first and last name, gender, polling station (constituency, district, and state), father or husband's name, among other such details. We assembled this data by scraping and parsing the electoral rolls.
This repository includes scripts for downloading the PDF electoral rolls from the various state election commission sites. Parse PDF Rolls has scripts for parsing the electoral rolls, scripts for translating native language rolls to English, and information about the resulting CSVs.
To ameliorate concerns about eligible voters not being on the rolls (and ineligible electors being on the rolls), the Election Commission of India mandates that state election commissions publish electoral rolls. As a result, the 36 different election commissions---29 states and 7 union territories---post electoral rolls for each polling station on their websites. The websites vary enormously in design, in the metadata they provide about the polling stations, and the language in which they provide the electoral rolls. For instance, some commissions provide electoral rolls in English, some in the main native language(s) of the state, and some in both the main native language(s) of the state and English. The only thing that is constant is that these electoral rolls are provided in dense pdfs. So we wrote separate scrapers for downloading the pdfs. In many cases, we also downloaded the metadata for each of the polling stations (pdfs) that was on the website. (A separate repository uses a different source of data to collate metadata on polling stations.) For scripts, information about the source of the electoral rolls, and such, see the table below.
Given privacy concerns, we are releasing the data only for research purposes. To access the pdfs, you must agree to take all precautions to maintain the privacy of Indian electors. (There is a difference between data being available in pdfs, split across different sites, sometimes behind CAPTCHA, and a common data dump.) If you would like access to the electoral rolls, please fill out the following form.
You will need to also get IRB approval from your respective university or institution. The IRB-approved proposal should include:
- Case for why the data are necessary
- Acknowledgment that the data will be kept in a secure environment
- All the people who will have access to the data
- That the data will only be used on projects with IRB approval
- That data won't be shared with people who are not identified in 3.
- That publications and presentations will not reveal identifying individual information: only statistical summaries will be presented.
The data are available on Harvard Dataverse and via Google Coldline Storage. The GCS buckets are setup as requester pays. So you need to create a project that will be used for billing.
To access data from GCS, you will need to do the following:
gsutil -u projectname_for_billing ls gs://in-electoral-rolls/
gs://in-electoral-rolls/andaman.tar.gz
gs://in-electoral-rolls/andhra_pdfs.tar.gz
gs://in-electoral-rolls/arunachal.tar.gz
gs://in-electoral-rolls/assam.tar.gz
gs://in-electoral-rolls/bihar.tar.gz
gs://in-electoral-rolls/chandigarh_pdfs.tar.gz
gs://in-electoral-rolls/dadra_pdfs.tar.gz
gs://in-electoral-rolls/daman_2015.tar.gz
gs://in-electoral-rolls/daman_2016.tar.gz
...
If you would like access to CSVs from parsing the electoral roll pdfs, check out https://github.com/in-rolls/parse_elex_rolls. The data are posted on the Harvard Dataverse at http://dx.doi.org/10.7910/DVN/MUEGDT.
Gaurav Sood and Atul Dhingra. 2018. Indian Electoral Rolls PDF Corpus. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OG47IV
State | Year(s) | Language(s) |
---|---|---|
Andaman & Nicobar Islands | 2017 | English |
Andhra Pradesh | 2017 | Telugu, English |
Arunachal Pradesh | 2017 | English |
Assam | 2018 | Bengali |
Bihar* | 2017 | Hindi |
Chattisgarh--- Not reachable | -- | -- |
Chandigarh | 2018 | Hindi |
Dadra & Nagar Haveli | 2017 | Gujarati, English |
Daman & Diu | 2017 | Gujarati, English |
Goa | 2018 | English |
Gujarat | 2017 | Gujarati |
Haryana | 2018 | Hindi |
Himachal Pradesh | 2017 | Hindi |
Jammu & Kashmir | 2018 | Hindi, English, and Urdu |
Jharkhand | 2018 | Hindi |
Lakshadweep | 2017 | Malayalam |
Karnataka | 2018 | Kannada |
Kerala | 2018 | Malayalam, English |
Madhya Pradesh | 2017 | Hindi |
Maharashtra | 2018 | Marathi |
Manipur | 2018 | Manipuri, English |
Meghalaya | 2018 | English |
Mizoram | 2018 | English |
Nagaland | 2018 | English |
NCT OF Delhi | 2018 | Hindi, English |
Odisha | 2018 | Odia |
Punjab | 2018 | Punjabi |
Puducherry | 2018 | Tamil, English |
Rajasthan | 2014 | Hindi |
Sikkim | 2018 | English |
Tamil Nadu | 2018 | Tamil |
Telangana | 2017 | Telugu |
Tripura | 2018 | Bengali |
Uttar Pradesh | 2018 | Hindi |
Uttarakhand | 2017 | Hindi |
West Bengal | 2018 | Bengali |
State | Year(s) | Language(s) |
---|---|---|
Bihar (see acknowledgments) | 2015 | Hindi |
Bihar | 2020 | Hindi |
Daman | 2015--2016 | English, Gujarati |
Karnataka | 2015--2017 | Kannada |
Kerala | 2011-2016 | Malyalam |
Uttarakhand | 2007--2016 | Hindi |
- Bihar 2015 electoral rolls were contributed by Aaditya Dar. Aaditya also pointed us the right way to setup a data access procedure where researchers need to get IRB approval.
- The specifics of IRB are 'inspired' by http://adfdell.pstc.brown.edu/arisreds_data/readme.txt
- Elian Carsenat helped us craft better directions for how to access data on GOOG storage.
The scripts are provided under the MIT license.