This README covers the TMC Data Engineering Assessment: Conducting voter_id match between a raw, user-input file, and a County voter file by use of various Python packages.
- Description
- Table of Contents
- Usage
- Dependencies
- Installation
- Process
- Technologies Employed
- Future Development
- Contributing
- Tests
- Questions
This process is used throughout the progressive political space for numerous reasons:
- To improve current process of voter outreach through analysis and reporting
- Identifying limitations in geographical approach and improving allocation of resources
- Audit tracking of state KPI methodology to maintain standardization of data
- Understanding voter trends and behavior based on various demographics
- Assuring composition of internal voter file data
In order to implement the scripts in this assessment, you will need to install the following Python packages & Extension:
- pandas: Library of data manipulation tools and analysis
- OS: Library that provides ports to use operating systems dependent on functionality.
- glob: Module that locates all pathnames matching a specified pattern according to rules implemented by shell.
- thefuzz: Ulitizes Levenshtein Distance to calculate the difference between sequences across various compilations of data.
- Jupyter Notebook: VSCode Extension, web-based interactive environment for coding.
All Python packages listed under Dependencies can be installed within the built-in Terminal CLI of VSCode by writing:
pip install pandas glob thefuzz os
Raw data (known in the script in chronological order: input_df/matched_input/matching_df
) was provided by TMC to perform the voter_id
match, with instructions to locate county voter file data online. The initial steps were taken as followed:
- Locate county voter file data from state SOS website.
- It is important to note that since the criteria of the assessment stated to download the first county files, I downloaded counties: 'ADAMS', 'ALLEN', 'ASHLAND', & 'ASHTABULA' from OH's SOS website.
- import python package
pandas
, to open csv. files in Jupyter Notebook by using thepd.read_csv("[filename].csv")
. - Understanding decrepencies in the data.
Prequisites to successfully completing a Voter_ID match:
- Clean, structured data
- Unique Identifier
- Name columns in each file + state voter_id must be present in at least 1 file.
Cleaning the data served to also introduce a unique identifier, since both files lacked a shared unique_id
. The name column in the Raw File
was filled with incomplete names, while the County file
had 4 separate columns dedicated to FIRST_NAME
, LAST_NAME
, MIDDLE_NAME
, & SUFFIX
.
Due to this, a new column was added to the county_file that concatenated the name columns into 1 column, looking as similiar as possible to the name
column in the raw_file.
Additionally, several columns were stripped away after combining the 4 county voter files into 1 DataFrame so that only columns necessary to perform the match were present, cutting down significantly on processing time.
Customer null
values were set for both DataFrames.
- Prior to removing columns from the county_file, I took deliberate steps to export the county voter files as soon as they were merged. This means the compiled data from the state is untouched and can be later implemented throughout analysis and reporting procedures.
The first step taken to perform the match was to import fuzz
from thefuzz
which offers a lil' something known as "fuzzy. Then, converting the name columns from each DataFrame into Lists():
input_list = input_df['name'].tolist()
oh_vf2_list = oh_vf2['Name'].tolist()
Next, Creating a function that defines the parameters of the match in conjunction with thefuzz
is crucial here.
def match_vf(query, choices, limit=2) :
mat1 = process.extract(input_list, oh_vf2_list, limit=2)
return mat1.title()
print(input_df)
This brought me to the intial matched DataFrame. Where the Matched Items look as such:
(ADAM METALWALA , 95)
This newly matched output was saved to a new DataFrame = matched_input
, where the entirety of the next script focused on cleaning and structuring this column, while also splitting the column between matched_name
& matched_score
, respectfully. The metamorphosis of this column into two separate ones went as followed:
1. `(ADAM METALWALA , 95)`
2. matched_name: ( ADAM METALWALA
matched_score: , 95 )
3. matched_name: ADAM METALWALA
matched_score: 95
This was successful and simple due to the implementation of Regular Expressions.
With the data cleaned twice and matched once, retrieving a match on matched_name
and address
allowed for a fruitful match of voter_id to matching_df
, the DataFrame formerly known as input_df
.
Lastly, I made sure columns not necessary to the desired final product were dropped, and SOS_VOTERID
was renamed to Matched_voterid
as required in the prompt.
The file was then exported as a csv. using the to_csv.
call in pandas and saved as tmc_final_matched_voter.csv
Python packages used in order: pandas, os, glob, and thefuzz (formerly known as fuzzywuzzy). The assessment in its entirety was completed using Jupyter Notebook, solely to utilize the run-by-cell functionality and assure output was accurate throughout the steps taken to produce the final product.
However, the script was revamped utilizing VS Code and its intergrated terminal to improve proficiency and processing speed while ensuring accuracy in the final results.
It would make sense to perform this procedure based on the stack provided in a designated workplace. Throughout my time in conducting the assessment, serveral ideas came to mind as to how this process can and should be improved.
For starters, after completing the first match, where thefuzz
gave a dedicated score and name, uploading the files as a .csv
to a Data Warehouse (e.g. Google BigQuery, Civis Analytics Platform) after cleaning the matches would have easily been a viable solution and a great timesaver. Additionally, utilizing the pd.read_sql
call via pandas
could serve as an alternative. However, with the Python packages listed under Dependencies, it was quite simple and effecient to remain in Python and carry out the process.
Understandably, this process is on par with Data Engineering, and I enjoy performing tasks of this nature, however, in my experience in progressive data, applications like TargetSmart offers a tool 'SmartMatching' which cut out data management tasks like voter file matching serve as a great tool when performing analysis on voter_file data on a day-to-day basis. In my present-visualization-queries repository, the folder VR_Matched_&_Counts
showcases the efficiency of utilizing tools like SmartMatch, where raw data-entry of Voter outreach canvassers are matched back to a state or national voter_id
.
Lastly, I'd like to recreate this process utilizing Splink as a record-lnkage python package. thefuzz
was implemented solely due to a lack of exposure on my end as to what python packages can and should be used for matching these two sources of data.
Now that the assessment is long over, I look to turn this into a personal project by testing different python packages, trying different ideas, and continue learning as much as I can. At this point, it's not as much important to find the correct solution as it is to learn more and further develop my understanding and comprehension for python, data engineering, and everything else this process potentially entails.
Contributions can be made at your nearest foodbank
At every iteration of DataFrame, where a pivotal change occurred, I made sure to export the contents to a csv. file. For example, I made sure to export the initial merge of the county voter file data because having a compiled, structured file of that stature will serve reporting and analytic needs infinitely.
Find me on GitHub
Email me directly
© 2022 TMC Data Engineering Assessment: VF Matching by ametalwala, Confidential and Proprietary. All Rights Reserved.