ARCS is a Python package which includes a handful of utilities for Assessing Relevance of our Catalog Search system. Specifically, it is intended to help us collect relevance judgments from crowdsourcing workers, and from those judgments, to compute relevance metrics such as normalized discounted cumulative gain (NDCG) and mean average precision (MAP).
In general, our aim is to make this general enough to work well with any crowdsourcing platform. That said, we have been using CrowdFlower for this particular crowdsourcing task, and this is reflected in the initial version of this software.
First, create a new virtual environment for Arcs, activate it, and then:
pip install -e .
Arcs requires the existence of a PostgreSQL database for persisting all of the relevance judgments and associated task data. If you're on a Mac, homebrew is the recommended package manager. With homebrew, you can install postgres like so:
brew install postgresql
If you're on linux, try the following:
sudo apt-get install postgresql
Once you have postgres installed and running, create the arcs database and required tables like so:
createdb <dbname>
psql -U <username> -d <dbname> -f arcs/sql/create_arcs_tables.sql
From now on, any references to a "DB connection string" are referring to libpq database connection string, which should look something like this:
postgresql://username:@hostname:port/db_name
The username
, hostname
, port
, and db_name
parameters should be replaced
with the appropriate values.
From the Arcs virtual environment created above, do the following:
pip install pytest
py.test
Use the following command to parse server logs for Catalog queries and to output them as JSON. Run this from the arcs directory. Set the dirname variable to the path to a directory of gzipped server logs.
dirname=~/Data/query_analysis/2015-08-10.logs
for x in `ls $dirname`; do
gzcat $dirname/$x | python arcs/logparser.py
done > 2015-08-10.logs.concat.json
You may be interested in sampling domains and queries, simply for the purpose of eyballing results, error analysis, or for serving as the basis for a new crowdsourcing catalog relevance task. You'll need a parsed query log JSON file (like the one generated in the previous step). Then you can do the following:
python arcs/collect_domain_query_data.py ~/Data/query_analysis/2015-08-10.logs.concat.json
This will write 3-column tab-delimited lines containing (domain, query, count) triples to STDOUT.
Not surprisingly, there is noise in the query logs. To ensure that we don't send
garbage queries to crowdsourcing workers for annotation, the
collect_domain_query_data
script uses both a hand-curated blacklist and
filtering regex patterns to eliminate garbage queries. You may find you want to
add additional patterns or blacklist elements, which you can do easily
enough. The query blacklist is in the data
directory. Additionally, it may be
useful to supply custom filters for particular tasks. For example, if you want
to launch a crowdsourcing task to collect judgments limited to only multi-term
queries, you can supply custom filters like so:
python arcs/collect_domain_query_data.py ~/Data/query_analysis/20150924.logs.concat.json -D 'postgresql://username:@hostname:5432/db_name' -d 10 -q 5 -B data/query_blacklist.txt --query_filter='lambda s: " " in s.strip()' > ~/Data/arcs/20151006.slop/queries.tsv
Here we specify an additional filter which will restrict our queries to those containing a whitespace character that is non-initial and non-terminal.
The CrowdFlower UI is pretty self-explanatory. Creating new jobs can be done from the UI by clicking on an existing job, and electing to copy that job with gold units only. As a rule of thumb, the number of gold units should probably be greater than or equal to 10% of the total number of rows in a job. Additionally, it's a good idea to add to this set regularly to ensure that workers are not being exposed to the same questions over and over again.
Any programmatic interaction with the CrowdFlower API requires that a CrowdFlower API token be present in your shell environment. You can obtain such a token by logging into CrowdFlower and going here. Set the CrowdFlower environment variable like so:
export CROWDFLOWER_API_KEY=123456789abcdefghijk
Add this to your environment resource or profile file to ensure that it is set on login. Note that the token included above is a stub, meant to be replaced with an actual token.
To simplify job creation and data bookeeping, we've added a script (launch_job), which that will do the following:
- collect results for all queries from an input query file from our catalog search system (Cetera)
- store raw results data as a CSV for posterity / inspection
- extract relevant result fields from each query-result pair to create CrowdFlower task
- launch CrowdFlower task copying existing test units from existing job
- persist job data in a postgres DB
The script can be run like so:
python arcs/launch_job.py -i ~/Data/arcs/20151006.slop/queries.tsv -g '{"name": "baseline", "description": "Current production system as of 10/6/2015", "params": {}}' -g '{"name": "Enabling slop=5", "description": "Testing the effect of slop=5 on multi-term queries", "params": {"slop": 5}}' -r 10 -c localhost -p 5704 -D 'postgresql://username:@hostname:5432/db_name' -F ~/Data/arcs/20151006.slop/full.csv -C ~/Data/arcs/20151006.slop/crowdflower.csv
We specify the required input file of queries w/ the -i
flag, the parameters
of each Group of results with the -g
flag, the number of the results with the
-r
flag, the Cetera host and port with the -c
and -p
flags, our database
connection string w/ the -D
flag, and finally, an optional path to where the
full and CrowdFlower CSVs should be written. If no groups are specified, the
default behavior is to create a group named "baseline" with an empty parameters
dict (which is used for each query to Cetera).
You may optionally specify a --job_to_copy
(-j
) parameter. This indicates
the CrowdFlower job that should be used as the basis for the task.
Some documentation of the various command-line parameters is available by
passing the help option (-h
|--help
).
Once a job has been completed -- and you should receive an email notification to this effect from CrowdFlower -- you can download the judged data like so:
python arcs/fetch_job_results.py <job_id> -D 'postgresql://username:@hostname:5432/db_name'
The external (CrowdFlower) job ID must be specified as the first argument. As with
the launch script above, a DB connection string must also be supplied
(-D
/--db_conn_str
).
Once a job has completed and you've downloaded the data, you can download the
results and report various statistics (including our core relevance metric,
NDCG) by running the summarize_results
script.
python arcs/summarize_results.py 14 27 -D 'postgresql://username:@hostname:5432/db_name'
This will report per-domain NDCG as well as overall NDCG. The output should look something like this:
{
"num_unique_qrps": 603,
"num_total_diffs": 563,
"baseline 51": {
"avg_ndcg_at_5": 0.6521285003925053,
"num_zero_result_queries": 76,
"num_queries": 236,
"num_irrelevant": 241,
"avg_ndcg_at_10": 0.6763264747737386,
"precision": 0.7897923875432526,
"unjudged_qrps": 2,
"ndcg_error": 0.34787149960749475
},
"ndcg_delta": 0.03033746690757866,
"adjusted boost clause 52": {
"avg_ndcg_at_5": 0.6824659673000839,
"num_zero_result_queries": 76,
"num_queries": 236,
"num_irrelevant": 234,
"avg_ndcg_at_10": 0.6967757671474697,
"precision": 0.796875,
"unjudged_qrps": 0,
"ndcg_error": 0.3175340326999161
}
}
It's useful to know how much agreement there is between our workers as it gives us some signal about the difficulty, interpretability, and subjectivity of our task. You can calculate inter-annotator agreement by first downloading non-aggregated data from CrowdFlower (Results > Settings > "All answers" in the dropdown before downloading the aggregated result) like so:
python arcs/calc_iaa.py -c file_from_crowdflower.csv --top_n
This will report Krippendorf's Alpha, which is a statistical measure of agreement among an arbitrary number of workers.
After getting judged data back from our chosen crowdsourcing platform, it's a good idea to inspect the the rows where the results were obviously bad, or where something went wrong and prevented the workers from assigning a judgment score. You can achieve this with the following:
python arcs/error_analysis.py 28 -D 'postgresql://username:@hostname:5432/db_name' -o 20151110.errors.csv
This script will output a list of all the QRPs that were assigned at least 2 "irrelevant" judgments. The results will be sorted based on aggregated judgment. The first argument is the ID of the group to use as the basis for analysis and is required. As is the case with other utilities, a DB connection string is required and is specified with the -D flag.