Assessing Relevance of Catalog Search (ARCS)

ARCS is a Python package which includes a handful of utilities for Assessing Relevance of our Catalog Search system. Specifically, it is intended to help us collect relevance judgments from crowdsourcing workers, and from those judgments, to compute relevance metrics such as normalized discounted cumulative gain (NDCG) and mean average precision (MAP).

In general, our aim is to make this general enough to work well with any crowdsourcing platform. That said, we have been using CrowdFlower for this particular crowdsourcing task, and this is reflected in the initial version of this software.

Installing

First, create a new virtual environment for Arcs, activate it, and then:

pip install -e .

Creating the database

Arcs requires the existence of a PostgreSQL database for persisting all of the relevance judgments and associated task data. If you're on a Mac, homebrew is the recommended package manager. With homebrew, you can install postgres like so:

brew install postgresql

If you're on linux, try the following:

sudo apt-get install postgresql

Once you have postgres installed and running, create the arcs database and required tables like so:

createdb <dbname>
psql -U <username> -d <dbname> -f arcs/sql/create_arcs_tables.sql

From now on, any references to a "DB connection string" are referring to libpq database connection string, which should look something like this:

postgresql://username:@hostname:port/db_name

The username, hostname, port, and db_name parameters should be replaced with the appropriate values.

Running tests

From the Arcs virtual environment created above, do the following:

pip install pytest
py.test

Parsing server logs for query data

Use the following command to parse server logs for Catalog queries and to output them as JSON. Run this from the arcs directory. Set the dirname variable to the path to a directory of gzipped server logs.

dirname=~/Data/query_analysis/2015-08-10.logs
for x in `ls $dirname`; do
    gzcat $dirname/$x | python arcs/logparser.py
done > 2015-08-10.logs.concat.json

Collecting query data for fun and profit

You may be interested in sampling domains and queries, simply for the purpose of eyballing results, error analysis, or for serving as the basis for a new crowdsourcing catalog relevance task. You'll need a parsed query log JSON file (like the one generated in the previous step). Then you can do the following:

python arcs/collect_domain_query_data.py ~/Data/query_analysis/2015-08-10.logs.concat.json

This will write 3-column tab-delimited lines containing (domain, query, count) triples to STDOUT.

Not surprisingly, there is noise in the query logs. To ensure that we don't send garbage queries to crowdsourcing workers for annotation, the collect_domain_query_data script uses both a hand-curated blacklist and filtering regex patterns to eliminate garbage queries. You may find you want to add additional patterns or blacklist elements, which you can do easily enough. The query blacklist is in the data directory. Additionally, it may be useful to supply custom filters for particular tasks. For example, if you want to launch a crowdsourcing task to collect judgments limited to only multi-term queries, you can supply custom filters like so:

python arcs/collect_domain_query_data.py ~/Data/query_analysis/20150924.logs.concat.json -D 'postgresql://username:@hostname:5432/db_name' -d 10 -q 5 -B data/query_blacklist.txt --query_filter='lambda s: " " in s.strip()' > ~/Data/arcs/20151006.slop/queries.tsv

Here we specify an additional filter which will restrict our queries to those containing a whitespace character that is non-initial and non-terminal.

Creating a new CrowdFlower job

The CrowdFlower UI is pretty self-explanatory. Creating new jobs can be done from the UI by clicking on an existing job, and electing to copy that job with gold units only. As a rule of thumb, the number of gold units should probably be greater than or equal to 10% of the total number of rows in a job. Additionally, it's a good idea to add to this set regularly to ensure that workers are not being exposed to the same questions over and over again.

Any programmatic interaction with the CrowdFlower API requires that a CrowdFlower API token be present in your shell environment. You can obtain such a token by logging into CrowdFlower and going here. Set the CrowdFlower environment variable like so:

export CROWDFLOWER_API_KEY=123456789abcdefghijk

Add this to your environment resource or profile file to ensure that it is set on login. Note that the token included above is a stub, meant to be replaced with an actual token.

To simplify job creation and data bookeeping, we've added a script (launch_job), which that will do the following:

collect results for all queries from an input query file from our catalog search system (Cetera)
store raw results data as a CSV for posterity / inspection
extract relevant result fields from each query-result pair to create CrowdFlower task
launch CrowdFlower task copying existing test units from existing job
persist job data in a postgres DB

The script can be run like so:

python arcs/launch_job.py -i ~/Data/arcs/20151006.slop/queries.tsv -g '{"name": "baseline", "description": "Current production system as of 10/6/2015", "params": {}}' -g '{"name": "Enabling slop=5", "description": "Testing the effect of slop=5 on multi-term queries", "params": {"slop": 5}}' -r 10 -c localhost -p 5704 -D 'postgresql://username:@hostname:5432/db_name' -F ~/Data/arcs/20151006.slop/full.csv -C ~/Data/arcs/20151006.slop/crowdflower.csv

We specify the required input file of queries w/ the -i flag, the parameters of each Group of results with the -g flag, the number of the results with the -r flag, the Cetera host and port with the -c and -p flags, our database connection string w/ the -D flag, and finally, an optional path to where the full and CrowdFlower CSVs should be written. If no groups are specified, the default behavior is to create a group named "baseline" with an empty parameters dict (which is used for each query to Cetera).

You may optionally specify a --job_to_copy (-j) parameter. This indicates the CrowdFlower job that should be used as the basis for the task.

Some documentation of the various command-line parameters is available by passing the help option (-h|--help).

Once a job has been completed -- and you should receive an email notification to this effect from CrowdFlower -- you can download the judged data like so:

python arcs/fetch_job_results.py <job_id> -D 'postgresql://username:@hostname:5432/db_name'

The external (CrowdFlower) job ID must be specified as the first argument. As with the launch script above, a DB connection string must also be supplied (-D/--db_conn_str).

Measuring relevance

Once a job has completed and you've downloaded the data, you can download the results and report various statistics (including our core relevance metric, NDCG) by running the summarize_results script.

python arcs/summarize_results.py 14 27 -D 'postgresql://username:@hostname:5432/db_name'

This will report per-domain NDCG as well as overall NDCG. The output should look something like this:

{
    "num_unique_qrps": 603,
    "num_total_diffs": 563,
    "baseline 51": {
        "avg_ndcg_at_5": 0.6521285003925053,
        "num_zero_result_queries": 76,
        "num_queries": 236,
        "num_irrelevant": 241,
        "avg_ndcg_at_10": 0.6763264747737386,
        "precision": 0.7897923875432526,
        "unjudged_qrps": 2,
        "ndcg_error": 0.34787149960749475
    },
    "ndcg_delta": 0.03033746690757866,
    "adjusted boost clause 52": {
        "avg_ndcg_at_5": 0.6824659673000839,
        "num_zero_result_queries": 76,
        "num_queries": 236,
        "num_irrelevant": 234,
        "avg_ndcg_at_10": 0.6967757671474697,
        "precision": 0.796875,
        "unjudged_qrps": 0,
        "ndcg_error": 0.3175340326999161
    }
}

Calculating inter-annotator agreement

It's useful to know how much agreement there is between our workers as it gives us some signal about the difficulty, interpretability, and subjectivity of our task. You can calculate inter-annotator agreement by first downloading non-aggregated data from CrowdFlower (Results > Settings > "All answers" in the dropdown before downloading the aggregated result) like so:

python arcs/calc_iaa.py -c file_from_crowdflower.csv --top_n

This will report Krippendorf's Alpha, which is a statistical measure of agreement among an arbitrary number of workers.

Error analysis

After getting judged data back from our chosen crowdsourcing platform, it's a good idea to inspect the the rows where the results were obviously bad, or where something went wrong and prevented the workers from assigning a judgment score. You can achieve this with the following:

python arcs/error_analysis.py 28 -D 'postgresql://username:@hostname:5432/db_name' -o 20151110.errors.csv

This script will output a list of all the QRPs that were assigned at least 2 "irrelevant" judgments. The results will be sorted based on aggregated judgment. The first argument is the ID of the group to use as the basis for analysis and is required. As is the case with other utilities, a DB connection string is required and is specified with the -D flag.

References

NDCG

MAP

"Measuring Search Relevance", Hugh Williams

Krippendorf's Alpha

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
arcs		arcs
data		data
task_templates		task_templates
tests		tests
.gitignore		.gitignore
License.md		License.md
README.md		README.md
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assessing Relevance of Catalog Search (ARCS)

Installing

Creating the database

Running tests

Parsing server logs for query data

Collecting query data for fun and profit

Creating a new CrowdFlower job

Measuring relevance

Calculating inter-annotator agreement

Error analysis

References

About

Releases

Packages

Contributors 2

Languages

License

socrata/arcs

Folders and files

Latest commit

History

Repository files navigation

Assessing Relevance of Catalog Search (ARCS)

Installing

Creating the database

Running tests

Parsing server logs for query data

Collecting query data for fun and profit

Creating a new CrowdFlower job

Measuring relevance

Calculating inter-annotator agreement

Error analysis

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages