This library processes crowdsourcing results from Amazon Mechanical Turk and CrowdFlower following the CrowdTruth methodology. For more information see http://crowdtruth.org.
Download the library and install it using python setup.py crowdtruth
Create a folder anywhere on your machine and fill it with raw result files from Amazon Mechanical Turk or CrowdFlower. These files should be unaltered csv
files generated by either of the two platforms, and contain on each row a collected judgment. A folder may contain files from both platforms, but the task should be the same. All files in the same folder will be aggregated together, so if there are multiple tasks then the results for each should be put in separate sub folders. An example of this can be seen in the /examples
folder.
Once the files are in place the code can be called from the command-line with crowdtruth
. The code will detect sub-folders, so you can choose to run it from the main folder so that the results for each sub-folder will be computed. You can also choose to run it only from within a sub-folder. All results for each folder will be saved in results.xlsx
, which contains a tab with all crowdsourcing jobs, all units, all workers, all judgments and all annotations.
Custom configuration can be added using a config.py
file. Currently the following configuration options are available:
name
: a label to identify the type of configuration.inputColumns
: a list of columns that contain original input data. Setting this option allows you to filter out columns you are not interested in. If empty the columns will be identified automatically.outputColumns
: a list of columns that contain judgments. Setting this option allows you to filter out columns you are not interested in. If empty the columns will be identified automatically.units
: a list of units to use. If empty all units are used.workers
: a list of workers to use. If empty all workers are used.jobs
: a list of jobs to use. If empty all jobs are used.processJudgments(self, judgments):
a function to alter the judgments before they are processed in CrowdTruth. Thejudgments
variable is a Pandas dataframe with all judgments of one input file. This function should always return the same dataframe, with only the input or output columns altered. The identified input columns are stored inself.input.keys()
and the output columns inself.output.keys()
.processResults(self, results):
a function to alter the results after they are processed in CrowdTruth. This allows custom metrics to be run, or additional visualizations to be generated. Theresults
variable is a dictionary with a Pandas dataframe for the jobs, units, workers, judgments and annotations. Each of these dataframes may be altered and new dataframes may be added to the dictionary. Each of the dataframes is saved as a tab in theresults.xlsx
file. Additionally, plots can be generated, which will be saved into the folder that is being processed.