The retrieve
tool is used to perform batch query processing experiments. To use it:
./target/bin/retrieve
The configuration of this script is completely managed by Java properties, as reported in the MatchingConfiguration class.
Property name | Description | Default |
---|---|---|
micro.namespace |
default namespace to append to class names not including any namespace | it.cnr.isti.hpclab.matching. |
micro.index.path |
the name of the directory in which the index data structures to process are stored | . |
micro.index.prefix |
filename prefix for the index data structures to process | data |
micro.manager |
class name of the manager to use at runtime, created by the Querying class |
micro.namespace + Manager |
micro.model |
class name of the weighting model to use at runtime, create by the Manager class |
micro.namespace + structures.BM25 |
micro.matching |
class name of the matching algorithm to use at runtime, create by the Manager class |
micro.namespace + RankedOr |
micro.queries |
comma-separated list of text file where queries are read from class QuerySource |
|
micro.queries.id |
boolean value specifying if query lines contain an initial query id | false |
micro.queries.tokenise |
boolean value specifying if queries must be tokenised or treated as single terms | true |
micro.queries.lowercase |
boolean value specifying if queries must be lowercased | true |
micro.queries.num |
number of queries to process | all |
micro.queries.threshold |
file containing the thresholds for priming. If queries have id, thresholds must have ids too | "" |
micro.termpipelines |
term processors to be applied in order to single terms | Stopwords,PorterStemmer |
micro.ignore.low.idf |
boolean value specifying if terms with low IDF must be ignored | true |
micro.topk |
the number of top documents to return, if necessary | 1000 |
The values can be changed by using Java command-line system property values:
java -Dmicro.queries=/tmp/somefile.txt <...>
or by including a micro.properties
file in the classpath.
Queries are read by the QuerySource
class, one per line, verbatim, from the file(s) specified by the micro.queries
property. Empty lines and lines starting with #
are ignored.
By default, queries are tokenised by this class, and are passed verbatim to the query parser. Tokenisation can be turned off by the property micro.queries.tokenise
.
Moreover, the first token on each line can be the query id. This can be controlled by the property micro.queries.id
(default: false
).
A matching algorithm must implement the MatchingAlgorithm
interface. The matching algorithms currently implemented are the following.
Class | Description |
---|---|
And |
Boolean AND, scores are not computed, no results are actually returned. |
Or |
Boolean OR, scores are not computed, no results are actually returned. |
RankedAnd |
Ranked AND processing, micro.topk documents returned with their scores. |
RankedOr |
Ranked OR processing, micro.topk documents returned with their scores, implemented as DAAT. |
MaxScore |
Ranked OR processing, micro.topk documents returned with their scores, implemented as Turtle & Flood's MaxScore. |
Wand |
Ranked OR processing, micro.topk documents returned with their scores, implemented as Carmel & Broder's WAND. |
BlockMaxWand |
Ranked OR processing, micro.topk documents returned with their scores, implemented as Suel's BlockMaxWand (postings aligned). |
Use the provided Java program Retrieve
as follows.
java it.cnr.isti.hpclab.Retrieve [y]
The program uses the Java properties to configure its runtime. The only parameter flag (y
), if present, will print to stderr all configured properties with their values, waiting for confirmation to proceed. Otherwise, the query processing will begin, outputting results to stdout in a simple JSON format.
For example:
java -Xmx32G -server \
-cp terrier-micro-1.4.0.jar \
-Dmicro.index.path=/data1/khast/index-java \
-Dmicro.index.prefix=cw09b.sux \
-Dmicro.queries=./query_log/msn.1k.txt \
-Dmicro.topk=30 \
-Dmicro.matching=it.cnr.isti.hpclab.matching.RankedOr \
-Dstopwords.filename=/home/khast/stuff/stopword-list.txt \
it.cnr.isti.hpclab.Retrieve
During query processing, a JSON file is produced, one line per query. Every line contains information about the terms and the processing time of the query. It can be easily parsed with the jq tool.
An additional output can be generate, if the actual docids returned for every query are necessary (typically, for debugging purposes or for effectiveness measures). This output is controlled by the following properties.
Property name | Description | Default |
---|---|---|
micro.results.output.type |
the type of output for results generation. Possible values are null , docid , score , docno and trec |
null |
micro.results.filename |
the absolute filename of the gzipped file where the output results will be stored | results.gz |
Depending on the output type, the content of the output type will be different, according to the following descriptions.
null
: no output is actually generated. This is the default behavior, since writing the results on file might negatively impact the efficiency of query processing.docid
: each line of the output file has the format<qid>TAB<docid>
.score
: each line of the output file has the format<qid>TAB<docid>TAB<score>
. Thescore
value is ceiled to 5 decimal digits.docno
: each line of the output file has the format<qid>TAB<docno>TAB<score>
. Thescore
value is ceiled to 5 decimal digits. Thedocno
is retrieved from the metaindex.trec
: each line of the output file has the format<qid> Q0 <docno> <pos> <score> ISTI
(note the spaces instead of tabs and the 'immutable' stringsQ0
andISTI
). Thescore
value is ceiled to 5 decimal digits. Thedocno
is retrieved from the metaindex. Thepos
value is the position in the returned results array, starting from 0 (i.e., the highest scoring document is in position 0, etc.).