Skip to content

Commit

Permalink
Add regressions for TREC 7/8 on Disks 4+5 (#1692)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Dec 9, 2021
1 parent fbdd008 commit 12149f8
Show file tree
Hide file tree
Showing 11 changed files with 217,132 additions and 47,521 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Anserini is designed to support experiments on various standard IR test collecti
The following experiments are backed by [rigorous end-to-end regression tests](docs/regressions.md) with [`run_regression.py`](src/main/python/run_regression.py) and [the Anserini reproducibility promise](docs/regressions.md).
For the most part, these runs are based on [_default_ parameter settings](https://github.com/castorini/Anserini/blob/master/src/main/java/io/anserini/search/SearchArgs.java).

+ Regressions for [Disks 1 & 2](docs/regressions-disk12.md), [Disks 4 & 5 (Robust04)](docs/regressions-robust04.md), [AQUAINT (Robust05)](docs/regressions-robust05.md)
+ Regressions for [Disks 1 & 2 (TREC 1-3)](docs/regressions-disk12.md), [Disks 4 & 5 (TREC 7-8)](docs/regressions-disk45.md), [Robust04](docs/regressions-robust04.md), [AQUAINT (Robust05)](docs/regressions-robust05.md)
+ Regressions for [the New York Times Corpus (Core17)](docs/regressions-core17.md), [the Washington Post Corpus (Core18)](docs/regressions-core18.md)
+ Regressions for [Wt10g](docs/regressions-wt10g.md), [Gov2](docs/regressions-gov2.md)
+ Regressions for [ClueWeb09 (Category B)](docs/regressions-cw09b.md), [ClueWeb12-B13](docs/regressions-cw12b13.md), [ClueWeb12](docs/regressions-cw12.md)
Expand Down
14 changes: 7 additions & 7 deletions docs/regressions-disk12.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Anserini: Regressions for [TIPSTER Disks 1 & 2](https://catalog.ldc.upenn.edu/LDC93T3A)

This page describes regressions for ad hoc topics from the early TRECs, which use [TIPSTER Disks 1 & 2](https://catalog.ldc.upenn.edu/LDC93T3A).
This page describes regressions for ad hoc topics from TREC 1-3, which use [TIPSTER Disks 1 & 2](https://catalog.ldc.upenn.edu/LDC93T3A).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/disk12.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/disk12.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

Expand All @@ -25,12 +25,12 @@ For additional details, see explanation of [common indexing options](common-inde

Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/), downloaded from NIST:

+ [`topics.adhoc.51-100.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt): [TREC-1 Ad Hoc Topics 51-100](http://trec.nist.gov/data/topics_eng/topics.51-100.gz)
+ [`topics.adhoc.101-150.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt): [TREC-2 Ad Hoc Topics 101-150](http://trec.nist.gov/data/topics_eng/topics.101-150.gz)
+ [`topics.adhoc.151-200.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt): [TREC-3 Ad Hoc Topics 151-200](http://trec.nist.gov/data/topics_eng/topics.151-200.gz)
+ [`qrels.adhoc.51-100.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.51-100.txt): [qrels for TREC-1 Ad Hoc Topics 51-100](http://trec.nist.gov/data/qrels_eng/qrels.51-100.disk1.disk2.parts1-5.tar.gz)
+ [`qrels.adhoc.101-150.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.101-150.txt): [qrels for TREC-2 Ad Hoc Topics 101-150](http://trec.nist.gov/data/qrels_eng/qrels.101-150.disk1.disk2.parts1-5.tar.gz)
+ [`qrels.adhoc.151-200.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.151-200.txt): [qrels for TREC-3 Ad Hoc Topics 151-200](http://trec.nist.gov/data/qrels_eng/qrels.151-200.201-250.disks1-3.all.tar.gz)
+ [`topics.adhoc.51-100.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt): [TREC-1 Ad Hoc Topics 51-100](http://trec.nist.gov/data/topics_eng/)
+ [`topics.adhoc.101-150.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt): [TREC-2 Ad Hoc Topics 101-150](http://trec.nist.gov/data/topics_eng/)
+ [`topics.adhoc.151-200.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt): [TREC-3 Ad Hoc Topics 151-200](http://trec.nist.gov/data/topics_eng/)
+ [`qrels.adhoc.51-100.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.51-100.txt): [qrels for TREC-1 Ad Hoc Topics 51-100](http://trec.nist.gov/data/qrels_eng/)
+ [`qrels.adhoc.101-150.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.101-150.txt): [qrels for TREC-2 Ad Hoc Topics 101-150](http://trec.nist.gov/data/qrels_eng/)
+ [`qrels.adhoc.151-200.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.151-200.txt): [qrels for TREC-3 Ad Hoc Topics 151-200](http://trec.nist.gov/data/qrels_eng/)

After indexing has completed, you should be able to perform retrieval as follows:

Expand Down
127 changes: 127 additions & 0 deletions docs/regressions-disk45.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Anserini: Regressions for [TREC Disks 4 & 5](https://trec.nist.gov/data/cd45/index.html)

This page describes regressions for ad hoc topics from TREC 7-8, which use [TREC Disks 4 & 5](https://trec.nist.gov/data/cd45/index.html).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/disk45.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/disk45.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \
-input /path/to/disk45 \
-index indexes/lucene-index.disk45.pos+docvectors+raw \
-generator DefaultLuceneDocumentGenerator \
-threads 16 -storePositions -storeDocvectors -storeRaw \
>& logs/log.disk45 &
```

The directory `/path/to/disk45/` should be the root directory of [TREC Disks 4 & 5](https://trec.nist.gov/data/cd45/index.html); inside each there should be subdirectories like `ft`, `fr94`.
Note that Anserini ignores the `cr` folder when indexing, which is the standard configuration.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/), downloaded from NIST:

+ [`topics.adhoc.351-400.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt): [TREC-7 Ad Hoc Topics 351-400](http://trec.nist.gov/data/topics_eng/)
+ [`topics.adhoc.401-450.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt): [TREC-8 Ad Hoc Topics 401-450](http://trec.nist.gov/data/topics_eng/)
+ [`qrels.adhoc.351-400.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.351-400.txt): [qrels for TREC-7 Ad Hoc Topics 351-400](http://trec.nist.gov/data/qrels_eng/)
+ [`qrels.adhoc.401-450.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.401-450.txt): [qrels for TREC-8 Ad Hoc Topics 401-450](http://trec.nist.gov/data/qrels_eng/)

After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt \
-output runs/run.disk45.bm25.topics.adhoc.351-400.txt \
-bm25 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt \
-output runs/run.disk45.bm25.topics.adhoc.401-450.txt \
-bm25 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt \
-output runs/run.disk45.bm25+rm3.topics.adhoc.351-400.txt \
-bm25 -rm3 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt \
-output runs/run.disk45.bm25+rm3.topics.adhoc.401-450.txt \
-bm25 -rm3 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt \
-output runs/run.disk45.bm25+ax.topics.adhoc.351-400.txt \
-bm25 -axiom -axiom.deterministic -rerankCutoff 20 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt \
-output runs/run.disk45.bm25+ax.topics.adhoc.401-450.txt \
-bm25 -axiom -axiom.deterministic -rerankCutoff 20 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt \
-output runs/run.disk45.ql.topics.adhoc.351-400.txt \
-qld &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt \
-output runs/run.disk45.ql.topics.adhoc.401-450.txt \
-qld &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt \
-output runs/run.disk45.ql+rm3.topics.adhoc.351-400.txt \
-qld -rm3 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt \
-output runs/run.disk45.ql+rm3.topics.adhoc.401-450.txt \
-qld -rm3 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt \
-output runs/run.disk45.ql+ax.topics.adhoc.351-400.txt \
-qld -axiom -axiom.deterministic -rerankCutoff 20 &
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.disk45.pos+docvectors+raw \
-topicreader Trec -topics src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt \
-output runs/run.disk45.ql+ax.topics.adhoc.401-450.txt \
-qld -axiom -axiom.deterministic -rerankCutoff 20 &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.351-400.txt runs/run.disk45.bm25.topics.adhoc.351-400.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.401-450.txt runs/run.disk45.bm25.topics.adhoc.401-450.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.351-400.txt runs/run.disk45.bm25+rm3.topics.adhoc.351-400.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.401-450.txt runs/run.disk45.bm25+rm3.topics.adhoc.401-450.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.351-400.txt runs/run.disk45.bm25+ax.topics.adhoc.351-400.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.401-450.txt runs/run.disk45.bm25+ax.topics.adhoc.401-450.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.351-400.txt runs/run.disk45.ql.topics.adhoc.351-400.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.401-450.txt runs/run.disk45.ql.topics.adhoc.401-450.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.351-400.txt runs/run.disk45.ql+rm3.topics.adhoc.351-400.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.401-450.txt runs/run.disk45.ql+rm3.topics.adhoc.401-450.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.351-400.txt runs/run.disk45.ql+ax.topics.adhoc.351-400.txt
tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.adhoc.401-450.txt runs/run.disk45.ql+ax.topics.adhoc.401-450.txt
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

MAP | BM25 | +RM3 | +Ax | QL | +RM3 | +Ax |
:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
[TREC-7 Ad Hoc Topics](../src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt)| 0.1862 | 0.2354 | 0.2431 | 0.1843 | 0.2168 | 0.2298 |
[TREC-8 Ad Hoc Topics](../src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt)| 0.2515 | 0.2750 | 0.2812 | 0.2460 | 0.2702 | 0.2647 |


P30 | BM25 | +RM3 | +Ax | QL | +RM3 | +Ax |
:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
[TREC-7 Ad Hoc Topics](../src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt)| 0.3093 | 0.3447 | 0.3287 | 0.3073 | 0.3307 | 0.3193 |
[TREC-8 Ad Hoc Topics](../src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt)| 0.3560 | 0.3760 | 0.3753 | 0.3480 | 0.3680 | 0.3500 |
14 changes: 7 additions & 7 deletions src/main/resources/docgen/templates/disk12.template
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Anserini: Regressions for [TIPSTER Disks 1 & 2](https://catalog.ldc.upenn.edu/LDC93T3A)

This page describes regressions for ad hoc topics from the early TRECs, which use [TIPSTER Disks 1 & 2](https://catalog.ldc.upenn.edu/LDC93T3A).
This page describes regressions for ad hoc topics from TREC 1-3, which use [TIPSTER Disks 1 & 2](https://catalog.ldc.upenn.edu/LDC93T3A).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/disk12.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/disk12.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

Expand All @@ -20,12 +20,12 @@ For additional details, see explanation of [common indexing options](common-inde

Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/), downloaded from NIST:

+ [`topics.adhoc.51-100.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt): [TREC-1 Ad Hoc Topics 51-100](http://trec.nist.gov/data/topics_eng/topics.51-100.gz)
+ [`topics.adhoc.101-150.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt): [TREC-2 Ad Hoc Topics 101-150](http://trec.nist.gov/data/topics_eng/topics.101-150.gz)
+ [`topics.adhoc.151-200.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt): [TREC-3 Ad Hoc Topics 151-200](http://trec.nist.gov/data/topics_eng/topics.151-200.gz)
+ [`qrels.adhoc.51-100.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.51-100.txt): [qrels for TREC-1 Ad Hoc Topics 51-100](http://trec.nist.gov/data/qrels_eng/qrels.51-100.disk1.disk2.parts1-5.tar.gz)
+ [`qrels.adhoc.101-150.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.101-150.txt): [qrels for TREC-2 Ad Hoc Topics 101-150](http://trec.nist.gov/data/qrels_eng/qrels.101-150.disk1.disk2.parts1-5.tar.gz)
+ [`qrels.adhoc.151-200.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.151-200.txt): [qrels for TREC-3 Ad Hoc Topics 151-200](http://trec.nist.gov/data/qrels_eng/qrels.151-200.201-250.disks1-3.all.tar.gz)
+ [`topics.adhoc.51-100.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.51-100.txt): [TREC-1 Ad Hoc Topics 51-100](http://trec.nist.gov/data/topics_eng/)
+ [`topics.adhoc.101-150.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.101-150.txt): [TREC-2 Ad Hoc Topics 101-150](http://trec.nist.gov/data/topics_eng/)
+ [`topics.adhoc.151-200.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.151-200.txt): [TREC-3 Ad Hoc Topics 151-200](http://trec.nist.gov/data/topics_eng/)
+ [`qrels.adhoc.51-100.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.51-100.txt): [qrels for TREC-1 Ad Hoc Topics 51-100](http://trec.nist.gov/data/qrels_eng/)
+ [`qrels.adhoc.101-150.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.101-150.txt): [qrels for TREC-2 Ad Hoc Topics 101-150](http://trec.nist.gov/data/qrels_eng/)
+ [`qrels.adhoc.151-200.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.151-200.txt): [qrels for TREC-3 Ad Hoc Topics 151-200](http://trec.nist.gov/data/qrels_eng/)

After indexing has completed, you should be able to perform retrieval as follows:

Expand Down
45 changes: 45 additions & 0 deletions src/main/resources/docgen/templates/disk45.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Anserini: Regressions for [TREC Disks 4 & 5](https://trec.nist.gov/data/cd45/index.html)

This page describes regressions for ad hoc topics from TREC 7-8, which use [TREC Disks 4 & 5](https://trec.nist.gov/data/cd45/index.html).
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/disk45.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/disk45.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
${index_cmds}
```

The directory `/path/to/disk45/` should be the root directory of [TREC Disks 4 & 5](https://trec.nist.gov/data/cd45/index.html); inside each there should be subdirectories like `ft`, `fr94`.
Note that Anserini ignores the `cr` folder when indexing, which is the standard configuration.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/), downloaded from NIST:

+ [`topics.adhoc.351-400.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.351-400.txt): [TREC-7 Ad Hoc Topics 351-400](http://trec.nist.gov/data/topics_eng/)
+ [`topics.adhoc.401-450.txt`](../src/main/resources/topics-and-qrels/topics.adhoc.401-450.txt): [TREC-8 Ad Hoc Topics 401-450](http://trec.nist.gov/data/topics_eng/)
+ [`qrels.adhoc.351-400.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.351-400.txt): [qrels for TREC-7 Ad Hoc Topics 351-400](http://trec.nist.gov/data/qrels_eng/)
+ [`qrels.adhoc.401-450.txt`](../src/main/resources/topics-and-qrels/qrels.adhoc.401-450.txt): [qrels for TREC-8 Ad Hoc Topics 401-450](http://trec.nist.gov/data/qrels_eng/)

After indexing has completed, you should be able to perform retrieval as follows:

```
${ranking_cmds}
```

Evaluation can be performed using `trec_eval`:

```
${eval_cmds}
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

${effectiveness}
Loading

0 comments on commit 12149f8

Please sign in to comment.