Skip to content

Commit

Permalink
Add regression for HC4 monolingual retrieval BM25
Browse files Browse the repository at this point in the history
* index neuclir

* refactor

* hc4 regression

* updates

* updates

* generate md

* refactor hc4 regressions

* update readme
  • Loading branch information
ToluClassics authored Jun 16, 2022
1 parent 94bbb44 commit d2fbe67
Show file tree
Hide file tree
Showing 19 changed files with 1,753 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ See individual pages for details!
+ Regressions for [TREC 2002 Monolingual Arabic](docs/regressions-trec02-ar.md)
+ Regressions for FIRE 2012: [Monolingual Bengali](docs/regressions-fire12-bn.md), [Monolingual Hindi](docs/regressions-fire12-hi.md), [Monolingual English](docs/regressions-fire12-en.md)
+ Regressions for Mr. TyDi (v1.1) baselines : [ar](docs/regressions-mrtydi-v1.1-ar.md), [bn](docs/regressions-mrtydi-v1.1-bn.md), [en](docs/regressions-mrtydi-v1.1-en.md), [fi](docs/regressions-mrtydi-v1.1-fi.md), [id](docs/regressions-mrtydi-v1.1-id.md), [ja](docs/regressions-mrtydi-v1.1-ja.md), [ko](docs/regressions-mrtydi-v1.1-ko.md), [ru](docs/regressions-mrtydi-v1.1-ru.md), [sw](docs/regressions-mrtydi-v1.1-sw.md), [te](docs/regressions-mrtydi-v1.1-te.md), [th](docs/regressions-mrtydi-v1.1-th.md)
+ Regressions for HC4 (v1.0) baselines : [Russian](docs/regressions-hc4-v1.0-ru.md), [Persian](docs/regressions-hc4-v1.0-fa.md), [Chinese](docs/regressions-hc4-v1.0-ru.zh)

### Available Corpora

Expand Down
64 changes: 64 additions & 0 deletions docs/regressions-hc4-v1.0-fa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Anserini Regressions: HC4 (v1.0) — Persian

This page documents BM25 regression experiments for [HC4 (v1.0) — Persian](https://github.com/hltcoe/HC4).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/hc4-v1.0-fa.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/hc4-v1.0-fa.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression hc4-v1.0-fa
```

## Indexing

Typical indexing command:

```
target/appassembler/bin/IndexCollection \
-collection NeuClirCollection \
-input /path/to/hc4-v1.0-fas \
-index indexes/lucene-index.hc4-v1.0-persian/ \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -storePositions -storeDocvectors -storeRaw -language fa \
>& logs/log.hc4-v1.0-fas &
```

See [this page](https://github.com/hltcoe/HC4) for more details about the HC4 corpus.
For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.hc4-v1.0-persian/ \
-topics src/main/resources/topics-and-qrels/topics.hc4-v1.0-fa.dev.title.tsv.gz \
-topicreader TsvInt \
-output runs/run.hc4-v1.0-fas.bm25.topics.hc4-v1.0-fa.dev.title.txt \
-bm25 -hits 100 -language fa &
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.hc4-v1.0-persian/ \
-topics src/main/resources/topics-and-qrels/topics.hc4-v1.0-fa.dev.desc.tsv.gz \
-topicreader TsvInt \
-output runs/run.hc4-v1.0-fas.bm25.topics.hc4-v1.0-fa.dev.desc.txt \
-bm25 -hits 100 -language fa &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map src/main/resources/topics-and-qrels/qrels.hc4-v1.0-fa.dev.txt runs/run.hc4-v1.0-fas.bm25.topics.hc4-v1.0-fa.dev.title.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map src/main/resources/topics-and-qrels/qrels.hc4-v1.0-fa.dev.txt runs/run.hc4-v1.0-fas.bm25.topics.hc4-v1.0-fa.dev.desc.txt
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

| MAP | BM25 |
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [HC4 (Persian): dev-topic title](https://github.com/hltcoe/HC4) | 0.2919 |
| [HC4 (Persian): dev-topic description](https://github.com/hltcoe/HC4) | 0.3188 |
64 changes: 64 additions & 0 deletions docs/regressions-hc4-v1.0-ru.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Anserini Regressions: HC4 (v1.0) — Russian

This page documents BM25 regression experiments for [HC4 (v1.0) — Russian](https://github.com/hltcoe/HC4).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/hc4-v1.0-ru.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/hc4-v1.0-ru.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression hc4-v1.0-ru
```

## Indexing

Typical indexing command:

```
target/appassembler/bin/IndexCollection \
-collection NeuClirCollection \
-input /path/to/hc4-v1.0-rus \
-index indexes/lucene-index.hc4-v1.0-russian/ \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -storePositions -storeDocvectors -storeRaw -language ru \
>& logs/log.hc4-v1.0-rus &
```

See [this page](https://github.com/hltcoe/HC4) for more details about the HC4 corpus.
For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.hc4-v1.0-russian/ \
-topics src/main/resources/topics-and-qrels/topics.hc4-v1.0-ru.dev.title.tsv.gz \
-topicreader TsvInt \
-output runs/run.hc4-v1.0-rus.bm25.topics.hc4-v1.0-ru.dev.title.txt \
-bm25 -hits 100 -language ru &
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.hc4-v1.0-russian/ \
-topics src/main/resources/topics-and-qrels/topics.hc4-v1.0-ru.dev.desc.tsv.gz \
-topicreader TsvInt \
-output runs/run.hc4-v1.0-rus.bm25.topics.hc4-v1.0-ru.dev.desc.txt \
-bm25 -hits 100 -language ru &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map src/main/resources/topics-and-qrels/qrels.hc4-v1.0-ru.dev.txt runs/run.hc4-v1.0-rus.bm25.topics.hc4-v1.0-ru.dev.title.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map src/main/resources/topics-and-qrels/qrels.hc4-v1.0-ru.dev.txt runs/run.hc4-v1.0-rus.bm25.topics.hc4-v1.0-ru.dev.desc.txt
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

| MAP | BM25 |
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [HC4 (Russian): dev-topic title](https://github.com/hltcoe/HC4) | 0.2767 |
| [HC4 (Russian): dev-topic description](https://github.com/hltcoe/HC4) | 0.2321 |
64 changes: 64 additions & 0 deletions docs/regressions-hc4-v1.0-zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Anserini Regressions: HC4 (v1.0) — Chinese

This page documents BM25 regression experiments for [HC4 (v1.0) — Chinese](https://github.com/hltcoe/HC4).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/hc4-v1.0-zh.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/hc4-v1.0-zh.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression hc4-v1.0-zh
```

## Indexing

Typical indexing command:

```
target/appassembler/bin/IndexCollection \
-collection NeuClirCollection \
-input /path/to/hc4-v1.0-zho \
-index indexes/lucene-index.hc4-v1.0-chinese/ \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -storePositions -storeDocvectors -storeRaw -language zh \
>& logs/log.hc4-v1.0-zho &
```

See [this page](https://github.com/hltcoe/HC4) for more details about the HC4 corpus.
For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.hc4-v1.0-chinese/ \
-topics src/main/resources/topics-and-qrels/topics.hc4-v1.0-zh.dev.title.tsv.gz \
-topicreader TsvInt \
-output runs/run.hc4-v1.0-zho.bm25.topics.hc4-v1.0-zh.dev.title.txt \
-bm25 -hits 100 -language zh &
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.hc4-v1.0-chinese/ \
-topics src/main/resources/topics-and-qrels/topics.hc4-v1.0-zh.dev.desc.tsv.gz \
-topicreader TsvInt \
-output runs/run.hc4-v1.0-zho.bm25.topics.hc4-v1.0-zh.dev.desc.txt \
-bm25 -hits 100 -language zh &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map src/main/resources/topics-and-qrels/qrels.hc4-v1.0-zh.dev.txt runs/run.hc4-v1.0-zho.bm25.topics.hc4-v1.0-zh.dev.title.txt
tools/eval/trec_eval.9.0.4/trec_eval -c -M 100 -m map src/main/resources/topics-and-qrels/qrels.hc4-v1.0-zh.dev.txt runs/run.hc4-v1.0-zho.bm25.topics.hc4-v1.0-zh.dev.desc.txt
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

| MAP | BM25 |
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [HC4 (Chinese): dev-topic title](https://github.com/hltcoe/HC4) | 0.2914 |
| [HC4 (Chinese): dev-topic description](https://github.com/hltcoe/HC4) | 0.1983 |
43 changes: 43 additions & 0 deletions src/main/resources/docgen/templates/hc4-v1.0-fa.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Anserini Regressions: HC4 (v1.0) — Persian

This page documents BM25 regression experiments for [HC4 (v1.0) — Persian](https://github.com/hltcoe/HC4).

The exact configurations for these regressions are stored in [this YAML file](${yaml}).
Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression ${test_name}
```

## Indexing

Typical indexing command:

```
${index_cmds}
```

See [this page](https://github.com/hltcoe/HC4) for more details about the HC4 corpus.
For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
${ranking_cmds}
```

Evaluation can be performed using `trec_eval`:

```
${eval_cmds}
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

${effectiveness}
43 changes: 43 additions & 0 deletions src/main/resources/docgen/templates/hc4-v1.0-ru.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Anserini Regressions: HC4 (v1.0) — Russian

This page documents BM25 regression experiments for [HC4 (v1.0) — Russian](https://github.com/hltcoe/HC4).

The exact configurations for these regressions are stored in [this YAML file](${yaml}).
Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression ${test_name}
```

## Indexing

Typical indexing command:

```
${index_cmds}
```

See [this page](https://github.com/hltcoe/HC4) for more details about the HC4 corpus.
For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
${ranking_cmds}
```

Evaluation can be performed using `trec_eval`:

```
${eval_cmds}
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

${effectiveness}
43 changes: 43 additions & 0 deletions src/main/resources/docgen/templates/hc4-v1.0-zh.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Anserini Regressions: HC4 (v1.0) — Chinese

This page documents BM25 regression experiments for [HC4 (v1.0) — Chinese](https://github.com/hltcoe/HC4).

The exact configurations for these regressions are stored in [this YAML file](${yaml}).
Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
python src/main/python/run_regression.py --index --verify --search --regression ${test_name}
```

## Indexing

Typical indexing command:

```
${index_cmds}
```

See [this page](https://github.com/hltcoe/HC4) for more details about the HC4 corpus.
For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

After indexing has completed, you should be able to perform retrieval as follows:

```
${ranking_cmds}
```

Evaluation can be performed using `trec_eval`:

```
${eval_cmds}
```

## Effectiveness

With the above commands, you should be able to reproduce the following results:

${effectiveness}
45 changes: 45 additions & 0 deletions src/main/resources/regression/hc4-v1.0-fa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
corpus: hc4-v1.0-fa
corpus_path: collections/multilingual/hc4-v1.0-fa/

index_path: indexes/lucene-index.hc4-v1.0-persian/
collection_class: NeuClirCollection
generator_class: DefaultLuceneDocumentGenerator
index_threads: 1
index_options: -storePositions -storeDocvectors -storeRaw -language fa
index_stats:
documents: 486486
documents (non-empty): 486486

metrics:
- metric: MAP
command: tools/eval/trec_eval.9.0.4/trec_eval
params: -c -M 100 -m map
separator: "\t"
parse_index: 2
metric_precision: 4
can_combine: true

topic_reader: TsvInt
topic_root: src/main/resources/topics-and-qrels/
qrels_root: src/main/resources/topics-and-qrels/
topics:
- name: "[HC4 (Persian): dev-topic title](https://github.com/hltcoe/HC4)"
id: dev_title
path: topics.hc4-v1.0-fa.dev.title.tsv.gz
qrel: qrels.hc4-v1.0-fa.dev.txt
- name: "[HC4 (Persian): dev-topic description](https://github.com/hltcoe/HC4)"
id: dev_description
path: topics.hc4-v1.0-fa.dev.desc.tsv.gz
qrel: qrels.hc4-v1.0-fa.dev.txt


models:
- name: bm25
display: BM25
params: -bm25 -hits 100 -language fa
results:
MAP:
- 0.2919
- 0.3188

Loading

0 comments on commit d2fbe67

Please sign in to comment.