Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that multi-line TREC topic titles are fully parsed. Update the… #1482

Merged
merged 3 commits into from
Feb 24, 2021

Conversation

eelstretching
Copy link
Contributor

Incorporated the change from @axlliu for ensuring that multi-line TREC topic titles are processed correctly. Added a test for title parsing to the set of tests.

Note that one of the existing tests (testNewswireTopics) happened to hit on a multi-line title (topic 200), and that test needed to be fixed to include the second line of the title.

As these changes change the queries, the numbers in the regressions for disk 1 and 2 need to change. I re-ran the index and the evaluations using the new code, which resulted in quite a few changes for the regression numbers (some better, some worse.)

… tests to test this, and update the regressions with numbers that take this into account
@codecov
Copy link

codecov bot commented Feb 19, 2021

Codecov Report

Merging #1482 (6b1d907) into master (32923e7) will increase coverage by 0.00%.
The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1482   +/-   ##
=========================================
  Coverage     46.32%   46.32%           
- Complexity      830      831    +1     
=========================================
  Files           197      197           
  Lines         10248    10242    -6     
  Branches       1349     1348    -1     
=========================================
- Hits           4747     4745    -2     
+ Misses         5109     5107    -2     
+ Partials        392      390    -2     
Impacted Files Coverage Δ Complexity Δ
...o/anserini/search/topicreader/TrecTopicReader.java 95.16% <75.00%> (+5.45%) 16.00 <0.00> (+1.00)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c750159...a75b691. Read the comment docs.

@lintool
Copy link
Member

lintool commented Feb 20, 2021

Great! Can you actually change the regression numbers here? https://github.com/castorini/anserini/blob/master/src/main/resources/regression/disk12.yaml

The documentation is auto-gen'ed based on yaml above.

…arkdown files, and the world goes round and round!
@eelstretching
Copy link
Contributor Author

The YAML file's connected to the MD file, the MD file's connected to the...

Updated the YAML and it re-generated the MD file, so I committed both.

@lintool
Copy link
Member

lintool commented Feb 22, 2021

@eelstretching great, thanks! I'll queue up a run of all regressions on my end to make sure it doesn't break anything else, and then I'll merge.

@lintool lintool merged commit 90d3aa0 into castorini:master Feb 24, 2021
lintool added a commit that referenced this pull request Feb 24, 2021
Documentation of regression change - fixes bug in reading of multi-line topics, issue #1482
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants