Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completed replication for msmarco-doc and msmarco-passage #1220

Merged
merged 2 commits into from
May 25, 2020
Merged

Completed replication for msmarco-doc and msmarco-passage #1220

merged 2 commits into from
May 25, 2020

Conversation

shaneding
Copy link
Contributor

Environment
OS: macOS 10.15 (Catalina)
Java: 14.0.1
Python: 3.7.6

Issues:
BM25 Baseline on MS Marco Passage Retreival:
Error: return self.object.search(JString(q.encode('utf8')), k)
AttributeError: 'bytes' object has no attribute 'encode’

  • Encountered when using pyserini to retrieve the smaller set of queries, had to use java implementation instead.

BM25 Baselines on MS MARCO Doc Retrieval Task

  • Within replication doc (for comparison to Microsoft baseline), the command given was: eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 collections/msmarco-doc/queries-and-qrels/msmarco-docdev-qrels.tsv collections/msmarco-doc/msmarco-docdev-top100.
  • I had to modify this to: eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 collections/msmarco-doc/queries-and-qrels/msmarco-docdev-qrels.tsv collections/msmarco-doc/queries-and-qrels/msmarco-docdev-top100
  • Key difference was that msmarco-docdev-top100 was stored in queries-and-qrels rather than directly msmarco-doc

@codecov
Copy link

codecov bot commented May 23, 2020

Codecov Report

Merging #1220 into master will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1220      +/-   ##
============================================
+ Coverage     48.34%   48.35%   +0.01%     
- Complexity      738      739       +1     
============================================
  Files           147      147              
  Lines          8547     8547              
  Branches       1212     1212              
============================================
+ Hits           4132     4133       +1     
+ Misses         4075     4074       -1     
  Partials        340      340              
Impacted Files Coverage Δ Complexity Δ
...anserini/ltr/feature/base/PMIFeatureExtractor.java 84.61% <0.00%> (-1.93%) 12.00% <0.00%> (-1.00%)
...java/io/anserini/ltr/feature/CountBigramPairs.java 89.61% <0.00%> (+2.59%) 33.00% <0.00%> (+2.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6e0367...19524a3. Read the comment docs.

@lintool
Copy link
Member

lintool commented May 24, 2020

Can you update the MS MARCO doc to directly fix the issue you encountered. Probably leftover from a recent refactoring...

@kelvin-jiang
Copy link
Member

Can confirm that I'm running into the same AttributeError: 'bytes' object has no attribute 'encode’ when retrieving with Python. Should I file an issue in the pyserini repo?

@lintool
Copy link
Member

lintool commented May 25, 2020

@kelvin-jiang yes please. the issue should be filed in this repo, though.

I think it might be because PyPI is behind the latest HEAD on master, lets record in issue regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants