Make sure query_from_API(debug=True), if you don't want call s2search module which requires a larger complex env. And you can Annotated from s2search.rank import S2Ranker in query_module.py.
Click to show
The Semantic Scholar Search Reranker
The code in this repo is for when you have a plain-text query and some academic documents, and your goal is to search within the documents and obtain a score for how good of a match each document is for the query. The standard pipeline involves a first-stage ranker (like ElasticSearch) and a reranker. The model included with this repository is for the reranking stage only, but you may have few-enough documents that a first-stage ranker is not necessary. The model and featurization are both fast.
To install this package, run the following:
git clone https://github.com/allenai/s2search.git
cd s2search
conda create -y --name s2search python==3.7
conda activate s2search
python setup.py develop
pip install https://github.com/kpu/kenlm/archive/master.zip
To obtain the necessary data, run this command after the package is installed:
aws s3 cp --no-sign-request s3://ai2-s2-research-public/s2search_data.zip .
Then unzip the file. Iniside the zip is folder named s2search/
that will contain all of the artifacts you'll need to get predictions.
Warning: this zip file is 10G compressed and 17G uncompressed.
Warning: you will need more than 17G of ram because of the large kenlm
models that need to be loaded into memory.
An example of how to use this repo:
from s2search.rank import S2Ranker
# point to the artifacts downloaded from s3
data_dir = 's2search/'
# the data is a list of dictionaries
papers = [
{
'title': 'Neural Networks are Great',
'abstract': 'Neural networks are known to be really great models. You should use them.',
'venue': 'Deep Learning Notions',
'authors': ['Sergey Feldman', 'Gottfried W. Leibniz'],
'year': 2019,
'n_citations': 100,
'n_key_citations': 10
},
{
'title': 'Neural Networks are Terrible',
'abstract': 'Neural networks have only barely worked and we should stop working on them.',
'venue': 'JMLR',
'authors': ['Isaac Newton', 'Sergey Feldman'],
'year': 2009,
'n_citations': 5000 # we don't have n_key_citations here and that's OK
}
]
# only do this once because we have to load the giant language models into memory
s2ranker = S2Ranker(data_dir)
# higher scores are better
print(s2ranker.score('neural networks', papers))
print(s2ranker.score('feldman newton', papers))
print(s2ranker.score('jmlr', papers))
print(s2ranker.score('missing', papers))
Note that n_key_citations
is a Semantic Scholar feature. If you don't have it, just leave that key out of the data dictionary. The other paper fields are required.
- Create exp folder;
- Put data into exp folder;
- Create a config file for this experiment;
- Run a command to use s2search to get the npy data;
- Run a command to generate the jupyter notebook for data plotting of this experiment;
-
Create a folder under
pipelining
, folder name would be the experiment name . Saypipelining/pdp-exp1
. -
Put all of your paper data under the experiment folder. Say
pipelining/exp4/cslg.data
and so on. Caution: the name of the data file should not contain any underscore. -
Create an experiment config file
conf.yml
under the experiment folder. Saypipelining/exp4/conf.yml
.description: " \nExperiment No.4, Plot all data" samples: cslg: - query: "Machine Learning" masking_option_keys: ["t", "abs", "v", "au", "y", "c"] - query: "convolutional neural network" masking_option_keys: ["t", "abs", "v", "au", "y", "c"] - query: "Machine Learning" masking_option_keys: ["tabsv"] using_origin_from: "t1" cscv: - query: "Computer Vision and Pattern Recognition" masking_option_keys: ["t", "abs", "v", "au", "y", "c"] csai: - query: "Artificial Intelligence" masking_option_keys: ["t", "abs", "v", "au", "y", "c"] csms: - query: "Mathematical Software" masking_option_keys: ["t", "abs", "v", "au", "y", "c"] sample_from_other_exp: cslg: ["exp1", "cslg.data"]
The configuration should contains those key-values.
- description: describe what this experiment is up to;
- samples: a key-value map of sample data: the key is the sample data file name, the value should contains a list of task, every task should have the following configs:
- query: the query which input to the s2search;
- masking_option_keys: a list of keys for masking option, this can be refer to here;
- **using_origin_from:**if you are defining a second task of a sample data, and the query of this new task are the same as some other task before, say
t3
is usingMachine Learning
query ast1
does, you can use this config to clarify thatt3
can use the same origin files oft1
. This will save you the time for duplicated computation.
- **sample_from_other_exp: **this will let you reuse the data files that already exist in other folders. No need to copy a big data file again for the new experiments.
python s2search_score_pipelining.py [experiment1_name] [experiment2_name] ...
E.g
python s2search_score_pipelining.py exp4
Then all npy data files will be created at pipelining/exp1/scores
representing all the masking_option_keys
that you config.
For instance, as for exp4 there should be four .data
files under this experiment folder named as cslg.data
, cscv.daat
and so on. For cslg.data
, the program will perform two tasks with different input queries then outputs the score files under exp4/scores
.
You can also run multiple experiments:
python s2search_score_pipelining.py exp1 exp2 exp3
Install nbformat frist:
pip install nbformat
Then you can:
python plotting_nb_gen.py pdp-exp1
You should always specify the experiment name. The command will generate all the uncreated notebook for every configured data sample under the experiment folder.
Same as we do at normal score computation task:
- Create exp folder;
- Put data into exp folder;
- Create a config file for this experiment;
- Run a command to use s2search to get the npy data for PDP;
- Run a command to generate the jupyter notebook for data plotting of this experiment;
-
Create a folder under
pipelining
, folder name would be the experiment name . The folder name should starts withpdp-
. Saypipelining/pdp-exp1
. -
Put all of your paper data under the experiment folder. Say
pipelining/pdp-exp1/cslg.data
and so on. Caution: the name of the data file should not contain any underscore. -
Create an experiment config file
conf.yml
under the experiment folder. Saypipelining/pdp-exp1/conf.yml
.description: " Produce PDP for a randomly picked data from cslg. " samples: cslg-rand-100: - query: "Machine Learning" cslg-rand-200: - query: "Machine Learning" cslg-rand-500: - query: "Machine Learning"
The configuration is same as we do in the normal tasks except there is no masking_option_keys
and use using_origin_from
.
- description: ;
- samples:
- query: ;
- **sample_from_other_exp: **;
python s2search_score_pdp.py [pdp_experiment_name]
E.g
python s2search_score_pdp.py pdp-exp1
Install nbformat frist:
pip install nbformat
Then you can:
python plotting_nb_gen.py pdp-exp1
It will know when to generate the notebook for normal task and when is for pdp task.
Instance system require: centos8:
sudo yum -y wget git gcc gcc-c++ unzip
Conda env:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda
conda init
source ~/miniconda3/bin/activate
conda create -y --name s2search397 python==3.9.7
Clone the project
git clone ...
Python packages:
pip install -r requirements.txt
pip install awscli
aws s3 cp --no-sign-request s3://ai2-s2-research-public/s2search_data.zip .
Then unzip the 16G model files to where is should be.