No module named 'cdqa.reader.reader_sklearn' #237

catch-n-release · 2019-08-16T09:14:02Z

When trying to use XLNet model in place of BERT I face the following error.
Could someone please help me with this?

cdqa_pipeline = QAPipeline(reader='./models/xlnet_cased_vCPU.joblib', max_df=1.0)

fmikaelian · 2019-08-17T10:18:23Z

Hi @SuyashSrivastavaDel

You are facing this error because the XLNet implementation for cdQA is not ready yet. It is being developped in the sync-huggingface branch. You can follow our progress on this PR.

You can still use cdQA with BERT in the meantime.

catch-n-release · 2019-08-17T10:26:17Z

Hey thanks for replying @fmikaelian. Could you help me answer a few of my other questions..

1.How many of Q&A pair per paragraph(custom dataset) would I require to train on top of pre-trained bert models?
2.Would the hyperparameters change in retraining the mentioned model?
3. Also, say If I want to train hotpotQA dataset, would you recommend training the model from scratch or train on given pre-trained BERT model?

andrelmfarias · 2019-08-19T07:05:04Z

Hi @SuyashSrivastavaDel ,

I will give my answers based on my knowledge and on what I think. @fmikaelian, you can correct me or add something if you think you should.

We don't have a closed number for this, as we trained the model only on SQUAD 1.1 in our experiments. SQUAD 1.1 has 100k QA pairs and we were able to train the model on it, so I would say that you will need at least something around 1k - 10k in total. If your dataset size is as small as 1k I would recommend you to train on SQUAD first (as per our tutorials) then do a 2nd train on your data.
Regarding the quantity per paragraph, I would say 4+ questions if the paragraph is long and 2 or 3 if the paragraph is short.
It depends on the hyperparameters, you can fine-tune some training hyperparameters as learning rate, the number of epochs and batch size or some conditions on data such as max sentence length for the paragraph, for the question and for the answer. The idea is the overall structure of the model won't change, for example you cannot tune the number of layers.
I would recommend using the pre-trained model

catch-n-release · 2019-08-19T09:05:25Z

Hey @andrelmfarias thanks for replying. Could you help me a little further with this..

How big of a document corpus can the TF-IDF retriever handle? A rough estimate would do.
Is there a way to get top 5 or say top 10 closest answers along with the predicted answer?
If not, what do you suggest I do to get them?

andrelmfarias · 2019-08-20T07:33:16Z

We did not try to test the limits of the tf-idf Retriever, but I suppose it can handle very large corpus. We are using the tf-idf vectorizer from sklearn, and from my experience, it can handle large corpus. If ever you try to push its limits can you inform us about the results?
Currently, it's not possible to do it on cdqa directly... You would have to modify the function write_predictions:

cdQA/cdqa/reader/bertqa_sklearn.py

Line 454 in cff2d44

def write_predictions(all_examples, all_features, all_results, n_best_size,

by returning final_predictions_sorted:

cdQA/cdqa/reader/bertqa_sklearn.py

Line 639 in cff2d44

    
           final_predictions_sorted = collections.OrderedDict(sorted(final_predictions.items(),

You also would have to modify the method .predict() of BertQA:

cdQA/cdqa/reader/bertqa_sklearn.py

Line 1242 in cff2d44

scores_diff_json, best_logit = write_predictions(

If ever you are interested in implementing it as an option when using BertQA for predictions, feel free to do a PR 😃

catch-n-release · 2019-08-22T04:15:36Z

@andrelmfarias Thanks for replying, I am working on it. Will comeback to you with more stupid doubts asap.

JimAva · 2019-08-31T13:26:07Z

Hi @SuyashSrivastavaDel and @andrelmfarias - any update on the multi-result feature?

Thank you.

catch-n-release · 2019-09-02T04:33:56Z

Hey @JimAva , my fork of the repo has a branch named feature/n_best_predictions.
In that, the final cdqa_pipeline.predict(X=question) function returns the final prediction with an ordered dictionary.
In that dictionary you will find the top n prediction in the format as:
prediction, n_best_predictions = cdqa_pipeline.predict(X=question)
where n_best_predictions={answer1:[title1, paragraph1],answer2:[title2, paragraph2]...........}
ranked from most relevant answer to the least.
n being equal to n_best_size which is passed as a parameter.

Hey @andrelmfarias I am raising a PR for this like you told me to. :):)
Please check and let me know of the corrections if needed.

andrelmfarias closed this as completed Aug 24, 2019

andrelmfarias mentioned this issue Aug 24, 2019

Prediction multiple results #239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No module named 'cdqa.reader.reader_sklearn' #237

No module named 'cdqa.reader.reader_sklearn' #237

catch-n-release commented Aug 16, 2019

fmikaelian commented Aug 17, 2019

catch-n-release commented Aug 17, 2019

andrelmfarias commented Aug 19, 2019 •

edited

Loading

catch-n-release commented Aug 19, 2019

andrelmfarias commented Aug 20, 2019 •

edited

Loading

catch-n-release commented Aug 22, 2019

JimAva commented Aug 31, 2019

catch-n-release commented Sep 2, 2019

No module named 'cdqa.reader.reader_sklearn' #237

No module named 'cdqa.reader.reader_sklearn' #237

Comments

catch-n-release commented Aug 16, 2019

fmikaelian commented Aug 17, 2019

catch-n-release commented Aug 17, 2019

andrelmfarias commented Aug 19, 2019 • edited Loading

catch-n-release commented Aug 19, 2019

andrelmfarias commented Aug 20, 2019 • edited Loading

catch-n-release commented Aug 22, 2019

JimAva commented Aug 31, 2019

catch-n-release commented Sep 2, 2019

andrelmfarias commented Aug 19, 2019 •

edited

Loading

andrelmfarias commented Aug 20, 2019 •

edited

Loading