Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No module named 'cdqa.reader.reader_sklearn' #237

Closed
catch-n-release opened this issue Aug 16, 2019 · 8 comments
Closed

No module named 'cdqa.reader.reader_sklearn' #237

catch-n-release opened this issue Aug 16, 2019 · 8 comments

Comments

@catch-n-release
Copy link
Contributor

When trying to use XLNet model in place of BERT I face the following error.
Could someone please help me with this?

cdqa_pipeline = QAPipeline(reader='./models/xlnet_cased_vCPU.joblib', max_df=1.0)

@fmikaelian
Copy link
Collaborator

Hi @SuyashSrivastavaDel

You are facing this error because the XLNet implementation for cdQA is not ready yet. It is being developped in the sync-huggingface branch. You can follow our progress on this PR.

You can still use cdQA with BERT in the meantime.

@catch-n-release
Copy link
Contributor Author

Hey thanks for replying @fmikaelian. Could you help me answer a few of my other questions..

1.How many of Q&A pair per paragraph(custom dataset) would I require to train on top of pre-trained bert models?
2.Would the hyperparameters change in retraining the mentioned model?
3. Also, say If I want to train hotpotQA dataset, would you recommend training the model from scratch or train on given pre-trained BERT model?

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Aug 19, 2019

Hi @SuyashSrivastavaDel ,

I will give my answers based on my knowledge and on what I think. @fmikaelian, you can correct me or add something if you think you should.

  1. We don't have a closed number for this, as we trained the model only on SQUAD 1.1 in our experiments. SQUAD 1.1 has 100k QA pairs and we were able to train the model on it, so I would say that you will need at least something around 1k - 10k in total. If your dataset size is as small as 1k I would recommend you to train on SQUAD first (as per our tutorials) then do a 2nd train on your data.
    Regarding the quantity per paragraph, I would say 4+ questions if the paragraph is long and 2 or 3 if the paragraph is short.

  2. It depends on the hyperparameters, you can fine-tune some training hyperparameters as learning rate, the number of epochs and batch size or some conditions on data such as max sentence length for the paragraph, for the question and for the answer. The idea is the overall structure of the model won't change, for example you cannot tune the number of layers.

  3. I would recommend using the pre-trained model

@catch-n-release
Copy link
Contributor Author

Hey @andrelmfarias thanks for replying. Could you help me a little further with this..

  1. How big of a document corpus can the TF-IDF retriever handle? A rough estimate would do.
  2. Is there a way to get top 5 or say top 10 closest answers along with the predicted answer?
    If not, what do you suggest I do to get them?

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Aug 20, 2019

  1. We did not try to test the limits of the tf-idf Retriever, but I suppose it can handle very large corpus. We are using the tf-idf vectorizer from sklearn, and from my experience, it can handle large corpus. If ever you try to push its limits can you inform us about the results?

  2. Currently, it's not possible to do it on cdqa directly... You would have to modify the function write_predictions:

    def write_predictions(all_examples, all_features, all_results, n_best_size,

by returning final_predictions_sorted:

final_predictions_sorted = collections.OrderedDict(sorted(final_predictions.items(),

You also would have to modify the method .predict() of BertQA:

scores_diff_json, best_logit = write_predictions(

If ever you are interested in implementing it as an option when using BertQA for predictions, feel free to do a PR 😃

@catch-n-release
Copy link
Contributor Author

@andrelmfarias Thanks for replying, I am working on it. Will comeback to you with more stupid doubts asap.

@JimAva
Copy link

JimAva commented Aug 31, 2019

Hi @SuyashSrivastavaDel and @andrelmfarias - any update on the multi-result feature?

Thank you.

@catch-n-release
Copy link
Contributor Author

Hey @JimAva , my fork of the repo has a branch named feature/n_best_predictions.
In that, the final cdqa_pipeline.predict(X=question) function returns the final prediction with an ordered dictionary.
In that dictionary you will find the top n prediction in the format as:
prediction, n_best_predictions = cdqa_pipeline.predict(X=question)
where n_best_predictions={answer1:[title1, paragraph1],answer2:[title2, paragraph2]...........}
ranked from most relevant answer to the least.
n being equal to n_best_size which is passed as a parameter.

Hey @andrelmfarias I am raising a PR for this like you told me to. :):)
Please check and let me know of the corrections if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants