-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] adds new question_answering NLP task for extracting answers to questions from a document #85958
[ML] adds new question_answering NLP task for extracting answers to questions from a document #85958
Conversation
…uestions from a document
Pinging @elastic/ml-core (Team:ML) |
Hi @benwtrent, I've created a changelog YAML for you. |
…om:benwtrent/elasticsearch into feature/ml-add-question-answering-nlp-task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits, nothing more.
LGTM I added the cloud-deploy
label so I can test it out
...ava/org/elasticsearch/xpack/core/ml/inference/results/QuestionAnsweringInferenceResults.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/QuestionAnsweringConfig.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/QuestionAnsweringConfig.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/QuestionAnsweringConfig.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/QuestionAnsweringConfig.java
Outdated
Show resolved
Hide resolved
...in/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/QuestionAnsweringProcessor.java
Outdated
Show resolved
Hide resolved
@@ -98,7 +98,8 @@ public List<TokenizationResult.Tokens> tokenize(String seq, Tokenization.Truncat | |||
); | |||
// Make sure we do not end on a word | |||
if (splitEndPos != tokenIds.size()) { | |||
while (Objects.equals(tokenPositionMap.get(splitEndPos), tokenPositionMap.get(splitEndPos - 1))) { | |||
while (splitEndPos > splitStartPos + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what is happening here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to verify that the end is actually after the start. If we don't we could get into a death spiral where the splitting never moves forward (think adversarial case of very small sequences with overlaps that are almost the same as the sequence size).
for (int j = 0; j < maxLength; j++) { | ||
builder.value(0); | ||
// Just a single sequence within this tokenization | ||
if (inputTokens.seqPairOffset <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good fix
...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/NlpTokenizer.java
Outdated
Show resolved
Hide resolved
* @param sequenceId Unique sequence id for this tokenization | ||
* @return tokenization result for the sequence pair | ||
*/ | ||
public List<TokenizationResult.Tokens> tokenize(String seq1, String seq2, Tokenization.Truncate truncate, int span, int sequenceId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to
tokenize( String seq1, InnerTokenization innerResultSeq1, String seq2, Tokenization.Truncate truncate, int sequenceId )
but with the spanning logic added. Is there an opportunity to refactor here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but the previous one shouldn't do spanning at all. Mainly because its an optimization to allow a single tokenization to be used multiple times. We just have no way of representing a spanned batched encoding (4d tensor?).
I think some refactoring could be done as a whole, but it would take more churn on the rest of the tokenization code.
Co-authored-by: David Kyle <david.kyle@elastic.co>
@elasticmachine update branch |
@elasticmachine update branch |
@elasticmachine update branch |
Adds support for `question_answering` NLP models within the pytorch model uploader. Related: elastic/elasticsearch#85958
Adds support for `question_answering` NLP models within the pytorch model uploader. Related: elastic/elasticsearch#85958 Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
Adds support for `question_answering` NLP models within the pytorch model uploader. Related: elastic/elasticsearch#85958 Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
* [ML] Improve NLP model import by using nicely defined types (#459) This adds some more definite types for our NLP tasks and tokenization configurations. This is the first step in allowing users to more easily import their own transformer models via something other than hugging face. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] add support for question_answering NLP tasks (#457) Adds support for `question_answering` NLP models within the pytorch model uploader. Related: elastic/elasticsearch#85958 Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] improve general pytorch model import and add tests (#463) This improves the user consumed functions and classes for PyTorch NLP model upload to Elasticsearch. Previously it was difficult to wrap your own module for uploading to Elasticsearch. This commit splits some classes out, adds new ones, and adds tests showing how to wrap some simple modules. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Release 8.2.0 Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] fixes decision tree classifier upload to account for probabilities (#465) This switches our sklearn.DecisionTreeClassifier serialization logic to account for multi-valued leaves in the tree. The key difference between our inference and DecisionTreeClassifier, is that we run a softMax over the leaf where sklearn simply normalizes the results. This means that our "probabilities" returned will be different than sklearn. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Add authentication methods for import model script (#466) Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Ignore type checking for `agg_value` Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [DOCS] Adds question_answering task type for eland_import_hub_model Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Stop explicitly pulling master Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Remove 'numpydoc' to stop reformatting Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Also pin traitlets Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [DOCS] Include missing attributes (#468) Co-authored-by: Seth Michael Larson <seth.larson@elastic.co> Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] ensure quantization is applied (#472) Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Freeze the traced PyTorch model Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Bump minimum PyTorch version to 1.11 Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] adds new auto task type that attempts to automatically determine NLP task type from model config (#475) For many model types, we don't need to require the task requested. We can infer the task type based on the model configuration and architecture. This commit makes the `task-type` parameter optional for the model up load script and adds logic for auto-detecting the task type based on the 🤗 model. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added opensearch as dependency Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * replaced core mentions of elasticsearch client w opensearch Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * changed index names for testing Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * modified test dataframes to accommodate opensearch indexing Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed aggregatable field name tests Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixing pytests that mention indices of ed/pd dataframes Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed equality boolean filter to accommodate terminology difference in elastic vs open search Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed pytests with indexing issues, geolocation field renaming issues Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * modified test setup code to work for opensearch Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * reverted many erroneous "fixes" to tests Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed opensearch integration so remaining non-ml tests run Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added initial connection to predicting with sagemaker Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added sagemaker predict api Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added band-aid to fix iterating over rows Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * debugging indexing issue Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * reverted indexing change for sagemaker predict Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added deprecation warnings to ml module Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * refactoring elasticsearch names to opensearch Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * continued renaming opensearch variables Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * more renaming changes Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * first commit for ml common integration Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * PoC for model upload Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * renamed model chunk uploading path Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added total chunks to model upload Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed docstring typo Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added first iteration of custom model load supprot Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * removed unsupported features Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * renaming all instances of elastic in code Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * created new dev requirements file Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * typo fix Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * PR feedback Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * implement PR feedback Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * PR feedback Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * implement pr feedback Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Update README.md Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added demo materials Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * refactoring Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * refactoring code and changed code to address some of the deprection warning Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * adding header license info Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * formatted code with black, isort, mypy Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * updating git ci workflow Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * refactoring code + adding pytest in the ci workflow Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * removing test from ci workflow Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * setup CI for integration test Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * adding files required for CI Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * adding files which got deleted during merge Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * adding deleted files by git merge Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com> Co-authored-by: Seth Michael Larson <seth.larson@elastic.co> Co-authored-by: Lisa Cawley <lcawley@elastic.co> Co-authored-by: Nigel Small <nasmall@pm.me> Co-authored-by: David Kyle <david.kyle@elastic.co> Co-authored-by: Thomas Ma <thomayinma@gmail.com> Co-authored-by: Thomas Ma <31194803+LEFTA98@users.noreply.github.com>
This commit adds a new
question_answering
task.The
question_answering
task allows supplying aquestion
in the inference config update.When storing the model config for inference:
Then when calling
_infer
or running with in a pipeline, add thequestion
you want answered given the context provided by the document textThe response then looks like:
Some models tested: