-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deletion optimisation #436
Conversation
Ran tensor search unit tests - passed (besides a randomly failing one)
…o pandu/deletion_optimisation
This is more than just changing how, in Opensearch, we perform delete operations. It's also an architectural change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for me to approve, but I have left some feedback based on issues I would consider addressing.
# -- Marqo-OS-specific deletion implementation: -- | ||
|
||
|
||
def delete_documents_marqo_os(config: Config, deletion_instruction: MqDeleteDocsRequest) -> MqDeleteDocsResponse: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful with unbounded size on deletion_instruction.document_ids
. There are limits to the size of HTTP requests. This is usually not a problem, but if you are sending millions of operations this can add up quickly.
Some other feedback.
- Consider chunking requests into, say, 10,000 operations and then sending these requests in parallel
- You should:
- potentially attempt retries, assuming request is retryable based on response HTTP code (e.g.
429
) - manage back-offs
- send multiple requests in parallel for higher throughput
- potentially attempt retries, assuming request is retryable based on response HTTP code (e.g.
- See here lookup "BulkAllObservable helper" heading for inspiration
- Inspect the bulk response(s) to determine they have been successful
- Consider disabling/changing the refresh interval around your bulk request (ensure it is correctly set back!)
# Conflicts: # src/marqo/tensor_search/configs.py # src/marqo/tensor_search/enums.py # src/marqo/tensor_search/tensor_search.py # tests/tensor_search/test_validation.py
unit tests (post merging mainline): https://github.com/marqo-ai/marqo/actions/runs/4739643546 |
|
||
def delete_documents(config: Config, index_name: str, doc_ids: List[str], auto_refresh): | ||
"""Delete documents from the Marqo index with the given doc_ids """ | ||
return delete_docs.delete_documents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we go about an entire reformatting of tensor_search/, it may be worth having format_delete_docs_response
in this function instead. Therefore operation specific files deal in Pydantic objects, and the tensor_search.py operations are responsibly for mapping args -> RequestObject, then ResponseObject -> Dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or have the decoding within the pydantic model (i.e. controlled via the .json
method).
@@ -598,3 +599,39 @@ def validate_score_modifiers_object(score_modifiers: List[dict]): | |||
f"Please revise your score_modifiers based on the provided error." | |||
f"\n Check `https://docs.marqo.ai/0.0.17/API-Reference/search/#score-modifiers` for more info." | |||
) | |||
|
|||
|
|||
def validate_delete_docs_request(delete_request: MqDeleteDocsRequest, max_delete_docs_count: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only now realise that MqDeleteDocsRequest is a NamedTuple, not a Pydantic BaseModel. I think we'd find that if we use a BaseModel, most of this is automatically validated at init time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think you can do both custom, and extended validation for something like if (len(delete_request.document_ids) > max_delete_docs_count) and max_delete_docs_count is not None:
|
||
assert run() | ||
|
||
def test_read_env_vars_and_defaults_ints_invalid_values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(A comment for future improvements). There is a common pattern of using a mocked object across a range of parameters. It's something like a @pytest.mark.parameterized
with a @mock.patch("foo", return_value="bar)
. It could be worth figuring out how to do so. I think it'd make alot of our tests, a) more readable, b) easier to see for which values, what tests fail.
# Conflicts: # src/marqo/tensor_search/enums.py
merged mainline back in. |
What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
Optimisation
What is the current behavior? (You can also link to an open issue here)
The delete documents by ID endpoint uses Marqo-os' delete by query. This can result in 5xxs being returned for deletion calls, after many calls.
What is the new behavior (if this is a feature change)?
The delete documents by ID endpoint now gives Marqo-os' endpoint bulk delete instructions
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
No
Have unit tests been run against this PR? (Has there also been any additional testing?)
In progress: https://github.com/marqo-ai/marqo/actions/runs/4727168886
Other information:
To dos: