Deletion optimisation #436

pandu-k · 2023-04-14T05:12:55Z

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
Optimisation
What is the current behavior? (You can also link to an open issue here)
The delete documents by ID endpoint uses Marqo-os' delete by query. This can result in 5xxs being returned for deletion calls, after many calls.
What is the new behavior (if this is a feature change)?
The delete documents by ID endpoint now gives Marqo-os' endpoint bulk delete instructions
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
No
Have unit tests been run against this PR? (Has there also been any additional testing?)
In progress: https://github.com/marqo-ai/marqo/actions/runs/4727168886
Other information:
To dos:

Large scale test

Please check if the PR fulfills these requirements

The commit message follows our guidelines
Tests for the changes have been added (for bug fixes/features)
Docs have been added / updated (for bug fixes / features)

Ran tensor search unit tests - passed (besides a randomly failing one)

… disruption

…o pandu/deletion_optimisation

Jeadie · 2023-04-18T05:42:44Z

This is more than just changing how, in Opensearch, we perform delete operations. It's also an architectural change.

codebrain

No need for me to approve, but I have left some feedback based on issues I would consider addressing.

codebrain · 2023-04-18T05:55:07Z

src/marqo/tensor_search/delete_docs.py

+# -- Marqo-OS-specific deletion implementation: --
+
+
+def delete_documents_marqo_os(config: Config, deletion_instruction: MqDeleteDocsRequest) -> MqDeleteDocsResponse:


Be careful with unbounded size on deletion_instruction.document_ids. There are limits to the size of HTTP requests. This is usually not a problem, but if you are sending millions of operations this can add up quickly.

Some other feedback.

Consider chunking requests into, say, 10,000 operations and then sending these requests in parallel

You should:

potentially attempt retries, assuming request is retryable based on response HTTP code (e.g. 429)

manage back-offs

send multiple requests in parallel for higher throughput

See here lookup "BulkAllObservable helper" heading for inspiration

Inspect the bulk response(s) to determine they have been successful

Consider disabling/changing the refresh interval around your bulk request (ensure it is correctly set back!)

pandu-k · 2023-04-19T04:42:57Z

unit tests: https://github.com/marqo-ai/marqo/actions/runs/4739613494

# Conflicts: # src/marqo/tensor_search/configs.py # src/marqo/tensor_search/enums.py # src/marqo/tensor_search/tensor_search.py # tests/tensor_search/test_validation.py

pandu-k · 2023-04-19T04:50:04Z

unit tests (post merging mainline): https://github.com/marqo-ai/marqo/actions/runs/4739643546

src/marqo/tensor_search/api.py

src/marqo/tensor_search/delete_docs.py

src/marqo/tensor_search/models/delete_docs_objects.py

Jeadie · 2023-04-19T05:33:27Z

src/marqo/tensor_search/tensor_search.py

+
+def delete_documents(config: Config, index_name: str, doc_ids: List[str], auto_refresh):
+    """Delete documents from the Marqo index with the given doc_ids """
+    return delete_docs.delete_documents(


When we go about an entire reformatting of tensor_search/, it may be worth having format_delete_docs_response in this function instead. Therefore operation specific files deal in Pydantic objects, and the tensor_search.py operations are responsibly for mapping args -> RequestObject, then ResponseObject -> Dict

Or have the decoding within the pydantic model (i.e. controlled via the .json method).

src/marqo/tensor_search/utils.py

Jeadie · 2023-04-19T05:39:22Z

src/marqo/tensor_search/validation.py

@@ -598,3 +599,39 @@ def validate_score_modifiers_object(score_modifiers: List[dict]):
            f"Please revise your score_modifiers based on the provided error."
            f"\n Check `https://docs.marqo.ai/0.0.17/API-Reference/search/#score-modifiers` for more info."
        )
+
+
+def validate_delete_docs_request(delete_request: MqDeleteDocsRequest, max_delete_docs_count: int):


I only now realise that MqDeleteDocsRequest is a NamedTuple, not a Pydantic BaseModel. I think we'd find that if we use a BaseModel, most of this is automatically validated at init time.

And I think you can do both custom, and extended validation for something like if (len(delete_request.document_ids) > max_delete_docs_count) and max_delete_docs_count is not None:

Jeadie · 2023-04-19T05:42:14Z

tests/tensor_search/test_utils.py

+
+            assert run()
+
+    def test_read_env_vars_and_defaults_ints_invalid_values(self):


(A comment for future improvements). There is a common pattern of using a mocked object across a range of parameters. It's something like a @pytest.mark.parameterized with a @mock.patch("foo", return_value="bar). It could be worth figuring out how to do so. I think it'd make alot of our tests, a) more readable, b) easier to see for which values, what tests fail.

…cuments

# Conflicts: # src/marqo/tensor_search/enums.py

pandu-k · 2023-04-20T01:38:00Z

merged mainline back in.
unit tests: https://github.com/marqo-ai/marqo/actions/runs/4749577191
api tests: https://github.com/marqo-ai/marqo/actions/runs/4749578141

pandu-k added 10 commits April 6, 2023 16:31

Refactored delete documents into its own file

a0db131

Ran tensor search unit tests - passed (besides a randomly failing one)

Refactored a deletion interface. At parity, in terms of delete tests

817a6ff

swapped delete by query with bulk delete. untested

9b90756

Merge branch 'mainline' into pandu/deletion_optimisation

9f5e9d0

combined components together

691c012

used existing tensor_search entrypoint function for minimal interface…

62937e2

… disruption

added label for data-layer agnostic logic

ec19af1

added more tests

eaed675

Overwrite files from mainline

cd375e8

Overwrite files from mainline

0e265ba

pandu-k changed the title ~~Pandu/deletion optimisation~~ Deletion optimisation Apr 17, 2023

Merge branch 'mainline' into pandu/deletion_optimisation

4b94ee8

pandu-k marked this pull request as ready for review April 17, 2023 05:58

pandu-k added 2 commits April 17, 2023 20:53

added tests for config.backend

be84b4b

Merge remote-tracking branch 'origin/pandu/deletion_optimisation' int…

7f8fd8b

…o pandu/deletion_optimisation

pandu-k requested a review from codebrain April 17, 2023 11:00

pandu-k temporarily deployed to marqo-test-suite April 18, 2023 00:23 — with GitHub Actions Inactive

codebrain reviewed Apr 18, 2023

View reviewed changes

pandu-k added 3 commits April 19, 2023 12:56

added env var for delete docs request

2bd749e

added tests for read_env_vars_and_defaults_ints

78a7157

fixed read_env_vars_and_defaults, added mock environ test

acb4c94

pandu-k had a problem deploying to marqo-test-suite April 19, 2023 04:44 — with GitHub Actions Failure

Merge branch 'mainline' into pandu/deletion_optimisation

b9bc58b

# Conflicts: # src/marqo/tensor_search/configs.py # src/marqo/tensor_search/enums.py # src/marqo/tensor_search/tensor_search.py # tests/tensor_search/test_validation.py

pandu-k temporarily deployed to marqo-test-suite April 19, 2023 04:51 — with GitHub Actions Inactive