Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deletion optimisation #436

Merged
merged 19 commits into from
Apr 20, 2023
Merged

Deletion optimisation #436

merged 19 commits into from
Apr 20, 2023

Conversation

pandu-k
Copy link
Collaborator

@pandu-k pandu-k commented Apr 14, 2023

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    Optimisation

  • What is the current behavior? (You can also link to an open issue here)
    The delete documents by ID endpoint uses Marqo-os' delete by query. This can result in 5xxs being returned for deletion calls, after many calls.

  • What is the new behavior (if this is a feature change)?
    The delete documents by ID endpoint now gives Marqo-os' endpoint bulk delete instructions

  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
    No

  • Have unit tests been run against this PR? (Has there also been any additional testing?)
    In progress: https://github.com/marqo-ai/marqo/actions/runs/4727168886

  • Other information:
    To dos:

  • Large scale test
  • Please check if the PR fulfills these requirements
  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes/features)
  • Docs have been added / updated (for bug fixes / features)

@pandu-k pandu-k changed the title Pandu/deletion optimisation Deletion optimisation Apr 17, 2023
@pandu-k pandu-k marked this pull request as ready for review April 17, 2023 05:58
@pandu-k pandu-k requested a review from codebrain April 17, 2023 11:00
@pandu-k pandu-k temporarily deployed to marqo-test-suite April 18, 2023 00:23 — with GitHub Actions Inactive
@Jeadie
Copy link
Contributor

Jeadie commented Apr 18, 2023

This is more than just changing how, in Opensearch, we perform delete operations. It's also an architectural change.

Copy link

@codebrain codebrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for me to approve, but I have left some feedback based on issues I would consider addressing.

# -- Marqo-OS-specific deletion implementation: --


def delete_documents_marqo_os(config: Config, deletion_instruction: MqDeleteDocsRequest) -> MqDeleteDocsResponse:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful with unbounded size on deletion_instruction.document_ids. There are limits to the size of HTTP requests. This is usually not a problem, but if you are sending millions of operations this can add up quickly.

Some other feedback.

  • Consider chunking requests into, say, 10,000 operations and then sending these requests in parallel
  • You should:
    • potentially attempt retries, assuming request is retryable based on response HTTP code (e.g. 429)
    • manage back-offs
    • send multiple requests in parallel for higher throughput
  • See here lookup "BulkAllObservable helper" heading for inspiration
  • Inspect the bulk response(s) to determine they have been successful
  • Consider disabling/changing the refresh interval around your bulk request (ensure it is correctly set back!)

@pandu-k
Copy link
Collaborator Author

pandu-k commented Apr 19, 2023

# Conflicts:
#	src/marqo/tensor_search/configs.py
#	src/marqo/tensor_search/enums.py
#	src/marqo/tensor_search/tensor_search.py
#	tests/tensor_search/test_validation.py
@pandu-k
Copy link
Collaborator Author

pandu-k commented Apr 19, 2023

unit tests (post merging mainline): https://github.com/marqo-ai/marqo/actions/runs/4739643546

@pandu-k pandu-k temporarily deployed to marqo-test-suite April 19, 2023 04:51 — with GitHub Actions Inactive

def delete_documents(config: Config, index_name: str, doc_ids: List[str], auto_refresh):
"""Delete documents from the Marqo index with the given doc_ids """
return delete_docs.delete_documents(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we go about an entire reformatting of tensor_search/, it may be worth having format_delete_docs_response in this function instead. Therefore operation specific files deal in Pydantic objects, and the tensor_search.py operations are responsibly for mapping args -> RequestObject, then ResponseObject -> Dict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or have the decoding within the pydantic model (i.e. controlled via the .json method).

@@ -598,3 +599,39 @@ def validate_score_modifiers_object(score_modifiers: List[dict]):
f"Please revise your score_modifiers based on the provided error."
f"\n Check `https://docs.marqo.ai/0.0.17/API-Reference/search/#score-modifiers` for more info."
)


def validate_delete_docs_request(delete_request: MqDeleteDocsRequest, max_delete_docs_count: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only now realise that MqDeleteDocsRequest is a NamedTuple, not a Pydantic BaseModel. I think we'd find that if we use a BaseModel, most of this is automatically validated at init time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think you can do both custom, and extended validation for something like if (len(delete_request.document_ids) > max_delete_docs_count) and max_delete_docs_count is not None:


assert run()

def test_read_env_vars_and_defaults_ints_invalid_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A comment for future improvements). There is a common pattern of using a mocked object across a range of parameters. It's something like a @pytest.mark.parameterized with a @mock.patch("foo", return_value="bar). It could be worth figuring out how to do so. I think it'd make alot of our tests, a) more readable, b) easier to see for which values, what tests fail.

@pandu-k
Copy link
Collaborator Author

pandu-k commented Apr 20, 2023

@pandu-k pandu-k temporarily deployed to marqo-test-suite April 20, 2023 01:38 — with GitHub Actions Inactive
@pandu-k pandu-k merged commit 7d571c2 into mainline Apr 20, 2023
@pandu-k pandu-k deleted the pandu/deletion_optimisation branch April 20, 2023 03:07
@pandu-k pandu-k temporarily deployed to marqo-test-suite April 20, 2023 03:08 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants