Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] Configurable index settings for hnsw #206

Closed
jn2clark opened this issue Dec 1, 2022 · 3 comments
Closed

[ENHANCEMENT] Configurable index settings for hnsw #206

jn2clark opened this issue Dec 1, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@jn2clark
Copy link
Contributor

jn2clark commented Dec 1, 2022

Is your feature request related to a problem? Please describe.
make the hnsw settings configurable per index
https://github.com/marqo-ai/marqo/blob/mainline/src/marqo/tensor_search/backend.py#L83-L106

Describe the solution you'd like
have (optional) settings in the index_defaults to set m, ef_construction and metric for the index hnsw settings

Describe alternatives you've considered

Additional context
Add any other context or screenshots about the feature request here.

@jn2clark jn2clark added the enhancement New feature or request label Dec 1, 2022
@Jeadie Jeadie self-assigned this Dec 7, 2022
@Jeadie
Copy link
Contributor

Jeadie commented Dec 8, 2022

Overview

Marqo uses Hierarchical Navigable Small World (HNSW) graphs to perform approximate nearest neighbour search (ANNs). HNSW has two keys benefits compared to other ANNs methods, namely high recall and low search latency. HNSW requires several hyperparameters to be selected for both indexing and search. From these, often, recall and latency are in tension: hyperparameters that decrease latency will have an adverse decrease in recall. Hyperparameter selection then, is a engineering design choice that can be tailored with respect to the use case.

Currently, these hyperparameters are fixed within Marqo (see code reference, here). This enhancement is to allow HNSW parameters to be configured on an index level.

Proposed Solution

The proposed solution is to extend an index's index_defaults, specified at index creation time, to include ANNS parameters. The current index_defaults as specified in the documentation is:

{
    "index_defaults": {
        "treat_urls_and_pointers_as_images": false,
        "model": "hf/all_datasets_v4_MiniLM-L6",
        "normalize_embeddings": true,
        "text_preprocessing": {
            "split_length": 2,
            "split_overlap": 0,
            "split_method": "sentence"
        },
        "image_preprocessing": {
            "patch_method": null
        }
    },
    "number_of_shards": 5
}

This can be augmentated as follows (with defaults as specified):

{
    "index_defaults": {
        "treat_urls_and_pointers_as_images": false,
        "model": "hf/all_datasets_v4_MiniLM-L6",
        "normalize_embeddings": true,
        "text_preprocessing": {
            "split_length": 2,
            "split_overlap": 0,
            "split_method": "sentence"
        },
        "image_preprocessing": {
            "patch_method": null
        },
        "ann_parameters" : {
            "method": "hnsw",
            "space_type": "cosinesimil",
            "method_parameters": {
                "ef_construction": 128,
                "m": 24
            }
        }
    },
    "number_of_shards": 5
}

Implementation

Two parts:

  1. Store parameter defaults when index is created
    1. Index settings are set in tensor_search.create_vector_index. ANN parameters must be included here.
  2. Use as default when updating or creating field/s in _mappings
    1. HNSW parameters are set in backend.py:add_customer_field_properties. These should be defaulted to index-defaults, not hardcoded.
    2. If index defaults do not exist (backward compatibility issue), set and use Marqo-defaults.

Backwards & Forwards Compatibility

  • Backwards compatible (i.e can previous version of code/data still work if they now run this Marqo):

    • No new parameters are being added to _mappings.
    • Since defaults will not be hardcoded, when an existing index adds a new field, how will it get the default ANNs parameters?
  • Forwards compatiblity (i.e. what constraints /difficulties will this force to future changes):

    • Index settings robust to:
      • Use of different ANNS algorithms
      • Field level settings

@Jeadie
Copy link
Contributor

Jeadie commented Dec 8, 2022

Of note, opensearch-KNN default setting are specified here. m=16 is the choice across libraries. From internal Marqo testing, m=16 has local latency improvements (compared to, for example m=12). Could do with 16 could provide better/cleaner memory load sizes.

@Jeadie
Copy link
Contributor

Jeadie commented Dec 8, 2022

We should improve readability for parameters in"method_parameters".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants