Skip to content

Commit

Permalink
Elasticsearch datastore support (#343)
Browse files Browse the repository at this point in the history
* added es datastore

* updated readmy and toml

* updated readme

* clean up code + implemented a few more methods

* add tests + fix issues found

* update documentation

* update notebook

* clean up notebook

---------

Co-authored-by: Sebastian Montero <smonteroparis@icloud.com>
  • Loading branch information
joemcelroy and sebastian-montero authored Jul 28, 2023
1 parent 9a12c11 commit e8fda70
Show file tree
Hide file tree
Showing 10 changed files with 1,376 additions and 9 deletions.
32 changes: 26 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ This README provides detailed information on how to set up, develop, and deploy
- [General Environment Variables](#general-environment-variables)
- [Choosing a Vector Database](#choosing-a-vector-database)
- [Pinecone](#pinecone)
- [Elasticsearch](#elasticsearch)
- [Weaviate](#weaviate)
- [Zilliz](#zilliz)
- [Milvus](#milvus)
Expand Down Expand Up @@ -166,6 +167,18 @@ Follow these steps to quickly set up and run the ChatGPT Retrieval Plugin:
export PG_USER=<postgres_user>
export PG_PASSWORD=<postgres_password>
export PG_DATABASE=<postgres_database>
# Elasticsearch
export ELASTICSEARCH_URL=<elasticsearch_host_and_port> (either specify host or cloud_id)
export ELASTICSEARCH_CLOUD_ID=<elasticsearch_cloud_id>
export ELASTICSEARCH_USERNAME=<elasticsearch_username>
export ELASTICSEARCH_PASSWORD=<elasticsearch_password>
export ELASTICSEARCH_API_KEY=<elasticsearch_api_key>
export ELASTICSEARCH_INDEX=<elasticsearch_index_name>
export ELASTICSEARCH_REPLICAS=<elasticsearch_replicas>
export ELASTICSEARCH_SHARDS=<elasticsearch_shards>
```

10. Run the API locally: `poetry run start`
Expand Down Expand Up @@ -277,11 +290,11 @@ poetry install

The API requires the following environment variables to work:

| Name | Required | Description |
| ---------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `DATASTORE` | Yes | This specifies the vector database provider you want to use to store and query embeddings. You can choose from `chroma`, `pinecone`, `weaviate`, `zilliz`, `milvus`, `qdrant`, `redis`, `azuresearch`, `supabase`, `postgres`, `analyticdb`. |
| `BEARER_TOKEN` | Yes | This is a secret token that you need to authenticate your requests to the API. You can generate one using any tool or method you prefer, such as [jwt.io](https://jwt.io/). |
| `OPENAI_API_KEY` | Yes | This is your OpenAI API key that you need to generate embeddings using the `text-embedding-ada-002` model. You can get an API key by creating an account on [OpenAI](https://openai.com/). |
| Name | Required | Description |
| ---------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `DATASTORE` | Yes | This specifies the vector database provider you want to use to store and query embeddings. You can choose from `elasticsearch`, `chroma`, `pinecone`, `weaviate`, `zilliz`, `milvus`, `qdrant`, `redis`, `azuresearch`, `supabase`, `postgres`, `analyticdb`. |
| `BEARER_TOKEN` | Yes | This is a secret token that you need to authenticate your requests to the API. You can generate one using any tool or method you prefer, such as [jwt.io](https://jwt.io/). |
| `OPENAI_API_KEY` | Yes | This is your OpenAI API key that you need to generate embeddings using the `text-embedding-ada-002` model. You can get an API key by creating an account on [OpenAI](https://openai.com/). |

### Using the plugin with Azure OpenAI

Expand Down Expand Up @@ -352,6 +365,10 @@ For detailed setup instructions, refer to [`/docs/providers/llama/setup.md`](/do

[AnalyticDB](https://www.alibabacloud.com/help/en/analyticdb-for-postgresql/latest/product-introduction-overview) is a distributed cloud-native vector database designed for storing documents and vector embeddings. It is fully compatible with PostgreSQL syntax and managed by Alibaba Cloud. AnalyticDB offers a powerful vector compute engine, processing billions of data vectors and providing features such as indexing algorithms, structured and unstructured data capabilities, real-time updates, distance metrics, scalar filtering, and time travel searches. For detailed setup instructions, refer to [`/docs/providers/analyticdb/setup.md`](/docs/providers/analyticdb/setup.md).

#### Elasticsearch

[Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html) currently supports storing vectors through the `dense_vector` field type and uses them to calculate document scores. Elasticsearch 8.0 builds on this functionality to support fast, approximate nearest neighbor search (ANN). This represents a much more scalable approach, allowing vector search to run efficiently on large datasets. For detailed setup instructions, refer to [`/docs/providers/elasticsearch/setup.md`](/docs/providers/elasticsearch/setup.md).

### Running the API locally

To run the API locally, you first need to set the requisite environment variables with the `export` command:
Expand Down Expand Up @@ -489,6 +506,7 @@ The scripts are:
- [`process_zip`](scripts/process_zip/): This script processes a file dump of documents in a zip file and stores them in the vector database with some metadata. The format of the zip file should be a flat zip file folder of docx, pdf, txt, md, pptx or csv files. You can provide custom metadata as a JSON string and flags to screen for PII and extract metadata.

## Pull Request (PR) Checklist

If you'd like to contribute, please follow the checklist below when submitting a PR. This will help us review and merge your changes faster! Thank you for contributing!

1. **Type of PR**: Indicate the type of PR by adding a label in square brackets at the beginning of the title, such as `[Bugfix]`, `[Feature]`, `[Enhancement]`, `[Refactor]`, or `[Documentation]`.
Expand Down Expand Up @@ -533,7 +551,7 @@ feature/advanced-chunking-strategy-123

While the ChatGPT Retrieval Plugin is designed to provide a flexible solution for semantic search and retrieval, it does have some limitations:

- **Keyword search limitations**: The embeddings generated by the `text-embedding-ada-002` model may not always be effective at capturing exact keyword matches. As a result, the plugin might not return the most relevant results for queries that rely heavily on specific keywords. Some vector databases, like Pinecone, Weaviate and Azure Cognitive Search, use hybrid search and might perform better for keyword searches.
- **Keyword search limitations**: The embeddings generated by the `text-embedding-ada-002` model may not always be effective at capturing exact keyword matches. As a result, the plugin might not return the most relevant results for queries that rely heavily on specific keywords. Some vector databases, like Elasticsearch, Pinecone, Weaviate and Azure Cognitive Search, use hybrid search and might perform better for keyword searches.
- **Sensitive data handling**: The plugin does not automatically detect or filter sensitive data. It is the responsibility of the developers to ensure that they have the necessary authorization to include content in the Retrieval Plugin and that the content complies with data privacy requirements.
- **Scalability**: The performance of the plugin may vary depending on the chosen vector database provider and the size of the dataset. Some providers may offer better scalability and performance than others.
- **Language support**: The plugin currently uses OpenAI's `text-embedding-ada-002` model, which is optimized for use in English. However, it is still robust enough to generate good results for a variety of languages.
Expand Down Expand Up @@ -585,3 +603,5 @@ We would like to extend our gratitude to the following contributors for their co
- [Postgres](https://www.postgresql.org/)
- [egor-romanov](https://github.com/egor-romanov)
- [mmmaia](https://github.com/mmmaia)
- [Elasticsearch](https://www.elastic.co/)
- [joemcelroy](https://github.com/joemcelroy)
10 changes: 8 additions & 2 deletions datastore/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,14 @@ async def get_datastore() -> DataStore:
from datastore.providers.analyticdb_datastore import AnalyticDBDataStore

return AnalyticDBDataStore()
case "elasticsearch":
from datastore.providers.elasticsearch_datastore import (
ElasticsearchDataStore,
)

return ElasticsearchDataStore()
case _:
raise ValueError(
f"Unsupported vector database: {datastore}. "
f"Try one of the following: llama, pinecone, weaviate, milvus, zilliz, redis, or qdrant"
)
f"Try one of the following: llama, elasticsearch, pinecone, weaviate, milvus, zilliz, redis, or qdrant"
)
Loading

0 comments on commit e8fda70

Please sign in to comment.