This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Support Elasticsearch/OpenSearch for user search #14608
Labels
A-User-Directory
O-Occasional
Affects or can be seen by some users regularly or most users rarely
S-Minor
Blocks non-critical functionality, workarounds exist.
T-Enhancement
New features, changes in functionality, improvements in performance, or user-facing enhancements.
Preamble
Synapse's user search feature has a few long-standing known shortcomings when searching for display names, namely:
Addressing these issues is non-trivial with PostgreSQL's full text search capabilities. In this writeup I am exploring integrating Elasticsearch, which is a full text search engine backed by Apache Lucene, into Synapse.
Note that mentions of Elasticsearch in this writeup also include OpenSearch, which is AWS's fork of Elasticsearch and has, as far as I can tell, compatible APIs with Elasticsearch (at least regarding the features that are relevant to us).
Also note that I am focusing specifically on user search, and am not including message search to avoid scope creep.
Indexing
Elasticsearch's equivalent of an SQL table is called an index. An index contains a number of documents, which are freeform JSON blobs. In the context of user search, this is where we would store user profiles. This is an example of a document in an Elasticsearch index:
Here, we're mostly interested in two properties:
_id
: the document's identifier. A document can be created without identifiers, in which case Elasticsearch automatically generates one for it._source
: the document's data.In this example, I've used the user's MXID as the document's ID (so that we can easily update it in the future), and the structure of profiles as they are returned by the
/user_directory/search
endpoint. In practice we'll probably want to remove theuser_directory
from the document's source, in order to avoid duplicating data.Analysis
We want our greatly improved user search to be:
For this, we need to create the Elasticsearch index and configure it with an analyzer. The analyzer is in charge of looking at all new piece of data and tokenise (while also retaining the document's source) it so it's easily searchable later on.
We create our index with a custom analyzer that is both case- and diacritics-insensitive:
There are a few things happening here:
settings.analysis.analyzer
, we define a custom analyzer on the index. This analyzer includes two token filters:asciifolding
, which folds non-ASCII characters in tokens to an equivalent ASCII character (thus eliminating accents), andlowercase
which forces tokens into lower case.mappings.properties
, we assign our custom analyzer to thedisplay_name
, since this is where we might have diacritics and case variations.Document insertion and update
When the index is created, we can start adding documents to it:
Note that the request for updating an existing document is identical to the one above. When a document is updated, its
_version
property is automatically incremented.Search
Now we can search for users:
(yes, this is a
GET
request with a body)We use a
match_phrase_prefix
query to ensure we start matching at the start of a sequence of tokens, instead of starting matching in the middle of a token.Results are then provided in the following format:
On each hit, a score is associated in the
_score
property to help sort results.With the query previously mentioned, the same score will be attributed to every result. We will probably want to use a more elaborate query, such as a boolean compound query, which would allow attributing a higher score to exact matches (see an example here). We will also probably want to tweak the query so that it also matches on MXIDs.
Integration
First off, we will probably want to make this integration optional. While there are valid use cases for requiring better user search results than the ones provided by PostgreSQL's full text search support, PostgreSQL is also good enough for most servers catering to a community that mostly uses latin-based languages. Mandating those servers to support Elasticsearch for user search would be an unnecessary burden, both in terms of resources and, if the Elasticsearch cluster is self-hosted, in maintenance.
Technically, integrating Elasticsearch into Synapse would mean writing our own interaction layer. Elastic does provide an official Python module, which even has async support, however this async support uses aiohttp for transport. I don't know if aiohttp is even compatible with Twisted, and I assume we will probably want to use Twisted agents to perform requests. There have been efforts in the past to write an Elasticsearch module for Twisted-based applications (txes2), but it looks widely out of date and unmaintained (and is still incompatible with Python 3).
Writing our own integration for Elasticsearch should not be very complex, however, since as demonstrated above we would only need to perform a couple of types of HTTP requests (creating/updating documents, and searching).
Migration
How to migrate user search data from PostgreSQL to Elasticsearch and back is an area which needs further research. It will likely need a script ran manually by the server admin, similar to the existing SQLite -> PostgreSQL migration script we currently have. There might be a way to make this migration incremental to ease the pain on servers with a very large number of users, but I'm not entirely sure.
Conclusion
Supporting Elasticsearch in Synapse looks like a pretty big amount of work, but I think it's also work that is worth putting in to enable more communities around the world to adopt Matrix. In my manual testing, most issues with user search that are caused by PostgreSQL's full text search engine seem to be resolved with Elasticsearch, apart from one edge case which I believe to be acceptable.
It is also worth considering that, once in place, we might also want to use Elasticsearch to handle message search, which has similar issues to user search.
To be clear, I am not claiming this work should be the team's highest priority by opening this issue - I mostly wanted to compile and share the findings from spending a limited amount of time researching options to improve user search.
The text was updated successfully, but these errors were encountered: