-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Indexing and Search Separation #14596
Comments
This RFC refreshes some of the concepts discussed in an old RFC #7258 with more targeted preference around role separation. We'll review for overlaps and come up with an execution plan |
I will share initial draft on how we can do some PoC with current available mechanisms to simulate reader/writer separation and experiments to show where this could benefit. |
A really quick way to get node separation for experiments (without any standby writers):
Played with this with a few test indices and
|
Just to build on a bit on what @sohami mentioned, here are the existing mechanisms I believe are relevant here:
@sohami What else am I missing? However, even with these mechanism we're still missing some pieces that would be needed for true index/search separation:
|
@andrross and @mch2 Yes I am thinking on similar lines and to @andrross point about separating indexers and searchers, we can use an existing index setting for PoC.
For now I was thinking we can keep this as is because with remote store the message between indexer and search will be of light weight (hence also calling out in Assumption below that we plan to mainly support for remote store based indices). Majority of the work will be involved in downloading the data from remote store which searcher anyways will need to do in separated setup as well. This is what I think could be a good starting point for the PoC, let me know your thoughts. Benefits of Reader/Writer Separation:
Assumptions:True reader and writer separation will make sense with segment replication and remote store, so we will not consider doc rep based indices in the mix or segrep indices with local store.Suggested PoC:To achieve the reader/writer separation we can use the follow setup.
PoC Tests:Scenario 1: Search traffic on any shard (primary or replica) on a node can affect the indexing traffic on same node to either same or different index shard.
Scenario 2: Different instance types for indexing and search. In shared setup, if we need to scale to memory optimized instance then both nodes hosting primary (writer) and replica (reader) shards will need to scaled up whereas in this case only node hosting replica (reader) shard is scaled up showing the cost benefits.
|
@sohami Can you show the failure isolation aspect with no-op replication in place? If searches brown-out a replica, I think that will impact the indexer because it will still be waiting on acks from the replica.
I think it would be interesting to compare the performance of two clusters: one with specialized nodes of different types, and one with all nodes of the same type and roles, but in both cases the total cost of all the nodes is the same. Then it would be interesting to see in what scenarios the specialized nodes can give better performance. |
Good point. My thought was with remote translog probably the no-op replication will not affect indexing but seems like it does (thanks to @mch2 for confirming). Then we will need to bypass that for PoC and make it a "dummy" one to always be successful.
Need to think more on this. Let me know if you have any suggestion on how to achieve this. At first thought, I think we probably need some workload such that indexing and search are competing on one of the resource (CPU/memory) which can be solved by adding a new instance type for indexing/search (this is what the suggested experiment scenario 2 is showing). But doing that with similar cost between 2 setup could be challenging specially when there is not significant difference between instance types. So I think it will require a bigger setup probably with large instance count. Creating a specific targeted workload at such big setup will then become a challenge. |
curious on why indexer would wait on acks from the replica (or brown-out replica can impact indexer) as per my understanding, it would ack on doc being flushed to translog in storage for segrep with remote storage. |
Curious, do we need to upload the translog to the remote storage or just the committed segments? I think the data nodes for indexing still need to have node-to-node replication for redundancy, but the search nodes need to download the committed segments only for serving? |
@yupeng9 With the current remote store design there is no node-to-node replication as the remote store provides durability for both translog and committed segments. There is no concept of "search nodes" with the remote store (yet) but the replica shards do not keep a copy of the translog as they are never sent the original documents and instead only download the translog if they are promoted to primary and need to take over indexing duties. |
I see. If there's no node-to-node replication, then how do we ensure the durability of the local changes not yet flushed into the translog file yet or the local changes on translog but not uploaded to remote yet? |
@sohami am running through your poc tests will share results asap.
@yupeng9 The translog sync is on the _bulk write path that by default provides request level durability. So in remote store case the request won't be ack'd until the upload has completed. |
I see. Do we have some benchmark report to share on its implication to throughput? Also, will this block the non-bulk write requests too? |
Rockset DB claims to be 4x faster for streaming data ingestion and they have a benchmark repo (Apache 2 license) for it. Compute-storage separation is one of aspect they claim which resulted in this performance improvement. So should we also run this benchmark for the setup above to see where we stand for streaming data and ingestion performance as a follow up exercise? |
@yupeng9 You can find some benchmarks here. The upshot is that if you properly load the system then the benefits of segment replication and not doing node-to-node copy outweigh the additional time spent waiting on remote uploads and result in throughput gains.
Non-bulk requests will wait on remote upload as well. However, independent indexing requests do not generally block one another. |
@sohami @andrross I think with the addition of segment based replication, we have a gap to address with respect to settinng What we may need is to send along with the search request are Does it make sense folks? Thanks! |
Thanks for sharing the benchmark. It's interesting to see the latency of segment replication is lower than document-replication, I guess it might be due to the consensus protocol for doc-replication adding overhead. I also agree with the other benefits mentioned for segment replication. Interestingly enough, if we can leverage the pull-based ingestion, we could save the upload of translog as we have the streams to replay the history upon node recovery. And this will improve the latency and throughput because the request does not have to wait for the change to flush from memory to disk, and then upload to remote storage. In fact, this is what we are doing at Uber to ingest from Kafka, and only upload committed segments to remote together with the offset of the last committed message. Upon recovery, we download the latest segments, rewind the offset and reingest. So we can have this optimization opportunity once we embrace the pull-based ingestion. |
I think providing seqNo/primaryTerm via the search request could be difficult for the users as they will now need to keep track of it ? If the goal is to provide some sort of SLA around delay on replica with segrep, then probably failing such replica on service side such that it is not considered in search routing could be a way to go. We have an existing mechanism which fails replica if it lags by say 4 checkpoints. Ref here |
Shard failure is an option but would be quite disruptive with tighter freshness requirements. I think a streaming index api gives the answer for a better experience here where we notify that a particular seqNo has been replicated across all or a portion of the searchers. Another search time alternative that we discussed in this week's search meetup is accounting for replica state at the coordinator and routing accordingly, though this would be noisy for cluster state during steady state indexing. I know this is more of a discussion on replica freshness vs strong/eventually consistency, but for some more context - |
Hey everyone, I wanted to share the changes I'm using for these poc benchmarks. Code: https://github.com/mch2/OpenSearch/tree/rwsplit-benchmarks - namely this commit.
Workloads:
cluster setup: https://github.com/mch2/opensearch-cluster-cdk/tree/benchmarking-update
the distributionUrl I included there will hold the artifact for the latest changes i'm working off of. |
Here are results of POC tests outlined by @sohami. Workload, cluster setup and POC changes are above. Summary of results:
Cluster Configuration:3 cluster manager nodes Benchmark ConfigurationAll benchmark runs were executed with the Note - To save time I restored 50% of the load from snapshot before running the concurrent step, so the below graphs will not show this. Normal indexing throughput reached is between 35-40k docs/s. Results:Scenario 1: Failure/workload isolation and scaling.
_bulk Throughput:
Scenario 2: Scaling up search nodes.
I've been running some experiments with this in mind using the indexing-querying-all test procedure. Note our specialized cluster does not have any failover mechanism with these benchmarks, so solutions like standby writers would add to this cost. Comparison 1: Comparison 2: Next I tried a run with a more appropriate sharding strategy closer to 1.5vCPU per node. I also tweaked refresh interval from 1s to 10s and bumped search threadpool size from the default to 50 on the search nodes given they aren't indexing.
|
Thanks a lot @mch2
Quick question: did we enable concurrent search ( |
This is only with the defaults. I will kick off some runs with concurrent search for both clusters and share. As a next step I think we can start drafting a lower level design where we can answer the following:
|
@reta Thanks for the callout on concurrent search. With this enabled results are much more favorable for the specialized cluster in all but 2 sort queries Contender here being with rwsplit.
max CPU on both clusters hits 100% while min fluctuates 90-100.
|
Good to see this discussion, resurrected and some items getting prioritised. Thanks @andrross @sohami @mch2 Other set of things we should consider is to re-evaluate push vs pull mechanism as a replication mechanism for replica. With decoupling as a core idea, I would prefer pull for replica and make it consistent with remote cluster replica to have pull based replication at all places. |
Is your feature request related to a problem? Please describe
Background
Currently, a data node performs both indexing and searching, leading to workload interference between these tasks. An expensive query can monopolize memory and CPU resources, causing indexing requests to fail or vice versa. Additionally, scaling read traffic typically involves adding more replicas, which can slow down indexing and reduce throughput. Therefore, supporting the separation of indexing and search will enhance read and write performance. Also Separation of indexing and search allows each to scale independently. For example, additional resources can be allocated to indexing processes during data ingestion, while search operations can be scaled separately to handle query loads.
Describe the solution you'd like
High level, there would be two approaches to achieve indexing and search separation. Node/Role level and instance/cluster level.
Node/Role level separation
In order to achieve Indexing and Search separation, we would build on the new node role “search” which separates out with existing data role which focus on indexing only. The “search” node role would act as dedicated search nodes.
With remote storage, we would keep committed data as segments and uncommitted data being added to the translog. To maintain consistency, the same semantics are applied when storing data in the remote store. Data from the local translog is backed up to the remote translog store with each indexing operation. Additionally, whenever new segments are created during refresh, flush, or merge processes, these new segments are uploaded to the remote segment store.
The control plane is running as active-standby for redundancy and would have built-in auto failover mechanism.
The search node directly downloads indexed data from remote storage and executes search operations, including aggregations. It operates in active-active mode to ensure availability during failures. The refresh interval should be configurable according to system limits.
Requirement:
a. today, any copy of shard in unassigned would trigger cluster status change to red(primary) or yellow. With
separation, we would have higher granularity for indexing and search status. e.g when primary fails on serving write
traffic, it would have certain indicator on Indexing Unhealthy/Unavailable. Similar apply to search, if any / all replica
fails, should indicate search failure etc.
a. today, opensearch follows set of resiliency policy and allocation preference based on primary/replica architecture.
With separation, the shard allocation would be based on role (search/data). For the awareness, primary shard would
need to apply to primary active / standby (avoid allocate primary active and standby in one zone/rack). For search,
try to distribute across different zone/rack.
automatically failover to primary standby for durability of indexing.
a. ensure consistency guarantees comparable to today's standards by having a search replica shard monitor its
synchronization with the data node. If the replica is out of date, it can redirect or fallback the request to the data
node. Options could include requiring strict consistency, allowing a maximum lag of X, and so on.
Related component
Other
Describe alternatives you've considered
Cluster / domain separation
Alternatively, the similar indexing and search separation can be achieved through Cross-Cluster Replication with segment replication. With ccr-(segrep), the leader cluster will mostly handle the indexing/writes while the follower cluster will keep in-sync with segment replication. All indexing requests would route to the primary cluster and the search request route to the follower cluster.
Comparison:
Additional context
No response
The text was updated successfully, but these errors were encountered: