Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Indexing and Search Separation #14596

Open
amberzsy opened this issue Jun 28, 2024 · 28 comments
Open

[RFC] Indexing and Search Separation #14596

amberzsy opened this issue Jun 28, 2024 · 28 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search Indexing Indexing, Bulk Indexing and anything related to indexing RFC Issues requesting major changes Roadmap:Modular Architecture Project-wide roadmap label Search Search query, autocomplete ...etc Storage:Remote

Comments

@amberzsy
Copy link

amberzsy commented Jun 28, 2024

Is your feature request related to a problem? Please describe

Background
Currently, a data node performs both indexing and searching, leading to workload interference between these tasks. An expensive query can monopolize memory and CPU resources, causing indexing requests to fail or vice versa. Additionally, scaling read traffic typically involves adding more replicas, which can slow down indexing and reduce throughput. Therefore, supporting the separation of indexing and search will enhance read and write performance. Also Separation of indexing and search allows each to scale independently. For example, additional resources can be allocated to indexing processes during data ingestion, while search operations can be scaled separately to handle query loads.

Describe the solution you'd like

High level, there would be two approaches to achieve indexing and search separation. Node/Role level and instance/cluster level.

Node/Role level separation
In order to achieve Indexing and Search separation, we would build on the new node role “search” which separates out with existing data role which focus on indexing only. The “search” node role would act as dedicated search nodes.
image

With remote storage, we would keep committed data as segments and uncommitted data being added to the translog. To maintain consistency, the same semantics are applied when storing data in the remote store. Data from the local translog is backed up to the remote translog store with each indexing operation. Additionally, whenever new segments are created during refresh, flush, or merge processes, these new segments are uploaded to the remote segment store.

The control plane is running as active-standby for redundancy and would have built-in auto failover mechanism.
The search node directly downloads indexed data from remote storage and executes search operations, including aggregations. It operates in active-active mode to ensure availability during failures. The refresh interval should be configurable according to system limits.

Requirement:

  1. traffic separation: coordinator role separation or consider have proxy layer atop for handling routing.
  2. Cluster status - Green/Yellow/Red:
    a. today, any copy of shard in unassigned would trigger cluster status change to red(primary) or yellow. With
    separation, we would have higher granularity for indexing and search status. e.g when primary fails on serving write
    traffic, it would have certain indicator on Indexing Unhealthy/Unavailable. Similar apply to search, if any / all replica
    fails, should indicate search failure etc.
  3. ShardAllocation strategy and zone/rack awareness:
    a. today, opensearch follows set of resiliency policy and allocation preference based on primary/replica architecture.
    With separation, the shard allocation would be based on role (search/data). For the awareness, primary shard would
    need to apply to primary active / standby (avoid allocate primary active and standby in one zone/rack). For search,
    try to distribute across different zone/rack.
  4. Auto failover mechanism for primary active and standby. When primary active fails to serve traffic, it should
    automatically failover to primary standby for durability of indexing.
  5. consistency guarantee:
    a. ensure consistency guarantees comparable to today's standards by having a search replica shard monitor its
    synchronization with the data node. If the replica is out of date, it can redirect or fallback the request to the data
    node. Options could include requiring strict consistency, allowing a maximum lag of X, and so on.
  6. lightweight snapshot

Related component

Other

Describe alternatives you've considered

Cluster / domain separation
Alternatively, the similar indexing and search separation can be achieved through Cross-Cluster Replication with segment replication. With ccr-(segrep), the leader cluster will mostly handle the indexing/writes while the follower cluster will keep in-sync with segment replication. All indexing requests would route to the primary cluster and the search request route to the follower cluster.

Snip20240627_71

Comparison:
image

Additional context

No response

@amberzsy amberzsy added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 28, 2024
@github-actions github-actions bot added the Other label Jun 28, 2024
@Pallavi-AWS
Copy link
Member

This RFC refreshes some of the concepts discussed in an old RFC #7258 with more targeted preference around role separation. We'll review for overlaps and come up with an execution plan

@sohami
Copy link
Collaborator

sohami commented Jul 1, 2024

I will share initial draft on how we can do some PoC with current available mechanisms to simulate reader/writer separation and experiments to show where this could benefit.

@mch2
Copy link
Member

mch2 commented Jul 2, 2024

A really quick way to get node separation for experiments (without any standby writers):

  1. Set preference in OperationRouting.getShards to Preference.REPLICA. This will route search requests away from the primary.
  2. The TargetPoolAllocationDecider introduced for searchable snapshots allows us to make decisions based on a RoutingPool. We can update this to consider any replica as "Remote Capable" and force it to a node with the search role.

Played with this with a few test indices and ./gradlew run -PnumNodes=3 - code here

% curl localhost:9200/_cat/shards                                                         
test4 0 p STARTED 5 5.4kb 127.0.0.1 runTask-0
test4 0 r STARTED 5 5.4kb 127.0.0.1 runTask-1
test2 0 p STARTED 5 5.4kb 127.0.0.1 runTask-2
test2 0 r STARTED 5 5.5kb 127.0.0.1 runTask-1
test3 0 p STARTED 5 5.4kb 127.0.0.1 runTask-2
test3 0 r STARTED 5 5.4kb 127.0.0.1 runTask-1
test  0 p STARTED 5 9.6kb 127.0.0.1 runTask-0
test  0 r STARTED 5 5.5kb 127.0.0.1 runTask-1
% curl localhost:9200/_cat/nodes                                                          
127.0.0.1 23 99 15 2.33   dimr cluster_manager,data,ingest,remote_cluster_client * runTask-2
127.0.0.1 62 99 15 2.33   dimr cluster_manager,data,ingest,remote_cluster_client - runTask-0
127.0.0.1 60 99 15 2.33   s    search                                            - runTask-1

@andrross
Copy link
Member

andrross commented Jul 2, 2024

Just to build on a bit on what @sohami mentioned, here are the existing mechanisms I believe are relevant here:

  • Remote-backed storage: With this feature, segment and translog durability is offloaded to the remote repository. Replica nodes will pull the segments from the remote store as opposed to relying on node-to-node replication.
  • Search routing preference: A user can specific that searches are routed to replica shards. If all searches supply such a preference, then all search traffic would go to replicas (and implicitly all writes go to the primaries because only primaries can accept writes).

@sohami What else am I missing?

However, even with these mechanism we're still missing some pieces that would be needed for true index/search separation:

  • By default, OpenSearch will attempt to evenly balance replica and primary shards across all nodes in the cluster. This is important in a homogeneous cluster to ensure the indexing load is spread evenly. For index/search separation we'd want exactly the opposite so that indexers and searchers are each allocated to a distinct subset of nodes.
  • While remote-backed storage got rid of the need to replicate the translog across primaries and replicas because the remote store provides durability, it still kept the concept of "no-op replication" (discussed here) in order to protect against isolated primaries continuing to accept writes. With true indexer/searcher separation we probably would not want search replicas to be involved on the write path at all to provide true independent scalability.
  • Somewhat related to the point above, we'd probably want the concept of this "standby primary" shown in the diagram to continue to provide the same consistency guarantees with no-op replication. OpenSearch today has only the "primary" and "replica" concepts so this would likely be a new concept. Alternatively, if the remote store could provide consistency guarantees (this would likely require a different type of store than an object store like S3), and OpenSearch could quickly provision a new primary fast enough in the case of a node failure, then the standby might not be necessary.

@sohami
Copy link
Collaborator

sohami commented Jul 2, 2024

@andrross and @mch2 Yes I am thinking on similar lines and to @andrross point about separating indexers and searchers, we can use an existing index setting for PoC.

With true indexer/searcher separation we probably would not want search replicas to be involved on the write path at all to provide true independent scalability.

For now I was thinking we can keep this as is because with remote store the message between indexer and search will be of light weight (hence also calling out in Assumption below that we plan to mainly support for remote store based indices). Majority of the work will be involved in downloading the data from remote store which searcher anyways will need to do in separated setup as well.

This is what I think could be a good starting point for the PoC, let me know your thoughts.

Benefits of Reader/Writer Separation:

  • Failure isolation between readers and writers
  • Independent scaling of ingestion and search workloads as needed
  • Independent workload based optimizations for reader and writer:
    • a) Using available knobs to tune
    • b) Selecting different instance/node types. This can potentially help to save on the cost depending on different instance type needed for indexing/search workloads. For e.g. compute optimized instances are cheaper than memory optimized instances
Instance Type vCPU Memory Storage Price per hour Type
c6g.large.search 2 4 EBS Only $0.113 Compute Optimized
r6g.large.search 2 16 EBS Only $0.167 Memory Optimized

Assumptions:

True reader and writer separation will make sense with segment replication and remote store, so we will not consider doc rep based indices in the mix or segrep indices with local store.

Suggested PoC:

To achieve the reader/writer separation we can use the follow setup.
  • Create a cluster with
    • 2 coordinator nodes, this can be achieved by setting up nodes with no role
    • 2 nodes with data role (use CPU optimized c6g.large ones to make it easy for repro) and
    • 3 dedicated cluster manager nodes
  • Create an index with 1 shard having 1 primary and 1 replica
  • Update index setting index.requests.cache.enable to disable request cache
  • Optional: Apply the index setting index.routing.allocation.total_shards_per_node to 1 to keep 1 shard per node. This will be needed if we plan to increase the node count and replica count. Otherwise in 2 node setup without having this setting is also fine.
  • OSB client should communicate to the coordinator only nodes
  • We may need to modify OSB to send search requests with a preference parameter in the request (if not already supported). Ref: https://opensearch.org/docs/latest/api-reference/search/#the-preference-query-parameter

PoC Tests:


Scenario 1:

Search traffic on any shard (primary or replica) on a node can affect the indexing traffic on same node to either same or different index shard.
  1. Start the indexing traffic on the test index. The indexing load will go to the primary shard and replica will try to sync from remote store
  2. Start the search traffic on the index which is memory intensive and is targeted only to the primary shard using the request level preference parameter set to _primary. This will simulate the behavior that same index primary shard is serving both read/write traffic.
    1. We are using this setup to keep it simple to simulate otherwise we can do the same thing with shards of different index such that on a node primary of 1 index and replica shard of other index is colocated.
    2. For memory intensive query, we can try the multi-terms aggs query or use the example here: [BUG] A sufficiently small interval value on a histogram can crash the node #14558 (comment)
  3. We can do 2 flavors here,
    1. Increase the search traffic and keep the indexing traffic constant until the node drops out of the cluster due to overload.
    2. Increase the indexing traffic and keep the search traffic constant until the node drops out of the cluster due to overload
  4. Once we have the workload which can trigger search/indexing overload, we can see the impact on other request types i.e. search affecting indexing (a) or vice-versa (b)
  5. Now using the same workload as above, we can re-run it but directing the search traffic to replica shards using preference=_replica
  6. In step 5, we should not encounter any failure which can showcase the workload isolation for search and indexing traffic.
  7. Now keep increasing the search traffic on the replica shard, until the node hosting replica shard drops out of the cluster. Once this happens there should not be any impact to the indexing traffic. This will showcase the failure isolation for indexing and search
    1. Note: We cannot do the same thing for indexing failure isolation as that will trigger primary promotion and cause search traffic failure as well.
  8. Now add another node and increase the replica count to 2. Add the index setting of total_shards_per_node of 1 to the index. Again trigger the same indexing and search traffic as in step 7 which caused the replica node crash. Now with additional replica, we should be able to handle the search traffic. This will showcase the search scaling independent of indexing.

Scenario 2:

Different instance types for indexing and search. In shared setup, if we need to scale to memory optimized instance then both nodes hosting primary (writer) and replica (reader) shards will need to scaled up whereas in this case only node hosting replica (reader) shard is scaled up showing the cost benefits.

  1. With above setup and difference being to use CPU optimized instances (c6g.large) for primary and replica nodes and performing workload in step 7.
  2. Now update the replica node instance type to Memory optimized instance (r6g.large) instead of CPU optimized one. Again perform the workload in step 7, we should see replica able to handle the workload without dropping from cluster. This will showcase that indexing and search can be optimized with the instance type suitable for each workload types.
    1. We may need to manually reroute the shards to ensure primary and replica is allocated on correct instance types.

@andrross
Copy link
Member

andrross commented Jul 2, 2024

For now I was thinking we can keep this as is because with remote store the message between indexer and search will be of light weight

@sohami Can you show the failure isolation aspect with no-op replication in place? If searches brown-out a replica, I think that will impact the indexer because it will still be waiting on acks from the replica.

showing the cost benefits

I think it would be interesting to compare the performance of two clusters: one with specialized nodes of different types, and one with all nodes of the same type and roles, but in both cases the total cost of all the nodes is the same. Then it would be interesting to see in what scenarios the specialized nodes can give better performance.

@sohami
Copy link
Collaborator

sohami commented Jul 2, 2024

For now I was thinking we can keep this as is because with remote store the message between indexer and search will be of light weight

@sohami Can you show the failure isolation aspect with no-op replication in place? If searches brown-out a replica, I think that will impact the indexer because it will still be waiting on acks from the replica.

Good point. My thought was with remote translog probably the no-op replication will not affect indexing but seems like it does (thanks to @mch2 for confirming). Then we will need to bypass that for PoC and make it a "dummy" one to always be successful.

showing the cost benefits

I think it would be interesting to compare the performance of two clusters: one with specialized nodes of different types, and one with all nodes of the same type and roles, but in both cases the total cost of all the nodes is the same. Then it would be interesting to see in what scenarios the specialized nodes can give better performance.

Need to think more on this. Let me know if you have any suggestion on how to achieve this. At first thought, I think we probably need some workload such that indexing and search are competing on one of the resource (CPU/memory) which can be solved by adding a new instance type for indexing/search (this is what the suggested experiment scenario 2 is showing). But doing that with similar cost between 2 setup could be challenging specially when there is not significant difference between instance types. So I think it will require a bigger setup probably with large instance count. Creating a specific targeted workload at such big setup will then become a challenge.

@amberzsy
Copy link
Author

amberzsy commented Jul 2, 2024

@sohami Can you show the failure isolation aspect with no-op replication in place? If searches brown-out a replica, I think that will impact the indexer because it will still be waiting on acks from the replica.

Good point. My thought was with remote translog probably the no-op replication will not affect indexing but seems like it does (thanks to @mch2 for confirming). Then we will need to bypass that for PoC and make it a "dummy" one to always be successful.

curious on why indexer would wait on acks from the replica (or brown-out replica can impact indexer) as per my understanding, it would ack on doc being flushed to translog in storage for segrep with remote storage.

@andrross
Copy link
Member

andrross commented Jul 2, 2024

curious on why indexer would wait on acks from the replica (or brown-out replica can impact indexer)

@amberzsy You can find the full context here: #3706

@yupeng9
Copy link

yupeng9 commented Jul 3, 2024

Curious, do we need to upload the translog to the remote storage or just the committed segments? I think the data nodes for indexing still need to have node-to-node replication for redundancy, but the search nodes need to download the committed segments only for serving?

@peternied peternied added RFC Issues requesting major changes Indexing & Search Search Search query, autocomplete ...etc Indexing Indexing, Bulk Indexing and anything related to indexing and removed untriaged labels Jul 3, 2024
@peternied peternied changed the title [Proposal] Indexing and Search Separation [RFC] Indexing and Search Separation Jul 3, 2024
@github-project-automation github-project-automation bot moved this to Issues and PR's in OpenSearch Roadmap Jul 3, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3]
@amberzsy Thanks for creating this RFC, looking forward to seeing how this resolves.

@andrross
Copy link
Member

andrross commented Jul 3, 2024

Curious, do we need to upload the translog to the remote storage or just the committed segments? I think the data nodes for indexing still need to have node-to-node replication for redundancy, but the search nodes need to download the committed segments only for serving?

@yupeng9 With the current remote store design there is no node-to-node replication as the remote store provides durability for both translog and committed segments. There is no concept of "search nodes" with the remote store (yet) but the replica shards do not keep a copy of the translog as they are never sent the original documents and instead only download the translog if they are promoted to primary and need to take over indexing duties.

@yupeng9
Copy link

yupeng9 commented Jul 5, 2024

I see. If there's no node-to-node replication, then how do we ensure the durability of the local changes not yet flushed into the translog file yet or the local changes on translog but not uploaded to remote yet?

@mch2
Copy link
Member

mch2 commented Jul 8, 2024

@sohami am running through your poc tests will share results asap.

I see. If there's no node-to-node replication, then how do we ensure the durability of the local changes not yet flushed into the translog file yet or the local changes on translog but not uploaded to remote yet?

@yupeng9 The translog sync is on the _bulk write path that by default provides request level durability. So in remote store case the request won't be ack'd until the upload has completed.

@yupeng9
Copy link

yupeng9 commented Jul 8, 2024

@yupeng9 The translog sync is on the _bulk write path that by default provides request level durability. So in remote store case the request won't be ack'd until the upload has completed.

I see. Do we have some benchmark report to share on its implication to throughput? Also, will this block the non-bulk write requests too?

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Jul 9, 2024

Rockset DB claims to be 4x faster for streaming data ingestion and they have a benchmark repo (Apache 2 license) for it. Compute-storage separation is one of aspect they claim which resulted in this performance improvement. So should we also run this benchmark for the setup above to see where we stand for streaming data and ingestion performance as a follow up exercise?

@andrross
Copy link
Member

andrross commented Jul 9, 2024

@yupeng9 The translog sync is on the _bulk write path that by default provides request level durability. So in remote store case the request won't be ack'd until the upload has completed.

I see. Do we have some benchmark report to share on its implication to throughput?

@yupeng9 You can find some benchmarks here. The upshot is that if you properly load the system then the benefits of segment replication and not doing node-to-node copy outweigh the additional time spent waiting on remote uploads and result in throughput gains.

Also, will this block the non-bulk write requests too?

Non-bulk requests will wait on remote upload as well. However, independent indexing requests do not generally block one another.

@reta
Copy link
Collaborator

reta commented Jul 10, 2024

@sohami @andrross I think with the addition of segment based replication, we have a gap to address with respect to settinng preference=_replica. Since the process is asynchronous, replicas are not equal anymore and each could be in different state with respect to be in sync with primary (or primaries).

What we may need is to send along with the search request are seqNo and/or primaryTerm (I think we support that only for SearchHits at the moment, response side) so the coordinator may decide a) which replicate to pick b) fallback to primary if replicas are far behind.

Does it make sense folks? Thanks!

@yupeng9
Copy link

yupeng9 commented Jul 10, 2024

@yupeng9 You can find some benchmarks here. The upshot is that if you properly load the system then the benefits of segment replication and not doing node-to-node copy outweigh the additional time spent waiting on remote uploads and result in throughput gains.

Thanks for sharing the benchmark. It's interesting to see the latency of segment replication is lower than document-replication, I guess it might be due to the consensus protocol for doc-replication adding overhead.

I also agree with the other benefits mentioned for segment replication.

Interestingly enough, if we can leverage the pull-based ingestion, we could save the upload of translog as we have the streams to replay the history upon node recovery. And this will improve the latency and throughput because the request does not have to wait for the change to flush from memory to disk, and then upload to remote storage.

In fact, this is what we are doing at Uber to ingest from Kafka, and only upload committed segments to remote together with the offset of the last committed message. Upon recovery, we download the latest segments, rewind the offset and reingest. So we can have this optimization opportunity once we embrace the pull-based ingestion.

@sohami
Copy link
Collaborator

sohami commented Jul 11, 2024

@sohami @andrross I think with the addition of segment based replication, we have a gap to address with respect to settinng preference=_replica. Since the process is asynchronous, replicas are not equal anymore and each could be in different state with respect to be in sync with primary (or primaries).

What we may need is to send along with the search request are seqNo and/or primaryTerm (I think we support that only for SearchHits at the moment, response side) so the coordinator may decide a) which replicate to pick b) fallback to primary if replicas are far behind.

Does it make sense folks? Thanks!

I think providing seqNo/primaryTerm via the search request could be difficult for the users as they will now need to keep track of it ? If the goal is to provide some sort of SLA around delay on replica with segrep, then probably failing such replica on service side such that it is not considered in search routing could be a way to go. We have an existing mechanism which fails replica if it lags by say 4 checkpoints. Ref here

@mch2
Copy link
Member

mch2 commented Jul 11, 2024

@sohami @andrross I think with the addition of segment based replication, we have a gap to address with respect to settinng preference=_replica. Since the process is asynchronous, replicas are not equal anymore and each could be in different state with respect to be in sync with primary (or primaries).
What we may need is to send along with the search request are seqNo and/or primaryTerm (I think we support that only for SearchHits at the moment, response side) so the coordinator may decide a) which replicate to pick b) fallback to primary if replicas are far behind.
Does it make sense folks? Thanks!

I think providing seqNo/primaryTerm via the search request could be difficult for the users as they will now need to keep track of it ? If the goal is to provide some sort of SLA around delay on replica with segrep, then probably failing such replica on service side such that it is not considered in search routing could be a way to go. We have an existing mechanism which fails replica if it lags by say 4 checkpoints. Ref here

Shard failure is an option but would be quite disruptive with tighter freshness requirements. I think a streaming index api gives the answer for a better experience here where we notify that a particular seqNo has been replicated across all or a portion of the searchers.

Another search time alternative that we discussed in this week's search meetup is accounting for replica state at the coordinator and routing accordingly, though this would be noisy for cluster state during steady state indexing.

I know this is more of a discussion on replica freshness vs strong/eventually consistency, but for some more context -
We explored supporting the existing _bulk wait/immediate refreshpolicy mechanisms at write time here, but this behavior depends on shards being able to internally refresh to make docs searchable and has scaling issues. More context on general consistency with segrep here.

@getsaurabh02 getsaurabh02 moved this from Todo to In Progress in Performance Roadmap Jul 15, 2024
@mch2
Copy link
Member

mch2 commented Jul 16, 2024

Hey everyone, I wanted to share the changes I'm using for these poc benchmarks.

Code: https://github.com/mch2/OpenSearch/tree/rwsplit-benchmarks - namely this commit.
This allocates all replicas to search nodes and includes two new settings:

  1. cluster.routing.default.preference hardcode default routing preference. Set this to _replica when testing search only nodes.
  2. index.pterm.check.enabled enable/disable the primary term check (no-op replication) on the write path. I disable this for tests for rwsplit.

Workloads:
https://github.com/mch2/opensearch-benchmark-workloads
I've been using http_logs workload with two new test procedures that run indexing and search concurrently.

  1. indexing-querying runs an initial 50% of writes followed by the remaining 50% while running an expensive multi-term agg query. I've been using this to topple over nodes to test failure isolation.
  2. indexing-querying-all runs 50/50 and includes all queries in the normal http_log workload. It hardcodes the multi-term agg so that it isn't as resource intensive. Using this for general throughput/latency comparisons.

cluster setup: https://github.com/mch2/opensearch-cluster-cdk/tree/benchmarking-update
adds an asg for search nodes and configuration for adjusting the size and instance type. For example to set up a cluster with 1 data and 1 search node with different node types.

cdk deploy "*"  -c distributionUrl=https://github.com/mch2/TestActions/releases/download/test-search-routing/opensearch-min-2.15.1-SNAPSHOT-linux-arm64.tar.gz -c region=us-east-1 -c suffix=rwsplit-cluster -c securityDisabled=true -c cpuArch=arm64 -c singleNodeCluster=false -c distVersion=2.15.1 -c minDistribution=true -c -c dataInstanceType=c6g.xlarge -c searchInstanceType=r6g.xlarge -c use50PercentHeap=false -c clientNodeCount=2 -c dataNodeCount=1 -c searchNodeCount=1 -c enableRemoteStore=true --no-approval-required

the distributionUrl I included there will hold the artifact for the latest changes i'm working off of.

@mch2
Copy link
Member

mch2 commented Jul 18, 2024

Here are results of POC tests outlined by @sohami. Workload, cluster setup and POC changes are above.

Summary of results:

  • Initial POC results show the specialized cluster achieves the workload/failure isolation and independent scalability as expected.
  • Ability to tune to workload, specialized clusters yielded 20-28% higher write throughput for the chosen benchmark workload but showed significant variance in search latencies compared to uniform clusters. This is likely due to varying segment counts during the workload. Intentionally this does not merge to any specific segment count. However search throughput does show roughly ~40%-50% improvement for certain queries.
  • As expected No-op/pterm check on write path needs to be removed as it can hurt throughput.

Cluster Configuration:

3 cluster manager nodes
2 coordinator nodes
1 data and 1 searcher node (both c6g.large)
1 shard 1 replica.

Benchmark Configuration

All benchmark runs were executed with the indexing-querying test procedure with the following workload params. Failure simulation was achieved by adjusting search_client count to increase query load.
search_clients: to increase search qps. Note every +1 here increases concurrent queries by 14x given the workload runs 14 queries concurrently.
search_iterations: 2000, iterations for each query
initial_ingest_percentage: 50% amount of data to load before running concurrently. 50% is used across so we have enough data to simulate memory intensive queries.
bulk_indexing_clients: 20 - to simulate a significant write load
index.pterm.check.enabled: disabled for r/w split runs, enabled for regular clusters.

Note - To save time I restored 50% of the load from snapshot before running the concurrent step, so the below graphs will not show this. Normal indexing throughput reached is between 35-40k docs/s.

Results:

Scenario 1: Failure/workload isolation and scaling.

  1. Find a workload that overloads a primary while under indexing and search load.
    This is achieved with 3 search clients. This generates enough strain to reach ~60s query timeouts and topple over the primary within ~5 minutes. load begins at ~18:35, throughput never reaches normal levels of 35-40k docs/s.

Query Latency:
Screenshot 2024-07-16 at 11 38 07 PM

Throughput:
Screenshot 2024-07-16 at 11 38 07 PM

  1. Run same workload with specialized cluster:
    Query latency hovers 35s but cluster does not fall over. Write throughput reaches normal levels of ~40k docs/s showing workload isolation.
Screenshot 2024-07-16 at 11 37 09 PM

_bulk Throughput:
Screenshot 2024-07-16 at 11 36 45 PM
3. Fail a searcher on the specialized cluster. Search node failure was reached by increasing to 5 search clients per query. Node drops at roughly 9:39 below. Throughput remains steady displaying failure isolation.

Latency:
Screenshot 2024-07-17 at 12 13 54 AM

Throughput:
Screenshot 2024-07-17 at 12 14 19 AM

  1. Add a search node - note in this execution I ran the initial 50% ingest. There is a dip around 19:45 in throughput that is from a pause between workload steps. The workload completes successfully displaying independent scaling.
Screenshot 2024-07-17 at 12 00 31 AM Screenshot 2024-07-17 at 12 01 16 AM

Scenario 2: Scaling up search nodes.
I updated the node to a r6g.large. Step 3 then succeeds as expected.

I think it would be interesting to compare the performance of two clusters: one with specialized nodes of different types, and one with all nodes of the same type and roles, but in both cases the total cost of all the nodes is the same. Then it would be interesting to see in what scenarios the specialized nodes can give better performance.

I've been running some experiments with this in mind using the indexing-querying-all test procedure. Note our specialized cluster does not have any failover mechanism with these benchmarks, so solutions like standby writers would add to this cost.

Comparison 1:
Baseline clusters: 4 node clusters of c6g ($0.544/hr) ,m6g ($0.616/hr) ,r6g xlarge ($0.8064).
Competitor: 2 indexers c6g.xlarge and 2 searchers r6g.xlarge - $0.6752/hr
Results: Specialized cluster achieves ~18% higher indexing throughput (while searching) compared to the baseline clusters. However, search latency was higher across the board up to 60% for some query types, likely due to segment counts.

Comparison 2: Next I tried a run with a more appropriate sharding strategy closer to 1.5vCPU per node. I also tweaked refresh interval from 1s to 10s and bumped search threadpool size from the default to 50 on the search nodes given they aren't indexing.
Baseline: 4 node r6g.xlarge. ($0.8064) 6 shards 1 replica.
Contender: 6 nodes - 3 c6g.xlarge, 3 r6g.xlarge ($1.0128)
Results: The specialized cluster yielded 28% higher throughput while concurrently indexing. Search latencies showed quite a bit of variance but started to improve over the baseline. Variance is expected here given we aren't force merging, throughput shows ~40% improvement for default query and 50% for term. There is likely more tuning we can do here that I'm missing to be more optimal for the workload. - full results below:

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                        Metric |                           Task |   Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------------------------:|-----------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |                                |    130.213 |     115.902 | -14.3113 |    min |
|             Min cumulative indexing time across primary shard |                                |    21.1971 |     18.1025 |  -3.0946 |    min |
|          Median cumulative indexing time across primary shard |                                |    21.7024 |     19.3585 | -2.34397 |    min |
|             Max cumulative indexing time across primary shard |                                |    22.1359 |     20.4479 | -1.68792 |    min |
|           Cumulative indexing throttle time of primary shards |                                |          0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |                                |          0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |                                |          0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |                                |          0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |                                |    58.1833 |     49.1664 | -9.01692 |    min |
|                      Cumulative merge count of primary shards |                                |        305 |         305 |        0 |        |
|                Min cumulative merge time across primary shard |                                |    8.23887 |     7.30127 |  -0.9376 |    min |
|             Median cumulative merge time across primary shard |                                |    9.24085 |     7.56177 | -1.67908 |    min |
|                Max cumulative merge time across primary shard |                                |    11.8424 |     9.73745 | -2.10493 |    min |
|              Cumulative merge throttle time of primary shards |                                |    35.7409 |     30.5751 | -5.16577 |    min |
|       Min cumulative merge throttle time across primary shard |                                |    4.89218 |     4.38092 | -0.51127 |    min |
|    Median cumulative merge throttle time across primary shard |                                |    5.07107 |     4.58437 | -0.48669 |    min |
|       Max cumulative merge throttle time across primary shard |                                |    8.17898 |      6.3829 | -1.79608 |    min |
|                     Cumulative refresh time of primary shards |                                |    11.1164 |     9.70393 | -1.41248 |    min |
|                    Cumulative refresh count of primary shards |                                |        756 |         756 |        0 |        |
|              Min cumulative refresh time across primary shard |                                |    1.57233 |     1.59127 |  0.01893 |    min |
|           Median cumulative refresh time across primary shard |                                |    1.89271 |     1.62654 | -0.26617 |    min |
|              Max cumulative refresh time across primary shard |                                |    2.12252 |     1.63562 |  -0.4869 |    min |
|                       Cumulative flush time of primary shards |                                |    2.23885 |     1.06707 | -1.17178 |    min |
|                      Cumulative flush count of primary shards |                                |         14 |          14 |        0 |        |
|                Min cumulative flush time across primary shard |                                |   0.257417 |       0.084 | -0.17342 |    min |
|             Median cumulative flush time across primary shard |                                |     0.3732 |    0.178458 | -0.19474 |    min |
|                Max cumulative flush time across primary shard |                                |     0.4713 |    0.274083 | -0.19722 |    min |
|                                       Total Young Gen GC time |                                |      2.112 |      10.371 |    8.259 |      s |
|                                      Total Young Gen GC count |                                |        216 |         773 |      557 |        |
|                                         Total Old Gen GC time |                                |          0 |           0 |        0 |      s |
|                                        Total Old Gen GC count |                                |          0 |           0 |        0 |        |
|                                                    Store size |                                |    30.4872 |     30.6709 |  0.18372 |     GB |
|                                                 Translog size |                                |   0.280385 |    0.204366 | -0.07602 |     GB |
|                                        Heap used for segments |                                |          0 |           0 |        0 |     MB |
|                                      Heap used for doc values |                                |          0 |           0 |        0 |     MB |
|                                           Heap used for terms |                                |          0 |           0 |        0 |     MB |
|                                           Heap used for norms |                                |          0 |           0 |        0 |     MB |
|                                          Heap used for points |                                |          0 |           0 |        0 |     MB |
|                                   Heap used for stored fields |                                |          0 |           0 |        0 |     MB |
|                                                 Segment count |                                |        194 |         199 |        5 |        |
|                                                Min Throughput |           initial-index-append |     127314 |      118987 | -8326.96 | docs/s |
|                                               Mean Throughput |           initial-index-append |     131901 |      121796 | -10104.9 | docs/s |
|                                             Median Throughput |           initial-index-append |     131447 |      121149 | -10298.2 | docs/s |
|                                                Max Throughput |           initial-index-append |     137581 |      127116 |   -10465 | docs/s |
|                                       50th percentile latency |           initial-index-append |    717.688 |     765.043 |  47.3552 |     ms |
|                                       90th percentile latency |           initial-index-append |    1104.94 |     1146.56 |  41.6175 |     ms |
|                                       99th percentile latency |           initial-index-append |    1427.02 |     1421.82 | -5.20365 |     ms |
|                                     99.9th percentile latency |           initial-index-append |    1777.81 |     1755.99 | -21.8169 |     ms |
|                                    99.99th percentile latency |           initial-index-append |    2063.07 |     2258.95 |  195.888 |     ms |
|                                      100th percentile latency |           initial-index-append |    2073.21 |      2296.2 |  222.988 |     ms |
|                                  50th percentile service time |           initial-index-append |    717.688 |      764.94 |  47.2516 |     ms |
|                                  90th percentile service time |           initial-index-append |    1104.94 |     1146.63 |  41.6859 |     ms |
|                                  99th percentile service time |           initial-index-append |    1427.02 |     1421.82 | -5.20365 |     ms |
|                                99.9th percentile service time |           initial-index-append |    1777.81 |     1755.99 | -21.8169 |     ms |
|                               99.99th percentile service time |           initial-index-append |    2063.07 |     2258.95 |  195.888 |     ms |
|                                 100th percentile service time |           initial-index-append |    2073.21 |      2296.2 |  222.988 |     ms |
|                                                    error rate |           initial-index-append |          0 |           0 |        0 |      % |
|                                                Min Throughput |        concurrent-index-append |    4048.48 |     57235.1 |  53186.6 | docs/s |
|                                               Mean Throughput |        concurrent-index-append |    94505.2 |      121798 |  27292.3 | docs/s |
|                                             Median Throughput |        concurrent-index-append |      97105 |      121384 |  24279.1 | docs/s |
|                                                Max Throughput |        concurrent-index-append |     108285 |      131655 |  23370.6 | docs/s |
|                                       50th percentile latency |        concurrent-index-append |    887.651 |     803.963 | -83.6882 |     ms |
|                                       90th percentile latency |        concurrent-index-append |    1199.64 |      1142.3 | -57.3318 |     ms |
|                                       99th percentile latency |        concurrent-index-append |    1874.87 |     1448.45 | -426.416 |     ms |
|                                     99.9th percentile latency |        concurrent-index-append |    3913.57 |     1902.21 | -2011.36 |     ms |
|                                    99.99th percentile latency |        concurrent-index-append |     5748.3 |     2371.72 | -3376.59 |     ms |
|                                      100th percentile latency |        concurrent-index-append |    6448.95 |     2383.27 | -4065.68 |     ms |
|                                  50th percentile service time |        concurrent-index-append |    887.651 |     803.963 | -83.6882 |     ms |
|                                  90th percentile service time |        concurrent-index-append |    1199.64 |      1142.3 | -57.3318 |     ms |
|                                  99th percentile service time |        concurrent-index-append |    1874.87 |     1448.45 | -426.416 |     ms |
|                                99.9th percentile service time |        concurrent-index-append |    3913.57 |     1902.21 | -2011.36 |     ms |
|                               99.99th percentile service time |        concurrent-index-append |     5748.3 |     2371.72 | -3376.59 |     ms |
|                                 100th percentile service time |        concurrent-index-append |    6448.95 |     2383.27 | -4065.68 |     ms |
|                                                    error rate |        concurrent-index-append |          0 |           0 |        0 |      % |
|                                                Min Throughput |                 multi_term_agg |   0.643752 |    0.680205 |  0.03645 |  ops/s |
|                                               Mean Throughput |                 multi_term_agg |    2.34316 |     1.80455 | -0.53862 |  ops/s |
|                                             Median Throughput |                 multi_term_agg |    2.50054 |     1.49541 | -1.00512 |  ops/s |
|                                                Max Throughput |                 multi_term_agg |    3.31976 |     3.41526 |   0.0955 |  ops/s |
|                                       50th percentile latency |                 multi_term_agg |    247.519 |     184.612 | -62.9069 |     ms |
|                                       90th percentile latency |                 multi_term_agg |    408.567 |     537.188 |  128.621 |     ms |
|                                       99th percentile latency |                 multi_term_agg |    1836.93 |     2512.96 |  676.022 |     ms |
|                                      100th percentile latency |                 multi_term_agg |    2688.42 |     2901.71 |  213.291 |     ms |
|                                  50th percentile service time |                 multi_term_agg |    247.519 |     184.612 | -62.9069 |     ms |
|                                  90th percentile service time |                 multi_term_agg |    408.567 |     537.188 |  128.621 |     ms |
|                                  99th percentile service time |                 multi_term_agg |    1836.93 |     2512.96 |  676.022 |     ms |
|                                 100th percentile service time |                 multi_term_agg |    2688.42 |     2901.71 |  213.291 |     ms |
|                                                    error rate |                 multi_term_agg |          0 |           0 |        0 |      % |
|                                                Min Throughput |                        default |   0.313678 |     4.57656 |  4.26288 |  ops/s |
|                                               Mean Throughput |                        default |    17.9215 |     25.0105 |  7.08898 |  ops/s |
|                                             Median Throughput |                        default |    19.9914 |     26.7474 |  6.75599 |  ops/s |
|                                                Max Throughput |                        default |    22.6138 |     42.3027 |  19.6889 |  ops/s |
|                                       50th percentile latency |                        default |    277.477 |     111.678 | -165.799 |     ms |
|                                       90th percentile latency |                        default |    475.837 |     262.068 | -213.769 |     ms |
|                                       99th percentile latency |                        default |    728.193 |     1358.29 |  630.101 |     ms |
|                                     99.9th percentile latency |                        default |    3509.28 |     2530.93 |  -978.35 |     ms |
|                                      100th percentile latency |                        default |    3647.16 |     2796.07 | -851.091 |     ms |
|                                  50th percentile service time |                        default |    277.477 |     111.678 | -165.799 |     ms |
|                                  90th percentile service time |                        default |    475.837 |     262.068 | -213.769 |     ms |
|                                  99th percentile service time |                        default |    728.193 |     1358.29 |  630.101 |     ms |
|                                99.9th percentile service time |                        default |    3509.28 |     2530.93 |  -978.35 |     ms |
|                                 100th percentile service time |                        default |    3647.16 |     2796.07 | -851.091 |     ms |
|                                                    error rate |                        default |          0 |           0 |        0 |      % |
|                                                Min Throughput |                           term |   0.273971 |     3.39619 |  3.12222 |  ops/s |
|                                               Mean Throughput |                           term |     15.768 |      24.175 |    8.407 |  ops/s |
|                                             Median Throughput |                           term |    17.7638 |     25.7935 |  8.02969 |  ops/s |
|                                                Max Throughput |                           term |    20.3024 |      41.699 |  21.3966 |  ops/s |
|                                       50th percentile latency |                           term |    303.362 |     115.058 | -188.304 |     ms |
|                                       90th percentile latency |                           term |    553.245 |     252.345 |   -300.9 |     ms |
|                                       99th percentile latency |                           term |    916.694 |     1351.81 |  435.116 |     ms |
|                                     99.9th percentile latency |                           term |    3648.24 |     2892.84 | -755.407 |     ms |
|                                      100th percentile latency |                           term |    3667.08 |     3651.04 | -16.0481 |     ms |
|                                  50th percentile service time |                           term |    303.362 |     115.058 | -188.304 |     ms |
|                                  90th percentile service time |                           term |    553.245 |     252.345 |   -300.9 |     ms |
|                                  99th percentile service time |                           term |    916.694 |     1351.81 |  435.116 |     ms |
|                                99.9th percentile service time |                           term |    3648.24 |     2892.84 | -755.407 |     ms |
|                                 100th percentile service time |                           term |    3667.08 |     3651.04 | -16.0481 |     ms |
|                                                    error rate |                           term |          0 |           0 |        0 |      % |
|                                                Min Throughput |                          range |   0.318561 |     3.99246 |   3.6739 |  ops/s |
|                                               Mean Throughput |                          range |    17.3427 |     25.0957 |  7.75301 |  ops/s |
|                                             Median Throughput |                          range |    18.9655 |     26.7652 |  7.79972 |  ops/s |
|                                                Max Throughput |                          range |    22.2434 |     41.8312 |  19.5877 |  ops/s |
|                                       50th percentile latency |                          range |    280.534 |      112.36 | -168.174 |     ms |
|                                       90th percentile latency |                          range |    493.086 |     258.849 | -234.236 |     ms |
|                                       99th percentile latency |                          range |    742.171 |      1357.6 |  615.425 |     ms |
|                                     99.9th percentile latency |                          range |    3503.08 |     2794.61 | -708.474 |     ms |
|                                      100th percentile latency |                          range |    3643.28 |     3800.24 |  156.957 |     ms |
|                                  50th percentile service time |                          range |    280.534 |      112.36 | -168.174 |     ms |
|                                  90th percentile service time |                          range |    493.086 |     258.849 | -234.236 |     ms |
|                                  99th percentile service time |                          range |    742.171 |      1357.6 |  615.425 |     ms |
|                                99.9th percentile service time |                          range |    3503.08 |     2794.61 | -708.474 |     ms |
|                                 100th percentile service time |                          range |    3643.28 |     3800.24 |  156.957 |     ms |
|                                                    error rate |                          range |          0 |           0 |        0 |      % |
|                                                Min Throughput |                  200s-in-range |    4.90086 |     5.25107 |   0.3502 |  ops/s |
|                                               Mean Throughput |                  200s-in-range |     26.761 |     27.2229 |  0.46194 |  ops/s |
|                                             Median Throughput |                  200s-in-range |     30.369 |     29.3196 | -1.04942 |  ops/s |
|                                                Max Throughput |                  200s-in-range |    32.0328 |     44.6883 |  12.6555 |  ops/s |
|                                       50th percentile latency |                  200s-in-range |    197.482 |     105.716 | -91.7651 |     ms |
|                                       90th percentile latency |                  200s-in-range |    355.069 |     242.099 | -112.969 |     ms |
|                                       99th percentile latency |                  200s-in-range |    598.682 |      1350.8 |   752.12 |     ms |
|                                     99.9th percentile latency |                  200s-in-range |    3010.42 |     2875.54 | -134.887 |     ms |
|                                      100th percentile latency |                  200s-in-range |    3454.59 |     2923.99 | -530.603 |     ms |
|                                  50th percentile service time |                  200s-in-range |    197.482 |     105.716 | -91.7651 |     ms |
|                                  90th percentile service time |                  200s-in-range |    355.069 |     242.099 | -112.969 |     ms |
|                                  99th percentile service time |                  200s-in-range |    598.682 |      1350.8 |   752.12 |     ms |
|                                99.9th percentile service time |                  200s-in-range |    3010.42 |     2875.54 | -134.887 |     ms |
|                                 100th percentile service time |                  200s-in-range |    3454.59 |     2923.99 | -530.603 |     ms |
|                                                    error rate |                  200s-in-range |          0 |           0 |        0 |      % |
|                                                Min Throughput |                  400s-in-range |    3.63781 |       4.259 |  0.62119 |  ops/s |
|                                               Mean Throughput |                  400s-in-range |    26.6012 |     28.1788 |  1.57762 |  ops/s |
|                                             Median Throughput |                  400s-in-range |    30.0655 |      30.305 |  0.23949 |  ops/s |
|                                                Max Throughput |                  400s-in-range |    32.5541 |     45.2136 |  12.6595 |  ops/s |
|                                       50th percentile latency |                  400s-in-range |    191.478 |     102.386 | -89.0912 |     ms |
|                                       90th percentile latency |                  400s-in-range |    350.886 |     246.587 | -104.299 |     ms |
|                                       99th percentile latency |                  400s-in-range |    650.305 |     1354.49 |  704.185 |     ms |
|                                     99.9th percentile latency |                  400s-in-range |    2765.77 |     2860.45 |  94.6778 |     ms |
|                                      100th percentile latency |                  400s-in-range |    2810.37 |     2937.84 |   127.47 |     ms |
|                                  50th percentile service time |                  400s-in-range |    191.478 |     102.386 | -89.0912 |     ms |
|                                  90th percentile service time |                  400s-in-range |    350.886 |     246.587 | -104.299 |     ms |
|                                  99th percentile service time |                  400s-in-range |    650.305 |     1354.49 |  704.185 |     ms |
|                                99.9th percentile service time |                  400s-in-range |    2765.77 |     2860.45 |  94.6778 |     ms |
|                                 100th percentile service time |                  400s-in-range |    2810.37 |     2937.84 |   127.47 |     ms |
|                                                    error rate |                  400s-in-range |          0 |           0 |        0 |      % |
|                                                Min Throughput |                     hourly_agg |    3.71854 |     0.86257 | -2.85597 |  ops/s |
|                                               Mean Throughput |                     hourly_agg |    24.0515 |     24.1455 |  0.09395 |  ops/s |
|                                             Median Throughput |                     hourly_agg |    26.8234 |     26.3711 |  -0.4523 |  ops/s |
|                                                Max Throughput |                     hourly_agg |    28.4592 |     39.3035 |  10.8443 |  ops/s |
|                                       50th percentile latency |                     hourly_agg |    221.184 |      123.07 | -98.1142 |     ms |
|                                       90th percentile latency |                     hourly_agg |    381.459 |     251.196 | -130.264 |     ms |
|                                       99th percentile latency |                     hourly_agg |      749.2 |     1373.69 |  624.489 |     ms |
|                                     99.9th percentile latency |                     hourly_agg |    2844.31 |     2921.55 |  77.2387 |     ms |
|                                      100th percentile latency |                     hourly_agg |    3142.43 |      4978.6 |  1836.17 |     ms |
|                                  50th percentile service time |                     hourly_agg |    221.184 |      123.07 | -98.1142 |     ms |
|                                  90th percentile service time |                     hourly_agg |    381.459 |     251.196 | -130.264 |     ms |
|                                  99th percentile service time |                     hourly_agg |      749.2 |     1373.69 |  624.489 |     ms |
|                                99.9th percentile service time |                     hourly_agg |    2844.31 |     2921.55 |  77.2387 |     ms |
|                                 100th percentile service time |                     hourly_agg |    3142.43 |      4978.6 |  1836.17 |     ms |
|                                                    error rate |                     hourly_agg |          0 |           0 |        0 |      % |
|                                                Min Throughput |              multi-term-filter |   0.274136 |     0.53585 |  0.26171 |  ops/s |
|                                               Mean Throughput |              multi-term-filter |     15.835 |     22.3251 |  6.49017 |  ops/s |
|                                             Median Throughput |              multi-term-filter |    17.7293 |     24.4579 |  6.72867 |  ops/s |
|                                                Max Throughput |              multi-term-filter |    20.2603 |     39.0992 |  18.8389 |  ops/s |
|                                       50th percentile latency |              multi-term-filter |    308.862 |     122.542 |  -186.32 |     ms |
|                                       90th percentile latency |              multi-term-filter |    558.644 |     256.422 | -302.222 |     ms |
|                                       99th percentile latency |              multi-term-filter |    916.885 |        1490 |  573.116 |     ms |
|                                     99.9th percentile latency |              multi-term-filter |    3648.72 |     3670.59 |  21.8657 |     ms |
|                                      100th percentile latency |              multi-term-filter |    3651.48 |     4022.65 |  371.167 |     ms |
|                                  50th percentile service time |              multi-term-filter |    308.862 |     122.542 |  -186.32 |     ms |
|                                  90th percentile service time |              multi-term-filter |    558.644 |     256.422 | -302.222 |     ms |
|                                  99th percentile service time |              multi-term-filter |    916.885 |        1490 |  573.116 |     ms |
|                                99.9th percentile service time |              multi-term-filter |    3648.72 |     3670.59 |  21.8657 |     ms |
|                                 100th percentile service time |              multi-term-filter |    3651.48 |     4022.65 |  371.167 |     ms |
|                                                    error rate |              multi-term-filter |          0 |           0 |        0 |      % |
|                                                Min Throughput |                  asc_sort_size |   0.312739 |    0.415774 |  0.10304 |  ops/s |
|                                               Mean Throughput |                  asc_sort_size |     16.751 |     19.2261 |  2.47506 |  ops/s |
|                                             Median Throughput |                  asc_sort_size |    18.8039 |     19.3454 |  0.54158 |  ops/s |
|                                                Max Throughput |                  asc_sort_size |    21.4151 |     36.3675 |  14.9523 |  ops/s |
|                                       50th percentile latency |                  asc_sort_size |    289.051 |     125.394 | -163.657 |     ms |
|                                       90th percentile latency |                  asc_sort_size |    505.538 |     326.661 | -178.877 |     ms |
|                                       99th percentile latency |                  asc_sort_size |     794.98 |     1613.72 |  818.741 |     ms |
|                                     99.9th percentile latency |                  asc_sort_size |    3679.19 |     2860.79 | -818.407 |     ms |
|                                      100th percentile latency |                  asc_sort_size |    3693.35 |     2991.33 | -702.016 |     ms |
|                                  50th percentile service time |                  asc_sort_size |    289.051 |     125.394 | -163.657 |     ms |
|                                  90th percentile service time |                  asc_sort_size |    505.538 |     326.661 | -178.877 |     ms |
|                                  99th percentile service time |                  asc_sort_size |     794.98 |     1613.72 |  818.741 |     ms |
|                                99.9th percentile service time |                  asc_sort_size |    3679.19 |     2860.79 | -818.407 |     ms |
|                                 100th percentile service time |                  asc_sort_size |    3693.35 |     2991.33 | -702.016 |     ms |
|                                                    error rate |                  asc_sort_size |          0 |           0 |        0 |      % |
|                                                Min Throughput |            desc_sort_timestamp |    0.26801 |   0.0553743 | -0.21264 |  ops/s |
|                                               Mean Throughput |            desc_sort_timestamp |    10.0046 |     3.99463 | -6.01002 |  ops/s |
|                                             Median Throughput |            desc_sort_timestamp |    10.5564 |     4.57789 | -5.97851 |  ops/s |
|                                                Max Throughput |            desc_sort_timestamp |    11.2353 |     4.83491 | -6.40038 |  ops/s |
|                                       50th percentile latency |            desc_sort_timestamp |    554.392 |     1259.88 |  705.491 |     ms |
|                                       90th percentile latency |            desc_sort_timestamp |    966.434 |     1648.89 |  682.455 |     ms |
|                                       99th percentile latency |            desc_sort_timestamp |    1376.64 |     2646.68 |  1270.04 |     ms |
|                                     99.9th percentile latency |            desc_sort_timestamp |    3709.02 |     18012.1 |  14303.1 |     ms |
|                                      100th percentile latency |            desc_sort_timestamp |    3729.76 |     18968.7 |    15239 |     ms |
|                                  50th percentile service time |            desc_sort_timestamp |    554.392 |     1259.88 |  705.491 |     ms |
|                                  90th percentile service time |            desc_sort_timestamp |    966.434 |     1648.89 |  682.455 |     ms |
|                                  99th percentile service time |            desc_sort_timestamp |    1376.64 |     2646.68 |  1270.04 |     ms |
|                                99.9th percentile service time |            desc_sort_timestamp |    3709.02 |     18012.1 |  14303.1 |     ms |
|                                 100th percentile service time |            desc_sort_timestamp |    3729.76 |     18968.7 |    15239 |     ms |
|                                                    error rate |            desc_sort_timestamp |          0 |           0 |        0 |      % |
|                                                Min Throughput |                 desc_sort_size |   0.311762 |    0.274529 | -0.03723 |  ops/s |
|                                               Mean Throughput |                 desc_sort_size |    16.2587 |     16.2429 | -0.01572 |  ops/s |
|                                             Median Throughput |                 desc_sort_size |      18.02 |     17.1233 | -0.89662 |  ops/s |
|                                                Max Throughput |                 desc_sort_size |    20.2308 |     29.9714 |  9.74054 |  ops/s |
|                                       50th percentile latency |                 desc_sort_size |    308.556 |     178.623 | -129.933 |     ms |
|                                       90th percentile latency |                 desc_sort_size |    510.406 |     357.271 | -153.135 |     ms |
|                                       99th percentile latency |                 desc_sort_size |    852.202 |     1294.15 |   441.95 |     ms |
|                                     99.9th percentile latency |                 desc_sort_size |     3727.5 |     4348.98 |  621.476 |     ms |
|                                      100th percentile latency |                 desc_sort_size |    3750.07 |     5205.57 |   1455.5 |     ms |
|                                  50th percentile service time |                 desc_sort_size |    308.556 |     178.623 | -129.933 |     ms |
|                                  90th percentile service time |                 desc_sort_size |    510.406 |     357.271 | -153.135 |     ms |
|                                  99th percentile service time |                 desc_sort_size |    852.202 |     1294.15 |   441.95 |     ms |
|                                99.9th percentile service time |                 desc_sort_size |     3727.5 |     4348.98 |  621.476 |     ms |
|                                 100th percentile service time |                 desc_sort_size |    3750.07 |     5205.57 |   1455.5 |     ms |
|                                                    error rate |                 desc_sort_size |          0 |           0 |        0 |      % |
|                                                Min Throughput |             asc_sort_timestamp |   0.268091 |    0.120874 | -0.14722 |  ops/s |
|                                               Mean Throughput |             asc_sort_timestamp |    13.9931 |     13.6757 | -0.31741 |  ops/s |
|                                             Median Throughput |             asc_sort_timestamp |    15.7477 |     13.2429 | -2.50479 |  ops/s |
|                                                Max Throughput |             asc_sort_timestamp |    18.4789 |     30.5706 |  12.0917 |  ops/s |
|                                       50th percentile latency |             asc_sort_timestamp |    342.675 |     135.164 |  -207.51 |     ms |
|                                       90th percentile latency |             asc_sort_timestamp |    609.821 |     349.238 | -260.583 |     ms |
|                                       99th percentile latency |             asc_sort_timestamp |      990.2 |     1343.57 |  353.368 |     ms |
|                                     99.9th percentile latency |             asc_sort_timestamp |    3743.16 |     8271.48 |  4528.31 |     ms |
|                                      100th percentile latency |             asc_sort_timestamp |    3745.12 |     8271.86 |  4526.74 |     ms |
|                                  50th percentile service time |             asc_sort_timestamp |    342.675 |     135.164 |  -207.51 |     ms |
|                                  90th percentile service time |             asc_sort_timestamp |    609.821 |     349.238 | -260.583 |     ms |
|                                  99th percentile service time |             asc_sort_timestamp |      990.2 |     1343.57 |  353.368 |     ms |
|                                99.9th percentile service time |             asc_sort_timestamp |    3743.16 |     8271.48 |  4528.31 |     ms |
|                                 100th percentile service time |             asc_sort_timestamp |    3745.12 |     8271.86 |  4526.74 |     ms |
|                                                    error rate |             asc_sort_timestamp |          0 |           0 |        0 |      % |
|                                                Min Throughput | desc_sort_with_after_timestamp |   0.308755 |   0.0570799 | -0.25168 |  ops/s |
|                                               Mean Throughput | desc_sort_with_after_timestamp |    8.61967 |     3.68472 | -4.93495 |  ops/s |
|                                             Median Throughput | desc_sort_with_after_timestamp |    9.05854 |     4.19652 | -4.86202 |  ops/s |
|                                                Max Throughput | desc_sort_with_after_timestamp |    9.72995 |     4.67605 |  -5.0539 |  ops/s |
|                                       50th percentile latency | desc_sort_with_after_timestamp |    625.733 |     1337.33 |  711.594 |     ms |
|                                       90th percentile latency | desc_sort_with_after_timestamp |    1202.98 |     1618.75 |  415.766 |     ms |
|                                       99th percentile latency | desc_sort_with_after_timestamp |    1727.33 |     2940.55 |  1213.22 |     ms |
|                                     99.9th percentile latency | desc_sort_with_after_timestamp |    3614.63 |     17792.6 |    14178 |     ms |
|                                      100th percentile latency | desc_sort_with_after_timestamp |    3969.47 |     19920.2 |  15950.8 |     ms |
|                                  50th percentile service time | desc_sort_with_after_timestamp |    625.733 |     1337.33 |  711.594 |     ms |
|                                  90th percentile service time | desc_sort_with_after_timestamp |    1202.98 |     1618.75 |  415.766 |     ms |
|                                  99th percentile service time | desc_sort_with_after_timestamp |    1727.33 |     2940.55 |  1213.22 |     ms |
|                                99.9th percentile service time | desc_sort_with_after_timestamp |    3614.63 |     17792.6 |    14178 |     ms |
|                                 100th percentile service time | desc_sort_with_after_timestamp |    3969.47 |     19920.2 |  15950.8 |     ms |
|                                                    error rate | desc_sort_with_after_timestamp |          0 |           0 |        0 |      % |
|                                                Min Throughput |  asc_sort_with_after_timestamp |   0.317337 |    0.286228 | -0.03111 |  ops/s |
|                                               Mean Throughput |  asc_sort_with_after_timestamp |    16.2102 |     16.7537 |  0.54349 |  ops/s |
|                                             Median Throughput |  asc_sort_with_after_timestamp |    17.7013 |     17.2534 | -0.44796 |  ops/s |
|                                                Max Throughput |  asc_sort_with_after_timestamp |    20.1866 |      32.416 |  12.2295 |  ops/s |
|                                       50th percentile latency |  asc_sort_with_after_timestamp |    310.214 |     149.509 | -160.705 |     ms |
|                                       90th percentile latency |  asc_sort_with_after_timestamp |    516.761 |     332.227 | -184.534 |     ms |
|                                       99th percentile latency |  asc_sort_with_after_timestamp |    809.193 |     1620.48 |  811.288 |     ms |
|                                     99.9th percentile latency |  asc_sort_with_after_timestamp |    3371.72 |     3568.04 |  196.321 |     ms |
|                                      100th percentile latency |  asc_sort_with_after_timestamp |    3680.64 |     3715.23 |  34.5886 |     ms |
|                                  50th percentile service time |  asc_sort_with_after_timestamp |    310.214 |     149.509 | -160.705 |     ms |
|                                  90th percentile service time |  asc_sort_with_after_timestamp |    516.761 |     332.227 | -184.534 |     ms |
|                                  99th percentile service time |  asc_sort_with_after_timestamp |    809.193 |     1620.48 |  811.288 |     ms |
|                                99.9th percentile service time |  asc_sort_with_after_timestamp |    3371.72 |     3568.04 |  196.321 |     ms |
|                                 100th percentile service time |  asc_sort_with_after_timestamp |    3680.64 |     3715.23 |  34.5886 |     ms |
|                                                    error rate |  asc_sort_with_after_timestamp |          0 |           0 |        0 |      % |


@reta
Copy link
Collaborator

reta commented Jul 18, 2024

Thanks a lot @mch2

Search latencies showed quite a bit of variance but started to improve over the baseline. Variance is expected here given we aren't force merging, throughput shows ~40% improvement for default query and 50% for term.

Quick question: did we enable concurrent search (search.concurrent_segment_search.enabled cluster setting) or where using the defaults?

@mch2
Copy link
Member

mch2 commented Jul 18, 2024

Thanks a lot @mch2

Search latencies showed quite a bit of variance but started to improve over the baseline. Variance is expected here given we aren't force merging, throughput shows ~40% improvement for default query and 50% for term.

Quick question: did we enable concurrent search (search.concurrent_segment_search.enabled cluster setting) or where using the defaults?

This is only with the defaults. I will kick off some runs with concurrent search for both clusters and share.

As a next step I think we can start drafting a lower level design where we can answer the following:

  1. Properly updating allocation paths such that replicas are assigned to separate nodes.
  2. Disable existing primary/replica failover and replace with primary failover mechanism. In the end state we can achieve this while leveraging an event stream to replay docs, but to start I think writer-replicas are a good option.
  3. Remove primary term validation on writes that currently couples write path to replicas. This validation (also referred to as no-op replication) significantly impacts write latency/throughput when replicas are overloaded.
  4. As a last step package/deploy an Indexer and a Searcher separately.

@mch2
Copy link
Member

mch2 commented Jul 19, 2024

@reta Thanks for the callout on concurrent search. With this enabled results are much more favorable for the specialized cluster in all but 2 sort queries desc_sort_timestamp and desc_sort_with_after_timestamp.

Contender here being with rwsplit.
To recap some settings:

  1. concurrent search is enabled.
  2. 7 search clients per query, 20 bulk indexing clients, batch of 500 docs.
  3. bump of search threads on search only cluster to 50.
  4. pterm check disabled

max CPU on both clusters hits 100% while min fluctuates 90-100.

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                        Metric |                           Task |   Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------------------------:|-----------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |                                |    131.557 |     107.692 | -23.8657 |    min |
|             Min cumulative indexing time across primary shard |                                |    20.6969 |     16.6308 | -4.06608 |    min |
|          Median cumulative indexing time across primary shard |                                |    22.3619 |     16.9826 | -5.37929 |    min |
|             Max cumulative indexing time across primary shard |                                |    22.6726 |     20.4617 | -2.21083 |    min |
|           Cumulative indexing throttle time of primary shards |                                |          0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |                                |          0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |                                |          0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |                                |          0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |                                |    52.5284 |     50.1962 | -2.33217 |    min |
|                      Cumulative merge count of primary shards |                                |        308 |         293 |      -15 |        |
|                Min cumulative merge time across primary shard |                                |    6.77522 |     7.53618 |  0.76097 |    min |
|             Median cumulative merge time across primary shard |                                |    7.79866 |     7.92185 |  0.12319 |    min |
|                Max cumulative merge time across primary shard |                                |    11.5903 |     10.3832 | -1.20712 |    min |
|              Cumulative merge throttle time of primary shards |                                |    31.9221 |     32.2007 |  0.27853 |    min |
|       Min cumulative merge throttle time across primary shard |                                |    3.94138 |     4.53032 |  0.58893 |    min |
|    Median cumulative merge throttle time across primary shard |                                |    4.47705 |     5.07282 |  0.59577 |    min |
|       Max cumulative merge throttle time across primary shard |                                |    7.41687 |     7.11455 | -0.30232 |    min |
|                     Cumulative refresh time of primary shards |                                |     10.874 |     9.82985 |  -1.0441 |    min |
|                    Cumulative refresh count of primary shards |                                |        764 |         738 |      -26 |        |
|              Min cumulative refresh time across primary shard |                                |     1.5555 |     1.48273 | -0.07277 |    min |
|           Median cumulative refresh time across primary shard |                                |    1.88017 |     1.66124 | -0.21893 |    min |
|              Max cumulative refresh time across primary shard |                                |    2.00017 |     1.79798 | -0.20218 |    min |
|                       Cumulative flush time of primary shards |                                |    1.65658 |      0.9718 | -0.68478 |    min |
|                      Cumulative flush count of primary shards |                                |         13 |          13 |        0 |        |
|                Min cumulative flush time across primary shard |                                |   0.126417 |      0.1263 | -0.00012 |    min |
|             Median cumulative flush time across primary shard |                                |   0.216475 |    0.156983 | -0.05949 |    min |
|                Max cumulative flush time across primary shard |                                |     0.6117 |    0.214933 | -0.39677 |    min |
|                                       Total Young Gen GC time |                                |      1.813 |       9.899 |    8.086 |      s |
|                                      Total Young Gen GC count |                                |        202 |         770 |      568 |        |
|                                         Total Old Gen GC time |                                |          0 |           0 |        0 |      s |
|                                        Total Old Gen GC count |                                |          0 |           0 |        0 |        |
|                                                    Store size |                                |    30.4746 |     30.0966 | -0.37796 |     GB |
|                                                 Translog size |                                |   0.447467 |    0.214929 | -0.23254 |     GB |
|                                        Heap used for segments |                                |          0 |           0 |        0 |     MB |
|                                      Heap used for doc values |                                |          0 |           0 |        0 |     MB |
|                                           Heap used for terms |                                |          0 |           0 |        0 |     MB |
|                                           Heap used for norms |                                |          0 |           0 |        0 |     MB |
|                                          Heap used for points |                                |          0 |           0 |        0 |     MB |
|                                   Heap used for stored fields |                                |          0 |           0 |        0 |     MB |
|                                                 Segment count |                                |        208 |         219 |       11 |        |
|                                                Min Throughput |           initial-index-append |     120781 |      112613 | -8167.16 | docs/s |
|                                               Mean Throughput |           initial-index-append |     124193 |      126107 |  1914.01 | docs/s |
|                                             Median Throughput |           initial-index-append |     122821 |      127894 |  5072.49 | docs/s |
|                                                Max Throughput |           initial-index-append |     132759 |      131657 | -1101.55 | docs/s |
|                                       50th percentile latency |           initial-index-append |    830.499 |     669.668 | -160.831 |     ms |
|                                       90th percentile latency |           initial-index-append |    1001.05 |     1066.03 |  64.9844 |     ms |
|                                       99th percentile latency |           initial-index-append |    1234.63 |     1327.95 |  93.3212 |     ms |
|                                     99.9th percentile latency |           initial-index-append |    1660.91 |     1911.09 |   250.18 |     ms |
|                                    99.99th percentile latency |           initial-index-append |    1914.48 |     3816.62 |  1902.15 |     ms |
|                                      100th percentile latency |           initial-index-append |    1928.28 |     5038.53 |  3110.24 |     ms |
|                                  50th percentile service time |           initial-index-append |    830.499 |     669.668 | -160.831 |     ms |
|                                  90th percentile service time |           initial-index-append |    1001.05 |     1066.03 |  64.9844 |     ms |
|                                  99th percentile service time |           initial-index-append |    1234.63 |     1327.95 |  93.3212 |     ms |
|                                99.9th percentile service time |           initial-index-append |    1660.91 |     1911.09 |   250.18 |     ms |
|                               99.99th percentile service time |           initial-index-append |    1914.48 |     3816.62 |  1902.15 |     ms |
|                                 100th percentile service time |           initial-index-append |    1928.28 |     5038.53 |  3110.24 |     ms |
|                                                    error rate |           initial-index-append |          0 |           0 |        0 |      % |
|                                                Min Throughput |        concurrent-index-append |    53063.7 |     56813.7 |  3750.03 | docs/s |
|                                               Mean Throughput |        concurrent-index-append |      95696 |      116984 |  21287.8 | docs/s |
|                                             Median Throughput |        concurrent-index-append |    96962.7 |      115754 |  18790.8 | docs/s |
|                                                Max Throughput |        concurrent-index-append |     113920 |      131231 |  17310.4 | docs/s |
|                                       50th percentile latency |        concurrent-index-append |    821.547 |     861.293 |  39.7462 |     ms |
|                                       90th percentile latency |        concurrent-index-append |    1199.79 |     1040.89 | -158.896 |     ms |
|                                       99th percentile latency |        concurrent-index-append |    1808.56 |     1381.15 | -427.408 |     ms |
|                                     99.9th percentile latency |        concurrent-index-append |    3378.66 |     2015.37 | -1363.29 |     ms |
|                                    99.99th percentile latency |        concurrent-index-append |    5629.41 |     2346.31 | -3283.11 |     ms |
|                                      100th percentile latency |        concurrent-index-append |    7707.74 |     2846.51 | -4861.23 |     ms |
|                                  50th percentile service time |        concurrent-index-append |    821.547 |     861.293 |  39.7462 |     ms |
|                                  90th percentile service time |        concurrent-index-append |    1199.79 |     1040.89 | -158.896 |     ms |
|                                  99th percentile service time |        concurrent-index-append |    1808.56 |     1381.15 | -427.408 |     ms |
|                                99.9th percentile service time |        concurrent-index-append |    3378.66 |     2015.37 | -1363.29 |     ms |
|                               99.99th percentile service time |        concurrent-index-append |    5629.41 |     2346.31 | -3283.11 |     ms |
|                                 100th percentile service time |        concurrent-index-append |    7707.74 |     2846.51 | -4861.23 |     ms |
|                                                    error rate |        concurrent-index-append |          0 |           0 |        0 |      % |
|                                                Min Throughput |                 multi_term_agg |   0.432598 |     3.17739 |  2.74479 |  ops/s |
|                                               Mean Throughput |                 multi_term_agg |    1.69637 |     3.34544 |  1.64907 |  ops/s |
|                                             Median Throughput |                 multi_term_agg |    1.94974 |     3.26463 |  1.31489 |  ops/s |
|                                                Max Throughput |                 multi_term_agg |    2.36327 |     3.71185 |  1.34858 |  ops/s |
|                                       50th percentile latency |                 multi_term_agg |    343.601 |     318.678 |  -24.923 |     ms |
|                                       90th percentile latency |                 multi_term_agg |    619.387 |     378.576 | -240.811 |     ms |
|                                       99th percentile latency |                 multi_term_agg |    2407.05 |     587.891 | -1819.16 |     ms |
|                                      100th percentile latency |                 multi_term_agg |     2694.1 |      663.58 | -2030.52 |     ms |
|                                  50th percentile service time |                 multi_term_agg |    343.601 |     318.678 |  -24.923 |     ms |
|                                  90th percentile service time |                 multi_term_agg |    619.387 |     378.576 | -240.811 |     ms |
|                                  99th percentile service time |                 multi_term_agg |    2407.05 |     587.891 | -1819.16 |     ms |
|                                 100th percentile service time |                 multi_term_agg |     2694.1 |      663.58 | -2030.52 |     ms |
|                                                    error rate |                 multi_term_agg |          0 |           0 |        0 |      % |
|                                                Min Throughput |                        default |   0.208504 |     13.5951 |  13.3866 |  ops/s |
|                                               Mean Throughput |                        default |    12.7109 |      19.812 |   7.1011 |  ops/s |
|                                             Median Throughput |                        default |    14.4006 |     20.1099 |  5.70928 |  ops/s |
|                                                Max Throughput |                        default |    16.0965 |     20.9676 |  4.87106 |  ops/s |
|                                       50th percentile latency |                        default |    377.813 |     346.038 | -31.7752 |     ms |
|                                       90th percentile latency |                        default |     695.92 |     408.222 | -287.698 |     ms |
|                                       99th percentile latency |                        default |    1059.03 |     572.052 | -486.973 |     ms |
|                                     99.9th percentile latency |                        default |    4792.19 |     1084.88 | -3707.32 |     ms |
|                                      100th percentile latency |                        default |    4797.84 |     1143.39 | -3654.45 |     ms |
|                                  50th percentile service time |                        default |    377.813 |     346.038 | -31.7752 |     ms |
|                                  90th percentile service time |                        default |     695.92 |     408.222 | -287.698 |     ms |
|                                  99th percentile service time |                        default |    1059.03 |     572.052 | -486.973 |     ms |
|                                99.9th percentile service time |                        default |    4792.19 |     1084.88 | -3707.32 |     ms |
|                                 100th percentile service time |                        default |    4797.84 |     1143.39 | -3654.45 |     ms |
|                                                    error rate |                        default |          0 |           0 |        0 |      % |
|                                                Min Throughput |                           term |   0.253427 |     7.65096 |  7.39753 |  ops/s |
|                                               Mean Throughput |                           term |    11.9324 |     15.7275 |   3.7951 |  ops/s |
|                                             Median Throughput |                           term |    13.3708 |     16.2875 |  2.91665 |  ops/s |
|                                                Max Throughput |                           term |    14.4267 |      17.011 |   2.5843 |  ops/s |
|                                       50th percentile latency |                           term |    423.418 |     417.382 | -6.03606 |     ms |
|                                       90th percentile latency |                           term |    775.567 |     669.879 | -105.688 |     ms |
|                                       99th percentile latency |                           term |    1176.42 |     805.845 | -370.571 |     ms |
|                                     99.9th percentile latency |                           term |    3863.48 |     1213.52 | -2649.96 |     ms |
|                                      100th percentile latency |                           term |    3946.14 |     1381.09 | -2565.05 |     ms |
|                                  50th percentile service time |                           term |    423.418 |     417.382 | -6.03606 |     ms |
|                                  90th percentile service time |                           term |    775.567 |     669.879 | -105.688 |     ms |
|                                  99th percentile service time |                           term |    1176.42 |     805.845 | -370.571 |     ms |
|                                99.9th percentile service time |                           term |    3863.48 |     1213.52 | -2649.96 |     ms |
|                                 100th percentile service time |                           term |    3946.14 |     1381.09 | -2565.05 |     ms |
|                                                    error rate |                           term |          0 |           0 |        0 |      % |
|                                                Min Throughput |                          range |   0.253375 |     6.70487 |  6.45149 |  ops/s |
|                                               Mean Throughput |                          range |     12.971 |     18.9212 |  5.95024 |  ops/s |
|                                             Median Throughput |                          range |    14.6048 |     19.5734 |  4.96862 |  ops/s |
|                                                Max Throughput |                          range |    16.3646 |     20.0046 |     3.64 |  ops/s |
|                                       50th percentile latency |                          range |    373.499 |     349.868 | -23.6316 |     ms |
|                                       90th percentile latency |                          range |    683.115 |     411.423 | -271.692 |     ms |
|                                       99th percentile latency |                          range |    1084.46 |     632.354 | -452.107 |     ms |
|                                     99.9th percentile latency |                          range |    4793.57 |     1137.03 | -3656.54 |     ms |
|                                      100th percentile latency |                          range |    4797.91 |     1139.62 | -3658.29 |     ms |
|                                  50th percentile service time |                          range |    373.499 |     349.868 | -23.6316 |     ms |
|                                  90th percentile service time |                          range |    683.115 |     411.423 | -271.692 |     ms |
|                                  99th percentile service time |                          range |    1084.46 |     632.354 | -452.107 |     ms |
|                                99.9th percentile service time |                          range |    4793.57 |     1137.03 | -3656.54 |     ms |
|                                 100th percentile service time |                          range |    4797.91 |     1139.62 | -3658.29 |     ms |
|                                                    error rate |                          range |          0 |           0 |        0 |      % |
|                                                Min Throughput |                  200s-in-range |     3.2923 |     20.9868 |  17.6945 |  ops/s |
|                                               Mean Throughput |                  200s-in-range |    19.1443 |     22.2688 |  3.12446 |  ops/s |
|                                             Median Throughput |                  200s-in-range |    20.9796 |     21.7593 |  0.77976 |  ops/s |
|                                                Max Throughput |                  200s-in-range |    22.9957 |     27.1339 |  4.13824 |  ops/s |
|                                       50th percentile latency |                  200s-in-range |    268.897 |     334.469 |  65.5725 |     ms |
|                                       90th percentile latency |                  200s-in-range |    506.735 |     396.178 | -110.557 |     ms |
|                                       99th percentile latency |                  200s-in-range |    850.536 |     513.982 | -336.554 |     ms |
|                                     99.9th percentile latency |                  200s-in-range |    4095.68 |     1003.77 | -3091.92 |     ms |
|                                      100th percentile latency |                  200s-in-range |    4560.76 |     1139.25 | -3421.51 |     ms |
|                                  50th percentile service time |                  200s-in-range |    268.897 |     334.469 |  65.5725 |     ms |
|                                  90th percentile service time |                  200s-in-range |    506.735 |     396.178 | -110.557 |     ms |
|                                  99th percentile service time |                  200s-in-range |    850.536 |     513.982 | -336.554 |     ms |
|                                99.9th percentile service time |                  200s-in-range |    4095.68 |     1003.77 | -3091.92 |     ms |
|                                 100th percentile service time |                  200s-in-range |    4560.76 |     1139.25 | -3421.51 |     ms |
|                                                    error rate |                  200s-in-range |          0 |           0 |        0 |      % |
|                                                Min Throughput |                  400s-in-range |    3.08018 |     20.8627 |  17.7825 |  ops/s |
|                                               Mean Throughput |                  400s-in-range |    18.6961 |     22.4375 |  3.74139 |  ops/s |
|                                             Median Throughput |                  400s-in-range |    20.7626 |     21.8207 |  1.05812 |  ops/s |
|                                                Max Throughput |                  400s-in-range |    22.7035 |     29.1326 |  6.42908 |  ops/s |
|                                       50th percentile latency |                  400s-in-range |    276.853 |     332.685 |  55.8319 |     ms |
|                                       90th percentile latency |                  400s-in-range |     515.85 |     393.643 | -122.207 |     ms |
|                                       99th percentile latency |                  400s-in-range |    847.354 |     538.986 | -308.368 |     ms |
|                                     99.9th percentile latency |                  400s-in-range |    3859.49 |     934.947 | -2924.55 |     ms |
|                                      100th percentile latency |                  400s-in-range |    4028.79 |     1160.85 | -2867.94 |     ms |
|                                  50th percentile service time |                  400s-in-range |    276.853 |     332.685 |  55.8319 |     ms |
|                                  90th percentile service time |                  400s-in-range |     515.85 |     393.643 | -122.207 |     ms |
|                                  99th percentile service time |                  400s-in-range |    847.354 |     538.986 | -308.368 |     ms |
|                                99.9th percentile service time |                  400s-in-range |    3859.49 |     934.947 | -2924.55 |     ms |
|                                 100th percentile service time |                  400s-in-range |    4028.79 |     1160.85 | -2867.94 |     ms |
|                                                    error rate |                  400s-in-range |          0 |           0 |        0 |      % |
|                                                Min Throughput |                     hourly_agg |    3.06703 |     19.2404 |  16.1733 |  ops/s |
|                                               Mean Throughput |                     hourly_agg |    18.1523 |     20.2524 |  2.10018 |  ops/s |
|                                             Median Throughput |                     hourly_agg |    20.0668 |     19.8916 | -0.17515 |  ops/s |
|                                                Max Throughput |                     hourly_agg |    22.0081 |     25.9083 |  3.90019 |  ops/s |
|                                       50th percentile latency |                     hourly_agg |    282.087 |     360.943 |  78.8554 |     ms |
|                                       90th percentile latency |                     hourly_agg |    517.262 |     434.152 | -83.1107 |     ms |
|                                       99th percentile latency |                     hourly_agg |    870.141 |     565.954 | -304.187 |     ms |
|                                     99.9th percentile latency |                     hourly_agg |    3770.79 |     1071.53 | -2699.27 |     ms |
|                                      100th percentile latency |                     hourly_agg |     4186.3 |     1187.34 | -2998.96 |     ms |
|                                  50th percentile service time |                     hourly_agg |    282.087 |     360.943 |  78.8554 |     ms |
|                                  90th percentile service time |                     hourly_agg |    517.262 |     434.152 | -83.1107 |     ms |
|                                  99th percentile service time |                     hourly_agg |    870.141 |     565.954 | -304.187 |     ms |
|                                99.9th percentile service time |                     hourly_agg |    3770.79 |     1071.53 | -2699.27 |     ms |
|                                 100th percentile service time |                     hourly_agg |     4186.3 |     1187.34 | -2998.96 |     ms |
|                                                    error rate |                     hourly_agg |          0 |           0 |        0 |      % |
|                                                Min Throughput |              multi-term-filter |   0.279656 |     7.80522 |  7.52557 |  ops/s |
|                                               Mean Throughput |              multi-term-filter |    11.8558 |     15.3768 |  3.52097 |  ops/s |
|                                             Median Throughput |              multi-term-filter |    13.2512 |      15.946 |  2.69486 |  ops/s |
|                                                Max Throughput |              multi-term-filter |    14.3892 |     16.5402 |  2.15099 |  ops/s |
|                                       50th percentile latency |              multi-term-filter |    424.293 |     424.971 |  0.67797 |     ms |
|                                       90th percentile latency |              multi-term-filter |    778.221 |     682.984 | -95.2377 |     ms |
|                                       99th percentile latency |              multi-term-filter |    1222.13 |      819.67 | -402.457 |     ms |
|                                     99.9th percentile latency |              multi-term-filter |    3802.55 |     1219.64 | -2582.91 |     ms |
|                                      100th percentile latency |              multi-term-filter |    3970.96 |     1257.41 | -2713.56 |     ms |
|                                  50th percentile service time |              multi-term-filter |    424.293 |     424.971 |  0.67797 |     ms |
|                                  90th percentile service time |              multi-term-filter |    778.221 |     682.984 | -95.2377 |     ms |
|                                  99th percentile service time |              multi-term-filter |    1222.13 |      819.67 | -402.457 |     ms |
|                                99.9th percentile service time |              multi-term-filter |    3802.55 |     1219.64 | -2582.91 |     ms |
|                                 100th percentile service time |              multi-term-filter |    3970.96 |     1257.41 | -2713.56 |     ms |
|                                                    error rate |              multi-term-filter |          0 |           0 |        0 |      % |
|                                                Min Throughput |                  asc_sort_size |    0.20705 |     5.26184 |  5.05479 |  ops/s |
|                                               Mean Throughput |                  asc_sort_size |    12.5314 |     17.5014 |  4.96998 |  ops/s |
|                                             Median Throughput |                  asc_sort_size |    14.0875 |     18.1507 |  4.06323 |  ops/s |
|                                                Max Throughput |                  asc_sort_size |    15.5259 |     18.5857 |  3.05984 |  ops/s |
|                                       50th percentile latency |                  asc_sort_size |    393.494 |     373.422 | -20.0721 |     ms |
|                                       90th percentile latency |                  asc_sort_size |    713.768 |      452.08 | -261.689 |     ms |
|                                       99th percentile latency |                  asc_sort_size |    1058.27 |      769.43 | -288.842 |     ms |
|                                     99.9th percentile latency |                  asc_sort_size |    4554.43 |     1197.88 | -3356.55 |     ms |
|                                      100th percentile latency |                  asc_sort_size |     4828.3 |     1420.73 | -3407.56 |     ms |
|                                  50th percentile service time |                  asc_sort_size |    393.494 |     373.422 | -20.0721 |     ms |
|                                  90th percentile service time |                  asc_sort_size |    713.768 |      452.08 | -261.689 |     ms |
|                                  99th percentile service time |                  asc_sort_size |    1058.27 |      769.43 | -288.842 |     ms |
|                                99.9th percentile service time |                  asc_sort_size |    4554.43 |     1197.88 | -3356.55 |     ms |
|                                 100th percentile service time |                  asc_sort_size |     4828.3 |     1420.73 | -3407.56 |     ms |
|                                                    error rate |                  asc_sort_size |          0 |           0 |        0 |      % |
|                                                Min Throughput |            desc_sort_timestamp |   0.206681 |    0.505648 |  0.29897 |  ops/s |
|                                               Mean Throughput |            desc_sort_timestamp |    7.81624 |     5.02897 | -2.78727 |  ops/s |
|                                             Median Throughput |            desc_sort_timestamp |    8.04716 |     5.34092 | -2.70623 |  ops/s |
|                                                Max Throughput |            desc_sort_timestamp |    10.5078 |     5.44055 | -5.06729 |  ops/s |
|                                       50th percentile latency |            desc_sort_timestamp |    589.856 |     1321.44 |  731.588 |     ms |
|                                       90th percentile latency |            desc_sort_timestamp |    1098.29 |     1557.06 |  458.767 |     ms |
|                                       99th percentile latency |            desc_sort_timestamp |    1648.69 |     2153.89 |  505.201 |     ms |
|                                     99.9th percentile latency |            desc_sort_timestamp |    4983.26 |     5679.28 |  696.019 |     ms |
|                                      100th percentile latency |            desc_sort_timestamp |     6113.1 |     7149.19 |  1036.09 |     ms |
|                                  50th percentile service time |            desc_sort_timestamp |    589.856 |     1321.44 |  731.588 |     ms |
|                                  90th percentile service time |            desc_sort_timestamp |    1098.29 |     1557.06 |  458.767 |     ms |
|                                  99th percentile service time |            desc_sort_timestamp |    1648.69 |     2153.89 |  505.201 |     ms |
|                                99.9th percentile service time |            desc_sort_timestamp |    4983.26 |     5679.28 |  696.019 |     ms |
|                                 100th percentile service time |            desc_sort_timestamp |     6113.1 |     7149.19 |  1036.09 |     ms |
|                                                    error rate |            desc_sort_timestamp |          0 |           0 |        0 |      % |
|                                                Min Throughput |                 desc_sort_size |    0.20688 |     4.91388 |    4.707 |  ops/s |
|                                               Mean Throughput |                 desc_sort_size |    11.7654 |     15.9593 |  4.19392 |  ops/s |
|                                             Median Throughput |                 desc_sort_size |    13.1054 |      16.594 |  3.48858 |  ops/s |
|                                                Max Throughput |                 desc_sort_size |    14.5242 |     17.1713 |  2.64708 |  ops/s |
|                                       50th percentile latency |                 desc_sort_size |    419.735 |     401.019 | -18.7157 |     ms |
|                                       90th percentile latency |                 desc_sort_size |    716.654 |      482.95 | -233.704 |     ms |
|                                       99th percentile latency |                 desc_sort_size |    1113.88 |     793.882 | -319.993 |     ms |
|                                     99.9th percentile latency |                 desc_sort_size |    4832.89 |     1130.05 | -3702.84 |     ms |
|                                      100th percentile latency |                 desc_sort_size |    4964.87 |     1164.34 | -3800.53 |     ms |
|                                  50th percentile service time |                 desc_sort_size |    419.735 |     401.019 | -18.7157 |     ms |
|                                  90th percentile service time |                 desc_sort_size |    716.654 |      482.95 | -233.704 |     ms |
|                                  99th percentile service time |                 desc_sort_size |    1113.88 |     793.882 | -319.993 |     ms |
|                                99.9th percentile service time |                 desc_sort_size |    4832.89 |     1130.05 | -3702.84 |     ms |
|                                 100th percentile service time |                 desc_sort_size |    4964.87 |     1164.34 | -3800.53 |     ms |
|                                                    error rate |                 desc_sort_size |          0 |           0 |        0 |      % |
|                                                Min Throughput |             asc_sort_timestamp |   0.195203 |    0.705657 |  0.51045 |  ops/s |
|                                               Mean Throughput |             asc_sort_timestamp |    9.81914 |     10.7137 |  0.89459 |  ops/s |
|                                             Median Throughput |             asc_sort_timestamp |     10.878 |     11.7022 |  0.82422 |  ops/s |
|                                                Max Throughput |             asc_sort_timestamp |    13.3006 |      16.268 |   2.9674 |  ops/s |
|                                       50th percentile latency |             asc_sort_timestamp |    476.628 |     464.836 | -11.7922 |     ms |
|                                       90th percentile latency |             asc_sort_timestamp |    863.197 |     627.704 | -235.493 |     ms |
|                                       99th percentile latency |             asc_sort_timestamp |    1271.99 |      1239.6 | -32.3855 |     ms |
|                                     99.9th percentile latency |             asc_sort_timestamp |     4992.1 |     1416.78 | -3575.31 |     ms |
|                                      100th percentile latency |             asc_sort_timestamp |    5427.26 |     1420.51 | -4006.75 |     ms |
|                                  50th percentile service time |             asc_sort_timestamp |    476.628 |     464.836 | -11.7922 |     ms |
|                                  90th percentile service time |             asc_sort_timestamp |    863.197 |     627.704 | -235.493 |     ms |
|                                  99th percentile service time |             asc_sort_timestamp |    1271.99 |      1239.6 | -32.3855 |     ms |
|                                99.9th percentile service time |             asc_sort_timestamp |     4992.1 |     1416.78 | -3575.31 |     ms |
|                                 100th percentile service time |             asc_sort_timestamp |    5427.26 |     1420.51 | -4006.75 |     ms |
|                                                    error rate |             asc_sort_timestamp |          0 |           0 |        0 |      % |
|                                                Min Throughput | desc_sort_with_after_timestamp |   0.210526 |    0.400458 |  0.18993 |  ops/s |
|                                               Mean Throughput | desc_sort_with_after_timestamp |    6.75232 |     4.25158 | -2.50074 |  ops/s |
|                                             Median Throughput | desc_sort_with_after_timestamp |    6.92546 |     4.42836 |  -2.4971 |  ops/s |
|                                                Max Throughput | desc_sort_with_after_timestamp |    9.71189 |     4.99881 | -4.71308 |  ops/s |
|                                       50th percentile latency | desc_sort_with_after_timestamp |     582.84 |     1419.73 |  836.886 |     ms |
|                                       90th percentile latency | desc_sort_with_after_timestamp |    1345.07 |     1838.13 |  493.052 |     ms |
|                                       99th percentile latency | desc_sort_with_after_timestamp |    1942.93 |     2324.49 |  381.568 |     ms |
|                                     99.9th percentile latency | desc_sort_with_after_timestamp |    4997.63 |      6614.5 |  1616.86 |     ms |
|                                      100th percentile latency | desc_sort_with_after_timestamp |     5431.7 |     7315.79 |  1884.09 |     ms |
|                                  50th percentile service time | desc_sort_with_after_timestamp |     582.84 |     1419.73 |  836.886 |     ms |
|                                  90th percentile service time | desc_sort_with_after_timestamp |    1345.07 |     1838.13 |  493.052 |     ms |
|                                  99th percentile service time | desc_sort_with_after_timestamp |    1942.93 |     2324.49 |  381.568 |     ms |
|                                99.9th percentile service time | desc_sort_with_after_timestamp |    4997.63 |      6614.5 |  1616.86 |     ms |
|                                 100th percentile service time | desc_sort_with_after_timestamp |     5431.7 |     7315.79 |  1884.09 |     ms |
|                                                    error rate | desc_sort_with_after_timestamp |          0 |           0 |        0 |      % |
|                                                Min Throughput |  asc_sort_with_after_timestamp |   0.221182 |      5.5396 |  5.31841 |  ops/s |
|                                               Mean Throughput |  asc_sort_with_after_timestamp |     11.997 |     20.4488 |  8.45183 |  ops/s |
|                                             Median Throughput |  asc_sort_with_after_timestamp |    13.3521 |     21.3962 |  8.04414 |  ops/s |
|                                                Max Throughput |  asc_sort_with_after_timestamp |    14.4878 |     21.9964 |  7.50854 |  ops/s |
|                                       50th percentile latency |  asc_sort_with_after_timestamp |    422.612 |     317.452 |  -105.16 |     ms |
|                                       90th percentile latency |  asc_sort_with_after_timestamp |    732.793 |     402.329 | -330.464 |     ms |
|                                       99th percentile latency |  asc_sort_with_after_timestamp |    1108.18 |     605.466 | -502.716 |     ms |
|                                     99.9th percentile latency |  asc_sort_with_after_timestamp |    4639.65 |     1095.57 | -3544.08 |     ms |
|                                      100th percentile latency |  asc_sort_with_after_timestamp |    4765.84 |     1169.05 | -3596.79 |     ms |
|                                  50th percentile service time |  asc_sort_with_after_timestamp |    422.612 |     317.452 |  -105.16 |     ms |
|                                  90th percentile service time |  asc_sort_with_after_timestamp |    732.793 |     402.329 | -330.464 |     ms |
|                                  99th percentile service time |  asc_sort_with_after_timestamp |    1108.18 |     605.466 | -502.716 |     ms |
|                                99.9th percentile service time |  asc_sort_with_after_timestamp |    4639.65 |     1095.57 | -3544.08 |     ms |
|                                 100th percentile service time |  asc_sort_with_after_timestamp |    4765.84 |     1169.05 | -3596.79 |     ms |
|                                                    error rate |  asc_sort_with_after_timestamp |          0 |           0 |        0 |      % |

@reta
Copy link
Collaborator

reta commented Jul 20, 2024

Thanks a lot for running those @mch2 , of the top of my head, I cannot explain why desc_sort_* are the outliers, @sohami do you?

@andrross andrross added the Roadmap:Modular Architecture Project-wide roadmap label label Aug 6, 2024
@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Aug 11, 2024

Good to see this discussion, resurrected and some items getting prioritised. Thanks @andrross @sohami @mch2
From a search freshness perspective there was some thought to decouple remote uploads from refresh latency so as to keep freshness on primary lower(see #12450). The long term vision is to see if we could do node to node block fetches on primary when search lands on the replica. This breaks the notion of separation between readers and writers however seems a promising option to improve refresh lag, something we should thing about as a part of this reader/writer separation.

Other set of things we should consider is to re-evaluate push vs pull mechanism as a replication mechanism for replica. With decoupling as a core idea, I would prefer pull for replica and make it consistent with remote cluster replica to have pull based replication at all places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search Indexing Indexing, Bulk Indexing and anything related to indexing RFC Issues requesting major changes Roadmap:Modular Architecture Project-wide roadmap label Search Search query, autocomplete ...etc Storage:Remote
Projects
Status: New
Status: In Progress
Status: 🏗 In progress
Status: 🆕 New
Development

No branches or pull requests