Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add coordination time to profiler #705

Open
peternied opened this issue May 14, 2021 · 5 comments
Open

Add coordination time to profiler #705

peternied opened this issue May 14, 2021 · 5 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@peternied
Copy link
Member

Describe the bug
Took time on profiled request is 1.3 seconds, inspecting the shards search/aggregations times they are all under 460ms-339ms. This implies coordination had a substantial impact on the response time.

To Reproduce
Steps to reproduce the behavior:

  1. Take a query from your ES instances slow log
  2. Repeat the query with "profile":"true" until took time is above an acceptable threshold
  3. Investigate the response by computing per shard - query / rewrite / collector time
  4. Print the list of shard based times
  5. Look for the 'longest running shard' to determine the bottleneck
  6. Inspect the the coordinated time breakdown

Expected behavior
There would be a breakdown for coordination

Plugins
N/A

Screenshots
N/A

Host/Environment (please complete the following information):

  • OS: Linux x64

Additional context
In some ways this is a feature request - but this is also an incomplete scenario to diagnose performance issues with this existing profiling tools.

@peternied peternied added bug Something isn't working untriaged Beta labels May 14, 2021
@Bukhtawar
Copy link
Collaborator

I don't think it's a bug as the documentation clearly calls out the gap. We should plan on bridging this holistically

  1. Network round trip
  2. Time spent in queues
  3. Time spent on coordinator fanning out
  4. Time spent on coordinator merging responses

@minalsha minalsha added distributed framework and removed bug Something isn't working Beta labels Aug 10, 2021
@dblock dblock added the enhancement Enhancement or improvement to existing feature or request label Aug 10, 2021
@dblock dblock changed the title [BUG] Profiler does not include coordination time Add coordination time to profiler Aug 10, 2021
@Poojita-Raj
Copy link
Contributor

Looking into this.

@peternied
Copy link
Member Author

peternied commented Aug 30, 2021

Here is an activity diagram of what it seems like is profiled, and those gaps around fan out, and node-to-node communication and waiting time are part of what I'd consider coordinator time that are absent.
image
Gist of the plantuml diagram

@Poojita-Raj
Copy link
Contributor

These are the limitations with profiling that we are trying to bridge with this issue - Profile limitations

Namely, the top two issues is what we're trying to target right now:

  1. Network overhead time measurement for node-to-node communication.
  2. Time spent on coordinator fan out and coordinator fan in.

Network overhead time measurement for node-to-node communication

  • Do we take the longest RTT or the average RTT (in wall clock nanoseconds) as a measurement for the network overhead in getting shard results from data nodes?

Time spent on coordinator fan out and coordinator fan in

The coordinating node (i.e., the node that receives the client request) executes it in 2 phases.

  • Scatter phase: it forwards request to data nodes that hold the data to execute and return their results.
  • Gather phase: it reduces individual results into a global result.

There are two questions to be answered here -

  • Do we take into account time spent waiting for results from data nodes in gather phase?
  • Do we consider scatter phase time and gather phase time separately or as a combined value called coordination time?

I'm putting possible choices/approaches for these metrics up for community transparency and to get opinions on alternate metrics that could be included!

@peternied
Copy link
Member Author

@Poojita-Raj Why are we closing this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

7 participants