Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bulk stats track the bulk per shard #52208

Merged
merged 18 commits into from
Apr 20, 2020
Merged

Add Bulk stats track the bulk per shard #52208

merged 18 commits into from
Apr 20, 2020

Conversation

zhichen
Copy link
Contributor

@zhichen zhichen commented Feb 11, 2020

Add Bulk stats track the bulk sizes per shard and the time spent on the bulk shard request

It might make sense to track the average bulk sizes per shard , since a large bulk request may be chopped down into much smaller shard level bulk operation on an index with high numbers of shards. This makes more sense to me than just tracking at the shard level since most clients are not partitioning by shard already.

Regarding the statistics of shard bulk size, considering the high cost of re-serialization, only the source field of IndexRequest and the doc field of UpdateRequest are calculated here, while the DeleteRequest in bulk will be counted as 0.

example output:

...
     "bulk": {
           "total": 1,
           "total_time_in_millis": 412,
           "total_size_in_bytes": 83
      }
...

Relates (#50536)(#47345)

@zhichen zhichen changed the title Add Bulk stats track the bulk sizes per shard Add Bulk stats track the bulk per shard Feb 11, 2020
@zhichen zhichen requested review from henningandersen, ywelsch, DaveCTurner and jbaiera and removed request for henningandersen and ywelsch February 11, 2020 14:45
@zhichen zhichen requested a review from dakrone February 11, 2020 14:53
@dakrone dakrone added the :Data Management/Stats Statistics tracking and retrieval APIs label Feb 11, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Stats)

@zhichen zhichen requested a review from martijnvg February 12, 2020 10:13
@zhichen
Copy link
Contributor Author

zhichen commented Feb 13, 2020

@elasticmachine TBR

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this @zhichen! I took a look and left a few comments that we should address, let me know what you think.

Another thing that occurs to me is whether we should take this opportunity to also track the exponentially weighted moving average for the time and size of shard bulk requests, so we can have an idea for a "recent average" for time and size. What do you think? If so, we already have an ExponentiallyWeightedMovingAverage class we could use and track it alongside the totals (that way we could track both the overall average and the "more recent" average)

@dakrone
Copy link
Member

dakrone commented Feb 14, 2020

@elasticmachine ok to test

@cla-checker-service
Copy link

Author of the following commits did not sign a Contributor Agreement:
270e8d5, 0d05e87

Please, read and sign the above mentioned agreement if you want to contribute to this project

@cla-checker-service
Copy link

Author of the following commits did not sign a Contributor Agreement:
270e8d5, 0d05e87, d988e98

Please, read and sign the above mentioned agreement if you want to contribute to this project

@jasontedor
Copy link
Member

@probakowski Since indexing performance is incredibly important, would you mind running your methodology past someone on the @elastic/es-perf team (e.g., maybe @danielmitterdorfer?) to ensure there are not any flaws? Any regressions here would be concerning.

@dliappis
Copy link
Contributor

@probakowski I'd like to understand better the fluctuations.

A few methodology questions:

  1. Did you use a load driver on a separate host?

  2. Did you ensure the environment (esp on the target host(s)) has been "reset" before each iteration? What we've found adding variance is especially untrimmed local SSD disks plus Intel Turbo boost (for Intel processors). We typically run something like:

    sudo /sbin/fstrim --all
    sync
    sleep 3
    sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
    sudo sh -c "echo 1 > /proc/sys/vm/compact_memory"
    sudo sh -c "echo 1 > cat /sys/devices/system/cpu/intel_pstate/no_turbo"
    

@zhichen zhichen requested a review from dakrone March 19, 2020 02:00
@zhichen
Copy link
Contributor Author

zhichen commented Mar 19, 2020

@dakrone sorry, I mistakenly operated on a review request, please ignore it.

@zhichen
Copy link
Contributor Author

zhichen commented Mar 25, 2020

@probakowski @jasontedor is there need any more indexing performance test before merging it.

@jasontedor
Copy link
Member

Yes, there is. I want to understand the methodology that was employed here. Most of the results have indicated performance regressions, I'm not convinced that they are noise. I want to understand the methodology with the questions that @dliappis asked, and also understand where these benchmarks were run (a laptop?). Indexing performance is entirely too important, we need to be cautious here.

@bpintea bpintea added v7.8.0 and removed v7.7.0 labels Mar 25, 2020
@dliappis
Copy link
Contributor

FYI we've synced up with @probakowski offline a few days ago to take a more critical look on the methodology, and he's currently working on a more thorough iteration, including using a higher amount of shards (since this PR adds stats per shard), isolated load driver and nodes, better choice of instances etc.

@zhichen
Copy link
Contributor Author

zhichen commented Apr 9, 2020

hi @probakowski is there any update?

@probakowski
Copy link
Contributor

Hi @zhichen, very sorry for the late update. I was finally able to get stable environment for testing and better testing methodology (thanks @dliappis and @danielmitterdorfer!) and was able to confirm that there's no visible impact on performance here (difference in median throughput was less than 0,5%, the same range as between different runs of master).

I'll resolve conflicts and merge/backport the change.

Thanks for your work!

@zhichen
Copy link
Contributor Author

zhichen commented Apr 18, 2020

Thanks @probakowski . It's nice to see that this PR will be merged so that we can use this feature on 7.8 or 8.0

@russcam
Copy link
Contributor

russcam commented Jun 9, 2020

@probakowski this change doesn't appear to have been backported to 7.8:

public CommonStats(CommonStatsFlags flags) {
CommonStatsFlags.Flag[] setFlags = flags.getFlags();
for (CommonStatsFlags.Flag flag : setFlags) {
switch (flag) {
case Docs:
docs = new DocsStats();
break;
case Store:
store = new StoreStats();
break;
case Indexing:
indexing = new IndexingStats();
break;
case Get:
get = new GetStats();
break;
case Search:
search = new SearchStats();
break;
case Merge:
merge = new MergeStats();
break;
case Refresh:
refresh = new RefreshStats();
break;
case Flush:
flush = new FlushStats();
break;
case Warmer:
warmer = new WarmerStats();
break;
case QueryCache:
queryCache = new QueryCacheStats();
break;
case FieldData:
fieldData = new FieldDataStats();
break;
case Completion:
completion = new CompletionStats();
break;
case Segments:
segments = new SegmentsStats();
break;
case Translog:
translog = new TranslogStats();
break;
case RequestCache:
requestCache = new RequestCacheStats();
break;
case Recovery:
recoveryStats = new RecoveryStats();
break;
default:
throw new IllegalStateException("Unknown Flag: " + flag);
}
}

Should it be backported, or should the v.7.8.0 label be removed?

@ywelsch
Copy link
Contributor

ywelsch commented Jul 3, 2020

This hasn't been backported to 7.9.0 either.

Please hold off on any backports now (I will remove the version label as well) as this is possibly interfering with other work that we are doing in this area (related to #58885). We will need a cohesive plan first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Stats Statistics tracking and retrieval APIs >enhancement v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.