Ingest parses index requests twice when there is a final pipeline #81244

jpountz · 2021-12-02T09:20:17Z

I was looking at a flame graph of an ingestion workload with @dliappis, and it looks like we parse the JSON representation of the index request twice when a document has both a pipeline and a final pipeline.

It's hard to know how much we would save on this benchmark since not all data streams in this benchmark had both a pipeline and a final pipeline, but the cost of parsing index request into maps of maps via IngestService represented 4.06% of overall CPU time according to the flame graph, so this might not be negligible.

elasticmachine · 2021-12-02T09:20:20Z

Pinging @elastic/es-perf (Team:Performance)

elasticmachine · 2021-12-02T09:20:20Z

Pinging @elastic/es-data-management (Team:Data Management)

masseyke · 2022-04-26T22:22:42Z

I assume you're talking about the call to IndexRequest::sourceAsMap happening once per pipeline? I've got a quick change in a branch where I cache that (https://github.com/elastic/elasticsearch/compare/master...masseyke:fix/not-parsing-index-request-twice?expand=1), but I haven't set up a performance test to prove that that does any good yet. I'll try to do that at some point soon.

DJRickyB · 2022-05-03T14:15:35Z

In #85926 I attempted and failed to run the self-reference check once per document instead of once per pipeline. That enhancement should be coupled with this one in case it gets implemented.

Also as far as I can tell, the reason for moving back to the IndexRequest from the IngestDocument seems to be to possibly pivot to other pipelines in case the target index changes. Any change for this issue should ensure we have sufficient coverage for the changing-target-index case

joegallo · 2022-11-18T15:27:20Z

Confirmed for sure that the original description is correct, and that this is a significant time sink in terms of the total work we're doing during ingest pipeline processing. Without a big change, I can't imagine how we could get rid of the initial parse and final generate, but it does seem suboptimal to have the inner generate followed almost immediately by a subsequent parse.

joegallo · 2022-11-18T22:28:36Z

Actually, there's another small cost that's repeated -- we build an IngestDocument from the parsed bytes twice and a very good fix for this ticket would remove that, too. (You can see a little repeated pattern just to the right of the two 'parse' rectangles, I'm thinking it would drop out along with the center generate/parse anti-pair.)

joegallo · 2023-01-25T20:37:02Z

Here's an updated flamegraph, fundamentally we're still in the same situation (note: PR coming, we soon won't be). As before, the inefficiency here is that the center two highlighted operations are a very expensive generate/parse pair that can be optimized out.

#92995 and #93213 changed the overall shape of the flamegraph that we're getting here, so I thought an updated flamegraph was called for.

joegallo · 2023-01-30T16:49:33Z

Here's an updated flame graph from #93329, where there's now a single parse/generate cycle for all the pipelines, rather than for each pipeline:

jpountz added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP :Performance All issues related to Elasticsearch performance including regressions and investigations labels Dec 2, 2021

elasticmachine added Team:Performance Meta label for performance team Team:Data Management Meta label for data/management team labels Dec 2, 2021

joegallo self-assigned this Dec 6, 2022

joegallo mentioned this issue Jan 24, 2023

Extract some methods from IngestService's innerExecute method #93213

Merged

joegallo mentioned this issue Jan 27, 2023

Handle a default/request pipeline and a final pipeline with minimal additional overhead #93329

Merged

joegallo closed this as completed in #93329 Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest parses index requests twice when there is a final pipeline #81244

Ingest parses index requests twice when there is a final pipeline #81244

jpountz commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

masseyke commented Apr 26, 2022

DJRickyB commented May 3, 2022

joegallo commented Nov 18, 2022

joegallo commented Nov 18, 2022

joegallo commented Jan 25, 2023

joegallo commented Jan 30, 2023

Ingest parses index requests twice when there is a final pipeline #81244

Ingest parses index requests twice when there is a final pipeline #81244

Comments

jpountz commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

masseyke commented Apr 26, 2022

DJRickyB commented May 3, 2022

joegallo commented Nov 18, 2022

joegallo commented Nov 18, 2022

joegallo commented Jan 25, 2023

joegallo commented Jan 30, 2023