-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest parses index requests twice when there is a final pipeline #81244
Comments
Pinging @elastic/es-perf (Team:Performance) |
Pinging @elastic/es-data-management (Team:Data Management) |
I assume you're talking about the call to IndexRequest::sourceAsMap happening once per pipeline? I've got a quick change in a branch where I cache that (https://github.com/elastic/elasticsearch/compare/master...masseyke:fix/not-parsing-index-request-twice?expand=1), but I haven't set up a performance test to prove that that does any good yet. I'll try to do that at some point soon. |
In #85926 I attempted and failed to run the self-reference check once per document instead of once per pipeline. That enhancement should be coupled with this one in case it gets implemented. Also as far as I can tell, the reason for moving back to the IndexRequest from the IngestDocument seems to be to possibly pivot to other pipelines in case the target index changes. Any change for this issue should ensure we have sufficient coverage for the changing-target-index case |
Confirmed for sure that the original description is correct, and that this is a significant time sink in terms of the total work we're doing during ingest pipeline processing. Without a big change, I can't imagine how we could get rid of the initial parse and final generate, but it does seem suboptimal to have the inner generate followed almost immediately by a subsequent parse. |
Actually, there's another small cost that's repeated -- we build an IngestDocument from the parsed bytes twice and a very good fix for this ticket would remove that, too. (You can see a little repeated pattern just to the right of the two 'parse' rectangles, I'm thinking it would drop out along with the center generate/parse anti-pair.) |
Here's an updated flamegraph, fundamentally we're still in the same situation (note: PR coming, we soon won't be). As before, the inefficiency here is that the center two highlighted operations are a very expensive generate/parse pair that can be optimized out. #92995 and #93213 changed the overall shape of the flamegraph that we're getting here, so I thought an updated flamegraph was called for. |
Here's an updated flame graph from #93329, where there's now a single parse/generate cycle for all the pipelines, rather than for each pipeline: |
I was looking at a flame graph of an ingestion workload with @dliappis, and it looks like we parse the JSON representation of the index request twice when a document has both a pipeline and a final pipeline.
It's hard to know how much we would save on this benchmark since not all data streams in this benchmark had both a pipeline and a final pipeline, but the cost of parsing index request into maps of maps via IngestService represented 4.06% of overall CPU time according to the flame graph, so this might not be negligible.
The text was updated successfully, but these errors were encountered: