-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Util][NFC] OptimizeIntArithmetic: reduce calls to eraseState
#19130
Conversation
eraseState
The dead code analysis dependency is a frequent footgun that is explained in the class comments of the int range optimization analysis. It has an implicit dependency on it. It would be good to see if this has to be fine this was upstream: it has tripped everyone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find with respect to the compilation time impact. I am indeed concerned that this pessimizes analysis. I thought I had a test for this case but maybe it was just observed: if prior to op removal, successors had their lattice fixated to the maximum range (ie. No known bound), and removing the op let that be resolved to a tighter bound, not doing a recursive erase will cause the analysis to get stuck at an overly broad range.
Iirc, I was seeing this with certain kinds of factorizations where only the simplified form could be analyzed.
But given the extreme overheads, I'm definitely pro finding a way to not pay this cost. I would need to study the loop more carefully, but I thought it had accounted for not looping too much.
One note I'll make here is that, at least originally, the optimization patterns were meant to run once and top-down. UnsignedWhenEquivalent was a separate pass and used dialect conversions because the context was The int-range-optimizations thrnselves (back when it was just the constant folding) were even initially an IR walk. IIRC the eraseState() listrner got added upstream due to concern about stale values, but I can't recall a concrete example of when it would come up. My overall thought here would be to try and alternate cycles of walking the IR to apply optimization patterns, then cannibalization/folding, and repeating until convergence. |
I don't understand how this affects the analysis. It looks like If this isn't the correct approach, I'm open to other ideas on how to speed up this pass.
I tried removing APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed. This corroborates what you were saying with stale values. Considering it only occurrs occasionally, id suspect newly created values with the same address as deleted ones are being used to lookup old state. |
A bitwidth of 0 is usually used for the "no data" state. So what might make sense is, if it's possible, to migrate analysis results to replacement ops if they still make sense. Though I don't know if the dataflow framework makes it safe to do that (given concurrency and all) so probably an eraseState() on removing a value is the way to go |
I do know that I didn't add that code because I wanted to. There was a failing case I was working on. There were some more advanced patterns that I didn't include in the initial submission. But there is no test for it, and as you say, it may not when he working as intended. It will not produce a correctness issue to reduce it like you have, but future us will probably find the case it was trying to optimize :) But yes, there most be an eraseState in there or else you get stray pointers. |
I'm also perfectly open to switching the overall iteration as K suggests. I tried to write comprehensive tests in anticipation of needing to do implementation work on how it is done. |
Thanks @IanWood1 . This is what
It is walking the entire analysis on every delete. Is that the cause of the slowness? I think now that the optimize int arithmetic is implemented using data flow solver, might be better to keep it that way cause it is a more general analysis. A walk update will help, but if we add any control flow we will be essentially duplicating the data flow solver. |
7d19205
to
ee51d3d
Compare
Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
@MaheshRavishankar yes, I think that's what's causing the slowness. Also, I rebased to fix/update CI. I'm not really sure where to go with this PR as there are several different paths discussed here. |
It's should be possible to improve upstream But potentially, we can ask each registered |
This was kind of what I had in mind, but I dont have enough understanding of all this TypeID stuff. @Hardcode84 do you think you could help with this. |
If this improves the situation might be worth landing... I dont have enough know how about this to immediately say if this is OK. CI passes. @stellaraccident you might have more context. |
I'm pretty booked coding wise, but upstream change should be straightforward:
|
If it doesn't regress tests, then the proposed fix SGTM. I think this may pessimize some cases but if but if not encoded in tests, that is something we will have to evaluate in the future. |
I can try to take a crack at the eraseState() stuff at some point soon - just not sure if I should move from the current task to that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lets do this for now. I understand what it pessimizes now I think. Lets come back to that later.
…e` (iree-org#19130)" This reverts commit 81dd4e6. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
This was hard to debug and I haven't yet been able to track down which op was modified causing the issue nor which pattern was triggering this. I was encountering the following assert during `Stream` pipeline (but not global opt): ```shell APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed. ``` It might be worth reverting #19130 instead of this patch since I wasn't able to track this down fully. 2/2 fix for #19167. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
…-org#19130) This pass is causing long compilation times for llama3 405b (even when cherry-picking llvm/llvm-project#115399). The majority of the time is spent in this one pass. The compilation times improve when calling `eraseState` only when ops are deleted. This is similar to the upstream listeners in `UnsignedWhenEquivalent.cpp` and `IntRangeOptimizations.cpp`. It appears this function loops over all `LatticeAnchors` on each invocation to find the one to delete, causing it to be slow. My (nonrigorous) experiment showed a decrease from 18 min to 3 min compile time. My main concern here would be this affecting correctness, as I don't know if this has unaccounted for side effects. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
This was hard to debug and I haven't yet been able to track down which op was modified causing the issue nor which pattern was triggering this. I was encountering the following assert during `Stream` pipeline (but not global opt): ```shell APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed. ``` It might be worth reverting iree-org#19130 instead of this patch since I wasn't able to track this down fully. 2/2 fix for iree-org#19167. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
…-org#19130) This pass is causing long compilation times for llama3 405b (even when cherry-picking llvm/llvm-project#115399). The majority of the time is spent in this one pass. The compilation times improve when calling `eraseState` only when ops are deleted. This is similar to the upstream listeners in `UnsignedWhenEquivalent.cpp` and `IntRangeOptimizations.cpp`. It appears this function loops over all `LatticeAnchors` on each invocation to find the one to delete, causing it to be slow. My (nonrigorous) experiment showed a decrease from 18 min to 3 min compile time. My main concern here would be this affecting correctness, as I don't know if this has unaccounted for side effects. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu> Signed-off-by: Giacomo Serafini <179146510+giacs-epic@users.noreply.github.com>
This was hard to debug and I haven't yet been able to track down which op was modified causing the issue nor which pattern was triggering this. I was encountering the following assert during `Stream` pipeline (but not global opt): ```shell APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed. ``` It might be worth reverting iree-org#19130 instead of this patch since I wasn't able to track this down fully. 2/2 fix for iree-org#19167. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu> Signed-off-by: Giacomo Serafini <179146510+giacs-epic@users.noreply.github.com>
This pass is causing long compilation times for llama3 405b (even when cherry-picking llvm/llvm-project#115399). The majority of the time is spent in this one pass. The compilation times improve when calling
eraseState
only when ops are deleted. This is similar to the upstream listeners inUnsignedWhenEquivalent.cpp
andIntRangeOptimizations.cpp
. It appears this function loops over allLatticeAnchors
on each invocation to find the one to delete, causing it to be slow. My (nonrigorous) experiment showed a decrease from 18 min to 3 min compile time. My main concern here would be this affecting correctness, as I don't know if this has unaccounted for side effects.Also, I'm not sure whatDeadCodeAnalysis
is being loaded/used for. I wasn't able to track down any users of it. Maybe that could be removed too?