[Util][NFC] OptimizeIntArithmetic: reduce calls to `eraseState` #19130

IanWood1 · 2024-11-13T04:58:17Z

This pass is causing long compilation times for llama3 405b (even when cherry-picking llvm/llvm-project#115399). The majority of the time is spent in this one pass. The compilation times improve when calling eraseState only when ops are deleted. This is similar to the upstream listeners in UnsignedWhenEquivalent.cpp and IntRangeOptimizations.cpp. It appears this function loops over all LatticeAnchors on each invocation to find the one to delete, causing it to be slow. My (nonrigorous) experiment showed a decrease from 18 min to 3 min compile time. My main concern here would be this affecting correctness, as I don't know if this has unaccounted for side effects.

~~Also, I'm not sure what DeadCodeAnalysis is being loaded/used for. I wasn't able to track down any users of it. Maybe that could be removed too?~~

stellaraccident · 2024-11-13T05:17:13Z

The dead code analysis dependency is a frequent footgun that is explained in the class comments of the int range optimization analysis. It has an implicit dependency on it. It would be good to see if this has to be fine this was upstream: it has tripped everyone.

stellaraccident

Good find with respect to the compilation time impact. I am indeed concerned that this pessimizes analysis. I thought I had a test for this case but maybe it was just observed: if prior to op removal, successors had their lattice fixated to the maximum range (ie. No known bound), and removing the op let that be resolved to a tighter bound, not doing a recursive erase will cause the analysis to get stuck at an overly broad range.

Iirc, I was seeing this with certain kinds of factorizations where only the simplified form could be analyzed.

But given the extreme overheads, I'm definitely pro finding a way to not pay this cost. I would need to study the loop more carefully, but I thought it had accounted for not looping too much.

krzysz00 · 2024-11-13T14:48:55Z

One note I'll make here is that, at least originally, the optimization patterns were meant to run once and top-down. UnsignedWhenEquivalent was a separate pass and used dialect conversions because the context was int-range-optimizations; canonicalize; cse; unsigned-when-equivalent.

The int-range-optimizations thrnselves (back when it was just the constant folding) were even initially an IR walk.

IIRC the eraseState() listrner got added upstream due to concern about stale values, but I can't recall a concrete example of when it would come up.

My overall thought here would be to try and alternate cycles of walking the IR to apply optimization patterns, then cannibalization/folding, and repeating until convergence.

IanWood1 · 2024-11-13T20:33:15Z

I am indeed concerned that this pessimizes analysis. I thought I had a test for this case but maybe it was just observed: if prior to op removal, successors had their lattice fixated to the maximum range (ie. No known bound), and removing the op let that be resolved to a tighter bound, not doing a recursive erase will cause the analysis to get stuck at an overly broad range.

I don't understand how this affects the analysis. It looks like initializeAndRun does a top-down recomputation of the states, so each should be flushed. And I don't think eraseState does what you mentioned either (but It could explain the slowness). I don't have a test to verify this, however.

If this isn't the correct approach, I'm open to other ideas on how to speed up this pass.

IIRC the eraseState() listrner got added upstream due to concern about stale values, but I can't recall a concrete example of when it would come up.

I tried removing eraseState() completely which results in a test occasionally hitting assertions:

APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed.

This corroborates what you were saying with stale values. Considering it only occurrs occasionally, id suspect newly created values with the same address as deleted ones are being used to lookup old state.

krzysz00 · 2024-11-13T20:36:12Z

A bitwidth of 0 is usually used for the "no data" state.

So what might make sense is, if it's possible, to migrate analysis results to replacement ops if they still make sense.

Though I don't know if the dataflow framework makes it safe to do that (given concurrency and all) so probably an eraseState() on removing a value is the way to go

stellaraccident · 2024-11-13T20:55:55Z

I do know that I didn't add that code because I wanted to. There was a failing case I was working on. There were some more advanced patterns that I didn't include in the initial submission. But there is no test for it, and as you say, it may not when he working as intended. It will not produce a correctness issue to reduce it like you have, but future us will probably find the case it was trying to optimize :)

But yes, there most be an eraseState in there or else you get stray pointers.

stellaraccident · 2024-11-13T20:56:55Z

I'm also perfectly open to switching the overall iteration as K suggests. I tried to write comprehensive tests in anticipation of needing to do implementation work on how it is done.

MaheshRavishankar · 2024-11-14T00:09:40Z

Thanks @IanWood1 . This is what eraseState is doing https://github.com/llvm/llvm-project/blob/6c9256dc5cda9184e295bc8d00be35e61b3be892/mlir/include/mlir/Analysis/DataFlowFramework.h#L340

    for (auto it = analysisStates.begin(); it != analysisStates.end(); ++it) {
      if (it->first.first == la)
        analysisStates.erase(it);
    }

It is walking the entire analysis on every delete. Is that the cause of the slowness?

I think now that the optimize int arithmetic is implemented using data flow solver, might be better to keep it that way cause it is a more general analysis. A walk update will help, but if we add any control flow we will be essentially duplicating the data flow solver.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

IanWood1 · 2024-11-14T17:07:55Z

@MaheshRavishankar yes, I think that's what's causing the slowness.

Also, I rebased to fix/update CI. I'm not really sure where to go with this PR as there are several different paths discussed here.

Hardcode84 · 2024-11-14T17:39:12Z

It's should be possible to improve upstream eraseState, I think. analysisStates is DenseMap<std::pair<LatticeAnchor, TypeID>, std::unique_ptr<AnalysisState>>, current code is walking entire analysisStates map because it doesn't have TypeIDs.

But potentially, we can ask each registered DataFlowAnalysis for their StateT TypeIDs and do a proper map removal.

MaheshRavishankar · 2024-11-14T17:46:44Z

It's should be possible to improve upstream eraseState, I think. analysisStates is DenseMap<std::pair<LatticeAnchor, TypeID>, std::unique_ptr<AnalysisState>>, current code is walking entire analysisStates map because it doesn't have TypeIDs.

But potentially, we can ask each registered DataFlowAnalysis for their StateT TypeIDs and do a proper map removal.

This was kind of what I had in mind, but I dont have enough understanding of all this TypeID stuff. @Hardcode84 do you think you could help with this.

MaheshRavishankar · 2024-11-14T17:48:40Z

If this improves the situation might be worth landing... I dont have enough know how about this to immediately say if this is OK. CI passes. @stellaraccident you might have more context.

Hardcode84 · 2024-11-14T18:49:11Z

I'm pretty booked coding wise, but upstream change should be straightforward:

Add vitrtual TypeID getStateTypeId() to DataFlowAnalysis, and update derived classes.
Change eraseState to something like

  template <typename AnchorT>
  void eraseState(AnchorT anchor) {
    LatticeAnchor la(anchor);
    for (auto &analysis : childAnalyses) {
        analysisStates.erase({la, analysis->getStateTypeId()})
    }
  }

stellaraccident · 2024-11-14T19:00:22Z

If it doesn't regress tests, then the proposed fix SGTM. I think this may pessimize some cases but if but if not encoded in tests, that is something we will have to evaluate in the future.

krzysz00 · 2024-11-14T19:41:52Z

I can try to take a crack at the eraseState() stuff at some point soon - just not sure if I should move from the current task to that.

MaheshRavishankar

Ok, lets do this for now. I understand what it pessimizes now I think. Lets come back to that later.

…e` (iree-org#19130)" This reverts commit 81dd4e6. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

This was hard to debug and I haven't yet been able to track down which op was modified causing the issue nor which pattern was triggering this. I was encountering the following assert during `Stream` pipeline (but not global opt): ```shell APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed. ``` It might be worth reverting #19130 instead of this patch since I wasn't able to track this down fully. 2/2 fix for #19167. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

…-org#19130) This pass is causing long compilation times for llama3 405b (even when cherry-picking llvm/llvm-project#115399). The majority of the time is spent in this one pass. The compilation times improve when calling `eraseState` only when ops are deleted. This is similar to the upstream listeners in `UnsignedWhenEquivalent.cpp` and `IntRangeOptimizations.cpp`. It appears this function loops over all `LatticeAnchors` on each invocation to find the one to delete, causing it to be slow. My (nonrigorous) experiment showed a decrease from 18 min to 3 min compile time. My main concern here would be this affecting correctness, as I don't know if this has unaccounted for side effects. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

This was hard to debug and I haven't yet been able to track down which op was modified causing the issue nor which pattern was triggering this. I was encountering the following assert during `Stream` pipeline (but not global opt): ```shell APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed. ``` It might be worth reverting iree-org#19130 instead of this patch since I wasn't able to track this down fully. 2/2 fix for iree-org#19167. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

…-org#19130) This pass is causing long compilation times for llama3 405b (even when cherry-picking llvm/llvm-project#115399). The majority of the time is spent in this one pass. The compilation times improve when calling `eraseState` only when ops are deleted. This is similar to the upstream listeners in `UnsignedWhenEquivalent.cpp` and `IntRangeOptimizations.cpp`. It appears this function loops over all `LatticeAnchors` on each invocation to find the one to delete, causing it to be slow. My (nonrigorous) experiment showed a decrease from 18 min to 3 min compile time. My main concern here would be this affecting correctness, as I don't know if this has unaccounted for side effects. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu> Signed-off-by: Giacomo Serafini <179146510+giacs-epic@users.noreply.github.com>

This was hard to debug and I haven't yet been able to track down which op was modified causing the issue nor which pattern was triggering this. I was encountering the following assert during `Stream` pipeline (but not global opt): ```shell APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed. ``` It might be worth reverting iree-org#19130 instead of this patch since I wasn't able to track this down fully. 2/2 fix for iree-org#19167. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu> Signed-off-by: Giacomo Serafini <179146510+giacs-epic@users.noreply.github.com>

IanWood1 changed the title ~~[Util][NFC] Change rewriter to erase state less often~~ [Util][NFC] OptimizeIntArithmetic: reduce calls to eraseState Nov 13, 2024

benvanik requested a review from stellaraccident November 13, 2024 05:08

stellaraccident reviewed Nov 13, 2024

View reviewed changes

IanWood1 requested a review from MaheshRavishankar November 13, 2024 20:34

IanWood1 force-pushed the opt_int_arith_speed branch from 7d19205 to ee51d3d Compare November 14, 2024 17:00

Change rewriter to erase state less often

ee51d3d

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

IanWood1 marked this pull request as ready for review November 14, 2024 17:08

IanWood1 requested a review from benvanik as a code owner November 14, 2024 17:08

MaheshRavishankar approved these changes Nov 14, 2024

View reviewed changes

IanWood1 merged commit 81dd4e6 into iree-org:main Nov 14, 2024
36 checks passed

IanWood1 added a commit to IanWood1/iree that referenced this pull request Nov 15, 2024

Revert "[Util][NFC] OptimizeIntArithmetic: reduce calls to `eraseStat…

b2f3398

…e` (iree-org#19130)" This reverts commit 81dd4e6. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

IanWood1 mentioned this pull request Nov 18, 2024

[Util] Erase state of modified ops #19187

Merged

IanWood1 mentioned this pull request Nov 21, 2024

[compiler] FLUX.1 transformer compilation for gfx942 hangs with pad-to-intrinsics #19249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Util][NFC] OptimizeIntArithmetic: reduce calls to `eraseState` #19130

[Util][NFC] OptimizeIntArithmetic: reduce calls to `eraseState` #19130

IanWood1 commented Nov 13, 2024 •

edited

Loading

stellaraccident commented Nov 13, 2024

stellaraccident left a comment

krzysz00 commented Nov 13, 2024

IanWood1 commented Nov 13, 2024

krzysz00 commented Nov 13, 2024

stellaraccident commented Nov 13, 2024

stellaraccident commented Nov 13, 2024

MaheshRavishankar commented Nov 14, 2024 •

edited

Loading

IanWood1 commented Nov 14, 2024

Hardcode84 commented Nov 14, 2024 •

edited

Loading

MaheshRavishankar commented Nov 14, 2024

MaheshRavishankar commented Nov 14, 2024

Hardcode84 commented Nov 14, 2024 •

edited

Loading

stellaraccident commented Nov 14, 2024

krzysz00 commented Nov 14, 2024

MaheshRavishankar left a comment

[Util][NFC] OptimizeIntArithmetic: reduce calls to eraseState #19130

[Util][NFC] OptimizeIntArithmetic: reduce calls to eraseState #19130

Conversation

IanWood1 commented Nov 13, 2024 • edited Loading

stellaraccident commented Nov 13, 2024

stellaraccident left a comment

Choose a reason for hiding this comment

krzysz00 commented Nov 13, 2024

IanWood1 commented Nov 13, 2024

krzysz00 commented Nov 13, 2024

stellaraccident commented Nov 13, 2024

stellaraccident commented Nov 13, 2024

MaheshRavishankar commented Nov 14, 2024 • edited Loading

IanWood1 commented Nov 14, 2024

Hardcode84 commented Nov 14, 2024 • edited Loading

MaheshRavishankar commented Nov 14, 2024

MaheshRavishankar commented Nov 14, 2024

Hardcode84 commented Nov 14, 2024 • edited Loading

stellaraccident commented Nov 14, 2024

krzysz00 commented Nov 14, 2024

MaheshRavishankar left a comment

Choose a reason for hiding this comment

[Util][NFC] OptimizeIntArithmetic: reduce calls to `eraseState` #19130

[Util][NFC] OptimizeIntArithmetic: reduce calls to `eraseState` #19130

IanWood1 commented Nov 13, 2024 •

edited

Loading

MaheshRavishankar commented Nov 14, 2024 •

edited

Loading

Hardcode84 commented Nov 14, 2024 •

edited

Loading

Hardcode84 commented Nov 14, 2024 •

edited

Loading