Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Util][NFC] OptimizeIntArithmetic: reduce calls to eraseState #19130

Merged
merged 1 commit into from
Nov 14, 2024

Conversation

IanWood1
Copy link
Contributor

@IanWood1 IanWood1 commented Nov 13, 2024

This pass is causing long compilation times for llama3 405b (even when cherry-picking llvm/llvm-project#115399). The majority of the time is spent in this one pass. The compilation times improve when calling eraseState only when ops are deleted. This is similar to the upstream listeners in UnsignedWhenEquivalent.cpp and IntRangeOptimizations.cpp. It appears this function loops over all LatticeAnchors on each invocation to find the one to delete, causing it to be slow. My (nonrigorous) experiment showed a decrease from 18 min to 3 min compile time. My main concern here would be this affecting correctness, as I don't know if this has unaccounted for side effects.

Also, I'm not sure what DeadCodeAnalysis is being loaded/used for. I wasn't able to track down any users of it. Maybe that could be removed too?

@IanWood1 IanWood1 changed the title [Util][NFC] Change rewriter to erase state less often [Util][NFC] OptimizeIntArithmetic: reduce calls to eraseState Nov 13, 2024
@stellaraccident
Copy link
Collaborator

The dead code analysis dependency is a frequent footgun that is explained in the class comments of the int range optimization analysis. It has an implicit dependency on it. It would be good to see if this has to be fine this was upstream: it has tripped everyone.

Copy link
Collaborator

@stellaraccident stellaraccident left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find with respect to the compilation time impact. I am indeed concerned that this pessimizes analysis. I thought I had a test for this case but maybe it was just observed: if prior to op removal, successors had their lattice fixated to the maximum range (ie. No known bound), and removing the op let that be resolved to a tighter bound, not doing a recursive erase will cause the analysis to get stuck at an overly broad range.

Iirc, I was seeing this with certain kinds of factorizations where only the simplified form could be analyzed.

But given the extreme overheads, I'm definitely pro finding a way to not pay this cost. I would need to study the loop more carefully, but I thought it had accounted for not looping too much.

@krzysz00
Copy link
Contributor

One note I'll make here is that, at least originally, the optimization patterns were meant to run once and top-down. UnsignedWhenEquivalent was a separate pass and used dialect conversions because the context was int-range-optimizations; canonicalize; cse; unsigned-when-equivalent.

The int-range-optimizations thrnselves (back when it was just the constant folding) were even initially an IR walk.

IIRC the eraseState() listrner got added upstream due to concern about stale values, but I can't recall a concrete example of when it would come up.

My overall thought here would be to try and alternate cycles of walking the IR to apply optimization patterns, then cannibalization/folding, and repeating until convergence.

@IanWood1
Copy link
Contributor Author

I am indeed concerned that this pessimizes analysis. I thought I had a test for this case but maybe it was just observed: if prior to op removal, successors had their lattice fixated to the maximum range (ie. No known bound), and removing the op let that be resolved to a tighter bound, not doing a recursive erase will cause the analysis to get stuck at an overly broad range.

I don't understand how this affects the analysis. It looks like initializeAndRun does a top-down recomputation of the states, so each should be flushed. And I don't think eraseState does what you mentioned either (but It could explain the slowness). I don't have a test to verify this, however.

If this isn't the correct approach, I'm open to other ideas on how to speed up this pass.

IIRC the eraseState() listrner got added upstream due to concern about stale values, but I can't recall a concrete example of when it would come up.

I tried removing eraseState() completely which results in a test occasionally hitting assertions:

APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed.

This corroborates what you were saying with stale values. Considering it only occurrs occasionally, id suspect newly created values with the same address as deleted ones are being used to lookup old state.

@krzysz00
Copy link
Contributor

A bitwidth of 0 is usually used for the "no data" state.

So what might make sense is, if it's possible, to migrate analysis results to replacement ops if they still make sense.

Though I don't know if the dataflow framework makes it safe to do that (given concurrency and all) so probably an eraseState() on removing a value is the way to go

@stellaraccident
Copy link
Collaborator

I do know that I didn't add that code because I wanted to. There was a failing case I was working on. There were some more advanced patterns that I didn't include in the initial submission. But there is no test for it, and as you say, it may not when he working as intended. It will not produce a correctness issue to reduce it like you have, but future us will probably find the case it was trying to optimize :)

But yes, there most be an eraseState in there or else you get stray pointers.

@stellaraccident
Copy link
Collaborator

I'm also perfectly open to switching the overall iteration as K suggests. I tried to write comprehensive tests in anticipation of needing to do implementation work on how it is done.

@MaheshRavishankar
Copy link
Contributor

MaheshRavishankar commented Nov 14, 2024

Thanks @IanWood1 . This is what eraseState is doing https://github.com/llvm/llvm-project/blob/6c9256dc5cda9184e295bc8d00be35e61b3be892/mlir/include/mlir/Analysis/DataFlowFramework.h#L340

    for (auto it = analysisStates.begin(); it != analysisStates.end(); ++it) {
      if (it->first.first == la)
        analysisStates.erase(it);
    }

It is walking the entire analysis on every delete. Is that the cause of the slowness?

I think now that the optimize int arithmetic is implemented using data flow solver, might be better to keep it that way cause it is a more general analysis. A walk update will help, but if we add any control flow we will be essentially duplicating the data flow solver.

@IanWood1 IanWood1 force-pushed the opt_int_arith_speed branch from 7d19205 to ee51d3d Compare November 14, 2024 17:00
Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
@IanWood1
Copy link
Contributor Author

@MaheshRavishankar yes, I think that's what's causing the slowness.

Also, I rebased to fix/update CI. I'm not really sure where to go with this PR as there are several different paths discussed here.

@IanWood1 IanWood1 marked this pull request as ready for review November 14, 2024 17:08
@IanWood1 IanWood1 requested a review from benvanik as a code owner November 14, 2024 17:08
@Hardcode84
Copy link

Hardcode84 commented Nov 14, 2024

It's should be possible to improve upstream eraseState, I think. analysisStates is DenseMap<std::pair<LatticeAnchor, TypeID>, std::unique_ptr<AnalysisState>>, current code is walking entire analysisStates map because it doesn't have TypeIDs.

But potentially, we can ask each registered DataFlowAnalysis for their StateT TypeIDs and do a proper map removal.

@MaheshRavishankar
Copy link
Contributor

It's should be possible to improve upstream eraseState, I think. analysisStates is DenseMap<std::pair<LatticeAnchor, TypeID>, std::unique_ptr<AnalysisState>>, current code is walking entire analysisStates map because it doesn't have TypeIDs.

But potentially, we can ask each registered DataFlowAnalysis for their StateT TypeIDs and do a proper map removal.

This was kind of what I had in mind, but I dont have enough understanding of all this TypeID stuff. @Hardcode84 do you think you could help with this.

@MaheshRavishankar
Copy link
Contributor

If this improves the situation might be worth landing... I dont have enough know how about this to immediately say if this is OK. CI passes. @stellaraccident you might have more context.

@Hardcode84
Copy link

Hardcode84 commented Nov 14, 2024

I'm pretty booked coding wise, but upstream change should be straightforward:

  1. Add vitrtual TypeID getStateTypeId() to DataFlowAnalysis, and update derived classes.
  2. Change eraseState to something like
  template <typename AnchorT>
  void eraseState(AnchorT anchor) {
    LatticeAnchor la(anchor);
    for (auto &analysis : childAnalyses) {
        analysisStates.erase({la, analysis->getStateTypeId()})
    }
  }

@stellaraccident
Copy link
Collaborator

If it doesn't regress tests, then the proposed fix SGTM. I think this may pessimize some cases but if but if not encoded in tests, that is something we will have to evaluate in the future.

@krzysz00
Copy link
Contributor

I can try to take a crack at the eraseState() stuff at some point soon - just not sure if I should move from the current task to that.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets do this for now. I understand what it pessimizes now I think. Lets come back to that later.

@IanWood1 IanWood1 merged commit 81dd4e6 into iree-org:main Nov 14, 2024
36 checks passed
IanWood1 added a commit to IanWood1/iree that referenced this pull request Nov 15, 2024
…e` (iree-org#19130)"

This reverts commit 81dd4e6.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
IanWood1 added a commit that referenced this pull request Nov 18, 2024
This was hard to debug and I haven't yet been able to track down which
op was modified causing the issue nor which pattern was triggering this.
I was encountering the following assert during `Stream` pipeline (but
not global opt):

```shell
APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed.
```

It might be worth reverting #19130
instead of this patch since I wasn't able to track this down fully.


2/2 fix for #19167.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
…-org#19130)

This pass is causing long compilation times for llama3 405b (even when
cherry-picking llvm/llvm-project#115399). The
majority of the time is spent in this one pass. The compilation times
improve when calling `eraseState` only when ops are deleted. This is
similar to the upstream listeners in `UnsignedWhenEquivalent.cpp` and
`IntRangeOptimizations.cpp`. It appears this function loops over all
`LatticeAnchors` on each invocation to find the one to delete, causing
it to be slow. My (nonrigorous) experiment showed a decrease from 18 min
to 3 min compile time. My main concern here would be this affecting
correctness, as I don't know if this has unaccounted for side effects.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
This was hard to debug and I haven't yet been able to track down which
op was modified causing the issue nor which pattern was triggering this.
I was encountering the following assert during `Stream` pipeline (but
not global opt):

```shell
APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed.
```

It might be worth reverting iree-org#19130
instead of this patch since I wasn't able to track this down fully.


2/2 fix for iree-org#19167.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
…-org#19130)

This pass is causing long compilation times for llama3 405b (even when
cherry-picking llvm/llvm-project#115399). The
majority of the time is spent in this one pass. The compilation times
improve when calling `eraseState` only when ops are deleted. This is
similar to the upstream listeners in `UnsignedWhenEquivalent.cpp` and
`IntRangeOptimizations.cpp`. It appears this function loops over all
`LatticeAnchors` on each invocation to find the one to delete, causing
it to be slow. My (nonrigorous) experiment showed a decrease from 18 min
to 3 min compile time. My main concern here would be this affecting
correctness, as I don't know if this has unaccounted for side effects.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
Signed-off-by: Giacomo Serafini <179146510+giacs-epic@users.noreply.github.com>
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
This was hard to debug and I haven't yet been able to track down which
op was modified causing the issue nor which pattern was triggering this.
I was encountering the following assert during `Stream` pipeline (but
not global opt):

```shell
APInt.cpp:285: int llvm::APInt::compare(const APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Bit widths must be same for comparison"' failed.
```

It might be worth reverting iree-org#19130
instead of this patch since I wasn't able to track this down fully.

2/2 fix for iree-org#19167.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
Signed-off-by: Giacomo Serafini <179146510+giacs-epic@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants