-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Bootstrapping the TCP Dialect in Torch-MLIR #1366
Comments
Overall I'm supportive, but I think the MLIR community needs to make a final decision here. I would recommend that we elaborate a bit on the details of the bringup plan, expected timelines, "unknowns", scoping, etc. and make a new RFC thread on MLIR discourse. I think that having a demo branch with Mlp2LayerModule_basic running e2e as prep work for the RFC would give folks some concrete code to rally around and really accelerate things. Especially Chris's point about "what's not part of TCP" is really useful to define (at least as a draft to rally around). |
Hey Sean, I obviously don't want to go against the community consensus here, but I do wonder if we are holding a bit of a different standard vs the rest of development that happens in torch-mlir (for which, it seems like this group is "the community"). For example, discussions about ONNX, MHLO connectivity, etc all happened here vs via LLVM side RFC. But it also sounds like you would like to see some more definition, and I don't want to discount that and assume anything. I'm running behind from vacation and think I missed some back and forth. Would you mind deep linking to the question you would like to see answered? (my personal opinion is that we have struggled a lot with the "mkdir" problem on this, and I'm somewhat interested in seeing motivated engineers be able to make contributions vs being homeless) |
How about to unblock progress, you all just create a There shouldn't be much development flow difference to working on the |
This is a good suggestion. I just created a |
sg. Just keep the history clean and reviewed. Then assuming we decide to go forward with it at a future checkpoint, we accept it in full (applying any additional feedback/patched per usual). |
I'm very excited about this starting. Definitely a lot of questions to answers and to refine. I bias there on being able to see things and evaluate how they fit and value they bring. Having an open venue where this initially bootstrapping can happen (while ensuring community has visibility and/or is able to participate in design discussions) SGTM. I think there is a lot to consider and these won't bet cut whole, but rather refined iteratively. Answering the questions from the community, showing the connectivity, testing, interop etc is part goal of this phase, and incubator is good avenue [Care of course that this be self-contained and not "infected" by PyTorch ( 🙂 ) to avoid some of the concerns from other thread.] Branch SGTM as start. And I know folks here have plans to ensure transparency and involvement. |
For those examples, the stakeholders were either just Torch-MLIR or "not broadly the MLIR upstream community". This case is different, because, as far as I know, the key stakeholders are the broader upstream MLIR community. The goal is for TCP to be "for all of the MLIR ecosystem and all frontends", not just Torch-MLIR + non-upstream stakeholders. If this were pitched as a way that did not interact with the broader community (for example, by saying "TCP + related lowerings is initially just a way for Torch-MLIR to greatly simplify its backend story, but we are designing it with an eye towards broader applicability") then it would be a different discussion. If we want to pivot that direction I'm open to that. Overall I am fine to have the development on a branch to solve the mkdir problem in the short term, as long as we have at least a vague plan for what's going to happen. I want to avoid a situation where we say "oh, it will only be a month" and then every month extend it another month for ~forever. If we see this branch as just a PoC for a future (2022Q4) more concerted RFC I think that is fine.
Especially "What are the principles that guide its design, and what are you saying “no” to?" Personally, I think that designing something that connects well with Linalg, TOSA, and MHLO/StableHLO is probably a strong enough regularizer on the design that I'm not worried, assuming that the dialect is brought up in a concerted fashion with that goal (and tested as such). I get more worried if we are just sort of abstractly designing a new dialect and not actually doing the work to demonstrate value-add by replacing the existing components with a better solution. Concretely, what I would like to see from Torch-MLIR's value-add perspective is that as a new TCP op is added, And I'm hoping that the dynamic shape design for TCP will be "right" and when we switch MHLO to StableHLO we adopt that to make the whole system way more robust. (@burmako has already expressed strong interest in using the dynamic shape design from TCP to inform StableHLO on a longer time frame). We might want to swap TCP and MHLO in that diagram too possibly, since |
Relu - Constant Propagation Signed-off-by: Robbie Perrone <robert.perrone1@marist.edu>
Updating this issue at year-end:
What we'd like to do next is: The goal remains to retain this in the same externals/llvm-external-projects path, with the same conditional compilation pathway, and to retain the same rigor of compilation but take away the branch. In a timebound manner in the future, we would like to then work on determining how to upstream this work out of torch-mlir so it is broadly accessible to other MLIR projects, while retaining the PyTorch related legalizations in Torch-MLIR. Please let us know your thoughts. @sanjoy @sjain-stanford @navahgar @asaadaldien @stellaraccident @silvasean @jpienaar @burmako |
After having some of our changes going through IREE (pack/unpack tensor ops), I am sceptical of this process, but I'm not against it. There may not be a better one, it seems. Our internal design has changed considerably from our internal prototype (we only had one op, While going through IREE, we split the ops (which was a good thing), changed from affine maps to attributes (which is a matter of opinion), and we lost the memref support, which broke any hope of using the upstream ops straight away, as we had no implementation further than tensors. This has led to us having three implementations:
With more lowering making its way into upstream, we're finally moving to those ops, but it's not a trivial exercise. IREE will have to do the same on their own repo, which multiplies the efforts. And this is just for two new ops in an existing dialect that does not replace, but adds, semantics. TCP is a whole different beast. I'm not sure the alternative (doing it full upstream) would be much easier. I can't measure the churn that pack/unpack would have had if it was done upstream, before we converged into a working form. I don't have enough information to say that it would have been easier for an entire dialect, so I just leave my own experiences here, without judgement. I think the silver lining is that you're including MHLO in the process from the start, so that it would at least be "good enough" for the two main (and reasonably different) formats (HLO, Torch). From our side (TPP-MLIR), we'd only be able to start testing it once it's on LLVM main. I imagine any small team like ours will be the same. It's also hard to review code in torch-mlir because of the differences in style and expectations. I imagine this will be a round between torch and xla folks first, until upstream to LLVM, then it will be another round of people not involved initially (like us), and there will be a "second wave" of reviews and change requests, and I'd like to request flexibility on this second wave if it seems the people that didn't review the first round find important changes and not be stuck on the design that the original group came up with. |
I don't see the current dialect as having reached a level of design maturity to merge into main. And purely from a timeline perspective, it has taken 3+ months to hash out the elementwise+broadcast design, which if I guesstimate correctly would put the first ResNet e2e run over a year out. We are aiming to use Torch-MLIR here as a temporary staging ground and year+ timelines do not fit that. To give a guide on the thinking here, I would like to contrast this with the addition of MHLO support to main (RFC here). In that RFC:
|
I suppose that MHLO will be under the OpenXLA GitHub org governance right? https://github.com/openxla/xla/tree/main/xla/mlir_hlo/mhlo/IR |
Hi everyone. To close the loop on this, we have found a new home for TCP (https://github.com/cruise-automation/mlir-tcp) and are in the last few stages of completing the internal integrations. We have now paused contributions to the cc: @stellaraccident , @ramiro050 , @silvasean, @sjarus , @zezhang , @navahgar , @sanjoy |
Awesome!! |
I can confirm our downstream integrations are now complete. I have deleted the |
We are announcing the effort to bootstrap the early development and technological demonstration of the TCP dialect (public specification, technical discussion) under llvm-external-projects within Torch-MLIR.
The TCP dialect intends to offer strong support for backend compilation connectivity to LinAlg, Tensor and other dialects, compiler friendly support for dynamic shapes, broadcasting and other capabilities, and the Torch-MLIR pathway offers an opportunity to demonstrate these dialect capabilities to augment the existing backend paths.
The initial effort will focus on connectivity both into and out of the TCP dialect, emphasizing the technical value proposition, and we welcome input into the mechanics of the op construct and connectivity.
@navahgar @sanjoy @sjain-stanford @silvasean @jpienaar @stellaraccident
The text was updated successfully, but these errors were encountered: