-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Failing test case for multi-GPU ProductionRuleField #2199
Failing test case for multi-GPU ProductionRuleField #2199
Conversation
Thanks for the test, @matt-gardner . I think we should move away from the scattering. It's complicated and I've also noticed some performance weirdness. As an alternative I tried simply taking a batch for each GPU in this PR: https://github.com/allenai/allennlp/pull/2200/files#diff-043dcd121296c3cef3f3ff8c74127ff1 (Includes unrelated changes. Relevant changes in trainer.py.) @matt-peters had some concerns about this approach earlier in the quarter, but I think we should revisit it as it now seems simpler and more robust. |
I'm also in favor of switching to taking one batch per GPU. I think @matt-peters' concerns were about getting an even amount of computation on each GPU. I think you can prove that taking one batch per GPU will not be slower, however, if you assume that the batches are in the same order. You just have less padding computation on the GPUs with the smallest batches, and they might sit idle for a bit waiting for the longest batch. I guess, though, if you use a bucket iterator with a larger batch size, you'll be more likely to get per-GPU batches that are the same size, so there will be less padding computation overall (i.e., the batches will not be in the same order). That's the main difference, I think. Maybe we can do something to the bucket iterator to have a larger grouping when you have multiple GPUs. But I wouldn't worry about that until we've actually profiled things and know for sure that this is a problem. |
My $0.02 - if we do make modifications we should move toward using the built in pytorch multiple GPU training instead of rolling our own (and then worrying about how to fix it in edge cases). If you go the route of separate batches in the iterator for each GPU, then you will also need to make the bucket iterator multiple GPU aware as it will significantly hamper performance otherwise. |
@matt-peters and I spent a little time today trying to understand the performance of the Transformer ELMo reimplementation w.r.t. max sequence length. His initial concern, if I'm articulating it correctly, was that using a batch per GPU would result in batches with small max lengths being computed concurrently with batches with large max lengths. (Even though the total number of tokens would be roughly equal due to using Fortunately, we observed this to not be the case. For batches with maximum lengths ranging from 14 tokens to 388 tokens the forward pass time ranged from 0.040 seconds to just 0.065 seconds. Further, the larger max length batches generally finished faster as they had fewer total tokens. On a per-token basis they're slower, but runtimes lie within a fairly narrow range. Any quadratic time computation is getting washed out. Spreadsheet for a handful of datapoints: https://docs.google.com/spreadsheets/d/1tRj0bntfoBNUgnpMOwj--HK6MTRRltABXcHLi7z2-D0/edit?usp=sharing. Given this, it seems like using a batch per GPU is reasonable in the interim. Especially considering that the existing scattering implementation is causing imbalances in GPU memory for reasons we don't currently understand. In the longer term I heartily agree with @matt-peters that we should move to the built-in multi-GPU training if at all possible. |
We think this is superceded by another PR. @brendan-ai2 is going to check. |
@brendan-ai2 please close this out and refer to the PR that supersedes it. |
Thanks for the ping. I've merged this PR into my fix: #2200. That's still waiting on the trainer refactor. |
#1944 was supposed to handle complex scattering, but it doesn't work for the
ProductionRuleField
, as brought up in #2057. I have reproduced this and made a failing test case, which is in this PR.I'm afraid I'm not going to be able to actually fix the underlying problem, though, as I don't really understand the voodoo that @brendan-ai2 did to convert tensors to pointers and back, and why it doesn't work in this case. The issue is that the
ProductionRuleField
produces a list ofProductionRules
, with tensors internal to the data structure that don't get sent to GPUs - they all remain on the CPU.