Failing test case for multi-GPU ProductionRuleField #2199

matt-gardner · 2018-12-18T00:59:59Z

#1944 was supposed to handle complex scattering, but it doesn't work for the ProductionRuleField, as brought up in #2057. I have reproduced this and made a failing test case, which is in this PR.

I'm afraid I'm not going to be able to actually fix the underlying problem, though, as I don't really understand the voodoo that @brendan-ai2 did to convert tensors to pointers and back, and why it doesn't work in this case. The issue is that the ProductionRuleField produces a list of ProductionRules, with tensors internal to the data structure that don't get sent to GPUs - they all remain on the CPU.

brendan-ai2 · 2018-12-18T01:13:52Z

Thanks for the test, @matt-gardner . I think we should move away from the scattering. It's complicated and I've also noticed some performance weirdness. As an alternative I tried simply taking a batch for each GPU in this PR: https://github.com/allenai/allennlp/pull/2200/files#diff-043dcd121296c3cef3f3ff8c74127ff1 (Includes unrelated changes. Relevant changes in trainer.py.) @matt-peters had some concerns about this approach earlier in the quarter, but I think we should revisit it as it now seems simpler and more robust.

matt-gardner · 2018-12-18T15:28:59Z

I'm also in favor of switching to taking one batch per GPU. I think @matt-peters' concerns were about getting an even amount of computation on each GPU. I think you can prove that taking one batch per GPU will not be slower, however, if you assume that the batches are in the same order. You just have less padding computation on the GPUs with the smallest batches, and they might sit idle for a bit waiting for the longest batch.

I guess, though, if you use a bucket iterator with a larger batch size, you'll be more likely to get per-GPU batches that are the same size, so there will be less padding computation overall (i.e., the batches will not be in the same order). That's the main difference, I think. Maybe we can do something to the bucket iterator to have a larger grouping when you have multiple GPUs. But I wouldn't worry about that until we've actually profiled things and know for sure that this is a problem.

matt-peters · 2018-12-18T17:13:56Z

My $0.02 - if we do make modifications we should move toward using the built in pytorch multiple GPU training instead of rolling our own (and then worrying about how to fix it in edge cases). If you go the route of separate batches in the iterator for each GPU, then you will also need to make the bucket iterator multiple GPU aware as it will significantly hamper performance otherwise.

brendan-ai2 · 2018-12-19T02:17:51Z

@matt-peters and I spent a little time today trying to understand the performance of the Transformer ELMo reimplementation w.r.t. max sequence length. His initial concern, if I'm articulating it correctly, was that using a batch per GPU would result in batches with small max lengths being computed concurrently with batches with large max lengths. (Even though the total number of tokens would be roughly equal due to using maximum_samples_per_batch.) Then, given the quadratic nature of transformers, the GPU with the small max length would finish first and sit idle.

Fortunately, we observed this to not be the case. For batches with maximum lengths ranging from 14 tokens to 388 tokens the forward pass time ranged from 0.040 seconds to just 0.065 seconds. Further, the larger max length batches generally finished faster as they had fewer total tokens. On a per-token basis they're slower, but runtimes lie within a fairly narrow range. Any quadratic time computation is getting washed out. Spreadsheet for a handful of datapoints: https://docs.google.com/spreadsheets/d/1tRj0bntfoBNUgnpMOwj--HK6MTRRltABXcHLi7z2-D0/edit?usp=sharing.

Given this, it seems like using a batch per GPU is reasonable in the interim. Especially considering that the existing scattering implementation is causing imbalances in GPU memory for reasons we don't currently understand. In the longer term I heartily agree with @matt-peters that we should move to the built-in multi-GPU training if at all possible.

schmmd · 2019-01-04T21:25:14Z

We think this is superceded by another PR. @brendan-ai2 is going to check.

schmmd · 2019-01-15T23:29:43Z

@brendan-ai2 please close this out and refer to the PR that supersedes it.

brendan-ai2 · 2019-01-15T23:40:00Z

Thanks for the ping. I've merged this PR into my fix: #2200. That's still waiting on the trainer refactor.

Added failing test case

2cf180e

matt-gardner mentioned this pull request Dec 18, 2018

ProductionRuleField not compatible with multiple GPUs #2057

Closed

schmmd assigned brendan-ai2 Jan 4, 2019

brendan-ai2 closed this Jan 15, 2019

matt-gardner deleted the production_rule_multi_gpu_failing_test branch February 5, 2019 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing test case for multi-GPU ProductionRuleField #2199

Failing test case for multi-GPU ProductionRuleField #2199

matt-gardner commented Dec 18, 2018

brendan-ai2 commented Dec 18, 2018

matt-gardner commented Dec 18, 2018

matt-peters commented Dec 18, 2018

brendan-ai2 commented Dec 19, 2018

schmmd commented Jan 4, 2019

schmmd commented Jan 15, 2019

brendan-ai2 commented Jan 15, 2019

Failing test case for multi-GPU ProductionRuleField #2199

Failing test case for multi-GPU ProductionRuleField #2199

Conversation

matt-gardner commented Dec 18, 2018

brendan-ai2 commented Dec 18, 2018

matt-gardner commented Dec 18, 2018

matt-peters commented Dec 18, 2018

brendan-ai2 commented Dec 19, 2018

schmmd commented Jan 4, 2019

schmmd commented Jan 15, 2019

brendan-ai2 commented Jan 15, 2019