Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Failing test case for multi-GPU ProductionRuleField #2199

Conversation

matt-gardner
Copy link
Contributor

#1944 was supposed to handle complex scattering, but it doesn't work for the ProductionRuleField, as brought up in #2057. I have reproduced this and made a failing test case, which is in this PR.

I'm afraid I'm not going to be able to actually fix the underlying problem, though, as I don't really understand the voodoo that @brendan-ai2 did to convert tensors to pointers and back, and why it doesn't work in this case. The issue is that the ProductionRuleField produces a list of ProductionRules, with tensors internal to the data structure that don't get sent to GPUs - they all remain on the CPU.

@brendan-ai2
Copy link
Contributor

Thanks for the test, @matt-gardner . I think we should move away from the scattering. It's complicated and I've also noticed some performance weirdness. As an alternative I tried simply taking a batch for each GPU in this PR: https://github.com/allenai/allennlp/pull/2200/files#diff-043dcd121296c3cef3f3ff8c74127ff1 (Includes unrelated changes. Relevant changes in trainer.py.) @matt-peters had some concerns about this approach earlier in the quarter, but I think we should revisit it as it now seems simpler and more robust.

@matt-gardner
Copy link
Contributor Author

I'm also in favor of switching to taking one batch per GPU. I think @matt-peters' concerns were about getting an even amount of computation on each GPU. I think you can prove that taking one batch per GPU will not be slower, however, if you assume that the batches are in the same order. You just have less padding computation on the GPUs with the smallest batches, and they might sit idle for a bit waiting for the longest batch.

I guess, though, if you use a bucket iterator with a larger batch size, you'll be more likely to get per-GPU batches that are the same size, so there will be less padding computation overall (i.e., the batches will not be in the same order). That's the main difference, I think. Maybe we can do something to the bucket iterator to have a larger grouping when you have multiple GPUs. But I wouldn't worry about that until we've actually profiled things and know for sure that this is a problem.

@matt-peters
Copy link
Contributor

My $0.02 - if we do make modifications we should move toward using the built in pytorch multiple GPU training instead of rolling our own (and then worrying about how to fix it in edge cases). If you go the route of separate batches in the iterator for each GPU, then you will also need to make the bucket iterator multiple GPU aware as it will significantly hamper performance otherwise.

@brendan-ai2
Copy link
Contributor

@matt-peters and I spent a little time today trying to understand the performance of the Transformer ELMo reimplementation w.r.t. max sequence length. His initial concern, if I'm articulating it correctly, was that using a batch per GPU would result in batches with small max lengths being computed concurrently with batches with large max lengths. (Even though the total number of tokens would be roughly equal due to using maximum_samples_per_batch.) Then, given the quadratic nature of transformers, the GPU with the small max length would finish first and sit idle.

Fortunately, we observed this to not be the case. For batches with maximum lengths ranging from 14 tokens to 388 tokens the forward pass time ranged from 0.040 seconds to just 0.065 seconds. Further, the larger max length batches generally finished faster as they had fewer total tokens. On a per-token basis they're slower, but runtimes lie within a fairly narrow range. Any quadratic time computation is getting washed out. Spreadsheet for a handful of datapoints: https://docs.google.com/spreadsheets/d/1tRj0bntfoBNUgnpMOwj--HK6MTRRltABXcHLi7z2-D0/edit?usp=sharing.

Given this, it seems like using a batch per GPU is reasonable in the interim. Especially considering that the existing scattering implementation is causing imbalances in GPU memory for reasons we don't currently understand. In the longer term I heartily agree with @matt-peters that we should move to the built-in multi-GPU training if at all possible.

@schmmd
Copy link
Member

schmmd commented Jan 4, 2019

We think this is superceded by another PR. @brendan-ai2 is going to check.

@schmmd
Copy link
Member

schmmd commented Jan 15, 2019

@brendan-ai2 please close this out and refer to the PR that supersedes it.

@brendan-ai2
Copy link
Contributor

Thanks for the ping. I've merged this PR into my fix: #2200. That's still waiting on the trainer refactor.

@matt-gardner matt-gardner deleted the production_rule_multi_gpu_failing_test branch February 5, 2019 19:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants