-
Notifications
You must be signed in to change notification settings - Fork 23.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Process Group for tensors shared across processes #21449
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Out of curiosity, why didn't we test for this situation using |
You seem to have modified nearly every call site to recordStream; I saw maybe only two missing sites:
and I suspect that these should not error if the pointer is not found. Maybe we should do away with the option / flip the default? |
test/test_c10d_spawn.py
Outdated
|
||
@property | ||
def world_size(self): | ||
return 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't world_size = 2
be a much more idiomatic way to write this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let me make the change.
test/test_c10d_spawn.py
Outdated
@unittest.skipIf(not TEST_MULTIGPU, "At least 2 CUDA GPUS needed") | ||
@unittest.skipIf(NO_NCCL, "NCCL needed") | ||
def test_shared_allgather_nccl(self): | ||
with tempfile.NamedTemporaryFile(delete=False) as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, it's better not to leak temporary files (which is what delete=False
does, unless you explicitly delete it later). What is the reasoning for using delete=false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but if I don't add delete=False
, it complains it cannot find the tmp file on delete. I was thinking maybe with
context and tempfile
both try to delete the file when exiting the context, as a result, one of them hits the error. But let me double check if that is the case, and will ad a comment if yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be extremely surprised if that were the case. Here's a simple test:
macbook-pro-116:~ ezyang$ cat test.py
import tempfile
with tempfile.NamedTemporaryFile() as f:
pass
macbook-pro-116:~ ezyang$ python test.py
macbook-pro-116:~ ezyang$
@ezyang thanks a lot for reviewing!
I agree this will test the changes to
|
test/test_c10d_spawn.py
Outdated
def assert_equal(cls, expected, value): | ||
assert (expected == value).all().item() == 1, ( | ||
"Expecting tensor value {} but got {}." | ||
).format(expected, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit suboptimal, because the way this will look to the test runner is that the multiprocessing spawned subprocess died unceremoniously, and you have to look at the tea leaves to see, "Ah, it failed due to an assert." The way that test_multiprocessing
goes about arranging this, is to have the parent process (aka the test runner) always responsible for doing the actual asserts (where you can do a normal self.assertEquals), and just have the child process pass back the tensors for the parent process to do checking on (or, if the child process must do the test, passing back a boolean saying if the result worked or not.)
What does the test suite output look like when this assert fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess because you are using multiprocessing.spawn
, the exceptions will get propagated backwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I don't like reinventing the wheel either. Will make the change.
What does the test suite output look like when this assert fails?
I have only tried error cases in local pytest runs, where the error messages are clear. Let me add a deliberate error to see what CI test shows.
).format(expected, value) | ||
|
||
# Why classmethod? multiprocessing cannot pickle TestCase subclass when in | ||
# spawn mode. See https://bugs.python.org/issue33884. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, test_multiprocessing.py
does this by just having the test runner methods as honest to goodness top-level functions. This is just an informational comment, since a class method is just as good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please think about if we should get rid of suppressError
argument and just suppress errors always. All the other comments are just nits.
Sorry, miscommunication here. I'm referring to how, in the logic of a PR, you determine if a pointer was produced by the CUDA caching allocator by attempting to look it up in the block map; if it's not in the block map, it's not a CUDA caching allocator pointer. My point is that another way to test if it's produced by the CUDA caching allocator is by checking if the deleter for the DataPtr in question is for the CUDA caching allocator (you can perform this test using
Remember, "different process" isn't the root cause of the problem; if any cuda tensor we allocate, comes from an allocation source that is not the caching allocator, Morally, CUDA caching allocator's record stream annotations are to solve a problem that exists specifically because of how the caching allocator is implemented. Arguably, it is a generic interface for all CUDA allocators, but it just so happens that other allocators don't need to do anything. |
@ezyang if I feel timid about a PR, can I force test suite to run all 73 tests? |
…ssage from subprocess assertion failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Ops on a Process Group (pg) instance will hit an error when input/output tensors are created on a different process, because, pg calls
recordStream
onCUDACachingAllocator
which only knows tensors created within the same process.The proposed solution is to add a
suppressError
arg (suggestions for better names?) torecordStream
. See comments in code for arguments.CC @pichuang1984