This repository has been archived by the owner on Feb 16, 2023. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 217
Distributed PyTorch on Grid (via IPFS/PubSub) #166
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* feat: modify pubsub_peers to handle newer IPFS api. (#153) * finished minimal transfer of overloading code * found an untested bug
* feat: modify pubsub_peers to handle newer IPFS api. (#153) * finished minimal transfer of overloading code * found an untested bug * adjust comments * this round of work sponsored by parallel jalebi * in the middle of fixing #130 and #132 * resolved #132, #130 will take a bit more effort than I'd planned for * completes #130, prepares #129 and #131; almost took care of #148 in the process
* laptop sync * finished up ipfs integration, yet to test * syncing with colab notebooks * renamed channels.openmined to channels.om * found a worker node error * bug in Tensor.send_ * fixed two client side bugs * keyerror in receive_obj message * register tensors before sending * well that was rough * more bug fixes * premerge * fix utils import in hook_worker_service * fix return_result for worker * premerge * premerge2 * BOOM
* lots o' comments * reorganize notebooks
iamtrask
approved these changes
Mar 31, 2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This marks a new chapter in the OpenMined project - very, very excited to merge this!
Benardi
pushed a commit
that referenced
this pull request
May 12, 2020
* First round of torch hooks integration (#152) * Finished HookService, linked it with TorchService (#154) * feat: modify pubsub_peers to handle newer IPFS api. (#153) * finished minimal transfer of overloading code * found an untested bug * WIP for #130 and #132 (#155) * feat: modify pubsub_peers to handle newer IPFS api. (#153) * finished minimal transfer of overloading code * found an untested bug * adjust comments * this round of work sponsored by parallel jalebi * in the middle of fixing #130 and #132 * resolved #132, #130 will take a bit more effort than I'd planned for * completes #130, prepares #129 and #131; almost took care of #148 in the process * Worker side command processing and execution (#156) resolved #129 * Finished implementing IPFS into torch services (#161) * laptop sync * finished up ipfs integration, yet to test * syncing with colab notebooks * renamed channels.openmined to channels.om * found a worker node error * bug in Tensor.send_ * fixed two client side bugs * keyerror in receive_obj message * register tensors before sending * well that was rough * more bug fixes * premerge * fix utils import in hook_worker_service * fix return_result for worker * premerge * premerge2 * BOOM * multinode demo (#162) * lots o' comments (#164) * Reorganizing notebooks (#165) * lots o' comments * reorganize notebooks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR establishes a general framework for distributed tensor computation on Grid with PyTorch. This is accomplished by overloading all relevant functions and methods in the torch module, so that when a user calls torch commands, they distribute the commands across the Grid on remote nodes that contain the actual tensors involved. When we send a tensor over the network, a 0-d tensor pointer remains for future computation. When we execute a command on remote tensors that return a new tensor, the worker node returns registration attributes that are used to create a new 0-d pointer for the result. This means that the original composability of the torch library is maintained, allowing for lower level operations to make their way into higher level abstractions like Modules and Containers with minimal future effort.
There are a few other major benefits to this approach -- mainly, it's the lowest level of abstraction we've been able to successfully distribute over IPFS. This has a host of benefits other than its potential to scale up to higher level abstractions, including more fine-grained control of computation, tighter security guarantees, and allowing for larger models by keeping more IPFS blocks under the 1 MiB limit.
Currently, this work is complete for all Tensor types, although it could use a good stress test or two. The implementation for Variable is incomplete -- the only remaining bits are the special methods
send_
,get_
, andser
, and to ensure that Variables are handled properly throughout transmission and remote computation. I'll be following up with another PR in the coming days to get autograd working.Due to the nature of Grid at the moment, it's not entirely resilient. Future work should improve error reporting on the client side (#151), induce garbage collection on the worker nodes when a client signals they're done (or after a timeout), and figure out how to handle workers that drop out mid-computation (likely by notifying the client that a worker with one of their tensors has disconnected from IPFS/stopped listening to openmined channels). There's a whole range of other things we need to do as well (e.g. #134), but let's get this merged first. 🙂
Critical files being created and modified mostly include files in
grid/services
, and in particular thetorch
subdirectory there, although changes have been made across all files in the repository relating to compute mode. A brief demo of computation on multiple nodes over IPFS can be found atnotebooks/experimental/torch_integration/Grid_MultiNode_Demo.ipynb
.Happy Torching!
🎉