-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support different tf.distribute.Strategies for distributed training on SageMaker #391
Support different tf.distribute.Strategies for distributed training on SageMaker #391
Comments
if you don't specify anything for |
@laurenyu the config is different for different nodes in the cluster. Are you suggesting we do something like this - # Start of train.py
if getHostname() == "algo-1":
# write TF_CONFIG for node-1
elif getHostname() == "algo-2":
# write TF_CONFIG for node-2
import tensorflow
# Start the actual training script |
@anirudhacharya yep, that's exactly what I was thinking. You can also use the environment variable |
@laurenyu I can try this, but I am not sure it will work, because os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker': [<list of addresses & ports of the nodes that make up the cluster>]
},
'task': {'type': 'worker', 'index': 0}
}) while with the conventional cluster setup, I can ssh into each node and get information like ip address and port number; I am not sure how I would be able to do that from within the training script. |
Did you give it a try? |
while using parameter server distribution type in estimator.fit() call, i came across sagemaker's TF_CONFIG in the logs such as following (just an example for Parameter Server distribution strategy)
Question:
Thank you |
Sharing an implementation of working TF Config for MultiNodeMirroredStrategy below. This has been tested on SageMaker Deep Learning container with TensorFlow v2.8 (link to dockerfile). def _build_tf_config():
hosts = json.loads(os.getenv("SM_HOSTS"))
current_host = os.getenv("SM_CURRENT_HOST")
workers = hosts
def host_addresses(hosts, port=7777):
return ["{}:{}".format(host, port) for host in hosts]
tf_config = {"cluster": {}, "task": {}}
tf_config["cluster"]["worker"] = host_addresses(workers)
tf_config["task"] = {"index": workers.index(current_host), "type": "worker"}
os.environ["TF_CONFIG"] = json.dumps(tf_config)
return |
Is there any update on this ? Is anybody working on a PR ? I see a TF_CONFIG setup is already implemented in https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/master/src/sagemaker_tensorflow_container/training.py#L37. It would only need some minor modifications of MWMS. The only task remaining is to add a new distribution option named 'multi_worker_mirrored' and include it in this condition https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/master/src/sagemaker_tensorflow_container/training.py#L139 I will be happy to cut a PR for this if required. This has been open for way too long. |
Hey @Lokiiiiii so is the aim of your feature to add a distribution argument for multi worker mirrored strategy in Sagemaker? i.e. |
Yes, I have suggested |
Does anything else apart from the If I use the same config from @vdabravolski it identifies 2 workers properly,
|
This seems like a discussions for https://github.com/tensorflow/tensorflow/issues?q=is%3Aissue+multiworkermirroredstrategy+ |
Where would you add this function ? there is also similar function present here https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-tensorflow-models.html, but when I use this function in starting of my training script then I see runtime error (RuntimeError: Collective ops must be configured at program startup),If I do it post strategy then it doesn’t work as multi node :( |
The cluster setup required for As indicated by the error message, the cluster setup or in this case the environment variable needs to the invoked before the training script is executed. You can specify the |
For strategies like Multi Worker Mirrored-Strategy TF2 requires us to configure each node individually (https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration). Currently SageMaker does not provide us a way of doing this while trying to launch a distributed training job with Multi Worker Mirrored-Strategy using
estimator.fit()
methodThe text was updated successfully, but these errors were encountered: