Merge ray upstream into master (#35)

* [rllib] Remove dependency on TensorFlow (ray-project#4764) * remove hard tf dep * add test * comment fix * fix test * Dynamic Custom Resources - create and delete resources (ray-project#3742) * Update tutorial link in doc (ray-project#4777) * [rllib] Implement learn_on_batch() in torch policy graph * Fix `ray stop` by killing raylet before plasma (ray-project#4778) * Fatal check if object store dies (ray-project#4763) * [rllib] fix clip by value issue as TF upgraded (ray-project#4697) * fix clip_by_value issue * fix typo * [autoscaler] Fix submit (ray-project#4782) * Queue tasks in the raylet in between async callbacks (ray-project#4766) * Add a SWAP TaskQueue so that we can keep track of tasks that are temporarily dequeued * Fix bug where tasks that fail to be forwarded don't appear to be local by adding them to SWAP queue * cleanups * updates * updates * [Java][Bazel] Refine auto-generated pom files (ray-project#4780) * Bump version to 0.7.0 (ray-project#4791) * [JAVA] setDefaultUncaughtExceptionHandler to log uncaught exception in user thread. (ray-project#4798) * Add WorkerUncaughtExceptionHandler * Fix * revert bazel and pom * [tune] Fix CLI test (ray-project#4801) * Fix pom file generation (ray-project#4800) * [rllib] Support continuous action distributions in IMPALA/APPO (ray-project#4771) * [rllib] TensorFlow 2 compatibility (ray-project#4802) * Change tagline in documentation and README. (ray-project#4807) * Update README.rst, index.rst, tutorial.rst and _config.yml * [tune] Support non-arg submit (ray-project#4803) * [autoscaler] rsync cluster (ray-project#4785) * [tune] Remove extra parsing functionality (ray-project#4804) * Fix Java worker log dir (ray-project#4781) * [tune] Initial track integration (ray-project#4362) Introduces a minimally invasive utility for logging experiment results. A broad requirement for this tool is that it should integrate seamlessly with Tune execution. * [rllib] [RFC] Dynamic definition of loss functions and modularization support (ray-project#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph * [rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (ray-project#4819) This implements some of the renames proposed in ray-project#4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. * [Java] Dynamic resource API in Java (ray-project#4824) * Add default values for Wgym flags * Fix import * Fix issue when starting `raylet_monitor` (ray-project#4829) * Refactor ID Serial 1: Separate ObjectID and TaskID from UniqueID (ray-project#4776) * Enable BaseId. * Change TaskID and make python test pass * Remove unnecessary functions and fix test failure and change TaskID to 16 bytes. * Java code change draft * Refine * Lint * Update java/api/src/main/java/org/ray/api/id/TaskId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/ObjectId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Address comment * Lint * Fix SINGLE_PROCESS * Fix comments * Refine code * Refine test * Resolve conflict * Fix bug in which actor classes are not exported multiple times. (ray-project#4838) * Bump Ray master version to 0.8.0.dev0 (ray-project#4845) * Add section to bump version of master branch and cleanup release docs (ray-project#4846) * Fix import * Export remote functions when first used and also fix bug in which rem… (ray-project#4844) * Export remote functions when first used and also fix bug in which remote functions and actor classes are not exported from workers during subsequent ray sessions. * Documentation update * Fix tests. * Fix grammar * Update wheel versions in documentation to 0.8.0.dev0 and 0.7.0. (ray-project#4847) * [tune] Later expansion of local_dir (ray-project#4806) * [rllib] [RFC] Deprecate Python 2 / RLlib (ray-project#4832) * Fix a typo in kubernetes yaml (ray-project#4872) * Move global state API out of global_state object. (ray-project#4857) * Install bazel in autoscaler development configs. (ray-project#4874) * [tune] Fix up Ax Search and Examples (ray-project#4851) * update Ax for cleaner API * docs update * [rllib] Update concepts docs and add "Building Policies in Torch/TensorFlow" section (ray-project#4821) * wip * fix index * fix bugs * todo * add imports * note on get ph * note on get ph * rename to building custom algs * add rnn state info * [rllib] Fix error getting kl when simple_optimizer: True in multi-agent PPO * Replace ReturnIds with NumReturns in TaskInfo to reduce the size (ray-project#4854) * Refine TaskInfo * Fix * Add a test to print task info size * Lint * Refine * Update deps commits of opencensus to support building with bzl 0.25.x (ray-project#4862) * Update deps to support bzl 2.5.x * Fix * Upgrade arrow to latest master (ray-project#4858) * [tune] Auto-init Ray + default SearchAlg (ray-project#4815) * Bump version from 0.8.0.dev0 to 0.7.1. (ray-project#4890) * [rllib] Allow access to batches prior to postprocessing (ray-project#4871) * [rllib] Fix Multidiscrete support (ray-project#4869) * Refactor redis callback handling (ray-project#4841) * Add CallbackReply * Fix * fix linting by format.sh * Fix linting * Address comments. * Fix * Initial high-level code structure of CoreWorker. (ray-project#4875) * Drop duplicated string format (ray-project#4897) This string format is unnecessary. java_worker_options has been appended to the commandline later. * Refactor ID Serial 2: change all ID functions to `CamelCase` (ray-project#4896) * Hotfix for change of from_random to FromRandom (ray-project#4909) * [rllib] Fix documentation on custom policies (ray-project#4910) * wip * add docs * lint * todo sections * fix doc * [rllib] Allow Torch policies access to full action input dict in extra_action_out_fn (ray-project#4894) * fix torch extra out * preserve setitem * fix docs * [tune] Pretty print params json in logger.py (ray-project#4903) * [sgd] Distributed Training via PyTorch (ray-project#4797) Implements distributed SGD using distributed PyTorch. * [rllib] Rough port of DQN to build_tf_policy() pattern (ray-project#4823) * fetching objects in parallel in _get_arguments_for_execution (ray-project#4775) * [tune] Disallow setting resources_per_trial when it is already configured (ray-project#4880) * disallow it * import fix * fix example * fix test * fix tests * Update mock.py * fix * make less convoluted * fix tests * [rllib] Rename PolicyEvaluator => RolloutWorker (ray-project#4820) * Fix local cluster yaml (ray-project#4918) * [tune] Directional metrics for components (ray-project#4120) (ray-project#4915) * [Core Worker] implement ObjectInterface and add test framework (ray-project#4899) * [tune] Make PBT Quantile fraction configurable (ray-project#4912) * Better organize ray_common module (ray-project#4898) * Fix error * [tune] Add requirements-dev.txt and update docs for contributing (ray-project#4925) * Add requirements-dev.txt and update docs. * Update doc/source/tune-contrib.rst Co-Authored-By: Richard Liaw <rliaw@berkeley.edu> * Unpin everything except for yapf. * Fix compute actions return value * Bump version from 0.7.1 to 0.8.0.dev1. (ray-project#4937) * Update version number in documentation after release 0.7.0 -> 0.7.1 and 0.8.0.dev0 -> 0.8.0.dev1. (ray-project#4941) * [doc] Update developer docs with bazel instructions (ray-project#4944) * [C++] Add hash table to Redis-Module (ray-project#4911) * Flush lineage cache on task submission instead of execution (ray-project#4942) * [rllib] Add docs on how to use TF eager execution (ray-project#4927) * [rllib] Port remainder of algorithms to build_trainer() pattern (ray-project#4920) * Fix resource bookkeeping bug with acquiring unknown resource. (ray-project#4945) * Update aws keys for uploading wheels to s3. (ray-project#4948) * Upload wheels on Travis to branchname/commit_id. (ray-project#4949) * [Java] Fix serializing issues of `RaySerializer` (ray-project#4887) * Fix * Address comment. * fix (ray-project#4950) * [Java] Add inner class `Builder` to build call options. (ray-project#4956) * Add Builder class * format * Refactor by IDE * Remove uncessary dependency * Make release stress tests work and improve them. (ray-project#4955) * Use proper session directory for debug_string.txt (ray-project#4960) * [core] Use int64_t instead of int to keep track of fractional resources (ray-project#4959) * [core worker] add task submission & execution interface (ray-project#4922) * [sgd] Add non-distributed PyTorch runner (ray-project#4933) * Add non-distributed PyTorch runner * use dist.is_available() instead of checking OS * Nicer exception * Fix bug in choosing port * Refactor some code * Address comments * Address comments * Flush all tasks from local lineage cache after a node failure (ray-project#4964) * Remove typing from setup.py install_requirements. (ray-project#4971) * [Java] Fix bug of `BaseID` in multi-threading case. (ray-project#4974) * [rllib] Fix DDPG example (ray-project#4973) * Upgrade CI clang-format to 6.0 (ray-project#4976) * [Core worker] add store & task provider (ray-project#4966) * Fix bugs in the a3c code template. (ray-project#4984) * Inherit Function Docstrings and other metedata (ray-project#4985) * Fix a crash when unknown worker registering to raylet (ray-project#4992) * [gRPC] Use gRPC for inter-node-manager communication (ray-project#4968)
wingman-ai · Jun 21, 2019 · b850e14 · b850e14
1 parent 59274f7
commit b850e14
Show file tree

Hide file tree

Showing 150 changed files with 4,846 additions and 2,076 deletions.
diff --git a/.bazelrc b/.bazelrc
@@ -2,3 +2,5 @@
 build --compilation_mode=opt
 build --action_env=PATH
 build --action_env=PYTHON_BIN_PATH
+# This workaround is needed due to https://github.com/bazelbuild/bazel/issues/4341
+build --per_file_copt="external/com_github_grpc_grpc/.*@-DGRPC_BAZEL_BUILD"
diff --git a/.travis.yml b/.travis.yml
@@ -1,5 +1,7 @@
 language: generic
 
+dist: xenial
+
 
 services:
 - docker

diff --git a/BUILD.bazel b/BUILD.bazel
@@ -1,12 +1,37 @@
 # Bazel build
 # C/C++ documentation: https://docs.bazel.build/versions/master/be/c-cpp.html
 
+load("@com_github_grpc_grpc//bazel:grpc_build_system.bzl", "grpc_proto_library")
 load("@com_github_google_flatbuffers//:build_defs.bzl", "flatbuffer_cc_library")
 load("@//bazel:ray.bzl", "flatbuffer_py_library")
 load("@//bazel:cython_library.bzl", "pyx_library")
 
 COPTS = ["-DRAY_USE_GLOG"]
 
+# Node manager gRPC lib.
+grpc_proto_library(
+    name = "node_manager_grpc_lib",
+    srcs = ["src/ray/protobuf/node_manager.proto"],
+)
+
+# Node manager server and client.
+cc_library(
+    name = "node_manager_rpc_lib",
+    srcs = glob([
+        "src/ray/rpc/*.cc",
+    ]),
+    hdrs = glob([
+        "src/ray/rpc/*.h",
+    ]),
+    copts = COPTS,
+    deps = [
+        ":node_manager_grpc_lib",
+        ":ray_common",
+        "@boost//:asio",
+        "@com_github_grpc_grpc//:grpc++",
+    ],
+)
+
 cc_binary(
     name = "raylet",
     srcs = ["src/ray/raylet/main.cc"],
@@ -89,6 +114,7 @@ cc_library(
         ":gcs",
         ":gcs_fbs",
         ":node_manager_fbs",
+        ":node_manager_rpc_lib",
         ":object_manager",
         ":ray_common",
         ":ray_util",
@@ -111,13 +137,18 @@ cc_library(
     srcs = glob(
         [
             "src/ray/core_worker/*.cc",
+            "src/ray/core_worker/store_provider/*.cc",
+            "src/ray/core_worker/transport/*.cc",
         ],
         exclude = [
             "src/ray/core_worker/*_test.cc",
+            "src/ray/core_worker/mock_worker.cc",
         ],
     ),
     hdrs = glob([
         "src/ray/core_worker/*.h",
+        "src/ray/core_worker/store_provider/*.h",
+        "src/ray/core_worker/transport/*.h",
     ]),
     copts = COPTS,
     deps = [
@@ -127,7 +158,15 @@ cc_library(
     ],
 )
 
-# This test is run by src/ray/test/run_core_worker_tests.sh
+cc_binary(
+    name = "mock_worker",
+    srcs = ["src/ray/core_worker/mock_worker.cc"],
+    copts = COPTS,
+    deps = [
+        ":core_worker_lib",
+    ],
+)
+
 cc_binary(
     name = "core_worker_test",
     srcs = ["src/ray/core_worker/core_worker_test.cc"],
@@ -535,7 +574,7 @@ flatbuffer_py_library(
         "ErrorTableData.py",
         "ErrorType.py",
         "FunctionTableData.py",
-        "GcsTableEntry.py",
+        "GcsEntry.py",
         "HeartbeatBatchTableData.py",
         "HeartbeatTableData.py",
         "Language.py",

diff --git a/README.rst b/README.rst
@@ -6,7 +6,7 @@
 .. image:: https://readthedocs.org/projects/ray/badge/?version=latest
     :target: http://ray.readthedocs.io/en/latest/?badge=latest
 
-.. image:: https://img.shields.io/badge/pypi-0.7.0-blue.svg
+.. image:: https://img.shields.io/badge/pypi-0.7.1-blue.svg
     :target: https://pypi.org/project/ray/
 
 |

diff --git a/bazel/ray_deps_build_all.bzl b/bazel/ray_deps_build_all.bzl
@@ -3,10 +3,14 @@ load("@com_github_nelhage_rules_boost//:boost/boost.bzl", "boost_deps")
 load("@com_github_jupp0r_prometheus_cpp//:repositories.bzl", "prometheus_cpp_repositories")
 load("@com_github_ray_project_ray//bazel:python_configure.bzl", "python_configure")
 load("@com_github_checkstyle_java//:repo.bzl", "checkstyle_deps")
+load("@com_github_grpc_grpc//bazel:grpc_deps.bzl", "grpc_deps")
+
 
 def ray_deps_build_all():
   gen_java_deps()
   checkstyle_deps()
   boost_deps()
   prometheus_cpp_repositories()
   python_configure(name = "local_config_python")
+  grpc_deps()
+
diff --git a/bazel/ray_deps_setup.bzl b/bazel/ray_deps_setup.bzl
@@ -101,3 +101,11 @@ def ray_deps_setup():
         # `https://github.com/jupp0r/prometheus-cpp/pull/225` getting merged.
         urls = ["https://github.com/jovany-wang/prometheus-cpp/archive/master.zip"],
     )
+
+    http_archive(
+        name = "com_github_grpc_grpc",
+        urls = [
+            "https://github.com/grpc/grpc/archive/7741e806a213cba63c96234f16d712a8aa101a49.tar.gz",
+        ],
+        strip_prefix = "grpc-7741e806a213cba63c96234f16d712a8aa101a49",
+    )
diff --git a/ci/jenkins_tests/perf_integration_tests/run_perf_integration.sh b/ci/jenkins_tests/perf_integration_tests/run_perf_integration.sh
@@ -9,7 +9,7 @@ pushd "$ROOT_DIR"
 
 python -m pip install pytest-benchmark
 
-pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev0-cp27-cp27mu-manylinux1_x86_64.whl
+pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev1-cp27-cp27mu-manylinux1_x86_64.whl
 python -m pytest --benchmark-autosave --benchmark-min-rounds=10 --benchmark-columns="min, max, mean" $ROOT_DIR/../../../python/ray/tests/perf_integration_tests/test_perf_integration.py
 
 pushd $ROOT_DIR/../../../python

diff --git a/ci/jenkins_tests/run_rllib_tests.sh b/ci/jenkins_tests/run_rllib_tests.sh
@@ -392,6 +392,16 @@ docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
 docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
     /ray/ci/suppress_output python /ray/python/ray/rllib/examples/rollout_worker_custom_workflow.py
 
+docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
+    /ray/ci/suppress_output python /ray/python/ray/rllib/examples/eager_execution.py --iters=2
+
+docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
+    /ray/ci/suppress_output /ray/python/ray/rllib/train.py \
+    --env CartPole-v0 \
+    --run PPO \
+    --stop '{"training_iteration": 1}' \
+    --config '{"use_eager": true, "simple_optimizer": true}'
+
 docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
     /ray/ci/suppress_output python /ray/python/ray/rllib/examples/custom_tf_policy.py --iters=2
 

diff --git a/ci/stress_tests/application_cluster_template.yaml b/ci/stress_tests/application_cluster_template.yaml
@@ -37,7 +37,7 @@ provider:
     # Availability zone(s), comma-separated, that nodes may be launched in.
     # Nodes are currently spread between zones by a round-robin approach,
     # however this implementation detail should not be relied upon.
-    availability_zone: us-west-2a,us-west-2b
+    availability_zone: us-west-2b
 
 # How Ray will authenticate with newly launched nodes.
 auth:
@@ -90,8 +90,8 @@ file_mounts: {
 # List of shell commands to run to set up nodes.
 setup_commands:
     - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_<<<PYTHON_VERSION>>>/bin:$PATH"' >> ~/.bashrc
-    - ray || wget https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev0-<<<WHEEL_STR>>>-manylinux1_x86_64.whl
-    - rllib || pip install -U ray-0.8.0.dev0-<<<WHEEL_STR>>>-manylinux1_x86_64.whl[rllib]
+    - ray || wget https://s3-us-west-2.amazonaws.com/ray-wheels/releases/<<<RAY_VERSION>>>/<<<RAY_COMMIT>>>/ray-<<<RAY_VERSION>>>-<<<WHEEL_STR>>>-manylinux1_x86_64.whl
+    - rllib || pip install -U ray-<<<RAY_VERSION>>>-<<<WHEEL_STR>>>-manylinux1_x86_64.whl[rllib]
     - pip install tensorflow-gpu==1.12.0
     - echo "sudo halt" | at now + 60 minutes
     # Consider uncommenting these if you also want to run apt-get commands during setup

diff --git a/ci/stress_tests/run_application_stress_tests.sh b/ci/stress_tests/run_application_stress_tests.sh
@@ -1,4 +1,11 @@
 #!/usr/bin/env bash
+
+# This script should be run as follows:
+#     ./run_application_stress_tests.sh <ray-version> <ray-commit>
+# For example, <ray-version> might be 0.7.1
+# and <ray-commit> might be bc3b6efdb6933d410563ee70f690855c05f25483. The commit
+# should be the latest commit on the branch "releases/<ray-version>".
+
 # This script runs all of the application tests.
 # Currently includes an IMPALA stress test and a SGD stress test.
 # on both Python 2.7 and 3.6.
@@ -10,26 +17,39 @@
 
 # This script will exit with code 1 if the test did not run successfully.
 
+# Show explicitly which commands are currently running. This should only be AFTER
+# the private key is placed.
+set -x
 
 ROOT_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd)
 RESULT_FILE=$ROOT_DIR/"results-$(date '+%Y-%m-%d_%H-%M-%S').log"
 
-echo "Logging to" $RESULT_FILE
-echo -e $RAY_AWS_SSH_KEY > /root/.ssh/ray-autoscaler_us-west-2.pem && chmod 400 /root/.ssh/ray-autoscaler_us-west-2.pem || true
+touch "$RESULT_FILE"
+echo "Logging to" "$RESULT_FILE"
 
+if [[ -z  "$1" ]]; then
+  echo "ERROR: The first argument must be the Ray version string."
+  exit 1
+else
+  RAY_VERSION=$1
+fi
 
-# Show explicitly which commands are currently running. This should only be AFTER
-# the private key is placed.
-set -x
+if [[ -z  "$2" ]]; then
+  echo "ERROR: The second argument must be the commit hash to test."
+  exit 1
+else
+  RAY_COMMIT=$2
+fi
 
-touch $RESULT_FILE
+echo "Testing ray==$RAY_VERSION at commit $RAY_COMMIT."
+echo "The wheels used will live under https://s3-us-west-2.amazonaws.com/ray-wheels/releases/$RAY_VERSION/$RAY_COMMIT/"
 
 # This function identifies the right string for the Ray wheel.
 _find_wheel_str(){
     local python_version=$1
     # echo "PYTHON_VERSION", $python_version
     local wheel_str=""
-    if [ $python_version == "p27" ]; then
+    if [ "$python_version" == "p27" ]; then
         wheel_str="cp27-cp27mu"
     else
         wheel_str="cp36-cp36m"
@@ -41,7 +61,7 @@ _find_wheel_str(){
 # Actual test runtime is roughly 10 minutes.
 test_impala(){
     local PYTHON_VERSION=$1
-    local WHEEL_STR=$(_find_wheel_str $PYTHON_VERSION)
+    local WHEEL_STR=$(_find_wheel_str "$PYTHON_VERSION")
 
     pushd "$ROOT_DIR"
         local TEST_NAME="rllib_impala_$PYTHON_VERSION"
@@ -50,32 +70,34 @@ test_impala(){
 
         cat application_cluster_template.yaml |
             sed -e "
+                s/<<<RAY_VERSION>>>/$RAY_VERSION/g;
+                s/<<<RAY_COMMIT>>>/$RAY_COMMIT/;
                 s/<<<CLUSTER_NAME>>>/$TEST_NAME/;
-                s/<<<HEAD_TYPE>>>/g3.16xlarge/;
+                s/<<<HEAD_TYPE>>>/p3.16xlarge/;
                 s/<<<WORKER_TYPE>>>/m5.24xlarge/;
                 s/<<<MIN_WORKERS>>>/5/;
                 s/<<<MAX_WORKERS>>>/5/;
                 s/<<<PYTHON_VERSION>>>/$PYTHON_VERSION/;
-                s/<<<WHEEL_STR>>>/$WHEEL_STR/;" > $CLUSTER
+                s/<<<WHEEL_STR>>>/$WHEEL_STR/;" > "$CLUSTER"
 
         echo "Try running IMPALA stress test."
         {
             RLLIB_DIR=../../python/ray/rllib/
-            ray --logging-level=DEBUG up -y $CLUSTER &&
-            ray rsync_up $CLUSTER $RLLIB_DIR/tuned_examples/ tuned_examples/ &&
+            ray --logging-level=DEBUG up -y "$CLUSTER" &&
+            ray rsync_up "$CLUSTER" $RLLIB_DIR/tuned_examples/ tuned_examples/ &&
             sleep 1 &&
-            ray --logging-level=DEBUG exec $CLUSTER "rllib || true" &&
-            ray --logging-level=DEBUG exec $CLUSTER "
+            ray --logging-level=DEBUG exec "$CLUSTER" "rllib || true" &&
+            ray --logging-level=DEBUG exec "$CLUSTER" "
                 rllib train -f tuned_examples/atari-impala-large.yaml --redis-address='localhost:6379' --queue-trials" &&
-            echo "PASS: IMPALA Test for" $PYTHON_VERSION >> $RESULT_FILE
-        } || echo "FAIL: IMPALA Test for" $PYTHON_VERSION >> $RESULT_FILE
+            echo "PASS: IMPALA Test for" "$PYTHON_VERSION" >> "$RESULT_FILE"
+        } || echo "FAIL: IMPALA Test for" "$PYTHON_VERSION" >> "$RESULT_FILE"
 
         # Tear down cluster.
         if [ "$DEBUG_MODE" = "" ]; then
-            ray down -y $CLUSTER
-            rm $CLUSTER
+            ray down -y "$CLUSTER"
+            rm "$CLUSTER"
         else
-            echo "Not tearing down cluster" $CLUSTER
+            echo "Not tearing down cluster" "$CLUSTER"
         fi
     popd
 }
@@ -93,32 +115,34 @@ test_sgd(){
 
         cat application_cluster_template.yaml |
             sed -e "
+                s/<<<RAY_VERSION>>>/$RAY_VERSION/g;
+                s/<<<RAY_COMMIT>>>/$RAY_COMMIT/;
                 s/<<<CLUSTER_NAME>>>/$TEST_NAME/;
-                s/<<<HEAD_TYPE>>>/g3.16xlarge/;
-                s/<<<WORKER_TYPE>>>/g3.16xlarge/;
+                s/<<<HEAD_TYPE>>>/p3.16xlarge/;
+                s/<<<WORKER_TYPE>>>/p3.16xlarge/;
                 s/<<<MIN_WORKERS>>>/3/;
                 s/<<<MAX_WORKERS>>>/3/;
                 s/<<<PYTHON_VERSION>>>/$PYTHON_VERSION/;
-                s/<<<WHEEL_STR>>>/$WHEEL_STR/;" > $CLUSTER
+                s/<<<WHEEL_STR>>>/$WHEEL_STR/;" > "$CLUSTER"
 
         echo "Try running SGD stress test."
         {
             SGD_DIR=$ROOT_DIR/../../python/ray/experimental/sgd/
-            ray --logging-level=DEBUG up -y $CLUSTER &&
+            ray --logging-level=DEBUG up -y "$CLUSTER" &&
             # TODO: fix submit so that args work
-            ray rsync_up $CLUSTER $SGD_DIR/mnist_example.py mnist_example.py &&
+            ray rsync_up "$CLUSTER" "$SGD_DIR/mnist_example.py" mnist_example.py &&
             sleep 1 &&
-            ray --logging-level=DEBUG exec $CLUSTER "
+            ray --logging-level=DEBUG exec "$CLUSTER" "
                 python mnist_example.py --redis-address=localhost:6379 --num-iters=2000 --num-workers=8 --devices-per-worker=2 --gpu" &&
-            echo "PASS: SGD Test for" $PYTHON_VERSION >> $RESULT_FILE
-        } || echo "FAIL: SGD Test for" $PYTHON_VERSION >> $RESULT_FILE
+            echo "PASS: SGD Test for" "$PYTHON_VERSION" >> "$RESULT_FILE"
+        } || echo "FAIL: SGD Test for" "$PYTHON_VERSION" >> "$RESULT_FILE"
 
         # Tear down cluster.
         if [ "$DEBUG_MODE" = "" ]; then
-            ray down -y $CLUSTER
-            rm $CLUSTER
+            ray down -y "$CLUSTER"
+            rm "$CLUSTER"
         else
-            echo "Not tearing down cluster" $CLUSTER
+            echo "Not tearing down cluster" "$CLUSTER"
         fi
     popd
 }
@@ -130,6 +154,6 @@ do
     test_sgd $PYTHON_VERSION
 done
 
-cat $RESULT_FILE
-cat $RESULT_FILE | grep FAIL > test.log
+cat "$RESULT_FILE"
+cat "$RESULT_FILE" | grep FAIL > test.log
 [ ! -s test.log ] || exit 1