README update

Azure · Mar 13, 2017 · d532b50 · d532b50
1 parent 6563ac2
commit d532b50
Show file tree

Hide file tree

Showing 4 changed files with 125 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -81,16 +81,16 @@ Import the package
 library(doAzureParallel)
 ```
 
-Set up your parallel backend with Azure. 
+Set up your parallel backend with Azure. This is your set of Azure VMs.
 ```R
 # 1. Generate a pool configuration file.  
-generatePoolConfig("my_pool_config.json")
+generateClusterConfig("cluster_config.json")
 
 # 2. Edit your pool configuration file.
 # Enter your Azure Batch Account & Azure Storage keys/account-info and configure your pool settings.
 
 # 3. Register the pool. This will create a new pool if your pool hasn't already been provisioned.
-pool <- registerPool("my_pool_config.json")
+pool <- makeCluster("cluster_config.json")
 
 # 4. Register the pool as your parallel backend
 registerDoAzureParallel(pool)
@@ -118,6 +118,24 @@ results <- foreach(i = 1:number_of_iterations) %do% { ... }
 results <- foreach(i = 1:number_of_iterations) %dopar% { ... }
 ```
 
+You can also run *long running jobs* with doAzureParallel. With long running jobs, you will need to keep track of your jobs as well as set your job to a non-blocking state. You can do this with the *.options.azure* options:
+
+```R
+# set the .options.azure option in the foreach loop 
+# NOTE - if wait = FALSE, foreach will return your unique job id
+jobid <- foreach(i = 1:number_of_iterations, .options.azure = list(job = 'unique_job_id', wait = FALSE)) %dopar % { ... }
+
+# get back your job results with your unique job id
+results <- getJobResult(jobid)
+```
+
+After you finish running your R code in Azure, you may want to shut down your pool of VMs to make sure that you are not being charged anymore.o
+
+```R
+# shut down your pool
+stopCluster(pool)
+```
+
 ### Pool Configuration JSON
 
 Use your pool configuration JSON file to define your pool in Azure.
@@ -129,16 +147,19 @@ Use your pool configuration JSON file to define your pool in Azure.
     "key": <Azure Batch Account Key>,
     "url": <Azure Batch Account URL>,
     "pool": {
-      "name": <your pool name>, // example: "my_new_azure_pool"
-      "vmsize": <your pool VM size name>, // example: "Standard_A1_v2"
+      "name": <your pool name>, // example: "myauzrecluster"
+      "vmSize": <your pool VM size name>, // example: "Standard_F2" ([Learn more](./docs/10-vm-sizes.md#vm-size-table) for more info)
+      "maxTasksPerNode": <num task to allocate to each node>, // example: "1" ([Learn more](./docs/22-parallelizing-cores.md))
       "poolSize": {
-        "targetDedicated": <number of node you want in your pool>, // example: 10
+        "minNodes": <min number of nodes in cluster>, // example: "1"
+        "maxNodes": <max number of nodes to scale cluster to>, // example: "10"
+        "autoscaleFormula": <your autoscale formula name>, // recommended: "QUEUE"
       }
     },
     "rPackages": {
-      "cran": {
+      "cran": 
         "source": "http://cran.us.r-project.org",
-        "name": ["some_cran_package", "some_other_cran_package"]
+        "name": ["<some_cran_package", "some_other_cran_package"]
       },
       "github": ["username/some_github_package", "another_username/some_other_github_package"]
     }
@@ -148,7 +169,7 @@ Use your pool configuration JSON file to define your pool in Azure.
     "key": <Azure Storage Account Key>
   },
   "settings": {
-    "verbose": false
+    "verbose": false // set to true to see debug logs
   }
 }
 ```

diff --git a/docs/10-vm-sizes.md b/docs/10-vm-sizes.md
@@ -39,8 +39,6 @@ Please see the below table for a curated list of VM types:
 
 | VM Category | VM Size | Cores | Memory (GB) |
 | ----------- | ------- | ----- | ----------- |
-| Av2-Series | Standard_A1_v2 | 1 | 2 |
-| Av2-Series | Standard_A2_v2 | 2 | 4 |
 | Av2-Series | Standard_A4_v2 | 4 | 8 |
 | Av2-Series | Standard_A8_v2 | 8 | 16 |
 | Av2-Series | Standard_A2m_v2 | 2 | 16 |

diff --git a/docs/11-autoscale.md b/docs/11-autoscale.md
@@ -0,0 +1,71 @@
+# Autoscale
+
+The doAzureParallel package lets you autoscale your cluster in several ways, letting you save both time and money by automatically adjusting the number of nodes in your cluster to fit your job's demands.
+
+This package pre-defines a few autoscale options (or *autoscale formulas*) that you can choose from and use in your JSON configuration file.
+
+The options are:
+ - "QUEUE"
+ - "WORKDAY"
+ - "WEEKEND"
+ - "MAX_CPU"
+
+*See more [below]() to learn how each of these settings work.*
+
+When configuring your autoscale formula, you also need to set the mininum number of nodes and the maximum number of nodes. Each autoscale formula will use these as parameters to set it's upper and lower bound limits for pool size. 
+
+By default, doAzureParallel uses autoscale and uses the QUEUE autoscale formula. This can be easily configured:
+
+```javascript
+{
+  ...  
+  "poolSize": {
+    "minNodes": 1,
+    "maxNodes": 20,
+    "autoscaleFormula": "QUEUE",
+  }
+  ...
+}
+```
+
+## Autoscale Formulas:
+
+For four autoscale settings are can be selected for different scenarios:
+
+### QUEUE
+This is a task-based adjustment - the formula will scale up and down the pool size based on the amount of work in the queue
+
+### WORKDAY/WEEKEND
+These are time-based adjustments - the formula  will adjust your pool size based on the day/time of the week. 
+
+For WORKDAY, we will check the current time, and if it's a weekday during working hours (8am - 6pm), the pool size will increase to the maximum size. Otherwise it will default to the minimum size.
+
+FOR WEEKEND, we will check the current time and see if its the weekend.
+
+### MAX_CPU
+
+
+## When to use Autoscale
+
+Autoscaling can be used in various scenarios when using the doAzureParallel package. 
+
+### Time-based scaling
+
+For time-based autoscaling adjustments, you would want to autoscale your pool in anticipation of incoming work. If you know that you want your cluster ready during the workday, you can select the WORKDAY formula and expect your clster to be ready when you get in for work, or expect your cluster to automatically shut down after work hours.
+
+### Task-based scaling
+
+In contrast, task-based autoscaling adjustments are ideal for when you don't have a pre-defined schedule for running work, and simply want your cluster to scale up or scale down according to your task queue. 
+
+A good example for this is when you want to execute long-running jobs: you can kick off a long-running foreach loops at the end of the day without worrying about having to shut down your cluster when the work is done. With Task-based scaling (QUEUE), the cluster will automatically decrease in size until the minNode property is met. This way you don't have to worry about monitoring your job and manually shutting down your cluster.
+
+To take advantage of this, you will also need to understand how to retreive the results of your foreach loop from storage. See [here](./23-persistent-storage.md) to learn more about it.
+
+## Static Clusters
+
+If you do not want your cluster to autoscale, you can simply set the property minNode equal to maxNodes. For example, if you wanted a static cluster of 10 nodes, you can set *"minNodes" : 10* and *"maxNodes" : 10*
+
+---
+
+doAzureParallel's autoscale comes from Azure Batch's autoscaling capabilities. To learn more about it, you can visit the [Azure Batch auto-scaling documentation](https://docs.microsoft.com/en-us/azure/batch/batch-automatic-scaling).
+
diff --git a/docs/22-parallelizing-cores.md b/docs/22-parallelizing-cores.md
@@ -1,6 +1,29 @@
 # Parallelizing Cores
 
-Depending on the VM size you select, you may want your R code running on all the cores in each VM. To do this, we recommend nesting a *foreach* loop using *doParallel* package inside the outer *foreach* loop that uses doAzureParallel. 
+If you are using a VM size that have more than one core, you may want your R code running on all the cores in each VM. 
+
+There are two methods to do this today:
+
+
+## MaxTasksPerNode
+MaxTasksPerNode is a property that tells Azure how many tasks it should send to each node in your cluster.
+
+The maxTasksPerNode property can be configured in the configuration json file when creating your Azure pool. By default, we set this equal to 1, meaning that only one iteration of the foreach loop will execute on each node at a time. However, if you want to maximize the different cores in your cluster, you can set this number up to four times (4X) the number of cores in each node. For example, if you select the VM Size of Standard_F2 which has 2 cores, then can set the maxTasksPerNode property up to 8. 
+
+However, because R is single threaded, we recommend setting the maxTasksPerNode equal to the number of cores in the VM size that you selected. For example, if you select a VM Size of Standard_F2 which has 2 cores, then we recommend that you set the maxTasksPerNode property to 2. This way, Azure will know to run each iteration of the foreach loop on each core (as opposed to each node).
+
+Here's an example of how you may want to set your JSON configuration file:
+```javascript
+{
+  ...
+  "vmSize": "Standard_F2",
+  "maxTasksPerNode": 2
+  ...
+}
+```
+
+## Nested doParallel 
+To take advantage of all the cores on each node, you can nest a *foreach* loop using *doParallel* package inside the outer *foreach* loop that uses doAzureParallel. 
 
 The *doParallel* package can detect the number of cores on a computer and parallelizes each iteration of the *foreach* loop across those cores. Pairing this with the doAzureParallel package, we can schedule work to each core of each VM in the pool.