This repository has been archived by the owner on Oct 12, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 49
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* put images into vignette folder * Created 00-azure-introduction.md * Created 10-vm-sizes.md * Update VM sizes link in docs/README.md * Create 20-package-management.md * Updated README.md with Azure Batch limitations * Create 21-distributing-data.md * Create 22-parallelizing-cores.md * Create 23-persistent-storage.md * standardized foreach keyword and pool keyword
- Loading branch information
Showing
10 changed files
with
250 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Azure Introduction | ||
|
||
doAzureParallel lets users seamlessly take advantage of the scale and elasticity of Azure to run their parallel workloads. This section will describe how the doAzureParallel package uses Azure and some of the key benefits that Azure provides. | ||
|
||
## Azure Batch | ||
|
||
Azure Batch is a platform service for running large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud. | ||
|
||
### How does it work? | ||
|
||
The doAzureParallel is built on of Azure Batch via the *rAzureBatch* package that interacts with the Azure Batch service's REST API. Azure Batch schedules work across a managed collection of VMs (called a *pool*) and automatically scales the pool meet the needs of your R jobs. | ||
|
||
In Azure Batch, a pool consists of a collection of VMs (in our case, a collection of DSVMs) - this pool can be configured by the config file that is generated by this package. For each *foreach* loop, the Azure Batch Job Scheduler will create a group of tasks (called an Azure Batch Job), where each iteration in the loop maps to a task. Each task is then distributed across the pool, running the code inside of each iteration in the loop. | ||
|
||
To do this, we copy the existing R environment and store it in Azure Storage. As the VMs in the Azure Batch pool are ready to run the job, it will fetch and load the R environment. The VM will run the R code inside each iteration of the *foreach* loop under the loaded R environment. Once the code is finished, the results are push back into Azure Storage, and a merge task is used to aggregate the results. Finally, the aggregated results are returned to the user within the R session. | ||
|
||
Learn more about Azure Batch [here](https://docs.microsoft.com/en-us/azure/batch/batch-technical-overview#pricing). | ||
|
||
### Azure Batch Pricing | ||
|
||
Azure Batch is a free service; you aren't charged for the Batch account itself. You are charged for the underlying Azure compute resources that your Batch solutions consume, and for the resources consumed by other services when your workloads run. | ||
|
||
## Data Science Virtual Machines (DSVM) | ||
|
||
The doAzureParallel package uses the Data Science Virtual Machine (DSVM) for each node in the pool. The DSVM is a customized VM image that has many popular R tools pre-installed. Because these tools are pre-baked into the DSVM VM image, using it gives us considerable speedup when provisioning the pool. | ||
|
||
This package uses the Linux Edition of the DSVM which comes preinstalled with the Microsoft R Server Developer edition as well as many popular packages from Microsoft R Open (MRO). By using and extending open source R, Microsoft R Server is fully compatible with R scripts, functions and CRAN packages. | ||
|
||
Learn more about the DSVM [here](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.standard-data-science-vm?tab=Overview). | ||
|
||
### DSVM Pricing | ||
Using the DSVM is free and doesn't add to the cost of bare VMs. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Virtual Machine Sizes | ||
|
||
The doAzureParallel package lets you choose the VMs that your code runs on giving you full control over your infrastructure. By default, we start you on an economical, general-purpose VM size called **"Standard_A1_v2"**. | ||
|
||
Each doAzureParallel pool can only comprise of of a collection of one VM size that is selected upon pool creation. Once the pool is created, users cannot change the VM size unless they plan on reprovisioning another pool. | ||
|
||
## Setting your VM size | ||
|
||
The VM size is set in the configuration JSON file that is passed into the `registerPool()` method. To set your desired VM size, simply edit the `vmSize` key in the JSON: | ||
|
||
```javascript | ||
{ | ||
... | ||
"vmSize": <Your Desired VM Size>, | ||
... | ||
} | ||
``` | ||
|
||
## Choosing your VM Size | ||
|
||
Azure has a wide variety of VMs that you can choose from. | ||
|
||
### VM Categories | ||
|
||
The three recommended VM categories for the doAzureParallel package are: | ||
- Av2-Series VMs | ||
- F-Series VMs | ||
- Dv2-Series VMs | ||
|
||
Each VM category also has a variety of VM sizes (see table below). | ||
|
||
Generally speaking, the F-Series VM is ideal for compute intensive workloads, the Dv2-Series VMs are ideal for memory intensive workloads, and finally the Av2-Series VMs are economical, general-purpose VMs. | ||
|
||
The Dv2-Series VMs and F-Series VMs use the 2.4 GHz Intel Xeon® E5-2673 v3 (Haswell) processor. | ||
|
||
### VM Size Table | ||
|
||
Please see the below table for a curated list of VM types: | ||
|
||
| VM Category | VM Size | Cores | Memory (GB) | | ||
| ----------- | ------- | ----- | ----------- | | ||
| Av2-Series | Standard_A1_v2 | 1 | 2 | | ||
| Av2-Series | Standard_A2_v2 | 2 | 4 | | ||
| Av2-Series | Standard_A4_v2 | 4 | 8 | | ||
| Av2-Series | Standard_A8_v2 | 8 | 16 | | ||
| Av2-Series | Standard_A2m_v2 | 2 | 16 | | ||
| Av2-Series | Standard_A4m_v2 | 4 | 32 | | ||
| Av2-Series | Standard_A8m_v2 | 8 | 64 | | ||
| F-Series | Standard_F1 | 1 | 2 | | ||
| F-Series | Standard_F2 | 2 | 4 | | ||
| F-Series | Standard_F4 | 4 | 8 | | ||
| F-Series | Standard_F8 | 8 | 16 | | ||
| F-Series | Standard_F16 | 16 | 32 | | ||
| Dv2-Series | Standard_D1_v2 | 1 | 3.5 | | ||
| Dv2-Series | Standard_D2_v2 | 2 | 7 | | ||
| Dv2-Series | Standard_D3_v2 | 4 | 14 | | ||
| Dv2-Series | Standard_D4_v2 | 8 | 28 | | ||
| Dv2-Series | Standard_D5_v2 | 16 | 56 | | ||
| Dv2-Series | Standard_D11_v2 | 2 | 14 | | ||
| Dv2-Series | Standard_D12_v2 | 4 | 28 | | ||
| Dv2-Series | Standard_D13_v2 | 8 | 56 | | ||
| Dv2-Series | Standard_D14_v2 | 16 | 112 | | ||
|
||
The list above covers most scenarios that run R jobs. For special scenarios (such as GPU accelerated R code) please see the full list of available VM sizes by visiting the Azure VM Linux Sizes page [here](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-sizes?toc=%2fazure%2fvirtual-machines%2flinux%2ftoc.json#a-series). | ||
|
||
To get a sense of what each VM costs, please visit the Azure Virtual Machine pricing page [here](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/). | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Package Management | ||
|
||
The doAzureParallel package allows you to install packages to your pool in two ways: | ||
- Installing on pool creation | ||
- Installing per-*foreach* loop | ||
|
||
## Installing Packages on Pool Creation | ||
You can install packages by specifying the package(s) in your JSON pool configuration file. This will then install the specified packages at the time of pool creation. | ||
|
||
```R | ||
{ | ||
... | ||
"rPackages": { | ||
"cran": { | ||
"source": "http://cran.us.r-project.org", | ||
"name": ["some_cran_package_name", "some_other_cran_package_name"] | ||
}, | ||
"github": ["github_username/github_package_name", "another_github_username/another_github_package_name"] | ||
}, | ||
... | ||
} | ||
``` | ||
|
||
## Installing Packages per-*foreach* Loop | ||
You can also install packages by using the **.packages** option in the *foreach* loop. Instead of installing packages during pool creation, packages (and it's dependencies) can be installed before each iteration in the loop is run on your Azure cluster. | ||
|
||
To install a single package: | ||
```R | ||
number_of_iterations <- 10 | ||
results <- foreach(i = 1:number_of_iterations, .packages='some_package') %dopar% { ... } | ||
``` | ||
|
||
To install multiple packages: | ||
```R | ||
number_of_iterations <- 10 | ||
results <- foreach(i = 1:number_of_iterations, .packages=c('package_1', 'package_2')) %dopar% { ... } | ||
``` | ||
|
||
Installing packages from github using this method is not yet supported. | ||
|
||
## Uninstalling packages | ||
Uninstalling packages from your pool is not supported. However, you may consider rebuilding your pool. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Distributing Data | ||
|
||
The doAzureParallel package lets you distribute the data you have in your R session across your Azure pool. | ||
|
||
As long as the data you wish to distribute can fit in-memory on your local machine as well as in the memory of the VMs in your pool, the doAzureParallel package will be able to manage the data. | ||
|
||
```R | ||
my_data_set <- data_set | ||
number_of_iterations <- 10 | ||
|
||
results <- foreach(i = 1:number_of_iterations) %dopar% { | ||
runAlgorithm(my_data_set) | ||
} | ||
``` | ||
|
||
## Chunking Data | ||
|
||
A common scenario would be to chunk your data so that each chunk is mapped to an interation of the *foreach* loop | ||
|
||
```R | ||
chunks <- split(<data_set>, 10) | ||
|
||
results <- foreach(chunk = iter(chunks)) %dopar% { | ||
runAlgorithm(chunk) | ||
} | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Parallelizing Cores | ||
|
||
Depending on the VM size you select, you may want your R code running on all the cores in each VM. To do this, we recommend nesting a *foreach* loop using *doParallel* package inside the outer *foreach* loop that uses doAzureParallel. | ||
|
||
The *doParallel* package can detect the number of cores on a computer and parallelizes each iteration of the *foreach* loop across those cores. Pairing this with the doAzureParallel package, we can schedule work to each core of each VM in the pool. | ||
|
||
```R | ||
|
||
# register your Azure pool as the parallel backend | ||
registerDoAzureParallel(pool) | ||
|
||
# execute your outer foreach loop to schedule work to the pool | ||
number_of_outer_iterations <- 10 | ||
results <- foreach(i = 1:number_of_outer_iterations, .packages='doParallel') %dopar% { | ||
|
||
# detect the number of cores on the VM | ||
cores <- detectCores() | ||
|
||
# make your 'cluster' using the nodes on the VM | ||
cl <- makeCluster(cores) | ||
|
||
# register the above pool as the parallel backend within each VM | ||
registerDoParallel(cl) | ||
|
||
# execute your inner foreach loop that will use all the cores in the VM | ||
number_of_inner_iterations <- 20 | ||
inner_results <- foreach(j = 1:number_of_inner_iterations) %dopar% { | ||
runAlgorithm() | ||
} | ||
|
||
return(inner_results) | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Persistent Storage | ||
|
||
When executing long-running jobs, users may not want to keep their session open to wait for results to be returned. | ||
|
||
The doAzureParallel package automatically stores the results of the *foreach* loop in a Azure Storage account - this means that when the user exits the session, their results won't be lost. Instead, users can simply pull the results down from Azure at any time and load it into their current session. | ||
|
||
To do so, users need to keep track of **job ids**. Each *foreach* loop is considered a *job* and is assigned an unique ID. The job id is returned to the user after the *foreach* loop is executed. | ||
|
||
When the user returns and begin a new session, the user can pull down the results from their job. | ||
|
||
```R | ||
my_job_id <- "job123456789" | ||
results <- GetJobResult(my_job_id) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes
File renamed without changes