-
Notifications
You must be signed in to change notification settings - Fork 49
Conversation
e42248d
to
cff4950
Compare
R/doAzureParallel.R
Outdated
@@ -435,17 +429,15 @@ setHttpTraffic <- function(value = FALSE) { | |||
rAzureBatch::updateJob(id) | |||
|
|||
if (enableCloudCombine) { | |||
mergeTask <- paste0(id, "-merge") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: mergeTask -> mergeTaksId
R/utility.R
Outdated
|
||
if (taskCounts$failed > 0 && | ||
errorHandling == "stop" && | ||
taskCounts$validationStatus == "Validated") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the scary call in my opinion. How long does this take to get Validated? Also, I think there is an upper limit the counting service provides. Maybe 100K? Can you double check and make sure it is a number we are comfortable with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
200k is the upper limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke with Task Count service team, validation happens often, and as the job progresses, validation increases.
R/utility.R
Outdated
rAzureBatch::listTask(jobId, filter = filter, select = select) | ||
|
||
tasksFailureWarningLabel <- | ||
sprintf("There are currently %i tasks failed while running the job:\n", taskCounts$failed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update to:
sprintf("%i task(s) failed while running the job. This caused the job to terminate automatically. To disable this behavior and continue on failure set .errorHandling="???" in the foreach loop\n", taskCounts$failed)
R/utility.R
Outdated
taskCounts$validationStatus == "Validated") { | ||
cat("\n") | ||
|
||
filter <- "state eq 'completed' and executionInfo/exitCode ne 0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be a sufficient filter. Double check with Jun around what is considered a failure. For example, when a resource file is not found it will also fail the task, but the exit code will not be set (instead some other value is there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filter does not work, for scheduling errors. Will have to do a client side filter for now...
filter <- "state eq 'completed' and executionInfo/result eq 'Failure'" is not queryable yet until we deprecate one of our service components.
R/utility.R
Outdated
"Error handling is set to 'stop' and has proceeded to terminate the job.", | ||
"The user will have to handle deleting the job.", | ||
"If this is not the correct behavior, change the errorHandling property", | ||
"in the foreach object. Use the 'getJobFile' function to obtain the logs.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to .errroHandling property to what?
R/utility.R
Outdated
|
||
setTxtProgressBar(pb, count) | ||
if (Sys.time() > timeToTimeout) { | ||
stop("A timeout has occurred when waiting for tasks to complete. ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more verbose around what the timeout was, and how a user can change it. Also, what does this mean for the job? We ran into this before. The job will continue, right? The user us just responsible for tracking it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job will continue, the user will need to track it manually
stop( | ||
paste( | ||
sprintf("Job '%s' unable to install packages.", jobId), | ||
"Use the 'getJobFile' function to get more information about", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do getJobFile on job packages? Is the api different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be worked on later
inst/startup/merger.R
Outdated
batchTaskWorkingDirectory <- args[2] | ||
batchJobId <- args[3] | ||
getSimpleErrorMessage <- function(e) { | ||
print(e$message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method not needed. Removed
parent.env(azbatchenv$exportenv) <- globalenv() | ||
sessionInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it, I think it's useful for users to know what packages are installed / what's their current R environment
inst/startup/merger.R
Outdated
full.names = TRUE) | ||
|
||
if (errorHandling == "stop" && length(files) != batchTasksCount) { | ||
stop("Issues with file upload") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What issues? Be more verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments. Also looks like this is missing any additional docs / samples.
R/utility.R
Outdated
taskCounts <- rAzureBatch::getJobTaskCounts(jobId) | ||
|
||
setTxtProgressBar(pb, taskCounts$completed) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: extra line
R/utility.R
Outdated
stop(sprintf(paste("Timeout has occurred while waiting for tasks to complete.", | ||
"Users will have to manually track the job '%s' and get the results.", | ||
"Use the getJobResults function to obtain the results and getJobList for", | ||
"tracking job status. To change the timeout, user can set 'timeout' property in the", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: remove 'user can' from 'timeout, user can set'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few minor nitpicks. Please fix but otherwise looks good.
* treat warnings as failures and fail the creation of the cluster (#91) * treat warnings as failures and fail the creation of the cluster * fix unit tests * fix lintr lines too long issue * escape single quotes * Check if existing pool is deleted when makeCluster is called (#99) * Added deleting pool check for makeCluster * Fixed double quotes * cluster logs renamed from pool to cluster * Added correct imports and fix range * Feature/bio conductor docs (#106) * initial command line instructions for bioconductor * initial startup scripts for installing bioconductor * fix if then syntax * force update node environment with update path for R runtime * install bioconductor * wrap bioconductor install command in Rscript * bioconductor sample docs * update bioC docs * remove .gitignore rule for .json files * add pointer to BioC cluster config from docs * Feature/cluster logs (#98) * download merge result gets content raw * Added setHttpTraffic and logging functions docs * Fixed broken links * Shorten lines down to 120 characters * download merge result gets content raw * Added setHttpTraffic and logging functions docs * Fixed broken links * Shorten lines down to 120 characters * Renamed function names from past discussion * Fixed log documentation * Added new operations for storage management * Added dont run examples * Fixed unused arg for running example * Updated docs for storage management * Added a new doc dedicated for managing storage * Added attribute for container name in data frame * Fixed downloadBlob to work with new rAzureBatch function * Updated docs based on PR comments * Changed dependency version to razurebatch 0.5.0 * Feature/add azure files cluster config (#108) * initial command line instructions for bioconductor * initial startup scripts for installing bioconductor * fix if then syntax * force update node environment with update path for R runtime * install bioconductor * wrap bioconductor install command in Rscript * bioconductor sample docs * update bioC docs * remove .gitignore rule for .json files * add pointer to BioC cluster config from docs * add missing azureFiles cluster config to samples * Add 0.4.2 CHANGELOG comments (#111) * Added live scenario test (#107) * Added live scenario test so users do not have to write their own sample code to test * Added file names for test live * Removed single quote linter * Added comment about the reason for this test * Wait for job preparation task function (#109) * Fixed verbose for getDoParWorkers (#112) * Feature/faq (#110) * initial FAQ * rename faq to FAQ * merge FAQ and Troubleshooting docs * add info on how to reboot a node * refrence TSG and FAQ from main docs index page * add more info as per PR feedback * PR feedback * point raw scripts at master branch (#118) * Update DESCRIPTION (#117) Update version for new milestone. * Fix: Removed anaconda from path (#119) * Removed anaconda from environment path * Line is too long for blobxfer command * For BioConductor install, force remove MRO 3.3 prior to installing MRO 3.4 (#120) * force add PATH to current user * Update bioc_setup.sh * Check verbose null case (#121) * Change True/False to TRUE/FALSE in README example (#124) * add .gitiattrributes file to track line endings * True and False are not valid in R; changed to TRUE and FALSE * Fixed worker and merger scripts (#116) * Fixed worker and merger scripts * Fixed verbose logs based on PR comments * Added documentation on error handling * Fixed header on table markdown * Fixed based on PR comments * v0.4.3 Release (#131) * Upgraded description to use rAzureBatch v0.5.1 * Updated change log for job failure
* treat warnings as failures and fail the creation of the cluster (#91) * treat warnings as failures and fail the creation of the cluster * fix unit tests * fix lintr lines too long issue * escape single quotes * Check if existing pool is deleted when makeCluster is called (#99) * Added deleting pool check for makeCluster * Fixed double quotes * cluster logs renamed from pool to cluster * Added correct imports and fix range * Feature/bio conductor docs (#106) * initial command line instructions for bioconductor * initial startup scripts for installing bioconductor * fix if then syntax * force update node environment with update path for R runtime * install bioconductor * wrap bioconductor install command in Rscript * bioconductor sample docs * update bioC docs * remove .gitignore rule for .json files * add pointer to BioC cluster config from docs * Feature/cluster logs (#98) * download merge result gets content raw * Added setHttpTraffic and logging functions docs * Fixed broken links * Shorten lines down to 120 characters * download merge result gets content raw * Added setHttpTraffic and logging functions docs * Fixed broken links * Shorten lines down to 120 characters * Renamed function names from past discussion * Fixed log documentation * Added new operations for storage management * Added dont run examples * Fixed unused arg for running example * Updated docs for storage management * Added a new doc dedicated for managing storage * Added attribute for container name in data frame * Fixed downloadBlob to work with new rAzureBatch function * Updated docs based on PR comments * Changed dependency version to razurebatch 0.5.0 * Feature/add azure files cluster config (#108) * initial command line instructions for bioconductor * initial startup scripts for installing bioconductor * fix if then syntax * force update node environment with update path for R runtime * install bioconductor * wrap bioconductor install command in Rscript * bioconductor sample docs * update bioC docs * remove .gitignore rule for .json files * add pointer to BioC cluster config from docs * add missing azureFiles cluster config to samples * Add 0.4.2 CHANGELOG comments (#111) * Added live scenario test (#107) * Added live scenario test so users do not have to write their own sample code to test * Added file names for test live * Removed single quote linter * Added comment about the reason for this test * Wait for job preparation task function (#109) * Fixed verbose for getDoParWorkers (#112) * Feature/faq (#110) * initial FAQ * rename faq to FAQ * merge FAQ and Troubleshooting docs * add info on how to reboot a node * refrence TSG and FAQ from main docs index page * add more info as per PR feedback * PR feedback * point raw scripts at master branch (#118) * Update DESCRIPTION (#117) Update version for new milestone. * Fix: Removed anaconda from path (#119) * Removed anaconda from environment path * Line is too long for blobxfer command * For BioConductor install, force remove MRO 3.3 prior to installing MRO 3.4 (#120) * force add PATH to current user * Update bioc_setup.sh * Check verbose null case (#121) * Change True/False to TRUE/FALSE in README example (#124) * add .gitiattrributes file to track line endings * True and False are not valid in R; changed to TRUE and FALSE * Fixed worker and merger scripts (#116) * Fixed worker and merger scripts * Fixed verbose logs based on PR comments * Added documentation on error handling * Fixed header on table markdown * Fixed based on PR comments * v0.4.3 Release (#131) * Upgraded description to use rAzureBatch v0.5.1 * Updated change log for job failure * readme.md update * Merge from feature/getjobresult for long running job support (#130) * Added set chunk size * Added cluster configuration validation function (#30) * Added pool config test validation * Added a fix for validation * Added if checks for null tests and more validation tests * Install R packages at job run time (#29) * Added cran/github installation scripts * Added package installation tests * Upgraded package version to 0.3.2 * Output file support (#40) * Output files support * Added createOutputFile method * output files readme documentation * added tests and find container sas * Added more detailed variable names * Enable/disable merge task (#39) * Merge task pass params * Fixed enableMerge cases * Merge task documentation on README.md * Fixed typo on merge task description * Update doAzureParallel.R * Changed enableMerge to enableCloudCombine * convert getJobResult output from binary to text * Only write vector to temp file * save cloud merge enabled, chunk size and packages as job metadata * update cloudMergeEnabled to cloudCombineEnabled * Fix/backwards compatible (#68) * Added backwards compatible in make cluster * Added deprecated config validator * Added mismatch label * Added validation for quota limits and bad getPool requests in waitForNodesToComplete (#52) * Added validation for quota limits and bad getPool requests * Fixed based on PR * Fixed progress bar layout to use switch statements instead of if statements * Changed clusterId to poolId * Added comments and fixed messages * Added running state to the node status * Reformatted lines for function * Added end statement for node completion * Feature/custom script and reduce (#70) * Added custom scripts and removed dependencies parameter * Updated roxygen tool version * Added parallelThreads support * Added test coverage * Removed verbose message on command line * Added Reduce function for group of tasks * Fix build because of doc semantics mismatch with function * Removed unused function * Added command line arg * Added docs for custom script * Moved customize cluster to separate doc for future usage * Fixed typo * Bug - Waiting for tasks to completion function ends too early (#69) * Moved wait for tasks to complete to doAzureParallel utility * Removed unneeded variables and progress * Fixed camel case for skiptoken * Travis/lintr (#72) * Added lintr config file * Added travis github package installation * Removed snake case rule * Fixed documents on doAzureParallel * Based on lintr default_settins docs, correctly added default rules * Updated lintr package to use object_name_style * Added package :: operator * Reformatted after merge * Fixed command line tests * Upgraded roxygen to 6.0.1 * Cluster config docs * Removed additional delete job * add getJob api (#84) * add getJob api * reformat code * update styling in utility file * fix code styling * update chunksize to chunkSize and remove unused code * handle job metadata in getJob api * fix styling issue * update getJobList parameter from list of job ids to filter object, and output jobs status in data frame (#128) long running job support, getJob, getJobList and getJobResult implementation * reformat code * update styling in utility file * fix code styling * update chunksize to chunkSize and remove unused code * handle job metadata in getJob api * fix styling issue * use counting service api in getJobList * fix coding style * return data frame from getJobList * update getJobList parameter from job id list to filter by state * reformat code * update description for getJobList * remove dup code * address review feedback * jobId parameter check for getJobResult * update documentation for long run job * update version to 0.5.0 * update version * address review feedback * update chunkSizeValue to chunkSizeKeyValuePair * Validate job names and pool names (#129) * Added validator class * Added validators for lintr * Added exclusion list for validators * fix bug in metadata handling for packages and enableCloudCombine (#133) * fix bug in metadata handling for packages and enableCloudCombine * call long running job api in test * update test * add test for long running job feature * code style fix * update job state description in readme * use list for job state filter * address review feedback
* treat warnings as failures and fail the creation of the cluster (#91) * treat warnings as failures and fail the creation of the cluster * fix unit tests * fix lintr lines too long issue * escape single quotes * Check if existing pool is deleted when makeCluster is called (#99) * Added deleting pool check for makeCluster * Fixed double quotes * cluster logs renamed from pool to cluster * Added correct imports and fix range * Feature/bio conductor docs (#106) * initial command line instructions for bioconductor * initial startup scripts for installing bioconductor * fix if then syntax * force update node environment with update path for R runtime * install bioconductor * wrap bioconductor install command in Rscript * bioconductor sample docs * update bioC docs * remove .gitignore rule for .json files * add pointer to BioC cluster config from docs * Feature/cluster logs (#98) * download merge result gets content raw * Added setHttpTraffic and logging functions docs * Fixed broken links * Shorten lines down to 120 characters * download merge result gets content raw * Added setHttpTraffic and logging functions docs * Fixed broken links * Shorten lines down to 120 characters * Renamed function names from past discussion * Fixed log documentation * Added new operations for storage management * Added dont run examples * Fixed unused arg for running example * Updated docs for storage management * Added a new doc dedicated for managing storage * Added attribute for container name in data frame * Fixed downloadBlob to work with new rAzureBatch function * Updated docs based on PR comments * Changed dependency version to razurebatch 0.5.0 * Feature/add azure files cluster config (#108) * initial command line instructions for bioconductor * initial startup scripts for installing bioconductor * fix if then syntax * force update node environment with update path for R runtime * install bioconductor * wrap bioconductor install command in Rscript * bioconductor sample docs * update bioC docs * remove .gitignore rule for .json files * add pointer to BioC cluster config from docs * add missing azureFiles cluster config to samples * Add 0.4.2 CHANGELOG comments (#111) * Added live scenario test (#107) * Added live scenario test so users do not have to write their own sample code to test * Added file names for test live * Removed single quote linter * Added comment about the reason for this test * Wait for job preparation task function (#109) * Fixed verbose for getDoParWorkers (#112) * Feature/faq (#110) * initial FAQ * rename faq to FAQ * merge FAQ and Troubleshooting docs * add info on how to reboot a node * refrence TSG and FAQ from main docs index page * add more info as per PR feedback * PR feedback * point raw scripts at master branch (#118) * Update DESCRIPTION (#117) Update version for new milestone. * Fix: Removed anaconda from path (#119) * Removed anaconda from environment path * Line is too long for blobxfer command * For BioConductor install, force remove MRO 3.3 prior to installing MRO 3.4 (#120) * force add PATH to current user * Update bioc_setup.sh * Check verbose null case (#121) * Change True/False to TRUE/FALSE in README example (#124) * add .gitiattrributes file to track line endings * True and False are not valid in R; changed to TRUE and FALSE * Fixed worker and merger scripts (#116) * Fixed worker and merger scripts * Fixed verbose logs based on PR comments * Added documentation on error handling * Fixed header on table markdown * Fixed based on PR comments * v0.4.3 Release (#131) * Upgraded description to use rAzureBatch v0.5.1 * Updated change log for job failure * readme.md update * Merge from feature/getjobresult for long running job support (#130) * Added set chunk size * Added cluster configuration validation function (#30) * Added pool config test validation * Added a fix for validation * Added if checks for null tests and more validation tests * Install R packages at job run time (#29) * Added cran/github installation scripts * Added package installation tests * Upgraded package version to 0.3.2 * Output file support (#40) * Output files support * Added createOutputFile method * output files readme documentation * added tests and find container sas * Added more detailed variable names * Enable/disable merge task (#39) * Merge task pass params * Fixed enableMerge cases * Merge task documentation on README.md * Fixed typo on merge task description * Update doAzureParallel.R * Changed enableMerge to enableCloudCombine * convert getJobResult output from binary to text * Only write vector to temp file * save cloud merge enabled, chunk size and packages as job metadata * update cloudMergeEnabled to cloudCombineEnabled * Fix/backwards compatible (#68) * Added backwards compatible in make cluster * Added deprecated config validator * Added mismatch label * Added validation for quota limits and bad getPool requests in waitForNodesToComplete (#52) * Added validation for quota limits and bad getPool requests * Fixed based on PR * Fixed progress bar layout to use switch statements instead of if statements * Changed clusterId to poolId * Added comments and fixed messages * Added running state to the node status * Reformatted lines for function * Added end statement for node completion * Feature/custom script and reduce (#70) * Added custom scripts and removed dependencies parameter * Updated roxygen tool version * Added parallelThreads support * Added test coverage * Removed verbose message on command line * Added Reduce function for group of tasks * Fix build because of doc semantics mismatch with function * Removed unused function * Added command line arg * Added docs for custom script * Moved customize cluster to separate doc for future usage * Fixed typo * Bug - Waiting for tasks to completion function ends too early (#69) * Moved wait for tasks to complete to doAzureParallel utility * Removed unneeded variables and progress * Fixed camel case for skiptoken * Travis/lintr (#72) * Added lintr config file * Added travis github package installation * Removed snake case rule * Fixed documents on doAzureParallel * Based on lintr default_settins docs, correctly added default rules * Updated lintr package to use object_name_style * Added package :: operator * Reformatted after merge * Fixed command line tests * Upgraded roxygen to 6.0.1 * Cluster config docs * Removed additional delete job * add getJob api (#84) * add getJob api * reformat code * update styling in utility file * fix code styling * update chunksize to chunkSize and remove unused code * handle job metadata in getJob api * fix styling issue * update getJobList parameter from list of job ids to filter object, and output jobs status in data frame (#128) long running job support, getJob, getJobList and getJobResult implementation * reformat code * update styling in utility file * fix code styling * update chunksize to chunkSize and remove unused code * handle job metadata in getJob api * fix styling issue * use counting service api in getJobList * fix coding style * return data frame from getJobList * update getJobList parameter from job id list to filter by state * reformat code * update description for getJobList * remove dup code * address review feedback * jobId parameter check for getJobResult * update documentation for long run job * update version to 0.5.0 * update version * address review feedback * update chunkSizeValue to chunkSizeKeyValuePair * Validate job names and pool names (#129) * Added validator class * Added validators for lintr * Added exclusion list for validators * fix bug in metadata handling for packages and enableCloudCombine (#133) * fix bug in metadata handling for packages and enableCloudCombine * call long running job api in test * update test * add test for long running job feature * code style fix * update job state description in readme * use list for job state filter * address review feedback
#36
Output:
Job Summary:
Id: job20170915230023
Waiting for tasks to complete. . .
|=========================================================== | 55%
Errors have occurred while running the job 'job20170915230023'. Error handling is set to 'stop' and has proceeded to terminate the job. The user will have to handle deleting the job. If this is not the correct behavior, change the errorHandling property in the foreach object. Use the 'getJobFile' function to obtain the logs. For more information about getting job logs, follow this link: https://github.com/Azure/doAzureParallel/blob/master/docs/40-troubleshooting.md#viewing-files-directly-from-compute-nodeWarning message:
In waitForTasksToComplete(id, jobTimeout, obj$errorHandling) :
There are currently 2 tasks failed while running the job:
job20170915230023-task5
job20170915230023-task6