Skip to content

Commit

Permalink
Update comments, README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jtnystrom committed Apr 24, 2021
1 parent ff11167 commit 8493bc0
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 23 deletions.
27 changes: 16 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,21 @@ you can download a pre-built release from the [Releases](https://github.com/jtny

### Running Discount

Spark applications, such as Discount, can run locally on your laptop, on a cluster, or in the cloud.
On Google Cloud, we have tested on Dataproc image version 1.4 (Debian 9, Hadoop 2.9, Spark 2.4).
However, Discount should run on any platform where Spark and Hadoop can run.
Discount can run locally on your laptop, on a cluster, or in the cloud.
It has been tested standalone with Spark 3.1.0 and Spark 2.4.6 (minor version differences should be compatible).
In principle, Discount should run on any platform where Spark and Hadoop can run.
On Google Cloud, we have tested on Dataproc image version 2.0 (Debian 10, Spark 3.1).
On AWS, we have tested with emr-6.2.0 (Spark 3.0.1).

To run locally, first, install and configure Spark (http://spark.apache.org).
Discount has been tested with Spark 3.1.0 and Spark 2.4.6 (minor version differences should be compatible).

Run/submit scripts for macOS and Linux are provided. To run locally, copy `spark-submit.sh.template` to a new file called `spark-submit.sh`
and edit the necessary variables in the file (at a minimum, set the path to your Spark installation). This will be the script used to run Discount.
Alternatively, to submit to a GCloud cluster, you may use `submit-gcloud.sh.template`. In that case, change the example commands below to use that script instead, and insert your
GCloud cluster name as an additional first parameter when invoking.
Run/submit scripts for macOS and Linux are provided. To run locally, copy `spark-submit.sh.template` to a new file
called `spark-submit.sh` and edit the necessary variables in the file (at a minimum, set the path to your Spark
installation). This will be the script used to run Discount.

To run on AWS EMR, you may use `submit-aws.sh.template` instead.
To submit to a GCP cluster, you may use `submit-gcloud.sh.template`. In that case, change the example commands below to
use that script instead, and insert your GCloud cluster name as an additional first parameter when invoking. To run on
AWS EMR, you may use `submit-aws.sh.template` instead.

### Usage (k-mer counting)

Expand Down Expand Up @@ -161,7 +163,8 @@ expected single read length.
* If you are setting up Spark for the first time, you may want to configure key settings such as logging verbosity,
spark driver and executor memory, and the local directories for shuffle data (may get large).
You can edit the files in e.g. spark-3.1.0-bin-hadoopX.X/conf/ to do this.
If you are running a local standalone Spark (everything in one process) then it is helpful to increase driver memory as much as possible.
If you are running a local standalone Spark (everything in one process) then it is helpful to increase driver memory
as much as possible.

* You can speed up the sampling stage somewhat by setting the `--numCPUs` argument.

Expand Down Expand Up @@ -216,7 +219,9 @@ The same caveat as above applies.
To compile the software, the SBT build tool (https://www.scala-sbt.org/) is needed.
Although JDK 11 can be used, for maximum compatibility, we recommend compiling on JDK 8.
Discount is by default compiled for Scala 2.12/Spark 3.1.
(You can use Scala 2.11 and Spark 2.4.x by editing build.sbt and the various run scripts according to the comments in those files.)
(You can use Scala 2.11/Spark 2.4.x by editing build.sbt and the various run scripts according to the comments in those
files. Note that generally, Spark 3.x is only compatible with Scala 2.12, and Spark 2.4.x is only compatible with
Scala 2.11.)

The command `sbt assembly` will compile the software and produce the necessary jar file in
target/scala-2.12/Discount-assembly-x.x.x.jar. This will be a "fat" jar that also contains some necessary dependencies.
Expand Down
4 changes: 3 additions & 1 deletion spark-submit.sh.template
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
#!/bin/bash
#Copy this file to spark-submit.sh and edit the config variables.

#Run everything in one process (don't forget to adjust Spark's driver memory)
MASTER=local[*]
#If you are running a standalone cluster, use the following instead

#Full cluster running independently
#MASTER=spark://localhost:7077

SPARK=/path/to/spark-2.4.X-bin-hadoopX.X
Expand Down
21 changes: 10 additions & 11 deletions submit-gcloud.sh.template
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,16 @@ REGION=asia-northeast1
CLUSTER=$1
shift

MAXRES=##spark.driver.maxResultSize=2g
MAXRES=spark.driver.maxResultSize=2g

#High memory
PARTITIONS=##spark.sql.shuffle.partitions=4000
PARTITIONS=spark.sql.shuffle.partitions=4000
#Low memory
#PARTITIONS=##spark.sql.shuffle.partitions=14000
#PARTITIONS=spark.sql.shuffle.partitions=14000

#Max size of input splits in bytes. A smaller number reduces memory usage but increases the number of
#partitions for the first stage. If this variable is unset, Spark's default of 128 MB will be used.
SPLIT=##spark.hadoop.mapreduce.input.fileinputformat.split.maxsize=$((64 * 1024 * 1024))
SPLIT=spark.hadoop.mapreduce.input.fileinputformat.split.maxsize=$((64 * 1024 * 1024))

#YARN memory is allocated using on GCP using executor memory and memoryOverhead.
#The number of executors that will be spawned by YARN is (total memory)/(executor memory + memoryOverhead).
Expand All @@ -28,16 +28,15 @@ SPLIT=##spark.hadoop.mapreduce.input.fileinputformat.split.maxsize=$((64 * 1024

#The two settings below are suitable for k-mer counting on highcpu 16-core nodes.
#They also work well for standard 4-core nodes.
#OVERHEAD=##spark.executor.memoryOverhead=768
#EXECMEM=##spark.executor.memory=4352m
#OVERHEAD=spark.executor.memoryOverhead=768
#EXECMEM=spark.executor.memory=4352m

#Half memory setting for standard-16 nodes. Artificially inflated overhead to limit #executors
#OVERHEAD=##spark.executor.memoryOverhead=$((11171 + 1117))
#EXECMEM=##spark.executor.memory=11171m
#OVERHEAD=spark.executor.memoryOverhead=$((11171 + 1117))
#EXECMEM=spark.executor.memory=11171m

#The special characters at the start make '##' the separator of properties in the list, making the comma sign
#available for other purposes.
PROPERTIES="^##^$PARTITIONS$MAXRES$OVERHEAD$EXECMEM$SPLIT"
#Properties to actually use in the job. Empty values cannot be in this list.
PROPERTIES="$PARTITIONS,$MAXRES,$SPLIT"

#Change 2.12 to 2.11 below if compiling for scala 2.11.
exec gcloud --verbosity=info dataproc jobs submit spark --region $REGION --cluster $CLUSTER \
Expand Down

0 comments on commit 8493bc0

Please sign in to comment.