You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@cwhelan mentioned at @jamesemery's presentation last week that there are more efficient options available for getting data into HDFS. It's worth exploring these.
The text was updated successfully, but these errors were encountered:
I have found that 'hadoop distcp' scales linearly with the size of file. Its runtime seems not to change as I scale the number of machines from 10 to 100. Furthermore, I have found that taken in concert with performance findings from #1675 that for large file sizes it is actually more efficient to first load a bam file into HDFS first using 'hadoop distcp' in order to run the spark BQSR.
Breakdown:
150GB bam file takes 25:30 minutes to download into HDFS
BQSRSpark takes 47:25 minutes to run on a 150GB bam file in HDFS
Total runtime = 72:55 minutes
BQSRSpark run from GCS bucket = 77:15 minutes
A small improvement in runtime but it is worth keeping in mind for #2015 going forward when evaluating which approach is best to take on spark.
@cwhelan mentioned at @jamesemery's presentation last week that there are more efficient options available for getting data into HDFS. It's worth exploring these.
The text was updated successfully, but these errors were encountered: