Explore more efficient ways of getting inputs into HDFS #2014

droazen · 2016-07-18T17:58:48Z

@cwhelan mentioned at @jamesemery's presentation last week that there are more efficient options available for getting data into HDFS. It's worth exploring these.

droazen · 2016-07-18T17:59:12Z

Should be no more than a 1-day investigation

cwhelan · 2016-07-18T19:09:52Z

I was just suggesting trying out 'hadoop distcp' as an import method.

jamesemery · 2016-07-21T14:17:12Z

I have found that 'hadoop distcp' scales linearly with the size of file. Its runtime seems not to change as I scale the number of machines from 10 to 100. Furthermore, I have found that taken in concert with performance findings from #1675 that for large file sizes it is actually more efficient to first load a bam file into HDFS first using 'hadoop distcp' in order to run the spark BQSR.

Breakdown:

150GB bam file takes 25:30 minutes to download into HDFS
BQSRSpark takes 47:25 minutes to run on a 150GB bam file in HDFS
Total runtime = 72:55 minutes
BQSRSpark run from GCS bucket = 77:15 minutes

A small improvement in runtime but it is worth keeping in mind for #2015 going forward when evaluating which approach is best to take on spark.

droazen · 2017-03-20T14:40:11Z

This is done

droazen assigned jamesemery Jul 18, 2016

droazen added this to the alpha-3 milestone Jul 18, 2016

droazen added Spark performance labels Jul 18, 2016

droazen closed this as completed Mar 20, 2017

droazen unassigned jamesemery Mar 20, 2017

droazen mentioned this issue Mar 22, 2017

implement parallel copy from NFS (or IFS) to HDFS #1509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore more efficient ways of getting inputs into HDFS #2014

Explore more efficient ways of getting inputs into HDFS #2014

droazen commented Jul 18, 2016

droazen commented Jul 18, 2016

cwhelan commented Jul 18, 2016

jamesemery commented Jul 21, 2016

droazen commented Mar 20, 2017

Explore more efficient ways of getting inputs into HDFS #2014

Explore more efficient ways of getting inputs into HDFS #2014

Comments

droazen commented Jul 18, 2016

droazen commented Jul 18, 2016

cwhelan commented Jul 18, 2016

jamesemery commented Jul 21, 2016

droazen commented Mar 20, 2017