Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore more efficient ways of getting inputs into HDFS #2014

Closed
droazen opened this issue Jul 18, 2016 · 4 comments
Closed

Explore more efficient ways of getting inputs into HDFS #2014

droazen opened this issue Jul 18, 2016 · 4 comments

Comments

@droazen
Copy link
Contributor

droazen commented Jul 18, 2016

@cwhelan mentioned at @jamesemery's presentation last week that there are more efficient options available for getting data into HDFS. It's worth exploring these.

@droazen
Copy link
Contributor Author

droazen commented Jul 18, 2016

Should be no more than a 1-day investigation

@droazen droazen added this to the alpha-3 milestone Jul 18, 2016
@cwhelan
Copy link
Member

cwhelan commented Jul 18, 2016

I was just suggesting trying out 'hadoop distcp' as an import method.

@jamesemery
Copy link
Collaborator

I have found that 'hadoop distcp' scales linearly with the size of file. Its runtime seems not to change as I scale the number of machines from 10 to 100. Furthermore, I have found that taken in concert with performance findings from #1675 that for large file sizes it is actually more efficient to first load a bam file into HDFS first using 'hadoop distcp' in order to run the spark BQSR.

Breakdown:

  • 150GB bam file takes 25:30 minutes to download into HDFS
  • BQSRSpark takes 47:25 minutes to run on a 150GB bam file in HDFS
  • Total runtime = 72:55 minutes
  • BQSRSpark run from GCS bucket = 77:15 minutes

A small improvement in runtime but it is worth keeping in mind for #2015 going forward when evaluating which approach is best to take on spark.

@droazen
Copy link
Contributor Author

droazen commented Mar 20, 2017

This is done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants