Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix record oriented shuffle #599

Closed
fnothaft opened this issue Mar 1, 2015 · 4 comments
Closed

Fix record oriented shuffle #599

fnothaft opened this issue Mar 1, 2015 · 4 comments
Milestone

Comments

@fnothaft
Copy link
Member

fnothaft commented Mar 1, 2015

Due to our shuffle being record oriented, we experience an approximately 8-10x increase in data volume when we shuffle. This is because our data is stored on disk in a columnar representation, but is shuffled in a row oriented format.

@fnothaft fnothaft added this to the 0.17.0 milestone Mar 1, 2015
@tdanford
Copy link
Contributor

tdanford commented Mar 1, 2015

So what's the proposed fix?

@fnothaft
Copy link
Member Author

fnothaft commented Mar 1, 2015

TBD?

@tdanford
Copy link
Contributor

tdanford commented Mar 1, 2015

Gotcha.

@ryan-williams
Copy link
Member

FTR: presumably @massie's SPARK-7263 is our best hope here?

@fnothaft fnothaft modified the milestones: 1.0.0, 0.17.0 May 31, 2015
fnothaft added a commit to fnothaft/adam that referenced this issue Dec 27, 2015
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.
fnothaft added a commit to fnothaft/adam that referenced this issue Dec 29, 2015
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.
fnothaft added a commit to fnothaft/adam that referenced this issue Dec 29, 2015
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.
fnothaft added a commit to fnothaft/adam that referenced this issue Jan 11, 2016
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.
fnothaft added a commit to fnothaft/adam that referenced this issue Jan 12, 2016
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.
fnothaft added a commit to fnothaft/adam that referenced this issue Jan 12, 2016
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.
@heuermh heuermh modified the milestones: 1.0.0, 0.20.0 Oct 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants