-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix record oriented shuffle #599
Milestone
Comments
So what's the proposed fix? |
TBD? |
Gotcha. |
FTR: presumably @massie's SPARK-7263 is our best hope here? |
fnothaft
added a commit
to fnothaft/adam
that referenced
this issue
Dec 27, 2015
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We process these files when loading/writing the Parquet files where the alignment data is stored. This allows us to both eliminate the bulky metadata that we are currently storing in the AlignmentRecord, while maintaining the Sequence and RecordGroup dictionaries that we need to keep around.
fnothaft
added a commit
to fnothaft/adam
that referenced
this issue
Dec 29, 2015
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We process these files when loading/writing the Parquet files where the alignment data is stored. This allows us to both eliminate the bulky metadata that we are currently storing in the AlignmentRecord, while maintaining the Sequence and RecordGroup dictionaries that we need to keep around.
fnothaft
added a commit
to fnothaft/adam
that referenced
this issue
Dec 29, 2015
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We process these files when loading/writing the Parquet files where the alignment data is stored. This allows us to both eliminate the bulky metadata that we are currently storing in the AlignmentRecord, while maintaining the Sequence and RecordGroup dictionaries that we need to keep around.
fnothaft
added a commit
to fnothaft/adam
that referenced
this issue
Jan 11, 2016
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We process these files when loading/writing the Parquet files where the alignment data is stored. This allows us to both eliminate the bulky metadata that we are currently storing in the AlignmentRecord, while maintaining the Sequence and RecordGroup dictionaries that we need to keep around.
fnothaft
added a commit
to fnothaft/adam
that referenced
this issue
Jan 12, 2016
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We process these files when loading/writing the Parquet files where the alignment data is stored. This allows us to both eliminate the bulky metadata that we are currently storing in the AlignmentRecord, while maintaining the Sequence and RecordGroup dictionaries that we need to keep around.
fnothaft
added a commit
to fnothaft/adam
that referenced
this issue
Jan 12, 2016
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We process these files when loading/writing the Parquet files where the alignment data is stored. This allows us to both eliminate the bulky metadata that we are currently storing in the AlignmentRecord, while maintaining the Sequence and RecordGroup dictionaries that we need to keep around.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Due to our shuffle being record oriented, we experience an approximately 8-10x increase in data volume when we shuffle. This is because our data is stored on disk in a columnar representation, but is shuffled in a row oriented format.
The text was updated successfully, but these errors were encountered: