-
Notifications
You must be signed in to change notification settings - Fork 10
Redelm File format design
julienledem edited this page Nov 21, 2012
·
9 revisions
The RedElm file format is a columnar format inspired by the ColumnIO format described in the Dremel paper from Google. Its primary target is HDFS.
This document reflects the current state of thinking around this format. It is not set in stone and is up for discussion.
The records are organized in row groups. Each row group is organized by column. There are three columns per leaf of the schema (primitive type):
- repetition level
- definition level
- data Each column chunk in a row group is compressed using the configured Hadoop CompressionCodec. Meta data is stored in a footer.
- The primary target is HDFS
- Each column chunk can be read independently so that only the data for the fields accessed is read.
- All the column data for the same row group should be collocated to speedup record assembly.
- Repetition levels and definition levels can be read independently of the data (for example for fast count implementation)
- Row groups should be large enough to amortize IO seek cost over scan.
- Backward compatibility: The format should allow for new extensions (column data encoding: dictionary, delta, ...) to be implemented while allowing new versions of the library to read old versions of the files.
- The file can be generated in one pass from a M/R jobs (either from mappers or reducers)
- The format should be readable in any language and not be tied to Java
- Application-specific metadata blocks can be added in the footer to allow customization of the conversion. For example when writing from Pig, a Pig metadata block is added with the Pig Schema. This allows converting the data back to the original Pig schema. (for example Pig Maps are converted to list of key/value pairs, the Pig schema is required to convert it back to a Map and not a bag of key/value tuples)
- Column data in a row group has a header to record the encoding used in this particular data block for this particular column. It should also allow not compressing the content of the column.
- the meta data is stored in a footer as it is accumulated in memory while writing row groups. Once all the data blocks (1 per row group) have been written, the meta data blocks are appended at the end of the file.
- the index of the footer is stored at the very end of the file to allow finding the footer.
- a given row group is buffered in memory and is flushed to disk when a threshold is reached.
- meta data blocks are stored in JSON (TODO) (To simplify backward compatibility and compatibility)
- the footer is gZipped.
- a summary file consolidating all the footers from the part file is created in the partition directory. This allows faster look up of metadata by the InputFormat and easy adjunction of custom file footers to configure how the data should be read by specific applications that require converting the data to their own types.
- we want to allow application specific customization without breaking cross compatibility. (Pig, Hive, ...)
- Skiplists to skip values in a row group when a filter is being evaluated on a different column.
- Look into setting the HDFS block size to a large value (1-2GB) for the resulting file. The goal being to ensure complete collocation of column data withing the same data block (the row groups do no cross HDFS block boundary)
- Explore a two pass generation: As row groups are buffered in memory, we want to flush them to disc when they get too big, however, when scanning the data we want to have larger blocks so that we can have more effective scans. We could write files with relatively small row groups and rewrite them in the end one column at a time in bigger chunks.
- figure out the best way to deal with schema conversion across systems (Pig, Hive, Thrift, ...) and a good way to capture types that are not in the Protobuf spec (Map, Set, ...)
See the picture after the description
- 8 bytes magic number: 82, 101, 100, 32, 69, 108, 109, 10 ("RedElm\n": to avoid trying to read non-redelm files)
- row group (repeated):
- column (repeated):
- repetition levels (ShortInt compressed)
- definition levels (compressed)
- data:
- header: describes the encoding of the data (ToDo)
- actual data: (compressed)
- column (repeated):
- the footer stores metadata blocks:
- file version
- metadata blocks count
- RedElm block: (required)
- File level metadata
- RedElm schema
- Codec used to compress (TODO: This is one of a predefined set to ensure cross language compatibility. for example => lzo, gz, snappy, ...) this is a generic compression codec applied on top of existing encoding. It may be optionally turn off per column. (TBD)
- block level meta data. For each block:
- start index (could replace by block length)
- end index (could replace by block length)
- record count
- columns for this block:
- r start (could replace by r length)
- d start (could replace by d length)
- data start (could replace by data length)
- data end (could replace by data length)
- identifier (path in the schema)
- type (one of the primitive types)
- values count
- r uncompressed size (useful for planning)
- d uncompressed size
- data uncompressed size
- File level metadata
- Pig Block: (optional: shown as an example, only the Pig extension of the InputFormat knows about this)
- Pig Schema (to allow reading with the exact same schema used to write as the conversion from Pig schema to RedElm schema is lossy)
- Thrift Block: (optional: same thing, the format just allows arbitrary named metadata blocs in the footer)
- thrift class name (same thing)
- footer index
- Magic number