Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add dictionary encoding #3134

Merged
merged 14 commits into from
Nov 22, 2024

Conversation

broccoliSpicy
Copy link
Contributor

@broccoliSpicy broccoliSpicy commented Nov 18, 2024

This PR tries to support dictionary encoding by integrating it with MiniBlock PageLayout.

The general approach here is:
In a MiniBlock PageLayout, there is a optional dictionary field that stores a dictionary encoding if this miniblock has a dictionary.

/// A layout used for pages where the data is small
///
/// In this case we can fit many values into a single disk sector and transposing buffers is
/// expensive.  As a result, we do not transpose the buffers but compress the data into small
/// chunks (called mini blocks) which are roughly the size of a disk sector.
message MiniBlockLayout {
  // Description of the compression of repetition levels (e.g. how many bits per rep)
  ArrayEncoding rep_compression = 1;
  // Description of the compression of definition levels (e.g. how many bits per def)
  ArrayEncoding def_compression = 2;
  // Description of the compression of values
  ArrayEncoding value_compression = 3;
  ArrayEncoding dictionary = 4;
}

The rational for this is that if we dictionary encoding something, it's indices will definitely fall into a MiniBlockLayout.
By doing this, we don't need to have a specific DictionaryEncoding, it can be any ArrayEncoding.
The Dictionary and the indices are cascaded into another encoding automatically.

Currently, the dictionary is stored inside the page along with chunk meta data and chunk data, this is not ideal and is a TODO task.

This is a draft for discussion with the above idea so I only supported FixedWidthDataBlock with this encoding, the effort to add support for VariableWidthData is trivial.

#3123

@github-actions github-actions bot added the enhancement New feature or request label Nov 18, 2024
@broccoliSpicy broccoliSpicy changed the title feat: add dictionary encoding(draft, for discussion only) feat: add dictionary encoding Nov 20, 2024
@codecov-commenter
Copy link

codecov-commenter commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 19.14894% with 266 lines in your changes missing coverage. Please review.

Project coverage is 77.69%. Comparing base (1d3b204) to head (0a6f6c9).

Files with missing lines Patch % Lines
.../lance-encoding/src/encodings/logical/primitive.rs 13.10% 172 Missing and 7 partials ⚠️
...st/lance-encoding/src/encodings/physical/binary.rs 0.00% 59 Missing ⚠️
rust/lance-encoding/src/encoder.rs 0.00% 14 Missing ⚠️
rust/lance-core/src/utils/hash.rs 0.00% 6 Missing ⚠️
rust/lance-encoding/src/format.rs 28.57% 5 Missing ⚠️
rust/lance-encoding/src/decoder.rs 0.00% 1 Missing ⚠️
rust/lance-encoding/src/encodings/physical.rs 0.00% 1 Missing ⚠️
rust/lance-encoding/src/statistics.rs 97.14% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3134      +/-   ##
==========================================
- Coverage   77.95%   77.69%   -0.26%     
==========================================
  Files         242      243       +1     
  Lines       81904    82206     +302     
  Branches    81904    82206     +302     
==========================================
+ Hits        63848    63874      +26     
- Misses      14890    15152     +262     
- Partials     3166     3180      +14     
Flag Coverage Δ
unittests 77.69% <19.14%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@broccoliSpicy
Copy link
Contributor Author

broccoliSpicy commented Nov 21, 2024

after make append method in DataBlockBuilderImpl use immutable borrow:

table name Parquet Read Time Lance Read Time Parquet File Size Lance File Size
customer 0.45s 0.42s 120 MiB 152 MiB
lineitem 11.55s 10.46s 2,076 MiB 3,374 MiB
orders 2.93s 2.78s 559 MiB 894 MiB
part 0.34s 0.40s 61 MiB 127 MiB
partsupp 3.29s 1.89s 401 MiB 470 MiB
Column Name DataType Parquet Read Time Lance Read Time Parquet File Size Lance File Size Cardinality
p_partkey int32 0.05s 0.02s 8 MiB 4 MiB 2000000
p_name string 0.45s 0.09s 25 MiB 30 MiB 1999828
p_mfgr string 0.07s 0.03s 0.7 MiB 0.7 MiB 5
p_brand string 0.06s 0.03s 1 MiB 1 MiB 25
p_type string 0.10s 0.23s 1 MiB 38 MiB 150
p_size int32 0.02s 0.02s 1 MiB 1 MiB 50
p_container string 0.08s 0.03s 1 MiB 1 MiB 40
p_retailprice decimal128(15, 2) 0.11s 0.20 s 3 MiB 30 MiB 31681
p_comment string 0.32s 0.19s 16 MiB 27 MiB 754704

@broccoliSpicy
Copy link
Contributor Author

broccoliSpicy commented Nov 21, 2024

First 100 rows:
l_extendedprice
[[Decimal('33078.94')]
[Decimal('38306.16')]
[Decimal('15479.68')]
[Decimal('34616.68')]
[Decimal('28974.00')]
[Decimal('44842.88')]
[Decimal('63066.32')]
[Decimal('86083.65')]
[Decimal('70822.15')]
[Decimal('39620.34')]
[Decimal('3581.56')]
[Decimal('52411.80')]
[Decimal('35032.14')]
[Decimal('39819.00')]
[Decimal('25179.60')]
[Decimal('31387.20')]
[Decimal('68864.50')]
[Decimal('53697.73')]
[Decimal('17273.04')]
[Decimal('12423.15')]
[Decimal('84904.50')]
[Decimal('46245.92')]
[Decimal('74398.68')]
[Decimal('55806.45')]
[Decimal('7216.50')]
[Decimal('26963.72')]
[Decimal('40995.52')]
[Decimal('3091.16')]
[Decimal('5393.68')]
[Decimal('46642.64')]
[Decimal('6978.84')]
[Decimal('39224.92')]
[Decimal('34948.80')]
[Decimal('8803.10')]
[Decimal('49780.56')]
[Decimal('20768.41')]
[Decimal('24817.98')]
[Decimal('8558.10')]
[Decimal('33708.00')]
[Decimal('44788.54')]
[Decimal('13026.23')]
[Decimal('42317.50')]
[Decimal('42877.74')]
[Decimal('45516.80')]
[Decimal('74029.62')]
[Decimal('48691.20')]
[Decimal('69449.25')]
[Decimal('45538.29')]
[Decimal('63681.20')]
[Decimal('49288.36')]
[Decimal('46194.72')]
[Decimal('58892.42')]
[Decimal('57788.48')]
[Decimal('52982.88')]
[Decimal('68665.20')]
[Decimal('30837.66')]
[Decimal('52933.66')]
[Decimal('26050.42')]
[Decimal('37545.27')]
[Decimal('37916.72')]
[Decimal('78670.80')]
[Decimal('5069.36')]
[Decimal('21910.92')]
[Decimal('10159.55')]
[Decimal('48887.96')]
[Decimal('23784.30')]
[Decimal('33001.13')]
[Decimal('4925.01')]
[Decimal('84764.66')]
[Decimal('84721.88')]
[Decimal('26424.60')]
[Decimal('40541.31')]
[Decimal('46006.50')]
[Decimal('63853.40')]
[Decimal('54433.44')]
[Decimal('55447.68')]
[Decimal('29539.20')]
[Decimal('3279.00')]
[Decimal('72225.30')]
[Decimal('25852.69')]
[Decimal('9761.92')]
[Decimal('20974.98')]
[Decimal('1186.00')]
[Decimal('14182.41')]
[Decimal('50996.73')]
[Decimal('30371.88')]
[Decimal('30631.75')]
[Decimal('3330.36')]
[Decimal('61348.50')]
[Decimal('49876.20')]
[Decimal('57583.11')]
[Decimal('47574.50')]
[Decimal('38862.87')]
[Decimal('58554.90')]
[Decimal('24241.36')]
[Decimal('61777.05')]
[Decimal('39272.24')]
[Decimal('29739.92')]
[Decimal('1424.37')]
[Decimal('14056.42')]]

first 100 rows
p_retailprice
[[Decimal('901.00')]
[Decimal('902.00')]
[Decimal('903.00')]
[Decimal('904.00')]
[Decimal('905.00')]
[Decimal('906.00')]
[Decimal('907.00')]
[Decimal('908.00')]
[Decimal('909.00')]
[Decimal('910.01')]
[Decimal('911.01')]
[Decimal('912.01')]
[Decimal('913.01')]
[Decimal('914.01')]
[Decimal('915.01')]
[Decimal('916.01')]
[Decimal('917.01')]
[Decimal('918.01')]
[Decimal('919.01')]
[Decimal('920.02')]
[Decimal('921.02')]
[Decimal('922.02')]
[Decimal('923.02')]
[Decimal('924.02')]
[Decimal('925.02')]
[Decimal('926.02')]
[Decimal('927.02')]
[Decimal('928.02')]
[Decimal('929.02')]
[Decimal('930.03')]
[Decimal('931.03')]
[Decimal('932.03')]
[Decimal('933.03')]
[Decimal('934.03')]
[Decimal('935.03')]
[Decimal('936.03')]
[Decimal('937.03')]
[Decimal('938.03')]
[Decimal('939.03')]
[Decimal('940.04')]
[Decimal('941.04')]
[Decimal('942.04')]
[Decimal('943.04')]
[Decimal('944.04')]
[Decimal('945.04')]
[Decimal('946.04')]
[Decimal('947.04')]
[Decimal('948.04')]
[Decimal('949.04')]
[Decimal('950.05')]
[Decimal('951.05')]
[Decimal('952.05')]
[Decimal('953.05')]
[Decimal('954.05')]
[Decimal('955.05')]
[Decimal('956.05')]
[Decimal('957.05')]
[Decimal('958.05')]
[Decimal('959.05')]
[Decimal('960.06')]
[Decimal('961.06')]
[Decimal('962.06')]
[Decimal('963.06')]
[Decimal('964.06')]
[Decimal('965.06')]
[Decimal('966.06')]
[Decimal('967.06')]
[Decimal('968.06')]
[Decimal('969.06')]
[Decimal('970.07')]
[Decimal('971.07')]
[Decimal('972.07')]
[Decimal('973.07')]
[Decimal('974.07')]
[Decimal('975.07')]
[Decimal('976.07')]
[Decimal('977.07')]
[Decimal('978.07')]
[Decimal('979.07')]
[Decimal('980.08')]
[Decimal('981.08')]
[Decimal('982.08')]
[Decimal('983.08')]
[Decimal('984.08')]
[Decimal('985.08')]
[Decimal('986.08')]
[Decimal('987.08')]
[Decimal('988.08')]
[Decimal('989.08')]
[Decimal('990.09')]
[Decimal('991.09')]
[Decimal('992.09')]
[Decimal('993.09')]
[Decimal('994.09')]
[Decimal('995.09')]
[Decimal('996.09')]
[Decimal('997.09')]
[Decimal('998.09')]
[Decimal('999.09')]
[Decimal('1000.10')]]

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done. A few initial questions, mostly around alignment, but I like the direction

assert!(block.bits_per_offset == 32);

let offsets = block.offsets.borrow_to_typed_slice::<u32>();
let offsets = offsets.as_ref();
let offsets: &[u32] = cast_slice(&block.offsets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use try_cast_slice. This cast should always be safe but this isn't in a critical section and probably worth it just to make sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! fixed.

let encoding = ProtobufUtils::binary_block();
Ok((encoder, encoding))
} else {
todo!("Implement BlockCompression for VariableWidth DataBlock with offsets type u32")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
todo!("Implement BlockCompression for VariableWidth DataBlock with offsets type u32")
todo!("Implement BlockCompression for VariableWidth DataBlock with 64 bit offsets")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! fixed

let offsets = offsets.as_ref();
// the first 4 bytes store the number of values, then 4 bytes for bytes_start_offset,
// then offsets data, then bytes data.
let bytes_start_offset = std::mem::size_of_val(offsets) as u32 + 4 + 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're using u32 directly above it seems you may as well do std::mem::size_of::<u32>() instead of size_of_val.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the size_of_val to get the whole length in bytes of &[u32] here,
std::mem::size_of::<u32>() can only get me the size of a element.

Comment on lines 765 to 766
let output_total_bytes =
((bytes_start_offset as usize + variable_width_data.data.len()) + 3) / 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why the +3 / 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is a mistake, thanks!

output.extend_from_slice(
&BLOCK_PAD_BUFFER[..pad_bytes::<BINARY_BLOCK_ALIGNMENT>(output.len())],
);
Ok(LanceBuffer::reinterpret_vec(output))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since output is already u8 you can just do LanceBuffer::Owned(output).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! fixed.

Comment on lines 777 to 780
// pad this chunk to make it align to 4 bytes.
output.extend_from_slice(
&BLOCK_PAD_BUFFER[..pad_bytes::<BINARY_BLOCK_ALIGNMENT>(output.len())],
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think padding should be a concern of the code using the block compressor and not a concern of the block compressor itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed.

Comment on lines 804 to 810
let offsets = header[2..2 + num_values as usize + 1].to_vec();

Ok(DataBlock::VariableWidth(VariableWidthBlock {
data: LanceBuffer::Owned(
data[bytes_start_offset
..bytes_start_offset + offsets[num_values as usize] as usize]
.to_vec(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to avoid both of these to_vec copies by using LanceBuffer::slice_with_length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -289,6 +289,7 @@ impl ValueEncoder {
}
}

// fn compress(&self, data: DataBlock) -> Result<(VariableWidthBlock, pb::ArrayEncoding)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// fn compress(&self, data: DataBlock) -> Result<(VariableWidthBlock, pb::ArrayEncoding)>;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

// These come from the protobuf
dictionary_decompressor: Arc<dyn BlockDecompressor>,
dictionary_buf_position_and_size: (u64, u64),
dictionary_data_alignment: u64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this alignment for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dictionary_data_alignment is used later for converting the raw bytes::Bytes fetched from disk IO to LanceBuffer

                        LanceBuffer::from_bytes(
                            dictionary_data,
                            dictionary.dictionary_data_alignment,
                        ),

@@ -2189,6 +2316,114 @@ impl PrimitiveStructuralEncoder {
})
}

fn dicitionary_encoding(mut data_block: DataBlock, cardinality: u64) -> (DataBlock, DataBlock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name this dictionary_encode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, yeah, good point, fixed.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@broccoliSpicy broccoliSpicy merged commit e32f393 into lancedb:main Nov 22, 2024
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants