feat: add dictionary encoding #3134

broccoliSpicy · 2024-11-18T15:12:22Z

This PR tries to support dictionary encoding by integrating it with MiniBlock PageLayout.

The general approach here is:
In a MiniBlock PageLayout, there is a optional dictionary field that stores a dictionary encoding if this miniblock has a dictionary.

/// A layout used for pages where the data is small
///
/// In this case we can fit many values into a single disk sector and transposing buffers is
/// expensive.  As a result, we do not transpose the buffers but compress the data into small
/// chunks (called mini blocks) which are roughly the size of a disk sector.
message MiniBlockLayout {
  // Description of the compression of repetition levels (e.g. how many bits per rep)
  ArrayEncoding rep_compression = 1;
  // Description of the compression of definition levels (e.g. how many bits per def)
  ArrayEncoding def_compression = 2;
  // Description of the compression of values
  ArrayEncoding value_compression = 3;
  ArrayEncoding dictionary = 4;
}

The rational for this is that if we dictionary encoding something, it's indices will definitely fall into a MiniBlockLayout.
By doing this, we don't need to have a specific DictionaryEncoding, it can be any ArrayEncoding.
The Dictionary and the indices are cascaded into another encoding automatically.

Currently, the dictionary is stored inside the page along with chunk meta data and chunk data, this is not ideal and is a TODO task.

This is a draft for discussion with the above idea so I only supported FixedWidthDataBlock with this encoding, the effort to add support for VariableWidthData is trivial.

#3123

…coding

codecov-commenter · 2024-11-20T19:42:23Z

Codecov Report

Attention: Patch coverage is 19.14894% with 266 lines in your changes missing coverage. Please review.

Project coverage is 77.69%. Comparing base (1d3b204) to head (0a6f6c9).

Files with missing lines	Patch %	Lines
.../lance-encoding/src/encodings/logical/primitive.rs	13.10%	172 Missing and 7 partials ⚠️
...st/lance-encoding/src/encodings/physical/binary.rs	0.00%	59 Missing ⚠️
rust/lance-encoding/src/encoder.rs	0.00%	14 Missing ⚠️
rust/lance-core/src/utils/hash.rs	0.00%	6 Missing ⚠️
rust/lance-encoding/src/format.rs	28.57%	5 Missing ⚠️
rust/lance-encoding/src/decoder.rs	0.00%	1 Missing ⚠️
rust/lance-encoding/src/encodings/physical.rs	0.00%	1 Missing ⚠️
rust/lance-encoding/src/statistics.rs	97.14%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3134      +/-   ##
==========================================
- Coverage   77.95%   77.69%   -0.26%     
==========================================
  Files         242      243       +1     
  Lines       81904    82206     +302     
  Branches    81904    82206     +302     
==========================================
+ Hits        63848    63874      +26     
- Misses      14890    15152     +262     
- Partials     3166     3180      +14

Flag	Coverage Δ
unittests	`77.69% <19.14%> (-0.26%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

broccoliSpicy · 2024-11-21T17:13:47Z

after make append method in DataBlockBuilderImpl use immutable borrow:

table name	Parquet Read Time	Lance Read Time	Parquet File Size	Lance File Size
customer	0.45s	0.42s	120 MiB	152 MiB
lineitem	11.55s	10.46s	2,076 MiB	3,374 MiB
orders	2.93s	2.78s	559 MiB	894 MiB
part	0.34s	0.40s	61 MiB	127 MiB
partsupp	3.29s	1.89s	401 MiB	470 MiB

Column Name	DataType	Parquet Read Time	Lance Read Time	Parquet File Size	Lance File Size	Cardinality
p_partkey	int32	0.05s	0.02s	8 MiB	4 MiB	2000000
p_name	string	0.45s	0.09s	25 MiB	30 MiB	1999828
p_mfgr	string	0.07s	0.03s	0.7 MiB	0.7 MiB	5
p_brand	string	0.06s	0.03s	1 MiB	1 MiB	25
p_type	string	0.10s	0.23s	1 MiB	38 MiB	150
p_size	int32	0.02s	0.02s	1 MiB	1 MiB	50
p_container	string	0.08s	0.03s	1 MiB	1 MiB	40
p_retailprice	decimal128(15, 2)	0.11s	0.20 s	3 MiB	30 MiB	31681
p_comment	string	0.32s	0.19s	16 MiB	27 MiB	754704

broccoliSpicy · 2024-11-21T20:05:35Z

First 100 rows:
l_extendedprice
[[Decimal('33078.94')]
[Decimal('38306.16')]
[Decimal('15479.68')]
[Decimal('34616.68')]
[Decimal('28974.00')]
[Decimal('44842.88')]
[Decimal('63066.32')]
[Decimal('86083.65')]
[Decimal('70822.15')]
[Decimal('39620.34')]
[Decimal('3581.56')]
[Decimal('52411.80')]
[Decimal('35032.14')]
[Decimal('39819.00')]
[Decimal('25179.60')]
[Decimal('31387.20')]
[Decimal('68864.50')]
[Decimal('53697.73')]
[Decimal('17273.04')]
[Decimal('12423.15')]
[Decimal('84904.50')]
[Decimal('46245.92')]
[Decimal('74398.68')]
[Decimal('55806.45')]
[Decimal('7216.50')]
[Decimal('26963.72')]
[Decimal('40995.52')]
[Decimal('3091.16')]
[Decimal('5393.68')]
[Decimal('46642.64')]
[Decimal('6978.84')]
[Decimal('39224.92')]
[Decimal('34948.80')]
[Decimal('8803.10')]
[Decimal('49780.56')]
[Decimal('20768.41')]
[Decimal('24817.98')]
[Decimal('8558.10')]
[Decimal('33708.00')]
[Decimal('44788.54')]
[Decimal('13026.23')]
[Decimal('42317.50')]
[Decimal('42877.74')]
[Decimal('45516.80')]
[Decimal('74029.62')]
[Decimal('48691.20')]
[Decimal('69449.25')]
[Decimal('45538.29')]
[Decimal('63681.20')]
[Decimal('49288.36')]
[Decimal('46194.72')]
[Decimal('58892.42')]
[Decimal('57788.48')]
[Decimal('52982.88')]
[Decimal('68665.20')]
[Decimal('30837.66')]
[Decimal('52933.66')]
[Decimal('26050.42')]
[Decimal('37545.27')]
[Decimal('37916.72')]
[Decimal('78670.80')]
[Decimal('5069.36')]
[Decimal('21910.92')]
[Decimal('10159.55')]
[Decimal('48887.96')]
[Decimal('23784.30')]
[Decimal('33001.13')]
[Decimal('4925.01')]
[Decimal('84764.66')]
[Decimal('84721.88')]
[Decimal('26424.60')]
[Decimal('40541.31')]
[Decimal('46006.50')]
[Decimal('63853.40')]
[Decimal('54433.44')]
[Decimal('55447.68')]
[Decimal('29539.20')]
[Decimal('3279.00')]
[Decimal('72225.30')]
[Decimal('25852.69')]
[Decimal('9761.92')]
[Decimal('20974.98')]
[Decimal('1186.00')]
[Decimal('14182.41')]
[Decimal('50996.73')]
[Decimal('30371.88')]
[Decimal('30631.75')]
[Decimal('3330.36')]
[Decimal('61348.50')]
[Decimal('49876.20')]
[Decimal('57583.11')]
[Decimal('47574.50')]
[Decimal('38862.87')]
[Decimal('58554.90')]
[Decimal('24241.36')]
[Decimal('61777.05')]
[Decimal('39272.24')]
[Decimal('29739.92')]
[Decimal('1424.37')]
[Decimal('14056.42')]]

first 100 rows
p_retailprice
[[Decimal('901.00')]
[Decimal('902.00')]
[Decimal('903.00')]
[Decimal('904.00')]
[Decimal('905.00')]
[Decimal('906.00')]
[Decimal('907.00')]
[Decimal('908.00')]
[Decimal('909.00')]
[Decimal('910.01')]
[Decimal('911.01')]
[Decimal('912.01')]
[Decimal('913.01')]
[Decimal('914.01')]
[Decimal('915.01')]
[Decimal('916.01')]
[Decimal('917.01')]
[Decimal('918.01')]
[Decimal('919.01')]
[Decimal('920.02')]
[Decimal('921.02')]
[Decimal('922.02')]
[Decimal('923.02')]
[Decimal('924.02')]
[Decimal('925.02')]
[Decimal('926.02')]
[Decimal('927.02')]
[Decimal('928.02')]
[Decimal('929.02')]
[Decimal('930.03')]
[Decimal('931.03')]
[Decimal('932.03')]
[Decimal('933.03')]
[Decimal('934.03')]
[Decimal('935.03')]
[Decimal('936.03')]
[Decimal('937.03')]
[Decimal('938.03')]
[Decimal('939.03')]
[Decimal('940.04')]
[Decimal('941.04')]
[Decimal('942.04')]
[Decimal('943.04')]
[Decimal('944.04')]
[Decimal('945.04')]
[Decimal('946.04')]
[Decimal('947.04')]
[Decimal('948.04')]
[Decimal('949.04')]
[Decimal('950.05')]
[Decimal('951.05')]
[Decimal('952.05')]
[Decimal('953.05')]
[Decimal('954.05')]
[Decimal('955.05')]
[Decimal('956.05')]
[Decimal('957.05')]
[Decimal('958.05')]
[Decimal('959.05')]
[Decimal('960.06')]
[Decimal('961.06')]
[Decimal('962.06')]
[Decimal('963.06')]
[Decimal('964.06')]
[Decimal('965.06')]
[Decimal('966.06')]
[Decimal('967.06')]
[Decimal('968.06')]
[Decimal('969.06')]
[Decimal('970.07')]
[Decimal('971.07')]
[Decimal('972.07')]
[Decimal('973.07')]
[Decimal('974.07')]
[Decimal('975.07')]
[Decimal('976.07')]
[Decimal('977.07')]
[Decimal('978.07')]
[Decimal('979.07')]
[Decimal('980.08')]
[Decimal('981.08')]
[Decimal('982.08')]
[Decimal('983.08')]
[Decimal('984.08')]
[Decimal('985.08')]
[Decimal('986.08')]
[Decimal('987.08')]
[Decimal('988.08')]
[Decimal('989.08')]
[Decimal('990.09')]
[Decimal('991.09')]
[Decimal('992.09')]
[Decimal('993.09')]
[Decimal('994.09')]
[Decimal('995.09')]
[Decimal('996.09')]
[Decimal('997.09')]
[Decimal('998.09')]
[Decimal('999.09')]
[Decimal('1000.10')]]

westonpace

Nicely done. A few initial questions, mostly around alignment, but I like the direction

westonpace · 2024-11-21T21:50:58Z

rust/lance-encoding/src/data.rs

        assert!(block.bits_per_offset == 32);

-        let offsets = block.offsets.borrow_to_typed_slice::<u32>();
-        let offsets = offsets.as_ref();
+        let offsets: &[u32] = cast_slice(&block.offsets);


Can we use try_cast_slice. This cast should always be safe but this isn't in a critical section and probably worth it just to make sure.

thanks! fixed.

westonpace · 2024-11-21T21:51:55Z

rust/lance-encoding/src/encoder.rs

+                    let encoding = ProtobufUtils::binary_block();
+                    Ok((encoder, encoding))
+                } else {
+                    todo!("Implement BlockCompression for VariableWidth DataBlock with offsets type u32")


Suggested change

todo!("Implement BlockCompression for VariableWidth DataBlock with offsets type u32")

todo!("Implement BlockCompression for VariableWidth DataBlock with 64 bit offsets")

thanks! fixed

westonpace · 2024-11-21T22:00:48Z