Skip to content

Commit

Permalink
HPCC-33601 Document the new lz4s and lz4shc index compression and opt…
Browse files Browse the repository at this point in the history
…ions

Signed-off-by: Jim DeFabia <jamesdefabia@lexisnexis.com>
  • Loading branch information
Jim DeFabia committed Mar 11, 2025
1 parent b04fa2a commit dfa3307
Show file tree
Hide file tree
Showing 2 changed files with 213 additions and 45 deletions.
139 changes: 111 additions & 28 deletions docs/EN_US/ECLLanguageReference/ECLR_mods/BltInFunc-BUILD.xml
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,9 @@

<para><informaltable colsep="1" frame="all" rowsep="1">
<tgroup cols="2">
<colspec colwidth="78.50pt" />
<colspec colwidth="78.50pt"/>

<colspec />
<colspec/>

<tbody>
<row>
Expand Down Expand Up @@ -241,9 +241,9 @@

<para><informaltable colsep="1" frame="all" rowsep="1">
<tgroup cols="2">
<colspec colwidth="125pt" />
<colspec colwidth="125pt"/>

<colspec />
<colspec/>

<tbody>
<row>
Expand All @@ -256,8 +256,8 @@
written to disk is always determined by the number of nodes in
the cluster on which the workunit executes, regardless of the
number of nodes on the target cluster(s) unless the WIDTH option
is also specified. Use this option for bare-metal deployments.
</entry>
is also specified. Use this option for bare-metal
deployments.</entry>
</row>

<row>
Expand Down Expand Up @@ -292,7 +292,7 @@
names of the plane(s) to write the
<emphasis>indexfile</emphasis> to. The
<emphasis>targetPlane</emphasis> names must be listed as they
are defined in the deployment. </entry>
are defined in the deployment.</entry>
</row>

<row>
Expand Down Expand Up @@ -441,8 +441,8 @@

<entry><para>Optional. Specifies the index should be compressed
using the type of compression specified. If omitted, the default
is <emphasis role="bold">LZW</emphasis>, a variant of the
Lempel-Ziv-Welch algorithm. </para></entry>
is <emphasis role="bold">'inplace:lz4shc'</emphasis>.
</para></entry>
</row>

<row>
Expand Down Expand Up @@ -856,17 +856,17 @@ BUILD(FilterDsLib1);

<informaltable colsep="1" frame="all" rowsep="1">
<tgroup cols="2">
<colspec align="left" colwidth="122.40pt" />
<colspec align="left" colwidth="188*"/>

<colspec />
<colspec colwidth="812*"/>

<tbody>
<row>
<entry><emphasis role="bold">LZW</emphasis></entry>

<entry>The default compression. It is a variant of the
Lempel-Ziv-Welch algorithm. It remains the default for backward
compatibility.</entry>
<entry>A variant of the Lempel-Ziv-Welch algorithm. This was the
the default compression prior to versions 9.6.90, 9.8.66,and
9.10.12.</entry>
</row>

<row>
Expand Down Expand Up @@ -894,6 +894,28 @@ BUILD(FilterDsLib1);
compression on the payload. The resulting index can be smaller
than using lz4.</entry>
</row>

<row>
<entry><emphasis role="bold"><emphasis
role="bold">'inplace:lz4s'</emphasis> </emphasis></entry>

<entry>Causes inplace compression on the key fields and lz4s
compression on the payload. This uses the streaming API to build
up a compressed data stream and avoid recompressing it resulting
in reduced build times.</entry>
</row>

<row>
<entry><emphasis role="bold"><emphasis
role="bold">'inplace:lz4shc'</emphasis> </emphasis></entry>

<entry>The default compression in versions after versions 9.6.90,
9.8.66, and 9.10.12. Causes inplace compression on the key fields
and lz4shc compression on the payload. This uses the streaming API
to build up a compressed data stream and avoids recompressing it
resulting in reduced build times. The resulting index can be
smaller and should build faster than using lz4.</entry>
</row>
</tbody>
</tgroup>
</informaltable>
Expand All @@ -903,25 +925,86 @@ BUILD(FilterDsLib1);
without decompression. The original index compression implementation
decompresses the rows when they are read from disk.</para>

<para>The inplace index compression format (introduced in versions 9.6.90,
9.8.66, and 9.10.12 9.2.0 or later) improves compression and reduces build
time. These formats require an engine that supports it. In other words,
<emphasis role="bold">if you build an index using the lz4s or lz4shc
formats, you must use a platform later than 9.6.90, 9.8.66, and 9.10.12 to
read those indexes. </emphasis></para>

<para>If you attempt to read an index with the inplace compression format
on a system that does not support it, you will receive an error
message.</para>

<para>Because the branch nodes can be searched without decompression more
branch nodes fit into memory which can improve search performance. The lz4
compression used for the payload is significantly faster at decompressing
leaf pages than the previous LZW compression.</para>
leaf pages than the previous LZW compression. Whether performance is
better with lz4hc (a high-compression variant of lz4) on the payload
fields depends on the access characteristics of the data and how much of
the index is cached in memory.</para>

<para>Whether performance is better with lz4hc (a high-compression variant
of lz4) on the payload fields depends on the access characteristics of the
data and how much of the index is cached in memory.</para>
<para><emphasis role="bold">Compression Levels :</emphasis></para>

<para>If you attempt to read an index with the inplace compression format
on a system that does not support them, you will receive an error
message.</para>
<informaltable colsep="1" frame="all" rowsep="1">
<tgroup cols="2">
<colspec align="left" colwidth="240*"/>

<colspec colwidth="836*"/>

<tbody>
<row>
<entry><emphasis role="bold">hclevel</emphasis></entry>

<entry>An integer between 0 and 9 to specify the level of
compression. The default is 3. Higher levels increase compression
times, but may be cost-effective.</entry>
</row>

<row>
<entry><emphasis role="bold">maxcompression</emphasis></entry>

<entry>The maximum desired compression ratio. This avoids the leaf
nodes getting too large when expanded, but increases the size of
some indexes. The default is 20.</entry>
</row>

<row>
<entry><emphasis role="bold">maxrecompress</emphasis></entry>

<entry>Specifies the number of times the entire input dataset
should be compressed to free up space. Increasing the number
decreases the size of the indexes, and will probably decrease the
decompress time slightly (because there are fewer stream blocks),
but will increase the build time. The default is 1.</entry>
</row>
</tbody>
</tgroup>
</informaltable>

<para>See Also: <link linkend="INDEX_record_structure">INDEX</link>, <link
linkend="JOIN">JOIN</link>, <link linkend="FETCH">FETCH</link>, <link
linkend="MODULE_Structure">MODULE</link>, <link
linkend="INTERFACE_Structure">INTERFACE</link>, <link
linkend="LIBRARY">LIBRARY</link>, <link
linkend="DISTRIBUTE">DISTRIBUTE</link>, <link
linkend="_WORKUNIT">#WORKUNIT</link></para>
<para/>

<para>Example:</para>

<programlisting>Vehicles := DATASET('vehicles',
{STRING2 st,STRING20 city,STRING20 lname},FLAT);

SearchTerms := RECORD
Vehicles.st;
Vehicles.city;
END;
Payload := RECORD
Vehicles.lname;
END;
VehicleKey := INDEX(Vehicles,SearchTerms,Payload,'vkey::st.city',
COMPRESSED('inplace:lz4shc,compressopt(hclevel=9,
maxcompression=25,
maxrecompress=4)'));
BUILD(VehicleKey);</programlisting>

<para>See Also: <link linkend="DATASET">DATASET</link>, <link
linkend="BUILD">BUILDINDEX</link>, <link linkend="JOIN">JOIN</link>, <link
linkend="FETCH">FETCH</link>, <link
linkend="KEYED-WILD">KEYED/WILD</link></para>
</sect2>
</sect1>
119 changes: 102 additions & 17 deletions docs/EN_US/ECLLanguageReference/ECLR_mods/Recrd-Index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,9 @@

<informaltable colsep="1" frame="all" rowsep="1">
<tgroup cols="2">
<colspec align="left" colwidth="122.40pt" />
<colspec align="left" colwidth="122.40pt"/>

<colspec />
<colspec/>

<tbody>
<row>
Expand Down Expand Up @@ -139,8 +139,7 @@

<entry><para>Optional. Specifies the index should be compressed
using the type of compression specified. If omitted, the default is
<emphasis role="bold">LZW</emphasis>, a variant of the
Lempel-Ziv-Welch algorithm. </para></entry>
<emphasis role="bold">'inplace:lz4shc'</emphasis>. </para></entry>
</row>

<row>
Expand Down Expand Up @@ -266,7 +265,7 @@

<para>All STRINGs must be fixed length.</para>

<para></para>
<para/>
</listitem>
</itemizedlist></para>

Expand Down Expand Up @@ -365,17 +364,17 @@ BUILD(VehicleKey3);

<informaltable colsep="1" frame="all" rowsep="1">
<tgroup cols="2">
<colspec align="left" colwidth="122.40pt" />
<colspec align="left" colwidth="188*"/>

<colspec />
<colspec colwidth="836*"/>

<tbody>
<row>
<entry><emphasis role="bold">LZW</emphasis></entry>

<entry>The default compression. It is a variant of the
Lempel-Ziv-Welch algorithm. It remains the default for backward
compatibility.</entry>
<entry>A variant of the Lempel-Ziv-Welch algorithm. This was the
the default compression prior to versions 9.6.90, 9.8.66, and
9.10.12.</entry>
</row>

<row>
Expand Down Expand Up @@ -403,6 +402,28 @@ BUILD(VehicleKey3);
compression on the payload. The resulting index can be smaller
than using lz4.</entry>
</row>

<row>
<entry><emphasis role="bold"><emphasis
role="bold">'inplace:lz4s'</emphasis> </emphasis></entry>

<entry>Causes inplace compression on the key fields and lz4s
compression on the payload. This uses the streaming API to build
up a compressed data stream and avoid recompressing it resulting
in reduced build times.</entry>
</row>

<row>
<entry><emphasis role="bold"><emphasis
role="bold">'inplace:lz4shc'</emphasis> </emphasis></entry>

<entry>The default compression in versions after versions 9.6.90,
9.8.66, and 9.10.12. Causes inplace compression on the key fields
and lz4shc compression on the payload. This uses the streaming API
to build up a compressed data stream and avoids recompressing it
resulting in reduced build times. The resulting index can be
smaller and should build faster than using lz4.</entry>
</row>
</tbody>
</tgroup>
</informaltable>
Expand All @@ -412,18 +433,82 @@ BUILD(VehicleKey3);
without decompression. The original index compression implementation
decompresses the rows when they are read from disk.</para>

<para>The inplace index compression format (introduced in versions 9.6.90,
9.8.66, and 9.10.12 9.2.0 or later) improves compression and reduces build
time. These formats require an engine that supports it. In other words,
<emphasis role="bold">if you build an index using the lz4s or lz4shc
formats, you must use a platform later than 9.6.90, 9.8.66, and 9.10.12 to
read those indexes. </emphasis></para>

<para>If you attempt to read an index with the inplace compression format
on a system that does not support it, you will receive an error
message.</para>

<para>Because the branch nodes can be searched without decompression more
branch nodes fit into memory which can improve search performance. The lz4
compression used for the payload is significantly faster at decompressing
leaf pages than the previous LZW compression.</para>
leaf pages than the previous LZW compression. Whether performance is
better with lz4hc (a high-compression variant of lz4) on the payload
fields depends on the access characteristics of the data and how much of
the index is cached in memory.</para>

<para>Whether performance is better with lz4hc (a high-compression variant
of lz4) on the payload fields depends on the access characteristics of the
data and how much of the index is cached in memory.</para>
<para><emphasis role="bold">Compression Levels :</emphasis></para>

<para>If you attempt to read an index with the inplace compression format
on a system that does not support them, you will receive an error
message.</para>
<informaltable colsep="1" frame="all" rowsep="1">
<tgroup cols="2">
<colspec align="left" colwidth="240*"/>

<colspec colwidth="733*"/>

<tbody>
<row>
<entry><emphasis role="bold">hclevel</emphasis></entry>

<entry>An integer between 0 and 9 to specify the level of
compression. The default is 3. Higher levels increase compression
times, but may be cost-effective.</entry>
</row>

<row>
<entry><emphasis role="bold">maxcompression</emphasis></entry>

<entry>The maximum desired compression ratio. This avoids the leaf
nodes getting too large when expanded, but increases the size of
some indexes. The default is 20.</entry>
</row>

<row>
<entry><emphasis role="bold">maxrecompress</emphasis></entry>

<entry>Specifies the number of times the entire input dataset
should be compressed to free up space. Increasing the number
decreases the size of the indexes, and will probably decrease the
decompress time slightly (because there are fewer stream blocks),
but will increase the build time. The default is 1.</entry>
</row>
</tbody>
</tgroup>
</informaltable>

<para/>

<para>Example:</para>

<programlisting>Vehicles := DATASET('vehicles',
{STRING2 st,STRING20 city,STRING20 lname},FLAT);

SearchTerms := RECORD
Vehicles.st;
Vehicles.city;
END;
Payload := RECORD
Vehicles.lname;
END;
VehicleKey := INDEX(Vehicles,SearchTerms,Payload,'vkey::st.city',
COMPRESSED('inplace:lz4shc,compressopt(hclevel=9,
maxcompression=25,
maxrecompress=4)'));
BUILD(VehicleKey);</programlisting>

<para>See Also: <link linkend="DATASET">DATASET</link>, <link
linkend="BUILD">BUILDINDEX</link>, <link linkend="JOIN">JOIN</link>, <link
Expand Down

0 comments on commit dfa3307

Please sign in to comment.