HPCC-33601 Document the new lz4s and lz4shc index compression and opt…

…ions Signed-off-by: Jim DeFabia <jamesdefabia@lexisnexis.com>
hpcc-systems · Mar 11, 2025 · dfa3307 · dfa3307
1 parent b04fa2a
commit dfa3307
Show file tree

Hide file tree

Showing 2 changed files with 213 additions and 45 deletions.
diff --git a/docs/EN_US/ECLLanguageReference/ECLR_mods/BltInFunc-BUILD.xml b/docs/EN_US/ECLLanguageReference/ECLR_mods/BltInFunc-BUILD.xml
@@ -39,9 +39,9 @@
 
   <para><informaltable colsep="1" frame="all" rowsep="1">
       <tgroup cols="2">
-        <colspec colwidth="78.50pt" />
+        <colspec colwidth="78.50pt"/>
 
-        <colspec />
+        <colspec/>
 
         <tbody>
           <row>
@@ -241,9 +241,9 @@
 
     <para><informaltable colsep="1" frame="all" rowsep="1">
         <tgroup cols="2">
-          <colspec colwidth="125pt" />
+          <colspec colwidth="125pt"/>
 
-          <colspec />
+          <colspec/>
 
           <tbody>
             <row>
@@ -256,8 +256,8 @@
               written to disk is always determined by the number of nodes in
               the cluster on which the workunit executes, regardless of the
               number of nodes on the target cluster(s) unless the WIDTH option
-              is also specified. Use this option for bare-metal deployments.
-              </entry>
+              is also specified. Use this option for bare-metal
+              deployments.</entry>
             </row>
 
             <row>
@@ -292,7 +292,7 @@
               names of the plane(s) to write the
               <emphasis>indexfile</emphasis> to. The
               <emphasis>targetPlane</emphasis> names must be listed as they
-              are defined in the deployment. </entry>
+              are defined in the deployment.</entry>
             </row>
 
             <row>
@@ -441,8 +441,8 @@
 
               <entry><para>Optional. Specifies the index should be compressed
               using the type of compression specified. If omitted, the default
-              is <emphasis role="bold">LZW</emphasis>, a variant of the
-              Lempel-Ziv-Welch algorithm. </para></entry>
+              is <emphasis role="bold">'inplace:lz4shc'</emphasis>.
+              </para></entry>
             </row>
 
             <row>
@@ -856,17 +856,17 @@ BUILD(FilterDsLib1);
 
     <informaltable colsep="1" frame="all" rowsep="1">
       <tgroup cols="2">
-        <colspec align="left" colwidth="122.40pt" />
+        <colspec align="left" colwidth="188*"/>
 
-        <colspec />
+        <colspec colwidth="812*"/>
 
         <tbody>
           <row>
             <entry><emphasis role="bold">LZW</emphasis></entry>
 
-            <entry>The default compression. It is a variant of the
-            Lempel-Ziv-Welch algorithm. It remains the default for backward
-            compatibility.</entry>
+            <entry>A variant of the Lempel-Ziv-Welch algorithm. This was the
+            the default compression prior to versions 9.6.90, 9.8.66,and
+            9.10.12.</entry>
           </row>
 
           <row>
@@ -894,6 +894,28 @@ BUILD(FilterDsLib1);
             compression on the payload. The resulting index can be smaller
             than using lz4.</entry>
           </row>
+
+          <row>
+            <entry><emphasis role="bold"><emphasis
+            role="bold">'inplace:lz4s'</emphasis> </emphasis></entry>
+
+            <entry>Causes inplace compression on the key fields and lz4s
+            compression on the payload. This uses the streaming API to build
+            up a compressed data stream and avoid recompressing it resulting
+            in reduced build times.</entry>
+          </row>
+
+          <row>
+            <entry><emphasis role="bold"><emphasis
+            role="bold">'inplace:lz4shc'</emphasis> </emphasis></entry>
+
+            <entry>The default compression in versions after versions 9.6.90,
+            9.8.66, and 9.10.12. Causes inplace compression on the key fields
+            and lz4shc compression on the payload. This uses the streaming API
+            to build up a compressed data stream and avoids recompressing it
+            resulting in reduced build times. The resulting index can be
+            smaller and should build faster than using lz4.</entry>
+          </row>
         </tbody>
       </tgroup>
     </informaltable>
@@ -903,25 +925,86 @@ BUILD(FilterDsLib1);
     without decompression. The original index compression implementation
     decompresses the rows when they are read from disk.</para>
 
+    <para>The inplace index compression format (introduced in versions 9.6.90,
+    9.8.66, and 9.10.12 9.2.0 or later) improves compression and reduces build
+    time. These formats require an engine that supports it. In other words,
+    <emphasis role="bold">if you build an index using the lz4s or lz4shc
+    formats, you must use a platform later than 9.6.90, 9.8.66, and 9.10.12 to
+    read those indexes. </emphasis></para>
+
+    <para>If you attempt to read an index with the inplace compression format
+    on a system that does not support it, you will receive an error
+    message.</para>
+
     <para>Because the branch nodes can be searched without decompression more
     branch nodes fit into memory which can improve search performance. The lz4
     compression used for the payload is significantly faster at decompressing
-    leaf pages than the previous LZW compression.</para>
+    leaf pages than the previous LZW compression. Whether performance is
+    better with lz4hc (a high-compression variant of lz4) on the payload
+    fields depends on the access characteristics of the data and how much of
+    the index is cached in memory.</para>
 
-    <para>Whether performance is better with lz4hc (a high-compression variant
-    of lz4) on the payload fields depends on the access characteristics of the
-    data and how much of the index is cached in memory.</para>
+    <para><emphasis role="bold">Compression Levels :</emphasis></para>
 
-    <para>If you attempt to read an index with the inplace compression format
-    on a system that does not support them, you will receive an error
-    message.</para>
+    <informaltable colsep="1" frame="all" rowsep="1">
+      <tgroup cols="2">
+        <colspec align="left" colwidth="240*"/>
+
+        <colspec colwidth="836*"/>
+
+        <tbody>
+          <row>
+            <entry><emphasis role="bold">hclevel</emphasis></entry>
+
+            <entry>An integer between 0 and 9 to specify the level of
+            compression. The default is 3. Higher levels increase compression
+            times, but may be cost-effective.</entry>
+          </row>
+
+          <row>
+            <entry><emphasis role="bold">maxcompression</emphasis></entry>
+
+            <entry>The maximum desired compression ratio. This avoids the leaf
+            nodes getting too large when expanded, but increases the size of
+            some indexes. The default is 20.</entry>
+          </row>
+
+          <row>
+            <entry><emphasis role="bold">maxrecompress</emphasis></entry>
+
+            <entry>Specifies the number of times the entire input dataset
+            should be compressed to free up space. Increasing the number
+            decreases the size of the indexes, and will probably decrease the
+            decompress time slightly (because there are fewer stream blocks),
+            but will increase the build time. The default is 1.</entry>
+          </row>
+        </tbody>
+      </tgroup>
+    </informaltable>
 
-    <para>See Also: <link linkend="INDEX_record_structure">INDEX</link>, <link
-    linkend="JOIN">JOIN</link>, <link linkend="FETCH">FETCH</link>, <link
-    linkend="MODULE_Structure">MODULE</link>, <link
-    linkend="INTERFACE_Structure">INTERFACE</link>, <link
-    linkend="LIBRARY">LIBRARY</link>, <link
-    linkend="DISTRIBUTE">DISTRIBUTE</link>, <link
-    linkend="_WORKUNIT">#WORKUNIT</link></para>
+    <para/>
+
+    <para>Example:</para>
+
+    <programlisting>Vehicles := DATASET('vehicles',
+          {STRING2 st,STRING20 city,STRING20 lname},FLAT);
+
+SearchTerms := RECORD
+  Vehicles.st;
+  Vehicles.city;
+END; 
+Payload     := RECORD
+  Vehicles.lname;
+END; 
+VehicleKey := INDEX(Vehicles,SearchTerms,Payload,'vkey::st.city',
+                    COMPRESSED('inplace:lz4shc,compressopt(hclevel=9,
+                                                           maxcompression=25,
+                                                           maxrecompress=4)'));
+BUILD(VehicleKey);</programlisting>
+
+    <para>See Also: <link linkend="DATASET">DATASET</link>, <link
+    linkend="BUILD">BUILDINDEX</link>, <link linkend="JOIN">JOIN</link>, <link
+    linkend="FETCH">FETCH</link>, <link
+    linkend="KEYED-WILD">KEYED/WILD</link></para>
   </sect2>
 </sect1>
diff --git a/docs/EN_US/ECLLanguageReference/ECLR_mods/Recrd-Index.xml b/docs/EN_US/ECLLanguageReference/ECLR_mods/Recrd-Index.xml
@@ -49,9 +49,9 @@
 
   <informaltable colsep="1" frame="all" rowsep="1">
     <tgroup cols="2">
-      <colspec align="left" colwidth="122.40pt" />
+      <colspec align="left" colwidth="122.40pt"/>
 
-      <colspec />
+      <colspec/>
 
       <tbody>
         <row>
@@ -139,8 +139,7 @@
 
           <entry><para>Optional. Specifies the index should be compressed
           using the type of compression specified. If omitted, the default is
-          <emphasis role="bold">LZW</emphasis>, a variant of the
-          Lempel-Ziv-Welch algorithm. </para></entry>
+          <emphasis role="bold">'inplace:lz4shc'</emphasis>. </para></entry>
         </row>
 
         <row>
@@ -266,7 +265,7 @@
 
         <para>All STRINGs must be fixed length.</para>
 
-        <para></para>
+        <para/>
       </listitem>
     </itemizedlist></para>
 
@@ -365,17 +364,17 @@ BUILD(VehicleKey3);
 
     <informaltable colsep="1" frame="all" rowsep="1">
       <tgroup cols="2">
-        <colspec align="left" colwidth="122.40pt" />
+        <colspec align="left" colwidth="188*"/>
 
-        <colspec />
+        <colspec colwidth="836*"/>
 
         <tbody>
           <row>
             <entry><emphasis role="bold">LZW</emphasis></entry>
 
-            <entry>The default compression. It is a variant of the
-            Lempel-Ziv-Welch algorithm. It remains the default for backward
-            compatibility.</entry>
+            <entry>A variant of the Lempel-Ziv-Welch algorithm. This was the
+            the default compression prior to versions 9.6.90, 9.8.66, and
+            9.10.12.</entry>
           </row>
 
           <row>
@@ -403,6 +402,28 @@ BUILD(VehicleKey3);
             compression on the payload. The resulting index can be smaller
             than using lz4.</entry>
           </row>
+
+          <row>
+            <entry><emphasis role="bold"><emphasis
+            role="bold">'inplace:lz4s'</emphasis> </emphasis></entry>
+
+            <entry>Causes inplace compression on the key fields and lz4s
+            compression on the payload. This uses the streaming API to build
+            up a compressed data stream and avoid recompressing it resulting
+            in reduced build times.</entry>
+          </row>
+
+          <row>
+            <entry><emphasis role="bold"><emphasis
+            role="bold">'inplace:lz4shc'</emphasis> </emphasis></entry>
+
+            <entry>The default compression in versions after versions 9.6.90,
+            9.8.66, and 9.10.12. Causes inplace compression on the key fields
+            and lz4shc compression on the payload. This uses the streaming API
+            to build up a compressed data stream and avoids recompressing it
+            resulting in reduced build times. The resulting index can be
+            smaller and should build faster than using lz4.</entry>
+          </row>
         </tbody>
       </tgroup>
     </informaltable>
@@ -412,18 +433,82 @@ BUILD(VehicleKey3);
     without decompression. The original index compression implementation
     decompresses the rows when they are read from disk.</para>
 
+    <para>The inplace index compression format (introduced in versions 9.6.90,
+    9.8.66, and 9.10.12 9.2.0 or later) improves compression and reduces build
+    time. These formats require an engine that supports it. In other words,
+    <emphasis role="bold">if you build an index using the lz4s or lz4shc
+    formats, you must use a platform later than 9.6.90, 9.8.66, and 9.10.12 to
+    read those indexes. </emphasis></para>
+
+    <para>If you attempt to read an index with the inplace compression format
+    on a system that does not support it, you will receive an error
+    message.</para>
+
     <para>Because the branch nodes can be searched without decompression more
     branch nodes fit into memory which can improve search performance. The lz4
     compression used for the payload is significantly faster at decompressing
-    leaf pages than the previous LZW compression.</para>
+    leaf pages than the previous LZW compression. Whether performance is
+    better with lz4hc (a high-compression variant of lz4) on the payload
+    fields depends on the access characteristics of the data and how much of
+    the index is cached in memory.</para>
 
-    <para>Whether performance is better with lz4hc (a high-compression variant
-    of lz4) on the payload fields depends on the access characteristics of the
-    data and how much of the index is cached in memory.</para>
+    <para><emphasis role="bold">Compression Levels :</emphasis></para>
 
-    <para>If you attempt to read an index with the inplace compression format
-    on a system that does not support them, you will receive an error
-    message.</para>
+    <informaltable colsep="1" frame="all" rowsep="1">
+      <tgroup cols="2">
+        <colspec align="left" colwidth="240*"/>
+
+        <colspec colwidth="733*"/>
+
+        <tbody>
+          <row>
+            <entry><emphasis role="bold">hclevel</emphasis></entry>
+
+            <entry>An integer between 0 and 9 to specify the level of
+            compression. The default is 3. Higher levels increase compression
+            times, but may be cost-effective.</entry>
+          </row>
+
+          <row>
+            <entry><emphasis role="bold">maxcompression</emphasis></entry>
+
+            <entry>The maximum desired compression ratio. This avoids the leaf
+            nodes getting too large when expanded, but increases the size of
+            some indexes. The default is 20.</entry>
+          </row>
+
+          <row>
+            <entry><emphasis role="bold">maxrecompress</emphasis></entry>
+
+            <entry>Specifies the number of times the entire input dataset
+            should be compressed to free up space. Increasing the number
+            decreases the size of the indexes, and will probably decrease the
+            decompress time slightly (because there are fewer stream blocks),
+            but will increase the build time. The default is 1.</entry>
+          </row>
+        </tbody>
+      </tgroup>
+    </informaltable>
+
+    <para/>
+
+    <para>Example:</para>
+
+    <programlisting>Vehicles := DATASET('vehicles',
+          {STRING2 st,STRING20 city,STRING20 lname},FLAT);
+
+SearchTerms := RECORD
+  Vehicles.st;
+  Vehicles.city;
+END; 
+Payload     := RECORD
+  Vehicles.lname;
+END; 
+VehicleKey := INDEX(Vehicles,SearchTerms,Payload,'vkey::st.city',
+                    COMPRESSED('inplace:lz4shc,compressopt(hclevel=9,
+                                                           maxcompression=25,
+                                                           maxrecompress=4)'));
+BUILD(VehicleKey);</programlisting>
 
     <para>See Also: <link linkend="DATASET">DATASET</link>, <link
     linkend="BUILD">BUILDINDEX</link>, <link linkend="JOIN">JOIN</link>, <link