update bench results

apache · Jan 27, 2023 · 84dcd52 · 84dcd52
1 parent e315b7d
commit 84dcd52
Showing 1 changed file with 57 additions and 62 deletions.
diff --git a/lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TiSBenchResults.md b/lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TiSBenchResults.md
@@ -13,14 +13,20 @@
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
-  */
+*/
+
+## Goals
+1. Gather some data on when postings vs. doc-values are efficient in evaluating a term-in-set clause.
+2. Determine if a custom, self-optimizing, term-in-set query provides any benefits over the existing
+   `IndexOrDocValuesQuery` on top of the existing `TermInSetQuery` + `DocValuesTermsQuery`.
+
 ## Setup
 Benchmarks run against an index using `allCountries.txt` geonames data. Index created with:
 * "name" field that indexed the parsed name of each record
 * "cc" field that indexed the country code ID for each record (single valued)
 * "id" field that indexed the unique ID for each record (single valued)
 
-Each benchmark query "lead" with a single term matching against the "name" field and included a required conjunction
+Each benchmark query "leads" with a single term matching against the "name" field and included a required conjunction
 over a disjunction of country codes (e.g., a "term-in-set" query that required one of the specified country codes
 for each hit to be considered a match). Details on the distributions of these terms for each task are below, followed by
 the results.
@@ -30,7 +36,7 @@ The benchmark results compare:
 2. Our current `DocValuesTermsQuery` implementation (always use doc values for the country code clause)
 3. Our current `IndexOrDocValuesQuery` wrapping #1 and #2
 4. A proposed new sandbox `TermInSetQuery` that acts a bit like `IndexOrDocValuesQuery` but goes beyond just using
-field-level stats when deciding which approach to take, starts to peek at individual term stats.
+   field-level stats when deciding which approach to take, starts to peek at individual term stats.
 
 ### Benchmark Term Distributions
 #### "Large" lead term counts:
@@ -59,7 +65,6 @@ field-level stats when deciding which approach to take, starts to peek at indivi
 #### Country Code filter term counts:
 #### High cost terms
 NOTE: First 10 used for "low cardinality" tasks, all 20 used for "high cardinality"
-
 * 2240232 US
 * 874153 CN
 * 649062 IN
@@ -83,7 +88,6 @@ NOTE: First 10 used for "low cardinality" tasks, all 20 used for "high cardinali
 
 #### Low cost terms
 NOTE: First 10 used for "low cardinality" tasks, all 20 used for "high cardinality"
-
 *  1 YU
 *  2 AN
 *  2 CS
@@ -110,7 +114,10 @@ In general, the proposed `TermInSetQuery` tends to perform at least as well as `
 some specific cases. These cases are where `IndexOrDocValuesQuery` incorrectly assumes a doc values approach will be
 better after looking at field-level stats, and the proposed `TermInSetQuery` makes a more informed decision to use
 doc values after loading some term-level stats. Primary-key cases also seem to perform significantly better with
-the proposed `TermInSetQuery`.
+the proposed `TermInSetQuery`, as the existing `IndexOrDocValuesQuery` query sometimes incorrectly decides to use a
+doc-values approach when postings would be better (they almost always are better for primary keys). This is because
+the `#cost()` estimation of `TermInSetQuery`, while specialized for primary-key cases, sort of assumes that all terms
+will occur in the segment, which is probably quite wrong for primary key cases.
 
 ### All Country Code Filter Terms
 * Term-in-set cardinality: 254 terms
@@ -122,10 +129,10 @@ deciding to use DV.
 
 | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
 |--------------|------------------|-------------------|------------------|
-| Current TiS  | 115.65           | 172.26            | 193.97           |
-| DV           | 4.88             | 6.06              | 5.71             |
-| IndexOrDV    | 4.96             | 6.19              | 5.88             |
-| Proposed TiS | 5.64             | 6.73              | 5.89             |
+| Current TiS  | 122.01           | 174.29            | 194.85           |
+| DV           | 12.01            | 7.83              | 5.68             |
+| IndexOrDV    | 12.11            | 7.97              | 5.85             |
+| Proposed TiS | 16.01            | 9.87              | 5.85             |
 
 
 ### Medium Cardinality + High Cost Country Code Filter Terms
@@ -137,10 +144,10 @@ the proposed TiS implementation is slightly slower due to the term seeking.
 
 | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
 |--------------|------------------|-------------------|------------------|
-| Current TiS  | 62.32            | 92.33             | 107.02           |
-| DV           | 3.29             | 2.42              | 1.90             |
-| IndexOrDV    | 3.31             | 2.45              | 1.94             |
-| Proposed TiS | 3.94             | 3.09              | 2.47             |
+| Current TiS  | 67.84            | 94.50             | 107.32           |
+| DV           | 9.59             | 4.05              | 1.72             |
+| IndexOrDV    | 9.61             | 4.09              | 1.76             |
+| Proposed TiS | 12.39            | 5.81              | 2.55             |
 
 
 ### Medium Cardinality + Low Cost Country Code Filter Terms
@@ -153,103 +160,91 @@ field-level stats, while the proposed TiS query uses term-level stats.
 
 | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
 |--------------|------------------|-------------------|------------------|
-| Current TiS  | 1.01             | 1.45              | 1.47             |
-| DV           | 14.73            | 5.48              | 1.65             |
-| IndexOrDV    | 14.75            | 5.52              | 1.69             |
-| Proposed TiS | 0.99             | 1.39              | 1.42             |
+| Current TiS  | 1.09             | 1.55              | 1.62             |
+| DV           | 8.15             | 3.47              | 1.50             |
+| IndexOrDV    | 8.17             | 3.49              | 1.53             |
+| Proposed TiS | 1.08             | 1.53              | 1.58             |
 
 
 ### Low Cardinality + High Cost Country Code Filter Terms
 * Term-in-set cardinality: 10 terms
 * Term-in-set cost: ranges from 243,458 to 2,240,232 across the 10 terms / avg: 645,352 / total: 6,453,524
 
-Here we see that our current TiS is better for both "large" and "medium" lead terms, but DV is best for a "small" lead.
+Here we see that our current TiS is better for "large" lead terms, but DV is best for "medium" and "small" leads.
 Because the number of filter terms is low (< 16 specifically), our current TiSQuery gets rewritten to a boolean query.
-Based on its cost estimation, which looks at the actual term-stats, IndexOrDV chooses to always use DV, which is
-less-than ideal, particularly for the "large" lead case (but the correct choice for the "small" lead case). Our
-proposed TiSQuery makes the same "mistakes" as our current IndexOrDV query as it's also looking at term-level stats,
-and decides DV should be better in all cases. I think this is just a situation where it's "unexpected" based on index
-stats that a postings-approach would be better given the high cost across the terms, but the arrangement of the docs
-in the postings relative to the lead term just happens to work out in a bit of a surprising way.
+This means it will estimate its cost based on the actual term-level stats, just like our proposed TiSQuery, so they're
+both pretty similar.
 
 | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
 |--------------|------------------|-------------------|------------------|
-| Current TiS  | 2.88             | 3.43              | 3.01             |
-| DV           | 5.63             | 3.93              | 1.86             |
-| IndexOrDV    | 6.27             | 4.91              | 2.86             |
-| Proposed TiS | 6.14             | 4.49              | 2.33             |
+| Current TiS  | 7.59             | 3.97              | 2.63             |
+| DV           | 9.45             | 3.89              | 1.55             |
+| IndexOrDV    | 8.79             | 4.22              | 2.16             |
+| Proposed TiS | 11.47            | 5.15              | 2.20             |
 
 
 ### Low Cardinality + Low Cost Country Code Filter Terms
 * Term-in-set cardinality: 10 terms
 * Term-in-set cost: ranges from 1 to 97 across the 10 terms / avg: 49 / total: 489
 
 Here we see that our current TiS is better than DV. This is expected as the filter terms all have such low costs, so
-a postings-approach is cheaper than a DV approach (although the two almost converge when the lead term gets "small").
+a postings-approach is cheaper than a DV approach (although the two converge when the lead term gets "small").
 This is an interesting case for IndexOrDV. While IndexOrDV generally uses field-level statistics and would be expected
 to get it "wrong" in this case (and use DV due to the over-estimate on cost at the field-level), the "index" query
 (TiSQuery) get rewritten to a standard BooleanQuery as there are fewer than 16 terms, and provides a more accurate
 cost by actually seeking the 16 terms and looking at term-level statistics. So IndexOrDV correctly chooses to use
-postings, but our proposed TiS implementation also correctly chooses to use postings and is a bit more efficient, likely
-explained by the choice to pre-populate a bitset from all terms up front due to their small size vs. maintaining a
-heap and doing doc-at-a-time scoring (as with the standard BooleanQuery that the current TiSQuery rewrites to).
+postings, like our proposed TiSQuery.
 
 | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
 |--------------|------------------|-------------------|------------------|
-| Current TiS  | 0.80             | 1.20              | 1.28             |
-| DV           | 14.23            | 5.15              | 1.44             |
-| IndexOrDV    | 1.06             | 1.55              | 1.66             |
-| Proposed TiS | 0.69             | 0.96              | 1.05             |
+| Current TiS  | 0.87             | 1.28              | 1.39             |
+| DV           | 7.84             | 3.20              | 1.31             |
+| IndexOrDV    | 0.96             | 1.43              | 1.55             |
+| Proposed TiS | 0.75             | 1.06              | 1.16             |
 
 
 ### High Cardinality PK Filter Terms
 * Term-in-set cardinality: 500 terms
 * Term-in-set cost: exactly 1 for each term / total cost of 500 across all terms
 
 In this situation, a postings-approach is generally better than a DV approach. Even though there are 500 individual
-filter terms, because the field is a PK, the overall cost is still relatively low (500 total), so it's generally
-better to use postings over DV, but in some cases--where the lead term has a very low cost--DV might out-perform.
-Both IndexOrDV and the proposed TiSQuery make different decisions between postings and DV depending on the cost of
-the specific lead term, and generally do the right thing. I suspect the reason IndexOrDV performs worse than
-the proposed TiSQuery has something to do with the 8x cost differential it uses to more strongly prefer postings,
-which may not really be the right thing to do all the time. The proposed TiSQuery is more aggressive about switching
-to DV when the lead cost is lower than the number of terms (500), while IndexOrDV will only do so when the lead cost
-is 8x cheaper than 500.
+filter terms, because the field is a PK, the number of terms in each segment is much lower. The cost estimation
+in our current TermInSetQuery has special-case logic for primary-key fields, but it overestimates the cost due to
+this issue, meaning IndexOrDV is incorrectly choosing to use DV. Our proposed TiS approach does a better job, but
+is still using DVs in some cases due to the heuristics not being perfect (confirmed through profiling). We can probably
+tune to be better here.
 
 | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
 |--------------|------------------|-------------------|------------------|
-| Current TiS  | 17.54            | 25.71             | 24.98            |
-| DV           | 63.72            | 72.83             | 67.27            |
-| IndexOrDV    | 60.53            | 82.03             | 76.00            |
-| Proposed TiS | 35.26            | 56.62             | 62.80            |
+| Current TiS  | 17.62            | 25.93             | 25.40            |
+| DV           | 53.36            | 65.10             | 61.45            |
+| IndexOrDV    | 56.48            | 76.43             | 70.72            |
+| Proposed TiS | 20.83            | 42.54             | 49.83            |
 
 
 ### Medium Cardinality PK Filter Terms
 * Term-in-set cardinality: 20 terms
 * Term-in-set cost: exactly 1 for each term / total cost of 20 across all terms
 
-This is a similar story to above, but postings outperforms DV in general even more due to the lower cardinality of
-the filter terms. Because of the 8x multiplier in IndexOrDV, it _never_ chooses a DV approach, while the proposed TiS
-query does occasionally, which probably accounts for the performance benefit.
+This is a similar story to above, but our proposed TiSQuery gets it right more often, producing better results.
 
 | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
 |--------------|------------------|-------------------|------------------|
-| Current TiS  | 1.74             | 2.57              | 2.59             |
-| DV           | 19.23            | 10.02             | 6.10             |
-| IndexOrDV    | 4.39             | 6.50              | 6.63             |
-| Proposed TiS | 1.79             | 3.32              | 3.49             |
+| Current TiS  | 1.84             | 2.71              | 2.79             |
+| DV           | 13.00            | 8.03              | 5.64             |
+| IndexOrDV    | 4.14             | 6.15              | 6.27             |
+| Proposed TiS | 1.52             | 2.52              | 2.73             |
 
 
 ### Low Cardinality PK Filter Terms
 * Term-in-set cardinality: 10 terms
 * Term-in-set cost: exactly 1 for each term / total cost of 10 across all terms
 
-This is yet again the same story as above. IndexOrDV always uses postings, while the proposed TiSQuery does sometimes
-use DV, which is better in some cases, so it outperforms.
+This is yet again the same story as above.
 
 | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
 |--------------|------------------|-------------------|------------------|
-| Current TiS  | 1.39             | 2.01              | 2.02             |
-| DV           | 16.36            | 8.17              | 4.42             |
-| IndexOrDV    | 3.55             | 5.20              | 5.29             |
-| Proposed TiS | 1.11             | 2.09              | 2.11             |
+| Current TiS  | 1.47             | 2.19              | 2.26             |
+| DV           | 10.94            | 6.38              | 4.13             |
+| IndexOrDV    | 2.28             | 3.39              | 3.58             |
+| Proposed TiS | 1.09             | 1.66              | 1.86             |