Skip to content

Commit

Permalink
update bench results
Browse files Browse the repository at this point in the history
  • Loading branch information
gsmiller committed Jan 27, 2023
1 parent e315b7d commit 84dcd52
Showing 1 changed file with 57 additions and 62 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,20 @@
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
*/

## Goals
1. Gather some data on when postings vs. doc-values are efficient in evaluating a term-in-set clause.
2. Determine if a custom, self-optimizing, term-in-set query provides any benefits over the existing
`IndexOrDocValuesQuery` on top of the existing `TermInSetQuery` + `DocValuesTermsQuery`.

## Setup
Benchmarks run against an index using `allCountries.txt` geonames data. Index created with:
* "name" field that indexed the parsed name of each record
* "cc" field that indexed the country code ID for each record (single valued)
* "id" field that indexed the unique ID for each record (single valued)

Each benchmark query "lead" with a single term matching against the "name" field and included a required conjunction
Each benchmark query "leads" with a single term matching against the "name" field and included a required conjunction
over a disjunction of country codes (e.g., a "term-in-set" query that required one of the specified country codes
for each hit to be considered a match). Details on the distributions of these terms for each task are below, followed by
the results.
Expand All @@ -30,7 +36,7 @@ The benchmark results compare:
2. Our current `DocValuesTermsQuery` implementation (always use doc values for the country code clause)
3. Our current `IndexOrDocValuesQuery` wrapping #1 and #2
4. A proposed new sandbox `TermInSetQuery` that acts a bit like `IndexOrDocValuesQuery` but goes beyond just using
field-level stats when deciding which approach to take, starts to peek at individual term stats.
field-level stats when deciding which approach to take, starts to peek at individual term stats.

### Benchmark Term Distributions
#### "Large" lead term counts:
Expand Down Expand Up @@ -59,7 +65,6 @@ field-level stats when deciding which approach to take, starts to peek at indivi
#### Country Code filter term counts:
#### High cost terms
NOTE: First 10 used for "low cardinality" tasks, all 20 used for "high cardinality"

* 2240232 US
* 874153 CN
* 649062 IN
Expand All @@ -83,7 +88,6 @@ NOTE: First 10 used for "low cardinality" tasks, all 20 used for "high cardinali

#### Low cost terms
NOTE: First 10 used for "low cardinality" tasks, all 20 used for "high cardinality"

* 1 YU
* 2 AN
* 2 CS
Expand All @@ -110,7 +114,10 @@ In general, the proposed `TermInSetQuery` tends to perform at least as well as `
some specific cases. These cases are where `IndexOrDocValuesQuery` incorrectly assumes a doc values approach will be
better after looking at field-level stats, and the proposed `TermInSetQuery` makes a more informed decision to use
doc values after loading some term-level stats. Primary-key cases also seem to perform significantly better with
the proposed `TermInSetQuery`.
the proposed `TermInSetQuery`, as the existing `IndexOrDocValuesQuery` query sometimes incorrectly decides to use a
doc-values approach when postings would be better (they almost always are better for primary keys). This is because
the `#cost()` estimation of `TermInSetQuery`, while specialized for primary-key cases, sort of assumes that all terms
will occur in the segment, which is probably quite wrong for primary key cases.

### All Country Code Filter Terms
* Term-in-set cardinality: 254 terms
Expand All @@ -122,10 +129,10 @@ deciding to use DV.

| Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
|--------------|------------------|-------------------|------------------|
| Current TiS | 115.65 | 172.26 | 193.97 |
| DV | 4.88 | 6.06 | 5.71 |
| IndexOrDV | 4.96 | 6.19 | 5.88 |
| Proposed TiS | 5.64 | 6.73 | 5.89 |
| Current TiS | 122.01 | 174.29 | 194.85 |
| DV | 12.01 | 7.83 | 5.68 |
| IndexOrDV | 12.11 | 7.97 | 5.85 |
| Proposed TiS | 16.01 | 9.87 | 5.85 |


### Medium Cardinality + High Cost Country Code Filter Terms
Expand All @@ -137,10 +144,10 @@ the proposed TiS implementation is slightly slower due to the term seeking.

| Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
|--------------|------------------|-------------------|------------------|
| Current TiS | 62.32 | 92.33 | 107.02 |
| DV | 3.29 | 2.42 | 1.90 |
| IndexOrDV | 3.31 | 2.45 | 1.94 |
| Proposed TiS | 3.94 | 3.09 | 2.47 |
| Current TiS | 67.84 | 94.50 | 107.32 |
| DV | 9.59 | 4.05 | 1.72 |
| IndexOrDV | 9.61 | 4.09 | 1.76 |
| Proposed TiS | 12.39 | 5.81 | 2.55 |


### Medium Cardinality + Low Cost Country Code Filter Terms
Expand All @@ -153,103 +160,91 @@ field-level stats, while the proposed TiS query uses term-level stats.

| Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
|--------------|------------------|-------------------|------------------|
| Current TiS | 1.01 | 1.45 | 1.47 |
| DV | 14.73 | 5.48 | 1.65 |
| IndexOrDV | 14.75 | 5.52 | 1.69 |
| Proposed TiS | 0.99 | 1.39 | 1.42 |
| Current TiS | 1.09 | 1.55 | 1.62 |
| DV | 8.15 | 3.47 | 1.50 |
| IndexOrDV | 8.17 | 3.49 | 1.53 |
| Proposed TiS | 1.08 | 1.53 | 1.58 |


### Low Cardinality + High Cost Country Code Filter Terms
* Term-in-set cardinality: 10 terms
* Term-in-set cost: ranges from 243,458 to 2,240,232 across the 10 terms / avg: 645,352 / total: 6,453,524

Here we see that our current TiS is better for both "large" and "medium" lead terms, but DV is best for a "small" lead.
Here we see that our current TiS is better for "large" lead terms, but DV is best for "medium" and "small" leads.
Because the number of filter terms is low (< 16 specifically), our current TiSQuery gets rewritten to a boolean query.
Based on its cost estimation, which looks at the actual term-stats, IndexOrDV chooses to always use DV, which is
less-than ideal, particularly for the "large" lead case (but the correct choice for the "small" lead case). Our
proposed TiSQuery makes the same "mistakes" as our current IndexOrDV query as it's also looking at term-level stats,
and decides DV should be better in all cases. I think this is just a situation where it's "unexpected" based on index
stats that a postings-approach would be better given the high cost across the terms, but the arrangement of the docs
in the postings relative to the lead term just happens to work out in a bit of a surprising way.
This means it will estimate its cost based on the actual term-level stats, just like our proposed TiSQuery, so they're
both pretty similar.

| Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
|--------------|------------------|-------------------|------------------|
| Current TiS | 2.88 | 3.43 | 3.01 |
| DV | 5.63 | 3.93 | 1.86 |
| IndexOrDV | 6.27 | 4.91 | 2.86 |
| Proposed TiS | 6.14 | 4.49 | 2.33 |
| Current TiS | 7.59 | 3.97 | 2.63 |
| DV | 9.45 | 3.89 | 1.55 |
| IndexOrDV | 8.79 | 4.22 | 2.16 |
| Proposed TiS | 11.47 | 5.15 | 2.20 |


### Low Cardinality + Low Cost Country Code Filter Terms
* Term-in-set cardinality: 10 terms
* Term-in-set cost: ranges from 1 to 97 across the 10 terms / avg: 49 / total: 489

Here we see that our current TiS is better than DV. This is expected as the filter terms all have such low costs, so
a postings-approach is cheaper than a DV approach (although the two almost converge when the lead term gets "small").
a postings-approach is cheaper than a DV approach (although the two converge when the lead term gets "small").
This is an interesting case for IndexOrDV. While IndexOrDV generally uses field-level statistics and would be expected
to get it "wrong" in this case (and use DV due to the over-estimate on cost at the field-level), the "index" query
(TiSQuery) get rewritten to a standard BooleanQuery as there are fewer than 16 terms, and provides a more accurate
cost by actually seeking the 16 terms and looking at term-level statistics. So IndexOrDV correctly chooses to use
postings, but our proposed TiS implementation also correctly chooses to use postings and is a bit more efficient, likely
explained by the choice to pre-populate a bitset from all terms up front due to their small size vs. maintaining a
heap and doing doc-at-a-time scoring (as with the standard BooleanQuery that the current TiSQuery rewrites to).
postings, like our proposed TiSQuery.

| Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
|--------------|------------------|-------------------|------------------|
| Current TiS | 0.80 | 1.20 | 1.28 |
| DV | 14.23 | 5.15 | 1.44 |
| IndexOrDV | 1.06 | 1.55 | 1.66 |
| Proposed TiS | 0.69 | 0.96 | 1.05 |
| Current TiS | 0.87 | 1.28 | 1.39 |
| DV | 7.84 | 3.20 | 1.31 |
| IndexOrDV | 0.96 | 1.43 | 1.55 |
| Proposed TiS | 0.75 | 1.06 | 1.16 |


### High Cardinality PK Filter Terms
* Term-in-set cardinality: 500 terms
* Term-in-set cost: exactly 1 for each term / total cost of 500 across all terms

In this situation, a postings-approach is generally better than a DV approach. Even though there are 500 individual
filter terms, because the field is a PK, the overall cost is still relatively low (500 total), so it's generally
better to use postings over DV, but in some cases--where the lead term has a very low cost--DV might out-perform.
Both IndexOrDV and the proposed TiSQuery make different decisions between postings and DV depending on the cost of
the specific lead term, and generally do the right thing. I suspect the reason IndexOrDV performs worse than
the proposed TiSQuery has something to do with the 8x cost differential it uses to more strongly prefer postings,
which may not really be the right thing to do all the time. The proposed TiSQuery is more aggressive about switching
to DV when the lead cost is lower than the number of terms (500), while IndexOrDV will only do so when the lead cost
is 8x cheaper than 500.
filter terms, because the field is a PK, the number of terms in each segment is much lower. The cost estimation
in our current TermInSetQuery has special-case logic for primary-key fields, but it overestimates the cost due to
this issue, meaning IndexOrDV is incorrectly choosing to use DV. Our proposed TiS approach does a better job, but
is still using DVs in some cases due to the heuristics not being perfect (confirmed through profiling). We can probably
tune to be better here.

| Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
|--------------|------------------|-------------------|------------------|
| Current TiS | 17.54 | 25.71 | 24.98 |
| DV | 63.72 | 72.83 | 67.27 |
| IndexOrDV | 60.53 | 82.03 | 76.00 |
| Proposed TiS | 35.26 | 56.62 | 62.80 |
| Current TiS | 17.62 | 25.93 | 25.40 |
| DV | 53.36 | 65.10 | 61.45 |
| IndexOrDV | 56.48 | 76.43 | 70.72 |
| Proposed TiS | 20.83 | 42.54 | 49.83 |


### Medium Cardinality PK Filter Terms
* Term-in-set cardinality: 20 terms
* Term-in-set cost: exactly 1 for each term / total cost of 20 across all terms

This is a similar story to above, but postings outperforms DV in general even more due to the lower cardinality of
the filter terms. Because of the 8x multiplier in IndexOrDV, it _never_ chooses a DV approach, while the proposed TiS
query does occasionally, which probably accounts for the performance benefit.
This is a similar story to above, but our proposed TiSQuery gets it right more often, producing better results.

| Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
|--------------|------------------|-------------------|------------------|
| Current TiS | 1.74 | 2.57 | 2.59 |
| DV | 19.23 | 10.02 | 6.10 |
| IndexOrDV | 4.39 | 6.50 | 6.63 |
| Proposed TiS | 1.79 | 3.32 | 3.49 |
| Current TiS | 1.84 | 2.71 | 2.79 |
| DV | 13.00 | 8.03 | 5.64 |
| IndexOrDV | 4.14 | 6.15 | 6.27 |
| Proposed TiS | 1.52 | 2.52 | 2.73 |


### Low Cardinality PK Filter Terms
* Term-in-set cardinality: 10 terms
* Term-in-set cost: exactly 1 for each term / total cost of 10 across all terms

This is yet again the same story as above. IndexOrDV always uses postings, while the proposed TiSQuery does sometimes
use DV, which is better in some cases, so it outperforms.
This is yet again the same story as above.

| Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms |
|--------------|------------------|-------------------|------------------|
| Current TiS | 1.39 | 2.01 | 2.02 |
| DV | 16.36 | 8.17 | 4.42 |
| IndexOrDV | 3.55 | 5.20 | 5.29 |
| Proposed TiS | 1.11 | 2.09 | 2.11 |
| Current TiS | 1.47 | 2.19 | 2.26 |
| DV | 10.94 | 6.38 | 4.13 |
| IndexOrDV | 2.28 | 3.39 | 3.58 |
| Proposed TiS | 1.09 | 1.66 | 1.86 |

0 comments on commit 84dcd52

Please sign in to comment.