Execution time overhead when reading `qbeast` indexed data (not using sampling) #16

eavilaes · 2021-09-30T11:28:39Z

What went wrong?

I have tested three queries from TPC-DS v2.4, concretely queries 3, 7 and 15 on a 100Gb dataset (size on disk is quite less, because of parquet's compression). I found that there's an overhead on execution time when the data is written in qbeast format compared to delta, which you can see below.

The queries I have used are:

Query 3

 SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand,SUM(ss_ext_sales_price) sum_agg
 FROM  date_dim dt, store_sales, item
 WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
   AND store_sales.ss_item_sk = item.i_item_sk
   AND item.i_manufact_id = 128
   AND dt.d_moy=11
 GROUP BY dt.d_year, item.i_brand, item.i_brand_id
 ORDER BY dt.d_year, sum_agg desc, brand_id
 LIMIT 100

Query 7

 SELECT i_item_id,
        avg(ss_quantity) agg1,
        avg(ss_list_price) agg2,
        avg(ss_coupon_amt) agg3,
        avg(ss_sales_price) agg4
 FROM store_sales, customer_demographics, date_dim, item, promotion
 WHERE ss_sold_date_sk = d_date_sk AND
       ss_item_sk = i_item_sk AND
       ss_cdemo_sk = cd_demo_sk AND
       ss_promo_sk = p_promo_sk AND
       cd_gender = 'M' AND
       cd_marital_status = 'S' AND
       cd_education_status = 'College' AND
       (p_channel_email = 'N' or p_channel_event = 'N') AND
       d_year = 2000
 GROUP BY i_item_id
 ORDER BY i_item_id LIMIT 100

Query 15

 select ca_zip, SUM(cs_sales_price) sum_agg
 from catalog_sales, customer, customer_address, date_dim
 where cs_bill_customer_sk = c_customer_sk
 	and c_current_addr_sk = ca_address_sk
 	and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475',
                                   '85392', '85460', '80348', '81792')
 	      or ca_state in ('CA','WA','GA')
 	      or cs_sales_price > 500)
 	and cs_sold_date_sk = d_date_sk
 	and d_qoy = 2 and d_year = 2001
 group by ca_zip
 order by ca_zip
 limit 100

I performed three different executions on the data. Each has iterated every query 10 times, to calculate an average time per query. The execution time for each execution and query follows.
As you can see on the following table, there's an increase in the time when the data is written in qbeast format. Using the average time, I have calculated the overhead percentage (you can find more details below the table):

Query	`delta` format, read in `delta`	`qbeast` format, read in `delta`	`qbeast` format, read in `qbeast`
3	7.822s.	13.533s. (173,01%)	20.635s. (263,80%)
7	18.839s.	24.319s. (129.09%)	35.546s. (188,68%)
15	7.201s.	16.385s. (227,53%)	24.620s. (341,89%)

For more detailed values, I included maximum and minimum values for each execution:

Detailed values (AVG, MAX and MIN) for each execution

Data written in `delta` format, read in `delta` format

Query	AVG	MAX	MIN
q3	7.822s.	10.579s.	7.203s.
q7	18.839s.	21.281s.	17.253s.
q15	7.201s.	9.448s.	6.344s.

Data written in `qbeast` format (index using PK), read in `delta` format

Query	AVG	MAX	MIN
q3	13.533s.	17.612s.	12.467s.
q7	24.319s.	29.949s.	22.478s.
q15	16.385s.	21.824s.	14.882s.

Data written in `qbeast` format (index using PK), read in `qbeast` format

Query	AVG	MAX	MIN
q3	20.635s.	31.006s.	18.756s.
q7	35.546s.	43.923s.	32.965s.
q15	24.620s.	28.866s.	22.850s.

How to reproduce?

Code that triggered the bug, or steps to reproduce:
I ran the mentioned queries using databricks/spark-sql-perf. The times provided in the tables correspond to the output of the mentioned application.
Branch and commit id:
main, on commit 15667c2
Spark version:
3.1.1
Hadoop version:
2.7.4
Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?
I'm running Spark in a remote K8s cluster, with 9 nodes, 8 spark-workers. Each node has 4 cores (3 for the executors) and 16Gb (12 for executors) of memory.
Stack trace:
N/A

The text was updated successfully, but these errors were encountered:

eavilaes · 2021-10-04T14:52:54Z

Using PR #17, I've rerun the same queries, showing some improvement when reading in qbeast format.
In the last column, I added the percentages comparing to delta format (as previously) and the difference to the main branch in qbeast format.

Query	`delta` format, read in `delta`	`qbeast` format, read in `qbeast` (main `15667c2`)	`qbeast` format, read in `qbeast` (PR #17)
3	7.822s.	20.635s. (263,80%)	13.203s. (168,79%) (-95,01%)
7	18.839s.	35.546s. (188,68%)	24.319s. (129,09%) (-59,59%)
15	7.201s.	24.620s. (341,89%)	16.481s. (228,87%) (-113,02%)

Detailed values (AVG, MAX and MIN) for the execution

Query	AVG	MIN	MAX
3	13.203s.	12.462s.	16.989s.
7	24.319s.	23.121s.	27.730s.
15	16.481s.	15.152s.	18.936s.

cugni · 2021-12-07T18:38:32Z

@eavilaes can you provide more info (e.g. a quick guide) on how you run these tests?

eavilaes · 2021-12-13T11:44:26Z

Well, the process is a bit complicated to handle (welcome to the world of benchmarking):
As I mentioned, I'm using Qbeast-io/spark-sql-perf-private together with our automated-deployments tools. The last one contains two directories: a scala app ready to run TPC-DS benchmarks, which uses spark-sql-perf under the hood, and other for shell scripts to make it easier to run the app with all the dependencies needed.
Note that these two repositories are currently for internal use, but they are based on the idea of databricks' spark-sql-perf.

As per your quote, the big refactor of #39, which includes the update to Delta version to 1.0.0, and per #51 (thanks, I can now index big amounts of data) I ran these tests again, and you can see the results below:

Query	`delta` format, read in `delta`	`qbeast` format, read in `qbeast` (after PR #17)	`qbeast` format, read in `qbeast` (#51 `1d4812a`)
3	7.152s.	13.203s.	12.094s.
7	16.504s.	24.319s.	18.666s.
15	6.880s.	16.481s.	15.448s.

To be mentioned: for the last column of the table, all the TPC-DS tables have been indexed in qbeast format using the primary key of the table, with a cubeSize of 2.000.000 (~100Mib per cube). The queries have been executed 10 times, as previously done.

Detailed values (AVG, MAX and MIN)

Data written in `qbeast` format, read in `qbeast` format (#51 `1d4812a`)

Query	AVG	MAX	MIN
q3	12.094s.	18.893s.	10.879s.
q7	18.666s.	24.063s.	17.670s.
q15	15.448s.	19.026s.	14.584s.

osopardo1 · 2022-03-07T08:39:47Z

I don't think this is relevant, at least as an issue. We should move it to a discussion, probably. Do you agree? @eavilaes @cugni

eavilaes · 2022-03-07T10:15:31Z

I don't think this is relevant, at least as an issue. We should move it to a discussion, probably. Do you agree? @eavilaes @cugni

Yep! I think that's more a discussion than a real issue. We can move it.

eavilaes added the type: bug Something isn't working label Sep 30, 2021

osopardo1 added this to the Christmas party milestone Nov 16, 2021

cugni self-assigned this Dec 7, 2021

osopardo1 removed this from the Christmas party milestone Dec 13, 2021

osopardo1 closed this as completed Mar 7, 2022

Qbeast-io locked and limited conversation to collaborators Mar 7, 2022

eavilaes converted this issue into discussion #80 Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Execution time overhead when reading `qbeast` indexed data (not using sampling) #16

Execution time overhead when reading `qbeast` indexed data (not using sampling) #16

eavilaes commented Sep 30, 2021

Data written in `delta` format, read in `delta` format

Data written in `qbeast` format (index using PK), read in `delta` format

Data written in `qbeast` format (index using PK), read in `qbeast` format

eavilaes commented Oct 4, 2021

cugni commented Dec 7, 2021

eavilaes commented Dec 13, 2021 •

edited

Loading

Data written in `qbeast` format, read in `qbeast` format (#51 `1d4812a`)

osopardo1 commented Mar 7, 2022

eavilaes commented Mar 7, 2022 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Execution time overhead when reading qbeast indexed data (not using sampling) #16

Execution time overhead when reading qbeast indexed data (not using sampling) #16

Comments

eavilaes commented Sep 30, 2021

What went wrong?

Data written in delta format, read in delta format

Data written in qbeast format (index using PK), read in delta format

Data written in qbeast format (index using PK), read in qbeast format

How to reproduce?

eavilaes commented Oct 4, 2021

cugni commented Dec 7, 2021

eavilaes commented Dec 13, 2021 • edited Loading

Data written in qbeast format, read in qbeast format (#51 1d4812a)

osopardo1 commented Mar 7, 2022

eavilaes commented Mar 7, 2022 • edited Loading

This issue was moved to a discussion.

Execution time overhead when reading `qbeast` indexed data (not using sampling) #16

Execution time overhead when reading `qbeast` indexed data (not using sampling) #16

Data written in `delta` format, read in `delta` format

Data written in `qbeast` format (index using PK), read in `delta` format

Data written in `qbeast` format (index using PK), read in `qbeast` format

eavilaes commented Dec 13, 2021 •

edited

Loading

Data written in `qbeast` format, read in `qbeast` format (#51 `1d4812a`)

eavilaes commented Mar 7, 2022 •

edited

Loading