Execution time overhead when reading `qbeast` indexed data (not using sampling) #80

eavilaes · 2021-09-30T11:28:39Z

eavilaes
Sep 30, 2021

What went wrong?

I have tested three queries from TPC-DS v2.4, concretely queries 3, 7 and 15 on a 100Gb dataset (size on disk is quite less, because of parquet's compression). I found that there's an overhead on execution time when the data is written in qbeast format compared to delta, which you can see below.

The queries I have used are:

Query 3

 SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand,SUM(ss_ext_sales_price) sum_agg
 FROM  date_dim dt, store_sales, item
 WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
   AND store_sales.ss_item_sk = item.i_item_sk
   AND item.i_manufact_id = 128
   AND dt.d_moy=11
 GROUP BY dt.d_year, item.i_brand, item.i_brand_id
 ORDER BY dt.d_year, sum_agg desc, brand_id
 LIMIT 100

Query 7

 SELECT i_item_id,
        avg(ss_quantity) agg1,
        avg(ss_list_price) agg2,
        avg(ss_coupon_amt) agg3,
        avg(ss_sales_price) agg4
 FROM store_sales, customer_demographics, date_dim, item, promotion
 WHERE ss_sold_date_sk = d_date_sk AND
       ss_item_sk = i_item_sk AND
       ss_cdemo_sk = cd_demo_sk AND
       ss_promo_sk = p_promo_sk AND
       cd_gender = 'M' AND
       cd_marital_status = 'S' AND
       cd_education_status = 'College' AND
       (p_channel_email = 'N' or p_channel_event = 'N') AND
       d_year = 2000
 GROUP BY i_item_id
 ORDER BY i_item_id LIMIT 100

Query 15

 select ca_zip, SUM(cs_sales_price) sum_agg
 from catalog_sales, customer, customer_address, date_dim
 where cs_bill_customer_sk = c_customer_sk
 	and c_current_addr_sk = ca_address_sk
 	and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475',
                                   '85392', '85460', '80348', '81792')
 	      or ca_state in ('CA','WA','GA')
 	      or cs_sales_price > 500)
 	and cs_sold_date_sk = d_date_sk
 	and d_qoy = 2 and d_year = 2001
 group by ca_zip
 order by ca_zip
 limit 100

I performed three different executions on the data. Each has iterated every query 10 times, to calculate an average time per query. The execution time for each execution and query follows.
As you can see on the following table, there's an increase in the time when the data is written in qbeast format. Using the average time, I have calculated the overhead percentage (you can find more details below the table):

Query	`delta` format, read in `delta`	`qbeast` format, read in `delta`	`qbeast` format, read in `qbeast`
3	7.822s.	13.533s. (173,01%)	20.635s. (263,80%)
7	18.839s.	24.319s. (129.09%)	35.546s. (188,68%)
15	7.201s.	16.385s. (227,53%)	24.620s. (341,89%)

For more detailed values, I included maximum and minimum values for each execution:

Detailed values (AVG, MAX and MIN) for each execution

Data written in `delta` format, read in `delta` format

Query	AVG	MAX	MIN
q3	7.822s.	10.579s.	7.203s.
q7	18.839s.	21.281s.	17.253s.
q15	7.201s.	9.448s.	6.344s.

Data written in `qbeast` format (index using PK), read in `delta` format

Query	AVG	MAX	MIN
q3	13.533s.	17.612s.	12.467s.
q7	24.319s.	29.949s.	22.478s.
q15	16.385s.	21.824s.	14.882s.

Data written in `qbeast` format (index using PK), read in `qbeast` format

Query	AVG	MAX	MIN
q3	20.635s.	31.006s.	18.756s.
q7	35.546s.	43.923s.	32.965s.
q15	24.620s.	28.866s.	22.850s.

How to reproduce?

Code that triggered the bug, or steps to reproduce:
I ran the mentioned queries using databricks/spark-sql-perf. The times provided in the tables correspond to the output of the mentioned application.
Branch and commit id:
main, on commit 15667c2
Spark version:
3.1.1
Hadoop version:
2.7.4
Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?
I'm running Spark in a remote K8s cluster, with 9 nodes, 8 spark-workers. Each node has 4 cores (3 for the executors) and 16Gb (12 for executors) of memory.
Stack trace:
N/A

eavilaes · 2021-10-04T14:52:54Z

eavilaes
Oct 4, 2021
Author

Using PR #17, I've rerun the same queries, showing some improvement when reading in qbeast format.
In the last column, I added the percentages comparing to delta format (as previously) and the difference to the main branch in qbeast format.

Query	`delta` format, read in `delta`	`qbeast` format, read in `qbeast` (main `15667c2`)	`qbeast` format, read in `qbeast` (PR #17)
3	7.822s.	20.635s. (263,80%)	13.203s. (168,79%) (-95,01%)
7	18.839s.	35.546s. (188,68%)	24.319s. (129,09%) (-59,59%)
15	7.201s.	24.620s. (341,89%)	16.481s. (228,87%) (-113,02%)

Detailed values (AVG, MAX and MIN) for the execution

Query	AVG	MIN	MAX
3	13.203s.	12.462s.	16.989s.
7	24.319s.	23.121s.	27.730s.
15	16.481s.	15.152s.	18.936s.

0 replies

cugni · 2021-12-07T18:38:32Z

cugni
Dec 7, 2021
Maintainer

@eavilaes can you provide more info (e.g. a quick guide) on how you run these tests?

0 replies

eavilaes · 2021-12-13T11:44:26Z

eavilaes
Dec 13, 2021
Author

Well, the process is a bit complicated to handle (welcome to the world of benchmarking):
As I mentioned, I'm using Qbeast-io/spark-sql-perf-private together with our automated-deployments tools. The last one contains two directories: a scala app ready to run TPC-DS benchmarks, which uses spark-sql-perf under the hood, and other for shell scripts to make it easier to run the app with all the dependencies needed.
Note that these two repositories are currently for internal use, but they are based on the idea of databricks' spark-sql-perf.

As per your quote, the big refactor of #39, which includes the update to Delta version to 1.0.0, and per #51 (thanks, I can now index big amounts of data) I ran these tests again, and you can see the results below:

Query	`delta` format, read in `delta`	`qbeast` format, read in `qbeast` (after PR #17)	`qbeast` format, read in `qbeast` (#51 `1d4812a`)
3	7.152s.	13.203s.	12.094s.
7	16.504s.	24.319s.	18.666s.
15	6.880s.	16.481s.	15.448s.

To be mentioned: for the last column of the table, all the TPC-DS tables have been indexed in qbeast format using the primary key of the table, with a cubeSize of 2.000.000 (~100Mib per cube). The queries have been executed 10 times, as previously done.

Detailed values (AVG, MAX and MIN)

Data written in `qbeast` format, read in `qbeast` format (#51 `1d4812a`)

Query	AVG	MAX	MIN
q3	12.094s.	18.893s.	10.879s.
q7	18.666s.	24.063s.	17.670s.
q15	15.448s.	19.026s.	14.584s.

0 replies

osopardo1 · 2022-03-07T08:39:47Z

osopardo1
Mar 7, 2022
Maintainer

I don't think this is relevant, at least as an issue. We should move it to a discussion, probably. Do you agree? @eavilaes @cugni

0 replies

eavilaes · 2022-03-07T10:15:31Z

eavilaes
Mar 7, 2022
Author

I don't think this is relevant, at least as an issue. We should move it to a discussion, probably. Do you agree? @eavilaes @cugni

Yep! I think that's more a discussion than a real issue. We can move it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution time overhead when reading `qbeast` indexed data (not using sampling) #80

{{title}}

Data written in `delta` format, read in `delta` format

Data written in `qbeast` format (index using PK), read in `delta` format

Data written in `qbeast` format (index using PK), read in `qbeast` format

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Data written in `qbeast` format, read in `qbeast` format (#51 `1d4812a`)

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Execution time overhead when reading qbeast indexed data (not using sampling) #80

eavilaes Sep 30, 2021

What went wrong?

Data written in delta format, read in delta format

Data written in qbeast format (index using PK), read in delta format

Data written in qbeast format (index using PK), read in qbeast format

How to reproduce?

Replies: 5 comments

eavilaes Oct 4, 2021 Author

cugni Dec 7, 2021 Maintainer

eavilaes Dec 13, 2021 Author

Data written in qbeast format, read in qbeast format (#51 1d4812a)

osopardo1 Mar 7, 2022 Maintainer

eavilaes Mar 7, 2022 Author

Execution time overhead when reading `qbeast` indexed data (not using sampling) #80

eavilaes
Sep 30, 2021

Data written in `delta` format, read in `delta` format

Data written in `qbeast` format (index using PK), read in `delta` format

Data written in `qbeast` format (index using PK), read in `qbeast` format

eavilaes
Oct 4, 2021
Author

cugni
Dec 7, 2021
Maintainer

eavilaes
Dec 13, 2021
Author

Data written in `qbeast` format, read in `qbeast` format (#51 `1d4812a`)

osopardo1
Mar 7, 2022
Maintainer

eavilaes
Mar 7, 2022
Author