Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution time overhead when reading qbeast indexed data (not using sampling) #16

Closed
eavilaes opened this issue Sep 30, 2021 · 5 comments
Assignees
Labels
type: bug Something isn't working

Comments

@eavilaes
Copy link
Contributor

What went wrong?

I have tested three queries from TPC-DS v2.4, concretely queries 3, 7 and 15 on a 100Gb dataset (size on disk is quite less, because of parquet's compression). I found that there's an overhead on execution time when the data is written in qbeast format compared to delta, which you can see below.

The queries I have used are:

Query 3
 SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand,SUM(ss_ext_sales_price) sum_agg
 FROM  date_dim dt, store_sales, item
 WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
   AND store_sales.ss_item_sk = item.i_item_sk
   AND item.i_manufact_id = 128
   AND dt.d_moy=11
 GROUP BY dt.d_year, item.i_brand, item.i_brand_id
 ORDER BY dt.d_year, sum_agg desc, brand_id
 LIMIT 100
Query 7
 SELECT i_item_id,
        avg(ss_quantity) agg1,
        avg(ss_list_price) agg2,
        avg(ss_coupon_amt) agg3,
        avg(ss_sales_price) agg4
 FROM store_sales, customer_demographics, date_dim, item, promotion
 WHERE ss_sold_date_sk = d_date_sk AND
       ss_item_sk = i_item_sk AND
       ss_cdemo_sk = cd_demo_sk AND
       ss_promo_sk = p_promo_sk AND
       cd_gender = 'M' AND
       cd_marital_status = 'S' AND
       cd_education_status = 'College' AND
       (p_channel_email = 'N' or p_channel_event = 'N') AND
       d_year = 2000
 GROUP BY i_item_id
 ORDER BY i_item_id LIMIT 100
Query 15
 select ca_zip, SUM(cs_sales_price) sum_agg
 from catalog_sales, customer, customer_address, date_dim
 where cs_bill_customer_sk = c_customer_sk
 	and c_current_addr_sk = ca_address_sk
 	and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475',
                                   '85392', '85460', '80348', '81792')
 	      or ca_state in ('CA','WA','GA')
 	      or cs_sales_price > 500)
 	and cs_sold_date_sk = d_date_sk
 	and d_qoy = 2 and d_year = 2001
 group by ca_zip
 order by ca_zip
 limit 100

I performed three different executions on the data. Each has iterated every query 10 times, to calculate an average time per query. The execution time for each execution and query follows.
As you can see on the following table, there's an increase in the time when the data is written in qbeast format. Using the average time, I have calculated the overhead percentage (you can find more details below the table):

Query delta format, read in delta qbeast format, read in delta qbeast format, read in qbeast
3 7.822s. 13.533s. (173,01%) 20.635s. (263,80%)
7 18.839s. 24.319s. (129.09%) 35.546s. (188,68%)
15 7.201s. 16.385s. (227,53%) 24.620s. (341,89%)

For more detailed values, I included maximum and minimum values for each execution:

Detailed values (AVG, MAX and MIN) for each execution

Data written in delta format, read in delta format

Query AVG MAX MIN
q3 7.822s. 10.579s. 7.203s.
q7 18.839s. 21.281s. 17.253s.
q15 7.201s. 9.448s. 6.344s.

Data written in qbeast format (index using PK), read in delta format

Query AVG MAX MIN
q3 13.533s. 17.612s. 12.467s.
q7 24.319s. 29.949s. 22.478s.
q15 16.385s. 21.824s. 14.882s.

Data written in qbeast format (index using PK), read in qbeast format

Query AVG MAX MIN
q3 20.635s. 31.006s. 18.756s.
q7 35.546s. 43.923s. 32.965s.
q15 24.620s. 28.866s. 22.850s.

How to reproduce?

  1. Code that triggered the bug, or steps to reproduce:
    I ran the mentioned queries using databricks/spark-sql-perf. The times provided in the tables correspond to the output of the mentioned application.

  2. Branch and commit id:
    main, on commit 15667c2

  3. Spark version:
    3.1.1

  4. Hadoop version:
    2.7.4

  5. Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?
    I'm running Spark in a remote K8s cluster, with 9 nodes, 8 spark-workers. Each node has 4 cores (3 for the executors) and 16Gb (12 for executors) of memory.

  6. Stack trace:
    N/A

@eavilaes eavilaes added the type: bug Something isn't working label Sep 30, 2021
@eavilaes
Copy link
Contributor Author

eavilaes commented Oct 4, 2021

Using PR #17, I've rerun the same queries, showing some improvement when reading in qbeast format.
In the last column, I added the percentages comparing to delta format (as previously) and the difference to the main branch in qbeast format.

Query delta format, read in delta
 
qbeast format, read in qbeast
(main 15667c2)
qbeast format, read in qbeast
(PR #17)
3 7.822s. 20.635s. (263,80%) 13.203s. (168,79%) (-95,01%)
7 18.839s. 35.546s. (188,68%) 24.319s. (129,09%) (-59,59%)
15 7.201s. 24.620s. (341,89%) 16.481s. (228,87%) (-113,02%)
Detailed values (AVG, MAX and MIN) for the execution
Query AVG MIN MAX
3 13.203s. 12.462s. 16.989s.
7 24.319s. 23.121s. 27.730s.
15 16.481s. 15.152s. 18.936s.

@osopardo1 osopardo1 added this to the Christmas party milestone Nov 16, 2021
@cugni cugni self-assigned this Dec 7, 2021
@cugni
Copy link
Member

cugni commented Dec 7, 2021

@eavilaes can you provide more info (e.g. a quick guide) on how you run these tests?

@eavilaes
Copy link
Contributor Author

eavilaes commented Dec 13, 2021

Well, the process is a bit complicated to handle (welcome to the world of benchmarking):
As I mentioned, I'm using Qbeast-io/spark-sql-perf-private together with our automated-deployments tools. The last one contains two directories: a scala app ready to run TPC-DS benchmarks, which uses spark-sql-perf under the hood, and other for shell scripts to make it easier to run the app with all the dependencies needed.
Note that these two repositories are currently for internal use, but they are based on the idea of databricks' spark-sql-perf.


As per your quote, the big refactor of #39, which includes the update to Delta version to 1.0.0, and per #51 (thanks, I can now index big amounts of data) I ran these tests again, and you can see the results below:
Query delta format, read in delta
qbeast format, read in qbeast
(after PR #17)
qbeast format, read in qbeast
(#51 1d4812a)
3 7.152s. 13.203s. 12.094s.
7 16.504s. 24.319s. 18.666s.
15 6.880s. 16.481s. 15.448s.

To be mentioned: for the last column of the table, all the TPC-DS tables have been indexed in qbeast format using the primary key of the table, with a cubeSize of 2.000.000 (~100Mib per cube). The queries have been executed 10 times, as previously done.

Detailed values (AVG, MAX and MIN)

Data written in qbeast format, read in qbeast format (#51 1d4812a)

Query AVG MAX MIN
q3 12.094s. 18.893s. 10.879s.
q7 18.666s. 24.063s. 17.670s.
q15 15.448s. 19.026s. 14.584s.

@osopardo1 osopardo1 removed this from the Christmas party milestone Dec 13, 2021
@osopardo1
Copy link
Member

I don't think this is relevant, at least as an issue. We should move it to a discussion, probably. Do you agree? @eavilaes @cugni

@eavilaes
Copy link
Contributor Author

eavilaes commented Mar 7, 2022

I don't think this is relevant, at least as an issue. We should move it to a discussion, probably. Do you agree? @eavilaes @cugni

Yep! I think that's more a discussion than a real issue. We can move it.

@Qbeast-io Qbeast-io locked and limited conversation to collaborators Mar 7, 2022
@eavilaes eavilaes converted this issue into discussion #80 Mar 7, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants