-
Notifications
You must be signed in to change notification settings - Fork 21
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execution time overhead when reading qbeast
indexed data (not using sampling)
#16
Comments
Using PR #17, I've rerun the same queries, showing some improvement when reading in
Detailed values (AVG, MAX and MIN) for the execution
|
@eavilaes can you provide more info (e.g. a quick guide) on how you run these tests? |
Well, the process is a bit complicated to handle (welcome to the world of benchmarking): As per your quote, the big refactor of #39, which includes the update to Delta version to 1.0.0, and per #51 (thanks, I can now index big amounts of data) I ran these tests again, and you can see the results below:
To be mentioned: for the last column of the table, all the TPC-DS tables have been indexed in |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
What went wrong?
I have tested three queries from TPC-DS v2.4, concretely queries 3, 7 and 15 on a 100Gb dataset (size on disk is quite less, because of parquet's compression). I found that there's an overhead on execution time when the data is written in
qbeast
format compared todelta
, which you can see below.The queries I have used are:
Query 3
Query 7
Query 15
I performed three different executions on the data. Each has iterated every query 10 times, to calculate an average time per query. The execution time for each execution and query follows.
As you can see on the following table, there's an increase in the time when the data is written in
qbeast
format. Using the average time, I have calculated the overhead percentage (you can find more details below the table):delta
format, read indelta
qbeast
format, read indelta
qbeast
format, read inqbeast
For more detailed values, I included maximum and minimum values for each execution:
Detailed values (AVG, MAX and MIN) for each execution
Data written in
delta
format, read indelta
formatData written in
qbeast
format (index using PK), read indelta
formatData written in
qbeast
format (index using PK), read inqbeast
formatHow to reproduce?
Code that triggered the bug, or steps to reproduce:
I ran the mentioned queries using databricks/spark-sql-perf. The times provided in the tables correspond to the output of the mentioned application.
Branch and commit id:
main
, on commit 15667c2Spark version:
3.1.1
Hadoop version:
2.7.4
Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?
I'm running Spark in a remote K8s cluster, with 9 nodes, 8 spark-workers. Each node has 4 cores (3 for the executors) and 16Gb (12 for executors) of memory.
Stack trace:
N/A
The text was updated successfully, but these errors were encountered: