Replies: 5 comments
-
Using PR #17, I've rerun the same queries, showing some improvement when reading in
Detailed values (AVG, MAX and MIN) for the execution
|
Beta Was this translation helpful? Give feedback.
-
@eavilaes can you provide more info (e.g. a quick guide) on how you run these tests? |
Beta Was this translation helpful? Give feedback.
-
Well, the process is a bit complicated to handle (welcome to the world of benchmarking): As per your quote, the big refactor of #39, which includes the update to Delta version to 1.0.0, and per #51 (thanks, I can now index big amounts of data) I ran these tests again, and you can see the results below:
To be mentioned: for the last column of the table, all the TPC-DS tables have been indexed in |
Beta Was this translation helpful? Give feedback.
-
I don't think this is relevant, at least as an issue. We should move it to a discussion, probably. Do you agree? @eavilaes @cugni |
Beta Was this translation helpful? Give feedback.
-
Yep! I think that's more a discussion than a real issue. We can move it. |
Beta Was this translation helpful? Give feedback.
-
What went wrong?
I have tested three queries from TPC-DS v2.4, concretely queries 3, 7 and 15 on a 100Gb dataset (size on disk is quite less, because of parquet's compression). I found that there's an overhead on execution time when the data is written in
qbeast
format compared todelta
, which you can see below.The queries I have used are:
Query 3
Query 7
Query 15
I performed three different executions on the data. Each has iterated every query 10 times, to calculate an average time per query. The execution time for each execution and query follows.
As you can see on the following table, there's an increase in the time when the data is written in
qbeast
format. Using the average time, I have calculated the overhead percentage (you can find more details below the table):delta
format, read indelta
qbeast
format, read indelta
qbeast
format, read inqbeast
For more detailed values, I included maximum and minimum values for each execution:
Detailed values (AVG, MAX and MIN) for each execution
Data written in
delta
format, read indelta
formatData written in
qbeast
format (index using PK), read indelta
formatData written in
qbeast
format (index using PK), read inqbeast
formatHow to reproduce?
Code that triggered the bug, or steps to reproduce:
I ran the mentioned queries using databricks/spark-sql-perf. The times provided in the tables correspond to the output of the mentioned application.
Branch and commit id:
main
, on commit 15667c2Spark version:
3.1.1
Hadoop version:
2.7.4
Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?
I'm running Spark in a remote K8s cluster, with 9 nodes, 8 spark-workers. Each node has 4 cores (3 for the executors) and 16Gb (12 for executors) of memory.
Stack trace:
N/A
Beta Was this translation helpful? Give feedback.
All reactions