-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Operation and File Metrics [Delta] #139
Conversation
…er and delta OptimisticTransaction
Those trackers come from DeltaFileStatistics and measure the max, min and null count of each column
Codecov Report
@@ Coverage Diff @@
## main #139 +/- ##
==========================================
+ Coverage 92.89% 93.18% +0.28%
==========================================
Files 73 76 +3
Lines 1717 1775 +58
Branches 126 133 +7
==========================================
+ Hits 1595 1654 +59
+ Misses 122 121 -1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
# Conflicts: # src/test/scala/io/qbeast/spark/utils/QbeastSparkCorrectnessTest.scala
On the new commit 8e7f304, I added a workaround to filter files using the Stats. Since Delta Lake has already implemented a This is a first naive implementation. We should think about something more efficient, since we are duplicating the number of collects on the Delta Log, and we can simplify the data skipping with one single process. |
After doing an initial benchmark, I think it's best to leave the data-skipping strategy for another Pull Request. We saw that reading and filtering twice the Delta Log (as a naïve solution) is creating an overhead on the queries. We should think of a better way of processing this information, but in the meantime, probably it's more important to have this metadata in the commit log. |
Description
Fixes issues #137 and #4
Type of change
The goal was to add new information to the Commit Log entries. On Delta Lake, statistics at the operation level and file level are getting collected. Since Qbeast has a modified implementation of the writing process, we were skipping this part.
In this PR, we use
JobStatsTrackers
(spark processes that keep track of specific metrics when managing data) to fill in stats information. As a result, theCommitInfo
and theAddFile
present in the commit log, should contain the following fields:UPDATE.
After doing an initial benchmark, we will be leaving the data-skipping strategy for another Pull Request. We saw that reading and filtering twice the Delta Log (as a naïve solution) is creating an overhead on the queries. We should think of a better way os processing this information, but in the meantime, I think it's more important to have this metadata in the commit log.
Checklist:
Here is the list of things you should do before submitting this pull request:
How Has This Been Tested? (Optional)
Please describe the tests that you ran to verify your changes.
Created a test under
QbeastDataSourceIntegrationTest
class in which we make sure the information is commited correctly.Since these additional measurements are going to impact performance, we should also benchmark this againts version 0.3.1.