Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Operation and File Metrics [Delta] #139

Merged
merged 13 commits into from
Jan 16, 2023

Conversation

osopardo1
Copy link
Member

@osopardo1 osopardo1 commented Nov 3, 2022

Description

Fixes issues #137 and #4

Type of change

The goal was to add new information to the Commit Log entries. On Delta Lake, statistics at the operation level and file level are getting collected. Since Qbeast has a modified implementation of the writing process, we were skipping this part.

In this PR, we use JobStatsTrackers (spark processes that keep track of specific metrics when managing data) to fill in stats information. As a result, the CommitInfo and the AddFile present in the commit log, should contain the following fields:

{
  "commitInfo": {
    ...
    "operationMetrics": {
      "numFiles": "1",
      "numOutputRows": "3",
      "numOutputBytes": "660"
    },
   ...
}
 "add": {
   ...
    "stats": "{\"numRecords\":3,\"minValues\":{\"a\":\"A\",\"b\":1},\"maxValues\":{\"a\":\"C\",\"b\":3},\"nullCount\":{\"a\":0,\"b\":0}}",
   ...
  }
}

UPDATE.
After doing an initial benchmark, we will be leaving the data-skipping strategy for another Pull Request. We saw that reading and filtering twice the Delta Log (as a naïve solution) is creating an overhead on the queries. We should think of a better way os processing this information, but in the meantime, I think it's more important to have this metadata in the commit log.

Checklist:

Here is the list of things you should do before submitting this pull request:

  • New feature / bug fix has been committed following the Contribution guide.
  • Add comments to the code (make it easier for the community!).
  • Change the documentation.
  • Add tests.
  • Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

Please describe the tests that you ran to verify your changes.

Created a test under QbeastDataSourceIntegrationTest class in which we make sure the information is commited correctly.
Since these additional measurements are going to impact performance, we should also benchmark this againts version 0.3.1.

osopardo1 and others added 6 commits October 24, 2022 09:37
@codecov
Copy link

codecov bot commented Nov 4, 2022

Codecov Report

Merging #139 (c34259d) into main (0bdee8e) will increase coverage by 0.28%.
The diff coverage is 98.61%.

@@            Coverage Diff             @@
##             main     #139      +/-   ##
==========================================
+ Coverage   92.89%   93.18%   +0.28%     
==========================================
  Files          73       76       +3     
  Lines        1717     1775      +58     
  Branches      126      133       +7     
==========================================
+ Hits         1595     1654      +59     
+ Misses        122      121       -1     
Impacted Files Coverage Δ
...he/spark/sql/delta/DeltaStatsCollectionUtils.scala 93.33% <93.33%> (ø)
...la/io/qbeast/spark/delta/DeltaMetadataWriter.scala 92.72% <100.00%> (+1.42%) ⬆️
...ala/io/qbeast/spark/delta/writer/BlockWriter.scala 98.21% <100.00%> (+0.38%) ⬆️
...east/spark/delta/writer/SparkDeltaDataWriter.scala 100.00% <100.00%> (ø)
...la/io/qbeast/spark/delta/writer/StatsTracker.scala 100.00% <100.00%> (ø)
...io/qbeast/spark/index/query/QuerySpecBuilder.scala 100.00% <0.00%> (ø)
.../scala/io/qbeast/spark/index/query/QuerySpec.scala 100.00% <0.00%> (ø)
...la/io/qbeast/spark/index/query/QueryExecutor.scala 97.05% <0.00%> (+0.39%) ⬆️
...ain/scala/io/qbeast/spark/table/IndexedTable.scala 90.80% <0.00%> (+1.14%) ⬆️
...c/main/scala/io/qbeast/core/model/QuerySpace.scala 100.00% <0.00%> (+5.26%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

# Conflicts:
#	src/test/scala/io/qbeast/spark/utils/QbeastSparkCorrectnessTest.scala
@osopardo1 osopardo1 changed the title Add Operation and File Metrics Add Operation and File Metrics [Delta] Nov 29, 2022
@osopardo1
Copy link
Member Author

On the new commit 8e7f304, I added a workaround to filter files using the Stats.

Since Delta Lake has already implemented a DataSkippingReader strategy on the Snapshot class, we would use their method to filter according to the gathered stats. Once we had their filtering and ours, we can join the outputs with the file paths.

This is a first naive implementation. We should think about something more efficient, since we are duplicating the number of collects on the Delta Log, and we can simplify the data skipping with one single process.

@osopardo1
Copy link
Member Author

osopardo1 commented Jan 10, 2023

After doing an initial benchmark, I think it's best to leave the data-skipping strategy for another Pull Request. We saw that reading and filtering twice the Delta Log (as a naïve solution) is creating an overhead on the queries. We should think of a better way of processing this information, but in the meantime, probably it's more important to have this metadata in the commit log.
What do you think?

@osopardo1 osopardo1 marked this pull request as ready for review January 10, 2023 08:12
@osopardo1 osopardo1 requested a review from Jiaweihu08 January 10, 2023 08:12
@osopardo1 osopardo1 merged commit 99a0fb3 into Qbeast-io:main Jan 16, 2023
@osopardo1 osopardo1 deleted the 137-operation-metrics branch August 2, 2023 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants