Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Fix stats with tightBounds check for AddFiles with deletionVectors #3633

Merged
merged 7 commits into from
Sep 3, 2024

Conversation

juliuszsompolski
Copy link
Contributor

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Check DeletionVectorFilesHaveWideBoundsCheck has been disabled for COMPUTE STATS because it reintroduces stats with tight bound to files with Deletion Vectors. However, there are other operations that can then copy over these AddFile actions with DVs and tight stats. These operations resulted in DELTA_ADDING_DELETION_VECTORS_WITH_TIGHT_BOUNDS_DISALLOWED error, which was a false positive.

In this PR we also attempt to introduce and discuss a "framework" for checks like that as a property of DeltaOperations, with DeltaOperations declaring as a member method whether a certain property and check should be performed. This is opposed to current practice, where many places in the code feature special cases like matching against a certain DeltaOperation and doing something special; this kind of code is very decentralized, and it's easy to miss if any new place or new operation needs such central handling. If this was centralized in DeltaOperations, this could lead to better discoverability of special cases and edge cases when implementing new operations.

How was this patch tested?

Tests added.

Does this PR introduce any user-facing changes?

No.

@juliuszsompolski
Copy link
Contributor Author

cc @bart-samwel @c27kwan

Copy link
Contributor

@c27kwan c27kwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

*
* This is abstract to force the implementers of all operations to think about this setting.
* All operations should add a comment justifying this setting.
* Any operation that sets this to false should add a test in TightBoundsSuite.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to enforce the test? I feel like people might just cargo-cult the override + comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that has to be up to good reviewing practice...
I have found examples of some tests that e.g. use reflection to check the number of fields in some class and then comment there what needs to be done if a new one is added.
I can imagine using reflection to find all subclasses of DeltaOperation, listing the ones that already define this as false and have a test, and instruct in that test that if you add a new subclass that overrides it to false then you should add a test, but I think it's a bit overkill...

@vkorukanti vkorukanti merged commit 5d63094 into delta-io:master Sep 3, 2024
14 of 17 checks passed
vkorukanti pushed a commit that referenced this pull request Sep 19, 2024
…letionVectorStatsAreNotTightBounds (#3641)

Post merge review followup: update and fix some comments.

Co-authored-by: Julek Sompolski <Juliusz Sompolski>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants