Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Auto Compaction was incorrectly including large files towards minNumFiles #4045 #4178

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

mwc360
Copy link
Contributor

@mwc360 mwc360 commented Feb 19, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

  • If DELTA_AUTO_COMPACT_MIN_FILE_SIZE was unset, it was defaulting to Long.MaxValue which resulted in large files counting torwards the minNumFiles threshold for AC to be triggered. This resulted in compaction running more frequently up to the point of running after every write as a table grows in size.
    image
    The below is the expected behavior on the same test suite as produced by Databricks:
    image

  • AC eval criteria incorrectly didn't always require enough small files and would trigger AC if AC wasn't run as part of the last operation. AC should only evaluate as shouldCompact if compaction last not just run AND there are enough small files.

Resolves #4045 See issue for more details.

How was this patch tested?

  • AutoCompactSuite.scala was updated to add more robust coverage to ensure large files don't trigger AC.
  • I separately ran a test suite which runs 200 iterations of merging data into a Delta table and monitors for AC being triggered to ensure that compaction is running based on having small files >= minNumFiles. I also ran this same test suite in Databricks to ensure that the behavior matches.

Does this PR introduce any user-facing changes?

No.

… the session config is unset. this resulted in large files counting torwards minNumFiles

- auto compaction should always evaluate if there are sufficient small files, the prior code use an `OR` condition if auto compaction hadn't been run, this resulted in AC being run too frequently.

Signed-off-by: Miles Cole <m.w.c.360@gmail.com>
Signed-off-by: Miles Cole <m.w.c.360@gmail.com>
Signed-off-by: Miles Cole <m.w.c.360@gmail.com>
Signed-off-by: Miles Cole <m.w.c.360@gmail.com>
Signed-off-by: Miles Cole <m.w.c.360@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation
1 participant