[Spark] Auto Compaction was incorrectly including large files towards minNumFiles #4045 #4178

mwc360 · 2025-02-19T22:24:47Z

Which Delta project/connector is this regarding?

Description

If DELTA_AUTO_COMPACT_MIN_FILE_SIZE was unset, it was defaulting to Long.MaxValue which resulted in large files counting torwards the minNumFiles threshold for AC to be triggered. This resulted in compaction running more frequently up to the point of running after every write as a table grows in size.

The below is the expected behavior on the same test suite as produced by Databricks:
AC eval criteria incorrectly didn't always require enough small files and would trigger AC if AC wasn't run as part of the last operation. AC should only evaluate as shouldCompact if compaction last not just run AND there are enough small files.

Resolves #4045 See issue for more details.

How was this patch tested?

AutoCompactSuite.scala was updated to add more robust coverage to ensure large files don't trigger AC.
I separately ran a test suite which runs 200 iterations of merging data into a Delta table and monitors for AC being triggered to ensure that compaction is running based on having small files >= minNumFiles. I also ran this same test suite in Databricks to ensure that the behavior matches.

Does this PR introduce any user-facing changes?

No.

… the session config is unset. this resulted in large files counting torwards minNumFiles - auto compaction should always evaluate if there are sufficient small files, the prior code use an `OR` condition if auto compaction hadn't been run, this resulted in AC being run too frequently. Signed-off-by: Miles Cole <m.w.c.360@gmail.com>

Signed-off-by: Miles Cole <m.w.c.360@gmail.com>

mwc360 mentioned this pull request Feb 19, 2025

[BUG][Spark] Auto Compaction trigger logic is not consistent with documentation #4045

Open

8 tasks

mwc360 added 8 commits February 19, 2025 17:58

fix test suite

10fe507

fix test suite

c915b79

Signed-off-by: Miles Cole <m.w.c.360@gmail.com>

remove whitespace in test suite

4d32201

Signed-off-by: Miles Cole <m.w.c.360@gmail.com>

fix maxFileSize for test

1fbcf9a

Signed-off-by: Miles Cole <m.w.c.360@gmail.com>

fix test case

9552af8

only check based on number of small files

ff4d069

fix test case

929359a

update description to match code

0058abe

Signed-off-by: Miles Cole <m.w.c.360@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Auto Compaction was incorrectly including large files towards minNumFiles #4045 #4178

[Spark] Auto Compaction was incorrectly including large files towards minNumFiles #4045 #4178

mwc360 commented Feb 19, 2025

[Spark] Auto Compaction was incorrectly including large files towards minNumFiles #4045 #4178

Are you sure you want to change the base?

[Spark] Auto Compaction was incorrectly including large files towards minNumFiles #4045 #4178

Conversation

mwc360 commented Feb 19, 2025

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?