Test: Validate memory limit for sort queries to extended test #14142

2010YOUY01 · 2025-01-16T04:51:42Z

Which issue does this PR close?

Rationale for this change

Datafusion supports memory-limited queries: it's implemented by tracking internal memory consumption to limit the total memory usage.
This feature needs to be verified externally: the profiled memory usage should be consistent with the specified limit.

Idea

Here is an example: compile and run datafusion-cli with memory limit, and profile the physical memory consumption:

/usr/bin/time -l cargo run --release -- --mem-pool-type fair -m 400M -c 'select * from generate_series(1,100000000) as t1(c1) order by c1'

The source relation in the query in theory should consume 800M memory (int64 * 100M), which can be checked with the same query without order by
The ideal implementation of sorting uses O(N) space, so the query without memory limit should ideally use 800M + small memory for other internal data structures. If provided with a 400M memory limit, this query should run with around 400M physical memory.

This test module is implementing this kind of validation. (And found the memory consumption of sorting is not ideal, it consumes 2X-3X memory or worse, I plan to investigate it later)

Implementation

Implementing such test is a bit tricky. The utility functions for measuring memory RSS can only get the current process's RSS, thus each test cases have to be run in a separate process, and rust will let tests in the same module run in the same process but in different threads.

This PR uses the following workaround.

#[test] 
fn sort_mem_test_1() {
    // Return directly if environment variable `DATAFUSION_TEST_MEM_LIMIT_VALIDATION` is not set
    ....
}

#[test]
fn test_runner() {
    // Set env var and execute command like 'cargo test sort_mem_test_1' to make sure all tests run in different processes
    ....
}

If a certain test case is run directly from 'cargo test', tests won't actually be runned. It uses a runner to be the actual entry point for all related tests.

What changes are included in this PR?

Added test utilities for memory limit validation (similar tests for external aggregate, join can be implemented later)
Added tests for simple sort queries

Are these changes tested?

Are there any user-facing changes?

2010YOUY01 · 2025-01-16T05:00:40Z

datafusion/core/tests/memory_limit/memory_limit_validation/utils.rs

+    query: &str,
+    baseline_query: &str,
+) {
+    if std::env::var("DATAFUSION_TEST_MEM_LIMIT_VALIDATION").is_err() {


I am aware of the test_with crate, which is able to let certain test case run when a env var is set, but I can't get it working if a test is running through a command

alamb

Thank you for working on this @2010YOUY01 -- very exciting.

I would like to request we move these tests to the "extended" suite that runs on commits to main here: https://github.com/apache/datafusion/blob/main/.github/workflows/extended.yml

I worry that if we add these tests to every local and CI test run, it will significantly slow down development (as I think these tests force a recompile)

I manually tested via

I ran this like

 cargo test --test core_integration -- memory_limit

I noticed that the submodule references to parquet-testing and testing (arrow-testing) are updated. I think that is fine but wanted to point it out

alamb · 2025-01-16T21:03:05Z

datafusion/core/tests/memory_limit/memory_limit_validation/sort_mem_validation.rs

+/// Runner that executes each test in a separate process with the required environment
+/// variable set. Memory limit validation tasks need to measure memory resident set
+/// size (RSS), so they must run in a separate process.
+#[test]


This test takes more than 60 seconds on my laptop (which is longer than any othe rtest). Is there any way we can speed it up

SLOW [> 60.000s] datafusion::core_integration memory_limit::memory_limit_validation::sort_mem_validation::test_runner PASS [ 64.625s] datafusion::core_integration memory_limit::memory_limit_validation::sort_mem_validation::test_runner

I think it is because the subprocess is calling cargo test again (which is causing a recompile)

I found for initial test compilation, recompile happens, but in later test runs it won't recompile.
I can't find a way to avoid it 🤦🏼

alamb · 2025-01-16T21:03:58Z

datafusion/core/tests/memory_limit/memory_limit_validation/sort_mem_validation.rs

+
+    let mut handles = vec![];
+
+    // Run tests in parallel, each test in a separate process


I suggest we break these into their own tests (that each call a helper function) and leave the threading to the test runner (cargo test or cargo nextest)

That makes:

The reporting better (the test runner prints out what tests are running)

Controls threads better (the user can control the runner)

Great point, updated in abd5d4e

alamb · 2025-01-16T21:06:10Z

To add it to the extended suite I suggest gating the tests with a environment variable (so normal invocations of cargo test ... don't run these tests)

Perhaps like

DATAFUSION_EXTENDED_TEST=1 cargo test --test core_integration

Or something to that effect

alamb · 2025-01-16T21:07:16Z

datafusion/core/tests/memory_limit/memory_limit_validation/utils.rs

+    // Spawn a monitoring task
+    let monitor_handle = SpawnedTask::spawn(async move {
+        let mut sys = System::new_all();
+        let mut interval = interval(Duration::from_millis(20));


20 milliseconds seems quite long -- i would recommend a 1ms delay

I tried 1ms, the tests take way longer to run. 7ms seems to be the smallest interval that won't affect execution speed. (so we have to make sure the profiled queries should take >> interval time to run, all current tests are satisfied)

alamb · 2025-01-16T21:08:15Z

datafusion/core/tests/memory_limit/memory_limit_validation/utils.rs

+
+    let (_, max_rss) = measure_max_rss(|| async { df.collect().await.unwrap() }).await;
+
+    println!(


this is quite clever, FWIW

2010YOUY01 · 2025-01-18T07:46:15Z

Thank you for the review @alamb

To add it to the extended suite I suggest gating the tests with a environment variable (so normal invocations of cargo test ... don't run these tests)

Perhaps like
DATAFUSION_EXTENDED_TEST=1 cargo test --test core_integration
Or something to that effect

I used features instead, this approach seems more common

alamb

Looks good to me -- thanks @2010YOUY01 -- I think we should merge it in and give it a try.

If it turns out that this is hard to maintain / fails intermittently we can reasses

alamb · 2025-01-19T14:52:36Z

I looked at the CI run after this merged to main and it ran well: https://github.com/apache/datafusion/actions/runs/12850882520/job/35830990299

I did file a small follow on to make the job naming clearer

Minor: Rename extended test job name #14199

alamb · 2025-01-22T21:43:14Z

🎉

External memory limit validation for sort

7cc98fd

github-actions bot added the core Core DataFusion crate label Jan 16, 2025

add bug tracker

e5370a9

2010YOUY01 commented Jan 16, 2025

View reviewed changes

2010YOUY01 mentioned this pull request Jan 16, 2025

A memory-limited sort query fails #14143

Open

cleanup

4d2e711

alamb reviewed Jan 16, 2025

View reviewed changes

2010YOUY01 added 2 commits January 18, 2025 10:59

Update submodule

ccdc233

reviews

abd5d4e

github-actions bot added the development-process Related to development process of DataFusion label Jan 18, 2025

2010YOUY01 added 2 commits January 18, 2025 15:36

fix CI

57f5446

Merge branch 'main' into mem-validation

4a8d97e

move feature to module level

4c6290c

alamb mentioned this pull request Jan 18, 2025

Jan 18, 2025: This week(s) in DataFusion #14179

Closed

alamb approved these changes Jan 18, 2025

View reviewed changes

alamb changed the title ~~Test: Validate memory limit for sort queries~~ Test: Validate memory limit for sort queries to extended test Jan 18, 2025

2010YOUY01 merged commit 0283077 into apache:main Jan 19, 2025
28 checks passed

alamb mentioned this pull request Jan 19, 2025

Minor: Rename extended test job name #14199

Merged

alamb mentioned this pull request Feb 11, 2025

Disable extended tests (extended_tests) that are failing on runner #14604

Merged

2010YOUY01 mentioned this pull request Feb 14, 2025

bug: Fix memory reservation and allocation problems for SortExec #14644

Merged

alamb mentioned this pull request Feb 15, 2025

extended_test (with memory limit tracking) are commented out #14680

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test: Validate memory limit for sort queries to extended test #14142

Test: Validate memory limit for sort queries to extended test #14142

2010YOUY01 commented Jan 16, 2025

2010YOUY01 Jan 16, 2025

alamb left a comment

alamb Jan 16, 2025

2010YOUY01 Jan 18, 2025

alamb Jan 16, 2025

2010YOUY01 Jan 18, 2025

alamb commented Jan 16, 2025 •

edited

Loading

alamb Jan 16, 2025

2010YOUY01 Jan 18, 2025

alamb Jan 16, 2025

2010YOUY01 commented Jan 18, 2025 •

edited

Loading

alamb left a comment

alamb commented Jan 19, 2025

alamb commented Jan 22, 2025


		let mut handles = vec![];

		// Run tests in parallel, each test in a separate process


		let (_, max_rss) = measure_max_rss(\|\| async { df.collect().await.unwrap() }).await;

		println!(

Test: Validate memory limit for sort queries to extended test #14142

Test: Validate memory limit for sort queries to extended test #14142

Conversation

2010YOUY01 commented Jan 16, 2025

Which issue does this PR close?

Rationale for this change

Idea

Implementation

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 commented Jan 18, 2025 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 19, 2025

alamb commented Jan 22, 2025

alamb commented Jan 16, 2025 •

edited

Loading

2010YOUY01 commented Jan 18, 2025 •

edited

Loading