Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL query with window function PARTITION BY caused panic in 'tokio-runtime-worker' (SQLancer) #12057

Closed
2010YOUY01 opened this issue Aug 19, 2024 · 3 comments · Fixed by #12297
Assignees
Labels
bug Something isn't working

Comments

@2010YOUY01
Copy link
Contributor

Describe the bug

The below query caused a panic.

SELECT
  sum(1) OVER (
    PARTITION BY false=false
  )
FROM
  t1
WHERE
  ((false > (v1 = v1)) IS DISTINCT FROM true);

This bug is triggered quite often in the fuzzer after window function is added, I think this bug is related to repartition execution in window functions (which is common execution logic, not function specific), see stack trace below.

To Reproduce

Run datafusion-cli under latest main (commit 574dfeb)

DataFusion CLI v41.0.0
> create table t1(v1 int);
0 row(s) fetched.
Elapsed 0.065 seconds.

> insert into t1 values (42);
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.049 seconds.

> SELECT
  sum(1) OVER (
    PARTITION BY false=false
  )
FROM
  t1
WHERE
  ((false > (v1 = v1)) IS DISTINCT FROM true);

thread 'tokio-runtime-worker' panicked at /Users/yongting/Desktop/code/my_datafusion/arrow-datafusion/datafusion/physical-plan/src/repartition/mod.rs:313:79:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("must either specify a row count or at least one column")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Join Error
caused by
External error: task 29 panicked

stack backtrace:
   0: rust_begin_unwind
             at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/result.rs:1679:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/result.rs:1102:23
   4: datafusion_physical_plan::repartition::BatchPartitioner::partition_iter::{{closure}}
             at /Users/yongting/Desktop/code/my_datafusion/arrow-datafusion/datafusion/physical-plan/src/repartition/mod.rs:313:33
   5: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
             at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/ops/function.rs:305:13
   6: core::option::Option<T>::map
             at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/option.rs:1075:29
   7: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
             at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/core/src/iter/adapters/map.rs:108:26
   8: <alloc::boxed::Box<I,A> as core::iter::traits::iterator::Iterator>::next
             at /rustc/051478957371ee0084a7c0913941d2a8c4757bb9/library/alloc/src/boxed.rs:1997:9
   9: datafusion_physical_plan::repartition::RepartitionExec::pull_from_input::{{closure}}
             at /Users/yongting/Desktop/code/my_datafusion/arrow-datafusion/datafusion/physical-plan/src/repartition/mod.rs:799:24
   ...
   tokio stuff
   ...

Expected behavior

No response

Additional context

Found by SQLancer #11030

@2010YOUY01 2010YOUY01 added the bug Something isn't working label Aug 19, 2024
@thinh2
Copy link
Contributor

thinh2 commented Aug 19, 2024

take

@thinh2
Copy link
Contributor

thinh2 commented Sep 1, 2024

Hi @2010YOUY01 ,

I am stucking with this bug several days without any progression, do you have any recommendation to debug the query execution issue? Now, I am able to reproduce the issue and after turn on RUST_LOG=trace, here is the information related to the issue I got and some of my guess and questions:

  • Query's physical plan:
      CoalesceBatchesExec: target_batch_size=8192
        RepartitionExec: partitioning=Hash([true], 4), input_partitions=4
          RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
            ProjectionExec: expr=[]
              CoalesceBatchesExec: target_batch_size=8192
                FilterExec: (false > (v1@0 = v1@0)) IS DISTINCT FROM true
                  MemoryExec: partitions=1, partition_sizes=[1]
  • Debug log with error:
    [2024-08-31T01:31:23Z DEBUG datafusion_physical_plan::stream] Stopping execution: plan returned error: WindowAggExec: wdw=[sum(Int64(1)) PARTITION BY [Boolean(false) = Boolean(false)] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Ok(Field { name: "sum(Int64(1)) PARTITION BY [Boolean(false) = Boolean(false)] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(NULL)), is_causal: false }]

  • From my understanding, I suspect that the issue is because of the repartition_exec RepartitionExec: partitioning=Hash([true], 4), input_partitions=4. This RepartitionExec: partitioning=Hash receive an empty RecordBatch because of the empty ProjectionExec: expr=[] I guess. And processing this empty RecordBatch lead to panic. Is my guess correct and how can I verify that it is correct? I think it is contradict with your assumption that it related to repartition execution in window functions . May I know what is the physical query plan related to the repartition execution in window functions ?

@2010YOUY01
Copy link
Contributor Author

@thinh2 I think you're correct, my expression was ambiguous, the bug is likely related to RepartitionExec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants