Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix](bug) fix the divide zero in local shuffle #37906

Merged
merged 1 commit into from
Jul 16, 2024
Merged

Conversation

HappenLee
Copy link
Contributor

Proposed changes

if 'num_buckets == 0' means the fragment is colocated by exchange node not the
scan node. so here use _num_instance to replace the num_buckets to prevent dividing 0
still keep colocate plan after local shuffle

coredump:

SIGFPE integer divide by zero (@0x56431791a54a) received by PID 33673 (TID 37768 OR 0x7f8028018640) from PID 395421002; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
3# 0x00007F8C47895520 in /lib/x86_64-linux-gnu/libc.so.6
4# doris::vectorized::Partitioner::do_partitioning(doris::RuntimeState*, doris::vectorized::Block*, doris::MemTracker*) const at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/vec/runtime/partitioner.cpp:50
5# doris::pipeline::ShuffleExchanger::sink(doris::RuntimeState*, doris::vectorized::Block*, bool, doris::pipeline::LocalExchangeSinkLocalState&) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/pipeline/local_exchange/local_exchanger.cpp:33
6# doris::pipeline::LocalExchangeSinkOperatorX::sink(doris::RuntimeState*, doris::vectorized::Block*, bool) in /mnt/ssd01/doris-branch40preview/NEREIDS_ASAN/be/lib/doris_be
7# doris::pipeline::PipelineTask::execute(bool*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/pipeline/pipeline_task.cpp:359
8# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/pipeline/task_scheduler.cpp:138
9# doris::ThreadPool::dispatch_thread() in /mnt/ssd01/doris-branch40preview/NEREIDS_ASAN/be/lib/doris_be
10# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/thread.cpp:499
11# start_thread at ./nptl/pthread_create.c:442
12# 0x00007F8C47979850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@HappenLee
Copy link
Contributor Author

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 16, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@HappenLee HappenLee changed the title [Fix] fix the divide zero in local shuffle [Fix](bug) fix the divide zero in local shuffle Jul 16, 2024
@doris-robot
Copy link

TPC-H: Total hot run time: 40032 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 955c4eb36ee441f2e180b35c5b7e59305680bcad, data reload: false

------ Round 1 ----------------------------------
q1	18619	4764	4351	4351
q2	2015	195	197	195
q3	10443	1159	1113	1113
q4	10198	846	849	846
q5	7614	2748	2675	2675
q6	224	137	136	136
q7	952	620	595	595
q8	9220	2105	2126	2105
q9	8891	6599	6585	6585
q10	8704	3789	3789	3789
q11	450	234	240	234
q12	391	218	223	218
q13	17764	2996	2969	2969
q14	274	245	223	223
q15	525	493	479	479
q16	477	381	387	381
q17	972	678	746	678
q18	8116	7415	7533	7415
q19	7625	1351	1281	1281
q20	706	315	332	315
q21	4886	3172	3269	3172
q22	346	282	277	277
Total cold run time: 119412 ms
Total hot run time: 40032 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4385	4306	4232	4232
q2	372	278	270	270
q3	3037	2738	2748	2738
q4	1870	1611	1573	1573
q5	5330	5344	5351	5344
q6	221	131	130	130
q7	2139	1697	1747	1697
q8	3197	3373	3349	3349
q9	8435	8448	8412	8412
q10	3883	3678	3629	3629
q11	594	485	487	485
q12	761	618	614	614
q13	17532	2945	2956	2945
q14	307	278	268	268
q15	524	479	467	467
q16	487	424	434	424
q17	1791	1493	1456	1456
q18	7718	7452	7348	7348
q19	1664	1548	1610	1548
q20	1997	1781	1805	1781
q21	4866	4711	4595	4595
q22	562	537	472	472
Total cold run time: 71672 ms
Total hot run time: 53777 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172655 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 955c4eb36ee441f2e180b35c5b7e59305680bcad, data reload: false

query1	908	380	368	368
query2	6449	1809	1762	1762
query3	6656	205	218	205
query4	28236	17591	17224	17224
query5	4235	489	477	477
query6	270	190	169	169
query7	4590	287	281	281
query8	252	204	197	197
query9	8670	2381	2362	2362
query10	450	286	264	264
query11	10719	10088	10080	10080
query12	131	86	94	86
query13	1665	378	366	366
query14	10117	7942	8108	7942
query15	244	165	167	165
query16	7356	326	314	314
query17	1810	545	520	520
query18	1344	283	284	283
query19	204	154	152	152
query20	94	82	81	81
query21	213	125	123	123
query22	4261	4120	4094	4094
query23	33627	33117	33158	33117
query24	11159	2905	2890	2890
query25	612	361	363	361
query26	1182	152	148	148
query27	2571	267	276	267
query28	7021	2018	2003	2003
query29	875	623	621	621
query30	284	149	149	149
query31	978	754	774	754
query32	99	56	52	52
query33	778	294	301	294
query34	919	484	485	484
query35	683	570	580	570
query36	1073	964	945	945
query37	152	77	80	77
query38	2858	2788	2793	2788
query39	879	812	801	801
query40	283	120	116	116
query41	47	47	52	47
query42	120	94	96	94
query43	534	440	450	440
query44	1159	746	736	736
query45	201	161	159	159
query46	1087	738	716	716
query47	1858	1768	1774	1768
query48	371	298	299	298
query49	1105	410	418	410
query50	783	396	390	390
query51	6945	6820	6805	6805
query52	100	94	94	94
query53	357	284	289	284
query54	999	448	443	443
query55	75	72	75	72
query56	287	265	265	265
query57	1125	1057	1052	1052
query58	245	245	298	245
query59	2972	2798	2584	2584
query60	294	289	283	283
query61	96	93	97	93
query62	845	651	659	651
query63	321	320	279	279
query64	9866	2229	1682	1682
query65	3166	3112	3101	3101
query66	1053	330	343	330
query67	15413	14969	14955	14955
query68	4554	526	516	516
query69	481	332	322	322
query70	1070	1085	1084	1084
query71	420	272	286	272
query72	7054	5541	5294	5294
query73	740	336	327	327
query74	6133	5744	5661	5661
query75	3497	2700	2712	2700
query76	2800	896	898	896
query77	477	303	307	303
query78	9583	8895	8887	8887
query79	2463	528	506	506
query80	2311	521	468	468
query81	609	219	216	216
query82	921	137	131	131
query83	298	166	166	166
query84	267	89	93	89
query85	2234	313	298	298
query86	486	301	307	301
query87	3294	3139	3091	3091
query88	4057	2475	2445	2445
query89	486	374	378	374
query90	1851	188	190	188
query91	131	99	99	99
query92	68	48	46	46
query93	3024	498	492	492
query94	1154	210	209	209
query95	391	319	312	312
query96	604	276	274	274
query97	3178	3019	3034	3019
query98	214	195	202	195
query99	1653	1246	1247	1246
Total cold run time: 282834 ms
Total hot run time: 172655 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 955c4eb36ee441f2e180b35c5b7e59305680bcad, data reload: false

query1	0.04	0.03	0.04
query2	0.07	0.04	0.03
query3	0.23	0.05	0.05
query4	1.68	0.07	0.06
query5	0.51	0.49	0.49
query6	1.13	0.73	0.73
query7	0.02	0.02	0.01
query8	0.05	0.04	0.05
query9	0.55	0.48	0.49
query10	0.53	0.54	0.55
query11	0.15	0.11	0.12
query12	0.15	0.12	0.12
query13	0.61	0.58	0.59
query14	0.74	0.77	0.80
query15	0.84	0.82	0.82
query16	0.37	0.37	0.35
query17	1.05	0.96	0.96
query18	0.22	0.22	0.22
query19	1.78	1.68	1.72
query20	0.01	0.01	0.01
query21	15.42	0.75	0.66
query22	3.90	7.46	2.02
query23	18.69	1.42	1.32
query24	2.17	0.23	0.23
query25	0.17	0.08	0.09
query26	0.30	0.22	0.20
query27	0.45	0.23	0.22
query28	13.20	1.01	1.00
query29	12.63	3.31	3.33
query30	0.25	0.06	0.06
query31	2.88	0.38	0.39
query32	3.27	0.47	0.47
query33	2.88	2.91	2.94
query34	17.07	4.41	4.36
query35	4.38	4.35	4.43
query36	0.65	0.46	0.50
query37	0.19	0.15	0.16
query38	0.15	0.15	0.14
query39	0.04	0.03	0.04
query40	0.15	0.12	0.12
query41	0.10	0.05	0.05
query42	0.06	0.05	0.05
query43	0.04	0.05	0.04
Total cold run time: 109.77 s
Total hot run time: 30.69 s

@yiguolei yiguolei added the p0_c label Jul 16, 2024
Copy link
Member

@mrhhsg mrhhsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@HappenLee HappenLee merged commit c138902 into apache:master Jul 16, 2024
26 of 31 checks passed
yiguolei pushed a commit that referenced this pull request Jul 16, 2024
## Proposed changes

cherry pick #37906 

<!--Describe your changes.-->
seawinde pushed a commit to seawinde/doris that referenced this pull request Jul 17, 2024
if 'num_buckets == 0' means the fragment is colocated by exchange node
not the
scan node. so here use `_num_instance` to replace the `num_buckets` to
prevent dividing 0
  still keep colocate plan after local shuffle


`coredump`:
```
SIGFPE integer divide by zero (@0x56431791a54a) received by PID 33673 (TID 37768 OR 0x7f8028018640) from PID 395421002; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
3# 0x00007F8C47895520 in /lib/x86_64-linux-gnu/libc.so.6
4# doris::vectorized::Partitioner::do_partitioning(doris::RuntimeState*, doris::vectorized::Block*, doris::MemTracker*) const at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/vec/runtime/partitioner.cpp:50
5# doris::pipeline::ShuffleExchanger::sink(doris::RuntimeState*, doris::vectorized::Block*, bool, doris::pipeline::LocalExchangeSinkLocalState&) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/pipeline/local_exchange/local_exchanger.cpp:33
6# doris::pipeline::LocalExchangeSinkOperatorX::sink(doris::RuntimeState*, doris::vectorized::Block*, bool) in /mnt/ssd01/doris-branch40preview/NEREIDS_ASAN/be/lib/doris_be
7# doris::pipeline::PipelineTask::execute(bool*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/pipeline/pipeline_task.cpp:359
8# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/pipeline/task_scheduler.cpp:138
9# doris::ThreadPool::dispatch_thread() in /mnt/ssd01/doris-branch40preview/NEREIDS_ASAN/be/lib/doris_be
10# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/thread.cpp:499
11# start_thread at ./nptl/pthread_create.c:442
12# 0x00007F8C47979850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
```
dataroaring pushed a commit that referenced this pull request Jul 17, 2024
if 'num_buckets == 0' means the fragment is colocated by exchange node
not the
scan node. so here use `_num_instance` to replace the `num_buckets` to
prevent dividing 0
  still keep colocate plan after local shuffle


`coredump`:
```
SIGFPE integer divide by zero (@0x56431791a54a) received by PID 33673 (TID 37768 OR 0x7f8028018640) from PID 395421002; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
3# 0x00007F8C47895520 in /lib/x86_64-linux-gnu/libc.so.6
4# doris::vectorized::Partitioner::do_partitioning(doris::RuntimeState*, doris::vectorized::Block*, doris::MemTracker*) const at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/vec/runtime/partitioner.cpp:50
5# doris::pipeline::ShuffleExchanger::sink(doris::RuntimeState*, doris::vectorized::Block*, bool, doris::pipeline::LocalExchangeSinkLocalState&) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/pipeline/local_exchange/local_exchanger.cpp:33
6# doris::pipeline::LocalExchangeSinkOperatorX::sink(doris::RuntimeState*, doris::vectorized::Block*, bool) in /mnt/ssd01/doris-branch40preview/NEREIDS_ASAN/be/lib/doris_be
7# doris::pipeline::PipelineTask::execute(bool*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/pipeline/pipeline_task.cpp:359
8# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/pipeline/task_scheduler.cpp:138
9# doris::ThreadPool::dispatch_thread() in /mnt/ssd01/doris-branch40preview/NEREIDS_ASAN/be/lib/doris_be
10# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/thread.cpp:499
11# start_thread at ./nptl/pthread_create.c:442
12# 0x00007F8C47979850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.5-merged dev/3.0.1-merged p0_c reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants