Eager speculative execution for final LIMIT stages #18862

losipiuk · 2023-08-30T15:05:03Z

It is quite common to use

  SELECT .... LIMIT N

queries for date exploration.
Currently such queries when run in FTE mode require completion of all (or almost) all
tasks which read source data, even though most of the time final answer
could be obtained much sooner.

This commit enables EAGER_SPECULATIVE tasks execution mode for stages
which have FINAL LIMIT operator. This will allow for returning final
results to user much faster (assuming exchange plugin in use supports
concurrent read and write)

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

core/trino-main/src/main/java/io/trino/execution/scheduler/TaskExecutionClass.java

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

findepi · 2023-09-04T14:17:25Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

-                if (nonSpeculativeTasksWaitingForNode >= maxTasksWaitingForNode) {
-                    break;


in Keep executionClass explicit in SchedulingQueue and PrioritizedSchedu… where did it go to?

not sure if this commit is supposed to be a pure syntatical refactor or something more?

It changes internal model of ScheulingQueue a bit. Not behavioral changes. Added some comment in commit message.

core/trino-main/src/main/java/io/trino/execution/scheduler/TaskExecutionClass.java

findepi · 2023-09-04T14:25:36Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

@@ -1412,7 +1412,7 @@ private void loadMoreTaskDescriptorsIfNecessary()
        {
            boolean schedulingQueueIsFull = schedulingQueue.getTaskCount(STANDARD) >= maxTasksWaitingForExecution;
            for (StageExecution stageExecution : stageExecutions.values()) {
-                if (!schedulingQueueIsFull || stageExecution.hasOpenTaskRunning()) {
+                if (!schedulingQueueIsFull || stageExecution.hasOpenTaskRunning() || stageExecution.isEager()) {


should this be here or some previous? (currently it's in Eagerly enumerate splits for eager stages)

this one (supprisingly) is actually in good commit :P

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

core/trino-main/src/main/java/io/trino/memory/MemoryManagerConfig.java

core/trino-spi/src/main/java/io/trino/spi/exchange/Exchange.java

core/trino-main/src/test/java/io/trino/execution/scheduler/TestBinPackingNodeAllocator.java

linzebing · 2023-09-05T18:00:44Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java


            while (!schedulingQueue.isEmpty()) {
                PrioritizedScheduledTask scheduledTask;
-                if (schedulingQueue.getTaskCount(STANDARD) > 0) {
+
+                if (schedulingQueue.getTaskCount(EAGER_SPECULATIVE) > 0 && eagerSpeculativeTasksWaitingForNode < maxTasksWaitingForNode) {


Should maxTasksWaitingForNode account for both standard and eagerSpeculative?

I think it does not matter much - it is mostly about how much state we want to keep. The waiting tasks are not really utilizing resources.
How I handle EAGER_SPECULATIVE matches logic we had so far for STANDARD and SPECULATIVE which are accounted separately.
Having separate lists may be better in a way if EAGER_SPECULATIVE require less resources than STANDARD task. Then we have a chance to schedule it even if bigger tasks are waiting.

linzebing · 2023-09-05T18:05:19Z

core/trino-main/src/main/java/io/trino/execution/scheduler/BinPackingNodeAllocatorService.java

@@ -139,6 +142,7 @@ public BinPackingNodeAllocatorService(
        this.memoryRequirementIncreaseOnWorkerCrashEnabled = memoryRequirementIncreaseOnWorkerCrashEnabled;
        this.allowedNoMatchingNodePeriod = requireNonNull(allowedNoMatchingNodePeriod, "allowedNoMatchingNodePeriod is null");
        this.taskRuntimeMemoryEstimationOverhead = requireNonNull(taskRuntimeMemoryEstimationOverhead, "taskRuntimeMemoryEstimationOverhead is null");
+        this.eagerSpeculativeTasksNodeMemoryOvercommit = eagerSpeculativeTasksNodeMemoryOvercommit;


We are able to over commit because we can always kill them if memory is not enough? Can we do the same for speculative tasks as well?

We can. We can even for STANDARD, those will also be killed we run out of memory. But if we wanted to have more of STANDARD/SEPCULATIVE tasks running together we could just lower the memory estimate for those - would be more straightforward.
The starting memory estimate for tasks should be adjust to what level of concurrency we want - and that is why we picked 5GB which should match ~100GB worker node fine.

Overcommit for EAGER tasks is important because we believe those are "special". Typically they would not take much resources and have huge potential of finishing query early. And that is why we still want to run those even bumping taks concurrency a bit if cluster is fully booked on resources.

linzebing · 2023-09-05T18:22:26Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

+        private static boolean hasSmallFinalLimitNode(SubPlan subPlan)
+        {
+            return PlanNodeSearcher.searchFrom(subPlan.getFragment().getRoot())
+                    .where(node -> node instanceof LimitNode limitNode && !limitNode.isPartial() && limitNode.getCount() < 1_000_000)


For my knowledge, what does it mean if a limit node is "partial"?

non-final. Limit is planned as

LIMIT_FINAL (single) | LIMIT_PARTIAL(distributed)

There is just a single task executing LIMIT_FINAL while there is many of LIMIT_PARTIAL nodes. Each LIMIT_PARTIAL would output up-to limit number of rows, and LIMIT_FINAL will do final selection

linzebing · 2023-09-05T18:28:59Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

+            for (int i = 0; i < outputPartitionsCount; ++i) {
+                estimateBuilder.add(0);
+            }
+            return Optional.of(new OutputDataSizeEstimateResult(estimateBuilder.build(), OutputDataSizeEstimateStatus.ESTIMATED_FOR_EAGER_PARENT));


I don't quite understand here. Is this a simply place holder full of 0s?

arhimondr · 2023-09-06T15:21:39Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

-            }
-
-            queue.addOrUpdate(prioritizedTask.task(), prioritizedTask.priority());
+            queues.values().forEach(queue -> queue.remove(prioritizedTask.task()));


What's the complexity of queue.remove?

It is removal from HashMap + removal from TreeSet. So logN

Oh, right. It's our custom priority queue. I was afraid it may involve a linear scan.

arhimondr · 2023-09-06T16:21:14Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

+            int outputPartitionsCount = sinkPartitioningScheme.getPartitionCount();
+            ImmutableLongArray.Builder estimateBuilder = ImmutableLongArray.builder(outputPartitionsCount);
+            for (int i = 0; i < outputPartitionsCount; ++i) {
+                estimateBuilder.add(0);


This is supposed to ensure that only a single "eager" task is created? (with the current logic it should always be single anyway as we are doing it only for simple LIMIT queries, right?)

Yeah this is current assumption - that single task will be enough. If we wanted more tasks we would need to have estimate guessing here more elaborate - but we do not have much of the information at hand to be honest, give EAGER tasks are to be used to handle cases when query just barely started (at least for now).

arhimondr · 2023-09-06T16:25:32Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

+
+        private static boolean hasSmallFinalLimitNode(SubPlan subPlan)
+        {
+            return PlanNodeSearcher.searchFrom(subPlan.getFragment().getRoot())


Is it possible for a plan to be more complex then a single LIMIT node? (For example Final Aggregation -> Limit).

What do you think about having an extra condition if the stage is distribution "SINGLE"?

Is it possible for a plan to be more complex then a single LIMIT node? (For example Final Aggregation -> Limit).

I think it is possible - and it would still be caught by this predicate as we are just looking for final Limit node in fragment - not assuming it is root.

What do you think about having an extra condition if the stage is distribution "SINGLE"?

As part of predicate or more like an assertion? I think we should always get "SINGLE" distribution for fragment with final Limit node, right?

As part of predicate or more like an assertion? I think we should always get "SINGLE" distribution for fragment with final Limit node, right?

That is my understanding as well, unless I'm missing some very unobvious corner case.

I wonder if maybe purely for "documentation" purposes it would be better to have a check for "SINGLE" to be explicit?

Changed method to:

private static boolean hasSmallFinalLimitNode(SubPlan subPlan) { if (!subPlan.getFragment().getPartitioning().isSingleNode()) { // Final LIMIT should always have SINGLE distribution return false; } return PlanNodeSearcher.searchFrom(subPlan.getFragment().getRoot()) .where(node -> node instanceof LimitNode limitNode && !limitNode.isPartial() && limitNode.getCount() < 1_000_000) .matches(); }

…ledTask Explicitly hold task execution class in SchedulingQueue and PrioritizedScheduledTask instead of deriving it from internal priority.

Preparatory refactor before we extend TaskExecutionClass enum

EAGER_SPECULATIVE is used for tasks from stages with some upstream stages still running but with high priority. Tasks will be scheduled even if there are resources to schedule STANDARD tasks on cluster. Tasks of EAGER_SPECULATIVE are used to implement early termination of queries, when it is probable that we do not need to run whole downstream stages to produce final query result. EAGER_SPECULATIVE will not prevent STANDARD tasks from being scheduled and will still be picked to kill if needed when worker runs out of memory; this is needed to prevent deadlocks.

Reorder fields so order matches order of constructor parameters.

While most of FTE related session properties were hidden already some were left visible. This was not intentional. Those session properties are low level toggels to be used for tweaking engine mechanism and removed eventually.

Only close source exchange if source stage writing to it is already done. It could be that closeSourceExchanges was called because downstream stage already finished while some upstream stages are still running. E.g this may happen in case of early limit termination.

It is quite common to use SELECT .... LIMIT N queries for date exploration. Currently such queries when run in FTE mode require completion of all (or almost) all tasks which read source data, even though most of the time final answer could be obtained much sooner. This commit enables EAGER_SPECULATIVE tasks execution mode for stages which have FINAL LIMIT operator. This will allow for returning final results to user much faster (assuming exchange plugin in use supports concurrent read and write)

Extend Exchange SPI so engine can tell exchange that it should deliver source handles as soon as it has any available, even if from troughput perspective it would make more sense to wait a bit and deliver bigger batch. It is important to be able to process stages which may short-circuit query execution (like top-level LIMIT) swiftly.

cla-bot bot added the cla-signed label Aug 30, 2023

losipiuk force-pushed the lo/specul-limit branch from f5ac037 to f20169f Compare September 4, 2023 11:36

losipiuk marked this pull request as ready for review September 4, 2023 11:37

losipiuk requested review from arhimondr, linzebing, findepi and sopel39 September 4, 2023 11:37

losipiuk force-pushed the lo/specul-limit branch from f20169f to 0be3eb7 Compare September 4, 2023 14:02

findepi approved these changes Sep 4, 2023

View reviewed changes

losipiuk force-pushed the lo/specul-limit branch 2 times, most recently from 93b749a to 1439ea3 Compare September 5, 2023 12:28

linzebing reviewed Sep 5, 2023

View reviewed changes

arhimondr reviewed Sep 6, 2023

View reviewed changes

losipiuk force-pushed the lo/specul-limit branch from 1439ea3 to 3dbf0a4 Compare September 7, 2023 09:43

arhimondr approved these changes Sep 7, 2023

View reviewed changes

losipiuk force-pushed the lo/specul-limit branch 3 times, most recently from d58369a to b5cc2ce Compare September 18, 2023 09:39

linzebing approved these changes Sep 18, 2023

View reviewed changes

losipiuk added 11 commits September 19, 2023 21:02

Remove extranous braces

373adbd

Remove empty string from concatenation

c58908d

Remove unneded else branch

344e8d9

Remove unused method

f73e188

Introduce TaskExecutionClass

5afc47b

Add helper canTransitionTo method

63c8a88

Reorder statements

eb4ac06

Rename variable

6b2bc9c

Extract helper method

d7be10b

Keep executionClass explicit in SchedulingQueue and PrioritizedSchedu…

7cd50ab

…ledTask Explicitly hold task execution class in SchedulingQueue and PrioritizedScheduledTask instead of deriving it from internal priority.

Add helper isSpeculative method

25f9637

Preparatory refactor before we extend TaskExecutionClass enum

losipiuk added 6 commits September 19, 2023 21:10

Improve variable naming

2d067a2

Overcommit node memory for tasks from eager stages

265c90c

Rename variables

fd8fae2

Reorder fields

b555ab3

Reorder fields so order matches order of constructor parameters.

Hide low level FTE related session properties

7fb21e3

While most of FTE related session properties were hidden already some were left visible. This was not intentional. Those session properties are low level toggels to be used for tweaking engine mechanism and removed eventually.

losipiuk force-pushed the lo/specul-limit branch 2 times, most recently from 39c47d3 to 8cd1491 Compare September 19, 2023 21:25

losipiuk added 4 commits September 19, 2023 23:26

Eagerly enumerate splits for eager stages

174ccb1

losipiuk force-pushed the lo/specul-limit branch from 8cd1491 to 0ea81f2 Compare September 19, 2023 21:26

linzebing approved these changes Sep 19, 2023

View reviewed changes

losipiuk merged commit 5c674ea into trinodb:master Sep 20, 2023

github-actions bot added this to the 427 milestone Sep 20, 2023

colebow mentioned this pull request Sep 25, 2023

Add Trino 427 release notes #19023

Merged

raunaqmorarka mentioned this pull request Oct 30, 2023

Limit statements cannot finish quickly with LimitPushDown in fault tolerant execution #19450

Open

julienlau mentioned this pull request Jun 13, 2024

Query with limit on s3 is not working optimally #21595

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eager speculative execution for final LIMIT stages #18862

Eager speculative execution for final LIMIT stages #18862

losipiuk commented Aug 30, 2023

findepi Sep 4, 2023

losipiuk Sep 5, 2023

findepi Sep 4, 2023

losipiuk Sep 5, 2023

linzebing Sep 5, 2023

losipiuk Sep 7, 2023

linzebing Sep 5, 2023

losipiuk Sep 7, 2023

linzebing Sep 5, 2023

losipiuk Sep 7, 2023

linzebing Sep 5, 2023

arhimondr Sep 6, 2023

losipiuk Sep 7, 2023

arhimondr Sep 7, 2023

arhimondr Sep 6, 2023

losipiuk Sep 7, 2023

arhimondr Sep 6, 2023

losipiuk Sep 7, 2023

arhimondr Sep 7, 2023

losipiuk Sep 11, 2023 •

edited

Loading

		if (nonSpeculativeTasksWaitingForNode >= maxTasksWaitingForNode) {
		break;

Eager speculative execution for final LIMIT stages #18862

Eager speculative execution for final LIMIT stages #18862

Conversation

losipiuk commented Aug 30, 2023

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

losipiuk Sep 11, 2023 •

edited

Loading