Bumped breeze version #40

superbobry · 2017-05-15T11:36:31Z

Relevant changes for us:

scalanlp/breeze@dc4e005
scalanlp/breeze@5ae0579
scalanlp/breeze@c3ff07a

Relevant changes for us: scalanlp/breeze@dc4e005 scalanlp/breeze@5ae0579 scalanlp/breeze@c3ff07a

bloody76

Seems ok to me.

ashangit · 2017-05-15T21:32:45Z

@superbobry as we want to avoid diverging too much from apache spark do we have a request open on the spark jira for this bumped version (at least for spark 2.1)

superbobry · 2017-05-16T11:16:45Z

@ashangit that's a valid point. I'll try to push that upstream first.

Update: there is an upstream PR merged into 2.2: apache#17746. Do you think we should backport it instead?

ashangit · 2017-05-16T15:39:04Z

@superbobry yes I think that it will be better as it seems that the bump also imply some changes on some ml lib/tests.
Would be great then to push back the backport (for 2.1) to the spark community.

By the way for future PR we do not need to have them pushed upstream first but to have a jira reference on the upstream project for the PR to ensure a work is ongoing to push the PR upstream.

AnthonyTruchet · 2017-05-16T18:28:15Z

Could you please link here to the upstream patch ?

superbobry · 2017-05-16T19:39:57Z

Would be great then to push back the backport (for 2.1) to the spark community.

I've asked upstream, why they didn't merge into 2.1. If they don't want it because of compatibility reasons, than we would have to merge it into Criteo fork bypassing upstream.

Could you please link here to the upstream patch?

Sure, there is a link in the above comment: apache#17746.

superbobry · 2017-05-17T11:05:54Z

So the bottomline is: 2.1 would not get the new breeze because of backward-incompatible API changes.

I think we should better wait until 2.2 instead of backporting the patch into 2.1.

ashangit · 2017-05-19T10:26:53Z

That's fine for me.
You should just know that spark 2.2 should not support scala 2.10 and I don't know when we will migrate to 2.11 so spark 2.2 could take some while to be available in our platform

…edExecutorBackend ### What changes were proposed in this pull request? Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck ### Why are the changes needed? For each executor, the single-threaded dispatcher can run into an "infinite loop" (as explained in the SPARK-45227). Once an executor process runs into a state, it'd stop launching tasks from the driver or reporting task status back. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` $ build/mvn package -DskipTests -pl core $ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.executor.CoarseGrainedExecutorBackendSuite test ``` ### Was this patch authored or co-authored using generative AI tooling? No ****************************************************************************** **_Please feel free to skip reading unless you're interested in details_** ****************************************************************************** ### Symptom Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very last step of writing a data frame to S3 by calling `df.write`. Looking at Spark UI, we saw that an executor process hung over 1 hour. After we manually killed the executor process, the app succeeded. Note that the same EMR cluster with two worker nodes was able to run the same app without any issue before and after the incident. Below is what's observed from relevant container logs and thread dump. - A regular task that's sent to the executor, which also reported back to the driver upon the task completion. ``` $zgrep 'task 150' container_1694029806204_12865_01_000001/stderr.gz 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 bytes) taskResourceAssignments Map() 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200) $zgrep ' 923' container_1694029806204_12865_01_000004/stderr.gz 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923 $zgrep 'task 150' container_1694029806204_12865_01_000004/stderr.gz 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923) 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 4495 bytes result sent to driver ``` - Another task that's sent to the executor but didn't get launched since the single-threaded dispatcher was stuck (presumably in an "infinite loop" as explained later). ``` $zgrep 'task 153' container_1694029806204_12865_01_000001/stderr.gz 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 bytes) taskResourceAssignments Map() $zgrep ' 924' container_1694029806204_12865_01_000004/stderr.gz 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924 $zgrep 'task 153' container_1694029806204_12865_01_000004/stderr.gz >> note that the above command has no matching result, indicating that task 153.0 in stage 23.0 (TID 924) was never launched ``` - Thread dump shows that the dispatcher-Executor thread has the following stack trace. ``` "dispatcher-Executor" #40 daemon prio=5 os_prio=0 tid=0x0000ffff98e37800 nid=0x1aff runnable [0x0000ffff73bba000] java.lang.Thread.State: RUNNABLE at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142) at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131) at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123) at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365) at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365) at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44) at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140) at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169) at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167) at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.put(HashMap.scala:126) at scala.collection.mutable.HashMap.update(HashMap.scala:131) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown Source) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` ### Relevant code paths Within an executor process, there's a [dispatcher thread](https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170) dedicated to CoarseGrainedExecutorBackend(a single RPC endpoint) that launches tasks scheduled by the driver. Each task is run on a TaskRunner thread backed by a thread pool created for the executor. The TaskRunner thread and the dispatcher thread are different. However, they read and write a common object (i.e., taskResources) that's a mutable hashmap without thread-safety, in [Executor](https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/executor/Executor.scala#L561) and [CoarseGrainedExecutorBackend](https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L189), respectively. ### What's going on? Based on the above observations, our hypothesis is that the dispatcher thread runs into an "infinite loop" due to a race condition when two threads access the same hashmap object. For illustration purpose, let's consider the following scenario where two threads (Thread 1 and Thread 2) access a hash table without thread-safety - Thread 1 sees A.next = B, but then yields execution to Thread 2 <img src="https://issues.apache.org/jira/secure/attachment/13063040/13063040_hashtable1.png" width="400"> - Thread 2 triggers a resize operation resulting in B.next = A (Note that hashmap doesn't care about ordering), and then yields execution to Thread 1. <img src="https://issues.apache.org/jira/secure/attachment/13063041/13063041_hashtable2.png" width="400"> - After taking over CPU, Thread 1 would run into an "infinite loop" when traversing the list in the last bucket, given A.next = B and B.next = A in its view. Closes apache#43021 from xiongbo-sjtu/master. Authored-by: Bo Xiong <xiongbo@amazon.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 8e6b160) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

superbobry force-pushed the bump-breeze-1.6 branch from edb5158 to a5ff19f Compare May 15, 2017 11:39

Bumped breeze version

a5ff19f

Relevant changes for us: scalanlp/breeze@dc4e005 scalanlp/breeze@5ae0579 scalanlp/breeze@c3ff07a

bloody76 approved these changes May 15, 2017

View reviewed changes

superbobry closed this Dec 18, 2017

Willymontaz deleted the bump-breeze-1.6 branch April 2, 2019 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bumped breeze version #40

Bumped breeze version #40

superbobry commented May 15, 2017 •

edited

Loading

bloody76 left a comment

ashangit commented May 15, 2017

superbobry commented May 16, 2017 •

edited

Loading

ashangit commented May 16, 2017 •

edited

Loading

AnthonyTruchet commented May 16, 2017

superbobry commented May 16, 2017 •

edited

Loading

superbobry commented May 17, 2017

ashangit commented May 19, 2017

Bumped breeze version #40

Bumped breeze version #40

Conversation

superbobry commented May 15, 2017 • edited Loading

bloody76 left a comment

Choose a reason for hiding this comment

ashangit commented May 15, 2017

superbobry commented May 16, 2017 • edited Loading

ashangit commented May 16, 2017 • edited Loading

AnthonyTruchet commented May 16, 2017

superbobry commented May 16, 2017 • edited Loading

superbobry commented May 17, 2017

ashangit commented May 19, 2017

superbobry commented May 15, 2017 •

edited

Loading

superbobry commented May 16, 2017 •

edited

Loading

ashangit commented May 16, 2017 •

edited

Loading

superbobry commented May 16, 2017 •

edited

Loading