[SPARK-14649][CORE] DagScheduler should not run duplicate tasks on fe… #17297

sitalkedia · 2017-03-14T22:06:13Z

What changes were proposed in this pull request?

When a fetch failure occurs, the DAGScheduler re-launches the previous stage (to re-generate output that was missing), and then re-launches all tasks in the stage that haven't completed by the time the stage gets resubmitted (the DAGScheduler re-lanches all of the tasks whose output data is not available -- which is equivalent to the set of tasks that hadn't yet completed). This some times leads to wasteful duplicate task run for the jobs with long running task.

To address the issue following changes have been made.

Dag scheduler maintains a pending task list, which is a list of tasks that have been submitted to the lower-level scheduler and they should not be resubmitted when rerun of the stage.

When a fetch failure happens, the task set manager informs the dag scheduler to mark all the non-running tasks to the pending task list. However, the running tasks in the task set are not killed.
In case of resubmission of the stage, the dag scheduler only resubmits the tasks which are in pending stage.

How was this patch tested?

Added new tests.

SparkQA · 2017-03-14T22:09:18Z

Test build #74558 has finished for PR 17297 at commit e5429d3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent

sitalkedia · 2017-03-14T22:10:29Z

cc - @kayousterhout - Addressed your earlier comment about #12436 ignoring fetch failure from stale map output. I have addressed this issue by adding epoch for each map output registered, that way if the task's epoch is smaller than the epoch of the map output, we can ignore the fetch failure. This also takes care of epoch changes which will be triggered due to executor loss for a shuffle task when its shuffle map task executor is gone as pointed out by @mridulm.

Let me know what you think of the approach.

SparkQA · 2017-03-14T22:24:19Z

Test build #74560 has finished for PR 17297 at commit 279b09a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent

SparkQA · 2017-03-15T02:42:34Z

Test build #74562 has finished for PR 17297 at commit f127150.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent

SparkQA · 2017-03-15T03:27:39Z

Test build #74566 has finished for PR 17297 at commit 0bcc69a.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent

kayousterhout · 2017-03-15T18:13:32Z

@sitalkedia I won't have time to review this in detail for at least a few weeks, just so you know (although others may have time to review / merge it).

At a very high level, I'm concerned about the amount of complexity that this adds to the scheduler code. We've recently had to deal with a number of subtle bugs with jobs hanging or Spark crashing as a result of trying to handle map output from old tasks. As a result, I'm hesitant to add more complexity -- and the associated risk of bugs that cause job failures + expense of maintaining the code -- to improve performance.

At the point I'd lean towards cancelling outstanding map tasks when a fetch failure occurs (there's currently a TODO in the code to do this) to simplify these issues. This would improve performance in some ways, by freeing up slots that could be used for something else, at the expense of wasted work if the tasks have already made significant progress. But it would significantly simplify the scheduler code, which given the debugging + reviewer time that has gone into fixing subtle issues with this code path, I think is worthwhile.

Curious what other folks think here.

sitalkedia · 2017-03-15T19:04:07Z

@kayousterhout - I understand your concern and I agree that canceling the running tasks is definitely a simpler approach, but this is very inefficient for large jobs where tasks can run for hours. In our environment where fetch failures are common, this change will not only improve the performance of the jobs in case of fetch failure, this also helps reliability. If we cancel all running reducers, we might end of in a state where jobs will not make any progress at all in case of frequent fetch failure, because they will just flip-flop between two stage.

Comparing this approach to how Hadoop handles fetch failure, it does not fail any reducer in case it detects any map output missing. The reducers just continue processing output from other mappers while the missing output is being recomputed concurrently. This approach give Hadoop a big edge over Spark for long running jobs with multiple fetch failure. This change is one step towards making Spark robust against fetch failure, we would eventually want to have the hadoop model, where we would not fail any task in case of map output missing.

Regarding the approach, please let me know if you can think of some way to reduce the complexity of this change.

cc -@markhamstra, @rxin, @sameeragarwal

sitalkedia · 2017-03-16T00:29:01Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

@@ -193,13 +193,6 @@ private[spark] class TaskSchedulerImpl private[scheduler](
      val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
      stageTaskSets(taskSet.stageAttemptId) = manager
-      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>


Please note that this check is not needed anymore because the DagScheduler already keeps track of running tasks and does not submit duplicate tasks anymore.

actually, that is not really the point of this check. Its just checking if one stage has two tasksets (aka stage attempts), where both are in the "non-zombie" state. It doesn't do any checks at all on what tasks are actually in those tasksets.

This is just checking an invariant which we believe to always be true, but we figure its better to fail-fast if we hit this condition, rather than proceed with some inconsistent state. This check was added because behavior gets really confusing when the invariant is violated, and though we think it should always be true, we've still hit cases where it happens.

@squito - That's correct, this is checking that we should not have more than one non-zombie attempts of a stage running. But in the scenario in (d) you described below, we will end up having more than two non-zombie attempts.

However, my point is there is no reason we should not allow multiple concurrent attempts of a stage to run, the only thing we should guarantee is we are running mutually exclusive tasks in those attempts. With this change, since the dag scheduler already keeps track of submitted/running tasks, it can guarantee that it will not resubmit duplicate tasks for a stage.

SparkQA · 2017-03-16T04:37:30Z

Test build #74631 has finished for PR 17297 at commit 901c9bf.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

squito · 2017-03-18T05:52:18Z

I'm a bit confused by the description:

When a fetch failure happens, the task set manager ask the dag scheduler to abort all the non-running tasks. However, the running tasks in the task set are not killed.

this is already true. when there is a fetch failure, the TaskSetManager is marked as zombie, and the DAGScheduler resubmits stages, but nothing actively kills running tasks.

re-launches all tasks in the stage with the fetch failure that hadn't completed when the fetch failure occurred (the DAGScheduler re-lanches all of the tasks whose output data is not available -- which is equivalent to the set of tasks that hadn't yet completed).

I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted.

But there are several other potential issues you may be trying to address.

Say there is stage 0 and stage 1, each one has 10 tasks. Stage 0 completes fine on the first attempt, then stage 1 starts. Tasks 0 & 1 in stage 1 complete, but then there is a fetch failure in task 2. Lets also say we have an abundance of cluster resources so tasks 3 - 9 from stage 1, attempt 0 are still running.

Stage 0 get resubmitted as attempt 1, just to regenerate the map output for whatever executor had the data for the fetch failure -- perhaps its just one task from stage 0 that needs to resubmitted. Now, lots of different scenarios are possible:

(a) Tasks 3 - 9 from stage 1 attempt 0 all finish successfully while stage 0 attempt 1 is running. So when stage 0 attempt 1 finishes, then stage 1 attempt 1 is submitted, just with Task 2. If it completely succesfully, we're done (no wasted work).

(b) stage 0 attempt 1 finishes, before tasks 3 - 9 from stage 1 attempt 0 have finished. So stage 1 gets submitted again as stage 1 attempt 1, with tasks 2 - 9. So there are now two copies running for tasks 3 - 9. Maybe all the tasks from attempt 0 actually finish shortly after attempt 1 starts. In this case, the stage is complete as soon as there is one complete attempt for each task. But even after the stage completes successfully, all the other tasks keep running anyway. (plenty of wasted work)

(c) like (b), but shortly after stage 1 attempt 1 is submitted, we get another fetch failure in one of the old "zombie" tasks from stage 1 attempt 0. But the DAGScheduler realizes it already has a more recent attempt for this stage, so it ignores the fetch failure. All the other tasks keep running as usual. If there aren't any other issues, the stage completes when there is one completed attempt for each task. (same amount of wasted work as (b)).

(d) While stage 0 attempt 1 is running, we get another fetch failure from stage 1 attempt 0, say in Task 3, which has a failure from a different executor. Maybe its from a completely different host (just by chance, or there may be cluster maintenance where multiple hosts are serviced at once); or maybe its from another executor on the same host (at least, until we do something about your other pr on unregistering all shuffle files on a host). To be honest, I don't understand how things work in this scenario. We mark stage 0 as failed, we unregister some shuffle output, and we resubmit stage 0. But stage 0 attempt 1 is still running, so I would have expected us to end up with conflicting task sets. Whatever the real behavior is here, it seems we're at risk of having even more duplicated work for yet another attempt for stage 1.

etc.

So I think in (b) and (c), you are trying to avoid resubmitting tasks 3-9 on stage 1 attempt 1. the thing is, there is a strong reason to believe that the original version of those tasks will fail. Most likely, those tasks needs map output from the same executor that caused the first fetch failure. So Kay is suggesting that we take the opposite approach, and instead actively kill the tasks from stage 1 attempt 0. OTOH, its possible that (i) the issue may have been transient or (ii) the tasks already finished fetching that data before the error occurred. We really have no idea.

sitalkedia · 2017-03-18T06:18:22Z

Thanks a lot @squito for taking a look at it and for your feedback.

this is already true. when there is a fetch failure, the TaskSetManager is marked as zombie, and the DAGScheduler resubmits stages, but nothing actively kills running tasks.

That is true but currently the DAG scheduler has no idea about which tasks are running and which are being aborted. With this change, the task set manager informs the dag scheduler about currently running/aborted tasks so that the DAG scheduler can avoid resubmitting duplicates.

I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted.

Yes that's true. I will update the PR description.

So I think in (b) and (c), you are trying to avoid resubmitting tasks 3-9 on stage 1 attempt 1. the thing is, there is a strong reason to believe that the original version of those tasks will fail. Most likely, those tasks needs map output from the same executor that caused the first fetch failure. So Kay is suggesting that we take the opposite approach, and instead actively kill the tasks from stage 1 attempt 0. OTOH, its possible that (i) the issue may have been transient or (ii) the tasks already finished fetching that data before the error occurred. We really have no idea.

In our case, we are observing that any transient issue on the shuffle service might cause few tasks to fail. While other reducers might not see the fetch failure because either they already fetched the data from that shuffle service or they are yet to fetch it. Killing all the reducers in those cases is waste of a lot of work and also as I mentioned above, we might end of in a state where jobs will not make any progress at all in case of frequent fetch failure, because they will just flip-flop between two stage.

sitalkedia · 2017-03-18T22:24:15Z

I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted.

Actually, I realized that it's not true. If you looked at the code (https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1419), when the stage fails because of fetch failure, we remove the stage from the output commiter. So if any task completes between the time of first fetch failure and the time stage is resubmitted, will be denied to commit the output and so the scheduler re-launches all tasks in the stage with the fetch failure that hadn't completed when the fetch failure occurred.

squito · 2017-03-20T04:22:22Z

when the stage fails because of fetch failure, we remove the stage from the output commiter. So if any task completes between the time of first fetch failure and the time stage is resubmitted, will be denied to commit the output

oh, that is a great point. I was mostly thinking of another shufflemapstage, where that wouldn't matter, but if its a result stage which needs to commit its output, you are right.

markhamstra · 2017-03-20T21:09:01Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+        // It is possible that the map output was regenerated by rerun of the stage and the
+        // fetch failure is being reported for stale map output. In that case, we should just
+        // ignore the fetch failure and relaunch the task with latest map output info.
+        if (epochForMapOutput.nonEmpty && epochForMapOutput.get <= task.epoch) {


I'd be inclined to do this without the extra binding and get:

for(epochForMapOutput <- mapOutputTracker.getEpochForMapOutput(shuffleId, mapId) if epochForMapOutput <= task.epoch) { // Mark the map whose fetch failed as broken in the map stage if (mapId != -1) { mapStage.removeOutputLoc(mapId, bmAddress) mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress) } // TODO: mark the executor as failed only if there were lots of fetch failures on it if (bmAddress != null) { handleExecutorLost(bmAddress.executorId, filesLost = true, Some(task.epoch)) } }

markhamstra · 2017-03-20T21:15:00Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

    if (changeEpoch) {
      incrementEpoch()
    }
+    mapStatuses.put(shuffleId, statuses.clone())


What was the point of moving this?

markhamstra · 2017-03-20T21:16:23Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

@@ -378,15 +382,17 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf,
    val array = mapStatuses(shuffleId)
    array.synchronized {
      array(mapId) = status
+      val epochs = epochForMapStatus.get(shuffleId).get


val epochs = epochForMapStatus(shuffleId)

markhamstra · 2017-03-20T23:16:17Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+       return Some(epochForMapStatus.get(shuffleId).get(mapId))
+    }
+    None
+  }


First, arrayOpt.get != null isn't necessary since we don't put null values into mapStatuses. Second, epochForMapStatus.get(shuffleId).get is the same as epochForMapStatus(shuffleId). Third, I don't like all the explicit gets,null checks and the unnecessary non-local return. To my mind, this is better:

def getEpochForMapOutput(shuffleId: Int, mapId: Int): Option[Long] = { for { mapStatus <- mapStatuses.get(shuffleId).flatMap { mapStatusArray => Option(mapStatusArray(mapId)) } } yield epochForMapStatus(shuffleId)(mapId) }

markhamstra · 2017-03-20T23:46:53Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+      for (task <- tasks) {
+        stage.pendingPartitions -= task.partitionId
+      }
+    }


for { stage <- stageIdToStage.get(stageId) task <- tasks } stage.pendingPartitions -= task.partitionId

markhamstra · 2017-03-20T23:49:40Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

-    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
+    val missingPartitions = stage.findMissingPartitions()
+    val partitionsToCompute =
+      missingPartitions.filter(id => !stage.pendingPartitions.contains(id))


missingPartitions.filterNot(stage.pendingPartitions)

SparkQA · 2017-03-22T06:04:41Z

Test build #75029 has finished for PR 17297 at commit 99b4069.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sitalkedia · 2017-03-22T06:19:05Z

Thanks @markhamstra for review comments, addressed. I also found an issue with my previous implementation that we do not allow task commits from old stage attempts, I fixed that issue as well.

SparkQA · 2017-03-22T06:22:34Z

Test build #75030 has started for PR 17297 at commit 40a3742.

SparkQA · 2017-03-23T23:14:06Z

Test build #75126 has finished for PR 17297 at commit 05770b9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T03:02:41Z

Test build #75124 has finished for PR 17297 at commit c0bdca6.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T03:32:32Z

Test build #75127 has finished for PR 17297 at commit 1aab715.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2017-03-28T01:09:30Z

To recap the issue that Imran and I discussed here, I think it can be summarized as follows:

A Fetch Failure happens at some time t and indicates that the map output on machine M has been lost
Consider some running task that's read x map outputs and still needs to process y map outputs
Scenario A: (PRO of this PR) If the output from M was in the x outputs that are already read, we should keep running the task (as this PR does), because the task already successfully fetched the output from the failed machine. We don't do this currently, meaning we're throwing away the wasted work.
Scenario B: (CON of this PR) If the output from M was in the y outputs that have not yet been read, then we should cancel the task, because the task won't learn about the new location for the re-generated output of M (IIUC, there's no functionality to do this now) so is going to fail later on. The current code will re-run the task, which is what we should do. This code will try to re-use the old task, which means the job will take longer to run because the task will fail later on and need to be re-started.

If my description above is correct, then this PR is assuming that scenario A is more likely than scenario B, but it seems to me that these two scenarios are equally likely (in which case this PR provides no net benefit). @sitalkedia what are your thoughts here / did I miss something in my description above?

sitalkedia · 2017-03-28T01:24:26Z

@squito - I am not able to reproduce this issue locally.

The tests fails with some other issue -

java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:347)
	at scala.None$.get(Option.scala:345)
	at org.apache.spark.InternalAccumulatorSuite$$anonfun$1.apply$mcV$sp(InternalAccumulatorSuite.scala:43)
	at org.apache.spark.InternalAccumulatorSuite$$anonfun$1.apply(InternalAccumulatorSuite.scala:39)
	at org.apache.spark.InternalAccumulatorSuite$$anonfun$1.apply(InternalAccumulatorSuite.scala:39)
	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
	at org.apache.spark.InternalAccumulatorSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(InternalAccumulatorSuite.scala:28)
	at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
	at org.apache.spark.InternalAccumulatorSuite.runTest(InternalAccumulatorSuite.scala:28)
	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
	at org.scalatest.Suite$class.run(Suite.scala:1424)
	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
	at org.scalatest.tools.Runner$.run(Runner.scala:883)
	at org.scalatest.tools.Runner.run(Runner.scala)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) ```

Please note that all `InternalAccumulatorSuite` tests fail on my laptop. 
In the Jenkins log, do you see any other test cases having the `java.lang.ArrayIndexOutOfBoundsException` from `MapOutputTrackerMaster` ?

squito · 2017-03-28T02:12:11Z

@sitalkedia how are you trying to run the test? Works fine for me on my laptop on master. Note that the test is referencing a var which is only defined if "spark.testing" is a system property: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L199

which it is in the sbt and maven build. (maybe doens't work inside an IDE? I'd strongly just using ~testOnly with sbt for faster dev iterations if you're not already)

SparkQA · 2017-03-28T04:32:36Z

Test build #75287 has finished for PR 17297 at commit 1e6e88a.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

sitalkedia · 2017-03-28T04:48:03Z

@squito - I am able to reproduce the issue by running ./build/sbt "test-only org.apache.spark.InternalAccumulatorSuite, however test case logs are not being printed on the console, do you know where can I find the test case logs on my laptop?

Also, one weird thing is that after adding system.testing property to my Intellij, all test cases succeeds without being stuck :/ .

kayousterhout · 2017-03-28T05:02:34Z

@sitalkedia they're in core/target/unit-tests.log

Sometimes it's easier to move the logs to the tests (so they show up in-line), which you can do by changing core/src/test/resources/log4j.properties to log to the console instead of to a file.

sitalkedia · 2017-03-28T05:41:37Z

@kayousterhout - Both the scenario A and B you described above are likely (it totally depend on the nature of the job and available cluster resources) and you are right that in case of scenario B, this PR will not provide any benefit.

I am planning to have a follow up PR to make the fetch failure handling logic better by not failing a task at all. In that case, the reducers can just inform the scheduler of lost map output and can still continue processing other available map outputs while the scheduler concurrently recomputes the lost map output. But that will be a bigger change in the scheduler.

squito · 2017-03-28T13:31:45Z

btw I filed https://issues.apache.org/jira/browse/SPARK-20128 for the test timeout -- fwiw I don't think its a problem w/ the test but a potential real issue with the metrics system, though I don't really understand how it can happen.

squito · 2017-03-28T15:16:25Z

@sitalkedia This change is pretty contentious, there are lot of questions about whether or not this is a good change. I don't think discussing this here in github comments on a PR is the best form. I think of PR comments as being more about code details -- clarity, tests, whether the implementation is correct, etc. But here we're discussing whether the behavior is even desirable, as well as trying to discuss this in relation to other changes. I think a better format would be for you to open a jira and submit a design document (maybe a shared google doc at first), where we can focus more on the desired behavior and consider all the changes, even if the PRs are smaller to make them easier to review.

I'm explicitly not making a judgement on whether or not this is a good change. Also I do appreciate you having the code changes ready, as a POC, as that can help folks consider the complexity of the change. But it seems clear to me that first we need to come to a decision about the end goal.

Also, assuming we do decide this is desirable behavior, there is also a question about how we can get changes like this in without risking breaking things -- I have started a thread on dev@ related to that topic in general, but we should figure that for these changes in particular as well.

@kayousterhout @tgravescs @markhamstra makes sense?

tgravescs · 2017-03-28T18:22:57Z

Sounds good to me.

kayousterhout · 2017-03-28T18:28:26Z

Agree sounds good!

sitalkedia · 2017-03-29T01:03:25Z

@squito - Sounds good to me, let me compile the list of pain points related to fetch failure we are seeing and also a design doc to have better handling of the issues.

markhamstra · 2017-03-29T01:21:49Z

Agreed. Let's establish what we want to do before trying to discuss the details of how we are going to do it.

…

On Tue, Mar 28, 2017 at 8:17 AM, Imran Rashid ***@***.***> wrote: @sitalkedia <https://github.com/sitalkedia> This change is pretty contentious, there are lot of questions about whether or not this is a good change. I don't think discussing this here in github comments on a PR is the best form. I think of PR comments as being more about code details -- clarity, tests, whether the implementation is correct, etc. But here we're discussing whether the behavior is even desirable, as well as trying to discuss this in relation to other changes. I think a better format would be for you to open a jira and submit a design document (maybe a shared google doc at first), where we can focus more on the desired behavior and consider all the changes, even if the PRs are smaller to make them easier to review. I'm explicitly *not* making a judgement on whether or not this is a good change. Also I do appreciate you having the code changes ready, as a POC, as that can help folks consider the complexity of the change. But it seems clear to me that first we need to come to a decision about the end goal. Also, assuming we do decide this is desirable behavior, there is also a question about how we can get changes like this in without risking breaking things -- I have started a thread on dev@ related to that topic in general, but we should figure that for these changes in particular as well. @kayousterhout <https://github.com/kayousterhout> @tgravescs <https://github.com/tgravescs> @markhamstra <https://github.com/markhamstra> makes sense? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17297 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ4-pbaJWHOMCLOB2JZFReBYx0E1xOHks5rqSSTgaJpZM4MdN08> .

SparkQA · 2017-03-29T03:10:23Z

Test build #75332 has finished for PR 17297 at commit bdaff12.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-29T06:55:33Z

Test build #75339 has finished for PR 17297 at commit ace8464.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sitalkedia · 2017-03-30T20:52:02Z

@kayousterhout, @squito - Since we need more discussion on this change over a design doc, I have put out a temporary change (#17485) to kill the running tasks in case of fetch failure. Although this is not ideal but that would be better than current situation.

jiangxb1987 · 2017-05-24T22:46:51Z

Should we temporarily close the PR and wait for the design doc to be finalized? @sitalkedia

sitalkedia · 2017-05-24T22:50:26Z

okay, closing the PR.

sitalkedia force-pushed the avoid_duplicate_tasks_new branch from e5429d3 to 279b09a Compare March 14, 2017 22:18

sitalkedia force-pushed the avoid_duplicate_tasks_new branch 2 times, most recently from f127150 to 0bcc69a Compare March 14, 2017 23:14

sitalkedia mentioned this pull request Mar 15, 2017

[SPARK-14649][CORE] DagScheduler should not run duplicate tasks on fe… #12436

Closed

sitalkedia commented Mar 16, 2017

View reviewed changes

markhamstra reviewed Mar 20, 2017

View reviewed changes

sitalkedia force-pushed the avoid_duplicate_tasks_new branch from 901c9bf to 99b4069 Compare March 22, 2017 05:59

Sital Kedia added 8 commits March 27, 2017 17:16

Remove unneeded check for conflicting taskSet

2da37de

Address review comments and fix commiter issue

8f98ff1

Fix tests and build

e2661ff

Fix tests

8303f2e

Fix tests

3e878dd

Minor changes

2ac4b34

Fix check style

d32856d

May be fix tests?

1e6e88a

sitalkedia force-pushed the avoid_duplicate_tasks_new branch from b179439 to 1e6e88a Compare March 28, 2017 00:20

Fix tests

bdaff12

Remove test which is not applicable

ace8464

sitalkedia closed this May 24, 2017

[SPARK-14649][CORE] DagScheduler should not run duplicate tasks on fe… #17297

[SPARK-14649][CORE] DagScheduler should not run duplicate tasks on fe… #17297

Conversation

sitalkedia commented Mar 14, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 14, 2017

sitalkedia commented Mar 14, 2017

SparkQA commented Mar 14, 2017

SparkQA commented Mar 15, 2017

SparkQA commented Mar 15, 2017

kayousterhout commented Mar 15, 2017

sitalkedia commented Mar 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 16, 2017

squito commented Mar 18, 2017

sitalkedia commented Mar 18, 2017

sitalkedia commented Mar 18, 2017

squito commented Mar 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 22, 2017

sitalkedia commented Mar 22, 2017

SparkQA commented Mar 22, 2017

SparkQA commented Mar 23, 2017

SparkQA commented Mar 24, 2017

SparkQA commented Mar 24, 2017

kayousterhout commented Mar 28, 2017

sitalkedia commented Mar 28, 2017 • edited Loading

squito commented Mar 28, 2017

SparkQA commented Mar 28, 2017

sitalkedia commented Mar 28, 2017

kayousterhout commented Mar 28, 2017

sitalkedia commented Mar 28, 2017

squito commented Mar 28, 2017

squito commented Mar 28, 2017

tgravescs commented Mar 28, 2017

kayousterhout commented Mar 28, 2017

sitalkedia commented Mar 29, 2017

markhamstra commented Mar 29, 2017 via email

SparkQA commented Mar 29, 2017

SparkQA commented Mar 29, 2017

sitalkedia commented Mar 30, 2017

jiangxb1987 commented May 24, 2017

sitalkedia commented May 24, 2017

sitalkedia commented Mar 14, 2017 •

edited

Loading

sitalkedia commented Mar 28, 2017 •

edited

Loading