[SPARK-2325] Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly #1281

YanTangZhai · 2014-07-02T06:57:52Z

If the first directory of spark.local.dir is bad, application will exit with the exception:
Exception in thread "main" java.io.IOException: Failed to create a temp directory (under /data1/sparkenv/local) after 10 attempts!
at org.apache.spark.util.Utils$.createTempDir(Utils.scala:258)
at org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:154)
at org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
at org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.SparkContext.(SparkContext.scala:202)
at JobTaskJoin$.main(JobTaskJoin.scala:9)
at JobTaskJoin.main(JobTaskJoin.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly. For example, spark.local.dir is /data1/sparkenv/local,/data2/sparkenv/local. The disk data1 is bad while the disk data2 is good, we could choose the data2 not data1.

…ose a good one instead of choosing the first one directly

AmplabJenkins · 2014-07-02T07:00:43Z

Can one of the admins verify this patch?

ash211 · 2014-07-06T04:36:02Z

Hi @YanTangZhai, with the merge of #1274 is this change still needed?

YanTangZhai · 2014-07-13T13:20:12Z

Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from DiskBlockManager. The two problems are different. Even though #1274 has been merged, the problem is still exist. Please review again. Thanks.

mateiz · 2014-07-26T22:32:30Z

When did this come up? I'm actually not sure this is a good behavior, because doing this means that a user might completely miss a misconfigured directory. With the current behavior, you immediately get an error and can fix your configuration. I was wondering if you had a scenario where it was just too difficult to configure this correctly on each machine.

advancedxy · 2014-07-27T03:55:03Z

Hi @mateiz, I think ignoring bad dir is needed in production cluster.
In production, there is a good chance for disk failures. I always love the idea that we could replace the bad disks without service downtime. I hope this can be implemented in spark cluster.
To replace disks without service downtime, it require:

the service is tolerant with bad dirs, which this pr did.
make sure the dir is read-only or remove all permissions anybody have (chmod 000 /dir assume it's a unix-like os), so the service doesn't pick the wrong dir.
replace the bad disk (modern machine supports hot plugging). mount it. bring the permissions back.
service auto detect the new good dir(disk), or provide a reload api so that we can notify it.

I didn't dig the code, so I don't know where spark.local.dir are used. But, if it's for storage, it's better to
choose different dirs(disks) to spread the disk IO.

Ok, let's go back to this behavior. @mateiz, when running spark service, one of the configured dir(disks) fails, I simple prefer ignoring the bad dir rather than bring down the entire service.
What hadoop's datanode and tasktracker do is simply ignoring some bad dirs with a maximum num limit.

what about a misconfigure? If a misconfigured directory is usable, we cannot do anything, it's uses' mistake. if the directory is bad, ignoring it isn't that bad.

@YanTangZhai, I believe we should log the bad dir, so user can know there is a bad dir. And what do you think the idea of replace bad disks?

mateiz · 2014-07-27T23:16:10Z

I see, that makes sense, but in that case we need to do a couple more things to make this complete:

We should have a max limit of broken dirs we tolerate, after which we'd throw an error.
spark.local.dir is used to specify a list of directories that the DiskStore puts data in. You need to modify the DiskStore to allow skipping some of them, or else there will still be problems.

If you don't have time to look through the rest of the code to do this, then please just add your discussion above to the JIRA and other people will get to it later.

advancedxy · 2014-07-28T02:54:42Z

HI @mateiz , I'd love to make my contribution for spark. However, I believe it's more than one pr work. There must be a lot of details to be considered. I will make my time and try to implement it. Anyway, I will file a JIRA first.

mateiz · 2014-07-29T21:14:28Z

Sure, please start by adding a JIRA with a proposed design for this. Then people will be able to comment on that before you even have to start implementing stuff.

JoshRosen · 2014-08-17T00:22:00Z

I'd like to revisit this in light of SPARK-2974; now that #1274 has been merged, the directory returned from Utils.getLocalDir() might not exist, leading to confusing errors when workers fetch files.

SparkQA · 2014-11-04T02:27:50Z

Test build #22852 has started for PR 1281 at commit 08424ce.

This patch does not merge cleanly.

SparkQA · 2014-11-04T02:31:39Z

Test build #22852 has finished for PR 1281 at commit 08424ce.

This patch fails some tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-04T02:31:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22852/
Test FAILed.

JoshRosen · 2014-12-16T02:27:15Z

It looks like the JIRA referenced from this PR was resolved as a duplicate of an issue which was fixed in #2002. Therefore, do you mind closing this PR? Thanks!

JoshRosen · 2014-12-16T02:32:15Z

(I think 'close this issue' is the magic that the script needs)

* [CARMEL-6587] Support Generic Skew Join Patterns * fix ut * fix ut * fix ut

[SPARK-2325] Utils.getLocalDir had better check the directory and cho…

08424ce

…ose a good one instead of choosing the first one directly

JoshRosen mentioned this pull request Aug 17, 2014

[SPARK-2974] [SPARK-2975] Fix two bugs related to spark.local.dirs #2002

Closed

asfgit closed this in ca12608 Dec 17, 2014

wangyum pushed a commit that referenced this pull request May 26, 2023

[CARMEL-6587] Support Generic Skew Join Patterns (#1281)

93f89f9

* [CARMEL-6587] Support Generic Skew Join Patterns * fix ut * fix ut * fix ut

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2325] Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly #1281

[SPARK-2325] Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly #1281

YanTangZhai commented Jul 2, 2014

AmplabJenkins commented Jul 2, 2014

ash211 commented Jul 6, 2014

YanTangZhai commented Jul 13, 2014

mateiz commented Jul 26, 2014

advancedxy commented Jul 27, 2014

mateiz commented Jul 27, 2014

advancedxy commented Jul 28, 2014

mateiz commented Jul 29, 2014

JoshRosen commented Aug 17, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

JoshRosen commented Dec 16, 2014

JoshRosen commented Dec 16, 2014

[SPARK-2325] Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly #1281

[SPARK-2325] Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly #1281

Conversation

YanTangZhai commented Jul 2, 2014

AmplabJenkins commented Jul 2, 2014

ash211 commented Jul 6, 2014

YanTangZhai commented Jul 13, 2014

mateiz commented Jul 26, 2014

advancedxy commented Jul 27, 2014

mateiz commented Jul 27, 2014

advancedxy commented Jul 28, 2014

mateiz commented Jul 29, 2014

JoshRosen commented Aug 17, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

JoshRosen commented Dec 16, 2014

JoshRosen commented Dec 16, 2014