Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2325] Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly #1281

Closed
wants to merge 1 commit into from

Conversation

YanTangZhai
Copy link
Contributor

If the first directory of spark.local.dir is bad, application will exit with the exception:
Exception in thread "main" java.io.IOException: Failed to create a temp directory (under /data1/sparkenv/local) after 10 attempts!
at org.apache.spark.util.Utils$.createTempDir(Utils.scala:258)
at org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:154)
at org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
at org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.SparkContext.(SparkContext.scala:202)
at JobTaskJoin$.main(JobTaskJoin.scala:9)
at JobTaskJoin.main(JobTaskJoin.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly. For example, spark.local.dir is /data1/sparkenv/local,/data2/sparkenv/local. The disk data1 is bad while the disk data2 is good, we could choose the data2 not data1.

…ose a good one instead of choosing the first one directly
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ash211
Copy link
Contributor

ash211 commented Jul 6, 2014

Hi @YanTangZhai, with the merge of #1274 is this change still needed?

@YanTangZhai
Copy link
Contributor Author

Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from DiskBlockManager. The two problems are different. Even though #1274 has been merged, the problem is still exist. Please review again. Thanks.

@mateiz
Copy link
Contributor

mateiz commented Jul 26, 2014

When did this come up? I'm actually not sure this is a good behavior, because doing this means that a user might completely miss a misconfigured directory. With the current behavior, you immediately get an error and can fix your configuration. I was wondering if you had a scenario where it was just too difficult to configure this correctly on each machine.

@advancedxy
Copy link
Contributor

Hi @mateiz, I think ignoring bad dir is needed in production cluster.
In production, there is a good chance for disk failures. I always love the idea that we could replace the bad disks without service downtime. I hope this can be implemented in spark cluster.
To replace disks without service downtime, it require:

  1. the service is tolerant with bad dirs, which this pr did.
  2. make sure the dir is read-only or remove all permissions anybody have (chmod 000 /dir assume it's a unix-like os), so the service doesn't pick the wrong dir.
  3. replace the bad disk (modern machine supports hot plugging). mount it. bring the permissions back.
  4. service auto detect the new good dir(disk), or provide a reload api so that we can notify it.

I didn't dig the code, so I don't know where spark.local.dir are used. But, if it's for storage, it's better to
choose different dirs(disks) to spread the disk IO.

Ok, let's go back to this behavior. @mateiz, when running spark service, one of the configured dir(disks) fails, I simple prefer ignoring the bad dir rather than bring down the entire service.
What hadoop's datanode and tasktracker do is simply ignoring some bad dirs with a maximum num limit.

what about a misconfigure? If a misconfigured directory is usable, we cannot do anything, it's uses' mistake. if the directory is bad, ignoring it isn't that bad.

@YanTangZhai, I believe we should log the bad dir, so user can know there is a bad dir. And what do you think the idea of replace bad disks?

@mateiz
Copy link
Contributor

mateiz commented Jul 27, 2014

I see, that makes sense, but in that case we need to do a couple more things to make this complete:

  1. We should have a max limit of broken dirs we tolerate, after which we'd throw an error.
  2. spark.local.dir is used to specify a list of directories that the DiskStore puts data in. You need to modify the DiskStore to allow skipping some of them, or else there will still be problems.

If you don't have time to look through the rest of the code to do this, then please just add your discussion above to the JIRA and other people will get to it later.

@advancedxy
Copy link
Contributor

HI @mateiz , I'd love to make my contribution for spark. However, I believe it's more than one pr work. There must be a lot of details to be considered. I will make my time and try to implement it. Anyway, I will file a JIRA first.

@mateiz
Copy link
Contributor

mateiz commented Jul 29, 2014

Sure, please start by adding a JIRA with a proposed design for this. Then people will be able to comment on that before you even have to start implementing stuff.

@JoshRosen
Copy link
Contributor

I'd like to revisit this in light of SPARK-2974; now that #1274 has been merged, the directory returned from Utils.getLocalDir() might not exist, leading to confusing errors when workers fetch files.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22852 has started for PR 1281 at commit 08424ce.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22852 has finished for PR 1281 at commit 08424ce.

  • This patch fails some tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22852/
Test FAILed.

@JoshRosen
Copy link
Contributor

It looks like the JIRA referenced from this PR was resolved as a duplicate of an issue which was fixed in #2002. Therefore, do you mind closing this PR? Thanks!

@JoshRosen
Copy link
Contributor

(I think 'close this issue' is the magic that the script needs)

@asfgit asfgit closed this in ca12608 Dec 17, 2014
wangyum pushed a commit that referenced this pull request May 26, 2023
* [CARMEL-6587] Support Generic Skew Join Patterns

* fix ut

* fix ut

* fix ut
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants