-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2325] Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly #1281
Conversation
…ose a good one instead of choosing the first one directly
Can one of the admins verify this patch? |
Hi @YanTangZhai, with the merge of #1274 is this change still needed? |
When did this come up? I'm actually not sure this is a good behavior, because doing this means that a user might completely miss a misconfigured directory. With the current behavior, you immediately get an error and can fix your configuration. I was wondering if you had a scenario where it was just too difficult to configure this correctly on each machine. |
Hi @mateiz, I think ignoring bad dir is needed in production cluster.
I didn't dig the code, so I don't know where Ok, let's go back to this behavior. @mateiz, when running spark service, one of the configured dir(disks) fails, I simple prefer ignoring the bad dir rather than bring down the entire service. what about a misconfigure? If a misconfigured directory is usable, we cannot do anything, it's uses' mistake. if the directory is bad, ignoring it isn't that bad. @YanTangZhai, I believe we should log the bad dir, so user can know there is a bad dir. And what do you think the idea of replace bad disks? |
I see, that makes sense, but in that case we need to do a couple more things to make this complete:
If you don't have time to look through the rest of the code to do this, then please just add your discussion above to the JIRA and other people will get to it later. |
HI @mateiz , I'd love to make my contribution for spark. However, I believe it's more than one pr work. There must be a lot of details to be considered. I will make my time and try to implement it. Anyway, I will file a JIRA first. |
Sure, please start by adding a JIRA with a proposed design for this. Then people will be able to comment on that before you even have to start implementing stuff. |
I'd like to revisit this in light of SPARK-2974; now that #1274 has been merged, the directory returned from |
Test build #22852 has started for PR 1281 at commit
|
Test build #22852 has finished for PR 1281 at commit
|
Test FAILed. |
It looks like the JIRA referenced from this PR was resolved as a duplicate of an issue which was fixed in #2002. Therefore, do you mind closing this PR? Thanks! |
(I think 'close this issue' is the magic that the script needs) |
* [CARMEL-6587] Support Generic Skew Join Patterns * fix ut * fix ut * fix ut
If the first directory of spark.local.dir is bad, application will exit with the exception:
Exception in thread "main" java.io.IOException: Failed to create a temp directory (under /data1/sparkenv/local) after 10 attempts!
at org.apache.spark.util.Utils$.createTempDir(Utils.scala:258)
at org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:154)
at org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
at org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.SparkContext.(SparkContext.scala:202)
at JobTaskJoin$.main(JobTaskJoin.scala:9)
at JobTaskJoin.main(JobTaskJoin.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly. For example, spark.local.dir is /data1/sparkenv/local,/data2/sparkenv/local. The disk data1 is bad while the disk data2 is good, we could choose the data2 not data1.