-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-2333 - spark_ec2 script should allow option for existing security group #1899
Conversation
vidaha
commented
Aug 12, 2014
Can one of the admins verify this patch? |
Just FYI - I tested this and it worked - not sure if it's controversial to use the name tag rather than the group for destroy a cluster. |
Why is it useful to have the cluster name be different from the security group prefix? If I want to re-use an existing security group, I can just name my cluster after that security group. The cluster name only seems to be used to name the security groups, instance requests, and instances. What would happen if I launched a cluster with a name that's different from its security group, then attempted to run spark-ec2 commands by passing only the actual security group name as the cluster name? In this case, I think the instances would still be named with the cluster name, which might cause problems since the script would be expecting to find ones named after the security group. |
Hi Josh, IMHO, it's best not to require a Spark cluster name and the security group to be the same. While you can reuse an existing security group to launch another cluster, you can't launch more than one cluster with the same security group. Perhaps a company wants to have an internal-applications or dev security group and reuse that for launch multiple Spark clusters. In addition, AWS has a strict limit of 100 on the number of security groups on a VPC, and since two security groups are required (one of the masters and one for the workers), this means, that only 50 Spark clusters can be launched on a VPC. While that might seem like a reasonable limit, I can easily see companies having a use case to exceed that. Do you mind illustrating the problem about name conflicts? If I understand what you are saying, you are mentioned this scenario: % ./spark-ec2 … —security-group my-security-group launch my-cluster-name And then later, you also run: % ./spark-ec2 … launch my-security-group This works fine - I tested it - there will be two clusters with the same security group, but different names. These are some error cases that I thought might offer and tested for manually and they worked out fine:
I tried this, and amazon has the correct controls to prevent deleting a security group still in use by another cluster.
I tried this as well, and since the get_existing_cluster code was modified to use the name rather than the security group to identify the instance, this works right. You can’t run: % ./spark-ec2 … launch my-cluster-name if there is already a cluster with my-cluster-name launched. Are there some other possible conflicts that you can think of? If you can write out the commands to illustrate the use cases you are thinking of - I can run them and see what happens. -Vida |
After a closer look, it looks like the existing The code in this PR looks good; a couple of TODOs:
|
Jenkins, this is ok to test. Test this please. |
QA tests have started for PR 1899. This patch merges cleanly. |
QA tests have started for PR 1899. This patch merges cleanly. |
QA results for PR 1899: |
QA results for PR 1899: |
|
||
parser.add_option( | ||
"--security-group-prefix", type="string", default=None, | ||
help="Use this prefix for teh security group rather than the cluster name.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: teh -> the
Some minor comments, but otherwise LGTM. One thing that might be worth noting is that we did use tags while launching clusters for AMPCamp[1] and one issue I ran into was that sometimes the create tag command failed because of eventual consistency between the instance being allocated and the instance being tag-able. We might want to retry the tag operation a few times since it is the only way to identify instances now. [1] https://github.com/amplab/training-scripts/blob/ampcamp4/spark_ec2.py#L342 |
QA tests have started for PR 1899. This patch merges cleanly. |
Great - I made the small edits, and added a loop for retrying the tagging logic. I tested to see if I can bring up a cluster, and it went up fine - it didn't try tagging more than once. I have no way of testing the tagging retry logic though in failure case. By the way, I'm amending my commits rather than create a new one per edit - is that standard Spark etiquette? Let me know if not. |
@vidaha please update the PR title and description to say what the change does. Those actually become the final commit message in git. |
QA tests have started for PR 1899. This patch merges cleanly. |
QA results for PR 1899: |
QA results for PR 1899: |
…y group - Uses the name tag to identify machines in a cluster. - Allows overriding the security group name so it doesn't need to coincide with the cluster name. - Outputs the request id's of up to 10 pending spot instance requests.
The committers use the merge_spark_pr.py script for merging pull requests. This script will squash together all of your commits into a single commit that uses the title of the PR plus the PR's description as the commit message, so you no longer need to worry about rebasing your patch into a single commit. I'd just copy-paste the description from your last commit into the description above so it becomes the commit message |
value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)) | ||
name = '{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id) | ||
for i in range(0, 5): | ||
master.add_tag(key='Name', value=name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if an add_tag
call fails? My bet is that it throws an exception rather than silently failing, in which case this re-try logic won't run. Rather than using this "set-and-test" logic, maybe we can just wrap the call in a try-except block?
@shivaram Did the eventual-consistency issue that you saw result in exceptions from add_tag
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I am pretty sure it throws an exception. I don't remember what the type is -- All I see in my notes is that the exception says 'Instance not found'
QA tests have started for PR 1899 at commit
|
Okay, I made the retry a try catch, and edited the title again. |
QA tests have finished for PR 1899 at commit
|
@@ -440,14 +449,29 @@ def launch_cluster(conn, opts, cluster_name): | |||
print "Launched master in %s, regid = %s" % (zone, master_res.id) | |||
|
|||
# Give the instances descriptive names | |||
# TODO: Add retry logic for tagging with name since it's used to identify a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: You can remove this TODO now.
LGTM |
Alright, great. I'm going to merge this into |
…ty group - Uses the name tag to identify machines in a cluster. - Allows overriding the security group name so it doesn't need to coincide with the cluster name. - Outputs the request id's of up to 10 pending spot instance requests. Author: Vida Ha <vida@databricks.com> Closes #1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits: c80d5c3 [Vida Ha] wrap retries in a try catch block b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group (cherry picked from commit 94053a7) Signed-off-by: Josh Rosen <joshrosen@apache.org>
Opened an issue related to this PR: https://issues.apache.org/jira/browse/SPARK-3332 |
This reverts #1899 and #2163, two patches that modified `spark-ec2` so that clusters are identified using tags instead of security groups. The original motivation for this patch was to allow multiple clusters to run in the same security group. Unfortunately, tagging is not atomic with launching instances on EC2, so with this approach we have the possibility of `spark-ec2` launching instances and crashing before they can be tagged, effectively orphaning those instances. The orphaned instances won't belong to any cluster, so the `spark-ec2` script will be unable to clean them up. Since this feature may still be worth supporting, there are several alternative approaches that we might consider, including detecting orphaned instances and logging warnings, or maybe using another mechanism to group instances into clusters. For the 1.1.0 release, though, I propose that we just revert this patch. Author: Josh Rosen <joshrosen@apache.org> Closes #2225 from JoshRosen/revert-ec2-cluster-naming and squashes the following commits: 0c18e86 [Josh Rosen] Revert "SPARK-2333 - spark_ec2 script should allow option for existing security group" c2ca2d4 [Josh Rosen] Revert "Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with "Launch More like this""
…ty group - Uses the name tag to identify machines in a cluster. - Allows overriding the security group name so it doesn't need to coincide with the cluster name. - Outputs the request id's of up to 10 pending spot instance requests. Author: Vida Ha <vida@databricks.com> Closes apache#1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits: c80d5c3 [Vida Ha] wrap retries in a try catch block b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group
name = '{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id) | ||
for i in range(0, 5): | ||
try: | ||
slave.add_tag(key='Name', value=name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that we need a break here
…ing and column masking (apache#1899) Initial PR to integrate UC-Spark-Plugin