SPARK-2333 - spark_ec2 script should allow option for existing security group #1899

vidaha · 2014-08-12T02:01:49Z

- Uses the name tag to identify machines in a cluster.
- Allows overriding the security group name so it doesn't need to coincide with the cluster name.
- Outputs the request id's of up to 10 pending spot instance requests.

AmplabJenkins · 2014-08-12T02:03:10Z

Can one of the admins verify this patch?

vidaha · 2014-08-12T02:04:55Z

Just FYI - I tested this and it worked - not sure if it's controversial to use the name tag rather than the group for destroy a cluster.

JoshRosen · 2014-08-12T02:39:58Z

Why is it useful to have the cluster name be different from the security group prefix? If I want to re-use an existing security group, I can just name my cluster after that security group. The cluster name only seems to be used to name the security groups, instance requests, and instances.

What would happen if I launched a cluster with a name that's different from its security group, then attempted to run spark-ec2 commands by passing only the actual security group name as the cluster name? In this case, I think the instances would still be named with the cluster name, which might cause problems since the script would be expecting to find ones named after the security group.

vidaha · 2014-08-12T17:44:56Z

Hi Josh,

IMHO, it's best not to require a Spark cluster name and the security group to be the same. While you can reuse an existing security group to launch another cluster, you can't launch more than one cluster with the same security group. Perhaps a company wants to have an internal-applications or dev security group and reuse that for launch multiple Spark clusters. In addition, AWS has a strict limit of 100 on the number of security groups on a VPC, and since two security groups are required (one of the masters and one for the workers), this means, that only 50 Spark clusters can be launched on a VPC. While that might seem like a reasonable limit, I can easily see companies having a use case to exceed that.

Do you mind illustrating the problem about name conflicts? If I understand what you are saying, you are mentioned this scenario:

% ./spark-ec2 … —security-group my-security-group launch my-cluster-name

And then later, you also run:

% ./spark-ec2 … launch my-security-group

This works fine - I tested it - there will be two clusters with the same security group, but different names. These are some error cases that I thought might offer and tested for manually and they worked out fine:

Can you then run ./spark-ec2 … —delete-groups destroy my-security-group and delete the security group when another cluster is using that security group?

I tried this, and amazon has the correct controls to prevent deleting a security group still in use by another cluster.

Can you forget to include the security group override on a launched cluster and create problems?

I tried this as well, and since the get_existing_cluster code was modified to use the name rather than the security group to identify the instance, this works right. You can’t run:

% ./spark-ec2 … launch my-cluster-name

if there is already a cluster with my-cluster-name launched.

Are there some other possible conflicts that you can think of? If you can write out the commands to illustrate the use cases you are thinking of - I can run them and see what happens.

-Vida

JoshRosen · 2014-08-12T18:11:19Z

After a closer look, it looks like the existing spark-ec2 already names instances after cluster_name, so it's safe and backwards-compatible to use that to identify existing clusters.

The code in this PR looks good; a couple of TODOs:

Update the pull request title to reference a new or existing JIRA. I think SPARK-2333 fits the bill, so I'd name it something like "[SPARK-2333] Allow multiple spark-ec2 clusters to be launched in the same security group".
Maybe update the intro of the Running Spark on EC2 guide guide to explain that the clusters are identified by their instances' names instead of security groups.

JoshRosen · 2014-08-12T18:18:24Z

Jenkins, this is ok to test. Test this please.

SparkQA · 2014-08-12T18:24:48Z

QA tests have started for PR 1899. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18376/consoleFull

SparkQA · 2014-08-12T18:39:49Z

QA tests have started for PR 1899. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18379/consoleFull

SparkQA · 2014-08-12T19:17:45Z

QA results for PR 1899:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18376/consoleFull

SparkQA · 2014-08-12T19:31:13Z

QA results for PR 1899:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18379/consoleFull

JoshRosen · 2014-08-13T18:50:55Z

This looks good to me. @pwendell or @shivaram, do you want to take a look at this?

shivaram · 2014-08-13T19:01:37Z

ec2/spark_ec2.py

-
+    parser.add_option(
+        "--security-group-prefix", type="string", default=None,
+        help="Use this prefix for teh security group rather than the cluster name.")


Typo: teh -> the

shivaram · 2014-08-13T19:08:10Z

Some minor comments, but otherwise LGTM.

One thing that might be worth noting is that we did use tags while launching clusters for AMPCamp[1] and one issue I ran into was that sometimes the create tag command failed because of eventual consistency between the instance being allocated and the instance being tag-able.

We might want to retry the tag operation a few times since it is the only way to identify instances now.

[1] https://github.com/amplab/training-scripts/blob/ampcamp4/spark_ec2.py#L342

SparkQA · 2014-08-14T01:05:04Z

QA tests have started for PR 1899. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18505/consoleFull

vidaha · 2014-08-14T01:34:51Z

Great - I made the small edits, and added a loop for retrying the tagging logic. I tested to see if I can bring up a cluster, and it went up fine - it didn't try tagging more than once. I have no way of testing the tagging retry logic though in failure case.

By the way, I'm amending my commits rather than create a new one per edit - is that standard Spark etiquette? Let me know if not.

vanzin · 2014-08-14T01:37:14Z

@vidaha please update the PR title and description to say what the change does. Those actually become the final commit message in git.

SparkQA · 2014-08-14T01:40:06Z

QA tests have started for PR 1899. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18510/consoleFull

SparkQA · 2014-08-14T01:57:38Z

QA results for PR 1899:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18505/consoleFull

SparkQA · 2014-08-14T02:31:17Z

QA results for PR 1899:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18510/consoleFull

…y group - Uses the name tag to identify machines in a cluster. - Allows overriding the security group name so it doesn't need to coincide with the cluster name. - Outputs the request id's of up to 10 pending spot instance requests.

JoshRosen · 2014-08-15T17:35:23Z

The committers use the merge_spark_pr.py script for merging pull requests. This script will squash together all of your commits into a single commit that uses the title of the PR plus the PR's description as the commit message, so you no longer need to worry about rebasing your patch into a single commit.

I'd just copy-paste the description from your last commit into the description above so it becomes the commit message

JoshRosen · 2014-08-15T17:43:33Z

ec2/spark_ec2.py

-            value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id))
+        name = '{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)
+        for i in range(0, 5):
+            master.add_tag(key='Name', value=name)


What happens if an add_tag call fails? My bet is that it throws an exception rather than silently failing, in which case this re-try logic won't run. Rather than using this "set-and-test" logic, maybe we can just wrap the call in a try-except block?

@shivaram Did the eventual-consistency issue that you saw result in exceptions from add_tag?

Yes - I am pretty sure it throws an exception. I don't remember what the type is -- All I see in my notes is that the exception says 'Instance not found'

SparkQA · 2014-08-19T00:45:38Z

QA tests have started for PR 1899 at commit c80d5c3.

This patch merges cleanly.

vidaha · 2014-08-19T00:45:44Z

Okay, I made the retry a try catch, and edited the title again.

SparkQA · 2014-08-19T01:38:07Z

QA tests have finished for PR 1899 at commit c80d5c3.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-08-19T19:21:52Z

@shivaram @pwendell This looks good to me. Does one of you want to take a final look + signoff?

shivaram · 2014-08-19T19:31:01Z

ec2/spark_ec2.py

@@ -440,14 +449,29 @@ def launch_cluster(conn, opts, cluster_name):
        print "Launched master in %s, regid = %s" % (zone, master_res.id)

    # Give the instances descriptive names
+    # TODO: Add retry logic for tagging with name since it's used to identify a cluster.


Minor nit: You can remove this TODO now.

shivaram · 2014-08-19T19:31:34Z

LGTM

JoshRosen · 2014-08-19T20:34:39Z

Alright, great. I'm going to merge this into master and branch-1.1.

…ty group - Uses the name tag to identify machines in a cluster. - Allows overriding the security group name so it doesn't need to coincide with the cluster name. - Outputs the request id's of up to 10 pending spot instance requests. Author: Vida Ha <vida@databricks.com> Closes #1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits: c80d5c3 [Vida Ha] wrap retries in a try catch block b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group (cherry picked from commit 94053a7) Signed-off-by: Josh Rosen <joshrosen@apache.org>

douglaz · 2014-08-31T20:58:10Z

Opened an issue related to this PR: https://issues.apache.org/jira/browse/SPARK-3332

This reverts #1899 and #2163, two patches that modified `spark-ec2` so that clusters are identified using tags instead of security groups. The original motivation for this patch was to allow multiple clusters to run in the same security group. Unfortunately, tagging is not atomic with launching instances on EC2, so with this approach we have the possibility of `spark-ec2` launching instances and crashing before they can be tagged, effectively orphaning those instances. The orphaned instances won't belong to any cluster, so the `spark-ec2` script will be unable to clean them up. Since this feature may still be worth supporting, there are several alternative approaches that we might consider, including detecting orphaned instances and logging warnings, or maybe using another mechanism to group instances into clusters. For the 1.1.0 release, though, I propose that we just revert this patch. Author: Josh Rosen <joshrosen@apache.org> Closes #2225 from JoshRosen/revert-ec2-cluster-naming and squashes the following commits: 0c18e86 [Josh Rosen] Revert "SPARK-2333 - spark_ec2 script should allow option for existing security group" c2ca2d4 [Josh Rosen] Revert "Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with "Launch More like this""

…ty group - Uses the name tag to identify machines in a cluster. - Allows overriding the security group name so it doesn't need to coincide with the cluster name. - Outputs the request id's of up to 10 pending spot instance requests. Author: Vida Ha <vida@databricks.com> Closes apache#1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits: c80d5c3 [Vida Ha] wrap retries in a try catch block b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group

rxin · 2014-09-19T22:17:31Z

ec2/spark_ec2.py

+        name = '{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id)
+        for i in range(0, 5):
+            try:
+                slave.add_tag(key='Name', value=name)


note that we need a break here

…ing and column masking (apache#1899) Initial PR to integrate UC-Spark-Plugin

vidaha changed the title ~~Vida/ec2 reuse security group~~ SPARK-2333 Aug 12, 2014

JoshRosen mentioned this pull request Aug 12, 2014

Add unit test to spark_ec2 script #134

Closed

shivaram reviewed Aug 13, 2014
View reviewed changes

vidaha changed the title ~~SPARK-2333~~ SPARK-2333 - Allow option to specify/reuse a security group Aug 14, 2014

JoshRosen reviewed Aug 15, 2014
View reviewed changes

wrap retries in a try catch block

c80d5c3

vidaha changed the title ~~SPARK-2333 - Allow option to specify/reuse a security group~~ SPARK-2333 - spark_ec2 script should allow option for existing security group Aug 19, 2014

shivaram reviewed Aug 19, 2014
View reviewed changes

asfgit closed this in 94053a7 Aug 19, 2014

douglaz mentioned this pull request Aug 31, 2014

SPARK-3332 - Tags shouldn't be the main strategy for machine membership on clusters #2223

Closed

JoshRosen mentioned this pull request Aug 31, 2014

[SPARK-3332] Revert spark-ec2 patch that identifies clusters using tags #2225

Closed

rxin reviewed Sep 19, 2014
View reviewed changes

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Mar 26, 2024

rdar://122341926 Integrate UC-Spark-Authz plugin for row-level filter…

86c8f0f

…ing and column masking (apache#1899) Initial PR to integrate UC-Spark-Plugin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-2333 - spark_ec2 script should allow option for existing security group #1899

SPARK-2333 - spark_ec2 script should allow option for existing security group #1899

vidaha commented Aug 12, 2014

AmplabJenkins commented Aug 12, 2014

vidaha commented Aug 12, 2014

JoshRosen commented Aug 12, 2014

vidaha commented Aug 12, 2014

JoshRosen commented Aug 12, 2014

JoshRosen commented Aug 12, 2014

SparkQA commented Aug 12, 2014

SparkQA commented Aug 12, 2014

SparkQA commented Aug 12, 2014

SparkQA commented Aug 12, 2014

JoshRosen commented Aug 13, 2014

shivaram Aug 13, 2014

shivaram commented Aug 13, 2014

SparkQA commented Aug 14, 2014

vidaha commented Aug 14, 2014

vanzin commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

JoshRosen commented Aug 15, 2014

JoshRosen Aug 15, 2014

shivaram Aug 15, 2014

SparkQA commented Aug 19, 2014

vidaha commented Aug 19, 2014

SparkQA commented Aug 19, 2014

JoshRosen commented Aug 19, 2014

shivaram Aug 19, 2014

shivaram commented Aug 19, 2014

JoshRosen commented Aug 19, 2014

douglaz commented Aug 31, 2014

rxin Sep 19, 2014

SPARK-2333 - spark_ec2 script should allow option for existing security group #1899

SPARK-2333 - spark_ec2 script should allow option for existing security group #1899

Conversation

vidaha commented Aug 12, 2014

AmplabJenkins commented Aug 12, 2014

vidaha commented Aug 12, 2014

JoshRosen commented Aug 12, 2014

vidaha commented Aug 12, 2014

JoshRosen commented Aug 12, 2014

JoshRosen commented Aug 12, 2014

SparkQA commented Aug 12, 2014

SparkQA commented Aug 12, 2014

SparkQA commented Aug 12, 2014

SparkQA commented Aug 12, 2014

JoshRosen commented Aug 13, 2014

shivaram Aug 13, 2014

Choose a reason for hiding this comment

shivaram commented Aug 13, 2014

SparkQA commented Aug 14, 2014

vidaha commented Aug 14, 2014

vanzin commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

JoshRosen commented Aug 15, 2014

JoshRosen Aug 15, 2014

Choose a reason for hiding this comment

shivaram Aug 15, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 19, 2014

vidaha commented Aug 19, 2014

SparkQA commented Aug 19, 2014

JoshRosen commented Aug 19, 2014

shivaram Aug 19, 2014

Choose a reason for hiding this comment

shivaram commented Aug 19, 2014

JoshRosen commented Aug 19, 2014

douglaz commented Aug 31, 2014

rxin Sep 19, 2014

Choose a reason for hiding this comment