Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-2333 - spark_ec2 script should allow option for existing security group #1899

Closed
wants to merge 2 commits into from

Conversation

vidaha
Copy link
Contributor

@vidaha vidaha commented Aug 12, 2014

- Uses the name tag to identify machines in a cluster.
- Allows overriding the security group name so it doesn't need to coincide with the cluster name.
- Outputs the request id's of up to 10 pending spot instance requests.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@vidaha
Copy link
Contributor Author

vidaha commented Aug 12, 2014

Just FYI - I tested this and it worked - not sure if it's controversial to use the name tag rather than the group for destroy a cluster.

@JoshRosen
Copy link
Contributor

Why is it useful to have the cluster name be different from the security group prefix? If I want to re-use an existing security group, I can just name my cluster after that security group. The cluster name only seems to be used to name the security groups, instance requests, and instances.

What would happen if I launched a cluster with a name that's different from its security group, then attempted to run spark-ec2 commands by passing only the actual security group name as the cluster name? In this case, I think the instances would still be named with the cluster name, which might cause problems since the script would be expecting to find ones named after the security group.

@vidaha
Copy link
Contributor Author

vidaha commented Aug 12, 2014

Hi Josh,

IMHO, it's best not to require a Spark cluster name and the security group to be the same. While you can reuse an existing security group to launch another cluster, you can't launch more than one cluster with the same security group. Perhaps a company wants to have an internal-applications or dev security group and reuse that for launch multiple Spark clusters. In addition, AWS has a strict limit of 100 on the number of security groups on a VPC, and since two security groups are required (one of the masters and one for the workers), this means, that only 50 Spark clusters can be launched on a VPC. While that might seem like a reasonable limit, I can easily see companies having a use case to exceed that.

Do you mind illustrating the problem about name conflicts? If I understand what you are saying, you are mentioned this scenario:

% ./spark-ec2 … —security-group my-security-group launch my-cluster-name

And then later, you also run:

% ./spark-ec2 … launch my-security-group

This works fine - I tested it - there will be two clusters with the same security group, but different names. These are some error cases that I thought might offer and tested for manually and they worked out fine:

  • Can you then run ./spark-ec2 … —delete-groups destroy my-security-group and delete the security group when another cluster is using that security group?

I tried this, and amazon has the correct controls to prevent deleting a security group still in use by another cluster.

  • Can you forget to include the security group override on a launched cluster and create problems?

I tried this as well, and since the get_existing_cluster code was modified to use the name rather than the security group to identify the instance, this works right. You can’t run:

% ./spark-ec2 … launch my-cluster-name

if there is already a cluster with my-cluster-name launched.

Are there some other possible conflicts that you can think of? If you can write out the commands to illustrate the use cases you are thinking of - I can run them and see what happens.

-Vida

@JoshRosen
Copy link
Contributor

After a closer look, it looks like the existing spark-ec2 already names instances after cluster_name, so it's safe and backwards-compatible to use that to identify existing clusters.

The code in this PR looks good; a couple of TODOs:

  • Update the pull request title to reference a new or existing JIRA. I think SPARK-2333 fits the bill, so I'd name it something like "[SPARK-2333] Allow multiple spark-ec2 clusters to be launched in the same security group".
  • Maybe update the intro of the Running Spark on EC2 guide guide to explain that the clusters are identified by their instances' names instead of security groups.

@vidaha vidaha changed the title Vida/ec2 reuse security group SPARK-2333 Aug 12, 2014
@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test. Test this please.

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1899. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18376/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1899. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18379/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1899:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18376/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1899:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18379/consoleFull

@JoshRosen
Copy link
Contributor

This looks good to me. @pwendell or @shivaram, do you want to take a look at this?


parser.add_option(
"--security-group-prefix", type="string", default=None,
help="Use this prefix for teh security group rather than the cluster name.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: teh -> the

@shivaram
Copy link
Contributor

Some minor comments, but otherwise LGTM.

One thing that might be worth noting is that we did use tags while launching clusters for AMPCamp[1] and one issue I ran into was that sometimes the create tag command failed because of eventual consistency between the instance being allocated and the instance being tag-able.

We might want to retry the tag operation a few times since it is the only way to identify instances now.

[1] https://github.com/amplab/training-scripts/blob/ampcamp4/spark_ec2.py#L342

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1899. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18505/consoleFull

@vidaha
Copy link
Contributor Author

vidaha commented Aug 14, 2014

Great - I made the small edits, and added a loop for retrying the tagging logic. I tested to see if I can bring up a cluster, and it went up fine - it didn't try tagging more than once. I have no way of testing the tagging retry logic though in failure case.

By the way, I'm amending my commits rather than create a new one per edit - is that standard Spark etiquette? Let me know if not.

@vanzin
Copy link
Contributor

vanzin commented Aug 14, 2014

@vidaha please update the PR title and description to say what the change does. Those actually become the final commit message in git.

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1899. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18510/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1899:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18505/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1899:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18510/consoleFull

@vidaha vidaha changed the title SPARK-2333 SPARK-2333 - Allow option to specify/reuse a security group Aug 14, 2014
…y group

- Uses the name tag to identify machines in a cluster.
- Allows overriding the security group name so it doesn't need to coincide with the cluster name.
- Outputs the request id's of up to 10 pending spot instance requests.
@JoshRosen
Copy link
Contributor

The committers use the merge_spark_pr.py script for merging pull requests. This script will squash together all of your commits into a single commit that uses the title of the PR plus the PR's description as the commit message, so you no longer need to worry about rebasing your patch into a single commit.

I'd just copy-paste the description from your last commit into the description above so it becomes the commit message

value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id))
name = '{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)
for i in range(0, 5):
master.add_tag(key='Name', value=name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if an add_tag call fails? My bet is that it throws an exception rather than silently failing, in which case this re-try logic won't run. Rather than using this "set-and-test" logic, maybe we can just wrap the call in a try-except block?

@shivaram Did the eventual-consistency issue that you saw result in exceptions from add_tag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I am pretty sure it throws an exception. I don't remember what the type is -- All I see in my notes is that the exception says 'Instance not found'

@vidaha vidaha changed the title SPARK-2333 - Allow option to specify/reuse a security group SPARK-2333 - spark_ec2 script should allow option for existing security group Aug 19, 2014
@SparkQA
Copy link

SparkQA commented Aug 19, 2014

QA tests have started for PR 1899 at commit c80d5c3.

  • This patch merges cleanly.

@vidaha
Copy link
Contributor Author

vidaha commented Aug 19, 2014

Okay, I made the retry a try catch, and edited the title again.

@SparkQA
Copy link

SparkQA commented Aug 19, 2014

QA tests have finished for PR 1899 at commit c80d5c3.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

@shivaram @pwendell This looks good to me. Does one of you want to take a final look + signoff?

@@ -440,14 +449,29 @@ def launch_cluster(conn, opts, cluster_name):
print "Launched master in %s, regid = %s" % (zone, master_res.id)

# Give the instances descriptive names
# TODO: Add retry logic for tagging with name since it's used to identify a cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: You can remove this TODO now.

@shivaram
Copy link
Contributor

LGTM

@JoshRosen
Copy link
Contributor

Alright, great. I'm going to merge this into master and branch-1.1.

asfgit pushed a commit that referenced this pull request Aug 19, 2014
…ty group

    - Uses the name tag to identify machines in a cluster.
    - Allows overriding the security group name so it doesn't need to coincide with the cluster name.
    - Outputs the request id's of up to 10 pending spot instance requests.

Author: Vida Ha <vida@databricks.com>

Closes #1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits:

c80d5c3 [Vida Ha] wrap retries in a try catch block
b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group

(cherry picked from commit 94053a7)
Signed-off-by: Josh Rosen <joshrosen@apache.org>
@asfgit asfgit closed this in 94053a7 Aug 19, 2014
@douglaz
Copy link
Contributor

douglaz commented Aug 31, 2014

Opened an issue related to this PR: https://issues.apache.org/jira/browse/SPARK-3332

asfgit pushed a commit that referenced this pull request Sep 2, 2014
This reverts #1899 and #2163, two patches that modified `spark-ec2` so that clusters are identified using tags instead of security groups.  The original motivation for this patch was to allow multiple clusters to run in the same security group.

Unfortunately, tagging is not atomic with launching instances on EC2, so with this approach we have the possibility of `spark-ec2` launching instances and crashing before they can be tagged, effectively orphaning those instances.  The orphaned instances won't belong to any cluster, so the `spark-ec2` script will be unable to clean them up.

Since this feature may still be worth supporting, there are several alternative approaches that we might consider, including detecting orphaned instances and logging warnings, or maybe using another mechanism to group instances into clusters.  For the 1.1.0 release, though, I propose that we just revert this patch.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2225 from JoshRosen/revert-ec2-cluster-naming and squashes the following commits:

0c18e86 [Josh Rosen] Revert "SPARK-2333 - spark_ec2 script should allow option for existing security group"
c2ca2d4 [Josh Rosen] Revert "Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with "Launch More like this""
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ty group

    - Uses the name tag to identify machines in a cluster.
    - Allows overriding the security group name so it doesn't need to coincide with the cluster name.
    - Outputs the request id's of up to 10 pending spot instance requests.

Author: Vida Ha <vida@databricks.com>

Closes apache#1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits:

c80d5c3 [Vida Ha] wrap retries in a try catch block
b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group
name = '{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id)
for i in range(0, 5):
try:
slave.add_tag(key='Name', value=name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we need a break here

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Mar 26, 2024
…ing and column masking (apache#1899)

Initial PR to integrate UC-Spark-Plugin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants