-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Accumulo installs on Microsoft Azure #270
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @srajtiwari. This is a large PR so its going to take some time for me to work through it. I noticed some binary files like OMS-collectd.pp
, what is that?
ansible/roles/accumulo/templates/hadoop-metrics2-accumulo.properties
Outdated
Show resolved
Hide resolved
Hi @srajtiwari . This is quite a large pull request, and I think it's going to take a bit of time to review. In future, it would probably help to contribute smaller, more narrowly scoped contributions. You have 6 different bullet points that this pull request accomplishes. I think that those probably could have probably been 6 different pull requests. Also, I noticed that there's some binary files in here. Can you explain what those are? We probably aren't going to be able to accept the binary files in the pull request, since binaries are not "open source" by any standard definition. |
@ctubbsii and @keith-turner firstly thank you for your comments. We acknowledge the feedback about the size of the PR, and in the future we will definitely scope those down to much smaller chunks. About the binary files, these are SELinux policy modules obtained using audit2allow which allow SELinux to permit the statsd plugin (for collectd) to bind to port 8125, as well as for collectd to talk to the Azure Log Analytics agent. We will figure out a way to not check-in those binary files and instead have Ansible tasks which will generate and copy these files per deployment, again with a conditional for cluster_type == azure. |
Oh, that makes sense. Could probably just check in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still looking at this, but here are the comments I have so far.
@arvindshmicrosoft, @srajtiwari, or @karthick-rn so far I have only been looking at the code changes for this. I would like to try running these changes in the next day or two, but I think you may still be make changes. Let me know if you think I should wait before giving this a run. |
Hi @keith-turner we have addressed all of the comments (to best of our belief) except for the HA configuration one, which (as discussed) we are tracking via #271 and hopefully we will push that as a separate PR in the next week or so. Please let us know about other immediate issues / comments you may find and we will quickly triage and decide if we can address in this PR or create issue(s) to track their subsequent resolution through later PR(s). Thank you very much again for your patience and advice working through this. |
I tried running this branch against EC2 to setup a 12 node cluster and I ran into a few issues.
Below are the zookeeper setup errors I saw. The leader nodes only have
Below are the jps errors and the failure to download Accumulo.
|
@keith-turner, thanks for testing! Could you let me know the EC2 instance type that you were using for tests? I'll take a look at the logic for assigning the worker data-dirs ASAP. Now, given that leader1 failed (due to the drive folder issue) and presuming that leader1 is also the proxy host, it looks like the downstream download Accumulo tarball task did not run on the proxy (that is Ansible's behavior) and hence the install failed as well. So IMHO the second observation about Accumulo install failing is "by design". Will reply ASAP on the data directory issue. |
@arvindshmicrosoft I am trying to use the following settings.
@arvindshmicrosoft thanks for that explanation, it was immensely helpful! I do not know Ansible very well, but I'm quickly learning. While debugging I ran the following commands on the cluster, the first would not download and the second would. ansible-playbook ansible/site.yml --extra-vars "azure_proxy_host=_" ansible-playbook ansible/accumulo.yml --extra-vars "azure_proxy_host=_" After reading your message I now understand the difference in behavior. |
Thanks, @keith-turner for the details. m5d.xlarge has 1 ephemeral disk (as you said) and d2.xlarge has 3. Based on this, default_data_dirs has just the '/media/ephemeral0' location, while worker_data_dirs has all 3 ephemeral disks. Now, unfortunately it looks like the previous Zookeeper and Hadoop roles were loosely using worker_data_dirs, and our changes continued using worker_data_dirs. From our understanding of the implementation, all references on the leader nodes should only be using default_data_dirs. This appears to be a latent bug which has been uncovered indirectly now. To unblock, if you use a homogenous cluster with a common instance type across leaders and workers it would workaround this issue, but @karthick-rn and I discussed and we feel we should address the latent bug. Only question is when: would you prefer we create a new issue and we will address that in the next week or so with a fresh PR, or would you prefer we bundle the fix for this new (latent) bug as part of this existing PR itself? |
I can't really say because I don't know enough about the bug and how it manifest in this branch vs master. I do know the config using |
I think you have implicitly answered our question, @keith-turner . Given you are seeing regression from current master, we will work to make sure our current PR is stable with EC2. |
0ffd6ac
to
6503458
Compare
@keith-turner @ctubbsii I've updated the PR to encompass the minimum set of changes that are needed to support Azure-based installs. We will follow up with separate PRs for the other (monitoring, optional HA) pieces. Thanks for your help with this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a much smaller, and easier to parse change than the previous. Thank you for trimming this down to the minimal for Azure support, and deferring the further changes.
* Add new cluster type 'azure' which leverages VM Scale Sets * Increase some CentOS defaults to improve cluster stability * Add checksums for specific Spark and Hadoop versions, as well as for Accumulo 2.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
@keith-turner what do you think?
@SlickNik In future, it's probably best to avoid force pushes, especially when responding to review feedback, since it makes it slightly harder to review. We typically squash merge at the end anyway. When you went back to the "minimal" approach, it probably made sense, but this last change could probably have been a regular commit. No big deal, though, I manually did a diff since my last review 😸
@ctubbsii 👍 Sounds good - will keep that in mind; and makes total sense since we're doing a squash and merge. It's just a remnant from working with other projects that merge without squashing. Having 'checkpoint' commits littering the git log can make history distracting! 😇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SlickNik this was much easier to review, thanks for simplifying it. I finished reviewing this and now I am going to test running it to ensure it still works on EC2.
* Addition of comments to [azure] section in muchos.props.example * Changes to README.md * Minor spelling error corrections * Update config.py to be stylistically consistent Signed-off-by: Nikhil Manchanda <SlickNik@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to run this branch on EC2 w/o issue.
Thanks for the contribution everyone! If any of you would like to be listed as contributor, please edit Fluo's people page. The Apache Fluo project tweets about first contributions. If you any of you would like a tweet to be made about this PR, just let me know what twitter handles to include in the tweet. |
In addition to being listed as contributors, if anybody is interested in continuing to contribute to Fluo, please consider subscribing to the developer mailing list and perhaps introducing yourself on that list. |
Accumulo master, and Zookeeper roles within Muchos. Note: HA is on by
default and should not be disabled
Accumulo 2.0.0