Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Followers not joining cluster #66

Closed
Narragansett opened this issue Aug 18, 2021 · 23 comments
Closed

Followers not joining cluster #66

Narragansett opened this issue Aug 18, 2021 · 23 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@Narragansett
Copy link

I can only ever get this result, how to get followers to join the cluster?
vault operator raft list-peers
Node Address State Voter


node1 node1.vault.int:8201 leader true

@binlab
Copy link
Owner

binlab commented Aug 19, 2021

@Narragansett Could you describe your running configuration and which version of the module are you using?

@binlab binlab added the bug Something isn't working label Aug 19, 2021
@Narragansett
Copy link
Author

I'm not using any configuration. I'm using latest master branch from here - https://github.com/binlab/terraform-aws-vault-ha-raft
With the default setup cluster size = 3. Does master branch work for you as-is?
Thank you!

@Narragansett
Copy link
Author

I tried pulling master again, minimal non-important changes, e.g. enable debug, declare private_key as sensitive=true, etc. Same result, only the leader is part of the cluster. Followers are absent. Can you try it, We need to build it from master. Should be close but isn't working. ty!
vault operator raft list-peers
Node Address State Voter


node2 node2.vault.int:8201 leader true

@Narragansett
Copy link
Author

Has anyone else tried this from master? I keep trying, clone master, but all I get is one leader and no followers. Please help!

@binlab
Copy link
Owner

binlab commented Aug 23, 2021

I'm not using any configuration. I'm using latest master branch from here - https://github.com/binlab/terraform-aws-vault-ha-raft

you need at least some Terraform code to call module, that's what I mean

With the default setup cluster size = 3. Does master branch work for you as-is?

for a cluster mode you need at least 3 nodes, it's correct. But for testing, even one node should work

@binlab
Copy link
Owner

binlab commented Aug 23, 2021

I tried pulling master again, minimal non-important changes, e.g. enable debug, declare private_key as sensitive=true, etc. Same result, only the leader is part of the cluster. Followers are absent. Can you try it, We need to build it from master. Should be close but isn't working. ty!

Vault module from master should work, it's tested. I just tried to create a cluster from scratch and it is going to. You can start from this example basic-usage-quick-start (this is the one I tried just now).

Then you need to initialise a cluster, manual here: Initializing the newly created cluster.

After successful initialization and login, you should see the following screen by the link (example):
http://tf-vault-ha-basic-alb-1234567890.us-east-1.elb.amazonaws.com:443/ui/vault/storage/raft

vault-raft

@Narragansett
Copy link
Author

Interesting. I pulled master again, minimal changes and I still get only one node joined the cluster, attachment.

Screen Shot 2021-08-23 at 6 58 24 PM

Screen Shot 2021-08-23 at 7 00 04 PM

I'm also attaching the minimal changes needed to get it running with Terraform v1.0.4. cluster_count is default = 3
"node_allow_public"
default = true

Let me think what else to do. We can't be very far apart. thank you,

@Narragansett
Copy link
Author

I notice your testing screenshot, token login at bottom of initialization page, shows c 2021 Hashicorp Vault 1.4.2. How can this be? 1.4.2 was released early in 2020. Now when I pull master and build, variables.tf, line 562: default = "1.8.1". That is the difference. Could you try current master please? That is what is not working, e.g. has only one active node in a cluster of 3. Thank you,

@Narragansett
Copy link
Author

I tried Vault 1.4.1 with same result, only one node in the cluster. I don't see any significant differences between us, indeed I made only the few necessary changes to get it to work, now using Terraform v1.0.5. Let me think what else it could be, ty I need it to work in a cluster to work on HA concerns.

@binlab
Copy link
Owner

binlab commented Aug 24, 2021

Interesting. I pulled master again, minimal changes and I still get only one node joined the cluster, attachment.

this is really very interesting but seems I guess what could be the reason. Did you try a fresh install from an example or just update the existing deployment? This is important because some resources are not updated, and if my guesses are correct then this may be the reason

@Narragansett
Copy link
Author

I am doing a fresh install each time. My deploy cycle is - terraform init (once) --> terraform apply --> terraform destroy. I'm back using Vault 1.8.1 as its better to stay with latest. All resources are new, then destroyed. If I try again I start over, everything new each time. And I initialize (1:1) just as you've done. Still thinking - what could it be? Still doesn't work, thank you!

@binlab
Copy link
Owner

binlab commented Aug 24, 2021

I notice your testing screenshot, token login at bottom of initialization page, shows c 2021 Hashicorp Vault 1.4.2. How can this be? 1.4.2 was released early in 2020. Now when I pull master and build, variables.tf, line 562: default = "1.8.1". That is the difference. Could you try current master please? That is what is not working, e.g. has only one active node in a cluster of 3. Thank you,

I tried with an example what I advised you and yes from the last master branch but I was running a fresh install

@Narragansett
Copy link
Author

Okay, we have to be close. Can you share your minimal Terraform changes to get it to work, similar to what I shared with you. We have to be close, not sure what else I can change to get it working, ty

@binlab
Copy link
Owner

binlab commented Aug 24, 2021

I am doing a fresh install each time. My deploy cycle is - terraform init (once) --> terraform apply --> terraform destroy. I'm back using Vault 1.8.1 as its better to stay with latest. All resources are new, then destroyed. If I try again I start over, everything new each time. And I initialize (1:1) just as you've done. Still thinking - what could it be? Still doesn't work, thank you!

nice, then to understand the reason, you need to check the logs. Could you go to the instance via ssh and check the logs with the command journalctl --utc -a -u vault.service -r -n 100 or journalctl --utc -a -u vault.service -f for realtime

@Narragansett
Copy link
Author

I will do it tonight, investigate, reply back. Again, thank you so much for the advisory.

@binlab
Copy link
Owner

binlab commented Aug 24, 2021

Okay, we have to be close. Can you share your minimal Terraform changes to get it to work, similar to what I shared with you. We have to be close, not sure what else I can change to get it working, ty

My Terraform versions are

$ terraform version
Terraform v1.0.3
on linux_amd64
+ provider registry.terraform.io/community-terraform-providers/ignition v1.3.0
+ provider registry.terraform.io/hashicorp/aws v3.55.0
+ provider registry.terraform.io/hashicorp/local v2.1.0
+ provider registry.terraform.io/hashicorp/tls v3.1.0

and this example from scratch with no changes

@binlab
Copy link
Owner

binlab commented Aug 24, 2021

@Narragansett btw, a similar issue was recently fixed by this PR #52, and more details here #51

but anyway, to help you I need to see a logs

@Narragansett
Copy link
Author

Great! I see the error, I know what it means and why its happening. I just am not certain how to fix it, or why it would be happening to me alone. node-2 is the leader and it is alone in the cluster. Thank you!

vault-node-0.log.zip
vault-node-1.log.zip
vault-node-2.log.zip

| * Vault is sealed
Aug 24 21:00:10 ip-192-168-1-193.ec2.internal docker[1756]: |
Aug 24 21:00:10 ip-192-168-1-193.ec2.internal docker[1756]: | Code: 503. Errors:
Aug 24 21:00:10 ip-192-168-1-193.ec2.internal docker[1756]: | URL: PUT https://node2.vault.int:8200/v1/sys/storage/raft/bootstrap/>
Aug 24 21:00:10 ip-192-168-1-193.ec2.internal docker[1756]: |
Aug 24 21:00:10 ip-192-168-1-193.ec2.internal docker[1756]: | error during raft bootstrap init call: Error making API request.
Aug 24 21:00:10 ip-192-168-1-193.ec2.internal docker[1756]: error=
Aug 24 21:00:10 ip-192-168-1-193.ec2.internal docker[1756]: 2021-08-24T21:00:10.584Z [WARN] core: join attempt failed:
Aug 24 21:00:10 ip-192-168-1-193.ec2.internal docker[1756]: 2021-08-24T21:00:10.577Z [INFO] core: attempting to join possible raft >
Aug 24 21:00:10 ip-192-168-1-193.ec2.internal docker[1756]: 2021-08-24T21:00:10.577Z [INFO] core: security barrier not initialized

@Narragansett
Copy link
Author

Also, I am following this example diligently, using master of today -
https://github.com/binlab/terraform-aws-vault-ha-raft/tree/master/examples/basic-usage-quick-start

The master doesn't quite do all of what the example says, but I trust that's not significant right now ...
with auto unseal by AWS KMS and AWS NAT Gateway enabled

Additional I had to change variables.tf -
variable "node_allow_public" {
default = true

@Narragansett
Copy link
Author

It appears you must be using autounseal using KMS keys. I will next try to set that up. thanks!

@Narragansett
Copy link
Author

Okay! Using autounseal it works! I get a cluster of 3x. Thanks for guiding me through it, and pointing me to the logs. It was better that way as I learned a lot lot more. Thank you so much for such nice project work! User error, auto unseal made it work.

@binlab
Copy link
Owner

binlab commented Aug 25, 2021

Okay! Using autounseal it works! I get a cluster of 3x. Thanks for guiding me through it, and pointing me to the logs. It was better that way as I learned a lot lot more. Thank you so much for such nice project work! User error, auto unseal made it work.

glad to hear that the issue was solved!

@binlab
Copy link
Owner

binlab commented Aug 25, 2021

The master doesn't quite do all of what the example says, but I trust that's not significant right now ...

yes, values from master configuration don't envisage the use of KMS autounseal by default. It was done intentionally, to reduce costs and for simple deployments in the form of a "cluster" from one node (in this case you can use manual autounseal with no issue) that can be used only with one goal - for autounseal the second cluster as a Vault transit KMS for example or testing and not important data. Using manual unseal for deployments that consist of more than one node is quite difficult to provide without manual unseal by SSH or through publicly available nodes. Partially this problem was discussed here #58. But now I see that this configuration may not be completely obvious for new users, and mislead. I will correct the default value and documentation soon, thank you for reporting!

@binlab binlab added documentation Improvements or additions to documentation and removed bug Something isn't working labels Aug 25, 2021
@binlab binlab self-assigned this Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants