-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What network/firewall config is required for private cluster #305
Comments
Possible cause - allowing the module to create a service account, does the service account need extra permissions because of the shared VPC thing? There's no indication that I can't use a created service account when using shared vpc, but it's the only obvious potential source of a problem. |
@ideasculptor That might be it - can you see if granting |
Sure thing. It's going to take me an hour or two to get to it, but check back here later or tomorrow and I should have a result. Any idea if adding a role to the service account outside of the module is likely to get placed into the dependency graph in an order that will be useful, or do I need to modify the module itself to get that to happen before cluster creation fails and the module returns? |
Answered my own question - Adding the role outside of the module DOES result in an object graph that is ordered correctly. Should have an answer shortly. I'm trying (separately) each of the following:
|
Excellent! Let us know how it goes and we could possibly integrate into the module itself. |
I've tried every variant I can come up with, including using google_project_iam_member for the compute.networkUser role instead of assigning it just to the one subnet (incidentally, where can I see those per-subnet role assignments in the console - I couldn't find them anywhere, which is why I ended up giving it the role project-wide, just so I could verify that the assignment was in place). I continue to get that same warning related to the node pool -
I've also disabled every possible optional feature (I had dashboard and http load balancing enabled, previously). |
I'm reasonably certain this is user error, but I'm still verifying a fix - the test cycle is pretty long. I had failed to note that the examples are explicitly adding the cidr range for the gke subnet to the master_authorized_networks_config in a private cluster. Seems like adding that subnet ought to go without saying rather than requiring explicit inclusion, but lacking an implicit inclusion of the subnet cidr block, it is probably worth calling out as a requirement in the variable description for master_authorized_networks_config. |
Hmm, perhaps I take that previous comment back - the documentation for that variable actually states that it is implicitly included, which is why I hadn't added it based on the examples, but that appears to be incorrect if the examples are anything to go by, since they always explicitly add the subnetwork cidr block. I'll know if it fixed my problem in 24 minutes, though then I have to go through re-enabling all the things I disabled while testing and make sure it stays up. The current description of the master_authorized_networks_config includes the following:
Whereas the examples include the following:
|
I give up. The only configuration I can make work is one in which all private cluster features are disabled. As soon as I set
The cluster comes up correctly, but is now complaining about no available nodes and failed autoscaling, but at least it correctly creates the default-node-pool pool. Even getting to the point of provisioning the default-node-pool is new behaviour. When running as any variant of a private cluster, the default-pool never finishes. I have tried ALL of the following variations of the following variables (always with master_ipv4_network_config configured with the cidr_block for the gke subnet):
Please note that each one of those trials requires 25 minutes to fail and cleanup. This is NOT a rapid debugging process. My core module config looks like this: module "gke" { project_id = local.project_id ip_range_services = local.ip_range_services deploy_using_private_endpoint = var.deploy_using_private_endpoint Whatever it is that is failing is VERY non-obvious, because I have been through all of the examples and tried every possible variation for days on end, to no avail. This implies a dire lack of documentation for private cluster modules, because I've pored over all of the readmes and examples, looking for clues to no avail. |
Can you try explicitly defining a node pool? That could be the issue. At this point I'm skeptical the issue is with the module. Does it return errors when running Terraform? It seems much more likely it's somehow an issue with your project/the GKE control plane. |
For what it is worth, an instance in default-node-pool DOES exist, but it is clearly unable to communicate back to the cluster, since the cluster thinks there are 0 nodes. At least I finally have a stable cluster and node pool that I can actually debug without having it automatically torn down by cluster creation failure. It has taken me since Friday afternoon just to get that much to function (it is Tuesday night). Except now I have to figure out how to get into the node, since it doesn't have an ssh tag. And I still don't have a private cluster. I manually added a firewall rule which allows port 22 access to tag Now that I can finally access a running node (though, rather obnoxiously, the node gets terminated periodically due to failed initialization), I can see the following failure:
So that's progress. Apparently the default network configuration when using the fabric network modules is not sufficient for access to |
I get no errors from terraform, other than the eventual cluster startup failure. The account I am using definitely has the specified permissions to run the module and it isn't failing due to api permissions problems. I need to figure out how to view that log in a manner which gives me the full commandline and url, and then I can set about debugging what is happening in the network. I thought there was some kind of private endpoint for accessing things like storage, in order to eliminate the need to go out through the internet gateway and access the public api endpoints. My network configuration is provided much farther up on this comment stream. I declare only a single route, out to the public internet via the default gateway. I also run the irewall module with default values for everything, which sets up rules for inbound http, https, and ssh tags. There are no egress rules. |
Finally got it up. For now, I added a route and firewall rule for tag Docs could definitely use a mention of the requirement for outbound https access. Once I'm using a custom node pool, I can specify tags that work with the default routes and rules, but it seems to me that the default node pool ought to work with the default network configuration or else the module ought to apply non-default network configuration to enable it. |
Outbound https access shouldn't be a requirement, but rather Private Google Access as documented here. Ultimately I don't think this is an issue with the module. The module doesn't control your network config, and there are many possible network configs, so we should really rely on the GKE docs for that. I know this has been a frustrating experience for you, but I don't think the module or Terraform are to blame. |
Terraform definitely isn't to blame. The module is, at least so far as the documentation is usually insufficient to actually use the modules in a real-world context. Ultimately, it comes back to the examples just being too simplified and too dependent on variables containing references to correctly configured resources, rather than showing what a valid config actually looks like - a problem I encounter awfully frequently with all of these modules. The examples are rarely sufficient to actually get a thing up and working in combination with the other modules, which give all appearances of having been designed to work together. Even the full fabric examples don't really show real world use-cases for an entire infrastructure. Every time I put a new module into use, it requires a reverse engineering effort (often, a multi-day one) in order to work out what assumptions are being made, in which modules, and how to effectively override them to actually wire two modules together to do something useful - frequently requiring module customization and/or submitting a PR. There appears to be a significant lack of integration examples - one that even careful deconstruction of available unit examples is insufficient to replace. It's not that I don't know how to set things up, but figuring out that module X sets up something according to pattern X but module Y assumes a dependency created via pattern Y, which requires a bunch of unspecified configuration changes from the default pattern X, is often a really significant chore, and one that requires reading every line of code in both modules and all of the examples. Which kind of defeats the purpose, since it would often be faster to just develop an equivalent module from scratch without first having to understand all the prerequisites of someone else's work. It feels like someone other than myself should be stringing all of these modules together into a cohesive architecture, and documenting what that looks like. The examples in the fabric example repo don't seem to do so - for example: https://github.com/terraform-google-modules/cloud-foundation-fabric/blob/master/infrastructure/shared-vpc/main.tf There's nothing in there about enabling private google access or setting up outbound https for node pools, nor is there an example that creates a custom node pool that includes tags that match the expectations of the network fabric modules. I've looked through the various git branches, too, and haven't found anything that shows how I would string folders, projects, networks, routes, firewalls, and a GKE cluster together. So I end up losing days at a time to reverse engineering the expectations of each module. |
I think the gap is that the modules don't necessarily stand on their own. These are not a replacement for the official GCP docs and we do expect you to spend some time with the docs as well. In this case, this module isn't responsible for configuring your network so it certainly isn't what sets up private Google Access. Maybe we should add an example network setup with private Google Access enabled to the network module—probably a good idea. One thing I want to be clear is that it isn't an 'expectation of these modules'—it's an expectation of the services themselves. GKE itself expects that your nodes have private Google Access properly configured, so you'd run into a similar problem if you simply tried to create a private cluster in the UI. Happy to look at targeted pull requests and issues, but as is this isn't really an actionable request. Everyone has a different network setup and some things you want to do (ex. keeping network creation and subnetwork creation separate) don't match our recommendations. |
Note that there is absolutely no mention made of private google access being required in the documentation for any of the private cluster modules. If there was an example which used the network module, or a private cluster example that included the network components (there is a public one, but no clue about private network access is included there, either), I'd have saved myself 2 full days of work. |
I understand the frustration, but again these modules are not a replacement for the GKE docs themselves (which do mention the requirement). That being said, I certainly appreciate your feedback! We'd love to have even more extensive documentation and examples, but please keep in mind these modules aren't anyone's full time job. We're doing the best we can, but can't cover all scenarios.
This is definitely actionable! If you could open an issue or PR for adding such an example, I think it'd be a great one to include. |
I don't think it is unreasonable to expect that the documentation for a module is sufficient to be able to use the module. I am very familiar with the documentation for GKE and know well how to run it - which is exactly why it consistently appears easier to build my own modules than to use those provided here - If I were putting these terraform resources together for myself, I would be aware of the assumptions I was making from one layer to the next. But I am not familiar with the internals of the modules, and figuring out what is going on when the module documentation doesn't provide any clues is a pretty tall order, especially when the problem is an incompatibility between one module and the next rather than an actual defect in the function of a single module. If private google access were mentioned in any context - an example, a variable description, even a comment in the source code of the module or another github issue - I'd have found the issue very quickly. I go through the docs and existing issues in extraordinary detail before I ever even open an issue here. Opening an issue or submitting a PR happens only after I've been unable to figure something out by reading the source code - and I do read the source code. |
Right, but Private Google Access is clearly documented on the GKE docs: https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters We're simply not going to be able to cover every possible use case, nor should we attempt to duplicate the full content of the GKE docs into our own documentation. |
Anyways, I opened #308 to add an example of the required networking config for a private cluster. I agree it would be helpful. |
And, it turns out, as I included in my comment about network config from several days ago, I have private_access enabled on the subnet in question
So it would appear that there's a bug in the GKE configuration which is causing it to fail to use the private google access, or else there is more to enabling it than just setting that value to true - though nothing in the documentation for the module indicates as much. But hardly surprising that I failed to note that private access wasn't enabled, when I have every indication of private access being enabled in the code and in the console. This is without reconfiguring ANYTHING from how it was when the GKE module was failing to work. So there is clearly something more than just private google access required. It's not a DNS problem, because I didn't add any firewall rules to enable DNS in order to get it working. But adding an outbound route through the default gateway did fix it, so it is failing to use the private google access. According to the page on private google access, it does nothing unless a node has no public ip address, which makes sense, so perhaps it's not that surprising that I need to provide an explicit outbound rule and route for port 443. But there is no mention of that in the docs or any of the examples. And running a fully private node is, without question, a config that I have tested repeatedly. I have had private access enabled this whole time, so that has never changed, and I have certainly tried running clusters with enable_private_nodes set to both true and false. I don't know if the problem is in the GKE module or the GKE functionality, but there is still nothing to suggest that I am using it incorrectly - even though it does not work. |
Just chiming in with my 2 pennies (british here) https://github.com/andreyk-code/no-inet-gke-cluster I believe @ideasculptor is correct. Having followed this guide, there was apparently a lot more to configure with regards to the network and even DNS that finally enabled my cluster to create OK and continue with the rest of my plan. I think this level of information is key when working with modules that are inherently opinionated (as they should be). |
I only just saw your comment, @SubatomicHero. That Medium post by @kopachevsky is a super helpful summary. @morgante, could someone throw a link to it in one of the READMEs in this repo so it's not hidden down here in a closed issue? I eventually went with a config that isn't totally locked down - I allow outbound access to the public internet, which allowed me to get everything up without all the DNS changes and such, but it would have been a whole lot easier if I'd been working from the new private cluster with network and bastion example, or I'd found andrey's medium post. I gather the DNS changes require connectivity to the private network from the terraforming host, so those of us who aren't working from a peered network will still have some small difficulties with a fully locked-down cluster. |
TL;DR - Using a very simple private, zonal configuration, node registration is failing and it is non-obvious what is going wrong. This is AFTER the networking issues of last weekend were finally resolved. Things get much farther along, but still never get all the way to completion.
The details:
I'm bringing up a private cluster in a shared VPC, my local network is configured as an authorized network in master_authorized_networks_config, and I just leave the default node pool alone, with a count of 1. Service account is created by the module. master ip cidr block is set to 10.0.0.0/28. I can access the endpoint via kubectl when the cluster is waiting for health checks to pass, but when that fails, the cluster gets deleted.
There's not really much else configured, but the cluster startup is failing due to a failure to register by the node in the node pool. It seems reasonable to think this is being blocked by the network, but it's not clear what I need to enable as far as firewall and routes. I'd have thought the module sets up networking to access the master cidr block, so is there something I need to enable for the subnet that the cluster is started in that wouldn't be handled by the module?
My network config (using the fabric modules) is as follows:
I applied the firewall module to the network, but with no rules other than the defaults for ssh, http and https, so far:
I can easily imagine that https traffic to the master cidr range needs to be enabled for nodes, but I'd have thought the gke module would set that up. In fact, checking networking config while the cluster is coming up appears to show that it is doing so. I've attached screenshots of the routes and firewall rules that get added.
The text was updated successfully, but these errors were encountered: