Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What network/firewall config is required for private cluster #305

Closed
ideasculptor opened this issue Nov 4, 2019 · 26 comments · Fixed by #388
Closed

What network/firewall config is required for private cluster #305

ideasculptor opened this issue Nov 4, 2019 · 26 comments · Fixed by #388
Labels
enhancement New feature or request P3 medium priority issues triaged Scoped and ready for work

Comments

@ideasculptor
Copy link
Contributor

TL;DR - Using a very simple private, zonal configuration, node registration is failing and it is non-obvious what is going wrong. This is AFTER the networking issues of last weekend were finally resolved. Things get much farther along, but still never get all the way to completion.

The details:
I'm bringing up a private cluster in a shared VPC, my local network is configured as an authorized network in master_authorized_networks_config, and I just leave the default node pool alone, with a count of 1. Service account is created by the module. master ip cidr block is set to 10.0.0.0/28. I can access the endpoint via kubectl when the cluster is waiting for health checks to pass, but when that fails, the cluster gets deleted.

There's not really much else configured, but the cluster startup is failing due to a failure to register by the node in the node pool. It seems reasonable to think this is being blocked by the network, but it's not clear what I need to enable as far as firewall and routes. I'd have thought the module sets up networking to access the master cidr block, so is there something I need to enable for the subnet that the cluster is started in that wouldn't be handled by the module?

My network config (using the fabric modules) is as follows:

  subnets          = [
    {
      subnet_name           = "admin"
      subnet_ip             = "10.1.0.0/24"
      subnet_private_access = "true"
      subnet_flow_logs      = "true"
    },
    {
      subnet_name           = "gke"
      subnet_ip             = "10.10.11.0/24"
      subnet_private_access = "true"
      subnet_flow_logs      = "true"
    },
    {
      subnet_name           = "cloud-sql"
      subnet_ip             = "10.10.12.0/24"
      subnet_private_access = "true"
      subnet_flow_logs      = "true"
    },
  ]

  secondary_ranges = {
    gke = [
      {
        range_name = "services"
        ip_cidr_range = "192.168.0.0/22"
      },
      {
        range_name = "pods"
        ip_cidr_range = "192.168.16.0/20"
      },
    ]
  }

  routes = [
    {
      name              = "egress-inet"
      description       = "route through IGW to access internet"
      destination_range = "0.0.0.0/0"
      tags              = "egress-inet"
      next_hop_internet = "true"
    },
  ]

I applied the firewall module to the network, but with no rules other than the defaults for ssh, http and https, so far:

module "firewall" {
  source                  = "terraform-google-modules/network/google//modules/fabric-net-firewall"
  project_id              = data.terraform_remote_state.vpc.outputs.project_id
  network                 = data.terraform_remote_state.vpc.outputs.network_name
  admin_ranges            = local.admin_ranges
  admin_ranges_enabled    = true
  internal_ranges_enabled = true
  internal_ranges         = local.all_ranges
  ssh_source_ranges       = ["0.0.0.0/0"]
}

I can easily imagine that https traffic to the master cidr range needs to be enabled for nodes, but I'd have thought the gke module would set that up. In fact, checking networking config while the cluster is coming up appears to show that it is doing so. I've attached screenshots of the routes and firewall rules that get added.

Screen Shot 2019-11-04 at 9 49 06 AM

Screen Shot 2019-11-04 at 9 49 16 AM

@ideasculptor
Copy link
Contributor Author

ideasculptor commented Nov 4, 2019

Screen Shot 2019-11-04 at 9 56 26 AM

This is the closest thing I get to an error message while the cluster is trying to come up.

Screen Shot 2019-11-04 at 9 58 34 AM

@ideasculptor
Copy link
Contributor Author

Possible cause - allowing the module to create a service account, does the service account need extra permissions because of the shared VPC thing? There's no indication that I can't use a created service account when using shared vpc, but it's the only obvious potential source of a problem.

@morgante
Copy link
Contributor

morgante commented Nov 5, 2019

@ideasculptor That might be it - can you see if granting compute.networkUser on the created Service Account resolves the issue?

@ideasculptor
Copy link
Contributor Author

Sure thing. It's going to take me an hour or two to get to it, but check back here later or tomorrow and I should have a result. Any idea if adding a role to the service account outside of the module is likely to get placed into the dependency graph in an order that will be useful, or do I need to modify the module itself to get that to happen before cluster creation fails and the module returns?

@ideasculptor
Copy link
Contributor Author

Answered my own question - Adding the role outside of the module DOES result in an object graph that is ordered correctly. Should have an answer shortly. I'm trying (separately) each of the following:

resource "google_compute_subnetwork_iam_member" "network_users" {
  project    = local.network_project_id
  region     = var.region
  subnetwork = local.subnetwork
  role       = "roles/compute.networkUser"
  member     = "serviceAccount:${module.gke.service_account}"
}

resource "google_project_iam_member" "service_agents" {
  project    = local.network_project_id
  role    = "roles/container.hostServiceAgentUser"
  member = "serviceAccount:${module.gke.service_account}"
}

@morgante
Copy link
Contributor

morgante commented Nov 5, 2019

Excellent! Let us know how it goes and we could possibly integrate into the module itself.

@ideasculptor
Copy link
Contributor Author

I've tried every variant I can come up with, including using google_project_iam_member for the compute.networkUser role instead of assigning it just to the one subnet (incidentally, where can I see those per-subnet role assignments in the console - I couldn't find them anywhere, which is why I ended up giving it the role project-wide, just so I could verify that the assignment was in place).

I continue to get that same warning related to the node pool -

The number of nodes is estimated by the number of Compute VM instances because the Kubernetes master did not respond, possibly due to a pending upgrade or missing IAM permissions.
The number of nodes in a node pool should match the number of Compute VM instances, except for:
A temporary skew during resize or upgrade
Uncommon configurations in which nodes or instances were manipulated directly with Kubernetes and/or Compute APIs

I've also disabled every possible optional feature (I had dashboard and http load balancing enabled, previously).

@ideasculptor
Copy link
Contributor Author

ideasculptor commented Nov 6, 2019

I'm reasonably certain this is user error, but I'm still verifying a fix - the test cycle is pretty long.

I had failed to note that the examples are explicitly adding the cidr range for the gke subnet to the master_authorized_networks_config in a private cluster. Seems like adding that subnet ought to go without saying rather than requiring explicit inclusion, but lacking an implicit inclusion of the subnet cidr block, it is probably worth calling out as a requirement in the variable description for master_authorized_networks_config.

@ideasculptor
Copy link
Contributor Author

ideasculptor commented Nov 6, 2019

Hmm, perhaps I take that previous comment back - the documentation for that variable actually states that it is implicitly included, which is why I hadn't added it based on the examples, but that appears to be incorrect if the examples are anything to go by, since they always explicitly add the subnetwork cidr block. I'll know if it fixed my problem in 24 minutes, though then I have to go through re-enabling all the things I disabled while testing and make sure it stays up.

The current description of the master_authorized_networks_config includes the following:

...(except the cluster node IPs, which GKE automatically whitelists).

Whereas the examples include the following:

data "google_compute_subnetwork" "subnetwork" {
  name    = var.subnetwork
  project = var.project_id
  region  = var.region
}

module "gke" {
...
  subnetwork                        = var.subnetwork
...
  master_authorized_networks_config = [
    {
      cidr_blocks = [
        {
          cidr_block   = data.google_compute_subnetwork.subnetwork.ip_cidr_range
          display_name = "VPC"
        },
      ]
    },
  ]
}

@ideasculptor
Copy link
Contributor Author

ideasculptor commented Nov 6, 2019

10 minutes in, and it looks like it is still failing. I have run out of ideas. I have all optional features disabled. I've tried the config of every private cluster example as far as the values of deploy_using_private_endpoint, enable_private_endpoint, enable_private_nodes go, and I've corrected the master_authorized_networks_config.

I have tried configs that are totally identical to the examples, with the only difference being that create_service_account=true . There doesn't appear to be any example which allows the service account to be created by the module while creating a private cluster. There really only seems to be a single example that creates a service account. One example passes the value 'create' to var.service_account, which seems unlikely to actually work correctly unless a service account called create@<projectid>... exists.

I'm pretty certain the module simply does not work correctly if allowed to create a service account inside the module. It's worth noting that the created service account does NOT appear to have the permissions that are listed in the docs as required to run the module. It looks like most of the examples are passing in a service account with the permissions to RUN the module, rather than one with permissions that match those given to the module-created service account. Unfortunately, the 30 minute test cycle means I still haven't tried it with an externally created service account, but that's really the only variation I have left to try.

The IAM roles assigned to the created service account are provided via the following screenshots (1st is the gke project, 2nd is the network host project):
Screen Shot 2019-11-05 at 5 54 00 PM
Screen Shot 2019-11-05 at 5 54 25 PM

It is probably also worth pointing out that 100% of the resources this might be dependent on are created using fabric modules. Projects are created by project factory. subnets and firewalls are handled by the network modules. Permissions set via the iam modules. In most cases, things tend to be wired up with very generic/default values.

@ideasculptor
Copy link
Contributor Author

I give up. The only configuration I can make work is one in which all private cluster features are disabled. As soon as I set

  deploy_using_private_endpoint = false
  enable_private_endpoint = false
  enable_private_nodes = false

The cluster comes up correctly, but is now complaining about no available nodes and failed autoscaling, but at least it correctly creates the default-node-pool pool. Even getting to the point of provisioning the default-node-pool is new behaviour. When running as any variant of a private cluster, the default-pool never finishes.

I have tried ALL of the following variations of the following variables (always with master_ipv4_network_config configured with the cidr_block for the gke subnet):

  deploy_using_private_endpoint = true
  enable_private_endpoint = true
  enable_private_nodes = true
  deploy_using_private_endpoint = false
  enable_private_endpoint = true
  enable_private_nodes = true
  deploy_using_private_endpoint = false
  enable_private_endpoint = false
  enable_private_nodes = true
  deploy_using_private_endpoint = true
  enable_private_endpoint = false
  enable_private_nodes = true

Please note that each one of those trials requires 25 minutes to fail and cleanup. This is NOT a rapid debugging process.

My core module config looks like this:

module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster"

project_id = local.project_id
network_project_id = local.network_project_id
registry_project_id = local.project_id
name = "gke-${var.environment}"
description = var.description
region = "us-central1"
regional = false
zones = ["us-central1-a"]
network = local.network
subnetwork = local.subnetwork
ip_range_pods = local.ip_range_pods

ip_range_services = local.ip_range_services
create_service_account = true

deploy_using_private_endpoint = var.deploy_using_private_endpoint
enable_private_endpoint = var.enable_private_endpoint
enable_private_nodes = var.enable_private_nodes
master_ipv4_cidr_block = var.master_ipv4_cidr_block
}

Whatever it is that is failing is VERY non-obvious, because I have been through all of the examples and tried every possible variation for days on end, to no avail. This implies a dire lack of documentation for private cluster modules, because I've pored over all of the readmes and examples, looking for clues to no avail.

Screen Shot 2019-11-05 at 7 01 11 PM

Screen Shot 2019-11-05 at 7 01 32 PM

Screen Shot 2019-11-05 at 7 01 42 PM

@morgante
Copy link
Contributor

morgante commented Nov 6, 2019

Can you try explicitly defining a node pool? That could be the issue.

At this point I'm skeptical the issue is with the module. Does it return errors when running Terraform? It seems much more likely it's somehow an issue with your project/the GKE control plane.

@ideasculptor
Copy link
Contributor Author

For what it is worth, an instance in default-node-pool DOES exist, but it is clearly unable to communicate back to the cluster, since the cluster thinks there are 0 nodes.

At least I finally have a stable cluster and node pool that I can actually debug without having it automatically torn down by cluster creation failure. It has taken me since Friday afternoon just to get that much to function (it is Tuesday night). Except now I have to figure out how to get into the node, since it doesn't have an ssh tag. And I still don't have a private cluster. I manually added a firewall rule which allows port 22 access to tag gke-gke-dev from 0.0.0.0/32 (and the firewall rule shows that node as being accessible to that rule) but it still won't allow me to ssh to the host. I don't have the slightest clue why and none of the error messaging provides a hint.

Now that I can finally access a running node (though, rather obnoxiously, the node gets terminated periodically due to failed initialization), I can see the following failure:

● kube-node-installation.service - Download and install k8s binaries and configurations
   Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2019-11-06 04:19:48 UTC; 1min 41s ago
   Active: activating (start) since Wed 2019-11-06 04:19:48 UTC; 1min 41s ago
● kube-node-installation.service - Download and install k8s binaries and configurations
   Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2019-11-06 04:19:48 UTC; 1min 41s ago
  Process: 823 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
  Process: 818 ExecStartPre=/bin/bash -c OPT=""; if curl --help | grep -q -- "--retry-connrefused"; then OPT="--retry-connrefused"; fi; /usr/bin/curl --fail --retry 5 --retry-delay 3 $OPT --silent --sho
  Process: 815 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
  Process: 812 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
  Process: 809 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
 Main PID: 826 (bash)
    Tasks: 2 (limit: 4915)
   Memory: 2.4M
      CPU: 177ms
   CGroup: /system.slice/kube-node-installation.service
           ├─826 bash /home/kubernetes/bin/configure.sh
           └─860 curl -f --ipv4 -Lo node-problem-detector-v0.6.6.tar.gz --connect-timeout 20 --max-time 300 --retry 6 --retry-delay 10 --retry-connrefused https://storage.googleapis.com/kubernetes-relea

Nov 06 04:19:49 gke-gke-dev-default-node-pool-ef2b7a90-7mtv configure.sh[826]: Running GKE internal configuration script gke-internal-configure.sh
Nov 06 04:19:49 gke-gke-dev-default-node-pool-ef2b7a90-7mtv configure.sh[826]: Downloading node-problem-detector-v0.6.6.tar.gz.
Nov 06 04:19:49 gke-gke-dev-default-node-pool-ef2b7a90-7mtv configure.sh[826]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Nov 06 04:19:49 gke-gke-dev-default-node-pool-ef2b7a90-7mtv configure.sh[826]:                                  Dload  Upload   Total   Spent    Left  Speed
Nov 06 04:20:09 gke-gke-dev-default-node-pool-ef2b7a90-7mtv configure.sh[826]: [1.6K blob data]
Nov 06 04:20:09 gke-gke-dev-default-node-pool-ef2b7a90-7mtv configure.sh[826]: Warning: Transient problem: timeout Will retry in 10 seconds. 6 retries left.
Nov 06 04:20:39 gke-gke-dev-default-node-pool-ef2b7a90-7mtv configure.sh[826]: [1.6K blob data]
Nov 06 04:20:39 gke-gke-dev-default-node-pool-ef2b7a90-7mtv configure.sh[826]: Warning: Transient problem: timeout Will retry in 10 seconds. 5 retries left.
Nov 06 04:21:09 gke-gke-dev-default-node-pool-ef2b7a90-7mtv configure.sh[826]: [1.6K blob data]

So that's progress. Apparently the default network configuration when using the fabric network modules is not sufficient for access to https://storage.googleapis.com/kubernetes-relea...

@ideasculptor
Copy link
Contributor Author

I get no errors from terraform, other than the eventual cluster startup failure. The account I am using definitely has the specified permissions to run the module and it isn't failing due to api permissions problems. I need to figure out how to view that log in a manner which gives me the full commandline and url, and then I can set about debugging what is happening in the network.

I thought there was some kind of private endpoint for accessing things like storage, in order to eliminate the need to go out through the internet gateway and access the public api endpoints.

My network configuration is provided much farther up on this comment stream. I declare only a single route, out to the public internet via the default gateway. I also run the irewall module with default values for everything, which sets up rules for inbound http, https, and ssh tags. There are no egress rules.

@ideasculptor
Copy link
Contributor Author

Finally got it up. For now, I added a route and firewall rule for tag gke-{cluster name} which allows outbound tcp on port 443 to 0.0.0.0/0. I eventually need to figure out how to get access to storage.googleapis.com whichout going across the default gateway to the public internet.

Docs could definitely use a mention of the requirement for outbound https access.

Once I'm using a custom node pool, I can specify tags that work with the default routes and rules, but it seems to me that the default node pool ought to work with the default network configuration or else the module ought to apply non-default network configuration to enable it.

@morgante
Copy link
Contributor

morgante commented Nov 6, 2019

Outbound https access shouldn't be a requirement, but rather Private Google Access as documented here.

Ultimately I don't think this is an issue with the module. The module doesn't control your network config, and there are many possible network configs, so we should really rely on the GKE docs for that.

I know this has been a frustrating experience for you, but I don't think the module or Terraform are to blame.

@morgante morgante closed this as completed Nov 6, 2019
@ideasculptor
Copy link
Contributor Author

Terraform definitely isn't to blame. The module is, at least so far as the documentation is usually insufficient to actually use the modules in a real-world context. Ultimately, it comes back to the examples just being too simplified and too dependent on variables containing references to correctly configured resources, rather than showing what a valid config actually looks like - a problem I encounter awfully frequently with all of these modules. The examples are rarely sufficient to actually get a thing up and working in combination with the other modules, which give all appearances of having been designed to work together. Even the full fabric examples don't really show real world use-cases for an entire infrastructure.

Every time I put a new module into use, it requires a reverse engineering effort (often, a multi-day one) in order to work out what assumptions are being made, in which modules, and how to effectively override them to actually wire two modules together to do something useful - frequently requiring module customization and/or submitting a PR. There appears to be a significant lack of integration examples - one that even careful deconstruction of available unit examples is insufficient to replace.

It's not that I don't know how to set things up, but figuring out that module X sets up something according to pattern X but module Y assumes a dependency created via pattern Y, which requires a bunch of unspecified configuration changes from the default pattern X, is often a really significant chore, and one that requires reading every line of code in both modules and all of the examples. Which kind of defeats the purpose, since it would often be faster to just develop an equivalent module from scratch without first having to understand all the prerequisites of someone else's work.

It feels like someone other than myself should be stringing all of these modules together into a cohesive architecture, and documenting what that looks like. The examples in the fabric example repo don't seem to do so - for example: https://github.com/terraform-google-modules/cloud-foundation-fabric/blob/master/infrastructure/shared-vpc/main.tf

There's nothing in there about enabling private google access or setting up outbound https for node pools, nor is there an example that creates a custom node pool that includes tags that match the expectations of the network fabric modules. I've looked through the various git branches, too, and haven't found anything that shows how I would string folders, projects, networks, routes, firewalls, and a GKE cluster together. So I end up losing days at a time to reverse engineering the expectations of each module.

@morgante
Copy link
Contributor

morgante commented Nov 6, 2019

I think the gap is that the modules don't necessarily stand on their own. These are not a replacement for the official GCP docs and we do expect you to spend some time with the docs as well.

In this case, this module isn't responsible for configuring your network so it certainly isn't what sets up private Google Access.

Maybe we should add an example network setup with private Google Access enabled to the network module—probably a good idea.

One thing I want to be clear is that it isn't an 'expectation of these modules'—it's an expectation of the services themselves. GKE itself expects that your nodes have private Google Access properly configured, so you'd run into a similar problem if you simply tried to create a private cluster in the UI.

Happy to look at targeted pull requests and issues, but as is this isn't really an actionable request. Everyone has a different network setup and some things you want to do (ex. keeping network creation and subnetwork creation separate) don't match our recommendations.

@ideasculptor
Copy link
Contributor Author

Note that there is absolutely no mention made of private google access being required in the documentation for any of the private cluster modules. If there was an example which used the network module, or a private cluster example that included the network components (there is a public one, but no clue about private network access is included there, either), I'd have saved myself 2 full days of work.

@morgante
Copy link
Contributor

morgante commented Nov 6, 2019

Note that there is absolutely no mention made of private google access being required in the documentation for any of the private cluster modules.

I understand the frustration, but again these modules are not a replacement for the GKE docs themselves (which do mention the requirement).

That being said, I certainly appreciate your feedback! We'd love to have even more extensive documentation and examples, but please keep in mind these modules aren't anyone's full time job. We're doing the best we can, but can't cover all scenarios.

If there was an example which used the network module, or a private cluster example that included the network components (there is a public one, but no clue about private network access is included there, either), I'd have saved myself 2 full days of work.

This is definitely actionable! If you could open an issue or PR for adding such an example, I think it'd be a great one to include.

@ideasculptor
Copy link
Contributor Author

I don't think it is unreasonable to expect that the documentation for a module is sufficient to be able to use the module. I am very familiar with the documentation for GKE and know well how to run it - which is exactly why it consistently appears easier to build my own modules than to use those provided here - If I were putting these terraform resources together for myself, I would be aware of the assumptions I was making from one layer to the next.

But I am not familiar with the internals of the modules, and figuring out what is going on when the module documentation doesn't provide any clues is a pretty tall order, especially when the problem is an incompatibility between one module and the next rather than an actual defect in the function of a single module. If private google access were mentioned in any context - an example, a variable description, even a comment in the source code of the module or another github issue - I'd have found the issue very quickly.

I go through the docs and existing issues in extraordinary detail before I ever even open an issue here. Opening an issue or submitting a PR happens only after I've been unable to figure something out by reading the source code - and I do read the source code.

@morgante
Copy link
Contributor

morgante commented Nov 6, 2019

If private google access were mentioned in any context - an example, a variable description, even a comment in the source code of the module or another github issue - I'd have found the issue very quickly.

Right, but Private Google Access is clearly documented on the GKE docs: https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters

We're simply not going to be able to cover every possible use case, nor should we attempt to duplicate the full content of the GKE docs into our own documentation.

@morgante
Copy link
Contributor

morgante commented Nov 6, 2019

Anyways, I opened #308 to add an example of the required networking config for a private cluster. I agree it would be helpful.

@ideasculptor
Copy link
Contributor Author

And, it turns out, as I included in my comment about network config from several days ago, I have private_access enabled on the subnet in question

  subnets          = [
    {
      subnet_name           = "gke"
      subnet_ip             = "10.10.11.0/24"
      subnet_private_access = "true"
      subnet_flow_logs      = "true"
    },
    {
      subnet_name           = "cloud-sql"
      subnet_ip             = "10.10.12.0/24"
      subnet_private_access = "true"
      subnet_flow_logs      = "true"
    },
  ]

  secondary_ranges = {
    gke = [
      {
        range_name = "services"
        ip_cidr_range = "192.168.0.0/22"
      },
      {
        range_name = "pods"
        ip_cidr_range = "192.168.16.0/20"
      },
    ]
  }

Screen Shot 2019-11-05 at 10 00 49 PM

So it would appear that there's a bug in the GKE configuration which is causing it to fail to use the private google access, or else there is more to enabling it than just setting that value to true - though nothing in the documentation for the module indicates as much. But hardly surprising that I failed to note that private access wasn't enabled, when I have every indication of private access being enabled in the code and in the console.

This is without reconfiguring ANYTHING from how it was when the GKE module was failing to work.

So there is clearly something more than just private google access required. It's not a DNS problem, because I didn't add any firewall rules to enable DNS in order to get it working. But adding an outbound route through the default gateway did fix it, so it is failing to use the private google access.

According to the page on private google access, it does nothing unless a node has no public ip address, which makes sense, so perhaps it's not that surprising that I need to provide an explicit outbound rule and route for port 443. But there is no mention of that in the docs or any of the examples. And running a fully private node is, without question, a config that I have tested repeatedly. I have had private access enabled this whole time, so that has never changed, and I have certainly tried running clusters with enable_private_nodes set to both true and false.

I don't know if the problem is in the GKE module or the GKE functionality, but there is still nothing to suggest that I am using it incorrectly - even though it does not work.

@SubatomicHero
Copy link

Just chiming in with my 2 pennies (british here)

https://medium.com/google-cloud/completely-private-gke-clusters-with-no-internet-connectivity-945fffae1ccd

https://github.com/andreyk-code/no-inet-gke-cluster

I believe @ideasculptor is correct. Having followed this guide, there was apparently a lot more to configure with regards to the network and even DNS that finally enabled my cluster to create OK and continue with the rest of my plan. I think this level of information is key when working with modules that are inherently opinionated (as they should be).

@ideasculptor
Copy link
Contributor Author

I only just saw your comment, @SubatomicHero. That Medium post by @kopachevsky is a super helpful summary. @morgante, could someone throw a link to it in one of the READMEs in this repo so it's not hidden down here in a closed issue?

I eventually went with a config that isn't totally locked down - I allow outbound access to the public internet, which allowed me to get everything up without all the DNS changes and such, but it would have been a whole lot easier if I'd been working from the new private cluster with network and bastion example, or I'd found andrey's medium post. I gather the DNS changes require connectivity to the private network from the terraforming host, so those of us who aren't working from a peered network will still have some small difficulties with a fully locked-down cluster.

@morgante morgante reopened this Nov 15, 2019
@aaron-lane aaron-lane added the P3 medium priority issues label Dec 16, 2019
@aaron-lane aaron-lane added triaged Scoped and ready for work enhancement New feature or request labels Dec 16, 2019
kopachevsky added a commit to kopachevsky/terraform-google-kubernetes-engine that referenced this issue Dec 26, 2019
kopachevsky added a commit to kopachevsky/terraform-google-kubernetes-engine that referenced this issue Dec 30, 2019
CPL-markus pushed a commit to WALTER-GROUP/terraform-google-kubernetes-engine that referenced this issue Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P3 medium priority issues triaged Scoped and ready for work
Projects
None yet
4 participants