Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PreStop hook logic seems incorrect #4698

Closed
forestoden opened this issue Jul 27, 2021 · 2 comments · Fixed by #4801
Closed

PreStop hook logic seems incorrect #4698

forestoden opened this issue Jul 27, 2021 · 2 comments · Fixed by #4801
Assignees
Labels
>bug Something isn't working v1.8.0

Comments

@forestoden
Copy link

Bug Report

What did you do?
When terminating ES Pod PreStop hook script pre-stop-hook-script.sh attempts to wait for the POD_IP to be removed from the list of Endpoints of the headless service

I replicated what the PreStop hook did in an ubuntu pod in the same namespace with the following:

while true; do getent hosts search-test-es-client  | grep $ES_POD_IP; sleep 0.5; done

When modifying PRE_STOP_MAX_WAIT_SECONDS and PRE_STOP_ADDITIONAL_WAIT_SECONDS you can observe that the IP remains in the Service until PRE_STOP_MAX_WAIT_SECONDS when the PreStop hook terminates unsuccessfully and Kubernetes continues with the Termination. This holds true even if you set PRE_STOP_MAX_WAIT_SECONDS to a substantial number. I went as high as 180s which is more than enough time.

What did you expect to see?
I would expect to see the POD_IP get removed from the output of getent hosts $HEADLESS_SERVICE_NAME | grep $POD_IP before the Pod is terminated.

What did you see instead? Under which circumstances?
It appears (although documentation on headless services is sparse) that headless services do not remove the IP from the list of Endpoint IPs until after the Pod is terminated, not when it enters Terminating status. What this results in is the PreStop hook will simply wait PRE_STOP_MAX_WAIT_SECONDS and then continue terminating, but the IP is still in service for some time afterwards. It seems like the logic here is not working as intended. I'm not really sure how this would work given this behavior of Headless Services

Environment

  • ECK version:

    1.3.0

  • Kubernetes information:
    GKE 1.19

    for each of them please give us the version you are using

kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T20:58:09Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.10-gke.1600", GitCommit:"7b8e568a7fb4c9d199c2ba29a5f7d76f6b4341c2", GitTreeState:"clean", BuildDate:"2021-05-07T09:18:53Z", GoVersion:"go1.15.10b5", Compiler:"gc", Platform:"linux/amd64"}
  • Resource definition:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: search-test
spec:
  auth: {}
  http:
    service:
      metadata:
        creationTimestamp: null
      spec: {}
    tls:
      certificate: {}
      selfSignedCertificate:
        disabled: true
  nodeSets:
  - config:
      node.data: false
      node.ingest: false
      node.master: true
      node.store.allow_mmap: true
    count: 1
    name: master-nodes
    podTemplate:
      metadata:
      spec:
        containers:
        - name: elasticsearch
        initContainers:
        - Image: busybox:latest
          command:
          - sh
          - -c
          - |
            sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
  - config:
      node.data: true
      node.ingest: false
      node.master: false
      node.store.allow_mmap: true
    count: 1
    name: data-nodes
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
        initContainers:
        - Image: busybox:latest
          command:
          - sh
          - -c
          - |
            sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
  - config:
      node.data: false
      node.ingest: false
      node.master: false
      node.store.allow_mmap: true
    count: 2
    name: client
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
        initContainers:
        - Image: busybox:latest
          command:
          - sh
          - -c
          - |
            sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
  transport:
    service:
      metadata:
        creationTimestamp: null
      spec: {}
  updateStrategy:
    changeBudget: {}
  version: 6.8.13
  • Logs:
insert operator logs or any relevant message to the issue here
@botelastic botelastic bot added the triage label Jul 27, 2021
@david-kow david-kow added the >bug Something isn't working label Aug 5, 2021
@botelastic botelastic bot removed the triage label Aug 5, 2021
@pebrc
Copy link
Collaborator

pebrc commented Aug 23, 2021

I have a suspicion that this PreStop script is ineffective since the IPv6 PR where the following change was introduced, that keeps endpoints in the service even if the corresponding Pod is no longer ready (which is the case when terminating)

8cf990a#diff-f8806c3bf97eb22be2e2915e4fbe06d5a1e85f335e026cdc46a6e96f4d63da59R57-R58

I think this change was somewhat accidental as I was exploring switching to DNS based discovery for Elasticsearch an idea we later gave up on due to issues with DNS resolution and re-use of Pod IPs but the PublishNotReadyAddresses was merged nonetheless.

The simplest fix could therefore be to just revert that change.

@pebrc pebrc added the v1.8.0 label Aug 30, 2021
@pebrc pebrc self-assigned this Aug 30, 2021
@pebrc
Copy link
Collaborator

pebrc commented Aug 31, 2021

Things are bit more complicated unfortunately. We are relying on unready Pods being published for node discovery from clients (aka sniffing) #3182

If we were to go back to only publishing ready Pods in the headless service we would break that capability.

I am thinking we could have a simpler version of the pre-stop hook that just waits for PRE_STOP_MAX_WAIT_SECONDS before allowing termination of the Pod. I am not even sure that this pre-stop hook ever worked the way it was intended. All clients would typically use the $CLUSTER_NAME-es-http service which is a ClusterIP service i.e. not relying on DNS anyway but (typically) on iptables managed by kube-proxy fed by the endpoints API directly. So the only connections that were relying on the headless service should have been inter-node connections in Elasticsearch itself. However Elasticsearch has its own discovery mechanism that we feed with a list of IP addresses and not DNS names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v1.8.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants