Autoscaling: handling unforeseeable volume capacities #4469

barkbay · 2021-05-03T16:23:31Z

This issue is a sub-issue of #4459 to focus on/discuss the case where the capacity of a persistent volume is larger than the one specified in the claim.

As an example here is a claim of 1Gi bounded to a volume with an actual physical capacity of 368Gi:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
  creationTimestamp: "2021-04-29T11:44:01Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    common.k8s.elastic.co/type: elasticsearch
    elasticsearch.k8s.elastic.co/cluster-name: storage-sample
    elasticsearch.k8s.elastic.co/statefulset-name: storage-sample-es-data
  name: elasticsearch-data-storage-sample-es-data-0
  namespace: demo
  ownerReferences:
  - apiVersion: elasticsearch.k8s.elastic.co/v1
    kind: Elasticsearch
    name: storage-sample
    uid: b0b9b12d-092d-4c9d-a73e-8c5039bfb2d9
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi # Claim is 1Gi
  storageClassName: e2e-default
  volumeMode: Filesystem
  volumeName: local-pv-6df90d02
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 368Gi # Storage capacity is actually 368Gi
  phase: Bound

Having a capacity larger than the request may lead to the following situations:

Elasticsearch reports the total, observed, capacity as the required capacity for a tier. If the actual capacity is higher than the claim it can lead to some cascading scale up events, up to the limit specified by the user. It can also exceed to limit specified by the user in which case some not pertinent HorizontalScalingLimitReached events are generated.
If the actual capacity of a volume is greater than the claim, then the nodes may hold more data than the maximum one specified in the autoscaling specification. It may lead to overloaded nodes. For example, assuming the following autoscaling policy:

{
    "name": "data",
    "roles": ["data", "ingest", "transform"],
    "resources": {
        "nodeCount": { "min": 2, "max": 5 },
        "memory": { "min": "2Gi", "max": "6Gi" },
        "storage": { "min": "1Gi",  "max": "3Gi" }
    }
}

Say that the claims of 1Gi have been bound to volumes of 1Ti of data each, then chances are that the 2Gi of memory are not enough to handle that amount of data.

Unforeseeable storage capacity makes it hard to scale vertically. It is especially true as long as there is no memory requirement in the autoscaling Elasticsearch API and as long as the operator attempts to infer memory requirement from storage (see #4076).

Proposals

Documentation update

It should be clearly stated in the documentation that scaling vertically data nodes assumes that the storage provider provisions physical volume with a predictable and even capacity across all the volumes managed by an autoscaling policy.

Autoscaling controller update

If the operator detects that the capacity of a volume is greater than the one specified in the claim then:

Memory is computed according to the capacity of the volume, as reported in the PVC status, up to the memory limit specified by the user.
Print a log, update the status and emit a K8S warning.

The text was updated successfully, but these errors were encountered:

barkbay · 2021-05-04T06:26:42Z

Proposals

Autoscaling controller update

If the operator detects that the capacity of a volume is greater than the one specified in the claim then:

Memory is computed according to the capacity of the volume, as reported in the PVC status, up to the memory limit specified by the user.

On second thought I'm wondering if the capacity reported in the status of the PVC is always accurate. If not then the proposal would not always help. Maybe we should just consider the capacity as reported by Elasticsearch, raise some warning if it is not expected, without trying to scale vertically.

barkbay added >bug Something isn't working autoscaling labels May 3, 2021

barkbay self-assigned this May 3, 2021

barkbay mentioned this issue May 4, 2021

Handle volumes with larger capacities than the one claimed #4473

Closed

barkbay mentioned this issue May 18, 2021

[Autoscaling] Introduce resource recommenders #4493

Merged

barkbay closed this as completed in #4493 May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling: handling unforeseeable volume capacities #4469

Autoscaling: handling unforeseeable volume capacities #4469

barkbay commented May 3, 2021 •

edited

Loading

barkbay commented May 4, 2021 •

edited

Loading

Proposals

Autoscaling controller update

Autoscaling: handling unforeseeable volume capacities #4469

Autoscaling: handling unforeseeable volume capacities #4469

Comments

barkbay commented May 3, 2021 • edited Loading

Proposals

Documentation update

Autoscaling controller update

barkbay commented May 4, 2021 • edited Loading

Proposals

Autoscaling controller update

barkbay commented May 3, 2021 •

edited

Loading

barkbay commented May 4, 2021 •

edited

Loading