stackabletech · sbernauer · Oct 6, 2023 · Sep 27, 2023 · Sep 27, 2023 · Sep 28, 2023
diff --git a/modules/concepts/pages/operations/index.adoc b/modules/concepts/pages/operations/index.adoc
@@ -31,18 +31,23 @@ Sometimes you want to quickly shut down a product or update the Stackable operat
 restarting at the same time. You can achieve this using the following methods:
 
 1. Quickly stop and start a whole product using `stopped` as described in xref:operations/cluster_operations.adoc[].
-2. Prevent any changes to your deployed product using `reconcilePaused` as described in xref:operations/cluster_operations.adoc[].
+2. Prevent any changes to your deployed product using `reconciliationPaused` as described in xref:operations/cluster_operations.adoc[].
 
 == Performance
 
-1. You can configure the available resource every product has using xref:concepts:resources.adoc[]. The defaults are
+1. *Compute resources*: You can configure the available resource every product has using xref:concepts:resources.adoc[]. The defaults are
    very restrained, as you should be able to spin up multiple products running on your Laptop.
-2. You can not only use xref:operations/pod_placement.adoc[] to achieve more resilience, but also to co-locate products
+2. *Autoscaling*: Although not supported by the platform natively yet, you can use
+   https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale[HorizontalPodAutoscaler] to autoscale the number of Pods
+   running for a given rolegroup dynamically based upon resource usage. To achieve this you need to *not* configure any replicas on the rolegroup.
+   Afterwards you can deploy a HorizontalPodAutoscaler as usual. Please note that doing so is experimental and not officially support by the
+   platform. Later platform versions will support autoscaling natively with sensible defaults and make it easy to enable and configure.
+3. *Co-location*: You can not only use xref:operations/pod_placement.adoc[] to achieve more resilience, but also to co-locate products
    that communicate frequently with each other. One example is placing HBase regionservers on the same Kubernetes node
    as the HDFS datanodes. Our operators already take this into account and co-locate connected services. However, if
-   you are not satisfied with the automatically created affinities you can use ref:operations/pod_placement.adoc[] to
+   you are not satisfied with the automatically created affinities you can use xref:operations/pod_placement.adoc[] to
    configure your own.
-3. If you want to have certain services running on dedicated nodes you can also use xref:operations/pod_placement.adoc[]
+4. *Dedicated nodes*: If you want to have certain services running on dedicated nodes you can also use xref:operations/pod_placement.adoc[]
    to force the Pods to be scheduled on certain nodes. This is especially helpful if you e.g. have Kubernetes nodes with
    16 cores and 64 GB, as you could allocate nearly 100% of these node resources to your Spark executors or Trino workers.
    In this case it is important that you https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/[taint]

diff --git a/modules/concepts/pages/operations/pod_disruptions.adoc b/modules/concepts/pages/operations/pod_disruptions.adoc
@@ -15,6 +15,27 @@ The defaults depend on the individual product and can be found below the "Operat
 They are based on our knowledge of each product's fault tolerance.
 In some cases they may be a little pessimistic, but they can be adjusted as documented in the following sections.
 
+In general we split product roles into the following two categories, which serve as guidelines for the default values we apply:
+
+=== Multiple replicas to increase availability
+
+For these roles (e.g. Zookeeper servers, HDFS journal + namenodes or HBase masters) we only allow a single Pod to be unavailable. One example would be 7 Zookeeper Nodes, you need 4 to form a quorum. If we would allow 2 to be unavailable,
+there is no single point of failure (as we have at least 5 nodes available), but also we only have a single spare node left. The reason why you did choose 7 instead of 5
+Zookeeper replicas might be, that you always want at least 2 spares. So by increasing the number of allowed disruptions when you increase the number of replicas probably
+is not what you are trying to achieve when you increase the replicas to increase availability.
+
+=== Multiple replicas to increase performance
+
+For these roles (e.g. HDFS datanodes, HBase regionservers or Trino workers) we allow more than a single Pod to be unavailable, as otherwise rolling re-deployments could take very long.
+
+IMPORTANT: The operators calculate the number of Pods for a give role by adding the number of replicas of every rolegroup that is part of that role.
+In case their are no replicas defined on a rolegroup, one Pod will be assumed for this rolegroup, as the created Kubernetes objects
+(StatefulSets or Deployments) will default to a single replica as well. However, in case there are
+https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/[HorizontalPodAutoscaler] in place, the number of replicas of the rolegroup
+can change dynamically. In this case the operators might falsely assume that rolegroups have less Pods than they actually have. This is a pessimistic approach,
+as the number of allowed disruption normally stays the same or even increases when the number of Pods increases. So this should be save, but in some cases
+more Pods *could* have been allowed to be unavailable, so rolling re-deployments can take a bit longer than needed.
+
 == Influencing and disabling PDBs
 
 You can configure