Skip to content

Commit

Permalink
Use different CNI conf file when configuring chaining with Antrea
Browse files Browse the repository at this point in the history
The current solution which consists of overwriting the existing CNI conf
file (e.g., 10-aws.conflist) suffers from one issue for which I cannot
find a simple workaround:
When a Node restarts, there can be a short window of time during which
the CNI conf file reverts to the old one (without Antrea). If some Pods
are restarted / scheduled on the Node during that time, they will not be
processed by Antrea and NetworkPolicies may not be applied to them.

The solution I have come up with is to create a new CNI conf file with
higher priority (00-antrea.conflist). Because that file will stay the
same during Node restart, the problematic window of time does not exist
anymore. We still watch for changes to the intial CNI conf file (e.g.,
10-aws.conflist), so we can update 00-antrea.conflist as needed.

We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to
use the same container image as antrea-eks-node-init.yml. Using v2
ensures that the script is run again if it is modified at runtime.

Signed-off-by: Antonin Bas <abas@vmware.com>
  • Loading branch information
antoninbas committed Jul 20, 2022
1 parent 2a092ab commit 0969602
Show file tree
Hide file tree
Showing 4 changed files with 35 additions and 44 deletions.
32 changes: 21 additions & 11 deletions build/images/scripts/install_cni_chaining
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@ while (( "$#" )); do
shift
done

# Find the cni conf file with lowest name
# Find the cni conf file with lowest name, which is not installed by us
while true; do
cni_conf_name=$(ls "$HOST_CNI_NET_DIR" | head -n1)
cni_conf_name=$(ls "$HOST_CNI_NET_DIR" | grep -v antrea | head -n1)
if [[ ! -z $cni_conf_name ]]; then
break
fi
Expand All @@ -44,16 +44,22 @@ while true; do
done
cni_conf_path="$HOST_CNI_NET_DIR/$cni_conf_name"

# EKS uses 10-aws.conflist
# AKS uses 10-azure.conflist
cni_conf_name_antrea="00-antrea.conflist"
cni_conf_path_antrea="$HOST_CNI_NET_DIR/$cni_conf_name_antrea"

cni_conf_sha=""

function update_cni_conf {
log_info "install_cni_chaining" "updating CNI conf file $cni_conf_name"
log_info "install_cni_chaining" "updating CNI conf file $cni_conf_name -> $cni_conf_name_antrea"

# To limit the risk of issue because of race conditions, we use the
# following steps:
# 1. read the file once and store its contents in a variable
# We use the following steps:
# 1. read the input file once and store its contents in a variable
# 2. perform the necessary changes on the variable contents
# 3. compute the sha of the updated variable contents
# 4. write the variable contents back to the file
# 3. write the variable contents to the output file if necessary
content=$(cat $cni_conf_path)
cni_conf_sha="$(echo -n "$content" | sha256sum | while read -r s _; do echo "$s"; done)"
echo "$content" | grep -sq "azure"
if [[ $? == 0 ]]; then
Expand All @@ -67,9 +73,13 @@ function update_cni_conf {
content="$(echo "$content" | jq '.plugins += [{"type": "antrea"}]')"
fi
cni_conf_sha="$(echo "$content" | sha256sum | while read -r s _; do echo "$s"; done)"
echo "$content" > $cni_conf_path
# we only write to the file if an update is necessary
cmp -s <(echo "$content") $cni_conf_path_antrea
if [[ $? != 0 ]]; then
echo "$content" > $cni_conf_path_antrea
else
log_info "install_cni_chaining" "CNI conf file is already up-to-date"
fi
}
# monitor will start a watch on host's CNI config directory.
Expand Down
2 changes: 1 addition & 1 deletion build/yamls/antrea-aks-node-init.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ spec:
hostNetwork: true
containers:
- name: node-init
image: gcr.io/google-containers/startup-script:v1
image: gcr.io/google-containers/startup-script:v2
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
Expand Down
43 changes: 12 additions & 31 deletions build/yamls/antrea-eks-node-init.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,28 +42,12 @@ spec:
set -o pipefail
set -o nounset
# The STARTUP_SCRIPT environment variable (which is set to the contents of this
# script) will be available when the script is executed :)
checkpoint_path="/opt/cni/antrea-node-init-status-$(md5sum <<<"${STARTUP_SCRIPT}" | cut -c-32)"
if [ -f "$checkpoint_path" ]; then
echo "Antrea node init already done. Exiting"
exit
fi
echo "Initializing node for Antrea"
echo "Initializing Node for Antrea"
cni_conf="/etc/cni/net.d/00-antrea.conflist"
while true; do
cni_conf=$(ls /etc/cni/net.d | head -n1)
if [[ ! -z $cni_conf ]]; then break; fi
echo "Waiting for cni conf file"
sleep 2s
done
cni_conf="/etc/cni/net.d/$cni_conf"
while true; do
if grep -sq "antrea" $cni_conf; then break; fi
echo "Waiting for antrea config to be updated"
if [[ -f $cni_conf ]]; then break; fi
echo "Waiting for Antrea CNI conf file"
sleep 2s
done
Expand All @@ -72,19 +56,20 @@ spec:
test -e /var/run/docker.sock || container_runtime="containerd"
echo "Container runtime: $container_runtime"
# Wait for kubelet to register the file update. Default sync time is 5sec
# Wait for kubelet to register the file update.
# Default sync time is 5s so we sleep for 10s.
# https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockershim/network/cni/cni.go#L50
sleep 5s
sleep 10s
while true; do
curl localhost:61679 && retry=false || retry=true
curl -sS -o /dev/null localhost:61679 && retry=false || retry=true
if [ $retry == false ]; then break ; fi
sleep 2s
echo "Waiting for aws-k8s-agent"
done
# Fetch running containers from aws-k8s-agent and kill them
echo "\n"
echo
for container_id in $(cat /var/run/aws-node/ipam.json | jq -r '.allocations | .[] | .containerID'); do
echo "Restarting container with ID: ${container_id}"
if [[ "$container_runtime" == "docker" ]]; then
Expand All @@ -94,12 +79,8 @@ spec:
fi
done
# Save the Node init status, to avoid container restart in case of node-init Pod
# restart or worker Node reboot.
# Note that gcr.io/google-containers/startup-script:v2 also includes a similar
# mechanism but it doesn't prevent the script from being run again when the Node
# restarts, since the checkpoint path is located in the /tmp folder.
# The script will run again if the contents of the script change.
# It may also run again in case of worker Node reboot, assuming
# that the contents of /tmp are deleted.
# See https://github.com/kubernetes-retired/contrib/blob/master/startup-script/manage-startup-script.sh.
touch "$checkpoint_path"
echo "Node initialization completed"
2 changes: 1 addition & 1 deletion build/yamls/antrea-gke-node-init.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ spec:
hostNetwork: true
containers:
- name: node-init
image: gcr.io/google-containers/startup-script:v1
image: gcr.io/google-containers/startup-script:v2
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
Expand Down

0 comments on commit 0969602

Please sign in to comment.