Migrating kubernetes cluster docker to 18.03
Docker versions 1.13, 17.03 have problems with finalizing containers.
Kubernetes has not being officially tested with docker 18.xx, but I'm facing problems with pods stocking.
In this post I will talk about the experiment on upgrading workers to 18.03 which is a version that solves some instability on stopping containers.
First, I have done this in production, but the cluster is small, I would recommend that you test before you try that in production.
A background on the problem:
github.com/moby/moby/issues/31768
github.com/kubernetes/kubernetes/issues/51835
github.com/kubernetes/kubernetes/issues/59564
docs.docker.com/docker-for-mac/kubernetes/#..
bugzilla.redhat.com/show_bug.cgi?id=1505687
We will do that node by node.
Start by removing all pods from node (except daemonsets which can't be removed)
kubectl drain worker0 --ignore-daemonsets
Node will be ready but scheduling disabled:
NAME STATUS ROLES AGE VERSION
master0 Ready master 18h v1.10.2
master1 Ready master 18h v1.10.2
master2 Ready master 17h v1.10.2
worker0 Ready,SchedulingDisabled <none> 17h v1.10.2
worker1 Ready <none> 17h v1.10.2
worker2 Ready <none> 17h v1.10.2
If this takes like forever to finish. You may proceed by forcing node down as show below.
In my case I think all pods have evicted, so I proceed even without command finished.
Next step we will stop everything on node by force. Login in the node and:
systemctl stop kubelet
docker stop $(docker ps -q)
Check that node is down:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
master0 Ready master 18h v1.10.2
master1 Ready master 18h v1.10.2
master2 Ready master 17h v1.10.2
worker0 NotReady,SchedulingDisabled <none> 17h v1.10.2
worker1 Ready <none> 17h v1.10.2
worker2 Ready <none> 17h v1.10.2
In my case I was using docker.io in Ubuntu Xenial. Remove that:
apt-get remove docker docker-engine docker.io
Follow instructions on docker to install: docs.docker.com/install/linux/docker-ce/ubu..
Is a good choice to install a specific version (to avoid unwanted system updates)
Now start kubelet
systemctl start kubelet
Check that things are working:
docker ps
journalctl -u kubelet
kubectl get nodes
Return node to cluster scheduling:
kubectl uncordon worker0
kubectl get nodes
# check docker version
kubectl describe node worker0
Your node should be ready again.
Repeat for the other ones
Tip: use screen
command to avoid context loose on ssh disconnection when working on servers.
The whole process take about 30 minutes to upgrade a cluster of 3 workers, while typing this blog post at the same time, so you could do it in 15 min or less.
Besides no official test on Kubernetes 1.10 for docker 18.x there was no problems detected, in fact stability have improved on stopping servers.
I like to give 2-3 minutes between each upgrade for kubernetes rescheduling, just in case.
My cluster has OpenEBS as persistent storage with default replication as 2 and critical deployments with 2 replicas. There is no critical services by the time of procedure, but I believe you can do it without service downtime (yes, this is the power of HA clusters right?). Users and services have noticed zero downtime. Except for Jira and Concluence (they can't have more replicas), but kubernetes have reschedule this containers for me, downtime was minimal. I have to mention that the number of users are very small at this time.