Migrating kubernetes cluster docker to 18.03

Docker versions 1.13, 17.03 have problems with finalizing containers.

Kubernetes has not being officially tested with docker 18.xx, but I'm facing problems with pods stocking.

In this post I will talk about the experiment on upgrading workers to 18.03 which is a version that solves some instability on stopping containers.

First, I have done this in production, but the cluster is small, I would recommend that you test before you try that in production.

A background on the problem:

github.com/moby/moby/issues/31768

github.com/kubernetes/kubernetes/issues/51835

github.com/kubernetes/kubernetes/issues/59564

docs.docker.com/docker-for-mac/kubernetes/#..

bugzilla.redhat.com/show_bug.cgi?id=1505687

We will do that node by node.

Start by removing all pods from node (except daemonsets which can't be removed)

kubectl drain worker0 --ignore-daemonsets

Node will be ready but scheduling disabled:

NAME      STATUS                        ROLES     AGE       VERSION
master0   Ready                         master    18h       v1.10.2
master1   Ready                         master    18h       v1.10.2
master2   Ready                         master    17h       v1.10.2
worker0   Ready,SchedulingDisabled   <none>    17h       v1.10.2
worker1   Ready                         <none>    17h       v1.10.2
worker2   Ready                         <none>    17h       v1.10.2

If this takes like forever to finish. You may proceed by forcing node down as show below.

In my case I think all pods have evicted, so I proceed even without command finished.

Next step we will stop everything on node by force. Login in the node and:

systemctl stop kubelet
docker stop $(docker ps -q)

Check that node is down:

kubectl get nodes

NAME      STATUS                        ROLES     AGE       VERSION
master0   Ready                         master    18h       v1.10.2
master1   Ready                         master    18h       v1.10.2
master2   Ready                         master    17h       v1.10.2
worker0   NotReady,SchedulingDisabled   <none>    17h       v1.10.2
worker1   Ready                         <none>    17h       v1.10.2
worker2   Ready                         <none>    17h       v1.10.2

In my case I was using docker.io in Ubuntu Xenial. Remove that:

apt-get remove docker docker-engine docker.io

Follow instructions on docker to install: docs.docker.com/install/linux/docker-ce/ubu..

Is a good choice to install a specific version (to avoid unwanted system updates)

Now start kubelet

systemctl start kubelet

Check that things are working:

docker ps
journalctl -u kubelet
kubectl get nodes

Return node to cluster scheduling:

kubectl uncordon worker0
kubectl get nodes
# check docker version
kubectl describe node worker0

Your node should be ready again.

Repeat for the other ones

Tip: use screen command to avoid context loose on ssh disconnection when working on servers.

The whole process take about 30 minutes to upgrade a cluster of 3 workers, while typing this blog post at the same time, so you could do it in 15 min or less.

Besides no official test on Kubernetes 1.10 for docker 18.x there was no problems detected, in fact stability have improved on stopping servers.

I like to give 2-3 minutes between each upgrade for kubernetes rescheduling, just in case.

My cluster has OpenEBS as persistent storage with default replication as 2 and critical deployments with 2 replicas. There is no critical services by the time of procedure, but I believe you can do it without service downtime (yes, this is the power of HA clusters right?). Users and services have noticed zero downtime. Except for Jira and Concluence (they can't have more replicas), but kubernetes have reschedule this containers for me, downtime was minimal. I have to mention that the number of users are very small at this time.