Kubernetes Caveats and Workarounds

In a previous blog post I talked about how we came to use Kubernetes on Google Container Engine (GKE). In this post I’ll mention some of the caveats and workarounds we’ve come across so far when using this combination. The intension of this blog post is not to discredit the awesome work made by the folks behind Kubernetes or Google Container Engine but rather to point out some the pain points we’ve experienced and present some of our workarounds. It should also be noted that some of the issues either will or has already been addressed and some of them will ship in the upcoming version of Kubernetes (v1.2). Here’s a list of the caveats and workarounds I’ll talk about in the blog post:

  1. Application Configuration
  2. Restrict Google CloudSQL to a Kubernetes cluster
  3. Graceful Container Shutdown
  4. Liveness Probes and Basic Authentication
  5. Tail logs from multiple pods simultaneously
  6. Deployments
  7. Inability to define headers in Google HTTP Load Balancer
  8. Google HTTP Load Balancer SSL Termination
  9. Quotas
  10. Scheduling
  11. Rescheduling POD with GCE Persistent Disk

There’s also a conclusion in the end. So without further ado..

 

Application Configuration

We’ve come from a background of using docker-compose which allowed us to expose a file containing key-value pairs that will be exposed as environment variables inside a Docker container. So for each application we had a config file consisting of such key-value pairs. For example:

URL=http://someurl.com
TIMEOUT=4000
...

Our continuous delivery (CD) pipeline made sure that these config files were indeed exposed as environment variables when deploying our Docker containers to a node in our cluster. When looking into how application configuration worked in Kubernetes it didn’t seem to fit well with our current setup. For example we found no way to pass environment variables to a replication controller (RC) template without hardcoding the values. For example we would’ve liked to have a template that looked something like this:

kinda: "ReplicationController"
apiVersion: "v1"
metadata:
  name: "my-example"
  labels:
    name: "my-example"
    version: "{{VERSION}}"
  …
    spec:
      containers:
        - name: "my-pod"
          image: "gcr.io/my-project/my-example:{{VERSION}}"
          env:
            - name: "TIMEOUT"
              value: {{TIMEOUT}}
            - name: "URL"
              value: "{{URL}}"
      … 

so that the CD server could apply the environment specific application configuration (test/prod) to the RC template before deploying it to the cluster. What made it even more difficult for us was that not all variables were defined in a single config file. For example `VERSION` is just a sequential build number generated by the CD server so we wanted to apply environment variables from two different sources (application config file and the CD server). After searching for quite a while we decided to go with a workaround and created a bash script that allowed us to apply our configuration like this:

$ VERSION=1.2.3 ./templater.sh rc-template.yaml \
                               -f app-config.properties > rc.yaml

This allows our CD server to define the `VERSION` property while the application specific configuration resides in the `app-config.properties` file. I’ve described this workaround in more detail here. Kubernetes 1.2 will give us ConfigMaps that’ll probably address the need for this workaround.

 

Restrict Google CloudSQL to a Kubernetes cluster

If you’re on the Google cloud and is using a SQL database it’s not unlikely that you’re using a Google CloudSQL instance or two. When creating a Cloud SQL instance you want to lock down the access so that only authorized networks can to connect to your instance. This is typically done by configuring the allowed hosts of the database instance. But if you want to lock down the database to only be accessible from a Kubernetes cluster running on Google Container Engine you need to manually maintain the allowed hosts list. The reason for is that your Kubernetes cluster doesn’t have a fixed IP range and cluster instances may come and go at any time. Also your cluster may be expanded with new machines that also needs to access the database. This places a huge burden for the operations team to keep track of and maintaining this.

Luckily Jordi Collell has created a little app called cloudsqlip which can be deployed in a pod inside a Kubernetes cluster to monitor the nodes and maintain the Google CloudSQL allowed hosts list based on the current state of the cluster. Jordi was nice enough to publish the the app to DockerHub for everyone to use so this is what we ended up using. If you want to know more on how to actually use it please refer to this blog post.

 

Graceful Container Shutdown

This is probably not an issue with Kubernetes but I’m including it here since this workaround has servered us well so far. We found that if we stopped a pod in Kubernetes it didn’t shutdown gracefully. It was simply killed in the middle of operations without waiting for it to properly terminate. This is of course something we want to avoid. Luckily Kubernetes has support for lifecycle hooks that we used to wait for the container to be shutdown properly. We’re mostly a JVM shop so we’ve added shutdown hooks to all our applications to shut them down gracefully. Thus we created a script that simply waited until the JVM had stopped:

["/bin/sh", "-c", "PID=`pidof java` && kill -SIGTERM $PID && while ps -p $PID > /dev/null; do sleep 1; done;"]

This script was then applied to the pods in the pod template:

...
containers:
- name: "x"
  ...
  lifecycle:
    preStop:
      exec:
	    command: {{SHUTDOWN_COMMAND}}

by using the templater script.

 

Liveness Probes and Basic Authentication

Kubernetes has the notion of liveness probes which keeps track of the health of a pod. If a pod doesn’t respond according to what’s specified by the liveness probe it’ll be restarted by the Kubelet. Usually it’s easy to define such a probe:

livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 15
      timeoutSeconds: 1

Our problem was that all our apps were using basic authentication in one way or the other and it’s not possible to define headers (or authentication) for the `GET` request issued by the liveness probe. The workaround for us consisted in triggering a `curl` command instead:

 
livenessProbe:
      exec:
        command: ["curl -G --fail --silent --output=/dev/null \ 
                               -u {{AUTH_USER}}:{{AUTH_PASSWD}} \
                               \"localhost:{{AUTH_PORT}}\""]
      initialDelaySeconds: 15
      timeoutSeconds: 1

 

Tail logs from multiple pods simultaneously

Tailing the logs of a single pod in Kubernetes from command-line is straight forward:

$ kubectl logs -f my-pod --since=10m

This will tail the logs of `my-pod` and just display the last 10 minutes of logs. But it’s very common to have a replication controller that spins up multiple instances of a pod, so how can we get an aggregated view of all the logs from the command-line? It turns out that this is quite a hassle and it doesn’t work out of the box (afaik). You could use the graphical interface of Google Cloud Logging (if you’re running on GKE) but it would be convenient to be able to do this from the command-line as well. This is why I created a bash script that tries to do exactly this called kubetail. It allows you to tail multiple pods (even from different replication controllers) by doing:

$ kubetail my-pod

It will present an aggregated view of all pods containing “my-pod” in its name. For more information see the github page or refer to this blog.

 

Deployments

Kubernetes has a unified way of creating resources by using the `create` command. For example to create a replication controller defined in a file called `rc.yaml` you can do:

$ kubectl create -f rc.yaml

Kubernetes will now create the replication controller defined in this file and start up the pods. Once the replication controller is up and running one can issue a rolling update by doing:

$ kubectl rolling-update my-rc-v1 -f my-rc-v2.json

This will update the pods of `my-rc-v1` using new replication controller data in `my-rc-v2.json` and it’ll try to do so without downtime. This is great! But besides the fact that this is controlled from the client side (i.e. if you `ctrl+c` a rolling update it won’t continue the update in the cluster) there’s no built-in way of unifying “create” and “rolling-update”. If you have a continuous delivery pipeline like we have it needs to figure out if a replication controller is already deployed in the cluster. If so it should perform a `rolling-update` and if not it should simply create it (using `kubectl create`). To accommodate this we’ve created the following script that is used by our Jenkins CD pipeline when deploying a container:

#!/bin/bash

kubectl_context="${1}"
replication_controller_label="${2}"
controller_file="${3}"

# Fail entire script when first command fail 
set -e
# Get the replication controller with the specified label
rc=$(kubectl get rc --context "${kubectl_context}" -l name="${replication_controller_label}" --no-headers=true | sed 's/ .*//')

if [ -z "${rc}" ]; then
    # No replication controller exists we should just create it.
    echo "No existing replication controller was found for ${replication_controller_label}, create a new one."
    kubectl create --context "${kubectl_context}" -f "${controller_file}"
else
    # Replication controller exists which means that we should do a rolling upgrade
    echo "Existing replication controller ${rc} already exists for replication controller with label ${replication_controller_label}, will do a rolling update."
    kubectl rolling-update --update-period="30s" "${rc}" -f "${controller_file}" --context "${kubectl_context}"
fi

The problem is that this doesn’t work for pods using google persistent disks so we need another mechanism for those. What we want to have is a unified deployment mechanism that is executed server side. Luckily Kubernetes now have something called deployments that addresses this issue but it’s not yet available in Google Container Engine.

 

Inability to define headers in Google HTTP Load Balancer

We’re using ingress resources to expose our services to the outside world by using a Google HTTP Load Balancer. When such a load balancer is created it brings with it a health check that is probed to find out if the service is up and running. The problem, as described here, was that we were using basic authentication for virtually all of our HTTP resources and there’s currently no way to add headers to the health check. This caused us A LOT of pain since we now had to rewrite many of our services (about 20) to allow for unauthenticated access to a ping resource. This is actually harder than it might sound and it took a while in order to port everything. It also caused us a lot of issues when we were trying to redirect our WordPress instance from http to https as described in this blog. I’ve created an issue for this that has been accepted so hopefully this will be available in the future.

 

Google HTTP Load Balancer SSL Termination

crazy!

This was by far the most difficult thing we’ve encountered so far when running Kubernetes on Google Container Engine. We wanted the HTTP Load Balancer (LB) to do SSL termination for us instead of hosting Nginx services (pods) for this in our Kubernetes cluster. Currently (v1.1) Kubernetes supports specifying ingress resources that allows us expose our services to the outside world using a HTTP Load Balancer. They look something like this:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: test
spec:
  rules:
  - host: foo.bar.com
    http:
      paths:
      - path: /foo
        backend:
          serviceName: s1
          servicePort: 80
      - path: /bar
        backend:
          serviceName: s2
          servicePort: 80

The problem is that SSL termination (https) is not supported until Kubernetes 1.2. But we decided to go down this route and fix SSL termination manually until version 1.2 is released to (hopefully) be able to upgrade more smoothly. Being completely new to the Google Cloud environment didn’t make things easier. The entire process of getting this to work is too long to be presented here (it could be the sole subject of another blog) but here’s a list of things that needs to be done for each service:

  • Create and deploy a new ingress resource
  • Create Firewall Rules
  • Set generated IP to static
  • Create target rules for HTTPS LB
  • Create forwarding rules for HTTPS LB
  • Fix health check path and port (especially problematic if the service expose two ports where one is just for managment resources)

Also for this to work you need to increase the projects resource quotas.

 

Quotas

As mentioned here one of the problems with ingress resources in Google Container Engine is that you need to increase the resource quotas of the project. If not you’ll run into issues very quickly. The main culprit is the lack of backend services that you’re allowed to create by default (which is limited to 3). When an ingress is created on GKE it creates one of these backend services for each port defined in a service plus one additional backend service for the error page that is presented when the service is unreachable. We typically expose two different ports for each service, one application port and one management port. This means that we can only deploy ONE single ingress resource before we hit the quota limit. In order to increase the limit one has to manually fill out a form and send a request to Google to increase it (which is handled manually and it may get rejected). And even so the maximum value is 30 which isn’t a large number if you need to expose a lot of services to the outside world. This is a huge problem and I know that Google engineers are working on it. Besides increasing the quota for backend services you must also increase the quota for at least the following resources:

  • Target http proxys
  • Global static IP’s
  • Forwarding rules
  • Health checks
  • Instance groups
  • Url map

Hopefully this will be less painful soon.

 

Scheduling

The Kubernetes scheduler doesn’t always distribute the pods in the cluster the way you would expect. For example have a look at this:

NAME                 READY     STATUS    RESTARTS   AGE       NODE
mypod—36-an5sw       1/1       Running   0          1d        gke-b99a-node-1rab
mypod—36-gbhqq       1/1       Running   0          1d        gke-b99a-node-1rab

If you look closely you can see that we have two instances of `mypod` running but they are deployed to the same node (gke-b99a-node-1rab). To me this doesn’t make any sense. There are other nodes available in the cluster with enough resources to host one of the pods so why doesn’t the scheduler decide to deploy it to one of those? If `gke-b99a-node-1rab` dies we’ll have downtime of `mypod` which seems unnecessary. The only safe workaround that I know of (but it doesn’t yet work on GKE since it’s in beta) would be to deploy your most important services as a daemon set in order to avoid downtime. But this might waste resources since you probably don’t want to host `mypod` on every node in the cluster. You just want it to be highly available. I’ve heard some Google engineers saying that you can write your own scheduler but that’s not realistic for most people. One thing that tend improve this though (but it could also be chance) was when we created resource requests/limits for each pod. For example:

...
 containers:
  - name: db
    image: mysql
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

 

Rescheduling POD with GCE Persistent Disk

Some of our pods requires persistence volumes and one way to achieve this is to the use a Google Persistent Disk (which unfortunately can only be mounted by one pod at a time for write access). But there’s a bug that we’ve ran into on several occasions in our test environment (for example if a node is killed abruptly) that makes the disk unmountable by a pod in Kubernetes. This means that the pod bound to this disk cannot be started without manual intervention (as described here). First it looked like this issue might not be fixed until Kubernetes 1.3 but from the last I’ve seen it might be fixed sooner than that.

 

Conclusion

I’ve presented some of the issues we’ve run into with Kubernetes and Google Container Engine so far but the good news is that we’ve managed to work around most of them and the future looks promising. I’m sure that we would have run into issues with whatever platform or provider we had chosen. To get support one typically turns to stackoverflow (Google is sponsoring certain tags there) or Slack. But we were also contacted by a Google representative to whom we mailed a lot of questions and (after a couple of weeks) got really good answers. This was way more than we had expected since we’re not a very large customer (yet :)) so big kudos for this.

One thing that I really miss though is persistent volumes like Flocker (that supports multiple writers simultaneously) which is available in other services such as Docker Cloud (former Tutum). I know you can set this up manually but doing so defies the purpose of using Google Container Engine as a managed service. Although no silver bullet it would’ve made our lives easier when deploying certain kinds of services (such as a highly available WordPress cluster). I really hope this can be incorporated natively into Google Container Engine in the future.

8 thoughts on “Kubernetes Caveats and Workarounds

  1. Thank you for this post. We’ve actually hit each and every one of the issues you have written about here. As a matter of fact, I am dealing with setting up a variable expansion strategy for rolling updates for Deployments. Very informative post. Thanks again.

  2. Misery loves company. Glad to see that I’m not the only one that struggled with the items on your list.

Leave a Reply

Your email address will not be published. Required fields are marked *