Health checks

Defining health checks
Using health checks
Automated rollbacks
Implementing auto-rollback in practice
Enabling auto-rollback
Visualizing a rolling update
Visualizing an automated rollback
CLI flags for health checks and rollbacks

(New in Docker Engine 1.12)

Commands that are executed on regular intervals in a container
Must return 0 or 1 to indicate "all is good" or "something's wrong"
Must execute quickly (timeouts = failures)
Example:
```
curl -f http://localhost/_ping || false
```
- the -f flag ensures that curl returns non-zero for 404 and similar errors
- || false ensures that any non-zero exit status gets mapped to 1
- curl must be installed in the container that is being checked

Defining health checks

In a Dockerfile, with the HEALTHCHECK instruction

HEALTHCHECK --interval=1s --timeout=3s CMD curl -f http://localhost/ || false

From the command line, when running containers or services

docker run --health-cmd "curl -f http://localhost/ || false" ...
docker service create --health-cmd "curl -f http://localhost/ || false" ...

In Compose files, with a per-service healthcheck section

  www:
    image: hellowebapp
    healthcheck:
      test: "curl -f https://localhost/ || false"
      timeout: 3s

Using health checks

With docker run, health checks are purely informative
- docker ps shows health status
- docker inspect has extra details (including health check command output)
With docker service:
- unhealthy tasks are terminated (i.e. the service is restarted)
- failed deployments can be rolled back automatically
  (by setting at least the flag --update-failure-action rollback)

Automated rollbacks

Here is a comprehensive example using the CLI:

docker service update \
  --update-delay 5s \
  --update-failure-action rollback \
  --update-max-failure-ratio .25 \
  --update-monitor 5s \
  --update-parallelism 1 \
  --rollback-delay 5s \
  --rollback-failure-action pause \
  --rollback-max-failure-ratio .5 \
  --rollback-monitor 5s \
  --rollback-parallelism 0 \
  --health-cmd "curl -f http://localhost/ || exit 1" \
  --health-interval 2s \
  --health-retries 1 \
  --image yourimage:newversion yourservice

Implementing auto-rollback in practice

We will use the following Compose file (stacks/dockercoins+healthcheck.yml):

...
  hasher:
    build: dockercoins/hasher
    image: ${REGISTRY-127.0.0.1:5000}/hasher:${TAG-latest}
    deploy:
      replicas: 7
      update_config:
        delay: 5s
        failure_action: rollback
        max_failure_ratio: .5
        monitor: 5s
        parallelism: 1
...

Enabling auto-rollback

Go to the stacks directory:
```
cd ~/container.training/stacks
```

Deploy the updated stack:

docker stack deploy --compose-file dockercoins+healthcheck.yml dockercoins

This will also scale the hasher service to 7 instances.

Visualizing a rolling update

First, let's make an "innocent" change and deploy it.

Update the sleep delay in the code:

sed -i "s/sleep 0.1/sleep 0.2/" dockercoins/hasher/hasher.rb

Build, ship, and run the new image:

export TAG=v0.5
docker-compose -f dockercoins+healthcheck.yml build
docker-compose -f dockercoins+healthcheck.yml push
docker service update dockercoins_hasher \
         --image=127.0.0.1:5000/hasher:$TAG

Visualizing an automated rollback

And now, a breaking change that will cause the health check to fail:

Change the HTTP listening port:

sed -i "s/80/81/" dockercoins/hasher/hasher.rb

Build, ship, and run the new image:

export TAG=v0.6
docker-compose -f dockercoins+healthcheck.yml build
docker-compose -f dockercoins+healthcheck.yml push
docker service update dockercoins_hasher \
         --image=127.0.0.1:5000/hasher:$TAG

CLI flags for health checks and rollbacks

--health-cmd string                  Command to run to check health
--health-interval duration           Time between running the check (ms|s|m|h)
--health-retries int                 Consecutive failures needed to report unhealthy
--health-start-period duration       Start period for the container to initialize before counting retries towards unstable (ms|s|m|h)
--health-timeout duration            Maximum time to allow one check to run (ms|s|m|h)
--no-healthcheck                     Disable any container-specified HEALTHCHECK
--restart-condition string           Restart when condition is met ("none"|"on-failure"|"any")
--restart-delay duration             Delay between restart attempts (ns|us|ms|s|m|h)
--restart-max-attempts uint          Maximum number of restarts before giving up
--restart-window duration            Window used to evaluate the restart policy (ns|us|ms|s|m|h)
--rollback                           Rollback to previous specification
--rollback-delay duration            Delay between task rollbacks (ns|us|ms|s|m|h)
--rollback-failure-action string     Action on rollback failure ("pause"|"continue")
--rollback-max-failure-ratio float   Failure rate to tolerate during a rollback
--rollback-monitor duration          Duration after each task rollback to monitor for failure (ns|us|ms|s|m|h)
--rollback-order string              Rollback order ("start-first"|"stop-first")
--rollback-parallelism uint          Maximum number of tasks rolled back simultaneously (0 to roll back all at once)
--update-delay duration              Delay between updates (ns|us|ms|s|m|h)
--update-failure-action string       Action on update failure ("pause"|"continue"|"rollback")
--update-max-failure-ratio float     Failure rate to tolerate during an update
--update-monitor duration            Duration after each task update to monitor for failure (ns|us|ms|s|m|h)
--update-order string                Update order ("start-first"|"stop-first")
--update-parallelism uint            Maximum number of tasks updated simultaneously (0 to update all at once)