Health checks
- Defining health checks
- Using health checks
- Automated rollbacks
- Implementing auto-rollback in practice
- Enabling auto-rollback
- Visualizing a rolling update
- Visualizing an automated rollback
- CLI flags for health checks and rollbacks
(New in Docker Engine 1.12)
Commands that are executed on regular intervals in a container
Must return 0 or 1 to indicate "all is good" or "something's wrong"
Must execute quickly (timeouts = failures)
Example:
curl -f http://localhost/_ping || false
- the
-f
flag ensures thatcurl
returns non-zero for 404 and similar errors || false
ensures that any non-zero exit status gets mapped to 1curl
must be installed in the container that is being checked
- the
Defining health checks
In a Dockerfile, with the HEALTHCHECK instruction
HEALTHCHECK --interval=1s --timeout=3s CMD curl -f http://localhost/ || false
From the command line, when running containers or services
docker run --health-cmd "curl -f http://localhost/ || false" ... docker service create --health-cmd "curl -f http://localhost/ || false" ...
In Compose files, with a per-service healthcheck section
www: image: hellowebapp healthcheck: test: "curl -f https://localhost/ || false" timeout: 3s
Using health checks
With
docker run
, health checks are purely informativedocker ps
shows health statusdocker inspect
has extra details (including health check command output)
With
docker service
:unhealthy tasks are terminated (i.e. the service is restarted)
failed deployments can be rolled back automatically
(by setting at least the flag--update-failure-action rollback
)
Automated rollbacks
Here is a comprehensive example using the CLI:
docker service update \
--update-delay 5s \
--update-failure-action rollback \
--update-max-failure-ratio .25 \
--update-monitor 5s \
--update-parallelism 1 \
--rollback-delay 5s \
--rollback-failure-action pause \
--rollback-max-failure-ratio .5 \
--rollback-monitor 5s \
--rollback-parallelism 0 \
--health-cmd "curl -f http://localhost/ || exit 1" \
--health-interval 2s \
--health-retries 1 \
--image yourimage:newversion yourservice
Implementing auto-rollback in practice
We will use the following Compose file (stacks/dockercoins+healthcheck.yml
):
...
hasher:
build: dockercoins/hasher
image: ${REGISTRY-127.0.0.1:5000}/hasher:${TAG-latest}
deploy:
replicas: 7
update_config:
delay: 5s
failure_action: rollback
max_failure_ratio: .5
monitor: 5s
parallelism: 1
...
Enabling auto-rollback
Go to the
stacks
directory:cd ~/container.training/stacks
Deploy the updated stack:
docker stack deploy --compose-file dockercoins+healthcheck.yml dockercoins
This will also scale the hasher
service to 7 instances.
Visualizing a rolling update
First, let's make an "innocent" change and deploy it.
Update the
sleep
delay in the code:sed -i "s/sleep 0.1/sleep 0.2/" dockercoins/hasher/hasher.rb
Build, ship, and run the new image:
export TAG=v0.5 docker-compose -f dockercoins+healthcheck.yml build docker-compose -f dockercoins+healthcheck.yml push docker service update dockercoins_hasher \ --image=127.0.0.1:5000/hasher:$TAG
Visualizing an automated rollback
And now, a breaking change that will cause the health check to fail:
Change the HTTP listening port:
sed -i "s/80/81/" dockercoins/hasher/hasher.rb
Build, ship, and run the new image:
export TAG=v0.6 docker-compose -f dockercoins+healthcheck.yml build docker-compose -f dockercoins+healthcheck.yml push docker service update dockercoins_hasher \ --image=127.0.0.1:5000/hasher:$TAG
CLI flags for health checks and rollbacks
--health-cmd string Command to run to check health
--health-interval duration Time between running the check (ms|s|m|h)
--health-retries int Consecutive failures needed to report unhealthy
--health-start-period duration Start period for the container to initialize before counting retries towards unstable (ms|s|m|h)
--health-timeout duration Maximum time to allow one check to run (ms|s|m|h)
--no-healthcheck Disable any container-specified HEALTHCHECK
--restart-condition string Restart when condition is met ("none"|"on-failure"|"any")
--restart-delay duration Delay between restart attempts (ns|us|ms|s|m|h)
--restart-max-attempts uint Maximum number of restarts before giving up
--restart-window duration Window used to evaluate the restart policy (ns|us|ms|s|m|h)
--rollback Rollback to previous specification
--rollback-delay duration Delay between task rollbacks (ns|us|ms|s|m|h)
--rollback-failure-action string Action on rollback failure ("pause"|"continue")
--rollback-max-failure-ratio float Failure rate to tolerate during a rollback
--rollback-monitor duration Duration after each task rollback to monitor for failure (ns|us|ms|s|m|h)
--rollback-order string Rollback order ("start-first"|"stop-first")
--rollback-parallelism uint Maximum number of tasks rolled back simultaneously (0 to roll back all at once)
--update-delay duration Delay between updates (ns|us|ms|s|m|h)
--update-failure-action string Action on update failure ("pause"|"continue"|"rollback")
--update-max-failure-ratio float Failure rate to tolerate during an update
--update-monitor duration Duration after each task update to monitor for failure (ns|us|ms|s|m|h)
--update-order string Update order ("start-first"|"stop-first")
--update-parallelism uint Maximum number of tasks updated simultaneously (0 to update all at once)