Copy-on-write filesystems

Introduction
What is copy-on-write?
A few metaphors
Copy-on-write is everywhere
Copy-on-write and containers
AUFS overview
AUFS operations: opening a file
AUFS operations: deleting a file
AUFS performance
Device Mapper
Device Mapper principles
Device Mapper operational details
Device Mapper performance
BTRFS principles
BTRFS in practice with Docker
BTRFS quirks
Overlay2
ZFS
Which one is the best?

Introduction

Container engines rely on copy-on-write to be able to start containers quickly, regardless of their size.

We will explain how that works, and review some of the copy-on-write storage systems available on Linux.

What is copy-on-write?

Copy-on-write is a mechanism allowing to share data.
The data appears to be a copy, but is only a link (or reference) to the original data.
The actual copy happens only when someone tries to change the shared data.
Whoever changes the shared data ends up using their own copy instead of the shared data.

A few metaphors

First metaphor:
white board and tracing paper
Second metaphor:
magic books with shadowy pages
Third metaphor:
just-in-time house building

Copy-on-write is everywhere

Process creation with fork().
Consistent disk snapshots.
Efficient VM provisioning.
And, of course, containers.

Copy-on-write and containers

Copy-on-write is essential to give us "convenient" containers.

Creating a new container (from an existing image) is "free".

(Otherwise, we would have to copy the image first.)
Customizing a container (by tweaking a few files) is cheap.

(Adding a 1 KB configuration file to a 1 GB container takes 1 KB, not 1 GB.)
We can take snapshots, i.e. have "checkpoints" or "save points" when building images.

AUFS overview

The original (legacy) copy-on-write filesystem used by first versions of Docker.
Combine multiple branches in a specific order.
Each branch is just a normal directory.
You generally have:
- at least one read-only branch (at the bottom),
- exactly one read-write branch (at the top).
(But other fun combinations are possible too!)

AUFS operations: opening a file

With O_RDONLY - read-only access:
- look it up in each branch, starting from the top
- open the first one we find
With O_WRONLY or O_RDWR - write access:
- if the file exists on the top branch: open it
- if the file exists on another branch: "copy up"
  (i.e. copy the file to the top branch and open the copy)
- if the file doesn't exist on any branch: create it on the top branch

That "copy-up" operation can take a while if the file is big!

AUFS operations: deleting a file

A whiteout file is created.
This is similar to the concept of "tombstones" used in some data systems.

 # docker run ubuntu rm /etc/shadow

 # ls -la /var/lib/docker/aufs/diff/$(docker ps --no-trunc -lq)/etc
 total 8
 drwxr-xr-x 2 root root 4096 Jan 27 15:36 .
 drwxr-xr-x 5 root root 4096 Jan 27 15:36 ..
 -r--r--r-- 2 root root    0 Jan 27 15:36 .wh.shadow

AUFS performance

AUFS mount() is fast, so creation of containers is quick.
Read/write access has native speeds.
But initial open() is expensive in two scenarios:
- when writing big files (log files, databases ...),
- when searching many directories (PATH, classpath, etc.) over many layers.
Protip: when we built dotCloud, we ended up putting all important data on volumes.
When starting the same container multiple times:
- the data is loaded only once from disk, and cached only once in memory;
- but dentries will be duplicated.

Device Mapper

Device Mapper is a rich subsystem with many features.

It can be used for: RAID, encrypted devices, snapshots, and more.

In the context of containers (and Docker in particular), "Device Mapper" means:

"the Device Mapper system + its thin provisioning target"

If you see the abbreviation "thinp" it stands for "thin provisioning".

Device Mapper principles

Copy-on-write happens on the block level (instead of the file level).
Each container and each image get their own block device.
At any given time, it is possible to take a snapshot:
- of an existing container (to create a frozen image),
- of an existing image (to create a container from it).
If a block has never been written to:
- it's assumed to be all zeros,
- it's not allocated on disk.

(That last property is the reason for the name "thin" provisioning.)

Device Mapper operational details

Two storage areas are needed: one for data, another for metadata.
"data" is also called the "pool"; it's just a big pool of blocks.

(Docker uses the smallest possible block size, 64 KB.)
"metadata" contains the mappings between virtual offsets (in the snapshots) and physical offsets (in the pool).
Each time a new block (or a copy-on-write block) is written, a block is allocated from the pool.
When there are no more blocks in the pool, attempts to write will stall until the pool is increased (or the write operation aborted).
In other words: when running out of space, containers are frozen, but operations will resume as soon as space is available.

Device Mapper performance

By default, Docker puts data and metadata on a loop device backed by a sparse file.
This is great from a usability point of view, since zero configuration is needed.
But it is terrible from a performance point of view:
- each time a container writes to a new block,
- a block has to be allocated from the pool,
- and when it's written to,
- a block has to be allocated from the sparse file,
- and sparse file performance isn't great anyway.
If you use Device Mapper, make sure to put data (and metadata) on devices!

BTRFS principles

BTRFS is a filesystem (like EXT4, XFS, NTFS...) with built-in snapshots.
The "copy-on-write" happens at the filesystem level.
BTRFS integrates the snapshot and block pool management features at the filesystem level.

(Instead of the block level for Device Mapper.)
In practice, we create a "subvolume" and later take a "snapshot" of that subvolume.

Imagine: mkdir with Super Powers and cp -a with Super Powers.
These operations can be executed with the btrfs CLI tool.

BTRFS in practice with Docker

Docker can use BTRFS and its snapshotting features to store container images.
The only requirement is that /var/lib/docker is on a BTRFS filesystem.

(Or, the directory specified with the --data-root flag when starting the engine.)

BTRFS quirks

BTRFS works by dividing its storage in chunks.
A chunk can contain data or metadata.
You can run out of chunks (and get No space left on device) even though df shows space available.

(Because chunks are only partially allocated.)
Quick fix:

 # btrfs filesys balance start -dusage=1 /var/lib/docker

Overlay2

Overlay2 is very similar to AUFS.
However, it has been merged in "upstream" kernel.
It is therefore available on all modern kernels.

(AUFS was available on Debian and Ubuntu, but required custom kernels on other distros.)
It is simpler than AUFS (it can only have two branches, called "layers").
The container engine abstracts this detail, so this is not a concern.
Overlay2 storage drivers generally use hard links between layers.
This improves stat() and open() performance, at the expense of inode usage.

ZFS

ZFS is similar to BTRFS (at least from a container user's perspective).
Pros:
- high performance
- high reliability (with e.g. data checksums)
- optional data compression and deduplication
Cons:
- high memory usage
- not in upstream kernel
It is available as a kernel module or through FUSE.

Which one is the best?

Eventually, overlay2 should be the best option.
It is available on all modern systems.
Its memory usage is better than Device Mapper, BTRFS, or ZFS.
The remarks about write performance shouldn't bother you:
data should always be stored in volumes anyway!