Copy-on-write filesystems

Introduction

Container engines rely on copy-on-write to be able to start containers quickly, regardless of their size.

We will explain how that works, and review some of the copy-on-write storage systems available on Linux.

What is copy-on-write?

• Copy-on-write is a mechanism allowing to share data.

• The data appears to be a copy, but is only a link (or reference) to the original data.

• The actual copy happens only when someone tries to change the shared data.

• Whoever changes the shared data ends up using their own copy instead of the shared data.

A few metaphors

• First metaphor:
white board and tracing paper

• Second metaphor:

• Third metaphor:
just-in-time house building

Copy-on-write is everywhere

• Process creation with fork().

• Consistent disk snapshots.

• Efficient VM provisioning.

• And, of course, containers.

Copy-on-write and containers

Copy-on-write is essential to give us "convenient" containers.

• Creating a new container (from an existing image) is "free".

(Otherwise, we would have to copy the image first.)

• Customizing a container (by tweaking a few files) is cheap.

(Adding a 1 KB configuration file to a 1 GB container takes 1 KB, not 1 GB.)

• We can take snapshots, i.e. have "checkpoints" or "save points" when building images.

AUFS overview

• The original (legacy) copy-on-write filesystem used by first versions of Docker.

• Combine multiple branches in a specific order.

• Each branch is just a normal directory.

• You generally have:

• at least one read-only branch (at the bottom),

• exactly one read-write branch (at the top).

(But other fun combinations are possible too!)

AUFS operations: opening a file

• With O_RDONLY - read-only access:

• look it up in each branch, starting from the top

• open the first one we find

• With O_WRONLY or O_RDWR - write access:

• if the file exists on the top branch: open it

• if the file exists on another branch: "copy up"
(i.e. copy the file to the top branch and open the copy)

• if the file doesn't exist on any branch: create it on the top branch

That "copy-up" operation can take a while if the file is big!

AUFS operations: deleting a file

• A whiteout file is created.

• This is similar to the concept of "tombstones" used in some data systems.

 # docker run ubuntu rm /etc/shadow

# ls -la /var/lib/docker/aufs/diff/\$(docker ps --no-trunc -lq)/etc
total 8
drwxr-xr-x 2 root root 4096 Jan 27 15:36 .
drwxr-xr-x 5 root root 4096 Jan 27 15:36 ..
-r--r--r-- 2 root root    0 Jan 27 15:36 .wh.shadow


AUFS performance

• AUFS mount() is fast, so creation of containers is quick.

• Read/write access has native speeds.

• But initial open() is expensive in two scenarios:

• when writing big files (log files, databases ...),

• when searching many directories (PATH, classpath, etc.) over many layers.

• Protip: when we built dotCloud, we ended up putting all important data on volumes.

• When starting the same container multiple times:

• the data is loaded only once from disk, and cached only once in memory;

• but dentries will be duplicated.

Device Mapper

Device Mapper is a rich subsystem with many features.

It can be used for: RAID, encrypted devices, snapshots, and more.

In the context of containers (and Docker in particular), "Device Mapper" means:

"the Device Mapper system + its thin provisioning target"

If you see the abbreviation "thinp" it stands for "thin provisioning".

Device Mapper principles

• Copy-on-write happens on the block level (instead of the file level).

• Each container and each image get their own block device.

• At any given time, it is possible to take a snapshot:

• of an existing container (to create a frozen image),

• of an existing image (to create a container from it).

• If a block has never been written to:

• it's assumed to be all zeros,

• it's not allocated on disk.

(That last property is the reason for the name "thin" provisioning.)

Device Mapper operational details

• Two storage areas are needed: one for data, another for metadata.

• "data" is also called the "pool"; it's just a big pool of blocks.

(Docker uses the smallest possible block size, 64 KB.)

• "metadata" contains the mappings between virtual offsets (in the snapshots) and physical offsets (in the pool).

• Each time a new block (or a copy-on-write block) is written, a block is allocated from the pool.

• When there are no more blocks in the pool, attempts to write will stall until the pool is increased (or the write operation aborted).

• In other words: when running out of space, containers are frozen, but operations will resume as soon as space is available.

Device Mapper performance

• By default, Docker puts data and metadata on a loop device backed by a sparse file.

• This is great from a usability point of view, since zero configuration is needed.

• But it is terrible from a performance point of view:

• each time a container writes to a new block,
• a block has to be allocated from the pool,
• and when it's written to,
• a block has to be allocated from the sparse file,
• and sparse file performance isn't great anyway.
• If you use Device Mapper, make sure to put data (and metadata) on devices!

BTRFS principles

• BTRFS is a filesystem (like EXT4, XFS, NTFS...) with built-in snapshots.

• The "copy-on-write" happens at the filesystem level.

• BTRFS integrates the snapshot and block pool management features at the filesystem level.

(Instead of the block level for Device Mapper.)

• In practice, we create a "subvolume" and later take a "snapshot" of that subvolume.

Imagine: mkdir with Super Powers and cp -a with Super Powers.

• These operations can be executed with the btrfs CLI tool.

BTRFS in practice with Docker

• Docker can use BTRFS and its snapshotting features to store container images.

• The only requirement is that /var/lib/docker is on a BTRFS filesystem.

(Or, the directory specified with the --data-root flag when starting the engine.)

BTRFS quirks

• BTRFS works by dividing its storage in chunks.

• A chunk can contain data or metadata.

• You can run out of chunks (and get No space left on device) even though df shows space available.

(Because chunks are only partially allocated.)

• Quick fix:

 # btrfs filesys balance start -dusage=1 /var/lib/docker


Overlay2

• Overlay2 is very similar to AUFS.

• However, it has been merged in "upstream" kernel.

• It is therefore available on all modern kernels.

(AUFS was available on Debian and Ubuntu, but required custom kernels on other distros.)

• It is simpler than AUFS (it can only have two branches, called "layers").

• The container engine abstracts this detail, so this is not a concern.

• Overlay2 storage drivers generally use hard links between layers.

• This improves stat() and open() performance, at the expense of inode usage.

ZFS

• ZFS is similar to BTRFS (at least from a container user's perspective).

• Pros:

• high performance
• high reliability (with e.g. data checksums)
• optional data compression and deduplication
• Cons:

• high memory usage
• not in upstream kernel
• It is available as a kernel module or through FUSE.

Which one is the best?

• Eventually, overlay2 should be the best option.

• It is available on all modern systems.

• Its memory usage is better than Device Mapper, BTRFS, or ZFS.

• The remarks about write performance shouldn't bother you:
data should always be stored in volumes anyway!