Copy-on-write filesystems

Introduction

Container engines rely on copy-on-write to be able to start containers quickly, regardless of their size.

We will explain how that works, and review some of the copy-on-write storage systems available on Linux.

What is copy-on-write?

  • Copy-on-write is a mechanism allowing to share data.

  • The data appears to be a copy, but is only a link (or reference) to the original data.

  • The actual copy happens only when someone tries to change the shared data.

  • Whoever changes the shared data ends up using their own copy instead of the shared data.

A few metaphors

  • First metaphor:
    white board and tracing paper

  • Second metaphor:
    magic books with shadowy pages

  • Third metaphor:
    just-in-time house building

Copy-on-write is everywhere

  • Process creation with fork().

  • Consistent disk snapshots.

  • Efficient VM provisioning.

  • And, of course, containers.

Copy-on-write and containers

Copy-on-write is essential to give us "convenient" containers.

  • Creating a new container (from an existing image) is "free".

    (Otherwise, we would have to copy the image first.)

  • Customizing a container (by tweaking a few files) is cheap.

    (Adding a 1 KB configuration file to a 1 GB container takes 1 KB, not 1 GB.)

  • We can take snapshots, i.e. have "checkpoints" or "save points" when building images.

AUFS overview

  • The original (legacy) copy-on-write filesystem used by first versions of Docker.

  • Combine multiple branches in a specific order.

  • Each branch is just a normal directory.

  • You generally have:

    • at least one read-only branch (at the bottom),

    • exactly one read-write branch (at the top).

    (But other fun combinations are possible too!)

AUFS operations: opening a file

  • With O_RDONLY - read-only access:

    • look it up in each branch, starting from the top

    • open the first one we find

  • With O_WRONLY or O_RDWR - write access:

    • if the file exists on the top branch: open it

    • if the file exists on another branch: "copy up"
      (i.e. copy the file to the top branch and open the copy)

    • if the file doesn't exist on any branch: create it on the top branch

That "copy-up" operation can take a while if the file is big!

AUFS operations: deleting a file

  • A whiteout file is created.

  • This is similar to the concept of "tombstones" used in some data systems.

 # docker run ubuntu rm /etc/shadow

 # ls -la /var/lib/docker/aufs/diff/$(docker ps --no-trunc -lq)/etc
 total 8
 drwxr-xr-x 2 root root 4096 Jan 27 15:36 .
 drwxr-xr-x 5 root root 4096 Jan 27 15:36 ..
 -r--r--r-- 2 root root    0 Jan 27 15:36 .wh.shadow

AUFS performance

  • AUFS mount() is fast, so creation of containers is quick.

  • Read/write access has native speeds.

  • But initial open() is expensive in two scenarios:

    • when writing big files (log files, databases ...),

    • when searching many directories (PATH, classpath, etc.) over many layers.

  • Protip: when we built dotCloud, we ended up putting all important data on volumes.

  • When starting the same container multiple times:

    • the data is loaded only once from disk, and cached only once in memory;

    • but dentries will be duplicated.

Device Mapper

Device Mapper is a rich subsystem with many features.

It can be used for: RAID, encrypted devices, snapshots, and more.

In the context of containers (and Docker in particular), "Device Mapper" means:

"the Device Mapper system + its thin provisioning target"

If you see the abbreviation "thinp" it stands for "thin provisioning".

Device Mapper principles

  • Copy-on-write happens on the block level (instead of the file level).

  • Each container and each image get their own block device.

  • At any given time, it is possible to take a snapshot:

    • of an existing container (to create a frozen image),

    • of an existing image (to create a container from it).

  • If a block has never been written to:

    • it's assumed to be all zeros,

    • it's not allocated on disk.

(That last property is the reason for the name "thin" provisioning.)

Device Mapper operational details

  • Two storage areas are needed: one for data, another for metadata.

  • "data" is also called the "pool"; it's just a big pool of blocks.

    (Docker uses the smallest possible block size, 64 KB.)

  • "metadata" contains the mappings between virtual offsets (in the snapshots) and physical offsets (in the pool).

  • Each time a new block (or a copy-on-write block) is written, a block is allocated from the pool.

  • When there are no more blocks in the pool, attempts to write will stall until the pool is increased (or the write operation aborted).

  • In other words: when running out of space, containers are frozen, but operations will resume as soon as space is available.

Device Mapper performance

  • By default, Docker puts data and metadata on a loop device backed by a sparse file.

  • This is great from a usability point of view, since zero configuration is needed.

  • But it is terrible from a performance point of view:

    • each time a container writes to a new block,
    • a block has to be allocated from the pool,
    • and when it's written to,
    • a block has to be allocated from the sparse file,
    • and sparse file performance isn't great anyway.
  • If you use Device Mapper, make sure to put data (and metadata) on devices!

BTRFS principles

  • BTRFS is a filesystem (like EXT4, XFS, NTFS...) with built-in snapshots.

  • The "copy-on-write" happens at the filesystem level.

  • BTRFS integrates the snapshot and block pool management features at the filesystem level.

    (Instead of the block level for Device Mapper.)

  • In practice, we create a "subvolume" and later take a "snapshot" of that subvolume.

    Imagine: mkdir with Super Powers and cp -a with Super Powers.

  • These operations can be executed with the btrfs CLI tool.

BTRFS in practice with Docker

  • Docker can use BTRFS and its snapshotting features to store container images.

  • The only requirement is that /var/lib/docker is on a BTRFS filesystem.

    (Or, the directory specified with the --data-root flag when starting the engine.)

BTRFS quirks

  • BTRFS works by dividing its storage in chunks.

  • A chunk can contain data or metadata.

  • You can run out of chunks (and get No space left on device) even though df shows space available.

    (Because chunks are only partially allocated.)

  • Quick fix:

 # btrfs filesys balance start -dusage=1 /var/lib/docker

Overlay2

  • Overlay2 is very similar to AUFS.

  • However, it has been merged in "upstream" kernel.

  • It is therefore available on all modern kernels.

    (AUFS was available on Debian and Ubuntu, but required custom kernels on other distros.)

  • It is simpler than AUFS (it can only have two branches, called "layers").

  • The container engine abstracts this detail, so this is not a concern.

  • Overlay2 storage drivers generally use hard links between layers.

  • This improves stat() and open() performance, at the expense of inode usage.

ZFS

  • ZFS is similar to BTRFS (at least from a container user's perspective).

  • Pros:

    • high performance
    • high reliability (with e.g. data checksums)
    • optional data compression and deduplication
  • Cons:

    • high memory usage
    • not in upstream kernel
  • It is available as a kernel module or through FUSE.

Which one is the best?

  • Eventually, overlay2 should be the best option.

  • It is available on all modern systems.

  • Its memory usage is better than Device Mapper, BTRFS, or ZFS.

  • The remarks about write performance shouldn't bother you:
    data should always be stored in volumes anyway!