Copy-on-write filesystems
- Introduction
- What is copy-on-write?
- A few metaphors
- Copy-on-write is everywhere
- Copy-on-write and containers
- AUFS overview
- AUFS operations: opening a file
- AUFS operations: deleting a file
- AUFS performance
- Device Mapper
- Device Mapper principles
- Device Mapper operational details
- Device Mapper performance
- BTRFS principles
- BTRFS in practice with Docker
- BTRFS quirks
- Overlay2
- ZFS
- Which one is the best?
Introduction
Container engines rely on copy-on-write to be able to start containers quickly, regardless of their size.
We will explain how that works, and review some of the copy-on-write storage systems available on Linux.
What is copy-on-write?
Copy-on-write is a mechanism allowing to share data.
The data appears to be a copy, but is only a link (or reference) to the original data.
The actual copy happens only when someone tries to change the shared data.
Whoever changes the shared data ends up using their own copy instead of the shared data.
A few metaphors
First metaphor:
white board and tracing paperSecond metaphor:
magic books with shadowy pagesThird metaphor:
just-in-time house building
Copy-on-write is everywhere
Process creation with
fork()
.Consistent disk snapshots.
Efficient VM provisioning.
And, of course, containers.
Copy-on-write and containers
Copy-on-write is essential to give us "convenient" containers.
Creating a new container (from an existing image) is "free".
(Otherwise, we would have to copy the image first.)
Customizing a container (by tweaking a few files) is cheap.
(Adding a 1 KB configuration file to a 1 GB container takes 1 KB, not 1 GB.)
We can take snapshots, i.e. have "checkpoints" or "save points" when building images.
AUFS overview
The original (legacy) copy-on-write filesystem used by first versions of Docker.
Combine multiple branches in a specific order.
Each branch is just a normal directory.
You generally have:
at least one read-only branch (at the bottom),
exactly one read-write branch (at the top).
(But other fun combinations are possible too!)
AUFS operations: opening a file
With
O_RDONLY
- read-only access:look it up in each branch, starting from the top
open the first one we find
With
O_WRONLY
orO_RDWR
- write access:if the file exists on the top branch: open it
if the file exists on another branch: "copy up"
(i.e. copy the file to the top branch and open the copy)if the file doesn't exist on any branch: create it on the top branch
That "copy-up" operation can take a while if the file is big!
AUFS operations: deleting a file
A whiteout file is created.
This is similar to the concept of "tombstones" used in some data systems.
# docker run ubuntu rm /etc/shadow
# ls -la /var/lib/docker/aufs/diff/$(docker ps --no-trunc -lq)/etc
total 8
drwxr-xr-x 2 root root 4096 Jan 27 15:36 .
drwxr-xr-x 5 root root 4096 Jan 27 15:36 ..
-r--r--r-- 2 root root 0 Jan 27 15:36 .wh.shadow
AUFS performance
AUFS
mount()
is fast, so creation of containers is quick.Read/write access has native speeds.
But initial
open()
is expensive in two scenarios:when writing big files (log files, databases ...),
when searching many directories (PATH, classpath, etc.) over many layers.
Protip: when we built dotCloud, we ended up putting all important data on volumes.
When starting the same container multiple times:
the data is loaded only once from disk, and cached only once in memory;
but
dentries
will be duplicated.
Device Mapper
Device Mapper is a rich subsystem with many features.
It can be used for: RAID, encrypted devices, snapshots, and more.
In the context of containers (and Docker in particular), "Device Mapper" means:
"the Device Mapper system + its thin provisioning target"
If you see the abbreviation "thinp" it stands for "thin provisioning".
Device Mapper principles
Copy-on-write happens on the block level (instead of the file level).
Each container and each image get their own block device.
At any given time, it is possible to take a snapshot:
of an existing container (to create a frozen image),
of an existing image (to create a container from it).
If a block has never been written to:
it's assumed to be all zeros,
it's not allocated on disk.
(That last property is the reason for the name "thin" provisioning.)
Device Mapper operational details
Two storage areas are needed: one for data, another for metadata.
"data" is also called the "pool"; it's just a big pool of blocks.
(Docker uses the smallest possible block size, 64 KB.)
"metadata" contains the mappings between virtual offsets (in the snapshots) and physical offsets (in the pool).
Each time a new block (or a copy-on-write block) is written, a block is allocated from the pool.
When there are no more blocks in the pool, attempts to write will stall until the pool is increased (or the write operation aborted).
In other words: when running out of space, containers are frozen, but operations will resume as soon as space is available.
Device Mapper performance
By default, Docker puts data and metadata on a loop device backed by a sparse file.
This is great from a usability point of view, since zero configuration is needed.
But it is terrible from a performance point of view:
- each time a container writes to a new block,
- a block has to be allocated from the pool,
- and when it's written to,
- a block has to be allocated from the sparse file,
- and sparse file performance isn't great anyway.
If you use Device Mapper, make sure to put data (and metadata) on devices!
BTRFS principles
BTRFS is a filesystem (like EXT4, XFS, NTFS...) with built-in snapshots.
The "copy-on-write" happens at the filesystem level.
BTRFS integrates the snapshot and block pool management features at the filesystem level.
(Instead of the block level for Device Mapper.)
In practice, we create a "subvolume" and later take a "snapshot" of that subvolume.
Imagine:
mkdir
with Super Powers andcp -a
with Super Powers.These operations can be executed with the
btrfs
CLI tool.
BTRFS in practice with Docker
Docker can use BTRFS and its snapshotting features to store container images.
The only requirement is that
/var/lib/docker
is on a BTRFS filesystem.(Or, the directory specified with the
--data-root
flag when starting the engine.)
BTRFS quirks
BTRFS works by dividing its storage in chunks.
A chunk can contain data or metadata.
You can run out of chunks (and get
No space left on device
) even thoughdf
shows space available.(Because chunks are only partially allocated.)
Quick fix:
# btrfs filesys balance start -dusage=1 /var/lib/docker
Overlay2
Overlay2 is very similar to AUFS.
However, it has been merged in "upstream" kernel.
It is therefore available on all modern kernels.
(AUFS was available on Debian and Ubuntu, but required custom kernels on other distros.)
It is simpler than AUFS (it can only have two branches, called "layers").
The container engine abstracts this detail, so this is not a concern.
Overlay2 storage drivers generally use hard links between layers.
This improves
stat()
andopen()
performance, at the expense of inode usage.
ZFS
ZFS is similar to BTRFS (at least from a container user's perspective).
Pros:
- high performance
- high reliability (with e.g. data checksums)
- optional data compression and deduplication
Cons:
- high memory usage
- not in upstream kernel
It is available as a kernel module or through FUSE.
Which one is the best?
Eventually, overlay2 should be the best option.
It is available on all modern systems.
Its memory usage is better than Device Mapper, BTRFS, or ZFS.
The remarks about write performance shouldn't bother you:
data should always be stored in volumes anyway!