Deep dive into container internals

Introduction
There is no container code in the Linux kernel
- Namespaces
Namespaces are always active
Manipulating namespaces
Namespaces lifecycle
Namespaces can be used independently
UTS namespace
Creating our first namespace
Demonstrating our uts namespace
Net namespace overview
Net namespace typical use
Creating a network namespace
Creating the veth interfaces
Moving the veth interface
Basic network configuration
Allocating IP address and default route
Validating the setup
Cleaning up network namespaces
Other ways to use network namespaces
Mnt namespace
Setting up a private /tmp
PID namespace
PID namespace in action
PID namespaces and /proc
PID namespaces, take 2
OK, really, why do we need --fork?
IPC namespace
User namespace
User namespace challenges
- Control groups
Crowd control
Generalities
Example
Cgroups v1 vs v2
Memory cgroup: accounting
Memory cgroup: limits
Soft limits and hard limits
Avoiding the OOM killer
Overhead of the memory cgroup
Setting up a limit with the memory cgroup
What's <<<?
Writing to cgroups pseudo-files requires root
Testing the memory limit
CPU cgroup
Cpuset cgroup
Blkio cgroup
Net_cls and net_prio cgroup
Devices cgroup
- Security features
Capabilities
Some capabilities
Using capabilities
Seccomp
Linux Security Modules

Introduction

In this chapter, we will explain some of the fundamental building blocks of containers.

This will give you a solid foundation so you can:

understand "what's going on" in complex situations,
anticipate the behavior of containers (performance, security...) in new scenarios,
implement your own container engine.

The last item should be done for educational purposes only!

There is no container code in the Linux kernel

If we search "container" in the Linux kernel code, we find:
- generic code to manipulate data structures (like linked lists, etc.),
- unrelated concepts like "ACPI containers",
- nothing relevant to "our" containers!
Containers are composed using multiple independent features.
On Linux, containers rely on "namespaces, cgroups, and some filesystem magic."
Security also requires features like capabilities, seccomp, LSMs...

Namespaces

Provide processes with their own view of the system.
Namespaces limit what you can see (and therefore, what you can use).
These namespaces are available in modern kernels:
- pid
- net
- mnt
- uts
- ipc
- user
(We are going to detail them individually.)
Each process belongs to one namespace of each type.

Namespaces are always active

Namespaces exist even when you don't use containers.
This is a bit similar to the UID field in UNIX processes:
- all processes have the UID field, even if no user exists on the system
- the field always has a value / the value is always defined
  (i.e. any process running on the system has some UID)
- the value of the UID field is used when checking permissions
  (the UID field determines which resources the process can access)
You can replace "UID field" with "namespace" above and it still works!
In other words: even when you don't use containers,
there is one namespace of each type, containing all the processes on the system.

Manipulating namespaces

Namespaces are created with two methods:
- the clone() system call (used when creating new threads and processes),
- the unshare() system call.
The Linux tool unshare allows to do that from a shell.
A new process can re-use none / all / some of the namespaces of its parent.
It is possible to "enter" a namespace with the setns() system call.
The Linux tool nsenter allows to do that from a shell.

Namespaces lifecycle

When the last process of a namespace exits, the namespace is destroyed.
All the associated resources are then removed.
Namespaces are materialized by pseudo-files in /proc/<pid>/ns.
```
ls -l /proc/self/ns
```
It is possible to compare namespaces by checking these files.

(This helps to answer the question, "are these two processes in the same namespace?")
It is possible to preserve a namespace by bind-mounting its pseudo-file.

Namespaces can be used independently

As mentioned in the previous slides:

A new process can re-use none / all / some of the namespaces of its parent.
We are going to use that property in the examples in the next slides.
We are going to present each type of namespace.
For each type, we will provide an example using only that namespace.

UTS namespace

gethostname / sethostname
Allows to set a custom hostname for a container.
That's (mostly) it!
Also allows to set the NIS domain.

(If you don't know what a NIS domain is, you don't have to worry about it!)
If you're wondering: UTS = UNIX time sharing.
This namespace was named like this because of the struct utsname,
which is commonly used to obtain the machine's hostname, architecture, etc.

(The more you know!)

Creating our first namespace

Let's use unshare to create a new process that will have its own UTS namespace:

$ sudo unshare --uts

We have to use sudo for most unshare operations.
We indicate that we want a new uts namespace, and nothing else.
If we don't specify a program to run, a $SHELL is started.

Demonstrating our uts namespace

In our new "container", check the hostname, change it, and check it:

 # hostname
 nodeX
 # hostname tupperware
 # hostname
 tupperware

In another shell, check that the machine's hostname hasn't changed:

$ hostname
nodeX

Exit the "container" with exit or Ctrl-D.

Net namespace overview

Each network namespace has its own private network stack.
The network stack includes:
- network interfaces (including lo),
- routing tables (as in ip rule etc.),
- iptables chains and rules,
- sockets (as seen by ss, netstat).
You can move a network interface from a network namespace to another:
```
ip link set dev eth0 netns PID
```

Net namespace typical use

Each container is given its own network namespace.
For each network namespace (i.e. each container), a veth pair is created.

(Two veth interfaces act as if they were connected with a cross-over cable.)
One veth is moved to the container network namespace (and renamed eth0).
The other veth is moved to a bridge on the host (e.g. the docker0 bridge).

Creating a network namespace

Start a new process with its own network namespace:

$ sudo unshare --net

See that this new network namespace is unconfigured:

 # ping 1.1
 connect: Network is unreachable
 # ifconfig
 # ip link ls
 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Creating the `veth` interfaces

In another shell (on the host), create a veth pair:

$ sudo ip link add name in_host type veth peer name in_netns

Configure the host side (in_host):

$ sudo ip link set in_host master docker0 up

Moving the `veth` interface

In the process created by unshare, check the PID of our "network container":

 # echo {% math %}
 533

On the host, move the other side (in_netns) to the network namespace:

$ sudo ip link set in_netns netns 533

(Make sure to update "533" with the actual PID obtained above!)

Basic network configuration

Let's set up lo (the loopback interface):

 # ip link set lo up

Activate the veth interface and rename it to eth0:

 # ip link set in_netns name eth0 up

Allocating IP address and default route

On the host, check the address of the Docker bridge:

$ ip addr ls dev docker0

(It could be something like 172.17.0.1.)

Pick an IP address in the middle of the same subnet, e.g. 172.17.0.99.

In the process created by unshare, configure the interface:

 # ip addr add 172.17.0.99/24 dev eth0
 # ip route add default via 172.17.0.1

(Make sure to update the IP addresses if necessary.)

Validating the setup

Check that we now have connectivity:

 # ping 1.1

Note: we were able to take a shortcut, because Docker is running, and provides us with a docker0 bridge and a valid iptables setup.

If Docker is not running, you will need to take care of this!

Cleaning up network namespaces

Terminate the process created by unshare (with exit or Ctrl-D).
Since this was the only process in the network namespace, it is destroyed.
All the interfaces in the network namespace are destroyed.
When a veth interface is destroyed, it also destroys the other half of the pair.
So we don't have anything else to do to clean up!

Other ways to use network namespaces

--net none gives an empty network namespace to a container.

(Effectively isolating it completely from the network.)
--net host means "do not containerize the network".

(No network namespace is created; the container uses the host network stack.)
--net container means "reuse the network namespace of another container".

(As a result, both containers share the same interfaces, routes, etc.)

Mnt namespace

Processes can have their own root fs (à la chroot).
Processes can also have "private" mounts. This allows to:
- isolate /tmp (per user, per service...)
- mask /proc, /sys (for processes that don't need them)
- mount remote filesystems or sensitive data,
  but make it visible only for allowed processes
Mounts can be totally private, or shared.
At this point, there is no easy way to pass along a mount from a namespace to another.

Setting up a private `/tmp`

Create a new mount namespace:

$ sudo unshare --mount

In that new namespace, mount a brand new /tmp:

 # mount -t tmpfs none /tmp

Check the content of /tmp in the new namespace, and compare to the host.

The mount is automatically cleaned up when you exit the process.

PID namespace

Processes within a PID namespace only "see" processes in the same PID namespace.
Each PID namespace has its own numbering (starting at 1).
When PID 1 goes away, the whole namespace is killed.

(When PID 1 goes away on a normal UNIX system, the kernel panics!)
Those namespaces can be nested.
A process ends up having multiple PIDs (one per namespace in which it is nested).

PID namespace in action

Create a new PID namespace:

$ sudo unshare --pid --fork

(We need the --fork flag because the PID namespace is special.)

Check the process tree in the new namespace:

 # ps faux

🤔 Why do we see all the processes?!?

PID namespaces and `/proc`

Tools like ps rely on the /proc pseudo-filesystem.
Our new namespace still has access to the original /proc.
Therefore, it still sees host processes.
But it cannot affect them.

(Try to kill a process: you will get No such process.)

PID namespaces, take 2

This can be solved by mounting /proc in the namespace.
The unshare utility provides a convenience flag, --mount-proc.
This flag will mount /proc in the namespace.
It will also unshare the mount namespace, so that this mount is local.

Try it:

 $ sudo unshare --pid --fork --mount-proc
 # ps faux

OK, really, why do we need `--fork`?

It is not necessary to remember all these details.
This is just an illustration of the complexity of namespaces!

The unshare tool calls the unshare syscall, then execs the new binary.
A process calling unshare to create new namespaces is moved to the new namespaces...
... Except for the PID namespace.
(Because this would change the current PID of the process from X to 1.)

The processes created by the new binary are placed into the new PID namespace.
The first one will be PID 1.
If PID 1 exits, it is not possible to create additional processes in the namespace.
(Attempting to do so will result in ENOMEM.)

Without the --fork flag, the first command that we execute will be PID 1 ...
... And once it exits, we cannot create more processes in the namespace!

Check man 2 unshare and man pid_namespaces if you want more details.

IPC namespace

Does anybody know about IPC?
Does anybody care about IPC?
Allows a process (or group of processes) to have own:
- IPC semaphores
- IPC message queues
- IPC shared memory
... without risk of conflict with other instances.
Older versions of PostgreSQL cared about this.

No demo for that one.

User namespace

Allows to map UID/GID; e.g.:
- UID 0→1999 in container C1 is mapped to UID 10000→11999 on host
- UID 0→1999 in container C2 is mapped to UID 12000→13999 on host
- etc.
UID 0 in the container can still perform privileged operations in the container.

(For instance: setting up network interfaces.)
But outside of the container, it is a non-privileged user.
It also means that the UID in containers becomes unimportant.

(Just use UID 0 in the container, since it gets squashed to a non-privileged user outside.)
Ultimately enables better privilege separation in container engines.

User namespace challenges

UID needs to be mapped when passed between processes or kernel subsystems.
Filesystem permissions and file ownership are more complicated.

(E.g. when the same root filesystem is shared by multiple containers running with different UIDs.)]
With the Docker Engine:
- some feature combinations are not allowed
  (e.g. user namespace + host network namespace sharing)
- user namespaces need to be enabled/disabled globally
  (when the daemon is started)
- container images are stored separately
  (so the first time you toggle user namespaces, you need to re-pull images)

No demo for that one.

Control groups

Control groups provide resource metering and limiting.
This covers a number of "usual suspects" like:
- memory
- CPU
- block I/O
- network (with cooperation from iptables/tc)
And a few exotic ones:
- huge pages (a special way to allocate memory)
- RDMA (resources specific to InfiniBand / remote memory transfer)

Crowd control

Control groups also allow to group processes for special operations:
- freezer (conceptually similar to a "mass-SIGSTOP/SIGCONT")
- perf_event (gather performance events on multiple processes)
- cpuset (limit or pin processes to specific CPUs)
There is a "pids" cgroup to limit the number of processes in a given group.
There is also a "devices" cgroup to control access to device nodes.

(i.e. everything in /dev.)

Generalities

Cgroups form a hierarchy (a tree).
We can create nodes in that hierarchy.
We can associate limits to a node.
We can move a process (or multiple processes) to a node.
The process (or processes) will then respect these limits.
We can check the current usage of each node.
In other words: limits are optional (if we only want accounting).
When a process is created, it is placed in its parent's groups.

Example

The numbers are PIDs.

The names are the names of our nodes (arbitrarily chosen).

cpu                      memory
├── batch                ├── stateless
│   ├── cryptoscam       │   ├── 25
│   │   └── 52           │   ├── 26
│   └── ffmpeg           │   ├── 27
│       ├── 109          │   ├── 52
│       └── 88           │   ├── 109
└── realtime             │   └── 88
    ├── nginx            └── databases
    │   ├── 25               ├── 1008
    │   ├── 26               └── 524
    │   └── 27
    ├── postgres
    │   └── 524
    └── redis
        └── 1008

Cgroups v1 vs v2

Cgroups v1 are available on all systems (and widely used).
Cgroups v2 are a huge refactor.

(Development started in Linux 3.10, released in 4.5.)
Cgroups v2 have a number of differences:
- single hierarchy (instead of one tree per controller),
- processes can only be on leaf nodes (not inner nodes),
- and of course many improvements / refactorings.

Memory cgroup: accounting

Keeps track of pages used by each group:
- file (read/write/mmap from block devices),
- anonymous (stack, heap, anonymous mmap),
- active (recently accessed),
- inactive (candidate for eviction).
Each page is "charged" to a group.
Pages can be shared across multiple groups.

(Example: multiple processes reading from the same files.)
To view all the counters kept by this cgroup:
```
$ cat /sys/fs/cgroup/memory/memory.stat
```

Memory cgroup: limits

Each group can have (optional) hard and soft limits.
Limits can be set for different kinds of memory:
- physical memory,
- kernel memory,
- total memory (including swap).

Soft limits and hard limits

Soft limits are not enforced.

(But they influence reclaim under memory pressure.)
Hard limits cannot be exceeded:
- if a group of processes exceeds a hard limit,
- and if the kernel cannot reclaim any memory,
- then the OOM (out-of-memory) killer is triggered,
- and processes are killed until memory gets below the limit again.

Avoiding the OOM killer

For some workloads (databases and stateful systems), killing processes because we run out of memory is not acceptable.
The "oom-notifier" mechanism helps with that.
When "oom-notifier" is enabled and a hard limit is exceeded:
- all processes in the cgroup are frozen,
- a notification is sent to user space (instead of killing processes),
- user space can then raise limits, migrate containers, etc.,
- once the memory usage is below the hard limit, unfreeze the cgroup.

Overhead of the memory cgroup

Each time a process grabs or releases a page, the kernel update counters.
This adds some overhead.
Unfortunately, this cannot be enabled/disabled per process.
It has to be done system-wide, at boot time.
Also, when multiple groups use the same page:
- only the first group gets "charged",
- but if it stops using it, the "charge" is moved to another group.

Setting up a limit with the memory cgroup

Create a new memory cgroup:

$ CG=/sys/fs/cgroup/memory/onehundredmegs
$ sudo mkdir $CG

Limit it to approximately 100MB of memory usage:

$ sudo tee $CG/memory.memsw.limit_in_bytes <<< 100000000

Move the current process to that cgroup:

$ sudo tee $CG/tasks <<< {% endmath %}

The current process and all its future children are now limited.

(Confused about <<<? Look at the next slide!)

What's `<<<`?

This is a "here string". (It is a non-POSIX shell extension.)

The following commands are equivalent:

foo <<< hello

echo hello | foo

foo <<EOF
hello
EOF

Why did we use that?

Writing to cgroups pseudo-files requires root

Instead of:

sudo tee $CG/tasks <<< {% math %}

We could have done:

sudo sh -c "echo {% endmath %} > $CG/tasks"

The following commands, however, would be invalid:

sudo echo {% math %} > $CG/tasks

sudo -i # (or su)
echo {% endmath %} > $CG/tasks

Testing the memory limit

Start the Python interpreter:

$ python
Python 3.6.4 (default, Jan  5 2018, 02:35:40)
[GCC 7.2.1 20171224] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Allocate 80 megabytes:

>>> s = "!" * 1000000 * 80

Add 20 megabytes more:

>>> t = "!" * 1000000 * 20
Killed

CPU cgroup

Keeps track of CPU time used by a group of processes.

(This is easier and more accurate than getrusage and /proc.)
Keeps track of usage per CPU as well.

(i.e., "this group of process used X seconds of CPU0 and Y seconds of CPU1".)
Allows to set relative weights used by the scheduler.

Cpuset cgroup

Pin groups to specific CPU(s).
Use-case: reserve CPUs for specific apps.
Warning: make sure that "default" processes aren't using all CPUs!
CPU pinning can also avoid performance loss due to cache flushes.
This is also relevant for NUMA systems.
Provides extra dials and knobs.

(Per zone memory pressure, process migration costs...)

Blkio cgroup

Keeps track of I/Os for each group:
- per block device
- read vs write
- sync vs async
Set throttle (limits) for each group:
- per block device
- read vs write
- ops vs bytes
Set relative weights for each group.
Note: most writes go through the page cache.
(So classic writes will appear to be unthrottled at first.)

Net_cls and net_prio cgroup

Only works for egress (outgoing) traffic.
Automatically set traffic class or priority for traffic generated by processes in the group.
Net_cls will assign traffic to a class.
Classes have to be matched with tc or iptables, otherwise traffic just flows normally.
Net_prio will assign traffic to a priority.
Priorities are used by queuing disciplines.

Devices cgroup

Controls what the group can do on device nodes
Permissions include read/write/mknod
Typical use:
- allow /dev/{tty,zero,random,null} ...
- deny everything else
A few interesting nodes:
- /dev/net/tun (network interface manipulation)
- /dev/fuse (filesystems in user space)
- /dev/kvm (VMs in containers, yay inception!)
- /dev/dri (GPU)

Security features

Namespaces and cgroups are not enough to ensure strong security.
We need extra mechanisms: capabilities, seccomp, LSMs.
These mechanisms were already used before containers to harden security.
They can be used together with containers.
Good container engines will automatically leverage these features.

(So that you don't have to worry about it.)

Capabilities

In traditional UNIX, many operations are possible if and only if UID=0 (root).
Some of these operations are very powerful:
- changing file ownership, accessing all files ...
Some of these operations deal with system configuration, but can be abused:
- setting up network interfaces, mounting filesystems ...
Some of these operations are not very dangerous but are needed by servers:
- binding to a port below 1024.
Capabilities are per-process flags to allow these operations individually.

Some capabilities

CAP_CHOWN: arbitrarily change file ownership and permissions.
CAP_DAC_OVERRIDE: arbitrarily bypass file ownership and permissions.
CAP_NET_ADMIN: configure network interfaces, iptables rules, etc.
CAP_NET_BIND_SERVICE: bind a port below 1024.

See man capabilities for the full list and details.

Using capabilities

Container engines will typically drop all "dangerous" capabilities.
You can then re-enable capabilities on a per-container basis, as needed.
With the Docker engine: docker run --cap-add ...
If you write your own code to manage capabilities:
- make sure that you understand what each capability does,
- read about ambient capabilities as well.

Seccomp

Seccomp is secure computing.
Achieve high level of security by restricting drastically available syscalls.
Original seccomp only allows read(), write(), exit(), sigreturn().
The seccomp-bpf extension allows to specify custom filters with BPF rules.
This allows to filter by syscall, and by parameter.
BPF code can perform arbitrarily complex checks, quickly, and safely.
Container engines take care of this so you don't have to.

Linux Security Modules

The most popular ones are SELinux and AppArmor.
Red Hat distros generally use SELinux.
Debian distros (in particular, Ubuntu) generally use AppArmor.
LSMs add a layer of access control to all process operations.
Container engines take care of this so you don't have to.