Deep dive into container internals
- Introduction
- There is no container code in the Linux kernel
- Namespaces are always active
- Manipulating namespaces
- Namespaces lifecycle
- Namespaces can be used independently
- UTS namespace
- Creating our first namespace
- Demonstrating our uts namespace
- Net namespace overview
- Net namespace typical use
- Creating a network namespace
- Creating the
veth
interfaces - Moving the
veth
interface - Basic network configuration
- Allocating IP address and default route
- Validating the setup
- Cleaning up network namespaces
- Other ways to use network namespaces
- Mnt namespace
- Setting up a private
/tmp
- PID namespace
- PID namespace in action
- PID namespaces and
/proc
- PID namespaces, take 2
- OK, really, why do we need
--fork
? - IPC namespace
- User namespace
- User namespace challenges
- Crowd control
- Generalities
- Example
- Cgroups v1 vs v2
- Memory cgroup: accounting
- Memory cgroup: limits
- Soft limits and hard limits
- Avoiding the OOM killer
- Overhead of the memory cgroup
- Setting up a limit with the memory cgroup
- What's
<<<
? - Writing to cgroups pseudo-files requires root
- Testing the memory limit
- CPU cgroup
- Cpuset cgroup
- Blkio cgroup
- Net_cls and net_prio cgroup
- Devices cgroup
- Capabilities
- Some capabilities
- Using capabilities
- Seccomp
- Linux Security Modules
Introduction
In this chapter, we will explain some of the fundamental building blocks of containers.
This will give you a solid foundation so you can:
understand "what's going on" in complex situations,
anticipate the behavior of containers (performance, security...) in new scenarios,
implement your own container engine.
The last item should be done for educational purposes only!
There is no container code in the Linux kernel
If we search "container" in the Linux kernel code, we find:
generic code to manipulate data structures (like linked lists, etc.),
unrelated concepts like "ACPI containers",
nothing relevant to "our" containers!
Containers are composed using multiple independent features.
On Linux, containers rely on "namespaces, cgroups, and some filesystem magic."
Security also requires features like capabilities, seccomp, LSMs...
Namespaces
Provide processes with their own view of the system.
Namespaces limit what you can see (and therefore, what you can use).
These namespaces are available in modern kernels:
- pid
- net
- mnt
- uts
- ipc
- user
(We are going to detail them individually.)
Each process belongs to one namespace of each type.
Namespaces are always active
Namespaces exist even when you don't use containers.
This is a bit similar to the UID field in UNIX processes:
all processes have the UID field, even if no user exists on the system
the field always has a value / the value is always defined
(i.e. any process running on the system has some UID)the value of the UID field is used when checking permissions
(the UID field determines which resources the process can access)
You can replace "UID field" with "namespace" above and it still works!
In other words: even when you don't use containers,
there is one namespace of each type, containing all the processes on the system.
Manipulating namespaces
Namespaces are created with two methods:
the
clone()
system call (used when creating new threads and processes),the
unshare()
system call.
The Linux tool
unshare
allows to do that from a shell.A new process can re-use none / all / some of the namespaces of its parent.
It is possible to "enter" a namespace with the
setns()
system call.The Linux tool
nsenter
allows to do that from a shell.
Namespaces lifecycle
When the last process of a namespace exits, the namespace is destroyed.
All the associated resources are then removed.
Namespaces are materialized by pseudo-files in
/proc/<pid>/ns
.ls -l /proc/self/ns
It is possible to compare namespaces by checking these files.
(This helps to answer the question, "are these two processes in the same namespace?")
It is possible to preserve a namespace by bind-mounting its pseudo-file.
Namespaces can be used independently
As mentioned in the previous slides:
A new process can re-use none / all / some of the namespaces of its parent.
We are going to use that property in the examples in the next slides.
We are going to present each type of namespace.
For each type, we will provide an example using only that namespace.
UTS namespace
gethostname / sethostname
Allows to set a custom hostname for a container.
That's (mostly) it!
Also allows to set the NIS domain.
(If you don't know what a NIS domain is, you don't have to worry about it!)
If you're wondering: UTS = UNIX time sharing.
This namespace was named like this because of the
struct utsname
,
which is commonly used to obtain the machine's hostname, architecture, etc.(The more you know!)
Creating our first namespace
Let's use unshare
to create a new process that will have its own UTS namespace:
$ sudo unshare --uts
We have to use
sudo
for mostunshare
operations.We indicate that we want a new uts namespace, and nothing else.
If we don't specify a program to run, a
$SHELL
is started.
Demonstrating our uts namespace
In our new "container", check the hostname, change it, and check it:
# hostname
nodeX
# hostname tupperware
# hostname
tupperware
In another shell, check that the machine's hostname hasn't changed:
$ hostname
nodeX
Exit the "container" with exit
or Ctrl-D
.
Net namespace overview
Each network namespace has its own private network stack.
The network stack includes:
network interfaces (including
lo
),routing tables (as in
ip rule
etc.),iptables chains and rules,
sockets (as seen by
ss
,netstat
).
You can move a network interface from a network namespace to another:
ip link set dev eth0 netns PID
Net namespace typical use
Each container is given its own network namespace.
For each network namespace (i.e. each container), a
veth
pair is created.(Two
veth
interfaces act as if they were connected with a cross-over cable.)One
veth
is moved to the container network namespace (and renamedeth0
).The other
veth
is moved to a bridge on the host (e.g. thedocker0
bridge).
Creating a network namespace
Start a new process with its own network namespace:
$ sudo unshare --net
See that this new network namespace is unconfigured:
# ping 1.1
connect: Network is unreachable
# ifconfig
# ip link ls
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Creating the veth
interfaces
In another shell (on the host), create a veth
pair:
$ sudo ip link add name in_host type veth peer name in_netns
Configure the host side (in_host
):
$ sudo ip link set in_host master docker0 up
Moving the veth
interface
In the process created by unshare
, check the PID of our "network container":
# echo {% math %}
533
On the host, move the other side (in_netns
) to the network namespace:
$ sudo ip link set in_netns netns 533
(Make sure to update "533" with the actual PID obtained above!)
Basic network configuration
Let's set up lo
(the loopback interface):
# ip link set lo up
Activate the veth
interface and rename it to eth0
:
# ip link set in_netns name eth0 up
Allocating IP address and default route
On the host, check the address of the Docker bridge:
$ ip addr ls dev docker0
(It could be something like 172.17.0.1
.)
Pick an IP address in the middle of the same subnet, e.g. 172.17.0.99
.
In the process created by unshare
, configure the interface:
# ip addr add 172.17.0.99/24 dev eth0
# ip route add default via 172.17.0.1
(Make sure to update the IP addresses if necessary.)
Validating the setup
Check that we now have connectivity:
# ping 1.1
Note: we were able to take a shortcut, because Docker is running,
and provides us with a docker0
bridge and a valid iptables
setup.
If Docker is not running, you will need to take care of this!
Cleaning up network namespaces
Terminate the process created by
unshare
(withexit
orCtrl-D
).Since this was the only process in the network namespace, it is destroyed.
All the interfaces in the network namespace are destroyed.
When a
veth
interface is destroyed, it also destroys the other half of the pair.So we don't have anything else to do to clean up!
Other ways to use network namespaces
--net none
gives an empty network namespace to a container.(Effectively isolating it completely from the network.)
--net host
means "do not containerize the network".(No network namespace is created; the container uses the host network stack.)
--net container
means "reuse the network namespace of another container".(As a result, both containers share the same interfaces, routes, etc.)
Mnt namespace
Processes can have their own root fs (à la chroot).
Processes can also have "private" mounts. This allows to:
isolate
/tmp
(per user, per service...)mask
/proc
,/sys
(for processes that don't need them)mount remote filesystems or sensitive data,
but make it visible only for allowed processes
Mounts can be totally private, or shared.
At this point, there is no easy way to pass along a mount from a namespace to another.
Setting up a private /tmp
Create a new mount namespace:
$ sudo unshare --mount
In that new namespace, mount a brand new /tmp
:
# mount -t tmpfs none /tmp
Check the content of /tmp
in the new namespace, and compare to the host.
The mount is automatically cleaned up when you exit the process.
PID namespace
Processes within a PID namespace only "see" processes in the same PID namespace.
Each PID namespace has its own numbering (starting at 1).
When PID 1 goes away, the whole namespace is killed.
(When PID 1 goes away on a normal UNIX system, the kernel panics!)
Those namespaces can be nested.
A process ends up having multiple PIDs (one per namespace in which it is nested).
PID namespace in action
Create a new PID namespace:
$ sudo unshare --pid --fork
(We need the --fork
flag because the PID namespace is special.)
Check the process tree in the new namespace:
# ps faux
🤔 Why do we see all the processes?!?
PID namespaces and /proc
Tools like
ps
rely on the/proc
pseudo-filesystem.Our new namespace still has access to the original
/proc
.Therefore, it still sees host processes.
But it cannot affect them.
(Try to
kill
a process: you will getNo such process
.)
PID namespaces, take 2
This can be solved by mounting
/proc
in the namespace.The
unshare
utility provides a convenience flag,--mount-proc
.This flag will mount
/proc
in the namespace.It will also unshare the mount namespace, so that this mount is local.
Try it:
$ sudo unshare --pid --fork --mount-proc
# ps faux
OK, really, why do we need --fork
?
It is not necessary to remember all these details.
This is just an illustration of the complexity of namespaces!
The unshare
tool calls the unshare
syscall, then exec
s the new binary.
A process calling unshare
to create new namespaces is moved to the new namespaces...
... Except for the PID namespace.
(Because this would change the current PID of the process from X to 1.)
The processes created by the new binary are placed into the new PID namespace.
The first one will be PID 1.
If PID 1 exits, it is not possible to create additional processes in the namespace.
(Attempting to do so will result in ENOMEM
.)
Without the --fork
flag, the first command that we execute will be PID 1 ...
... And once it exits, we cannot create more processes in the namespace!
Check man 2 unshare
and man pid_namespaces
if you want more details.
IPC namespace
Does anybody know about IPC?
Does anybody care about IPC?
Allows a process (or group of processes) to have own:
- IPC semaphores
- IPC message queues
- IPC shared memory
... without risk of conflict with other instances.
Older versions of PostgreSQL cared about this.
No demo for that one.
User namespace
Allows to map UID/GID; e.g.:
- UID 0→1999 in container C1 is mapped to UID 10000→11999 on host
- UID 0→1999 in container C2 is mapped to UID 12000→13999 on host
- etc.
UID 0 in the container can still perform privileged operations in the container.
(For instance: setting up network interfaces.)
But outside of the container, it is a non-privileged user.
It also means that the UID in containers becomes unimportant.
(Just use UID 0 in the container, since it gets squashed to a non-privileged user outside.)
Ultimately enables better privilege separation in container engines.
User namespace challenges
UID needs to be mapped when passed between processes or kernel subsystems.
Filesystem permissions and file ownership are more complicated.
(E.g. when the same root filesystem is shared by multiple containers running with different UIDs.)]
With the Docker Engine:
some feature combinations are not allowed
(e.g. user namespace + host network namespace sharing)user namespaces need to be enabled/disabled globally
(when the daemon is started)container images are stored separately
(so the first time you toggle user namespaces, you need to re-pull images)
No demo for that one.
Control groups
Control groups provide resource metering and limiting.
This covers a number of "usual suspects" like:
memory
CPU
block I/O
network (with cooperation from iptables/tc)
And a few exotic ones:
huge pages (a special way to allocate memory)
RDMA (resources specific to InfiniBand / remote memory transfer)
Crowd control
Control groups also allow to group processes for special operations:
freezer (conceptually similar to a "mass-SIGSTOP/SIGCONT")
perf_event (gather performance events on multiple processes)
cpuset (limit or pin processes to specific CPUs)
There is a "pids" cgroup to limit the number of processes in a given group.
There is also a "devices" cgroup to control access to device nodes.
(i.e. everything in
/dev
.)
Generalities
Cgroups form a hierarchy (a tree).
We can create nodes in that hierarchy.
We can associate limits to a node.
We can move a process (or multiple processes) to a node.
The process (or processes) will then respect these limits.
We can check the current usage of each node.
In other words: limits are optional (if we only want accounting).
When a process is created, it is placed in its parent's groups.
Example
The numbers are PIDs.
The names are the names of our nodes (arbitrarily chosen).
cpu memory
├── batch ├── stateless
│ ├── cryptoscam │ ├── 25
│ │ └── 52 │ ├── 26
│ └── ffmpeg │ ├── 27
│ ├── 109 │ ├── 52
│ └── 88 │ ├── 109
└── realtime │ └── 88
├── nginx └── databases
│ ├── 25 ├── 1008
│ ├── 26 └── 524
│ └── 27
├── postgres
│ └── 524
└── redis
└── 1008
Cgroups v1 vs v2
Cgroups v1 are available on all systems (and widely used).
Cgroups v2 are a huge refactor.
(Development started in Linux 3.10, released in 4.5.)
Cgroups v2 have a number of differences:
single hierarchy (instead of one tree per controller),
processes can only be on leaf nodes (not inner nodes),
and of course many improvements / refactorings.
Memory cgroup: accounting
Keeps track of pages used by each group:
- file (read/write/mmap from block devices),
- anonymous (stack, heap, anonymous mmap),
- active (recently accessed),
- inactive (candidate for eviction).
Each page is "charged" to a group.
Pages can be shared across multiple groups.
(Example: multiple processes reading from the same files.)
To view all the counters kept by this cgroup:
$ cat /sys/fs/cgroup/memory/memory.stat
Memory cgroup: limits
Each group can have (optional) hard and soft limits.
Limits can be set for different kinds of memory:
physical memory,
kernel memory,
total memory (including swap).
Soft limits and hard limits
Soft limits are not enforced.
(But they influence reclaim under memory pressure.)
Hard limits cannot be exceeded:
if a group of processes exceeds a hard limit,
and if the kernel cannot reclaim any memory,
then the OOM (out-of-memory) killer is triggered,
and processes are killed until memory gets below the limit again.
Avoiding the OOM killer
For some workloads (databases and stateful systems), killing processes because we run out of memory is not acceptable.
The "oom-notifier" mechanism helps with that.
When "oom-notifier" is enabled and a hard limit is exceeded:
all processes in the cgroup are frozen,
a notification is sent to user space (instead of killing processes),
user space can then raise limits, migrate containers, etc.,
once the memory usage is below the hard limit, unfreeze the cgroup.
Overhead of the memory cgroup
Each time a process grabs or releases a page, the kernel update counters.
This adds some overhead.
Unfortunately, this cannot be enabled/disabled per process.
It has to be done system-wide, at boot time.
Also, when multiple groups use the same page:
only the first group gets "charged",
but if it stops using it, the "charge" is moved to another group.
Setting up a limit with the memory cgroup
Create a new memory cgroup:
$ CG=/sys/fs/cgroup/memory/onehundredmegs
$ sudo mkdir $CG
Limit it to approximately 100MB of memory usage:
$ sudo tee $CG/memory.memsw.limit_in_bytes <<< 100000000
Move the current process to that cgroup:
$ sudo tee $CG/tasks <<< {% endmath %}
The current process and all its future children are now limited.
(Confused about <<<
? Look at the next slide!)
What's <<<
?
This is a "here string". (It is a non-POSIX shell extension.)
The following commands are equivalent:
foo <<< hello
echo hello | foo
foo <<EOF hello EOF
Why did we use that?
Writing to cgroups pseudo-files requires root
Instead of:
sudo tee $CG/tasks <<< {% math %}
We could have done:
sudo sh -c "echo {% endmath %} > $CG/tasks"
The following commands, however, would be invalid:
sudo echo {% math %} > $CG/tasks
sudo -i # (or su)
echo {% endmath %} > $CG/tasks
Testing the memory limit
Start the Python interpreter:
$ python
Python 3.6.4 (default, Jan 5 2018, 02:35:40)
[GCC 7.2.1 20171224] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
Allocate 80 megabytes:
>>> s = "!" * 1000000 * 80
Add 20 megabytes more:
>>> t = "!" * 1000000 * 20
Killed
CPU cgroup
Keeps track of CPU time used by a group of processes.
(This is easier and more accurate than
getrusage
and/proc
.)Keeps track of usage per CPU as well.
(i.e., "this group of process used X seconds of CPU0 and Y seconds of CPU1".)
Allows to set relative weights used by the scheduler.
Cpuset cgroup
Pin groups to specific CPU(s).
Use-case: reserve CPUs for specific apps.
Warning: make sure that "default" processes aren't using all CPUs!
CPU pinning can also avoid performance loss due to cache flushes.
This is also relevant for NUMA systems.
Provides extra dials and knobs.
(Per zone memory pressure, process migration costs...)
Blkio cgroup
Keeps track of I/Os for each group:
- per block device
- read vs write
- sync vs async
Set throttle (limits) for each group:
- per block device
- read vs write
- ops vs bytes
Set relative weights for each group.
Note: most writes go through the page cache.
(So classic writes will appear to be unthrottled at first.)
Net_cls and net_prio cgroup
Only works for egress (outgoing) traffic.
Automatically set traffic class or priority for traffic generated by processes in the group.
Net_cls will assign traffic to a class.
Classes have to be matched with tc or iptables, otherwise traffic just flows normally.
Net_prio will assign traffic to a priority.
Priorities are used by queuing disciplines.
Devices cgroup
Controls what the group can do on device nodes
Permissions include read/write/mknod
Typical use:
- allow
/dev/{tty,zero,random,null}
... - deny everything else
- allow
A few interesting nodes:
/dev/net/tun
(network interface manipulation)/dev/fuse
(filesystems in user space)/dev/kvm
(VMs in containers, yay inception!)/dev/dri
(GPU)
Security features
Namespaces and cgroups are not enough to ensure strong security.
We need extra mechanisms: capabilities, seccomp, LSMs.
These mechanisms were already used before containers to harden security.
They can be used together with containers.
Good container engines will automatically leverage these features.
(So that you don't have to worry about it.)
Capabilities
In traditional UNIX, many operations are possible if and only if UID=0 (root).
Some of these operations are very powerful:
- changing file ownership, accessing all files ...
Some of these operations deal with system configuration, but can be abused:
- setting up network interfaces, mounting filesystems ...
Some of these operations are not very dangerous but are needed by servers:
- binding to a port below 1024.
Capabilities are per-process flags to allow these operations individually.
Some capabilities
CAP_CHOWN
: arbitrarily change file ownership and permissions.CAP_DAC_OVERRIDE
: arbitrarily bypass file ownership and permissions.CAP_NET_ADMIN
: configure network interfaces, iptables rules, etc.CAP_NET_BIND_SERVICE
: bind a port below 1024.
See man capabilities
for the full list and details.
Using capabilities
Container engines will typically drop all "dangerous" capabilities.
You can then re-enable capabilities on a per-container basis, as needed.
With the Docker engine:
docker run --cap-add ...
If you write your own code to manage capabilities:
make sure that you understand what each capability does,
read about ambient capabilities as well.
Seccomp
Seccomp is secure computing.
Achieve high level of security by restricting drastically available syscalls.
Original seccomp only allows
read()
,write()
,exit()
,sigreturn()
.The seccomp-bpf extension allows to specify custom filters with BPF rules.
This allows to filter by syscall, and by parameter.
BPF code can perform arbitrarily complex checks, quickly, and safely.
Container engines take care of this so you don't have to.
Linux Security Modules
The most popular ones are SELinux and AppArmor.
Red Hat distros generally use SELinux.
Debian distros (in particular, Ubuntu) generally use AppArmor.
LSMs add a layer of access control to all process operations.
Container engines take care of this so you don't have to.