Where did fork go?

13 Jun 2014

A couple of days ago I was playing around with strace and bash on a Linux box. My goal was to get a better understanding of Unix shells and how they operate at a systems level. Happily launching commands inside the bash process and watching the output of strace it suddenly struck me: “Wait! Where is fork? It’s a system call! strace is supposed to show this! Where is it?” Nowhere in the strace output was a call to fork(2) to be found.

I was really confused and my curiosity was spiked. So I spent the next couple of hours searching for an explanation — and I found one, which I think is worth sharing. But first of all let me explain why I was confused.

fork and execve in the Unix environment

Processes in Unix environments are based on a pretty simple idea: the combination of fork(2) and execve(2).

Every process running on a Unix system started out as a call to fork(2) followed by a call to execve(2). Well, not every process, since the first process, the init process, the one that starts up the rest of the operating system, didn’t. But every process that came after.

The idea is rather simple: fork(2) creates a new process and execve(2) turns the new process into the kind of process you want it to be.

Let’s say you’re a shell and your user wants to start his productivity utility vimwonderhorse. Now, the first thing you’ve got to do is to start a new process. The reason for that is simple: when the user quits vimwonderhorse you should still be there and wait for the user’s input again. If you, as the shell, would have changed into vimwonderhorse and the user quit, well, then you would be gone, too. So you start a new process with fork(2).

A new process started with fork(2) is a replica (ignoring some details here) of the process making the call: the same instructions, the same open file descriptors, the same working directory and the so on. Only the PID (and the parent PID) and your memory address space have changed. But since the user wanted to start another program it’s not that useful to now have two shell processes. And that’s what execve(2) is for.

A call to execve(2) transforms the calling process into the executable specified in the arguments to execve(2). execve(2) never returns, except something goes wrong. That means that once you execve(2) into something else, you can’t go back.

So, breaking it down, it’s just this: fork(2) to create a new process and then execve(2) to turn the newly-created process into the process you want it to be. Normally you would close open and unneeded file descriptors and clean up other things between fork(2) and execve(2).

Let’s see how this works in practice with a little C code snippet.

#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

int main(int argc, char *argv[])
{
    pid_t p, p_wait;
    int status;
    char *cmd[] = {"/bin/ls", "-l", ".", 0};

    // fork
    if ((p = fork()) > 0) {
        // parent
        p_wait = wait(&status);
        printf("%s [%d] exited with %d\n", cmd[0], p_wait, status);
    } else {
        // child process
        // execve
        execve(cmd[0], cmd, 0);
    }

    return 0;
}

All this snippet really does is creating a new process with fork(2) and using execve(2) to turn it into ls. We set up a few variables to help us and then start right away: we call fork(2), which returns the PID of the newly created process (in the parent process) and 0 in the created process itself, the child process. That sounds funny, but actually makes a lot of sense when you think about the fact that fork(2) does not much more than duplicating the current process.

After the call to fork(2) the parent and the child process run the same code. To find out in which process we are in we need check the return value of fork(2). In the child process we call execve(2) to turn the process into ls with some arguments (effectively running ls -l .). In the parent process we call wait(2) to check the exit code of the child and to not leave a zombie process.

Compiling and running this gives us the following output:

$ cc -o basic_pattern basic_pattern.c && ./basic_pattern
total 12
-rwxr-xr-x 1 mrnugget mrnugget 7201 Jun  8 13:05 basic_pattern
-rw-r--r-- 1 mrnugget mrnugget  453 Jun  8 13:04 basic_pattern.c
/bin/ls [2021] exited with 0

That looks good: first the output of ls and then the output of our parent process, which waited for ls to exit.

In the snippet we use three system calls: fork, execve and wait. Let’s use strace to see this confirmed. We will use strace -f to make strace follow every created child process:

$ sudo strace -f ./basic_pattern
execve("./basic_pattern", ["./basic_pattern"], [/* 14 vars */]) = 0
[...]
Process 2316 attached
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f108d62e9d0) = 2316
[pid  2315] wait4(-1, Process 2315 suspended <unfinished ...>
[pid  2316] execve("/bin/ls", ["/bin/ls", "-l", "."], [/* 0 vars */]) = 0
[...]
[pid  2316] write(1, "-rwxr-xr-x 1 mrnugget mrnugget 7"..., 63-rwxr-xr-x 1 mrnugget mrnugget 7201 Jun  8 13:05 basic_pattern) = 63
[...]
[pid  2316] write(1, "-rw-r--r-- 1 mrnugget mrnugget  "..., 65-rw-r--r-- 1 mrnugget mrnugget  453 Jun  8 13:04 basic_pattern.c) = 65
[...]
[...]
[...]
[pid  2316] exit_group(0)               = ?
Process 2315 resumed
Process 2316 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 2316
--- SIGCHLD (Child exited) @ 0 (0) ---
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 6), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f108d646000
write(1, "/bin/ls [2316] exited with 0\n", 30/bin/ls [2316] exited with 0) = 30
exit_group(0)                           = ?

I left out some lines that are not relevant here (and which basically show memory allocation, loading of shared libraries and the internals of ls).

But we can see what’s happening. A new process is created, the parent process calls wait for the child, the child execves into /bin/ls, which writes the contents of the working directory to STDOUT and then exits. After the child exits the call to wait in the parent returns and the parents writes the status message to STDOUT.

So, where is fork? We explicitly called fork() in the code, which is supposed to be a system call but it’s nowhere to be seen. Come to speak of it, where is wait(2)? wait4 shows up, yes, but that’s not what we called.

fork, clone, library and system calls

It turns out, that when we call fork() in our code, we don’t actually call the system call fork(). Instead, we call a library function in the C standard library (yes, called fork()) that is a small wrapper around the system call.

The top answer to this post on Stack Overflow explains in detail and with links to the relevant parts of the glibc source that the fork(2) system call we use in our code is actually a wrapper in glibc that calls the clone(2) system call. (The same goes for wait(2) — see code here)

Even the man page for fork(2) explains this:

Since version 2.3.3, rather than invoking the kernel's fork() system
call, the glibc fork() wrapper that is provided as part of the NPTL threading
implementation invokes clone(2) with flags that provide the same effect as the
traditional system call.  (A call to fork() is equivalent to a call to clone(2)
specifying flags as just  SIGCHLD.)   The glibc wrapper invokes any fork
handlers that have been established using pthread_atfork(3).

If we use ltrace, instead of strace, which traces library calls instead of system calls, we can see this happening:

$ sudo ltrace -f ./basic_pattern
[pid 8178] __libc_start_main(0x4005cd, 1, 0x7fffb39059b8, 0x400660 <unfinished ...>
[pid 8178] fork() = 8179
[pid 8178] wait(0x7fffb39058c4 <unfinished ...>
[pid 8179] <... fork resumed> ) = 0
[pid 8179] execve(0x4006e4, 0x7fffb39058a0, 0, 0x7fffb39058a0 <no return ...>
[pid 8179] --- Called exec() ---
[pid 8179] __libc_start_main(0x4028c0, 3, 0x7fffd94860c8, 0x411e60 <unfinished ...>
[pid 8179] strrchr("/bin/ls", '/') = "/ls"
[...]
[pid 8179] +++ exited (status 0) +++
[pid 8178] --- SIGCHLD (Child exited) ---
[pid 8178] <... wait resumed> ) = 8179
[pid 8178] printf("%s [%d] exited with %d\n", "/bin/ls", 8179, 0/bin/ls [8179] exited with 0) = 30
[pid 8178] +++ exited (status 0) +++

We could stop here and conclude by saying that reading man pages is a wise and noble thing to do and nobody should speak about anything without checking the man page for it. But, here’s the thing: digging deeper provides some really interesting information about processes in the Linux environment. So let’s do that and get back to the topic at hand.

Why does glibc do that? Why call clone(2) instead of fork(2)? And why does it wrap system calls in library functions?

After digging around a bit I found out that making a system call is actually harder than just calling fork() somewhere in my code. I’d need to know the unique number of system call I was about to make, set up registers, call a special instruction (which varies on different machine architectures) to switch to kernel mode and then handle the results when I’m back in user space.

By providing a wrapper around certain system calls glibc makes it a lot easier and portable for developers to use system calls. There is still the possibility to use syscall(2) to call system calls somewhat more directly.

So why does fork() in glibc call clone(2) instead of just being a wrapper for the fork system call? The reason for that is the implementation of threads and processes in Linux. Processes are just “fat” threads. Under the hood they don’t differ too much, at least from the kernel’s point of view. The main difference is that instead of sharing a memory address space with other processes, each process gets its own. Of course, this is a simplified idea of what is actually going on, but what it boils down to is this: threads are lightweight processes that can be created with clone(2).

In contrast to fork(2), which takes no arguments, we can call clone(2) with different arguments to change which process will be created. Do they need to share their execution context? Memory? File descriptors? Signal handlers? clone(2) allows us to change these attributes of newly created processes. This is clearly much more flexible and powerful than fork(2), which creates the “fat processes” we can see when we run ps.

The functionality fork(2) provides is covered by clone(2). So the Linux kernel uses clone(2) to implement fork(2) to not break the API and to centralize the creation of processes in a single system call.

And that is the reason why strace won’t show fork(2): calling fork(2) uses the wrapper provided by glibc, which uses clone(2) to create a process.

Follow me on twitter: @thorstenball. Or send me an email to me@thorstenball.com. Or check out my books at interpreterbook.com and compilerbook.com.

I also write a weekly newsletter called Register Spill. Read it and sign up at registerspill.thorstenball.com.