Where did fork go?
A couple of days ago I was playing around with strace and bash on a
Linux box. My goal was to get a better understanding of Unix shells and how
they operate at a systems level. Happily launching commands inside the bash
process and watching the output of strace
it suddenly struck me: “Wait! Where
is fork? It’s a system call! strace is supposed to show this! Where is it?”
Nowhere in the strace
output was a call to fork(2)
to be found.
I was really confused and my curiosity was spiked. So I spent the next couple of hours searching for an explanation — and I found one, which I think is worth sharing. But first of all let me explain why I was confused.
fork and execve in the Unix environment
Processes in Unix environments are based on a pretty simple idea: the
combination of fork(2)
and execve(2)
.
Every process running on a Unix system started out as a call to fork(2)
followed by a call to execve(2)
. Well, not every process, since the first
process, the init process, the one that starts up the rest of the operating
system, didn’t. But every process that came after.
The idea is rather simple: fork(2)
creates a new process and execve(2)
turns
the new process into the kind of process you want it to be.
Let’s say you’re a shell and your user wants to start his productivity utility
vimwonderhorse
. Now, the first thing you’ve got to do is to start a new
process. The reason for that is simple: when the user quits vimwonderhorse
you should still be there and wait for the user’s input again. If you, as the
shell, would have changed into vimwonderhorse
and the user quit, well, then
you would be gone, too. So you start a new process with fork(2)
.
A new process started with fork(2)
is a replica (ignoring some details here)
of the process making the call: the same instructions, the same open file
descriptors, the same working directory and the so on. Only the PID (and the
parent PID) and your memory address space have changed. But since the user
wanted to start another program it’s not that useful to now have two shell
processes. And that’s what execve(2)
is for.
A call to execve(2)
transforms the calling process into the executable
specified in the arguments to execve(2)
. execve(2)
never returns, except
something goes wrong. That means that once you execve(2)
into something else,
you can’t go back.
So, breaking it down, it’s just this: fork(2)
to create a new process and
then execve(2)
to turn the newly-created process into the process you want it
to be. Normally you would close open and unneeded file descriptors and clean up
other things between fork(2)
and execve(2)
.
Let’s see how this works in practice with a little C code snippet.
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
int main(int argc, char *argv[])
{
pid_t p, p_wait;
int status;
char *cmd[] = {"/bin/ls", "-l", ".", 0};
// fork
if ((p = fork()) > 0) {
// parent
p_wait = wait(&status);
printf("%s [%d] exited with %d\n", cmd[0], p_wait, status);
} else {
// child process
// execve
execve(cmd[0], cmd, 0);
}
return 0;
}
All this snippet really does is creating a new process with fork(2)
and using
execve(2)
to turn it into ls
. We set up a few variables to help us and then
start right away: we call fork(2)
, which returns the PID of the newly created
process (in the parent process) and 0 in the created process itself, the child
process. That sounds funny, but actually makes a lot of sense when you think
about the fact that fork(2)
does not much more than duplicating the current
process.
After the call to fork(2)
the parent and the child process run the same code.
To find out in which process we are in we need check the return value of
fork(2)
. In the child process we call execve(2)
to turn the process into
ls
with some arguments (effectively running ls -l .
). In the parent process
we call wait(2)
to check the exit code of the child and to not leave a zombie
process.
Compiling and running this gives us the following output:
$ cc -o basic_pattern basic_pattern.c && ./basic_pattern
total 12
-rwxr-xr-x 1 mrnugget mrnugget 7201 Jun 8 13:05 basic_pattern
-rw-r--r-- 1 mrnugget mrnugget 453 Jun 8 13:04 basic_pattern.c
/bin/ls [2021] exited with 0
That looks good: first the output of ls
and then the output of our parent
process, which waited for ls
to exit.
In the snippet we use three system calls: fork
, execve
and wait
. Let’s use
strace
to see this confirmed. We will use strace -f
to make strace
follow
every created child process:
$ sudo strace -f ./basic_pattern
execve("./basic_pattern", ["./basic_pattern"], [/* 14 vars */]) = 0
[...]
Process 2316 attached
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f108d62e9d0) = 2316
[pid 2315] wait4(-1, Process 2315 suspended <unfinished ...>
[pid 2316] execve("/bin/ls", ["/bin/ls", "-l", "."], [/* 0 vars */]) = 0
[...]
[pid 2316] write(1, "-rwxr-xr-x 1 mrnugget mrnugget 7"..., 63-rwxr-xr-x 1 mrnugget mrnugget 7201 Jun 8 13:05 basic_pattern) = 63
[...]
[pid 2316] write(1, "-rw-r--r-- 1 mrnugget mrnugget "..., 65-rw-r--r-- 1 mrnugget mrnugget 453 Jun 8 13:04 basic_pattern.c) = 65
[...]
[...]
[...]
[pid 2316] exit_group(0) = ?
Process 2315 resumed
Process 2316 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 2316
--- SIGCHLD (Child exited) @ 0 (0) ---
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 6), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f108d646000
write(1, "/bin/ls [2316] exited with 0\n", 30/bin/ls [2316] exited with 0) = 30
exit_group(0) = ?
I left out some lines that are not relevant here (and which basically show
memory allocation, loading of shared libraries and the internals of ls
).
But we can see what’s happening. A new process is created, the parent process
calls wait
for the child, the child execve
s into /bin/ls
, which writes the
contents of the working directory to STDOUT and then exits. After the child
exits the call to wait
in the parent returns and the parents writes the status
message to STDOUT.
So, where is fork
? We explicitly called fork()
in the code, which
is supposed to be a system call but it’s nowhere to be seen. Come to speak of
it, where is wait(2)
? wait4
shows up, yes, but that’s not what we called.
fork, clone, library and system calls
It turns out, that when we call fork()
in our code, we don’t actually call
the system call fork()
. Instead, we call a library function in the C standard
library (yes, called fork()
) that is a small wrapper around the system call.
The top answer to this post on Stack Overflow explains in
detail and with links to the relevant parts of the glibc source that the
fork(2)
system call we use in our code is actually a wrapper in glibc that
calls the clone(2)
system call. (The same goes for wait(2)
— see code
here)
Even the man page for fork(2)
explains this:
Since version 2.3.3, rather than invoking the kernel's fork() system
call, the glibc fork() wrapper that is provided as part of the NPTL threading
implementation invokes clone(2) with flags that provide the same effect as the
traditional system call. (A call to fork() is equivalent to a call to clone(2)
specifying flags as just SIGCHLD.) The glibc wrapper invokes any fork
handlers that have been established using pthread_atfork(3).
If we use ltrace
, instead of strace
, which traces library calls instead of
system calls, we can see this happening:
$ sudo ltrace -f ./basic_pattern
[pid 8178] __libc_start_main(0x4005cd, 1, 0x7fffb39059b8, 0x400660 <unfinished ...>
[pid 8178] fork() = 8179
[pid 8178] wait(0x7fffb39058c4 <unfinished ...>
[pid 8179] <... fork resumed> ) = 0
[pid 8179] execve(0x4006e4, 0x7fffb39058a0, 0, 0x7fffb39058a0 <no return ...>
[pid 8179] --- Called exec() ---
[pid 8179] __libc_start_main(0x4028c0, 3, 0x7fffd94860c8, 0x411e60 <unfinished ...>
[pid 8179] strrchr("/bin/ls", '/') = "/ls"
[...]
[pid 8179] +++ exited (status 0) +++
[pid 8178] --- SIGCHLD (Child exited) ---
[pid 8178] <... wait resumed> ) = 8179
[pid 8178] printf("%s [%d] exited with %d\n", "/bin/ls", 8179, 0/bin/ls [8179] exited with 0) = 30
[pid 8178] +++ exited (status 0) +++
We could stop here and conclude by saying that reading man pages is a wise and noble thing to do and nobody should speak about anything without checking the man page for it. But, here’s the thing: digging deeper provides some really interesting information about processes in the Linux environment. So let’s do that and get back to the topic at hand.
Why does glibc do that? Why call clone(2)
instead of fork(2)
? And why
does it wrap system calls in library functions?
After digging around a bit I found out that making a system call is
actually harder than just calling fork()
somewhere in my code. I’d need to
know the unique number of system call I was about to make, set up registers,
call a special instruction (which varies on different machine architectures) to
switch to kernel mode and then handle the results when I’m back in user space.
By providing a wrapper around certain system calls glibc makes it a lot easier
and portable for developers to use system calls. There is still the possibility
to use syscall(2)
to call system calls somewhat more directly.
So why does fork()
in glibc call clone(2)
instead of just being a wrapper
for the fork
system call? The reason for that is the implementation of
threads and processes in Linux. Processes are just “fat” threads. Under the
hood they don’t differ too much, at least from the kernel’s point of view. The
main difference is that instead of sharing a memory address space with other
processes, each process gets its own. Of course, this is a simplified idea of
what is actually going on, but what it boils down to is this: threads are
lightweight processes that can be created with clone(2)
.
In contrast to fork(2)
, which takes no arguments, we can call clone(2)
with
different arguments to change which process will be created. Do they need to
share their execution context? Memory? File descriptors? Signal handlers?
clone(2)
allows us to change these attributes of newly created processes.
This is clearly much more flexible and powerful than fork(2)
, which creates
the “fat processes” we can see when we run ps
.
The functionality fork(2)
provides is covered by clone(2)
. So the Linux
kernel uses clone(2)
to implement fork(2)
to not break the API and to
centralize the creation of processes in a single system call.
And that is the reason why strace
won’t show fork(2)
: calling fork(2)
uses the wrapper provided by glibc, which uses clone(2)
to create a process.