This blog post explains how strace works, internally. We’ll examine the
ptrace system call, which
strace relies on, at the API layer and internally to understand how exactly
strace can get information about the system calls being made in a running process.
ptrace is a system call which a program can use to:
- trace system calls
- read and write memory and registers
- manipulate signal delivery to the traced process
As you can see,
ptrace is a really useful system call for tracing and manipulating other programs.
It is used by
strace, GDB, and other tools as well.
You should take a look at the man page for
ptrace for a lot of useful documentation.
ptrace and syscalls
For simplicity, let’s call the program which will trace the system calls being made by another program, the tracer. The program which is being traced, the tracee.
The tracer can use the
PTRACE_ATTACH flag when calling
ptrace and supply the process ID of the tracee. This is followed by another call to
ptrace with the
PTRACE_SYSCALL flag and process ID.
The tracee will run until it enters a system call, at which point it will be stopped by the Linux kernel. For the tracer, it will appear as if the tracee has been stopped because it received a
SIGTRAP signal. The tracer can then inspect the arguments to the system call, print any relevant information.
Next, the tracer can call
PTRACE_SYSCALL again which will resume the tracee, but cause it to be stopped by the kernel when it exits the system call.
This pattern continues to trace the entry and exit from system calls allowing the tracer to inspect the tracee and print arguments, return values, timing information, and more.
Now that the order of operations for working with
ptrace has been outline, let’s take a look at how this actually works under the hood in the kernel.
A good place to start looking is the code in the kernel for the
ptrace system call. The code samples below will refer to the Linux Kernel 3.13 and there will be links to the source on GitHub.
We’ll start by understanding what
ptrace system call code can be found in
kernel/ptrace.c. Take a look at the source on GitHub.
A few lines in, the source for
ptrace checks the
request parameter for
If, it matches,
ptrace_attach is called which you can find toward the top of the file.
ptrace_attach does a few things early on:
flagswhich will eventually be stored on a structure in the kernel representing the process that is being attached to
- ensures that the task it will attach to is not a kernel thread
- ensures that the task it will attach to is not a thread of the current proccess
__ptrace_may_accessdoes some security checks
In our case, that flag is
ptrace_attach function completes and execution resumes in
ptrace checks if the process is ready for
ptrace operations by calling
arch_ptrace which is a function supplied by the CPU architecture specific code. For our purposes, this is the x86
ptrace code, which can be found in
You can examine the
arch_ptrace function here. If you follow that huge
switch statement to the bottom, you’ll see that there is no case for
PTRACE_ATTACH, so the default case will execute, which simply hands execution back up to the generic
ptrace code in a function called
ptrace_request also has nothing to do for
PTRACE_ATTACH, so it simply breaks, returns, and the whole call stack is unwound back to the
ptrace system call code which finishes up by returning.
Now that we’ve seen how
PTRACE_ATTACH works, we can start examining
ptrace system call begins as it did in the
PTRACE_ATTACH case. Since this is not an attach request, the first thing the source does is call
ptrace_check_attach to ensure that the proces is ready for
Next, much like
PTRACE_ATTACH, the function
arch_ptrace is called which is supplied by the CPU architecture specific code. In similar fashion,
arch_ptrace has nothing to do for
PTRACE_SYSCALL and calls over to
So far, this is very similar to
PTRACE_ATTACH, but now the code diverges a bit.
ptrace_request for the case of
ptrace_resume is called.
This function starts by setting the
TIF_SYSCALL_TRACE flag on the thread info structure for the tracee.
A few possible states are checked (as other functions might call
ptrace_resume) and finally the tracee is woken up and execution resumes until the tracee enters a system call.
Entering system calls
So, we’ve seen how the kernel tracks when a process’ system calls should be traced by setting a flag (
TIF_SYSCALL_TRACE) on a thread info structure associated with a process.
The question then follows: when is that flag checked and acted upon?
Whenever a system call is made by a program, there is CPU architecture specific code that is executed on the kernel side prior to the execution of the system call itself.
The code that is executed on x86 CPUs when a system call is made is written in assmebly and can be found in
If take a look at the code for the assembly function
system_call here you will that about 20 lines into the function a flag called
_TIF_WORK_SYSCALL_ENTRY is checked:
If this flag is set, execution moves to
tracesys. Looking at the definition of this flag:
You can see that the flag is actually a combination of several flags, including the one we saw getting set earlier:
And so: on every system call made, the thread info structure for a process has its flags checked for
_TIF_SYSCALL_TRACE. If a flag is set, execution moves to
This function is actually defined in the CPU specific ptrace code found in
arch/x86/kernel/ptrace.c, which you can find here.
This function checks to see if the
_TIF_SYSCALL_TRACE flag is set and if so,
tracehook_report_syscall_entry is called:
tracehook_report_syscall_entry is a static-inline wrapper function from
include/linux/tracehook.h and it has some great documentation.
ptrace_report_syscall which is defined in the same file, but near the top.
ptrace_report_syscall function matches what was described earlier: a
SIGTRAP is generated when a traced process enters a system call:
ptrace_notify is implemented in
As you can see,
ptrace_do_notify which prepares a signal info structure for delivery to the process by
ptrace_stop, which is also found in
Once the tracee receives a
SIGTRAP, the tracee is stopped and the tracer is notified that a signal is pending for the process. The tracer can then examine the state of the tracee and print register values, timestamps, or other information.
This is how
strace prints its information to the terminal when you trace a process.
syscall_trace_leave for the exit path
A similar code path is executed for the exit path of the system call:
syscall_trace_leaveis called by the assembly code
- this function calls
- which also calls
ptrace_report_syscall, just like the entry path.
And this is how tracing processes are notified when a system call completes so they can harvest the return value, timing information, or anything else needed to print useful output for the user.
ptrace is an incredibly useful system call for debuggers, tracers, and other system programs that need to extract useful information from programs.
strace is implemented primarily by relying on
ptrace internals are a bit tricky, as execution is transferred between a set of files, but the implementation itself is relatively straight forward.
I also encourage you to check out the source code for your favorite debugger and see how it uses
ptrace to examine program state, modify registers and memory, etc.