TL;DR
This blog post explains how strace works, internally. We’ll examine the ptrace
system call, which strace
relies on, at the API layer and internally to understand how exactly strace
can get information about the system calls being made in a running process.
ptrace
ptrace
is a system call which a program can use to:
- trace system calls
- read and write memory and registers
- manipulate signal delivery to the traced process
As you can see, ptrace
is a really useful system call for tracing and manipulating other programs.
It is used by strace
, GDB, and other tools as well.
You should take a look at the man page for ptrace
for a lot of useful documentation.
ptrace
and syscalls
For simplicity, let’s call the program which will trace the system calls being made by another program, the tracer. The program which is being traced, the tracee.
The tracer can use the PTRACE_ATTACH
flag when calling ptrace
and supply the process ID of the tracee. This is followed by another call to ptrace
with the PTRACE_SYSCALL
flag and process ID.
The tracee will run until it enters a system call, at which point it will be stopped by the Linux kernel. For the tracer, it will appear as if the tracee has been stopped because it received a SIGTRAP
signal. The tracer can then inspect the arguments to the system call, print any relevant information.
Next, the tracer can call ptrace
with PTRACE_SYSCALL
again which will resume the tracee, but cause it to be stopped by the kernel when it exits the system call.
This pattern continues to trace the entry and exit from system calls allowing the tracer to inspect the tracee and print arguments, return values, timing information, and more.
Now that the order of operations for working with ptrace
has been outline, let’s take a look at how this actually works under the hood in the kernel.
PTRACE_ATTACH
A good place to start looking is the code in the kernel for the ptrace
system call. The code samples below will refer to the Linux Kernel 3.13 and there will be links to the source on GitHub.
We’ll start by understanding what PTRACE_ATTACH
does.
The generic ptrace
system call code can be found in kernel/ptrace.c
. Take a look at the source on GitHub.
A few lines in, the source for ptrace
checks the request
parameter for PTRACE_ATTACH
:
ptrace_attach
If, it matches, ptrace_attach
is called which you can find toward the top of the file.
ptrace_attach
does a few things early on:
- sets
flags
which will eventually be stored on a structure in the kernel representing the process that is being attached to - ensures that the task it will attach to is not a kernel thread
- ensures that the task it will attach to is not a thread of the current proccess
__ptrace_may_access
does some security checks
After that, flags are set and the process is stopped.
In our case, that flag is PT_PTRACED
.
The ptrace_attach
function completes and execution resumes in ptrace
.
Finishing the PTRACE_ATTACH
from ptrace
Next, ptrace
checks if the process is ready for ptrace
operations by calling ptrace_check_attach
here.
Finally, ptrace
calls arch_ptrace
which is a function supplied by the CPU architecture specific code. For our purposes, this is the x86 ptrace
code, which can be found in arch/x86/kernel/ptrace.c
.
You can examine the arch_ptrace
function here. If you follow that huge switch
statement to the bottom, you’ll see that there is no case for PTRACE_ATTACH
, so the default case will execute, which simply hands execution back up to the generic ptrace
code in a function called ptrace_request
.
ptrace_request
also has nothing to do for PTRACE_ATTACH
, so it simply breaks, returns, and the whole call stack is unwound back to the ptrace
system call code which finishes up by returning.
Now that we’ve seen how PTRACE_ATTACH
works, we can start examining PTRACE_SYSCALL
.
PTRACE_SYSCALL
The ptrace
system call begins as it did in the PTRACE_ATTACH
case. Since this is not an attach request, the first thing the source does is call ptrace_check_attach
to ensure that the proces is ready for ptrace
operations.
Next, much like PTRACE_ATTACH
, the function arch_ptrace
is called which is supplied by the CPU architecture specific code. In similar fashion, arch_ptrace
has nothing to do for PTRACE_SYSCALL
and calls over to ptrace_request
.
So far, this is very similar to PTRACE_ATTACH
, but now the code diverges a bit.
In ptrace_request
for the case of PTRACE_SYSCALL
, ptrace_resume
is called.
ptrace_resume
This function starts by setting the TIF_SYSCALL_TRACE
flag on the thread info structure for the tracee.
A few possible states are checked (as other functions might call ptrace_resume
) and finally the tracee is woken up and execution resumes until the tracee enters a system call.
Entering system calls
So, we’ve seen how the kernel tracks when a process’ system calls should be traced by setting a flag (TIF_SYSCALL_TRACE
) on a thread info structure associated with a process.
The question then follows: when is that flag checked and acted upon?
Whenever a system call is made by a program, there is CPU architecture specific code that is executed on the kernel side prior to the execution of the system call itself.
The code that is executed on x86 CPUs when a system call is made is written in assmebly and can be found in arch/x86/kernel/entry_64.S
.
_TIF_WORK_SYSCALL_ENTRY
If take a look at the code for the assembly function system_call
here you will that about 20 lines into the function a flag called _TIF_WORK_SYSCALL_ENTRY
is checked:
If this flag is set, execution moves to tracesys
. Looking at the definition of this flag:
You can see that the flag is actually a combination of several flags, including the one we saw getting set earlier: _TIF_SYSCALL_TRACE
.
And so: on every system call made, the thread info structure for a process has its flags checked for _TIF_SYSCALL_TRACE
. If a flag is set, execution moves to tracesys
tracesys
If you read the source for tracesys
here you can see that about 10 lines in the function syscall_trace_enter
is called.
This function is actually defined in the CPU specific ptrace code found in arch/x86/kernel/ptrace.c
, which you can find here.
This function checks to see if the _TIF_SYSCALL_TRACE
flag is set and if so, tracehook_report_syscall_entry
is called:
The tracehook_report_syscall_entry
is a static-inline wrapper function from include/linux/tracehook.h
and it has some great documentation.
It calls ptrace_report_syscall
which is defined in the same file, but near the top.
ptrace_report_syscall
The ptrace_report_syscall
function matches what was described earlier: a SIGTRAP
is generated when a traced process enters a system call:
ptrace_notify
is implemented in kernel/signal.c
here.
As you can see, ptrace_notify
calls ptrace_do_notify
which prepares a signal info structure for delivery to the process by ptrace_stop
, which is also found in kernel/signal.c
SIGTRAP
Once the tracee receives a SIGTRAP
, the tracee is stopped and the tracer is notified that a signal is pending for the process. The tracer can then examine the state of the tracee and print register values, timestamps, or other information.
This is how strace
prints its information to the terminal when you trace a process.
syscall_trace_leave
for the exit path
A similar code path is executed for the exit path of the system call:
syscall_trace_leave
is called by the assembly code- this function calls
tracehook_report_syscall_exit
- which also calls
ptrace_report_syscall
, just like the entry path.
And this is how tracing processes are notified when a system call completes so they can harvest the return value, timing information, or anything else needed to print useful output for the user.
Conclusion
ptrace
is an incredibly useful system call for debuggers, tracers, and other system programs that need to extract useful information from programs. strace
is implemented primarily by relying on ptrace
.
ptrace
internals are a bit tricky, as execution is transferred between a set of files, but the implementation itself is relatively straight forward.
I also encourage you to check out the source code for your favorite debugger and see how it uses ptrace
to examine program state, modify registers and memory, etc.