The history of a linux kernel design flaw - DevConf.CZ 2020

4 years ago

Speakers
Speakers: Dmitry Levin This talk about fixing long-standing bugs in the Linux kernel is based on a real story of a design flaw in the Linux kernel on x86-64 architecture. That design flaw existed since the beginning of Linux x86-64 support in 2001 and was finally fixed in 2019. 18 years ago, when the first Linux kernel with x86-64 architecture support was released, it was capable of running processes that execute native x86-64 CPU instructions, and processes that execute legacy x86 CPU instructions. This feature was very popular in the early years of x86-64 architecture when the amount of already existing legacy 32-bit binary code exceeded the amount of native 64-bit binary code. The feature was implemented in a way that allowed native 64-bit processes to invoke both native 64-bit system calls and compat 32-bit system calls - a fact that was not widely known and caused quite a few surprises for many years to come. At the same time Linux kernel provided no API that would allow user processes to determine in a reliable way whether the system call being invoked is a native 64-bit system call or a compat 32-bit one. For this reason debuggers and system call tracers traditionally decided on the bitness of syscalls by looking at the value of CS register that describes the bitness of processes. While this approach works in most cases, it fails miserably when the assumption that system call bitness matches the process bitness is not valid. This problem was reported many years ago. The first report I'm aware of is the bug #459820 reported in 2008 to the Debian bug tracker. That bug report contains an example program which invokes a 32-bit 'fork' system call, deceiving strace to think it's a 64-bit 'open' system call with weird arguments and return code. The alternative method of obtaining system call bitness that is also known since the early years of x86-64 architecture is to fetch from the process memory and analyze the opcode of the CPU instruction that caused the system call invocation. This approach is also not reliable because reading process memory after the system call invocation is inherently racy, and in 2012 Linus Torvalds even produced a short example code that reliably deceives the tracer. The fact that Linux kernel provided no API to obtain this crucial piece of system call information in a reliable way was recognized by Linux kernel developers as a problem many years ago. For example, in the beginning of 2012 there was a lively discussion on that matter. Many kernel hackers took part, many interesting ideas were discussed, but, unfortunately, there were no follow-up because none of these people were interested enough to implement a solution for the problem. It took many years to find people who really care and were capable of delivering a fix. It was November 2018 when the first RFC patch to fix the problem by extending ptrace API was proposed by Elvira Khabirova. Shortly after the discussion that followed there was a second edition consisting of 16 patches, 15 of them were extending and fixing internal Linux kernel API on various architectures. As result of subsequent iterations the patchset grew further, it affected all architectures supported by the kernel and extended audit and ptrace subsystems. To get it accepted into the kernel, we had no other choice but to split it into parts and upstream them via appropriate maintainer trees. It was an amusing process full of pings. To cut story short, it took almost 9 months to get all 29 patches implementing PTRACE_GET_SYSCALL_INFO API merged into the kernel, the last patch of the series was accepted in July 2019, the first Linux kernel release with this feature is 5.3. PTRACE_GET_SYSCALL_INFO API is supported in strace starting with version 4.26 released in December 2018. strace performs a runtime check for PTRACE_GET_SYSCALL_INFO support in the kernel and automatically switches to use this API when it's available. Other userspace debuggers and tracers will follow. [ https://sched.co/YOrK ] -- Recordings of talks at DevConf are a community effort. Unfortunately not everything works perfectly every time. If you're interested in helping us improve, let us know.
Soyez le premier à laisser un commentaire