eBPF - a new Swiss army knife in the system
eBPF is a revolutionary technology that has become a crucial tool in systematic level work. According to the official eBPF site[1], it can run sandboxed programs in a privileged context, such as the operating system kernel. This allows for safe and efficient kernel extension capabilities without the need to alter kernel source code or load kernel modules.
But how exactly does eBPF work, and why is it so important? In this article, I will provide a brief introduction to eBPF, discuss several interesting projects that utilise eBPF, and demonstrate how eBPF tools can be used to monitor CUDA functions in a GPU.
Introduction to eBPF
BPF: BSD Packet Filter
Back in 1992, a paper titled 'The BSD Packet Filter: A New Architecture for User-level Packet Capture[2]' addressed the challenge of efficiently discarding unwanted packets in a timely manner. To tackle this problem, the paper proposed the design of a 'filter machine' that took both performance and usability into account.
Essentially, the filter machine comprises two main components: the network tap and the packet filter. The network tap collects copies of packets from network device drivers and delivers them to listening applications. On the other hand, the packet filter is a boolean-based function that evaluates incoming packets. If the result of the function is true, the kernel copies the packet for the application to consume. If the result is false, the packet is ignored. This approach significantly improves the efficiency and effectiveness of packet filtering.
A prime example of a tool that employs the BPF approach is tcpdump. Based on BPF, tcpdump operates in user space and sends filter instructions to BPF while also receiving filtered packets. In other words, tcpdump leverages BPF's capabilities to capture and display network packets that match user-specified criteria. This makes tcpdump an essential tool for network administrators and security professionals who need to monitor network traffic for troubleshooting, analysis, and security reasons.
And this idea has further impact, researchers and developers are trying to monitor not only network packets in the system.
That is eBPF, extended Berkeley Packet Filter.
eBPF
In 2014, Alexei Starovoitov implemented eBPF on the previous design
of
BPF[3].
To make the BPF more powerful, the eBPF introduced more registers,
lengthen the register's width, and make more kernel calls availble
through bpf_call
fucntion.[4]
In the design of BPF, it is responsible for handling incoming filter instructions that are written in a compact instruction set. However, before execution, these instructions must first be verified by a separate component to ensure they will not cause harm to the current system. Once verified, the instructions are then passed into the BPF JIT, which appears as a tiny virtual machine within the system. During execution, BPF instructions can access certain registers and call BPF helper functions, as well as obtain information from other event signals. Overall, this approach enables BPF to provide a secure and efficient way to extend the capabilities of the kernel without requiring any changes to its source code or loading kernel modules.
Thanks to its innovative and creative implementations, eBPF can provide a powerful ability to observe and monitor the entire system. To facilitate this process, eBPF also offers a suite of other infrastructure components to operate effectively and efficiently[1][5], including:
- BPF Map: eBPF programs require the ability to share collected data and store state information accurately. In this regard, eBPF maps serve as crucial components that allow eBPF programs to store and retrieve data in a broad set of data structures.
- Helper Function: This feature enables the eBPF virtual machine to interact with the kernel seamlessly. As of early 2023, there are over 220 eBPF Helpers in the kernel.
- Tail call: eBPF programs can also leverage the Tail Call mechanism
that enables them to call other eBPF programs efficiently. Similar to
execve
, this mechanism allows the program to replace content and proceed to the next program. - LLVM backend: Now clang can generate eBPF byte-code and eBPF object file now.
What can eBPF do?
Overall, eBPF serves three main purposes: networking, observability, and security.
In terms of networking, eBPF allows for custom packet filtering rules that enable the creation of high-performance networking capabilities. By handling network packets within the kernel, eBPF avoids costly transitions to and from user space, thus providing efficient and effective packet filtering.
Regarding observability, eBPF's powerful ability to observe the system allows for the creation of comprehensive monitoring and debugging tools. With eBPF, developers can snoop specific packets and calls between user space and kernel space, providing richer and more detailed troubleshooting information. Additionally, tracing and profiling are made more accessible and flexible for developers with eBPF.
Lastly, eBPF has also been found to be useful in the realm of security. By leveraging eBPF's observability capabilities, enterprises can detect and even prevent various types of malicious activities originating from within the kernel.
In summary, eBPF's observability feature is a crucial aspect of its overall usefulness, providing developers with deeper insights and greater flexibility when monitoring and debugging their systems.
Why do we need eBPF?
The next issue is why do we need eBPF. It seems those tasks above can be achieved in the Linux kernel, too. While it is true that some of the tasks that eBPF can accomplish can also be achieved within the Linux kernel, there are several reasons why eBPF is gaining popularity:
Flexibility: eBPF allows for more flexibility in terms of what can be monitored and how it can be monitored. It allows for monitoring at various layers of the stack, including the network layer, application layer, and kernel layer.
Safety: eBPF provides a safe way to execute custom code within the kernel without compromising system stability or security.
Performance: eBPF is designed to be highly performant and efficient, allowing for real-time monitoring and analysis with minimal overhead.
Portability: eBPF programs can be written once and run on any Linux kernel version that supports.
The primary objective of the Linux kernel is to offer a consistent API (system calls) that abstracts the underlying hardware or virtual hardware and facilitates resource sharing among applications. To achieve this goal, the Linux kernel relies on various subsystems and layers that allocate different responsibilities. Typically, each subsystem allows some degree of configuration to cater to different user needs. However, in cases where the desired behavior cannot be configured, there are traditionally two options: either change the kernel source code and advocate for the modification to be accepted by the Linux kernel community (which can take years), or write a kernel module and maintain it regularly to avoid compatibility issues, potentially risking the security of the Linux kernel.
Practically, neither of these options is widely adopted. The former is too expensive, while the latter is not particularly portable. Fortunately, eBPF provides a new option - users can now program in eBPF instructions and load them into the eBPF virtual machine. This approach offers maximum programmability and flexibility while still retaining the virtual machine design, making the program portable like a JVM. This feature is typically referred to as BPF CO-RE (Compile Once - Run Everywhere). As a result, more and more services and tools are adopting eBPF to take advantage of its programmability and portability capabilities.
Profile in a flash
For developers and researchers, achieving optimal performance is crucial. To improve performance, it's necessary to identify the hotspot in the code and optimize it. Profiling tools have been designed to assist in such scenarios. Popular tools such as Intel Vtune, gprof, nvprof, and others are designed to observe the executable program. However, these tools require the code to be run from the beginning, and in some cases, the code needs to be rebuilt or recompiled with specific options.
With the introduction of eBPF, the game has changed. It can observe all system calls and fetch the arguments of each call without needing the program to be rerun with tracing enabled. The ability to capture system call data in real-time with eBPF allows for more efficient and comprehensive tracing, as well as the ability to analyze and manipulate the data being captured. Moreover, eBPF can be used for other purposes beyond tracing, making custom statistics available.
How to use eBPF
BCC
The BCC[6] (BPF Compiler Collection) is a comprehensive toolkit designed for the creation of efficient kernel tracing and manipulation programs. It comes with a variety of helpful tools and examples to streamline the process.
BCC simplifies BPF program development by allowing kernel instrumentation in C, providing a C wrapper around LLVM, and offering front-ends in Python and Lua. It is well-suited for various tasks such as performance analysis, network traffic control, and more[7].
Installing steps are list out here: https://github.com/iovisor/bcc/blob/master/INSTALL.md
Within BCC, there is an array of pre-built tools that are user-friendly. However, the compilation process can be somewhat tedious. On most systems, it is necessary to install the appropriate Linux-header packages. This can be challenging for some virtual machines, and occasionally, the installation process can be time-consuming and resource-intensive.
Recommended resources for BCC:
- bcc Reference Guide (googlesource.com)
- iovisor/bcc: BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more (github.com)
BCC example 1
1 |
|
BCC example 2
1 |
|
BCC example 3
Call BCC execsnoop directly:
Bpftrace
Bpftrace[8] is a high-level tracing language for Linux enhanced Berkeley Packet Filter (eBPF) available in recent Linux kernels (4.x). bpftrace uses LLVM as a backend to compile scripts to BPF-bytecode and makes use of BCC for interacting with the Linux BPF system, as well as existing Linux tracing capabilities: kernel dynamic tracing (kprobes), user-level dynamic tracing (uprobes), and tracepoints. The bpftrace language is inspired by awk and C, and predecessor tracers such as DTrace and SystemTap.
One of the main benefits of bpftrace is its compatibility with different Linux kernel versions, which makes it a versatile tool for debugging complex system problems across different operating systems like Ubuntu, CentOS, and Debian. With bpftrace, users can write and execute lightweight scripts that can trace kernel functions, system calls, and user-space functions.
These scripts can be used to monitor network activity, debug performance bottlenecks, and detect security threats. Moreover, bpftrace can help users troubleshoot various problems, such as memory leaks, network congestion, and application crashes. It can also be utilised to identify rogue processes, analyse system logs, and track down performance issues in real-time.
In summary, bpftrace represents a reliable and efficient solution for developers, system administrators, and security professionals seeking to diagnose and fix complex system problems. Its powerful capabilities, ease of use, and compatibility with different Linux kernel versions make it an indispensable tool in the Linux ecosystem.
Github list out some one-liner as examples: One Liners
libbpf
BCC and Bpftrace are undoubtedly powerful tools for tracing and monitoring performance on Linux systems. However, their use cases are limited by predefined constraints, which in turn hinders their extensibility.
That's where Libbpf comes in, which offers a standardised library for developing and sharing BPF (Berkeley Packet Filter) programs across Linux systems. The authoritative source code for it is developed as part of the bpf-next Linux source tree under the tools/lib/bpf subdirectory and is periodically synced to Github.
But, developing directly with Linux kernel library can be a daunting task. Fortunately, the built-in framework known as libbpf-bootstrap can significantly aid developers in building their own tools. This framework offers numerous examples in the repository, which makes the development process less complicated. By following these examples, developers can quickly get up to speed and start creating their own custom tools.
Recommend resources: - Building BPF applications with libbpf-bootstrap (nakryiko.com)
libbpf-bootstrap example
1 |
|
Perf-tools
Link: https://github.com/brendangregg/perf-tools
A miscellaneous collection of in-development and unsupported performance analysis tools for Linux ftrace and perf_events (aka the "perf" command). Both ftrace and perf are core Linux tracing tools, included in the kernel source. Your system probably has ftrace already, and perf is often just a package add (see Prerequisites).
These tools are designed to be easy to install (fewest dependencies), provide advanced performance observability, and be simple to use: do one thing and do it well. This collection was created by Brendan Gregg (author of the DTraceToolkit).
Many of these tools employ workarounds so that functionality is possible on existing Linux kernels. Because of this, many tools have caveats (see man pages), and their implementation should be considered a placeholder until future kernel features, or new tracing subsystems, are added.
perf-tools example
pycuda.py:
1 |
|
GPTtrace
Project link: https://github.com/eunomia-bpf/GPTtrace
Even though BCC and bpftrace simplify the eBPF programming, it still require a lot of time to read the whole manual of each tool. Furthermore, eBPF programming requires a strong understanding of the underlying Linux kernel and networking stack, making it a challenging task for beginners. Additionally, debugging eBPF programs can be difficult due to limited visibility into the kernel.
GPTtrace might help. It can generate eBPF programs and tracing with ChatGPT and natural language.
What's next?
After a quick overview on eBPF, we know that observing the system in many different aspects becomes easier than ever. Admittedly, eBPF has some shortcomings. The BPF Helper Functions are still limited and many Hook for eBPF are read-only. eBPF still needs to be go through a strict verification.
Furthermore, eBPF is a relatively new technology and there is still a lack of documentation and resources available for developers to learn how to use it effectively. This can make it difficult for organisations to adopt eBPF as part of their infrastructure.
Despite these shortcomings, eBPF provides possible solution for researchers have a quick insightful view of their programs. And eBPF depict a modern view of operating system. By allowing researchers to dynamically trace and analyse kernel-level events and functions, eBPF provides a powerful tool for understanding program behaviour and identifying performance bottlenecks. This real-time visibility can help researchers optimise their code and improve overall system efficiency.
Additionally, eBPF's flexibility and programmability make it well-suited for modern operating systems that require dynamic, adaptable solutions. Its ability to operate at the kernel level while maintaining safety and security ensures that it can be used effectively in a wide range of applications.
Overall, eBPF represents an exciting new direction for operating systems research, offering a unique combination of performance monitoring, analysis, and adaptability that can help researchers better understand and optimise their programs.
- What is eBPF? An Introduction and Deep Dive into the eBPF Technology ↩︎
- https://www.tcpdump.org/papers/bpf-usenix93.pdf ↩︎
- The untold story of BPF | Kernel Recipes 2022 (kernel-recipes.org) ↩︎
- BPF Performance Tools (Book) (brendangregg.com) ↩︎
- gojue/ebpf-slide: Collection of Linux eBPF slides/documents. (github.com) ↩︎
- bcc Reference Guide (googlesource.com) ↩︎
- iovisor/bcc: BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more (github.com) ↩︎
- https://github.com/iovisor/bpftrace ↩︎