eBPF - a new Swiss army knife in the system

eBPF is a revolutionary technology that has become a crucial tool in systematic level work. According to the official eBPF site[1], it can run sandboxed programs in a privileged context, such as the operating system kernel. This allows for safe and efficient kernel extension capabilities without the need to alter kernel source code or load kernel modules.

But how exactly does eBPF work, and why is it so important? In this article, I will provide a brief introduction to eBPF, discuss several interesting projects that utilise eBPF, and demonstrate how eBPF tools can be used to monitor CUDA functions in a GPU.

Introduction to eBPF

BPF: BSD Packet Filter

Back in 1992, a paper titled ‘The BSD Packet Filter: A New Architecture for User-level Packet Capture[2]‘ addressed the challenge of efficiently discarding unwanted packets in a timely manner. To tackle this problem, the paper proposed the design of a ‘filter machine’ that took both performance and usability into account.

Essentially, the filter machine comprises two main components: the network tap and the packet filter. The network tap collects copies of packets from network device drivers and delivers them to listening applications. On the other hand, the packet filter is a boolean-based function that evaluates incoming packets. If the result of the function is true, the kernel copies the packet for the application to consume. If the result is false, the packet is ignored. This approach significantly improves the efficiency and effectiveness of packet filtering.

A prime example of a tool that employs the BPF approach is tcpdump. Based on BPF, tcpdump operates in user space and sends filter instructions to BPF while also receiving filtered packets. In other words, tcpdump leverages BPF’s capabilities to capture and display network packets that match user-specified criteria. This makes tcpdump an essential tool for network administrators and security professionals who need to monitor network traffic for troubleshooting, analysis, and security reasons.

And this idea has further impact, researchers and developers are trying to monitor not only network packets in the system.

That is eBPF, extended Berkeley Packet Filter.

eBPF

In 2014, Alexei Starovoitov implemented eBPF on the previous design of BPF[3]. To make the BPF more powerful, the eBPF introduced more registers, lengthen the register’s width, and make more kernel calls availble through bpf_call fucntion.[4]

In the design of BPF, it is responsible for handling incoming filter instructions that are written in a compact instruction set. However, before execution, these instructions must first be verified by a separate component to ensure they will not cause harm to the current system. Once verified, the instructions are then passed into the BPF JIT, which appears as a tiny virtual machine within the system. During execution, BPF instructions can access certain registers and call BPF helper functions, as well as obtain information from other event signals. Overall, this approach enables BPF to provide a secure and efficient way to extend the capabilities of the kernel without requiring any changes to its source code or loading kernel modules.

Thanks to its innovative and creative implementations, eBPF can provide a powerful ability to observe and monitor the entire system. To facilitate this process, eBPF also offers a suite of other infrastructure components to operate effectively and efficiently[1][5], including:

  • BPF Map: eBPF programs require the ability to share collected data and store state information accurately. In this regard, eBPF maps serve as crucial components that allow eBPF programs to store and retrieve data in a broad set of data structures.
  • Helper Function: This feature enables the eBPF virtual machine to interact with the kernel seamlessly. As of early 2023, there are over 220 eBPF Helpers in the kernel.
  • Tail call: eBPF programs can also leverage the Tail Call mechanism that enables them to call other eBPF programs efficiently. Similar to execve, this mechanism allows the program to replace content and proceed to the next program.
  • LLVM backend: Now clang can generate eBPF byte-code and eBPF object file now.

What can eBPF do?

Overall, eBPF serves three main purposes: networking, observability, and security.

In terms of networking, eBPF allows for custom packet filtering rules that enable the creation of high-performance networking capabilities. By handling network packets within the kernel, eBPF avoids costly transitions to and from user space, thus providing efficient and effective packet filtering.

Regarding observability, eBPF’s powerful ability to observe the system allows for the creation of comprehensive monitoring and debugging tools. With eBPF, developers can snoop specific packets and calls between user space and kernel space, providing richer and more detailed troubleshooting information. Additionally, tracing and profiling are made more accessible and flexible for developers with eBPF.

Lastly, eBPF has also been found to be useful in the realm of security. By leveraging eBPF’s observability capabilities, enterprises can detect and even prevent various types of malicious activities originating from within the kernel.

In summary, eBPF’s observability feature is a crucial aspect of its overall usefulness, providing developers with deeper insights and greater flexibility when monitoring and debugging their systems.

Why do we need eBPF?

The next issue is why do we need eBPF. It seems those tasks above can be achieved in the Linux kernel, too. While it is true that some of the tasks that eBPF can accomplish can also be achieved within the Linux kernel, there are several reasons why eBPF is gaining popularity:

  1. Flexibility: eBPF allows for more flexibility in terms of what can be monitored and how it can be monitored. It allows for monitoring at various layers of the stack, including the network layer, application layer, and kernel layer.

  2. Safety: eBPF provides a safe way to execute custom code within the kernel without compromising system stability or security.

  3. Performance: eBPF is designed to be highly performant and efficient, allowing for real-time monitoring and analysis with minimal overhead.

  4. Portability: eBPF programs can be written once and run on any Linux kernel version that supports.

The primary objective of the Linux kernel is to offer a consistent API (system calls) that abstracts the underlying hardware or virtual hardware and facilitates resource sharing among applications. To achieve this goal, the Linux kernel relies on various subsystems and layers that allocate different responsibilities. Typically, each subsystem allows some degree of configuration to cater to different user needs. However, in cases where the desired behavior cannot be configured, there are traditionally two options: either change the kernel source code and advocate for the modification to be accepted by the Linux kernel community (which can take years), or write a kernel module and maintain it regularly to avoid compatibility issues, potentially risking the security of the Linux kernel.

Practically, neither of these options is widely adopted. The former is too expensive, while the latter is not particularly portable. Fortunately, eBPF provides a new option - users can now program in eBPF instructions and load them into the eBPF virtual machine. This approach offers maximum programmability and flexibility while still retaining the virtual machine design, making the program portable like a JVM. This feature is typically referred to as BPF CO-RE (Compile Once - Run Everywhere). As a result, more and more services and tools are adopting eBPF to take advantage of its programmability and portability capabilities.

Profile in a flash

For developers and researchers, achieving optimal performance is crucial. To improve performance, it’s necessary to identify the hotspot in the code and optimize it. Profiling tools have been designed to assist in such scenarios. Popular tools such as Intel Vtune, gprof, nvprof, and others are designed to observe the executable program. However, these tools require the code to be run from the beginning, and in some cases, the code needs to be rebuilt or recompiled with specific options.

With the introduction of eBPF, the game has changed. It can observe all system calls and fetch the arguments of each call without needing the program to be rerun with tracing enabled. The ability to capture system call data in real-time with eBPF allows for more efficient and comprehensive tracing, as well as the ability to analyze and manipulate the data being captured. Moreover, eBPF can be used for other purposes beyond tracing, making custom statistics available.

How to use eBPF

BCC

The BCC[6] (BPF Compiler Collection) is a comprehensive toolkit designed for the creation of efficient kernel tracing and manipulation programs. It comes with a variety of helpful tools and examples to streamline the process.

BCC simplifies BPF program development by allowing kernel instrumentation in C, providing a C wrapper around LLVM, and offering front-ends in Python and Lua. It is well-suited for various tasks such as performance analysis, network traffic control, and more[7].

Installing steps are list out here: https://github.com/iovisor/bcc/blob/master/INSTALL.md

Within BCC, there is an array of pre-built tools that are user-friendly. However, the compilation process can be somewhat tedious. On most systems, it is necessary to install the appropriate Linux-header packages. This can be challenging for some virtual machines, and occasionally, the installation process can be time-consuming and resource-intensive.

Recommended resources for BCC:

BCC example 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#
# disksnoop.py Trace block device I/O: basic version of iosnoop.
# For Linux, uses BCC, eBPF. Embedded C.
#
# Written as a basic example of tracing latency.
#
# Copyright (c) 2015 Brendan Gregg.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 11-Aug-2015 Brendan Gregg Created this.

from __future__ import print_function
from bcc import BPF
from bcc.utils import printb

REQ_WRITE = 1 # from include/linux/blk_types.h

# load BPF program
b = BPF(text="""

BPF_HASH(start, struct request *);

void trace_start(struct pt_regs *ctx, struct request *req) {
// stash start timestamp by request ptr
u64 ts = bpf_ktime_get_ns();

start.update(&req, &ts);
}

void trace_completion(struct pt_regs *ctx, struct request *req) {
u64 *tsp, delta;

tsp = start.lookup(&req);
if (tsp != 0) {
delta = bpf_ktime_get_ns() - *tsp;
bpf_trace_printk("%d %x %d\\n", req->__data_len,
req->cmd_flags, delta / 1000);
start.delete(&req);
}
}
""")

if BPF.get_kprobe_functions(b'blk_start_request'):
b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
if BPF.get_kprobe_functions(b'__blk_account_io_done'):
b.attach_kprobe(event="__blk_account_io_done", fn_name="trace_completion")
else:
b.attach_kprobe(event="blk_account_io_done", fn_name="trace_completion")

# header
print("%-18s %-2s %-7s %8s" % ("TIME(s)", "T", "BYTES", "LAT(ms)"))

# format output
while 1:
try:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
(bytes_s, bflags_s, us_s) = msg.split()

if int(bflags_s, 16) & REQ_WRITE:
type_s = b"W"
elif bytes_s == "0": # see blk_fill_rwbs() for logic
type_s = b"M"
else:
type_s = b"R"
ms = float(int(us_s, 10)) / 1000

printb(b"%-18.9f %-2s %-7s %8.2f" % (ts, type_s, bytes_s, ms))
except KeyboardInterrupt:
exit()

BCC example 2

1
2
3
4
5
6
7
8
9
10
11
# Copyright (c) PLUMgrid, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")

# run in project examples directory with:
# sudo ./hello_world.py"
# see trace_fields.py for a longer example

from bcc import BPF

# This may not work for 4.17 on x64, you need replace kprobe__sys_clone with kprobe____x64_sys_clone
BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; }').trace_print()

BCC example 3

Call BCC execsnoop directly:

Bpftrace

Bpftrace[8] is a high-level tracing language for Linux enhanced Berkeley Packet Filter (eBPF) available in recent Linux kernels (4.x). bpftrace uses LLVM as a backend to compile scripts to BPF-bytecode and makes use of BCC for interacting with the Linux BPF system, as well as existing Linux tracing capabilities: kernel dynamic tracing (kprobes), user-level dynamic tracing (uprobes), and tracepoints. The bpftrace language is inspired by awk and C, and predecessor tracers such as DTrace and SystemTap.

One of the main benefits of bpftrace is its compatibility with different Linux kernel versions, which makes it a versatile tool for debugging complex system problems across different operating systems like Ubuntu, CentOS, and Debian. With bpftrace, users can write and execute lightweight scripts that can trace kernel functions, system calls, and user-space functions.

These scripts can be used to monitor network activity, debug performance bottlenecks, and detect security threats. Moreover, bpftrace can help users troubleshoot various problems, such as memory leaks, network congestion, and application crashes. It can also be utilised to identify rogue processes, analyse system logs, and track down performance issues in real-time.

In summary, bpftrace represents a reliable and efficient solution for developers, system administrators, and security professionals seeking to diagnose and fix complex system problems. Its powerful capabilities, ease of use, and compatibility with different Linux kernel versions make it an indispensable tool in the Linux ecosystem.

Github list out some one-liner as examples: One Liners

libbpf

BCC and Bpftrace are undoubtedly powerful tools for tracing and monitoring performance on Linux systems. However, their use cases are limited by predefined constraints, which in turn hinders their extensibility.

That’s where Libbpf comes in, which offers a standardised library for developing and sharing BPF (Berkeley Packet Filter) programs across Linux systems. The authoritative source code for it is developed as part of the bpf-next Linux source tree under the tools/lib/bpf subdirectory and is periodically synced to Github.

But, developing directly with Linux kernel library can be a daunting task. Fortunately, the built-in framework known as libbpf-bootstrap can significantly aid developers in building their own tools. This framework offers numerous examples in the repository, which makes the development process less complicated. By following these examples, developers can quickly get up to speed and start creating their own custom tools.

Recommend resources:

libbpf-bootstrap example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2020 Facebook */

char LICENSE[] SEC("license") = "Dual BSD/GPL";

int my_pid = 0;

SEC("tp/syscalls/sys_enter_write")
int handle_tp(void *ctx)
{
int pid = bpf_get_current_pid_tgid() >> 32;

if (pid != my_pid)
return 0;

bpf_printk("BPF triggered from PID %d.\n", pid);

return 0;
}

Perf-tools

Link: https://github.com/brendangregg/perf-tools

A miscellaneous collection of in-development and unsupported performance analysis tools for Linux ftrace and perf_events (aka the “perf” command). Both ftrace and perf are core Linux tracing tools, included in the kernel source. Your system probably has ftrace already, and perf is often just a package add (see Prerequisites).

These tools are designed to be easy to install (fewest dependencies), provide advanced performance observability, and be simple to use: do one thing and do it well. This collection was created by Brendan Gregg (author of the DTraceToolkit).

Many of these tools employ workarounds so that functionality is possible on existing Linux kernels. Because of this, many tools have caveats (see man pages), and their implementation should be considered a placeholder until future kernel features, or new tracing subsystems, are added.

perf-tools example

pycuda.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch

a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
ab_full = a_full @ b_full
mean = ab_full.abs().mean() # 80.7277

a = a_full.float()
b = b_full.float()

# Do matmul at TF32 mode.
torch.backends.cuda.matmul.allow_tf32 = True
ab_tf32 = a @ b # takes 0.016s on GA100
error = (ab_tf32 - ab_full).abs().max() # 0.1747
relative_error = error / mean # 0.0022

# Do matmul with TF32 disabled.
torch.backends.cuda.matmul.allow_tf32 = False
ab_fp32 = a @ b # takes 0.11s on GA100
error = (ab_fp32 - ab_full).abs().max() # 0.0031
relative_error = error / mean # 0.000039

print(relative_error)

GPTtrace

Project link: https://github.com/eunomia-bpf/GPTtrace

Even though BCC and bpftrace simplify the eBPF programming, it still require a lot of time to read the whole manual of each tool. Furthermore, eBPF programming requires a strong understanding of the underlying Linux kernel and networking stack, making it a challenging task for beginners. Additionally, debugging eBPF programs can be difficult due to limited visibility into the kernel.

GPTtrace might help. It can generate eBPF programs and tracing with ChatGPT and natural language.

What’s next?

After a quick overview on eBPF, we know that observing the system in many different aspects becomes easier than ever. Admittedly, eBPF has some shortcomings. The BPF Helper Functions are still limited and many Hook for eBPF are read-only. eBPF still needs to be go through a strict verification.

Furthermore, eBPF is a relatively new technology and there is still a lack of documentation and resources available for developers to learn how to use it effectively. This can make it difficult for organisations to adopt eBPF as part of their infrastructure.

Despite these shortcomings, eBPF provides possible solution for researchers have a quick insightful view of their programs. And eBPF depict a modern view of operating system. By allowing researchers to dynamically trace and analyse kernel-level events and functions, eBPF provides a powerful tool for understanding program behaviour and identifying performance bottlenecks. This real-time visibility can help researchers optimise their code and improve overall system efficiency.

Additionally, eBPF’s flexibility and programmability make it well-suited for modern operating systems that require dynamic, adaptable solutions. Its ability to operate at the kernel level while maintaining safety and security ensures that it can be used effectively in a wide range of applications.

Overall, eBPF represents an exciting new direction for operating systems research, offering a unique combination of performance monitoring, analysis, and adaptability that can help researchers better understand and optimise their programs.


eBPF - a new Swiss army knife in the system
http://blog.chivier.site/2023-04-28/c30f7c4d12e1/
Author
Chivier Humber
Posted on
April 28, 2023
Licensed under