eBPF Updates #5: CNCF Proposals, Kinvolk Acquired, eCHO, the Cost of Tail Calls, Systemd Features, Reverse Debugging, Static Linking

Spring is back in the Northern Hemisphere, and with it all kinds of features are blooming for eBPF. Kernel 5.12 is out, and version 5.13 is on track to bring new exciting functionalities. At the same time, new projects hatch and companies are enjoying renewed activity. Several eBPF-related projects applied to join or upgrade their status in the bosom of CNCF. After some delay, no doubt caused by winter hibernation, welcome to the fifth issue of the eBPF Updates!

Important News

Cilium applies to become a CNCF project.
The Cilium Community has sent a proposal to move the project into the CNCF (Cloud Native Computing Foundation) at the Incubation level. If the proposal is accepted, there should be very little change to development models or governance, already fully open and following open-source best practice. Why do this? A number of users looking to build large projects on Cilium have requested a donation to a foundation, to get the final piece of insurance on the commitment to open source. The CNCF was naturally considered, as the core focus of Cilium is to provide networking and security for cloud-native environments and its goals align well with those of the Foundation.At the same time, the project is finalizing its version 1.10, for which it just tagged a release candidate.

Sysdig contributes Falco’s kernel module, eBPF probe, and libraries to the CNCF.
This contribution is a commitment provide and keep those components as open source. The kernel module and eBPF probe component are roughly equivalent, and implement a system call capture framework in the Linux kernel, used by Sysdig and Falco. The libraries sit on top of them, and “enrich” the information collected by the probes to turn it into meaningful data. All components become the property of the Cloud Native Computing Foundation. Some have been relicensed, although this does not seem to be the case of the eBPF probes. From a technical point of view, this means that the contributors worked on decoupling those from the Sysdig code base, making them easier to reuse in other projects.Nearly at the same time, the maintainers of the Falco submitted a proposal for CNCF graduation, which would represent a marker of health and maturity for the project. Additionally Sysdig, the company behind Falco, raised a $188 million Series F at a $1.19 billion valuation.

Microsoft acquires Kinvolk.
Launched in 2015, Kinvolk has been working on open source projects such as the Flatcar Container Linux distribution, Lokomotive, or Inspektor Gadget. The latter is a suite of eBPF-based tools for debugging and inspecting Kubernetes clusters, and the company has been actively contributing to the eBPF ecosystem (see for example the posts from Mauricio Vásquez Bernal referenced later in this document). The company is joining the Azure team, congratulations and best of luck for the future!

LPC 2021 Networking and BPF Track CFP, on the development mailing list for eBPF.
Just like the previous editions, the next Linux Plumbers Conference will have a Networking and BPF track. The call for participation is open until the 13th of August. The conference itself was initially due to take place in Dublin, Ireland, from the 27th to the 29th of September 2021, but the organizers recently announced that it will be a fully virtual event instead.

libbpf: the road to v1.0, from Andrii Nakryiko.
From a very basic library wrapping around the bpf() system call, libbpf has become the reference for handling eBPF objects and has grown to support all the recent features available in the kernel. This came with a number of issues or inconsistencies from an API perspective. In an effort to address those, Andrii defines a roadmap for the release 1.0 of the library. The announcement email also contains a link to a file on Google Docs to contribute and discuss ideas for this milestone.
eCHO – eBPF & Cilium Office Hours.
This is a new series of livestream episodes about, you guessed it, eBPF and Cilium. The first three episodes have aired already:
You can easily add a show to your calendar from the link above, and you can even suggest topics for discussions on the GitHub tracker. A new episode streams every Friday, make sure you attend the next one!

New Resources

Introductory Material

eBPF and Kubernetes – What’s the Deal? (video), from Dawid Ziółkowski.
eBPF can greatly improve networking for environment running with Kubernetes. After providing some elements of historical context, this presentation exposes the limitations imposed by the use of iptables. It introduces eBPF and Cilium, and explains how to leverage these tools to improve routing, observability, and network policies in clusters.

Getting Started with eBPF and Go, from Michael Kashin.
Have you tried working with Go and eBPF, and are you getting confused by the multiple libraries offering to help manage the programs and the maps? Libbpfgo, goebpf, gobpf, cilium/ebpf are several possibilities. The first three wrap around different components, whereas cilium/ebpf has a pure Go implementation. This post aims at disambiguate those distinct libraries, and explains how to write, manage, and ship a simple program with the cilium/ebpf library. The program, xdp-xconnect, cross-connects different Linux interfaces and redirects packets with XDP.

Getting started with bpf and libbpfgo, from Grant Seltzer.
Another approach to writing a simple program and managing it in user space with Go, this time with the libbpfgo library which wraps around libbpf, and with an example oriented towards tracing.

In the Industry

Liz Rice: Following the ‘Superpower’ Promise of eBPF from Liz Rice.
The chair of the CNCF’s Technical Oversight Committee recently joined Isovalent to work on Cilium, described as “the most widely deployed eBPF project in production”. She comes back on the reason that led her to this decision, and explains in what sense “eBPF is another transformational technology”, like Docker has been.

fledge.io brings eBPF to multi-cloud and edge from Pramodh Mallipatna.
The solution proposes to help define, deploy, and manage geo-distributed applications that span multi-cloud and edge environments. It now uses eBPF programs to monitor the local CPU, disk, and network resources, as well as to gather and present richer application information in real time.

Linux Kernel insights with eBPF, from Manos Saratsis.
Netdata is already using eBPF to some extent, but has more plans to leverage the technology and provide a variety of charts to visualize metrics on all components of a system. This post is a roadmap announcement of the features to come, to inform the community as well as to gather some feedback. Netdata version 1.30.0, released a few weeks later, seems to contain some of these changes.

eBPF Integration for Pyroscope.
Pyroscope, an open source continuous profiling platform, has got some integration with eBPF to provide efficient CPU profiling of the applications and the system itself.

Using eBPF and immudb to audit executed commands on a Linux server.
CodeNotary explains how to load an eBPF program with bpftrace to attach on uretprobes to capture shell commands, and to insert the collected data tamper-proof in immudb, the open source immutable database solution developed by the company.

New Relic : What Is eBPF and Why Does It Matter for Observability?.
This introduction to eBPF includes a basic overview of its functioning, advantages, and limitations, as well as an overview of the main tools based on eBPF. New Relic concludes by presenting how Pixie, its open source observability platform for Kubernetes, leverages eBPF to extract richer information.

Gathering insights on Kubernetes applications, services, and network traffic with Pixie.
Directly related to the previous item, Pixie is also the object of a new partnership between New Relic and AWS, and even applied to join the CNCF.

Deep Dives

BPF meets io_uring, from Jonathan Corbet.
This article from LWN.net comments an RFC submitted to the io-uring mailing list and proposing to add a new hook and eBPF program type, BPF_PROG_TYPE_IOURING, to run in the io_uring context. With some extra efforts, it could lead to making decisions based on the outcome of previous operations in the ring, such as submitting other I/O operations or processing the next file in a list.

Comparing SystemTap and bpftrace, from Emanuele Rocca.
SystemTap and bpftrace are tools to dynamically instrument the kernel or user applications on Linux. While bpftrace uses eBPF programs, SystemTap is older and compiles its scripts into kernel modules. This post from LWN.net exposes the differences between the two tools in terms of installation procedure, program structure, and features. As one could expect, using kernel modules is ultimately more powerful, but eBPF makes for a faster and easier-to-use tracing tool.

Toward signed BPF programs, from Jonathan Corbet.
Another LWN.net article, this time on a recent proposal sent by Alexei Starovoitov in order to introduce support for signed eBPF programs, so that the system would only accept authorized eBPF programs. The mechanism is compared with the one in place for kernel modules, but signing eBPF programs is more complex (due to the relocations and map creations). The proposal currently under review solves this with a new program type, which allows a special eBPF program to run bpf() and close() system calls to load other eBPF programs from within the kernel. The work is still in progress.

What is vmlinux.h?, from Grant Seltzer.
The vmlinux.h header is generated automatically and contains BTF information about the kernel itself. Modern eBPF programs may use this information to access to a number of features, CO-RE being one of them.

DevConf.CZ 2021 (Schedule) hosted several talks related to eBPF, some of which are summarized below. Sadly I could not find the videos and slides for all presentations.
- eBPF Iterators (slides), from Jiri Olsa.
  This is possibly the only document on eBPF iterators we have at the moment. Those iterators are hooked on various objects in the kernel (for example, among many others: tasks, eBPF programs or maps, active TCP connections). Pinned to the eBPF virtual file system, they can be dumped (simply with cat) to iterate over the selected objects and process information to print about them. This is very similar to what procfs already offers, although more flexible. It is even possible to preload and attach some iterators at boot time, so that listing eBPF programs and maps is available at all times. BCC tools, bpftrace, bpftool, and perf all have support for iterators (at various degrees of progress).
- Capturing network traffic in an eXpress Data Path (slides), from Eelco Chaudron.
  How to inspect network packet with an XDP program attached to the interface, when only the XDP_PASS return code will pass packets to the stack and lead them to a path where tcpdump can see them? To answer this question, xdpdump was developed. It relies on the fentry and fexit hooks that eBPF programs can use to attach to the entry and the exit of functions, including other eBPF programs, and inspects the packets before and after they are processed by the XDP program. The tool provides a command-line interface, and also led to improvements of the PcapNG capture format. WireShark v3.4.0 and newer can already benefit from those changes, with new filters to show packets for which the XDP program returned a specific action code or packets coming from a specific interface queue.
- Always present type information thanks to BPF: BTF, (video, slides), from Arnaldo Melo.
  Recent eBPF features increasingly use BTF (BPF Type Information, a format for storing debug information). But BTF is generic enough to be used by other applications as well. This presentation focuses on BTF and on the kind of kernel information it can help retrieve. Pahole, a tool used to examine data structure layouts encoded in debugging information formats, is central in this talk. Not only do we use it to produce BTF information from the kernel, but it also has extensive support for BTF and is able to extract information for a number of use cases, from pretty-printing type information to extracting relevant tokens for kernel live patching.

The Cost of BPF Tail Calls, from Paul Chaignon.
In order to better understand the overhead related to tail calls in eBPF programs (long jumps from one program to another, with no coming back), Paul ran extensive tests spanning multiple kernel versions, with and without retpolines. This led to a few discoveries: version 5.5 of the Linux kernel performed sensibly better than the previous versions, but 5.10 reintroduced some overhead. The post, nearly a scientific report, contains more details on the methodology and results.

Exploring a New Detection Evasion Technique on Linux, from Alan Cao.
One of the use case for application monitoring with eBPF is to ensure the security of the system and to detect any mischievous process. Over the years, some malwares have been found to implement countermeasures to evade detection, and behave differently depending on whether they are under observation or not. Can this be applied to eBPF monitoring? Yes, explains this post. eBPF events can be logged to systemd, and if a process is able to access these logs, it can check whether an eBPF program has been loaded at the same time as itself, which would likely indicate a tentative to monitor. Limitations and possible mitigations are included in the post.

Examining Problematic Memory in C/C++ Applications with BPF, perf, and Memcheck, from Filip Busic.
This long post goes way beyond what one usually finds in introductions and tutorials. It explains how to simply trace memory leaks in applications with Memcheck, and then turns to perf and eBPF. But the author also strives to provide all the technical background to understand what is going on: Stack unwinding and how to do it, Linux’s memory model, tracing event sources, introduction to perf and eBPF, installation steps, are as many sections to read and to learn from. At last, a variety of example use cases, often involving flame graphs, show how to help troubleshoot memory issues.

Extending systemd Security Features with eBPF, from Mauricio Vásquez Bernal.
Two new systemd properties implemented through eBPF programs are in development, leveraging the oncoming support for eBPF programs written in C (rather than bytecode only) for systemd.The first one restricts the file system types that processes in a systemd unit can access. Setting RestrictFileSystems=ext4, for example, will prevent the processes to interact with the tmpfs, thanks to a program attached to the eBPF LSM (Linux Security Module) and running on all attempts to open files. The program checks for the presence of the magic number associated to the file system in a dedicated eBPF map.The second property, RestrictNetworkInterfaces, attaches a program to the cgroup hooks for sending and receiving packets and blocks any attempt that is not associated to one of the authorized interfaces listed in the dedicated eBPF map.

Using eBPF in Flatcar Container Linux, from Mauricio Vásquez Bernal.
Some specific kernel options must be enabled at compilation to enable all of the eBPF features. This post describes a few of them, all enabled in the Flatcar Container Linux distribution. It includes an overview of what CONFIG_IKHEADERS, CONFIG_DEBUG_INFO_BTF, and CONFIG_BPF_LSM mean and what features they activate.

Reverse debugging at scale, from Walter Erquinigo, David Carrillo-Cisneros, and Alston Tang.
These engineers from Facebook explain how they deployed a solution to record CPU activity on their (many) servers, to be able to extract information when a process crashes. The data is stored in a circular buffer, to be analyzed later with the LLDB debugger after an incident. One issue was to find a way to quickly notify the collector when a process crashes, to avoid having too much useful data overwritten in the circular buffer between the crash and the extraction. The best solution they found was to use an eBPF program attached to a kprobe to trigger the copy.

BPF binaries: BTF, CO-RE, and the future of BPF perf tools, from Brendan Gregg.
Catching up: This post is from November 2020, which does not really correspond to the time frame otherwise covered in this issue, but I missed it at the time and it feels important.
After a brief overview of BTF and CO-RE, the post covers the next steps for existing eBPF-based tracing tools. In particular, it announces that the Python bindings used with the BCC tools should be considered as deprecated, and that libbpf should be preferred instead to build and manage these programs. This does not mean that BCC itself is deprecated: the project is simply transitioning to libbpf-based tools and abandoning the Python wrappers in newer tracing utilities.

Academic Works

BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing (PDF), from Yoann Ghigoff et al.
Published at NSDI ’21 (USENIX), this paper introduces BMC, the BPF Memcached Cache, a first-level in-kernel cache relying on XDP to accelerate Memcached. It improves the response time, by avoiding costly traversals of the network stack. The authors provide extensive evaluation of the mechanism and compare it to an improved version of Memcached. Paul Chaignon also wrote a dedicated review of that article, rightly concluding that “it’s exciting to see BPF applied to application protocols”.

Software Projects

The Cilium community is proud to take part in this year’s Google Summer of Code. The application period is now closed and accepted projects will be announced on May 17th.
The Go library cilium/ebpf v0.5.0 is out, and adds support for attaching programs to kprobes, kretprobes, or tracepoints out of the box.
ipftrace2 v0.3.0 was released and now relies on the new static linker from libbpf. A few weeks earlier, the tool had revived the support for tracing the journey of network packets in kernel module functions, thanks to BTF for modules.

Software Demos and Experiments

CrowdStrike released BPFMon, a proof-of-concept utility to map updates in an eBPF map. This is part of a study on how to detect attackers who would tamper with some configuration options, for example, passed through maps to eBPF programs.
Conntracker is “a firewall sniffer”. It analyzes in real time the flow going through Linux’s tables (netfilter or nf_tables), and provides output to understand, debug and optimize firewall rules. It can use eBPF to trace TCP and UDP flows and their associated processes.
EBPFCat is an Python implementation of a primary device for EtherCAT, relying on XDP programs to process the packets with very low latency. The project contains a Python-based generator to produce eBPF code on-the-fly!
The sonde-rs library provides a way to compile USDT (Userland Statically Defined Tracing) probes into Rust applications, so that they can be traced with any tools supporting those probes, including for example BCC tools or bpftrace.
eBPFSnitch is an experimental application-level firewall for Linux, based on eBPF and NFQUEUE. At this time, it filters all outgoing flows, but filtering incoming connections is under work. It aims at providing a good integration with containerized applications.
Project Kube-Knark uses the pcap capture format and eBPF programs to monitor calls to the Kubernetes API and changes to the configuration files for the primary node, so as to help detect rogue applications that would attempt to change that configuration.
QEMU-CSD is “full stack prototype to execute BPF programs as if they are running on a Zoned Namespace (ZNS) SSD Computational Storage Device (CSD). The entire prototype can be run from userspace by utilizing existing technologies such as SPDK and uBPF”, the latter being a user space implementation of an eBPF runtime.

The Kernel Side

Kernel 5.12 was released on the 25th of April, and with it a number of features discussed in the previous issues made their way into the latest stable version of Linux.

As for the 5.13 release, the time frame covered by this post covers all its development cycle. There were five pull requests from the bpf-next tree, sent on March 10th, March 25th, April 1st, April 24th, and April 27th. As usual, the list below only contains a few highlights—Follow the links above to see the full list of changes involved in those pull requests. Here, we broke down the changes into categories.

Core

Support calling kernel functions from eBPF programs. This feature has some similarities with eBPF helper functions, which are compiled as part of the kernel and can be called from eBPF programs. But instead of writing dedicated functions, this is about calling pre-existing function from the kernel. This does not apply to any function: a list of allowed functions is maintained in the kernel for each eBPF program type.A crucial difference with eBPF helpers is that kernel functions that can be called are not bounded to a fixed ABI contract. This means that they remain free to evolve, even if this breaks existing eBPF programs. BTF is what makes this possible.The motivation behind this set is to reuse some code portions from the kernel, in particular for those eBPF programs that override specific kernel operations (BPF_PROG_TYPE_STRUCT_OPS), like TCP congestion control. Several related functions are marked as allowed for eBPF programs overriding TCP congestion control. (Martin KaFai Lau, link)

XDP

Make the bpf_redirect_map() helper faster by turning it internally into a map operation, to access it immediately instead of traversing a switch-statement to select the relevant function for the current map type. This, with another improvement on the xdp_do_redirect() kernel function, led to an improvement of up to 8% in performance. (Björn Töpel, link)
For all drivers implementing XDP, move the drop error path for XDP_REDIRECT to devmap. This should help implement better queue overflow handling, and represents a step towards the addition of an XDP hook on the transmit queue. (Lorenzo Bianconi, link)
Improve AF_XDP selftests and program loading. AF_XDP sockets need a XDP program to filter the packets to redirect to user space. But when multiple AF_XDP sockets (“xdpsock” instances) are running on a single interface and one of them is terminated, the XDP program would be automatically unloaded, thus rendering the other sockets unable to operate. Besides improving the selftests, this PR addresses the issue by making libbpf use eBPF “links” to properly reference the XDP programs and make them persistent. (Maciej Fijalkowski, link)
Convert “cpumap” (for redirecting packets to specific CPUs with XDP) to use netif_receive_skb_list(), which allows to receive a bulk of socket buffers, thus improving i-cache usage. This results in a performance improvement of about 15% on a test with the xdp_redirect_cpu kernel sample program. (Lorenzo Bianconi, link)

eBPF Helper functions

Add a new helper function bpf_for_each_map_elem() to iterate and run a callback eBPF function with a given context on all elements of a map. This requires BTF information, and targets arrays, hash maps, LRU hash maps, and their per-CPU derivatives. (Yonghong Song, link)
Implement a new bpf_snprintf() helper, with a behavior close to the classic snprintf() function. The signature differs a little:
```
bpf_snprintf(char *str, u32 str_size, const char *fmt, u64 *data, u32 data_len)`
```
Format specifiers %s or %p are available, among others. The validation of the format string is performed by the verifier. (Florent Revest, link)

Miscellaneous

Add support for floating point types (float and double) in BTF. The objective is to help load programs with BTF information on the s390 architecture. (Ilya Leoshkevich, link)
Document the various sub-commands for the bpf() system call (map operations, program load, object pinning, and so on). This documentation is added to the UAPI header file, just like the documentation of the eBPF helpers. From there it can be parsed and turn into friendlier formats, and should eventually be available from the kernel’s API guide once version 5.13 is out. (Joe Stringer, link)
Add support for the BPF_PROG_TEST_RUN subcommand to programs of type BPF_PROG_TYPE_SK_LOOKUP, to be able to “test-run” them and evaluate their performance. This recent program type helps selecting a socket for new TCP or UDP flows and overcomes some of the bind() limitations in specific use cases. (Lorenz Bauer, link)
Enable task-local storage for tracing programs (only LSM programs could access it so far). (Song Liu, link)
Add UDP support to socket maps (“sockmaps”). Only TCP was supported so far. The use case that motivated the change is the need for an efficient solution to proxy connections over AF_UNIX sockets for thousands of services connected to a daemon, after they have been moved into a VM. (Cong Wang, link)
Extend batch map operations (lookup, update, delete) to LPM (Longest Prefix Match) maps. (Pedro Tammela, link)
Extend batch map operations (lookup, update, delete) to per-CPU array maps. (Pedro Tammela, link)
Allow to detach and re-attach to eBPF links trampolines associated to programs of certain types (for tracing or from the eBPF LSM in particular). It makes it possible to reattach such programs after detaching them from their hook, as long as they remain loaded in the kernel. (Jiri Olsa, link)

Tools

Update libbpf to support static linking of multiple ELF files containing eBPF bytecode. This is a huge step forwards for building modular programs. The new linker added to libbpf supports extern resolution of global symbols, which means that global variables, eBPF sub-programs (functions), and maps defined with (or without) BTF information can all be compiled individually into multiple, separate object files, and then assembled by libbpf into a single object file. One can achieve this by calling dedicated functions from the library, or with the new bpftool command bpftool gen object <output-file> <input_file>.... A few follow-up issues are still under work. (Andrii Nakryiko, link (static linker), link (extern resolution))

Community

We are happy to host more and more people interested in the technology on the eBPF Community Slack!

The #ebpf slack channel is growing like crazy, we passed 2.3K people. Lots of cool stuff being built and shared. Lurking and see what others are building might be one of the best ways to get started learning about eBPF.https://t.co/Geqq9QU9AZ

— Thomas Graf (@tgraf__) March 4, 2021

eBPF experience is improving by the day.

can’t believe an eBPF thing “just worked” the first time i ran it; the future is amazing

— 𓃭𓇋𓊃𓄿𓁐 (@mycoliza) March 16, 2021

Wow eBPF.

It’s been 10 years since I stopped doing kernel work. Every time I peek in, of course I find new things, but usually along the lines of what I know is happening in the industry. This is the first time I’ve been kind of shocked. In a good way.

— jlbec (@jlbec) March 29, 2021

This leads to critical gains.

Massive #DDoS with 650Gbps of volumetric UDP, 0 impact to the clients network. #ebpf #xdp pic.twitter.com/IgPC2zu1fs

— Path Network (@path_network) February 25, 2021

… And we’re just getting started!

eBPF is going to eat the world.

— Jaana Dogan ヤナドガン (@rakyll) March 18, 2021

Credits

eBPF Updates are brought to you by the Cilium project. This report was produced by Quentin Monnet (Isovalent). Thanks to Cilium engineering team (Paul Chaignon in particular) for input and reviews. And many thanks to all the contributors to the eBPF community and ecosystem, who generated the contents listed in this post!

If you would like to submit contributions for the next report, please submit them via the #ebpf-news channel on eBPF Slack.