Packet processing in Linux - 101
Imagine your computer as a high-speed train station, where millions of passengers (packets) arrive every second. Without efficient routing and scheduling, chaos would ensue, slowing everything down. In the same way, Linux optimizes packet processing to ensure smooth network performance. But how does this process work under the hood? Let’s break down the process of receiving a packet.
Life of a Packet: From Arrival to Application
When a packet arrives at a computer's Network Interface Card (NIC), it triggers a complex sequence of events that involve multiple hardware and software components.
1. Initial Reception
The journey begins when the NIC receives a packet from the network. The NIC performs preliminary processing, including checksum verification and basic packet filtering. Once the packet passes these initial checks, the NIC generates an interrupt to notify the CPU that new data has arrived.
You can check out `/proc/interrupts` to find out what IRQ number is associated with a NIC. As the interrupts are generated the number of interrupts generated will keep increasing.
2. Interrupt Handling
When the CPU receives the interrupt, it temporarily pauses its current task to handle the network packet. The operating system's network driver takes control and copies the packet from the NIC's buffer to a kernel buffer.
To prevent performance degradation from excessive interrupts during high-traffic situations, modern NICs employ techniques like interrupt coalescing, which bundles multiple packets into a single interrupt.
3. Kernel Processing and Socket Buffers
Once the packet reaches the kernel, it enters a sophisticated buffering and processing system. At the heart of this system are socket buffers, which serve as the fundamental unit of data management throughout the networking stack. Socket buffers are represented by sk_buff struct. Full source code for the struct can be found in Linux repository here.
The following sections talk more about purpose of this struct as well as network stack traversal at a high level.
3.1 Socket Buffer Structure
A socket buffer (sk_buff) is a data structure that:
Maintains packet data and metadata
Tracks the packet's current position in the protocol stack
Contains protocol headers and payload
Manages memory references and buffer states
3.2 Protocol Stack Traversal
As the packet moves through the networking stack layers sk_buff is modified. Reusing the same sk_buff structure across layers avoids unnecessary memory allocations and improves efficiency.
Linux networking stack is closely aligned with TCP/IP model. You can think about it this way: each layer of the TCP/IP model processes sk_buff structure only after the previous layer finished processing it.
1. Link Layer (Ethernet)
The sk_buff structure is allocated and initialized
Link layer headers are processed and stripped
2. Network Layer (IP)
IP headers are validated and processed
Routing decisions are made
3. Transport Layer (TCP/UDP)
Transport headers are processed
Data is segmented or reassembled as needed
The packet is matched to its destination socket
4. Application Delivery and Socket Interface
The final stage involves moving data between kernel space and user space through the socket interface.
For that purpose Linux networking interface utilizes send and receive buffers as well as backlog queues.
4. Packet delivery
Packet delivery is a layered process. The following diagram shows the journey of a network packet through the Linux networking stack,
[ NIC ] ---> [ Kernel Buffer ] ---> [ Network Stack ] ---> [ Socket Buffer ] ---> [ Application ]
As packets are being received they are first stored in a buffer in a NIC. From there the packets are quickly moved to a backlog queue as part of the interrupt handling. This is to ensure that as little as possible time is spent handling the interrupt. Following this move new interrupts will be generated in order to move the packets from the backlog to a socket buffer. Once in socket buffer an application is able to consume the content using one of the syscalls (e.g. read).
When talking about packet delivery there are roughly four areas of interest.
1. Packet Matching
The kernel identifies the destination socket based a number of parameters including: protocol (e.g. TCP/UDP), port numbers, IP addresses, etc. This is usually done by computing a hash (which can be computed either by the CPU or more commonly these days by the network card itself).
2. Buffer Management
Packets are queued in the socket's receive buffer. The buffer size is limited and the limit is enforced. This is controlled by SO_RCVBUF socket option.
In case a socket’s receive buffer becomes full packets will be dropped. Some network protocols (e.g. TCP/IP) implement flow control to prevent or manage buffer overflows.
3. Data Delivery
When an application calls recv() or read() the kernel checks if data is available in the socket buffer. If data exists, it's copied from kernel space to user space. If no data is available, the process may block or return based on socket flags.
4. Memory Management
Once data is copied to user space, the kernel may free or recycle the sk_buff structure, and updates socket's receive buffer.
4.1 Socket Buffer Tuning
System administrators can tune socket buffer behaviour through several sysctl parameters:
Maximum receive buffer size: net.core.rmem_max
Default receive buffer size: net.core.rmem_default
Maximum number of packets in backlog queue: net.core.netdev_max_backlog
These settings can significantly impact network performance, especially for high-throughput applications.
5. Turbocharging Packet Processing: Key Optimizations
In modern multi-core systems, efficient packet processing requires careful consideration of how network interrupts and processing tasks are distributed across CPU cores. Several techniques can help optimize this process:
1. Core Affinity Control
This optimization technique involves controlling which CPU cores handle network packets. By default, operating systems distribute network interrupts across multiple cores for load balancing. However, you can manually control this distribution through several mechanisms:
1. IRQ Affinity: You can bind specific network interrupts to particular cores by modifying the system's IRQ affinity settings. In Linux, this is done through the `/proc/irq/<IRQ number>/smp_affinity` interface.
It is often useful to bind NIC interrupt to a specific CPU core and minimize other work on that same core. This will ensure that various caches (e.g. L1 and TLB) are warm and can perform work as efficiently as possible.
2. Receive Side Scaling (RSS)
RSS distributes network receive processing across multiple hardware based receive (RX) queues. If a NIC supports multiple receive queues one can assign a different CPU core to each one.
3. Receive Packet Steering (RPS)
RPS is similar to RSS in the sense that it is used to distribute packet processing amongst different cores. However instead of being a hardware feature RPS is implemented in software. RSS selects the queue AND the CPU that will run the interrupt handler whereas RPS selects CPU to use for processing “above” the interrupt handler. This is useful when a NIC has only one receive queue but we would like to distribute network receive load across multiple CPU cores.
This approach has a few advantages such as:
can be used by any NIC
can easily add new software filters
doesn’t increase hardware device interrupt rate
RPS does not take application locality into account. It merely distributes processing across CPUs based on a hash.
4. Receive Flow Steering (RFS)
RFS is an optimization technique that aims to improve cache efficiency by ensuring packets are processed on the same CPU core as the application thread that will consume them. To achieve this RFS uses a flow lookup table. Packet hash is used as an index into the flow lookup table. The flow table (rps_sock_flow_table) maps flows to CPUs where those flows are being processed. Each table value is updated during calls to recvmsg and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and tcp_splice_read()).
Best Practices and Considerations
When optimizing network packet processing:
1. Monitor Performance: Use tools like netstat, perf, and sar to measure the impact of your optimizations.
2. Consider Workload: RFS and similar optimizations may add unnecessary overhead for low-traffic scenarios. Evaluate whether the potential benefits justify the additional complexity.
3. Thread Stability: If possible, pin application threads to specific cores to maximize the effectiveness of flow steering and reduce the overhead of updating flow mappings.
4. Memory Impact: Be aware that features like RFS consume memory to track flow mappings. Configure limits appropriate for your system's resources.
Conclusion
Understanding how network packets move through your system and knowing how to optimize processing can significantly impact application performance. While modern operating systems handle most optimizations automatically, knowledge of these mechanisms allows you to fine-tune your system for specific workloads and requirements. Whether you're building high-frequency trading applications or managing busy web servers, these optimization techniques can help ensure your network-intensive applications perform at their best.