Linux Networking Stack tutorial: Receiving Data #

A censorship circumvention tool may be required to open the links in this post from mainland of china.

TL;DR #

This post tries to help readers to conceptualize a high-level overview of how the Linux Network Stack receives data from the network.

Consider the following UDP code:

int main(){
    int serverSocketFd = socket(AF_INET, SOCK_DGRAM, 0);
    bind(serverSocketFd, ...);

    char buff[BUFFSIZE];
    int readCount = recvfrom(serverSocketFd, buff, BUFFSIZE, 0, ...);
    buff[readCount] = '\0';
    printf("Receive from client:%s\n", buff);

}

From the application developer’s perspective, as long as there is data comes from the client and the server is calling recv_from, the server should be able to receive it and print it out. however, we’d like to understand what happens under the hood after a packet arrives NIC until the application receives data from recv_from.

P.S., The following discussion is based on linux kernel v5.15.

Overview #

The TCP/IP model orchestrates the Internet Protocol Suite to Physical, Data link, Network, Transport and Application layers. Things like Network Interface Controler/Card(NIC) and Network cable are been considered as the Physical layer, Applications like Nginx, Envoy, and so on belong to the Application layer. What Linux implemented are the Data link, Network, and Transport layers. More specifically, the NIC driver implemented the Data link layer, and the kernel protocol stack implemented the Network and Transport layers.


Fig.1 Linux Network protocol stack

The high level path a packet takes from arrival to socket receive buffer is as follows:

Driver is loaded and initialized.
Packet arrives at the NIC from the network.
Packet is copied (via DMA) to a ring buffer in kernel memory.
Hardware interrupt is generated to let the system know a packet is in memory.
Driver calls into NAPI to start a poll loop if one was not running already(via napi_schedule or other NAPI APIs from device).
ksoftirqd processes run on each CPU on the system. They are registered at boot time. The ksoftirqd processes pull packets off the ring buffer by calling the NAPI poll function that the device driver registered during initialization.
Memory regions in the ring buffer that have had network data written to them are unmapped.
Data that was DMA’d into memory is passed up the networking layer as an skb for more processing.
Incoming network data frames are distributed among multiple CPUs if Receive Packet Steering (RPS) is enabled or if the NIC has multiple receive queues.
Network data frames are handed to the protocol layers from the queues.
Protocol layers process data.
Data is added to receive buffers attached to sockets by protocol layers.

Here is a simplified version of the overall flow(some of the steps above are left out) follows:


Fig.2 Simplified Overall Flow

This entire flow will be separated into different parts to walk through.

The protocol layers examined below are the IP and UDP protocol layers. Much of the information presented will serve as a reference for other protocol layers, as well.

Initial setup #

Devices have many ways of alerting the rest of the computer system that some work is ready for processing. In the case of network devices, it is common for the NIC to raise an IRQ to signal that a packet has arrived and is ready to be processed. When an IRQ handler is executed by the Linux kernel, it runs at a very, very high priority and often blocks additional IRQs from being generated. As such, IRQ handlers in device drivers must execute as quickly as possible and defer all long running work to execute outside of this context. This is why the softirq system exists.

`softirq` initialization #

The softirq system in the Linux kernel is a system that kernel uses to process work outside of the device driver IRQ context. In the case of network devices, the softirq system is responsible for processing incoming packets. The softirq system is initialized early during the boot process of the kernel.

The initialization of the softirq system is as follows:


Fig.3 softirq Intial setup

softirq kernel threads are created (one per CPU) in spawn_ksoftirqd in kernel/softirq.c with a call to smpboot_register_percpu_thread from kernel/smpboot.c. As seen in the code, the function run_ksoftirqd is listed as thread_fn, which is the function that will be executed in a loop.
The ksoftirqd threads begin executing their processing loops in the run_ksoftirqd function.
Next, the softnet_data structures are created, one per CPU. These structures hold references to important data structures for processing network data. One we’ll see again is the poll_list. The poll_list is where NAPI poll worker structures will be added by calls to napi_schedule or other NAPI APIs from device drivers.
net_dev_init then registers the NET_RX_SOFTIRQ softirq with the softirq system by calling open_softirq, as shown here. The handler function that is registered is called net_rx_action. This is the function the softirq kernel threads will execute to process packets.

Steps 5 - 8 on the diagram relate to the arrival of data for processing and will be mentioned in the next section. Read on for more!

NIC driver intiallization #

All the drivers are required to use module_init to register their init function to the kernel which will be called by the kernel when the driver is loaded. For igb NIC driver, we can find the init function at drivers/net/ethernet/intel/igb/igb_main.c


Fig.4 NIC driver Intial setup

when driver is loaded into kernel, kernel calls pci_register_driver in igb_init_module, the Linux kernel knows the driver details such as igb_driver_name and igb_probe and so on.
the kernel detects the NIC card is a valid card, it will call the registered probe function, i.e., igb_probe in our case, igb_probe then performs the next steps
MAC Initialization
Initialize ethernet
Set ethtool ops
Initialize netdev and net_device_ops
Initialize NAPI and register pool function

Bringing a NIC up #

Recall the net_device_ops structure we saw earlier which registered a set of functions for bringing the NIC up, transmitting packets, setting the MAC address, etc.

When a NIC is brought up (for example, with ifconfig eth0 up), the function attached to the ndo_open field of the net_device_ops structure is called. i.e., igb_open in our case.

The igb_open function will typically do things like:


Fig.5 NIC start

Allocate RX and TX queue memory
Enable NAPI
Register an interrupt handler
Enable hardware interrupts

Data arrives #

When network data arrives at a NIC, the NIC will use DMA to write the packet data to RAM. In the case of the igb network driver, a ring buffer is setup in RAM that points to received packets. It is important to note that some NICs are multiqueue NICs, meaning that they can DMA incoming packets to one of many ring buffers in RAM. As we’ll see soon, such NICs are able to make use of multiple processors for processing incoming network data. Read more about multiqueue NICs. The diagram in Fig.6 below shows just a single ring buffer for simplicity, but depending on the NIC you are using and your hardware settings you may have multiple queues on your system.

Read more detail about the process describe below in this section of the networking blog post.

Let’s walk through the process of receiving data:


Fig.6 Data arrives

Data is received by the NIC from the network.
The NIC uses DMA to write the network data to RAM.
The NIC raises an IRQ.
The NIC driver’s registered IRQ handler is executed.
The IRQ is cleared on the NIC, so that it can generate IRQs for new packet arrivals.
NAPI softirq poll loop is started with a call to napi_schedule.

The call to napi_schedule triggers the start of steps 5 - 8 in Fig.3 above. As we’ll see, the NAPI softirq poll loop is started by simply flipping a bit in a bitfield and adding a structure to the poll_list for processing. No other work is done by napi_schedule and this is precisely how a driver defers processing to the softirq system.

Continuing on to the diagram in the Fig.3, using the numbers found there:

The call to napi_schedule in the driver adds the driver’s NAPI poll structure to the poll_list for the current CPU.
The softirq pending bit is set so that the ksoftirqd process on this CPU knows that there are packets to process.
run_ksoftirqd function (which is being run in a loop by the ksoftirq kernel thread) executes.
__do_softirq is called which checks the pending bitfield, sees that a softirq is pending, and calls the handler registered for the pending softirq: net_rx_action which does all the heavy lifting for incoming network data processing.

It is important to note that the softirq kernel thread is executing net_rx_action, not the device driver IRQ handler.

Network data processing begins #

Now, data processing begins. The net_rx_action function (called from the ksoftirqd kernel thread) will start to process any NAPI poll structures that have been added to the poll_list for the current CPU. Poll structures are added in two general cases:

From device drivers with calls to napi_schedule.
With an Inter-processor Interrupt, a.k.a IPI in the case of Receive Packet Steering. Read more about how Receive Packet Steering uses IPIs to process packets.

We’re going to start by walking through what happens when a driver’s NAPI structure is retreived from the poll_list. (The next section how NAPI structures registered with IPIs for RPS work).


Fig.7 Data processing begins

The diagram above is explained in depth here, but can be summarized as follows:

net_rx_action loop starts by checking the NAPI poll list for NAPI structures.
The budget and elapsed time are checked to ensure that the softirq will not monopolize CPU time.
The registered poll function is called. In this case, the function igb_poll was registered by the igb driver.
The driver’s poll function harvests packets from the ring buffer in RAM.
Packets are handed over to napi_gro_receive, which will deal with possible Generic Receive Offloading.
Packets are either held for GRO and the call chain ends or packets are passed on to netif_receive_skb to proceed up toward the protocol stacks.

We’ll see next how netif_receive_skb deals with Receive Packet steering to distribute the packet processing load amongst multiple CPUs.

Network data processing continues #

Network data processing continues from netif_receive_skb, but the path of the data depends on whether or not Receive Packet Steering (RPS) is enabled or not. An out of the box Linux kernel will not have RPS enabled by default and it will need to be explicitly enabled and configured if you want to use it.


Fig.8 Data processing continues

In the case where RPS is disabled, using the numbers in the above Fig.8:

netif_receive_skb passes the data on to __netif_receive_core.
__netif_receive_core delivers data to any taps (like PCAP). 7, __netif_receive_core delivers data to registered protocol layer handlers. In many cases, this would be the ip_rcv function that the IPv4 protocol stack has registered.

In the case where RPS is enabled:

netif_receive_skb passes the data on to enqueue_to_backlog.
Packets are placed on a per-CPU input queue for processing.
The remote CPU’s NAPI structure is added to that CPU’s poll_list and an IPI is queued which will trigger the ksoftirqd kernel thread on the remote CPU to wake-up if it is not running already.
When the ksoftirqd kernel thread on the remote CPU runs, it follows the same pattern describe in the previous section, but this time, the registered poll function is process_backlog which harvests packets from the current CPU’s input queue.
Packets are passed on toward __net_receive_skb_core.
__netif_receive_core delivers data to any taps (like PCAP).
__netif_receive_core delivers data to registered protocol layer handlers. In many cases, this would be the ip_rcv function that the IPv4 protocol stack has registered.

Protocol stacks and userland sockets #

Next up are the protocol stacks, netfilter, berkley packet filters, and finally the userland socket. This code path is long, but linear and relatively straightforward.

You can continue following the detailed path for network data. A very brief, high level summary of the path is:

Packets are received by the IPv4 protocol layer with ip_rcv.
netfilter, iptables and a routing optimization are performed.
Data destined for the current system is delivered to higher-level protocol layers, like UDP.
Packets are received by the UDP protocol layer with udp_rcv and are queued to the receive buffer of a userland socket by udp_queue_rcv_skb and sock_queue_rcv. Prior to queuing to the receive buffer, netfilter are processed.
Finally, any processes in userland waiting on data to arrive in the socket are notified with a call to the sk_data_ready notification handler function in sock_queue_rcv.

Note that netfilter & iptables is consulted multiple times throughout this process. The exact locations can be [ound in this detailed walk-through.

Keep in mind: if you have numerous or very complex netfilter or iptables rules, those rules will be executed in the softirq context and can lead to latency in your network stack. This may be unavoidable, though, if you need to have a particular set of rules installed.

Monitoring & Tuning #

It is impossible to tune or monitor the Linux networking stack without reading the source code of the kernel and having a deep understanding of what exactly is happening.

As a starter for monitoring & tuning, Here are some hints gathered from this awesome post for monitoring & tuning the Linux networking stack

Hints for diving in #

Note: Since some of the sections are not addressable in this orign post, you may need to use the browser search to locate them manually.

Hints for tools #

ethtool: a powerfull moitoring & tunning for network device, e.g.,

# Increase size of each RX queue(ring buffer) to 4096 with ethtool -G
$ sudo ethtool -G eth0 rx 4096

sysfs: a slightly higher level(compared to ethtool) statistics for NIC, e.g.,
```
$ cat /sys/class/net/eth0/statistics/rx_dropped
2
```

/proc/net/dev: a even higher level summary-esque information for each network adapter, e.g.,

$ cat /proc/net/dev
Inter-|   Receive                                                                                                               |  Transmit
face |                   bytes         packets errs drop fifo frame compressed     multicast |                   bytes           packets errs drop fifo colls carrier compressed
eth0:    110346752214   597737500     0      2    0        0                    0  20963860   990024805984 6066582604     0       0    0      0         0                    0
lo: 428349463836 1579868535     0      0    0        0                    0                  0    428349463836  1579868535     0       0    0      0         0                    0

/proc/softirqs: softirq system statistic counters, e.g.,

$ cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3
        HI:          0          0          0          0
    TIMER: 2831512516 1337085411 1103326083 1423923272
    NET_TX:   15774435     779806     733217     749512
    NET_RX: 1671622615 1257853535 2088429526 2674732223
    BLOCK: 1800253852    1466177    1791366     634534
BLOCK_IOPOLL:          0          0          0          0
    TASKLET:         25          0          0          0
    SCHED: 2642378225 1711756029  629040543  682215771
    HRTIMER:    2547911    2046898    1558136    1521176
        RCU: 2056528783 4231862865 3545088730  844379888

/proc/interrupts: Hardware interrupt requests, e.g.,

$ cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3
0:         46          0          0          0 IR-IO-APIC-edge      timer
1:          3          0          0          0 IR-IO-APIC-edge      i8042
30: 3361234770          0          0          0 IR-IO-APIC-fasteoi   aacraid
64:          0          0          0          0 DMAR_MSI-edge      dmar0
65:          1          0          0          0 IR-PCI-MSI-edge      eth0
66:  863649703          0          0          0 IR-PCI-MSI-edge      eth0-TxRx-0
67:  986285573          0          0          0 IR-PCI-MSI-edge      eth0-TxRx-1
68:         45          0          0          0 IR-PCI-MSI-edge      eth0-TxRx-2
69:        394          0          0          0 IR-PCI-MSI-edge      eth0-TxRx-3
NMI:    9729927    4008190    3068645    3375402  Non-maskable interrupts
LOC: 2913290785 1585321306 1495872829 1803524526  Local timer interrupts

/proc/irq/IRQ_NUMBER/smp_affinity: IRQ affinities, e.g.,

# Check irqbalance configuration, before perform this command,
# Set the IRQ affinity for IRQ 8 to CPU 0
$ sudo bash -c 'echo 1 > /proc/irq/8/smp_affinity'

/proc/net/softnet_stat: This statistic is tracked as part of the struct softnet_data associated with the CPU, e.g.,

$ cat /proc/net/softnet_stat
6dcad223 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000
6f0e1565 00000000 00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000
660774ec 00000000 00000003 00000000 00000000 00000000 00000000 00000000 00000000 00000000
61c99331 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
6794b1b3 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000
6488cb92 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000

/proc/net/snmp: detailed IP protocol statistics

$ cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip:                   1           64 25922988125                0                    0             15771700                            0           0 25898327616 22789396404 12987882                    51          
                    1       10129840     2196520                  1              0              0                    0
...

/proc/net/netstat: extended IP protocol statistics

$ cat /proc/net/netstat | grep IpExt
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT0Pktsu InCEPkts
IpExt: 0 0 0 0 277959 0 14568040307695 32991309088496 0 0 58649349 0 0 0 0 0

/proc/net/udp: UDP socket statistics

$ cat /proc/net/udp
sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops
515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000 00000000   104        0 7518 2 0000000000000000 0
558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 7408 2 0000000000000000 0
588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 7511 2 0000000000000000 0
769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 7673 2 0000000000000000 0
812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 7407 2 0000000000000000 0

cat /proc/net/tcp: TCP socket statistics

$ cat /proc/net/tcp
sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
0: 0100007F:F199 00000000:0000 0A 00000000:00000000 00:00000000 00000000  1000        0 127900 1 0000000000000000 100 0 0 10 0
1: 00000000:0386 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 45649 1 0000000000000000 100 0 0 10 0
2: 00000000:0016 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 36314 1 0000000000000000 100 0 0 10 0
3: 00000000:18EB 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 323024 1 0000000000000000 100 0 0 10 0
4: 0100007F:6F93 00000000:0000 0A 00000000:00000000 00:00000000 00000000  1000        0 502101 1 0000000000000000 100 0 0 10 0
5: 0100007F:EF19 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 43064 1 0000000000000000 100 0 0 10 0

kernel parameters #

net.core.netdev_budget: how much packet processing can be spent among all NAPI structures registered to a CPU, default is 300
net.core.netdev_max_backlog: help prevent drops in enqueue_to_backlog by increasing the netdev_max_backlog if you are using RPS or if your driver calls netif_rx, default is 1000
net.core.dev_weight: how much of the overall budget the backlog poll loop can consume, default 64
net.core.flow_limit_table_len: the size of the flow limit table, default 4906
net.core.rps_sock_flow_entries: the size of the RFS socket flow hash
net.core.netdev_tstamp_prequeue: when packets will be timestamped after they are received
net.core.rmem_max: the maximum receive buffer size
net.core.rmem_default: the default initial receive buffer size

Conclusion #

The Linux network stack is incredibly complex and has many different systems interacting together. Any effort to tune or monitor these complex systems must strive to understand the interation between all of them and how changing settings in one system will affect others.

This (poorly) illustrated post attempts to help readers form basic picture of how the Linux network stack works.