Linux Networking Stack tutorial: Receiving Data

Linux Networking Stack tutorial: Receiving Data #

TL;DR #

This post tries to help readers to conceptualize a high-level overview of how the Linux Network Stack receives data from the network.

Consider the following UDP code:

int main(){
    int serverSocketFd = socket(AF_INET, SOCK_DGRAM, 0);
    bind(serverSocketFd, ...);

    char buff[BUFFSIZE];
    int readCount = recvfrom(serverSocketFd, buff, BUFFSIZE, 0, ...);
    buff[readCount] = '\0';
    printf("Receive from client:%s\n", buff);

}

From the application developer’s perspective, as long as there is data comes from the client and the server is calling recv_from, the server should be able to receive it and print it out. however, we’d like to understand what happens under the hood after a packet arrives NIC until the application receives data from recv_from.

P.S., The following discussion is based on linux kernel v5.15.

Overview #

The TCP/IP model orchestrates the Internet Protocol Suite to Physical, Data link, Network, Transport and Application layers. Things like Network Interface Controler(NIC) card and Network cable are been considered as the Physical layer, Applications like Nginx, Envoy, and so on belong to the Application layer. What Linux implemented are the Data link, Network, and Transport layers. More specifically, the Network device driver implemented the Data link layer, and the kernel protocol stack implemented the Network and Transport layers.

Linux Network protocol stack
Fig.1 Linux Network protocol stack

The network device driver source code can be found at drivers/net/ethernet, source code for intel driver can be found at drivers/net/ethernet/intel. Source code for network stack are located at kernel and net.

The kernel and network device driver interact with each other via interrupt. when data arrives at the network device, it signals a CPU hardware interrupt request(IRQ) to trigger the handler execution. When an IRQ handler is executed by the Linux kernel, it runs at a very, very high priority and often blocks additional IRQs from being generated. As such, IRQ handlers in device drivers must execute as quickly as possible and defer all long-running work to execute outside of this context. This is why the Soft IRQ(implemented by Linux kernel threads ksoftirqd) system exists.

Here is a simplified packet-receiving flow:

flow-overview
Fig.2 Simplified Overall Flow

After the packets arrive at NIC,

  1. the NIC driver writes frames to RAM via DMA
  2. the NIC generates a hardware interrupt request to the CPU to indicate data arrived
  3. CPU calls the interrupt handler registered by the NIC driver, the interrupt handler executes a few instructions and generates a soft IRQ then releases the CPU.
  4. ksoftirqd(one per CPU) receives soft IRQ, and calls registered poll to pull packets off the ring buffer and deliver them upward to protocol layers.

Initial setup #

Before modules such as the NIC driver and kernel protocol stack are able to receive packets, things like creating ksoftirqd kernel threads, registering protocol handlers, initializing network sub-system, and starting NIC should be done first.

ksoftirqd kernel threads #

Linux soft interrupt is processed by a group of dedicated kernel threads(ksoftirqd). the number of the ksoftirqd threads should equal the number of CPUs.

During the kernel initialization, kernel/smpboot.c calls smpboot_register_percpu_thread, and smpboot_register_percpu_thread calls spawn_ksoftirqd

ksoftirqd
Fig.3 creat ksoftirqd kernel threads
static struct smp_hotplug_thread softirq_threads = {
	.store			= &ksoftirqd,
	.thread_should_run	= ksoftirqd_should_run,
	.thread_fn		= run_ksoftirqd,
	.thread_comm		= "ksoftirqd/%u",
};

static __init int spawn_ksoftirqd(void)
{
	register_cpu_notifier(&cpu_nfb);

	BUG_ON(smpboot_register_percpu_thread(&softirq_threads));

	return 0;
}
early_initcall(spawn_ksoftirqd);

After ksoftirqd kernel threads are created, it will run ksoftirqd_should_run and run_ksoftirqd in its processing loop constantly to check if any soft IRQ is pending for processing. P.S. ksoftirqd threads are designed for process all kinds of soft IRQ in include/linux/interrupt.h

//file: include/linux/interrupt.h
enum{
    HI_SOFTIRQ=0,
    TIMER_SOFTIRQ,
    NET_TX_SOFTIRQ,
    NET_RX_SOFTIRQ,
    BLOCK_SOFTIRQ,
    BLOCK_IOPOLL_SOFTIRQ,
    TASKLET_SOFTIRQ,
    SCHED_SOFTIRQ,
    HRTIMER_SOFTIRQ,
    RCU_SOFTIRQ,  
};

network sub-system initialization #

ksoftirqd
Fig.4 network subsys init

Linux kernel intialize supported sub system via subsys_initcall, the net_dev_init is one of them.

//file: net/core/dev.c

static int __init net_dev_init(void){

    ......

    for_each_possible_cpu(i) {
        struct softnet_data *sd = &per_cpu(softnet_data, i);

        memset(sd, 0, sizeof(*sd));
        skb_queue_head_init(&sd->input_pkt_queue);
        skb_queue_head_init(&sd->process_queue);
        sd->completion_queue = NULL;
        INIT_LIST_HEAD(&sd->poll_list);
        ......
    }
    ......
    open_softirq(NET_TX_SOFTIRQ, net_tx_action);
    open_softirq(NET_RX_SOFTIRQ, net_rx_action);

}

subsys_initcall(net_dev_init);

In net_dev_int, it allocate softnet_data for every CPU in which pool_list is a array that keep the registered NIC driver poll function. the register process will be discussed later on.

open_softirq registered net_rx_action as the handler of NET_RX_SOFTIRQ and net_tx_action as the handler of NET_TX_SOFTIRQ, both the IRQs and handlers are been saved into softirq_vec array which will be used later on for handler lookup when a Soft IRQ is processed in run_ksoftirqd.

//file: kernel/softirq.c

void open_softirq(int nr, void (*action)(struct softirq_action *)){

    softirq_vec[nr].action = action;

}

protocol stack registration #

The kernel implemented IP protocol at Internet layer, TCP and UDP protocol at Transport layer. The respective implementation handlers for each are ip_rcv, tcp_v4_rcv and udp_rcv. unlike the application code, kernel manage these implementations via registration. like subsys_initcall above, kernel fs_initcall(one of the entry point of the kernel intitialization) call inet_init to register the protocol staks in which the protocol handlers are been registered into inet_protos and ptype_base structures.

ksoftirqd
Fig.5 AF_INET protocol stack registration

The related code follows:

//file: net/ipv4/af_inet.c

static struct packet_type ip_packet_type __read_mostly = {

    .type = cpu_to_be16(ETH_P_IP),
    .func = ip_rcv,
};
static const struct net_protocol udp_protocol = {
    .handler =  udp_rcv,
    .err_handler =  udp_err,
    .no_policy =    1,
    .netns_ok = 1,
};
static const struct net_protocol tcp_protocol = {
    .early_demux    =   tcp_v4_early_demux,
    .handler    =   tcp_v4_rcv,
    .err_handler    =   tcp_v4_err,
    .no_policy  =   1,
    .netns_ok   =   1,
};

static int __init inet_init(void){

    ......
    if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
        pr_crit("%s: Cannot add ICMP protocol\n", __func__);
    if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
        pr_crit("%s: Cannot add UDP protocol\n", __func__);
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        pr_crit("%s: Cannot add TCP protocol\n", __func__);
    ......
    dev_add_pack(&ip_packet_type);

}

The above code shows that udp_rev is registered as udp_protocol handler and tcp_v4_rcv is registered as tcp_protocol handler via inet_add_protocol

int inet_add_protocol(const struct net_protocol *prot, unsigned char protocol){
    if (!prot->netns_ok) {
        pr_err("Protocol %u is not namespace aware, cannot register.\n",
            protocol);
        return -EINVAL;
    }

    return !cmpxchg((const struct net_protocol **)&inet_protos[protocol],
            NULL, prot) ? 0 : -1;

}

inet_add_protocol registered tcp & udp handlers to inet_protos array. furthermore, ip_packet_type(ip_rcv as the func) is registered to ptype_base hash table

//file: net/core/dev.c

void dev_add_pack(struct packet_type *pt){

    struct list_head *head = ptype_head(pt);
    ......

}

static inline struct list_head *ptype_head(const struct packet_type *pt){

    if (pt->type == htons(ETH_P_ALL))
        return &ptype_all;
    else
        return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];

}

The TCP & UDP handlers are kept in inet_protos, and IP handler ip_rcvis kept in ptype_base. run_ksoftirqd can deliver packets upward to IP protocol handler ip_rcv via ptype_base, then continue delivering packets upward to upd_rcv or tcp_v4_rcv in ip_rcv by asking inet_protos for handler lookup.

P.S. protocol handles like ip_rcv and udp_rcv or so on are doing the heavy lifting for data processing, for example, the ip_rcv handler is one of the netfilter and iptables consulting points and ucp_rcv will check if the socket receive buffer queue full or not to decide whether drop the packet(net.core.rmem_max and net.core.rmem_default can be used to adjust the buffer size).

NIC driver registration #

All the drivers are required to use module_init to register their init function to the kernel which will be called by the kernel when the driver is loaded. For igb NIC driver, we can find the init function at drivers/net/ethernet/intel/igb/igb_main.c

//file: drivers/net/ethernet/intel/igb/igb_main.c

static struct pci_driver igb_driver = {

    .name     = igb_driver_name,
    .id_table = igb_pci_tbl,
    .probe    = igb_probe,
    .remove   = igb_remove,
    ......

};

static int __init igb_init_module(void){

    ......
    ret = pci_register_driver(&igb_driver);
    return ret;

}

After pci_register-driver(&igb_driver) is called, the Linux kernel knows the driver details such as igb_driver_name and igb_probe and so on. when the kernel detects the NIC card is a valid card, it will call the registered probe function to make the NIC card ready for processing data. in the case of igb driver, it’s igb_probe, the major initialization steps show as Fig.6 flows:

nic-init
Fig.6 NIC initialization

The igb_netdev_ops at step #6 contains a func called igb_open which will be called during the NIC start phase.

//file: drivers/net/ethernet/intel/igb/igb_main.c

static const struct net_device_ops igb_netdev_ops = {

  .ndo_open               = igb_open,
  .ndo_stop               = igb_close,
  .ndo_start_xmit         = igb_xmit_frame,
  .ndo_get_stats64        = igb_get_stats64,
  .ndo_set_rx_mode        = igb_set_rx_mode,
  .ndo_set_mac_address    = igb_set_mac,
  .ndo_change_mtu         = igb_change_mtu,
  .ndo_do_ioctl           = igb_ioctl,

 ......

At Step #7, during igb_probe initialization, the NAPI poller, i.e. igb_poll, is registered to poll_list of softnet_data(per CPU) via igb_lloc_q_vector.

static int igb_alloc_q_vector(struct igb_adapter *adapter,
                 int v_count, int v_idx,
                 int txr_count, int txr_idx,
                 int rxr_count, int rxr_idx){
   ......
   /* initialize NAPI */
   netif_napi_add(adapter->netdev, &q_vector->napi,
              igb_poll, 64);

}

Start NIC #

After the intial setup, the NIC is okay to start. as structure net_device_ops was registered to the kernel during the NIC initialization phase. when a NIC was started igb_open in net_device_ops will be called. here are the steps to start a NIC card show as Fig.7 follows:

start-nic
Fig.7 NIC start
//file: drivers/net/ethernet/intel/igb/igb_main.c
static int __igb_open(struct net_device *netdev, bool resuming){

    /* allocate transmit descriptors */
    err = igb_setup_all_tx_resources(adapter);

    /* allocate receive descriptors */
    err = igb_setup_all_rx_resources(adapter);

    err = igb_request_irq(adapter);
    if (err)
        goto err_req_irq;

    for (i = 0; i < adapter->num_q_vectors; i++)
        napi_enable(&(adapter->q_vector[i]->napi));
    ......
}

__igb_open calls igb_setup_all_tx_resources and igb_setup_all_rx_resources which allocates Rx ring buffer for each queue(number of queues can be configure by ethtool if the NIC supports multip queues).

In the igb_request_irq of igb driver, the functions igb_msix_ring, igb_intr_msi, igb_intr are the interrupt handler methods for the MSI-X, MSI, and legacy interrupt modes, respectively.

/**
 *  igb_request_irq - initialize interrupts
 *  @adapter: board private structure to initialize
 *
 *  Attempts to configure interrupts using the best available
 *  capabilities of the hardware and kernel.
 **/
static int igb_request_irq(struct igb_adapter *adapter)
{
	struct net_device *netdev = adapter->netdev;
	struct pci_dev *pdev = adapter->pdev;
	int err = 0;

	if (adapter->flags & IGB_FLAG_HAS_MSIX) {
		err = igb_request_msix(adapter);
		if (!err)
			goto request_done;
		/* fall back to MSI */
		igb_free_all_tx_resources(adapter);
		igb_free_all_rx_resources(adapter);

        .....
	}

    .....

	if (adapter->flags & IGB_FLAG_HAS_MSI) {
		err = request_irq(pdev->irq, igb_intr_msi, 0,
				  netdev->name, adapter);
		if (!err)
			goto request_done;

		/* fall back to legacy interrupts */
        ......
	}

	err = request_irq(pdev->irq, igb_intr, IRQF_SHARED,
			  netdev->name, adapter);

	if (err)
		dev_err(&pdev->dev, "Error %d getting interrupt\n",
			err);

request_done:
	return err;
}

In the case of msix, every RX queue has its own interrupt which allows NIC to notify different CPUs to process the incoming packets(IRQ for multiple RX queues can be managed byirqbalance or by changing /proc/irq/IRQ_NUMBER/smp_affnity to set the CPU affinity).

After all these, the NIC & kernel is ready to process arrival data.

Data processing #

Reference #