Kubernetes LoadBalancer Service with MetalLB

May 24, 2023

In kubernetes, LoadBalancer is the most common way of exposing backend applications to the outside world. Its API is very similar to NodePort with the only exception being the spec.type: LoadBalancer. At the very least, a user is expected to define which ports to expose and a label selector to match backend Pods:

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  ports:
  - name: web
    port: 80
  selector:
    app: web
  type: LoadBalancer

From the networking point of view, a LoadBalancer Service is expected to accomplish three things:

Allocate a new, externally routable IP from a pool of addresses and release it when a Service is deleted.
Make sure the packets for this IP get delivered to one of the Kubernetes Nodes.
Program Node-local data plane to deliver the incoming traffic to one of the healthy backend Endpoints.

By default, Kubernetes will only take care of the last item, i.e. kube-proxy (or it’s equivalent) will program a Node-local data plane to enable external reachability – most of the work to enable this is already done by the NodePort implementation. However, the most challenging part – IP allocation and reachability – is left to external implementations. What this means is that in a vanilla Kubernetes cluster, LoadBalancer Services will remain in a “pending” state, i.e. they will have no external IP and will not be reachable from the outside:

$ kubectl get svc web
NAME   TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
web    LoadBalancer   10.96.86.198   <pending>     80:30956/TCP   43s

However, as soon as a LoadBalancer controller gets installed, it collects all “pending” Services and allocates a unique external IP from its own pool of addresses. It then updates a Service status with the allocated IP and configures external infrastructure to deliver incoming packets to (by default) all Kubernetes Nodes.

apiVersion: v1
kind: Service
metadata:
  name: web
  namespace: default
spec:
  clusterIP: 10.96.174.4
  ports:
  - name: web
    nodePort: 32634
    port: 80
  selector:
    app: web
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 198.51.100.0

There are many implementations of these cluster add-ons ranging from simple controllers, designed to work in isolated environments, all the way to feature-rich and production-grade projects. This is a relatively active area of development with new projects appearing almost every year, like MetalLB, OpenELB, kube-vip, PureLB, Klipper and so on. Here we use MetalLB as an example.

Here is an service of type LoadBanlancer

$ kubectl get svc web
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
web          LoadBalancer   10.96.86.198   198.51.100.0   80:30956/TCP   2d23h

When using kube-proxy IPTables mode as data plane, we can find the flowing rules on the kubernetes node:

As soon a LoadBalancer controller publishes an external IP in the status.loadBalancer field, kube-proxy, who watches all Services, gets notified and inserts the KUBE-FW-* chain right next to the ClusterIP entry of the same Service. So somewhere inside the KUBE-SERVICES chain, you will see a rule that matches the external IP:

$ export NODE=k8s-guide-worker2
$ docker exec $NODE iptables -t nat -nvL KUBE-SERVICES
...
    0     0 KUBE-SVC-LOLE4ISW44XBNF3G  tcp  --  *      *       0.0.0.0/0            10.96.86.198         /* default/web cluster IP */ tcp dpt:80
    5   300 KUBE-FW-LOLE4ISW44XBNF3G  tcp  --  *      *       0.0.0.0/0            198.51.100.0         /* default/web loadbalancer IP */ tcp dpt:80

Inside the KUBE-FW chain packets get marked for IP masquerading (SNAT to incoming interface) and get redirected to the KUBE-SVC-* chain. The last KUBE-MARK-DROP entry is used when spec.loadBalancerSourceRanges are defined in order to drop packets from unspecified prefixes:

$ export NODE=k8s-guide-worker2
$ docker exec $NODE iptables -t nat -nvL KUBE-FW-LOLE4ISW44XBNF3G
Chain KUBE-FW-LOLE4ISW44XBNF3G (1 references)
 pkts bytes target     prot opt in     out     source               destination
    5   300 KUBE-MARK-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web loadbalancer IP */
    5   300 KUBE-SVC-LOLE4ISW44XBNF3G  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web loadbalancer IP */
    0     0 KUBE-MARK-DROP  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web loadbalancer IP */

The KUBE-SVC chain is the same as the one used for the ClusterIP Services – one of the Endpoints gets chosen randomly and incoming packets get DNAT’ed to its address inside one of the KUBE-SEP-* chains:

$ export NODE=k8s-guide-worker2
$ docker exec $NODE iptables -t nat -nvL KUBE-SVC-LOLE4ISW44XBNF3G
Chain KUBE-SVC-LOLE4ISW44XBNF3G (1 references)
 pkts bytes target     prot opt in     out     source               destination
    1    60 KUBE-MARK-MASQ  tcp  --  *      *      !10.244.0.0/16        10.96.86.198        /* default/web cluster IP */ tcp dpt:80
    0     0 KUBE-SEP-MQ4W7Q2CV67URU6Q  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web -> 10.244.1.2:80 */ statistic mode random probability 0.50000000000
    1    60 KUBE-SEP-YUPMFTK3IHSQP2LT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/web -> 10.244.2.6:80 */

Kubernetes LoadBalancer Service with MetalLB

May 24, 2023

References #