Kubernetes LoadBalancer Service with MetalLB
May 24, 2023
In kubernetes, LoadBalancer is the most common way of exposing backend applications to the outside world. Its API is very similar to NodePort with the only exception being the spec.type: LoadBalancer. At the very least, a user is expected to define which ports to expose and a label selector to match backend Pods:
apiVersion: v1
kind: Service
metadata:
name: web
spec:
ports:
- name: web
port: 80
selector:
app: web
type: LoadBalancer
From the networking point of view, a LoadBalancer Service is expected to accomplish three things:
- Allocate a new, externally routable IP from a pool of addresses and release it when a Service is deleted.
- Make sure the packets for this IP get delivered to one of the Kubernetes Nodes.
- Program Node-local data plane to deliver the incoming traffic to one of the healthy backend Endpoints.
By default, Kubernetes will only take care of the last item, i.e. kube-proxy
(or it’s equivalent) will program a Node-local data plane to enable external reachability – most of the work to enable this is already done by the NodePort implementation. However, the most challenging part – IP allocation and reachability – is left to external implementations. What this means is that in a vanilla Kubernetes cluster, LoadBalancer Services will remain in a “pending” state, i.e. they will have no external IP and will not be reachable from the outside:
$ kubectl get svc web
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
web LoadBalancer 10.96.86.198 <pending> 80:30956/TCP 43s
However, as soon as a LoadBalancer controller gets installed, it collects all “pending” Services and allocates a unique external IP from its own pool of addresses. It then updates a Service status with the allocated IP and configures external infrastructure to deliver incoming packets to (by default) all Kubernetes Nodes.
apiVersion: v1
kind: Service
metadata:
name: web
namespace: default
spec:
clusterIP: 10.96.174.4
ports:
- name: web
nodePort: 32634
port: 80
selector:
app: web
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 198.51.100.0
There are many implementations of these cluster add-ons ranging from simple controllers, designed to work in isolated environments, all the way to feature-rich and production-grade projects. This is a relatively active area of development with new projects appearing almost every year, like MetalLB, OpenELB, kube-vip, PureLB, Klipper and so on. Here we use MetalLB as an example.
Here is an service of type LoadBanlancer
$ kubectl get svc web
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
web LoadBalancer 10.96.86.198 198.51.100.0 80:30956/TCP 2d23h
When using kube-proxy IPTables mode as data plane, we can find the flowing rules on the kubernetes node:
As soon a LoadBalancer controller publishes an external IP in the status.loadBalancer field, kube-proxy
, who watches all Services, gets notified and inserts the KUBE-FW-* chain right next to the ClusterIP entry of the same Service. So somewhere inside the KUBE-SERVICES chain, you will see a rule that matches the external IP:
$ export NODE=k8s-guide-worker2
$ docker exec $NODE iptables -t nat -nvL KUBE-SERVICES
...
0 0 KUBE-SVC-LOLE4ISW44XBNF3G tcp -- * * 0.0.0.0/0 10.96.86.198 /* default/web cluster IP */ tcp dpt:80
5 300 KUBE-FW-LOLE4ISW44XBNF3G tcp -- * * 0.0.0.0/0 198.51.100.0 /* default/web loadbalancer IP */ tcp dpt:80
Inside the KUBE-FW chain packets get marked for IP masquerading (SNAT to incoming interface) and get redirected to the KUBE-SVC-* chain. The last KUBE-MARK-DROP entry is used when spec.loadBalancerSourceRanges are defined in order to drop packets from unspecified prefixes:
$ export NODE=k8s-guide-worker2
$ docker exec $NODE iptables -t nat -nvL KUBE-FW-LOLE4ISW44XBNF3G
Chain KUBE-FW-LOLE4ISW44XBNF3G (1 references)
pkts bytes target prot opt in out source destination
5 300 KUBE-MARK-MASQ all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/web loadbalancer IP */
5 300 KUBE-SVC-LOLE4ISW44XBNF3G all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/web loadbalancer IP */
0 0 KUBE-MARK-DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/web loadbalancer IP */
The KUBE-SVC chain is the same as the one used for the ClusterIP Services – one of the Endpoints gets chosen randomly and incoming packets get DNAT’ed to its address inside one of the KUBE-SEP-* chains:
$ export NODE=k8s-guide-worker2
$ docker exec $NODE iptables -t nat -nvL KUBE-SVC-LOLE4ISW44XBNF3G
Chain KUBE-SVC-LOLE4ISW44XBNF3G (1 references)
pkts bytes target prot opt in out source destination
1 60 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.86.198 /* default/web cluster IP */ tcp dpt:80
0 0 KUBE-SEP-MQ4W7Q2CV67URU6Q all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/web -> 10.244.1.2:80 */ statistic mode random probability 0.50000000000
1 60 KUBE-SEP-YUPMFTK3IHSQP2LT all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/web -> 10.244.2.6:80 */