How conntrack Could Be Limiting Your k8s Gateway

How conntrack Could Be Limiting Your k8s Gateway

Under high load in specific scenarios, a Kubernetes gateway may be limited by more than just its obvious CPU and Memory limits or requests if Karpenter is aggressively sizing the node (a different topic!). You may be hitting a wall in conntrack exhaustion.

For those uninitiated, conntrack, put simply, is a subsystem of the Linux kernel that tracks all network connections entering, exiting, or passing through the system, allowing it to monitor and manage the state of each connection, which is crucial for tasks like NAT (Network Address Translation), firewalling, and maintaining session continuity. It operates as part of Netfilter, the Linux kernel's framework for network packet filtering, which provides the underlying infrastructure for connection tracking, packet filtering, and network address translation. A quick explanation of the problem is if connection tracking reaches over the conntrack_max value (found with sysctl net.netfilter.nf_conntrack_max ) by long lived, stale, or an inundation of requests, your CPU and memory headroom will look fine but requests will be dropped.

$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 131072

How to monitor

How can we monitor for this type of event? Or really many of the hardware and OS level metrics that are important to collect? Prometheus ships a collector called node_exporter . By utilizing this, you will be able to track and monitor events related to conntrack as defined by the project Shows conntrack statistics (does nothing if no /proc/sys/net/netfilter/ present). If running on AWS, using a Nitro based image instance type, and using the ENA driver version 2.8.1 or newer, AWS has the capability of gathering these metrics into CloudWatch should you prefer.

Ways around the problem

So how can we get around the issue? The most straight forward answer, in the context of AWS, would be to upgrade your EC2 size. AWS calculates and sets the conntrack table size based on a calculation of CPU, memory, and OS (32/64 bit). Throw more power at it! But wait, isn't that a monolith mentality?

root@ip-192-168-30-28:/# sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 262144 # m5.large

root@ip-192-168-98-251:/# sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 131072 # t2.micro

Another way is simply knowing your machine, traffic type, what it can handle through performance tests, and setting your conntrack_max accordingly. Using the kube-proxy ConfigMap, we can declaratively set the max as seen in the Kubernetes docs.

    conntrack:
      maxPerCore: 32768
      min: 131072
      tcpCloseWaitTimeout: 1h0m0s
      tcpEstablishedTimeout: 24h0m0s

A less likely, possibly excessive for this issue, design change that would need consideration in various elements of your architecture is the switch from iptables to IPVS. Shifting from iptables to IPVS for load balancing addresses the bottleneck of hitting the maximum connection tracking capacity. Unlike iptables, which filters and inspects packets and relies on connection tracking, IPVS efficiently routes traffic to backends with load balancing algorithms, bypassing the exhaustive state tracking.

Least likely and last, the use of tunnels. By using an IP tunnel and encapsulating the traffic, the far end only sees the tunnel as the tracked connection. This is more feasible of an option in public facing proxies hosted externally of the Kubernetes cluster (with the same conntrack adjustments), and/or not serving a public gateway.

Wrapping up

Like anything in IT, its important to monitor everything you can while being mindful of the pitfalls of cardinality and the inundation of metrics that brings. Conntrack only being a small piece of the larger puzzle of issues we solve everyday!