pod can not access the service loadbalance ip

Recently, while developing a CNI network plugin, I encountered an issue where containers couldn’t access the LoadBalancer IP of a service. However, access to pod, service, and node addresses was functioning correctly.

  1. The network plugin is based on the cloud provider’s Elastic Network Interface (ENI), with multiple pods sharing a single ENI in a policy routing mode. This setup directs pod traffic in and out through the ENI.
  2. The primary network interface is bound to an IP, while the secondary network interfaces (ENIs) do not have IP bindings. The IPs of the pods are assigned to the secondary network interfaces.
  3. kube-proxy binds the LoadBalancer IP to the kube-ipvs0 network interface, adds IPVS rules for LoadBalancer IP forwarding, and sets up SNAT rules in iptables to allow pod IPs to access the LoadBalancer IP.

When attempting to access the LoadBalancer address 10.12.115.101:80 from within a container, connectivity issues occur:

text

# nsenter -t 188058 -n curl 10.12.115.101:80
curl: (7) Failed connect to 10.12.115.101:80; Connection timed out

Inspecting IPVS Rules

text

# ipvsadm -l -n |grep -A 3 10.12.115.101
TCP  10.12.115.101:80 rr
  -> 10.253.1.139:8000            Masq    1      0          0         
  -> 10.253.3.5:8000              Masq    1      0          0         
  -> 10.253.4.194:8000            Masq    1      0          1         
--
TCP  10.12.115.101:443 rr
  -> 10.253.1.139:8443            Masq    1      0          1         
  -> 10.253.3.5:8443              Masq    1      0          0         
  -> 10.253.4.194:8443            Masq    1      0          1   

Viewing Routing Tables and Network Interfaces

The container’s IP is 10.12.192.44, and its peer network interface is fsk7c22dbae71f7. The container uses eth1 for outbound traffic.

text

# ip rule
0:      from all lookup local 
1000:   from all to 10.12.192.3 lookup main 
1000:   from all to 10.12.192.12 lookup main 
1000:   from all to 10.12.192.14 lookup main 
1000:   from all to 10.12.192.10 lookup main 
1000:   from all to 10.12.192.16 lookup main 
1000:   from all to 10.12.192.44 lookup main 
1000:   from all to 10.12.192.42 lookup main 
1000:   from all to 10.12.192.35 lookup main 
1000:   from all to 10.12.192.34 lookup main 
1000:   from all to 10.12.192.45 lookup main 
2000:   from 10.12.192.3 lookup 1003 
2000:   from 10.12.192.12 lookup 1003 
2000:   from 10.12.192.14 lookup 1003 
2000:   from 10.12.192.10 lookup 1003 
2000:   from 10.12.192.16 lookup 1003 
2000:   from 10.12.192.44 lookup 1003 
2000:   from 10.12.192.42 lookup 1003 
2000:   from 10.12.192.35 lookup 1003 
2000:   from 10.12.192.34 lookup 1003 
2000:   from 10.12.192.45 lookup 1003 
32766:  from all lookup main 
32767:  from all lookup default 
# nsenter -t 188058 -n ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
28: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether c2:c9:cc:9c:b7:51 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.12.192.44/32 scope global eth0
       valid_lft forever preferred_lft forever
# ip route
default via 10.12.97.1 dev eth0 
10.12.97.0/24 dev eth0 proto kernel scope link src 10.12.97.49 
10.12.192.3 dev fsk43227a97fc8c scope link 
10.12.192.34 dev fskc611c3dc93de scope link 
10.12.192.35 dev fskf124324b1db6 scope link 
10.12.192.42 dev fskf589f83877f6 scope link 
10.12.192.44 dev fsk7c22dbae71f7 scope link 
10.12.192.45 dev fskfaeb6762732b scope link 
169.254.0.0/16 dev eth0 scope link metric 1002 

# ip route show table 1003
default via 10.12.192.1 dev eth1 table 1003
10.12.192.1 dev eth1 table 1003 scope link

# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:bb:f5:77 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 20:90:6f:42:35:13 brd ff:ff:ff:ff:ff:ff
4: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 4e:b7:46:71:d2:b4 brd ff:ff:ff:ff:ff:ff
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default 
    link/ether 6e:f5:43:c4:f7:64 brd ff:ff:ff:ff:ff:ff
27: fsk43227a97fc8c@if26: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 86:18:c7:ad:54:28 brd ff:ff:ff:ff:ff:ff link-netnsid 1
29: fsk7c22dbae71f7@if28: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 4e:8d:9e:0f:42:89 brd ff:ff:ff:ff:ff:ff link-netnsid 2
31: fskf589f83877f6@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f2:5d:a3:80:c2:52 brd ff:ff:ff:ff:ff:ff link-netnsid 3
33: fskf124324b1db6@if32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ae:15:61:f1:59:4f brd ff:ff:ff:ff:ff:ff link-netnsid 4
35: fskc611c3dc93de@if34: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether a2:5d:11:0b:aa:af brd ff:ff:ff:ff:ff:ff link-netnsid 5
37: fskfaeb6762732b@if36: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 12:66:02:1d:79:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 9

iptables Rules

Among the iptables rules, the one in the nat table that reads -A KUBE-SERVICES -m comment --comment "Kubernetes service lb portal" -m set --match-set KUBE-LOAD-BALANCER dst,dst -j KUBE-LOAD-BALANCER appears to be particularly relevant. This rule matches packets with destination IP and port corresponding to the LoadBalancer’s address and port. It then directs these packets to the KUBE-LOAD-BALANCER chain, ultimately performing SNAT using -A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ.

text

# ipset list KUBE-LOAD-BALANCER
Name: KUBE-LOAD-BALANCER
Type: hash:ip,port
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 512
References: 2
Number of entries: 5
Members:
10.12.115.101,tcp:80
10.12.97.112,tcp:80
10.12.115.101,tcp:443
10.12.97.70,tcp:15021
10.12.97.70,tcp:80

# iptables-save 
# Generated by iptables-save v1.4.21 on Wed Jun 22 16:03:15 2022
*mangle
:PREROUTING ACCEPT [10381156:5905007979]
:INPUT ACCEPT [4485839:3718471785]
:FORWARD ACCEPT [5448862:2173827292]
:OUTPUT ACCEPT [4420992:1947555378]
:POSTROUTING ACCEPT [9861006:4119693380]
:KUBE-KUBELET-CANARY - [0:0]
COMMIT
# Completed on Wed Jun 22 16:03:15 2022
# Generated by iptables-save v1.4.21 on Wed Jun 22 16:03:15 2022
*filter
:INPUT ACCEPT [149:13360]
:FORWARD ACCEPT [19:532]
:OUTPUT ACCEPT [134:148180]
:KUBE-FIREWALL - [0:0]
:KUBE-FORWARD - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-NODE-PORT - [0:0]
-A INPUT -j KUBE-FIREWALL
-A INPUT -m comment --comment "kubernetes health check rules" -j KUBE-NODE-PORT
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FIREWALL ! -s 127.0.0.0/8 -d 127.0.0.0/8 -m comment --comment "block incoming localnet connections" -m conntrack ! --ctstate RELATED,ESTABLISHED,DNAT -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-NODE-PORT -m comment --comment "Kubernetes health check node port" -m set --match-set KUBE-HEALTH-CHECK-NODE-PORT dst -j ACCEPT
COMMIT
# Completed on Wed Jun 22 16:03:15 2022
# Generated by iptables-save v1.4.21 on Wed Jun 22 16:03:15 2022
*nat
:PREROUTING ACCEPT [60:1848]
:INPUT ACCEPT [9:420]
:OUTPUT ACCEPT [10:660]
:POSTROUTING ACCEPT [29:1192]
:KUBE-FIREWALL - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-LOAD-BALANCER - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-NODE-PORT - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-SERVICES - [0:0]
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A KUBE-FIREWALL -j KUBE-MARK-DROP
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODE-PORT -p tcp -m comment --comment "Kubernetes nodeport TCP port for masquerade purpose" -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-MARK-MASQ
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE
-A KUBE-SERVICES -m comment --comment "Kubernetes service lb portal" -m set --match-set KUBE-LOAD-BALANCER dst,dst -j KUBE-LOAD-BALANCER
-A KUBE-SERVICES ! -s 10.12.0.0/16 -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT
-A KUBE-SERVICES -m set --match-set KUBE-LOAD-BALANCER dst,dst -j ACCEPT
COMMIT
# Completed on Wed Jun 22 16:03:15 2022

Packet Capture

Packet captures were taken on both eth0 and eth1 interfaces. The IP address 10.12.97.49 represents the main network card’s IP address. On eth0, packets from the backend were observed being received:

text

# tcpdump -i eth0 -nn -v  ip host 10.253.1.139 or 10.253.3.5 or 10.253.4.194
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
18:47:35.011943 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x1ed1 (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705347097 ecr 794582693,nop,wscale 7], length 0
18:47:36.036868 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x1ad0 (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705348122 ecr 794582693,nop,wscale 7], length 0
18:47:37.097201 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x16ac (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705349182 ecr 794582693,nop,wscale 7], length 0
18:47:39.145257 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x0eac (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705351230 ecr 794582693,nop,wscale 7], length 0

On eth1, packets sent from the container were captured. It is important to note that even before SNAT occurs in the postrouting phase, the routing decision was made, as evidenced by the outgoing packets on eth1:

text

# tcpdump -i eth1 -nn -v  ip host 10.253.1.139 or 10.253.3.5 or 10.253.4.194
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
18:47:35.011699 IP (tos 0x0, ttl 63, id 65073, offset 0, flags [DF], proto TCP (6), length 60)
    10.12.97.49.44624 > 10.253.4.194.8000: Flags [S], cksum 0x82ea (correct), seq 204263430, win 62727, options [mss 8961,sackOK,TS val 794582693 ecr 0,nop,wscale 7], length 0
18:47:36.036692 IP (tos 0x0, ttl 63, id 65074, offset 0, flags [DF], proto TCP (6), length 60)
    10.12.97.49.44624 > 10.253.4.194.8000: Flags [S], cksum 0x7ee9 (correct), seq 204263430, win 62727, options [mss 8961,sackOK,TS val 794583718 ecr 0,nop,wscale 7], length 0
18:48:17.389548 IP (tos 0x0, ttl 63, id 20326, offset 0, flags [DF], proto TCP (6), length 60)
    10.12.97.49.44676 > 10.253.3.5.8000: Flags [S], cksum 0x11be (correct), seq 3721513867, win 62727, options [mss 8961,sackOK,TS val 794625071 ecr 0,nop,wscale 7], length 0
18:48:18.404658 IP (tos 0x0, ttl 63, id 20327, offset 0, flags [DF], proto TCP (6), length 60)
    10.12.97.49.44676 > 10.253.3.5.8000: Flags [S], cksum 0x0dc7 (correct), seq 3721513867, win 62727, options [mss 8961,sackOK,TS val 794626086 ecr 0,nop,wscale 7], length 0

Overall Observation:

Traffic from the pod, after SNAT occurs via iptables, appears as coming from 10.12.97.49 (the main network card’s IP address since other secondary network cards do not have assigned IPs). However, after DNAT in LVS, the response packets from 10.253.3.5:8000 are being sent out through eth0, which is different from the outgoing path.

Since the outgoing and incoming paths are not symmetrical, one simple solution is to set the kernel parameter /proc/sys/net/ipv4/conf/{network_interface}/rp_filter to 2 for all elastic network cards (ENIs).

Note that the value of this kernel parameter is determined as the maximum value from /proc/sys/net/ipv4/conf/all/rp_filter and /proc/sys/net/ipv4/conf/{network_interface}/rp_filter. The default value is 1. By setting the main network card’s value to 2, you can achieve the desired behavior. For example:

text

rp_filter - INTEGER
 0 - No source validation.
 1 - Strict mode as defined in RFC3704 Strict Reverse Path
     Each incoming packet is tested against the FIB and if the interface
     is not the best reverse path the packet check will fail.
     By default failed packets are discarded.
 2 - Loose mode as defined in RFC3704 Loose Reverse Path
     Each incoming packet's source address is also tested against the FIB
     and if the source address is not reachable via any interface
     the packet check will fail.

 Current recommended practice in RFC3704 is to enable strict mode
 to prevent IP spoofing from DDos attacks. If using asymmetric routing
 or other complicated routing, then loose mode is recommended.

 The max value from conf/{all,interface}/rp_filter is used
 when doing source validation on the {interface}.

 Default value is 0. Note that some distributions enable it
 in startup scripts.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

https://access.redhat.com/solutions/53031

Since SNAT IP is set to the main network card’s IP, causing the asymmetric path issue, one solution is to allow the SNAT IP to follow symmetric paths. This can be achieved by binding a routable IP (the primary IP of the elastic network card) to the secondary network cards.

Another approach is to use a solution similar to the one used for NodePort. This involves setting the /proc/sys/net/ipv4/conf/{network_interface}/rp_filter to 2 for the main network card and configuring the mangle table in iptables. Then, set a mark in iptables and define a policy routing rule to route this traffic through the main network card.

I have a similar solution for scenario 2 but it is more complicated because it needs to be applied to all pods:

  1. Mark packets received through veth and not from host
  2. Use conntrack to restore the same mark on packets in the reverse direction
  3. Create a rule with higher priority to force returning through veth interface for this mark using a new routing table

text

iptables -A PREROUTING -i veth0 -t mangle -s ! 172.30.187.226  -j CONNMARK --set-mark 42
iptables -t mangle -A OUTPUT -j CONNMARK --restore-mark
ip rule add fwmark 42 lookup 100 pref 1024
ip route add default via 172.30.187.226 dev veth0 table 100
ip route add 172.30.187.226 dev veth0  scope link table 100

Where 172.30.187.226 is the host IP

This assumes that all traffic from veth and not from host ip is nodeport traffic

Both solutions work but add a lot of complexity. I hope we can find a simpler solution

https://github.com/lyft/cni-ipvlan-vpc-k8s/issues/38#issuecomment-387059850

https://github.com/aws/amazon-vpc-cni-k8s/commit/2cce7de02bbfef66b12f0d61d3e9f7cb96d2c186

https://github.com/cilium/cilium/commit/c7f9997d7001c8561583d374dcbd4d973bad6fac

https://github.com/cilium/cilium/commit/01f8dcc51c84e1cab269f84e782e09d8261ac495

https://github.com/kubernetes/kubernetes/issues/66607

https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/

https://unix.stackexchange.com/questions/590123/is-it-possible-to-mention-destination-interface-in-iptables-while-dnat

https://en.wikipedia.org/wiki/Netfilter#/media/File:Netfilter-packet-flow.svg

https://www.hwchiu.com/ipvs-4.html

https://www.digihunch.com/2020/11/ipvs-iptables-and-kube-proxy/

https://serverfault.com/questions/869751/how-does-masquerade-choose-an-ip-address-if-there-are-multiple

Related Content