Recently, while developing a CNI network plugin, I encountered an issue where containers couldn’t access the LoadBalancer IP of a service. However, access to pod, service, and node addresses was functioning correctly.
- The network plugin is based on the cloud provider’s Elastic Network Interface (ENI), with multiple pods sharing a single ENI in a policy routing mode. This setup directs pod traffic in and out through the ENI.
- The primary network interface is bound to an IP, while the secondary network interfaces (ENIs) do not have IP bindings. The IPs of the pods are assigned to the secondary network interfaces.
- kube-proxy binds the LoadBalancer IP to the kube-ipvs0 network interface, adds IPVS rules for LoadBalancer IP forwarding, and sets up SNAT rules in iptables to allow pod IPs to access the LoadBalancer IP.
When attempting to access the LoadBalancer address 10.12.115.101:80
from within a container, connectivity issues occur:
1
2
|
# nsenter -t 188058 -n curl 10.12.115.101:80
curl: (7) Failed connect to 10.12.115.101:80; Connection timed out
|
Inspecting IPVS Rules
1
2
3
4
5
6
7
8
9
10
|
# ipvsadm -l -n |grep -A 3 10.12.115.101
TCP 10.12.115.101:80 rr
-> 10.253.1.139:8000 Masq 1 0 0
-> 10.253.3.5:8000 Masq 1 0 0
-> 10.253.4.194:8000 Masq 1 0 1
--
TCP 10.12.115.101:443 rr
-> 10.253.1.139:8443 Masq 1 0 1
-> 10.253.3.5:8443 Masq 1 0 0
-> 10.253.4.194:8443 Masq 1 0 1
|
Viewing Routing Tables and Network Interfaces
The container’s IP is 10.12.192.44
, and its peer network interface is fsk7c22dbae71f7
. The container uses eth1
for outbound traffic.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
|
# ip rule
0: from all lookup local
1000: from all to 10.12.192.3 lookup main
1000: from all to 10.12.192.12 lookup main
1000: from all to 10.12.192.14 lookup main
1000: from all to 10.12.192.10 lookup main
1000: from all to 10.12.192.16 lookup main
1000: from all to 10.12.192.44 lookup main
1000: from all to 10.12.192.42 lookup main
1000: from all to 10.12.192.35 lookup main
1000: from all to 10.12.192.34 lookup main
1000: from all to 10.12.192.45 lookup main
2000: from 10.12.192.3 lookup 1003
2000: from 10.12.192.12 lookup 1003
2000: from 10.12.192.14 lookup 1003
2000: from 10.12.192.10 lookup 1003
2000: from 10.12.192.16 lookup 1003
2000: from 10.12.192.44 lookup 1003
2000: from 10.12.192.42 lookup 1003
2000: from 10.12.192.35 lookup 1003
2000: from 10.12.192.34 lookup 1003
2000: from 10.12.192.45 lookup 1003
32766: from all lookup main
32767: from all lookup default
# nsenter -t 188058 -n ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
28: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
link/ether c2:c9:cc:9c:b7:51 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.12.192.44/32 scope global eth0
valid_lft forever preferred_lft forever
# ip route
default via 10.12.97.1 dev eth0
10.12.97.0/24 dev eth0 proto kernel scope link src 10.12.97.49
10.12.192.3 dev fsk43227a97fc8c scope link
10.12.192.34 dev fskc611c3dc93de scope link
10.12.192.35 dev fskf124324b1db6 scope link
10.12.192.42 dev fskf589f83877f6 scope link
10.12.192.44 dev fsk7c22dbae71f7 scope link
10.12.192.45 dev fskfaeb6762732b scope link
169.254.0.0/16 dev eth0 scope link metric 1002
# ip route show table 1003
default via 10.12.192.1 dev eth1 table 1003
10.12.192.1 dev eth1 table 1003 scope link
# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:bb:f5:77 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 20:90:6f:42:35:13 brd ff:ff:ff:ff:ff:ff
4: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 4e:b7:46:71:d2:b4 brd ff:ff:ff:ff:ff:ff
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
link/ether 6e:f5:43:c4:f7:64 brd ff:ff:ff:ff:ff:ff
27: fsk43227a97fc8c@if26: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 86:18:c7:ad:54:28 brd ff:ff:ff:ff:ff:ff link-netnsid 1
29: fsk7c22dbae71f7@if28: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 4e:8d:9e:0f:42:89 brd ff:ff:ff:ff:ff:ff link-netnsid 2
31: fskf589f83877f6@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether f2:5d:a3:80:c2:52 brd ff:ff:ff:ff:ff:ff link-netnsid 3
33: fskf124324b1db6@if32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether ae:15:61:f1:59:4f brd ff:ff:ff:ff:ff:ff link-netnsid 4
35: fskc611c3dc93de@if34: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether a2:5d:11:0b:aa:af brd ff:ff:ff:ff:ff:ff link-netnsid 5
37: fskfaeb6762732b@if36: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 12:66:02:1d:79:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 9
|
iptables Rules
Among the iptables rules, the one in the nat
table that reads -A KUBE-SERVICES -m comment --comment "Kubernetes service lb portal" -m set --match-set KUBE-LOAD-BALANCER dst,dst -j KUBE-LOAD-BALANCER
appears to be particularly relevant. This rule matches packets with destination IP and port corresponding to the LoadBalancer’s address and port. It then directs these packets to the KUBE-LOAD-BALANCER
chain, ultimately performing SNAT using -A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
|
# ipset list KUBE-LOAD-BALANCER
Name: KUBE-LOAD-BALANCER
Type: hash:ip,port
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 512
References: 2
Number of entries: 5
Members:
10.12.115.101,tcp:80
10.12.97.112,tcp:80
10.12.115.101,tcp:443
10.12.97.70,tcp:15021
10.12.97.70,tcp:80
# iptables-save
# Generated by iptables-save v1.4.21 on Wed Jun 22 16:03:15 2022
*mangle
:PREROUTING ACCEPT [10381156:5905007979]
:INPUT ACCEPT [4485839:3718471785]
:FORWARD ACCEPT [5448862:2173827292]
:OUTPUT ACCEPT [4420992:1947555378]
:POSTROUTING ACCEPT [9861006:4119693380]
:KUBE-KUBELET-CANARY - [0:0]
COMMIT
# Completed on Wed Jun 22 16:03:15 2022
# Generated by iptables-save v1.4.21 on Wed Jun 22 16:03:15 2022
*filter
:INPUT ACCEPT [149:13360]
:FORWARD ACCEPT [19:532]
:OUTPUT ACCEPT [134:148180]
:KUBE-FIREWALL - [0:0]
:KUBE-FORWARD - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-NODE-PORT - [0:0]
-A INPUT -j KUBE-FIREWALL
-A INPUT -m comment --comment "kubernetes health check rules" -j KUBE-NODE-PORT
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FIREWALL ! -s 127.0.0.0/8 -d 127.0.0.0/8 -m comment --comment "block incoming localnet connections" -m conntrack ! --ctstate RELATED,ESTABLISHED,DNAT -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-NODE-PORT -m comment --comment "Kubernetes health check node port" -m set --match-set KUBE-HEALTH-CHECK-NODE-PORT dst -j ACCEPT
COMMIT
# Completed on Wed Jun 22 16:03:15 2022
# Generated by iptables-save v1.4.21 on Wed Jun 22 16:03:15 2022
*nat
:PREROUTING ACCEPT [60:1848]
:INPUT ACCEPT [9:420]
:OUTPUT ACCEPT [10:660]
:POSTROUTING ACCEPT [29:1192]
:KUBE-FIREWALL - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-LOAD-BALANCER - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-NODE-PORT - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-SERVICES - [0:0]
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A KUBE-FIREWALL -j KUBE-MARK-DROP
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODE-PORT -p tcp -m comment --comment "Kubernetes nodeport TCP port for masquerade purpose" -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-MARK-MASQ
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE
-A KUBE-SERVICES -m comment --comment "Kubernetes service lb portal" -m set --match-set KUBE-LOAD-BALANCER dst,dst -j KUBE-LOAD-BALANCER
-A KUBE-SERVICES ! -s 10.12.0.0/16 -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT
-A KUBE-SERVICES -m set --match-set KUBE-LOAD-BALANCER dst,dst -j ACCEPT
COMMIT
# Completed on Wed Jun 22 16:03:15 2022
|
Packet Capture
Packet captures were taken on both eth0 and eth1 interfaces. The IP address 10.12.97.49 represents the main network card’s IP address. On eth0, packets from the backend were observed being received:
1
2
3
4
5
6
7
8
9
10
|
# tcpdump -i eth0 -nn -v ip host 10.253.1.139 or 10.253.3.5 or 10.253.4.194
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
18:47:35.011943 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x1ed1 (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705347097 ecr 794582693,nop,wscale 7], length 0
18:47:36.036868 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x1ad0 (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705348122 ecr 794582693,nop,wscale 7], length 0
18:47:37.097201 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x16ac (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705349182 ecr 794582693,nop,wscale 7], length 0
18:47:39.145257 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x0eac (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705351230 ecr 794582693,nop,wscale 7], length 0
|
On eth1, packets sent from the container were captured. It is important to note that even before SNAT occurs in the postrouting phase, the routing decision was made, as evidenced by the outgoing packets on eth1:
1
2
3
4
5
6
7
8
9
10
|
# tcpdump -i eth1 -nn -v ip host 10.253.1.139 or 10.253.3.5 or 10.253.4.194
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
18:47:35.011699 IP (tos 0x0, ttl 63, id 65073, offset 0, flags [DF], proto TCP (6), length 60)
10.12.97.49.44624 > 10.253.4.194.8000: Flags [S], cksum 0x82ea (correct), seq 204263430, win 62727, options [mss 8961,sackOK,TS val 794582693 ecr 0,nop,wscale 7], length 0
18:47:36.036692 IP (tos 0x0, ttl 63, id 65074, offset 0, flags [DF], proto TCP (6), length 60)
10.12.97.49.44624 > 10.253.4.194.8000: Flags [S], cksum 0x7ee9 (correct), seq 204263430, win 62727, options [mss 8961,sackOK,TS val 794583718 ecr 0,nop,wscale 7], length 0
18:48:17.389548 IP (tos 0x0, ttl 63, id 20326, offset 0, flags [DF], proto TCP (6), length 60)
10.12.97.49.44676 > 10.253.3.5.8000: Flags [S], cksum 0x11be (correct), seq 3721513867, win 62727, options [mss 8961,sackOK,TS val 794625071 ecr 0,nop,wscale 7], length 0
18:48:18.404658 IP (tos 0x0, ttl 63, id 20327, offset 0, flags [DF], proto TCP (6), length 60)
10.12.97.49.44676 > 10.253.3.5.8000: Flags [S], cksum 0x0dc7 (correct), seq 3721513867, win 62727, options [mss 8961,sackOK,TS val 794626086 ecr 0,nop,wscale 7], length 0
|
Overall Observation:
Traffic from the pod, after SNAT occurs via iptables, appears as coming from 10.12.97.49
(the main network card’s IP address since other secondary network cards do not have assigned IPs). However, after DNAT in LVS, the response packets from 10.253.3.5:8000
are being sent out through eth0, which is different from the outgoing path.
Since the outgoing and incoming paths are not symmetrical, one simple solution is to set the kernel parameter /proc/sys/net/ipv4/conf/{network_interface}/rp_filter
to 2 for all elastic network cards (ENIs).
Note that the value of this kernel parameter is determined as the maximum value from /proc/sys/net/ipv4/conf/all/rp_filter
and /proc/sys/net/ipv4/conf/{network_interface}/rp_filter
. The default value is 1. By setting the main network card’s value to 2, you can achieve the desired behavior. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
rp_filter - INTEGER
0 - No source validation.
1 - Strict mode as defined in RFC3704 Strict Reverse Path
Each incoming packet is tested against the FIB and if the interface
is not the best reverse path the packet check will fail.
By default failed packets are discarded.
2 - Loose mode as defined in RFC3704 Loose Reverse Path
Each incoming packet's source address is also tested against the FIB
and if the source address is not reachable via any interface
the packet check will fail.
Current recommended practice in RFC3704 is to enable strict mode
to prevent IP spoofing from DDos attacks. If using asymmetric routing
or other complicated routing, then loose mode is recommended.
The max value from conf/{all,interface}/rp_filter is used
when doing source validation on the {interface}.
Default value is 0. Note that some distributions enable it
in startup scripts.
|
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
https://access.redhat.com/solutions/53031
Since SNAT IP is set to the main network card’s IP, causing the asymmetric path issue, one solution is to allow the SNAT IP to follow symmetric paths. This can be achieved by binding a routable IP (the primary IP of the elastic network card) to the secondary network cards.
Another approach is to use a solution similar to the one used for NodePort. This involves setting the /proc/sys/net/ipv4/conf/{network_interface}/rp_filter
to 2 for the main network card and configuring the mangle table in iptables. Then, set a mark in iptables and define a policy routing rule to route this traffic through the main network card.
I have a similar solution for scenario 2 but it is more complicated because it needs to be applied to all pods:
- Mark packets received through veth and not from host
- Use conntrack to restore the same mark on packets in the reverse direction
- Create a rule with higher priority to force returning through veth interface for this mark using a new routing table
1
2
3
4
5
|
iptables -A PREROUTING -i veth0 -t mangle -s ! 172.30.187.226 -j CONNMARK --set-mark 42
iptables -t mangle -A OUTPUT -j CONNMARK --restore-mark
ip rule add fwmark 42 lookup 100 pref 1024
ip route add default via 172.30.187.226 dev veth0 table 100
ip route add 172.30.187.226 dev veth0 scope link table 100
|
Where 172.30.187.226 is the host IP
This assumes that all traffic from veth and not from host ip is nodeport traffic
Both solutions work but add a lot of complexity. I hope we can find a simpler solution
https://github.com/lyft/cni-ipvlan-vpc-k8s/issues/38#issuecomment-387059850
https://github.com/aws/amazon-vpc-cni-k8s/commit/2cce7de02bbfef66b12f0d61d3e9f7cb96d2c186
https://github.com/cilium/cilium/commit/c7f9997d7001c8561583d374dcbd4d973bad6fac
https://github.com/cilium/cilium/commit/01f8dcc51c84e1cab269f84e782e09d8261ac495
https://github.com/kubernetes/kubernetes/issues/66607
https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/
https://unix.stackexchange.com/questions/590123/is-it-possible-to-mention-destination-interface-in-iptables-while-dnat
https://en.wikipedia.org/wiki/Netfilter#/media/File:Netfilter-packet-flow.svg
https://www.hwchiu.com/ipvs-4.html
https://www.digihunch.com/2020/11/ipvs-iptables-and-kube-proxy/
https://serverfault.com/questions/869751/how-does-masquerade-choose-an-ip-address-if-there-are-multiple