cni插件无法访问service的loadbalancer ip

最近在开发cni网络插件,遇到容器访问service的loadbalancer地址不通。但是访问pod、service、node的地址都是通的。

  1. 网络插件是基于云厂商的弹性网卡,多个pod共享一个弹性网卡模式。使用策略路由,让pod流量通过弹性网卡进出。
  2. 主网卡绑定了ip,而副网卡没有绑定ip,pod的ip都在副网卡上。
  3. kube-proxy会在kube-ipvs0网卡上绑定loadbalancer ip,并且在ipvs规则里添加loadbalancer ip转发,且在iptable添加snat规则,让pod ip访问loadbalancer ip进行snat。

容器里访问lb地址10.12.115.101:80,出现访问不通

1
2
# nsenter -t 188058 -n curl 10.12.115.101:80
curl: (7) Failed connect to 10.12.115.101:80; Connection timed out

查看ipvs规则

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# ipvsadm -l -n |grep -A 3 10.12.115.101
TCP  10.12.115.101:80 rr
  -> 10.253.1.139:8000            Masq    1      0          0         
  -> 10.253.3.5:8000              Masq    1      0          0         
  -> 10.253.4.194:8000            Masq    1      0          1         
--
TCP  10.12.115.101:443 rr
  -> 10.253.1.139:8443            Masq    1      0          1         
  -> 10.253.3.5:8443              Masq    1      0          0         
  -> 10.253.4.194:8443            Masq    1      0          1   

查看路由表和网卡

容器的ip为10.12.192.44,容器的对端网卡为fsk7c22dbae71f7,容器通过eth1进行出方向访问。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# ip rule
0:      from all lookup local 
1000:   from all to 10.12.192.3 lookup main 
1000:   from all to 10.12.192.12 lookup main 
1000:   from all to 10.12.192.14 lookup main 
1000:   from all to 10.12.192.10 lookup main 
1000:   from all to 10.12.192.16 lookup main 
1000:   from all to 10.12.192.44 lookup main 
1000:   from all to 10.12.192.42 lookup main 
1000:   from all to 10.12.192.35 lookup main 
1000:   from all to 10.12.192.34 lookup main 
1000:   from all to 10.12.192.45 lookup main 
2000:   from 10.12.192.3 lookup 1003 
2000:   from 10.12.192.12 lookup 1003 
2000:   from 10.12.192.14 lookup 1003 
2000:   from 10.12.192.10 lookup 1003 
2000:   from 10.12.192.16 lookup 1003 
2000:   from 10.12.192.44 lookup 1003 
2000:   from 10.12.192.42 lookup 1003 
2000:   from 10.12.192.35 lookup 1003 
2000:   from 10.12.192.34 lookup 1003 
2000:   from 10.12.192.45 lookup 1003 
32766:  from all lookup main 
32767:  from all lookup default 
# nsenter -t 188058 -n ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
28: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether c2:c9:cc:9c:b7:51 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.12.192.44/32 scope global eth0
       valid_lft forever preferred_lft forever
# ip route
default via 10.12.97.1 dev eth0 
10.12.97.0/24 dev eth0 proto kernel scope link src 10.12.97.49 
10.12.192.3 dev fsk43227a97fc8c scope link 
10.12.192.34 dev fskc611c3dc93de scope link 
10.12.192.35 dev fskf124324b1db6 scope link 
10.12.192.42 dev fskf589f83877f6 scope link 
10.12.192.44 dev fsk7c22dbae71f7 scope link 
10.12.192.45 dev fskfaeb6762732b scope link 
169.254.0.0/16 dev eth0 scope link metric 1002 

# ip route show table 1003
default via 10.12.192.1 dev eth1 table 1003
10.12.192.1 dev eth1 table 1003 scope link

# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:bb:f5:77 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 20:90:6f:42:35:13 brd ff:ff:ff:ff:ff:ff
4: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 4e:b7:46:71:d2:b4 brd ff:ff:ff:ff:ff:ff
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default 
    link/ether 6e:f5:43:c4:f7:64 brd ff:ff:ff:ff:ff:ff
27: fsk43227a97fc8c@if26: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 86:18:c7:ad:54:28 brd ff:ff:ff:ff:ff:ff link-netnsid 1
29: fsk7c22dbae71f7@if28: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 4e:8d:9e:0f:42:89 brd ff:ff:ff:ff:ff:ff link-netnsid 2
31: fskf589f83877f6@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f2:5d:a3:80:c2:52 brd ff:ff:ff:ff:ff:ff link-netnsid 3
33: fskf124324b1db6@if32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ae:15:61:f1:59:4f brd ff:ff:ff:ff:ff:ff link-netnsid 4
35: fskc611c3dc93de@if34: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether a2:5d:11:0b:aa:af brd ff:ff:ff:ff:ff:ff link-netnsid 5
37: fskfaeb6762732b@if36: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 12:66:02:1d:79:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 9

iptables规则

其中nat表里的-A KUBE-SERVICES -m comment --comment "Kubernetes service lb portal" -m set --match-set KUBE-LOAD-BALANCER dst,dst -j KUBE-LOAD-BALANCER规则,目标包里的ip和端口为lb的地址和端口,丢到KUBE-LOAD-BALANCER chain,最终进行snat -A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# ipset list KUBE-LOAD-BALANCER
Name: KUBE-LOAD-BALANCER
Type: hash:ip,port
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 512
References: 2
Number of entries: 5
Members:
10.12.115.101,tcp:80
10.12.97.112,tcp:80
10.12.115.101,tcp:443
10.12.97.70,tcp:15021
10.12.97.70,tcp:80

# iptables-save 
# Generated by iptables-save v1.4.21 on Wed Jun 22 16:03:15 2022
*mangle
:PREROUTING ACCEPT [10381156:5905007979]
:INPUT ACCEPT [4485839:3718471785]
:FORWARD ACCEPT [5448862:2173827292]
:OUTPUT ACCEPT [4420992:1947555378]
:POSTROUTING ACCEPT [9861006:4119693380]
:KUBE-KUBELET-CANARY - [0:0]
COMMIT
# Completed on Wed Jun 22 16:03:15 2022
# Generated by iptables-save v1.4.21 on Wed Jun 22 16:03:15 2022
*filter
:INPUT ACCEPT [149:13360]
:FORWARD ACCEPT [19:532]
:OUTPUT ACCEPT [134:148180]
:KUBE-FIREWALL - [0:0]
:KUBE-FORWARD - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-NODE-PORT - [0:0]
-A INPUT -j KUBE-FIREWALL
-A INPUT -m comment --comment "kubernetes health check rules" -j KUBE-NODE-PORT
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FIREWALL ! -s 127.0.0.0/8 -d 127.0.0.0/8 -m comment --comment "block incoming localnet connections" -m conntrack ! --ctstate RELATED,ESTABLISHED,DNAT -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-NODE-PORT -m comment --comment "Kubernetes health check node port" -m set --match-set KUBE-HEALTH-CHECK-NODE-PORT dst -j ACCEPT
COMMIT
# Completed on Wed Jun 22 16:03:15 2022
# Generated by iptables-save v1.4.21 on Wed Jun 22 16:03:15 2022
*nat
:PREROUTING ACCEPT [60:1848]
:INPUT ACCEPT [9:420]
:OUTPUT ACCEPT [10:660]
:POSTROUTING ACCEPT [29:1192]
:KUBE-FIREWALL - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-LOAD-BALANCER - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-NODE-PORT - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-SERVICES - [0:0]
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A KUBE-FIREWALL -j KUBE-MARK-DROP
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODE-PORT -p tcp -m comment --comment "Kubernetes nodeport TCP port for masquerade purpose" -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-MARK-MASQ
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE
-A KUBE-SERVICES -m comment --comment "Kubernetes service lb portal" -m set --match-set KUBE-LOAD-BALANCER dst,dst -j KUBE-LOAD-BALANCER
-A KUBE-SERVICES ! -s 10.12.0.0/16 -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT
-A KUBE-SERVICES -m set --match-set KUBE-LOAD-BALANCER dst,dst -j ACCEPT
COMMIT
# Completed on Wed Jun 22 16:03:15 2022

抓包

在eth0上抓后端的包,10.12.97.49为主网卡的ip。eth0收到后端的响应包

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# tcpdump -i eth0 -nn -v  ip host 10.253.1.139 or 10.253.3.5 or 10.253.4.194
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
18:47:35.011943 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x1ed1 (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705347097 ecr 794582693,nop,wscale 7], length 0
18:47:36.036868 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x1ad0 (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705348122 ecr 794582693,nop,wscale 7], length 0
18:47:37.097201 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x16ac (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705349182 ecr 794582693,nop,wscale 7], length 0
18:47:39.145257 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.253.4.194.8000 > 10.12.97.49.44624: Flags [S.], cksum 0x0eac (correct), seq 4144203775, ack 204263431, win 65160, options [mss 1424,sackOK,TS val 3705351230 ecr 794582693,nop,wscale 7], length 0

在eth1上抓后端的包,10.12.97.49为主网卡的ip。eth1发出的sync包,说明即使在postrouting进行snat之前,已经执行路由选择(匹配到了策略路由),选择eth1做为出方向网卡。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# tcpdump -i eth1 -nn -v  ip host 10.253.1.139 or 10.253.3.5 or 10.253.4.194
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
18:47:35.011699 IP (tos 0x0, ttl 63, id 65073, offset 0, flags [DF], proto TCP (6), length 60)
    10.12.97.49.44624 > 10.253.4.194.8000: Flags [S], cksum 0x82ea (correct), seq 204263430, win 62727, options [mss 8961,sackOK,TS val 794582693 ecr 0,nop,wscale 7], length 0
18:47:36.036692 IP (tos 0x0, ttl 63, id 65074, offset 0, flags [DF], proto TCP (6), length 60)
    10.12.97.49.44624 > 10.253.4.194.8000: Flags [S], cksum 0x7ee9 (correct), seq 204263430, win 62727, options [mss 8961,sackOK,TS val 794583718 ecr 0,nop,wscale 7], length 0
18:48:17.389548 IP (tos 0x0, ttl 63, id 20326, offset 0, flags [DF], proto TCP (6), length 60)
    10.12.97.49.44676 > 10.253.3.5.8000: Flags [S], cksum 0x11be (correct), seq 3721513867, win 62727, options [mss 8961,sackOK,TS val 794625071 ecr 0,nop,wscale 7], length 0
18:48:18.404658 IP (tos 0x0, ttl 63, id 20327, offset 0, flags [DF], proto TCP (6), length 60)
    10.12.97.49.44676 > 10.253.3.5.8000: Flags [S], cksum 0x0dc7 (correct), seq 3721513867, win 62727, options [mss 8961,sackOK,TS val 794626086 ecr 0,nop,wscale 7], length 0

总节现象:

pod访问的流量,经过iptables进行了snat后,为10.12.97.49(主网卡ip,因为其他副网卡没有绑定ip),再经过lvs的dnat后访问后端10.253.3.5:8000,出去的网卡为eth1(策略路由指定的),而响应的包10.253.3.5:8000通过了eth0。

就是出去的路径和返回的路径不对等。

既然不对等那么最简单的方案,就是所有弹性网卡,设置内核参数/proc/sys/net/ipv4/conf/{网卡}/rp_filter为2

提示这里内核参数取/proc/sys/net/ipv4/conf/all/rp_filter/proc/sys/net/ipv4/conf/{网卡}/rp_filter里最大值,而默认值为1。所以只要将主网卡的值设置为2就可以,我这里是设置echo 2 >/proc/sys/net/ipv4/conf/eth0/rp_filter

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
rp_filter - INTEGER
 0 - No source validation.
 1 - Strict mode as defined in RFC3704 Strict Reverse Path
     Each incoming packet is tested against the FIB and if the interface
     is not the best reverse path the packet check will fail.
     By default failed packets are discarded.
 2 - Loose mode as defined in RFC3704 Loose Reverse Path
     Each incoming packet's source address is also tested against the FIB
     and if the source address is not reachable via any interface
     the packet check will fail.

 Current recommended practice in RFC3704 is to enable strict mode
 to prevent IP spoofing from DDos attacks. If using asymmetric routing
 or other complicated routing, then loose mode is recommended.

 The max value from conf/{all,interface}/rp_filter is used
 when doing source validation on the {interface}.

 Default value is 0. Note that some distributions enable it
 in startup scripts.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

https://access.redhat.com/solutions/53031

由于snat ip为主网卡ip,导致出入路径不对称,那么让snat的ip,能够出入对称就行了。

即在副网卡上绑定可路由的ip(弹性网卡主ip)。

使用解决node port的方案,即设置主网卡的/proc/sys/net/ipv4/conf/{网卡}/rp_filter为2,和设置iptables的mangle表里设置mark,然后在策略路由里设置规则将这个流量从主网卡出去。

I have a similar solution for scenario 2 but it is more complicated because it needs to be applied to all pods:

  1. Mark packets received through veth and not from host
  2. Use conntrack to restore the same mark on packets in the reverse direction
  3. Create a rule with higher priority to force returning through veth interface for this mark using a new routing table
1
2
3
4
5
iptables -A PREROUTING -i veth0 -t mangle -s ! 172.30.187.226  -j CONNMARK --set-mark 42
iptables -t mangle -A OUTPUT -j CONNMARK --restore-mark
ip rule add fwmark 42 lookup 100 pref 1024
ip route add default via 172.30.187.226 dev veth0 table 100
ip route add 172.30.187.226 dev veth0  scope link table 100

Where 172.30.187.226 is the host IP

This assumes that all traffic from veth and not from host ip is nodeport traffic

Both solutions work but add a lot of complexity. I hope we can find a simpler solution

https://github.com/lyft/cni-ipvlan-vpc-k8s/issues/38#issuecomment-387059850

https://github.com/aws/amazon-vpc-cni-k8s/commit/2cce7de02bbfef66b12f0d61d3e9f7cb96d2c186

https://github.com/cilium/cilium/commit/c7f9997d7001c8561583d374dcbd4d973bad6fac

https://github.com/cilium/cilium/commit/01f8dcc51c84e1cab269f84e782e09d8261ac495

https://github.com/kubernetes/kubernetes/issues/66607

https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/

https://unix.stackexchange.com/questions/590123/is-it-possible-to-mention-destination-interface-in-iptables-while-dnat

https://en.wikipedia.org/wiki/Netfilter#/media/File:Netfilter-packet-flow.svg

https://www.hwchiu.com/ipvs-4.html

https://www.digihunch.com/2020/11/ipvs-iptables-and-kube-proxy/

https://serverfault.com/questions/869751/how-does-masquerade-choose-an-ip-address-if-there-are-multiple

相关内容