cni插件无法访问service的loadbalancer ip
Contents
最近在开发cni网络插件,遇到容器访问service的loadbalancer地址不通。但是访问pod、service、node的地址都是通的。
背景
- 网络插件是基于云厂商的弹性网卡,多个pod共享一个弹性网卡模式。使用策略路由,让pod流量通过弹性网卡进出。
- 主网卡绑定了ip,而副网卡没有绑定ip,pod的ip都在副网卡上。
- kube-proxy会在kube-ipvs0网卡上绑定loadbalancer ip,并且在ipvs规则里添加loadbalancer ip转发,且在iptable添加snat规则,让pod ip访问loadbalancer ip进行snat。
现象
容器里访问lb地址10.12.115.101:80
,出现访问不通
|
|
查看ipvs规则
|
|
查看路由表和网卡
容器的ip为10.12.192.44,容器的对端网卡为fsk7c22dbae71f7
,容器通过eth1进行出方向访问。
|
|
iptables规则
其中nat表里的-A KUBE-SERVICES -m comment --comment "Kubernetes service lb portal" -m set --match-set KUBE-LOAD-BALANCER dst,dst -j KUBE-LOAD-BALANCER
规则,目标包里的ip和端口为lb的地址和端口,丢到KUBE-LOAD-BALANCER
chain,最终进行snat -A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
|
|
抓包
在eth0上抓后端的包,10.12.97.49为主网卡的ip。eth0收到后端的响应包
|
|
在eth1上抓后端的包,10.12.97.49为主网卡的ip。eth1发出的sync包,说明即使在postrouting进行snat之前,已经执行路由选择(匹配到了策略路由),选择eth1做为出方向网卡。
|
|
总节现象:
pod访问的流量,经过iptables进行了snat后,为10.12.97.49
(主网卡ip,因为其他副网卡没有绑定ip),再经过lvs的dnat后访问后端10.253.3.5:8000
,出去的网卡为eth1(策略路由指定的),而响应的包10.253.3.5:8000
通过了eth0。
就是出去的路径和返回的路径不对等。
解决方案
第一种
既然不对等那么最简单的方案,就是所有弹性网卡,设置内核参数/proc/sys/net/ipv4/conf/{网卡}/rp_filter
为2
提示这里内核参数取/proc/sys/net/ipv4/conf/all/rp_filter
和/proc/sys/net/ipv4/conf/{网卡}/rp_filter
里最大值,而默认值为1。所以只要将主网卡的值设置为2就可以,我这里是设置echo 2 >/proc/sys/net/ipv4/conf/eth0/rp_filter
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rp_filter - INTEGER 0 - No source validation. 1 - Strict mode as defined in RFC3704 Strict Reverse Path Each incoming packet is tested against the FIB and if the interface is not the best reverse path the packet check will fail. By default failed packets are discarded. 2 - Loose mode as defined in RFC3704 Loose Reverse Path Each incoming packet's source address is also tested against the FIB and if the source address is not reachable via any interface the packet check will fail. Current recommended practice in RFC3704 is to enable strict mode to prevent IP spoofing from DDos attacks. If using asymmetric routing or other complicated routing, then loose mode is recommended. The max value from conf/{all,interface}/rp_filter is used when doing source validation on the {interface}. Default value is 0. Note that some distributions enable it in startup scripts.
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
https://access.redhat.com/solutions/53031
第二种
由于snat ip为主网卡ip,导致出入路径不对称,那么让snat的ip,能够出入对称就行了。
即在副网卡上绑定可路由的ip(弹性网卡主ip)。
第三种
使用解决node port的方案,即设置主网卡的/proc/sys/net/ipv4/conf/{网卡}/rp_filter
为2,和设置iptables的mangle表里设置mark,然后在策略路由里设置规则将这个流量从主网卡出去。
I have a similar solution for scenario 2 but it is more complicated because it needs to be applied to all pods:
- Mark packets received through veth and not from host
- Use conntrack to restore the same mark on packets in the reverse direction
- Create a rule with higher priority to force returning through veth interface for this mark using a new routing table
1 2 3 4 5
iptables -A PREROUTING -i veth0 -t mangle -s ! 172.30.187.226 -j CONNMARK --set-mark 42 iptables -t mangle -A OUTPUT -j CONNMARK --restore-mark ip rule add fwmark 42 lookup 100 pref 1024 ip route add default via 172.30.187.226 dev veth0 table 100 ip route add 172.30.187.226 dev veth0 scope link table 100
Where 172.30.187.226 is the host IP
This assumes that all traffic from veth and not from host ip is nodeport traffic
Both solutions work but add a lot of complexity. I hope we can find a simpler solution
https://github.com/lyft/cni-ipvlan-vpc-k8s/issues/38#issuecomment-387059850
参考连接
https://github.com/aws/amazon-vpc-cni-k8s/commit/2cce7de02bbfef66b12f0d61d3e9f7cb96d2c186
https://github.com/cilium/cilium/commit/c7f9997d7001c8561583d374dcbd4d973bad6fac
https://github.com/cilium/cilium/commit/01f8dcc51c84e1cab269f84e782e09d8261ac495
https://github.com/kubernetes/kubernetes/issues/66607
https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/
https://en.wikipedia.org/wiki/Netfilter#/media/File:Netfilter-packet-flow.svg
https://www.hwchiu.com/ipvs-4.html
https://www.digihunch.com/2020/11/ipvs-iptables-and-kube-proxy/