All our customers have an online presence. A subset of these have higher demands when it comes to latency and reliability than others. Sometimes this is purely because of high amount of real end-user traffic - and sometimes it’s more malicious; A DDOS-attack.
In most OpenStack-configurations, you have the concept of «port security». This is a firewall enforced on the network interface of the virtual instance. It is also there to prevent a malicious self-service user from spoofing their IP or MAC-address. This is enforced using classic Linux iptables, which in turn use connection tracking tables.
In the event of high amounts of traffic, application bugs, DDOS, combinations of the above and what have you - the connection tracking tables can go full. When this happens, packets are being dropped - regardless of who are the recipient of the packet. You can increase their size, but you may only move the bottleneck; Your virtio-interrupt queues are saturated. You can increase the amount of queues, but your hypervisor is spending a lot of resources dealing with the network virtualization - it is clearly struggling.
You need something better. What if we somehow could remove one of the bottlenecks altogether, and make the other one significantly wider?
What we got to work with
We have quite a few hypervisors based on Intel S2600BPS blade servers. These come with a X722 network card - it has two SFP+ ports, and supports virtual function / SR-IOV. This allows a physical PCI device to be represented as virtual PCI devices - which again can be send down to a virtual instance with PCI Passthrough - bypassing the host OS entirely.
Our hypervisors use LACP-bonding in an attempt to get the most bandwidth we can from our network.
NOTE: Different network card vendors got different ways to approach this problem. The solution posed for Intel NICs are kind of a hack - but it’s an effective hack. Mellanox hardware behave very differently, and this article may be of less relevance (although the advantages of SR-IOV should be the same).
How to prepare things
We set this up mostly according to the SR-IOV documentation from OpenStack’s website. However, for the bonding to work - we need to add some trickery.
Assuming you have a compute node with one dual port NIC - one port present as eth0
, and the other as eth1
- you need to ensure that VFs passed from them are mapped to separate networks. Why is that? Well - when using SR-IOV, we need some way to ensure that the two ports we attach to the VMs are mapped to the two different network ports of the physical network card. And we do this by mapping one of the ports to physnet2
, and the other port to physnet3
- assuming physnet1
is already in use for ordinary provider networks.
Neutron server / Nova Scheduler
Ensure that the VLAN ranges are added and appropriately mapped to the correct physnets:
network_vlan_ranges = physnet1:100:2999,physnet2:100:2999,physnet3:100:2999
The compute node
The SR-IOV Agent needs the mappings added in sriov_agent.ini
:
physical_device_mappings = physnet2:eth0,physnet3:eth1
The Nova compute agent needs to have these devices whitelisted in nova.conf
:
passthrough_whitelist={"devname":"eth0","physical_network":"physnet2"}
passthrough_whitelist={"devname":"eth1","physical_network":"physnet3"}
You also need to ensure that VFs are created from both eth0
and eth1
.
Bonding
When you set up a bond interface, Linux will change the MAC-address of the slave interfaces to be the same. However, the OS inside the instance will not be allowed to change the MAC address if you try. To circumvent this, you need to create the port for both network interfaces with the same MAC address. No need to change the MAC if it is already the same.
Setting up the network objects
Creating the network objects and ports are by default a privileged operation - simply because users shouldn’t be able to just bring up network interfaces which are tagged at the network of their arbitrary choice. So ensure that you have admin access when you create these resources.
VLAN_ID=1337
TARGET_PROJECT=acme_org
IPV4_NET=10.0.0.0/24
openstack network create --project $TARGET_PROJECT \
--provider-network-type vlan \
--provider-physical-network physnet2 \
--provider-segment $VLAN_ID vlan-$VLAN_ID-eth0
openstack network create --project $TARGET_PROJECT \
--provider-network-type vlan \
--provider-physical-network physnet3 \
--provider-segment $VLAN_ID vlan-$VLAN_ID-eth1
openstack subnet create \
--no-dhcp \
--subnet-range $IPV4_NET \
--network vlan-$VLAN_ID-eth0 vlan-$VLAN_ID-ipv4-eth0
openstack subnet create \
--no-dhcp \
--subnet-range $IPV4_NET \
--network vlan-$VLAN_ID-eth1 vlan-$VLAN_ID-ipv4-eth1
If you want to bond the network interfaces, Linux need to be able to change the MAC address of the interface. And as a security measure, a virtual instance is not allowed to change the MAC address of the virtual function device — so you need to pass two interfaces to the instance, using the same MAC address.
Create the ports - these can be created without being admin:
openstack port create \
--vnic-type direct \
--network vlan-1337-eth0 myserverport-eth0
openstack port create \
--vnic-type direct \
--network vlan-1337-eth1 \
--mac-address $(openstack port show myserverport-eth0 \
-f value \
-c mac_address) myserverport-eth1
You are now ready to create the server. Note the following limitations:
- You have to create a /NEW/ server and ensure these ports are used on creation, as PCI allocations only happen on server creation. Attaching them later on will not work.
- On our blade servers, the servers we create are only running on CPU socket 0. This means that about half the load will not be powered by SR-IOV.
- The image you are booting up needs the driver of the VF card (If the hypervisor use
ixgbe
, the instance needixgbevf
. - If you need cloud-init, you should use configuration drive.
The following cloud-configuration can be used to set up bonding on CentOS 7 - or you can do it manually I guess:
#cloud-config
write_files:
- path: /etc/modules-load.d/bonding.conf
content: bonding
- path: /etc/sysconfig/network-scripts/ifcfg-bond0
content: |
BONDING_MASTER=yes
BOOTPROTO=none
DEFROUTE=yes
DEVICE=bond0
DNS1=DNS_SERVER
GATEWAY=SERVER_GATEWAY
IPADDR=SERVER_IPV4
NAME=bond0
ONBOOT=yes
PREFIX=SUBNET_PREFIX_SIZE
TYPE=Bond
- path: /etc/sysconfig/network-scripts/ifcfg-ens4
content: |
DEVICE=ens4
MASTER=bond0
ONBOOT=yes
SLAVE=yes
TYPE=Ethernet
- path: /etc/sysconfig/network-scripts/ifcfg-ens5
content: |
DEVICE=ens5
MASTER=bond0
ONBOOT=yes
SLAVE=yes
TYPE=Ethernet
runcmd:
- [rm, /etc/sysconfig/network-scripts/ifcfg-eth0]
power_state:
mode: reboot
You may notice it being stuck creating for about 5 minutes - followed by falling into an error state.
The bug
Most resources in OpenStack use a 128-bit UUID as their primary key - and ports are no exception to this rule.
Unless they are for use with SR-IOV. In this case, looking up the correct port and interface is done using a different unique identifier - the interface MAC address. Neutron bug 1791159
So you have two ports with the same MAC address, but the SR-IOV Agent isn’t completely able to distinguish between the two. The two interfaces are (mostly) set up and attached to the instance by SR-IOV Agent, but this is not properly communicated back to nova-compute
, which is waiting for a message that the second interface is attached. After five minutes it will give up and put the instance in ERROR
state.
The following workaround will pass the message to nova, and the instance will boot up:
#!/bin/bash
# This script takes a single argument - the server UUID.
# It also requires that you are authenticating as the neutron service user
SERVER_ID=$1
PORT_ID=$(openstack port list --server $SERVER_ID -c ID -c Status | grep DOWN | awk '{ print $2 }')
KS_TOKEN=$(openstack token issue -f value -c id)
curl -H "x-auth-token: $KS_TOKEN" \
-H 'Content-Type: application/json' \
--data "{\"events\": [{\"status\": \"completed\", \"tag\": \"$PORT_ID\", \"name\": \"network-vif-plugged\", \"server_uuid\": \"$SERVER_ID\"}]}" \
http://NOVA_ENDPOINT/v2/${OS_PROJECT_ID}/os-server-external-events
If the server seems like it’s booting, and is reported as ACTIVE
by nova, you probably did it correct.
On the compute node, the output of ip link
may give you something like this:
2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
link/ether ac:1f:f8:72:42:4f brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 2 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 4 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 5 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 6 MAC fa:ca:fa:fe:f0:f3, vlan 1337, spoof checking on, link-state auto, trust off, query_rss off
vf 7 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 8 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 9 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
link/ether ac:1f:f8:72:42:4f brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 2 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 4 MAC fa:ca:fa:fe:f0:f3, vlan 1337, spoof checking on, link-state auto, trust off, query_rss off
vf 5 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 6 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 7 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 8 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
vf 9 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
We see both eth0/eth1, and a bunch of provisioned VF NICs. Most of them have MAC 00:00:00:00:00:00, but one on each interface has the MAC and VLAN of the instance ports set. If the link-state isn’t auto
(which it probably is on one of the interfaces), it is advisable to set it as thus:
## ip link set PFINTERFACE vf VFINDEX state auto
# ip link set eth1 vf 4 state auto
If it has a link-state of enable
, you will not get the benefit of bonding when the switch link goes down. The instance will not know, and will fire 50% of its packets into a black hole.
If you’re lucky, the cloud-configuration could do what you needed to do to get bonding up and ready (or maybe the image had what you need). If not, you need to set it up somehow. Bonding mode should be Round Robin, and NOT LACP. LACP works on the link level and the switches/PF already has established a good relationship.
The end result
There are quite a few advantages with this setup as opposed to ordinary virtio-based interfaces.
- Lower latency
- Less CPU used by the hypervisor (no need for the vhost threads - the NIC hardware does it for you and are passed as PCI devices to the instances) … This is especially visible on setups where you do CPU pinning, because this affects the vhost threads as well - and a high utilized system which also pushes a lot of packets will experience high amounts of CPU steal time because of this.
- The traffic sent back and forth to these ports are skipping connection tracking - skipping one potential bottleneck entirely.
- The instance should have more interrupt queues.
This is especially suitable for busy frontend servers that may be targeted by DDOS attempts, for example Varnish.
Any packet filtering you have previously done with security groups need to be implemented by other means - but unless it’s obvious from the complicated hoops one need to jump through to get this to work - you should know what you’re doing before putting this into production.