Bonding SR-IOV ports with OpenStack

All our customers have an online presence. A subset of these have higher demands when it comes to latency and reliability than others. Sometimes this is purely because of high amount of real end-user traffic - and sometimes it’s more malicious; A DDOS-attack.

In most OpenStack-configurations, you have the concept of «port security». This is a firewall enforced on the network interface of the virtual instance. It is also there to prevent a malicious self-service user from spoofing their IP or MAC-address. This is enforced using classic Linux iptables, which in turn use connection tracking tables.

In the event of high amounts of traffic, application bugs, DDOS, combinations of the above and what have you - the connection tracking tables can go full. When this happens, packets are being dropped - regardless of who are the recipient of the packet. You can increase their size, but you may only move the bottleneck; Your virtio-interrupt queues are saturated. You can increase the amount of queues, but your hypervisor is spending a lot of resources dealing with the network virtualization - it is clearly struggling.

You need something better. What if we somehow could remove one of the bottlenecks altogether, and make the other one significantly wider?

What we got to work with

We have quite a few hypervisors based on Intel S2600BPS blade servers. These come with a X722 network card - it has two SFP+ ports, and supports virtual function / SR-IOV. This allows a physical PCI device to be represented as virtual PCI devices - which again can be send down to a virtual instance with PCI Passthrough - bypassing the host OS entirely.

Our hypervisors use LACP-bonding in an attempt to get the most bandwidth we can from our network.

NOTE: Different network card vendors got different ways to approach this problem. The solution posed for Intel NICs are kind of a hack - but it’s an effective hack. Mellanox hardware behave very differently, and this article may be of less relevance (although the advantages of SR-IOV should be the same).

How to prepare things

We set this up mostly according to the SR-IOV documentation from OpenStack’s website. However, for the bonding to work - we need to add some trickery.

Assuming you have a compute node with one dual port NIC - one port present as eth0, and the other as eth1 - you need to ensure that VFs passed from them are mapped to separate networks. Why is that? Well - when using SR-IOV, we need some way to ensure that the two ports we attach to the VMs are mapped to the two different network ports of the physical network card. And we do this by mapping one of the ports to physnet2, and the other port to physnet3 - assuming physnet1 is already in use for ordinary provider networks.

Neutron server / Nova Scheduler

Ensure that the VLAN ranges are added and appropriately mapped to the correct physnets:

network_vlan_ranges = physnet1:100:2999,physnet2:100:2999,physnet3:100:2999

The compute node

The SR-IOV Agent needs the mappings added in sriov_agent.ini:

physical_device_mappings = physnet2:eth0,physnet3:eth1

The Nova compute agent needs to have these devices whitelisted in nova.conf:

passthrough_whitelist={"devname":"eth0","physical_network":"physnet2"}
passthrough_whitelist={"devname":"eth1","physical_network":"physnet3"}

You also need to ensure that VFs are created from both eth0 and eth1.

Bonding

When you set up a bond interface, Linux will change the MAC-address of the slave interfaces to be the same. However, the OS inside the instance will not be allowed to change the MAC address if you try. To circumvent this, you need to create the port for both network interfaces with the same MAC address. No need to change the MAC if it is already the same.

Setting up the network objects

Creating the network objects and ports are by default a privileged operation - simply because users shouldn’t be able to just bring up network interfaces which are tagged at the network of their arbitrary choice. So ensure that you have admin access when you create these resources.

VLAN_ID=1337
TARGET_PROJECT=acme_org
IPV4_NET=10.0.0.0/24
openstack network create --project $TARGET_PROJECT \
  --provider-network-type vlan \
  --provider-physical-network physnet2 \
  --provider-segment $VLAN_ID vlan-$VLAN_ID-eth0
openstack network create --project $TARGET_PROJECT \
  --provider-network-type vlan \
  --provider-physical-network physnet3 \
  --provider-segment $VLAN_ID vlan-$VLAN_ID-eth1
openstack subnet create \
  --no-dhcp \
  --subnet-range $IPV4_NET \
  --network vlan-$VLAN_ID-eth0 vlan-$VLAN_ID-ipv4-eth0
openstack subnet create \
  --no-dhcp \
  --subnet-range $IPV4_NET \
  --network vlan-$VLAN_ID-eth1 vlan-$VLAN_ID-ipv4-eth1

If you want to bond the network interfaces, Linux need to be able to change the MAC address of the interface. And as a security measure, a virtual instance is not allowed to change the MAC address of the virtual function device — so you need to pass two interfaces to the instance, using the same MAC address.

Create the ports - these can be created without being admin:

openstack port create \
  --vnic-type direct \
  --network vlan-1337-eth0 myserverport-eth0
openstack port create \
  --vnic-type direct \
  --network vlan-1337-eth1 \
  --mac-address $(openstack port show myserverport-eth0 \
  -f value \
  -c mac_address) myserverport-eth1

You are now ready to create the server. Note the following limitations:

You have to create a /NEW/ server and ensure these ports are used on creation, as PCI allocations only happen on server creation. Attaching them later on will not work.
On our blade servers, the servers we create are only running on CPU socket 0. This means that about half the load will not be powered by SR-IOV.
The image you are booting up needs the driver of the VF card (If the hypervisor use ixgbe, the instance need ixgbevf.
If you need cloud-init, you should use configuration drive.

The following cloud-configuration can be used to set up bonding on CentOS 7 - or you can do it manually I guess:

#cloud-config

write_files:
  - path: /etc/modules-load.d/bonding.conf
    content: bonding
  - path: /etc/sysconfig/network-scripts/ifcfg-bond0
    content: |
      BONDING_MASTER=yes
      BOOTPROTO=none
      DEFROUTE=yes
      DEVICE=bond0
      DNS1=DNS_SERVER
      GATEWAY=SERVER_GATEWAY
      IPADDR=SERVER_IPV4
      NAME=bond0
      ONBOOT=yes
      PREFIX=SUBNET_PREFIX_SIZE
      TYPE=Bond
  - path: /etc/sysconfig/network-scripts/ifcfg-ens4
    content: |
      DEVICE=ens4
      MASTER=bond0
      ONBOOT=yes
      SLAVE=yes
      TYPE=Ethernet
  - path: /etc/sysconfig/network-scripts/ifcfg-ens5
    content: |
      DEVICE=ens5
      MASTER=bond0
      ONBOOT=yes
      SLAVE=yes
      TYPE=Ethernet

runcmd:
  - [rm, /etc/sysconfig/network-scripts/ifcfg-eth0]

power_state:
  mode: reboot

You may notice it being stuck creating for about 5 minutes - followed by falling into an error state.

The bug

Most resources in OpenStack use a 128-bit UUID as their primary key - and ports are no exception to this rule.

Unless they are for use with SR-IOV. In this case, looking up the correct port and interface is done using a different unique identifier - the interface MAC address. Neutron bug 1791159

So you have two ports with the same MAC address, but the SR-IOV Agent isn’t completely able to distinguish between the two. The two interfaces are (mostly) set up and attached to the instance by SR-IOV Agent, but this is not properly communicated back to nova-compute, which is waiting for a message that the second interface is attached. After five minutes it will give up and put the instance in ERROR state.

The following workaround will pass the message to nova, and the instance will boot up:

#!/bin/bash
# This script takes a single argument - the server UUID.
# It also requires that you are authenticating as the neutron service user

SERVER_ID=$1
PORT_ID=$(openstack port list --server $SERVER_ID -c ID -c Status | grep DOWN | awk '{ print $2 }')
KS_TOKEN=$(openstack token issue -f value -c id)

curl -H "x-auth-token: $KS_TOKEN" \
  -H 'Content-Type: application/json' \
  --data "{\"events\": [{\"status\": \"completed\", \"tag\": \"$PORT_ID\", \"name\": \"network-vif-plugged\", \"server_uuid\": \"$SERVER_ID\"}]}" \
  http://NOVA_ENDPOINT/v2/${OS_PROJECT_ID}/os-server-external-events

If the server seems like it’s booting, and is reported as ACTIVE by nova, you probably did it correct.

On the compute node, the output of ip link may give you something like this:

2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether ac:1f:f8:72:42:4f brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 2 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 4 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 5 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 6 MAC fa:ca:fa:fe:f0:f3, vlan 1337, spoof checking on, link-state auto, trust off, query_rss off
    vf 7 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 8 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 9 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether ac:1f:f8:72:42:4f brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 2 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 4 MAC fa:ca:fa:fe:f0:f3, vlan 1337, spoof checking on, link-state auto, trust off, query_rss off
    vf 5 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 6 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 7 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 8 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off
    vf 9 MAC 00:00:00:00:00:00, spoof checking on, link-state auto, trust off, query_rss off

We see both eth0/eth1, and a bunch of provisioned VF NICs. Most of them have MAC 00:00:00:00:00:00, but one on each interface has the MAC and VLAN of the instance ports set. If the link-state isn’t auto (which it probably is on one of the interfaces), it is advisable to set it as thus:

## ip link set PFINTERFACE vf VFINDEX state auto
# ip link set eth1 vf 4 state auto

If it has a link-state of enable, you will not get the benefit of bonding when the switch link goes down. The instance will not know, and will fire 50% of its packets into a black hole.

If you’re lucky, the cloud-configuration could do what you needed to do to get bonding up and ready (or maybe the image had what you need). If not, you need to set it up somehow. Bonding mode should be Round Robin, and NOT LACP. LACP works on the link level and the switches/PF already has established a good relationship.

The end result

There are quite a few advantages with this setup as opposed to ordinary virtio-based interfaces.

Lower latency
Less CPU used by the hypervisor (no need for the vhost threads - the NIC hardware does it for you and are passed as PCI devices to the instances) … This is especially visible on setups where you do CPU pinning, because this affects the vhost threads as well - and a high utilized system which also pushes a lot of packets will experience high amounts of CPU steal time because of this.
The traffic sent back and forth to these ports are skipping connection tracking - skipping one potential bottleneck entirely.
The instance should have more interrupt queues.

This is especially suitable for busy frontend servers that may be targeted by DDOS attempts, for example Varnish.

Any packet filtering you have previously done with security groups need to be implemented by other means - but unless it’s obvious from the complicated hoops one need to jump through to get this to work - you should know what you’re doing before putting this into production.

Bonding SR-IOV ports with OpenStack

January 30, 2021

What we got to work with

How to prepare things

Neutron server / Nova Scheduler

The compute node

Bonding

Setting up the network objects

The bug

The end result

Trygve Vea

Kom igång med Matrix, skapa en Synapse server

Introduktion

How to setup a Matrix homeserver

Att bana väg för öppen källkod i offentlig sektor