`

100Gbit Ethernet Tuning

A summary of internet resources. Background: A 100 Gb ethernet adapter will not do anything near 100 Gb without tuning. A new installation over 200 miles shows 12 GB/s.

fasterdata.es.net

romoreira

Stanford

Redhat-1

Redhat-2

Redhat-3] [[https://docs.nvidia.com/networking/display/winof2v320/Performance+Tuning | NVidia

bbr

PerfSonar

Intel

What PerfSonar does automatically

net.core.rmem_max = 536870912
net.core.wmem_max = 536870912
net.ipv4.conf.all.arp_announce = 2
net.ipv4.conf.all.arp_filter = 1
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.default.arp_filter = 1
net.ipv4.tcp_congestion_control = htcp
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_rmem = 4096 87380 268435456
net.ipv4.tcp_wmem = 4096 65536 268435456

ifconfig ens16f1 mtu 9000
Redhat 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 8192 x 4194304
net.ipv4.tcp_wmem = 8192 Y 4194304
The middle value Y is the default buffer size. This is the most important value. You might wish to start at 524288 (512kb) and move up from there. You will generally wish to try small increments of your Bandwidth Delay Product. Try BDP x1 then BDP x1.25 then BDP x1.5 and so on. Once you start to get increased speeds, you may wish to refine your testing down smaller, for example BDP x2.5 then BDP x2.6 and so on. It is unlikely you will need a value larger than BDP x5.
NVidia

This is windows so requires translation

Disable SACK
Enable Fast Datagram UDP
Set RSS=enabled
Set closest NUMA
Set receive buffers=512
Set send buffers=2048
IPV4 Checksum offload enabled
TCP/UDP Checksum offload enabled
IPV6 TCP/UDP Checksum offload enabled
Large Send Option offload enabled
BBR

Enable TCP/BBR congestion control. See BBR and Redhat-3 above.

echo “net.ipv4.tcpcongestioncontrol = bbr”

>> /etc/sysctl.conf

Stanford
firewall-cmd --zone=public --add-port=61617/tcp --permanent
firewall-cmd --zone=public --add-port=8090/tcp --permanent
firewall-cmd --zone=public --add-port=8096/tcp --permanent
firewall-cmd --zone=public --add-port=4823/tcp --permanent
firewall-cmd --zone=public --add-port=6001-6200/tcp --permanent
firewall-cmd --zone=public --add-port=6001-6200/udp --permanent
firewall-cmd --zone=public --add-port=5001-5900/udp --permanent
firewall-cmd --zone=public --add-port=5001-5900/tcp --permanent
firewall-cmd --zone=public --add-port=861/tcp --permanent
firewall-cmd --zone=public --add-port=8760-9960/udp --permanent
firewall-cmd --zone=public --add-port=33434-33634/udp --permanent
 
This allows the fdt.jar tool to use its default port
firewall-cmd --zone=public --add-port=54321/tcp --permanent
 
Makes these permanent and reload rule database
firewall-cmd --reload

Lets find the CPU that the 100g NIC is associated with.
cat /sys/class/net/<100g-NIC-name>/device/numa_node
 
Usual response is either '0' or '1', meaning the NIC is associated with either the '0' CPU or the '1' CPU. (If it comes back with a '-1' it probably suggests it is a single CPU system.) Lets assume it returned a '1'.

Knowing that, run this command:
lscpu


Most modern CPUs can run at different clock frequencies and often do so to save energy. In our case we want to run the CPU as fast as possible. First lets see what speed each CPU core is running at and what the maximum speed could be. Just run this funky command:
grep -E '^model name|^cpu MHz' /proc/cpuinfo
 
You'll probably see that the cores aren't running near their spec speed. Most often at a level called 'powersave'.
 
This simple command sets all the cores to 'performance' instead:
sudo cpupower frequency-set --governor performance

(No controversy to add/change these values for high speed nics)
# increase TCP max buffer size setable using setsockopt()
# allow testing with 256MB buffers
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
# allow auto-tuning up to 128MB buffers
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
# recommended to increase this for CentOS6 with 10G NICS or higher
net.core.netdev_max_backlog = 250000
# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1
# Explicitly set htcp as the congestion control: cubic buggy in older 2.6 kernels
net.ipv4.tcp_congestion_control = htcp
# If you are using Jumbo Frames, also set this
net.ipv4.tcp_mtu_probing = 1
# recommended for CentOS7/Debian8 hosts
net.core.default_qdisc = fq
Intel
To configure IRQ affinity, stop irqbalance and then either use the set_irq_affinity script from the
i40e source package or pin queues manually.
Disable user-space IRQ balancer to enable queue pinning:
• systemctl disable irqbalance
• systemctl stop irqbalance
Using the set_irq_affinity script from the i40e source package (recommended):
• To use all cores:
[path-to-i40epackage]/scripts/set_irq_affinity -x all ethX
• To use only cores on the local NUMA socket:
[path-to-i40epackage]/scripts/set_irq_affinity -x local ethX
• You can also select a range of cores. Avoid using cpu0 because it runs timer tasks.
[path-to-i40epackage]/scripts/set_irq_affinity 1-2 ethX
Manually:
• Find the processors attached to each node using:
numactl --hardware
lscpu
• Find the bit masks for each of the processors:
• Assuming cores 0-11 for node 0: [1,2,4,8,10,20,40,80,100,200,400,800]
• Find the IRQs assigned to the port being assigned:
grep ethX /proc/interrupts and note the IRQ values
For example, 181-192 for the 12 vectors loaded.
• Echo the SMP affinity value into the corresponding IRQ entry. Note that this needs to be done for
each IRQ entry:
echo 1 > /proc/irq/181/smp_affinity
echo 2 > /proc/irq/182/smp_affinity
echo 4 > /proc/irq/183/smp_affinity
ethtool

from vm host

[root@fiona ~]# ethtool -k ens4np0
Features for ens4np0:
rx-checksumming: on
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: off [fixed]
	tx-tcp-mangleid-segmentation: off
	tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
tx-gso-list: off [fixed]
rx-udp-gro-forwarding: off
rx-gro-list: off
tls-hw-rx-offload: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]