`==100Gbit Ethernet Tuning== A summary of internet resources. Background: A 100 Gb ethernet adapter will not do anything near 100 Gb without tuning. A new installation over 200 miles shows 12 GB/s. [[https://fasterdata.es.net/host-tuning/linux/100g-tuning/ | fasterdata.es.net]] [[https://github.com/leandromoreira/linux-network-performance-parameters | romoreira]] [[https://srcc.stanford.edu/100g-network-adapter-tuning | Stanford]] [[https://access.redhat.com/solutions/168483 | Redhat-1]] [[https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_virtualization/configuring-virtual-machine-network-connections_configuring-and-managing-virtualization | Redhat-2 ]] [[https://access.redhat.com/solutions/3713681 | Redhat-3] [[https://docs.nvidia.com/networking/display/winof2v320/Performance+Tuning | NVidia ]] [[https://cloud.google.com/blog/products/networking/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster | bbr ]] [[ https://www.perfsonar.net/index.html | PerfSonar ]] [[ https://cdrdv2-public.intel.com/334019/xl710-x710-performance-tuning-linux-guide.pdf | Intel ]] === What PerfSonar does automatically === net.core.rmem_max = 536870912 net.core.wmem_max = 536870912 net.ipv4.conf.all.arp_announce = 2 net.ipv4.conf.all.arp_filter = 1 net.ipv4.conf.all.arp_ignore = 1 net.ipv4.conf.default.arp_filter = 1 net.ipv4.tcp_congestion_control = htcp net.ipv4.tcp_mtu_probing = 1 net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_rmem = 4096 87380 268435456 net.ipv4.tcp_wmem = 4096 65536 268435456 ifconfig ens16f1 mtu 9000 ==Redhat 1== net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_rmem = 8192 x 4194304 net.ipv4.tcp_wmem = 8192 Y 4194304 The middle value Y is the default buffer size. This is the most important value. You might wish to start at 524288 (512kb) and move up from there. You will generally wish to try small increments of your Bandwidth Delay Product. Try BDP x1 then BDP x1.25 then BDP x1.5 and so on. Once you start to get increased speeds, you may wish to refine your testing down smaller, for example BDP x2.5 then BDP x2.6 and so on. It is unlikely you will need a value larger than BDP x5. ==NVidia== This is windows so requires translation Disable SACK Enable Fast Datagram UDP Set RSS=enabled Set closest NUMA Set receive buffers=512 Set send buffers=2048 IPV4 Checksum offload enabled TCP/UDP Checksum offload enabled IPV6 TCP/UDP Checksum offload enabled Large Send Option offload enabled ==BBR== Enable TCP/BBR congestion control. See BBR and Redhat-3 above. echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.conf ==Stanford== firewall-cmd --zone=public --add-port=61617/tcp --permanent firewall-cmd --zone=public --add-port=8090/tcp --permanent firewall-cmd --zone=public --add-port=8096/tcp --permanent firewall-cmd --zone=public --add-port=4823/tcp --permanent firewall-cmd --zone=public --add-port=6001-6200/tcp --permanent firewall-cmd --zone=public --add-port=6001-6200/udp --permanent firewall-cmd --zone=public --add-port=5001-5900/udp --permanent firewall-cmd --zone=public --add-port=5001-5900/tcp --permanent firewall-cmd --zone=public --add-port=861/tcp --permanent firewall-cmd --zone=public --add-port=8760-9960/udp --permanent firewall-cmd --zone=public --add-port=33434-33634/udp --permanent This allows the fdt.jar tool to use its default port firewall-cmd --zone=public --add-port=54321/tcp --permanent Makes these permanent and reload rule database firewall-cmd --reload Lets find the CPU that the 100g NIC is associated with. cat /sys/class/net/<100g-NIC-name>/device/numa_node Usual response is either '0' or '1', meaning the NIC is associated with either the '0' CPU or the '1' CPU. (If it comes back with a '-1' it probably suggests it is a single CPU system.) Lets assume it returned a '1'. Knowing that, run this command: lscpu Most modern CPUs can run at different clock frequencies and often do so to save energy. In our case we want to run the CPU as fast as possible. First lets see what speed each CPU core is running at and what the maximum speed could be. Just run this funky command: grep -E '^model name|^cpu MHz' /proc/cpuinfo You'll probably see that the cores aren't running near their spec speed. Most often at a level called 'powersave'. This simple command sets all the cores to 'performance' instead: sudo cpupower frequency-set --governor performance (No controversy to add/change these values for high speed nics) # increase TCP max buffer size setable using setsockopt() # allow testing with 256MB buffers net.core.rmem_max = 268435456 net.core.wmem_max = 268435456 # increase Linux autotuning TCP buffer limits # min, default, and max number of bytes to use # allow auto-tuning up to 128MB buffers net.ipv4.tcp_rmem = 4096 87380 134217728 net.ipv4.tcp_wmem = 4096 65536 134217728 # recommended to increase this for CentOS6 with 10G NICS or higher net.core.netdev_max_backlog = 250000 # don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1 # Explicitly set htcp as the congestion control: cubic buggy in older 2.6 kernels net.ipv4.tcp_congestion_control = htcp # If you are using Jumbo Frames, also set this net.ipv4.tcp_mtu_probing = 1 # recommended for CentOS7/Debian8 hosts net.core.default_qdisc = fq ==Intel== To configure IRQ affinity, stop irqbalance and then either use the set_irq_affinity script from the i40e source package or pin queues manually. Disable user-space IRQ balancer to enable queue pinning: • systemctl disable irqbalance • systemctl stop irqbalance Using the set_irq_affinity script from the i40e source package (recommended): • To use all cores: [path-to-i40epackage]/scripts/set_irq_affinity -x all ethX • To use only cores on the local NUMA socket: [path-to-i40epackage]/scripts/set_irq_affinity -x local ethX • You can also select a range of cores. Avoid using cpu0 because it runs timer tasks. [path-to-i40epackage]/scripts/set_irq_affinity 1-2 ethX Manually: • Find the processors attached to each node using: numactl --hardware lscpu • Find the bit masks for each of the processors: • Assuming cores 0-11 for node 0: [1,2,4,8,10,20,40,80,100,200,400,800] • Find the IRQs assigned to the port being assigned: grep ethX /proc/interrupts and note the IRQ values For example, 181-192 for the 12 vectors loaded. • Echo the SMP affinity value into the corresponding IRQ entry. Note that this needs to be done for each IRQ entry: echo 1 > /proc/irq/181/smp_affinity echo 2 > /proc/irq/182/smp_affinity echo 4 > /proc/irq/183/smp_affinity ==ethtool== from vm host [root@fiona ~]# ethtool -k ens4np0 Features for ens4np0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: on highdma: on [fixed] rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-tunnel-remcsum-segmentation: off [fixed] tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: on tx-gso-list: off [fixed] rx-udp-gro-forwarding: off rx-gro-list: off tls-hw-rx-offload: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off rx-all: off tx-vlan-stag-hw-insert: on rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: on [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on tls-hw-tx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed]