The fastest interconnect for hundreds of CPUs, GPUs and FPGAs (make a supercomputer)

About the fastest inter-server and inter-device communication with bandwidth more than 100 Gbit/sec and latency 0.5 usec

1.  Possible protocols and benefits
2.   Physical interfaces and benefits
3.   The fastest and the most available interface – PCIe
4.   CPU – CPU interconnect
5.   CPU – GPU – FPGA interconnect



1.   Possible protocols and benefits
We all know the stack of data transfer protocols via the network - TCP/IP, which includes the transport level protocols TCP and UDP, which use IP addresses to address computers. In programming languages, these protocols use the API: BSD Socket.
·       TCP-protocol requires a connection setup and ensures delivery of a data package, and if the package is not delivered, it tells us about it.

·       UDP-protocol does not require a connection setup and does not ensure delivery of a data package, but it can broadcast.

These protocols are used for data exchange in the Internet global network and in local networks.
Even if there were faster protocols, then their use in the Internet would be very difficult, because all routers in the Internet use the TCP / IP stack, the TCP and UDP transport protocols, and the IP network protocol. 
However, work with Internet connections can still be speeded up by means of using IRQ-coalescing setting, by processing the TCP/IP hardware-based stack on the Ethernet-TOE-adapter, by using kernel-bypass approach Onload, Netmap/PF_RING/DPDK or by processing TCP/IP on the GPU with remapping Netmap into GPU-UVA using Intel Ethernet-card. GPU-Cores copy the data direct from the PCIe-BAR, to which the physical Ethernet-adapter buffers are mapped, with a total additional round-trip latency of 1 - 4 usec, and the GPU uses thousands of physical cores to process network packets and tens of thousands of threads for hide-latency ["Implementing an architecture for efficient network traffic on modern graphics hardware", Lazaros Koromilas, 2012].
Recall that the TCP and UDP exchange from the TCP/IP stack by means of using Ethernet equipment is called IPoE (IP over Ethernet).
However, on the local network, we can dramatically accelerate data exchange by using faster protocols and equipment:

·       Protocols: IPoEth, IPoIB, IPoPCIe, Onload, SDP, VMA, RDS, uDAPL, VERBS, SISCI

·       Hardware (adapter and wiring): Ethernet (TOE, RoCE, iWarp), Intel Omni-Path, Infiniband, PCIe

The difference between these protocols:
·       IPoEth, IPoIB and IPoPCIe - it is the same standard TCP/IP stack with TCP and UDP protocols, but it uses equipment Ethernet (Eth), Infinband (IB) and PCI-Express (PCIe) for the exchange, respectively.

·       Onload - user-space implementation of TCP/IP-stack with highly optimized TCP, UDP and IP protocols using the standard API BSD Sockets, that reduces CPU utilisation and latency (app-to-app 1.7 usec), and can be used on the Solarflare Ethernet-adapter at one end of the network, even if all the rest are the conventional Etnernet equipment.

·       SDP and VMA - (Socket Direct Protocols) and (Voltaire Message Accelerator) are highly optimized TCP and UDP protocols with the standard API BSD Sockets, for the equipment that allows for direct RDMA-exchange (memory to memory) at the hardware level.

·       RDS (Reliable Datagram Sockets) is a highly optimized UDP-based protocol but with guaranteed delivery of datagrams, this is especially effective on lossless devices with hardware-based reliable delivery (Infiniband, PCIe, …), RDS was included in the Linux kernel 2.6.30.

·       uDAPL (userspace Direct Access Programming Library) - is a standard API for the fast data exchange on the network by means of using RDMA transport from user-space on network devices with VERBS / RDMA support.

·      VERBS is the fastest API (libibverbs.so) standardized by the OpenFabric alliance for using RDMA on the following devices: RoCE, iWARP, Infinband (Mellanox, QLogic, IBM, Chelsio).

·       SISCI (Software Infrastructure Shared-Memory Cluster Interconnect) is the fastest way to exchange data, requiring the API only for optimal allocation of memory, but not for accessing the data, because data exchange occurs by using ordinary assembler MOV-instructions, i.e. CPU-0 accesses by ordinary pointer to the memory of another CPU-1 located on another server via PCI Express (similarly as on a multiprocessor server via QPI / HyperTransport but without cache coherency). I.e. the physical memory of one server-0 is mapped into the physical address space of another server-1 and is re-mapped into the process user-space. You can use two ways: Programmed IO (PIO) - CPU cores by using pointer directly accesses to remote memory with the lowest latency, or (RDMA) - DMA-Controller of PCIe-NTB copies data from remote memory to local memory with the highest bandwidth.

RDMA - (Remote Direct Memory Accesshardware support to provide direct access to the RAM of another computer by using network equipment.
Typically, to ensure the maximum performance, application programmers, instead of VERBS and uDAPL, use a more standard API - MPI library (MPI-2 standard) that works over the uDAPL protocol, uses the uDAPL-API and has a larger built-in functionality: gather, scatter, scan, reduce, rma-operations, ...

2.   Physical interfaces and benefits

Network adapters:
·       Ethernet - the most widely used network cards for data exchange that allow to effectively build complex topologies and duplicate channels; the maximum length of the optical cable is 40 km (IEEE 802.3ba - 100GBase-ER4), currently the maximum speed is 100 Gbit/sec (12.5 GB/sec).

·       Ethernet TCP offload engine (TOE) is a similar Ethernet-adapter, but here, the TCP/IP stack processing is hardware-based and is executed in the adapter chip direct, instead of software-based processing in linux-kernel, this reduces packages delivery latencies and reduces the CPU load (e.g. Chelsio, Broadcom).

·       RoCE (RDMA over Converged Ethernet v2) is an Ethernet-adapter, in addition to TCP/IP, supporting hardware RDMA - transfer of data from the memory of one server to the memory of another server without unnecessary intermediate copies by using the UDP protocol as transport. It works via normal Ethernet-connections and Ethernet-switches / routers, but requires RoCE-adapters to be installed on all the computers using RDMA. It supports RDMA multicast (e.g. Mellanox ConnectX-3 Pro Ethernet).

·      Ethernet iWARP is also an Ethernet adapter, in addition to TCP/IP, supporting hardware RDMA, but using TCP and SCTP protocols as transport. It works via normal Ethernet connections and Ethernet-switches/routers and requires installation of iWARP-adapters to support RDMA (e.g. Chelsio T5/T6).

·       Infiniband - is used mainly for data exchange in the local network for High-performance-computing clusters and super computers from top500.org, at the hardware level, it guarantees delivery of packages even for multicasting (hardware-based reliable multicast) and allows you to build complex network topologies, the maximum length of the optical cable is 200 m (Mellanox MFS1200-E200), the maximum speed for the Infiniband EDR 12x is 300 Gbit/sec (37.5 GB/sec), latency 1.5 usec (e.g. Mellanox, Intel).

·       Intel Omni-Path is used for HPC clusters and supercomputers top-10 from top500.org and allows you to build complex network topologies, as well as combines the advantages of Ethernet/iWARP/Infiniband, the maximum length of the optical cable is 30 meters; working at a speed of 100 Gbit/sec, it has latency of 56% lower than a similar implementation of InfiniBand, latency 1 usec.

·       PCI Express (PCIe) is a standard interface for connecting to the CPU devices: GPU, FGPA, Ethernet, Infiniband, Intel Omni-Path - the chips of each of these devices have a built-in PCIe-adapter. PCIe can also be used for connection of one CPU to another CPU via the NTB bridge or for combining multiple CPU servers into a single cluster by using multiple PCIe-Switches from PLX or IDT. The maximum length of the optical cable is 30 meters, the maximum actual throughput of PCIe 3.0 16x is 100 Gbit/sec (12.5 GB/s) with a latency of 0.5 usec. In all Intel Server (Xeon) and High-end desktop (Core) processors Intel there is 40-lanes PCI Express 3.0 with a total throughput of 256 Gbit/sec (32 GB/s) and a latency of 0.5 sec (e.g. IDT PCIe-Switch, Broadcom / PLX PCIe-Switch).

·       NVLink is NVidia's built-in GPU interface supporting up to 8 GPUs in a single network, and these 8 GPUs are additionally connected to the Intel CPU by PCIe. The nVidia GPU Volta features 4-Ports NVLink 2.0 with a total performance of 1600 Gbit/sec (200 GB/s). I.e. NVLink 2.0 4-Ports are 6.5 times faster than PCIe 3.0 40-Lanes, but NVLink allows you to connect only 8 GPUs.

As PCI Express is located on the CPU chip direct and any network adapters (Ethernet, Infiniband, Omni Path, ...) are connected to the CPU via the PCIe interface, the CPUs cannot exchange the data via any of the above adapters faster than via PCI Express. 
The exception is not the above-described QPI interface, which is also located on Xeon-Intel server CPU chips, provides something like a cache-coherent RDMA, where CPU-0 can access to RAM of CPU-1. But it allows to combine no more than 8 CPUs into a single server - i.e. the system does not look like a cluster of multiple servers, but as a single multi-core server (ccNUMA). Typically, server prices with QPI connections grow exponentially, depending on the number of CPUs.
The PCI Express has the maximum throughput and the minimum latencies. It is integrated into most devices and can combine hundreds of CPUs, GPUs, FPGAs, Ethernet, Infiniband and other devices supporting PCI Express into the same network by using PCIe-Switches. PCIe is the fastest solution for building HPC clusters from hundreds of different devices.

All network devices, except for the standard Ethernet, allow using any of the protocols listed above, for example: 
   Computer1 -> uDAPL -> iWARP -> Ethernet -> iWARP -> uDAPL -> Computer2
You can read about each of the protocols and network adapters in Wikipedia, and you can find them by their names. There are other less common protocols and interfaces not described here.
To run the program using the Onload, SDP and VMA protocols instead of TCP and UDP, it is not necessary to change the source code of the program. The program (binary executable program) compiled for work with TCP and UDP can be made to work by using Onload, SDP or VMA protocols. To do this, run the program as follows: 

  LD_PRELOAD = libonload.so program.bin
  LD_PRELOAD = libsdp.so program.bin or LD_PRELOAD = libvma.so program.bin
OpenOnload retains the TCP/IP protocol so can be used at single end, even if on other computers on the network, the usual Ethernet-adapters are installed and the usual Linux-kernel stack TCP/IP is used.
However, for SDP or VMA, it should be taken into account that all the parties involved in data exchange (computers) should work via the same protocols - at both ends only SDP or only VMA. I.e. the program trying to establish a connection via TCP will not be able to establish a connection with the program that accepts connections via SDP.

General recommendations:

(These recommendations can be used to process client connections from the Internet, without changing the source code of your program, and without changing the hardware – but it is recommended to use Solarflare Ethernet 40 Gbit to use OpenOnload – TCP offload engine)

1.   Install 2 utilities by using: aptitude install / sudo apt-get install

Then check that these utilities are installed:
·       numactl --hardware
·       ethtool -S eth0
·       ethtool -k eth0

Also, you can get other useful information:
·       ifconfig
·       lspci -t -n -D -v
·       cat /proc/interrupts
·       cat /proc/net/softnet_stat
·       perf top

2.   If you have a server with 2 or more CPUs on the same motherboard (i.e. NUMA), then connect the network adapters to the PCIe of different CPUs evenly - i.e. each CPU should has the same number of network cards.

3.   Run a individual instance of your application for each CPU (NUMA-node) as follows, so that each instance will be rigidly bound to a specific NUMA-node, example for 2 x Xeon CPU Server:

·      service irqbalance stop (or set ENABLED=0 in the file /etc/default/irqbalance)
·      numactl --localalloc --cpunodebind=0 ./program.bin
·      numactl --localalloc --cpunodebind=1 ./program.bin

These instances can communicate with each other via sockets by a local connection.
However, If you necessarily want to use only one instance of your program, then run your program as follows: numactl --interleave=all ./program.bin

4.   Use RSS (Receive Side Scaling) – hardware based – usually enabled by default

·     For Intel use:   echo "options igb RSS=0,0" >>/etc/modprobe.d/igb.conf
·     For Solarflare: echo "options sfc rss_cpus=cores” >>/etc/modprobe.d/scf.conf

Then reload driver:
·     rmmod sfc
·     modprobe sfc

5.   Choose the optimal number of queues of network card up to 4 – 32, but not more than the number of Cores per one CPU, example for 2 x Ethernet cards: 

·      ethtool -L eth0 combined 4
·      ethtool -L eth1 combined 4
(The most efficient high-rate configuration is likely the one with the smallest number of receive queues where no receive queue overflows due to a saturated CPU)

6.   Bind interrupts of each network card to the CPU (NUMA-node) to which it is connected:

·     service irqbalance stop (or set ENABLED=0 in the file /etc/default/irqbalance)
·     echo 1  >/proc/irq/[eth0-irq-num]/node
·     echo 2  >/proc/irq/[eth1-irq-num]/node
·     echo 1  >/sys/class/net/eth0/device/numa_node
·     echo 2  >/sys/class/net/eth1/device/numa_node

7.   Distribute interrupts of queues of network card to the different Cores of CPU to which this network card is attached, by using bit-mask in HEX (1,2,4,8,10,20,40,80,100,200,…):

·     echo 1 >/proc/irq/[eth0-queue0-irq-num]/smp_affinity
·     echo 2 >/proc/irq/[eth0-queue1-irq-num]/smp_affinity
·     echo 4 >/proc/irq/[eth0-queue2-irq-num]/smp_affinity
·     echo 8 >/proc/irq/[eth0-queue3-irq-num]/smp_affinity

·     echo 1 >/proc/irq/[eth1-queue0-irq-num]/smp_affinity
·     echo 2 >/proc/irq/[eth1-queue1-irq-num]/smp_affinity
·     echo 4 >/proc/irq/[eth1-queue2-irq-num]/smp_affinity
·     echo 8 >/proc/irq/[eth1-queue3-irq-num]/smp_affinity

Or by using – set_irq_affinity.sh: https://gist.github.com/SaveTheRbtz/8875474

8.   Enable RPS (Receive Packet Steering) – software based – it can use any number of CPU-Cores in contrast to the described above RSS, which can usually only use 8 Cores. Also RPS allows us to enable following RFS.

Note: RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on by default for SMP).

·     echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
·     echo 2 > /sys/class/net/eth0/queues/rx-1/rps_cpus
·     echo 4 > /sys/class/net/eth0/queues/rx-2/rps_cpus
·     echo 8 > /sys/class/net/eth0/queues/rx-3/rps_cpus

·     echo 1 > /sys/class/net/eth1/queues/rx-0/rps_cpus
·     echo 2 > /sys/class/net/eth1/queues/rx-1/rps_cpus
·     echo 4 > /sys/class/net/eth1/queues/rx-2/rps_cpus
·     echo 8 > /sys/class/net/eth1/queues/rx-3/rps_cpus


9.   Enable RFS (Receive Flow Steering) – software based – then TCP/IP-stack in kernel-space will be executed on the same CPU-Core as thread of your Application in user-space that processes the same packet - good cache locality.

·       ethtool -K eth0 ntuple on
·       ethtool -K eth1 ntuple on
·       sudo sysctl -w net.core.rps_sock_flow_entries=32768
·       echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
·       echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
·       echo 4096 > /sys/class/net/eth0/queues/rx-1/rps_flow_cnt
·       echo 4096 > /sys/class/net/eth0/queues/rx-2/rps_flow_cnt
·       echo 4096 > /sys/class/net/eth0/queues/rx-3/rps_flow_cnt

·       echo 4096 > /sys/class/net/eth1/queues/rx-0/rps_flow_cnt
·       echo 4096 > /sys/class/net/eth1/queues/rx-1/rps_flow_cnt
·       echo 4096 > /sys/class/net/eth1/queues/rx-2/rps_flow_cnt
·       echo 4096 > /sys/class/net/eth1/queues/rx-3/rps_flow_cnt

(Set the value of this file to the value of rps_sock_flow_entries divided by N, where N is the number of receive queues on a device)

10.           Enable XPS (Transmit Packet Steering) – software based – the same as RPS, but XPS used for transmitting packets instead of receiving. Can use any number of CPU-Cores to process packets on the TCP/IP-stack in kernel-space.

Note: XPS requires a kernel compiled with the CONFIG_XPS kconfig symbol (on by default for SMP).

·     echo 1 > /sys/class/net/eth0/queues/tx-0/xps_cpus
·     echo 2 > /sys/class/net/eth0/queues/tx-1/xps_cpus
·     echo 4 > /sys/class/net/eth0/queues/tx-2/xps_cpus
·     echo 8 > /sys/class/net/eth0/queues/tx-3/xps_cpus

·     echo 1 > /sys/class/net/eth1/queues/tx-0/xps_cpus
·     echo 2 > /sys/class/net/eth1/queues/tx-1/xps_cpus
·     echo 4 > /sys/class/net/eth1/queues/tx-2/xps_cpus
·     echo 8 > /sys/class/net/eth1/queues/tx-3/xps_cpus

11.           If interrupts use too much CPU time and if more bandwidth (GB/sec) is required, but the minimal latency (usec) is not critical, then use Interrupt coalescing – increase up to 100 or 1000.
(However, if you want to achieve low latency, then set these values to 0)
·     ethtool -c eth0
·     ethtool -c eth1
·     ethtool -C eth0 rx-usecs 100
·     ethtool -C eth1 rx-usecs 100
·     ethtool -C eth0 tx-usecs 100
·     ethtool -C eth1 tx-usecs 100
·     ethtool -C eth0 rx-frames 100
·     ethtool -C eth1 rx-frames 100

12.           If a lot of packets are lost – this you can see using the command:
·       ethtool -S eth0

Then see the size of the current and maximum rx/tx buffers of your network card:
·       ethtool -g eth0

Then try to increase current value to maximum by using:
·       ethtool -G eth0 rx 4096 tx 4096


13. Disable SSR (slow-start restart) for long-lived TCP connections:

·       sudo sysctl -w net.ipv4.tcp_slow_start_after_idle=0

14. Change CPU governor to ‘performance’ for all CPUs/cores, run as root:

for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f $CPUFREQ ] || continue; echo -n performance > $CPUFREQ; done

service cpuspeed stop



15. Use Accelerated RFS/SARFS – hardware based RFS – to bind network-packets to network-stack on that CPU-Core on which this packet will be processed – this is cache-friendly:

·       Enable A-RFS if you are using Mellanox with mlx4-driver (or Intel Ethernet adapter if it supports A-RFS). Accelerated RFS is only available if the kernel is complied with the kconfig symbol CONFIG_RFS_ACCEL enabled.

·       Enable SARFS (Solarflare Accelerated Receive Flow) if you are using Solarflare Ethernet adapters – supported since Solarflare Linux network driver v4.1.0.6734 Create file: /etc/modprobe.d/sfc.conf
Then add this content to this file:
sxps_enabled
sarfs_table_size
sarfs_global_holdoff_ms
sarfs_sample_rate

Also, see “A Guide to Optimising Memcache Performance on SolarFlare”: http://10gigabitethernet.typepad.com/files/sf-110694-tc.pdf

16. Use processing TCP/IP-stack in the user-space (kernel bypass) if possible – this will: increase throughput, reduce latency (-10 usec), and reduce CPU usage.

If used Solarflare Ethernet card then use OpenOnload – run your program as follows without recompile: 
·      LD_PRELOAD=libonload.so numactl --localalloc --cpunodebind=0 ./program.bin
·      LD_PRELOAD=libonload.so numactl --localalloc --cpunodebind=1 ./program.bin

17.           Try to use fast memory allocator tcmalloc or jemalloc: https://github.com/jemalloc/jemalloc/wiki/Getting-Started

·     LD_PRELOAD=libonload.so:jemalloc.so.1 numactl --localalloc --cpunodebind=0 ./program.bin
·     LD_PRELOAD=libonload.so:jemalloc.so.1 numactl --localalloc --cpunodebind=1 ./program.bin

18.           Find the best combination for performance of your system by switching on/off two parameters in the BIOS: 1. C-sleep state (C3 or greater), 2. Hyper Threading.

19.           Increase socket limits:

·       cat /proc/sys/fs/file-max
·       echo "100000" > /proc/sys/fs/file-max
(or equivalently: sysctl -w fs.file-max=100000)

20.           Try to check the effect of offload engine parameters on performance – this offloads the execution part of the algorithms to the NIC:

·       Look at the parameters: ethtool -k eth0
·       And try to turn them on or off: ethtool -K eth0 rx on tx on tso on gso on lso on lro on

21.           Try to check the effect of these parameters on performance: https://stackoverflow.com/a/3923785/1558037

·       sudo sysctl net.core.somaxconn=16384
·       ifconfig eth0 txqueuelen 5000
·       sudo sysctl net.core.netdev_max_backlog=2000
·       sudo sysctl net.ipv4.tcp_max_syn_backlog=4096



Also read:
·     Scaling in the Linux Networking Stack:  https://www.kernel.org/doc/Documentation/networking/scaling.txt

·     Monitoring and Tuning the Linux Networking Stack: Receiving Data: https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/

·     Monitoring and Tuning the Linux Networking Stack: Sending Data: https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/

·     How to achieve low latency with 10Gbps Ethernet: https://blog.cloudflare.com/how-to-receive-a-million-packets/




Recommendations for Application:

1.   Use thread-pool: i.e. don’t create and destroy threads for each connection, but create so much threads as the number of CPU-Cores, evenly distribute connections over threads and bind threads to CPU-Cores by using: sched_setaffinity(), pthread_setaffinity_np()

2.   Use memory-pool: i.e. repeatedly re-use pre-allocated memory areas to avoid the overhead of extra memory allocation.

3.   Use platform-specific optimal demultiplexing approaches: epoll (Linux), kqueue (Open/Net/FreeBSD, MacOS, iOS), IOCP (Windows), IOCP/POLL_SET (AIX), /dev/poll (Solaris)

4.   If your application is working with a large number of short connections, then in each thread use individual own epoll_wait() processing cycle and individual own acceptor-socket with the same <ip:port> and enabled SO_REUSEADDR and SO_REUSEPORT (Linux kernel 3.9 or later), as it is used in Ngnix: https://www.nginx.com/blog/socket-sharding-nginx-release-1-9-1/

epoll_event event, ret_events[N];
event.data.fd = socket(AF_INET, SOCK_STREAM, 0);
bind(event.data.fd, (struct sockaddr *)&serv_addr, sizeof(serv_addr));
event.events = EPOLLIN | EPOLLET;
int enable = 1; setsockopt(event.data.fd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(int));
enable = 1; setsockopt(event.data.fd, SOL_SOCKET, SO_REUSEPORT, &enable, sizeof(int));
fcntl(event.data.fd, F_SETFL, fcntl(sfd, F_GETFL, 0) | O_NONBLOCK); 
listen(event.data.fd, SOMAXCONN);
int efd = epoll_create1(EPOLL_CLOEXEC); 
epoll_ctl(efd, EPOLL_CTL_ADD, event.data.fd, &event);
while(1) { int events_num = epoll_wait (efd, ret_events, N, -1); for(int i=0; i<events_num; ++i){…} }
// call stack with URLs: epoll_wait() ->kernel-> SYSCALL_DEFINE4() -> ep_poll()

// or polling for UDP: while(1) { int r = recvmsg(socket_desc, msg, MSG_DONTWAIT); if (r > 0) {…} }

5.   Use polling: for low-latency and a small number of long connections, then in each thread use individual own poll() processing cycle and individual own acceptor-socket with the same <ip:port> and enabled SO_REUSEADDR and SO_REUSEPORT.

pollfd pfd[N]; while(1) { int events_num = poll(pfd, N, -1); if(events_num > 0) { read(…); }  } // https://linux.die.net/man/2/poll
// call stack with URLs: poll() ->kernel-> do_sys_poll() -> do_poll()
// call stack with URLs: select() ->kernel-> core_sys_select() -> do_select()

6.   If your application is working with a large number of long connections, then accept connections using the acceptor-socket in only one thread, and then transfer batches of the received connection sockets to other threads using a thread-safe 1P-MC queues (wait-free or write-contention-free). In each thread use its own epoll_wait() processing cycle.

7.   Use work stealing: for high throughput and long connections, if the CPU-Cores usage becomes uneven over time. I.e. if there is too much work for the one of thread, the thread sends extra work to the thread-safe queue. Use individual 1P-MC queues for each thread – so busy-thread pushes extra works to its own queue with relaxed-ordering using only Write-operations, but free-threads which have too little work concurrently get works using RMW-operations with acq-rel-ordering from any of these queues). Work – is a connection socket descriptor or something else.

8.   Use batching for sending data: i.e. if required high throughput, then prepare large block of data in your buffer and then send this entire block to the socket

9.   Use batching for receiving and sending data over UDP: i.e. use recvmmsg() and sendmmsg() instead of recv() and send()

10.           To filter traffic use libs: PF_RING or (Netmap or DPDK on Intel-Eth only) – this will give the highest possible performance. Comparison: https://www.net.in.tum.de/fileadmin/bibtex/publications/theses/2014-gallenmueller-high-speed-packet-processing.pdf

Alternatively, use RAW-sockets with: batching, zero-copy and fan-out.
int packet_socket = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); // raw-sockets

int version = TPACKET_V3;  // read/poll per-block instead of per-packet
setsockopt(packet_socket, SOL_PACKET, PACKET_VERSION, &version, sizeof(version));

// map socket buffer from kernel to user-space for zero-copy
struct tpacket_req3 req;
setsockopt(packet_socket, SOL_PACKET , PACKET_RX_RING , (void*)&req , sizeof(req));
mapped_buffer = (uint8_t*)mmap(NULL, req.tp_block_size * req.tp_block_nr, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, packet_socket, 0);

// to avoid locks when used one socket from many threads
int fanout_arg = (fanout_group_id | (PACKET_FANOUT_CPU << 16));
setsockopt(packet_socket, SOL_PACKET, PACKET_FANOUT, &fanout_arg, sizeof(fanout_arg));

11.           Accept connections to different listening IP addresses on different NUMA-nodes – i.e. for each process use only that Ethernet-card which connected to this NUMA-node on which executed this process





Recommendations for LAN:

1.   Enable Jumbo-frames – i.e. increase MTU:

·       ifconfig eth0 mtu 32000
·       ifconfig eth1 mtu 32000
·       ip route flush cache
·       ping -M do -c 4 -s 31972 192.168.0.1  (try to ping each server in LAN)

2.   If you are using Ethernet Mellanox adapter, then use VMA

·       LD_PRELOAD=libvma.so numactl --localalloc --cpunodebind=0 ./program.bin
·       LD_PRELOAD=libvma.so numactl --localalloc --cpunodebind=1 ./program.bin

(If you are using iWarp offload adapter or RoCE adapter, then try to use SDP)

3.   Try to use Infininband (switches, wires and adapters): Mellanox or Intel
(Use preferably VMA for Mellanox, or SDP for Intel, otherwise use IPoIB)

For SDP (Sockets Direct Protocol):
·       modprobe ib_sdp
·       LD_PRELOAD=libsdp.so numactl --localalloc --cpunodebind=0 ./program.bin
·       LD_PRELOAD=libsdp.so numactl --localalloc --cpunodebind=1 ./program.bin

4.   Try to connect servers direct via PCI Express by using IDT or PLX Switches with NTB, virtual NIC and TWC, for example: PEX9797

I.e. use PEX9700 series or more modern PCIe-switches which support virtual Ethernet NICs, NIC DMA and Tunneled Window Connection (TWC): https://docs.broadcom.com/docs/AV00-0327EN

5.   If you are using network adapters that support RDMA (iWarp, RoCE, Infiniband, PCIe NTB-Switches) then try to use API uDAPL, MPI (MPI-2 over uDAPL) or SISCI in your program instead of BSD Sockets







Screenshots and Quotes:
Next, we give a lot of screenshots from various articles and links to them - the screenshots show page numbers on the basis of which they are easy to find in the articles. We’ll show which problems are solved by using PCIe interface, DAPL protocol and MPI software interface.



We’ll show how the throughput and latency of various network protocols are related in general, and which of them use the standard API - BSD Socket:







Openonload retains the TCP/IP protocol so can be used single ended.

This can be used in any of three modes each with decreasing latency but with increasing API complexity:

  • The simplest is to take an existing TCP/IP socket based program, it can even be in binary format and preload onload before it is started. 
  • The next is to use TCPdirect, which uses a simplified socket API referred to as zsockets. This requires refactoring of existing Linux socket 'C' code. 
  • The best latency however, for Onload, is to re-write your application to use their EF_VI API. 







TOE (TCP/IP Offload Engine) Ethernet Adapter: 
Solarflare Flareon SFN7002F Dual-Port 10GbE PCIe 3.0 Server I/O Adapter - Part ID: SFN7002F
  • Hardware Offloads: LSO, LRO, GSO; IPv4/IPv6; TCP, UDP checksums






Comparison on Ngnix:

  • SFN7002F - OpenOnload = TOE(hardware TCP/IP) + Onload (kernel-bypass)
  • SFN7002F - Kernel = TOE(hardware TCP/IP)
  • X710 - Kernel = Conventional Ethernet Adapter + Linux Kernel








Here are the results of the test in 2010 to compare different protocols: TCP, SDP, VERBS.
On a single ConnectX network device, in one thread, different protocols are run in two modes:

      ·       Ethernet 10 GigE – TCP with on/off AIC-Rx
      ·       Infiniband – TCP (IPoIB), SDP, VERBS
    





We’ll show what protocols and which adapters can be used, as well as what connections can be used for exchange, for example: 
    Ethernet-Switch <-> RoCE-adapter <-> Verbs-API.



Which APIs and protocols can be used for an Ethernet connection:



Feature comparison: Infiniband, iWARP, RoE, RoCE



Comparison of the throughput of two Ethernet types with RDMA support: iWARP and RoCE



There are many implementations of iWARP:

  •         Software-iWARP using standard Ethernet-adapter and Ethernet TCP Offload
  •         Hardware-iWARP using adapter with Offload-iWARP hardware support




Infiniband adapter supports hardware multicasting with guaranteed delivery.



We’ll show 4 screenshots of the overhead costs of the standard operation of the TCP/IP stack in the Linux kernel - these overheads are avoided in more optimized protocols: SDP / VMA, uDAPL.








VMA (Voltaire Messaging Accelerator) - is libvma.so library, which can be used to speed up data exchange not only for the UDP protocol, but also for TCP. 
Finally, to maximize the benefits of low latency networking hardware for the end user application, the Mellanox Messaging Accelerator (VMA) Linux library has been enhanced in 2012. VMA has long been capable of performing socket acceleration via OS/kernel bypass for UDP unicast and multicast messages (typically associated with market data feeds from Exchanges), without rewriting the applications. In version 6 this dynamically-linked user-space library can now also accelerate any application using TCP messages (typically associated with orders/acknowledgements to and from the Exchanges), passing TCP traffic directly from the user-space application to the network adapter. Bypassing the kernel and IP stack interrupts delivers extremely low-latency because context switches and buffer copies are minimized. The result is high packet-per-second rates, low data distribution latency, low CPU utilization and increased application scalability.

SDP (Socket Direct Protocol) - is a protocol and the libsdp.so loadable library, which can be used to speed up data exchange via the TCP protocol. 
Latency tests on InfiniBand used the VMA (Voltaire Messaging Accelerator) library for UDP traffic and the SDP library was used for TCP traffic.


The way of SDP and VMA protocols’ reducing the overhead for data exchange by working from user-space direct with the Infiniband device bypassing the kernel-space:






Socket Direct Protocol (SDP) over PCI Express interconnect: Design, implementation and evaluation.

Simplified scheme of operation of SDP-protocol (used instead of TCP):


What changes are needed to pass from TCP to SDP.
It’s not necessary to change the code. I remind you that a binary executable program compiled for TCP and UDP can be made to work by using SDP or VMA protocols. For this purpose, it is necessary to run the program as follows: LD_PRELOAD=libsdp.so program.bin or LD_PRELOAD=libvma.so program.bin


MPI (Message Passing Interface) - is an API that can use the uDAPL protocol and physical interfaces (RoCE, Infiniband, Intel Omni-Path, PCIe - that support RDMA and uDAPL protocol) for direct reading from the memory of another server without requiring it to send this data and even without a mandatory call to any hardware interruptions. 
This interaction is more and more similar to one multiprocessor server with shared NUMA (Non-uniform memory access) memory, but without cache coherency.
    



     3.   The fastest and most available interface – PCIe

The number of redundant copies inside the devices and between them when data is exchanged between two CPUs (servers) is shown schematically. This is a short summary of the above.
  • bypassing the kernel-space
  • bypassing the network-adapter


Latencies from 0.5 microseconds when exchanging the data via PCI Express 2.0 16x with a speed up to 7 GB/s (56 Gbit/sec)
(bandwidth of modern PCIe 3.0 with 16-lanes up to 14 GB/s (112 Gbit/sec))
"Shared-Memory Cluster Interconnect (SISCI) API makes developing PCI Express® Network applications faster and easier."
http://www.dolphinics.com/products/embedded-sisci-developers-kit.html


Starting with version 2.0, PCI Express supports compound atomic operations: FetchAdd, CAS, Swap





4.    CPU – CPU interconnect

An example of 2 CPUs via QPI connection to a single server (remember, you can connect a maximum of 8 CPUs, and the price of the system grows exponentially)




Connection of 2 CPU Intel Xeon via PCI Express interface instead of QPI.
(The used PCIe BARs are shown).



1. PCIe “Non-Transparent” Bridge
  • Forwards PCIe traffic between busses like a bridge
  • CPU sees the bridge as an end-point device
  • CPU does not see devices on the other side
  • Other side is typically attached to another CPU

2. NTB Features
  • Low Cost: Already present on CPU or PCIe bridge chips
  • High Performance – PCIe wire speed: NTB connects PCIe buses
  • Internally Wired – Not accessible to customer, low maintenance
  • External setup also supported (redriver and cable)



The memory of 4 CPUs connected via PCIe is mapped to their physical address spaces.



Bandwidth of PCIe 3.0 16-lanes:
  • for raw-data is 16 GB/s (128 Gbit/sec) by specification
  • for actual-data 14 GB/s (112 Gbit/sec) by benchmark

Total bandwidth for 3 PCIe-connections 40-lanes=(16+16+8) on one CPU is 45 GB/s (280 Gbit/sec).





5.     CPU – GPU – FPGA interconnect

The computing cores (ALUs) of the GPU0 can read the memory locations from the GPU1 direct, without accessing the CPU, by using GPU-Direct 2.0.

  • GPU Direct 1


  • GPU Direct 2

  • GPU Direct 3




It is possible to connect tens and hundreds of GPUs into a single cascaded network by using a lot of PCI-Switches from PLX or IDT. (In this network there can also be hundreds of CPUs (Intel, PowerPC ...), FPGA, Ethernet, Infiniband ...).
(Page 16)


Functions for initializing direct access from the GPU to the CPU memory, into which the memory of any devices connected to the CPU can be mapped: FPGA, Ethernet, ...
(Page 14)


Connection scheme of several CPUs and GPUs in one HPC computer nVidia DGX-1:
     ·       Black - PCI Express connections; any CPU and GPU devices are connected

     ·       Green - NVLink connections; 8 x nVidia GPU P100 (or new CPU Power PC) only


PCI Express - direct exchange among CPU, GPU and FPGA is carried out by direct data transfer without occupying the interface of neighboring modules and without interfering with them.



Through what PCI Express BARs the memory windows for the exchange are configured:
     ·       P2P (GPU Direct 2.0) – for GPU-GPU exchange
     ·       RDMA (GPU Direct 3.0) – for exchange between GPU and any other device: FPGA, Ethernet



The GPU directly uses the Ethernet-adapter via PCIe for efficient network traffic processing on modern graphics hardware.
A prerequisite for the GPU to read the data direct from the Ethernet-adapter, the physical memory of which is mapped to the physical addressing of the CPU, - is that this memory should be mapped to virtual CPU addressing from there by using vm_insert_page instead of remap_pfn_range.
Processing is carried out like in case of Netmap: https://github.com/luigirizzo/netmap
(Page 27)


As various CPU и GPU calculators can be combined into a single network, then we will briefly show the types of multithreading that are found on these calculators:
     ·       CPU Intel Core/Xeon uses: SMT (for Hyper-Threading) and CMP (for Cores)
     ·       GPU uses: Fine-grained (for Threads/Cores) and CMP (for Blocks/SM)




Comments

Popular posts from this blog

C++: Thread-safe std::map with the speed of lock-free map

C++: We make any object thread-safe