Additional latency approach

A quick overview of what ocp-network-split does to introduce latency among nodes of different cluster zones.

Network latency script

Latency between nodes from different zones is introduced by setting up netem qdisc egress traffic queue on each node of the cluster so that packets targeted to nodes in other zones flows through a netem qdisc which introduces given delay. This means that for latency to be introduced for incoming packets as well, it’s necessary to setup netem introduced latency on all nodes of the cluster(s) and the total RTT (round-trip time) added this way equals two times the specified delay.

This setup is implemented in network-latency.sh script, which consumes requested delay latency in miliseconds (half of RTT) as a command line argument. Zone configuration and detection are handled in the same way as in network-latency.sh script (in fact, both scripts share this part).

One can see what changes will be introduced via -d option, which makes the script report what it would do instead of performing the setup:

$ export ZONE_A="198.51.100.199"
$ export ZONE_B="198.51.100.109 198.51.100.96 198.51.100.97 198.51.100.99"
$ export ZONE_C="198.51.100.103 198.51.100.84 198.51.100.87 198.51.100.98"
$ ./network-latency.sh -d 15
ZONE_A="198.51.100.199"
ZONE_B="198.51.100.109 198.51.100.96 198.51.100.97 198.51.100.99"
ZONE_C="198.51.100.103 198.51.100.84 198.51.100.87 198.51.100.98"
current zone: ZONE_B
network interface: ens192
tc qdisc del dev ens192 root
tc qdisc add dev ens192 root handle 1: prio bands 4
tc qdisc add dev ens192 parent 1:4 handle 40: netem delay 15ms
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.199/32 flowid 1:4
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.103/32 flowid 1:4
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.84/32 flowid 1:4
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.87/32 flowid 1:4
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.98/32 flowid 1:4
tc qdisc show dev ens192
tc class show dev ens192

It’s also possible to specify specific latencies between particular zones, eg. command network-latency.sh -l ab=25 -l ac=35 5 will setup 25 ms (50ms RTT) latency between zones a and b, 35ms between zones a and c, and 5ms (10ms RTT) between rest of the zones (which in this particular case means between b and c).

$ ./network-latency.sh -d -l ab=25 -l ac=35 5
ZONE_A="198.51.100.199"
ZONE_B="198.51.100.109 198.51.100.96 198.51.100.97 198.51.100.99"
ZONE_C="198.51.100.103 198.51.100.84 198.51.100.87 198.51.100.98"
current zone: ZONE_B
network interface: ens192
tc qdisc del dev ens192 root
tc qdisc add dev ens192 root handle 1: prio bands 6
tc qdisc add dev ens192 parent 1:4 handle 40: netem delay 5ms
tc qdisc add dev ens192 parent 1:6 handle 60: netem delay 35ms
tc qdisc add dev ens192 parent 1:5 handle 50: netem delay 25ms
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.199/32 flowid 1:5
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.103/32 flowid 1:4
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.84/32 flowid 1:4
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.87/32 flowid 1:4
tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.98/32 flowid 1:4
tc qdisc show dev ens192
tc class show dev ens192

As you can see, the script removes existing root qdisc and creates new traffic queues filtering packets for particular zones to qdiscs with netem introduced latency. This is obviously not optimal from production perspective, but it’s a good trade-off for testing purposes.

The script can remove the extra latency via it’s teardown command: network-latency.sh teardown. But note that the script does it by removing the root qdisc relying on the fact that the default qdisc will be recreated. The script doesn’t provide ability to revert to the original traffic queue configuration applied before the latency was set (as noted above, the original configuration gets deleted).

See also:

Systemd Unit

The latency script described above is not used directly, but via systemd network-latency.service unit. Starting the service configures the latency, while stopping the service removes the latency setup (via the teardown command as described above). This means that checking status of this service on given node reveals whether the additional latency is currently in effect. When deployed via MachineConfig or Ansible Playbook as explained below, the latency service is started during boot.

[root@example-0 ~]# systemctl status network-latency
● network-latency.service - Linux Traffic Control enforced network latency setup
   Loaded: loaded (/etc/systemd/system/network-latency.service; enabled; vendor preset: disabled)
   Active: active (exited) since Fri 2023-02-03 15:31:54 UTC; 17s ago
  Process: 20864 ExecStop=/usr/bin/bash -c /etc/network-latency.sh teardown (code=exited, status=0/SUCCESS)
  Process: 20882 ExecStart=/usr/bin/bash -c /etc/network-latency.sh -l ab=11 -l ac=7 5 (code=exited, status=0/SUCCESS)
 Main PID: 20882 (code=exited, status=0/SUCCESS)

Feb 03 15:31:54 osd-0 bash[20917]: qdisc netem 60: parent 1:6 limit 1000 delay 11ms
Feb 03 15:31:54 osd-0 bash[20917]: qdisc netem 40: parent 1:4 limit 1000 delay 5ms
Feb 03 15:31:54 osd-0 bash[20917]: qdisc netem 50: parent 1:5 limit 1000 delay 7ms
Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:1 parent 1:
Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:2 parent 1:
Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:3 parent 1:
Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:4 parent 1: leaf 40:
Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:5 parent 1: leaf 50:
Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:6 parent 1: leaf 60:
Feb 03 15:31:54 osd-0 systemd[1]: Started Linux Traffic Control enforced network latency setup.

MachineConfig

MachineConfig resource is used to deploy both the script and systemd service unit file on each node of OpenShift cluster.

Using openshift interface has an advantage of better visibility of such changes, which can be easily inspected via machine config operator (MCO) API. Moreover the latency setup would survive a node reboot (assuming ip address of the node don’t change).

Both ocp-network-split-setup (single cluster mode) and ocp-network-split-multisetup tools which generates MachineConfig resources can include latency setup there when latency configuration is specified via --latency and --latency-spec options.

Example of passing latency values to ocp-network-split-multisetup tool:

$ ocp-network-split-multisetup zone.ini --mc example.mc.yaml --env example.env --latency 5 --latency-spec ab=50 ac=50

Ansible Playbook

In multi cluster mode ansible playbook multisetup-latency.yml is used to deploy the latency script and systemd service to RHEL machines which are part of a zone but outside of any OpenShift cluster. The playbook receives the latency values via the following variables:

Variable name

Meaning

Example

latency

default latency between zones

5

latency_spec

dictionary with zone spec latency

{"ab":"50","ac":"50"}

Example of passing the values via --extra-vars:

$ ansible-playbook -i ceph.hosts --extra-vars '{"latency":"5","latency_spec":{"ab":"50","ac":"50"}}' multisetup-latency.yml

If multi cluster zones contain both OpenShift nodes and classic RHEL machines outside of any OpenShift cluster, one needs to use both MachineConfig and ansible playbook setup so that the latency service is deployed and running on all nodes of all zones.

Single Cluster Example

This example assumes we deployed network latency MachineConfig, and the OpenShift cluster have already applied the configuration on all it’s nodes.

For demonstration purposes, we connect to some cluster node via oc debug and check status of network-latency service there:

sh-4.4# systemctl status network-latency
● network-latency.service - Linux Traffic Control enforced network latency setup
   Loaded: loaded (/etc/systemd/system/network-latency.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2021-09-28 00:32:15 UTC; 4min 59s ago
  Process: 1614 ExecStart=/usr/bin/bash -c /etc/network-latency.sh 106 (code=exited, status=0/SUCCESS)
 Main PID: 1614 (code=exited, status=0/SUCCESS)
      CPU: 46ms

Sep 28 00:32:15 compute-5 systemd[1]: Starting Linux Traffic Control enforced network latency setup...
Sep 28 00:32:15 compute-5 bash[1614]: ZONE_A="198.51.100.94"
Sep 28 00:32:15 compute-5 bash[1614]: ZONE_B="198.51.100.109 198.51.100.96 198.51.100.97 198.51.100.99"
Sep 28 00:32:15 compute-5 bash[1614]: ZONE_C="198.51.100.103 198.51.100.84 198.51.100.87 198.51.100.98"
Sep 28 00:32:15 compute-5 bash[1614]: current zone: ZONE_C
Sep 28 00:32:15 compute-5 bash[1614]: Error: Cannot delete qdisc with handle of zero.
Sep 28 00:32:15 compute-5 systemd[1]: network-latency.service: Succeeded.
Sep 28 00:32:15 compute-5 systemd[1]: Started Linux Traffic Control enforced network latency setup.
Sep 28 00:32:15 compute-5 systemd[1]: network-latency.service: Consumed 46ms CPU time

There we can see that the delay introduced is 106 ms, we see the zone configuration, detected zone of the node, and that the setup succeeded. Now when we try to ping some node from zone A or B, we will observe that RTT is two times the delay, 212 ms:

sh-4.4# ping 198.51.100.96
PING 198.51.100.96 (198.51.100.96) 56(84) bytes of data.
64 bytes from 198.51.100.96: icmp_seq=1 ttl=64 time=212 ms
64 bytes from 198.51.100.96: icmp_seq=2 ttl=64 time=212 ms
64 bytes from 198.51.100.96: icmp_seq=3 ttl=64 time=212 ms
64 bytes from 198.51.100.96: icmp_seq=4 ttl=64 time=212 ms
^C
--- 198.51.100.96 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 212.292/212.326/212.347/0.564 ms

But when we try to ping a node from the same zone C, we see that there is no additional delay:

sh-4.4# ping 198.51.100.84
PING 198.51.100.84 (198.51.100.84) 56(84) bytes of data.
64 bytes from 198.51.100.84: icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from 198.51.100.84: icmp_seq=2 ttl=64 time=0.059 ms
64 bytes from 198.51.100.84: icmp_seq=3 ttl=64 time=0.060 ms
^C
--- 198.51.100.84 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 0.059/0.068/0.086/0.014 ms

Verifying latency via a testing script

To make sure that the latency configuration works as expected, both the MachineConfig and the Ansible Playbook deploys a simple testing script /etc/network-pingtest.sh on all machines where the latency scripts are installed.

See an example of the usage from a machine in zone b:

# /etc/network-pingtest.sh
===============================================================================
ZONE_A
===============================================================================
PING 198.51.100.43 rtt min/avg/max/mdev = 10.300/10.377/10.510/0.125 ms
===============================================================================
ZONE_B
===============================================================================
PING 198.51.100.131 rtt min/avg/max/mdev = 0.202/0.223/0.243/0.016 ms
PING 198.51.100.159 rtt min/avg/max/mdev = 0.035/0.041/0.052/0.007 ms
PING 198.51.100.160 rtt min/avg/max/mdev = 0.172/0.200/0.218/0.026 ms
===============================================================================
ZONE_C
===============================================================================
PING 198.51.100.109 rtt min/avg/max/mdev = 10.213/10.242/10.296/0.122 ms
PING 198.51.100.140 rtt min/avg/max/mdev = 10.171/10.196/10.214/0.118 ms
PING 198.51.100.176 rtt min/avg/max/mdev = 10.223/10.254/10.286/0.086 ms
===============================================================================
ZONE_X
===============================================================================