.. _overview_latency:

Additional latency approach
===========================

A quick overview of what ocp-network-split does to introduce latency among
nodes of different cluster zones.

Network latency script
----------------------

Latency between nodes from different zones is introduced by setting up `netem
qdisc`_ egress `traffic queue`_ on each node of the cluster so that packets
targeted to nodes in other zones flows through a netem qdisc which introduces
given delay. This means that for latency to be introduced for incoming packets
as well, it's necessary to setup netem introduced latency on all nodes of the
cluster(s) and the total RTT (round-trip time) added this way equals two times
the specified delay.

This setup is implemented in ``network-latency.sh`` script, which consumes
requested delay latency in miliseconds (half of RTT) as a command line
argument. Zone configuration and detection are handled in the same way as in
``network-latency.sh`` script (in fact, both scripts share this part).

One can see what changes will be introduced via ``-d`` option, which
makes the script report what it would do instead of performing the setup:

.. code:: console

   $ export ZONE_A="198.51.100.199"
   $ export ZONE_B="198.51.100.109 198.51.100.96 198.51.100.97 198.51.100.99"
   $ export ZONE_C="198.51.100.103 198.51.100.84 198.51.100.87 198.51.100.98"
   $ ./network-latency.sh -d 15
   ZONE_A="198.51.100.199"
   ZONE_B="198.51.100.109 198.51.100.96 198.51.100.97 198.51.100.99"
   ZONE_C="198.51.100.103 198.51.100.84 198.51.100.87 198.51.100.98"
   current zone: ZONE_B
   network interface: ens192
   tc qdisc del dev ens192 root
   tc qdisc add dev ens192 root handle 1: prio bands 4
   tc qdisc add dev ens192 parent 1:4 handle 40: netem delay 15ms
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.199/32 flowid 1:4
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.103/32 flowid 1:4
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.84/32 flowid 1:4
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.87/32 flowid 1:4
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.98/32 flowid 1:4
   tc qdisc show dev ens192
   tc class show dev ens192

It's also possible to specify specific latencies between particular zones, eg.
command ``network-latency.sh -l ab=25 -l ac=35 5`` will setup 25 ms (50ms RTT)
latency between zones ``a`` and ``b``, 35ms between zones ``a`` and ``c``, and
5ms (10ms RTT) between rest of the zones (which in this particular case means
between ``b`` and ``c``).

.. code:: console

   $ ./network-latency.sh -d -l ab=25 -l ac=35 5
   ZONE_A="198.51.100.199"
   ZONE_B="198.51.100.109 198.51.100.96 198.51.100.97 198.51.100.99"
   ZONE_C="198.51.100.103 198.51.100.84 198.51.100.87 198.51.100.98"
   current zone: ZONE_B
   network interface: ens192
   tc qdisc del dev ens192 root
   tc qdisc add dev ens192 root handle 1: prio bands 6
   tc qdisc add dev ens192 parent 1:4 handle 40: netem delay 5ms
   tc qdisc add dev ens192 parent 1:6 handle 60: netem delay 35ms
   tc qdisc add dev ens192 parent 1:5 handle 50: netem delay 25ms
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.199/32 flowid 1:5
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.103/32 flowid 1:4
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.84/32 flowid 1:4
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.87/32 flowid 1:4
   tc filter add dev ens192 parent 1: protocol ip prio 1 u32 match ip dst 198.51.100.98/32 flowid 1:4
   tc qdisc show dev ens192
   tc class show dev ens192

As you can see, the script removes existing root qdisc and creates new traffic
queues filtering packets for particular zones to qdiscs with netem introduced
latency. This is obviously not optimal from production perspective, but it's
a good trade-off for testing purposes.

The script can remove the extra latency via it's teardown command:
``network-latency.sh teardown``.
But note that the script does it by removing the root qdisc relying on the
fact that the default qdisc will be recreated. The script doesn't provide
ability to revert to the original traffic queue configuration applied before
the latency was set (as noted above, the original configuration gets deleted).

See also:

- `Classful Queueing Disciplines`_
- `Classifying packets with filters`_
- Description of `PRIO qdisc`_
- Description of `netem qdisc`_ network delay and loss emulator

.. _`Classful Queueing Disciplines`: https://lartc.org/howto/lartc.qdisc.classful.html
.. _`Classifying packets with filters`: https://lartc.org/howto/lartc.qdisc.filters.html
.. _`netem qdisc`: https://wiki.linuxfoundation.org/networking/netem
.. _`PRIO qdisc`: https://linux.die.net/man/8/tc-prio
.. _`traffic queue`: https://www.coverfire.com/articles/queueing-in-the-linux-network-stack/

Systemd Unit
------------

The latency script described above is not used directly, but via systemd
``network-latency.service`` unit. Starting the service configures the latency,
while stopping the service removes the latency setup (via the teardown
command as described above). This means that checking status of this service on
given node reveals whether the additional latency is currently in effect.
When deployed via MachineConfig or Ansible Playbook as explained below, the
latency service is started during boot.

.. code:: console

   [root@example-0 ~]# systemctl status network-latency
   ● network-latency.service - Linux Traffic Control enforced network latency setup
      Loaded: loaded (/etc/systemd/system/network-latency.service; enabled; vendor preset: disabled)
      Active: active (exited) since Fri 2023-02-03 15:31:54 UTC; 17s ago
     Process: 20864 ExecStop=/usr/bin/bash -c /etc/network-latency.sh teardown (code=exited, status=0/SUCCESS)
     Process: 20882 ExecStart=/usr/bin/bash -c /etc/network-latency.sh -l ab=11 -l ac=7 5 (code=exited, status=0/SUCCESS)
    Main PID: 20882 (code=exited, status=0/SUCCESS)
   
   Feb 03 15:31:54 osd-0 bash[20917]: qdisc netem 60: parent 1:6 limit 1000 delay 11ms
   Feb 03 15:31:54 osd-0 bash[20917]: qdisc netem 40: parent 1:4 limit 1000 delay 5ms
   Feb 03 15:31:54 osd-0 bash[20917]: qdisc netem 50: parent 1:5 limit 1000 delay 7ms
   Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:1 parent 1:
   Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:2 parent 1:
   Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:3 parent 1:
   Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:4 parent 1: leaf 40:
   Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:5 parent 1: leaf 50:
   Feb 03 15:31:54 osd-0 bash[20918]: class prio 1:6 parent 1: leaf 60:
   Feb 03 15:31:54 osd-0 systemd[1]: Started Linux Traffic Control enforced network latency setup.

MachineConfig
-------------

MachineConfig resource is used to deploy both the script and systemd service
unit file on each node of OpenShift cluster.

Using openshift interface has an advantage of better visibility of such
changes, which can be easily inspected via machine config operator (MCO) API.
Moreover the latency setup would survive a node reboot (assuming ip address of
the node don't change).

Both ``ocp-network-split-setup`` (single cluster mode) and
``ocp-network-split-multisetup`` tools which generates MachineConfig resources
can include latency setup there when latency configuration is specified via
``--latency`` and ``--latency-spec`` options.

Example of passing latency values to ``ocp-network-split-multisetup`` tool:

.. code:: console

   $ ocp-network-split-multisetup zone.ini --mc example.mc.yaml --env example.env --latency 5 --latency-spec ab=50 ac=50

Ansible Playbook
----------------

In *multi cluster* mode ansible playbook ``multisetup-latency.yml`` is used
to deploy the latency script and systemd service to RHEL machines which are
part of a zone but outside of any OpenShift cluster. The playbook receives
the latency values via the following variables:

=================== =================================== ======================
Variable name       Meaning                             Example
=================== =================================== ======================
``latency``         default latency between zones       ``5``
``latency_spec``    dictionary with zone spec latency   ``{"ab":"50","ac":"50"}``
=================== =================================== ======================

Example of passing the values via ``--extra-vars``:

.. code:: console

   $ ansible-playbook -i ceph.hosts --extra-vars '{"latency":"5","latency_spec":{"ab":"50","ac":"50"}}' multisetup-latency.yml

If *multi cluster* zones contain both OpenShift nodes and classic RHEL
machines outside of any OpenShift cluster, one needs to use both MachineConfig
and ansible playbook setup so that the latency service is deployed and running
on all nodes of all zones.

Single Cluster Example
----------------------

This example assumes we deployed network latency MachineConfig, and the
OpenShift cluster have already applied the configuration on all it's nodes.

For demonstration purposes, we connect to some cluster node via ``oc
debug`` and check status of ``network-latency`` service there:

.. code:: console

    sh-4.4# systemctl status network-latency
    ● network-latency.service - Linux Traffic Control enforced network latency setup
       Loaded: loaded (/etc/systemd/system/network-latency.service; enabled; vendor preset: disabled)
       Active: inactive (dead) since Tue 2021-09-28 00:32:15 UTC; 4min 59s ago
      Process: 1614 ExecStart=/usr/bin/bash -c /etc/network-latency.sh 106 (code=exited, status=0/SUCCESS)
     Main PID: 1614 (code=exited, status=0/SUCCESS)
          CPU: 46ms

    Sep 28 00:32:15 compute-5 systemd[1]: Starting Linux Traffic Control enforced network latency setup...
    Sep 28 00:32:15 compute-5 bash[1614]: ZONE_A="198.51.100.94"
    Sep 28 00:32:15 compute-5 bash[1614]: ZONE_B="198.51.100.109 198.51.100.96 198.51.100.97 198.51.100.99"
    Sep 28 00:32:15 compute-5 bash[1614]: ZONE_C="198.51.100.103 198.51.100.84 198.51.100.87 198.51.100.98"
    Sep 28 00:32:15 compute-5 bash[1614]: current zone: ZONE_C
    Sep 28 00:32:15 compute-5 bash[1614]: Error: Cannot delete qdisc with handle of zero.
    Sep 28 00:32:15 compute-5 systemd[1]: network-latency.service: Succeeded.
    Sep 28 00:32:15 compute-5 systemd[1]: Started Linux Traffic Control enforced network latency setup.
    Sep 28 00:32:15 compute-5 systemd[1]: network-latency.service: Consumed 46ms CPU time

There we can see that the delay introduced is 106 ms, we see the zone
configuration, detected zone of the node, and that the setup succeeded. Now
when we try to ping some node from zone A or B, we will observe that RTT is
two times the delay, 212 ms:

.. code:: console

    sh-4.4# ping 198.51.100.96
    PING 198.51.100.96 (198.51.100.96) 56(84) bytes of data.
    64 bytes from 198.51.100.96: icmp_seq=1 ttl=64 time=212 ms
    64 bytes from 198.51.100.96: icmp_seq=2 ttl=64 time=212 ms
    64 bytes from 198.51.100.96: icmp_seq=3 ttl=64 time=212 ms
    64 bytes from 198.51.100.96: icmp_seq=4 ttl=64 time=212 ms
    ^C
    --- 198.51.100.96 ping statistics ---
    4 packets transmitted, 4 received, 0% packet loss, time 3004ms
    rtt min/avg/max/mdev = 212.292/212.326/212.347/0.564 ms

But when we try to ping a node from the same zone C, we see that there is no
additional delay:

.. code:: console

    sh-4.4# ping 198.51.100.84
    PING 198.51.100.84 (198.51.100.84) 56(84) bytes of data.
    64 bytes from 198.51.100.84: icmp_seq=1 ttl=64 time=0.086 ms
    64 bytes from 198.51.100.84: icmp_seq=2 ttl=64 time=0.059 ms
    64 bytes from 198.51.100.84: icmp_seq=3 ttl=64 time=0.060 ms
    ^C
    --- 198.51.100.84 ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 2053ms
    rtt min/avg/max/mdev = 0.059/0.068/0.086/0.014 ms

Verifying latency via a testing script
--------------------------------------

To make sure that the latency configuration works as expected, both the
``MachineConfig`` and the Ansible Playbook deploys a simple testing script
``/etc/network-pingtest.sh``
on all machines where the latency scripts are installed.

See an example of the usage from a machine in zone ``b``:

.. code:: console

   # /etc/network-pingtest.sh
   ===============================================================================
   ZONE_A
   ===============================================================================
   PING 198.51.100.43 rtt min/avg/max/mdev = 10.300/10.377/10.510/0.125 ms
   ===============================================================================
   ZONE_B
   ===============================================================================
   PING 198.51.100.131 rtt min/avg/max/mdev = 0.202/0.223/0.243/0.016 ms
   PING 198.51.100.159 rtt min/avg/max/mdev = 0.035/0.041/0.052/0.007 ms
   PING 198.51.100.160 rtt min/avg/max/mdev = 0.172/0.200/0.218/0.026 ms
   ===============================================================================
   ZONE_C
   ===============================================================================
   PING 198.51.100.109 rtt min/avg/max/mdev = 10.213/10.242/10.296/0.122 ms
   PING 198.51.100.140 rtt min/avg/max/mdev = 10.171/10.196/10.214/0.118 ms
   PING 198.51.100.176 rtt min/avg/max/mdev = 10.223/10.254/10.286/0.086 ms
   ===============================================================================
   ZONE_X
   ===============================================================================