Summary of Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications

Replicated systems typically pretty much always have some overhead in comparison to unreplicated systems, at least if you want strong consistency for your data. We need to do extra work in order to make sure that we get the same result across all nodes. The fastest systems minimise or avoid that coordination, but where we can’t avoid it, we need an algorithm to manage that consensus.

Network ordered distributed protocols can be surprisingly performant compared to unreplicated systems. Network-Ordered Paxos (NOPaxos) is able to achieve throughput within 2% of an unreplicated system, while for comparison Paxos only achieves around 25%1. However, they have drawbacks. NOPaxos requires sending all packets for a consensus group through a single point. This leaves it difficult to scale the size of groups within a data centre, and can increase the time to recover when a sequencer fails.

This paper provides an algorithm for consistent network packet ordering with drop detection over a parallel set of sequencers. This means we can get the benefits of packet sequencing (higher throughput and lower latency consensus algorithms) while avoiding the single point of failure in a system.

How it works

Single sequencer

Let’s start with the case of a single sequencer, as in NOPaxos. A single node sequencer works by maintaining a counter. When it receives a packet from a sender it adds a header with the current counter, increments the counter, and sends2 the packet to a group of receivers.

When the receivers get the packet, they can then recreate the same ordering that the sequencer saw pretty simply. Because the packets each have a number that’s monotonically increasing (1, 2, 3…) it’s easy to sort the sequence. Similarly, this is how drop detection comes in. If a receiver has received a set of packets with sequence counts: [1, 2, 4], then it can tell it’s missing a packet with sequence count 3. The receiver will then send a drop notification to the other members of its group, who then can decide to either permanently ignore the packet, or accept it and then resend to all the members that are missing it. They do that through an elected leader, which initiates a round of agreement. Leaders are regularly elected in a process similar to Paxos, but we don’t need to go into that for the differences with Hydra.

Together this provides consistent ordering and drop detection of packets.

Hydra

Hydra takes this protocol and adds the ability to run multiple sequencers. This means you’re not limited by the throughput of a single switch or host when scaling your service. The paper shows a roughly linear increase in throughput as the number of switches increases. If one switch was limited to a throughput of 200k messages per second across a group, then with two you should have a throughput of 400k.

Just like in the single sequencer case, each sequencer maintains a monotonically increasing counter (1, 2, 3…). This counter is maintained locally by each sequencer. When they receive a packet, they add the counter as a header, increment their counter, and forward the packet on to the receiver group. The sequencers also have their own ID, which they add to the packet and is used to determinsitically settle the order of packets at the receivers. If we took this naive approach then we would lose drop detection. Imagine a receiver’s getting packets from one sequencer, but missing those from another. How can it tell that a packet from the other sequencer has been dropped?

To resolve this issue, the paper introduces a combination of sequence numbers and physical clocks. Each sequencer also has a physical clock tracking real time. When forwarding packets on, as well as adding the local counter to the packet, the sequencer also adds its current clock value and its sequencer ID. Because the physical clocks are monontonically increasing, the protocol is able to guarantee that each message broadcast by the sequencers has a consistent partial ordering:

Partial ordering definition - $\S 4.3.1$

For messages $m_1$ and $m_2$ sent to the same recipients, with respective clock values $c_1$ and $c_2$, sequenced by sequencers with IDs $i$ and $j$, $m_1$ is ordered before $m_2$ if $c_1 < c_2 \vee (c_1 = c_2 \wedge i < j)$

This essentially means when a receiver gets two messages from different sequencers, the one with the lower clock value is ordered first, and if the clock values are the same then ties are broken using the sequencer ID.

Since the stamps on each packet are consistent for each receiver, the ordering is the same among all of them. So that’s great! But how does that help us with drop detection?

Drop Detection

This is where the sequence numbers come back in. Remember how the single sequencer scenario uses these to detect drops? Hydra uses them in a similar way. Each Hydra receiver first buffers the messages it gets, and only “delivers” them (logically to the application) once they have determined that no message with a lower clock value from another sequencer will be delivered.

To do that, the receivers track two values for each sequencer: the largest sequence number, and the largest clock value seen in its messages. The receiver will only deliver messages up to the point when it knows that all other receivers have a higher clock value.

Let’s take an example where a receiver is listening to two sequencers with IDs $1$ and $2$. If a receiver has received three messages:

  • $m_1$ that has a clock value $c = 14$, a sequencer ID $s_{id} = 2$ and a sequencer count of $s_c = 1$
  • $m_2$ with $c = 12, s_{id} = 1, s_c = 1$
  • $m_3$ with $c = 20, s_{id} = 1, s_c = 2$

The first two messages can be delivered in the order $m_2, m_1$, because the receiver knows that the time on sequencer 1 is at least 20, and the time on sequencer 2 is at least 14. Therefore the minimum time at all sequencers is 14, and it can deliver all messages with a physical time up to that point. It can’t yet deliver $m_3$ because it hasn’t received a message from sequencer 2 with a time of 20 or greater.

If the sequencer then received another message, $m_4$, with $c = 30, s_{id} = 2, s_c = 3$, it would know that it’s missing the message $s_{id} = 2 \wedge s_c = 2$. At that point it delivers a drop notification for $s_{id} = 2 \wedge s_c = 2$ to the application.

Because we’re now waiting for updates from all sequencers before we can deliver messages, it’s clear there could be some issues. What if one sequencer gets fewer messages to forward on than others? A receiver could be waiting a while to get an update from a slow sequencer, while a load of messages from other sequencers are queueing up. To fix this progress issue, the paper presents configurable flush messages. This is a kind of heartbeat notification, where the sequencer sends a packet with its current physical clock, and its current sequence number (without incrementing it). This allows all receivers to be updated on the minimum time across all sequencers, so they can send out messages before that time.

It should be clear enough that the safety of this protocol isn’t affected by clock drift, just the performance. If the sequencers have a huge difference in their physical clocks then receivers may be waiting a long time for all receivers to catch up to a high water mark time. However, the messages still have consistent ordering, and drop detection is not affected by a slow physical clock.

Thoughts

I thought this was a really interesting paper! It was great to dig into the network ordering that enables NOPaxos. The contributions here definitely extend the work in practical and useful directions. It’s pretty rare that you need to implement a system like this, but interesting to know that network ordering can be scaled beyond a single sequencer.

  1. Figures from Just Say NO to Paxos Overhead 

  2. Technically this is a multicast