Ben Schofield

Anvil: Verifying Liveness of Cluster Management Controllers

2024-10-15T23:00:00+00:00

Summary of Anvil: Verifying Liveness of Cluster Management Controllers

Wouldn’t it be nice to write some software and confidently say that you know it’s right? That, as long as some assumptions about the world hold, it’s going to do exactly what you want it to, no matter what strange permutations or combinations of failures happen. In broad strokes that’s the promise of formal verification and proofs in software.

This paper, in particular, is about formally verifying controllers. More specifically, how to verify liveness for Kubernetes cluster controllers. While the researchers start there, I think their contributions are more broadly applicable for other control plane architectures too.

What’s TLA?

Let’s start with a quick diversion/refresher, feel free to skip if you already know! TLA stands for temporal logic of actions. It’s a way of thinking, step by step, how things change over time as the result of actions. Leslie Lamport came up with the field as a way to describe concurrent programs in a formal, logical way.

We use formal methods like TLA to be able to make concrete mathematical statements about things that we can then prove or disprove. Something like “the ground is dry” is a logical proposition about the world. If then an action happens like “it rains” we can know that in the next state “the ground is dry” is false. TLA allows us to draw these same kinds of logical conclusions about concurrent systems.

It’s really useful when we want to describe properties of systems. These could be safety properties, or liveness properties. A safety property says that our system doesn’t do a bad thing, while a liveness property says that our system eventually does a good thing, or the thing we want it to do. Liveness is a tricky thing! Safety properties are usually easier to prove because a system that does nothing can be perfectly safe. Liveness properties require the system to keep making progress towards a goal.

What does this paper provide?

This paper provides a property called eventually stable reconciliation (ESR). They claim that this property precludes a lot of bugs in controllers. It also presents Anvil, a framework for developing kubernetes controllers and verifying that they meet this property. They then use Anvil to verify three Kubernetes controllers and show they have comparable performance to their un-verified counterparts.

ESR is a TLA formula, it’s a statement about a system that can be expressed in TLA. It essentially says that if what the controller wants stops changing, then the system should eventually reach that state.

That seems super reasonable as a goal to me! Intuitively, if a controller is able to update a system to match its desired state then it seems like it’s doing its job. If you don’t have familiarity with controllers or control plane software, you can think of them as being like a thermostat. You set some kind of desired state through configuration (turn the thermometer to 20C), and then the controller makes adjustments to the environment (turns your central heating on/off) while monitoring for changes (checking the current temp) until its desired state is reached.

In the case of a Kubernetes controller, this looks something more like creating service configuration files, or spinning up containerized nodes. The overall principle is the same, but the amount of input or ongoing state increases, and the number of ways in which things can go wrong multiplies.

Formally ESR is represented by the TLA formula:

$\forall d.\Box(\Box \text{desire}(d) \implies \Diamond \Box \text{match}(d))$¹

This asserts that for all desired states $d$, if the controller always desires $d$, then eventually $d$ will always match the current state. It seems clear that this will prevent a whole range of bugs that prevent the controller from moving towards its goal.

How do they prove things?

The authors use Verus, a language that allows writing proofs in the form of preconditions and postconditions. Verus leverages Rust’s type system and reduces the problem to an SMT-solvable form. The team extended Verus with a set of simple temporal logic constructs to handle the temporal aspects of their proofs.

The advantage of using Verus is that it allows for modular proofs, breaking down complex properties into smaller, more manageable pieces. This approach makes it easier to reason about and verify complex systems like cluster controllers. Like Dafny, and in contrast to a model language like TLA+, it allows verifying the actual code that you run in production.

The Anvil toolkit provides a good deal of work towards proving ESR for Kubernetes controllers. Proving implementations with Anvil is still a manual process, you still have to write proof code to go with your implementation, but Anvil does provide a lot of handy looking lemmas² to help.

By applying their ESR property and proof methodology, the authors were able to identify and fix bugs in existing controllers that had been missed by extensive testing. This shows the power of formal verification in catching subtle issues that can be nearly impossible to uncover with traditional testing.

The Future

A big part of why I find this exciting is that it shows a future where more and more of our software is written proof first. With projects like this, it doesn’t seem hard to imagine a world where we start our software with a given proven core and then extend it to match the service at hand. I think this could be extremely powerful for Control Plane software where the problems are relatively common and abstractable: identify what’s wrong, make changes to fix it.

We’re already seeing other aspects of software trend in this formally verified direction. The AWS crypto team have delivered huge speed improvements by formally verifying RSA. By building on the foundation of a correct solution they were able to make more aggressive optimisations.

To be a bit more pedantic, I see this happening for software at the largest scales. At AWS scale events that would be one in a billion are happening every hour. Formal correctness can help eliminate these. A big benefit from that is that, much like with testing, formal verification can give teams the ability to make big changes more confidently. If you know that you have a process to catch issues, it’s much easier to be bold in making big changes.

If you’re unfamiliar with TLA, $\forall$ means “for all”, $\Box$ means “always” (in the future) and $\Diamond$ means “eventually”. ↩
A lemma (not a lemming) is a reuseable bit of logic that you can use to prove something else. It’s like a useful function that can convert from thing to another. In the case of proofs, if you want to prove some statement $C$, and you know how to prove that $A$ means $B$ is true, it’s really handy to have something that shows that if $B$ is true then so is $C$. That would be an example of a lemma. ↩

MRTOM: Mostly Reliable Totally Ordered Multicast

2024-07-04T19:00:00+00:00

Summary of MRTOM: Mostly Reliable Totally Ordered Multicast

This paper is about building a network primitive to speed up consensus protocols. Like other papers in this area, MRTOM builds on the fact that network ordered protocols can have much higher throughput than standard consensus protocols. MRTOM takes this one step further by offloading not just packet ordering, but also the fast path of consensus protocols to programmable switches.

The consensus problem

It’s probably a good idea to have a quick refresher on what consensus problems are, and why network ordered protocols can be faster.

Broadly, the problem of consensus is agreeing with a group what’s happened. It’s like if you were sending a series of letters to 20 different people in different countries. Some of your letters might get lost or intercepted before they get there, or one of the people you’re sending to might be missing a letter and would disagree with the others on what you’ve said.

Even worse, the letters need to have an agreed order. If you send three messages, the end meaning might be completely different if they arrive in a different order:

I’ll be arriving at midnight
I’ll be arriving at noon
Ignore that last message, the next one will be my new arrival time

These three messages have different meanings depending on when they arrive. We could understand that you’ll be arriving at midnight, or at noon, or we could be unsure of what time you’re arriving at all.¹

Without a way of deciding what letters all of the group has received and in what order, we can’t agree on exactly what you’ve said. Consensus protocols like Paxos solve this by ensuring a consensus group agrees on what messages they’ve received and in what order. Typically a Paxos like protocol has a leader that accepts requests from clients, while followers can be used to read committed data or will forward new requests onto the leader. During the process of agreeing on a message, the members of a Paxos group have to send messages back and forth with each other, which takes time, meaning we can process fewer messages and respond slower than if there was just one recipient.

The benefits of having a group of recipients are that because the data is replicated, it is more resilient. One of the recipients could become unavailable, but we’re still able to provide that data when someone comes to read it.

If you know what you’re expecting next, you can act fast

Network ordered consensus protocols, like NOPaxos, exploit the fact that if you have a total ordering for incoming client messages, you can get consensus pretty quickly! Typically consensus incurs a penalty, as a group of servers incur a communication overhead in agreeing what they’ve received. Network ordered protocols make the observation that if you have a monotonically increasing dense ID² assigned to incoming requests, 1) its very easy to know that you have things in order, and 2) its very obvious if you’re missing a message. If the packets reliably arrive and in order³, then servers in a consensus group don’t need to communicate to agree on what they’ve received and in what order.

Since not doing something is always cheaper and faster than doing something, this lets network ordered protocols do as little as possible. They only have to communicate when packets don’t arrive. This is the separation of the fast and the slow path of network ordered consensus protocols. The fast path is when everything gets there on time and in order, while the slow path happens when a server is missing a message.

The downside of network ordered protocols is they need something to order the messages. Typically this is a single network switch which all messages to the group have to pass through. This switch will then assign each message a monotonically increasing ID. The negative of this is that we’re introducing a single point of failure (bad) and a scaling bottleneck (also not great). There’s no such thing as a free lunch.

MRTOM Design

MRTOM (pronounced Mr. Tom) is based on the observation that the fast path of many consensus protocols “is precisely the reliable, ordered, and acknowledged delivery of messages to a set of nodes”. By focusing on offloading that fast path to the network, the core idea of the paper is to free up server capacity for handling the slow path and other application logic.

MRTOM works by having a MRTOM instance, typically a programmable switch, in between clients and the group of servers running the consensus protocol. This is so far so familiar for the network ordered story. The MRTOM instance provides an ordering for packets, which is then used to speed up consensus.

It differs in two important ways, first that it tries to increase the reliability of delivery by maintaining a loopback of packets. Once a packet has been acked by all servers in the group, MRTOM considers it delivered and can discard it. If it’s not acknowledged within a certain time then MRTOM re-sends the packet.

Second, and more importantly, MRTOM offloads the fast path from the server group, instead running it on the switch through eBPF. This is the main aspect that gives the protocol a big throughput and latency advantage over NOPaxos or other protocols.

eBPF usage

MRTOM allows offloading the fast path of protocols to network switches. They do this using eBPF, which is a way of using Linux kernel capabilities without needing to change kernel source code.

Extended BPF developed from Berkeley Packet Filtering, but is mostly referred to as eBPF now, as the capabilities go beyond just packet filtering. eBPF programs let you run code in the kernel userspace through verified bytecode. This means they can attach hooks to the kernel without recompiling or creating kernel modules. As a result, we can get very efficient program execution as it cuts out the syscall middle man. This in turn cuts CPU & memory overheads by avoiding allocations for packet processing. eBPF also provides means for kernel level programs to have data structures that can interact with user space programs, which is how the switch can then push packets into the slow path.

In the case of MRTOM-Paxos, the fast path is integrated into the MRTOM edge interface. This interface handles aggregating acknowledgement responses from servers in the MRTOM group. Once a majority of servers in the group (including the leader) have responded, the switch sends back an acknowledgement to the client. This aggregation reduces the number of messages and coordination between the client and servers in the group.

Single switch

This paper uses a single switch, much like NOPaxos, which gives a single bottleneck. However, we’ve seen recently other protocols like Hydra demonstrate the ability to scale the number of sequencers while maintaining a total consistent order. I’d like to see a combination of these ideas, to see if there would be a way to merge the multiple sequencer approach with MRTOM’s fast path offloading. My intuition is that it wouldn’t be possible to mix them, given Hydra’s reliance on the receiver’s reconstructing the order.

Summary

MRTOM isn’t a real theoretical advance, but it presents a nice practical idea for increasing throughput and decreasing latency. Doing more at the network level, or even on the server while bypassing the typical overhead of a linux stack with eBPF/XDP, can help us be fast.

However, I’m not sure if the trend of the industry is heading in this direction. While programmable switches have been around for a while, there’s very few commercial cloud operators that will let you use them, and when you do it’s certainly not easy. The trend seems to be to more abstracted and easily fungible hardware, rather than investing time programming switches that will need to be replaced in 4-5 years.

Note that in this example we’re not necessarily understanding what the sender meant, we’re just agreeing on what they meant. ↩
This just means that we’re counting up (0, 1, 2, 3, …) with each new message, with no gaps in the IDs. ↩
or close enough to it that you can find it in your message buffer ↩

How Hard is Asynchronous Weight Reassignment?

2024-05-01T14:00:00+00:00

Summary of How Hard is Asynchronous Weight Reassignment?

Majority quorum systems are useful in providing a simple mechanism for consensus. To accept a value, you need a majority of servers to agree to accepting it. Weighted majority quorum services (WQMS) take this approach and recognise that some servers are going to have better performance than others, so they should get more voting power.

The main contribution of this paper is defining three ways of reassigning weights in a WQMS. The first two are shown to be as difficult as consensus, while the third can be performed consensus-free. I’m not fully sure how practically useful this is, but that’s also because I don’t know much about the real world uses of WQMS¹! As far as the paper goes, it’s very clearly written, with some nicely presented proofs.

Weight Reassignment

The paper presents the problem of weight reassignment. Solving this problem means providing an algorithm to update the weights of a static set of servers running a WQMS.

Weight reassignment has some restrictions. With the system, we want to be able to tolerate up to $f$ servers (out of a total $n$) crashing at once. This means that we need to place a limit on the total weight of the $f$ most weighted servers so they have less than half of the total weight. This way, if any $f$ servers fail, the remaining $n - f$ can continue to make progress. We call this restriction integrity.

This means that we can’t reassign weights in a way that would violate integrity. Say we had a quorum of 5 servers, each with a voting weight of $1$, and we wanted to tolerate up to two server failures. If we were to add $2$ to the weight of any of the servers, now the $f$ top weighted servers would have a total weight of $4$, and the total weight of all servers would be $7$. Since the total weight of the top $f$ servers is greater than half of the total weight, this would violate integrity. For brevity going forward, we’ll call the set of the top $f$ servers $F$.

There are three other restrictions on the weight reassignment algorithm that are simpler:

Validity-I - when a weight change is proposed, if it violates integrity then a no-op change (a change with zero weight difference) is created. If it doesn’t violate integrity then the proposed change is created.
Validity-II - there’s an API clients can use to check the weights of a server $s$, read_changes(s). When it’s called, a client gets a list of all of the weight changes made to $s$, from which they can reconstruct the current weight of $s$ by summing the changes.
Livesness - if a server calls reassign, the operation will eventually complete, and the server will get back a message indicating the set of changes made.

Equivalence to consensus

Laid out in this way, the authors go on to show that this problem is equivalent to consensus, meaning it’s at least as difficult to solve. Given we’re using a quorum system to solve consensus, that sounds like it would defeat the point!

I really liked the proof they use to demonstrate this. The authors construct a scenario in which every server proposes a weight change. These changes are constructed such that one, and only one, of the weight changes can succeed with a non-zero weight change. If two or more of the changes succeeded, then integrity would be violated.

For the proof, the servers are divided into two disjoint sets, $F$ and $S \backslash F$, where $F$ is the set of servers $F = { s_1, s_2, …, s_f}$. $S$ is the set of all servers, so $S \backslash F$ is the set of all servers not in $F$. Note that the sets have the following lengths $|S| = n, |F| = f, |S \backslash F| = n - f$. The initial weight of each server $s \in F$ is $\frac{n - 1}{2f}$ while the weight of every server $s \in S \backslash F$ is $\frac{n+1}{2(n - f)}$. We can call the total weight of a set $S$ at time $t$: $\texttt{W}_{S,t}$. Based on the initial weights of each server in the sets, $\texttt{W}_{F,0} = \frac{n - 1}{2f} \times f$ and $\texttt{W}_{S \backslash F,0} = \frac{n+1}{2(n - f)} \times (n - f)$. Since $\frac{n - 1}{2} < \frac{n+1}{2}$, we can see that the total weight of $F$ is less than the total weight of $S \backslash F$ and integrity is satisfied by the initial weights.

Each of the servers $s_i$ proposes a weight change. For the servers in $F$, they propose adding $0.5$ to their weight: $\texttt{reassign}(s_i, 0.5)$, while all servers in $S \backslash F$ propose subtracting $0.5$ from their weight: $\texttt{reassign}(s_i, -0.5)$. From this, we can see that accepting one of these changes would not violate integrity. E.g., if we accept one change from $F$, then the new total weight of $F$ becomes $\frac{n - 1}{2} + 0.5$, which is still less than $\frac{n+1}{2}$. Similarly, if we accept one change from $S \backslash F$, then $\texttt{W}_{S \backslash F,1} = \frac{n+1}{2} - 0.5$.

However, it’s clear that if we accept more than one change in any combination then integrity will be violated. E.g., if we accept one change from both sets, we can see that $\frac{n - 1}{2} + 0.5 = \frac{n+1}{2} - 0.5$, which would violate integrity. Since if a change doesn’t violate integrity we have to accept it, then we must accept one and only one change, which is the same as deciding consensus on a value between the group. The paper also provides an algorithm for solving consensus by proposing weight changes and deciding on one, which is fun but probably not as interesting or useful as the equivalence proof.

Pairwise weight reassignment

The authors then tried restricting the problem to make it easier to solve. If it’s hard to arbitrarily re-assign weights up and down among the servers, what about only allowing pairs of servers to exchange weights? E.g., for server $s_2$ to gain a weight, some other server $s_4$ needs to lose the same amount of weight. This means that the total weight of all servers can remain constant throughout.

Apart from this change, weight reassignment remains the same. The integrity requirement is still in place. Instead of a server proposing $\texttt{reassign}$, they can propose to $\texttt{transfer}(s_i, s_j, \Delta)$, where the $\Delta$ change in weight is taken from $s_i$ and given to $s_j$ if integrity is not violated. If integrity would be violated by the change, then two zero weight changes are created (one for $s_i$ and one for $s_j$).

The authors then show in much the same way that this is also equivalent to consensus. Just like before, they craft a scenario based on the sets of servers $F$ and $S \backslash F$ where one and only one weight $\texttt{transfer}$ can complete with a non-zero weight, showing that the problem is equivalent to consensus. The proof is pretty similar to the previous one so I won’t go in to the exact scenario. Again, they provide an algorithm for consensus based on this impossibility proof.

Restricted pairwise weight reassignment

Finally the authors introduce restricted pairwise weight reassignment, which can be performed without consensus. There are two restrictions they place on transfer:

Only $s_i$ can call $\texttt{transfer}(s_i, *, \Delta)$. So only $s_i$ can transfer away some of its weight
The weight of $s_i$ has to always be greater than $\frac{\texttt{W}_{S,0}}{2(n - f)}$. That means there’s a floor on the weight of each server, such that if each server in $S \backslash F$ had this weight, they would have the majority vote.

The authors assert that if the first condition holds, then the second is locally verifiable, meaning it can be done without consensus. I found this argument pretty straight forward! Any server can give away its weight, but only while remaining greater than the floor weight, leaving all of the servers not in $F$ with enough weight to continue as a quorum. This is proved with the inequality:

\[|S\backslash F| \times \frac{\texttt{W}_{S, 0}}{2(n - f)} = \frac{\texttt{W}_{S, 0}}{2}\]

So $\texttt{W}_{S \backslash F, t} > \frac{\texttt{W}_{S, 0}}{2}$ and integrity is preserved at all times.

The reason no other server can take away another’s weight without consensus is that if we have any two servers trying to take away weight from some server $s_i$, then while either of them could make a transfer recognizing the second condition, there’s no way for both transfers to succeed without some form of consensus to decide which succeeds.

Thoughts

The authors proved this result for a static set of servers. It would be great to see if the results could be extended for a dynamic set of servers, but given all of the proofs relied on a static set I’d imagine that would be hard. At the minimum you could imagine using consensus to decide weights as you add/remove servers, before switching back to the consensus free weight reassignment.

How useful is it? You can easily imagine a scenario where a server doesn’t know that it’s holding up progress for a quorum, and since only servers can give away their own weight, that might prove tricky to use in practice. It would be great to see an experimental evaluation to see how helpful this could be.

I’ve been pointed to this paper (Read-Write Quorum Systems Made Practical) which seems to be a good read on this. ↩

Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications

2024-04-28T21:00:00+00:00

Summary of Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications

Replicated systems typically pretty much always have some overhead in comparison to unreplicated systems, at least if you want strong consistency for your data. We need to do extra work in order to make sure that we get the same result across all nodes. The fastest systems minimise or avoid that coordination, but where we can’t avoid it, we need an algorithm to manage that consensus.

Network ordered distributed protocols can be surprisingly performant compared to unreplicated systems. Network-Ordered Paxos (NOPaxos) is able to achieve throughput within 2% of an unreplicated system, while for comparison Paxos only achieves around 25%¹. However, they have drawbacks. NOPaxos requires sending all packets for a consensus group through a single point. This leaves it difficult to scale the size of groups within a data centre, and can increase the time to recover when a sequencer fails.

This paper provides an algorithm for consistent network packet ordering with drop detection over a parallel set of sequencers. This means we can get the benefits of packet sequencing (higher throughput and lower latency consensus algorithms) while avoiding the single point of failure in a system.

How it works

Single sequencer

Let’s start with the case of a single sequencer, as in NOPaxos. A single node sequencer works by maintaining a counter. When it receives a packet from a sender it adds a header with the current counter, increments the counter, and sends² the packet to a group of receivers.

When the receivers get the packet, they can then recreate the same ordering that the sequencer saw pretty simply. Because the packets each have a number that’s monotonically increasing (1, 2, 3…) it’s easy to sort the sequence. Similarly, this is how drop detection comes in. If a receiver has received a set of packets with sequence counts: [1, 2, 4], then it can tell it’s missing a packet with sequence count 3. The receiver will then send a drop notification to the other members of its group, who then can decide to either permanently ignore the packet, or accept it and then resend to all the members that are missing it. They do that through an elected leader, which initiates a round of agreement. Leaders are regularly elected in a process similar to Paxos, but we don’t need to go into that for the differences with Hydra.

Together this provides consistent ordering and drop detection of packets.

Hydra

Hydra takes this protocol and adds the ability to run multiple sequencers. This means you’re not limited by the throughput of a single switch or host when scaling your service. The paper shows a roughly linear increase in throughput as the number of switches increases. If one switch was limited to a throughput of 200k messages per second across a group, then with two you should have a throughput of 400k.

Just like in the single sequencer case, each sequencer maintains a monotonically increasing counter (1, 2, 3…). This counter is maintained locally by each sequencer. When they receive a packet, they add the counter as a header, increment their counter, and forward the packet on to the receiver group. The sequencers also have their own ID, which they add to the packet and is used to determinsitically settle the order of packets at the receivers. If we took this naive approach then we would lose drop detection. Imagine a receiver’s getting packets from one sequencer, but missing those from another. How can it tell that a packet from the other sequencer has been dropped?

To resolve this issue, the paper introduces a combination of sequence numbers and physical clocks. Each sequencer also has a physical clock tracking real time. When forwarding packets on, as well as adding the local counter to the packet, the sequencer also adds its current clock value and its sequencer ID. Because the physical clocks are monontonically increasing, the protocol is able to guarantee that each message broadcast by the sequencers has a consistent partial ordering:

Partial ordering definition - $\S 4.3.1$

For messages $m_1$ and $m_2$ sent to the same recipients, with respective clock values $c_1$ and $c_2$, sequenced by sequencers with IDs $i$ and $j$, $m_1$ is ordered before $m_2$ if $c_1 < c_2 \vee (c_1 = c_2 \wedge i < j)$

This essentially means when a receiver gets two messages from different sequencers, the one with the lower clock value is ordered first, and if the clock values are the same then ties are broken using the sequencer ID.

Since the stamps on each packet are consistent for each receiver, the ordering is the same among all of them. So that’s great! But how does that help us with drop detection?

Drop Detection

This is where the sequence numbers come back in. Remember how the single sequencer scenario uses these to detect drops? Hydra uses them in a similar way. Each Hydra receiver first buffers the messages it gets, and only “delivers” them (logically to the application) once they have determined that no message with a lower clock value from another sequencer will be delivered.

To do that, the receivers track two values for each sequencer: the largest sequence number, and the largest clock value seen in its messages. The receiver will only deliver messages up to the point when it knows that all other receivers have a higher clock value.

Let’s take an example where a receiver is listening to two sequencers with IDs $1$ and $2$. If a receiver has received three messages:

$m_1$ that has a clock value $c = 14$, a sequencer ID $s_{id} = 2$ and a sequencer count of $s_c = 1$
$m_2$ with $c = 12, s_{id} = 1, s_c = 1$
$m_3$ with $c = 20, s_{id} = 1, s_c = 2$

The first two messages can be delivered in the order $m_2, m_1$, because the receiver knows that the time on sequencer 1 is at least 20, and the time on sequencer 2 is at least 14. Therefore the minimum time at all sequencers is 14, and it can deliver all messages with a physical time up to that point. It can’t yet deliver $m_3$ because it hasn’t received a message from sequencer 2 with a time of 20 or greater.

If the sequencer then received another message, $m_4$, with $c = 30, s_{id} = 2, s_c = 3$, it would know that it’s missing the message $s_{id} = 2 \wedge s_c = 2$. At that point it delivers a drop notification for $s_{id} = 2 \wedge s_c = 2$ to the application.

Because we’re now waiting for updates from all sequencers before we can deliver messages, it’s clear there could be some issues. What if one sequencer gets fewer messages to forward on than others? A receiver could be waiting a while to get an update from a slow sequencer, while a load of messages from other sequencers are queueing up. To fix this progress issue, the paper presents configurable flush messages. This is a kind of heartbeat notification, where the sequencer sends a packet with its current physical clock, and its current sequence number (without incrementing it). This allows all receivers to be updated on the minimum time across all sequencers, so they can send out messages before that time.

It should be clear enough that the safety of this protocol isn’t affected by clock drift, just the performance. If the sequencers have a huge difference in their physical clocks then receivers may be waiting a long time for all receivers to catch up to a high water mark time. However, the messages still have consistent ordering, and drop detection is not affected by a slow physical clock.

Thoughts

I thought this was a really interesting paper! It was great to dig into the network ordering that enables NOPaxos. The contributions here definitely extend the work in practical and useful directions. It’s pretty rare that you need to implement a system like this, but interesting to know that network ordering can be scaled beyond a single sequencer.

Figures from Just Say NO to Paxos Overhead ↩
Technically this is a multicast. ↩

Breaking into tech: Advice on getting your first role or internship

2024-04-19T17:00:00+00:00

This is my rough sketch of advice for people trying to get their first role in the tech industry. A lot of this is synthesized and regurgitated from what others have told me, so might be pretty recognizable! Cracking the Coding Interview is a really excellent resource for all of this too.

Getting an interview

Tech companies look for engineers that can work with data. They want people that know how it can tell a story, how to get the right data, what’s important, and how you can use it to support your arguments. Your CV should reflect this. Try to identify nuggets of data about your previous experiences to include.

What was the impact of your actions in your projects? This doesn’t even have to be technical projects. If you were part of a student society that held events, did you grow their attendance by some %-age year on year? Companies want to see that you can set goals for improvement and hit them, or at least have awareness of it. Include this in both your CV bullet points and in your interview prep.

When it comes to applications, a referral is at least 10X more effective than applying to a portal – at least when it comes to getting you through the first screening. After that it’s up to you, but it’s the best way to get an introduction to a company. There’s so many low quality submissions that come in through online application portals that it’s very hard for recruiters to screen, particularly with the rise of job application bots and LLM usage.

If you have a target company you want to work at, try and get a referral instead of applying for a job just through a portal. Look up people that work there on LinkedIn. You can filter by some criteria that’s connected to you to make an easier introduction. Look for mutual connections, that you went to the same school, etc. that’s not the key part but it helps people have familiartiy. Reach out to them with a short message about yourself, what you’re aiming for, and ask if they can refer you.

If there’s a position advertised you’re looking at, reach out to the hiring manager and ask for more information. Hiring managers are much more likely to hire candidates that they know. It’s a good way to show interest and that you’re a real person before putting an application in. This stuff is a little scary at first, but it gets easier when you understand what people want. Hiring managers want good quality candidates that won’t waste their time. Hiring is an incredibly expensive process! You have something to offer them as a friendly, competent, understanding individual.

Interview prep

Have a 50/50 emphasis split on leadership questions (stories about what work you’ve done in the past and how they show certain characteristics the company is looking for) vs DSA coding skills (Leetcode or whatever other platform). You should prepare the stories for interviews in much the same way you prepare your coding skills.

For coding prep I have no other advice than to do it. I did a bunch prior to my first role, and I really enjoyed a lot of the learning. Leetcode Easy and Mediums will serve you well for entry level roles. It does suck, but it’s a bit of a necessary evil. No one needs you to solve DSA problems under time pressure in your day to day job, but like exams at school, it’s a way of demonstrating you can play the game.

Write out a grid of all of the various experiences you had against all of the public values that your target company advertises¹. For each box write a few bulletpoints in the STAR² format (situation, task, action, result) about the story, talking about how you demonstrated that principle. People care most about the action and the result, while the situation and the task are the context so they can understand. You should be able to relay these stories in around 5 minutes.

When it comes to the interview process, performing that you understand and can demonstrate what they want to see is as good as actually doing it. Interviewers want to know that you can play the game of internal values, because that’s how companies organise themselves without pulling in many different directions. They’re used in all company decisions, hiring, project prioritisation. Playing a role in a company is as good as being the role.

When it comes to interviewing, come with a few questions. It’s your chance to learn about the place too, what the day to day is like. Be curious about the team and the work. It’s a good way to demonstrate you are engaged in the process and evaluating your options carefully.

Amazon has well documented leadership principles which they use to interview and make decisions, other companies will have different principles and criteria. Where you can, find what they are, but you can also prep in this way for questions that you expect to get. ↩
STAR just kind of works. It’s a great way to trick yourself into being coherent, and makes it a lot easier for others to follow your thoughts. ↩

XFaaS: Hyperscale and Low Cost Serverless Functions at Meta

2024-03-26T19:00:00+00:00

Summary of XFaaS: Hyperscale and Low Cost Serverless Functions at Meta

This paper from Meta presents their home-built Function-as-a-Service system called XFaaS. According to the paper, XFaaS handles trillions of function invocations per day, across hundreds of thousands of worker servers. The paper sets out the system architecture, and talks about how they’re able to achieve some quite impressive hardware utilization at scale.

XFaaS has a number of differences between its architecture and that of public clouds. The biggest differences show up in security, being able to delay function execution, and the ability to globally load balance execution.

One key note, XFaaS may be a hyperscale platform, but it’s clear that it is operating at a drastically different scale to AWS Lambda. In the paper they mention there are 18,377 unique functions invoked every month. AWS Lambda has over one million active customers each month. By the publicly available numbers, there’s at least two orders of magnitude in difference here.

Because of that difference in scale, Meta is able to employ a lot of optimizations in how they deploy the functions. For example, XFaaS preloads each worker with all of the updated function code for a given runtime. That works when you have around 18k functions, but that’s not an optimization that will continue to scale, except by further partitioning workers.

Another place where Meta differs is in the response times. The paper notes that services requiring sub-second response times are rarely served by XFaaS. I can’t tell if this is by choice or a result of the system. XFaaS’s optimization is delayed execution. Each function submitted to XFaaS has a “criticality deadline”, between a few seconds and 24 hours, by which time it should be executed. This means the system can delay execution to smooth out peaks and valleys.

Figure 4 from the paper shows the function calls executed, vs the invocations received.

However, by allowing functions to be delayed, XFaaS trades off response time for higher hardware utilization. Across public cloud offerings, many many customers rely on (and receive) response times well below a second. As a result, plenty of customers use Lambda, GCP, or Azure Functions to serve live customer traffic. This is not an unsolvable problem of FaaS, but it is a trade off that Meta has decided to make. While they make note that there are no publicly available hardware utilization numbers for other cloud vendors, they comment that their rate of 66% may be “several times higher than that of typical FaaS platforms”. Frankly that’s a ridiculous claim based on anecdotes that I don’t believe would hold up if the numbers were published.

The most interesting mechanism that this paper presents is its backpressure model, where the schedulers are able to dynamically adjust the rate limiting of a function based on exceptions from downstream services. This allows Meta to prevent downstream databases from being overwhelmed by reactively throttling functions. Mechanisms like this do exist in public clouds, event sources like SQS or Kinesis can implement backpressure, however, they’re not as tightly integrated as what the paper describes.

Global load balancing is an interesting mechanism, and an advantage that again seems unique to Meta. Public clouds can’t easily move work from one place to another. First, there’s big concerns with data privacy, with GDPR in place it may not be legal to transfer data between regions. Second, public clouds don’t work well with this! A lot of AWS resources are region localized, endpoints between services differ. A Lambda written for us-east-1 may not run in eu-central-1.

What this paper provides a great example of is the swing back and forth of optimization and generalization in systems at scale. Coupling happens to get some kind of greater performance, for example pre-loading new versions of the functions onto worker hosts. Decoupling comes as the system needs to generalize to handle scale, or new workload types. XFaaS is a system built for Meta’s needs above all, and it clearly works well for their purposes and scale. Is there as much for public cloud providers to learn from? That I’m less convinced.

Keeping CALM: when distributed consistency is easy

2024-03-11T14:00:00+00:00

Summary of Keeping CALM: when distributed consistency is easy

This paper from Hellerstein and Alvaro centers around what we can do without coordinating. The core insight of the paper is that some distributed programs can run without coordination, as long as the output of running the program on a subset of the inputs doesn’t change once you get more information.

This made sense to me as kind of the core insight behind something like MapReduce. For a simple example, imagine we had a bunch of books, and we wanted to count the number of times the word “peacock” appears in all of them. We can spread that work out. If someone gets a result first, that’s not going to change. We can just add all the results together at the end (a commutative operation) to get our final result, so the problem doesn’t require coordination and is monotonic.

In a contrary example, we could think of taking the same books and asking for a list of words for each book containing the unique set of words that appear in that book, but not any of the others. Because here we’re updating our answer for each book depending on what words other books have, the result can change when we get more information. We also have to check the results from every book before we can give any intermediate result for certain. That gives us an example of a program that is non-monotonic.¹

Formally they give the definition:

A program P is monotonic if for any input sets $S, T$ where $S \subseteq T, P(S) \subseteq P(T)$

And present the theorem:

Consistency as Logical Monotonicity (CALM). A program has a consistent, coordination-free distributed implementation if and only if it is monotonic.

So the claim of the theorem here is that if we make our applications or systems monotonic, then they can be both consistent and coordination-free. This sounds like a great deal! Instead of trading away some availability under network partitions to be consistent, as long as our program is monotonic we can have both availability and consistency.

The key insight that helps us is that we can invert some operations to make the overall program monotonic. Here they use the example of an online shopping cart from Dynamo. We can always add items to the cart. Adding more items doesn’t change anything about the items you already have. However, we can’t update or delete items, because those operations aren’t monotonic - they change the outcome depending on the order they’re performed.

To make a deletion work with simultaneous operations we can instead use tombstoning, where deletes instead form a separate set and we combine them at the end. This means we don’t need to wait for the delay of coordination while we’re trying to put something in our cart. We can pay the coordination cost to combine the results once at the end instead during checkout.

Representing this as a series of sets really made it click for me. You can always create more sets and add to sets without coordinating, but you can’t take a difference of them.

Another interesting highlight is the mention of Conflict-free replicated data types (CRDTs, presented at INRIA 2011). These are types that underly systems like Figma, to enable real time collaboration. These types are primitives that provide an object-oriented way to implement monotonic patterns. The states of two CRDTs can always converge to the same state, allowing work from different users to be merged.

The paper aims to be a rare positive result in distributed systems, where a lot of papers and results center around what you can’t do, or can’t have. It shows us how we can constructively think about creating programs that don’t require coordination, and helped me to understand how others already have.

Previously I gave this example as asking if the word “frangipane” occurs in any of them. Thanks to Oleg Kiselev for pointing out that this could be terminated early in the positive case, and therefore isn’t a great example of a non-monotonic program. The updated example requires processing all of the books before giving a result. ↩

Best Practices for Software Engineering (IMHO)

2023-04-22T17:43:00+00:00

I feel like I’ve been learning a lot over the past 9 months that I’ve been working full time as an SDE. I want to get some of that understanding down in an open place. Here’s a doc, that will hopefully evolve over time, on some best practices that I’ve come to know about since working.

This is going to be a list of topics that I think are really core to being effective when working in a team on software projects that are intended to have a lifetime in the years or decades. Mostly this is from what I’ve learned at work, and there’s a large amount of overlap with the reading that I’ve done, particularly with Titus Winters’s great book Software Engineering at Google.

I really enjoyed the definition in there of Software Engineering being “programming over time”. The goal is not just to produce an implementation, but to create software that will last and evolve over the time you need it to. That’s going to vary from company to company and project to project, and that’s fine!

Best practices (that I know about)

I’m prefacing this with the disclaimer that these are all my relatively naive opinions and understandings. I also think these mostly apply in large organisations, where you’re working collaboratively with other engineers over long timescales. All of these make delivering software in those conditions easier, but they probably don’t make sense as a startup or an indie creator. The same problems that each of these ideas solve or mitigate will emerge, but they probably won’t be as big of a deal depending on your scale.

Build Systems

Build systems solve the problem of software packages relying on each other. A simple example of this is writing a simple Python script, and then trying to run that on another machine. This is very easy when its a simple script, with no dependencies, that uses features that have been stable for a long time. But what if the script uses features only in Python 3.10, and it also relies on packages like numpy or pandas. It starts to build up a graph of dependencies, and each of these dependencies will have dependencies of their own, and so on, until you reach libc and sys-calls.

I’m putting build systems right at the top because they’re so key to writing software that runs on any machine other than your own. A good build system should handle:

Package versioning
Dependency graphs
Build scripts (that include unit tests)

It should be:

Reproducible: the same dependency graph builds reliably. This essentially means that the system is determinstic taken as a whole, there’s no randomness in the “will it work/won’t it” when using the same “build”.
Declarative: you should be able to specify system requirements through a config file or set of configs. This means that the “build” can be replicated without sending an image of the entire system (as something like docker does). You can send a lightweight text file and replicate a system state.

I’m incredibly lucky to work at a place that has a long experience of working with build systems. Amazon (like other 10k+ engineering orgs) has had to invest heavily in its build infrastructure so engineers can be productive. If I wasn’t working at somewhere with such a strong existing solution for this issue, I’d be learning Nix - and I probably will one day.

Build systems should also ideally be fast! The longer an engineer is waiting for things to build, the slower their feedback loops, and in general, the shorter my feedback loop when working, the quicker I can get something done.

Individual best practices

This is a section on individual best practices. I think these are separate enough from the practices of how to work as part of a team over a long timescale. This is about being the best developer you can be.

The core idea for me is cutting down the feedback loop. I learn based on a feedback loop, I develop with one. It’s all about having a target state, and iteratively getting there in small steps. This is a lot like test driven development, where you write your tests first and then write code until they pass, except I’m defining my end goal first and getting feedback on my progress towards it. The shorter your feedback loop in making a change and seeing progress towards your goal, the better.

Know your tools (editing and reading code)

This element I think is universal. Tools should help you work, and get out of the way. The primary interface for programming is text. The tools you use to read, comprehend, and edit that text are key. It should be a buttery smooth experience for you to hop around a codebase. Your IDE should conform to how you like to work, and you should be able to customise it to extend its abilities.

This is something that The Primeagen has really influenced me on, but this is common with any practice. You should know your tools really well, and there’s no tool more important in Software than text manipulation tools.

IDEs and editors like Vim/Emacs are the core part of this, but I’m not even talking exclusively about these. Unix tools like grep, ripgrep, awk, sed, sort, all of these should be at your fingertips too. My editing is mostly done in VSCode and Neovim. I use Vim bindings in VSCode, and the benefit of learning some key movement bindings is that everything feels smoother. The ideal state of code editing is where you can edit as fast as you can think. I definitely believe that Software Engineering is not about typing, I know the majority of the work and understanding should be done before you’re setting things down in an editor, but there is a huge advantage to being fast when you are making changes. You don’t have to disrupt your chain of thought with a long winded “how should I do that” sidetrack. You can have a little blur at the keys and your change is already made. Most developers aren’t quicking between windows on their screen, they’re using Alt-tab to shift focus and keep their working memory on the task at hand. The more your tools become muscle memory, the more they fade into the background. It’s like driving, changing gears, or putting on your windshield wipers when it rains, all of that should be ingrained for you to be able to focus on reacting to the road around you.

The other way this is key is in reading code. You should have shortcuts in your IDE to hop to definitions. You should know how to quickly look up strings in your codebase, whether that’s telescope in nvim, or cmd + shift + f in VSCode. You can learn a lot about how some code is used with just grep.

Little scripts

One way I’m becoming a lot more productive is by writing and using one time scripts. This kind of development is at the opposite end of Team best practices. These scripts should be fast to write, lightweight, and help me do things that I’m going to have to do more than once.

An example would be where recently I was working on improving code quality in a Java package that my team had inherited. We use Checkstyle internally to make sure our code meets certain best practices, but this package was missing a bunch of our rules. When I first applied our rules to it, there were 1866 errors. To fix this, I wrote a series of small scripts to fix them. These errors were all categoriseable, so first I wrote a script that would parse the error log and split them into different types:

errs_file=checkstyle-errs.txt
# Get checkstyle output 
checkstyle > $errs_file

# Split error types into their own files
grep JavadocMethod $errs_file > javadoc-method-errs.txt 
grep FinalVariable $errs_file > final-var-errs.txt 
grep FinalParameter $errs_file > final-param-errs.txt 
... 
grep -v JavadocMethod $errs_file | grep -v FinalVariable | grep -v FinalParameter > other-errs.txt 

wc -l *errs.txt 

This script split all the different types of errors into their own files and gave me a way of tracking my progress fixing them. I used wc to see how many of the different errors I had left to fix.

I then set about parsing the errors with awk to produce small sed scripts that would fix them for me. For things like parameters or variables that should have been final this was pretty easy. You can get a lot done with relatively simple tools.

One area where I think generative LLMs are already incredibly useful¹, and will continue to get better, is in getting over the learning curve for little scripts that help you get your work done. People like John Wiseman and Simon Willison have already pointed this out, and provided great examples. Tools are about making things easier or faster, so the scope of what you can do is expanded.

think this is one of them. They can also cause massive harm when misused. AI as a field is filled with people trying to make money by moving fast without considering the consequences.

LLMs are controversial for good reason. They’re good at some things, I ↩

Ending and Starting

2021-09-03T14:00:00+00:00

Today I handed in my dissertation, drawing a line under the past year of work towards my MSc at the University of Bath. There’s a bit of a wait for the results now, but charging ahead with plans for getting out to Canada.

Just a few stats about my dissertation:

Roughly 3500 lines of code
Around 17000 words written, an average of 117 a day
97 Experiment runs

I’ve found a job that will take me out in Vancouver and received an invitation to apply for the IEC Working Holiday Visa, that’s been the other really good news from this month! I’m filling out my application, once it’s in I should hear back within 12 weeks. Ideally I’ll be heading out for good (or at least the 2 years of the IEC) around mid-November.

Reflecting on the Bath Computer Science MSc

Bath bills this as a conversion course for those who already have an undergraduate degree in an unrelated subject and want to convert to Computer Science. For someone with a humanities background but two years of technical experience it was very well pitched. Overall it was a great experience, with some fantastic tutors and important lessons. I felt reasonably ahead at the beginning thanks to a few coding MOOCs I’d taken¹ , but no-one seemed to be really struggling to keep up with the material. It doesn’t have all of the in depth materials that you would see in a three year undergraduate course. We had some fantastic training on the fundamentals that really boosted me from where I was previously. I feel confident in learning any language I need to, and definitely in calling myself a developer or a programmer, although I still have a lot to learn and a long way to go.

What was missing:

Data structures and algorithms. Considering the expectations of interviewers revolves around whiteboard exercises and solving Leetcode style problems this was the one aread that was completely missing. However, I would defend its exclusion. A year really isn’t that long to squeeze in that much content, and DSA patterns are really suitable for self study once you have other principles learned.
Operating systems. We learnt a lot at the foundational (Turing Machine, Lambda-calculus) level, and a lot more about the higher abstraction levels, but not much about the middle levels of Operating Systems or the Von Neumann architecture. Again this isn’t too much of a loss, it’s just what’s sacrificed with the shorter timescale.

What was great:

Foundations of computation. The education on this truly foundational part of the course was really well delivered. The journey from finite state machines to Turing machines and the halting problem was really well led. Definitely a lot of important things I wouldn’t have learned otherwise.
Functional programming. Similarly this course was really good at introducing us to a paradigm we need to know about, and probably wouldn’t have sought out independently. I’m getting more curious about functional programming, while maybe not as fully converted as some others from the course (looking at my pal Vlad here) I definitely see the utility. It creates far cleaner, faster program flows when used properly.
Reinforcement learning. This was another really well put together course on an interesting subject. It helped pierce the veil for me around RL and machine learning in general. Definitely helped me understand the roots of the discipline, the mathematics that underpins recent developments, and a vague sense of where the field is headed. My dissertation ended up in this area too and, while I’m not sure I want to look at an artificial neural network for a little while, it was a great experience putting it into practice.

What else is going on

I had a bunch of setbacks and steps forward again with my dissertation. I stopped writing up these weeknotes for the past month while I just had my head down, but now hopefully they’ll help me keep a bit more structure as I go forward.

I’ve been getting pretty Rust curious. Started reading Rust in Action during my evenings, it’s fantastically written. Trying to play around with the language, need to find a project to apply it to. Part of my dissertation was writing a Tetris environment for reinforcement learning, but the whole thing was written in Python. I Cythonized the core loop for a bit of a speed up but its still laughably slow compared to what you can do in C or Rust, so that might be a fun thing to rewrite.

Reading List

CS50 was the main one I recommend to anyone else interested in the jump, and a really important follow up is Missing Semester for all of the skills that get glossed over by other courses. ↩

Finish Strong

2021-08-02T15:49:00+00:00

Just got back from the first weekend spent with a big group of friends in a long while. It was so good to spend time wandering about, eating and drinking together. Highly recommended.

It’s been a few weeks since I’ve written up my weeknotes (becoming more fortnightly, and now a week further than that). The dissertation is a bit of a grind for the moment. My experiments have been going well, and I seem on track as I get into the final stretch. Just 25 working days until it’s due, 33 including weekends. Nothing has changed drastically, there’s just a lot of incremental progress and I haven’t had much bandwidth on anything else. I still need to figure out my next stages after the dissertation, but my loose plan is to get out to Vancouver and work it out from there.

Because it’s been so long since the last post I’ve got a lot of reading links.

Reading List

Gabriella Hirst: An English Garden
- This work was censored recently by a group of Conservative councillors in the area it was located. It was an artwork that launched a discussion about Britain’s history of creating nuclear weapons, and testing them on unceded indigenous Australian land. This is an example of the worst ahistorical nimby English instinct.
Concern trolls and power grabs: Inside Big Tech’s angry, geeky, often petty war for your privacy
If You Want To Transform IT, Start With Finance
Kremlin papers appear to show Putin’s plot to put Trump in White House
Soldiers Angrily Speak Out about Being Blocked from Repairing Equipment by Contractors
Four lessons from a year building tools for machine learning
Why the U.S. once set off a nuclear bomb in space
Majority of Covid misinformation came from 12 people, report finds
- Great to see the White House calling out Facebook for their lack of enforcement and failure to remove these actors because it would reduce engagement.
Pacific Northwest Heat Wave Killed More Than One Billion Sea Creatures
Robotaxis: have Google and Amazon backed the wrong technology?
Edward Snowden calls for spyware trade ban amid Pegasus revelations
10 Papers Every Developer Should Read
Two ways AI technology is like Nuclear technology
Night, Elie Wiesel
It doesn’t take much public creativity to stand out as a job candidate
Three degrees of global warming is quite plausible and truly disastrous
- The pure, deadly statistics of climate change terrify me: “Britain’s Foreign, Commonwealth and Development Office estimated that the likelihood of an extreme heatwave capable of wiping out the southern Chinese rice crop in a given year was 1 in 100 under 1°C of warming, but one in ten under 2-3°C of warming”. We’re expected to see between 2-3 degrees of warming, these weather events are already happening with increasing and alarming regularity.
“Working within government means you can’t be honest about the issues, working outside removes your ability to even know how to fix them”
Caml trading – experiences with functional programming on Wall Street
TODO apps are meant for robots
Close open loops
About these notes
- I love a digital garden, Andy’s has a really fascinating structure! I love the pagination and the wiki style linking. His design style of scrolling paginated windows is fantastic.
Winnebiko
Computing Across America
Hundreds of AI tools have been built to catch covid. None of them helped.