<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://blog.benjscho.dev/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.benjscho.dev/" rel="alternate" type="text/html" /><updated>2026-02-13T22:55:36+00:00</updated><id>https://blog.benjscho.dev/feed.xml</id><title type="html">Ben Schofield</title><subtitle>Software developer</subtitle><entry><title type="html">How DSQL Makes Sequences Scale</title><link href="https://blog.benjscho.dev/technical/2026/02/13/dsql-sequences.html" rel="alternate" type="text/html" title="How DSQL Makes Sequences Scale" /><published>2026-02-13T17:15:00+00:00</published><updated>2026-02-13T17:15:00+00:00</updated><id>https://blog.benjscho.dev/technical/2026/02/13/dsql-sequences</id><content type="html" xml:base="https://blog.benjscho.dev/technical/2026/02/13/dsql-sequences.html"><![CDATA[<p>Sequences are one of those Postgres features that you don’t think much about.
You can ask for the next number in the sequence, and you get it. That works
pretty well when you have one machine asking for the next number, but what
about 10,000?</p>

<!--more-->

<p>We’ve just launched Sequence support in DSQL and we’re excited about it. Up
until now our recommendation has been to use UUIDs, and for truly massive
scale it still is. But we recognize that there’s plenty of places where 
you’d like to use a unique number for an identity on a table.</p>

<p>If you just want to get started with sequences, here’s how you make one
in DSQL:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="n">SEQUENCE</span> <span class="n">my_sequence</span> <span class="k">CACHE</span> <span class="mi">65536</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'my_sequence'</span><span class="p">);</span> 
</code></pre></div></div>

<p>Or if you want to use it as an identity in a column:<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">orders</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">BIGINT</span> <span class="k">GENERATED</span> <span class="k">BY</span> <span class="k">DEFAULT</span> <span class="k">AS</span> <span class="k">IDENTITY</span> <span class="p">(</span><span class="k">CACHE</span> <span class="mi">65536</span><span class="p">)</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">customer_name</span> <span class="nb">TEXT</span>
<span class="p">);</span>
</code></pre></div></div>

<p>This blog is going to go more into the details of why sequences look like
they do in DSQL, but to get started that’s all you need to know!<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<h2 id="sequences-in-dsql">Sequences in DSQL</h2>

<p>In single box (or single writer) SQL systems, sequences are very lightweight.
They provide unique values without a unique index and without taking heavy
locks. In Postgres, sequences are stored like any other data: in a table. That
table is then stored for durability on disk, with writes and updates going
through a log for crash recovery. When a backend process calls <code class="language-plaintext highlighter-rouge">nextval()</code> it
reads the value, increments it, and writes the new value back. To avoid going
to disk too much, backends can also cache a number of values, set by the
<code class="language-plaintext highlighter-rouge">CACHE</code> value of the sequence, we’ll come back to that later.</p>

<p>In a distributed architecture things are a little less simple, but not by much!
Marc’s <a href="https://marc-bowes.com/dsql-circle-of-life.html">blog on the circle of
life</a> is a good primer on
DSQL’s architecture. When you call <code class="language-plaintext highlighter-rouge">nextval()</code>, the read goes to storage,
checks the latest value, and increments the value before writing the update to
the journal. So far so simple? The important thing to remember is that
getting the next set of values in a sequence goes through a full circle of
life.</p>

<p>In distributed systems, creating scalable applications is a mutual 
responsibility<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>. The big problem with scaling sequences is they’re a 
classic <a href="https://marc-bowes.com/dsql-avoid-hot-keys.html">hot key</a>. DSQL’s advantage is that we can support endless horizontal
scaling at every component, but to be able to scale DSQL needs to be able
to spread your changes out across workers. For typical inserts we do that
by partitioning the data. However, you can’t partition a single row table.</p>

<h2 id="cache-to-the-rescue">CACHE to the rescue</h2>

<p>Remember the <code class="language-plaintext highlighter-rouge">CACHE</code> value from sequences in Postgres? This sets how many
values a given backend gets on each call to <code class="language-plaintext highlighter-rouge">nextval()</code>. So with <code class="language-plaintext highlighter-rouge">CACHE=3</code>
a backend would fetch 3 values on every call to <code class="language-plaintext highlighter-rouge">nextval()</code>, which they can
then use in requests without performing extra expensive IO.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                         ┌─────────────┐
                         │    Disk     │
                         │  seq = 10   │
                         └──────┬──────┘
                                │
           ┌────────────────────┼────────────────────┐
           │                    │                    │
           ▼                    ▼                    ▼
    ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
    │  Backend A  │      │  Backend B  │      │  Backend C  │
    │ cache: 1-3  │      │ cache: 4-6  │      │ cache: 7-9  │
    └─────────────┘      └─────────────┘      └─────────────┘
           │                    │                    │
           ▼                    ▼                    ▼
      nextval()=1          nextval()=4          nextval()=7
      nextval()=2          nextval()=5          nextval()=8
      nextval()=3          nextval()=6          nextval()=9
</code></pre></div></div>

<p>Each backend reserves a chunk of sequence values on its first call.
Subsequent <code class="language-plaintext highlighter-rouge">nextval()</code> calls return from the local cache without going to disk.
If a backend crashes, those values from the sequence are discarded, so higher
cache values can result in gaps in sequences.</p>

<p>DSQL parallelises our compute in the form of the Query Processor. Instead of
individual backends, statements are executed by QPs. Here, <code class="language-plaintext highlighter-rouge">CACHE</code> functions
the same as in Postgres. When a QP calls <code class="language-plaintext highlighter-rouge">nextval()</code> it gets a cached set
of values, and hands them out. So now to the elephant in the room for DSQL
support, we only support <code class="language-plaintext highlighter-rouge">CACHE=1</code> or <code class="language-plaintext highlighter-rouge">CACHE&gt;=65536</code>.</p>

<p>The point of these values is to highlight the decision for the developer. 
Either you want your sequence to be densely packed and low throughput, or 
you want it to be able to scale. With large cache values (&gt;=65k), sequences
are rarely a bottleneck in DSQL transactions.</p>

<h2 id="what-if-i-dont-need-scale-though">What if I don’t need scale though?</h2>

<p>That is totally fine too! Not every project is trying to hit 100k TPS. We also
know there’s plenty of applications where you have a slow rate of inserts
and would prefer a dense, increasing sequence. That’s why DSQL supports
<code class="language-plaintext highlighter-rouge">CACHE=1</code>.</p>

<p>To put some numbers on it, I ran some experiments<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>.</p>

<p>Each of these I tested with <code class="language-plaintext highlighter-rouge">CACHE 1</code>, <code class="language-plaintext highlighter-rouge">CACHE 65536</code>, and I provided an example
with UUID for a value that doesn’t require coordination. Since UUID is always
locally generated, there’s no way for it to conflict and serves as a good
baseline.</p>

<p>This is what the DDL for my first test looks like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="n">SEQUENCE</span> <span class="n">seq_cache_1</span> <span class="k">CACHE</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">CREATE</span> <span class="n">SEQUENCE</span> <span class="n">seq_cache_65536</span> <span class="k">CACHE</span> <span class="mi">65536</span><span class="p">;</span>
</code></pre></div></div>

<p>Here in the first test I’m creating two sequences, one with CACHE=1, and one
with CACHE=65536. We’re then fetching new values serially, so we’re making one 
request to get a new value, waiting until we get it back, and then making 
another. The majority of the time is spent in network time waiting for
the request to go from my laptop to DSQL’s QP and back. You’ll notice that
the high cache value is faster, because the QP I’m connected to isn’t having
to fetch an update from Storage every time, but it’s not faster by much. 
Comparing with UUID, you can see that’s pretty much the same as our high 
cache option.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>------------------------------------------------------------
Experiment 1: Individual nextval() calls (100 iterations)
------------------------------------------------------------

CACHE=1:
  Mean:  20.26 ms
  Min:   17.94 ms
  Max:   97.10 ms
  Total: 2025.65 ms

CACHE=65536:
  Mean:  13.60 ms
  Min:   11.53 ms
  Max:   26.95 ms
  Total: 1360.24 ms

UUID:
  Mean:  13.31 ms
  Min:   11.81 ms
  Max:   27.11 ms
  Total: 1330.73 ms

Speedup with CACHE=65536 vs CACHE=1: 1.5x faster
Speedup with UUID vs CACHE=1: 1.5x faster
</code></pre></div></div>

<p>Okay great! But that doesn’t really tell us that much about how it scales.
We’re just using a single connection and fetching one value at a time.
Let’s look now at the case of a bulk insert. So here, we’re inserting 1000
rows into a table with a sequence:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Creating the table</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">bench_table</span> <span class="p">(</span><span class="n">id</span> <span class="nb">BIGINT</span><span class="p">,</span> <span class="k">data</span> <span class="nb">TEXT</span><span class="p">);</span>

<span class="c1">-- My insert statements look like this:</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">bench_table</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">data</span><span class="p">)</span>
<span class="k">SELECT</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'seq_cache_1'</span><span class="p">),</span> <span class="s1">'row '</span> <span class="o">||</span> <span class="k">g</span>
<span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span> <span class="k">g</span><span class="p">;</span>
</code></pre></div></div>

<p>We’re still running on one connection, but now we’re running a bulk insert of
1000 rows instead of fetching the nextval a bunch of times. So what does that
look like?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>------------------------------------------------------------
Experiment 2: Bulk INSERT with 1000 rows
------------------------------------------------------------

CACHE=1:     6229.11 ms total
CACHE=65536: 83.16 ms total
UUID:        77.24 ms total

Sequence CACHE=65536 vs CACHE=1: 74.9x faster
UUID vs CACHE=1: 80.6x faster
</code></pre></div></div>

<p>Well that’s a big difference! The reason for this is that typically incrementing
sequences don’t follow the transaction semantics that we have for other values.
It would be strange if something like:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">BEGIN</span><span class="p">;</span>
<span class="k">select</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'my_seq'</span><span class="p">);</span> <span class="c1">-- returns 4</span>
<span class="k">ROLLBACK</span><span class="p">;</span>
<span class="k">select</span> <span class="n">nextval</span><span class="p">(</span><span class="s1">'my_seq'</span><span class="p">);</span> <span class="c1">-- returns 4 again ??</span>
</code></pre></div></div>

<p>were to happen. To preserve the expectation that sequences always return a
unique, incrementing value, under the hood they are created by special internal
transactions. This means that every call to fetch a nextval is going through
the DSQL circle of life. With that in mind, the results for the bulk insert on
a single connection make sense! For <code class="language-plaintext highlighter-rouge">CACHE=1</code>, even though we’re only on one
connection, the QP has to go through the full loop for each row, fetching 
a value from storage, writing back to the journal, waiting for the transaction
to finish before the next value can be read. With a large CACHE value, our
QP only needs to do that once. This is on a single region cluster, but on a
multi-region cluster the difference would be even more marked, because we’d
need to wait for the write to be committed to our second region.</p>

<p>Now that was still on a single connection, what about when we actually want
to <em>scale</em>? How do they behave when we throw more connections at them? 
This experiment is the same as experiment one, except we’re running 
contested. Instead of just one connection, let’s create 100, and have each of
those fetch 100 next values:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>------------------------------------------------------------
Experiment 3: Conflict test (100 workers x 100 nextvals each)
------------------------------------------------------------

CACHE=1:
  Total time:  51589.87 ms
  Throughput:  193.8 calls/sec
  Mean:        353.03 ms
  Min:         17.16 ms
  Max:         21182.18 ms
  Errors:      0

CACHE=65536:
  Total time:  3020.29 ms
  Throughput:  3310.9 calls/sec
  Mean:        17.46 ms
  Min:         10.57 ms
  Max:         1538.64 ms
  Errors:      0

UUID:
  Total time:  2902.25 ms
  Throughput:  3445.6 calls/sec
  Mean:        15.00 ms
  Min:         10.86 ms
  Max:         1439.94 ms
  Errors:      0

Throughput speedup with CACHE=65536 vs CACHE=1: 17.1x
Throughput speedup with UUID vs CACHE=1: 17.8x
</code></pre></div></div>

<p>The throughput difference is again just as marked. In the <code class="language-plaintext highlighter-rouge">CACHE=1</code> case, 
the majority of internal transactions to fetch a cache value are conflicting.
DSQL hides the internal details where these would show up as OCC errors,
instead this shows up as latency, as conflicts would in regular Postgres.
With high cache values we have almost no contention. Comparing to our 
baseline of UUIDs we can see the difference is minimal.</p>

<h2 id="so-what-do-i-use">So what do I use?</h2>

<p>If you want to use a sequence our recommendation is to use a high cache
value. It’s going to keep up with your scale and avoid being a bottleneck
in your system. If you really want densely packed sequences and you don’t 
expect your table to ever be running higher than a few transactions 
per second, then <code class="language-plaintext highlighter-rouge">CACHE=1</code> will work just fine. If you change your mind or
see it becoming a blocker down the line, you can always go back and fix it with:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ALTER</span> <span class="n">SEQUENCE</span> <span class="n">my_seq</span> <span class="k">CACHE</span> <span class="mi">65536</span><span class="p">;</span>
</code></pre></div></div>

<p>But if you <em>truly</em> don’t want to worry about scale, just use UUIDs:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">orders</span> <span class="p">(</span>
  <span class="n">id</span> <span class="n">UUID</span> <span class="k">DEFAULT</span> <span class="n">gen_random_uuid</span><span class="p">()</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="c1">--...</span>
<span class="p">);</span>
</code></pre></div></div>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>If you’re looking for <code class="language-plaintext highlighter-rouge">SERIAL</code> support, there are <a href="https://www.naiyerasif.com/post/2024/09/04/stop-using-serial-in-postgres/">a lot of reasons not
to use
it</a>.
<code class="language-plaintext highlighter-rouge">SERIAL</code> is essentially a wrapper over a sequence with <code class="language-plaintext highlighter-rouge">CACHE=1</code>. We decided
that a default <code class="language-plaintext highlighter-rouge">CACHE</code> of 1 was a performance footgun that it’s worth
protecting customers from. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>For more details on sequences and how they work in DSQL, you can see the <a href="https://docs.aws.amazon.com/aurora-dsql/latest/userguide/sequences-identity-columns.html">documentation here</a> and the <a href="https://docs.aws.amazon.com/aurora-dsql/latest/userguide/create-sequence-syntax-support.html">supported syntax</a>. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>I like Pat Helland’s <a href="https://www.cidrdb.org/cidr2024/papers/p63-helland.pdf">BIG
DEAL</a> paper for
  discussing the deal between infra providers and app developers here:</p>
      <blockquote>
        <ul>
          <li>Scalable apps don’t concurrently update the same key.</li>
          <li>Scalable DBs don’t coordinate across disjoint TXs.</li>
        </ul>
      </blockquote>
      <p><a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>You can find the code for this <a href="https://gist.github.com/Benjscho/7573e0e1e6b7cc574c384cd0492cbcb6">available here</a>. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="technical" /><category term="dsql" /><category term="postgres" /><category term="sequences" /><summary type="html"><![CDATA[Sequences are one of those Postgres features that you don’t think much about. You can ask for the next number in the sequence, and you get it. That works pretty well when you have one machine asking for the next number, but what about 10,000?]]></summary></entry><entry><title type="html">It’s Time to Replace TCP in the Datacenter</title><link href="https://blog.benjscho.dev/papers/2025/01/23/tcp-datacenter.html" rel="alternate" type="text/html" title="It’s Time to Replace TCP in the Datacenter" /><published>2025-01-23T17:00:00+00:00</published><updated>2025-01-23T17:00:00+00:00</updated><id>https://blog.benjscho.dev/papers/2025/01/23/tcp-datacenter</id><content type="html" xml:base="https://blog.benjscho.dev/papers/2025/01/23/tcp-datacenter.html"><![CDATA[<p><em>Summary of <a href="https://arxiv.org/pdf/2210.00714">It’s Time to Replace TCP in the Datacenter</a></em></p>

<p>This position paper from John Ousterhout sets out everything that’s wrong with TCP and exactly how we should fix it. <!--more--> It’s an interesting and purposefully polemical paper<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. Ousterhout has serious pedigree in distributed systems. He was one of the co-authors of <a href="https://raft.github.io/raft.pdf"><em>In Search of an Understandable Consensus Algorithm</em></a>, the paper that introduced Raft, created the Tcl scripting language, and has led a number of teams to impressive results over the years, so he’s talking from a place of experience.</p>

<p>The paper proposes that there are core issues with TCP that can’t be fixed. It argues that these issues are so core to TCP as to require breaking changes, at which point you might as well fix everything at once. It then goes on to discuss <a href="https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview">Homa</a>, a protocol designed specifically for the datacenter, that fixes all of these issues.</p>

<h2 id="whats-wrong-with-tcp">What’s wrong with TCP?</h2>

<p>Let’s run through the properties that the author sets out as needing to be reworked:</p>

<p>First, TCP is <strong>stream oriented</strong>. Work comes in as bytes, but in the datacenter
it’s typically executed in complete <em>messages</em>, which have to be read from the
stream and reconstructed.  This means messages can’t be rerouted to available
cores. Over time network speeds have increased to the point that server cores
can’t keep up. To make full use of a network link you need to spread the load
equally across them, but stream orientation makes that difficult to do.
The stream is tied to whichever core is reading from the stream and you either
need to then dispatch work to other cores or process whatever incoming message
you have, blocking further work on the same stream. This quote highlights the
issue well:</p>

<blockquote>
  <p>The fundamental problem with streaming is that the units in which data is received (range of bytes) do not correspond to dispatchable units of work (messages)</p>
</blockquote>

<p>In a similar vein, TCP is also <strong>connection oriented</strong>. This adds overhead, each open remote connection on a Linux server requires around 2000 bytes of overhead in the kernel. Connections have non trivial time to setup too, with 1 RTT to connect. While connections made sense previously when clients and hosts were long lived, now many applications are serverless. Paying the connection cost makes less sense in that world.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<p>TCP requires <strong>in order delivery</strong>, although as Ousterhout admits, with a limited amount of reordering allowed. This prevents techniques for reducing load throughout the network, such as packet spraying, where packets are routed via different network pathways, reducing congestion at the hot nodes.</p>

<p><strong>Congestion control</strong> is highlighted as another problem point. TCP’s congestion control is driven by senders reacting to backpressure. This means there must be packet queueing when the network is loaded.</p>

<p><strong>Bandwidth sharing</strong> (or “fair scheduling”) shares bandwidth on a host link equally between active connections. But Ousterhout argues that this impacts short messages disproportionately, leading to much higher tail latencies under load.</p>

<h2 id="why-is-homa-better">Why is Homa better?</h2>

<p>The main thrust of Homa is that it fixes all of these issues. It’s message oriented, so work arrives in dispatchable units. It’s connectionless, so there’s no setup or ongoing overhead. It can be delivered out of order, allowing packet spraying–balancing load evenly across network links. It also lets receivers control congestion through a kind of token bucket method. Senders can only send packets in response to grants from a receiver, so the receiver can limit congestion and use grants to prioritize certain (shorter) messages.</p>

<p><img src="/assets/2025/homa-slowdown.png" alt="Graph displaying the comparative slowdown between Homa, TCP, and DTCP. Homa appears to have a much better slowdown ratio throughout" /></p>

<p>The only data provided in this paper is a graph displaying the 99th percentile slowdown on a loaded network. This took me a little while to parse so I’m going to talk my way through my understanding. As this is the slowdown, it’s graphing the ratio between the latency of messages in an unloaded, vs loaded network. It’s essentially showing how much slower the p99 is when the network is loaded, vs unloaded for each of these protocols, so we can see that for Homa, messages are about 6-10X slower under load for the p99, while for TCP it’s over 100X for small messages, dropping down to a little under 20X for 1M messages.</p>

<p>I found this a little confusing of a way to present the information, but it gets the message across! Homa is clearly designed to benefit tail latencies for smaller messages.</p>

<h3 id="but-what-about-encryption">But what about encryption?</h3>

<p>My big question reading this paper was the lack of any mention of encryption. TCP works very well with TLS and there are so many easy to set up integrations. There’s no excuse or reason to have datacenter traffic containing customer data communicating over plaintext. Although I believe there are existing standards for protocols like this, such as <a href="https://en.wikipedia.org/wiki/Datagram_Transport_Layer_Security">DTLS</a> for datagrams and UDP, it would be great to have some kind of mention about how encryption fits into the picture.</p>

<h3 id="what-else-is-out-there">What else is out there?</h3>

<p>There are other protocols in the space. Discussing this paper with colleagues, I learned about the <a href="https://assets.amazon.science/a6/34/41496f64421faafa1cbe301c007c/a-cloud-optimized-transport-protocol-for-elastic-and-scalable-hpc.pdf">SRD protocol</a> which is used by Elastic Block Store for high throughput. SRD takes advantage of many of the same improvements that Homa does, such as packet spraying through a network. This means packets can arrive unordered (which the protocol then handles) but the common case is in order. In cases like this, the work is already being done, however, EBS (even within AWS) is relatively unique in its needs. This paper does mention other alternatives, mainly Infiniband, but it doesn’t mention SRD.</p>

<h2 id="conclusion">Conclusion</h2>

<p>I found this a fun read. Like a lot of foundational software, TCP becomes one of those things that you rarely think about <em>not</em> using. It’s a strong default for many good reasons. TCP is incredibly common, almost any host that you can think of has a TCP implementation, and because its so common its been heavily optimised over the years.</p>

<p>But it’s always worth revisiting building blocks as things change, particularly the most common ones. Saving resources at the lower levels can be so powerful <em>because</em> they are so common. Saving 1% on a $100M cost can justify spending a lot more engineering time than saving 90% on $1000.</p>

<p>However, I don’t know if this seems to be one of those cases. TCP is just so ubiquitous, and so well optimised already, that for 95% of use cases it’s not worth the effort to switch. With every new technology there are new operational scars to learn. The further out on the bleeding edge you are, the more you have to debug yourself. I think for extremely high throughput systems where you have control over the vertical system it will make sense. I definitely agree with the position that integrating any such protocol with a few major RP frameworks is the best start to get things off the ground. I’ll be interested to see over the next few years how this space continues to develop.</p>

<h2 id="related-reading">Related reading</h2>

<ul>
  <li><a href="https://dl.acm.org/doi/pdf/10.1145/3015146">Attack of the Killer Microseconds</a></li>
  <li><a href="https://assets.amazon.science/a6/34/41496f64421faafa1cbe301c007c/a-cloud-optimized-transport-protocol-for-elastic-and-scalable-hpc.pdf">A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC</a></li>
  <li><a href="https://systemsapproach.substack.com/p/its-tcp-vs-rpc-all-over-again">It’s TCP vs. RPC All Over Again</a></li>
</ul>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Don’t blame an old lit grad for forcing the alliteration <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I’m not sure I fully agree here as many serverless applications end up keeping the compute around to amortise setup costs. For example, AWS Lambda will keep your function running for a time after execution and reuse the micro-VM if another request comes in. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="papers" /><category term="software-engineering" /><category term="distributed-systems" /><category term="academic" /><category term="networking" /><category term="datacenter" /><summary type="html"><![CDATA[Summary of It’s Time to Replace TCP in the Datacenter This position paper from John Ousterhout sets out everything that’s wrong with TCP and exactly how we should fix it.]]></summary></entry><entry><title type="html">Rust is Safe for X</title><link href="https://blog.benjscho.dev/technical/2025/01/10/rust-safe-for-x.html" rel="alternate" type="text/html" title="Rust is Safe for X" /><published>2025-01-10T17:55:00+00:00</published><updated>2025-01-10T17:55:00+00:00</updated><id>https://blog.benjscho.dev/technical/2025/01/10/rust-safe-for-x</id><content type="html" xml:base="https://blog.benjscho.dev/technical/2025/01/10/rust-safe-for-x.html"><![CDATA[<p>I love <a href="https://lwn.net/Articles/995814/">this article from lwn</a> and this conclusion especially:</p>
<blockquote>
  <p>this ability to take a property that the language does not know about and “teach” it to Rust, so that now it is enforced at compile time, is why he likes to call Rust an “X-safe” language. It’s not just memory-safe or thread-safe, but X-safe for any X that one takes the time to implement in the type system.</p>
</blockquote>

<!--more-->

<p>Rust is a language where you can use the type system to enforce safety guarantees at compile time. This isn’t exclusive to Rust, it’s a feature of any language with a strong type system. I think the ergonomics of Rust are particularly well suited to it, definitely over other systems languages.</p>

<p>I find myself coming back to this when we have implementation decisions. For example, if you are working on a service that has to handle encryption keys, you can use the type system to craft APIs and structs that simplify their handling. You can use a struct to ensure that your keys are never logged accidentally, by creating custom <code class="language-plaintext highlighter-rouge">Display</code> and <code class="language-plaintext highlighter-rouge">Debug</code> implementations that censor the plaintext. If you have multiple different kinds of encryption keys, you can craft APIs that require a key of type <code class="language-plaintext highlighter-rouge">Key&lt;ComponentA&gt;</code>, so you can’t accidentally pass in the key for <code class="language-plaintext highlighter-rouge">ComponentB</code>. There are all sorts of nice things you can do here, and it’s important to take advantage of them!</p>

<p>I think a good rule of thumb is when you find someone saying “as long as…”. As long as we use it in this way… As long as we pass it in the same way we receive it… That just means we know how we should use it, so we should encode that in the type system! People forget, people make mistakes. Lets make it easy on ourselves by making it <em>harder</em> to make mistakes than to just use the API as intended.</p>]]></content><author><name></name></author><category term="technical" /><category term="software-engineering" /><category term="rust" /><category term="type-safety" /><summary type="html"><![CDATA[I love this article from lwn and this conclusion especially: this ability to take a property that the language does not know about and “teach” it to Rust, so that now it is enforced at compile time, is why he likes to call Rust an “X-safe” language. It’s not just memory-safe or thread-safe, but X-safe for any X that one takes the time to implement in the type system.]]></summary></entry><entry><title type="html">Kafka: a Distributed Messaging System for Log Processing</title><link href="https://blog.benjscho.dev/papers/2025/01/06/kafka.html" rel="alternate" type="text/html" title="Kafka: a Distributed Messaging System for Log Processing" /><published>2025-01-06T17:00:00+00:00</published><updated>2025-01-06T17:00:00+00:00</updated><id>https://blog.benjscho.dev/papers/2025/01/06/kafka</id><content type="html" xml:base="https://blog.benjscho.dev/papers/2025/01/06/kafka.html"><![CDATA[<p><em>Summary of <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf">Kafka: a Distributed Messaging System for Log Processing</a></em></p>

<p>Apache Kafka is a system for streaming logs with the aim of providing high throughput over strict guarantees around delivery. This was an interesting paper to read ~13 years on, as Kafka has become more and more ubiquitous in system design.</p>

<!--more-->

<h3 id="design">Design</h3>

<p>This paper presents Kafka as a relatively simple system, consisting of <strong>brokers</strong> and <strong>consumers</strong>. Brokers receive messages from <strong>producers</strong>. Consumers poll brokers for messages that they care about, which are broken up into topics. Messages are opaque byte strings, allowing consumers to define their own formats. LinkedIn, for example, used <a href="https://en.wikipedia.org/wiki/Apache_Avro">Avro</a> encoding, a binary serialization format with a versioned schema<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>

<p>Producers send messages to brokers on particular topics. To distribute load across the nodes, topics are sharded into a number of partitions. Each broker stores one or more of these. Consumers subscribe to a topic by creating 1+ message streams, which each provide an iterator over a stream of messages. Producers can publish a message to either a randomly selected partition, or one determined by a partition key and a partition function.</p>

<p><img src="/assets/2025/kafka-design.png" alt="Kafka design" /></p>

<h3 id="let-someone-else-do-the-hard-work">Let someone else do the hard work</h3>

<p>In a few key areas the Kafka developers chose simplicity over complexity in their design, letting someone else do the hard work. This first comes up in the caching (or lack of). They decided to rely on the file system’s page cache instead of caching messages in memory. This has the benefit of avoiding double buffering as well as maintaining a cache between broker process restarts. It’s also simpler to implement and maintain.</p>

<p>Similarly, the developers use the sendfile API to eliminate additional copying on the host. This eliminates 2 buffer copies and 1 system call from a standard approach of sending bytes. This adds up to make Kafka more efficient for throughput.</p>

<p>Instead of organising their own consensus mechanism, Kafka uses Zookeeper for the coordination primitives. This saves reinventing the wheel and keeps the coordination of the system much simpler. Zookeeper mimics a simple file system’s page cache for an API. Paths can be created, have their values set, read, or deleted. Nodes can register to watch a path - meaning a watcher can be notified when the children of a path or its value have been changed. This is used by Kafka for nodes to know when they need to reconfigure.</p>

<p>Paths can also be created as ephemeral, so when the client that creates them disappears the path is automatically removed. By offboarding the complexity of this management to Zookeeper, Kafka again priorities simplicity. Instead of having a centralized main node, the consumers and producers can coordinate in a decentralized way.</p>

<p>Whenever brokers or consumers are added, a rebalancing process is triggered, which spreads the partitions of a topic over the new set of consumers. This is a pretty simple algorithm that runs determinastically on each consumer. Because Kafka only guarantees “at least once” delivery, there’s no real correctness issues that crop up here. But it’s important for developers to be aware that they should implement their own idempotency mechanism if its needed.</p>

<h3 id="pull-vs-push">Pull vs Push</h3>

<p>Kafka operates on a “pull” model for messages, meaning consumers are in charge of the state necessary to pull messages. Each message is stored in memory in the broker while being identified by its memory offset in the log. Logs are partitions of topics implemented as a set of files , each split into roughly the same size.</p>

<p>Instead of giving each message a unique ID, it is identified by its offset in the log file. This is a pretty interesting implementation! It means that the brokers have much less state or information to manage for each message. Consumers start reading messages from the queue, and request the next message by sending the end of the offset they got to. This means that the brokers don’t need to store any state for consumers, they can just feed them the next messages when requested. Another benefit of this is easy checkpointing. If a consumer fails while processing, it can pick up from its last successful checkpoint.</p>

<p>To trim messages, brokers wait for a specific time period, e.g., 7 days. This allows replaying messages over a longer term for consumers, they can just roll back to an earlier point in the queue. Since message queues are saved to files on the brokers, the only added pressure is to disk utilization – which is a lot cheaper than memory.</p>

<h3 id="why-did-kafka-become-so-successful">Why did Kafka become so successful?</h3>

<p>I struggled to find any hard data showing the market adoption of Kafka aside from <a href="https://survey.stackoverflow.co/2023/#section-most-popular-technologies-other-frameworks-and-libraries">this 2023 Stack Overflow survey</a> where 10% of professional devs reported using it - however that still put it behind RabbitMQ at 12%. There are a lot of articles claiming that Kafka has ‘won’ the message queueing space, and it certainly seems to be used more and more each year, either as a system or a compatible protocol.</p>

<p>There’s probably a lot of people who are more authorised to draw conclusions here than I am. From my understanding, the simplicity of Kafka seems really advantageous. As an API, it’s super simple to work with. You have a continuous iteration of messages which simply hangs when waiting for new ones along the stream. It’s also quite powerful for development. Thanks to the pull model, consumers can replay messages for up to a configurable time limit. This means from an operations standpoint, it’s much simpler to re-read a queue than it is to re-push a queue of messages (as you would in the case of SNS).</p>

<p>The performance benefits of this approach are pretty clear from the paper, at least against the existing top log processors of the time. Since 2011 a number of other data streaming services have emerged, both open and closed source. Kafka’s by no means the only option, but it’s a good example of what the authors put forward: by having a specialized system you can get a good deal of extra performance.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>I think whatever format you use, versioning or a schema definition is a pretty good choice to have! The paper describes a system where producers and consumers could load the schemas from a lightweight schema registry, which is neat. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="papers" /><category term="software-engineering" /><category term="distributed-systems" /><category term="academic" /><summary type="html"><![CDATA[Summary of Kafka: a Distributed Messaging System for Log Processing Apache Kafka is a system for streaming logs with the aim of providing high throughput over strict guarantees around delivery. This was an interesting paper to read ~13 years on, as Kafka has become more and more ubiquitous in system design.]]></summary></entry><entry><title type="html">Prompting Experiments</title><link href="https://blog.benjscho.dev/technical/2024/11/16/code-prompting.html" rel="alternate" type="text/html" title="Prompting Experiments" /><published>2024-11-16T10:00:00+00:00</published><updated>2024-11-16T10:00:00+00:00</updated><id>https://blog.benjscho.dev/technical/2024/11/16/code-prompting</id><content type="html" xml:base="https://blog.benjscho.dev/technical/2024/11/16/code-prompting.html"><![CDATA[<p>I’m on vacation, so I’m getting some time to do things that interest me in between time spent with family and recharging. As part of that, I wanted to write a blog on <a href="https://github.com/tokio-rs/turmoil">turmoil</a> and how to use it for testing, and I’ve ended up yak-shaving my way into making a preprocessor for mdBook to compile examples that use external dependencies. I say making instead of writing, because I prompted my way to a solution.</p>

<!--more-->

<p>This took me all of 8:00am to 9:30am today to get a working solution. This feels very much in a similar vein to <a href="https://www.linkedin.com/posts/marc-brooker-b431772b_one-thing-ive-enjoyed-in-the-run-up-to-re-activity-7255614212155006976-k47m?utm_source=share&amp;utm_medium=member_desktop">Marc Brooker’s</a> experience with gen AI coding tools. It’s great for small, greenfield tasks where you want a solution, but don’t have the time to dive into all of the weeds yourself.</p>

<p>Here I was using Claude 3.5 Sonnet to iterate. I’ve added the chatbot log to the repo to keep a record of what it was like to actually iterate and reach the solution. There’s a few things I need to do first, but hopefully I’ll be able to share it soon.</p>

<p>The code itself isn’t particularly long, and it’s mostly glue that’s supported by incredible open source projects (mdBook and the whole Rust ecosystem). However, I still find it super impressive that this is where we are at. It feels like a step change in tooling. I’m having more and more of these moments when it comes to reaching to LLMs to help me fill a gap in my available tools.</p>

<p>Long term, would I blindly use this crate? No. I think I would go through the code with a line by line review before publishing, add some TODOs and make notes of the sketchier parts that are likely to bite you. But to get a solution off the ground in no time at all, it’s awesome.</p>

<p>Now I can get back to the blog writing I actually meant to do.</p>]]></content><author><name></name></author><category term="technical" /><category term="software-engineering" /><category term="gen-ai" /><category term="claude-3.5" /><category term="anthropic" /><summary type="html"><![CDATA[I’m on vacation, so I’m getting some time to do things that interest me in between time spent with family and recharging. As part of that, I wanted to write a blog on turmoil and how to use it for testing, and I’ve ended up yak-shaving my way into making a preprocessor for mdBook to compile examples that use external dependencies. I say making instead of writing, because I prompted my way to a solution.]]></summary></entry><entry><title type="html">Anvil: Verifying Liveness of Cluster Management Controllers</title><link href="https://blog.benjscho.dev/papers/2024/10/15/anvil-liveness.html" rel="alternate" type="text/html" title="Anvil: Verifying Liveness of Cluster Management Controllers" /><published>2024-10-15T23:00:00+00:00</published><updated>2024-10-15T23:00:00+00:00</updated><id>https://blog.benjscho.dev/papers/2024/10/15/anvil-liveness</id><content type="html" xml:base="https://blog.benjscho.dev/papers/2024/10/15/anvil-liveness.html"><![CDATA[<p><em>Summary of <a href="https://www.usenix.org/conference/osdi24/presentation/sun-xudong">Anvil: Verifying Liveness of Cluster Management Controllers</a></em></p>

<p>Wouldn’t it be nice to write some software and confidently say that you know it’s right? That, as long as some assumptions about the world hold, it’s going to do exactly what you want it to, no matter what strange permutations or combinations of failures happen. In broad strokes that’s the promise of formal verification and proofs in software.</p>

<!--more-->

<p>This paper, in particular, is about formally verifying controllers. More specifically, how to verify liveness for Kubernetes cluster controllers. While the researchers start there, I think their contributions are more broadly applicable for other control plane architectures too.</p>

<h2 id="whats-tla">What’s TLA?</h2>

<p>Let’s start with a quick diversion/refresher, feel free to skip if you already know! TLA stands for temporal logic of actions. It’s a way of thinking, step by step, how things change over time as the result of actions. Leslie Lamport came up with the field as a way to describe concurrent programs in a formal, logical way.</p>

<p>We use formal methods like TLA to be able to make concrete mathematical statements about things that we can then prove or disprove. Something like “the ground is dry” is a logical proposition about the world. If then an action happens like “it rains” we can know that in the next state “the ground is dry” is false. TLA allows us to draw these same kinds of logical conclusions about concurrent systems.</p>

<p>It’s really useful when we want to describe properties of systems. These could be safety properties, or liveness properties. A safety property says that our system doesn’t do a bad thing, while a liveness property says that our system eventually does a good thing, or the thing we want it to do. Liveness is a tricky thing! Safety properties are usually easier to prove because a system that does nothing can be perfectly safe. Liveness properties require the system to keep making progress towards a goal.</p>

<h2 id="what-does-this-paper-provide">What does this paper provide?</h2>

<p>This paper provides a property called <strong>eventually stable reconciliation</strong> (ESR). They claim that this property precludes a lot of bugs in controllers. It also presents <em>Anvil</em>, a framework for developing kubernetes controllers and verifying that they meet this property. They then use Anvil to verify three Kubernetes controllers and show they have comparable performance to their un-verified counterparts.</p>

<p>ESR is a TLA formula, it’s a statement about a system that can be expressed in TLA. It essentially says that if what the controller wants stops changing, then the system should eventually reach that state.</p>

<p>That seems super reasonable as a goal to me! Intuitively, if a controller is able to update a system to match its desired state then it seems like it’s doing its job. If you don’t have familiarity with controllers or control plane software, you can think of them as being like a thermostat. You set some kind of desired state through configuration (turn the thermometer to 20C), and then the controller makes adjustments to the environment (turns your central heating on/off) while monitoring for changes (checking the current temp) until its desired state is reached.</p>

<p>In the case of a Kubernetes controller, this looks something more like creating service configuration files, or spinning up containerized nodes. The overall principle is the same, but the amount of input or ongoing state increases, and the number of ways in which things can go wrong multiplies.</p>

<p>Formally ESR is represented by the TLA formula:</p>

<p>\(\forall d.\Box(\Box \text{desire}(d) \implies \Diamond \Box \text{match}(d))\)<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>This asserts that for all desired states $d$, if the controller always desires $d$, then eventually $d$ will always match the current state. It seems clear that this will prevent a whole range of bugs that prevent the controller from moving towards its goal.</p>

<h2 id="how-do-they-prove-things">How do they prove things?</h2>

<p>The authors use <a href="https://github.com/verus-lang/verus">Verus</a>, a language that allows writing proofs in the form of preconditions and postconditions. Verus leverages Rust’s type system and reduces the problem to an SMT-solvable form. The team extended Verus with a set of simple temporal logic constructs to handle the temporal aspects of their proofs.</p>

<p>The advantage of using Verus is that it allows for modular proofs, breaking down complex properties into smaller, more manageable pieces. This approach makes it easier to reason about and verify complex systems like cluster controllers. Like <a href="https://github.com/dafny-lang/dafny">Dafny</a>, and in contrast to a model language like TLA+, it allows verifying the <em>actual code</em> that you run in production.</p>

<p><img src="/assets/2024/10/anvil.png" alt="A diagram from the paper of the anvil workflow" /></p>

<p>The Anvil toolkit provides a good deal of work towards proving ESR for Kubernetes controllers. Proving implementations with Anvil is still a manual process, you still have to write proof code to go with your implementation, but Anvil does provide a lot of handy looking lemmas<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> to help.</p>

<p>By applying their ESR property and proof methodology, the authors were able to identify and fix bugs in existing controllers that had been missed by extensive testing. This shows the power of formal verification in catching subtle issues that can be nearly impossible to uncover with traditional testing.</p>

<h2 id="the-future">The Future</h2>

<p>A big part of why I find this exciting is that it shows a future where more and more of our software is written proof first. With projects like this, it doesn’t seem hard to imagine a world where we start our software with a given proven core and then extend it to match the service at hand. I think this could be extremely powerful for Control Plane software where the problems are relatively common and abstractable: identify what’s wrong, make changes to fix it.</p>

<p>We’re already seeing other aspects of software trend in this formally verified direction. The AWS crypto team have delivered huge speed improvements by <a href="https://www.amazon.science/blog/formal-verification-makes-rsa-faster-and-faster-to-deploy">formally verifying RSA</a>. By building on the foundation of a correct solution they were able to make more aggressive optimisations.</p>

<p>To be a bit more pedantic, I see this happening for software at the largest scales. At AWS scale events that would be one in a billion are happening every hour. Formal correctness can help eliminate these. A big benefit from that is that, much like with testing, formal verification can give teams the ability to make big changes more confidently. If you know that you have a process to catch issues, it’s much easier to be bold in making big changes.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>If you’re unfamiliar with TLA, $\forall$ means “for all”, $\Box$ means “always” (in the future) and $\Diamond$ means “eventually”. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>A lemma (not a <a href="https://en.wikipedia.org/wiki/Lemming">lemming</a>) is a reuseable bit of logic that you can use to prove something else. It’s like a useful function that can convert from thing to another. In the case of proofs, if you want to prove some statement $C$, and you know how to prove that $A$ means $B$ is true, it’s really handy to have something that shows that if $B$ is true then so is $C$. That would be an example of a lemma. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="papers" /><category term="software-engineering" /><category term="distributed-systems" /><category term="academic" /><category term="automated-reasoning" /><summary type="html"><![CDATA[Summary of Anvil: Verifying Liveness of Cluster Management Controllers Wouldn’t it be nice to write some software and confidently say that you know it’s right? That, as long as some assumptions about the world hold, it’s going to do exactly what you want it to, no matter what strange permutations or combinations of failures happen. In broad strokes that’s the promise of formal verification and proofs in software.]]></summary></entry><entry><title type="html">MRTOM: Mostly Reliable Totally Ordered Multicast</title><link href="https://blog.benjscho.dev/papers/2024/07/04/mrtom.html" rel="alternate" type="text/html" title="MRTOM: Mostly Reliable Totally Ordered Multicast" /><published>2024-07-04T19:00:00+00:00</published><updated>2024-07-04T19:00:00+00:00</updated><id>https://blog.benjscho.dev/papers/2024/07/04/mrtom</id><content type="html" xml:base="https://blog.benjscho.dev/papers/2024/07/04/mrtom.html"><![CDATA[<p><em>Summary of <a href="https://ieeexplore.ieee.org/document/10272412">MRTOM: Mostly Reliable Totally Ordered Multicast</a></em></p>

<p>This paper is about building a network primitive to speed up consensus protocols. Like other papers in this area, MRTOM builds on the fact that network ordered protocols can have much higher throughput than standard consensus protocols. MRTOM takes this one step further by offloading not just packet ordering, but also the fast path of consensus protocols to programmable switches.</p>

<!--more-->

<h2 id="the-consensus-problem">The consensus problem</h2>

<p>It’s probably a good idea to have a quick refresher on what consensus problems are, and why network ordered protocols can be faster.</p>

<p>Broadly, the problem of consensus is agreeing with a group what’s happened. It’s like if you were sending a series of letters to 20 different people in different countries. Some of your letters might get lost or intercepted before they get there, or one of the people you’re sending to might be missing a letter and would disagree with the others on what you’ve said.</p>

<p>Even worse, the letters need to have an agreed order. If you send three messages, the end meaning might be completely different if they arrive in a different order:</p>
<ul>
  <li>I’ll be arriving at midnight</li>
  <li>I’ll be arriving at noon</li>
  <li>Ignore that last message, the next one will be my new arrival time</li>
</ul>

<p>These three messages have different meanings depending on when they arrive. We could understand that you’ll be arriving at midnight, or at noon, or we could be unsure of what time you’re arriving at all.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>Without a way of deciding what letters all of the group has received and in what order, we can’t agree on exactly what you’ve said. Consensus protocols like Paxos solve this by ensuring a consensus group agrees on what messages they’ve received and <em>in what order</em>. Typically a Paxos like protocol has a leader that accepts requests from clients, while followers can be used to read committed data or will forward new requests onto the leader. During the process of agreeing on a message, the members of a Paxos group have to send messages back and forth with each other, which takes time, meaning we can process fewer messages and respond slower than if there was just one recipient.</p>

<p>The benefits of having a group of recipients are that because the data is replicated, it is more resilient. One of the recipients could become unavailable, but we’re still able to provide that data when someone comes to read it.</p>

<h2 id="if-you-know-what-youre-expecting-next-you-can-act-fast">If you know what you’re expecting next, you can act fast</h2>

<p>Network ordered consensus protocols, like NOPaxos, exploit the fact that if you have a total ordering for incoming client messages, you can get consensus pretty quickly! Typically consensus incurs a penalty, as a group of servers incur a communication overhead in agreeing what they’ve received. Network ordered protocols make the observation that if you have a monotonically increasing dense ID<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> assigned to incoming requests, 1) its very easy to know that you have things in order, and 2) its very obvious if you’re missing a message. If the packets reliably arrive and in order<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>, then servers in a consensus group don’t need to communicate to agree on what they’ve received and in what order.</p>

<p>Since not doing something is always cheaper and faster than doing something, this lets network ordered protocols do as little as possible. They only have to communicate when packets <em>don’t</em> arrive. This is the separation of the fast and the slow path of network ordered consensus protocols. The fast path is when everything gets there on time and in order, while the slow path happens when a server is missing a message.</p>

<p>The downside of network ordered protocols is they need something to order the messages. Typically this is a single network switch which all messages to the group have to pass through. This switch will then assign each message a monotonically increasing ID. The negative of this is that we’re introducing a single point of failure (bad) and a scaling bottleneck (also not great). There’s no such thing as a free lunch.</p>

<h2 id="mrtom-design">MRTOM Design</h2>

<p>MRTOM (pronounced Mr. Tom) is based on the observation that the fast path of many consensus protocols “is precisely the reliable, ordered, and acknowledged delivery of messages to a set of nodes”. By focusing on offloading that fast path to the network, the core idea of the paper is to free up server capacity for handling the slow path and other application logic.</p>

<p>MRTOM works by having a MRTOM instance, typically a programmable switch, in between clients and the group of servers running the consensus protocol. This is so far so familiar for the network ordered story. The MRTOM instance provides an ordering for packets, which is then used to speed up consensus.</p>

<p>It differs in two important ways, first that it tries to increase the reliability of delivery by maintaining a loopback of packets. Once a packet has been <code class="language-plaintext highlighter-rouge">ack</code>ed by all servers in the group, MRTOM considers it delivered and can discard it. If it’s not acknowledged within a certain time then MRTOM re-sends the packet.</p>

<p>Second, and more importantly, MRTOM offloads the fast path from the server group, instead running it on the switch through eBPF. This is the main aspect that gives the protocol a big throughput and latency advantage over NOPaxos or other protocols.</p>

<p><img src="/assets/2024/07/ThroughputComp.png" alt="A graph showing a comparison of throughput versus latency for different Paxos implementations in a 3 node setup. MRTOM-Paxos shows lower latency and higher throughput than the other implementations, with NOPaxos being next best." /></p>

<h3 id="ebpf-usage">eBPF usage</h3>

<p>MRTOM allows offloading the fast path of protocols to network switches. They do this using eBPF, which is a way of using Linux kernel capabilities without needing to change kernel source code.</p>

<p>Extended BPF developed from Berkeley Packet Filtering, but is mostly referred to as eBPF now, as the capabilities go beyond just packet filtering. eBPF programs let you run code in the kernel userspace through verified bytecode. This means they can attach hooks to the kernel without recompiling or creating kernel modules. As a result, we can get very efficient program execution as it cuts out the syscall middle man. This in turn cuts CPU &amp; memory overheads by avoiding allocations for packet processing. eBPF also provides means for kernel level programs to have data structures that can interact with user space programs, which is how the switch can then push packets into the slow path.</p>

<p>In the case of MRTOM-Paxos, the fast path is integrated into the MRTOM edge interface. This interface handles aggregating acknowledgement responses from servers in the MRTOM group. Once a majority of servers in the group (including the leader) have responded, the switch sends back an acknowledgement to the client. This aggregation reduces the number of messages and coordination between the client and servers in the group.</p>

<h3 id="single-switch">Single switch</h3>

<p>This paper uses a single switch, much like NOPaxos, which gives a single bottleneck. However, we’ve seen recently other protocols like Hydra demonstrate the ability to scale the number of sequencers while maintaining a total consistent order. I’d like to see a combination of these ideas, to see if there would be a way to merge the multiple sequencer approach with MRTOM’s fast path offloading. My intuition is that it wouldn’t be possible to mix them, given Hydra’s reliance on the receiver’s reconstructing the order.</p>

<h2 id="summary">Summary</h2>

<p>MRTOM isn’t a real <em>theoretical</em> advance, but it presents a nice practical idea for increasing throughput and decreasing latency. Doing more at the network level, or even on the server while bypassing the typical overhead of a linux stack with eBPF/XDP, can help us be fast.</p>

<p>However, I’m not sure if the trend of the industry is heading in this direction. While programmable switches have been around for a while, there’s very few commercial cloud operators that will let you use them, and when you do it’s certainly not easy. The trend seems to be to more abstracted and easily fungible hardware, rather than investing time programming switches that will need to be replaced in 4-5 years.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Note that in this example we’re not necessarily understanding what the sender meant, we’re just <em>agreeing</em> on what they meant. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>This just means that we’re counting up (0, 1, 2, 3, …) with each new message, with no gaps in the IDs. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>or close enough to it that you can find it in your message buffer <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="papers" /><category term="software-engineering" /><category term="distributed-systems" /><category term="academic" /><category term="Paxos" /><category term="NOPaxos" /><summary type="html"><![CDATA[Summary of MRTOM: Mostly Reliable Totally Ordered Multicast This paper is about building a network primitive to speed up consensus protocols. Like other papers in this area, MRTOM builds on the fact that network ordered protocols can have much higher throughput than standard consensus protocols. MRTOM takes this one step further by offloading not just packet ordering, but also the fast path of consensus protocols to programmable switches.]]></summary></entry><entry><title type="html">How Hard is Asynchronous Weight Reassignment?</title><link href="https://blog.benjscho.dev/papers/2024/05/01/async-weights.html" rel="alternate" type="text/html" title="How Hard is Asynchronous Weight Reassignment?" /><published>2024-05-01T14:00:00+00:00</published><updated>2024-05-01T14:00:00+00:00</updated><id>https://blog.benjscho.dev/papers/2024/05/01/async-weights</id><content type="html" xml:base="https://blog.benjscho.dev/papers/2024/05/01/async-weights.html"><![CDATA[<p><em>Summary of <a href="https://arxiv.org/pdf/2306.03185">How Hard is Asynchronous Weight Reassignment?</a></em></p>

<p>Majority quorum systems are useful in providing a simple mechanism for consensus. To accept a value, you need a majority of servers to agree to accepting it. Weighted majority quorum services (WQMS) take this approach and recognise that some servers are going to have better performance than others, so they should get more voting power.</p>

<!--more-->

<p>The main contribution of this paper is defining three ways of reassigning weights in a WQMS. The first two are shown to be as difficult as consensus, while the third can be performed consensus-free. I’m not fully sure how practically useful this is, but that’s also because I don’t know much about the real world uses of WQMS<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>! As far as the paper goes, it’s very clearly written, with some nicely presented proofs.</p>

<h3 id="weight-reassignment">Weight Reassignment</h3>
<p>The paper presents the problem of weight reassignment. Solving this problem means providing an algorithm to update the weights of a static set of servers running a WQMS.</p>

<p>Weight reassignment has some restrictions. With the system, we want to be able to tolerate up to $f$ servers (out of a total $n$) crashing at once. This means that we need to place a limit on the total weight of the $f$ most weighted servers so they have less than half of the total weight. This way, if <em>any</em> $f$ servers fail, the remaining $n - f$ can continue to make progress. We call this restriction <em>integrity</em>.</p>

<p>This means that we can’t reassign weights in a way that would violate integrity. Say we had a quorum of 5 servers, each with a voting weight of $1$, and we wanted to tolerate up to two server failures. If we were to add $2$ to the weight of any of the servers, now the $f$ top weighted servers would have a total weight of $4$, and the total weight of all servers would be $7$. Since the total weight of the top $f$ servers is greater than half of the total weight, this would violate integrity. For brevity going forward, we’ll call the set of the top $f$ servers $F$.</p>

<p>There are three other restrictions on the weight reassignment algorithm that are simpler:</p>
<ul>
  <li>Validity-I - when a weight change is proposed, if it violates integrity then a no-op change (a change with zero weight difference) is created. If it <em>doesn’t</em> violate integrity then the proposed change is created.</li>
  <li>Validity-II - there’s an API clients can use to check the weights of a server $s$, <code class="language-plaintext highlighter-rouge">read_changes(s)</code>. When it’s called, a client gets a list of all of the weight changes made to $s$, from which they can reconstruct the current weight of $s$ by summing the changes.</li>
  <li>Livesness - if a server calls <code class="language-plaintext highlighter-rouge">reassign</code>, the operation will eventually complete, and the server will get back a message indicating the set of changes made.</li>
</ul>

<h4 id="equivalence-to-consensus">Equivalence to consensus</h4>

<p>Laid out in this way, the authors go on to show that this problem is equivalent to consensus, meaning it’s at least as difficult to solve. Given we’re using a quorum system to solve consensus, that sounds like it would defeat the point!</p>

<p>I really liked the proof they use to demonstrate this. The authors construct a scenario in which every server proposes a weight change. These changes are constructed such that one, and only one, of the weight changes can succeed with a non-zero weight change. If two or more of the changes succeeded, then integrity would be violated.</p>

<p>For the proof, the servers are divided into two disjoint sets, $F$ and $S \backslash F$, where $F$ is the set of servers $F = { s_1, s_2, …, s_f}$. $S$ is the set of all servers, so $S \backslash F$ is the set of all servers not in $F$. Note that the sets have the following lengths $|S| = n, |F| = f, |S \backslash F| = n - f$. The initial weight of each server $s \in F$ is $\frac{n - 1}{2f}$ while the weight of every server $s \in S \backslash F$ is $\frac{n+1}{2(n - f)}$. We can call the total weight of a set $S$ at time $t$: $\texttt{W}_{S,t}$. Based on the initial weights of each server in the sets, $\texttt{W}_{F,0} = \frac{n - 1}{2f} \times f$ and $\texttt{W}_{S \backslash F,0} = \frac{n+1}{2(n - f)} \times (n - f)$. Since  $\frac{n - 1}{2} &lt; \frac{n+1}{2}$, we can see that the total weight of $F$ is less than the total weight of $S \backslash F$ and integrity is satisfied by the initial weights.</p>

<p>Each of the servers $s_i$ proposes a weight change. For the servers in $F$, they propose adding $0.5$ to their weight: $\texttt{reassign}(s_i, 0.5)$, while all servers in $S \backslash F$ propose subtracting $0.5$ from their weight: $\texttt{reassign}(s_i, -0.5)$. From this, we can see that accepting one of these changes would not violate integrity. E.g., if we accept one change from $F$, then the new total weight of $F$ becomes $\frac{n - 1}{2} + 0.5$, which is still less than $\frac{n+1}{2}$. Similarly, if we accept one change from $S \backslash F$, then $\texttt{W}_{S \backslash F,1} = \frac{n+1}{2} - 0.5$.</p>

<p>However, it’s clear that if we accept more than one change in any combination then integrity will be violated. E.g., if we accept one change from both sets, we can see that $\frac{n - 1}{2} + 0.5 = \frac{n+1}{2} - 0.5$, which would violate integrity. Since if a change doesn’t violate integrity we have to accept it, then we must accept one and only one change, which is the same as deciding consensus on a value between the group. The paper also provides an algorithm for solving consensus by proposing weight changes and deciding on one, which is fun but probably not as interesting or useful as the equivalence proof.</p>

<h4 id="pairwise-weight-reassignment">Pairwise weight reassignment</h4>

<p>The authors then tried restricting the problem to make it easier to solve. If it’s hard to arbitrarily re-assign weights up and down among the servers, what about only allowing pairs of servers to exchange weights? E.g., for server $s_2$ to gain a weight, some other server $s_4$ needs to lose the same amount of weight. This means that the total weight of all servers can remain constant throughout.</p>

<p>Apart from this change, weight reassignment remains the same. The integrity requirement is still in place. Instead of a server proposing $\texttt{reassign}$, they can propose to $\texttt{transfer}(s_i, s_j, \Delta)$, where the $\Delta$ change in weight is taken from $s_i$ and given to $s_j$ if integrity is not violated. If integrity would be violated by the change, then two zero weight changes are created (one for $s_i$ and one for $s_j$).</p>

<p>The authors then show in much the same way that this is also equivalent to consensus. Just like before, they craft a scenario based on the sets of servers $F$ and $S \backslash F$ where one and only one weight $\texttt{transfer}$ can complete with a non-zero weight, showing that the problem is equivalent to consensus. The proof is pretty similar to the previous one so I won’t go in to the exact scenario. Again, they provide an algorithm for consensus based on this impossibility proof.</p>

<h4 id="restricted-pairwise-weight-reassignment">Restricted pairwise weight reassignment</h4>
<p>Finally the authors introduce restricted pairwise weight reassignment, which can be performed <em>without</em> consensus. There are two restrictions they place on transfer:</p>
<ol>
  <li>Only $s_i$ can call $\texttt{transfer}(s_i, *, \Delta)$. So only $s_i$ can transfer away some of its weight</li>
  <li>The weight of $s_i$ has to always be greater than $\frac{\texttt{W}_{S,0}}{2(n - f)}$. That means there’s a floor on the weight of each server, such that if each server in $S \backslash F$ had this weight, they would have the majority vote.</li>
</ol>

<p>The authors assert that if the first condition holds, then the second is locally verifiable, meaning it can be done without consensus. I found this argument pretty straight forward! Any server can give away its weight, but only while remaining greater than the floor weight, leaving all of the servers not in $F$ with enough weight to continue as a quorum. This is proved with the inequality:</p>

\[|S\backslash F| \times \frac{\texttt{W}_{S, 0}}{2(n - f)} = \frac{\texttt{W}_{S, 0}}{2}\]

<p>So $\texttt{W}_{S \backslash F, t} &gt; \frac{\texttt{W}_{S, 0}}{2}$ and integrity is preserved at all times.</p>

<p>The reason no other server can take away another’s weight without consensus is that if we have any two servers trying to take away weight from some server $s_i$, then while either of them could make a transfer recognizing the second condition, there’s no way for both transfers to succeed without some form of consensus to decide which succeeds.</p>

<h3 id="thoughts">Thoughts</h3>

<p>The authors proved this result for a static set of servers. It would be great to see if the results could be extended for a dynamic set of servers, but given all of the proofs relied on a static set I’d imagine that would be hard. At the minimum you could imagine using consensus to decide weights as you add/remove servers, before switching back to the consensus free weight reassignment.</p>

<p>How useful is it? You can easily imagine a scenario where a server doesn’t know that it’s holding up progress for a quorum, and since only servers can give away their own weight, that might prove tricky to use in practice. It would be great to see an experimental evaluation to see how helpful this could be.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>I’ve been pointed to <a href="https://dl.acm.org/doi/pdf/10.1145/3447865.3457962">this paper (<em>Read-Write Quorum Systems Made Practical</em>)</a> which seems to be a good read on this. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="papers" /><category term="software-engineering" /><category term="distributed-systems" /><category term="academic" /><category term="weights" /><category term="wqms" /><category term="quorum" /><summary type="html"><![CDATA[Summary of How Hard is Asynchronous Weight Reassignment? Majority quorum systems are useful in providing a simple mechanism for consensus. To accept a value, you need a majority of servers to agree to accepting it. Weighted majority quorum services (WQMS) take this approach and recognise that some servers are going to have better performance than others, so they should get more voting power.]]></summary></entry><entry><title type="html">Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications</title><link href="https://blog.benjscho.dev/papers/2024/04/28/hydra.html" rel="alternate" type="text/html" title="Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications" /><published>2024-04-28T21:00:00+00:00</published><updated>2024-04-28T21:00:00+00:00</updated><id>https://blog.benjscho.dev/papers/2024/04/28/hydra</id><content type="html" xml:base="https://blog.benjscho.dev/papers/2024/04/28/hydra.html"><![CDATA[<p><em>Summary of <a href="https://www.usenix.org/conference/nsdi23/presentation/choi">Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications</a></em></p>

<p>Replicated systems typically pretty much always have some overhead in comparison to unreplicated systems, at least if you want <a href="https://dl.acm.org/doi/10.1145/2500500">strong consistency</a> for your data. We need to do extra work in order to make sure that we get the same result across all nodes. The fastest systems <a href="https://blog.benjscho.dev/papers/2024/03/11/keeping-calm.html">minimise or avoid that coordination</a>, but where we can’t avoid it, we need an algorithm to manage that consensus.</p>

<!--more-->

<p>Network ordered distributed protocols can be surprisingly performant compared to unreplicated systems. Network-Ordered Paxos (NOPaxos) is able to achieve throughput within 2% of an unreplicated system, while for comparison Paxos only achieves around 25%<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. However, they have drawbacks. NOPaxos requires sending all packets for a consensus group through a single point. This leaves it difficult to scale the size of groups within a data centre, and can increase the time to recover when a sequencer fails.</p>

<p>This paper provides an algorithm for consistent network packet ordering with drop detection over a parallel set of sequencers. This means we can get the benefits of packet sequencing (higher throughput and lower latency consensus algorithms) while avoiding the single point of failure in a system.</p>

<h3 id="how-it-works">How it works</h3>
<h4 id="single-sequencer">Single sequencer</h4>

<p>Let’s start with the case of a single sequencer, as in <a href="https://www.usenix.org/system/files/conference/osdi16/osdi16-li.pdf">NOPaxos</a>. A single node sequencer works by maintaining a counter. When it receives a packet from a sender it adds a header with the current counter, increments the counter, and sends<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> the packet to a group of receivers.</p>

<p>When the receivers get the packet, they can then recreate the same ordering that the sequencer saw pretty simply. Because the packets each have a number that’s monotonically increasing (1, 2, 3…) it’s easy to sort the sequence. Similarly, this is how drop detection comes in. If a receiver has received a set of packets with sequence counts: <code class="language-plaintext highlighter-rouge">[1, 2, 4]</code>, then it can tell it’s missing a packet with sequence count <code class="language-plaintext highlighter-rouge">3</code>. The receiver will then send a drop notification to the other members of its group, who then can decide to either permanently ignore the packet, or accept it and then resend to all the members that are missing it. They do that through an elected leader, which initiates a round of agreement. Leaders are regularly elected in a process similar to Paxos, but we don’t need to go into that for the differences with Hydra.</p>

<p>Together this provides <em>consistent ordering</em> and <em>drop detection</em> of packets.</p>

<h4 id="hydra">Hydra</h4>
<p>Hydra takes this protocol and adds the ability to run multiple sequencers. This means you’re not limited by the throughput of a single switch or host when scaling your service. The paper shows a roughly linear increase in throughput as the number of switches increases. If one switch was limited to a throughput of 200k messages per second across a group, then with two you should have a throughput of 400k.</p>

<p>Just like in the single sequencer case, each sequencer maintains a monotonically increasing counter (1, 2, 3…). This counter is maintained locally by each sequencer. When they receive a packet, they add the counter as a header, increment their counter, and forward the packet on to the receiver group. The sequencers also have their own ID, which they add to the packet and is used to determinsitically settle the order of packets at the receivers. If we took this naive approach then we would lose drop detection. Imagine a receiver’s getting packets from one sequencer, but missing those from another. How can it tell that a packet from the other sequencer has been dropped?</p>

<p>To resolve this issue, the paper introduces a combination of sequence numbers and physical clocks. Each sequencer also has a physical clock tracking real time. When forwarding packets on, as well as adding the local counter to the packet, the sequencer also adds its current clock value and its sequencer ID. Because the physical clocks are monontonically increasing, the protocol is able to guarantee that each message broadcast by the sequencers has a consistent partial ordering:</p>

<blockquote>
  <p>Partial ordering definition - $\S 4.3.1$</p>

  <p>For messages $m_1$ and $m_2$ sent to the same recipients, with respective clock values $c_1$ and $c_2$, sequenced by sequencers with IDs $i$ and $j$, $m_1$ is ordered before $m_2$ if $c_1 &lt; c_2 \vee (c_1 = c_2 \wedge i &lt; j)$</p>
</blockquote>

<p>This essentially means when a receiver gets two messages from different sequencers, the one with the lower clock value is ordered first, and if the clock values are the same then ties are broken using the sequencer ID.</p>

<p>Since the stamps on each packet are consistent for each receiver, the ordering is the same among all of them. So that’s great! But how does that help us with drop detection?</p>

<h4 id="drop-detection">Drop Detection</h4>

<p>This is where the sequence numbers come back in. Remember how the single sequencer scenario uses these to detect drops? Hydra uses them in a similar way. Each Hydra receiver first buffers the messages it gets, and only “delivers” them (logically to the application) once they have determined that no message with a lower clock value from another sequencer will be delivered.</p>

<p>To do that, the receivers track two values for each sequencer: the largest sequence number, and the largest clock value seen in its messages. The receiver will only deliver messages up to the point when it knows that all other receivers have a higher clock value.</p>

<p>Let’s take an example where a receiver is listening to two sequencers with IDs $1$ and $2$. If a receiver has received three messages:</p>
<ul>
  <li>$m_1$ that has a clock value $c = 14$, a sequencer ID $s_{id} = 2$ and a sequencer count of $s_c = 1$</li>
  <li>$m_2$ with $c = 12, s_{id} = 1, s_c = 1$</li>
  <li>$m_3$ with $c = 20, s_{id} = 1, s_c = 2$</li>
</ul>

<p>The first two messages can be delivered in the order $m_2, m_1$, because the receiver knows that the time on sequencer 1 is at least 20, and the time on sequencer 2 is at least 14. Therefore the minimum time at all sequencers is 14, and it can deliver all messages with a physical time up to that point. It can’t yet deliver $m_3$ because it hasn’t received a message from sequencer 2 with a time of 20 or greater.</p>

<p>If the sequencer then received another message, $m_4$, with $c = 30, s_{id} = 2, s_c = 3$, it would know that it’s missing the message $s_{id} = 2 \wedge s_c = 2$. At that point it delivers a drop notification for $s_{id} = 2 \wedge s_c = 2$ to the application.</p>

<p>Because we’re now waiting for updates from all sequencers before we can deliver messages, it’s clear there could be some issues. What if one sequencer gets fewer messages to forward on than others? A receiver could be waiting a while to get an update from a slow sequencer, while a load of messages from other sequencers are queueing up. To fix this progress issue, the paper presents configurable flush messages. This is a kind of heartbeat notification, where the sequencer sends a packet with its current physical clock, and its current sequence number (without incrementing it). This allows all receivers to be updated on the minimum time across all sequencers, so they can send out messages before that time.</p>

<p>It should be clear enough that the <em>safety</em> of this protocol isn’t affected by clock drift, just the performance. If the sequencers have a huge difference in their physical clocks then receivers may be waiting a long time for all receivers to catch up to a high water mark time. However, the messages still have consistent ordering, and drop detection is not affected by a slow physical clock.</p>

<h4 id="thoughts">Thoughts</h4>

<p>I thought this was a really interesting paper! It was great to dig into the network ordering that enables NOPaxos. The contributions here definitely extend the work in practical and useful directions. It’s pretty rare that you need to implement a system like this, but interesting to know that network ordering can be scaled beyond a single sequencer.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Figures from <a href="https://www.usenix.org/conference/osdi16/technical-sessions/presentation/li">Just Say NO to Paxos Overhead</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Technically this is a <a href="https://en.wikipedia.org/wiki/Multicast">multicast</a>. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="papers" /><category term="software-engineering" /><category term="distributed-systems" /><category term="swe" /><category term="productivity" /><category term="academic" /><category term="NOPaxos" /><summary type="html"><![CDATA[Summary of Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications Replicated systems typically pretty much always have some overhead in comparison to unreplicated systems, at least if you want strong consistency for your data. We need to do extra work in order to make sure that we get the same result across all nodes. The fastest systems minimise or avoid that coordination, but where we can’t avoid it, we need an algorithm to manage that consensus.]]></summary></entry><entry><title type="html">Breaking into tech: Advice on getting your first role or internship</title><link href="https://blog.benjscho.dev/technical/2024/04/19/breaking-in.html" rel="alternate" type="text/html" title="Breaking into tech: Advice on getting your first role or internship" /><published>2024-04-19T17:00:00+00:00</published><updated>2024-04-19T17:00:00+00:00</updated><id>https://blog.benjscho.dev/technical/2024/04/19/breaking-in</id><content type="html" xml:base="https://blog.benjscho.dev/technical/2024/04/19/breaking-in.html"><![CDATA[<p>This is my rough sketch of advice for people trying to get their first role in the tech industry. A lot of this is synthesized and regurgitated from what others have told me, so might be pretty recognizable! <a href="https://www.amazon.ca/Cracking-Coding-Interview-Programming-Questions/dp/0984782850">Cracking the Coding Interview</a> is a really excellent resource for all of this too.</p>

<!--more-->

<h2 id="getting-an-interview">Getting an interview</h2>

<p>Tech companies look for engineers that can work with data. They want people that know how it can tell a story, how to get the right data, what’s important, and how you can use it to support your arguments. Your CV should reflect this. Try to identify nuggets of data about your previous experiences to include.</p>

<p>What was the impact of your actions in your projects? This doesn’t even have to be technical projects. If you were part of a student society that held events, did you grow their attendance by some %-age year on year?  Companies want to see that you can set goals for improvement and hit them, or at least have awareness of it. Include this in both your CV bullet points and in your interview prep.</p>

<p>When it comes to applications, a referral is at least 10X more effective than applying to a portal – at least when it comes to getting you through the first screening. After that it’s up to you, but it’s the best way to get an introduction to a company. There’s so many low quality submissions that come in through online application portals that it’s very hard for recruiters to screen, particularly with the rise of job application bots and LLM usage.</p>

<p>If you have a target company you want to work at, try and get a referral instead of applying for a job just through a portal. Look up people that work there on LinkedIn. You can filter by some criteria that’s connected to you to make an easier introduction. Look for mutual connections, that you went to the same school, etc. that’s not the key part but it helps people have familiartiy. Reach out to them with a short message about yourself, what you’re aiming for, and ask if they can refer you.</p>

<p>If there’s a position advertised you’re looking at, reach out to the hiring manager and ask for more information. Hiring managers are much more likely to hire candidates that they know. It’s a good way to show interest and that you’re a real person before putting an application in. This stuff is a little scary at first, but it gets easier when you understand what people want. Hiring managers want good quality candidates that won’t waste their time. Hiring is an incredibly expensive process! You have something to offer them as a friendly, competent, understanding individual.</p>

<h2 id="interview-prep">Interview prep</h2>

<p>Have a 50/50 emphasis split on leadership questions (stories about what work you’ve done in the past and how they show certain characteristics the company is looking for) vs DSA coding skills (Leetcode or whatever other platform). You should prepare the stories for interviews in much the same way you prepare your coding skills.</p>

<p>For coding prep I have no other advice than to do it. I did a bunch prior to my first role, and I really enjoyed a lot of the learning. Leetcode Easy and Mediums will serve you well for entry level roles. It does suck, but it’s a bit of a necessary evil. No one needs you to solve DSA problems under time pressure in your day to day job, but like exams at school, it’s a way of demonstrating you can play the game.</p>

<p>Write out a grid of all of the various experiences you had against all of the public values that your target company advertises<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. For each box write a few bulletpoints in the STAR<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> format (situation, task, action, result) about the story, talking about how you demonstrated that principle. People care most about the action and the result, while the situation and the task are the context so they can understand. You should be able to relay these stories in around 5 minutes.</p>

<p>When it comes to the interview process, <em>performing</em> that you understand and can demonstrate what they want to see <em>is as good as actually doing it</em>. Interviewers want to know that you can play the game of internal values, because that’s how companies organise themselves without pulling in many different directions. They’re used in all company decisions, hiring, project prioritisation. Playing a role in a company is as good as <em>being</em> the role.</p>

<p>When it comes to interviewing, come with a few questions. It’s your chance to learn about the place too, what the day to day is like. Be curious about the team and the work. It’s a good way to demonstrate you are engaged in the process and evaluating your options carefully.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Amazon has well documented <a href="https://www.aboutamazon.com/about-us/leadership-principles">leadership principles</a> which they use to interview and make decisions, other companies will have different principles and criteria. Where you can, find what they are, but you can also prep in this way for questions that you expect to get. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>STAR just kind of works. It’s a great way to trick yourself into being coherent, and makes it a lot easier for others to follow your thoughts. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="technical" /><category term="software-engineering" /><category term="swe" /><category term="intern" /><category term="advice" /><summary type="html"><![CDATA[This is my rough sketch of advice for people trying to get their first role in the tech industry. A lot of this is synthesized and regurgitated from what others have told me, so might be pretty recognizable! Cracking the Coding Interview is a really excellent resource for all of this too.]]></summary></entry></feed>