Implementing Poptrie in Lua and DynASM
Max Rottenkolber <max@mr.gy>
When routing IP packets it is common to do a longest prefix match (LPM) lookup on the destination address to find a suitable next hop for a packet. Networks are denoted by address prefixes which are matched against the address fields of packets in order to determine where they are headed or where they come from.
Poptrie is a high-performance LPM lookup algorithm designed for general purpose computers. What makes it perform well on regular CPUs you ask? Well, it lays out the look up structure in a way that minimizes memory accesses, which are notoriously slow. If we are to route packets at high speeds in software we need to keep our business in the muffer—otherwise known as the cache.
One reason to choose Poptrie is that it is sufficiently general to work on prefixes of arbitrary lengths. Hence, it is suitable for both IPv4 and IPv6 addresses, unlike say DIR-24-8 which targets prefixes if up to 32 bits exclusively. It would be neat if we can use a single, highly optimized lookup routine for all our routing needs (and possibly other use cases as well).
A rough overview of Poptrie (go read the paper!)
At its heart, Poptrie uses a compressed trie structure to hold the forwarding information base (FIB), which is networking-speak for a mapping from prefixes (keys) to next hops (values). In this compressed trie each node holds edges for the next k = 6 bits of the prefix which in turn point to either child nodes or leaves.
Poptrie nodes are efficiently encoded using a trick based on the popcnt instruction which counts the number of bits set in a bit vector. Each node consists of two base offsets—one into an array of nodes, and another into an array of leaves—as well as two 64‑bit vectors that hold compressed information about the edges of the node. Let us call them the node and the leaf vectors.
Since k = 6 bits can represent 64 distinct values, each of the vectors has one bit of information for every possible value of the next k‑bit chunk of the prefix. A set bit in the corresponding position of the node vector indicates that the edge points to a node, and the number of bits set up that position indicates the child node’s offset from the node base.
While we could just invert the node vector and then count the set bits to get an offset into the leaves array, we expect the trie to be sparse and leaves to be plenty—a “no match” is represented by a leaf with value zero. A node with a single edge would imply 63 additional zero leaves in the leaves array. Enter leaf compression. In order minimize the size of the leaves array, the leaf vector has bits at the corresponding positions set if the edge points to a leaf, and only if the leaf’s value differs from the value of the preceding leaf. I.e., adjacent identical leaves are deduplicated.
Finally, a direct pointing optimization is prepended to the lookup. It trades off memory space for decreased depth of the trie which translates to reduced branching and pointer chasing. This is the same mechanism as is also employed in DIR-24-8 and DXR. In this scheme an array of size 2^s holds references to the nodes and leaves for the first s bits of prefixes (the Poptrie paper suggests s = 18). In a sense this table of roots allows us to fast forward the lookup to the chunk at s and proceeding with the regular trie search from there.
Structure
Besides the lookup routine there is quite bit of logic required to build the Poptrie lookup structure in the first place. An incremental way of tackling the problem is to first build a simple binary trie to hold the routing information. Following networking-speak (rather inaccurately) our code calls this the RIB (for routing information base). After we built the RIB we can convert it to an equivalent, compressed Poptrie structure, the FIB.
Note that no particular consideration was given towards optimizing either building the RIB or converting it to a FIB because we do not necessarily have to perform these operations in the data-plane. Instead, we make sure that the final FIB that is used for lookups is easily serializable and hence portable. I.e., our Poptrie object constructor can be used to create an empty RIB left for us to populate and convert to a FIB, but we can also do that in a separate process and instantiate it with the resulting binary blobs (node and leaves arrays, and the direct map).
Currently, lib.poptrie has the following interface:
new
(init) — Instantiate a new Poptrie object, either empty or from a pre-built FIB. Additionally, init is a map that allows you to tweak things like s or whether you want to enable the direct pointing optimization at all.Poptrie:add
(key, length, value) — Add prefix key (auint8_t *
) of length to the RIB, mapping it to value (auint16_t
with zero being reserved to indicate “no match”)Poptrie:build
() — Derive the FIB from the RIB (overwrite the previous FIB)Poptrie:lookup32/64/128
(key) — Look up key and return the associated value. These routines expect 32‑bit, 64‑bit, and 128‑bit keys respectively.
There is currently no interface to remove individual prefixes or even clear the RIB, although there is no particular reason that prevents this from being implemented.
The recurring theme of the code is bootstrapping from “simple, slow, and obviously (ha!) correct” to “more complex but fast”. For instance we start with an innocent Lua implementation of a binary trie, and transition to tricky but safe Lua code that builds the compressed Poptrie structure. I am not going to go into that hairy piece of code here, but recommend to checkout the Poptrie:build_node
function for those that are interested. Let’s just say I am happy I was able to do this in a high-level language.
Nodes and leaves are allocated on a contiguous chunk of memory. When a chunk is exhausted, a new chunk twice the size of the previous chunk is allocated. The old chunk is copied into the new chunk, and the next node or leaf is allocated in the new extra space. The old chunk becomes garbage and is eventually collected. This works because all references to leaves or nodes are relative to the pointer to their backing chunk, so you can just swap out chunks by updating the pointers to the nodes and leaves arrays respectively.
It remains visible in the code that it was constructed incrementally. At first lib.poptrie
did not implement any of the optimizations described in the paper, which where then added one after another. The leftovers of this scaffolding are toggles that you can switch off to access intermediary versions of the algorithm.
Finally, we generate hand-optimized assembler routines for looking up prefixes of various sizes using DynASM. More on that in the next section.
Credit where credit is due: a lot of the design choices were influenced by Pete Bristow’s earlier work on LPM implementations for Snabb. Notably, the requirement that the compressed FIB is easily copied between processes was inherited from his design. Likewise, the idea of bootstrapping from a simple trie was apparent from looking at the prior LPM implementations.
Lookup
There are three flavors of lookup routines. First there is a simple Poptrie:rib_lookup
routine to look up prefixes in the RIB. It is a dead simple recursive binary trie search, and useful as a reference since the Poptrie lookup routines should return the exact same results when querying the compressed FIB.
Let us quickly recap how the Poptrie lookup works:
- 1. If we use direct pointing look up the first s bits in the direct table. Check if the entry denotes a leaf by checking if bit 32 (the leaf tag) of the value is set. If the entry is a leaf then return early with its value, which is obtained by masking off the leaf tag. Otherwise start the lookup at the node denoted by the entry.
- 2. Check if the edge denoted by the next k bit segment of the prefix points to a node. If so fetch the next node and repeat this step with the child node and the next prefix segment.
- 3. Fetch the leaf denoted by the k bit segment and return its value.
Our first implementation is written in Lua. It is not particularly fast, but supports all combinations of optimizations turned on or off and works on prefixes of any length. It is mainly used as a runable reference for comparison with the x86 assembly implementation.
The DynASM generator for the optimized x86 lookup routines on the other hand is supposed to generate the best code possible, and keeps its working data set in registers. Pete Cawley helped a lot by reviewing the first version of the code and suggesting a bunch of x86 tricks that justify calling the resulting code “optimized”. :-)
Note that to use the optimized lookup routines you must ensure that RIB and FIB contain no prefixes longer than the keys you will look up. I.e., you would call Poptrie:lookup32
on a FIB with prefixes of length less than or equal to 32 bits.
The above excerpt shows how the generator is parameterized on prefix/key size as well as on the availability of certain new x86 features such as BMI2. That is, we assemble different routines for each key size, and different code depending on which features the executing CPU supports at runtime. This ifdef
-esque code is certainly not pretty to look at, but we would rather not leave any performance on the table here.
Property testing
Hopefully the routines that build the initial binary trie are so simple as to obviously have no bugs. Assuming we can verify Poptrie:add
and Poptrie:rib_lookup
by means of unit tests and staring at them, we can property test our tower from there.
We randomly generate RIBs and then make sure that our Poptrie lookup routines return identical results when querying the corresponding FIB. The Lua version of the Poptrie lookup is useful to compare with the assembly routines. If both return identical, incorrect results then there is most likely a bug in the FIB building logic. If only one of them is incorrect then the issue is probably with either implementation.
While the testing logic is quite simple (e.g., no test case reduction) we are happy to realize that we apparently were able to find all bugs with RIBs of just up to three entries and minimal prefix lengths (1, 7, 33, 65) to trigger edge cases. Since we are checking the simplest cases first and only iteratively increase test case complexity we end up with minimal examples to reproduce bugs. Neat!
Benchmarking
The library also includes a micro-benchmark to help quantify changes and optimizations. To help quantify micro-optimizations in the x86 code it appears most useful to ensure the benchmark executes in L3 cache in order to highlight how the code performs on the CPU pipeline. Cache misses and the ensuing memory latency mask pressure on the pipeline and could cause us to miss valuable optimizations.
The number of instructions needed per lookup increases linearly with prefix length, but is mostly unaffected by the total number of entries in the Poptrie. That is, as long as the set of prefixes you query stay in cache. It seems difficult to make general statements about how memory latency impacts performance because it really depends on the runtime profile of your application when it processes a real world workload.
However, it is clear that memory access latency will become the bottleneck depending on the size of your routing table and traffic profile. You can tweak the numhit
variable which controls how many distinct entries will be looked up during the benchmark. Doing so shows that the instructions executed per cycle ratio (IPC) takes a dive once the live data outgrows the cache.
Testing in Vita has shown practically equivalent performance when compared to the previously used DIR-24-8 implementation. This is good news for Vita because we need to support IPv6 routes, and it looks like we can do that with a single algorithm without a performance regression in IPv4 routing. Vita might not be a particularly representative test case for other applications though, since it typically uses very small routing tables.
Next Steps
Luke Gorrie brought up the possibility parallelizing lookups to amortize memory latency. That is, if we could somehow perform multiple lookups in parallel we could do useful work while waiting for the next nodes to be fetched from memory. At least in theory this could help us keep a more steady IPC ratio with random accesses on big routing tables. Maybe someone wants to take a stab at vectorizing the Poptrie lookup routine? :-)