commit 376404dbf6ccaadfad15e5eb0f51423df4581799 Merge: 2371df4 cf2a14d Author: Max Rottenkolber Date: Tue Mar 26 10:12:58 2019 +0100 Merge branch 'blog/poptrie' and publish article commit cf2a14d6f2e21c85322ddd8c9ea98899eefd955f Author: Max Rottenkolber Date: Tue Mar 26 10:12:31 2019 +0100 blog/poptrie: more edits and some credits diff --git a/blog/poptrie-dynasm.mk2 b/blog/poptrie-dynasm.mk2 index 7229eba..4990ac4 100644 --- a/blog/poptrie-dynasm.mk2 +++ b/blog/poptrie-dynasm.mk2 @@ -17,8 +17,8 @@ known as the cache. One reason to choose Poptrie is that it is sufficiently general to work on prefixes of arbitrary lengths. Hence, it is suitable for both IPv4 and IPv6 addresses, unlike say [DIR-24-8](http://tiny-tera.stanford.edu/~nickm/papers/Infocom98_lookup.pdf) -which targets 32‑bit prefixes exclusively. It would be neat if we can use a -single, highly optimized lookup routine for all our routing needs (and +which targets prefixes if up to 32 bits exclusively. It would be neat if we can +use a single, highly optimized lookup routine for all our routing needs (and possibly other use cases as well). < A rough overview of Poptrie (go read the paper!) @@ -171,10 +171,11 @@ function Poptrie:allocate_node () end # - It remains visible in the code that it was constructed incrementally, - initially lacking any of the optimizations described in the paper, which where - then added one after the other. The leftovers of this scaffolding are toggles - that you can switch off to access intermediary versions of the algorithm. + It remains visible in the code that it was constructed incrementally. At first + {lib.poptrie} did not implement any of the optimizations described in the + paper, which where then added one after another. The leftovers of this + scaffolding are toggles that you can switch off to access intermediary + versions of the algorithm. #code# -- Compress RIB into Poptrie @@ -208,6 +209,13 @@ function Poptrie:configure_lookup () unpack(asm_cache[config]) end # + + Credit where credit is due: a lot of the design choices were influenced by + Pete Bristow’s earlier work on [LPM implementations](https://github.com/snabbco/snabb/tree/v2018.09/src/lib/lpm) + for [Snabb](https://github.com/snabbco/snabb). Notably, the requirement that + the compressed FIB is easily copied between processes was inherited from his + design. Likewise, the idea of bootstrapping from a simple trie was apparent + from looking at the prior LPM implementations. > @@ -254,8 +262,8 @@ end for the optimized x86 lookup routines on the other hand is supposed to generate the best code possible, and keeps its working data set in registers. [Pete Cawley](https://www.corsix.org/) helped a lot by reviewing the first - version of the code and subsequently pointing out a bunch of x86 tricks that - justify this code to be called “optimized”. :-) + version of the code and suggesting a bunch of x86 tricks that justify calling + the resulting code “optimized”. :-) Note that to use the optimized lookup routines you must ensure that RIB and FIB contain no prefixes longer than the keys you will look up. I.e., you would @@ -342,7 +350,7 @@ end optimizations. To help quantify micro-optimizations in the x86 code it appears most useful to ensure the benchmark executes in L3 cache in order to highlight how the code performs on the CPU pipeline. Cache misses and the ensuing memory - latency masks pressure on the pipeline and could cause us to miss valuable + latency mask pressure on the pipeline and could cause us to miss valuable optimizations. #code Micro-benchmark results with the default parameters on a laptop. commit 86a7f2e4543d1e2ebd1a2e85c56ed3c68e99ac61 Author: Max Rottenkolber Date: Sat Mar 23 17:08:19 2019 +0300 blog/poptrie-dynasm: show ipc in plots diff --git a/blog/poptrie-dynasm.mk2 b/blog/poptrie-dynasm.mk2 index a74a9d5..7229eba 100644 --- a/blog/poptrie-dynasm.mk2 +++ b/blog/poptrie-dynasm.mk2 @@ -339,47 +339,65 @@ end < Benchmarking The library also includes a micro-benchmark to help quantify changes and - optimizations. It seems most useful to ensure the benchmark executes in L3 - cache to highlight how the code performs on the CPU pipeline. Cache misses and - the ensuing memory latency masks pressure on the pipeline and could cause us - to miss valuable optimizations. - - #code Micro-benchmark results with the default parameters on a laptop. Hot - cache.# -PMU analysis (numentries=10000, keysize=32) + optimizations. To help quantify micro-optimizations in the x86 code it appears + most useful to ensure the benchmark executes in L3 cache in order to highlight + how the code performs on the CPU pipeline. Cache misses and the ensuing memory + latency masks pressure on the pipeline and could cause us to miss valuable + optimizations. + + #code Micro-benchmark results with the default parameters on a laptop. + Executes within cache.# +PMU analysis (numentries=10000, numhit=100, keysize=32) build: 0.2241 seconds -lookup: 7877.08 cycles/lookup 17605.28 instructions/lookup -lookup32: 60.41 cycles/lookup 99.95 instructions/lookup -lookup64: 60.31 cycles/lookup 99.96 instructions/lookup -lookup128: 70.94 cycles/lookup 118.61 instructions/lookup -build(direct_pointing): 0.1960 seconds -lookup(direct_pointing): 1218.53 cycles/lookup 3033.57 instructions/lookup -lookup32(direct_pointing): 32.84 cycles/lookup 62.59 instructions/lookup -lookup64(direct_pointing): 32.63 cycles/lookup 62.59 instructions/lookup -lookup128(direct_pointing): 34.39 cycles/lookup 66.80 instructions/lookup +lookup: 7922.54 cycles/lookup 17583.86 instructions/lookup +lookup32: 61.34 cycles/lookup 99.96 instructions/lookup +lookup64: 60.50 cycles/lookup 99.95 instructions/lookup +lookup128: 70.93 cycles/lookup 118.62 instructions/lookup +build(direct_pointing): 0.2007 seconds +lookup(direct_pointing): 1281.85 cycles/lookup 3033.97 instructions/lookup +lookup32(direct_pointing): 33.12 cycles/lookup 62.59 instructions/lookup +lookup64(direct_pointing): 33.46 cycles/lookup 62.59 instructions/lookup +lookup128(direct_pointing): 34.72 cycles/lookup 66.80 instructions/lookup # - It is difficult to make general statements about how memory latency impacts + The number of instructions needed per lookup increases linearly with prefix + length, but is mostly unaffected by the total number of entries in the + Poptrie. That is, as long as the set of prefixes you query stay in cache. It + seems difficult to make general statements about how memory latency impacts performance because it really depends on the runtime profile of your - application when it processes a real world workload. Worst case performance - might not necessarily a particularly interesting case either. However, it is - clear that memory accesses will become the bottleneck depending on your - routing table and traffic profile. + application when it processes a real world workload. + + #media Plot showing instructions/lookup vs number of Poptrie entries for + multiple prefix lengths on a Xeon E3-1246v3.# + poptrie-ins-numentries.svg + + However, it is clear that memory access latency will become the bottleneck + depending on the size of your routing table and traffic profile. You can tweak + the {numhit} variable which controls how many distinct entries will be looked + up during the benchmark. Doing so shows that the instructions executed per + cycle ratio (IPC) takes a dive once the live data outgrows the cache. + + #media Plot showing IPC vs number unique prefixes queried in a Poptrie with + 100,000 prefixes on a Xeon E3-1246v3.# + poptrie-ins-cycle-numhit.svg Testing in [Vita](https://github.com/inters/vita) has shown practically equivalent performance when compared to the previously used _DIR-24-8_ implementation. This is good news for Vita because we need to support IPv6 - routes, but might not be particularly representative for other applications - as Vita typically uses very small routing tables. + routes, and it looks like we can do that with a single algorithm without a + performance regression in IPv4 routing. Vita might not be a particularly + representative test case for other applications though, since it typically + uses very small routing tables. > < Next Steps Luke Gorrie brought up the possibility parallelizing lookups to amortize - memory latency. That is, if we could somehow perform multiple lookups - concurrently we could do useful work while waiting for the next nodes to be - fetched from memory. Maybe someone wants to take a stab at vectorizing the - poptrie lookup routine? :-) + memory latency. That is, if we could somehow perform multiple lookups in + parallel we could do useful work while waiting for the next nodes to be + fetched from memory. At least in theory this could help us keep a more steady + IPC ratio with random accesses on big routing tables. Maybe someone wants to + take a stab at vectorizing the Poptrie lookup routine? :-) > commit 1e48fa5ba940153fcf9704006186099b1cf9d421 Author: Max Rottenkolber Date: Wed Mar 20 20:03:49 2019 +0300 blog/poptrie-dynasm: edits diff --git a/blog/poptrie-dynasm.mk2 b/blog/poptrie-dynasm.mk2 index 323e2dd..a74a9d5 100644 --- a/blog/poptrie-dynasm.mk2 +++ b/blog/poptrie-dynasm.mk2 @@ -27,31 +27,30 @@ possibly other use cases as well). structure to hold the _forwarding information base_ (FIB), which is networking-speak for a mapping from prefixes (keys) to next hops (values). In this compressed trie each node holds edges for the next _k_ = 6 bits of the - prefix which again point to either nodes or leaves. + prefix which in turn point to either child nodes or leaves. Poptrie nodes are efficiently encoded using a trick based on the [popcnt](https://www.felixcloutier.com/x86/popcnt) instruction which counts the number of bits set in a bit vector. Each node - consists of two base pointers—one into an array of nodes, and another into an + consists of two base offsets—one into an array of nodes, and another into an array of leaves—as well as two 64‑bit vectors that hold compressed information - about the edges of the node. Let us call them the node vector and the leaf - vector respectively. + about the edges of the node. Let us call them the node and the leaf vectors. #media Figure 3 from the Poptrie paper showing the encoding of a node.# poptrie-figure3.png Since _k_ = 6 bits can represent 64 distinct values, each of the vectors has - one bit of information for every possible value of the next 6 bit chunk of the - prefix. A set bit in the corresponding position of the node vector indicates - that the edge points to a node, and the number of bits set up that position - indicates the child node’s offset from the node base pointer. + one bit of information for every possible value of the next _k_‑bit chunk of + the prefix. A set bit in the corresponding position of the node vector + indicates that the edge points to a node, and the number of bits set up that + position indicates the child node’s offset from the node base. While we could just invert the node vector and then count the set bits to get an offset into the leaves array, we expect the trie to be sparse and leaves to - be plenty (a “no match” is represented by a leaf with value zero.) A node with + be plenty—a “no match” is represented by a leaf with value zero. A node with a single edge would imply 63 additional zero leaves in the leaves array. Enter _leaf compression_. In order minimize the size of the leaves array, the leaf vector has bits at the corresponding positions set if the edge points to a - leaf and only if the leaf’s value differs from the value of the preceeding + leaf, and only if the leaf’s value differs from the value of the preceding leaf. I.e., adjacent identical leaves are deduplicated. #media Figure 5 from the Poptrie paper illustrating the _direct pointing_ @@ -74,9 +73,9 @@ possibly other use cases as well). Besides the lookup routine there is quite bit of logic required to build the Poptrie lookup structure in the first place. An incremental way of tackling the problem is to first build a simple binary trie to hold the routing - information. Following networking-speak rather inaccurately our code calls + information. Following networking-speak (rather inaccurately) our code calls this the RIB (for _routing information base_). After we built the RIB we can - convert it to an equivalent, compressed Poptrie structure (the FIB). + convert it to an equivalent, compressed Poptrie structure, the FIB. Note that no particular consideration was given towards optimizing either building the RIB or converting it to a FIB because we do not necessarily have @@ -90,16 +89,16 @@ possibly other use cases as well). Currently, [lib.poptrie](https://github.com/eugeneia/snabb/blob/poptrie/src/lib/poptrie.lua) has the following interface: - + {new}(_init_) — instantiate a new Poptrie object, either empty or from a + + {new}(_init_) — Instantiate a new Poptrie object, either empty or from a pre-built FIB. Additionally, _init_ is a map that allows you to tweak things - like _s_ or if you want the _direct pointing_ optimization at all. - + {Poptrie:add}(_key_, _length_, _value_) — add prefix _key_ (a {uint8_t *}) + like _s_ or whether you want to enable the _direct pointing_ optimization at + all. + + {Poptrie:add}(_key_, _length_, _value_) — Add prefix _key_ (a {uint8_t *}) of _length_ to the RIB, mapping it to _value_ (a {uint16_t} with zero being reserved to indicate “no match”) - + {Poptrie:build}() — derive the FIB from the RIB (overwrite the previous FIB) - + {Poptrie:lookup32},{Poptrie:lookup64},{Poptrie:lookup128}(_key_) — look - up _key_ and return the associated value. These routines expect 32‑bit, - 64‑bit, and 128‑bit keys respectively. + + {Poptrie:build}() — Derive the FIB from the RIB (overwrite the previous FIB) + + {Poptrie:lookup32/64/128}(_key_) — Look up _key_ and return the associated + value. These routines expect 32‑bit, 64‑bit, and 128‑bit keys respectively. There is currently no interface to remove individual prefixes or even clear the RIB, although there is no particular reason that prevents this from being @@ -124,13 +123,13 @@ function Poptrie:add (key, length, value) end # - The recurring theme of the code is bootstrapping from “simple, slow but - correct” to “more complex but fast”. For instance we start with a innocent Lua - implementation of a binary trie to tricky but safe Lua code that builds the - compressed Poptrie structure. I am not going to go into that hairy piece of - code here, but recommend to checkout the {Poptrie:build_node} function for - those that are interested. Let’s just say I am happy I was able to do this in - a high-level language. + The recurring theme of the code is bootstrapping from “simple, slow, and + obviously (ha!) correct” to “more complex but fast”. For instance we start + with an innocent Lua implementation of a binary trie, and transition to tricky + but safe Lua code that builds the compressed Poptrie structure. I am not going + to go into that hairy piece of code here, but recommend to checkout the + {Poptrie:build_node} function for those that are interested. Let’s just say I + am happy I was able to do this in a high-level language. #code# -- Map f over keys of length in RIB @@ -150,12 +149,12 @@ end # Nodes and leaves are allocated on a contiguous chunk of memory. When a chunk - is exhausted a new chunk twice the size of the previous chunk is allocated. - The old chunk is copied into the new chunk, and the next node or leaf are - allocated in the new extra space. The old chunk is eventually garbage - collected. This works because all references to leaves or nodes are relative - to the beginning of their backing chunk, so you can just swap out chunks by - updating the pointers to the nodes and leaves arrays respectively. + is exhausted, a new chunk twice the size of the previous chunk is allocated. + The old chunk is copied into the new chunk, and the next node or leaf is + allocated in the new extra space. The old chunk becomes garbage and is + eventually collected. This works because all references to leaves or nodes are + relative to the pointer to their backing chunk, so you can just swap out + chunks by updating the pointers to the nodes and leaves arrays respectively. #code# function Poptrie:allocate_node () @@ -173,9 +172,9 @@ end # It remains visible in the code that it was constructed incrementally, - initially lacking any of the optimizations described in the paper, and then - adding them one by one. The leftovers of this scaffolding are toggles that you - can switch off to access intermediary versions of the algorithm. + initially lacking any of the optimizations described in the paper, which where + then added one after the other. The leftovers of this scaffolding are toggles + that you can switch off to access intermediary versions of the algorithm. #code# -- Compress RIB into Poptrie @@ -214,11 +213,11 @@ end < Lookup - There are three flavours of lookup routines. First there is a simple + There are three flavors of lookup routines. First there is a simple {Poptrie:rib_lookup} routine to look up prefixes in the RIB. It is a dead - simple recurive binary tree search, and useful as a reference since the - Poptrie lookup routines should return the exact same results when querying our - the compressed FIB. + simple recursive binary trie search, and useful as a reference since the + Poptrie lookup routines should return the exact same results when querying the + compressed FIB. Let us quickly recap how the Poptrie lookup works: @@ -227,14 +226,14 @@ end tag) of the value is set. If the entry is a leaf then return early with its value, which is obtained by masking off the leaf tag. Otherwise start the lookup at the node denoted by the entry. - + *2.* Check if the edge denoted by the next _k_ bit segement of the prefix + + *2.* Check if the edge denoted by the next _k_ bit segment of the prefix points to a node. If so fetch the next node and repeat this step with the child node and the next prefix segment. + *3.* Fetch the leaf denoted by the _k_ bit segment and return its value. - Our first implementation is written in Lua. It is not particularily fast, but + Our first implementation is written in Lua. It is not particularly fast, but supports all combinations of optimizations turned on or off and works on - prefixes of any length. It is mainly used as a runnable reference for + prefixes of any length. It is mainly used as a runable reference for comparison with the x86 assembly implementation. #code# @@ -253,10 +252,15 @@ end The [DynASM generator](https://github.com/eugeneia/snabb/blob/poptrie/src/lib/poptrie_lookup.dasl) for the optimized x86 lookup routines on the other hand is supposed to - generate the best code possible, and keeps all of its working set in - registers. [Pete Cawley](https://www.corsix.org/) helped a lot by reviewing - the first version of the code and subsequently pointing out a bunch of x86 - tricks that earned its right to be called “optimized”. :-) + generate the best code possible, and keeps its working data set in registers. + [Pete Cawley](https://www.corsix.org/) helped a lot by reviewing the first + version of the code and subsequently pointing out a bunch of x86 tricks that + justify this code to be called “optimized”. :-) + + Note that to use the optimized lookup routines you must ensure that RIB and + FIB contain no prefixes longer than the keys you will look up. I.e., you would + call {Poptrie:lookup32} on a FIB with prefixes of length less than or equal to + 32 bits. #code# ... @@ -288,12 +292,12 @@ end ... # - The above excerpt shows how its parameterized on prefix/key size as well as on - the availability of certain new x86 features such as [BMI2](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#BMI2_\(Bit_Manipulation_Instruction_Set_2\)). - This {ifdef}-esque code is certainly not pretty to look at, but we are not - leaving any performance on the table here. Note that to use the optimized - lookup routines you must ensure that the FIB contains no keys longer than - the keys you will look up. + The above excerpt shows how the generator is parameterized on prefix/key size + as well as on the availability of certain new x86 features such as [BMI2](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#BMI2_\(Bit_Manipulation_Instruction_Set_2\)). + That is, we assemble different routines for each key size, and different code + depending on which features the executing CPU supports at runtime. This + {ifdef}-esque code is certainly not pretty to look at, but we would rather not + leave any performance on the table here. > @@ -358,14 +362,14 @@ lookup128(direct_pointing): 34.39 cycles/lookup 66.80 instructions/lookup It is difficult to make general statements about how memory latency impacts performance because it really depends on the runtime profile of your application when it processes a real world workload. Worst case performance - might not necessairly a particularily interesting cas either. It is clear - however that memory accesses will become the bottleneck depending on your + might not necessarily a particularly interesting case either. However, it is + clear that memory accesses will become the bottleneck depending on your routing table and traffic profile. Testing in [Vita](https://github.com/inters/vita) has shown practically equivalent performance when compared to the previously used _DIR-24-8_ implementation. This is good news for Vita because we need to support IPv6 - routes, but might not be particularily representative for other applications + routes, but might not be particularly representative for other applications as Vita typically uses very small routing tables. > commit 9f71ed8592b26daaabb5ad7058447d599a1550e8 Author: Max Rottenkolber Date: Wed Mar 20 17:09:02 2019 +0300 blog/poptrie-dynasm: draft diff --git a/blog/poptrie-dynasm.meta b/blog/poptrie-dynasm.meta new file mode 100644 index 0000000..b7fa03f --- /dev/null +++ b/blog/poptrie-dynasm.meta @@ -0,0 +1,5 @@ +:document (:title "Implementing Poptrie in Lua and DynASM" + :author "Max Rottenkolber " + :index-p nil + :index-headers-p nil) +:publication-date nil diff --git a/blog/poptrie-dynasm.mk2 b/blog/poptrie-dynasm.mk2 new file mode 100644 index 0000000..323e2dd --- /dev/null +++ b/blog/poptrie-dynasm.mk2 @@ -0,0 +1,381 @@ +#media# +hongkong-caratoes.jpg + +When routing IP packets it is common to do a longest prefix match (LPM) lookup +on the destination address to find a suitable next hop for a packet. Networks +are denoted by address prefixes which are matched against the address fields of +packets in order to determine where they are headed or where they come from. + +[Poptrie](http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p57.pdf) is a +high-performance LPM lookup algorithm designed for general purpose computers. +What makes it perform well on regular CPUs you ask? Well, it lays out the look +up structure in a way that minimizes memory accesses, which are notoriously +slow. If we are to route packets at high speeds in software we need to keep our +business in the [muffer](https://twitter.com/rygorous/status/1100600336958382081)—otherwise +known as the cache. + +One reason to choose Poptrie is that it is sufficiently general to work on +prefixes of arbitrary lengths. Hence, it is suitable for both IPv4 and IPv6 +addresses, unlike say [DIR-24-8](http://tiny-tera.stanford.edu/~nickm/papers/Infocom98_lookup.pdf) +which targets 32‑bit prefixes exclusively. It would be neat if we can use a +single, highly optimized lookup routine for all our routing needs (and +possibly other use cases as well). + +< A rough overview of Poptrie (go read the paper!) + + At its heart, Poptrie uses a compressed [trie](https://en.wikipedia.org/wiki/Trie) + structure to hold the _forwarding information base_ (FIB), which is + networking-speak for a mapping from prefixes (keys) to next hops (values). In + this compressed trie each node holds edges for the next _k_ = 6 bits of the + prefix which again point to either nodes or leaves. + + Poptrie nodes are efficiently encoded using a trick based on the [popcnt](https://www.felixcloutier.com/x86/popcnt) + instruction which counts the number of bits set in a bit vector. Each node + consists of two base pointers—one into an array of nodes, and another into an + array of leaves—as well as two 64‑bit vectors that hold compressed information + about the edges of the node. Let us call them the node vector and the leaf + vector respectively. + + #media Figure 3 from the Poptrie paper showing the encoding of a node.# + poptrie-figure3.png + + Since _k_ = 6 bits can represent 64 distinct values, each of the vectors has + one bit of information for every possible value of the next 6 bit chunk of the + prefix. A set bit in the corresponding position of the node vector indicates + that the edge points to a node, and the number of bits set up that position + indicates the child node’s offset from the node base pointer. + + While we could just invert the node vector and then count the set bits to get + an offset into the leaves array, we expect the trie to be sparse and leaves to + be plenty (a “no match” is represented by a leaf with value zero.) A node with + a single edge would imply 63 additional zero leaves in the leaves array. Enter + _leaf compression_. In order minimize the size of the leaves array, the leaf + vector has bits at the corresponding positions set if the edge points to a + leaf and only if the leaf’s value differs from the value of the preceeding + leaf. I.e., adjacent identical leaves are deduplicated. + + #media Figure 5 from the Poptrie paper illustrating the _direct pointing_ + optimization.# + poptrie-figure5.png + + Finally, a _direct pointing_ optimization is prepended to the lookup. It + trades off memory space for decreased depth of the trie which translates to + reduced branching and pointer chasing. This is the same mechanism as is also + employed in _DIR-24-8_ and [DXR](http://www.nxlab.fer.hr/dxr/). In this scheme + an array of size 2^_s_ holds references to the nodes and leaves for the first + _s_ bits of prefixes (the Poptrie paper suggests _s_ = 18). In a sense this + table of roots allows us to fast forward the lookup to the chunk at _s_ and + proceeding with the regular trie search from there. + +> + +< Structure + + Besides the lookup routine there is quite bit of logic required to build the + Poptrie lookup structure in the first place. An incremental way of tackling + the problem is to first build a simple binary trie to hold the routing + information. Following networking-speak rather inaccurately our code calls + this the RIB (for _routing information base_). After we built the RIB we can + convert it to an equivalent, compressed Poptrie structure (the FIB). + + Note that no particular consideration was given towards optimizing either + building the RIB or converting it to a FIB because we do not necessarily have + to perform these operations in the data-plane. Instead, we make sure that the + final FIB that is used for lookups is easily serializable and hence portable. + I.e., our Poptrie object constructor can be used to create an empty RIB left + for us to populate and convert to a FIB, but we can also do that in a separate + process and instantiate it with the resulting binary blobs (node and leaves + arrays, and the direct map). + + Currently, [lib.poptrie](https://github.com/eugeneia/snabb/blob/poptrie/src/lib/poptrie.lua) + has the following interface: + + + {new}(_init_) — instantiate a new Poptrie object, either empty or from a + pre-built FIB. Additionally, _init_ is a map that allows you to tweak things + like _s_ or if you want the _direct pointing_ optimization at all. + + {Poptrie:add}(_key_, _length_, _value_) — add prefix _key_ (a {uint8_t *}) + of _length_ to the RIB, mapping it to _value_ (a {uint16_t} with zero being + reserved to indicate “no match”) + + {Poptrie:build}() — derive the FIB from the RIB (overwrite the previous FIB) + + {Poptrie:lookup32},{Poptrie:lookup64},{Poptrie:lookup128}(_key_) — look + up _key_ and return the associated value. These routines expect 32‑bit, + 64‑bit, and 128‑bit keys respectively. + + There is currently no interface to remove individual prefixes or even clear + the RIB, although there is no particular reason that prevents this from being + implemented. + + #code# +-- Add key/value pair to RIB (intermediary binary trie) +-- key=uint8_t[?], length=uint16_t, value=uint16_t +function Poptrie:add (key, length, value) + assert(value) + local function add (node, offset) + if offset == length then + node.value = value + elseif extract(key, offset, 1) == 0 then + node.left = add(node.left or {}, offset + 1) + elseif extract(key, offset, 1) == 1 then + node.right = add(node.right or {}, offset + 1) + else error("invalid state") end + return node + end + self.rib = add(self.rib or {}, 0) +end + # + + The recurring theme of the code is bootstrapping from “simple, slow but + correct” to “more complex but fast”. For instance we start with a innocent Lua + implementation of a binary trie to tricky but safe Lua code that builds the + compressed Poptrie structure. I am not going to go into that hairy piece of + code here, but recommend to checkout the {Poptrie:build_node} function for + those that are interested. Let’s just say I am happy I was able to do this in + a high-level language. + + #code# +-- Map f over keys of length in RIB +function Poptrie:rib_map (f, length, root) + local function map (node, offset, key, value) + value = (node and node.value) or value + local left, right = node and node.left, node and node.right + if offset == length then + f(key, value, (left or right) and node) + else + map(left, offset + 1, key, value) + map(right, offset + 1, bor(key, lshift(1, offset)), value) + end + end + return map(root or self.rib, 0, 0) +end + # + + Nodes and leaves are allocated on a contiguous chunk of memory. When a chunk + is exhausted a new chunk twice the size of the previous chunk is allocated. + The old chunk is copied into the new chunk, and the next node or leaf are + allocated in the new extra space. The old chunk is eventually garbage + collected. This works because all references to leaves or nodes are relative + to the beginning of their backing chunk, so you can just swap out chunks by + updating the pointers to the nodes and leaves arrays respectively. + + #code# +function Poptrie:allocate_node () + if self.direct_pointing then + -- When using direct_pointing, the node index space is split into half in + -- favor of a bit used for disambiguation in Poptrie:build_directmap. + assert(band(self.node_base, Poptrie.leaf_tag) == 0, "Node overflow") + end + while self.node_base >= self.num_nodes do + self:grow_nodes() + end + self.node_base = self.node_base + 1 + return self.node_base - 1 +end + # + + It remains visible in the code that it was constructed incrementally, + initially lacking any of the optimizations described in the paper, and then + adding them one by one. The leftovers of this scaffolding are toggles that you + can switch off to access intermediary versions of the algorithm. + + #code# +-- Compress RIB into Poptrie +function Poptrie:build () + self:clear_fib() + if self.direct_pointing then + self:build_directmap(self.rib) + else + self:build_node(self.rib, self:allocate_node()) + end +end + # + + Finally, we generate hand-optimized assembler routines for looking up prefixes + of various sizes using DynASM. More on that in the next section. + + #code# +local asm_cache = {} + +function Poptrie:configure_lookup () + local config = ("leaf_compression=%s,direct_pointing=%s,s=%s") + :format(self.leaf_compression, self.direct_pointing, self.s) + if not asm_cache[config] then + asm_cache[config] = { + poptrie_lookup.generate(self, 32), + poptrie_lookup.generate(self, 64), + poptrie_lookup.generate(self, 128) + } + end + self.asm_lookup32, self.asm_lookup64, self.asm_lookup128 = + unpack(asm_cache[config]) +end + # + +> + +< Lookup + + There are three flavours of lookup routines. First there is a simple + {Poptrie:rib_lookup} routine to look up prefixes in the RIB. It is a dead + simple recurive binary tree search, and useful as a reference since the + Poptrie lookup routines should return the exact same results when querying our + the compressed FIB. + + Let us quickly recap how the Poptrie lookup works: + + + *1.* If we use _direct pointing_ look up the first _s_ bits in the direct + table. Check if the entry denotes a leaf by checking if bit 32 (the leaf + tag) of the value is set. If the entry is a leaf then return early with its + value, which is obtained by masking off the leaf tag. Otherwise start the + lookup at the node denoted by the entry. + + *2.* Check if the edge denoted by the next _k_ bit segement of the prefix + points to a node. If so fetch the next node and repeat this step with the + child node and the next prefix segment. + + *3.* Fetch the leaf denoted by the _k_ bit segment and return its value. + + Our first implementation is written in Lua. It is not particularily fast, but + supports all combinations of optimizations turned on or off and works on + prefixes of any length. It is mainly used as a runnable reference for + comparison with the x86 assembly implementation. + + #code# + |.define leaves, rdi -- pointer to leaves array + |.define nodes, rsi -- pointer to nodes array + |.define key, rdx -- key to look up + |.define dmap, rcx -- pointer to directmap + |.define index, r8d -- index into node array + |.define node, r8 -- pointer into node array + |.define offset, r9d -- offset into key + |.define offsetx, r9 -- (offset as qword) + |.define v, r10 -- k or s bits extracted from key + |.define v_dw, r10d -- (v as dword) + |.define vec, r11 -- 64-bit vector or leafvec + # + + The [DynASM generator](https://github.com/eugeneia/snabb/blob/poptrie/src/lib/poptrie_lookup.dasl) + for the optimized x86 lookup routines on the other hand is supposed to + generate the best code possible, and keeps all of its working set in + registers. [Pete Cawley](https://www.corsix.org/) helped a lot by reviewing + the first version of the code and subsequently pointing out a bunch of x86 + tricks that earned its right to be called “optimized”. :-) + + #code# + ... + -- v = extract(key, offset, k=6) + if keysize == 32 then + if BMI2 then + | shrx v_dw, dword [key], offset + else + | mov ecx, offset + | mov v_dw, dword [key] + | shr v, cl + end + elseif keysize == 64 then + if BMI2 then + | shrx v, [key], offsetx + else + | mov ecx, offset + | mov v, [key] + | shr v, cl + end + elseif keysize == 128 then + | mov ecx, offset + | mov v, [key] + | mov vec, [key+8] + | test cl, 64 + | cmovnz v, vec + | shrd v, vec, cl + else error("NYI") end + ... + # + + The above excerpt shows how its parameterized on prefix/key size as well as on + the availability of certain new x86 features such as [BMI2](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#BMI2_\(Bit_Manipulation_Instruction_Set_2\)). + This {ifdef}-esque code is certainly not pretty to look at, but we are not + leaving any performance on the table here. Note that to use the optimized + lookup routines you must ensure that the FIB contains no keys longer than + the keys you will look up. + +> + +< Property testing + + Hopefully the routines that build the initial binary trie are so simple as to + obviously have no bugs. Assuming we can verify {Poptrie:add} and + {Poptrie:rib_lookup} by means of unit tests and staring at them, we can + property test our tower from there. + + We randomly generate RIBs and then make sure that our Poptrie lookup routines + return identical results when querying the corresponding FIB. The Lua version + of the Poptrie lookup is useful to compare with the assembly routines. If both + return identical, incorrect results then there is most likely a bug in the FIB + building logic. If only one of them is incorrect then the issue is probably + with either implementation. + + #code# + local seed = lib.getenv("SNABB_RANDOM_SEED") or 0 + for keysize = 1, 128 do + print("keysize:", keysize) + -- ramp up the geometry below to crank up test coverage + for entries = 1, 3 do + for i = 1, 10 do + -- add {direct_pointing=true} to test direct pointing + for _, config in ipairs{ {} } do + ... -- assert that FIB lookup = RIB lookup + # + + While the testing logic is quite simple (e.g., no test case reduction) we are + happy to realize that we apparently were able to find all bugs with RIBs of + just up to three entries and minimal prefix lengths (1, 7, 33, 65) to trigger + edge cases. Since we are checking the simplest cases first and only + iteratively increase test case complexity we end up with minimal examples to + reproduce bugs. Neat! + +> + +< Benchmarking + + The library also includes a micro-benchmark to help quantify changes and + optimizations. It seems most useful to ensure the benchmark executes in L3 + cache to highlight how the code performs on the CPU pipeline. Cache misses and + the ensuing memory latency masks pressure on the pipeline and could cause us + to miss valuable optimizations. + + #code Micro-benchmark results with the default parameters on a laptop. Hot + cache.# +PMU analysis (numentries=10000, keysize=32) +build: 0.2241 seconds +lookup: 7877.08 cycles/lookup 17605.28 instructions/lookup +lookup32: 60.41 cycles/lookup 99.95 instructions/lookup +lookup64: 60.31 cycles/lookup 99.96 instructions/lookup +lookup128: 70.94 cycles/lookup 118.61 instructions/lookup +build(direct_pointing): 0.1960 seconds +lookup(direct_pointing): 1218.53 cycles/lookup 3033.57 instructions/lookup +lookup32(direct_pointing): 32.84 cycles/lookup 62.59 instructions/lookup +lookup64(direct_pointing): 32.63 cycles/lookup 62.59 instructions/lookup +lookup128(direct_pointing): 34.39 cycles/lookup 66.80 instructions/lookup + # + + It is difficult to make general statements about how memory latency impacts + performance because it really depends on the runtime profile of your + application when it processes a real world workload. Worst case performance + might not necessairly a particularily interesting cas either. It is clear + however that memory accesses will become the bottleneck depending on your + routing table and traffic profile. + + Testing in [Vita](https://github.com/inters/vita) has shown practically + equivalent performance when compared to the previously used _DIR-24-8_ + implementation. This is good news for Vita because we need to support IPv6 + routes, but might not be particularily representative for other applications + as Vita typically uses very small routing tables. + +> + +< Next Steps + + Luke Gorrie brought up the possibility parallelizing lookups to amortize + memory latency. That is, if we could somehow perform multiple lookups + concurrently we could do useful work while waiting for the next nodes to be + fetched from memory. Maybe someone wants to take a stab at vectorizing the + poptrie lookup routine? :-) + +>