NOTES ON REWRITING THE RAPTORJIT INTERPRETER USING LLVM’S MUSTTAIL AND PRESERVE_NONE ATTRIBUTES Max Rottenkolber Wednesday, 26 November 2025 A contemporary bytecode interpreter might be written in assembly language and look like this: N bytecodes are implemented by adjacent subroutines R0..N-1 aligned evenly to M bytes. Each subroutine ends with dispatch code inlined that does a jump R0 + i * M; where i is 0..N-1 By convention certain registers will hold important values across bytecode subroutines. The expected-to-be-common cases will be implemented directly in each subroutine, while so-called “slow paths” will cause ABI calls to functions—possibly written in a high-level language—which eventually return to the bytecode subroutine. The LuaJIT interpreter follows this design with one extra layer of indirection: instead of jumping to evenly aligned subroutines, it looks up the addresses of bytecode subroutines in a dispatch table. Being the backbone of a tracing JIT compiler, the LuaJIT interpreter uses this dispatch table to dynamically swap out bytecode subroutines depending wheter it is currently recording a trace or not. For RaptorJIT we would like to rewrite the interpreter in a high-level language. Recently (relatively speaking) added features of LLVM allow us to construct an equivalent interpreter in C. reverberate.org 1 (https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html), 2 (https://blog.reverberate.org/2025/02/10/tail-call-updates.html) __attribute__((always_inline)) void dispatch (Bytecode *pc, Bytecode bc, Value *stack, Subroutine *disp) { bc = *pc++; __attribute__((musttail)) return disp[bc.op](pc, bc, stack, disp); } The dispatch logic (musttail was introduced in LLVM 17). The interpreter dispatch can be expressed as a tail call, where the arguments hold important values across bytecode subroutines. Fast-path logic can be inlined, while slow paths can be contained in tail calls to slow-path functions. void op_ADD (Bytecode *pc, Bytecode bc, Value *stack, Subroutine *disp) { if (!isnumber(stack[bc.b]) || !isnumber(stack[bc.a])) __attribute__((musttail)) return add_slowpath(pc, bc, stack, dispatch); stack[bc.a] = stack[bc.b].num + stack[bc.c].num; __attribute__((musttail)) return dispatch(pc, bc, stack, dispatch); } An exemplary bytecode subroutine. Using this technique we end up with machine code comparable to the LuaJIT assembler VM. By enabling frame pointer omission (-fomit-frame-pointer) we can avoid saving the frame pointer to the stack for most functions. There is one caveat, namely that code generated for the bytecode subroutines follows the default calling convention. Under the default convention, caller-save registers are usually limited. Once LLVM runs out of those it will resort to callee-save registers which have to be saved and restored to and from the stack between tail calls. We can use the preserve_none calling convention for all interpreter subroutines instead by tagging their definitions with __attribute__((preserve_none)) (available since LLVM 21). This directs LLVM to use a calling convention where all registers are caller-save, and enables penalty-free use of previously callee-save registers. We found that with -fzero-call-used-regs (which appears to be enabled by default on some installations) LLVM will zero-initialize many registers in between tail calls when using preserve_none in LLVM 21. In our case this caused redundant code to be emitted and we resort to -fzero-call-used-regs=skip to inhibit this behavior.