NOTES ON REWRITING THE RAPTORJIT INTERPRETER USING LLVM’S MUSTTAIL AND
                        PRESERVE_NONE ATTRIBUTES

                                             Max Rottenkolber <max@mr.gy>
                                              Wednesday, 26 November 2025

   A contemporary bytecode interpreter might be written in assembly
   language and look like this: N bytecodes are implemented by adjacent
   subroutines R0..N-1 aligned evenly to M bytes. Each subroutine ends
   with dispatch code inlined that does a

   jump R0 + i * M; where i is 0..N-1

   By convention certain registers will hold important values across
   bytecode subroutines. The expected-to-be-common cases will be
   implemented directly in each subroutine, while so-called “slow paths”
   will cause ABI calls to functions—possibly written in a high-level
   language—which eventually return to the bytecode subroutine.

   The LuaJIT interpreter follows this design with one extra layer of
   indirection: instead of jumping to evenly aligned subroutines, it
   looks up the addresses of bytecode subroutines in a dispatch table.
   Being the backbone of a tracing JIT compiler, the LuaJIT interpreter
   uses this dispatch table to dynamically swap out bytecode subroutines
   depending wheter it is currently recording a trace or not.

   For RaptorJIT we would like to rewrite the interpreter in a high-level
   language. Recently (relatively speaking) added features of LLVM allow
   us to construct an equivalent interpreter in C. reverberate.org 1
   (https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html),
   2 (https://blog.reverberate.org/2025/02/10/tail-call-updates.html)

     __attribute__((always_inline))
     void dispatch
     (Bytecode *pc, Bytecode bc, Value *stack, Subroutine *disp)
     {
         bc = *pc++;
         __attribute__((musttail))
         return disp[bc.op](pc, bc, stack, disp);
     }

       The dispatch logic (musttail was introduced in LLVM 17).

   The interpreter dispatch can be expressed as a tail call, where the
   arguments hold important values across bytecode subroutines. Fast-path
   logic can be inlined, while slow paths can be contained in tail calls
   to slow-path functions.

     void op_ADD
     (Bytecode *pc, Bytecode bc, Value *stack, Subroutine *disp)
     {
         if (!isnumber(stack[bc.b]) || !isnumber(stack[bc.a]))
             __attribute__((musttail))
             return add_slowpath(pc, bc, stack, dispatch);
     
         stack[bc.a] = stack[bc.b].num + stack[bc.c].num;
         __attribute__((musttail))
         return dispatch(pc, bc, stack, dispatch);
     }

       An exemplary bytecode subroutine.

   Using this technique we end up with machine code comparable to the
   LuaJIT assembler VM. By enabling frame pointer omission
   (-fomit-frame-pointer) we can avoid saving the frame pointer to the
   stack for most functions. There is one caveat, namely that code
   generated for the bytecode subroutines follows the default calling
   convention. Under the default convention, caller-save registers are
   usually limited. Once LLVM runs out of those it will resort to
   callee-save registers which have to be saved and restored to and from
   the stack between tail calls.

   We can use the preserve_none calling convention for all interpreter
   subroutines instead by tagging their definitions with
   __attribute__((preserve_none)) (available since LLVM 21). This directs
   LLVM to use a calling convention where all registers are caller-save,
   and enables penalty-free use of previously callee-save registers.

   We found that with -fzero-call-used-regs (which appears to be enabled
   by default on some installations) LLVM will zero-initialize many
   registers in between tail calls when using preserve_none in LLVM 21.
   In our case this caused redundant code to be emitted and we resort to
   -fzero-call-used-regs=skip to inhibit this behavior.