commit 8a03585de7c77e623b2d50a4224de9994a1fdd03 Author: Max Rottenkolber Date: Wed Nov 26 11:45:13 2025 +0100 .... diff --git a/blog/raptorjit-vm-musttail-preserve_none.mk2 b/blog/raptorjit-vm-musttail-preserve_none.mk2 index 4c9a83f..66cf7e7 100644 --- a/blog/raptorjit-vm-musttail-preserve_none.mk2 +++ b/blog/raptorjit-vm-musttail-preserve_none.mk2 @@ -21,7 +21,7 @@ interpreter in C. [reverberate.org 1](https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html), [2](https://blog.reverberate.org/2025/02/10/tail-call-updates.html) -#code The _dispatch_ logic ({musttail} was introduced in LLVM 17).# +#code The _dispatch_ logic ({musttail} was introduced in LLVM 17).# __attribute__((always_inline)) void dispatch (Bytecode *pc, Bytecode bc, Value *stack, Subroutine *disp) commit a6ac1ce2fd5558092a801fa637dbaa8644d3eb47 Author: Max Rottenkolber Date: Wed Nov 26 11:33:04 2025 +0100 blog/raptorjit-vm-musttail-preserve_none: publish diff --git a/blog/raptorjit-vm-musttail-preserve_none.meta b/blog/raptorjit-vm-musttail-preserve_none.meta index 4306d62..0609137 100644 --- a/blog/raptorjit-vm-musttail-preserve_none.meta +++ b/blog/raptorjit-vm-musttail-preserve_none.meta @@ -2,4 +2,4 @@ :author "Max Rottenkolber " :index-p nil :index-headers-p nil) -:publication-date nil \ No newline at end of file +:publication-date "2025-11-26 11:32" \ No newline at end of file commit ea6c690993df024e3718b7156747968ccd1dc987 Author: Max Rottenkolber Date: Wed Nov 26 11:00:49 2025 +0100 blog/raptorjit-vm-musttail-preserve_none: edits diff --git a/blog/raptorjit-vm-musttail-preserve_none.mk2 b/blog/raptorjit-vm-musttail-preserve_none.mk2 index d2acac9..4c9a83f 100644 --- a/blog/raptorjit-vm-musttail-preserve_none.mk2 +++ b/blog/raptorjit-vm-musttail-preserve_none.mk2 @@ -1,11 +1,8 @@ A contemporary bytecode interpreter might be written in assembly language and look like this: -{N} bytecodes are implemented by adjacent subroutines {R}0..{N}-1_ aligned evenly to {M} bytes. +{N} bytecodes are implemented by adjacent subroutines {R0}..{N}-1 aligned evenly to {M} bytes. Each subroutine ends with dispatch code inlined that does a -#code where {i} is 0..{N}-1# -jump R0 + i * M -# - +_jump_ {R0} + {i} * {M}; where {i} is 0..{N}-1 By convention certain registers will hold important values across bytecode subroutines. The expected-to-be-common cases will be implemented directly in each subroutine, while so-called commit 7e04fbbf7adb251e2e9364a6ddf9fccd3dd40661 Author: Max Rottenkolber Date: Tue Nov 25 21:59:35 2025 +0100 blog/raptorjit-vm-musttail-preserve_none: edits diff --git a/blog/raptorjit-vm-musttail-preserve_none.mk2 b/blog/raptorjit-vm-musttail-preserve_none.mk2 index d748e44..d2acac9 100644 --- a/blog/raptorjit-vm-musttail-preserve_none.mk2 +++ b/blog/raptorjit-vm-musttail-preserve_none.mk2 @@ -1,7 +1,11 @@ A contemporary bytecode interpreter might be written in assembly language and look like this: -_N_ bytecodes are implemented by adjacent subroutines _R0..N-1_ aligned evenly to _M_ bytes. -Each subroutine ends with dispatch code inlined that does a {jump R0 + i * M} -where _i_ is 0.._N_-1. +{N} bytecodes are implemented by adjacent subroutines {R}0..{N}-1_ aligned evenly to {M} bytes. +Each subroutine ends with dispatch code inlined that does a + +#code where {i} is 0..{N}-1# +jump R0 + i * M +# + By convention certain registers will hold important values across bytecode subroutines. The expected-to-be-common cases will be implemented directly in each subroutine, while so-called @@ -10,16 +14,15 @@ eventually return to the bytecode subroutine. The LuaJIT interpreter follows this design with one extra layer of indirection: instead of jumping to evenly aligned subroutines, it looks up the addresses of bytecode -subroutines in a dispatch table. Being the backbone of a tracing JIT compilers, the -LuaJIT interpreter used this dispatch table to dynamically swap out bytecode subroutines +subroutines in a dispatch table. Being the backbone of a tracing JIT compiler, the +LuaJIT interpreter uses this dispatch table to dynamically swap out bytecode subroutines depending wheter it is currently recording a trace or not. For RaptorJIT we would like to rewrite the interpreter in a high-level language. Recently (relatively speaking) added features of LLVM allow us to construct an equivalent interpreter in C. -[1](https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html), -[2](https://transactional.blog/copy-and-patch/how-it-works), -[3](https://blog.reverberate.org/2025/02/10/tail-call-updates.html) +[reverberate.org 1](https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html), +[2](https://blog.reverberate.org/2025/02/10/tail-call-updates.html) #code The _dispatch_ logic ({musttail} was introduced in LLVM 17).# __attribute__((always_inline)) @@ -63,9 +66,9 @@ saved and restored to and from the stack between tail calls. We can use the {preserve_none} calling convention for all interpreter subroutines instead by tagging their definitions with {__attribute__((preserve_none))} (available since LLVM 21). This directs LLVM to use a calling convention where all registers are caller-save, and -lets enables penalty-free use of previously callee-save registers. +enables penalty-free use of previously callee-save registers. We found that with {-fzero-call-used-regs} (which appears to be enabled by default on some installations) LLVM will zero-initialize many registers in between tail calls when using {preserve_none} in LLVM 21. In our case this caused redundant code to be emitted and we -resorted to {-fzero-call-used-regs=skip} to inhibit this behavior. +resort to {-fzero-call-used-regs=skip} to inhibit this behavior. commit e22f021a909b60b74421f1f4544ba3baab609069 Author: Max Rottenkolber Date: Tue Nov 25 17:58:58 2025 +0100 blog/raptorjit-vm-musttail-preserve_none diff --git a/blog/raptorjit-vm-musttail-preserve_none.meta b/blog/raptorjit-vm-musttail-preserve_none.meta new file mode 100644 index 0000000..4306d62 --- /dev/null +++ b/blog/raptorjit-vm-musttail-preserve_none.meta @@ -0,0 +1,5 @@ +:document (:title "Notes on rewriting the RaptorJIT interpreter using LLVM’s musttail and preserve_none attributes" + :author "Max Rottenkolber " + :index-p nil + :index-headers-p nil) +:publication-date nil \ No newline at end of file diff --git a/blog/raptorjit-vm-musttail-preserve_none.mk2 b/blog/raptorjit-vm-musttail-preserve_none.mk2 new file mode 100644 index 0000000..d748e44 --- /dev/null +++ b/blog/raptorjit-vm-musttail-preserve_none.mk2 @@ -0,0 +1,71 @@ +A contemporary bytecode interpreter might be written in assembly language and look like this: +_N_ bytecodes are implemented by adjacent subroutines _R0..N-1_ aligned evenly to _M_ bytes. +Each subroutine ends with dispatch code inlined that does a {jump R0 + i * M} +where _i_ is 0.._N_-1. + +By convention certain registers will hold important values across bytecode subroutines. +The expected-to-be-common cases will be implemented directly in each subroutine, while so-called +“slow paths” will cause ABI calls to functions—possibly written in a high-level language—which +eventually return to the bytecode subroutine. + +The LuaJIT interpreter follows this design with one extra layer of indirection: +instead of jumping to evenly aligned subroutines, it looks up the addresses of bytecode +subroutines in a dispatch table. Being the backbone of a tracing JIT compilers, the +LuaJIT interpreter used this dispatch table to dynamically swap out bytecode subroutines +depending wheter it is currently recording a trace or not. + +For RaptorJIT we would like to rewrite the interpreter in a high-level language. +Recently (relatively speaking) added features of LLVM allow us to construct an equivalent +interpreter in C. +[1](https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html), +[2](https://transactional.blog/copy-and-patch/how-it-works), +[3](https://blog.reverberate.org/2025/02/10/tail-call-updates.html) + +#code The _dispatch_ logic ({musttail} was introduced in LLVM 17).# +__attribute__((always_inline)) +void dispatch +(Bytecode *pc, Bytecode bc, Value *stack, Subroutine *disp) +{ + bc = *pc++; + __attribute__((musttail)) + return disp[bc.op](pc, bc, stack, disp); +} +# + +The interpreter dispatch can be expressed as a tail call, +where the arguments hold important values across bytecode subroutines. +Fast-path logic can be inlined, while slow paths can be contained in tail calls +to slow-path functions. + +#code An exemplary bytecode subroutine.# +void op_ADD +(Bytecode *pc, Bytecode bc, Value *stack, Subroutine *disp) +{ + if (!isnumber(stack[bc.b]) || !isnumber(stack[bc.a])) + __attribute__((musttail)) + return add_slowpath(pc, bc, stack, dispatch); + + stack[bc.a] = stack[bc.b].num + stack[bc.c].num; + __attribute__((musttail)) + return dispatch(pc, bc, stack, dispatch); +} +# + +Using this technique we end up with machine code comparable to the LuaJIT assembler VM. +By enabling frame pointer omission ({-fomit-frame-pointer}) we can avoid saving the +frame pointer to the stack for most functions. +There is one caveat, namely that code generated for the bytecode subroutines follows +the default calling convention. +Under the default convention, caller-save registers are usually limited. +Once LLVM runs out of those it will resort to callee-save registers which have to be +saved and restored to and from the stack between tail calls. + +We can use the {preserve_none} calling convention for all interpreter subroutines instead +by tagging their definitions with {__attribute__((preserve_none))} (available since LLVM 21). +This directs LLVM to use a calling convention where all registers are caller-save, and +lets enables penalty-free use of previously callee-save registers. + +We found that with {-fzero-call-used-regs} (which appears to be enabled by default on some +installations) LLVM will zero-initialize many registers in between tail calls when using +{preserve_none} in LLVM 21. In our case this caused redundant code to be emitted and we +resorted to {-fzero-call-used-regs=skip} to inhibit this behavior.