What is Tail Call Elimination?

Previous posts highlight my interest in (obsession with?) interpreters. I keep marveling at techniques people have come up with over the years to make our interactions with computers feel conversational. I was recently reminded of Guy Steele’s seminal Lambda: The Ultimate GOTO article. Around the same time, I came across old release notes for Clang 13.0.0 announcing, among other things, a musttail attribute. This immediately reminded me of my decision to use GNU’s labels as values to implement jonesforth in C. Naturally, I started wondering what a FORTH implemented in C would look like with goto replaced by musttail. A quick google search turned up a public gist by Jens Alfke. As an aside, I didn’t notice Jens’s more elaborate Tails project until later.

The first step to migrating from goto to musttail is to abandon all the void ** (and occasional void *** because of indirect threading) in favor of explicit types. I adapted the union below from Jens:

union instr_t {
    intptr_t *(*code)(struct interp_t *, intptr_t *, union instr_t **, union instr_t *, union instr_t *);
    intptr_t literal;
    union instr_t *word;
};

A FORTH instruction is either an address to a function with a complicated signature (more on that later), a literal word (branching instructions encode their offsets directly in the instruction stream), or a pointer to an instruction (for indirect threading). Let’s pick one example apart

intptr_t *SWAP(struct interp_t *, intptr_t *, union instr_t **, union instr_t *, union instr_t *);
static struct word_t name_SWAP __attribute__((used)) = {
    .link = &name_DROP,
    .flag = 4,
    .name = "SWAP",
    .code = {.code = SWAP}
};
intptr_t *SWAP(struct interp_t *env, intptr_t *sp, union instr_t **rsp, union instr_t *ip, union instr_t *target __attribute__((unused)))
{
    register intptr_t tmp = sp[1];
    sp[1] = sp[0];
    sp[0] = tmp;
    __attribute__((musttail)) return ip->word->code(env, sp, rsp, ip + 1, ip->word)
}

This example begins with a forward declaration of the SWAP function so we can include it in the name_SWAP struct. That struct is part of the singly linked list that implements the FORTH vocabulary. The name_* structs let the FORTH interpreter/compiler associate textual names of FORTH words with their implementations. That logic lives in INTERPRET which invokes WORD to recognize user input then either appends to the word being compiled or transfers control when in interpreter mode. STATE distinguishes between interpreter and compiler mode. By convention, INTERPRET also tries to interpret strings it couldn’t map to words as literal numbers. If it fails, it emits an error message and moves on to the next token. This illustrates FORTH parsimony: no tokenizer, lexer, or syntax tree. FORTH is a WYSIWYG language.

Performance specialists might worry that hot functions with so many parameters will challenge clang’s register allocation. Fortunately, those functions will never be called by a CALL instruction! The parameters are as follow

struct interp_t *env

A number of interpreter global variables like BASE, STATE (whether we’re compiling or interpreting), HERE

intprt_t *sp

A pointer to the top of the value stack. Both stacks grow downward

union instr_t **rsp

A pointer to the top of the return stack

union instr_t *ip

A pointer to the next instruction. Because we’re using indirect threading, ip->code rarely makes sense, only ip->literal and ip->word do.

union instr_t *target

A pointer to the current instruction. Used by DOCOL to transfer control to the next word. Having it as rightmost argument is reminiscent of continuation-passing style. If anything, having so many parameters simplifies clang’s register allocation as the System V ABI mandates that the first six integer or pointer arguments are passed in registers %rdi, %rsi, %rdx, %rcx, %r8, and %r9.

All word initializers start with a forward declaration so the first eight lines (before the last opening curly) can be generated by a preprocessor macro. A previous version injected the code (from the last opening to the end) as a macro argument. This made for a sub-optimal debugging experience since macros expand everything into a single line. The macro version of SWAP looks much tidier, like this:

DEFCODE(DROP, 0, "SWAP", SWAP)
{
    register intptr_t tmp = sp[1];
    sp[1] = sp[0];
    sp[0] = tmp;
    NEXT;
}

Amazingly, the x86 version of SWAP is barely longer than the c version:

SWAP:                           # @SWAP
    movdqu  (%rsi), %xmm0
    pshufd  $78, %xmm0, %xmm0   # xmm0 = xmm0[2,3,0,1]
    movdqu  %xmm0, (%rsi)
    movq    (%rcx), %r8         # target = ip->word
    movq    (%r8), %rax         # rax = target->code
    addq    $8, %rcx            # ip + 1
    jmpq    *%rax               # TAILCALL

The first three lines actually swap sp[0] and sp[1] (using pshufd to shuffle packed doublewords). The last four lines dereference ip twice (indirect threading) the perform the tail call using an indirect branch. Staring at that generated assembly highlights a clever bit of code golf in Richard WM Jones’s handcrafted jonesforth.S: he condensed the last 4 instructions above into just 2 (lodsl is equivalent to mov (%esi), %eax; add $4, %esi while jmp *(%eax) is equivalent to mov (%eax), %eax; jmp *%eax)! 97e00d7 is a (failed) attempt at using extended asm to achieve the same compression. Unfortunately, I found no better way to enforce pre-conditions than explicit mov instructions (which more than offset the saving).

The webassembly version, courtesy of wasm2wat, is only slightly longer:

(module $5th.wasm
  (type $t0 (func (param i32 i32 i32 i32 i32) (result i32)))
  ...
  (func $SWAP (type $t0) (param $p0 i32) (param $p1 i32) (param $p2 i32) (param $p3 i32) (param $p4 i32) (result i32)
    (i64.store align=4
      (local.get $p1)
      (i64.rotl
        (i64.load align=4
          (local.get $p1))
        (i64.const 32)))
    (return_call_indirect (type $t0)
      (local.get $p0)
      (local.get $p1)
      (local.get $p2)
      (i32.add
        (local.get $p3)
        (i32.const 4))
      (local.tee $p3
        (i32.load
          (local.get $p3)))
      (i32.load
        (local.get $p3))))
  ...
)

In summary, 5th.c looks different from 4th.c, at least when considering their C sources. Turns out that compilers generate very similar code from those difference sources. That code is also reminiscent of jonesforth.S which was hand-crafted in 32-bit x86. 5th.c is probably the easiest to extend of the bunch. New native instructions only require implementing a new C function with a well-defined signature. It might be the shiny object syndrome but think that 5th.c is a reasonable blueprint for implementing your next interpreter.