What, Forth, Again?

We already discussed Chuck Moore’s discovery of FORTH in a previous post but leveraging python bytecodes to implement it felt like cheating. Like many others, I started understanding FORTH through Richard WM Jones’ jonesforth. I trudged through the unfamiliar assembly syntax and settled down with many cups of coffee to understand the difference between direct and indirect threaded code. And yet, after all this time, I realized I didn’t really understand it. So I decided to translate jonesforth.S to C.

The first bit of actual code in jonesforth.S is the NEXT macro:

    .macro NEXT
    lodsl
    jmp *(%eax)
    .endm

NEXT is fundamental to FORTH yet remarkably only requires two x86 instructions. The second instruction reminds us of What makes Julia delightful, cont’d?. The wikipedia entry on threaded code outlines a rather vast array of potential translations. However, in the spirit of keeping our translation as faithful to the original as possible, I chose GNU C’s Labels as Values. This is what my NEXT macro looks like in C:

#define NEXT do { target = *ip++; goto **target; } while (0)

Checking the generated assembly (using -fverbose-asm), we see that GCC always translates goto **target to

    movq    (%rdx), %rax
    jmp     *%rax

so that looks rather promising. GCC’s register allocation logic means that the line which immediately precedes these instructions can vary greatly. ip is not always kept in the same register. Fortunately, we don’t need to worry about that.

With this first challenge behind us, we move on to the precise memory layout of FORTH dictionary entries. FORTH words can vary in length. Words written in assembly (or in our case C) language only require a single address. Words written in FORTH can refer to N previously defined words plus the special DOCOL label (since they were introduced by a :) that is at the center of indirect threading. Jones explains that extra indirection step in great details, complete with ASCII diagrams. I will therefore not repeat it here. Flexible array members were officially standardized in C99 so we will use them to store the collection of labels in the dictionary entry. We could store the first label, aka Code Field, as a separate member but that only increases the cognitive load of remembering that following parameters are contiguous. jonesforth.S words actually contain not one but two flexible arrays (name and code) where C only allows one. That’s our first big deviation from jonesforth.S (there will be more, don’t worry):

struct word_t {
    struct word_t *link;
    char flags;
    char name[15] __attribute__((nonstring));  /* big enough for builtins, forth words might overflow  */
    void *code[];
};

Looking at all the words written in assembly, the longest name ("O_NONBLOCK") is 11 characters long. Assuming 8 bytes per word, we need two words to encode it. That (and a bit of the lost art of structure packing) explains how name[15] was chosen. This choice is marginally wasteful on 32 bit chips (where words only occupy 4 bytes) but lets us get away with one fewer __SIZEOF_POINTER__ conditional. Note that this choice doesn’t impose a 15 character limit on FORTH words provided our code never accesses the .code member directly and respects .flags & F_LENMASK instead. We introduced code_field_address(word) for convenience.

I can honestly say that I have never written code with a higher bug density before this. Every tiny little detail matters, particularly when testing some of the more advanced FORTH words like CFA> and SEE. Things mostly work right now but I wouldn’t trust it for space exploration.

Less generous readers are sure to ask: “are you happy now, what was the point of it all?” One of the cool things having this code in C lets us do is use emscripten to build it for the web! Emscripten is based on the Clang/LLVM stack, not GCC/libgccjit. Another GCC extension my code relies on, which I didn’t bother to mention, are Nested Functions to push on and pop from the data stack. This led me to discover Clang’s Blocks. The wikipedia entry for them is more advanced than strictly necessary but still incredibly useful. The basic syntax is reasonably intuitive:

    intptr_t (^pop)(void) = ^(void)
    {
        assert(sp < stack + STACK_SIZE);
        return *sp++;
    };

This might also be a good time to point out that our stacks grow downwards to match x86 push/pop conventions.

Then I hit an issue where emscripten decided to not implement syscall(2). I threw a simplistic shim together and moved on. Finally, I wrapped the whole thing in a thin xterm-pty layer and got a rather more heterogeneous clone of WAForth.

Many thanks to stefnotch for explaining how a Service Worker lets us access the required SharedArrayBuffer on github pages

Thanks also to Richard Jones for pointing out I forgot to include a link to my GNU C translation of his code. Here it is.

3/1 Update: I was really impressed by how jonesforth.f defines ARGC, ARGV, and ENVIRON so I set out to understand it better. I came across two tremendously helpful blog posts about the subject:

These posts highlight that the variation between x86 and x86-64 makes it very difficult to write code that will “find” argc in 32-bit and 64-bit environments. That difficulty can be avoided with the -nostartfiles link option. This option lets us define a void _start(void) entry point instead of the standard int main(int argc, char *argv[], char *envp[]). Furthermore, -mpreferred-stack-boundary=3 also avoids a one word padding difference between GCC and Clang. The result is unfortunately still not ABI-compatible with jonesforth.S and requires a small tweak to the definition of ARGC:

< : ARGC S0 @ @ ;
---
> : ARGC 7 CELLS S0 @ + @ ;

4th.fs needs to skip 7 cells because neither GCC nor Clang provide a way to avoid saving callee-saved registers. Visual inspection of the generated assembly code reveals that both compilers save 6 registers (%r12, %r13, %r14, %r15, %rbp, %rbx in 64-bit) plus an extra word that holds the base address of our stack. Emscripten does not care for this low-level chicanery and will require a different (as of yet unidentified) approach. Making ARGC, ARGV, and ENVIRON work in the browser is left as an exercise to the reader.

6/10 Update: SEE >CFA and SEE QUIT used to segfault on x86-64 and, while I understood the immediate cause (second WHILE loop in SEE went one step past the end of their respective .code[]), I struggled to fix it. As luck would have it, I inspected the asm output (using gcc’s -S and -fverbose-asm command line flags) and noticed that each word ends with an .align 16 directive. This indicates that gcc pads flexible array members to double word boundaries. Appending an unreachable EXIT to these two words fixes the decompilation. This kludge is significantly easier than influencing gcc’s padding.

While I had the patient on the table, I finally understood the gcc doc explaining how to “stringize the result of expansion of a macro argument” and replaced several #if __SIZEOF_POINTER__ by XSTR(__SIZEOF_POINTER__). I also realized that our DEFCONST macro handles more than constants and simplified a number of short words. Then, I improved the DEFCODE macro to reduce boilerplate further. Finally, I leveraged python’s subprocess to write a simple unit test. This revealed another segmentation fault in my github action which runs on ubuntu-latest and GCC 11 whereas I still run GCC 10 locally. gdb could attribute the segmentation fault to an unaligned MOVAPS instruction but couldn’t show the corresponding stack. The root cause was the -mpreferred-stack-boundary=3 switch I introduced to simplify ARGC. So I replaced it by -mstackrealign, adjusted the definition of ARGC accordingly (7 → 8) and added a test for it. I would never have figured any of this out without nektos/act and Visual Studio Code Dev Containers.