What, Forth, Again?
We already discussed Chuck Moore’s discovery of FORTH in a previous post but leveraging python bytecodes to implement it felt like cheating. Like many others, I started understanding FORTH through Richard WM Jones’ jonesforth. I trudged through the unfamiliar assembly syntax and settled down with many cups of coffee to understand the difference between direct and indirect threaded code. And yet, after all this time, I realized I didn’t really understand it. So I decided to translate jonesforth.S to C.
The first bit of actual code in jonesforth.S
is the NEXT
macro:
.macro NEXT
lodsl
jmp *(%eax)
.endm
NEXT
is fundamental to FORTH
yet remarkably only requires two x86 instructions. The second instruction
reminds us of What makes Julia delightful, cont’d? The
wikipedia entry on threaded code outlines a rather vast array
of potential translations. However, in the spirit of keeping our translation as faithful to the original as
possible, I chose GNU C’s Labels as Values. This
is what my NEXT
macro looks like in C:
#define NEXT do { target = *ip++; goto **target; } while (0)
Checking the generated assembly (using -fverbose-asm),
we see that GCC always translates goto **target
to
movq (%rdx), %rax
jmp *%rax
so that looks rather promising. GCC’s register allocation logic
means that the line which immediately precedes these instructions can vary greatly. ip
is not always kept in
the same register. Fortunately, we don’t need to worry about that.
With this first challenge behind us, we move on to the precise memory layout of FORTH
dictionary entries. FORTH words
can vary in length. Words written in assembly (or in our case C) language only require a single address. Words
written in FORTH can refer to N previously defined words plus the special DOCOL
label (since they were
introduced by a :
) that is at the center of indirect threading. Jones explains that extra indirection step
in great details, complete with ASCII diagrams. I will therefore not repeat it here.
Flexible array members were officially standardized
in C99 so we will use them to store the collection of labels in the dictionary
entry. We could store the first label, aka Code Field
, as a separate member but that only increases the
cognitive load of remembering that following parameters are contiguous. jonesforth.S
words actually contain
not one but two flexible arrays (name
and code
) where C only allows one. That’s our first big deviation
from jonesforth.S
(there will be more, don’t worry):
struct word_t {
struct word_t *link;
char flags;
char name[15] __attribute__((nonstring)); /* big enough for builtins, forth words might overflow */
void *code[];
};
Looking at all the words written in assembly, the longest name ("O_NONBLOCK"
) is 11 characters long. Assuming
8 bytes per word, we need two words to encode it. That (and a bit of
the lost art of structure packing)
explains how name[15]
was chosen. This choice is marginally wasteful
on 32 bit chips (where words only occupy 4 bytes) but lets us get away with one fewer __SIZEOF_POINTER__
conditional.
Note that this choice doesn’t impose a 15 character limit on FORTH words provided our code never accesses the .code
member directly and respects .flags & F_LENMASK
instead. We introduced code_field_address(word)
for convenience.
I can honestly say that I have never written code with a higher bug density before this. Every tiny little detail
matters, particularly when testing some of the more advanced FORTH words like CFA>
and SEE
. Things mostly work
right now but I wouldn’t trust it for space exploration.
Less generous readers are sure to ask: “are you happy now, what was the point of it all?” One of the cool things having this code in C lets us do is use emscripten to build it for the web! Emscripten is based on the Clang/LLVM stack, not GCC/libgccjit. Another GCC extension my code relies on, which I didn’t bother to mention, are Nested Functions to push on and pop from the data stack. This led me to discover Clang’s Blocks. The wikipedia entry for them is more advanced than strictly necessary but still incredibly useful. The basic syntax is reasonably intuitive:
intptr_t (^pop)(void) = ^(void)
{
assert(sp < stack + STACK_SIZE);
return *sp++;
};
This might also be a good time to point out that our stacks grow downwards to match x86 push
/pop
conventions.
Then I hit an issue where emscripten decided to not implement syscall(2). I threw a simplistic shim together and moved on. Finally, I wrapped the whole thing in a thin xterm-pty layer and got a rather more heterogeneous clone of WAForth.
Many thanks to stefnotch for explaining how a Service Worker lets us access the required SharedArrayBuffer on github pages
Thanks also to Richard Jones for pointing out I forgot to include a link to my GNU C translation of his code. Here it is.
3/1 Update: I was really impressed by how jonesforth.f
defines ARGC
, ARGV
, and ENVIRON
so I set out to
understand it better. I came across two tremendously helpful blog posts about the subject:
These posts highlight that the variation between x86 and x86-64 makes it very difficult to write code that will “find” argc
in
32-bit and 64-bit environments. That difficulty can be avoided with the -nostartfiles
link option. This option lets us define a void _start(void)
entry point instead of the standard
int main(int argc, char *argv[], char *envp[])
. Furthermore, -mpreferred-stack-boundary=3
also avoids a one word padding difference between GCC and Clang. The result is unfortunately still not ABI-compatible with jonesforth.S
and
requires a small tweak to the definition of ARGC
:
< : ARGC S0 @ @ ;
---
> : ARGC 7 CELLS S0 @ + @ ;
4th.fs
needs to skip 7 cells because neither GCC nor Clang provide a way to avoid saving callee-saved registers.
Visual inspection of the generated assembly code reveals that both compilers save 6 registers (%r12
, %r13
, %r14
, %r15
, %rbp
, %rbx
in 64-bit)
plus an extra word that holds the base address of our stack
. Emscripten does not care for this low-level chicanery and will require a different
(as of yet unidentified) approach. Making ARGC
, ARGV
, and ENVIRON
work in the browser is
left as an exercise to the reader.
6/10 Update: SEE >CFA
and SEE QUIT
used to segfault on x86-64
and, while I understood the immediate cause (second WHILE
loop in SEE
went one step past the end of their
respective .code[]
), I struggled to fix it. As luck would have it, I inspected the asm
output (using gcc’s
-S
and -fverbose-asm
command line flags) and noticed that each word ends with an .align 16
directive. This
indicates that gcc pads flexible array members to double word boundaries. Appending an unreachable EXIT
to
these two words fixes the decompilation. This kludge is significantly
easier than influencing gcc’s padding.
While I had the patient on the table, I finally understood the
gcc doc explaining how to “stringize the result of
expansion of a macro argument” and replaced several #if __SIZEOF_POINTER__
by XSTR(__SIZEOF_POINTER__)
.
I also realized that our DEFCONST
macro handles more than constants and simplified a number of short words.
Then, I improved the DEFCODE
macro to reduce boilerplate further. Finally, I leveraged
python’s subprocess to write a simple unit test.
This revealed another segmentation fault in my github action which runs on
ubuntu-latest and GCC 11
whereas I still run GCC 10 locally.
gdb could attribute the segmentation fault to an unaligned
MOVAPS instruction but couldn’t show the corresponding
stack. The root cause was the -mpreferred-stack-boundary=3
switch I introduced to simplify ARGC
.
So I replaced it by -mstackrealign
, adjusted the definition of ARGC
accordingly (7
→ 8
) and added a test for it.
I would never have figured any of this out without nektos/act and
Visual Studio Code Dev Containers.