How Many Roads Must a Man Walk Down?
Look at me being all highfalutin and referencing “The Bard” when I ought to admit this is more an “Oops!… I Did It Again” situation! I promised myself I wouldn’t and yet, here it is, I wrote yet another JONESFORTH clone. I have no excuse. Well, I kinda do. I was nerd sniped on discord. See, what happened was this: I built all previous clones to WebAssembly so Paul Tarvydas whether I considered “writing Forth-ish directly in WAT (WASM)”. I said no because I was already aware of WAForth which took the conversation into an exploration of WAForth with its author, Remko. WAForth inlines much more aggressively than JONESFORTH. It uses subroutine threading and “generates WebAssembly instructions in binary format”. Quite the feat if I’m honest.
But that strays rather far from JONESFORTH “simple”
indirect threading.
So I couldn’t just let it go now, could I. And luckily, WebAssembly introduced
return_call and return_call_indirect around 2023. So off I went.
I initially tried to have Claude do all the heavy lifting so I issued the
following prompt (with jonesforth.S in the context window):
Please translate
jonesforth.Sfrom x86 assembly to WAT the textual Webassembly representation
And off it went, quickly producing reams of code. I saved everything it produced
as a GitHub gist.
It looked great at first but soon hit a bunch of “pointer out of bounds” errors
and Claude is not great at debugging, at least not yet. I took a crack at it myself
but the horse was already out of the barn, Claude had produced too much code and
some of it was too different from the original source to be worth fixing. Here is
an interesting data point though: Claude ported WORD and NUMBER from JONESFORTH
but missed the, admittedly subtle, distinction between WORD (the primitive) and
_WORD (the actual implementation). Wonder how I could have adjusted the prompt
to highlight it.
Claude also stumbled over the fact that JONESFORT uses indirect threading. And that’s when I was humbly reminded of the quote, often attributed to Einstein:
If you can’t explain it simply, you don’t understand it well enough.
I had already ported JONESFORTH three times then (4th.c, 5th.c, and 6th.zig) yet I still couldn’t explain the difference between direct and indirect threading simply. Ironically, I think I can now!
I made a second attempt from scrach with Claude, this time using the following prompt:
jonesforth.Sis an indirect threaded Forth interpreter written in 32-bit x86 assembly. I want you to produce the closest possible equivalent in WAT, the textual representation of WebAssembly. You will usereturn_call_indirectto replaceJMP *(%eax)in theNEXTmacro. Thename_*labels which jonesforth.S introduces via thedefcodemacro will become WASM functions.
This time I tried to explain the WORD/_WORD duality in follow-up prompts but we
were going in circles so I stopped, grabbed the outer shell, and started down
the good ole-fashioned “artisanal” route of typing the bloody code myself!
I will admit that Claude’s two failed attempts gave me plenty of good pointers
on the WebAssembly text format.
I chose to use S-expressions throughout for readability although YMMV.
Richard WM Jones uses a lot of
GNU assembler macros in jonesforth.S. My C implementations use the
C Preprocessor to similar effect. Zig prides itself on having “no preprocessor,
no macros” but offers comptime metaprogramming instead. WebAssembly provides
none of that. So I wrote a poor man’s assembler of sorts in
python.
At its core, it’s but a dict of word definitions and a loop to print them along with headers (using
data)
and funcref registrations (using
elem).
Take SWAP for example:
(data (i32.const 0x5054) "\44\50\00\00\04SWAP\00\00\00\02\00\00\00")
(func $swap (type 0)
(local $t i32)
(local.set $t (i32.load offset=4 (global.get $sp)))
(i32.store offset=4 (global.get $sp) (i32.load (global.get $sp)))
(i32.store (global.get $sp) (local.get $t))
(return_call $next)
)
(elem (i32.const 0x2) $swap)
Its header starts at address 0x5054 with a link to the start of the preceding word
(in this case DROP at address 0x5044 and yes, WebAssembly is
little-endian). The next byte (0x04) encodes
the length and potential flags like HIDDEN or IMMEDIATE. The name itself ("SWAP") follows,
with padding to the next 4 byte word boundary. Finally, because SWAP is built in (i.e.
implemented in WebAssembly), its code word is just the index of $swap in the funcref table.
The first 2 and last 3 lines in this snippet were generated by python.
jonesforth.wast
started as a direct threading interpreter which meant it could at best execute composite
words consisting entirely of builtin words. One could imagine an implementation that
inlines aggressively
to maintain that invariant but that would lead to
code bloat and is certainly not how JONESFORTH
works. 2eb3615
converted to indirect threading and the following two lines go a long way to
explain the difference:
- (data (i32.const 0x53ec) "\dc\53\00\00\04>DFA\00\00\00\00\00\00\00\42\00\00\00\0d\00\00\00\23\00\00\00")
+ (data (i32.const 0x53ec) "\dc\53\00\00\04>DFA\00\00\00\00\00\00\00\e8\53\00\00\fc\50\00\00\10\52\00\00")
>DFA is the first composite word to appear in jonesforth.S. It decompiles to
: >DFA >CFA 4+ ;. Before
2eb3615,
its data consisted of the indices of >CFA, 4+, and EXIT in the funcref table.
Afterwards, its data became the code field addresses of those same words. So NEXT
loads the value
at memory address 0x53fc (16 bytes past 0x53ec to skip the link to >CFA, flag, name,
padding, and index of DOCOL (0)), see another memory address (0x53e8) which it needs to
load before it can return_call_indirect the corresponding function. Note also how the first
data word (0x53e8) is exactly 12 bytes past the link (0x53dc). The link points to the
>CFA word (with header) whereas >DFA’s first data element points to the address of >CFA’s
code field! Oh, the joys of pointer arithmetic!
To recap, it’s somehow fitting that this should be my fourth Forth!
| # | Language | Strategy | Source |
|---|---|---|---|
| 1 | C | Labels as Values | 4th.c |
| 2 | C | musttail |
5th.c |
| 3 | Zig | .always_tail |
6th.zig |
| 4 | WAT | return_call_indirect |
jonesforth.wast |
And it’s entirely unacceptable that the “native” WebAssembly implementation is the only one to
not have a browser demo! That’s because I developed it on WASI to
iterate quickly in the command line. I’ll work on a shim
real soon now:
Epilogue
I had so much fun writing jonesforth.wast
that I continued to experiment. First I replaced the 4 most often modified global variables by local ones.
With that, $swap becomes
(data (i32.const 0x5054) "\44\50\00\00\04SWAP\00\00\00\02\00\00\00")
(func $swap (param $cfa i32) (param $ip i32) (param $sp i32) (param $rsp i32)
(local i32)
(local.set 4 (i32.load offset=4 (local.get $sp)))
(i32.store offset=4 (local.get $sp) (i32.load (local.get $sp)))
(i32.store (local.get $sp) (local.get 4))
(return_call $next (local.get $cfa) (local.get $ip) (local.get $sp) (local.get $rsp))
)
(elem (i32.const 0x2) $swap)
You can see the whole thing in localize.wast.
After that, I remembered something Remko shared on Discord: uxn.wasm.
He uses br_table
to implement the UXN virtual machine. That’s quite cool so I took a
crack at it. The result is in tabulate.wast.
The whole point is how parentheses are balanced. More than a hundred are opened between lines 223 and 234 to
introduce the nested blocks that br_table requires. They are closed one at a time after each builtin word, e.g.
;; swap
(local.set 4 (i32.load offset=4 (local.get $sp)))
(i32.store offset=4 (local.get $sp) (i32.load (local.get $sp)))
(i32.store (local.get $sp) (local.get 4))
(br $next)) ;; ⇐ note 2 right parentheses here!
The only clever bit I came up with in that implementation are the jumps in INTERPRET and EXECUTE.
I realized I could simply branch “in the middle” of NEXT (after the first indirection from ip
into cfa but before the second one). All this takes is another label, which I named (rather
unimaginatively) $dispatch.
That led me to realize that good ole C offers
just enough flexibilty to achieve the same thing without non-standard extensions like labels as values
or tail calls. break and continue offer just enough daylight to skip to the end of a switch
statement or the top of the surrounding loop. This lends itself to the following scheme:
while (1) {
switch (memory[cfa]) {
case DOCOL:
memory[--rsp] = ip;
ip = cfa + 1;
break;
case DROP:
++sp;
break;
...
case EXECUTE:
cfa = memory[sp++] >> 2;
continue;
...
}
cfa = memory[ip++] >> 2;
}
(extracted from here).
As an added bonus, this last implementation (and I hope to everything I hold dear this is
the last one) is relocatable.
Addresses are relative to memory (which explains the SYSCALL1 hack). The left shifts
are required to conform to
jonesforth.f address arithmetic,
specifically around control structures (IF, THEN, …) and decompilation (CFA>, SEE).
In hindsight, my long and convoluted arc with JONESFORTH reminds me of this joke.