When did Basic become insulting?
Two thoughts come to mind at the start of this third post along the same theme:
- Letâs hope that, unlike The Hitchhikerâs Guide to the Galaxy, this is not part of an âincreasingly inaccurately namedâ trilogy
- All this talk of embedding interpreters in our browser is starting to remind us of the classic brandy snifter SNL skit
We recently covered FORTH, which uses postfix notation, and Lisp, which uses prefix notation (albeit with lots of parentheses). It seems fair to pick an infix language at this point. But itâs also fun to stick with our retro theme. Infix programming languages are trickier because they rely on a formal grammar to define their syntax as well as the precedence of their operators. In computer science, these grammars are often written down in Backus-Naur form named after John Backus of FORTRAN fame and Peter Naur who contributed to ALGOL 60.
Modern interpreters and compilers often rely on tools which generate lexers and parsers directly from the formal grammar. Lex and Yacc are early examples of such tools. What did hobbyists do before these tools became widely available? A hand-crafted recursive descent parser is one approach. It looks straightforward and almost systematic at first but, speaking from experience, can prove frustratingly fiddly.
The Homebrew Computer Club, an informal group of hackers, came up with a wonderfully pragmatic solution to this problem in response to Bill Gates kvetching about people pirating Altair BASIC: they defined their own dialect called Tiny BASIC. Dennis Allison wrote a technical specification for it. His design was based on an intermediate language (TBIL) to aid porting and running light without overbyte. This makes for a very layered system, a bit like an onion. As an interesting historical aside, the original BASIC was designed by John G. Kemeny and Thomas E. Kurtz who were both at Dartmouth College. Coincidentally, Dartmouth is also where John McCarthy began thinking about creating a language. And if youâre curious what it felt like to discover programming with BASIC, I highly recommend this article.
Luckily, Tom Pittman published
a bunch of content about his amazing contributions to Tiny BASIC on
http://www.ittybittycomputers.com.
In particular, the Tiny BASIC Experimenterâs Kit
(also in HTML)
describes every âInterpretive Language Operation Codeâ starting on page 15. Some of these
opcodes are geared towards parsing like BC
(String Match Branch). Its behavior is best understood by
looking at a specific use case, like its first appearance in the âOriginal Tiny Basic Intermediate Interpreterâ
on page 35. Bytes 13-16 read
000D ; 17 . STATEMENT EXECUTOR
000D ; 18 .
000D ; 19 :STMT BC GOTO "LET"
000D 8B4C45D4;
The first byte, 0x8B
, gets interpretered as 0x80
(the opcode for BC
)
plus 0x0B
to jump 11 bytes ahead if the user program does not match the string encoded in the following
bytes. The next byte, 0x4C
is simply the ASCII encoding for 'L'
.
0x45
similarly means 'E'
but 0xD4
is not 'T'
. The ASCII encoding for 'T'
is 0x54
. 0xD4
encodes an
additional (literal) bit of information, namely that this is the last character to match in the user code.
0xD4
is 'L' + 2â·
. And sure enough, bytes 25-30 read
0019 ; 27 :GOTO BC PRNT "GO"
0019 9447CF;
001C ; 28 BC GOSB "TO"
001C 8854CF;
Tiny BASIC is getting quite cute here. It matches "GOTO"
as "GO"
+ "TO"
so it can try "GOSUB"
next
if the first segment matched but not the second one, like a radix tree.
A good exercise to appreciate how much thought went into the design of TBIL is to write an
assembler for it.
That assembler must recognize opcodes that are followed by an argument, like branches (BR
, BV
, BN
, BE
, BC
),
jumps (JS
and J
), loads (LB
and LN
), and SX
. Most of them combine themselves with (the first byte of)
their argument. In the case of loaders, the argument follows. The inner loop could look something like this:
if op is OpCode.SX:
out.append(int(op) + int(arg))
elif op is OpCode.LB:
out.append(int(op))
out.append(int(arg))
elif op is OpCode.LN:
out.append(int(op))
out.extend(struct.pack(">H", eval(arg)))
elif op >= OpCode.BR:
offset = offsets.get(arg, here + 1) - (here + 1)
if op is OpCode.BR:
offset += 0x20
out.append(int(op) + offset)
elif op >= OpCode.JS:
target = struct.pack(">H", offsets.get(arg, 0))
out.append(int(op) + target[0])
out.extend(target[1:])
else:
out.append(int(op))
Those two calls to struct.pack(">H", ...)
determine the Tiny BASICâs
endianness. Tom picked big-endian while
WASM is little-endian for portability.
I initially convinced myself that endianness didnât matter as long as the virtual
machine was internally consistent. I was wrong because of LN
, JS
, and J
!
Another interesting aspect of TBIL is that it, like most assembly, requires two passes. The first pass constructs an incomplete byte stream and records the positions of labels. The second pass is identical to the first except it can compute jump offsets correctly from those positions. As a matter of fact, the outer loop of our assembler looks like
offsets = {}
for _ in range(2): # first pass to compute offsets
out = bytearray()
for j, line in enumerate(interpreter[1:].splitlines()):
...
Converting TBIL text to bytes might strike some readers as one low-level step too far. However, this two-bit history post reminds us of a (fortunately) distant past where assemblers were the first piece of software developers had to create before using a new machine. Can you imagine feverishly unboxing your brand new laptop only to discover that it only understand hex?