Sila Dev Log: Tokenization and Compiling Basic Arithmetic
Posted on 19th of August 2023A while ago, I wrote about why would anyone start writing a new programming
language.
Consequently, I decided to gradually initiate the project. Simultaneously, I
found the idea of documenting my progress to be intriguing. I opted to adopt a
strategy of incremental development, wherein I would construct small,
functional components and refine them over time.
This incremental methodology draws inspiration from the
paper by Abdulaziz Ghuloum.
Other tremendously useful resources include works such as
tcc, A small C compiler written by Fabrice Bellard,
lcc: Another small C compiler. The creators wrote
a book about the
internals of lcc, which I found a good resource to see how a compiler is
implemented, and also chibicc by Rui
Ueyama.
Before embarking on this endeavor, I want to underscore that I currently
harbor modest ambitions for this project. Consequently, my approach might be
somewhat sporadic, aligning with my personal inclination and availability.
Primarily, I intend this project to serve educational and research purposes,
surpassing other objectives.
Getting Started With Sila Compiler
I decided to call this project Sila, which comes from Pali and is often used
in Buddhist texts meaning:
śīla (Pali: sīla; T. tshul khrims ཚུལ་ཁྲིམས་), literally, ‘acting
appropriately’. Sila is translated as “skillful conduct,” “discipline,”
“ethical conduct,” “moral conduct,” “virtue,” etc.
Sila is said to be a way of being that is conducive to positive and happy
states of mind.
With the Buddhist teachings, sila is identified as:
- one of the three trainings
- one of the six paramitas in the Sanskrit tradition
- one of the ten paramis in the Pali tradition
From https://encyclopediaofbuddhism.org/wiki/Śīla
To initiate the process, my initial step was to attempt the compilation of the
most elementary language conceivable. Essentially, my goal was to input an
integer into the compiler, which would then produce corresponding assembly
code. This assembly code, when compiled, would result in an exit code
identical to the provided integer. On a x86-64 Linux platform, the assembly
program to achieve this might resemble the following:
.globl main
main:
mov $0, %rax
ret
So what is happening here?
.globl main: This directive tells the assembler that the label main should
be considered a global symbol. This means that the symbol main can be accessed
from other parts of the program, possibly in other source files or modules.
main:: This is a label that marks the beginning of the main function or
code block. Labels are used as targets for jumps and branches in assembly
code.
mov $0, %rax: This instruction moves the immediate value 0 into the %rax
register. In the x86-64 architecture, %rax is a general-purpose register that
is often used to hold return values of functions.
ret: This instruction is used to return from a function. In this case,
since we’re looking at the main function, it’s effectively ending the program.
The value in %rax would typically be used as the exit status of the program.
Simple arithmetic, like “5 + 20 - 4”, is also easily represented x86 assembly:
.globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
mov $5, %rax: This instruction moves the immediate value 5 into the %rax
register.
add $20, %rax: This instruction adds the immediate value 20 to the current
value stored in the %rax register. After this operation, %rax would hold
the value 25 (5 + 20).
sub $4, %rax: This instruction subtracts the immediate value 4 from the
current value in the %rax register. After this operation, %rax would hold
the value 21 (25 - 4).
Why Common Lisp?
The internet is already saturated with fervent support for Lisp, leaving me
with a sense that there is little additional I can contribute on the matter.
My decision to opt for Common Lisp stems from pragmatic considerations: it is
a language in which I am effective, and, above all, it aligns with my personal
enjoyment of programming.
Testing Infrastructure
Considering my primary focus on Linux and x86-64, I must undertake certain
preparations to facilitate program testing. While I possess an x86-64 machine,
it operates on macOS, rendering the same assembly directives ineffective as
they are in Linux. Fortunately, this predicament is effortlessly resolved
through the utilization of Docker.
My container requirements are rather minimal. Given my work with Common Lisp,
the inclusion of a CL compiler is paramount. I’ve chosen
SBCL for this purpose, primarily due to my familiarity
with it. While other implementations might suffice, my present focus won’t
extend to testing those alternatives. Alongside the CL compiler, an assembly
compiler for the generated code is essential. In this context, GCC serves as
an apt choice.
So the Dockerfile for development could look something like this:
FROM fukamachi/sbcl:2.3.7-debian
LABEL maintainer="Topi Kettunen <topi@topikettunen.com>"
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update -qq \
&& apt-get install -qq -y --no-install-recommends \
build-essentials \
&& rm -rf /var/lib/apt/lists/*
# For unit testing
RUN sbcl --eval '(ql:quickload :rove)'
WORKDIR /root/.roswell/local-projects/sila
ENTRYPOINT ["/bin/bash"]I decided to use Eitaro Fukamachi’s
dockerfiles. These Dockerfiles are
built around roswell, which I’m also using
locally.
I also write some simple Makefile to just make my life easier:
LISP ?= ros -Q run
.PHONY: build
build:
$(LISP) --load sila.asd \
--eval '(ql:quickload :sila)' \
--eval '(asdf:make :sila)' \
--eval '(quit)'
.PHONY: test
test:
$(LISP) --load sila.asd \
--eval '(ql:quickload :sila)' \
--eval '(asdf:test-system :sila)' \
--eval '(quit)'
.PHONY: docker-build
docker-build:
docker build -t sila .
.PHONY: docker-run
docker-run:
docker run -it --rm -v ${PWD}:/root/.roswell/local-projects/sila silaInitial Tests
Since I already know what I want to output from my compiler at this point,
it’s easy to start development by writing some nifty little unit tests for it.
For now, I decided to write tests for emitting the assembly code and than
separately the compilation process.
(deftest test-emit-asm-integer
(let ((emit-asm-tests '(("0" . " .globl main
main:
mov $0, %rax
ret
")
("42" . " .globl main
main:
mov $42, %rax
ret
"))))
(dolist (test emit-asm-tests)
(ok (string-equal (sila::emit-asm (car test)) (cdr test))
(format nil "Expect to be equal: ~%~A~%=>~A"
`(sila::emit-asm ,(car test))
(cdr test))))))
(deftest test-emit-asm-add-sub
(let ((emit-asm-tests '(("5+20-4" . " .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
")
(" 5 + 20 - 4 " . " .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
"))))
(dolist (test emit-asm-tests)
(ok (string-equal (sila::emit-asm (car test)) (cdr test))
(format nil "Expect to be equal: ~%~A~%=>~A"
`(sila::emit-asm ,(car test))
(cdr test))))))These tests are relatively straight forward, I generate the output and just
compare strings to each other to see that they both are the same. Compilation
itself is little bit more involved. Most likely you would achieve easier test
setup by just simple shell scripts, but since I want to write Common Lisp, I
decided to write some small testing harness for it.
(defvar *bin-name* "sila")
(defvar *tmp-name* "tmp.s")
(defvar *tmp-bin-name* "tmp")
(defun sila-project-root (path)
(namestring (asdf:system-relative-pathname :sila path)))
(defun compile-program-and-compare-rc (input expected-rc)
(multiple-value-bind (out err rc)
(uiop:run-program (list (sila-project-root *bin-name*)
input)
:output :string
:error-output :string)
(declare (ignore err rc))
(with-open-file (fstream (sila-project-root *tmp-name*)
:direction :output
:if-exists :supersede
:if-does-not-exist :create)
(format fstream out))
(uiop:run-program (list "gcc" "-static" "-o" *tmp-bin-name* *tmp-name*))
(multiple-value-bind (out err got-rc)
(uiop:run-program (sila-project-root *tmp-bin-name*)
;; Since we're working with arbitrary return codes.
:ignore-error-status t)
(declare (ignore out err))
(= expected-rc got-rc))))
(deftest test-compilation-and-compare-rc
(ok (compile-program-and-compare-rc "0" 0))
(ok (compile-program-and-compare-rc "5 + 40 - 20" 25)))What I’m doing here is that I use the freshly created executable that outputs
assembly and pipe the output to a file called tmp.s. After that I run gcc
for that generated assembly and compile it to a binary called tmp. Then I
just run the program and check what return code it returns and compare those
to what I expect them to be.
Obviously the tests will fail for now, but those give me a nice foundation on
top of which is nice to start developing.
Tokenization
Before I do anything for my compiler, I need to do some lexical analysis and
generate some tokens from given input. I start this by defining some
structures and types for holding these tokens:
(deftype kind ()
'(member
:punct
:num
:eof))
(defstruct token
kind
pos
len
val
next)Since I’m currently only working with strings given as input, I will do the
tokenization simply moving a pointer through and creating tokens based on the
subsequences I encounter in the given input string. Before that, I need some
helper functions:
(defun whitespacep (c)
(member c '(#\Space #\Tab #\Return #\Newline)))
(defun punctuatorp (c)
;; TODO(topi): Add more punctuators when needed.
(member c '(#\+ #\-)))
(defun skip-to-punctuator (input start)
;; TODO(topi): Improve this. Won't work when new tokens are added.
(position-if-not #'digit-char-p input :start start))
I also define a condition for encountering invalid tokens, which I can throw
in case something bad happens:
(define-condition invalid-token (error)
((token :initarg :token
:initform nil
:reader token)
(token-position :initarg :token-position
:initform nil
:reader token-position))
(:report (lambda (condition stream)
(format stream "Invalid token: '~A' at position ~A.~&"
(token condition)
(token-position condition))))
(:documentation "Condition for when we encounter invalid token."))As you can see in the token structure above, there is a slot called next
indicating the next token. So I’m working with linked lists here. I’ll start
the tokenization by creating head node and also cur node which I will be
populating while I’m iterating through the list. I also initialise a variable
for pointing me where I’m currently at in the given input string.
The way I decided to do this initial tokenization is simply looping through
the given input with loop. At the moment, I’m not interested in whitespace,
so when I encounter whitespace character, I’ll just move the src-pos
forward.
Currently, I’m interested in at what position I will encounter a punctuator,
either + or -, in the given input string. To do that, I’ll use the
skip-to function defined above, which returns the position from the input
string where first given item is found at. When I know that, it’s easy the
deduce the positions of each token in the given input.
(defun tokenize (src)
(let* ((head (make-token))
(cur head)
(src-pos 0))
(loop :while (< src-pos (length src))
:do (let ((punct-pos (skip-to-punctuator src src-pos)))
(cond (;; Whitespace
(whitespacep (char src src-pos))
(incf src-pos))
(;; Numeric literal
(digit-char-p (char src src-pos))
(setf (token-next cur)
(make-token :kind num
:val (parse-integer src
:start src-pos
:junk-allowed t)
:pos src-pos))
(setf (token-len (token-next cur))
(if punct-pos
(- punct-pos src-pos)
;; No more punctuators.
(- (length src) src-pos)))
(setf cur (token-next cur))
(if punct-pos
(setf src-pos punct-pos)
;; No more punctuators, move `src-pos` to the end.
(setf src-pos (length src))))
(;; Punctuator: + | -
(punctuatorp (char src src-pos))
(setf (token-next cur)
(make-token :kind punct
:val (char src src-pos)
:pos src-pos
;; TODO(topi): Currently I only have
;; 1 char puncts, change when more is
;; needed.
:len 1))
(setf cur (token-next cur))
(incf src-pos))
(t
(error 'invalid-token
:token (char src src-pos)
:token-position src-pos)))))
;; No more tokens.
(setf (token-next cur) (make-token :kind eof
:pos src-pos))
(setf cur (token-next cur))
(token-next head))We can then test this out in the REPL, to see what it prints out:
SILA> (sila/lexer:tokenize "1 + 1 - 123")
#S(SILA/LEXER::TOKEN
:KIND :NUM
:POS 0
:LEN 2
:VAL 1
:NEXT #S(SILA/LEXER::TOKEN
:KIND :PUNCT
:POS 2
:LEN 1
:VAL #\+
:NEXT #S(SILA/LEXER::TOKEN
:KIND :NUM
:POS 4
:LEN 2
:VAL 1
:NEXT #S(SILA/LEXER::TOKEN
:KIND :PUNCT
:POS 6
:LEN 1
:VAL #\-
:NEXT #S(SILA/LEXER::TOKEN
:KIND :NUM
:POS 8
:LEN 3
:VAL 123
:NEXT #S(SILA/LEXER::TOKEN
:KIND :EOF
:POS 11
:LEN NIL
:VAL NIL
:NEXT NIL))))))And it seems to work nicely giving us crucial information such as how long the
tokens are in the input and at what position they’re found at, something that
will be useful in the future.
Emitting Assembly
Since now the tokenization is working more or less as intended, we can move
forward with the assembly generation. For now, the only thing that matters to
me is that we can generate some simple assembly to standard output. The way I
decided to this, is to just concatenate assembly instructions to one string
and just print the result. I’m only working with one label main for now, so
I’m not starting to do anything too fancy yet.
The way I’m emitting the assembly is simply by looping all the tokens that I
got from tokenize until I encounter token which kind is :eof. Then
based on the token, I’ll just concatenate assembly instructions as needed.
I’ll also define simple entrypoint to my executable which only accepts one
string as input that will generated to assembly.
(defun main ()
"Sila programming language entrypoint."
(multiple-value-bind (options free-args)
(opts:get-opts)
;; TODO(topi): Add some useful options.
(declare (ignore options))
(unless (= (length free-args) 1)
(format *error-output* "Invalid number of arguments, expected 1.~%")
(opts:exit 1))
(format *standard-output* (emit-asm (first free-args)))
(opts:exit)))
(defun emit-asm (src)
"Emit assembly code from given source code. Currently emits only x86-64 and only Linux
is tested."
(let ((asm "")
(tok (tokenize src)))
(labels ((asm-conc (inst)
(setf asm (concatenate 'string asm inst)))
(asm-inst (inst)
(format nil " ~A~%" inst))
(asm-directive (dir)
(asm-inst dir))
(asm-label (label)
(format nil "~A:~%" label)))
(asm-conc (asm-directive ".globl main"))
(asm-conc (asm-label "main"))
;; For now, first token must be a number.
(asm-conc (asm-inst (format nil "mov $~d, %rax" (token-val tok))))
(setf tok (token-next tok))
(loop :until (equal (token-kind tok) :eof)
:do (cond ((eq (token-kind tok) :num)
(setf tok (token-next tok)))
((char= (token-val tok) #\+)
(asm-conc (asm-inst
(format nil "add $~d, %rax"
(token-val (token-next tok)))))
(setf tok (token-next tok)))
((char= (token-val tok) #\-)
(asm-conc (asm-inst
(format nil "sub $~d, %rax"
(token-val (token-next tok)))))
(setf tok (token-next tok)))))
(asm-conc (asm-inst "ret"))
asm)))With these we should be able to run our testing suite to see what happens.
$ make test
[...]
Testing System sila/tests
;; testing 'sila/tests/main'
test-compilation-and-compare-rc
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2763ms)
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2509ms)
;; testing 'sila/tests/emit-asm-x86-64'
test-emit-asm-integer
✓ Expect to be equal:
(EMIT-ASM 0)
=> .globl main
main:
mov $0, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 42)
=> .globl main
main:
mov $42, %rax
ret
test-emit-asm-add-sub
✓ Expect to be equal:
(EMIT-ASM 5+20-4)
=> .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 5 + 20 - 4 )
=> .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
✓ 1 test completed
Summary:
All 1 test passed.Cool seems to pass!
Next Steps
I’m continuing the development of my compiler at my own pace, and will write
when something new happens. I think I will come back to some parts when I
decide to some improvements or refactorings so that everything happens in
small iterations. Bugs will be found and hopefully fixed. This might be a long
project, but I don’t mind, it’s fun! If you’re interested, feel free to follow
my blog to know when something new comes.