Posted on 5th of September 2023
| 661 wordsSummer months went by fast when you had lot on your hands. Didn’t feel like
keeping the reading log up to date during these months, so I’ll just do one
big post-summer update here.
Aldous Huxley: Brave New World (reread)
On May, I seemed to be rereading bunch of classics especially with the
common theme of dystopian future. Why you might ask? That I don’t know,
maybe something inside me just thinks that future depicted in these classics
is inevitable. Brave New World is probably one of my favorite books ever
written. Wonderful vision of advanced utopia, but at what cost? Masterful
critique of the dehumanizing effects of a highly controlled and
pleasure-driven society challenges readers to reflect on the consequences of
sacrificing personal freedom for comfort and conformity.
George Orwell: 1984 (reread)
Continuing on the series of dystopia and nightmarish tales of possible
future. Tale about the dangers of totalitarianism and government
surveillance. Serves as a stark reminder of the importance of safeguarding
individual freedoms, critical thinking, and truth itself.
William Golding: Lord of the Flies (reread)
More classics! Compelling exploration of human nature and the thin veneer of
civilization that separates order from chaos.
J.D. Sallinger: The Catcher in the Rye
The Catcher in the Rye was, one of many, classics that I hadn’t read before
and to be honest, wasn’t a huge fan of it. Maybe since I read this as a
little bit older. Don’t know. Topics in this book was really interesting,
considering teenager alienation, identity etc. While Holden’s struggles in
his adolescent were definitely unique, I just felt more annoyed about him
that anything else.
Jacques Ellul: The Technological Society
Jacques Ellul: Propaganda: The Formation of Men’s Attitudes
Ted Kaczynski: Industrial Society and Its Future
Ted Kaczynski: Anti-Tech Revolution: Why and How
Every once in a while my inner luddite wakes up and I start hating
everything about technology. Reading Jacques Ellul was definitely part of
this, but this time it was mainly the death of Unabomber, Ted Kaczynski,
that brought me to him. While I don’t agree/support on what Ted Kaczynski
did, he had some great points about how technology affects us. Ellul was
many ways inspiration for him in these sort of anti-technology philosophies
and it affection on humans. But Kaczynski and Ellul had one big difference
between them, Ellul was a pacifist were as Kaczynski was a domestic
terrorist.
Nonetheless, all of these books make some great points about technology so
despite your opinion about it, these are definitely worth a read.
Cormac McCarthy: The Road
Cormac McCarthy: No Country for Old Men
During the summer unfortunately we had a couple of passings of great
authors, one of which were Cormac McCarthy. I had never read his books
before, but I was familiar with as a movie from (No Country for Old Men by
Coen Brothers) and I had heard great things about his writing. So I grabbed
a copy of The Road and No Country for Old Men, since those were quite highly
recommended. And the recommendations were definitely true.
The Road offers a haunting depiction of post-apocalyptic America and the
struggles of father and son in this world. Immersing in bleak landscape
where hope and love endure against all odds. No Country for Old Men delves
into the realms of crime, fate, and the inexorable consequences of one’s
choices.
Both of these works showcase McCarthy’s masterful storytelling, unique
narrative styles, and philosophical depth, making them essential reads for
those interested in literature that explores the human condition in its most
challenging and thought-provoking forms.
Richard P. Gabriel: Patterns of Software
Richard P. Gabriel is a pretty known name in the software world and
especially in the Lisp world. One of the famous writings of Richard P.
Gabriel is the concept of “Worse Is
Better”
.
Patterns of Software gives a great look into software design, programming
and business around it. Definitely a must read for everyone working in this
industry.
Posted on 24th of August 2023
| 1860 wordsFor the initial implementation of Sila’s parses, I decided to go with the
recursive descent route due to its simplicity. So what it is? Recursive
descent parsing is a top-down parsing technique employed to analyze the syntax
of structured input, such as programming languages or mathematical
expressions. It operates by breaking down the input into smaller, manageable
components using a set of recursive functions that correspond to the
non-terminal symbols in the grammar.
Beginning with a starting symbol, typically representing the highest-level
construct, these functions are invoked recursively to recognize and process
the various syntactic elements of the input. Each function handles a specific
grammar rule, and by calling each other based on the grammar’s structure, the
parser traverses the input while building a parse tree that reflects the
hierarchical organization of the language’s syntax.
This approach allows for clear code organization and error reporting, though
it can face challenges with certain types of grammars and may require
augmentation with techniques like lookahead or memoization for optimal
efficiency and effectiveness. But for early stages of the project, I don’t
really mind that.
Parser Implementation
First I create some structures and helpers that I will be using in the parsing
process.
(deftype ast-node-kind ()
"Sila AST node kind."
'(member
:add
:sub
:mul
:div
:number))
(defstruct ast-node
"Structure for Sila AST nodes."
kind
val
lhs
rhs)
(defun skip-to-token (val tok)
"Search for token which is `val'. Returns `nil' when not found.
TODO(topi): Probably should do some proper error handling if `val' isn't
found."
(cond ((eq (token-kind tok) :eof) nil)
((equal (token-val tok) val) tok)
(t (search-token val (token-next tok)))))
For now, I’m interested in parsing something simple like 2 / (1 + 1) * 8
, so
simple multiplicative calculations and expressions inside parenthesis. So if I
would write some metanotation like BNF for that, it would look something like
this:
expr ::== mul ( '+' mul | '-' mul ) *
mul ::== primary ( '*' primary | '/' primary ) *
primary ::== '(' expr ')' | number
expr
This function is responsible for parsing arithmetic expressions. It follows
the grammar rule: expr ::= mul ( '+' mul | '-' mul ) *
. It starts by calling
the mul
function to parse the initial term. Then, it enters a loop
to
handle addition and subtraction operators. If it encounters a +
, it parses
the next term using mul and creates an AST node representing addition. If it
encounters a -
, it does the same for subtraction.
The loop
continues until it encounters an operator that’s not +
or -
, at
which point it returns the parsed expression node and the remaining token
stream.
(defun expr (tok)
"expr ::== mul ( '+' mul | '-' mul ) *"
(multiple-value-bind (node rest)
(mul tok)
(loop
(cond ((eq (token-val rest) #\+)
(multiple-value-bind (node2 rest2)
(mul (token-next rest))
(setf node (make-ast-node :kind :add :lhs node :rhs node2))
(setf rest rest2)))
((eq (token-val rest) #\-)
(multiple-value-bind (node2 rest2)
(mul (token-next rest))
(setf node (make-ast-node :kind :sub :lhs node :rhs node2))
(setf rest rest2)))
(t
(return-from expr
(values node rest)))))))
mul
Parsing mul
expressions works in a similar fashion:
(defun mul (tok)
"mul ::== primary ( '*' primary | '/' primary ) *"
(multiple-value-bind (node rest)
(primary tok)
(loop
(cond ((eq (token-val rest) #\*)
(multiple-value-bind (node2 rest2)
(primary (token-next rest))
(setf node (make-ast-node :kind :mul :lhs node :rhs node2))
(setf rest rest2)))
((eq (token-val rest) #\/)
(multiple-value-bind (node2 rest2)
(primary (token-next rest))
(setf node (make-ast-node :kind :div :lhs node :rhs node2))
(setf rest rest2)))
(t
(return-from mul
(values node rest)))))))
primary
This function parses primary expressions, which can be either numeric values
or sub-expressions enclosed in parentheses. It follows the grammar rule:
primary ::= '(' expr ')' | number
. If the current token is a number, it
creates an AST node representing a numeric value. If the token is an opening
parenthesis, it recursively calls the expr function to parse the enclosed
expression and skips to the closing parenthesis. It then returns the parsed
node and the remaining token stream.
(defun primary (tok)
"primary ::== '(' expr ')' | number"
(cond ((eq (token-kind tok) :num)
(values (make-ast-node :kind :number :val (token-val tok))
(token-next tok)))
((eq (token-val tok) #\()
(multiple-value-bind (node rest)
(expr (token-next tok))
(values node (token-next (skip-to-token #\) rest)))))
(t (error 'parser-error))))
Testing the Initial Parser
For now, I will be testing this parser only from the REPL. Firing up it, and
running the command should print something like this:
* (expr (tokenize "2 / (1 + 1) * 8"))
#S(AST-NODE
:KIND :MUL
:VAL NIL
:LHS #S(AST-NODE
:KIND :DIV
:VAL NIL
:LHS #S(AST-NODE :KIND :NUMBER :VAL 2 :LHS NIL :RHS NIL)
:RHS #S(AST-NODE
:KIND :ADD
:VAL NIL
:LHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)
:RHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)))
:RHS #S(AST-NODE :KIND :NUMBER :VAL 8 :LHS NIL :RHS NIL))
#S(SILA/LEXER::TOKEN :KIND :EOF :POS 15 :LEN NIL :VAL NIL :NEXT NIL)
*
Here we can see how our parser parses the expressions inside it. Transforming
it to ASCII would look something like this:
Mul(*)
/ \
Div(/) 8
/ \
2 Add(+)
/ \
1 1
Code Generation
Assembly code of something like 2 / (1 + 1) * 8
is still quite simple, but
definitely more instruction will be needed compared to the simple addition and
subtraction. Assembly code for it would equal to:
.globl main
main:
mov $8, %rax
push %rax
mov $1, %rax
push %rax
mov $1, %rax
pop %rdi
add %rdi, %rax
push %rax
mov $2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
So what is happening here:
Implementation
For the code generation aspects, we can start by creating some unit tests since we already know what would be the output of our function:
(deftest test-emit-asm-div-mul-parens
(let ((emit-asm-tests '(("2 / (1 + 1) * 8" . " .globl main
main:
mov $8, %rax
push %rax
mov $1, %rax
push %rax
mov $1, %rax
pop %rdi
add %rdi, %rax
push %rax
mov $2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
"))))
(dolist (test emit-asm-tests)
(ok (string-equal (sila::emit-asm (car test)) (cdr test))
(format nil "Expect to be equal: ~%~a~%=>~a"
`(sila::emit-asm ,(car test))
(cdr test))))))
The implementation aspects also will be relatively straight forward.
(defun asm-inst (inst)
(format nil " ~a~%" inst))
(defun asm-directive (dir)
(asm-inst dir))
(defun asm-label (label)
(format nil "~a:~%" label))
(defun asm-push ()
(asm-inst "push %rax"))
(defun asm-pop (reg)
(asm-inst (format nil "pop %~a" reg)))
(defun generate-expr (node)
(let ((asm ""))
(flet ((asm-conc (inst)
(setf asm (concatenate 'string asm inst))))
(when (eq (ast-node-kind node) :number)
(asm-conc (asm-inst (format nil "mov $~d, %rax"
(ast-node-val node))))
(return-from generate-expr asm))
(asm-conc (generate-expr (ast-node-rhs node)))
(asm-conc (asm-push))
(asm-conc (generate-expr (ast-node-lhs node)))
(asm-conc (asm-pop "rdi"))
(alexandria:switch ((ast-node-kind node) :test #'eq)
(:add (asm-conc (asm-inst "add %rdi, %rax")))
(:sub (asm-conc (asm-inst "sub %rdi, %rax")))
(:mul (asm-conc (asm-inst "imul %rdi, %rax")))
(:div (asm-conc (asm-inst "cqo"))
(asm-conc (asm-inst "idiv %rdi, %rax")))
(t (error 'parser-error))))
asm))
This function recursively generates assembly code from an abstract syntax tree
node representing an arithmetic expression. It constructs assembly
instructions based on the kind of node encountered. It concatenates these
instructions into a string that represents the generated assembly code.
For a number node, it generates a move instruction (mov
) to load the value
into %rax
.
For binary operations (addition, subtraction, multiplication, division), it
generates assembly for the right-hand side expression, pushes %rax
onto the
stack, generates assembly for the left-hand side expression, pops the value
into %rdi, and then performs the corresponding operation.
Then lastly we just define the emit function, like so:
(defun emit-asm (src)
"Emit assembly code from given source code. Currently emits only x86-64 and
only Linux is tested."
(let ((asm ""))
(flet ((asm-conc (inst)
(setf asm (concatenate 'string asm inst))))
(multiple-value-bind (node rest)
(expr (tokenize src))
(unless (eq (token-kind rest) :eof)
(error 'parser-error :error-msg "Extra tokens"))
(asm-conc (asm-directive ".globl main"))
(asm-conc (asm-label "main"))
(asm-conc (generate-expr node))
(asm-conc (asm-inst "ret"))))
asm))
Again, we can test it directly in the REPL first:
* (emit-asm "2 / (1 + 1) * 8")
" .globl main
main:
mov $8, %rax
push %rax
mov $1, %rax
push %rax
mov $1, %rax
pop %rdi
add %rdi, %rax
push %rax
mov $2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
"
Seems to work. Now to run the tests:
# make test
rove sila.asd
Testing System sila/tests
;; testing 'sila/tests'
test-compilation-and-compare-rc
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2726ms)
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2693ms)
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "2 / (1 + 1) * 8" 8) to be true. (2703ms)
test-emit-asm-integer
✓ Expect to be equal:
(EMIT-ASM 0)
=> .globl main
main:
mov $0, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 42)
=> .globl main
main:
mov $42, %rax
ret
test-emit-asm-add-sub
✓ Expect to be equal:
(EMIT-ASM 5+20-4)
=> .globl main
main:
mov $4, %rax
push %rax
mov $20, %rax
push %rax
mov $5, %rax
pop %rdi
add %rdi, %rax
pop %rdi
sub %rdi, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 5 + 20 - 4 )
=> .globl main
main:
mov $4, %rax
push %rax
mov $20, %rax
push %rax
mov $5, %rax
pop %rdi
add %rdi, %rax
pop %rdi
sub %rdi, %rax
ret
test-emit-asm-div-mul-parens
✓ Expect to be equal:
(EMIT-ASM 2 / (1 + 1) * 8)
=> .globl main
main:
mov $8, %rax
push %rax
mov $1, %rax
push %rax
mov $1, %rax
pop %rdi
add %rdi, %rax
push %rax
mov $2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
✓ 1 test completed
Summary:
All 1 test passed.
Cool, seems to work!
Posted on 19th of August 2023
| 2491 wordsA while ago, I wrote about why would anyone start writing a new programming
language
.
Consequently, I decided to gradually initiate the project. Simultaneously, I
found the idea of documenting my progress to be intriguing. I opted to adopt a
strategy of incremental development, wherein I would construct small,
functional components and refine them over time.
This incremental methodology draws inspiration from the
paper
by Abdulaziz Ghuloum.
Other tremendously useful resources include works such as
tcc
, A small C compiler written by Fabrice
Bellard, lcc
: Another small C compiler. The
creators wrote a
book
about the
internals of lcc, which I found a good resource to see how a compiler is
implemented, and also chibicc
by Rui
Ueyama.
Before embarking on this endeavor, I want to underscore that I currently
harbor modest ambitions for this project. Consequently, my approach might be
somewhat sporadic, aligning with my personal inclination and availability.
Primarily, I intend this project to serve educational and research purposes,
surpassing other objectives.
Getting Started With Sila Compiler
I decided to call this project Sila, which comes from Pali and is often used
in Buddhist texts meaning:
śīla (Pali: sīla; T. tshul khrims ཚུལ་ཁྲིམས་), literally, ‘acting
appropriately’. Sila is translated as “skillful conduct,” “discipline,”
“ethical conduct,” “moral conduct,” “virtue,” etc.
Sila is said to be a way of being that is conducive to positive and happy
states of mind.
With the Buddhist teachings, sila is identified as:
- one of the three trainings
- one of the six paramitas in the Sanskrit tradition
- one of the ten paramis in the Pali tradition
From https://encyclopediaofbuddhism.org/wiki/
Śīla
To initiate the process, my initial step was to attempt the compilation of the
most elementary language conceivable. Essentially, my goal was to input an
integer into the compiler, which would then produce corresponding assembly
code. This assembly code, when compiled, would result in an exit code
identical to the provided integer. On a x86-64 Linux platform, the assembly
program to achieve this might resemble the following:
.globl main
main:
mov $0, %rax
ret
So what is happening here?
.globl main
: This directive tells the assembler that the label main should
be considered a global symbol. This means that the symbol main can be accessed
from other parts of the program, possibly in other source files or modules.
main:
: This is a label that marks the beginning of the main function or
code block. Labels are used as targets for jumps and branches in assembly
code.
mov $0, %rax
: This instruction moves the immediate value 0 into the %rax
register. In the x86-64 architecture, %rax
is a general-purpose register that
is often used to hold return values of functions.
ret
: This instruction is used to return from a function. In this case,
since we’re looking at the main function, it’s effectively ending the program.
The value in %rax would typically be used as the exit status of the program.
Simple arithmetic, like “5 + 20 - 4”, is also easily represented x86 assembly:
.globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
mov $5, %rax
: This instruction moves the immediate value 5 into the %rax
register.
add $20, %rax
: This instruction adds the immediate value 20 to the current
value stored in the %rax
register. After this operation, %rax
would hold
the value 25 (5 + 20).
sub $4, %rax
: This instruction subtracts the immediate value 4 from the
current value in the %rax
register. After this operation, %rax
would hold
the value 21 (25 - 4).
Why Common Lisp?
The internet is already saturated with fervent support for Lisp, leaving me
with a sense that there is little additional I can contribute on the matter.
My decision to opt for Common Lisp stems from pragmatic considerations: it is
a language in which I am effective, and, above all, it aligns with my personal
enjoyment of programming.
Testing Infrastructure
Considering my primary focus on Linux and x86-64, I must undertake certain
preparations to facilitate program testing. While I possess an x86-64 machine,
it operates on macOS, rendering the same assembly directives ineffective as
they are in Linux. Fortunately, this predicament is effortlessly resolved
through the utilization of Docker.
My container requirements are rather minimal. Given my work with Common Lisp,
the inclusion of a CL compiler is paramount. I’ve chosen
SBCL
for this purpose, primarily due to my familiarity
with it. While other implementations might suffice, my present focus won’t
extend to testing those alternatives. Alongside the CL compiler, an assembly
compiler for the generated code is essential. In this context, GCC serves as
an apt choice.
So the Dockerfile for development could look something like this:
FROM fukamachi/sbcl:2.3.7-debian
LABEL maintainer="Topi Kettunen <topi@topikettunen.com>"
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update -qq \
&& apt-get install -qq -y --no-install-recommends \
build-essentials \
&& rm -rf /var/lib/apt/lists/*
# For unit testing
RUN sbcl --eval '(ql:quickload :rove)'
WORKDIR /root/.roswell/local-projects/sila
ENTRYPOINT ["/bin/bash"]
I decided to use Eitaro Fukamachi’s
dockerfiles
. These Dockerfiles are
built around roswell
, which I’m also using
locally.
I also write some simple Makefile to just make my life easier:
LISP ?= ros -Q run
.PHONY: build
build:
$(LISP) --load sila.asd \
--eval '(ql:quickload :sila)' \
--eval '(asdf:make :sila)' \
--eval '(quit)'
.PHONY: test
test:
$(LISP) --load sila.asd \
--eval '(ql:quickload :sila)' \
--eval '(asdf:test-system :sila)' \
--eval '(quit)'
.PHONY: docker-build
docker-build:
docker build -t sila .
.PHONY: docker-run
docker-run:
docker run -it --rm -v ${PWD}:/root/.roswell/local-projects/sila sila
Initial Tests
Since I already know what I want to output from my compiler at this point,
it’s easy to start development by writing some nifty little unit tests for it.
For now, I decided to write tests for emitting the assembly code and than
separately the compilation process.
(deftest test-emit-asm-integer
(let ((emit-asm-tests '(("0" . " .globl main
main:
mov $0, %rax
ret
")
("42" . " .globl main
main:
mov $42, %rax
ret
"))))
(dolist (test emit-asm-tests)
(ok (string-equal (sila::emit-asm (car test)) (cdr test))
(format nil "Expect to be equal: ~%~A~%=>~A"
`(sila::emit-asm ,(car test))
(cdr test))))))
(deftest test-emit-asm-add-sub
(let ((emit-asm-tests '(("5+20-4" . " .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
")
(" 5 + 20 - 4 " . " .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
"))))
(dolist (test emit-asm-tests)
(ok (string-equal (sila::emit-asm (car test)) (cdr test))
(format nil "Expect to be equal: ~%~A~%=>~A"
`(sila::emit-asm ,(car test))
(cdr test))))))
These tests are relatively straight forward, I generate the output and just
compare strings to each other to see that they both are the same. Compilation
itself is little bit more involved. Most likely you would achieve easier test
setup by just simple shell scripts, but since I want to write Common Lisp, I
decided to write some small testing harness for it.
(defvar *bin-name* "sila")
(defvar *tmp-name* "tmp.s")
(defvar *tmp-bin-name* "tmp")
(defun sila-project-root (path)
(namestring (asdf:system-relative-pathname :sila path)))
(defun compile-program-and-compare-rc (input expected-rc)
(multiple-value-bind (out err rc)
(uiop:run-program (list (sila-project-root *bin-name*)
input)
:output :string
:error-output :string)
(declare (ignore err rc))
(with-open-file (fstream (sila-project-root *tmp-name*)
:direction :output
:if-exists :supersede
:if-does-not-exist :create)
(format fstream out))
(uiop:run-program (list "gcc" "-static" "-o" *tmp-bin-name* *tmp-name*))
(multiple-value-bind (out err got-rc)
(uiop:run-program (sila-project-root *tmp-bin-name*)
;; Since we're working with arbitrary return codes.
:ignore-error-status t)
(declare (ignore out err))
(= expected-rc got-rc))))
(deftest test-compilation-and-compare-rc
(ok (compile-program-and-compare-rc "0" 0))
(ok (compile-program-and-compare-rc "5 + 40 - 20" 25)))
What I’m doing here is that I use the freshly created executable that outputs
assembly and pipe the output to a file called tmp.s
. After that I run gcc
for that generated assembly and compile it to a binary called tmp
. Then I
just run the program and check what return code it returns and compare those
to what I expect them to be.
Obviously the tests will fail for now, but those give me a nice foundation on
top of which is nice to start developing.
Tokenization
Before I do anything for my compiler, I need to do some lexical analysis and
generate some tokens from given input. I start this by defining some
structures and types for holding these tokens:
(deftype kind ()
'(member
:punct
:num
:eof))
(defstruct token
kind
pos
len
val
next)
Since I’m currently only working with strings given as input, I will do the
tokenization simply moving a pointer through and creating tokens based on the
subsequences I encounter in the given input string. Before that, I need some
helper functions:
(defun whitespacep (c)
(member c '(#\Space #\Tab #\Return #\Newline)))
(defun punctuatorp (c)
;; TODO(topi): Add more punctuators when needed.
(member c '(#\+ #\-)))
(defun skip-to-punctuator (input start)
;; TODO(topi): Improve this. Won't work when new tokens are added.
(position-if-not #'digit-char-p input :start start))
I also define a condition for encountering invalid tokens, which I can throw
in case something bad happens:
(define-condition invalid-token (error)
((token :initarg :token
:initform nil
:reader token)
(token-position :initarg :token-position
:initform nil
:reader token-position))
(:report (lambda (condition stream)
(format stream "Invalid token: '~A' at position ~A.~&"
(token condition)
(token-position condition))))
(:documentation "Condition for when we encounter invalid token."))
As you can see in the token
structure above, there is a slot called next
indicating the next token. So I’m working with linked lists here. I’ll start
the tokenization by creating head
node and also cur
node which I will be
populating while I’m iterating through the list. I also initialise a variable
for pointing me where I’m currently at in the given input string.
The way I decided to do this initial tokenization is simply looping through
the given input with loop
. At the moment, I’m not interested in whitespace,
so when I encounter whitespace character, I’ll just move the src-pos
forward.
Currently, I’m interested in at what position I will encounter a punctuator,
either +
or -
, in the given input string. To do that, I’ll use the
skip-to
function defined above, which returns the position from the input
string where first given item is found at. When I know that, it’s easy the
deduce the positions of each token in the given input.
(defun tokenize (src)
(let* ((head (make-token))
(cur head)
(src-pos 0))
(loop :while (< src-pos (length src))
:do (let ((punct-pos (skip-to-punctuator src src-pos)))
(cond (;; Whitespace
(whitespacep (char src src-pos))
(incf src-pos))
(;; Numeric literal
(digit-char-p (char src src-pos))
(setf (token-next cur)
(make-token :kind num
:val (parse-integer src
:start src-pos
:junk-allowed t)
:pos src-pos))
(setf (token-len (token-next cur))
(if punct-pos
(- punct-pos src-pos)
;; No more punctuators.
(- (length src) src-pos)))
(setf cur (token-next cur))
(if punct-pos
(setf src-pos punct-pos)
;; No more punctuators, move `src-pos` to the end.
(setf src-pos (length src))))
(;; Punctuator: + | -
(punctuatorp (char src src-pos))
(setf (token-next cur)
(make-token :kind punct
:val (char src src-pos)
:pos src-pos
;; TODO(topi): Currently I only have
;; 1 char puncts, change when more is
;; needed.
:len 1))
(setf cur (token-next cur))
(incf src-pos))
(t
(error 'invalid-token
:token (char src src-pos)
:token-position src-pos)))))
;; No more tokens.
(setf (token-next cur) (make-token :kind eof
:pos src-pos))
(setf cur (token-next cur))
(token-next head))
We can then test this out in the REPL, to see what it prints out:
SILA> (sila/lexer:tokenize "1 + 1 - 123")
#S(SILA/LEXER::TOKEN
:KIND :NUM
:POS 0
:LEN 2
:VAL 1
:NEXT #S(SILA/LEXER::TOKEN
:KIND :PUNCT
:POS 2
:LEN 1
:VAL #\+
:NEXT #S(SILA/LEXER::TOKEN
:KIND :NUM
:POS 4
:LEN 2
:VAL 1
:NEXT #S(SILA/LEXER::TOKEN
:KIND :PUNCT
:POS 6
:LEN 1
:VAL #\-
:NEXT #S(SILA/LEXER::TOKEN
:KIND :NUM
:POS 8
:LEN 3
:VAL 123
:NEXT #S(SILA/LEXER::TOKEN
:KIND :EOF
:POS 11
:LEN NIL
:VAL NIL
:NEXT NIL))))))
And it seems to work nicely giving us crucial information such as how long the
tokens are in the input and at what position they’re found at, something that
will be useful in the future.
Emitting Assembly
Since now the tokenization is working more or less as intended, we can move
forward with the assembly generation. For now, the only thing that matters to
me is that we can generate some simple assembly to standard output. The way I
decided to this, is to just concatenate assembly instructions to one string
and just print the result. I’m only working with one label main
for now, so
I’m not starting to do anything too fancy yet.
The way I’m emitting the assembly is simply by looping all the tokens that I
got from tokenize
until I encounter token
which kind is :eof
. Then
based on the token, I’ll just concatenate assembly instructions as needed.
I’ll also define simple entrypoint to my executable which only accepts one
string as input that will generated to assembly.
(defun main ()
"Sila programming language entrypoint."
(multiple-value-bind (options free-args)
(opts:get-opts)
;; TODO(topi): Add some useful options.
(declare (ignore options))
(unless (= (length free-args) 1)
(format *error-output* "Invalid number of arguments, expected 1.~%")
(opts:exit 1))
(format *standard-output* (emit-asm (first free-args)))
(opts:exit)))
(defun emit-asm (src)
"Emit assembly code from given source code. Currently emits only x86-64 and only Linux
is tested."
(let ((asm "")
(tok (tokenize src)))
(labels ((asm-conc (inst)
(setf asm (concatenate 'string asm inst)))
(asm-inst (inst)
(format nil " ~A~%" inst))
(asm-directive (dir)
(asm-inst dir))
(asm-label (label)
(format nil "~A:~%" label)))
(asm-conc (asm-directive ".globl main"))
(asm-conc (asm-label "main"))
;; For now, first token must be a number.
(asm-conc (asm-inst (format nil "mov $~d, %rax" (token-val tok))))
(setf tok (token-next tok))
(loop :until (equal (token-kind tok) :eof)
:do (cond ((eq (token-kind tok) :num)
(setf tok (token-next tok)))
((char= (token-val tok) #\+)
(asm-conc (asm-inst
(format nil "add $~d, %rax"
(token-val (token-next tok)))))
(setf tok (token-next tok)))
((char= (token-val tok) #\-)
(asm-conc (asm-inst
(format nil "sub $~d, %rax"
(token-val (token-next tok)))))
(setf tok (token-next tok)))))
(asm-conc (asm-inst "ret"))
asm)))
With these we should be able to run our testing suite to see what happens.
$ make test
[...]
Testing System sila/tests
;; testing 'sila/tests/main'
test-compilation-and-compare-rc
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2763ms)
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2509ms)
;; testing 'sila/tests/emit-asm-x86-64'
test-emit-asm-integer
✓ Expect to be equal:
(EMIT-ASM 0)
=> .globl main
main:
mov $0, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 42)
=> .globl main
main:
mov $42, %rax
ret
test-emit-asm-add-sub
✓ Expect to be equal:
(EMIT-ASM 5+20-4)
=> .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 5 + 20 - 4 )
=> .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
✓ 1 test completed
Summary:
All 1 test passed.
Cool seems to pass!
Next Steps
I’m continuing the development of my compiler at my own pace, and will write
when something new happens. I think I will come back to some parts when I
decide to some improvements or refactorings so that everything happens in
small iterations. Bugs will be found and hopefully fixed. This might be a long
project, but I don’t mind, it’s fun! If you’re interested, feel free to follow
my blog to know when something new comes.