Posted on 5th of September 2023
| 661 wordsSummer months went by fast when you had lot on your hands. Didn’t feel like
keeping the reading log up to date during these months, so I’ll just do one
big post-summer update here.
Aldous Huxley: Brave New World (reread)
On May, I seemed to be rereading bunch of classics especially with the
common theme of dystopian future. Why you might ask? That I don’t know,
maybe something inside me just thinks that future depicted in these classics
is inevitable. Brave New World is probably one of my favorite books ever
written. Wonderful vision of advanced utopia, but at what cost? Masterful
critique of the dehumanizing effects of a highly controlled and
pleasure-driven society challenges readers to reflect on the consequences of
sacrificing personal freedom for comfort and conformity.
George Orwell: 1984 (reread)
Continuing on the series of dystopia and nightmarish tales of possible
future. Tale about the dangers of totalitarianism and government
surveillance. Serves as a stark reminder of the importance of safeguarding
individual freedoms, critical thinking, and truth itself.
William Golding: Lord of the Flies (reread)
More classics! Compelling exploration of human nature and the thin veneer of
civilization that separates order from chaos.
J.D. Sallinger: The Catcher in the Rye
The Catcher in the Rye was, one of many, classics that I hadn’t read before
and to be honest, wasn’t a huge fan of it. Maybe since I read this as a
little bit older. Don’t know. Topics in this book was really interesting,
considering teenager alienation, identity etc. While Holden’s struggles in
his adolescent were definitely unique, I just felt more annoyed about him
that anything else.
Jacques Ellul: The Technological Society
Jacques Ellul: Propaganda: The Formation of Men’s Attitudes
Ted Kaczynski: Industrial Society and Its Future
Ted Kaczynski: Anti-Tech Revolution: Why and How
Every once in a while my inner luddite wakes up and I start hating
everything about technology. Reading Jacques Ellul was definitely part of
this, but this time it was mainly the death of Unabomber, Ted Kaczynski,
that brought me to him. While I don’t agree/support on what Ted Kaczynski
did, he had some great points about how technology affects us. Ellul was
many ways inspiration for him in these sort of anti-technology philosophies
and it affection on humans. But Kaczynski and Ellul had one big difference
between them, Ellul was a pacifist were as Kaczynski was a domestic
terrorist.
Nonetheless, all of these books make some great points about technology so
despite your opinion about it, these are definitely worth a read.
Cormac McCarthy: The Road
Cormac McCarthy: No Country for Old Men
During the summer unfortunately we had a couple of passings of great
authors, one of which were Cormac McCarthy. I had never read his books
before, but I was familiar with as a movie from (No Country for Old Men by
Coen Brothers) and I had heard great things about his writing. So I grabbed
a copy of The Road and No Country for Old Men, since those were quite highly
recommended. And the recommendations were definitely true.
The Road offers a haunting depiction of post-apocalyptic America and the
struggles of father and son in this world. Immersing in bleak landscape
where hope and love endure against all odds. No Country for Old Men delves
into the realms of crime, fate, and the inexorable consequences of one’s
choices.
Both of these works showcase McCarthy’s masterful storytelling, unique
narrative styles, and philosophical depth, making them essential reads for
those interested in literature that explores the human condition in its most
challenging and thought-provoking forms.
Richard P. Gabriel: Patterns of Software
Richard P. Gabriel is a pretty known name in the software world and
especially in the Lisp world. One of the famous writings of Richard P.
Gabriel is the concept of “Worse Is
Better”
.
Patterns of Software gives a great look into software design, programming
and business around it. Definitely a must read for everyone working in this
industry.
Posted on 24th of August 2023
| 1860 wordsFor the initial implementation of Sila’s parses, I decided to go with the
recursive descent route due to its simplicity. So what it is? Recursive
descent parsing is a top-down parsing technique employed to analyze the syntax
of structured input, such as programming languages or mathematical
expressions. It operates by breaking down the input into smaller, manageable
components using a set of recursive functions that correspond to the
non-terminal symbols in the grammar.
Beginning with a starting symbol, typically representing the highest-level
construct, these functions are invoked recursively to recognize and process
the various syntactic elements of the input. Each function handles a specific
grammar rule, and by calling each other based on the grammar’s structure, the
parser traverses the input while building a parse tree that reflects the
hierarchical organization of the language’s syntax.
This approach allows for clear code organization and error reporting, though
it can face challenges with certain types of grammars and may require
augmentation with techniques like lookahead or memoization for optimal
efficiency and effectiveness. But for early stages of the project, I don’t
really mind that.
Parser Implementation
First I create some structures and helpers that I will be using in the parsing
process.
(deftype ast-node-kind ()
"Sila AST node kind."
'(member
:add
:sub
:mul
:div
:number))
(defstruct ast-node
"Structure for Sila AST nodes."
kind
val
lhs
rhs)
(defun skip-to-token (val tok)
"Search for token which is `val'. Returns `nil' when not found.
TODO(topi): Probably should do some proper error handling if `val' isn't
found."
(cond ((eq (token-kind tok) :eof) nil)
((equal (token-val tok) val) tok)
(t (search-token val (token-next tok)))))
For now, I’m interested in parsing something simple like 2 / (1 + 1) * 8
, so
simple multiplicative calculations and expressions inside parenthesis. So if I
would write some metanotation like BNF for that, it would look something like
this:
expr ::== mul ( '+' mul | '-' mul ) *
mul ::== primary ( '*' primary | '/' primary ) *
primary ::== '(' expr ')' | number
expr
This function is responsible for parsing arithmetic expressions. It follows
the grammar rule: expr ::= mul ( '+' mul | '-' mul ) *
. It starts by calling
the mul
function to parse the initial term. Then, it enters a loop
to
handle addition and subtraction operators. If it encounters a +
, it parses
the next term using mul and creates an AST node representing addition. If it
encounters a -
, it does the same for subtraction.
The loop
continues until it encounters an operator that’s not +
or -
, at
which point it returns the parsed expression node and the remaining token
stream.
(defun expr (tok)
"expr ::== mul ( '+' mul | '-' mul ) *"
(multiple-value-bind (node rest)
(mul tok)
(loop
(cond ((eq (token-val rest) #\+)
(multiple-value-bind (node2 rest2)
(mul (token-next rest))
(setf node (make-ast-node :kind :add :lhs node :rhs node2))
(setf rest rest2)))
((eq (token-val rest) #\-)
(multiple-value-bind (node2 rest2)
(mul (token-next rest))
(setf node (make-ast-node :kind :sub :lhs node :rhs node2))
(setf rest rest2)))
(t
(return-from expr
(values node rest)))))))
mul
Parsing mul
expressions works in a similar fashion:
(defun mul (tok)
"mul ::== primary ( '*' primary | '/' primary ) *"
(multiple-value-bind (node rest)
(primary tok)
(loop
(cond ((eq (token-val rest) #\*)
(multiple-value-bind (node2 rest2)
(primary (token-next rest))
(setf node (make-ast-node :kind :mul :lhs node :rhs node2))
(setf rest rest2)))
((eq (token-val rest) #\/)
(multiple-value-bind (node2 rest2)
(primary (token-next rest))
(setf node (make-ast-node :kind :div :lhs node :rhs node2))
(setf rest rest2)))
(t
(return-from mul
(values node rest)))))))
primary
This function parses primary expressions, which can be either numeric values
or sub-expressions enclosed in parentheses. It follows the grammar rule:
primary ::= '(' expr ')' | number
. If the current token is a number, it
creates an AST node representing a numeric value. If the token is an opening
parenthesis, it recursively calls the expr function to parse the enclosed
expression and skips to the closing parenthesis. It then returns the parsed
node and the remaining token stream.
(defun primary (tok)
"primary ::== '(' expr ')' | number"
(cond ((eq (token-kind tok) :num)
(values (make-ast-node :kind :number :val (token-val tok))
(token-next tok)))
((eq (token-val tok) #\()
(multiple-value-bind (node rest)
(expr (token-next tok))
(values node (token-next (skip-to-token #\) rest)))))
(t (error 'parser-error))))
Testing the Initial Parser
For now, I will be testing this parser only from the REPL. Firing up it, and
running the command should print something like this:
* (expr (tokenize "2 / (1 + 1) * 8"))
#S(AST-NODE
:KIND :MUL
:VAL NIL
:LHS #S(AST-NODE
:KIND :DIV
:VAL NIL
:LHS #S(AST-NODE :KIND :NUMBER :VAL 2 :LHS NIL :RHS NIL)
:RHS #S(AST-NODE
:KIND :ADD
:VAL NIL
:LHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)
:RHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)))
:RHS #S(AST-NODE :KIND :NUMBER :VAL 8 :LHS NIL :RHS NIL))
#S(SILA/LEXER::TOKEN :KIND :EOF :POS 15 :LEN NIL :VAL NIL :NEXT NIL)
*
Here we can see how our parser parses the expressions inside it. Transforming
it to ASCII would look something like this:
Mul(*)
/ \
Div(/) 8
/ \
2 Add(+)
/ \
1 1
Code Generation
Assembly code of something like 2 / (1 + 1) * 8
is still quite simple, but
definitely more instruction will be needed compared to the simple addition and
subtraction. Assembly code for it would equal to:
.globl main
main:
mov $8, %rax
push %rax
mov $1, %rax
push %rax
mov $1, %rax
pop %rdi
add %rdi, %rax
push %rax
mov $2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
So what is happening here:
Implementation
For the code generation aspects, we can start by creating some unit tests since we already know what would be the output of our function:
(deftest test-emit-asm-div-mul-parens
(let ((emit-asm-tests '(("2 / (1 + 1) * 8" . " .globl main
main:
mov $8, %rax
push %rax
mov $1, %rax
push %rax
mov $1, %rax
pop %rdi
add %rdi, %rax
push %rax
mov $2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
"))))
(dolist (test emit-asm-tests)
(ok (string-equal (sila::emit-asm (car test)) (cdr test))
(format nil "Expect to be equal: ~%~a~%=>~a"
`(sila::emit-asm ,(car test))
(cdr test))))))
The implementation aspects also will be relatively straight forward.
(defun asm-inst (inst)
(format nil " ~a~%" inst))
(defun asm-directive (dir)
(asm-inst dir))
(defun asm-label (label)
(format nil "~a:~%" label))
(defun asm-push ()
(asm-inst "push %rax"))
(defun asm-pop (reg)
(asm-inst (format nil "pop %~a" reg)))
(defun generate-expr (node)
(let ((asm ""))
(flet ((asm-conc (inst)
(setf asm (concatenate 'string asm inst))))
(when (eq (ast-node-kind node) :number)
(asm-conc (asm-inst (format nil "mov $~d, %rax"
(ast-node-val node))))
(return-from generate-expr asm))
(asm-conc (generate-expr (ast-node-rhs node)))
(asm-conc (asm-push))
(asm-conc (generate-expr (ast-node-lhs node)))
(asm-conc (asm-pop "rdi"))
(alexandria:switch ((ast-node-kind node) :test #'eq)
(:add (asm-conc (asm-inst "add %rdi, %rax")))
(:sub (asm-conc (asm-inst "sub %rdi, %rax")))
(:mul (asm-conc (asm-inst "imul %rdi, %rax")))
(:div (asm-conc (asm-inst "cqo"))
(asm-conc (asm-inst "idiv %rdi, %rax")))
(t (error 'parser-error))))
asm))
This function recursively generates assembly code from an abstract syntax tree
node representing an arithmetic expression. It constructs assembly
instructions based on the kind of node encountered. It concatenates these
instructions into a string that represents the generated assembly code.
For a number node, it generates a move instruction (mov
) to load the value
into %rax
.
For binary operations (addition, subtraction, multiplication, division), it
generates assembly for the right-hand side expression, pushes %rax
onto the
stack, generates assembly for the left-hand side expression, pops the value
into %rdi, and then performs the corresponding operation.
Then lastly we just define the emit function, like so:
(defun emit-asm (src)
"Emit assembly code from given source code. Currently emits only x86-64 and
only Linux is tested."
(let ((asm ""))
(flet ((asm-conc (inst)
(setf asm (concatenate 'string asm inst))))
(multiple-value-bind (node rest)
(expr (tokenize src))
(unless (eq (token-kind rest) :eof)
(error 'parser-error :error-msg "Extra tokens"))
(asm-conc (asm-directive ".globl main"))
(asm-conc (asm-label "main"))
(asm-conc (generate-expr node))
(asm-conc (asm-inst "ret"))))
asm))
Again, we can test it directly in the REPL first:
* (emit-asm "2 / (1 + 1) * 8")
" .globl main
main:
mov $8, %rax
push %rax
mov $1, %rax
push %rax
mov $1, %rax
pop %rdi
add %rdi, %rax
push %rax
mov $2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
"
Seems to work. Now to run the tests:
# make test
rove sila.asd
Testing System sila/tests
;; testing 'sila/tests'
test-compilation-and-compare-rc
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2726ms)
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2693ms)
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "2 / (1 + 1) * 8" 8) to be true. (2703ms)
test-emit-asm-integer
✓ Expect to be equal:
(EMIT-ASM 0)
=> .globl main
main:
mov $0, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 42)
=> .globl main
main:
mov $42, %rax
ret
test-emit-asm-add-sub
✓ Expect to be equal:
(EMIT-ASM 5+20-4)
=> .globl main
main:
mov $4, %rax
push %rax
mov $20, %rax
push %rax
mov $5, %rax
pop %rdi
add %rdi, %rax
pop %rdi
sub %rdi, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 5 + 20 - 4 )
=> .globl main
main:
mov $4, %rax
push %rax
mov $20, %rax
push %rax
mov $5, %rax
pop %rdi
add %rdi, %rax
pop %rdi
sub %rdi, %rax
ret
test-emit-asm-div-mul-parens
✓ Expect to be equal:
(EMIT-ASM 2 / (1 + 1) * 8)
=> .globl main
main:
mov $8, %rax
push %rax
mov $1, %rax
push %rax
mov $1, %rax
pop %rdi
add %rdi, %rax
push %rax
mov $2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
✓ 1 test completed
Summary:
All 1 test passed.
Cool, seems to work!
Posted on 19th of August 2023
| 2491 wordsA while ago, I wrote about why would anyone start writing a new programming
language
.
Consequently, I decided to gradually initiate the project. Simultaneously, I
found the idea of documenting my progress to be intriguing. I opted to adopt a
strategy of incremental development, wherein I would construct small,
functional components and refine them over time.
This incremental methodology draws inspiration from the
paper
by Abdulaziz Ghuloum.
Other tremendously useful resources include works such as
tcc
, A small C compiler written by Fabrice
Bellard, lcc
: Another small C compiler. The
creators wrote a
book
about the
internals of lcc, which I found a good resource to see how a compiler is
implemented, and also chibicc
by Rui
Ueyama.
Before embarking on this endeavor, I want to underscore that I currently
harbor modest ambitions for this project. Consequently, my approach might be
somewhat sporadic, aligning with my personal inclination and availability.
Primarily, I intend this project to serve educational and research purposes,
surpassing other objectives.
Getting Started With Sila Compiler
I decided to call this project Sila, which comes from Pali and is often used
in Buddhist texts meaning:
śīla (Pali: sīla; T. tshul khrims ཚུལ་ཁྲིམས་), literally, ‘acting
appropriately’. Sila is translated as “skillful conduct,” “discipline,”
“ethical conduct,” “moral conduct,” “virtue,” etc.
Sila is said to be a way of being that is conducive to positive and happy
states of mind.
With the Buddhist teachings, sila is identified as:
- one of the three trainings
- one of the six paramitas in the Sanskrit tradition
- one of the ten paramis in the Pali tradition
From https://encyclopediaofbuddhism.org/wiki/
Śīla
To initiate the process, my initial step was to attempt the compilation of the
most elementary language conceivable. Essentially, my goal was to input an
integer into the compiler, which would then produce corresponding assembly
code. This assembly code, when compiled, would result in an exit code
identical to the provided integer. On a x86-64 Linux platform, the assembly
program to achieve this might resemble the following:
.globl main
main:
mov $0, %rax
ret
So what is happening here?
.globl main
: This directive tells the assembler that the label main should
be considered a global symbol. This means that the symbol main can be accessed
from other parts of the program, possibly in other source files or modules.
main:
: This is a label that marks the beginning of the main function or
code block. Labels are used as targets for jumps and branches in assembly
code.
mov $0, %rax
: This instruction moves the immediate value 0 into the %rax
register. In the x86-64 architecture, %rax
is a general-purpose register that
is often used to hold return values of functions.
ret
: This instruction is used to return from a function. In this case,
since we’re looking at the main function, it’s effectively ending the program.
The value in %rax would typically be used as the exit status of the program.
Simple arithmetic, like “5 + 20 - 4”, is also easily represented x86 assembly:
.globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
mov $5, %rax
: This instruction moves the immediate value 5 into the %rax
register.
add $20, %rax
: This instruction adds the immediate value 20 to the current
value stored in the %rax
register. After this operation, %rax
would hold
the value 25 (5 + 20).
sub $4, %rax
: This instruction subtracts the immediate value 4 from the
current value in the %rax
register. After this operation, %rax
would hold
the value 21 (25 - 4).
Why Common Lisp?
The internet is already saturated with fervent support for Lisp, leaving me
with a sense that there is little additional I can contribute on the matter.
My decision to opt for Common Lisp stems from pragmatic considerations: it is
a language in which I am effective, and, above all, it aligns with my personal
enjoyment of programming.
Testing Infrastructure
Considering my primary focus on Linux and x86-64, I must undertake certain
preparations to facilitate program testing. While I possess an x86-64 machine,
it operates on macOS, rendering the same assembly directives ineffective as
they are in Linux. Fortunately, this predicament is effortlessly resolved
through the utilization of Docker.
My container requirements are rather minimal. Given my work with Common Lisp,
the inclusion of a CL compiler is paramount. I’ve chosen
SBCL
for this purpose, primarily due to my familiarity
with it. While other implementations might suffice, my present focus won’t
extend to testing those alternatives. Alongside the CL compiler, an assembly
compiler for the generated code is essential. In this context, GCC serves as
an apt choice.
So the Dockerfile for development could look something like this:
FROM fukamachi/sbcl:2.3.7-debian
LABEL maintainer="Topi Kettunen <topi@topikettunen.com>"
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update -qq \
&& apt-get install -qq -y --no-install-recommends \
build-essentials \
&& rm -rf /var/lib/apt/lists/*
# For unit testing
RUN sbcl --eval '(ql:quickload :rove)'
WORKDIR /root/.roswell/local-projects/sila
ENTRYPOINT ["/bin/bash"]
I decided to use Eitaro Fukamachi’s
dockerfiles
. These Dockerfiles are
built around roswell
, which I’m also using
locally.
I also write some simple Makefile to just make my life easier:
LISP ?= ros -Q run
.PHONY: build
build:
$(LISP) --load sila.asd \
--eval '(ql:quickload :sila)' \
--eval '(asdf:make :sila)' \
--eval '(quit)'
.PHONY: test
test:
$(LISP) --load sila.asd \
--eval '(ql:quickload :sila)' \
--eval '(asdf:test-system :sila)' \
--eval '(quit)'
.PHONY: docker-build
docker-build:
docker build -t sila .
.PHONY: docker-run
docker-run:
docker run -it --rm -v ${PWD}:/root/.roswell/local-projects/sila sila
Initial Tests
Since I already know what I want to output from my compiler at this point,
it’s easy to start development by writing some nifty little unit tests for it.
For now, I decided to write tests for emitting the assembly code and than
separately the compilation process.
(deftest test-emit-asm-integer
(let ((emit-asm-tests '(("0" . " .globl main
main:
mov $0, %rax
ret
")
("42" . " .globl main
main:
mov $42, %rax
ret
"))))
(dolist (test emit-asm-tests)
(ok (string-equal (sila::emit-asm (car test)) (cdr test))
(format nil "Expect to be equal: ~%~A~%=>~A"
`(sila::emit-asm ,(car test))
(cdr test))))))
(deftest test-emit-asm-add-sub
(let ((emit-asm-tests '(("5+20-4" . " .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
")
(" 5 + 20 - 4 " . " .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
"))))
(dolist (test emit-asm-tests)
(ok (string-equal (sila::emit-asm (car test)) (cdr test))
(format nil "Expect to be equal: ~%~A~%=>~A"
`(sila::emit-asm ,(car test))
(cdr test))))))
These tests are relatively straight forward, I generate the output and just
compare strings to each other to see that they both are the same. Compilation
itself is little bit more involved. Most likely you would achieve easier test
setup by just simple shell scripts, but since I want to write Common Lisp, I
decided to write some small testing harness for it.
(defvar *bin-name* "sila")
(defvar *tmp-name* "tmp.s")
(defvar *tmp-bin-name* "tmp")
(defun sila-project-root (path)
(namestring (asdf:system-relative-pathname :sila path)))
(defun compile-program-and-compare-rc (input expected-rc)
(multiple-value-bind (out err rc)
(uiop:run-program (list (sila-project-root *bin-name*)
input)
:output :string
:error-output :string)
(declare (ignore err rc))
(with-open-file (fstream (sila-project-root *tmp-name*)
:direction :output
:if-exists :supersede
:if-does-not-exist :create)
(format fstream out))
(uiop:run-program (list "gcc" "-static" "-o" *tmp-bin-name* *tmp-name*))
(multiple-value-bind (out err got-rc)
(uiop:run-program (sila-project-root *tmp-bin-name*)
;; Since we're working with arbitrary return codes.
:ignore-error-status t)
(declare (ignore out err))
(= expected-rc got-rc))))
(deftest test-compilation-and-compare-rc
(ok (compile-program-and-compare-rc "0" 0))
(ok (compile-program-and-compare-rc "5 + 40 - 20" 25)))
What I’m doing here is that I use the freshly created executable that outputs
assembly and pipe the output to a file called tmp.s
. After that I run gcc
for that generated assembly and compile it to a binary called tmp
. Then I
just run the program and check what return code it returns and compare those
to what I expect them to be.
Obviously the tests will fail for now, but those give me a nice foundation on
top of which is nice to start developing.
Tokenization
Before I do anything for my compiler, I need to do some lexical analysis and
generate some tokens from given input. I start this by defining some
structures and types for holding these tokens:
(deftype kind ()
'(member
:punct
:num
:eof))
(defstruct token
kind
pos
len
val
next)
Since I’m currently only working with strings given as input, I will do the
tokenization simply moving a pointer through and creating tokens based on the
subsequences I encounter in the given input string. Before that, I need some
helper functions:
(defun whitespacep (c)
(member c '(#\Space #\Tab #\Return #\Newline)))
(defun punctuatorp (c)
;; TODO(topi): Add more punctuators when needed.
(member c '(#\+ #\-)))
(defun skip-to-punctuator (input start)
;; TODO(topi): Improve this. Won't work when new tokens are added.
(position-if-not #'digit-char-p input :start start))
I also define a condition for encountering invalid tokens, which I can throw
in case something bad happens:
(define-condition invalid-token (error)
((token :initarg :token
:initform nil
:reader token)
(token-position :initarg :token-position
:initform nil
:reader token-position))
(:report (lambda (condition stream)
(format stream "Invalid token: '~A' at position ~A.~&"
(token condition)
(token-position condition))))
(:documentation "Condition for when we encounter invalid token."))
As you can see in the token
structure above, there is a slot called next
indicating the next token. So I’m working with linked lists here. I’ll start
the tokenization by creating head
node and also cur
node which I will be
populating while I’m iterating through the list. I also initialise a variable
for pointing me where I’m currently at in the given input string.
The way I decided to do this initial tokenization is simply looping through
the given input with loop
. At the moment, I’m not interested in whitespace,
so when I encounter whitespace character, I’ll just move the src-pos
forward.
Currently, I’m interested in at what position I will encounter a punctuator,
either +
or -
, in the given input string. To do that, I’ll use the
skip-to
function defined above, which returns the position from the input
string where first given item is found at. When I know that, it’s easy the
deduce the positions of each token in the given input.
(defun tokenize (src)
(let* ((head (make-token))
(cur head)
(src-pos 0))
(loop :while (< src-pos (length src))
:do (let ((punct-pos (skip-to-punctuator src src-pos)))
(cond (;; Whitespace
(whitespacep (char src src-pos))
(incf src-pos))
(;; Numeric literal
(digit-char-p (char src src-pos))
(setf (token-next cur)
(make-token :kind num
:val (parse-integer src
:start src-pos
:junk-allowed t)
:pos src-pos))
(setf (token-len (token-next cur))
(if punct-pos
(- punct-pos src-pos)
;; No more punctuators.
(- (length src) src-pos)))
(setf cur (token-next cur))
(if punct-pos
(setf src-pos punct-pos)
;; No more punctuators, move `src-pos` to the end.
(setf src-pos (length src))))
(;; Punctuator: + | -
(punctuatorp (char src src-pos))
(setf (token-next cur)
(make-token :kind punct
:val (char src src-pos)
:pos src-pos
;; TODO(topi): Currently I only have
;; 1 char puncts, change when more is
;; needed.
:len 1))
(setf cur (token-next cur))
(incf src-pos))
(t
(error 'invalid-token
:token (char src src-pos)
:token-position src-pos)))))
;; No more tokens.
(setf (token-next cur) (make-token :kind eof
:pos src-pos))
(setf cur (token-next cur))
(token-next head))
We can then test this out in the REPL, to see what it prints out:
SILA> (sila/lexer:tokenize "1 + 1 - 123")
#S(SILA/LEXER::TOKEN
:KIND :NUM
:POS 0
:LEN 2
:VAL 1
:NEXT #S(SILA/LEXER::TOKEN
:KIND :PUNCT
:POS 2
:LEN 1
:VAL #\+
:NEXT #S(SILA/LEXER::TOKEN
:KIND :NUM
:POS 4
:LEN 2
:VAL 1
:NEXT #S(SILA/LEXER::TOKEN
:KIND :PUNCT
:POS 6
:LEN 1
:VAL #\-
:NEXT #S(SILA/LEXER::TOKEN
:KIND :NUM
:POS 8
:LEN 3
:VAL 123
:NEXT #S(SILA/LEXER::TOKEN
:KIND :EOF
:POS 11
:LEN NIL
:VAL NIL
:NEXT NIL))))))
And it seems to work nicely giving us crucial information such as how long the
tokens are in the input and at what position they’re found at, something that
will be useful in the future.
Emitting Assembly
Since now the tokenization is working more or less as intended, we can move
forward with the assembly generation. For now, the only thing that matters to
me is that we can generate some simple assembly to standard output. The way I
decided to this, is to just concatenate assembly instructions to one string
and just print the result. I’m only working with one label main
for now, so
I’m not starting to do anything too fancy yet.
The way I’m emitting the assembly is simply by looping all the tokens that I
got from tokenize
until I encounter token
which kind is :eof
. Then
based on the token, I’ll just concatenate assembly instructions as needed.
I’ll also define simple entrypoint to my executable which only accepts one
string as input that will generated to assembly.
(defun main ()
"Sila programming language entrypoint."
(multiple-value-bind (options free-args)
(opts:get-opts)
;; TODO(topi): Add some useful options.
(declare (ignore options))
(unless (= (length free-args) 1)
(format *error-output* "Invalid number of arguments, expected 1.~%")
(opts:exit 1))
(format *standard-output* (emit-asm (first free-args)))
(opts:exit)))
(defun emit-asm (src)
"Emit assembly code from given source code. Currently emits only x86-64 and only Linux
is tested."
(let ((asm "")
(tok (tokenize src)))
(labels ((asm-conc (inst)
(setf asm (concatenate 'string asm inst)))
(asm-inst (inst)
(format nil " ~A~%" inst))
(asm-directive (dir)
(asm-inst dir))
(asm-label (label)
(format nil "~A:~%" label)))
(asm-conc (asm-directive ".globl main"))
(asm-conc (asm-label "main"))
;; For now, first token must be a number.
(asm-conc (asm-inst (format nil "mov $~d, %rax" (token-val tok))))
(setf tok (token-next tok))
(loop :until (equal (token-kind tok) :eof)
:do (cond ((eq (token-kind tok) :num)
(setf tok (token-next tok)))
((char= (token-val tok) #\+)
(asm-conc (asm-inst
(format nil "add $~d, %rax"
(token-val (token-next tok)))))
(setf tok (token-next tok)))
((char= (token-val tok) #\-)
(asm-conc (asm-inst
(format nil "sub $~d, %rax"
(token-val (token-next tok)))))
(setf tok (token-next tok)))))
(asm-conc (asm-inst "ret"))
asm)))
With these we should be able to run our testing suite to see what happens.
$ make test
[...]
Testing System sila/tests
;; testing 'sila/tests/main'
test-compilation-and-compare-rc
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2763ms)
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2509ms)
;; testing 'sila/tests/emit-asm-x86-64'
test-emit-asm-integer
✓ Expect to be equal:
(EMIT-ASM 0)
=> .globl main
main:
mov $0, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 42)
=> .globl main
main:
mov $42, %rax
ret
test-emit-asm-add-sub
✓ Expect to be equal:
(EMIT-ASM 5+20-4)
=> .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
✓ Expect to be equal:
(EMIT-ASM 5 + 20 - 4 )
=> .globl main
main:
mov $5, %rax
add $20, %rax
sub $4, %rax
ret
✓ 1 test completed
Summary:
All 1 test passed.
Cool seems to pass!
Next Steps
I’m continuing the development of my compiler at my own pace, and will write
when something new happens. I think I will come back to some parts when I
decide to some improvements or refactorings so that everything happens in
small iterations. Bugs will be found and hopefully fixed. This might be a long
project, but I don’t mind, it’s fun! If you’re interested, feel free to follow
my blog to know when something new comes.
Posted on 28th of July 2023
| 519 wordsIn the vast and ever-evolving landscape of software development, programming
languages play a pivotal role in how developers interact with computers and
build applications. While a plethora of well-established programming languages
exist, there has always been an intriguing curiosity and desire among some
developers, including myself, to embark on the quest of creating our own
languages.
Existing programming languages come with their own set of strengths and
weaknesses. However, they may not always cater optimally to specific domains
or problem sets. This has been a driving force for many language creators who
seek full control over the language’s semantics. Building a new language
offers the opportunity to tailor the language’s features and syntax,
empowering developers and teams to solve certain classes of problems more
efficiently compared to existing solutions.
However, it is essential to acknowledge that creating a new language is no
easy task. Beyond the initial development, considerations for onboarding other
developers and promoting adoption must be taken into account. Moreover, when
considering building a language within a company setting, one must be mindful
of potential monetary implications.
The decision to build a new programming language should not solely be driven
by pragmatic concerns; it is also an intellectually stimulating exercise. The
process involves delving into various aspects of language design, including
syntax, semantics, and the implementation of the compiler or interpreter.
Aspiring language creators gain a deeper understanding of computer science
principles and enhance their problem-solving skills through this complex
undertaking.
Throughout my journey in the world of computer science, I have dabbled in
creating multiple toy compilers. Though they may not have been groundbreaking,
they ignited my curiosity about compiler implementation. Despite having
explored various compilers and interpreters of varying complexity, I felt a
lingering sense of not fully comprehending the intricacies occurring under the
hood.
Fortuitously, the summer provided a respite from my regular work commitments,
granting me the perfect opportunity to embark on a more serious attempt at
building a language from scratch. This time, I was determined to delve deep
into the intricate details of lexing, parsing, and code generation. This
ambitious undertaking offered an exciting challenge, and I was eager to
explore uncharted territories in language design.
The journey of building a new programming language is undoubtedly exciting and
rewarding, fueled by an array of motivations. Whether it is optimizing
performance to meet specific demands, addressing domain-specific challenges,
or simply exploring innovative language design paradigms, language creation is
a conduit for personal growth and innovation.
As the development of my language progresses, I anticipate gaining a more
comprehensive understanding of how programming languages shape our
interactions with computers and applications. I hope that my creation, like
many others in the world of programming languages, will contribute to the
diversity of tools available to developers, enabling them to craft exceptional
solutions tailored to their unique needs.
In conclusion, the decision to build a new programming language should be
approached with a blend of curiosity, pragmatism, and enthusiasm. It is a
journey that fosters personal growth, deepens technical understanding, and
holds the potential to revolutionize the way developers solve problems and
interact with computers.
Posted on 16th of July 2023
| 312 wordsI stumbled upon a fun-sounding challenge from Werner Vogels'
blog
so I wanted to partake in it personally. The challenge was to list
your favourite album for every year of your life with a restriction of
only one album per year and no repeats of artists.
Here is my list:
1995: Alice in Chains, Alice in Chains
1996: Type O Negative, October Rust
1997: Deftones, Around the Fur
1998: Neutral Milk Hotel, In the Aeroplane Over the Sea
1999: Sleep, Jerusalem (Dopesmoker)
2000: Godspeed You! Black Emperor - Lift Your Skinny Fists Like Antennas to Heaven
2001: B.R.M.C., Black Rebel Motorcycle Club
2002: Queens of the Stone Age, Songs for the Deaf
2003: Songs: Ohia, The Magnolia Electric Co.
2004: MF DOOM, MM...FOOD
2005: Boris, Pink
2006: Tool, 10,000 Days
2007: Porcupine Tree, Fear of a Blank Planet
2008: Have a Nice Life, Deathconsciousness
2009: Them Crooked Vultures, Them Crooked Vultures
2010: Nails, Unsilent Death
2011: Gillian Welch, The Harrow & the Harvest
2012: Death Grips, The Money Store
2013: Jason Isbell, Southeastern
2014: Swans, To Be Kind
2015: Kendrick Lamar, To Pimp a Butterfly
2016: Gojira, Magma
2017: Slowdive, Slowdive
2018: Anna Von Hausswolff, Dead Magic
2019: Justin Townes Earle, The Saint of Lost Causes
2020: Lianne La Havas, Lianne La Havas
2021: Silk Sonic, An Evening with Silk Sonic
2022: Weyes Blood, And in the Darkness, Hearts Aglow
2023: King Gizzard & the Lizard Wizard, PetroDragonic Apocalypse (as of July 2023)
Challenge turned out to be quite hard for me since most of my
favourite artists and albums are from time before my birth, but I
managed to find something for each year. Especially difficult aspect
was leaving great albums out. Rule of no repeats of artists didn’t
make it easier either. Most likely I missed some great artists/albums
so this list most likely just represents how I currently feel.