Sila Dev Log: Initial Recursive Descent Parsing

Posted on 24th of August 2023 | 1860 words

For the initial implementation of Sila’s parses, I decided to go with the recursive descent route due to its simplicity. So what it is? Recursive descent parsing is a top-down parsing technique employed to analyze the syntax of structured input, such as programming languages or mathematical expressions. It operates by breaking down the input into smaller, manageable components using a set of recursive functions that correspond to the non-terminal symbols in the grammar.

Beginning with a starting symbol, typically representing the highest-level construct, these functions are invoked recursively to recognize and process the various syntactic elements of the input. Each function handles a specific grammar rule, and by calling each other based on the grammar’s structure, the parser traverses the input while building a parse tree that reflects the hierarchical organization of the language’s syntax.

This approach allows for clear code organization and error reporting, though it can face challenges with certain types of grammars and may require augmentation with techniques like lookahead or memoization for optimal efficiency and effectiveness. But for early stages of the project, I don’t really mind that.

Parser Implementation

First I create some structures and helpers that I will be using in the parsing process.

(deftype ast-node-kind ()
  "Sila AST node kind."
  '(member
    :add
    :sub
    :mul
    :div
    :number))

(defstruct ast-node
  "Structure for Sila AST nodes."
  kind
  val
  lhs
  rhs)

(defun skip-to-token (val tok)
  "Search for token which is `val'. Returns `nil' when not found.

TODO(topi): Probably should do some proper error handling if `val' isn't
found."
  (cond ((eq (token-kind tok) :eof) nil)
        ((equal (token-val tok) val) tok)
        (t (search-token val (token-next tok)))))

For now, I’m interested in parsing something simple like 2 / (1 + 1) * 8, so simple multiplicative calculations and expressions inside parenthesis. So if I would write some metanotation like BNF for that, it would look something like this:

expr ::== mul ( '+' mul | '-' mul ) *
mul ::== primary ( '*' primary | '/' primary ) *
primary ::== '(' expr ')' | number

expr

This function is responsible for parsing arithmetic expressions. It follows the grammar rule: expr ::= mul ( '+' mul | '-' mul ) *. It starts by calling the mul function to parse the initial term. Then, it enters a loop to handle addition and subtraction operators. If it encounters a +, it parses the next term using mul and creates an AST node representing addition. If it encounters a -, it does the same for subtraction.

The loop continues until it encounters an operator that’s not + or -, at which point it returns the parsed expression node and the remaining token stream.

(defun expr (tok)
  "expr ::== mul ( '+' mul | '-' mul ) *"
  (multiple-value-bind (node rest)
      (mul tok)
    (loop
      (cond ((eq (token-val rest) #\+)
             (multiple-value-bind (node2 rest2)
                 (mul (token-next rest))
               (setf node (make-ast-node :kind :add :lhs node :rhs node2))
               (setf rest rest2)))
            ((eq (token-val rest) #\-)
             (multiple-value-bind (node2 rest2)
                 (mul (token-next rest))
               (setf node (make-ast-node :kind :sub :lhs node :rhs node2))
               (setf rest rest2)))
            (t
             (return-from expr
               (values node rest)))))))

mul

Parsing mul expressions works in a similar fashion:

(defun mul (tok)
  "mul ::== primary ( '*' primary | '/' primary ) *"
  (multiple-value-bind (node rest)
      (primary tok)
    (loop
      (cond ((eq (token-val rest) #\*)
             (multiple-value-bind (node2 rest2)
                 (primary (token-next rest))
               (setf node (make-ast-node :kind :mul :lhs node :rhs node2))
               (setf rest rest2)))
            ((eq (token-val rest) #\/)
             (multiple-value-bind (node2 rest2)
                 (primary (token-next rest))
               (setf node (make-ast-node :kind :div :lhs node :rhs node2))
               (setf rest rest2)))
            (t
             (return-from mul
               (values node rest)))))))

primary

This function parses primary expressions, which can be either numeric values or sub-expressions enclosed in parentheses. It follows the grammar rule: primary ::= '(' expr ')' | number. If the current token is a number, it creates an AST node representing a numeric value. If the token is an opening parenthesis, it recursively calls the expr function to parse the enclosed expression and skips to the closing parenthesis. It then returns the parsed node and the remaining token stream.

(defun primary (tok)
  "primary ::== '(' expr ')' | number"
  (cond ((eq (token-kind tok) :num)
         (values (make-ast-node :kind :number :val (token-val tok))
                 (token-next tok)))
        ((eq (token-val tok) #\()
         (multiple-value-bind (node rest)
             (expr (token-next tok))
           (values node (token-next (skip-to-token #\) rest)))))
        (t (error 'parser-error))))

Testing the Initial Parser

For now, I will be testing this parser only from the REPL. Firing up it, and running the command should print something like this:

* (expr (tokenize "2 / (1 + 1) * 8"))
#S(AST-NODE
   :KIND :MUL
   :VAL NIL
   :LHS #S(AST-NODE
           :KIND :DIV
           :VAL NIL
           :LHS #S(AST-NODE :KIND :NUMBER :VAL 2 :LHS NIL :RHS NIL)
           :RHS #S(AST-NODE
                   :KIND :ADD
                   :VAL NIL
                   :LHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)
                   :RHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)))
   :RHS #S(AST-NODE :KIND :NUMBER :VAL 8 :LHS NIL :RHS NIL))
#S(SILA/LEXER::TOKEN :KIND :EOF :POS 15 :LEN NIL :VAL NIL :NEXT NIL)
*

Here we can see how our parser parses the expressions inside it. Transforming it to ASCII would look something like this:

     Mul(*)
    /     \
 Div(/)    8
 /   \
2   Add(+)
     /  \
    1    1

Code Generation

Assembly code of something like 2 / (1 + 1) * 8 is still quite simple, but definitely more instruction will be needed compared to the simple addition and subtraction. Assembly code for it would equal to:

	.globl main
main:
	mov $8, %rax
	push %rax
	mov $1, %rax
	push %rax
	mov $1, %rax
	pop %rdi
	add %rdi, %rax
	push %rax
	mov $2, %rax
	pop %rdi
	cqo
	idiv %rdi, %rax
	pop %rdi
	imul %rdi, %rax
	ret

So what is happening here:

  • Initializing and Pushing Values:

    • mov $8, %rax: This instruction moves the immediate value 8 into the %rax register.

    • push %rax: The value in %rax (which is now 8) is pushed onto the stack.

  • Further Operations:

    • mov $1, %rax: Another immediate value, 1, is moved into the %rax register.

    • push %rax: The value in %rax (now 1) is pushed onto the stack.

    • mov $1, %rax: Yet another immediate value, 1, is moved into %rax.

    • pop %rdi: The top value from the stack (1) is popped into the %rdi register.

    • add %rdi, %rax: The value in %rdi (previously 1) is added to the value in %rax (also 1), and the result is stored in %rax.

  • More Operations:

    • push %rax: The value in %rax (now 2) is pushed onto the stack.

    • mov $2, %rax: The immediate value 2 is moved into %rax.

    • pop %rdi: The top value from the stack (2) is popped into %rdi.

  • Division and Multiplication:

    • cqo: This instruction extends the sign of %rax into %rdx in preparation for the division operation.

    • idiv %rdi, %rax: The value in %rax (now 2) is divided by the value in %rdi (previously 1). The quotient is stored in %rax and the remainder in %rdx.

    • pop %rdi: The value 2 is popped from the stack and stored in %rdi.

    • imul %rdi, %rax: The value in %rax (the quotient of the division) is multiplied by the value in %rdi (previously 2), and the result is stored in %rax.

Implementation

For the code generation aspects, we can start by creating some unit tests since we already know what would be the output of our function:

(deftest test-emit-asm-div-mul-parens
  (let ((emit-asm-tests '(("2 / (1 + 1) * 8" . "  .globl main
main:
  mov $8, %rax
  push %rax
  mov $1, %rax
  push %rax
  mov $1, %rax
  pop %rdi
  add %rdi, %rax
  push %rax
  mov $2, %rax
  pop %rdi
  cqo
  idiv %rdi, %rax
  pop %rdi
  imul %rdi, %rax
  ret
"))))
    (dolist (test emit-asm-tests)
      (ok (string-equal (sila::emit-asm (car test)) (cdr test))
          (format nil "Expect to be equal: ~%~a~%=>~a"
                  `(sila::emit-asm ,(car test))
                  (cdr test))))))

The implementation aspects also will be relatively straight forward.

(defun asm-inst (inst)
  (format nil "  ~a~%" inst))

(defun asm-directive (dir)
  (asm-inst dir))

(defun asm-label (label)
  (format nil "~a:~%" label))

(defun asm-push ()
  (asm-inst "push %rax"))

(defun asm-pop (reg)
  (asm-inst (format nil "pop %~a" reg)))

(defun generate-expr (node)
  (let ((asm ""))
    (flet ((asm-conc (inst)
             (setf asm (concatenate 'string asm inst))))
      (when (eq (ast-node-kind node) :number)
        (asm-conc (asm-inst (format nil "mov $~d, %rax"
                                    (ast-node-val node))))
        (return-from generate-expr asm))
      (asm-conc (generate-expr (ast-node-rhs node)))
      (asm-conc (asm-push))
      (asm-conc (generate-expr (ast-node-lhs node)))
      (asm-conc (asm-pop "rdi"))
      (alexandria:switch ((ast-node-kind node) :test #'eq)
        (:add (asm-conc (asm-inst "add %rdi, %rax")))
        (:sub (asm-conc (asm-inst "sub %rdi, %rax")))
        (:mul (asm-conc (asm-inst "imul %rdi, %rax")))
        (:div (asm-conc (asm-inst "cqo"))
              (asm-conc (asm-inst "idiv %rdi, %rax")))
        (t (error 'parser-error))))
    asm))

This function recursively generates assembly code from an abstract syntax tree node representing an arithmetic expression. It constructs assembly instructions based on the kind of node encountered. It concatenates these instructions into a string that represents the generated assembly code.

  • For a number node, it generates a move instruction (mov) to load the value into %rax.

  • For binary operations (addition, subtraction, multiplication, division), it generates assembly for the right-hand side expression, pushes %rax onto the stack, generates assembly for the left-hand side expression, pops the value into %rdi, and then performs the corresponding operation.

Then lastly we just define the emit function, like so:

(defun emit-asm (src)
  "Emit assembly code from given source code. Currently emits only x86-64 and
only Linux is tested."
  (let ((asm ""))
    (flet ((asm-conc (inst)
             (setf asm (concatenate 'string asm inst))))
      (multiple-value-bind (node rest)
          (expr (tokenize src))
        (unless (eq (token-kind rest) :eof)
          (error 'parser-error :error-msg "Extra tokens"))
        (asm-conc (asm-directive ".globl main"))
        (asm-conc (asm-label "main"))
        (asm-conc (generate-expr node))
        (asm-conc (asm-inst "ret"))))
    asm))

Again, we can test it directly in the REPL first:

* (emit-asm "2 / (1 + 1) * 8")
"  .globl main
main:
  mov $8, %rax
  push %rax
  mov $1, %rax
  push %rax
  mov $1, %rax
  pop %rdi
  add %rdi, %rax
  push %rax
  mov $2, %rax
  pop %rdi
  cqo
  idiv %rdi, %rax
  pop %rdi
  imul %rdi, %rax
  ret
"

Seems to work. Now to run the tests:

# make test
rove sila.asd

Testing System sila/tests

;; testing 'sila/tests'
test-compilation-and-compare-rc
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2726ms)
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2693ms)
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "2 / (1 + 1) * 8" 8) to be true. (2703ms)
test-emit-asm-integer
  ✓ Expect to be equal:
    (EMIT-ASM 0)
    =>  .globl main
    main:
      mov $0, %rax
      ret

  ✓ Expect to be equal:
    (EMIT-ASM 42)
    =>  .globl main
    main:
      mov $42, %rax
      ret

test-emit-asm-add-sub
  ✓ Expect to be equal:
    (EMIT-ASM 5+20-4)
    =>  .globl main
    main:
      mov $4, %rax
      push %rax
      mov $20, %rax
      push %rax
      mov $5, %rax
      pop %rdi
      add %rdi, %rax
      pop %rdi
      sub %rdi, %rax
      ret

  ✓ Expect to be equal:
    (EMIT-ASM     5   +  20  -  4   )
    =>  .globl main
    main:
      mov $4, %rax
      push %rax
      mov $20, %rax
      push %rax
      mov $5, %rax
      pop %rdi
      add %rdi, %rax
      pop %rdi
      sub %rdi, %rax
      ret

test-emit-asm-div-mul-parens
  ✓ Expect to be equal:
    (EMIT-ASM 2 / (1 + 1) * 8)
    =>  .globl main
    main:
      mov $8, %rax
      push %rax
      mov $1, %rax
      push %rax
      mov $1, %rax
      pop %rdi
      add %rdi, %rax
      push %rax
      mov $2, %rax
      pop %rdi
      cqo
      idiv %rdi, %rax
      pop %rdi
      imul %rdi, %rax
      ret


✓ 1 test completed

Summary:
  All 1 test passed.

Cool, seems to work!

Sila Dev Log: Tokenization and Compiling Basic Arithmetic

Posted on 19th of August 2023 | 2491 words

A while ago, I wrote about why would anyone start writing a new programming language . Consequently, I decided to gradually initiate the project. Simultaneously, I found the idea of documenting my progress to be intriguing. I opted to adopt a strategy of incremental development, wherein I would construct small, functional components and refine them over time.

This incremental methodology draws inspiration from the paper by Abdulaziz Ghuloum. Other tremendously useful resources include works such as tcc , A small C compiler written by Fabrice Bellard, lcc : Another small C compiler. The creators wrote a book about the internals of lcc, which I found a good resource to see how a compiler is implemented, and also chibicc by Rui Ueyama.

Before embarking on this endeavor, I want to underscore that I currently harbor modest ambitions for this project. Consequently, my approach might be somewhat sporadic, aligning with my personal inclination and availability. Primarily, I intend this project to serve educational and research purposes, surpassing other objectives.

Getting Started With Sila Compiler

I decided to call this project Sila, which comes from Pali and is often used in Buddhist texts meaning:

śīla (Pali: sīla; T. tshul khrims ཚུལ་ཁྲིམས་), literally, ‘acting appropriately’. Sila is translated as “skillful conduct,” “discipline,” “ethical conduct,” “moral conduct,” “virtue,” etc.

Sila is said to be a way of being that is conducive to positive and happy states of mind.

With the Buddhist teachings, sila is identified as:

  • one of the three trainings
  • one of the six paramitas in the Sanskrit tradition
  • one of the ten paramis in the Pali tradition

From https://encyclopediaofbuddhism.org/wiki/ Śīla

To initiate the process, my initial step was to attempt the compilation of the most elementary language conceivable. Essentially, my goal was to input an integer into the compiler, which would then produce corresponding assembly code. This assembly code, when compiled, would result in an exit code identical to the provided integer. On a x86-64 Linux platform, the assembly program to achieve this might resemble the following:

	.globl main
main:
	mov $0, %rax
	ret

So what is happening here?

  • .globl main: This directive tells the assembler that the label main should be considered a global symbol. This means that the symbol main can be accessed from other parts of the program, possibly in other source files or modules.

  • main:: This is a label that marks the beginning of the main function or code block. Labels are used as targets for jumps and branches in assembly code.

  • mov $0, %rax: This instruction moves the immediate value 0 into the %rax register. In the x86-64 architecture, %rax is a general-purpose register that is often used to hold return values of functions.

  • ret: This instruction is used to return from a function. In this case, since we’re looking at the main function, it’s effectively ending the program. The value in %rax would typically be used as the exit status of the program.

Simple arithmetic, like “5 + 20 - 4”, is also easily represented x86 assembly:

	.globl main
main:
	mov $5, %rax
	add $20, %rax
	sub $4, %rax
	ret
  • mov $5, %rax: This instruction moves the immediate value 5 into the %rax register.

  • add $20, %rax: This instruction adds the immediate value 20 to the current value stored in the %rax register. After this operation, %rax would hold the value 25 (5 + 20).

  • sub $4, %rax: This instruction subtracts the immediate value 4 from the current value in the %rax register. After this operation, %rax would hold the value 21 (25 - 4).

Why Common Lisp?

The internet is already saturated with fervent support for Lisp, leaving me with a sense that there is little additional I can contribute on the matter. My decision to opt for Common Lisp stems from pragmatic considerations: it is a language in which I am effective, and, above all, it aligns with my personal enjoyment of programming.

Testing Infrastructure

Considering my primary focus on Linux and x86-64, I must undertake certain preparations to facilitate program testing. While I possess an x86-64 machine, it operates on macOS, rendering the same assembly directives ineffective as they are in Linux. Fortunately, this predicament is effortlessly resolved through the utilization of Docker.

My container requirements are rather minimal. Given my work with Common Lisp, the inclusion of a CL compiler is paramount. I’ve chosen SBCL for this purpose, primarily due to my familiarity with it. While other implementations might suffice, my present focus won’t extend to testing those alternatives. Alongside the CL compiler, an assembly compiler for the generated code is essential. In this context, GCC serves as an apt choice.

So the Dockerfile for development could look something like this:

FROM fukamachi/sbcl:2.3.7-debian

LABEL maintainer="Topi Kettunen <topi@topikettunen.com>"

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update -qq \
    && apt-get install -qq -y --no-install-recommends \
         build-essentials \
    && rm -rf /var/lib/apt/lists/*

# For unit testing
RUN sbcl --eval '(ql:quickload :rove)'

WORKDIR /root/.roswell/local-projects/sila

ENTRYPOINT ["/bin/bash"]

I decided to use Eitaro Fukamachi’s dockerfiles . These Dockerfiles are built around roswell , which I’m also using locally.

I also write some simple Makefile to just make my life easier:

LISP ?= ros -Q run

.PHONY: build
build:
	$(LISP) --load sila.asd \
		--eval '(ql:quickload :sila)' \
		--eval '(asdf:make :sila)' \
		--eval '(quit)'

.PHONY: test
test:
	$(LISP) --load sila.asd \
		--eval '(ql:quickload :sila)' \
		--eval '(asdf:test-system :sila)' \
		--eval '(quit)'

.PHONY: docker-build
docker-build:
	docker build -t sila .

.PHONY: docker-run
docker-run:
	docker run -it --rm -v ${PWD}:/root/.roswell/local-projects/sila sila

Initial Tests

Since I already know what I want to output from my compiler at this point, it’s easy to start development by writing some nifty little unit tests for it. For now, I decided to write tests for emitting the assembly code and than separately the compilation process.

(deftest test-emit-asm-integer
  (let ((emit-asm-tests '(("0" . "  .globl main
main:
  mov $0, %rax
  ret
")
                          ("42" . "  .globl main
main:
  mov $42, %rax
  ret
"))))
    (dolist (test emit-asm-tests)
      (ok (string-equal (sila::emit-asm (car test)) (cdr test))
          (format nil "Expect to be equal: ~%~A~%=>~A"
                       `(sila::emit-asm ,(car test))
                       (cdr test))))))

(deftest test-emit-asm-add-sub
  (let ((emit-asm-tests '(("5+20-4" . "  .globl main
main:
  mov $5, %rax
  add $20, %rax
  sub $4, %rax
  ret
")
                          ("    5   +  20  -  4   " . "  .globl main
main:
  mov $5, %rax
  add $20, %rax
  sub $4, %rax
  ret
"))))
    (dolist (test emit-asm-tests)
      (ok (string-equal (sila::emit-asm (car test)) (cdr test))
          (format nil "Expect to be equal: ~%~A~%=>~A"
                       `(sila::emit-asm ,(car test))
                       (cdr test))))))

These tests are relatively straight forward, I generate the output and just compare strings to each other to see that they both are the same. Compilation itself is little bit more involved. Most likely you would achieve easier test setup by just simple shell scripts, but since I want to write Common Lisp, I decided to write some small testing harness for it.

(defvar *bin-name* "sila")
(defvar *tmp-name* "tmp.s")
(defvar *tmp-bin-name* "tmp")

(defun sila-project-root (path)
  (namestring (asdf:system-relative-pathname :sila path)))

(defun compile-program-and-compare-rc (input expected-rc)
  (multiple-value-bind (out err rc)
      (uiop:run-program (list (sila-project-root *bin-name*)
                              input)
                        :output :string
                        :error-output :string)
    (declare (ignore err rc))
    (with-open-file (fstream (sila-project-root *tmp-name*)
                             :direction :output
                             :if-exists :supersede
                             :if-does-not-exist :create)
      (format fstream out))
    (uiop:run-program (list "gcc" "-static" "-o" *tmp-bin-name* *tmp-name*))
    (multiple-value-bind (out err got-rc)
        (uiop:run-program (sila-project-root *tmp-bin-name*)
                          ;; Since we're working with arbitrary return codes.
                          :ignore-error-status t)
      (declare (ignore out err))
      (= expected-rc got-rc))))

(deftest test-compilation-and-compare-rc
  (ok (compile-program-and-compare-rc "0" 0))
  (ok (compile-program-and-compare-rc "5 + 40 - 20" 25)))

What I’m doing here is that I use the freshly created executable that outputs assembly and pipe the output to a file called tmp.s. After that I run gcc for that generated assembly and compile it to a binary called tmp. Then I just run the program and check what return code it returns and compare those to what I expect them to be.

Obviously the tests will fail for now, but those give me a nice foundation on top of which is nice to start developing.

Tokenization

Before I do anything for my compiler, I need to do some lexical analysis and generate some tokens from given input. I start this by defining some structures and types for holding these tokens:

(deftype kind ()
  '(member
    :punct
    :num
    :eof))

(defstruct token
  kind
  pos
  len
  val
  next)

Since I’m currently only working with strings given as input, I will do the tokenization simply moving a pointer through and creating tokens based on the subsequences I encounter in the given input string. Before that, I need some helper functions:

(defun whitespacep (c)
  (member c '(#\Space #\Tab #\Return #\Newline)))

(defun punctuatorp (c)
  ;; TODO(topi): Add more punctuators when needed.
  (member c '(#\+ #\-)))

(defun skip-to-punctuator (input start)
  ;; TODO(topi): Improve this. Won't work when new tokens are added.
  (position-if-not #'digit-char-p input :start start))

I also define a condition for encountering invalid tokens, which I can throw in case something bad happens:

(define-condition invalid-token (error)
  ((token :initarg :token
          :initform nil
          :reader token)
   (token-position :initarg :token-position
                   :initform nil
                   :reader token-position))
  (:report (lambda (condition stream)
             (format stream "Invalid token: '~A' at position ~A.~&"
                     (token condition)
                     (token-position condition))))
  (:documentation "Condition for when we encounter invalid token."))

As you can see in the token structure above, there is a slot called next indicating the next token. So I’m working with linked lists here. I’ll start the tokenization by creating head node and also cur node which I will be populating while I’m iterating through the list. I also initialise a variable for pointing me where I’m currently at in the given input string.

The way I decided to do this initial tokenization is simply looping through the given input with loop. At the moment, I’m not interested in whitespace, so when I encounter whitespace character, I’ll just move the src-pos forward.

Currently, I’m interested in at what position I will encounter a punctuator, either + or -, in the given input string. To do that, I’ll use the skip-to function defined above, which returns the position from the input string where first given item is found at. When I know that, it’s easy the deduce the positions of each token in the given input.

(defun tokenize (src)
  (let* ((head (make-token))
         (cur head)
         (src-pos 0))
    (loop :while (< src-pos (length src))
          :do (let ((punct-pos (skip-to-punctuator src src-pos)))
                (cond (;; Whitespace
                       (whitespacep (char src src-pos))
                       (incf src-pos))
                      (;; Numeric literal
                       (digit-char-p (char src src-pos))
                       (setf (token-next cur)
                             (make-token :kind num
                                         :val (parse-integer src
                                                             :start src-pos
                                                             :junk-allowed t)
                                         :pos src-pos))
                       (setf (token-len (token-next cur))
                             (if punct-pos
                                 (- punct-pos src-pos)
                                 ;; No more punctuators.
                                 (- (length src) src-pos)))
                       (setf cur (token-next cur))
                       (if punct-pos
                           (setf src-pos punct-pos)
                           ;; No more punctuators, move `src-pos` to the end.
                           (setf src-pos (length src))))
                      (;; Punctuator: + | -
                       (punctuatorp (char src src-pos))
                       (setf (token-next cur)
                             (make-token :kind punct
                                         :val (char src src-pos)
                                         :pos src-pos
                                         ;; TODO(topi): Currently I only have
                                         ;; 1 char puncts, change when more is
                                         ;; needed.
                                         :len 1))
                       (setf cur (token-next cur))
                       (incf src-pos))
                      (t
                       (error 'invalid-token
                              :token (char src src-pos)
                              :token-position src-pos)))))
    ;; No more tokens.
    (setf (token-next cur) (make-token :kind eof
                                       :pos src-pos))
    (setf cur (token-next cur))
    (token-next head))

We can then test this out in the REPL, to see what it prints out:

SILA> (sila/lexer:tokenize "1 + 1 - 123")
#S(SILA/LEXER::TOKEN
   :KIND :NUM
   :POS 0
   :LEN 2
   :VAL 1
   :NEXT #S(SILA/LEXER::TOKEN
            :KIND :PUNCT
            :POS 2
            :LEN 1
            :VAL #\+
            :NEXT #S(SILA/LEXER::TOKEN
                     :KIND :NUM
                     :POS 4
                     :LEN 2
                     :VAL 1
                     :NEXT #S(SILA/LEXER::TOKEN
                              :KIND :PUNCT
                              :POS 6
                              :LEN 1
                              :VAL #\-
                              :NEXT #S(SILA/LEXER::TOKEN
                                       :KIND :NUM
                                       :POS 8
                                       :LEN 3
                                       :VAL 123
                                       :NEXT #S(SILA/LEXER::TOKEN
                                                :KIND :EOF
                                                :POS 11
                                                :LEN NIL
                                                :VAL NIL
                                                :NEXT NIL))))))

And it seems to work nicely giving us crucial information such as how long the tokens are in the input and at what position they’re found at, something that will be useful in the future.

Emitting Assembly

Since now the tokenization is working more or less as intended, we can move forward with the assembly generation. For now, the only thing that matters to me is that we can generate some simple assembly to standard output. The way I decided to this, is to just concatenate assembly instructions to one string and just print the result. I’m only working with one label main for now, so I’m not starting to do anything too fancy yet.

The way I’m emitting the assembly is simply by looping all the tokens that I got from tokenize until I encounter token which kind is :eof. Then based on the token, I’ll just concatenate assembly instructions as needed. I’ll also define simple entrypoint to my executable which only accepts one string as input that will generated to assembly.

(defun main ()
  "Sila programming language entrypoint."
  (multiple-value-bind (options free-args)
      (opts:get-opts)
    ;; TODO(topi): Add some useful options.
    (declare (ignore options))
    (unless (= (length free-args) 1)
      (format *error-output* "Invalid number of arguments, expected 1.~%")
      (opts:exit 1))
    (format *standard-output* (emit-asm (first free-args)))
    (opts:exit)))

(defun emit-asm (src)
  "Emit assembly code from given source code. Currently emits only x86-64 and only Linux
is tested."
  (let ((asm "")
        (tok (tokenize src)))
    (labels ((asm-conc (inst)
               (setf asm (concatenate 'string asm inst)))
             (asm-inst (inst)
               (format nil "  ~A~%" inst))
             (asm-directive (dir)
               (asm-inst dir))
             (asm-label (label)
               (format nil "~A:~%" label)))
      (asm-conc (asm-directive ".globl main"))
      (asm-conc (asm-label "main"))
      ;; For now, first token must be a number.
      (asm-conc (asm-inst (format nil "mov $~d, %rax" (token-val tok))))
      (setf tok (token-next tok))
      (loop :until (equal (token-kind tok) :eof)
            :do (cond ((eq (token-kind tok) :num)
                       (setf tok (token-next tok)))
                      ((char= (token-val tok) #\+)
                       (asm-conc (asm-inst
                                  (format nil "add $~d, %rax"
                                          (token-val (token-next tok)))))
                       (setf tok (token-next tok)))
                      ((char= (token-val tok) #\-)
                       (asm-conc (asm-inst
                                  (format nil "sub $~d, %rax"
                                          (token-val (token-next tok)))))
                       (setf tok (token-next tok)))))
      (asm-conc (asm-inst "ret"))
      asm)))

With these we should be able to run our testing suite to see what happens.

$ make test

[...]

Testing System sila/tests

;; testing 'sila/tests/main'
test-compilation-and-compare-rc
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2763ms)
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2509ms)

;; testing 'sila/tests/emit-asm-x86-64'
test-emit-asm-integer
  ✓ Expect to be equal:
    (EMIT-ASM 0)
    =>	.globl main
    main:
    	mov $0, %rax
    	ret

  ✓ Expect to be equal:
    (EMIT-ASM 42)
    =>	.globl main
    main:
    	mov $42, %rax
    	ret

test-emit-asm-add-sub
  ✓ Expect to be equal:
    (EMIT-ASM 5+20-4)
    =>	.globl main
    main:
    	mov $5, %rax
    	add $20, %rax
    	sub $4, %rax
    	ret

  ✓ Expect to be equal:
    (EMIT-ASM     5   +  20  -  4   )
    =>	.globl main
    main:
    	mov $5, %rax
    	add $20, %rax
    	sub $4, %rax
    	ret


✓ 1 test completed

Summary:
  All 1 test passed.

Cool seems to pass!

Next Steps

I’m continuing the development of my compiler at my own pace, and will write when something new happens. I think I will come back to some parts when I decide to some improvements or refactorings so that everything happens in small iterations. Bugs will be found and hopefully fixed. This might be a long project, but I don’t mind, it’s fun! If you’re interested, feel free to follow my blog to know when something new comes.

Why Would Anyone Build a New Programming Language?

Posted on 28th of July 2023 | 519 words

In the vast and ever-evolving landscape of software development, programming languages play a pivotal role in how developers interact with computers and build applications. While a plethora of well-established programming languages exist, there has always been an intriguing curiosity and desire among some developers, including myself, to embark on the quest of creating our own languages.

Existing programming languages come with their own set of strengths and weaknesses. However, they may not always cater optimally to specific domains or problem sets. This has been a driving force for many language creators who seek full control over the language’s semantics. Building a new language offers the opportunity to tailor the language’s features and syntax, empowering developers and teams to solve certain classes of problems more efficiently compared to existing solutions.

However, it is essential to acknowledge that creating a new language is no easy task. Beyond the initial development, considerations for onboarding other developers and promoting adoption must be taken into account. Moreover, when considering building a language within a company setting, one must be mindful of potential monetary implications.

The decision to build a new programming language should not solely be driven by pragmatic concerns; it is also an intellectually stimulating exercise. The process involves delving into various aspects of language design, including syntax, semantics, and the implementation of the compiler or interpreter. Aspiring language creators gain a deeper understanding of computer science principles and enhance their problem-solving skills through this complex undertaking.

Throughout my journey in the world of computer science, I have dabbled in creating multiple toy compilers. Though they may not have been groundbreaking, they ignited my curiosity about compiler implementation. Despite having explored various compilers and interpreters of varying complexity, I felt a lingering sense of not fully comprehending the intricacies occurring under the hood.

Fortuitously, the summer provided a respite from my regular work commitments, granting me the perfect opportunity to embark on a more serious attempt at building a language from scratch. This time, I was determined to delve deep into the intricate details of lexing, parsing, and code generation. This ambitious undertaking offered an exciting challenge, and I was eager to explore uncharted territories in language design.

The journey of building a new programming language is undoubtedly exciting and rewarding, fueled by an array of motivations. Whether it is optimizing performance to meet specific demands, addressing domain-specific challenges, or simply exploring innovative language design paradigms, language creation is a conduit for personal growth and innovation.

As the development of my language progresses, I anticipate gaining a more comprehensive understanding of how programming languages shape our interactions with computers and applications. I hope that my creation, like many others in the world of programming languages, will contribute to the diversity of tools available to developers, enabling them to craft exceptional solutions tailored to their unique needs.

In conclusion, the decision to build a new programming language should be approached with a blend of curiosity, pragmatism, and enthusiasm. It is a journey that fosters personal growth, deepens technical understanding, and holds the potential to revolutionize the way developers solve problems and interact with computers.

An Album for Each Year

Posted on 16th of July 2023 | 312 words

I stumbled upon a fun-sounding challenge from Werner Vogels' blog so I wanted to partake in it personally. The challenge was to list your favourite album for every year of your life with a restriction of only one album per year and no repeats of artists.

Here is my list:

1995: Alice in Chains, Alice in Chains
1996: Type O Negative, October Rust
1997: Deftones, Around the Fur
1998: Neutral Milk Hotel, In the Aeroplane Over the Sea
1999: Sleep, Jerusalem (Dopesmoker)
2000: Godspeed You! Black Emperor - Lift Your Skinny Fists Like Antennas to Heaven
2001: B.R.M.C., Black Rebel Motorcycle Club
2002: Queens of the Stone Age, Songs for the Deaf
2003: Songs: Ohia, The Magnolia Electric Co.
2004: MF DOOM, MM...FOOD
2005: Boris, Pink
2006: Tool, 10,000 Days
2007: Porcupine Tree, Fear of a Blank Planet
2008: Have a Nice Life, Deathconsciousness
2009: Them Crooked Vultures, Them Crooked Vultures
2010: Nails, Unsilent Death
2011: Gillian Welch, The Harrow & the Harvest
2012: Death Grips, The Money Store
2013: Jason Isbell, Southeastern
2014: Swans, To Be Kind
2015: Kendrick Lamar, To Pimp a Butterfly
2016: Gojira, Magma
2017: Slowdive, Slowdive
2018: Anna Von Hausswolff, Dead Magic
2019: Justin Townes Earle, The Saint of Lost Causes
2020: Lianne La Havas, Lianne La Havas
2021: Silk Sonic, An Evening with Silk Sonic
2022: Weyes Blood, And in the Darkness, Hearts Aglow
2023: King Gizzard & the Lizard Wizard, PetroDragonic Apocalypse (as of July 2023)

Challenge turned out to be quite hard for me since most of my favourite artists and albums are from time before my birth, but I managed to find something for each year. Especially difficult aspect was leaving great albums out. Rule of no repeats of artists didn’t make it easier either. Most likely I missed some great artists/albums so this list most likely just represents how I currently feel.

Embrace Authenticity in Your Blog

Posted on 10th of June 2023 | 522 words

I’ve been a longtime fan of Julie Evans’ writing and recently stumbled upon a great post of her called Some blogging myths which inspired to me write something about the same topic.

I have always seen blogging or personal websites as this powerful platforms for self-expression, enabling people from all walks of life to share their thoughts, ideas, and experiences with the whole wide world. Beyond that, I’ve never really given too much thought to it.

While it’s true that if you happen to be some sort of artist, or even a blogger, that uses his site for some capitalistic reasons, attracting traffic and engagement is important. This can be seen today in social media. This basically means that people tend to approach blogging nowadays from this mindset. Meaning, that they are only able to write about topics that are popular or, especially in tech world, about topics that they consider themselves being an expert in.

Now I don’t say that you shouldn’t write about these sort of topics if you really happen to be expert in the field in question. But, limiting the topics solely to topics that are likely to generate traffic, you most likely inadvertently stifle your own authenticity. Blog should be a place for you to have freedom to explore diverse subjects that just happens to pique your interest. By doing so, you invite your readers into a genuine and vulnerable space where they can connect with you on a deeper level.

This sort of safe haven fosters a certain trust and connection with your audience. When you write about wide range of topics, even those outside your comfort zone, you invite readers who share those interest or experience to engage with you. This way you create an inclusive environment where your audience can feel heard and understood strengthening the bond between you and them.

While first and foremost blog is mainly about your voice but at the same time it’s also about providing a platform for other to share their stories and experiences. Embracing a wide range of topics sparks conversations that can educate, enlighten, and inspire everyone involved.

I don’t say that if your blog is dedicated to a either popular or very niche topic would be a bad thing. If that’s something that you enjoy, godspeed. The way I see it is that if I would limit the topics written in my blog, I would inadvertently hinder my own personal growth as a writer and individual. Exploring new subjects, especially those that you’re by no mean an expert of, allows you to expand your knowledge, learn from others, and develop new skills. Possibly it can also help you to discover hidden passions and uncover something new about yourself.

When you write about anything that sparks your interest, you also demonstrate that it’s okay to write about anything and to step outside the boundaries of popular opinion or expertise, you empower individuals to express themselves more authentically in their own spheres. This sort of vulnerability and explorations can serve as an inspiration and catalyst for others to break free from limitations and embrace their true selves.