Sila Dev Log: Tokenization and Compiling Basic Arithmetic

Posted on 19th of August 2023 | 2491 words
Plug: Follow the Sila development here.

A while ago, I wrote about why would anyone start writing a new programming language . Consequently, I decided to gradually initiate the project. Simultaneously, I found the idea of documenting my progress to be intriguing. I opted to adopt a strategy of incremental development, wherein I would construct small, functional components and refine them over time.

This incremental methodology draws inspiration from the paper by Abdulaziz Ghuloum. Other tremendously useful resources include works such as tcc , A small C compiler written by Fabrice Bellard, lcc : Another small C compiler. The creators wrote a book about the internals of lcc, which I found a good resource to see how a compiler is implemented, and also chibicc by Rui Ueyama.

Before embarking on this endeavor, I want to underscore that I currently harbor modest ambitions for this project. Consequently, my approach might be somewhat sporadic, aligning with my personal inclination and availability. Primarily, I intend this project to serve educational and research purposes, surpassing other objectives.

Getting Started With Sila Compiler

I decided to call this project Sila, which comes from Pali and is often used in Buddhist texts meaning:

śīla (Pali: sīla; T. tshul khrims ཚུལ་ཁྲིམས་), literally, ‘acting appropriately’. Sila is translated as “skillful conduct,” “discipline,” “ethical conduct,” “moral conduct,” “virtue,” etc.

Sila is said to be a way of being that is conducive to positive and happy states of mind.

With the Buddhist teachings, sila is identified as:

  • one of the three trainings
  • one of the six paramitas in the Sanskrit tradition
  • one of the ten paramis in the Pali tradition

From https://encyclopediaofbuddhism.org/wiki/ Śīla

To initiate the process, my initial step was to attempt the compilation of the most elementary language conceivable. Essentially, my goal was to input an integer into the compiler, which would then produce corresponding assembly code. This assembly code, when compiled, would result in an exit code identical to the provided integer. On a x86-64 Linux platform, the assembly program to achieve this might resemble the following:

	.globl main
main:
	mov $0, %rax
	ret

So what is happening here?

  • .globl main: This directive tells the assembler that the label main should be considered a global symbol. This means that the symbol main can be accessed from other parts of the program, possibly in other source files or modules.

  • main:: This is a label that marks the beginning of the main function or code block. Labels are used as targets for jumps and branches in assembly code.

  • mov $0, %rax: This instruction moves the immediate value 0 into the %rax register. In the x86-64 architecture, %rax is a general-purpose register that is often used to hold return values of functions.

  • ret: This instruction is used to return from a function. In this case, since we’re looking at the main function, it’s effectively ending the program. The value in %rax would typically be used as the exit status of the program.

Simple arithmetic, like “5 + 20 - 4”, is also easily represented x86 assembly:

	.globl main
main:
	mov $5, %rax
	add $20, %rax
	sub $4, %rax
	ret
  • mov $5, %rax: This instruction moves the immediate value 5 into the %rax register.

  • add $20, %rax: This instruction adds the immediate value 20 to the current value stored in the %rax register. After this operation, %rax would hold the value 25 (5 + 20).

  • sub $4, %rax: This instruction subtracts the immediate value 4 from the current value in the %rax register. After this operation, %rax would hold the value 21 (25 - 4).

Why Common Lisp?

The internet is already saturated with fervent support for Lisp, leaving me with a sense that there is little additional I can contribute on the matter. My decision to opt for Common Lisp stems from pragmatic considerations: it is a language in which I am effective, and, above all, it aligns with my personal enjoyment of programming.

Testing Infrastructure

Considering my primary focus on Linux and x86-64, I must undertake certain preparations to facilitate program testing. While I possess an x86-64 machine, it operates on macOS, rendering the same assembly directives ineffective as they are in Linux. Fortunately, this predicament is effortlessly resolved through the utilization of Docker.

My container requirements are rather minimal. Given my work with Common Lisp, the inclusion of a CL compiler is paramount. I’ve chosen SBCL for this purpose, primarily due to my familiarity with it. While other implementations might suffice, my present focus won’t extend to testing those alternatives. Alongside the CL compiler, an assembly compiler for the generated code is essential. In this context, GCC serves as an apt choice.

So the Dockerfile for development could look something like this:

FROM fukamachi/sbcl:2.3.7-debian

LABEL maintainer="Topi Kettunen <topi@topikettunen.com>"

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update -qq \
    && apt-get install -qq -y --no-install-recommends \
         build-essentials \
    && rm -rf /var/lib/apt/lists/*

# For unit testing
RUN sbcl --eval '(ql:quickload :rove)'

WORKDIR /root/.roswell/local-projects/sila

ENTRYPOINT ["/bin/bash"]

I decided to use Eitaro Fukamachi’s dockerfiles . These Dockerfiles are built around roswell , which I’m also using locally.

I also write some simple Makefile to just make my life easier:

LISP ?= ros -Q run

.PHONY: build
build:
	$(LISP) --load sila.asd \
		--eval '(ql:quickload :sila)' \
		--eval '(asdf:make :sila)' \
		--eval '(quit)'

.PHONY: test
test:
	$(LISP) --load sila.asd \
		--eval '(ql:quickload :sila)' \
		--eval '(asdf:test-system :sila)' \
		--eval '(quit)'

.PHONY: docker-build
docker-build:
	docker build -t sila .

.PHONY: docker-run
docker-run:
	docker run -it --rm -v ${PWD}:/root/.roswell/local-projects/sila sila

Initial Tests

Since I already know what I want to output from my compiler at this point, it’s easy to start development by writing some nifty little unit tests for it. For now, I decided to write tests for emitting the assembly code and than separately the compilation process.

(deftest test-emit-asm-integer
  (let ((emit-asm-tests '(("0" . "  .globl main
main:
  mov $0, %rax
  ret
")
                          ("42" . "  .globl main
main:
  mov $42, %rax
  ret
"))))
    (dolist (test emit-asm-tests)
      (ok (string-equal (sila::emit-asm (car test)) (cdr test))
          (format nil "Expect to be equal: ~%~A~%=>~A"
                       `(sila::emit-asm ,(car test))
                       (cdr test))))))

(deftest test-emit-asm-add-sub
  (let ((emit-asm-tests '(("5+20-4" . "  .globl main
main:
  mov $5, %rax
  add $20, %rax
  sub $4, %rax
  ret
")
                          ("    5   +  20  -  4   " . "  .globl main
main:
  mov $5, %rax
  add $20, %rax
  sub $4, %rax
  ret
"))))
    (dolist (test emit-asm-tests)
      (ok (string-equal (sila::emit-asm (car test)) (cdr test))
          (format nil "Expect to be equal: ~%~A~%=>~A"
                       `(sila::emit-asm ,(car test))
                       (cdr test))))))

These tests are relatively straight forward, I generate the output and just compare strings to each other to see that they both are the same. Compilation itself is little bit more involved. Most likely you would achieve easier test setup by just simple shell scripts, but since I want to write Common Lisp, I decided to write some small testing harness for it.

(defvar *bin-name* "sila")
(defvar *tmp-name* "tmp.s")
(defvar *tmp-bin-name* "tmp")

(defun sila-project-root (path)
  (namestring (asdf:system-relative-pathname :sila path)))

(defun compile-program-and-compare-rc (input expected-rc)
  (multiple-value-bind (out err rc)
      (uiop:run-program (list (sila-project-root *bin-name*)
                              input)
                        :output :string
                        :error-output :string)
    (declare (ignore err rc))
    (with-open-file (fstream (sila-project-root *tmp-name*)
                             :direction :output
                             :if-exists :supersede
                             :if-does-not-exist :create)
      (format fstream out))
    (uiop:run-program (list "gcc" "-static" "-o" *tmp-bin-name* *tmp-name*))
    (multiple-value-bind (out err got-rc)
        (uiop:run-program (sila-project-root *tmp-bin-name*)
                          ;; Since we're working with arbitrary return codes.
                          :ignore-error-status t)
      (declare (ignore out err))
      (= expected-rc got-rc))))

(deftest test-compilation-and-compare-rc
  (ok (compile-program-and-compare-rc "0" 0))
  (ok (compile-program-and-compare-rc "5 + 40 - 20" 25)))

What I’m doing here is that I use the freshly created executable that outputs assembly and pipe the output to a file called tmp.s. After that I run gcc for that generated assembly and compile it to a binary called tmp. Then I just run the program and check what return code it returns and compare those to what I expect them to be.

Obviously the tests will fail for now, but those give me a nice foundation on top of which is nice to start developing.

Tokenization

Before I do anything for my compiler, I need to do some lexical analysis and generate some tokens from given input. I start this by defining some structures and types for holding these tokens:

(deftype kind ()
  '(member
    :punct
    :num
    :eof))

(defstruct token
  kind
  pos
  len
  val
  next)

Since I’m currently only working with strings given as input, I will do the tokenization simply moving a pointer through and creating tokens based on the subsequences I encounter in the given input string. Before that, I need some helper functions:

(defun whitespacep (c)
  (member c '(#\Space #\Tab #\Return #\Newline)))

(defun punctuatorp (c)
  ;; TODO(topi): Add more punctuators when needed.
  (member c '(#\+ #\-)))

(defun skip-to-punctuator (input start)
  ;; TODO(topi): Improve this. Won't work when new tokens are added.
  (position-if-not #'digit-char-p input :start start))

I also define a condition for encountering invalid tokens, which I can throw in case something bad happens:

(define-condition invalid-token (error)
  ((token :initarg :token
          :initform nil
          :reader token)
   (token-position :initarg :token-position
                   :initform nil
                   :reader token-position))
  (:report (lambda (condition stream)
             (format stream "Invalid token: '~A' at position ~A.~&"
                     (token condition)
                     (token-position condition))))
  (:documentation "Condition for when we encounter invalid token."))

As you can see in the token structure above, there is a slot called next indicating the next token. So I’m working with linked lists here. I’ll start the tokenization by creating head node and also cur node which I will be populating while I’m iterating through the list. I also initialise a variable for pointing me where I’m currently at in the given input string.

The way I decided to do this initial tokenization is simply looping through the given input with loop. At the moment, I’m not interested in whitespace, so when I encounter whitespace character, I’ll just move the src-pos forward.

Currently, I’m interested in at what position I will encounter a punctuator, either + or -, in the given input string. To do that, I’ll use the skip-to function defined above, which returns the position from the input string where first given item is found at. When I know that, it’s easy the deduce the positions of each token in the given input.

(defun tokenize (src)
  (let* ((head (make-token))
         (cur head)
         (src-pos 0))
    (loop :while (< src-pos (length src))
          :do (let ((punct-pos (skip-to-punctuator src src-pos)))
                (cond (;; Whitespace
                       (whitespacep (char src src-pos))
                       (incf src-pos))
                      (;; Numeric literal
                       (digit-char-p (char src src-pos))
                       (setf (token-next cur)
                             (make-token :kind num
                                         :val (parse-integer src
                                                             :start src-pos
                                                             :junk-allowed t)
                                         :pos src-pos))
                       (setf (token-len (token-next cur))
                             (if punct-pos
                                 (- punct-pos src-pos)
                                 ;; No more punctuators.
                                 (- (length src) src-pos)))
                       (setf cur (token-next cur))
                       (if punct-pos
                           (setf src-pos punct-pos)
                           ;; No more punctuators, move `src-pos` to the end.
                           (setf src-pos (length src))))
                      (;; Punctuator: + | -
                       (punctuatorp (char src src-pos))
                       (setf (token-next cur)
                             (make-token :kind punct
                                         :val (char src src-pos)
                                         :pos src-pos
                                         ;; TODO(topi): Currently I only have
                                         ;; 1 char puncts, change when more is
                                         ;; needed.
                                         :len 1))
                       (setf cur (token-next cur))
                       (incf src-pos))
                      (t
                       (error 'invalid-token
                              :token (char src src-pos)
                              :token-position src-pos)))))
    ;; No more tokens.
    (setf (token-next cur) (make-token :kind eof
                                       :pos src-pos))
    (setf cur (token-next cur))
    (token-next head))

We can then test this out in the REPL, to see what it prints out:

SILA> (sila/lexer:tokenize "1 + 1 - 123")
#S(SILA/LEXER::TOKEN
   :KIND :NUM
   :POS 0
   :LEN 2
   :VAL 1
   :NEXT #S(SILA/LEXER::TOKEN
            :KIND :PUNCT
            :POS 2
            :LEN 1
            :VAL #\+
            :NEXT #S(SILA/LEXER::TOKEN
                     :KIND :NUM
                     :POS 4
                     :LEN 2
                     :VAL 1
                     :NEXT #S(SILA/LEXER::TOKEN
                              :KIND :PUNCT
                              :POS 6
                              :LEN 1
                              :VAL #\-
                              :NEXT #S(SILA/LEXER::TOKEN
                                       :KIND :NUM
                                       :POS 8
                                       :LEN 3
                                       :VAL 123
                                       :NEXT #S(SILA/LEXER::TOKEN
                                                :KIND :EOF
                                                :POS 11
                                                :LEN NIL
                                                :VAL NIL
                                                :NEXT NIL))))))

And it seems to work nicely giving us crucial information such as how long the tokens are in the input and at what position they’re found at, something that will be useful in the future.

Emitting Assembly

Since now the tokenization is working more or less as intended, we can move forward with the assembly generation. For now, the only thing that matters to me is that we can generate some simple assembly to standard output. The way I decided to this, is to just concatenate assembly instructions to one string and just print the result. I’m only working with one label main for now, so I’m not starting to do anything too fancy yet.

The way I’m emitting the assembly is simply by looping all the tokens that I got from tokenize until I encounter token which kind is :eof. Then based on the token, I’ll just concatenate assembly instructions as needed. I’ll also define simple entrypoint to my executable which only accepts one string as input that will generated to assembly.

(defun main ()
  "Sila programming language entrypoint."
  (multiple-value-bind (options free-args)
      (opts:get-opts)
    ;; TODO(topi): Add some useful options.
    (declare (ignore options))
    (unless (= (length free-args) 1)
      (format *error-output* "Invalid number of arguments, expected 1.~%")
      (opts:exit 1))
    (format *standard-output* (emit-asm (first free-args)))
    (opts:exit)))

(defun emit-asm (src)
  "Emit assembly code from given source code. Currently emits only x86-64 and only Linux
is tested."
  (let ((asm "")
        (tok (tokenize src)))
    (labels ((asm-conc (inst)
               (setf asm (concatenate 'string asm inst)))
             (asm-inst (inst)
               (format nil "  ~A~%" inst))
             (asm-directive (dir)
               (asm-inst dir))
             (asm-label (label)
               (format nil "~A:~%" label)))
      (asm-conc (asm-directive ".globl main"))
      (asm-conc (asm-label "main"))
      ;; For now, first token must be a number.
      (asm-conc (asm-inst (format nil "mov $~d, %rax" (token-val tok))))
      (setf tok (token-next tok))
      (loop :until (equal (token-kind tok) :eof)
            :do (cond ((eq (token-kind tok) :num)
                       (setf tok (token-next tok)))
                      ((char= (token-val tok) #\+)
                       (asm-conc (asm-inst
                                  (format nil "add $~d, %rax"
                                          (token-val (token-next tok)))))
                       (setf tok (token-next tok)))
                      ((char= (token-val tok) #\-)
                       (asm-conc (asm-inst
                                  (format nil "sub $~d, %rax"
                                          (token-val (token-next tok)))))
                       (setf tok (token-next tok)))))
      (asm-conc (asm-inst "ret"))
      asm)))

With these we should be able to run our testing suite to see what happens.

$ make test

[...]

Testing System sila/tests

;; testing 'sila/tests/main'
test-compilation-and-compare-rc
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2763ms)
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2509ms)

;; testing 'sila/tests/emit-asm-x86-64'
test-emit-asm-integer
  ✓ Expect to be equal:
    (EMIT-ASM 0)
    =>	.globl main
    main:
    	mov $0, %rax
    	ret

  ✓ Expect to be equal:
    (EMIT-ASM 42)
    =>	.globl main
    main:
    	mov $42, %rax
    	ret

test-emit-asm-add-sub
  ✓ Expect to be equal:
    (EMIT-ASM 5+20-4)
    =>	.globl main
    main:
    	mov $5, %rax
    	add $20, %rax
    	sub $4, %rax
    	ret

  ✓ Expect to be equal:
    (EMIT-ASM     5   +  20  -  4   )
    =>	.globl main
    main:
    	mov $5, %rax
    	add $20, %rax
    	sub $4, %rax
    	ret


✓ 1 test completed

Summary:
  All 1 test passed.

Cool seems to pass!

Next Steps

I’m continuing the development of my compiler at my own pace, and will write when something new happens. I think I will come back to some parts when I decide to some improvements or refactorings so that everything happens in small iterations. Bugs will be found and hopefully fixed. This might be a long project, but I don’t mind, it’s fun! If you’re interested, feel free to follow my blog to know when something new comes.

Note: If you want to read earlier posts of this dev log, head over here.