# Sila Dev Log: Initial Recursive Descent Parsing

Posted on 24th of August 2023 | 1881 words
 Plug: Follow the Sila development here.

For the initial implementation of Sila’s parses, I decided to go with the recursive descent route due to its simplicity. So what it is? Recursive descent parsing is a top-down parsing technique employed to analyze the syntax of structured input, such as programming languages or mathematical expressions. It operates by breaking down the input into smaller, manageable components using a set of recursive functions that correspond to the non-terminal symbols in the grammar.

Beginning with a starting symbol, typically representing the highest-level construct, these functions are invoked recursively to recognize and process the various syntactic elements of the input. Each function handles a specific grammar rule, and by calling each other based on the grammar’s structure, the parser traverses the input while building a parse tree that reflects the hierarchical organization of the language’s syntax.

This approach allows for clear code organization and error reporting, though it can face challenges with certain types of grammars and may require augmentation with techniques like lookahead or memoization for optimal efficiency and effectiveness. But for early stages of the project, I don’t really mind that.

## Parser Implementation

First I create some structures and helpers that I will be using in the parsing process.

``````(deftype ast-node-kind ()
"Sila AST node kind."
'(member
:sub
:mul
:div
:number))

(defstruct ast-node
"Structure for Sila AST nodes."
kind
val
lhs
rhs)

(defun skip-to-token (val tok)

TODO(topi): Probably should do some proper error handling if `val' isn't
found."
(cond ((eq (token-kind tok) :eof) nil)
((equal (token-val tok) val) tok)
(t (search-token val (token-next tok)))))``````

For now, I’m interested in parsing something simple like `2 / (1 + 1) * 8`, so simple multiplicative calculations and expressions inside parenthesis. So if I would write some metanotation like BNF for that, it would look something like this:

``````expr ::== mul ( '+' mul | '-' mul ) *
mul ::== primary ( '*' primary | '/' primary ) *
primary ::== '(' expr ')' | number``````

### `expr`

This function is responsible for parsing arithmetic expressions. It follows the grammar rule: `expr ::= mul ( '+' mul | '-' mul ) *`. It starts by calling the `mul` function to parse the initial term. Then, it enters a `loop` to handle addition and subtraction operators. If it encounters a `+`, it parses the next term using mul and creates an AST node representing addition. If it encounters a `-`, it does the same for subtraction.

The `loop` continues until it encounters an operator that’s not `+` or `-`, at which point it returns the parsed expression node and the remaining token stream.

``````(defun expr (tok)
"expr ::== mul ( '+' mul | '-' mul ) *"
(multiple-value-bind (node rest)
(mul tok)
(loop
(cond ((eq (token-val rest) #\+)
(multiple-value-bind (node2 rest2)
(mul (token-next rest))
(setf node (make-ast-node :kind :add :lhs node :rhs node2))
(setf rest rest2)))
((eq (token-val rest) #\-)
(multiple-value-bind (node2 rest2)
(mul (token-next rest))
(setf node (make-ast-node :kind :sub :lhs node :rhs node2))
(setf rest rest2)))
(t
(return-from expr
(values node rest)))))))``````

### `mul`

Parsing `mul` expressions works in a similar fashion:

``````(defun mul (tok)
"mul ::== primary ( '*' primary | '/' primary ) *"
(multiple-value-bind (node rest)
(primary tok)
(loop
(cond ((eq (token-val rest) #\*)
(multiple-value-bind (node2 rest2)
(primary (token-next rest))
(setf node (make-ast-node :kind :mul :lhs node :rhs node2))
(setf rest rest2)))
((eq (token-val rest) #\/)
(multiple-value-bind (node2 rest2)
(primary (token-next rest))
(setf node (make-ast-node :kind :div :lhs node :rhs node2))
(setf rest rest2)))
(t
(return-from mul
(values node rest)))))))``````

### `primary`

This function parses primary expressions, which can be either numeric values or sub-expressions enclosed in parentheses. It follows the grammar rule: `primary ::= '(' expr ')' | number`. If the current token is a number, it creates an AST node representing a numeric value. If the token is an opening parenthesis, it recursively calls the expr function to parse the enclosed expression and skips to the closing parenthesis. It then returns the parsed node and the remaining token stream.

``````(defun primary (tok)
"primary ::== '(' expr ')' | number"
(cond ((eq (token-kind tok) :num)
(values (make-ast-node :kind :number :val (token-val tok))
(token-next tok)))
((eq (token-val tok) #\()
(multiple-value-bind (node rest)
(expr (token-next tok))
(values node (token-next (skip-to-token #\) rest)))))
(t (error 'parser-error))))``````

### Testing the Initial Parser

For now, I will be testing this parser only from the REPL. Firing up it, and running the command should print something like this:

``````* (expr (tokenize "2 / (1 + 1) * 8"))
#S(AST-NODE
:KIND :MUL
:VAL NIL
:LHS #S(AST-NODE
:KIND :DIV
:VAL NIL
:LHS #S(AST-NODE :KIND :NUMBER :VAL 2 :LHS NIL :RHS NIL)
:RHS #S(AST-NODE
:VAL NIL
:LHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)
:RHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)))
:RHS #S(AST-NODE :KIND :NUMBER :VAL 8 :LHS NIL :RHS NIL))
#S(SILA/LEXER::TOKEN :KIND :EOF :POS 15 :LEN NIL :VAL NIL :NEXT NIL)
*``````

Here we can see how our parser parses the expressions inside it. Transforming it to ASCII would look something like this:

``````     Mul(*)
/     \
Div(/)    8
/   \
/  \
1    1``````

## Code Generation

Assembly code of something like `2 / (1 + 1) * 8` is still quite simple, but definitely more instruction will be needed compared to the simple addition and subtraction. Assembly code for it would equal to:

``````	.globl main
main:
mov \$8, %rax
push %rax
mov \$1, %rax
push %rax
mov \$1, %rax
pop %rdi
push %rax
mov \$2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret``````

So what is happening here:

• Initializing and Pushing Values:

• `mov \$8, %rax`: This instruction moves the immediate value 8 into the `%rax` register.

• `push %rax`: The value in `%rax` (which is now 8) is pushed onto the stack.

• Further Operations:

• `mov \$1, %rax`: Another immediate value, 1, is moved into the `%rax` register.

• `push %rax`: The value in `%rax` (now 1) is pushed onto the stack.

• `mov \$1, %rax`: Yet another immediate value, 1, is moved into `%rax`.

• `pop %rdi`: The top value from the stack (1) is popped into the `%rdi` register.

• `add %rdi, %rax`: The value in `%rdi` (previously 1) is added to the value in `%rax` (also 1), and the result is stored in `%rax`.

• More Operations:

• `push %rax`: The value in %rax (now 2) is pushed onto the stack.

• `mov \$2, %rax`: The immediate value 2 is moved into `%rax`.

• `pop %rdi`: The top value from the stack (2) is popped into `%rdi`.

• Division and Multiplication:

• `cqo`: This instruction extends the sign of `%rax` into `%rdx` in preparation for the division operation.

• `idiv %rdi, %rax`: The value in `%rax` (now 2) is divided by the value in `%rdi` (previously 1). The quotient is stored in `%rax` and the remainder in `%rdx`.

• `pop %rdi`: The value 2 is popped from the stack and stored in `%rdi`.

• `imul %rdi, %rax`: The value in `%rax` (the quotient of the division) is multiplied by the value in `%rdi` (previously 2), and the result is stored in `%rax`.

### Implementation

For the code generation aspects, we can start by creating some unit tests since we already know what would be the output of our function:

``````(deftest test-emit-asm-div-mul-parens
(let ((emit-asm-tests '(("2 / (1 + 1) * 8" . "  .globl main
main:
mov \$8, %rax
push %rax
mov \$1, %rax
push %rax
mov \$1, %rax
pop %rdi
push %rax
mov \$2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
"))))
(dolist (test emit-asm-tests)
(ok (string-equal (sila::emit-asm (car test)) (cdr test))
(format nil "Expect to be equal: ~%~a~%=>~a"
`(sila::emit-asm ,(car test))
(cdr test))))))``````

The implementation aspects also will be relatively straight forward.

``````(defun asm-inst (inst)
(format nil "  ~a~%" inst))

(defun asm-directive (dir)
(asm-inst dir))

(defun asm-label (label)
(format nil "~a:~%" label))

(defun asm-push ()
(asm-inst "push %rax"))

(defun asm-pop (reg)
(asm-inst (format nil "pop %~a" reg)))

(defun generate-expr (node)
(let ((asm ""))
(flet ((asm-conc (inst)
(setf asm (concatenate 'string asm inst))))
(when (eq (ast-node-kind node) :number)
(asm-conc (asm-inst (format nil "mov \$~d, %rax"
(ast-node-val node))))
(return-from generate-expr asm))
(asm-conc (generate-expr (ast-node-rhs node)))
(asm-conc (asm-push))
(asm-conc (generate-expr (ast-node-lhs node)))
(asm-conc (asm-pop "rdi"))
(alexandria:switch ((ast-node-kind node) :test #'eq)
(:sub (asm-conc (asm-inst "sub %rdi, %rax")))
(:mul (asm-conc (asm-inst "imul %rdi, %rax")))
(:div (asm-conc (asm-inst "cqo"))
(asm-conc (asm-inst "idiv %rdi, %rax")))
(t (error 'parser-error))))
asm))``````

This function recursively generates assembly code from an abstract syntax tree node representing an arithmetic expression. It constructs assembly instructions based on the kind of node encountered. It concatenates these instructions into a string that represents the generated assembly code.

• For a number node, it generates a move instruction (`mov`) to load the value into `%rax`.

• For binary operations (addition, subtraction, multiplication, division), it generates assembly for the right-hand side expression, pushes `%rax` onto the stack, generates assembly for the left-hand side expression, pops the value into %rdi, and then performs the corresponding operation.

Then lastly we just define the emit function, like so:

``````(defun emit-asm (src)
"Emit assembly code from given source code. Currently emits only x86-64 and
only Linux is tested."
(let ((asm ""))
(flet ((asm-conc (inst)
(setf asm (concatenate 'string asm inst))))
(multiple-value-bind (node rest)
(expr (tokenize src))
(unless (eq (token-kind rest) :eof)
(error 'parser-error :error-msg "Extra tokens"))
(asm-conc (asm-directive ".globl main"))
(asm-conc (asm-label "main"))
(asm-conc (generate-expr node))
(asm-conc (asm-inst "ret"))))
asm))``````

Again, we can test it directly in the REPL first:

``````* (emit-asm "2 / (1 + 1) * 8")
"  .globl main
main:
mov \$8, %rax
push %rax
mov \$1, %rax
push %rax
mov \$1, %rax
pop %rdi
push %rax
mov \$2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret
"``````

Seems to work. Now to run the tests:

``````# make test
rove sila.asd

Testing System sila/tests

;; testing 'sila/tests'
test-compilation-and-compare-rc
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2726ms)
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2693ms)
✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "2 / (1 + 1) * 8" 8) to be true. (2703ms)
test-emit-asm-integer
✓ Expect to be equal:
(EMIT-ASM 0)
=>  .globl main
main:
mov \$0, %rax
ret

✓ Expect to be equal:
(EMIT-ASM 42)
=>  .globl main
main:
mov \$42, %rax
ret

✓ Expect to be equal:
(EMIT-ASM 5+20-4)
=>  .globl main
main:
mov \$4, %rax
push %rax
mov \$20, %rax
push %rax
mov \$5, %rax
pop %rdi
pop %rdi
sub %rdi, %rax
ret

✓ Expect to be equal:
(EMIT-ASM     5   +  20  -  4   )
=>  .globl main
main:
mov \$4, %rax
push %rax
mov \$20, %rax
push %rax
mov \$5, %rax
pop %rdi
pop %rdi
sub %rdi, %rax
ret

test-emit-asm-div-mul-parens
✓ Expect to be equal:
(EMIT-ASM 2 / (1 + 1) * 8)
=>  .globl main
main:
mov \$8, %rax
push %rax
mov \$1, %rax
push %rax
mov \$1, %rax
pop %rdi
push %rax
mov \$2, %rax
pop %rdi
cqo
idiv %rdi, %rax
pop %rdi
imul %rdi, %rax
ret

✓ 1 test completed

Summary:
All 1 test passed.``````

Cool, seems to work!

 Note: If you want to read earlier posts of this dev log, head over here.