Sila Dev Log: Initial Recursive Descent Parsing

Posted on 24th of August 2023 | 1860 words
Plug: Follow the Sila development here.

For the initial implementation of Sila’s parses, I decided to go with the recursive descent route due to its simplicity. So what it is? Recursive descent parsing is a top-down parsing technique employed to analyze the syntax of structured input, such as programming languages or mathematical expressions. It operates by breaking down the input into smaller, manageable components using a set of recursive functions that correspond to the non-terminal symbols in the grammar.

Beginning with a starting symbol, typically representing the highest-level construct, these functions are invoked recursively to recognize and process the various syntactic elements of the input. Each function handles a specific grammar rule, and by calling each other based on the grammar’s structure, the parser traverses the input while building a parse tree that reflects the hierarchical organization of the language’s syntax.

This approach allows for clear code organization and error reporting, though it can face challenges with certain types of grammars and may require augmentation with techniques like lookahead or memoization for optimal efficiency and effectiveness. But for early stages of the project, I don’t really mind that.

Parser Implementation

First I create some structures and helpers that I will be using in the parsing process.

(deftype ast-node-kind ()
  "Sila AST node kind."
  '(member
    :add
    :sub
    :mul
    :div
    :number))

(defstruct ast-node
  "Structure for Sila AST nodes."
  kind
  val
  lhs
  rhs)

(defun skip-to-token (val tok)
  "Search for token which is `val'. Returns `nil' when not found.

TODO(topi): Probably should do some proper error handling if `val' isn't
found."
  (cond ((eq (token-kind tok) :eof) nil)
        ((equal (token-val tok) val) tok)
        (t (search-token val (token-next tok)))))

For now, I’m interested in parsing something simple like 2 / (1 + 1) * 8, so simple multiplicative calculations and expressions inside parenthesis. So if I would write some metanotation like BNF for that, it would look something like this:

expr ::== mul ( '+' mul | '-' mul ) *
mul ::== primary ( '*' primary | '/' primary ) *
primary ::== '(' expr ')' | number

expr

This function is responsible for parsing arithmetic expressions. It follows the grammar rule: expr ::= mul ( '+' mul | '-' mul ) *. It starts by calling the mul function to parse the initial term. Then, it enters a loop to handle addition and subtraction operators. If it encounters a +, it parses the next term using mul and creates an AST node representing addition. If it encounters a -, it does the same for subtraction.

The loop continues until it encounters an operator that’s not + or -, at which point it returns the parsed expression node and the remaining token stream.

(defun expr (tok)
  "expr ::== mul ( '+' mul | '-' mul ) *"
  (multiple-value-bind (node rest)
      (mul tok)
    (loop
      (cond ((eq (token-val rest) #\+)
             (multiple-value-bind (node2 rest2)
                 (mul (token-next rest))
               (setf node (make-ast-node :kind :add :lhs node :rhs node2))
               (setf rest rest2)))
            ((eq (token-val rest) #\-)
             (multiple-value-bind (node2 rest2)
                 (mul (token-next rest))
               (setf node (make-ast-node :kind :sub :lhs node :rhs node2))
               (setf rest rest2)))
            (t
             (return-from expr
               (values node rest)))))))

mul

Parsing mul expressions works in a similar fashion:

(defun mul (tok)
  "mul ::== primary ( '*' primary | '/' primary ) *"
  (multiple-value-bind (node rest)
      (primary tok)
    (loop
      (cond ((eq (token-val rest) #\*)
             (multiple-value-bind (node2 rest2)
                 (primary (token-next rest))
               (setf node (make-ast-node :kind :mul :lhs node :rhs node2))
               (setf rest rest2)))
            ((eq (token-val rest) #\/)
             (multiple-value-bind (node2 rest2)
                 (primary (token-next rest))
               (setf node (make-ast-node :kind :div :lhs node :rhs node2))
               (setf rest rest2)))
            (t
             (return-from mul
               (values node rest)))))))

primary

This function parses primary expressions, which can be either numeric values or sub-expressions enclosed in parentheses. It follows the grammar rule: primary ::= '(' expr ')' | number. If the current token is a number, it creates an AST node representing a numeric value. If the token is an opening parenthesis, it recursively calls the expr function to parse the enclosed expression and skips to the closing parenthesis. It then returns the parsed node and the remaining token stream.

(defun primary (tok)
  "primary ::== '(' expr ')' | number"
  (cond ((eq (token-kind tok) :num)
         (values (make-ast-node :kind :number :val (token-val tok))
                 (token-next tok)))
        ((eq (token-val tok) #\()
         (multiple-value-bind (node rest)
             (expr (token-next tok))
           (values node (token-next (skip-to-token #\) rest)))))
        (t (error 'parser-error))))

Testing the Initial Parser

For now, I will be testing this parser only from the REPL. Firing up it, and running the command should print something like this:

* (expr (tokenize "2 / (1 + 1) * 8"))
#S(AST-NODE
   :KIND :MUL
   :VAL NIL
   :LHS #S(AST-NODE
           :KIND :DIV
           :VAL NIL
           :LHS #S(AST-NODE :KIND :NUMBER :VAL 2 :LHS NIL :RHS NIL)
           :RHS #S(AST-NODE
                   :KIND :ADD
                   :VAL NIL
                   :LHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)
                   :RHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL)))
   :RHS #S(AST-NODE :KIND :NUMBER :VAL 8 :LHS NIL :RHS NIL))
#S(SILA/LEXER::TOKEN :KIND :EOF :POS 15 :LEN NIL :VAL NIL :NEXT NIL)
*

Here we can see how our parser parses the expressions inside it. Transforming it to ASCII would look something like this:

     Mul(*)
    /     \
 Div(/)    8
 /   \
2   Add(+)
     /  \
    1    1

Code Generation

Assembly code of something like 2 / (1 + 1) * 8 is still quite simple, but definitely more instruction will be needed compared to the simple addition and subtraction. Assembly code for it would equal to:

	.globl main
main:
	mov $8, %rax
	push %rax
	mov $1, %rax
	push %rax
	mov $1, %rax
	pop %rdi
	add %rdi, %rax
	push %rax
	mov $2, %rax
	pop %rdi
	cqo
	idiv %rdi, %rax
	pop %rdi
	imul %rdi, %rax
	ret

So what is happening here:

  • Initializing and Pushing Values:

    • mov $8, %rax: This instruction moves the immediate value 8 into the %rax register.

    • push %rax: The value in %rax (which is now 8) is pushed onto the stack.

  • Further Operations:

    • mov $1, %rax: Another immediate value, 1, is moved into the %rax register.

    • push %rax: The value in %rax (now 1) is pushed onto the stack.

    • mov $1, %rax: Yet another immediate value, 1, is moved into %rax.

    • pop %rdi: The top value from the stack (1) is popped into the %rdi register.

    • add %rdi, %rax: The value in %rdi (previously 1) is added to the value in %rax (also 1), and the result is stored in %rax.

  • More Operations:

    • push %rax: The value in %rax (now 2) is pushed onto the stack.

    • mov $2, %rax: The immediate value 2 is moved into %rax.

    • pop %rdi: The top value from the stack (2) is popped into %rdi.

  • Division and Multiplication:

    • cqo: This instruction extends the sign of %rax into %rdx in preparation for the division operation.

    • idiv %rdi, %rax: The value in %rax (now 2) is divided by the value in %rdi (previously 1). The quotient is stored in %rax and the remainder in %rdx.

    • pop %rdi: The value 2 is popped from the stack and stored in %rdi.

    • imul %rdi, %rax: The value in %rax (the quotient of the division) is multiplied by the value in %rdi (previously 2), and the result is stored in %rax.

Implementation

For the code generation aspects, we can start by creating some unit tests since we already know what would be the output of our function:

(deftest test-emit-asm-div-mul-parens
  (let ((emit-asm-tests '(("2 / (1 + 1) * 8" . "  .globl main
main:
  mov $8, %rax
  push %rax
  mov $1, %rax
  push %rax
  mov $1, %rax
  pop %rdi
  add %rdi, %rax
  push %rax
  mov $2, %rax
  pop %rdi
  cqo
  idiv %rdi, %rax
  pop %rdi
  imul %rdi, %rax
  ret
"))))
    (dolist (test emit-asm-tests)
      (ok (string-equal (sila::emit-asm (car test)) (cdr test))
          (format nil "Expect to be equal: ~%~a~%=>~a"
                  `(sila::emit-asm ,(car test))
                  (cdr test))))))

The implementation aspects also will be relatively straight forward.

(defun asm-inst (inst)
  (format nil "  ~a~%" inst))

(defun asm-directive (dir)
  (asm-inst dir))

(defun asm-label (label)
  (format nil "~a:~%" label))

(defun asm-push ()
  (asm-inst "push %rax"))

(defun asm-pop (reg)
  (asm-inst (format nil "pop %~a" reg)))

(defun generate-expr (node)
  (let ((asm ""))
    (flet ((asm-conc (inst)
             (setf asm (concatenate 'string asm inst))))
      (when (eq (ast-node-kind node) :number)
        (asm-conc (asm-inst (format nil "mov $~d, %rax"
                                    (ast-node-val node))))
        (return-from generate-expr asm))
      (asm-conc (generate-expr (ast-node-rhs node)))
      (asm-conc (asm-push))
      (asm-conc (generate-expr (ast-node-lhs node)))
      (asm-conc (asm-pop "rdi"))
      (alexandria:switch ((ast-node-kind node) :test #'eq)
        (:add (asm-conc (asm-inst "add %rdi, %rax")))
        (:sub (asm-conc (asm-inst "sub %rdi, %rax")))
        (:mul (asm-conc (asm-inst "imul %rdi, %rax")))
        (:div (asm-conc (asm-inst "cqo"))
              (asm-conc (asm-inst "idiv %rdi, %rax")))
        (t (error 'parser-error))))
    asm))

This function recursively generates assembly code from an abstract syntax tree node representing an arithmetic expression. It constructs assembly instructions based on the kind of node encountered. It concatenates these instructions into a string that represents the generated assembly code.

  • For a number node, it generates a move instruction (mov) to load the value into %rax.

  • For binary operations (addition, subtraction, multiplication, division), it generates assembly for the right-hand side expression, pushes %rax onto the stack, generates assembly for the left-hand side expression, pops the value into %rdi, and then performs the corresponding operation.

Then lastly we just define the emit function, like so:

(defun emit-asm (src)
  "Emit assembly code from given source code. Currently emits only x86-64 and
only Linux is tested."
  (let ((asm ""))
    (flet ((asm-conc (inst)
             (setf asm (concatenate 'string asm inst))))
      (multiple-value-bind (node rest)
          (expr (tokenize src))
        (unless (eq (token-kind rest) :eof)
          (error 'parser-error :error-msg "Extra tokens"))
        (asm-conc (asm-directive ".globl main"))
        (asm-conc (asm-label "main"))
        (asm-conc (generate-expr node))
        (asm-conc (asm-inst "ret"))))
    asm))

Again, we can test it directly in the REPL first:

* (emit-asm "2 / (1 + 1) * 8")
"  .globl main
main:
  mov $8, %rax
  push %rax
  mov $1, %rax
  push %rax
  mov $1, %rax
  pop %rdi
  add %rdi, %rax
  push %rax
  mov $2, %rax
  pop %rdi
  cqo
  idiv %rdi, %rax
  pop %rdi
  imul %rdi, %rax
  ret
"

Seems to work. Now to run the tests:

# make test
rove sila.asd

Testing System sila/tests

;; testing 'sila/tests'
test-compilation-and-compare-rc
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2726ms)
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2693ms)
  ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "2 / (1 + 1) * 8" 8) to be true. (2703ms)
test-emit-asm-integer
  ✓ Expect to be equal:
    (EMIT-ASM 0)
    =>  .globl main
    main:
      mov $0, %rax
      ret

  ✓ Expect to be equal:
    (EMIT-ASM 42)
    =>  .globl main
    main:
      mov $42, %rax
      ret

test-emit-asm-add-sub
  ✓ Expect to be equal:
    (EMIT-ASM 5+20-4)
    =>  .globl main
    main:
      mov $4, %rax
      push %rax
      mov $20, %rax
      push %rax
      mov $5, %rax
      pop %rdi
      add %rdi, %rax
      pop %rdi
      sub %rdi, %rax
      ret

  ✓ Expect to be equal:
    (EMIT-ASM     5   +  20  -  4   )
    =>  .globl main
    main:
      mov $4, %rax
      push %rax
      mov $20, %rax
      push %rax
      mov $5, %rax
      pop %rdi
      add %rdi, %rax
      pop %rdi
      sub %rdi, %rax
      ret

test-emit-asm-div-mul-parens
  ✓ Expect to be equal:
    (EMIT-ASM 2 / (1 + 1) * 8)
    =>  .globl main
    main:
      mov $8, %rax
      push %rax
      mov $1, %rax
      push %rax
      mov $1, %rax
      pop %rdi
      add %rdi, %rax
      push %rax
      mov $2, %rax
      pop %rdi
      cqo
      idiv %rdi, %rax
      pop %rdi
      imul %rdi, %rax
      ret


✓ 1 test completed

Summary:
  All 1 test passed.

Cool, seems to work!

Note: If you want to read earlier posts of this dev log, head over here.