Sila Dev Log: Initial Recursive Descent Parsing
Posted on 24th of August 2023 | 1860 wordsPlug: Follow the Sila development here. |
For the initial implementation of Sila’s parses, I decided to go with the recursive descent route due to its simplicity. So what it is? Recursive descent parsing is a top-down parsing technique employed to analyze the syntax of structured input, such as programming languages or mathematical expressions. It operates by breaking down the input into smaller, manageable components using a set of recursive functions that correspond to the non-terminal symbols in the grammar.
Beginning with a starting symbol, typically representing the highest-level construct, these functions are invoked recursively to recognize and process the various syntactic elements of the input. Each function handles a specific grammar rule, and by calling each other based on the grammar’s structure, the parser traverses the input while building a parse tree that reflects the hierarchical organization of the language’s syntax.
This approach allows for clear code organization and error reporting, though it can face challenges with certain types of grammars and may require augmentation with techniques like lookahead or memoization for optimal efficiency and effectiveness. But for early stages of the project, I don’t really mind that.
Parser Implementation
First I create some structures and helpers that I will be using in the parsing process.
(deftype ast-node-kind () "Sila AST node kind." '(member :add :sub :mul :div :number)) (defstruct ast-node "Structure for Sila AST nodes." kind val lhs rhs) (defun skip-to-token (val tok) "Search for token which is `val'. Returns `nil' when not found. TODO(topi): Probably should do some proper error handling if `val' isn't found." (cond ((eq (token-kind tok) :eof) nil) ((equal (token-val tok) val) tok) (t (search-token val (token-next tok)))))
For now, I’m interested in parsing something simple like 2 / (1 + 1) * 8
, so
simple multiplicative calculations and expressions inside parenthesis. So if I
would write some metanotation like BNF for that, it would look something like
this:
expr ::== mul ( '+' mul | '-' mul ) * mul ::== primary ( '*' primary | '/' primary ) * primary ::== '(' expr ')' | number
expr
This function is responsible for parsing arithmetic expressions. It follows
the grammar rule: expr ::= mul ( '+' mul | '-' mul ) *
. It starts by calling
the mul
function to parse the initial term. Then, it enters a loop
to
handle addition and subtraction operators. If it encounters a +
, it parses
the next term using mul and creates an AST node representing addition. If it
encounters a -
, it does the same for subtraction.
The loop
continues until it encounters an operator that’s not +
or -
, at
which point it returns the parsed expression node and the remaining token
stream.
(defun expr (tok) "expr ::== mul ( '+' mul | '-' mul ) *" (multiple-value-bind (node rest) (mul tok) (loop (cond ((eq (token-val rest) #\+) (multiple-value-bind (node2 rest2) (mul (token-next rest)) (setf node (make-ast-node :kind :add :lhs node :rhs node2)) (setf rest rest2))) ((eq (token-val rest) #\-) (multiple-value-bind (node2 rest2) (mul (token-next rest)) (setf node (make-ast-node :kind :sub :lhs node :rhs node2)) (setf rest rest2))) (t (return-from expr (values node rest)))))))
mul
Parsing mul
expressions works in a similar fashion:
(defun mul (tok) "mul ::== primary ( '*' primary | '/' primary ) *" (multiple-value-bind (node rest) (primary tok) (loop (cond ((eq (token-val rest) #\*) (multiple-value-bind (node2 rest2) (primary (token-next rest)) (setf node (make-ast-node :kind :mul :lhs node :rhs node2)) (setf rest rest2))) ((eq (token-val rest) #\/) (multiple-value-bind (node2 rest2) (primary (token-next rest)) (setf node (make-ast-node :kind :div :lhs node :rhs node2)) (setf rest rest2))) (t (return-from mul (values node rest)))))))
primary
This function parses primary expressions, which can be either numeric values
or sub-expressions enclosed in parentheses. It follows the grammar rule:
primary ::= '(' expr ')' | number
. If the current token is a number, it
creates an AST node representing a numeric value. If the token is an opening
parenthesis, it recursively calls the expr function to parse the enclosed
expression and skips to the closing parenthesis. It then returns the parsed
node and the remaining token stream.
(defun primary (tok) "primary ::== '(' expr ')' | number" (cond ((eq (token-kind tok) :num) (values (make-ast-node :kind :number :val (token-val tok)) (token-next tok))) ((eq (token-val tok) #\() (multiple-value-bind (node rest) (expr (token-next tok)) (values node (token-next (skip-to-token #\) rest))))) (t (error 'parser-error))))
Testing the Initial Parser
For now, I will be testing this parser only from the REPL. Firing up it, and running the command should print something like this:
* (expr (tokenize "2 / (1 + 1) * 8")) #S(AST-NODE :KIND :MUL :VAL NIL :LHS #S(AST-NODE :KIND :DIV :VAL NIL :LHS #S(AST-NODE :KIND :NUMBER :VAL 2 :LHS NIL :RHS NIL) :RHS #S(AST-NODE :KIND :ADD :VAL NIL :LHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL) :RHS #S(AST-NODE :KIND :NUMBER :VAL 1 :LHS NIL :RHS NIL))) :RHS #S(AST-NODE :KIND :NUMBER :VAL 8 :LHS NIL :RHS NIL)) #S(SILA/LEXER::TOKEN :KIND :EOF :POS 15 :LEN NIL :VAL NIL :NEXT NIL) *
Here we can see how our parser parses the expressions inside it. Transforming it to ASCII would look something like this:
Mul(*) / \ Div(/) 8 / \ 2 Add(+) / \ 1 1
Code Generation
Assembly code of something like 2 / (1 + 1) * 8
is still quite simple, but
definitely more instruction will be needed compared to the simple addition and
subtraction. Assembly code for it would equal to:
.globl main main: mov $8, %rax push %rax mov $1, %rax push %rax mov $1, %rax pop %rdi add %rdi, %rax push %rax mov $2, %rax pop %rdi cqo idiv %rdi, %rax pop %rdi imul %rdi, %rax ret
So what is happening here:
-
Initializing and Pushing Values:
-
mov $8, %rax
: This instruction moves the immediate value 8 into the%rax
register. -
push %rax
: The value in%rax
(which is now 8) is pushed onto the stack.
-
-
Further Operations:
-
mov $1, %rax
: Another immediate value, 1, is moved into the%rax
register. -
push %rax
: The value in%rax
(now 1) is pushed onto the stack. -
mov $1, %rax
: Yet another immediate value, 1, is moved into%rax
. -
pop %rdi
: The top value from the stack (1) is popped into the%rdi
register. -
add %rdi, %rax
: The value in%rdi
(previously 1) is added to the value in%rax
(also 1), and the result is stored in%rax
.
-
-
More Operations:
-
push %rax
: The value in %rax (now 2) is pushed onto the stack. -
mov $2, %rax
: The immediate value 2 is moved into%rax
. -
pop %rdi
: The top value from the stack (2) is popped into%rdi
.
-
-
Division and Multiplication:
-
cqo
: This instruction extends the sign of%rax
into%rdx
in preparation for the division operation. -
idiv %rdi, %rax
: The value in%rax
(now 2) is divided by the value in%rdi
(previously 1). The quotient is stored in%rax
and the remainder in%rdx
. -
pop %rdi
: The value 2 is popped from the stack and stored in%rdi
. -
imul %rdi, %rax
: The value in%rax
(the quotient of the division) is multiplied by the value in%rdi
(previously 2), and the result is stored in%rax
.
-
Implementation
For the code generation aspects, we can start by creating some unit tests since we already know what would be the output of our function:
(deftest test-emit-asm-div-mul-parens (let ((emit-asm-tests '(("2 / (1 + 1) * 8" . " .globl main main: mov $8, %rax push %rax mov $1, %rax push %rax mov $1, %rax pop %rdi add %rdi, %rax push %rax mov $2, %rax pop %rdi cqo idiv %rdi, %rax pop %rdi imul %rdi, %rax ret ")))) (dolist (test emit-asm-tests) (ok (string-equal (sila::emit-asm (car test)) (cdr test)) (format nil "Expect to be equal: ~%~a~%=>~a" `(sila::emit-asm ,(car test)) (cdr test))))))
The implementation aspects also will be relatively straight forward.
(defun asm-inst (inst) (format nil " ~a~%" inst)) (defun asm-directive (dir) (asm-inst dir)) (defun asm-label (label) (format nil "~a:~%" label)) (defun asm-push () (asm-inst "push %rax")) (defun asm-pop (reg) (asm-inst (format nil "pop %~a" reg))) (defun generate-expr (node) (let ((asm "")) (flet ((asm-conc (inst) (setf asm (concatenate 'string asm inst)))) (when (eq (ast-node-kind node) :number) (asm-conc (asm-inst (format nil "mov $~d, %rax" (ast-node-val node)))) (return-from generate-expr asm)) (asm-conc (generate-expr (ast-node-rhs node))) (asm-conc (asm-push)) (asm-conc (generate-expr (ast-node-lhs node))) (asm-conc (asm-pop "rdi")) (alexandria:switch ((ast-node-kind node) :test #'eq) (:add (asm-conc (asm-inst "add %rdi, %rax"))) (:sub (asm-conc (asm-inst "sub %rdi, %rax"))) (:mul (asm-conc (asm-inst "imul %rdi, %rax"))) (:div (asm-conc (asm-inst "cqo")) (asm-conc (asm-inst "idiv %rdi, %rax"))) (t (error 'parser-error)))) asm))
This function recursively generates assembly code from an abstract syntax tree node representing an arithmetic expression. It constructs assembly instructions based on the kind of node encountered. It concatenates these instructions into a string that represents the generated assembly code.
-
For a number node, it generates a move instruction (
mov
) to load the value into%rax
. -
For binary operations (addition, subtraction, multiplication, division), it generates assembly for the right-hand side expression, pushes
%rax
onto the stack, generates assembly for the left-hand side expression, pops the value into %rdi, and then performs the corresponding operation.
Then lastly we just define the emit function, like so:
(defun emit-asm (src) "Emit assembly code from given source code. Currently emits only x86-64 and only Linux is tested." (let ((asm "")) (flet ((asm-conc (inst) (setf asm (concatenate 'string asm inst)))) (multiple-value-bind (node rest) (expr (tokenize src)) (unless (eq (token-kind rest) :eof) (error 'parser-error :error-msg "Extra tokens")) (asm-conc (asm-directive ".globl main")) (asm-conc (asm-label "main")) (asm-conc (generate-expr node)) (asm-conc (asm-inst "ret")))) asm))
Again, we can test it directly in the REPL first:
* (emit-asm "2 / (1 + 1) * 8") " .globl main main: mov $8, %rax push %rax mov $1, %rax push %rax mov $1, %rax pop %rdi add %rdi, %rax push %rax mov $2, %rax pop %rdi cqo idiv %rdi, %rax pop %rdi imul %rdi, %rax ret "
Seems to work. Now to run the tests:
# make test rove sila.asd Testing System sila/tests ;; testing 'sila/tests' test-compilation-and-compare-rc ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "0" 0) to be true. (2726ms) ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "5 + 40 - 20" 25) to be true. (2693ms) ✓ Expect (COMPILE-PROGRAM-AND-COMPARE-RC "2 / (1 + 1) * 8" 8) to be true. (2703ms) test-emit-asm-integer ✓ Expect to be equal: (EMIT-ASM 0) => .globl main main: mov $0, %rax ret ✓ Expect to be equal: (EMIT-ASM 42) => .globl main main: mov $42, %rax ret test-emit-asm-add-sub ✓ Expect to be equal: (EMIT-ASM 5+20-4) => .globl main main: mov $4, %rax push %rax mov $20, %rax push %rax mov $5, %rax pop %rdi add %rdi, %rax pop %rdi sub %rdi, %rax ret ✓ Expect to be equal: (EMIT-ASM 5 + 20 - 4 ) => .globl main main: mov $4, %rax push %rax mov $20, %rax push %rax mov $5, %rax pop %rdi add %rdi, %rax pop %rdi sub %rdi, %rax ret test-emit-asm-div-mul-parens ✓ Expect to be equal: (EMIT-ASM 2 / (1 + 1) * 8) => .globl main main: mov $8, %rax push %rax mov $1, %rax push %rax mov $1, %rax pop %rdi add %rdi, %rax push %rax mov $2, %rax pop %rdi cqo idiv %rdi, %rax pop %rdi imul %rdi, %rax ret ✓ 1 test completed Summary: All 1 test passed.
Cool, seems to work!
NB: If you want to read earlier posts of this dev log, head over here. |