BruteForce/Compiler
/** Analyse and compile the language (create an AST) */ class Compiler extends Object dependsOn(Tokenizer) dependsOn(AST); // terminals const __BEGIN = "begin"; const __END = "end"; const __SEMICOLON = ";"; const __VAR = "var"; const __INTEGER = "int"; const __STRING = "string"; const __FLOAT = "float"; const __BOOLEAN = "bool"; const __FUNC = "function"; const __BECOMES = "="; const __WHILE = "while"; const __DO = "do"; const __IF = "if"; const __THEN = "then"; const __ELSE = "else"; const __LT = "<"; const __LE = "<="; const __GT = ">"; const __GE = ">="; const __EQ = "=="; const __NE = "!="; const __AND = "&&"; const __OR = "||"; const __PLUS = "+"; const __MINUS = "-"; const __MULTIPLY = "*"; const __DIVIDE = "/"; const __MOD = "%"; const __NOT = "!"; const __LBRACK = "("; const __RBRACK = ")"; const __TRUE = "true"; const __FALSE = "false"; const __COMMA = ","; // terminals -- end
All the terminals as defined in the grammar, it's best to use constant instead of inline strings because this prevents you from making typos and it will also make it easier to change terminal. For example if you want to use { ... } instead of begin ... end you can simply change the constant value.
var private Tokenizer t; var private AST a; function Compile(Tokenizer tokenizer, AST tree) { t = tokenizer; a = tree; _program(); }
The main function of this class, this will start the parsing of the code.
function bool has(Tokenizer.tokenType token, optional string text) { if (text == "") return token ~= t.currentToken(); return (text ~= t.tokenString()) && (token ~= t.currentToken()); }
The function has checks if the current token has the correct type, and optional if the token string is equal to the input.
function require(Tokenizer.tokenType token, optional string text) { local bool res; res = has(token, text); if (!res) { Warn("Expected ("$token$") \""$text$"\" but has ("$t.currentToken()$") \""$t.tokenString()$"\" @ "$t.currentLine()$","$t.currentPos()); assert(false); } }
Require does the same as has() except that if the result is false it will abort the whole compilation. Require is used to check for tokens that have to be there in order to for the language to be conform to the specification.
Currently it uses assert() to stop the compilation, this will ofcourse stop the whole system, you may want to write a diffirent instruction here to stop compiling.
function _program() { _declarations(); _statements(); require(TT_EOF); }
The initial rule, note that it requires an end of file at the end, this is to ensure that no more declarations are possible after the main code has been finished.
function _declarations() { while (has(TT_Identifier, __VAR) || has(TT_Identifier, __FUNC)) { if (has(TT_Identifier, __VAR)) _declaration(); else if (has(TT_Identifier, __FUNC)) _function(); require(TT_Literal, __SEMICOLON); t.nextToken(); } }
The declaration block, a declaration is either a var or function, while this is true we keep reading declarations. When it's not true it has to be a statement.
function _declaration() { a.AddRoot(NT_Keyword, __VAR); t.nextToken(); // __VAR _type(); require(TT_Identifier); a.AddChild(NT_Identifier, t.tokenString()); t.nextToken(); a.CloseRoot(); }
A variable declaration, we start by adding a new root node to the /AST, next we decent into the type specification where type is added to the current root node. Next we should get the variable identifier which we also add to the current root node.
At the end we close the current root node, everything required for a var AST node is not in the root node so we can close it to start with a new one.
function _type() { if (has(TT_Identifier, __INTEGER)) { a.AddChild(NT_Keyword, __INTEGER); t.nextToken(); } else if (has(TT_Identifier, __STRING)) { a.AddChild(NT_Keyword, __STRING); t.nextToken(); } else if (has(TT_Identifier, __FLOAT)) { a.AddChild(NT_Keyword, __FLOAT); t.nextToken(); } else if (has(TT_Identifier, __BOOLEAN)) { a.AddChild(NT_Keyword, __BOOLEAN); t.nextToken(); } else { Warn("Unrecognised type:"@t.tokenString()@"@ "$t.currentLine()$","$t.currentPos()); assert(false); } }
We only support four types at the moment, if the encounter something else it's wrong and we abort the compilation.
function _function() { a.AddRoot(NT_Keyword, __FUNC); t.nextToken(); // function _type(); require(TT_Identifier); a.AddRoot(NT_Identifier, t.tokenString()); // function name t.nextToken(); require(TT_Literal, __LBRACK); t.nextToken(); _arguments(); require(TT_Literal, __RBRACK); a.CloseRoot(); t.nextToken(); _declarations(); require(TT_Identifier, __BEGIN); t.nextToken(); _statements(); require(TT_Identifier, __END); t.nextToken(); a.CloseRoot(); }
We received a function declarations, we start by adding a new root node of type function. Just like with the variable declaration we add the type of the function to the root node and the name of the function. But, unlike with the variable name we make the function name the new root node, to this root node we will add the function arguments.
After we added the arguments, if any, we close the root node (the function name) and we start adding a declarations and statements block to the function root node.
Note that I don't add the begin, end, etc to the AST. We don't need these things in the AST because in the AST we know how many children a block has.
function _arguments() { while (!has(TT_Literal, __RBRACK)) { a.AddRoot(NT_Keyword, __VAR); _type(); require(TT_Identifier); a.AddChild(NT_Identifier, t.tokenString()); t.nextToken(); a.CloseRoot(); if (has(TT_Literal, __SEMICOLON)) t.nextToken(); else break; } }
The function arguments are just defined as local variable of the function, the only reason we added them as children of the function name instead of the function because it's easy to assign values to those when we call the function.
function _statements() { while (!(has(TT_Identifier, __END) || has(TT_EOF))) { _statement(); require(TT_Literal, __SEMICOLON); t.nextToken(); } }
Statements are read until the end of file.
function _statement() { if (has(TT_Identifier, __WHILE)) _whiledo(); else if (has(TT_Identifier, __IF)) _ifthenelse(); else if (has(TT_Identifier)) _assignment(); else { Warn("Unrecognised statement:"@t.tokenString()@"@ "$t.currentLine()$","$t.currentPos()); assert(false); } }
As defined in the /EBNF a statement can be a while ... do, an if ... then ... else or an assignment. When we have something else as the current token something has to be wrong and we stop with compiling.
function _whiledo() { a.AddRoot(NT_Keyword, __WHILE); t.nextToken(); // WHILE _expr(); require(TT_Identifier, __DO); t.nextToken(); _codeblock(); a.CloseRoot(); }
A while to begins with adding the new root node of type while, after that we add a child node with the expression, and a child node of a codeblock (these child nodes are added by the called functions)
function _codeblock() { if (has(TT_Identifier, __BEGIN)) { a.AddRoot(NT_Keyword, __BEGIN); t.nextToken(); _statements(); require(TT_Identifier, __END); t.nextToken(); a.CloseRoot(); } else { _statement(); } }
If the code block starts with a begin we open a new root node because a code block may only add one child node to the current root node. This new root node will then get more than a group of statements and an end which will close the root node.
If the code block doesn't start with a begin we asume it's just a single statement.
function _ifthenelse() { a.AddRoot(NT_Keyword, __IF); t.nextToken(); // IF _expr(); require(TT_Identifier, __THEN); t.nextToken(); _codeblock(); if (has(TT_Identifier, __ELSE)) { t.nextToken(); _codeblock(); } a.CloseRoot(); }
This works just like the while ... do except that there might be a third child to be added to the if root node, the else is optional.
function _assignment() { local string tmp; tmp = t.tokenString(); t.nextToken(); if (has(TT_Literal, __LBRACK)) _functioncall(tmp); else { a.AddRoot(NT_Keyword, __BECOMES); a.AddChild(NT_Identifier, tmp); require(TT_Operator, __BECOMES); t.nextToken(); _expr(); a.CloseRoot(); } }
An assignment starts with a identifier, however this identifier can be a variable or a function, so we have to take a look at the next token if it's either a becomes or a left bracket. For this we had to make the compiler look a head one token (e.g. LL(2)).
If it's really an assigment we will add a new root node of type becomes and we assign the childs identifier and expression to that node.
function _expr() { _boolex(); }
This is an nice example of a useless looking function, but it might be usefull in the future when we want to add an operator with a lower precendance than the boolean expressions.
function _boolex() { _accum(); while (has(TT_Operator, __LT)||has(TT_Operator, __LE)||has(TT_Operator, __GT)||has(TT_Operator, __GE)|| has(TT_Operator, __EQ)||has(TT_Operator, __NE)||has(TT_Operator, __AND)||has(TT_Operator, __OR)) { a.AddRoot(NT_Keyword, t.tokenString()); a.SwitchNode(); t.nextToken(); _accum(); a.CloseRoot(); } }
Well here is where it's get difficult, we first decent into the _accum() function which will finally result into a child node that can have a complete sub tree, or just a single child node. After that we might get one or more boolean operators. This is the infix notation, but we want to make our AST prefix. So we add a new root node for the current boolean operator and switch that root node around with the last child node in the current root.
Let's take the following code: x = 1 + 2
before swap | after swap |
= x 1 + 2 |
= x + 1 2 |
The same applies to the following few functions below
function _accum() { _mult(); while (has(TT_Operator, __PLUS)||has(TT_Operator, __MINUS)) { a.AddRoot(NT_Keyword, t.tokenString()); a.SwitchNode(); t.nextToken(); _mult(); a.CloseRoot(); } } function _mult() { _preop(); while (has(TT_Operator, __MULTIPLY)||has(TT_Operator, __DIVIDE)||has(TT_Operator, __MOD)) { a.AddRoot(NT_Keyword, t.tokenString()); a.SwitchNode(); t.nextToken(); _preop(); a.CloseRoot(); } } function _preop() { local bool open; open = false; if (has(TT_Operator, __MINUS)) { open = true; a.AddRoot(NT_Keyword, __MINUS); a.AddChild(NT_Integer, "0"); t.nextToken(); } else if (has(TT_Operator, __NOT)) { open = true; a.AddRoot(NT_Keyword, __NOT); t.nextToken(); } _operand(); if (open) a.CloseRoot(); }
For the - preoperator I've added a hack, it will be inserted into the AST as if it was 0 - operand
function _operand() { local string tmp; if (has(TT_Identifier, __TRUE)) { a.AddChild(NT_Boolean, t.tokenString()); t.nextToken(); } else if (has(TT_Identifier, __FALSE)) { a.AddChild(NT_Boolean, t.tokenString()); t.nextToken(); } else if (has(TT_Identifier)) { tmp = t.tokenString(); t.nextToken(); if (has(TT_Literal, __LBRACK)) // is function ?? { _functioncall(tmp); } else { a.AddChild(NT_Identifier, tmp); } } else if (has(TT_Integer)) { a.AddChild(NT_Integer, t.tokenString()); t.nextToken(); } else if (has(TT_String)) { a.AddChild(NT_String, t.tokenString()); t.nextToken(); } else if (has(TT_Float)) { a.AddChild(NT_Float, t.tokenString()); t.nextToken(); } else if (has(TT_Literal, __LBRACK)) { t.nextToken(); _expr(); require(TT_Literal, __RBRACK); t.nextToken(); } else { Warn("Unexpected token:"@t.tokenString()@"@ "$t.currentLine()$","$t.currentPos()); assert(false); } }
The operand function is the finaly stop, first we check if the the identifier is either a true or a false, if not we check if the identifier is either a variable or a function call. If it's just a variable we will add a normal child to the current root node.
Next we check if it is a inline constant, or an expression between braces.
If everything fails we have something wierd and stop compiling.
function _functioncall(string name) { a.AddRoot(NT_Function, name); t.nextToken(); while (!has(TT_Literal, __RBRACK)) { _expr(); if (has(TT_Literal, __COMMA)) t.nextToken(); else break; } require(TT_Literal, __RBRACK); t.nextToken(); // __RBRACK a.CloseRoot(); }
If it's a function call we will add a new root with the function name, and give it the type function, as child nodes we add the arguments passed to it. And finally close the function call root.