The Parser
Linguists call the process of reading text “parsing.” So that’s what we call the part of a programming language that reads the text and finds out what commands the programmer is giving.
But wait — how would we remember commands? Well, in the end, you can think of everything
in a programming language as a function call. Instead of having a +
operator, you could
as well have a add(firstNumber, secondNumber)
function. Functions can be nested, and
you know what nesting means: We’ll probably want some sort of tree data structure.
So let’s just do that, define a data structure for a function:
struct Function
{
string name;
array<Function *> parameters;
string stringValue;
}
Right now we’ll just assume a program is an array of commands.
array<Function> commands;
So we’ll just loop over our tokenList
from the last chapter and try to read stuff until
we run out:
int currentToken = 0;
while (x < tokenList.size())
{
if (tokenList[currentToken].type == IDENTIFIER && tokenList[currentToken].text == "print")
{
Function printFunction;
printFunction.name = tokenList[currentToken].text
++currentToken;
if (currentToken == tokenList.size())
{
error("Expected parameter after \"print\".");
}
if (tokenList[currentToken].type != STRING)
{
error("Expected parameter after \"print\".");
}
Function getStringFunction;
getStringFunction.name = "__get_string";
getStringFunction.stringValue = tokenList[currentToken].text;
++currentToken;
printFunction.parameters.append(getStringFunction);
commands.append(printFunction);
}
else if (tokenList[currentToken].type == LINE_BREAK)
{
++currentToken; // Skip empty line.
}
else
{
error("Unknown command.");
}
if (currentToken == tokenList.size()) // File has no line break at end, that's fine.
{
break;
}
if (tokenList[currentToken].type != LINE_BREAK)
{
error("Extra tokens after end of command.");
}
++currentToken;
}
As you can see, we’re being a little hackish here: Our functions can also have an extra
string attached to it, so if we need a string instead of a function call, we just use
the fake function name __get_string
to mean “this is a string, not a function call” and
set its stringValue
to the string.
If you’re using an object-oriented programming language, you could instead make a
base class that is called e.g. Value
, and just gives back a string or whatever. And
Function
would be a subclass of Value
that can also contain parameters (which are
other Value
s, not necessarily Functions
).
What does that get us? It gets us an array of commands. We could now loop over these commands, look at what the function name is, and do stuff with them. But that is not the parser’s job.
Syntax Trees
That array of Function
s we just made is a “syntax tree”. It contains all our code,
already properly structured. Right now we only do a bunch of commands. If you wanted to
let our programmer define functions, instead of one array of function calls, we’d have a
dictionary (or “hash map”) of function call arrays, with the function name as the key.
Change the parser so it reads whatever your function definitions look like, and then use
the code above to fill the function definition’s array.
For example, if a function in your programming language looks like
function main
{
print "Hello world"
}
Your parser would look like:
dictionary<string,array<Function>> functionDefinitions;
int currentToken = 0;
while (x < tokenList.size())
{
if (tokenList[currentToken].type == IDENTIFIER && tokenList[currentToken].text == "function")
{
++currentToken;
if (currentToken == tokenList.size() or tokenList[currentToken].type != IDENTIFIER)
{
error("Expected function name after \"function\".");
}
string functionName = tokenList[currentToken].text;
++currentToken;
array<Function> commands = parse_function_body(¤tToken, &tokenList); // This looks for the { and } as well.
functionDefinitions[functionName] = commands;
}
else if (tokenList[currentToken].type == LINE_BREAK)
{
++currentToken; // Skip empty line.
}
else
{
error("Unknown top-level construct, expected a function.");
}
}
Now of course, functions usually don’t just have a name, so instead of the raw array, a real programming language would likely use a struct that contains the array of commands, plus the parameter list, plus the return type, plus whatever else your programming language wants to attach to its functions.
No Flow Control?
At the moment, our programming language is very linear. There is no if/else, no loops. Let’s leave it like that for now, until we know how to run the program. Once we know how the program runs (in the next chapter), it will be easier to understand how flow control can be achieved.