Linguists call the process of reading text “parsing.” So that’s what we call the part of a programming language that reads the text and finds out what commands the programmer is giving.

But wait — how would we remember commands? Well, in the end, you can think of everything in a programming language as a function call. Instead of having a + operator, you could as well have a add(firstNumber, secondNumber) function. Functions can be nested, and you know what nesting means: We’ll probably want some sort of tree data structure.

So let’s just do that, define a data structure for a function:

struct Function
{
	string				name;
	array<Function *>	parameters;
	string				stringValue;
}

Right now we’ll just assume a program is an array of commands.

array<Function> commands;

So we’ll just loop over our tokenList from the last chapter and try to read stuff until we run out:

int currentToken = 0;
while (x < tokenList.size())
{
	if (tokenList[currentToken].type == IDENTIFIER && tokenList[currentToken].text == "print")
	{
		Function printFunction;
		printFunction.name = tokenList[currentToken].text
		
		++currentToken;
		if (currentToken == tokenList.size())
		{
			error("Expected parameter after \"print\".");
		}
		
		if (tokenList[currentToken].type != STRING)
		{
			error("Expected parameter after \"print\".");
		}
		
		Function getStringFunction;
		getStringFunction.name = "__get_string";
		getStringFunction.stringValue = tokenList[currentToken].text;
		
		++currentToken;
		
		printFunction.parameters.append(getStringFunction);
		commands.append(printFunction);
	}
	else if (tokenList[currentToken].type == LINE_BREAK)
	{
		++currentToken; // Skip empty line.
	}
	else
	{
		error("Unknown command.");
	}

	if (currentToken == tokenList.size()) // File has no line break at end, that's fine.
	{
		break;
	}
	
	if (tokenList[currentToken].type != LINE_BREAK)
	{
		error("Extra tokens after end of command.");
	}
	
	++currentToken;
}

As you can see, we’re being a little hackish here: Our functions can also have an extra string attached to it, so if we need a string instead of a function call, we just use the fake function name __get_string to mean “this is a string, not a function call” and set its stringValue to the string.

If you’re using an object-oriented programming language, you could instead make a base class that is called e.g. Value, and just gives back a string or whatever. And Function would be a subclass of Value that can also contain parameters (which are other Values, not necessarily Functions).

What does that get us? It gets us an array of commands. We could now loop over these commands, look at what the function name is, and do stuff with them. But that is not the parser’s job.

Syntax Trees

That array of Functions we just made is a “syntax tree”. It contains all our code, already properly structured. Right now we only do a bunch of commands. If you wanted to let our programmer define functions, instead of one array of function calls, we’d have a dictionary (or “hash map”) of function call arrays, with the function name as the key. Change the parser so it reads whatever your function definitions look like, and then use the code above to fill the function definition’s array.

For example, if a function in your programming language looks like

function main
{
	print "Hello world"
}

Your parser would look like:

dictionary<string,array<Function>> functionDefinitions;

int currentToken = 0;
while (x < tokenList.size())
{
	if (tokenList[currentToken].type == IDENTIFIER && tokenList[currentToken].text == "function")
	{
		++currentToken;
		
		if (currentToken == tokenList.size() or tokenList[currentToken].type != IDENTIFIER)
		{
			error("Expected function name after \"function\".");
		}
		
		string functionName = tokenList[currentToken].text;
		
		++currentToken;
		
		array<Function> commands = parse_function_body(&currentToken, &tokenList); // This looks for the { and } as well.
		functionDefinitions[functionName] = commands;
	}
	else if (tokenList[currentToken].type == LINE_BREAK)
	{
		++currentToken; // Skip empty line.
	}
	else
	{
		error("Unknown top-level construct, expected a function.");
	}
}

Now of course, functions usually don’t just have a name, so instead of the raw array, a real programming language would likely use a struct that contains the array of commands, plus the parameter list, plus the return type, plus whatever else your programming language wants to attach to its functions.

No Flow Control?

At the moment, our programming language is very linear. There is no if/else, no loops. Let’s leave it like that for now, until we know how to run the program. Once we know how the program runs (in the next chapter), it will be easier to understand how flow control can be achieved.