The Tokenizer

The first thing any programming language has to do, is take the text that the user typed in and make sense of it. Pretty much all programming languages treat text in a way that makes this awkward. Whenever you read text, you get an array of characters, or a string.

So a program that says

print "hello world"

will look to your program like:

'p', 'r', 'i', 'n', 't', ' ', '"', 'h', 'e', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '"'

You can now extract parts from this array, but you usually have to kind of scan into it to see what you are really dealing with. Some programming languages make this a bit easier by providing you with regular expressions, but in the end, you still have to scan through your text from the start, character by character.

Regular expressions help here, but they are not very good at handling nested constructs, or handling escape sequences like “Hello "World"” where we have a string (whose start and end is marked by quote characters, but which also contains quote characters, which are marked with backslashes to indicate to people reading that there are supposed to be quotes here inside the string, and this is not the end of the string already), so you usually still need to handle those by hand.

Also, when reading a program, you will often find bits of text that starts with similar words and assume it is one thing, then realize it isn’t, and backtrack and try to interpret it as something else. If, every time you backtracked, you had to re-read the individual words of your text, that would be annoying, and would make the code for reading the program very hard to read.

We want nice, readable code, so we can keep adding things to our language for years to come! So instead of saying

are the first 5 characters of the program "print"?
are the next 1 or more characters whitespace?
	Is after them one or more numbers?
	Or a quote character followed by arbitrary characters followed by a quote?
	Or an alphabetic character followed by alphabetic characters or numbers?
Is there a line-end marker after that?

we just want to say

Is word 1 "print"?
Is word 2 a quoted string or number or identifier?
Is there a line-end marker after that?

We want to look at our text at a higher level, and if we find out that word 2 is something else, we just want to be able to say, “rewind to word 1” instead of having to remember how much whitespace was between print and whatever followed it.

Enter the Token

You may have been confused by me calling a string “word 2”. That’s true, there is a space inside the string so that string would actually be words 2 and 3 or so, but we don’t care about the contents of the string at this point, we just want whatever is between the two quotes. So let’s give this thing we care about a special name, we call it a “token”. But if it helps you, think about it as something like a word, just with a meaning we define however it fits our programming language. Typical tokens are:

- Mathematical operators like `+`, `-`, `*`, `/` and even `(` and `)`.
- Strings in `"quotes"`
- Integers like `123` (note that we treat signs as operators for simplicity)
- Decimal numbers like `123.45`
- Identifiers, i.e. language keywords and variable names like `print`, `for`, `repeat`, `variable12`
- A line break

but really, anything goes that is a common, easily-recognized “word” in your programming language. For example, many languages have several types of strings (e.g. C has single-quoted and double-quoted strings, Perl has multi-line “heredoc” strings). Some languages also have characters following a number to indicate its type (e.g. 120U is 120 as an unsigned integer in C, whereas 120L would be the same number as a signed long integer).

In a language that is used for a lot of internet things, you might even have an IP address as a recognized token (192.168.1.1 or whatever).

Also keep in mind that you don’t need all (or any) of these token types in every programming language. For example, in C, whitespace and newlines are just there to indicate points where a new token starts. But there is no actual token needed for them (because C doesn’t care if you split a single command onto several lines). If this sounds weird, that’s okay, it will make sense later once you’ve made a programming language.

The token list

So let’s make a little function in our programming language that splits up that chunk of text from a source file into tokens and creates an list of tokens. Usually you would define a data structure of some sort with information about each token, and collect the entries for all tokens in whatever list or array data structure your programming language uses.

My token data structure looks like

struct Token
{
	string		text;
	TokenType	type;
	int			startOffset;
	int			endOffset;
}

Where TokenType is just an int that can have different values identifying what type of token we’re dealing with:

enum TokenType : int
{
	NOTHING = 0,
	PLUS_OPERATOR = 1,
	MINUS_OPERATOR = 2,
	STRING = 3,
	INTEGER = 4,
	DECIMAL_NUMBER = 5,
	IDENTIFIER = 6,
	LINE_BREAK = 7
}

No magic here, really. Now let’s just loop over our text and create tokens:

array<Token> tokenList;
Token currentToken;
currentToken.type = NOTHING;

for (int x = 0; x < text.length; x += 1)
{
	char currentCharacter = text[x];
	if (currentCharacter == ' ' or currentCharacter == '\t') // whitespace?
	{
		if (currentToken.type == STRING)
		{
			currentToken.text.append(currentCharacter);
		}
		else if (currentToken.type ≠ NOTHING)
		{
			tokenList.append(currentToken);
			currentToken.type = NOTHING;
			currentToken.text = "";
		}
	}
	else if (currentCharacter == '"')
	{
		if (currentToken.type == STRING)
		{
			currentToken.endOffset = x;
			tokenList.append(currentToken);
			currentToken.type = NOTHING;
			currentToken.text = "";
		}
		else if (currentToken.type ≠ STRING)
		{
			if (currentToken.type ≠ NOTHING)
			{
				tokenList.append(currentToken);
			}
			currentToken.type = STRING;
			currentToken.text = "";
			currentToken.startOffset = currentToken.endOffset = x;
		}
	}
	else if (is_number(currentCharacter))
	{
		if (currentToken.type == NOTHING)
		{
			currentToken.type = INTEGER;
			currentToken.startOffset = currentToken.endOffset = x;
			currentToken.text = "";
			currentToken.text.append(currentCharacter);
		}
		else
		{
			currentToken.endOffset = x;
			currentToken.text.append(currentCharacter);
		}
	}
	else if (currentCharacter == '.')
	{
		if (currentToken.type == NOTHING)
		{
			currentToken.type = OPERATOR;
			currentToken.startOffset = currentToken.endOffset = x;
			currentToken.text = "";
			currentToken.text.append(currentCharacter);
			tokenList.append(currentToken);
			currentToken.type = NOTHING;
			currentToken.text = "";
		}
		else if (currentToken.type == DECIMAL_NUMBER) // A *second* decimal point?
		{
			tokenList.append(currentToken);
			
			// Assume this is a number followed by a dot operator:
			currentToken.type = OPERATOR;
			currentToken.startOffset = currentToken.endOffset = x;
			currentToken.text = "";
			currentToken.text.append(currentCharacter);
			tokenList.append(currentToken);
			
			currentToken.type = NOTHING;
			currentToken.text = "";
			currentToken.text.append(currentCharacter);
		}
		else if (currentToken.type == INTEGER) // Integers can't contain dots, so this must be a decimal number.
		{
			currentToken.type = DECIMAL_NUMBER;
			currentToken.endOffset = x;
			currentToken.text.append(currentCharacter);
		}
		else if (currentToken.type ≠ NOTHING)
		{
			currentToken.endOffset = x;
			currentToken.text.append(currentCharacter);
		}
	}
	else if (is_operator(currentCharacter))
	{
		if (currentToken.type == STRING)
		{
			currentToken.text.append(currentCharacter);
			currentToken.endOffset = x;
		}
		else if (currentToken.type ≠ NOTHING)
		{
			tokenList.append(currentToken);
			currentToken.type = OPERATOR;
			currentToken.startOffset = currentToken.endOffset = x;
			currentToken.text = "";
			currentToken.text.append(currentCharacter);
			tokenList.append(currentToken);
			currentToken.type = NOTHING;
			currentToken.text = "";
		}
	}
	else if (currentCharacter == '\n')
	{
		if (currentToken.type == STRING)
		{
			currentToken.text.append(currentCharacter);
			currentToken.endOffset = x;
		}
		else if (currentToken.type ≠ NOTHING)
		{
			tokenList.append(currentToken);
			currentToken.type = LINE_BREAK;
			currentToken.startOffset = currentToken.endOffset = x;
			currentToken.text = "";
			currentToken.text.append(currentCharacter);
			tokenList.append(currentToken);
			currentToken.type = NOTHING;
			currentToken.text = "";
		}
	}
	else
	{
		if (currentToken.type == INTEGER or currentToken.type == DECIMAL_NUMBER) // Text right after a number? End number, start identifier!
		{
			tokenList.append(currentToken);
			currentToken.type = IDENTIFIER;
			currentToken.startOffset = x;
		}
					
		if (currentToken.type ≠ NOTHING)
		{
			currentToken.endOffset = x;
			currentToken.text.append(currentCharacter);
		}
	}
}

if (currentToken.type ≠ NOTHING)
{
	tokenList.append(currentToken);
}

Wow. That’s a lot of code. But don’t worry, I’ll explain it. Basically, what this code does is that it goes over each character and looks at it, and depending on what characters came before it, decides what token to create, or whether to just collect characters.

State Machine

The code is what’s called a “state machine”, which is a word fancy people use for a loop with lots of booleans that remember things that previously happened. Don’t see any booleans?

That’s because I replaced them with currToken.type. That is our current state. We could just as well have boolean isString;, boolean isNumberOrDecimalNumber;, boolean isDecimalNumber; etc. But since we already have the token’s type and want to keep track of other things, we’re just re-using the type to hold those booleans. After all, it is not as if we could be in two states at once, so if one of these booleans is true, none of the others are.

Okay, that’s not quite true: If you look at our DECIMAL_NUMBER and INTEGER types, you can see that there is one special case: When we see a number, we assume at first that it is an INTEGER. Then, when we encounter a . character, we know that it can’t be an integer. In that case, we just change its type to DECIMAL_NUMBER and keep going.

If we encounter a second decimal point inside a DECIMAL_NUMBER, that makes no sense. If we supported IP addresses as a native type, we’d probably do another switch like we did from INTEGER and switch to IP_ADDRESS. But in this case, we just assume that it must be a decimal number followed by a dot operator.

Note that our parser is a bit weird now: 12.34 is a decimal number, 12.abc is 12.0 followed by the identifier abc, and 12.34.abc is 12.34 followed by the dot operator followed by the identifier abc. For consistency’s sake, one would probably expect 12. to be read as the integer 12 followed by the dot operator.

I’m leaving it like this for now. It’s easy enough to fix though: Instead of directly calling tokenList.append(currentToken), we could simply have a function end_token(currentToken, tokenList) that appends the token to the list. Before it does that, it can check whether a DECIMAL_NUMBER ends in a decimal point, and if so, remove it from the end of the number, change it back into an INTEGER, and then create the dot operator token. Alternately, we could have a special TokenType that is only for parsing, DECIMAL_POINT, that we switch to when we encounter a decimal point in an integer, then only when we encounter a number in this state, we make the token a DECIMAL_NUMBER. We would still need the end_token() function to catch this state and make the integer and dot operator tokens though.

The result of all this

Once this function runs through, we have a list of tokens in tokenList. In our example case, it would look something like:

{ 'print', IDENTIFIER, 0, 4 }, { 'hello world', STRING, 6, 16 }

And now, to check which command it is, we can simply do

if (tokenList[0] == "print")

which is much more readable than having to check the start of the text whether it is “print” and whether there is whitespace after it. After all, the first token could really be “printer”, which is something entirely different.

But we are getting ahead of ourselves here. Reading the tokens is for the next chapter.

Let’s look at some other choice things the tokenizer has to do:

Operators

Operators are a bit special. Anywhere we encounter e.g. a + sign (except inside a string, obviously), it immediately ends the current token, and then inserts a single token for the operator, and then switches back to our “NOTHING” state. Why? Well, because there doesn’t have to be any whitespace between operators and their operands. In maths, it’s totally fine to write 1+2*3 instead of 1 + 2 * 3.

A not dissimilar case is when a non-numeric character follows a number: We end the number and start an identifier. And I guess while we’re at it: Yes, if we encounter a quote character while reading an identifier or number, then yes, we also end the current token and start a string.

Also, the code for line breaks is practically identical to the operator code. In fact, depending on your language, you could just make line breaks an “end-of-line-operator”, just like the semicolon that indicates the end of line in C.

startOffset and endOffset?

You may wonder why we keep around not just the text of the token we read (in text), but also its start and end offset. The latter is for error reporting later during reading. This way, we can tell the programmer using our language where they made an error.

Some programming languages also count the current line number and write that into every token. That is useful for code that can be read just fine, but has an error when run. We can at least give the programmer an indication of which of the five print commands in the file failed.

For programming languages that can run code from multiple files, it also makes sense to remember the file name in each token, so we can later extract it from there and use it in error messages.

Escaping in the String

While I talked about escaping characters in strings at the beginning, you may have noted that this code actually doesn’t do that. It skips the leading and trailing quotes of the string, but will happily end the string at \". This is a fairly simple fix: add a boolean isInEscape; Set it to true when you encounter a \ inside a string (and do not append the backslash to currentToken.text)

	else if (currentCharacter == '\\' and currentToken.type == STRING)
	{
		isInEscape = true;
	}
	else
	{

then add a case before the whitespace if at the very start:

	char currentCharacter = text[x];
	
	if (isInEscape)
	{
		currentToken.text.append(currentCharacter);
	}
	else if (currentCharacter == ' ' or currentCharacter == '\t') // whitespace?

Of course, this example only takes care of “unpacking” escaped quotes. If you want to correctly handle other escape sequences, you’ll have to add a little code that looks at the character in the escape and e.g. turns \t into a tab key, etc.

The end result of this? We have the actual string our program will need. No surrounding quotes, no escapes, no nothing.

Token subtypes

Many programming languages have certain standard tokens that show up again and again. Doing string comparisons each time you scan across a token is a bit error-prone. What if you made a typo in the string you compare against? The compiler doesn’t look into strings and doesn’t know what is a valid string in this case.

One solution to this to add a subtype to your token struct. For example, for identifiers, you could compare the string to whatever commonly used standard tokens you have at the time the token is appended to the tokenList already, and then you can compare the token subtype to a constant that the compiler has to have seen or it will raise an error. For example, the end_token function mentioned earlier could look like

void end_token(Token *currentToken, array<Token> *tokenList)
{
	if (currentToken.type == DECIMAL_NUMBER and currentToken.text.has_suffix("."))
	{
		currentToken.type = INTEGER;
		currentToken.text.delete(currentToken.text.size() - 1, 1);
		currentToken.endOffset -= 1;
		tokenList.append(currentToken);
		
		currentToken.type = OPERATOR;
		currentToken.text = ".";
		currentToken.startOffset = currentToken.endOffset;
		tokenList.append(currentToken);
		
		currentToken.type = NOTHING;
	}
	else if (currentToken.type == IDENTIFIER)
	{
		if (currentToken.text == "return")
		{
			currentToken.subtype = RETURN_IDENTIFIER;
		}
		else if (currentToken.text == "break")
		{
			currentToken.subtype = BREAK_IDENTIFIER;
		}
		else
		{
			currentToken.subtype = USER_DEFINED;
		}
	
		tokenList.append(currentToken);
	}
	else if (currentToken.type ≠ NOTHING)
	{
		tokenList.append(currentToken);
	}
}

This way, your comparisons change from

if (currentToken.type == IDENTIFIER and currentToken.text == "returnn")

into

if (currentToken.type == IDENTIFIER and currentToken.subtype == RETURN_IDENTIFIER)

As a convenience, one could add an is_identifier(currentToken, RETURN_IDENTIFIER) function that checks both the type and the subtype, to save you some typing.

The same approach could be done for operators.