Skip to main content

Internals: State Machine Lexer

· 4 min read
Jakub T. Jankiewicz
LIPS maintainer

The first version of LIPS Scheme had regex based tokenizer. It was using a single regex to split the input string into tokens. In this article I will show the internals of the new Lexer in LIPS Scheme.

You can still find the first version of what later become LIPS on CodePen:

When I started working on version 1.0 (you can read this story in article: LIPS Scheme History), the code become more and more complex, the regular expression become dynamic, mostly because of syntax extensions that needed to update the regular expression and the tokenizer.

You can see this code on GitHub on old-version branch

function makeTokenRe() {
    var tokens = Object.keys(specials).map(escapeRegex).join('|');
    return new RegExp(`("(?:\\\\[\\S\\s]|[^"])*"|\\/(?! )[^\\/\\\\]*(?:\\\\[\\S\\s][^\\/\\\\]*)*\\/[gimy]*(?=\\s|\\(|\\)|$)|\\(|\\)|'|"(?:\\\\[\\S\\s]|[^"])+|\\n|(?:\\\\[\\S\\s]|[^"])*"|;.*|(?:[-+]?(?:(?:\\.[0-9]+|[0-9]+\\.[0-9]+)(?:[eE][-+]?[0-9]+)?)|[0-9]+\\.)[0-9]|\\.{2,}|${tokens}|[^(\\s)]+)`, 'gim');
}

At one point I realized that I need to change my approach into parsing and tokenization, because you could not add new syntax extensions in the same file that contained the code. Because the whole code was tokenized at once.

State Machine Lexer

The limitation of syntax extension lead into introducing a new Lexer and a Streaming Parser (if you're interested in this topic I will be writing an article about this in the future).

The new Lexer is much simpler and easier to maintain, only had one bug recently related to Lexer inner working (#433).

The new Lexer is a class that have rules for the state machine, this is an example sequance of rules for a string:

Lexer.string = Symbol.for('string');
Lexer.string_escape = Symbol.for('string_escape');
...
Lexer._rules = [
    ...
    [/"/, null, null, Lexer.string, null],
    [/"/, null, null, null, Lexer.string],
    [/"/, null, null, Lexer.string_escape, Lexer.string],
    [/\\/, null, null, Lexer.string, Lexer.string_escape],
    [/./, /\\/, null, Lexer.string_escape, Lexer.string],
    ...
]

The single rule is consisted of a current character, next character, and a previous character (they can be single character strings or regular expressions). If the character is null it can be any character. The last two elements of the array are the starting and the ending state (they are symbols so they are unique values).

The Lexer start with null state and iterate over every rule on every character until it find a match. If a rule enters a state and the state finish with null it means that the rule sequance was matched, and full token is created.

If no rules match and the state is not null then the characters is collected and will be included in final token.

That's why in above example there are no rule like this:

[/./, null, null, Lexer.string, Lexer.string]

This rule may be added in the future to speed up the Lexer.

Example

When we have a string like this:

"foo\"bar"

It matches the second rule because the first character is a quote, so it enters Lexer.string state. The first rule don't match because the initial state is null. For characters foo it collects the tokens because no rule match them. When it finds slash \ it changes state from Lexer.string to Lexer.string_escape, and for next character it enters again Lexer.string. Then it consumes sequence of characters bar, and the last quote matches the first rule. And that's how we have the full token.

Syntax Extensions and Constants

The static rules are located in Lexer._rules, but Lexer.rules is a getter that create the final rules dynamically by adding all tokens added as syntax extensions (they are called specials in the code). This is also where other constats that starts with hash are added like: #t, #f, or #void. They are added together with syntax extension to handle the rules matching order.

As an optimization the value of dynamic rules is cached, and the cache is invalidated when a new syntax extension is added.

The syntax extension create a lexer rule using Lexer.literal_rule that creates an array of rules that match literal characters in the token, passed as first character.

Lexer is important not only when it reads input LIPS Scheme code, it's also used when reading from I/O ports.

Conclusion

And that's it, this is whole Lexer. As you can see reading the above, it's very simple, easy to maintain. If you want to look how it works for yourself. You can jump into the source code. And search for "class Lexer", "Lexer._rule", Object.defineProperty(Lexer, 'rules'.

The source code is in one file, so to navigate you need to use search. I've made an attempt to split the code into modules, but failed. Because of Rollup errors about circular dependencies.

This was first part of articles about LIPS Scheme Internals.