Type-safe Way to Build a Parser

This post is for testing my hugely updated blog theme.

¶Prerequisites

You’re required to know TypeScript types, such as some built-in types like Extract and Omit.

And, you may need to know how JavaScript parser works, but I’ll try to explain it.

¶Background

Suppose we’re building a JavaScript parser, and we only focus on the parser which accepts a stream of tokens then produces AST.

For a parser written in JavaScript or TypeScript, there’s a common first step which is creating an object which only contains locations or other basic information, since we can’t know the detail of that node before consuming tokens. Once we finished consuming enough tokens and gathered detail, we can set properties on that object.

¶The Way That Babel Does

Here, we take some code from Babel as example. Babel is written in Flow but they’re migrating to TypeScript progressively at the time I wrote this post.

Let’s see how Babel parses IfStatement. I’ll show you the code without implementation detail:

parseIfStatement(node: N.IfStatement): N.IfStatement {
  this.next();
  node.test = // call another method to parse something else
  node.consequent = // call another method to parse something else
  node.alternate = // call another method to parse something else
  return this.finishNode(node, "IfStatement");
}

For the IfStatement, there’re three important properties: test, consequent and alternate. When parsing IfStatement, Babel calls other parser methods, then once those methods return ASTs, it mount those ASTs to corresponding properties.

This looks okay, right? Yes, it actually works. However, this isn’t type-safe.

What if we forget to mount AST to a property? Please note line 3.

parseIfStatement(node: N.IfStatement): N.IfStatement {
  this.next();
  // just call another method to parse something else, without setting it to current node
  node.consequent = // call another method to parse something else
  node.alternate = // call another method to parse something else
  return this.finishNode(node, "IfStatement");
}

If we run the code above, oh, the AST will be broken. We got an IfStatement with the test property whose value is undefined. This introduces a bug, but it hasn’t been caught at compile-time.

The reason why that error won’t be caught at compile-time is that the type of parameter node is already IfStatement, and when compiler checks this code and sees the type of that parameter is IfStatement, it thinks: okay, it’s already an IfStatement, so all the properties of that type have been properly set and I don’t need to care whether there’ll be assignments to those properties later or not.

¶To be Type-safe

To address the problem above, we shouldn’t assign values to properties on an object directly, because this makes harder to ensure type-safety.

We can move those assignments to the finishNode method. After changed, finishNode may look like this:

this.finishNode(node, 'IfStatement', { test, consequent, alternate })

TypeScript can check the call above and make sure all required properties are properly given.

But, how can we tell TypeScript we’re parsing IfStatement or other type of nodes? Use generics. But before using generics, we must declare a discriminated union type.

For most parsers, all kinds of nodes have a shared property called type which indicates the type of a specific kind of AST node, so we can declare it like this:

type Node = IfStatement | WhileStatement | /* etc */

And, each kind of AST node must have the type property whose type is a string literal:

type IfStatement = {
  type: 'IfStatement'
  /* other properties */
}
// or
interface IfStatement {
  type: 'IfStatement'
  /* other properties */
}

This is necessary; otherwise, TypeScript won’t be able to distinguish them.

Now the signature of method finishNode can be:

finishNode<N extends Node['type']>(node: /* type? */, type: N, properties: /* type? */)

So far, the second parameter has become type-safe. We can’t give an arbitrary string like 'YuruCamp' because there’s no AST node whose type is 'YuruCamp'.

How about the first parameter and the third parameter?

To solve it, we must find a solution that we can retrieve a specific AST node by type name. For example, if we have a string 'Identifier', we should get the Identifier AST node.

Thanks to discriminated union and the Node type we declared above, we can make a good use of built-in Extract type:

type GetNodeByType<N extends Node['type']> = Extract<Node, { type: N }>

Extract returns the type which is a subtype of the Node type. If we give 'Identifier', it will be instantiated like this:

Extract<IfStatement | Identifier | /* etc */, { type: 'Identifier' }>

There’s only one type which is a subtype of { type: 'Identifier' }, and that is the Identifier.

Now the signature of method finishNode can be:

finishNode<N extends Node['type']>(node: GetNodeByType<N>, type: N, properties: /* type? */)

The rest problem, completing the type of parameter properties, would be easy if you know another TypeScript built-in type: Omit. We omit some information like locations and range:

finishNode<N extends Node['type']>(node: GetNodeByType<N>, type: N, properties: Omit<GetNodeByType<N>, 'loc' | 'start' | 'end' | 'range'>)

Here’s the final code of calling finishNode:

parseIfStatement(node: N.NodeBase): N.IfStatement {
  this.next();
  const test = // call another method to parse something else
  const consequent = // call another method to parse something else
  const alternate = // call another method to parse something else
  return this.finishNode(node, "IfStatement", { test, consequent, alternate });
}

You may find that we introduced a new type NodeBase. There’s nothing magical: it only contains shared properties in all kinds of AST nodes, such as locations and range.

Therefore, the signature of method finishNode can be simplified:

finishNode<N extends Node['type']>(node: GetNodeByType<N>, type: N, properties: Omit<GetNodeByType<N>, keyof NodeBase>)

Thanks for reading!