Acorn internals, main concepts

Acorn is a JavaScript-based JavaScript parser. It is easy to understand and to extend due to its small size. There is a relatively small number of concepts you need to get familiar with it to augment its parsing operations and to enrich its subject language. I’ll get into most of them here.

Parser

To parse a program, you might use Acorn.parse and give it the input code string:

import Acorn from 'acorn'
const parsed = Acorn.parse('var i = 1');

The result for this example is:

Node {
  type: 'Program',
  start: 0,
  end: 9,
  body:[{
    type: 'VariableDeclaration',
    start: 0,
    end: 9,
    declarations: [{
      type: 'VariableDeclarator',
      start: 4,
      end: 9,
      id: {type: 'Identifier', start: 4, end: 5, name: 'i'},
      init: {type: 'Literal', start: 8, end: 9, value: 1, raw: '1'}
    }],
    kind: 'var'
  }],
  sourceType: 'script'
}

Acorn puts all operations in one class, Parser. It is a parser and a tokenizer. It is defined in state.js then enriched by other modules. Acorn.parse is a static method that creates a Parser instance then uses it to parse the input.

Parser constructor takes a set of options, a source code to parse or the name of a Javascript file, and the starting position. It also accepts an AST to which it adds the result of parsing the given source.

Options specify the valid input. They decide whether to accept import statements in the middle of the code, whether to accept await and return statements at the top-level, whether to use strict mode rules or not and a couple of other similar settings.

Options also define callbacks that are called by Acorn during different stages of parsing. This allows you to define a function that is called after reading a token or after parsing a node or a comment.

Parsing the top level

That top-level node is either created by parseTopLevel itself or it is taken from the AST given by the options. In both situations, the result of the parsing is an AST whose root is a node with the type 'Program'. Parser has a method parse that uses parseTopLevel to loop over and parse top-level statements in the input, then to collect their nodes under the top-level node.

Each node has a type, a position index, and type-specific attributes (like body for a block statement, condition consequent and alternate for an if statement, declarations in the variable declaration, …). Acorn models node types as string. This makes sense as each type is used once and is highly correlated to the method using it.

Acorn defines a parsing method for each type of node. parseForStatement parses for loops and returns a ForStatement node, parseIfStatement parses a conditional if statement, parseReturnStatement a return statement, and so on.

Each parsing method takes an empty node, a node that contains only the first character position. It sets its type and its type-specific attributes and returns it back.

Tokens

Parser keeps track of the current token being parsed and the previously parsed token. parseStatement relies on those attributes to create the next node in the tree.

It stores the current line in curLine, the current line beginning in lineStart, the token type in type, and the token value value. Then, it uses pos for the current position in the input, start/end for the token boundaries in the source, and lastTokEnd for the last token end position.

Parser navigates the code using next and nextToken. next stores the last token and calls nextToken, which tries to find the next token. nextToken skips over insignificant white space and sets start to the current position. Then, depending on the first token character, it advances until the end of a recognized token. It increments this.pos by 1 if the code is one byte, and by 2 if it takes 2 bytes (when the code point is above 0xfff).

Acorn decides whether the current token is a word or not by checking its first character using isIdentifierStart. Different methods are used to proceed in each condition. For a word, readWord is used. It reads the token and tries to match it against a keyword or a reserved word and a type. When no predefined word and type is found, it sets the token type to name. This is the type of variables and classes identifiers. For example, if the word is while, readWord returns _while token type, which evaluates to ‘while’. Token types are defined in tokentype.js. Acorn creates a RegExp that checks a token against version-specific ECMAScript keywords.

If the current character cannot be the beginning of a word, getTokenFromCode handles it. If the token is a punctuation mark, it creates a token with finishToken. Otherwise, it delegates to other helpers that read long tokens. It delegates to readString when it encounters ' or a ', readNumber when it finds a digit, and so on.

Ther characters and the functions used for each are:

// The interpretation of a dot depends on whether it is followed
// by a digit or another two dots.
'.'->  readToken_dot

'/' -> readToken_slash
'%*' -> readToken_mult_modulo_exp
'|&' -> readToken_pipe_amp
'^' -> readToken_caret
'+-' -> readToken_plus_min
'<>' -> readToken_lt_gt
'=!' -> readToken_eq_excl
'?' -> readToken_question
'~' -> finishOp.

In addition to a type, a token also has a context. Acorn keeps track of the current context and its parents in a stack.

Token contexts are defined by their first token. TokenContext constructor is defined as follows:

class TokContext {
  constructor(token, isExpr, preserveSpace, override, generator)

Contexts are defined in tokencontext.js as follows:

b_stat    : new TokContext('{', false)
b_expr    : new TokContext('{', true)
b_tmpl    : new TokContext('${', false)
p_stat    : new TokContext('(', false)
p_expr    : new TokContext('(', true)
q_tmpl    : new TokContext('`', true, true, p => p.tryReadTemplateToken())
f_stat    : new TokContext('function', false)
f_expr    : new TokContext('function', true)
f_expr_gen: new TokContext('function', true, false, null, true)
f_gen     : new TokContext('function', false, false, null, true)

When parsing function compute(a) { return (a - 1) * 2; }, the context at different times stack will be:

// when the tokenizer is reading the function argument
[
{ token: '{', isExpr: false, preserveSpace: false, generator: false },
{ token: 'function', isExpr: false, preserveSpace: false, generator: false },
{ token: '(', isExpr: true, preserveSpace: false, generator: false }
]

// when the tokenizer is reading the 'a - 1' part of the return expression
[
{ token: '{', isExpr: false, preserveSpace: false, generator: false },
{ token: 'function', isExpr: false, preserveSpace: false, generator: false },
{ token: '{', isExpr: false, preserveSpace: false, generator: false },
{ token: '(', isExpr: true, preserveSpace: false, generator: false }
]

The top-level context (and the first added item to the stack) is always a block context. It is defined in Parser constructor using initalContext.

Parser.parseStatement

parseStatement creates a node from the current token, the one created by next. It is called by successively parseTopLevel as long as no token with the type tt.eof is found. To zoom out, parse first calls next then parseTopLevel. parseTopLevel calls parseStatement for each top-level statement and adds the result to the top-level node body. parseStatement uses a helper method that calls next in the end so that the next call to parseStament gets the token that follows the last node as a current token.

Each parsing method is specific to a type of statement. The only pattern shared by most of them is reading a semicolon.

Acorn uses this.eat(tt.semi) (), this.insertSemicolon() (), and this.semicolon ().

eat is a:

// Predicate that tests whether the next token is of the given
// type, and if yes, consumes it as a side effect. (by calling this.next())

Here is an example from parseForStatement:

  node.init = init
  this.expect(tt.semi)
  node.test = this.type === tt.semi ? null : this.parseExpression()
  this.expect(tt.semi)
  node.update = this.type === tt.parenR ? null : this.parseExpression()
  this.expect(tt.parenR)
  node.body = this.parseStatement('for')

insertSemicolon:

// Consume a semicolon, or, failing that, see if we are allowed to
// pretend that there is a semicolon at this position.

It is defined as follows

pp.semicolon = function() {
  if (!this.eat(tt.semi) && !this.insertSemicolon()) this.unexpected()
}

semicolon meanwhile checks whether we are allowed to insert a semicolon after the current token. It is true if we are at the end of the file, in a }, or if lineBreak.test(this.input.slice(this.lastTokEnd, this.start)). The last call checks that the last token ends at the end of a line by checking whether a line break exists between the last token ending position and the start of the current token.

Scope

Parser keeps track of the current scope and its parents in scopeStack. In the constructor, Parser initializes scopeStack and calls this.enterScope(SCOPE_TOP). Then scope.js module defines getters that check the scope of the current token.

Acorn models scopes using bitsets and uses logical-end to compare them. Scopes are defined in src/scopeflags. Each scpe is a binary with a shifted 1:

    SCOPE_TOP = 1,
    SCOPE_FUNCTION = 2,
    SCOPE_VAR = SCOPE_TOP | SCOPE_FUNCTION,
    SCOPE_ASYNC = 4,
    SCOPE_GENERATOR = 8,
    SCOPE_ARROW = 16,
    SCOPE_SIMPLE_CATCH = 32,
    SCOPE_SUPER = 64,
    SCOPE_DIRECT_SUPER = 128

export function functionFlags(async, generator) {
  return SCOPE_FUNCTION | (async ? SCOPE_ASYNC : 0) | (generator ? SCOPE_GENERATOR : 0)
}

A scope with value 0 is neither of those. 0 is used when the current statement introduces a new lexical scope. It is used in parseForStatement, parseSwitchStatement, to parse a non-simple catch block in parseTryStatement, and in parseBlock when createNewLexicalScope is true. These helpers use enterScope to create a new scope.

Acorn usually calls this.enterScope before parsing the body of the statement, and this.exitScope after parsing the body.

The following pattern is used in multiple parsing helpers:

this.enterScope(/* flags */);
node.body = this.parseBlock()
this.exitScope()

enterScope and exitScope are defined as follows:

pp.enterScope = function(flags) {
  this.scopeStack.push(new Scope(flags))
}

pp.exitScope = function() {
  this.scopeStack.pop()
}

A ScopeStack frame, an instance of Scope, contains a list of variables, a list of lexically-declared names, and a list of FunctionDeclaration names. As the parsing goes on, declareName is used to add names (like variable declarations) to the current scope.