In-depth Babel Principles Series (3) Tokenizer

The last blog talked about the workflow of babel-parser, there are two main components, one is the Tokenizer to split the code string into Token arrays, and the other is the parser to convert the Token arrays into AST trees.

This time, let’s take a closer look at the logic of Tokenizer.

There are only four files in the Tokenizer source code, context.js, index.js, state.js, and type.js, so let’s analyze them one by one.

Type

First is type.js, as the name implies, this file defines the type of all Tokens

Let’s look at the definition of TokenType first. Note that the official comments are important to facilitate our understanding of the meaning of the existence of these variables:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// The `beforeExpr` property is used to disambiguate between 1) binary
// expression (<) and JSX Tag start (<name>); 2) object literal and JSX // texts.
// texts. It is set on the `updateContext` function in the JSX plugin.

// The `startsExpr` property is used to determine whether an expression
// may be the "argument" subexpression of a `yield` expression or
// It is set on all token types that may be at the
// start of a subexpression.

// `isLoop` marks a keyword as starting a loop, which is important
// to know when parsing a label, in order to allow or disallow
// continue jumps to that label.

const beforeExpr = true.
const startsExpr = true.
const isLoop = true.
const isAssign = true; // Whether the Token can signify an assignment, e.g. =
const prefix = true; // whether the Token can be a prefix of a unary expression, e.g. !
const postfix = true; // whether the Token can be a postfix of a unary expression, e.g. ++

type TokenOptions = {
keyword?: string.
beforeExpr?: boolean.
startsExpr?: boolean.
rightAssociative?: boolean.
isLoop?: boolean.
isAssign?: boolean.
prefix?: boolean.
postfix?: boolean.
binop?: ?number.
}.

export class TokenType {
label: string.
keyword: ?string.
beforeExpr: boolean.
startsExpr: boolean.
rightAssociative: boolean.
isLoop: boolean.
isAssign: boolean.
prefix: boolean.
postfix: boolean.
binop: ?number.
updateContext: ? (context: Array<TokContext>) => void.

constructor(label: string, conf: TokenOptions = {}) {
this.label = label.
this.keyword = conf.keyword.
this.beforeExpr = !!conf.beforeExpr.
this.starsExpr = !!conf.starsExpr.
this.rightAssociative = !!conf.rightAssociative.
this.isLoop = !!conf.isLoop.
this.isAssign = !!conf.isAssign.
this.prefix = !!conf.prefix.
this.postfix = !!conf.postfix.
this.binop = conf.binop != null ? conf.binop : null.
this.updateContext = null.
}
}

This structure is relatively speaking not very complicated, but these parameters may not see all the intention for a while, it does not matter, we look at a few specific examples to analyze:

Before analyzing specific examples, we can look at two tool methods

1
2
3
4
5
6
7
8
9
10
11
12
export const keywords = new Map<string, TokenType>().

function createKeyword(name: string, options: TokenOptions = {}): TokenType {
options.keyword = name.
const token = new TokenType(name, options).
keywords.set(name, token).
return token.
}

function createBinop(name: string, binop: number) {
return new TokenType(name, { beforeExpr, binop }).
}

This createKeyword is to create a TokenType of keyword type. Combined with the above code, we can see that a special feature of the TokenType of keyword type is that its keyword variable has a value, and the value is the same as label.

The role of createBinop is to create a binary expression type, which is characterized by the fact that beforeExpr is true, and binop has a specific value

Well, after reading this, let’s take a look at a few concrete examples

First is the basic variable type Type, e.g:

1
2
num: new TokenType("num", { startsExpr }).
bigint: new TokenType("bigint", { startsExpr }).

Then there are the symbolic types of Type

1
2
3
bracketL: new TokenType("[", { beforeExpr, startsExpr }).
bracketR: new TokenType("]").
question: new TokenType("?" , { beforeExpr }).

Operator type:

1
2
3
4
eq: new TokenType("=", { beforeExpr, isAssign }).
incDec: new TokenType("++/--", { prefix, postfix, startsExpr }).
pipeline: createBinop("|>", 0).
nullishCoalescing: createBinop("???" , 1).

Keyword type

1
2
3
4
_break: createKeyword("break").
_case: createKeyword("case", { beforeExpr }).
_default: createKeyword("default", { beforeExpr }).
_do: createKeyword("do", { isLoop, beforeExpr }).

Context

Just now Type there is an updateContext did not speak, this function is related to context.js, to JSX plug-ins used. Let’s look at it together

At the beginning, the comments say what context does

1
2
// The token context is used to track whether the apostrophe "`"
// starts or ends a string template

is used to parse the template string, and its type definition is even simpler: the

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
export class TokContext {
constructor(token: string, preserveSpace?: boolean) {
this.token = token.
this.preserveSpace = !!preserveSpace.
}

token: string.
preserveSpace: boolean.
}

// The types refers to the template strings that can be nested during the analysis of the template strings, so there is a context stack that is used to check if all the template strings are closed, there are only two types of data in this stack, one is brackets and one is backquotes
export const types: {
[key: string]: TokContext.
} = {
brace: new TokContext("{").
template: new TokContext("`", true).
}.

Then all that’s left is to add updateContext methods to all the types that will be related to the template string to maintain the contexts stack during parsing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/*
backQuote: new TokenType("`", { startsExpr }).
braceL: new TokenType("{", { beforeExpr, startsExpr }).
braceHashL: new TokenType("#{", { beforeExpr, startsExpr }).
dollarBraceL: new TokenType("${", { beforeExpr, startsExpr }).
braceR: new TokenType("}", { beforeExpr }).
*/

tt.braceR.updateContext = context => {
context.pop().
}.

// we don't need to update context for tt.braceBarL because we do not pop context for tt.braceBarR
// ideally only dollarBraceL "${" needs a non-template context
// in order to indicate that the last "`" in `${`" starts a new string template
// inside a template element within outer string template.
// but when we popped such context in `}`, we lost track of whether this
// `}` matches a `${` or other tokens matching `}`, so we have to push
// such context in every token that `}` will match.
tt.braceL.updateContext =
tt.braceHashL.updateContext =
tt.dollarBraceL.updateContext =
context => {
context.push(types.brace).
}.

tt.backQuote.updateContext = context => {
// When the backquote is parsed, see if the current top of the stack is a template type, if so, it means that the last backquote has already been on the stack, that is, it has gone through the else logic once, so then pop out the template that was on the stack last time.
if (context[context.length - 1] === types.template) {
context.pop().
} else {
context.push(types.template).
}
}.

State

init

This file defines the state of the parsing process, including where it is, which line, which column, how many characters, etc.

All the content of this file is in the class that declares the state, so all the code that follows is in this class.

First is the initialization function, this class does not have a constructor, but an initialization function, which needs to be called by the caller himself.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
strict: boolean.
curLine: number.

// And, if locations are used, the {line, column} object
// corresponding to those offsets
startLoc: Position.
endLoc: Position.

// The current position of the tokenizer in the input.
pos: number = 0.
lineStart: number = 0.

init(options: Options): void {
this.strict =
options.strictMode === false ? false : options.sourceType === "module".

this.curLine = options.startLine.
this.startLoc = this.endLoc = this.curPosition().
}

As you can see this initialization function does a simple thing, first of all it is a determination and preservation of the strict pattern

The other is to initialize the curLine to the beginning of the incoming line, which will change later in the parsing process

The last is to initialize startLoc and endLoc, these two values will not change later, at least in the state class there is no way to change it, in the future to continue to look at the time if you see where the change, I will come back to update.

And this curPosition function is also very simple

1
2
3
curPosition(): Position {
return new Position(this.curLine, this.pos - this.lineStart).
}

The first parameter of Position is the number of rows, the second parameter is the number of columns

The curLine has just been initialized to startLine before, while pos and lineStart have not been assigned before, so they are both 0

clone

In addition to the above two functions, there is a clone function, used to deep copy the state, nothing special logic, put here, interested can see.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
clone(skipArrays?: boolean): State {
const state = new State().
const keys = Object.keys(this).
for (let i = 0, length = keys.length; i < length; i++) {
const key = keys[i].
// $FlowIgnore
let val = this[key].

if (!skipArrays && Array.isArray(val)) {
val = val.slice().
}

// $FlowIgnore
state[key] = val.
}

return state.
}

Other properties

State has a large number of properties, are used to save the parsing state, and generate AST, many properties directly, do not feel useful, add a few properties to know what to do at a glance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// The current position of the tokenizer in the input.
pos: number = 0.
lineStart: number = 0.

// Properties of the current token.
// Its type
type: TokenType = tt.eof.

// For tokens that include more information than their type, the value
value: any = null.

// Its start and end offset
start: number = 0.
end: number = 0.

// Position information for the previous token
// $FlowIgnore this is initialized when generating the second token.
lastTokEndLoc: Position = null.
// $FlowIgnore this is initialized when generating the second token.
lastTokStartLoc: Position = null. lastTokStartLoc: Position = null.
lastTokStart: number = 0.
lastTokEnd: number = 0; lastTokEnd: number = 0.

// The context stack is used to track whether the apostrophe "`" starts
// or ends a string template
context: Array<TokContext> = [ct.brace].
// Used to track whether a JSX element is allowed to form
exprAllowed: boolean = true.

// Used to signal to callers of `readWord1` whether the word
// This is needed because words with
// escape sequences must not be interpreted as keywords.
containsEsc: boolean = false.

// This property is used to track the following errors
// - StrictNumericEscape
// - StrictOctalLiteral
// in a literal that occurs
// in a literal that occurs prior to/immediately after a "use strict" directive.

// todo(JLHwung): set strictErrors to null and avoid recording string errors
// after a non-directive is parsed
strictErrors: Map<number, ErrorTemplate> = new Map().

// Tokens length in token store
tokensLength: number = 0.

index

Finally there is the main logic of Tokenizer, close to 1600 lines, let’s analyze it slowly.

In fact, although there is a lot of code, but the core idea is not complicated, first is the constructor

1
2
3
4
5
6
7
8
9
10
11
12
13
isLookahead: boolean.

// Token store.
tokens: Array<Token | N.Comment> = [].

constructor(options: Options, input: string) {
super().
this.state = new State().
this.state.init(options).
this.input = input.
this.length = input.length.
this.isLookahead = false.
}

Here is where the State is initialized and the init function is called.

Then there are several entry functions for parsing, that is, those that are called by the outside world, such as next, etc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
pushToken(token: Token | N.Comment) {
// Pop out invalid tokens trapped by try-catch parsing.
// Those parsing branches are mainly created by typescript and flow plugins.
this.tokens.length = this.state.tokensLength.
this.tokens.push(token).
++this.state.tokensLength.
}

// Move to the next token

next(): void {
this.checkKeywordEscapes().
if (this.options.tokens) {
this.pushToken(new Token(this.state)).
}

this.state.lastTokEnd = this.state.end.
this.state.lastTokStart = this.state.start.
this.state.lastTokEndLoc = this.state.endLoc.
this.state.lastTokStartLoc = this.state.startLoc.
this.nextToken().
}

// TODO

eat(type: TokenType): boolean {
if (this.match(type)) {
this.next().
return true.
} else {
return false.
}
}

// TODO

match(type: TokenType): boolean {
return this.state.type === type.
}

nextToken we talked about in the last blog, the main is two, one is to analyze the template string, one is getTokenFromCode, the main is this getTokenFromCode, judged a variety of cases, and then call different methods, such as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Anything else beginning with a digit is an integer, octal
// number, or float. (fall through)
case charCodes.digit1.
case charCodes.digit2.
case charCodes.digit3.
case charCodes.digit4.
case charCodes.digit5.
case charCodes.digit6.
case charCodes.digit7.
case charCodes.digit8.
case charCodes.digit9.
this.readNumber(false).
return.

// Quotes produce strings.
case charCodes.quotationMark.
case charCodes.apostrophe.
This.readString(code).
return.

There are tons of other functions like readRegexp read RegExp, readEscapedChar, etc. Most of the code is of this kind, each internally a small state machine like the one I talked about in my last blog about reading template strings.