In-depth Babel Principle Series (II) Parser Code Structure Introduction

In the previous article, when analyzing the Babel compilation process, we mentioned that Babel converts JS code into ASTs (Abstract Syntax Trees). This behavior is a generic one, no matter what programming language parses the source code into an AST, AST is not specific to Babel, let alone to JS.

Why do we need to do this? The original JS file is incomprehensible to the computer, and it is difficult for the computer to modify the JS code directly, but by converting it to an AST, which is essentially a set of objects that represent the structure of the program, we can indirectly modify the code by modifying the objects. AST to generate bytecode.

Parser’s process is divided into two steps, the first step, lexical analysis, which is the finite state machine in the compilation principle, to split a piece of code into individual Tokens, and the second step, syntax analysis, to convert the Token array, into an AST tree.

This time I’ll look at源码The process is briefly analyzed.

First, take a look at the directory structure of Babel-Parser

There are four main folders, util, plugins, tokeinzer, parser

Entrance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
export function parse(input: string, options?: Options): File {
if (options?.sourceType= "unambiguous") {
options = {
...options,
};
try {
options.sourceType = "module";
const parser = getParser(options, input);
const ast = parser.parse();

//Omit some other codes

return ast;
} catch (moduleError) {
try {
options.sourceType = "script";
return getParser(options, input).parse();
} catch {}

throw moduleError;
}
} else {
return getParser(options, input).parse();
}
}

The core of this code is to get a parser through the getParser method, and then use the obtained parser to parse.

Let’s look at this getParser again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
function getParser(options: ?Options, input: string): Parser {
// Get Parser
let cls = Parser;
// If a plugin is declared in options, first check whether the way the plugin is declared is reasonable, and if so, enable the plugin function
// Yes, enable the plug-in function, Parser's plug-ins are built-in, and you can only choose whether to enable them through the configuration
if (options?.plugins) {
validatePlugins(options.plugins);
cls = getParserClass(options.plugins);
}

return new cls(options, input);
}

const parserClassCache: { [key: string]: Class<Parser> } = {};

/** Get a Parser class with plugins applied. */
function getParserClass(pluginsFromOptions: PluginList): Class<Parser> {
// mixinPluginNames is the name of all built-in plugins
const pluginList = mixinPluginNames.filter(name =>
hasPlugin(pluginsFromOptions, name),
);

// Cache the current portfolio of plugins
const key = pluginList.join("/");
let cls = parserClassCache[key];
if (!cls) {
cls = parser;
for (const plugin of pluginList) {
cls = mixinPlugins[plugin](cls);
}
parserClassCache[key] = cls;
}
return cls;
}

Parser parsing process

By now, we figured out the logic of the entry file, which is mainly two parts, the first part declares the Parser, and the second part, if the plug-in is configured, opens the plug-in function for the Parser.

Then let’s continue to look at Parser’s logic

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
export default class Parser extends StatementParser {
constructor(options: ?Options, input: string) {
}

parse(): File {
this.enterInitialScopes();
const file = this.startNode();
const program = this.startNode();
this.nextToken();
file.errors = null;
this.parseTopLevel(file, program);
file.errors = this.state.errors;
return file;
}
}

The constructor is all about preparatory work, so don’t pay attention to it first, the main logic is still in this parse function

1.enterInitialScopes

1
2
3
4
5
6
7
8
enterInitialScopes() {
let paramFlags = PARAM;
if (this.hasPlugin("topLevelAwait") && this.inModule) {
paramFlags |= PARAM_AWAIT;
}
this.scope.enter(SCOPE_PROGRAM);
this.prodParam.enter(paramFlags);
}

This step is to initialize the root node at the beginning, along with the corresponding parameters and scopes

2. startNode

1
2
3
4
startNode<T: NodeType>(): T {
// $FlowIgnore
return new Node(this, this.state.start, this.state.startLoc);
}

3. nextToken

This part is the focus of the parsing, this part of the code will be more complex, the parsing process will have to parse backwards one character at a time, the use of finite state machine state transfer to determine the different states, and finally in reaching a certain state, to produce a token.

If the number 123456 is read as a 6, a numeric token is generated if it is followed by a space or a semicolon.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
nextToken(): void {
// curContext = this.state.context[this.state.context.length - 1];
const curContext = this.curContext();
// The internal loop will keep skipping all spaces, such as spaces, tabs, etc.
if (!curContext.preserveSpace) this.skipSpace();
this.state.start = this.state.pos;
if (!this.isLookahead) this.state.startLoc = this.state.curPosition();
if (this.state.pos >= this.length) {
this.finishToken(tt.eof);
return;
}
if (curContext= ct.template) {
// Read the template string Token
this.readTmplToken();
} else {
// Read a normal Token, codePointAtPos returns the ASCII code of the character at the pos position
this.getTokenFromCode(this.codePointAtPos(this.state.pos));
}
}

4. parseTopLevel

1
2
3
4
5
6
7
8
parseTopLevel(file: N.File, program: N.Program): N.File {
file.program = this.parseProgram(program);
file.comments = this.state.comments;

if (this.options.tokens) file.tokens = babel7CompatTokens(this.tokens);

return this.finishNode(file, "File");
}

Here the call to parseProgram will continue

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
parseProgram(
Program,
end:TokenType = tt.eof,
sourceType: SourceType = this.options.sourceType,
): N.Program {
program.sourceType = sourceType;
program.interpreter = this.parseInterpreterDirective();
this.parseBlockBody(program, true, true, end);
if (
this.inModule &&
!this.options.allowUndeclaredExports &&
this.scope.undefinedExports.size > 0
) {
for (const [name] of Array.from(this.scope.undefinedExports)) {
const pos = this.scope.undefinedExports.get(name);
// $FlowIssue
this.raise(pos, Errors.ModuleExportUndefined, name);
}
}
return this.finishNode<N.Program>(program, "Program");
}

Call parseBlockBody again

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
parseBlockBody(
node: N.BlockStatementLike,
allowDirectives: ?boolean,
topLevel: boolean,
end:TokenType,
afterBlockParse?: (hasStrictModeDirective: boolean) => void,
): void {
const body = (node.body = []);
const directives = (node.directives = []);
this.parseBlockOrModuleBlockBody(
body,
allowDirectives ? directives : undefined,
topLevel,
end,
afterBlockParse,
);
}

Continue to call parseBlockOrModuleBlockBody and eventually enter recursion, calling nextToken recursively through the parserStatement, next and other functions until the string passed in by the parser method at the beginning is completely parsed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
parseBlockOrModuleBlockBody(
body: N.Statement[],
directives: ?(N.Directive[]),
topLevel: boolean,
end:TokenType,
afterBlockParse?: (hasStrictModeDirective: boolean) => void,
): void {
const oldStrict = this.state.strict;
let hasStrictModeDirective = false;
let parsedNonDirective = false;

while (!this.match(end)) {
const stmt = this.parseStatement(null, topLevel);

if (directives && !parsedNonDirective) {
if (this.isValidDirective(stmt)) {
const directive = this.stmtToDirective(stmt);
directives.push(directive);

if (
!hasStrictModeDirective &&
directive.value.value= "use strict"
) {
hasStrictModeDirective = true;
this.setStrict(true);
}

continue;
}
parsedNonDirective = true;
// clear strict errors since the strict mode will not change within the block
this.state.strictErrors.clear();
}
body.push(stmt);
}

if (afterBlockParse) {
afterBlockParse.call(this, hasStrictModeDirective);
}

if (!oldStrict) {
this.setStrict(false);
}

this.next();
}

Summary

A simple diagram to represent, roughly, is this, omitting many details

nextToken method analysis

readTmplToken reads the template string

This is the state machine I analyzed from the code

Let’s try it and see the results

getTokenFromCode

This function is not complex in terms of logic, but the conditions are particularly divided, because it is necessary to adapt a variety of different characters to determine, simply show the following:

charcodes

The various charCodes used in the code are the contents of another repository, here is the link: https://github.com/xtuc/charcodes/blob/master/packages/charcodes/src/index.js

TokenType

And the parameters of finishToken are actually a built-in good TokenType, such as tt.parentL which is actually parenL: new TokenType("(", { beforeExpr, startsExpr }),

These TokenTypes are all of babel’s built-in token types, and there are two sources of TokenTypes, one is built into the Tokenizer and the other is provided by the parser plugin, but as we said, the parser plugin is just an on/off switch for the user, so essentially, all of the TokenType is essentially built into babel-praser to begin with.

There are four main categories, one is the variable type, such as number, string, one is the symbol, such as brackets, colon and so on, one is the expression, such as equal to, greater than, and finally is the keyword, such as switch, case, etc.

Function Logic

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
getTokenFromCode(code: number): void {
switch (code) {
// The interpretation of a dot depends on whether it is followed
// by a digit or another two dots.

case charCodes.dot:
this.readToken_dot();
return;

// Punctuation tokens.
case charCodes.leftParenthesis:
++this.state.pos;
this.finishToken(tt.parenL);
return;
case charCodes.rightParenthesis:
++this.state.pos;
this.finishToken(tt.parenR);
return;
case charCodes.semicolon:
++this.state.pos;
this.finishToken(tt.semi);
return;
case charCodes.comma:

// Dozens of conditions are omitted here

default:
if (isIdentifierStart(code)) {
this.readWord(code);
return;
}
}

throw this.raise(
this.state.pos,
Errors.InvalidOrUnexpectedToken,
String.fromCodePoint(code),
);
}

parseTopLevel method analysis

At the beginning of the process analysis, we are going to see that the main logic of this function is in the parseBlockOrModuleBlockBody function, so let’s take a look at this function first

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
parseBlockOrModuleBlockBody(
body: N.Statement[],
directives: ?(N.Directive[]),
topLevel: boolean,
end:TokenType,
afterBlockParse?: (hasStrictModeDirective: boolean) => void,
): void {
const oldStrict = this.state.strict;
let hasStrictModeDirective = false;
let parsedNonDirective = false;

while (!this.match(end)) {
const stmt = this.parseStatement(null, topLevel);

if (directives && !parsedNonDirective) {
if (this.isValidDirective(stmt)) {
const directive = this.stmtToDirective(stmt);
directives.push(directive);

if (
!hasStrictModeDirective &&
directive.value.value= "use strict"
) {
hasStrictModeDirective = true;
this.setStrict(true);
}

continue;
}
parsedNonDirective = true;
// clear strict errors since the strict mode will not change within the block
this.state.strictErrors.clear();
}
body.push(stmt);
}

if (afterBlockParse) {
afterBlockParse.call(this, hasStrictModeDirective);
}

if (!oldStrict) {
this.setStrict(false);
}

this.next();
}

This function does not look short, but the main logic is the while loop, as long as it meets !this.match(end) will always parse, this end is actually tt.eof, that is, we just TokenType in a kind of, said the end of the file.

loop body The main content is two const stmt = this.parseStatement(null, topLevel); and body.push(stmt);, this stmt is a Node of AST

parseStatement

1
2
3
4
5
6
parseStatement(context: ?string, topLevel?: boolean): N.Statement {
if (this.match(tt.at)) {
this.parseDecorators(true);
}
return this.parseStatementContent(context, topLevel);
}

The first line is to determine whether the current is @, if so, that is the decorator, this temporarily regardless of

Let’s look at this parseStatementContent

parseStatementContent

This function is very much like getTokenFromCode in the Tokenizer just now. getTokenFromCode generates various types of tokens based on code, while parseStatementContent generates AST Node based on different types of tokens.

Then, during the parsing process, there are some special cases where the nextToken of the Tokenizer will be called again to continue to generate a new token, such as parsing to import

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
parseStatementContent(context: ?string, topLevel: ?boolean): N.Statement {
let starttype = this.state.type;
const node = this.startNode();
watch child;

if (this.isLet(context)) {
starttype = tt._var;
kind = "let
}

// Most types of statements are recognized by the keyword they
// start with. Many are trivial to parse, some require a bit of
// complexity.

switch (starttype) {
case tt._break:
case tt._continue:
// $FlowFixMe
return this.parseBreakContinueStatement(node, starttype.keyword);
case tt._debugger:
return this.parseDebuggerStatement(node);
case tt._do:
return this.parseDoStatement(node);
case tt._for:
return this.parseForStatement(node);

// Omit the various tokenType judgments

default: {
if (this.isAsyncFunction()) {
if (context) {
this.raise(
this.state.start,
Errors.AsyncFunctionInSingleStatementContext,
);
}
this.next(); // Here again, the nextToken method of the Tokenizer is called
return this.parseFunctionStatement(node, true, !context);
}
}
}

// If the statement does not start with a statement keyword or a
// brace, it's an ExpressionStatement or LabeledStatement. We
// simply start parsing an expression, and afterwards, if the
// next token is a colon and the expression was a simple
// Identifier node, we switch to interpreting it as a label.
const maybeName = this.state.value;
const expr = this.parseExpression();

if (
start type= tt.name &&
expr.type= "Identifier" &&
this.eat(tt.colon)
) {
return this.parseLabeledStatement(node, maybeName, expr, context);
} else {
return this.parseExpressionStatement(node, expr);
}
}