In the previous article, when analyzing the Babel compilation process, we mentioned that Babel converts JS code into ASTs (Abstract Syntax Trees). This behavior is a generic one, no matter what programming language parses the source code into an AST, AST is not specific to Babel, let alone to JS.
Why do we need to do this? The original JS file is incomprehensible to the computer, and it is difficult for the computer to modify the JS code directly, but by converting it to an AST, which is essentially a set of objects that represent the structure of the program, we can indirectly modify the code by modifying the objects. AST to generate bytecode.
Parser’s process is divided into two steps, the first step, lexical analysis, which is the finite state machine in the compilation principle, to split a piece of code into individual Tokens, and the second step, syntax analysis, to convert the Token array, into an AST tree.
This time I’ll look at源码The process is briefly analyzed.
First, take a look at the directory structure of Babel-Parser
There are four main folders, util, plugins, tokeinzer, parser
functiongetParser(options: ?Options, input: string): Parser { // Get Parser let cls = Parser; // If a plugin is declared in options, first check whether the way the plugin is declared is reasonable, and if so, enable the plugin function // Yes, enable the plug-in function, Parser's plug-ins are built-in, and you can only choose whether to enable them through the configuration if (options?.plugins) { validatePlugins(options.plugins); cls = getParserClass(options.plugins); }
/** Get a Parser class with plugins applied. */ functiongetParserClass(pluginsFromOptions: PluginList): Class<Parser> { // mixinPluginNames is the name of all built-in plugins const pluginList = mixinPluginNames.filter(name => hasPlugin(pluginsFromOptions, name), );
// Cache the current portfolio of plugins const key = pluginList.join("/"); let cls = parserClassCache[key]; if (!cls) { cls = parser; for (const plugin of pluginList) { cls = mixinPlugins[plugin](cls); } parserClassCache[key] = cls; } return cls; }
Parser parsing process
By now, we figured out the logic of the entry file, which is mainly two parts, the first part declares the Parser, and the second part, if the plug-in is configured, opens the plug-in function for the Parser.
The constructor is all about preparatory work, so don’t pay attention to it first, the main logic is still in this parse function
1.enterInitialScopes
1 2 3 4 5 6 7 8
enterInitialScopes() { let paramFlags = PARAM; if (this.hasPlugin("topLevelAwait") && this.inModule) { paramFlags |= PARAM_AWAIT; } this.scope.enter(SCOPE_PROGRAM); this.prodParam.enter(paramFlags); }
This step is to initialize the root node at the beginning, along with the corresponding parameters and scopes
2. startNode
1 2 3 4
startNode<T: NodeType>(): T { // $FlowIgnore returnnewNode(this, this.state.start, this.state.startLoc); }
3. nextToken
This part is the focus of the parsing, this part of the code will be more complex, the parsing process will have to parse backwards one character at a time, the use of finite state machine state transfer to determine the different states, and finally in reaching a certain state, to produce a token.
If the number 123456 is read as a 6, a numeric token is generated if it is followed by a space or a semicolon.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
nextToken(): void { // curContext = this.state.context[this.state.context.length - 1]; const curContext = this.curContext(); // The internal loop will keep skipping all spaces, such as spaces, tabs, etc. if (!curContext.preserveSpace) this.skipSpace(); this.state.start = this.state.pos; if (!this.isLookahead) this.state.startLoc = this.state.curPosition(); if (this.state.pos >= this.length) { this.finishToken(tt.eof); return; } if (curContext= ct.template) { // Read the template string Token this.readTmplToken(); } else { // Read a normal Token, codePointAtPos returns the ASCII code of the character at the pos position this.getTokenFromCode(this.codePointAtPos(this.state.pos)); } }
Continue to call parseBlockOrModuleBlockBody and eventually enter recursion, calling nextToken recursively through the parserStatement, next and other functions until the string passed in by the parser method at the beginning is completely parsed.
continue; } parsedNonDirective = true; // clear strict errors since the strict mode will not change within the block this.state.strictErrors.clear(); } body.push(stmt); }
if (afterBlockParse) { afterBlockParse.call(this, hasStrictModeDirective); }
if (!oldStrict) { this.setStrict(false); }
this.next(); }
Summary
A simple diagram to represent, roughly, is this, omitting many details
nextToken method analysis
readTmplToken reads the template string
This is the state machine I analyzed from the code
Let’s try it and see the results
getTokenFromCode
This function is not complex in terms of logic, but the conditions are particularly divided, because it is necessary to adapt a variety of different characters to determine, simply show the following:
And the parameters of finishToken are actually a built-in good TokenType, such as tt.parentL which is actually parenL: new TokenType("(", { beforeExpr, startsExpr }),
These TokenTypes are all of babel’s built-in token types, and there are two sources of TokenTypes, one is built into the Tokenizer and the other is provided by the parser plugin, but as we said, the parser plugin is just an on/off switch for the user, so essentially, all of the TokenType is essentially built into babel-praser to begin with.
There are four main categories, one is the variable type, such as number, string, one is the symbol, such as brackets, colon and so on, one is the expression, such as equal to, greater than, and finally is the keyword, such as switch, case, etc.
getTokenFromCode(code: number): void { switch (code) { // The interpretation of a dot depends on whether it is followed // by a digit or another two dots.
case charCodes.dot: this.readToken_dot(); return;
// Punctuation tokens. case charCodes.leftParenthesis: ++this.state.pos; this.finishToken(tt.parenL); return; case charCodes.rightParenthesis: ++this.state.pos; this.finishToken(tt.parenR); return; case charCodes.semicolon: ++this.state.pos; this.finishToken(tt.semi); return; case charCodes.comma: // Dozens of conditions are omitted here
default: if (isIdentifierStart(code)) { this.readWord(code); return; } }
At the beginning of the process analysis, we are going to see that the main logic of this function is in the parseBlockOrModuleBlockBody function, so let’s take a look at this function first
continue; } parsedNonDirective = true; // clear strict errors since the strict mode will not change within the block this.state.strictErrors.clear(); } body.push(stmt); }
if (afterBlockParse) { afterBlockParse.call(this, hasStrictModeDirective); }
if (!oldStrict) { this.setStrict(false); }
this.next(); }
This function does not look short, but the main logic is the while loop, as long as it meets !this.match(end) will always parse, this end is actually tt.eof, that is, we just TokenType in a kind of, said the end of the file.
loop body The main content is two const stmt = this.parseStatement(null, topLevel); and body.push(stmt);, this stmt is a Node of AST
The first line is to determine whether the current is @, if so, that is the decorator, this temporarily regardless of
Let’s look at this parseStatementContent
parseStatementContent
This function is very much like getTokenFromCode in the Tokenizer just now. getTokenFromCode generates various types of tokens based on code, while parseStatementContent generates AST Node based on different types of tokens.
Then, during the parsing process, there are some special cases where the nextToken of the Tokenizer will be called again to continue to generate a new token, such as parsing to import
if (this.isLet(context)) { starttype = tt._var; kind = "let } // Most types of statements are recognized by the keyword they // start with. Many are trivial to parse, some require a bit of // complexity. switch (starttype) { case tt._break: case tt._continue: // $FlowFixMe return this.parseBreakContinueStatement(node, starttype.keyword); case tt._debugger: return this.parseDebuggerStatement(node); case tt._do: return this.parseDoStatement(node); case tt._for: return this.parseForStatement(node); // Omit the various tokenType judgments default: { if (this.isAsyncFunction()) { if (context) { this.raise( this.state.start, Errors.AsyncFunctionInSingleStatementContext, ); } this.next(); // Here again, the nextToken method of the Tokenizer is called return this.parseFunctionStatement(node, true, !context); } } } // If the statement does not start with a statement keyword or a // brace, it's an ExpressionStatement or LabeledStatement. We // simply start parsing an expression, and afterwards, if the // next token is a colon and the expression was a simple // Identifier node, we switch to interpreting it as a label. const maybeName = this.state.value; const expr = this.parseExpression(); if ( start type= tt.name && expr.type= "Identifier" && this.eat(tt.colon) ) { return this.parseLabeledStatement(node, maybeName, expr, context); } else { return this.parseExpressionStatement(node, expr); } }