In-depth Babel Principles Series (3) Tokenizer
The last blog talked about the workflow of babel-parser, there are two main components, one is the Tokenizer to split the code string into Token arrays, and the other is the parser to convert the Token arrays into AST trees.
This time, let’s take a closer look at the logic of Tokenizer.
There are only four files in the Tokenizer source code, context.js, index.js, state.js, and type.js, so let’s analyze them one by one.
Type
First is type.js, as the name implies, this file defines the type of all Tokens
Let’s look at the definition of TokenType first. Note that the official comments are important to facilitate our understanding of the meaning of the existence of these variables:
1 | // The `beforeExpr` property is used to disambiguate between 1) binary |
This structure is relatively speaking not very complicated, but these parameters may not see all the intention for a while, it does not matter, we look at a few specific examples to analyze:
Before analyzing specific examples, we can look at two tool methods
1 | export const keywords = new Map<string, TokenType>(). |
This createKeyword is to create a TokenType of keyword type. Combined with the above code, we can see that a special feature of the TokenType of keyword type is that its keyword variable has a value, and the value is the same as label.
The role of createBinop is to create a binary expression type, which is characterized by the fact that beforeExpr is true, and binop has a specific value
Well, after reading this, let’s take a look at a few concrete examples
First is the basic variable type Type, e.g:
1 | num: new TokenType("num", { startsExpr }). |
Then there are the symbolic types of Type
1 | bracketL: new TokenType("[", { beforeExpr, startsExpr }). |
Operator type:
1 | eq: new TokenType("=", { beforeExpr, isAssign }). |
Keyword type
1 | _break: createKeyword("break"). |
Context
Just now Type there is an updateContext did not speak, this function is related to context.js, to JSX plug-ins used. Let’s look at it together
At the beginning, the comments say what context does
1 | // The token context is used to track whether the apostrophe "`" |
is used to parse the template string, and its type definition is even simpler: the
1 | export class TokContext { |
Then all that’s left is to add updateContext methods to all the types that will be related to the template string to maintain the contexts stack during parsing
1 | /* |
State
init
This file defines the state of the parsing process, including where it is, which line, which column, how many characters, etc.
All the content of this file is in the class that declares the state, so all the code that follows is in this class.
First is the initialization function, this class does not have a constructor, but an initialization function, which needs to be called by the caller himself.
1 | strict: boolean. |
As you can see this initialization function does a simple thing, first of all it is a determination and preservation of the strict pattern
The other is to initialize the curLine to the beginning of the incoming line, which will change later in the parsing process
The last is to initialize startLoc and endLoc, these two values will not change later, at least in the state class there is no way to change it, in the future to continue to look at the time if you see where the change, I will come back to update.
And this curPosition function is also very simple
1 | curPosition(): Position { |
The first parameter of Position is the number of rows, the second parameter is the number of columns
The curLine has just been initialized to startLine before, while pos and lineStart have not been assigned before, so they are both 0
clone
In addition to the above two functions, there is a clone function, used to deep copy the state, nothing special logic, put here, interested can see.
1 | clone(skipArrays?: boolean): State { |
Other properties
State has a large number of properties, are used to save the parsing state, and generate AST, many properties directly, do not feel useful, add a few properties to know what to do at a glance
1 | // The current position of the tokenizer in the input. |
index
Finally there is the main logic of Tokenizer, close to 1600 lines, let’s analyze it slowly.
In fact, although there is a lot of code, but the core idea is not complicated, first is the constructor
1 | isLookahead: boolean. |
Here is where the State is initialized and the init function is called.
Then there are several entry functions for parsing, that is, those that are called by the outside world, such as next, etc.
1 | pushToken(token: Token | N.Comment) { |
nextToken we talked about in the last blog, the main is two, one is to analyze the template string, one is getTokenFromCode, the main is this getTokenFromCode, judged a variety of cases, and then call different methods, such as:
1 | // Anything else beginning with a digit is an integer, octal |
There are tons of other functions like readRegexp read RegExp, readEscapedChar, etc. Most of the code is of this kind, each internally a small state machine like the one I talked about in my last blog about reading template strings.