深入Babel原理系列(三)Tokenizer

上一个博客大概讲了下babel-parser的工作流程,主要有两个内容,一个是Tokenizer把代码字符串拆分成Token数组,一个是parser把Token数组转换为AST树。

这一次就来仔细看看Tokenizer的逻辑。

Tokenizer的源码只有四个文件,分别是context.js,index.js,state.js,type.js,我们一个个来分析。

Type

首先是type.js,顾名思义,这个文件中定义了所有Token的type

先看看TokenType的定义,注意,官方注释很重要,便于我们理解这些变量存在的意义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// The `beforeExpr` property is used to disambiguate between 1) binary
// expression (<) and JSX Tag start (<name>); 2) object literal and JSX
// texts. It is set on the `updateContext` function in the JSX plugin.

// The `startsExpr` property is used to determine whether an expression
// may be the “argument” subexpression of a `yield` expression or
// `yield` statement. It is set on all token types that may be at the
// start of a subexpression.

// `isLoop` marks a keyword as starting a loop, which is important
// to know when parsing a label, in order to allow or disallow
// continue jumps to that label.

const beforeExpr = true;
const startsExpr = true;
const isLoop = true;
const isAssign = true; // 该Token可以是否标志着赋值,如=
const prefix = true; // 该Token是否可以是一个一元表达式的前缀,如!
const postfix = true; // 该Token是否可以是一个一元表达式的后缀,如++

type TokenOptions = {
keyword?: string,
beforeExpr?: boolean,
startsExpr?: boolean,
rightAssociative?: boolean,
isLoop?: boolean,
isAssign?: boolean,
prefix?: boolean,
postfix?: boolean,
binop?: ?number,
};

export class TokenType {
label: string;
keyword: ?string;
beforeExpr: boolean;
startsExpr: boolean;
rightAssociative: boolean;
isLoop: boolean;
isAssign: boolean;
prefix: boolean;
postfix: boolean;
binop: ?number;
updateContext: ?(context: Array<TokContext>) => void;

constructor(label: string, conf: TokenOptions = {}) {
this.label = label;
this.keyword = conf.keyword;
this.beforeExpr = !!conf.beforeExpr;
this.startsExpr = !!conf.startsExpr;
this.rightAssociative = !!conf.rightAssociative;
this.isLoop = !!conf.isLoop;
this.isAssign = !!conf.isAssign;
this.prefix = !!conf.prefix;
this.postfix = !!conf.postfix;
this.binop = conf.binop != null ? conf.binop : null;
this.updateContext = null;
}
}

这个结构相对来讲还不是很复杂,但是这些参数一时之间我们可能也看不出所有的用意,没关系,我们看看几个具体的例子,来分析:

在分析具体的例子前,我们可以先看两个工具方法

1
2
3
4
5
6
7
8
9
10
11
12
export const keywords = new Map<string, TokenType>();

function createKeyword(name: string, options: TokenOptions = {}): TokenType {
options.keyword = name;
const token = new TokenType(name, options);
keywords.set(name, token);
return token;
}

function createBinop(name: string, binop: number) {
return new TokenType(name, { beforeExpr, binop });
}

这个createKeyword就是创建一个keyword类型的TokenType,结合上面的代码,我们能看出来,keyword类型的TokenType一个比较特殊的地方,是它的keyword变量有值,而且值和label相同。

createBinop的作用是创建一个二元表达式类型,其特点就是beforeExpr是true,且binop有具体的值

好了,看完这些,我们就来看看几个具体的例子

首先是基本的变量类型的Type,如:

1
2
num: new TokenType("num", { startsExpr }),
bigint: new TokenType("bigint", { startsExpr }),

然后是符号类型的Type

1
2
3
bracketL: new TokenType("[", { beforeExpr, startsExpr }),
bracketR: new TokenType("]"),
question: new TokenType("?", { beforeExpr }),

运算符类型:

1
2
3
4
eq: new TokenType("=", { beforeExpr, isAssign }),
incDec: new TokenType("++/--", { prefix, postfix, startsExpr }),
pipeline: createBinop("|>", 0),
nullishCoalescing: createBinop("??", 1),

关键字类型

1
2
3
4
_break: createKeyword("break"),
_case: createKeyword("case", { beforeExpr }),
_default: createKeyword("default", { beforeExpr }),
_do: createKeyword("do", { isLoop, beforeExpr }),

Context

刚才Type还有一个updateContext没有讲,这个函数与context.js有关,给JSX插件用的。我们一起来看一下

一开始,注释里就说了,context是做什么的

1
2
// The token context is used to track whether the apostrophe "`"
// starts or ends a string template

就是用来分析模板字符串的,它的类型定义更加简单:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
export class TokContext {
constructor(token: string, preserveSpace?: boolean) {
this.token = token;
this.preserveSpace = !!preserveSpace;
}

token: string;
preserveSpace: boolean;
}

// 这个types指的是分析模板字符串过程中可以又嵌套的模板字符串,所以有个context的栈,用来检测所有的模板字符串是否闭合的,这个栈中只有两种类型的数据,一种是花括号,一种是反引号
export const types: {
[key: string]: TokContext,
} = {
brace: new TokContext("{"),
template: new TokContext("`", true),
};

然后剩下的所有代码就是给刚才所有type中会和模板字符串扯上关系的type添加updateContext方法来维护解析过程中的contex栈

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/* 
backQuote: new TokenType("`", { startsExpr }),
braceL: new TokenType("{", { beforeExpr, startsExpr }),
braceHashL: new TokenType("#{", { beforeExpr, startsExpr }),
dollarBraceL: new TokenType("${", { beforeExpr, startsExpr }),
braceR: new TokenType("}", { beforeExpr }),
*/

tt.braceR.updateContext = context => {
context.pop();
};

// we don't need to update context for tt.braceBarL because we do not pop context for tt.braceBarR
// ideally only dollarBraceL "${" needs a non-template context
// in order to indicate that the last "`" in `${`" starts a new string template
// inside a template element within outer string template.
// but when we popped such context in `}`, we lost track of whether this
// `}` matches a `${` or other tokens matching `}`, so we have to push
// such context in every token that `}` will match.
tt.braceL.updateContext =
tt.braceHashL.updateContext =
tt.dollarBraceL.updateContext =
context => {
context.push(types.brace);
};

tt.backQuote.updateContext = context => {
// 当解析到反引号的时候,看看当前的栈顶是不是模板类型,如果是,那就说明上一个反引号已经进过栈了,也就是走过一次else逻辑了,那这时候就把上次进栈的template弹出来。
if (context[context.length - 1] === types.template) {
context.pop();
} else {
context.push(types.template);
}
};

State

init

这个文件中定义的是解析过程中的状态,包括当前解析到了那个位置,哪一行,那一列,第几个字符等等。

这个文件所有的内容都是在声明这个state的class,所以接下来所有的代码都是在这个类中的。

首先是初始化函数,这个类没有构造函数,倒是有个初始化函数,需要调用者自己调用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
strict: boolean;
curLine: number;

// And, if locations are used, the {line, column} object
// corresponding to those offsets
startLoc: Position;
endLoc: Position;

// The current position of the tokenizer in the input.
pos: number = 0;
lineStart: number = 0;

init(options: Options): void {
this.strict =
options.strictMode === false ? false : options.sourceType === "module";

this.curLine = options.startLine;
this.startLoc = this.endLoc = this.curPosition();
}

可以看到这个初始化函数做的事很简单,首先是对严格模式的一个判断和保存

另一个就是初始化curLine为传入的开始行,这个curLine会在后面解析过程中不断变化

最后就是初始化startLoc和endLoc,这两个值在后面不会再发生变化了,最起码在state类中没有方法去改变它,在以后继续看的时候如果看到了哪里改变了,我再回来更新。

而这个curPosition函数也很简单

1
2
3
curPosition(): Position {
return new Position(this.curLine, this.pos - this.lineStart);
}

Position的第一个参数是行数,第二个参数是列数

这个curLine在之前刚被初始化为startLine,而pos和lineStart没有在之前被赋值,所以都是0

clone

除了上面两个函数,还有一个clone函数,用来深复制State的,没什么特别逻辑,放在这里,有兴趣的可以看看。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
clone(skipArrays?: boolean): State {
const state = new State();
const keys = Object.keys(this);
for (let i = 0, length = keys.length; i < length; i++) {
const key = keys[i];
// $FlowIgnore
let val = this[key];

if (!skipArrays && Array.isArray(val)) {
val = val.slice();
}

// $FlowIgnore
state[key] = val;
}

return state;
}

其他属性

State中有大量的属性,都是用来保存解析状态,并生成AST的,很多属性直接说,感受不到用处,就补充几个一看就知道做什么的属性吧

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// The current position of the tokenizer in the input.
pos: number = 0;
lineStart: number = 0;

// Properties of the current token:
// Its type
type: TokenType = tt.eof;

// For tokens that include more information than their type, the value
value: any = null;

// Its start and end offset
start: number = 0;
end: number = 0;

// Position information for the previous token
// $FlowIgnore this is initialized when generating the second token.
lastTokEndLoc: Position = null;
// $FlowIgnore this is initialized when generating the second token.
lastTokStartLoc: Position = null;
lastTokStart: number = 0;
lastTokEnd: number = 0;

// The context stack is used to track whether the apostrophe "`" starts
// or ends a string template
context: Array<TokContext> = [ct.brace];
// Used to track whether a JSX element is allowed to form
exprAllowed: boolean = true;

// Used to signal to callers of `readWord1` whether the word
// contained any escape sequences. This is needed because words with
// escape sequences must not be interpreted as keywords.
containsEsc: boolean = false;

// This property is used to track the following errors
// - StrictNumericEscape
// - StrictOctalLiteral
//
// in a literal that occurs prior to/immediately after a "use strict" directive.

// todo(JLHwung): set strictErrors to null and avoid recording string errors
// after a non-directive is parsed
strictErrors: Map<number, ErrorTemplate> = new Map();

// Tokens length in token store
tokensLength: number = 0;

index

最后就是Tokenizer的主逻辑了,接近1600行,我们来慢慢分析。

其实代码虽多,但是核心思路不复杂,首先就是构造函数

1
2
3
4
5
6
7
8
9
10
11
12
13
isLookahead: boolean;

// Token store.
tokens: Array<Token | N.Comment> = [];

constructor(options: Options, input: string) {
super();
this.state = new State();
this.state.init(options);
this.input = input;
this.length = input.length;
this.isLookahead = false;
}

这里就是初始化了State,并且调用了init函数。

然后就是几个解析的入口函数,也就是那些被外界调用方法,如next等

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
pushToken(token: Token | N.Comment) {
// Pop out invalid tokens trapped by try-catch parsing.
// Those parsing branches are mainly created by typescript and flow plugins.
this.tokens.length = this.state.tokensLength;
this.tokens.push(token);
++this.state.tokensLength;
}

// Move to the next token

next(): void {
this.checkKeywordEscapes();
if (this.options.tokens) {
this.pushToken(new Token(this.state));
}

this.state.lastTokEnd = this.state.end;
this.state.lastTokStart = this.state.start;
this.state.lastTokEndLoc = this.state.endLoc;
this.state.lastTokStartLoc = this.state.startLoc;
this.nextToken();
}

// TODO

eat(type: TokenType): boolean {
if (this.match(type)) {
this.next();
return true;
} else {
return false;
}
}

// TODO

match(type: TokenType): boolean {
return this.state.type === type;
}

nextToken我们上篇博客讲了,主要就是两个,一个是分析模板字符串,一个是getTokenFromCode,主要就是这个getTokenFromCode,判断了各种情况,然后调用不同的方法,如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Anything else beginning with a digit is an integer, octal
// number, or float. (fall through)
case charCodes.digit1:
case charCodes.digit2:
case charCodes.digit3:
case charCodes.digit4:
case charCodes.digit5:
case charCodes.digit6:
case charCodes.digit7:
case charCodes.digit8:
case charCodes.digit9:
this.readNumber(false);
return;

// Quotes produce strings.
case charCodes.quotationMark:
case charCodes.apostrophe:
this.readString(code);
return;

还有大量其他的函数,如readRegexp读取正则,readEscapedChar等等,大部分代码都是这种,每个内部都像我上个博客讲的读取模板字符串那样,内部是个小的状态机。