If you don’t know how it works, find out. If you’re not sure if it will work, try it. If it doesn’t make sense, play with it until it does. If it’s not broken, break it. If it might not be true, find out.
Seth Godin
Token Link to heading
All tokens are defined by a macro TOKEN_LIST
. It takes a list of 3 macros M all of which satisfy the same signature M(name, string, precedence), where name is the symbolic token name, string is the corresponding syntactic symbol (or NULL, for literals), and precedence is the precedence (or 0).
The parameters are invoked for token categories as follows:
- T: Non-keyword tokens
- K: Keyword tokens
- F: Future (reserved) keyword tokens
// token.h
#define TOKEN_LIST(T, K, F) \
/* End of source indicator. */ \
T(EOS, "EOS", 0) \
\
/* Punctuators (ECMA-262, section 7.7, page 15). */ \
T(LPAREN, "(", 0) \
T(RPAREN, ")", 0) \
T(LBRACK, "[", 0) \
T(RBRACK, "]", 0) \
/* some more tokens, and keywords */ \
K(INSTANCEOF, "instanceof", 10) \
K(IN, "in", 10) \
/* and future reserved words */ \
/* Future reserved words (ECMA-262, section 7.5.3, page 14). */ \
F(ABSTRACT, "abstract", 0) \
F(BOOLEAN, "boolean", 0)
Token class Link to heading
The Token
class provides static values and methods to work with token list.
// token.h
class Token {
public:
// All token values.
#define T(name, string, precedence) name,
enum Value {
TOKEN_LIST(T, T, IGNORE_TOKEN)
NUM_TOKENS
};
#undef T
This combines with TOKEN_LIST
will give a result of
class Token {
public:
enum Value {
EOS, LPAREN, RPAREN, LBRACK, RBRACK, LBRACE, ...,
NUM_TOKENS
};
Conviniently, NUM_TOKENS
’s value is exact length of TOKEN_LIST
.
Use gcc src/token.h -E -D DEBUG
to see the full source code of Token
header generated by the macro. When debug mode is on, this macro will generate extra information to help debugging. Name(tok)
will return the string corresponding to the C++ token name (e.g. “LT” for the token LT).
#ifdef DEBUG
static const char* Name(Value tok) {
ASSERT(0 <= tok && tok < NUM_TOKENS);
return name_[tok];
}
#endif
But we don’t see how precedence is managed yet. Let’s take a look into src/token.cc
.
// token.cc
#include "token.h"
#ifdef DEBUG
#define T(name, string, precedence) #name,
const char* Token::name_[NUM_TOKENS] = {
TOKEN_LIST(T, T, IGNORE_TOKEN)
};
#undef T
#endif
#define T(name, string, precedence) string,
const char* Token::string_[NUM_TOKENS] = {
TOKEN_LIST(T, T, IGNORE_TOKEN)
};
#undef T
#define T(name, string, precedence) precedence,
int8_t Token::precedence_[NUM_TOKENS] = {
TOKEN_LIST(T, T, IGNORE_TOKEN)
};
#undef T
Here we can see the Token::name_
is assigned with token names, and Token::string_
is assigned with actual token string. And Token::name_
only requires more memory if we are in debug mode.
The precedences defined in TOKEN_LIST
will be assigned to Token::precedence_
.
Since token value is a number from 0
to NUM_TOKENS - 1
, we need a way to quickly convert a token string to corresponding value, and vice versa.A Map<string, Token::Value>
can work. But V8 does it the fast way by making Hash
and Lookup
helpers. If a given string does not match any value in hash table, it must be an identifier.
// token.cc
static unsigned int Hash(const char* s) {
const unsigned int L = 5, S = 4, M = 3;
unsigned int h = 0;
for (unsigned int i = 0; s[i] != '\0' && i < L; i++) {
h += (h << S) + s[i];
}
// unsigned int % by a power of 2 (otherwise this will not be a bit mask)
return h * M % N;
}
Token::Value Token::Lookup(const char* str) {
ASSERT(IsInitialized);
Value k = static_cast<Value>(Hashtable[Hash(str)]);
const char* s = string_[k];
ASSERT(s != NULL || k == IDENTIFIER);
if (s == NULL || strcmp(s, str) == 0) {
return k;
}
return IDENTIFIER;
}
Question: If javascripts support more keywords such as async, await, private, static
, how can we maintain logic of Hash
and Lookup
?
Answer: V8 uses gperf
to auto generate a perfect hash table code.
Token
class must be initialized once, checking for collisions, and verify the result again.