Token

If you don’t know how it works, find out. If you’re not sure if it will work, try it. If it doesn’t make sense, play with it until it does. If it’s not broken, break it. If it might not be true, find out.
Seth Godin

Token Link to heading

All tokens are defined by a macro TOKEN_LIST. It takes a list of 3 macros M all of which satisfy the same signature M(name, string, precedence), where name is the symbolic token name, string is the corresponding syntactic symbol (or NULL, for literals), and precedence is the precedence (or 0). The parameters are invoked for token categories as follows:

T: Non-keyword tokens
K: Keyword tokens
F: Future (reserved) keyword tokens

// token.h
#define TOKEN_LIST(T, K, F)                                             \
  /* End of source indicator. */                                        \
  T(EOS, "EOS", 0)                                                      \
                                                                        \
  /* Punctuators (ECMA-262, section 7.7, page 15). */                   \
  T(LPAREN, "(", 0)                                                     \
  T(RPAREN, ")", 0)                                                     \
  T(LBRACK, "[", 0)                                                     \
  T(RBRACK, "]", 0)                                                     \
  /* some more tokens, and keywords */                                  \
  K(INSTANCEOF, "instanceof", 10)                                       \
  K(IN, "in", 10)                                                       \
  /* and future reserved words */                                       \
  /* Future reserved words (ECMA-262, section 7.5.3, page 14). */       \
  F(ABSTRACT, "abstract", 0)                                            \
  F(BOOLEAN, "boolean", 0)

Token class Link to heading

The Token class provides static values and methods to work with token list.

// token.h
class Token {
 public:
  // All token values.
#define T(name, string, precedence) name,
  enum Value {
    TOKEN_LIST(T, T, IGNORE_TOKEN)
    NUM_TOKENS
  };
#undef T

This combines with TOKEN_LIST will give a result of

class Token {
 public:

  enum Value {
    EOS, LPAREN, RPAREN, LBRACK, RBRACK, LBRACE, ...,
    NUM_TOKENS
  };

Conviniently, NUM_TOKENS ’s value is exact length of TOKEN_LIST. Use gcc src/token.h -E -D DEBUG to see the full source code of Token header generated by the macro. When debug mode is on, this macro will generate extra information to help debugging. Name(tok) will return the string corresponding to the C++ token name (e.g. “LT” for the token LT).

#ifdef DEBUG
  static const char* Name(Value tok) {
    ASSERT(0 <= tok && tok < NUM_TOKENS);
    return name_[tok];
  }
#endif

But we don’t see how precedence is managed yet. Let’s take a look into src/token.cc.

// token.cc
#include "token.h"

#ifdef DEBUG
#define T(name, string, precedence) #name,
const char* Token::name_[NUM_TOKENS] = {
  TOKEN_LIST(T, T, IGNORE_TOKEN)
};
#undef T
#endif

#define T(name, string, precedence) string,
const char* Token::string_[NUM_TOKENS] = {
  TOKEN_LIST(T, T, IGNORE_TOKEN)
};
#undef T

#define T(name, string, precedence) precedence,
int8_t Token::precedence_[NUM_TOKENS] = {
  TOKEN_LIST(T, T, IGNORE_TOKEN)
};
#undef T

Here we can see the Token::name_ is assigned with token names, and Token::string_ is assigned with actual token string. And Token::name_ only requires more memory if we are in debug mode. The precedences defined in TOKEN_LIST will be assigned to Token::precedence_.

Since token value is a number from 0 to NUM_TOKENS - 1, we need a way to quickly convert a token string to corresponding value, and vice versa.A Map<string, Token::Value> can work. But V8 does it the fast way by making Hash and Lookup helpers. If a given string does not match any value in hash table, it must be an identifier.

// token.cc
static unsigned int Hash(const char* s) {
  const unsigned int L = 5, S = 4, M = 3;

  unsigned int h = 0;
  for (unsigned int i = 0; s[i] != '\0' && i < L; i++) {
    h += (h << S) + s[i];
  }
  // unsigned int % by a power of 2 (otherwise this will not be a bit mask)
  return h * M % N;
}

Token::Value Token::Lookup(const char* str) {
  ASSERT(IsInitialized);
  Value k = static_cast<Value>(Hashtable[Hash(str)]);
  const char* s = string_[k];
  ASSERT(s != NULL || k == IDENTIFIER);
  if (s == NULL || strcmp(s, str) == 0) {
    return k;
  }
  return IDENTIFIER;
}

Question: If javascripts support more keywords such as async, await, private, static, how can we maintain logic of Hash and Lookup?

Answer: V8 uses gperf to auto generate a perfect hash table code.

Token class must be initialized once, checking for collisions, and verify the result again.