Understanding the Go compiler: The Scanner

Understanding the Go compiler: The Scanner

This is part of a series where I’ll walk you through the entire Go compiler, covering each phase from source code to executable. If you’ve ever wondered what happens when you run go build, you’re in the right place.

Note: This article is based on Go 1.25.3. The compiler internals may change in future versions, but the core concepts will likely remain the same.

I’m going to use the simplest example possible to guide us through the process—a classic “hello world” program:

package main

import "fmt"

func main() {
    fmt.Println("Hello world")
}

Let’s start with the very first step: the scanner.

What the Scanner Does

The Go scanner (also called the lexer) is the first component of the compiler. Its job is straightforward: convert your source code into tokens. Each token typically represents a word or symbol—things like package, main, {, (, or string literals.

Here’s the key thing to understand: the scanner reads your code one character at a time and doesn’t care about context. It doesn’t know whether you’re inside a function or declaring a variable. It just knows: “This sequence of characters forms a valid token” or “This is invalid.”

The scanner also handles automatic semicolon insertion. You might not write semicolons in your Go code, but the scanner adds them after certain tokens when it sees a newline. From your perspective, semicolons are optional. From the compiler’s perspective, they’re always there.

Two Scanner Implementations

Go actually has two scanner implementations:

  1. Standard library scanner (src/go/scanner/) - This is what you’d use if you’re writing Go tools that need to parse Go code.
  2. Compiler scanner (src/cmd/compile/internal/syntax/scanner.go) - This is the real deal, what the compiler actually uses.

We’re going to focus on the compiler scanner.

Tokens: The Scanner’s Output

Let’s see what tokens actually look like. When the scanner processes our hello world program, it produces this sequence:

Position   Token      Literal
--------   -----      -------
1:1        package    "package"
1:9        IDENT      "main"
1:14       ;          "\n"
3:1        import     "import"
3:8        STRING     "\"fmt\""
3:13       ;          "\n"
5:1        func       "func"
5:6        IDENT      "main"
5:10       (          ""
5:11       )          ""
5:13       {          ""
6:5        IDENT      "fmt"
6:8        .          ""
6:9        IDENT      "Println"
6:16       (          ""
6:17       STRING     "\"Hello world\""
6:30       )          ""
6:31       ;          "\n"
7:1        }          ""
7:2        ;          "\n"

Notice how the scanner automatically inserted semicolons (represented as newlines) after main, after the "fmt" import, and at the end of the print statement. This is that automatic semicolon insertion I mentioned.

Try It Yourself

The Go standard library includes a scanner package that lets you experiment with tokenization. Here’s a complete program you can run:

package main

import (
	"fmt"
	"go/scanner"
	"go/token"
)

func main() {
	src := []byte(`package main

import "fmt"

func main() {
    fmt.Println("Hello world")
}`)

	var s scanner.Scanner
	fset := token.NewFileSet()
	file := fset.AddFile("", fset.Base(), len(src))
	s.Init(file, src, nil, scanner.ScanComments)

	for {
		pos, tok, lit := s.Scan()
		if tok == token.EOF {
			break
		}
		fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit)
	}
}

Now that we understand what the scanner produces, let’s look at how it actually works.

Inside the Scanner

The scanner needs to be initialized before it can start scanning. This happens in the init function (src/cmd/compile/internal/syntax/scanner.go):

func (s *scanner) init(src io.Reader, errh func(line, col uint, msg string), mode uint) {
    s.source.init(src, errh)
    s.mode = mode
    s.nlsemi = false
}

This sets up three key things: the underlying source reader (where the code comes from), the scanning mode (whether to report comments, for example), and the semicolon insertion state (initially off).

Under the hood, the source reader initialization does the heavy lifting:

func (s *source) init(in io.Reader, errh func(line, col uint, msg string)) {
    s.in = in
    s.errh = errh

    if s.buf == nil {
        s.buf = make([]byte, nextSize(0))
    }
    s.buf[0] = sentinel
    s.ioerr = nil
    s.b, s.r, s.e = -1, 0, 0
    s.line, s.col = 0, 0
    s.ch = ' '
    s.chw = 0
}

This creates a buffered reader that’s optimized for Go code. The buffer (buf) stores chunks of source code, and three indices (b, r, e) track which parts have been read and which parts are being actively processed. The line and col fields track the current position for error reporting. The sentinel is a special marker that makes it faster to detect when we’ve reached the end of buffered content. Finally, ch holds the current character being examined (initialized to a space), and we’re ready to start reading.

Once initialized, the scanner is ready to start producing tokens. Every call to the next function advances through the source code until it finds the next token.

How Token Recognition Works

Here’s where the magic happens. Let’s walk through the next function in chunks.

First, the scanner handles semicolon insertion:

func (s *scanner) next() {
    nlsemi := s.nlsemi
    s.nlsemi = false

The nlsemi attribute tracks whether the scanner should insert a semicolon if it encounters a newline. This is how Go lets you skip writing semicolons—the scanner adds them for you after certain tokens.

Next, it skips whitespace:

 redo:
    // skip white space
    s.stop()
    startLine, startCol := s.pos()
    for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' && !nlsemi || s.ch == '\r' {
        s.nextch()
    }

The stop call ensures we’re starting fresh with a new token. Then the scanner consumes all whitespace characters until it hits something meaningful.

After that, it records token metadata—specifically, where this token starts in the source file:

    // token start
    s.line, s.col = s.pos()
    s.blank = s.line > startLine || startCol == colbase
    s.start()

This captures the line and column where the token begins (for error messages), checks if the line was blank up to this point (useful for formatting tools), and marks the start of the token’s text in the buffer.

Now the scanner needs to figure out what kind of token this is. It does this by checking the first character. Let’s start with identifiers and keywords:

    if isLetter(s.ch) || s.ch >= utf8.RuneSelf && s.atIdentChar(true) {
        s.nextch()
        s.ident()
        return
    }

If the current character is a letter (or a valid Unicode identifier character), the scanner knows it’s looking at either a keyword (like package or func) or an identifier (like main or fmt). It consumes that first character with nextch(), then delegates to the ident method to read the rest of the characters and determine whether it’s a keyword or identifier:

func (s *scanner) ident() {
    // accelerate common case (7bit ASCII)
    for isLetter(s.ch) || isDecimal(s.ch) {
        s.nextch()
    }

    // general case
    if s.ch >= utf8.RuneSelf {
        for s.atIdentChar(false) {
            s.nextch()
        }
    }

    // possibly a keyword
    lit := s.segment()
    if len(lit) >= 2 {
        if tok := keywordMap[hash(lit)]; tok != 0 && tokStrFast(tok) == string(lit) {
            s.nlsemi = contains(1<<_Break|1<<_Continue|1<<_Fallthrough|1<<_Return, tok)
            s.tok = tok
            return
        }
    }

    s.nlsemi = true
    s.lit = string(lit)
    s.tok = _Name
}

Here’s what ident() does step by step:

  1. Read the identifier: It keeps consuming characters as long as they’re letters or digits (handling both ASCII and Unicode)
  2. Check if it’s a keyword: Once it has the complete word, it looks it up in Go’s keywordMap using a hash function for speed
  3. Return the appropriate token: If it finds a match in the keyword map, it returns that specific keyword token (like _Package or _Func). If there’s no match, it’s just a regular identifier, so it returns _Name and stores the actual text in s.lit

The nlsemi flag is also set here—it tells the scanner whether a semicolon should be automatically inserted after this token if a newline follows.

Handling Symbols and Operators

Remember, if the first character wasn’t a letter, the ident() path didn’t execute. Instead, the scanner continues in the next function with a large switch statement that checks what kind of character we’re looking at. This is where symbols, operators, numbers, strings, and other tokens get recognized. Let’s look at some examples.

End of file is simple:

switch s.ch {
case -1:
    if nlsemi {
        s.lit = "EOF"
        s.tok = _Semi
        break
    }
    s.tok = _EOF

When the scanner hits -1 (EOF), it returns the appropriate token. If it needs to insert a semicolon before EOF, it does that first.

Simple symbols are straightforward:

case ',':
    s.nextch()
    s.tok = _Comma

case ';':
    s.nextch()
    s.lit = "semicolon"
    s.tok = _Semi

A comma is a comma. A semicolon is a semicolon. Easy.

Multi-character operators:

case '+':
    s.nextch()
    s.op, s.prec = Add, precAdd
    if s.ch != '+' {
        goto assignop
    }
    s.nextch()
    s.nlsemi = true
    s.tok = _IncOp

Here’s where lookahead comes into play. When the scanner sees a +, it can’t immediately decide what token it is—it could be +, ++, or +=. So it consumes the + with nextch() and then checks what’s in s.ch (the next character in the stream) without consuming it yet. This is lookahead: peeking at the next character to make a decision.

If s.ch is another +, we have the increment operator ++, so we consume that second + and set the token. If it’s not, we jump to the assignop label to check if it’s += or just a plain +:

assignop:
    if s.ch == '=' {
        s.nextch()
        s.tok = _AssignOp
        return
    }
    s.tok = _Operator

If the next character is =, we have an assignment operator like +=. If not, it’s just the operator by itself, and the scanner doesn’t consume the next character—it leaves it for the next token.

More Complex Cases

I haven’t covered everything here. The scanner also handles string tokens (with escape sequences), numeric tokens (including floats, exponents, and different number bases like hex and binary), and comments. These follow similar patterns but with more complexity. If you’re curious, I encourage you to explore src/cmd/compile/internal/syntax/scanner.go yourself.

Walking Through an Example

We’ve covered a lot of ground—initialization, token recognition, lookahead, and the different code paths the scanner takes. Now let’s bring it all together by walking through our hello world program line by line. This will help you see how all these pieces work together in practice, from the first character to the final EOF token.

The scanner starts by reading p, which is a letter. It continues reading until it has the full word package. Then it checks whether package is a keyword. It is, so the scanner returns a package token.

Next, it reads m, another letter. It keeps reading: a, i, n. Now it has main. Is this a keyword? Nope. So the scanner returns an IDENT token with the literal "main".

Then it hits a newline. The previous token was an identifier, which means the scanner should insert a semicolon here. It does.

Next up is import, which is a keyword. The scanner returns the import token.

The scanner then encounters ", signaling the start of a string. It reads the entire string "fmt" and returns a STRING token.

After another newline (and semicolon insertion), the scanner sees func, another keyword.

Then main again—an identifier.

The characters (, ), and { are all single-character tokens, so the scanner emits each one immediately.

Next is fmt (identifier), followed by . (dot token), followed by Println (identifier).

Then ( (open paren), the string "Hello world" (string token), ) (close paren), and } (close brace).

Finally, the scanner inserts a semicolon after the closing brace, hits the end of the file, and returns an EOF token.

And that’s the complete tokenization of a hello world program.

Summary

The scanner is the first phase of the Go compiler. It reads your source code character by character and produces a stream of tokens—a much more structured representation that the rest of the compiler can work with.

We’ve seen how the scanner:

  • Automatically inserts semicolons so you don’t have to
  • Distinguishes between keywords and identifiers using a lookup table
  • Handles multi-character operators by peeking ahead
  • Processes different token types using a combination of lookahead and pattern matching

If you want to go deeper, I highly recommend reading through the actual scanner code. There are plenty of interesting details in how it handles strings, numbers, and edge cases that I didn’t cover here.

Want to Learn More?

If you’d like to continue exploring the Go compiler:

In the next post, I’ll talk about the parser—the component that takes this stream of tokens and builds an Abstract Syntax Tree, giving the compiler a semantic understanding of your code.