Parsing and Analysis | Internals for interns

In the previous article, we explored how PostgreSQL establishes connections and communicates using its wire protocol. Once your connection is established and the backend process is ready, you can finally send queries. But when PostgreSQL receives your SQL, it’s just a string of text—the database can’t execute text directly.

Let me show you what happens when PostgreSQL receives this query:

SELECT name FROM users WHERE id = 42;

PostgreSQL doesn’t see this as a command yet. It sees characters: S, E, L, E, C, T, and so on. The journey from this raw text to something PostgreSQL can execute involves two major transformations: parsing (understanding structure) and semantic analysis (adding meaning).

In this article, we’ll explore both phases—how PostgreSQL validates your SQL syntax and then ensures it actually makes sense in the context of your database.

The Two-Phase Transformation

PostgreSQL breaks down query processing into two phases.

Phase 1: Parsing transforms your SQL text into a tree structure using only grammar rules. First, it breaks the text into tokens—things like keywords, identifiers, and numbers. Then it organizes those tokens into a tree that shows how the pieces fit together.

Phase 2: Semantic Analysis takes that tree and adds meaning. This is where PostgreSQL looks up table and column names in the database catalog—PostgreSQL’s own internal tables where it keeps track of what exists in your database and what properties those things have.

It also checks if your operations make sense. Can you compare an integer to a text string? Do the types match up?

And it records what permissions the query will need (though the actual permission check happens later, when the query runs).

The result? A complete query representation that PostgreSQL can optimize and execute.

Here’s the complete pipeline:

digraph pipeline {
    rankdir=LR;
    node [shape=box, style=rounded];

    SQLText [label="SQL Text", shape=box, style="rounded,filled", fillcolor=lightblue];
    Lexer [label="Lexer", shape=ellipse, fillcolor=lightyellow, style=filled];
    Tokens [label="Tokens", shape=box, style="rounded,filled", fillcolor=lightgreen];
    Parser [label="Parser", shape=ellipse, fillcolor=lightyellow, style=filled];
    ParseTree [label="Parse Tree", shape=box, style="rounded,filled", fillcolor=lightgreen];
    Analyzer [label="Analyzer", shape=ellipse, fillcolor=lightyellow, style=filled];
    QueryTree [label="Query Tree", shape=box, style="rounded,filled", fillcolor=lightgreen];

    SQLText -> Lexer -> Tokens -> Parser -> ParseTree -> Analyzer -> QueryTree;
}

Why separate these phases? Performance and clarity. The parser can catch syntax errors instantly using only CPU and memory—no database lookups needed, no transaction required. When you write SELECT name FROM users WHER id = 42 (notice the typo), PostgreSQL rejects it during parsing without ever checking if users exists. No transaction begins, no locks are acquired, no expensive catalog access occurs.

This separation provides fast failure for syntax errors while deferring expensive semantic validation (which requires transaction and catalog access) until it’s actually needed.

Let’s trace our example query through both phases.

Phase 1: Parsing

Phase 1 transforms raw SQL text into a structured parse tree using only grammatical rules. This happens in two steps: lexical analysis (breaking text into tokens) and syntactic analysis (building those tokens into a tree).

Breaking Text into Tokens

The lexer reads your SQL character by character, grouping them into meaningful tokens. Think of it like reading a sentence and identifying each word’s role—except here we’re identifying keywords, identifiers, numbers, and operators.

PostgreSQL uses a tool called Flex to implement its lexer (you can see the lexer definition in src/backend/parser/scan.l). Let’s watch it break down our query:

SELECT name FROM users WHERE id = 42;

The lexer transforms this into a stream of tokens:

SELECT   → keyword token
name     → identifier token
FROM     → keyword token
users    → identifier token
WHERE    → keyword token
id       → identifier token
=        → operator token
42       → number token
;        → terminator token

Each token carries two pieces of information: its type (what kind of thing it is) and its value (the actual text).

Not all tokens are created equal. Keywords have the highest priority—when the lexer sees letters, it first checks if they form a keyword before considering them an identifier. This is why you can’t use SELECT or FROM as table names without quoting them:

SELECT select FROM users;  -- Syntax error!
SELECT "select" FROM users;  -- This works!

If it’s not a keyword, the lexer treats it as an identifier—a name for a table, column, or function. Identifiers must start with a letter or underscore and can contain letters, digits, underscores, and dollar signs. Want special characters or preserved case? Use quoted identifiers.

Numbers get their own treatment. Numeric literals can be integers like 42, decimals like 3.14159, or scientific notation like 1.23e10. Similarly, string literals use single quotes ('Hello World') or PostgreSQL’s dollar-quote syntax ($tag$content$tag$) when you need to avoid escaping quotes inside the string.

Finally, operators and punctuation like =, <>, ,, and ; each get their own token types. These are the glue that connects everything together.

With these tokens identified, the parser can start building structure.

Building the Parse Tree

The parser takes the flat stream of tokens and organizes it into a hierarchical tree that captures relationships in your query. PostgreSQL uses Bison, a parser generator, to implement its grammar (you can see the grammar rules in src/backend/parser/gram.y).

Grammar rules work like recipes—they specify how tokens combine to form valid SQL constructs. Here are simplified rules for SELECT statements:

SelectStmt: SELECT target_list FROM from_list WHERE where_clause

target_list: target_item
           | target_list ',' target_item

expression: identifier
          | number
          | expression operator expression

For our query SELECT name FROM users WHERE id = 42, the parser builds this structure:

digraph parse_tree {
    rankdir=TB;
    node [shape=box, style=rounded];

    SelectStmt [label="SelectStmt", style="filled,rounded", fillcolor=lightblue];
    TargetList [label="target_list"];
    FromClause [label="from_clause"];
    WhereClause [label="where_clause"];

    TargetItem [label="target_item\n'name'"];
    Table [label="table\n'users'"];
    Comparison [label="comparison\n'id = 42'"];

    SelectStmt -> TargetList;
    SelectStmt -> FromClause;
    SelectStmt -> WhereClause;

    TargetList -> TargetItem;
    FromClause -> Table;
    WhereClause -> Comparison;
}

This tree captures the query’s structure—a SELECT statement with one column, one table, and a WHERE clause.

But here’s the key: at this stage, PostgreSQL doesn’t know what anything means. The parse tree shows structure, but PostgreSQL doesn’t know if users exists, if name is a valid column, or if comparing id to 42 makes sense. That semantic validation comes next.

Phase 2: Adding Meaning Through Semantic Analysis

Semantic analysis transforms the parse tree into a Query tree—a complete, validated representation that PostgreSQL can optimize and execute. The analyzer must answer critical questions about our query:

Does the users table exist? Where is it?
Is name a valid column in users? What’s its type?
Is id a valid column? What’s its type?
Can we compare id to 42? Are the types compatible?
What permissions will be needed to execute this query?

As the analyzer works through the query, it tracks which tables are available, what columns they contain, and what context is being processed.

Let’s explore the three main operations of semantic analysis.

Resolving Names: What Does Everything Refer To?

Name resolution is the most fundamental operation. PostgreSQL must determine what every identifier refers to by consulting the database catalog.

Table Resolution

When you write SELECT * FROM users, PostgreSQL needs to figure out which users table you mean. You might have tables named users in multiple schemas (public.users, hr.users, sales.users).

PostgreSQL uses the search path—an ordered list of schemas to search. You can see yours:

SHOW search_path;

   search_path
-----------------
 "$user", public
(1 row)

PostgreSQL searches schemas in order until it finds a matching table. You can be explicit by qualifying the table name:

SELECT * FROM public.users;  -- Skip search path, use this specific table

Once PostgreSQL finds the correct table, it needs to record this information for the rest of query processing.

PostgreSQL assigns every database object (tables, indexes, functions, etc.) a unique number called an OID (Object Identifier). When the analyzer finds the users table, it looks up its OID—let’s say it’s 16384. From this point forward, PostgreSQL can refer to this specific table by its OID instead of searching by name again.

The analyzer also creates a Range Table Entry (RTE)—a data structure that holds all the information about how this query uses the table. The RTE includes the table’s OID, which columns are accessed, what alias it has (if any), and what permissions are needed. Think of the range table as a directory at the front of the query that lists all the tables involved. Other parts of the query can then reference “table #1 in the range table” instead of repeatedly looking up table names.

With tables resolved and added to the range table, PostgreSQL can now figure out what each column reference in the query actually means.

Column Resolution

For column references, PostgreSQL must match names against available tables. Consider:

SELECT name, u.email, p.title
FROM users u
JOIN posts p ON u.id = p.user_id;

Unqualified columns (name): PostgreSQL searches all tables in scope. If exactly one table contains the column, resolution succeeds. If multiple tables have it, that’s ambiguous. If no table has it, that’s an error.

Qualified columns (u.email, p.title): PostgreSQL resolves the alias first, then looks for the column within that specific table.

The analyzer consults pg_attribute (the system catalog containing column information) to verify each column exists and records its type and attributes.

You can explore the same catalogs PostgreSQL uses:

SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'users';

 column_name | data_type
-------------+-----------
 id          | integer
 name        | text
 email       | text
 active      | boolean
(4 rows)

Once PostgreSQL finds a column, it records this as a reference to the Range Table Entry—something like “table #1, column #3” instead of storing the text name. These numeric references are much faster for later processing.

Now that PostgreSQL knows what tables and columns you’re referencing, it needs to verify that the operations you’re performing actually make sense.

Type Checking: Making Sure Operations Make Sense

Once PostgreSQL knows what tables and columns you’re using, it checks whether your operations actually make sense. Can you multiply a number by a text string? Can you compare an integer to a date? Type checking catches these problems.

Let’s look at a query:

SELECT * FROM products
WHERE price > 19.99
  AND quantity > '5';

For price > 19.99, PostgreSQL sees you’re comparing a numeric column to a numeric literal. That works fine.

For quantity > '5', you’re comparing an integer column to a text string. PostgreSQL automatically converts '5' to an integer so the comparison works. This is called type coercion—PostgreSQL tries to make compatible types work together when it’s safe to do so.

But not everything can be converted:

SELECT name + price FROM products;

ERROR: operator does not exist: text + numeric

You can’t add text and numbers together. There’s no + operator that works with those types, and no safe way to convert one to the other. PostgreSQL catches this during analysis and tells you before trying to run the query.

Function calls work similarly. When you call length(name), PostgreSQL picks the right version of the length function based on whether name is text or binary data. Each version does something slightly different, and PostgreSQL figures out which one you need.

With names resolved and types checked, there’s one more thing the analyzer needs to track.

Recording Permission Requirements

During semantic analysis, PostgreSQL records what permissions will be needed to execute the query. The actual permission checks happen later, during execution.

For our query:

SELECT name FROM users WHERE id = 42;

The analyzer records in the Range Table Entry that SELECT privilege is required on the users table. It also notes which specific columns are accessed (name and id), since PostgreSQL supports column-level privileges where a user might have access to some columns but not others. This information gets stored in the Query tree and will be checked when the query actually executes.

With all names resolved, types checked, and permission requirements recorded, PostgreSQL has everything it needs to build the complete query representation.

Building the Final Query Tree

The culmination of semantic analysis is the Query tree—a fully validated, type-checked, and permission-verified representation. For this query:

SELECT u.name, COUNT(p.id)
FROM users u
JOIN posts p ON u.id = p.user_id
WHERE u.active = true
GROUP BY u.name;

The Query tree organizes all components into a hierarchical structure:

digraph query_tree {
    rankdir=TB;
    node [shape=box, style=rounded];

    Query [label="Query\n(SELECT)", style="filled,rounded", fillcolor=lightblue];

    RangeTable [label="Range Table\n1: users (alias: u)\n2: posts (alias: p)"];
    TargetList [label="Target List\n1: u.name (TEXT)\n2: COUNT(p.id) (BIGINT)"];
    FromClause [label="FROM Clause"];
    WhereClause [label="WHERE Clause"];
    GroupBy [label="GROUP BY\nu.name"];

    Query -> RangeTable;
    Query -> TargetList;
    Query -> FromClause;
    Query -> WhereClause;
    Query -> GroupBy;

    FromClause -> JoinTree;
    JoinTree [label="Join Tree\nINNER JOIN"];

    JoinTree -> LeftTable [label="left"];
    JoinTree -> RightTable [label="right"];
    JoinTree -> JoinCond [label="condition"];

    LeftTable [label="RTE 1\nusers u"];
    RightTable [label="RTE 2\nposts p"];
    JoinCond [label="u.id = p.user_id"];

    WhereClause -> EqOp [label="expression"];
    EqOp [label="= operator"];
    EqOp -> VarActive [label="left"];
    EqOp -> ConstTrue [label="right"];
    VarActive [label="u.active\n(BOOLEAN)"];
    ConstTrue [label="true\n(BOOLEAN)"];
}

This complete Query tree contains everything PostgreSQL’s planner needs to generate an optimal execution plan. Every table reference is resolved, every type is checked, every permission requirement is recorded, and every expression is properly structured.

Playing With SQL Parsing

Want to see parsing in action? You can use the pg_query Ruby library to explore how PostgreSQL breaks down your SQL:

gem install pg_query

Watch the lexer break down your query:

require 'pg_query'

query = "SELECT name FROM users WHERE id = 42"
scan_result = PgQuery.scan(query)

puts "Tokens:"
scan_result.first.tokens.each do |token|
  token_text = query[token.start...token.end]
  puts "  #{token.token.to_s.ljust(12)} '#{token_text}'"
end

Output:

Tokens:
  SELECT       'SELECT'
  IDENT        'name'
  FROM         'FROM'
  IDENT        'users'
  WHERE        'WHERE'
  IDENT        'id'
  '='          '='
  ICONST       '42'

See the parse tree structure:

parsed = PgQuery.parse(query)
stmt = parsed.tree.stmts.first.stmt.select_stmt

puts "Parse Tree Structure:"
puts "  Statement Type: SelectStmt"
puts "  Target List: #{stmt.target_list.length} items"
puts "  FROM Clause: #{stmt.from_clause.length} items"

PostgreSQL provides a debug option to see the complete query tree after both parsing and analysis:

SET debug_print_parse = on;
SELECT name FROM users WHERE id = 42;

This writes the query tree to the PostgreSQL server logs (not to your query results). Despite its name, debug_print_parse shows the tree after semantic analysis completes—you’ll see Range Table Entries, resolved types, and all the semantic information we’ve discussed.

Now you’ve seen the complete journey from SQL text to query tree. Let’s tie it all together.

Wrapping Up

So that’s the journey from SQL text to query tree. Your simple SELECT name FROM users WHERE id = 42 goes through quite a transformation—first parsed into tokens and then into a syntactic tree, then analyzed to resolve names, check types, and record what the query needs to run.

What makes this architecture elegant is the separation of concerns. The parser only cares about grammar—no database lookups, no transactions, just fast validation of structure. The analyzer handles the expensive stuff—consulting the catalog, resolving ambiguities, checking types. By the time PostgreSQL finishes analysis, it has a complete, validated query tree ready for execution.

But we’re not done yet. This query tree still needs to go through the rewriter (where PostgreSQL applies views, rules, and row-level security) and the planner (where it figures out the fastest way to actually execute your query). Those phases deserve their own deep dives—we’ll explore them in future articles.

For now, you’ve seen how PostgreSQL takes your SQL and transforms it into something it can understand and validate. Every query goes through this process, and understanding it helps you write better SQL and debug problems when things go wrong.