Overview | Internals for interns

Ever wonder what happens when you type SELECT * FROM users WHERE id = 42; and hit Enter? That simple query triggers a fascinating journey through PostgreSQL’s internals—a complex series of operations involving multiple processes, sophisticated memory management, and decades of optimization research.

This is the first article in a series where we’ll explore PostgreSQL’s query execution in depth. In this overview, I’ll walk you through the complete journey from SQL text to results, giving you the roadmap. Then, in subsequent articles, we’ll dive deep into each component—the parser, analyzer, rewriter, planner, and executor—exploring the details of how each one works.

Understanding this journey will transform how you write efficient queries, diagnose performance issues, and think about database development. Let’s start with the big picture.

The Five-Stage Journey

Before we dive deep, let’s get our bearings. Every query takes the same fundamental path through PostgreSQL:

Parsing → Analysis → Rewriting → Planning → Execution

That’s it. Five stages. Your SQL text goes in one end, results come out the other. But within each stage, there’s a lot happening. Let’s trace the journey.

The Journey Begins: Sending Your Query

Let’s trace our query from the moment we submit it. Your application establishes a connection to the PostgreSQL server and starts sending queries using PostgreSQL’s communication protocol.

Here’s something important: when you send SELECT * FROM users WHERE id = 42;, PostgreSQL receives it exactly as you typed it—plain text. Your SQL statement arrives as a text string, just like you wrote it. This is true whether you’re typing in psql, running a query from your application, or using an ORM.

PostgreSQL receives this text and performs some quick validation—making sure it’s properly encoded and well-formed. Then the real work begins.

Stage 1: Parsing - From Text to Structure

The parser is the first stop. Its job is converting your SQL text into a structured parse tree.

The parser reads your query character by character, applying PostgreSQL’s SQL grammar rules. It identifies keywords (SELECT, FROM, WHERE), table names, column references, operators—all the syntactic pieces of your query.

Think of this like a language teacher analyzing a sentence. The parser identifies nouns, verbs, and clauses, but it doesn’t yet understand what they mean in the context of your database.

After parsing SELECT name FROM users WHERE id = 42, PostgreSQL knows:

There’s a SELECT clause with a column reference (name)
There’s a FROM clause with a table reference (users)
There’s a WHERE clause with a comparison (id = 42)

But it doesn’t know if the users table exists, if name is a valid column, or what data types are involved. That’s the next stage’s job.

Stage 2: Analysis - Adding Meaning

The analyzer transforms the parse tree into a semantically validated query tree. This is where PostgreSQL moves from understanding grammar to understanding meaning.

The analyzer does several critical things:

Resolves references: It looks up users in the system catalogs to verify the table exists. It checks that name and id are real columns in that table.

Type checking: It verifies that the comparison id = 42 makes sense—is id a numeric column? Can it be compared to the integer 42? Does the comparison operator exist for these types?

Permission checking: It verifies you have SELECT permission on the users table and the specific columns you’re accessing.

Adds semantic information: The query tree now knows that users refers to a specific table with a specific OID (object identifier), name is a text column, and id is an integer column.

If anything fails—table doesn’t exist, column is misspelled, you lack permissions—the analyzer throws an error and the query stops here.

Now that PostgreSQL understands what our query means, it needs to check if any automatic transformations should be applied before optimization begins.

Stage 3: Rewriting - Automatic Transformations

The rewriter takes our semantically valid query tree and applies automatic transformations. The most common examples:

View expansion: If you query a view, the rewriter converts it into a query against the underlying base tables. For instance, if active_users is a view defined as SELECT * FROM users WHERE status = 'active', querying SELECT * FROM active_users gets rewritten to directly access the users table with the appropriate filter.

Row-level security: If you have security policies defined, the rewriter adds additional WHERE conditions to enforce them. Your query might become SELECT * FROM users WHERE id = 42 AND tenant_id = 123 if there’s a policy restricting access by tenant.

User-defined rules: PostgreSQL allows custom rewrite rules (though these are less common in modern applications).

For most simple queries, the rewriter doesn’t change much. But for complex applications with views and security policies, this stage can significantly transform your query.

With all transformations applied, PostgreSQL now has the final query that needs to be executed. The next challenge is determining the most efficient way to execute it.

Stage 4: Planning - Finding the Optimal Path

This is where we reach the most sophisticated part: the planner. Its job is answering the question: “What’s the most efficient way to execute this query?”

The planner evaluates multiple dimensions to make this decision. Let’s look at the key factors it considers:

Access Paths

For each table in the query, the planner decides how to retrieve the data. Should we scan the entire users table row by row, or is there an index on id we can use to jump directly to the matching rows?

The planner examines all available indexes for each table and evaluates which access method will be fastest. Sometimes an index lookup is the clear winner. Other times—especially for small tables or when you need most of the rows—a sequential scan is actually faster because it avoids the overhead of index navigation.

If your query involves multiple tables, the planner makes this decision independently for each one.

Join Strategies

When your query joins multiple tables, the planner faces two critical decisions: what order should we join them in, and which join algorithm should we use for each join?

Join order matters tremendously. Joining table A to B first, then adding C, can be much faster or slower than joining B to C first, then adding A. The planner considers different orderings to find the most efficient path.

For each join, the planner also chooses an algorithm. PostgreSQL has three main options:

Nested loop join: Best for small datasets or when one side has very few rows
Hash join: Efficient for medium to large datasets when you have enough memory
Merge join: Works well when both inputs are already sorted

The planner evaluates which combination of order and algorithms will minimize the overall cost for your specific query and data sizes.

Statistics

All these planning decisions depend on one critical ingredient: statistics. The planner needs to know how much data it’s dealing with to make good choices.

PostgreSQL collects and maintains statistics about your tables through the ANALYZE command. For each table, it tracks things like:

How many rows are in the table
How many distinct values each column has
The distribution of data (are values evenly spread or clustered?)

These statistics answer questions like “If I filter by id = 42, how many rows will match?” or “If I join these two tables, will the result be 10 rows or 10 million?”

Without accurate statistics, the planner is flying blind. It might choose an index when a sequential scan would be faster, or pick a nested loop join when a hash join would work better. Running ANALYZE regularly—or enabling autovacuum to do it automatically—is essential for good performance.

Cost Estimation and the Final Plan

With all this information, the planner evaluates different execution strategies. For each possible plan, PostgreSQL calculates an estimated cost based on:

Expected disk I/O (reading pages from disk or cache)
CPU usage (processing rows, evaluating conditions)
Memory requirements (sorting, hashing)

The planner explores multiple options and picks the one with the lowest estimated cost. For simple queries with a few tables, it can exhaustively evaluate all possibilities. But when your query joins many tables, the number of possible join orders explodes exponentially. In these cases, PostgreSQL switches to a Genetic Query Optimizer that uses heuristics to find a good plan without evaluating every possibility. We’ll explore how this genetic algorithm works in a future article.

Once the planner settles on the best approach, the result is an execution plan—a tree of operations specifying exactly how to retrieve your data.

For our simple query SELECT name FROM users WHERE id = 42, the plan might be: “Use the index on id to find the matching row, then fetch the name column from that row.”

You can see this plan yourself using EXPLAIN:

EXPLAIN SELECT name FROM users WHERE id = 42;

This shows you the planner’s decision process and cost estimates, which is invaluable for understanding and optimizing query performance.

With the optimal execution plan in hand, PostgreSQL is finally ready to retrieve our data and return the results.

Stage 5: Execution - Producing Results

The executor takes your execution plan and produces results. PostgreSQL uses a pull-based execution model.

What does pull-based mean? Instead of each operation actively pushing data to the next operation, each operation requests data from its child operation. Think of it like a chain of waiters—you ask the waiter for food, the waiter asks the kitchen, the kitchen asks the pantry, and so on. Data only flows when requested.

Here’s how it works for our query SELECT name FROM users WHERE id = 42. The execution plan might look like:

                        QUERY PLAN
-----------------------------------------------------------
 Index Scan using users_id_idx on users  (cost=0.00..8.27 rows=1 width=64)
   Filter: (id = 42)
(2 rows)

When execution starts:

The top of the plan (returning results to client) asks for a row
This triggers the Index Scan node to request a row
The Index Scan uses the users_id_idx index to find id = 42
It reads the matching row from disk/cache and applies the filter
The row flows back up: Index Scan → Client

This pull-based model is memory-efficient because PostgreSQL only processes data when it’s actually needed. It also makes implementing limits and early termination simple—just stop asking for more rows.

The executor processes rows one by one, formats them according to the wire protocol, and sends them to your client over the network connection.

Once all results are sent, PostgreSQL automatically:

Destroys temporary memory contexts
Releases locks acquired during execution
Cleans up any temporary files created

Your backend process is now ready for the next query.

Bringing It All Together

Let’s trace our complete journey one more time with SELECT name FROM users WHERE id = 42:

Sending the query: Your application connects and sends the query as plain text
Parsing: Text becomes a parse tree (syntactic structure)
Analysis: Validation and semantic meaning—tables exist, types match, permissions check
Rewriting: Automatic transformations—views expanded, security policies applied
Planning: Evaluates access paths, join strategies, and statistics to create an optimal execution plan
Execution: Runs the plan using a pull-based model, returns results
Cleanup: Frees memory, releases locks, cleans up resources

Every query follows this same path—simple or complex, the difference is just how much work happens at each stage.

Why This Matters

Understanding this pipeline helps you:

Write better queries: Knowing how the planner works helps you structure queries for optimal performance. You’ll understand why certain queries can’t use indexes, why join order matters, and when to use CTEs vs subqueries.

Debug performance issues: When a query is slow, you can use EXPLAIN to see which stage is problematic. Is the planner choosing a bad path? Are statistics out of date? Is an index missing?

Design better schemas: Understanding how PostgreSQL processes queries informs index design, table partitioning, and view usage.

Appreciate the complexity: There’s a lot happening behind that simple SELECT statement. PostgreSQL handles connection management, memory allocation, syntax validation, semantic analysis, rule application, cost-based optimization, and pull-based execution—all transparently.

What’s Next?

This overview covered the complete journey at a high level, giving you the roadmap of how PostgreSQL transforms SQL into results. But each stage has fascinating details worth exploring in depth.

Over the coming months, we’ll dive deep into topics like:

The wire protocol: How clients and servers communicate, message formats, and the extended query protocol
The query planner: Cost estimation algorithms, the Genetic Query Optimizer, and how PostgreSQL chooses between execution strategies
Index types: B-tree, Hash, GiST, GIN, BRIN—how each works internally and when to use them
Statistical data: What statistics PostgreSQL collects, how they’re used in planning, and why ANALYZE matters
Join algorithms: The internals of nested loop, hash, and merge joins
The executor model: How the pull-based execution works in detail

Each article will explore the actual implementation, helping you understand not just what PostgreSQL does, but how and why it does it.

The next time you run a query, you’ll know exactly what journey it’s taking through PostgreSQL’s internals. And as we explore each component in the coming articles, that knowledge will deepen into true expertise.