The Execution Engine | Internals for Interns

In the previous article , we explored how PostgreSQL’s planner chooses the optimal execution strategy. The planner produces an abstract plan tree—nodes like “Sequential Scan,” “Hash Join,” “Sort”—that describe what to do. Now the execution engine needs to actually do the work: read pages from disk, follow indexes, join tables, and produce results.

The executor uses the Volcano execution model—a beautifully simple pattern where every operation implements the same interface: “give me the next row.” A sort operation doesn’t care whether its input comes from a table scan or a join—it just asks for rows and sorts whatever it gets. This uniform approach allows arbitrarily complex queries to be built from simple, composable pieces.

But underneath this elegant interface lies the complex reality of accessing data stored in 8KB pages (configurable at compile time, though 8KB is the standard). Understanding how operations work at the page level reveals why some queries fly and others crawl.

Let’s start with the organizing principle behind it all.

The Volcano Model: Simple Iterator Pattern

PostgreSQL’s execution engine is organized around a deceptively simple idea: every operation in the plan tree implements the same basic interface. Whether it’s reading from a table, sorting rows, or joining two inputs—every operation acts like an iterator that produces rows on demand.

This means complex operations can be built by combining simpler ones. A hash join doesn’t need to know whether its input comes from a table scan, an index scan, or another join. It just asks for the next row and processes whatever it gets.

This composability is what allows PostgreSQL to handle arbitrarily complex queries using a relatively simple execution framework.

How It Works: Three-Step Lifecycle

Every operation follows the same three-step lifecycle. First comes setup—the operation initializes itself and prepares any resources it needs. A sequential scan opens the table file and gets ready to read pages. A hash join prepares to build a hash table from its input.

Then comes execution, the heart of the iterator pattern. The operation repeatedly returns the next row when asked, or signals that it’s finished. Each operation acts like a generator that produces rows on demand—it doesn’t do any work until someone asks for data. We’ll see a concrete example of this in action very soon.

Finally, cleanup—the operation releases any resources it was using. Files get closed, memory gets freed, temporary data gets cleaned up.

Now that we understand the lifecycle, let’s see how these operations actually coordinate with each other to process a query.

Data Flow Through the Plan Tree

As operations request data from their children, rows flow up through the plan tree from data sources (like table scans) to operations that transform them (like joins and sorts) and finally to your client.

Here’s the key insight: the plan describes what to do, while the execution state tracks where you are. The plan says “scan this table”—that never changes. The state tracks which page you’re currently reading, which rows you’ve processed, what temporary data you’ve built—that changes constantly as execution progresses.

But how do operations actually pass row data to each other? PostgreSQL uses a standardized container called a tuple slot. Think of it as a generic box that can hold a row’s column values, no matter where that row came from—whether it was read from a disk page, computed by an expression, or built by combining values from multiple sources.

This uniform interface is what makes the magic happen—a sort operation doesn’t need special knowledge about where its input comes from. It just asks for rows and sorts whatever it gets.

Let’s see how this works with a concrete example from the previous chapter. Consider this query finding all rentals for a specific customer:

SELECT *
FROM customer c
JOIN rental r ON c.customer_id = r.customer_id
WHERE c.customer_id = 1;

As we saw in the planner article, PostgreSQL produces this plan:

Nested Loop  (cost=0.28..359.16 rows=32 width=114)
  ->  Index Scan using customer_pkey on customer c  (cost=0.28..8.29 rows=1 width=74)
        Index Cond: (customer_id = 1)
  ->  Seq Scan on rental r  (cost=0.00..350.55 rows=32 width=40)
        Filter: (customer_id = 1)

Here’s how execution actually flows through this plan tree. When the client sends the query, the executor takes over and drives the entire process. The client doesn’t ask for rows one at a time—it simply requests query execution, and the executor handles everything from there, streaming results back to the client as they become available (with some buffering in between). The pull-based model happens inside the executor, where plan nodes ask other plan nodes for rows.

Executor asks the root node for the first row: The Nested Loop node receives the request
Nested Loop asks Index Scan on customer: “Give me your first row”
- Index Scan: Scans the index to find customer_id = 1 location in the table data
- Index Scan: Fetches the row from the table data
- Index Scan returns to Nested Loop: Sends the one matching customer row
Nested Loop asks Seq Scan on rental: “Give me your first row”
- Seq Scan: Reads rental table pages, checking each row for customer_id = 1
- Seq Scan returns first match: Sends the first matching rental to Nested Loop
Nested Loop combines and returns: Combines the customer and rental rows, passes it up to the executor which streams it to the client
Executor asks for the next row: The process continues
- Nested Loop still has the same customer row
- Nested Loop asks Seq Scan on rental: “Give me your next row”
- Seq Scan returns second match: Sends the next matching rental to Nested Loop
- Nested Loop returns: Sends the second joined row up, which gets streamed to the client

This continues until Seq Scan has found all 32 matching rentals. When the executor asks for another row, Nested Loop goes back to Index Scan asking for the next customer. Since there’s only one customer with id=1, Index Scan returns NULL (no more rows), and Nested Loop signals it’s finished.

Notice how the outer side (Index Scan) controls the overall flow—the inner side (Seq Scan) runs completely for each outer row. This pull-based model means the Nested Loop coordinates the two scans, combining rows as they arrive, and stops as soon as either side runs out of data. The key insight is that plan nodes pull from other plan nodes—the Volcano model describes how operations within the executor communicate with each other, not how the client communicates with PostgreSQL.

Not All Operations Are Lazy: Blocking Nodes

The Volcano model’s “give me the next row” interface might suggest that all operations process data lazily, one row at a time. But that’s not quite true—some operations need to consume all their input before they can produce even a single output row.

Sort is the classic example. When you ask a Sort node for its first row, it can’t just return the first row it receives from its child—that row might belong at the end of the sorted result. So Sort must read every single row from its input, sort them all in memory (or on disk if they don’t fit), and only then can it start returning rows. The first row takes a long time; subsequent rows come quickly.

Hash Join has a similar pattern, but only on one side. Before the join can produce any results, it must read the entire “build” side (usually the smaller table) and construct the hash table in memory. Only then can it start reading from the “probe” side and returning matches. So if you see a Hash Join in your plan, one of the inputs will be fully consumed before any results appear.

This has real performance implications. If you run SELECT * FROM large_table ORDER BY column LIMIT 10, the Sort node still needs to read and sort the entire table before it can return even those 10 rows—adding a LIMIT doesn’t make Sort any faster (unless there’s an index on that column, in which case the planner can use an Index Scan that returns rows already sorted, avoiding the Sort node entirely). Understanding which operations in your plan are blocking helps explain why some queries take a long time before the first result appears.

Now that we understand the organizing principle, let’s look at how operations actually access data stored on disk.

How Data Access Really Works

Behind the simple iterator interface lies the complex reality of accessing data stored on disk. PostgreSQL organizes all data into 8KB pages, and understanding how different access methods work at the page level reveals why some queries are fast and others are slow.

The two fundamental approaches are sequential scans (reading pages in order) and index scans (using an index structure to find specific pages). Each has dramatically different performance characteristics.

Sequential Scans: Reading Pages in Order

A sequential scan is the simplest data access method—read through all pages of a table from beginning to end, examining every row to see if it matches the query conditions.

Think of it like reading a book cover to cover. PostgreSQL opens the table’s data file on disk and starts reading from page 0. Each 8KB page contains multiple rows, so PostgreSQL scans through each row in the page, checking if it matches your WHERE clause conditions. When a row matches, it gets returned to the parent operation. Once all rows on a page are processed, PostgreSQL moves to the next page and repeats the process.

Here’s the key insight: sequential scans read every page of the table, even if only a few rows are actually needed. This makes them efficient when you need most of the table’s data—why bother with an index if you’re reading 80% of the rows anyway? But they’re inefficient for finding small numbers of specific rows—you’re still reading the entire table even if you only need 10 rows out of a million.

Implementation: src/backend/executor/nodeSeqscan.c

Index Scans: Following Pointers to Data

Index scans use a separate index structure to find specific rows without reading the entire table. The index acts like a book’s index—it tells you which pages contain the data you’re looking for.

The process involves two levels of page access. First, PostgreSQL navigates the index structure itself, following the index’s internal page structure to find the right location. Then it reads the index leaf pages to find the actual row identifiers (tuple IDs) that match your search criteria. With those tuple IDs in hand, PostgreSQL fetches the heap pages—the actual data pages from the main table—and extracts the matching rows.

PostgreSQL supports different types of indexes—B-tree, hash, GiST, GIN, and others—each with its own way of storing and searching for keys. We’ll explore these index structures in detail in the next article in this series.

Index scans are efficient when you need only a small fraction of the table’s rows. But here’s the catch: if you need a large percentage of rows, index scans can actually be inefficient because they result in random page access patterns rather than sequential reads.

This is why PostgreSQL’s planner sometimes chooses a sequential scan even when an index exists—reading the entire table sequentially can be faster than thousands of random index lookups.

Implementation: src/backend/executor/nodeIndexscan.c

Now that we’ve seen how sequential and index scans access pages, let’s look at the crucial layer that sits between these operations and the disk.

Page Access Patterns and Buffer Management

Whether reading pages sequentially or jumping around via indexes, all data access ultimately comes down to reading 8KB pages from disk into PostgreSQL’s shared buffer pool (a shared memory area that caches frequently accessed pages). Understanding how this buffer pool works and how different access patterns interact with it explains why sequential scans and index scans have such different performance characteristics.

Buffer Pool Management

PostgreSQL’s shared buffer pool acts as a cache between the execution engine and the disk. When an operation needs a page, it first checks if that page is already in memory—if so, use it directly. If not, PostgreSQL needs to load it from disk. If there’s free space in the buffer pool, it loads the page there. If the buffer pool is full, PostgreSQL picks a page to remove (one that’s not currently being used) to make room for the new page. While an operation is actively using a page, PostgreSQL marks it so it won’t be removed.

This buffer management is crucial for performance—keeping frequently accessed pages in memory can make queries orders of magnitude faster than having to read from disk every time.

Implementation: src/backend/storage/buffer/

Now that we understand how the buffer pool works, let’s see how different access patterns interact with it.

Sequential Access Patterns

Sequential scans read pages in order, which is very efficient. The operating system can prefetch upcoming pages because the access pattern is predictable. Disk drives don’t need to move the read head between distant locations, minimizing seeking. And recently read pages might still be in the buffer pool for other operations to use.

PostgreSQL implements a clever ring buffer strategy for large sequential scans. Instead of using the entire buffer pool, large scans use a small ring of buffers that gets reused as the scan progresses. This prevents a single large table scan from pushing out all other pages and destroying the cache for all other queries—brilliant design that keeps the system responsive even during big scans.

Random Access Patterns

Index scans often result in random page access because the heap tuples they need may be scattered across many different pages. This creates performance challenges. The I/O is unpredictable, making it hard to prefetch the right pages. Disk drives need more seeking, which is especially expensive on traditional hard drives. And random access patterns don’t benefit as much from caching since recently accessed pages are less likely to be needed again soon.

However, PostgreSQL has several optimizations for random access. Index-only scans avoid heap access entirely when all needed columns are in the index. Bitmap index scans collect many tuple IDs from the index, sort them by page location, then access heap pages in order—turning random access into sequential. And tables can be physically ordered to match an index through clustering, making index scans more sequential.

The buffer pool helps with pages that fit in memory, but what about operations that need more memory than the buffer pool provides?

Handling Large Operations: When Data Doesn’t Fit in Memory

Some operations like sorting large tables or building hash tables for joins need more memory than PostgreSQL is configured to use. When this happens, PostgreSQL automatically “spills” data to temporary files on disk, allowing operations to complete even with limited memory. This spilling behavior is controlled by the work_mem setting.

For hash joins, if the hash table won’t fit in memory, PostgreSQL partitions the data into batches and processes each batch separately. For large sorts, it creates sorted chunks that fit in memory, writes them to temporary files, then merges them back together. Both approaches are much slower than in-memory operations, which is why proper work_mem sizing matters for performance.

Beyond managing data and memory, the executor also needs to compute the actual logic of your query.

Expression Evaluation: Computing WHERE Clauses and SELECT Lists

Every query needs to evaluate expressions—WHERE clause conditions, join conditions, and function calls. These get evaluated for every single row that flows through the executor.

Consider a simple expression like WHERE price * 1.1 > 100. For each row, PostgreSQL needs to:

Fetch the price column value from the current row
Multiply it by 1.1
Compare the result to 100
Return true or false

During planning, PostgreSQL compiles these expressions into a sequence of simple steps that can be executed efficiently. Instead of navigating a complex expression tree for every row, it creates a straight list of operations—load this column, multiply by this constant, compare, return result.

PostgreSQL can also optimize during compilation. For example, if part of an expression only involves constants, it evaluates that part once during planning instead of repeating it for every row. For very complex expressions on large datasets, PostgreSQL can even use JIT compilation to generate optimized machine code.

Implementation: src/backend/executor/execExpr.c

All the optimizations we’ve discussed so far happen within a single process. But for really large queries, PostgreSQL can leverage an even more powerful optimization.

Parallel Execution: Using Multiple CPU Cores

For large queries, PostgreSQL can split the work across multiple worker processes to use multiple CPU cores. This can significantly speed up operations like large table scans, sorts, and aggregations.

Parallel execution works by having a leader process coordinate with several worker processes. The leader launches workers and shares the query plan with them through shared memory. Workers dynamically claim chunks of work (like which pages to scan), process their assigned data, and send results back through shared memory queues. The leader collects results from all workers and sends the final results to the client.

For table scans, workers dynamically claim pages as they finish their current work. For operations like hash joins, workers cooperate to build shared data structures before processing their portions. When queries need sorted results, the leader merges the sorted streams from each worker to maintain overall order.

Parallel execution can provide significant speedups (2-4x or more) for large queries. However, it adds coordination overhead, so it’s most beneficial for queries that process substantial amounts of data.

Implementation: src/backend/executor/execParallel.c , src/backend/executor/nodeGather.c

Now let’s step back and look at what we’ve covered.

Wrapping Up

So that’s PostgreSQL’s execution engine—the machinery that transforms abstract query plans into concrete operations that access data and produce results.

We started with the Volcano execution model—the organizing principle where every operation implements the same simple iterator interface. This uniform approach allows complex queries to be built from composable pieces, each asking its children for the next row and processing whatever it gets.

At the storage level, we explored how operations actually access data stored in 8KB pages. Sequential scans read pages in order—efficient for large result sets. Index scans use index structures to find specific pages—efficient for small lookups but potentially lots of random I/O. Understanding these different access methods explains why the planner chooses different strategies for different queries.

The buffer pool sits between the executor and disk, caching frequently accessed pages in memory. Sequential access patterns benefit from predictable I/O and prefetching, while random access patterns face challenges from unpredictable I/O. PostgreSQL uses clever strategies like ring buffers to prevent large scans from destroying the cache.

When operations need more memory than available, PostgreSQL automatically spills to disk, partitioning data into batches that fit in memory. Expression evaluation is heavily optimized through compilation and JIT for speed. Parallel execution distributes work across multiple CPU cores for significant speedups on large queries.

The execution engine is where PostgreSQL’s elegance meets the messy reality of disk I/O, memory limits, and CPU constraints. Understanding these mechanics reveals how PostgreSQL turns logical query plans into actual results and why queries perform as they do.

Next up: we’ll explore PostgreSQL’s index structures—B-tree, hash, GiST, GIN, and others—diving into how each type organizes and searches for keys to enable efficient data access.