AI Dev Tools
·5 min read·explainer

Demystifying the Database Index: A B-Tree Explained Simply

Confused about database performance? Here is a database index b-tree explained simply, showing how database indexes work under the hood with TypeScript examples.

Imagine walking into a physical archive containing 10 million printed documents. If you want to find a document containing the invoice for transaction #87654, and the documents are simply thrown into boxes in the order they arrived, you have only one option: pick up every single piece of paper, read the transaction number, and put it down until you find the right one.

In database systems, this is a Sequential Scan (or Full Table Scan). It scales linearly: $O(N)$ time complexity. If it takes 1 millisecond to inspect one record, you'll be waiting nearly three hours for your query to complete.

Now, imagine the archivist maintains a separate, highly organized card catalog. This catalog contains only two columns: the transaction number (sorted sequentially) and the physical box-and-folder number where the actual invoice is stored. That card catalog is a Database Index. It is a separate, highly structured lookup map designed to let you bypass the manual labor and jump straight to the exact physical coordinates of your data.


The Simple Version: A Database Index B-Tree Explained

In a relational database like PostgreSQL or MySQL, table data is stored on disk in blocks (often called pages), typically 8KB in size. When you insert a row, the database engine appends it to the current page. If that page is full, it requests a new page from the operating system. Crucially, there is no inherent order to how these rows are physically laid out on disk. They form an unordered heap.

An index is a secondary data structure that sits alongside your table. It extracts the values of the indexed column(s) and pairs them with a Tuple ID (TID) or physical row pointer.

Instead of reading every page on disk, the database engine queries the index, finds the matching key in logarithmic time ($O(\log N)$), grabs the physical pointer, and performs a single, highly targeted read of the actual table data in the heap.


What's Actually Happening Under the Hood

To understand why indexes are so fast, we have to look past the high-level abstraction. Databases do not use simple sorted arrays for indexing; they use a highly specialized tree structure called a B+ Tree (a self-balancing variation of the classic B-Tree).

Why Not a Binary Search Tree (BST)?

If you used a standard Binary Search Tree, every node would have at most two children. For 10 million rows, a balanced BST would be roughly 24 levels deep ($2^ \approx 16.7$ million).

In database systems, traversing a node means reading a page from disk. Doing 24 sequential disk seeks for a single query is a performance disaster because disk I/O is the ultimate bottleneck.

B-Trees solve this by having a massive fan-out (the number of child nodes per parent). A single node in a B+ Tree can hold hundreds of keys. This keeps the tree incredibly shallow—usually only 3 to 4 levels deep, even for billions of records.

In a modern B+ Tree:

  1. Internal nodes only contain routing keys and pointers to child nodes. They do not store row data or physical pointers. This maximizes the number of keys that can fit into a single 8KB page, maximizing fan-out.
  2. Leaf nodes store the actual indexed values alongside pointers to the physical table rows.
  3. Leaf node linking: Leaf nodes are linked together sequentially in a doubly linked list (represented by the bold horizontal arrows above). This makes range scans (e.g., WHERE age BETWEEN 20 AND 40) incredibly fast. Once the engine traverses down to 20 using the tree, it simply walks the leaf nodes horizontally until it hits a value greater than 40, avoiding the need to traverse back up and down the tree.

Simulating the Mechanics in TypeScript

To make this concrete, let's write a simplified simulation. We will contrast a Sequential Scan (scanning an unordered heap) with an Index Scan (performing a binary search on a sorted index mapping).

typescript
// Simulate a database table row
interface UserRow {
  id: number;
  email: string;
  age: number;
}
 
// Simulate physical storage (The Heap) - Unordered array
const tableHeap: UserRow[] = Array.from({ length: 100_000 }, (_, index) => ({
  id: index + 1,
  email: `user_${index + 1}@example.com`,
  age: Math.floor(Math.random() * 80) + 18,
}));
 
// Simulate an Index Entry: A sorted key mapping to a physical position in the heap
interface IndexEntry {
  key: string;
  heapIndex: number;
}
 
// Build the index: Copy keys, sort them alphabetically, keep reference to physical index
console.time("Build Index Duration");
const emailIndex: IndexEntry[] = tableHeap
  .map((row, index) => ({ key: row.email, heapIndex: index }))
  .sort((a, b) => a.key.localeCompare(b.key));
console.timeEnd("Build Index Duration");
 
/**
 * 1. Sequential Scan - O(N) Complexity
 * We must check every single row in the heap until we find a match.
 */
function sequentialScan(email: string): UserRow | null {
  let comparisons = 0;
  for (let i = 0; i < tableHeap.length; i++) {
    comparisons++;
    if (tableHeap[i].email === email) {
      console.log(`[Seq Scan] Found record in ${comparisons} comparisons`);
      return tableHeap[i];
    }
  }
  console.log(`[Seq Scan] Record not found after ${comparisons} comparisons`);
  return null;
}
 
/**
 * 2. Index Scan (Binary Search Simulation) - O(log N) Complexity
 * We search the sorted index, then use the heap index pointer for O(1) retrieval.
 */
function indexScan(email: string): UserRow | null {
  let low = 0;
  let high = emailIndex.length - 1;
  let comparisons = 0;
 
  while (low <= high) {
    comparisons++;
    const mid = Math.floor((low + high) / 2);
    const midVal = emailIndex[mid].key;
 
    if (midVal === email) {
      console.log(`[Index Scan] Found key in ${comparisons} comparisons`);
      const heapPointer = emailIndex[mid].heapIndex;
      return tableHeap[heapPointer]; // Direct pointer lookup in the heap
    } else if (midVal < email) {
      low = mid + 1;
    } else {
      high = mid - 1;
    }
  }
 
  console.log(`[Index Scan] Key not found after ${comparisons} comparisons`);
  return null;
}
 
// Execute benchmarks
const targetEmail = "user_87654@example.com";
 
console.log("\n--- Executing Queries ---");
console.time("Sequential Scan Runtime");
sequentialScan(targetEmail);
console.timeEnd("Sequential Scan Runtime");
 
console.time("Index Scan Runtime");
indexScan(targetEmail);
console.timeEnd("Index Scan Runtime");

When you run this code, you will see that the sequential scan requires tens of thousands of comparisons, while the index scan finds the record in roughly 15 to 17 comparisons. That is the power of logarithmic scaling.


Why It Matters for Your Code

If indexes are so fast, why don't we just index every single column in our database? This is where many engineers run into production bottlenecks. Understanding the trade-offs is what separates senior developers from juniors.

1. The Write Tax

Every time you execute an INSERT, UPDATE, or DELETE, the database engine must not only write the raw data to the table heap, but it must also update every single B-Tree index associated with that table.

If you have 5 indexes on a table, a single insert generates at least 6 write operations. The database must traverse the tree, find the correct leaf node, insert the key, and potentially rebalance the tree (which involves splitting nodes and writing to disk).

If you are dealing with lightweight state, you might not even need a heavy relational database (we've written before about the comical limits of abusing Gmail as a database), but when you are in the realm of millions of rows, understanding your index layout is non-negotiable.

2. The Left-to-Right Rule (Composite Indexes)

A composite index is an index built on multiple columns, such as:

sql
CREATE INDEX idx_users_last_first ON users(last_name, first_name);

The B-Tree sorts this index first by last_name, and then by first_name within duplicate last

ShareTweet

Related Posts