5 Counter-Intuitive Truths About How Databases Actually Work

Introduction: Beyond the Black Box
Databases are the foundation of modern applications, quietly storing and retrieving the information that powers everything from social media feeds to online shopping carts. For many developers, they are a "black box": you put data in, you get data out. But what actually happens inside that box? When you save a user profile or log a transaction, how does the database physically arrange that data to find it again, instantly, among billions of other records?
A German proverb offers a surprisingly relevant insight into this challenge:
Wer Ordnung hält, ist nur zu faul zum Suchen. (If you keep things tidily ordered, you’re just too lazy to go searching.)
This proverb perfectly captures the fundamental tension inside every database storage engine. It pits the upfront effort of organization (keeping things tidy) against the later effort of retrieval (searching). This is the exact trade-off databases constantly make: fast writes (quickly organizing new data) often come at the cost of slower reads (searching for that data), and vice versa.
This article will look under the hood to reveal a few powerful and counter-intuitive truths about how databases actually work, challenging common assumptions about storing and retrieving data.
1. The Simplest High-Performance Database is... Just a Log File
If you were to build the world's simplest database, you could do it with two short Bash functions:
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
The db_set function takes a key and a value and simply appends them as a new line to the end of a file called database. This file is an append-only data file, more commonly known as a log. If you update a key, the old value isn't overwritten; the new value is just added to the end of the file.
Here's the counter-intuitive part: this incredibly simple db_set operation is extremely fast. Appending to a file is a highly efficient, sequential write operation. This very principle treating the database as a simple, ever-growing log forms the high-performance foundation of many real-world database systems.
The massive drawback, of course, is retrieval. The db_get function has terrible performance because, for every lookup, it must scan the entire file from beginning to end to find the last (and therefore most recent) entry for a given key. This is an operation with a cost of O(n), meaning the lookup time grows in direct proportion to the number of records. To solve this, we need something to help us find data without scanning everything: an index.
2. Indexes Are Powerful, But They Aren't a Free Lunch
To avoid the slow, O(n) scan, databases use an index an additional structure derived from the primary data that acts as a signpost to help locate data quickly. But these signposts come at a cost. Maintaining them requires overhead, which introduces a critical trade-off that every developer must understand.
This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes.
Why do indexes slow down writes? Because every time you add or update data in your main data file, the index must also be updated to reflect that change. The more indexes you have, the more work the database has to do for every single write operation.
For this reason, databases don't index everything by default. They require you the application developer or database administrator to choose indexes manually, using your knowledge of the application’s typical query patterns. This elevates indexing from a simple performance tweak to a strategic decision that reflects how you expect your application to be used.
3. The Best Way to Update Data Is To Never Actually Update It
It seems logical that updating a record would mean finding the old data on disk and overwriting it with the new data. However, many high-performance databases do the opposite: they never modify data files in place.
Instead, they fully embrace the append-only design. When a record is "updated," the new version is just appended to a log file. To prevent the log file from growing infinitely, it is broken into segments. Periodically, a background process performs compaction and merging. This process creates a new, compacted segment file by reviewing the older segments, throwing away any duplicate or outdated values, and keeping only the most recent version of each key.
This append-only design is highly effective for three main reasons:
Appending and merging are sequential write operations, which are generally much faster than random writes, especially on magnetic spinning-disk hard drives. To some extent, sequential writes are also preferable on flash-based solid state drives (SSDs).
Concurrency and crash recovery are much simpler. With immutable files, you don’t have to worry about the case where a crash happened while a value was being overwritten, leaving you with a file containing a mix of the old and part of the new value spliced together.
Merging old segments avoids data fragmentation over time, keeping storage compact.
4. There Are Two Warring Philosophies for Storing Your Data
When it comes to the complex indexes that power most databases, two dominant and fundamentally different philosophies have emerged: Log-Structured Merge-Trees (LSM-Trees) and B-Trees.
The Log-Structured Merge-Tree (LSM-Tree) is a direct evolution of the append-only philosophy we've just discussed. Instead of writing directly to a file on disk, an LSM-Tree first adds all incoming writes to a sorted data structure in memory, called a memtable. Because it’s in memory, keeping it sorted is fast and easy. Once the memtable reaches a certain size, it is flushed to disk as a new, immutable file segment called a Sorted String Table (SSTable). In the background, a process constantly merges these SSTable segments, compacting the data and removing duplicates. This clever two-step process sort in memory, then write sequentially to disk is how systems like LevelDB, RocksDB, and Cassandra achieve incredibly high write throughput.
In the other corner stands the B-Tree, the most widely used indexing structure in the world and the foundation for nearly all relational databases. B-Trees adopt the exact opposite philosophy: updating data in place. A B-Tree breaks the database into fixed-size blocks or pages (typically 4KB). When data is changed, the B-Tree algorithm finds the correct page on disk, modifies it with the new information, and overwrites the entire page in place.
This fundamental difference leads to a simple performance heuristic.
As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads.
Both approaches also contend with a phenomenon known as write amplification where a single write from an application result in multiple physical writes to the disk. This is a key performance concern, especially for SSDs which have a finite lifespan.
5. For Big Data Analytics, We Flip Tables on Their Side
Database workloads are not all the same. They generally fall into two categories:
Online Transaction Processing (OLTP): The rapid, small reads and writes typical of user-facing applications (e.g., getting a user's profile, posting a comment).
Online Analytic Processing (OLAP): The large, aggregate queries typical of business intelligence and data analysis (e.g., "What was the total revenue for all stores last month?").
For OLTP workloads, traditional databases use row-oriented storage, where all the values belonging to a single row are stored together. This is efficient for fetching an entire record. However, it's terribly inefficient for analytic queries, which might need to calculate the sum of a single column (e.g., net_price) across millions of rows. A row-oriented database would have to load all 100+ columns for every one of those million rows into memory, just to use one of them.
The solution for OLAP is a simple but powerful idea: column-oriented storage.
...don’t store all the values from one row together but store all the values from each column together instead.
Each column gets its own file. This columnar approach has two huge benefits for analytics:
A query only needs to read the files for the columns it actually uses, dramatically reducing the amount of data read from disk.
The data within a single column is often highly repetitive (e.g., a
categorycolumn might only have a few dozen unique values repeated millions of times). This makes it extremely suitable for compression, further reducing the amount of data the system needs to process.
Conclusion: A Look Inside the Engine
We've journeyed from a simple append-only log file to the sophisticated engineering of B-Trees and the clever "flipped" world of columnar stores. The underlying theme is that the way data is physically stored and indexed has profound implications for performance.
The counter-intuitive truths that appending is better than modifying, that indexes have a cost, and that storage layout depends entirely on the workload reveal that there is no single "best" database. The ideal choice is always a matter of trade-offs, optimized for either the rapid, targeted queries of OLTP or the massive, aggregate scans of OLAP.
Now that you've seen inside the engine, will you think differently about the trade-offs involved in the databases you use every day?



