Skip to main content

Command Palette

Search for a command to run...

Senior Engineers Don’t Trust ACID: 6 Hard Truths About Database Transactions

In the cozy, idealized world of local development, we write code for the “happy path.”

Published
7 min read
Senior Engineers Don’t Trust ACID: 6 Hard Truths About Database Transactions

On localhost, we code for a fantasy where hardware is immortal and networks are perfect.

Production is a violent correction. Packets drop, disks fail, and race conditions trigger 3 AM nightmares.

Transactions are our attempt to tame this chaos, but they aren’t magic.

As an engineer, you must realize that “ACID compliance” is not a binary law of physics it is a spectrum of trade-offs, and many of the guarantees you rely on are just marketing fluff.

Here are six truths about transactions that every senior engineer needs to know.

1. The “C” in ACID is Mostly for Show

We are taught early in our careers that ACID (Atomicity, Consistency, Isolation, and Durability) is the bedrock of reliability. But if you look under the hood, the “C” is a glaring outlier.

While Atomicity, Isolation, and Durability are core properties of the database engine itself, Consistency is almost entirely an application-level property.

The database can enforce technical constraints uniqueness, foreign keys, data types but it has no idea if your business logic remains “consistent.” It doesn’t know that a debit in Table A requires a credit in Table B for the books to balance. The application relies on the database’s other three properties to achieve consistency, but it is the developer’s responsibility to define what a “good state” looks like.

As Joe Hellerstein, a renowned computer scientist at UC Berkeley, famously remarked:

“The C in ACID was ‘tossed in to make the acronym work’ in Härder and Reuter’s paper, and that it wasn’t considered important at the time.”

In practice, the term “Consistency” is a marketing tag. If you write bad data that violates your business rules, an ACID database will store it with perfect atomicity and durability.

2. Atomicity is Actually About “Abortability”

In the world of multi-threaded programming, “atomic” usually refers to isolation ensuring a thread can’t see a half-finished result. In the context of ACID, however, atomicity is not about concurrency at all.

It is about error handling.

ACID Atomicity describes what happens when a client wants to make several writes, but a fault occurs partway through a process crashes, a disk fills up, or a network link snaps. The database’s job is to ensure that the system doesn’t end up in a partially updated, nonsensical state.

This “all-or-nothing” guarantee is the foundation of a retry loop. It allows an application to safely retry a failed operation because it knows that an aborted transaction has discarded every single write it attempted.

If we were being technically honest, we wouldn’t call it Atomicity. We would call it “Abortability.”

3. The “Repeatable Read” Naming Nightmare

One of the greatest risks to a system’s integrity is blindly relying on tools. Many architects assume that an isolation level like “Repeatable Read” is a standardized contract.

It isn’t.

The SQL standard’s definition of isolation levels is fundamentally flawed. It is ambiguous, imprecise, and far too dependent on specific 1970s-era implementations (like System R). This ambiguity has led to a naming disaster where different vendors provide vastly different guarantees under the exact same label.

  • Oracle: The isolation level it markets as “Serializable” is actually just Snapshot Isolation.

  • PostgreSQL & MySQL: They use “Repeatable Read” to describe their implementations of Snapshot Isolation.

  • IBM DB2: When it says, “Repeatable Read,” it actually refers to full Serializability.

You cannot assume your tool handles a specific race condition just because it claims standards compliance. You must look past the marketing label to the underlying implementation, or you’re just waiting for a production anomaly to strike.

4. The Mantra of Snapshot Isolation: Readers and Writers Never Meet

To solve the problem of “read skew” where a user sees an inconsistent state because they queried a record while a transfer was in flight most modern databases utilize Snapshot Isolation. This is powered by Multi-Version Concurrency Control (MVCC).

MVCC works by keeping multiple versions of an object side-by-side. When a transaction starts, it is assigned a unique, always-increasing transaction ID (txid). The database then applies strict visibility rules:

  • It creates a list of all other in-progress transactions.

  • It ignores any of their writes.

  • It ignores any writes made by transactions with a later txid.

By using metadata fields like created_by and deleted_by to track these versions, the database allows each transaction to see a "frozen" snapshot of the data as it existed at the moment it began.

“A key principle of snapshot isolation is readers never block writers, and writers never block readers.”

This mantra is why Snapshot Isolation is a boon for backups and analytics. You can scan a multi-terabyte table for hours without ever interfering with the incoming stream of writes.

5. “Write Skew” The Subtle Bug That Weak Isolation Misses

The most deceptive race condition you will encounter is Write Skew. It is a generalization of the “lost update” problem, but far harder to detect. It occurs when two transactions read the same data, but then update different objects based on a premise that is about to be invalidated.

Consider a hospital app where the business rule is: At least one doctor must be on call.

  1. Alice and Bob are both on call. Both feel ill.

  2. Alice queries the DB: SELECT COUNT(*) WHERE on_call = true. She sees "2".

  3. Bob queries the DB: SELECT COUNT(*) WHERE on_call = true. He also sees "2".

  4. Both conclude it is safe to go off call.

  5. Alice updates her record to false. Bob updates his record to false.

  6. Both commit.

The Result: Zero doctors are on call. Patients are at risk.

This is the “Aha!” moment for why weak isolation fails. Write Skew is often caused by a Phantom. A write in one transaction changes the result of a search query in another. Because the query was looking for the absence of a condition (or a count of rows), there was “no object to which we can attach a lock.”

Your standard row-level locking won’t save you here.

6. The Modern Renaissance of Serializability

Serializability is the gold standard it guarantees that the result of concurrent transactions is the same as if they had run one at a time. After years of being dismissed as a performance killer, it is seeing a renaissance through three distinct approaches:

  1. Actual Serial Execution (Redis, VoltDB): Eliminates concurrency entirely by running transactions on a single thread. This works brilliantly, but only if transactions are very short and the data fits in RAM.

  2. Two-Phase Locking (2PL): The classic pessimistic approach. It prevents all race conditions, but at a massive cost. 2PL often leads to “unstable latencies” and systems “grinding to a halt” because one slow transaction can block every other reader and writer.

  3. Serializable Snapshot Isolation (SSI): The optimistic modern choice used in PostgreSQL. It allows transactions to proceed without blocking, but tracks when a transaction might have acted on an outdated premise (the “tripwire”). If a violation is detected at commit time, the transaction is aborted.

When looking at high-profile systems like Google’s Spanner, the industry consensus is shifting. As James Corbett and the Spanner team noted, it is often better to have your architects deal with performance bottlenecks caused by transactions than to have them “always coding around the lack of transactions.”

Conclusion

Transactions are not laws of nature; they are man-made tools with specific, often sharp, trade-offs. The labels “ACID” and “Serializable” are frequently used as marketing terms that obscure the subtle bugs phantoms and write skews lurking in your code.

As an architect, you must challenge your assumptions. Do you actually know the default isolation level of your production database? Are you ignoring “unstable latencies” or “deadlock frequencies” that indicate your isolation level is poorly matched to your workload?

If you don’t know exactly which “phantoms” your database allows, you aren’t managing a system you’re just getting lucky.

References & Further Reading

For those who want to dig deeper into the mechanics of distributed data, I highly recommend the following sources, which heavily inspired this article:

  1. Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media. (Chapters 7 & 9 are essential reading on this topic).

  2. Härder, T., & Reuter, A. (1983). Principles of Transaction-Oriented Database Recovery. ACM Computing Surveys. (The paper that coined the ACID acronym).

  3. Corbett, J. C., et al. (2012). Spanner: Google’s Globally Distributed Database. OSDI.

  4. Berenson, H., et al. (1995). A Critique of ANSI SQL Isolation Levels. SIGMOD. (The paper that exposed the ambiguity in SQL standards).