Your Data Isn’t Stored. It’s Negotiated.

You didn’t lose your data.Your system just decided someone else deserved to keep it.

The Lie We All Build On

You click Save.
The screen refreshes.
Your data is still there.

So, you move on.

In your head, a contract has been signed:

“The system has my data. The system will protect my data.”

But modern systems don’t store your data in one place.
They scatter it across machines, racks, regions, and sometimes continents.

And somewhere in that journey, something changes.

Your data stops being a fact
and becomes a negotiation.

The Hidden Cost of “Always On”

Most production databases replicate data asynchronously.

That means your write is accepted by one node and only promised to the rest.

That promise lives in a dangerous gap called replication lag.

And inside that gap, reality bends.

1. Your App Can Time Travel

You update your profile photo.
The system says “Saved.”

You refresh.

It’s gone.

Not deleted.
Just unread by the server you happened to hit next.

Now the weirder version:

You load a page.
You see a new comment.
You reload.

It disappears.

Nothing broke.
You were routed to a slower replica.

From a user’s perspective, your app just rewound time.

From a system’s perspective, this is called:

Eventual consistency

Engineers try to patch over this with guarantees like:

Read-your-writes → You always see your own changes
Monotonic reads → You never see older data after newer data

Each guarantee costs you something.

Speed. Scale. Or availability.

You don’t get all three.
You choose which one hurts.

Stability isn’t free. You’re just paying for it somewhere you can’t see yet.

2. When Safety Systems Become Failure Machines

When a leader database dies, a replica is promoted.

This is called failover.
It sounds heroic.

In reality, it’s a gamble.

Here’s what can go wrong.

Silent Data Loss

If the replica never received your last writes, they’re gone.

The system said “Saved.”
The system lied.

Corruption at Scale

GitHub once promoted a replica whose ID counter was behind.

It reused primary keys.
Those keys were tied to cached user data.

Some users briefly saw other people’s private information.

Split Brain

Two servers believe they’re both in charge.
Both accept writes.

Now your database is arguing with itself.

Some systems solve this by killing one node on purpose.

If that logic fails?

They kill both.
Total outage.

This is why some veteran teams still do failovers manually.

Not because automation is bad
But because automated mistakes happen faster.

3. Why Two Leaders Create Twice the Trouble

Multi-leader systems sound like a dream:

Global apps
Offline support
Regional independence

Until two people change the same thing.

A wiki title starts as “A”
User 1 changes it to “B”
User 2 changes it to “C”

Both succeed.
Both get a green checkmark.

Later, the systems sync.

Now the database has a question it cannot answer:

Which one is true?

The users are gone.
The context is gone.

So, the system guesses.

This is why multi-leader architectures are often described as:

Powerful and dangerous territory

They don’t fail loudly.
They fail creatively.

4. “Last Write Wins” Is a Polite Way to Say “Someone Loses”

The most common conflict strategy is simple:

Whichever write is newest survives.
Everything else is deleted.

Three users can save data.
Two of them will lose it.

No error.
No warning.
No trace.

From their perspective, the system didn’t fail.

It betrayed them.

Some major databases still use this as the default.

It’s not a consistency model.
It’s a data shredder with good branding.

Every distributed system has bugs. The best ones just hide them from users longer.

The Real Fix Isn’t Technical

It’s philosophical.

The moment you stop asking:

“How do I replicate my data?”

And start asking:

“What does my user believe ‘saved’ means?”

Your architecture changes.

Strong systems don’t chase perfect consistency.
They design intentional inconsistency.

They:

Make critical data immutable
Route users to the same replicas per session
Surface conflicts instead of hiding them
Choose correctness or availability per feature not per system

They don’t eliminate failure.

They decide where it’s allowed to exist.

The 4 Questions That Change Your Architecture

Before you ship your next system, write these down:

Which actions can never be lost?
Which screens can never go backward in time?
Where are conflicts acceptable and where are they catastrophic?
When the system is unsure, should it fail or guess?

Your replication strategy should answer those.

Not your database.
Not your cloud provider.

You.

Final Thought

Your system isn’t broken when data behaves strangely.

It’s doing exactly what you designed it to do.

The real question is:

Did you design it for a clean diagram…
or for a world where servers fail, networks split, and users still expect the truth?

Your Data Isn’t Stored. It’s Negotiated.

The Lie We All Build On

The Hidden Cost of “Always On”

1. Your App Can Time Travel

2. When Safety Systems Become Failure Machines

Silent Data Loss

Corruption at Scale

Split Brain

3. Why Two Leaders Create Twice the Trouble

4. “Last Write Wins” Is a Polite Way to Say “Someone Loses”

The Real Fix Isn’t Technical

The 4 Questions That Change Your Architecture

Final Thought

Comments

More from this blog

I Was Doing AI Engineering Wrong; Chip Huyen’s Book Set Me Straight

If You Think Distributed Systems Are Deterministic, Read This

Senior Engineers Don’t Trust ACID: 6 Hard Truths About Database Transactions

How to Crash Your Production Database by Adding a Single Node 101

Command Palette

The Lie We All Build On

The Hidden Cost of “Always On”

1. Your App Can Time Travel

2. When Safety Systems Become Failure Machines

Silent Data Loss

Corruption at Scale

Split Brain

3. Why Two Leaders Create Twice the Trouble

4. “Last Write Wins” Is a Polite Way to Say “Someone Loses”

The Real Fix Isn’t Technical

The 4 Questions That Change Your Architecture

Final Thought

Comments

More from this blog