The Prophetess Nobody Believed: The Origin Story of Apache Cassandra

In 2007, two engineers at Facebook needed a name for the distributed database they'd just built in a company hackathon. They chose Cassandra — the Trojan prophetess who could see the future but whom no one would ever believe. The irony was intentional. But it worked on two levels, and only one of them got remembered.
Blog post featured image

The Prophetess Nobody Believed: The Origin Story of Apache Cassandra

Generated by Master Biographer | Source for LinkedIn Content


1. THE HOOK — The Name Is The Story

In 2007, two engineers at Facebook needed a name for the distributed database they'd just built in a company hackathon. They chose Cassandra — the Trojan prophetess who could see the future but whom no one would ever believe.

The irony was intentional. But it worked on two levels, and only one of them got remembered.

The version everyone tells: it was named after a mythological oracle, a clever nod to a database that "sees" and stores everything. Pretty enough metaphor. But co-creator Prashant Malik, in a 2019 interview, gave the real reason: "We had named it after an oracle in Greek mythology. We thought it would become bigger than Oracle."

They were poking the bear. Oracle — the company that dominated enterprise databases, that no serious infrastructure team could imagine replacing — was the implicit target. Two engineers at Facebook, who'd just built something at a hackathon, were quietly confident they'd built its successor.

Nobody believed them.

Which is, of course, exactly what Cassandra the prophetess would have predicted.


2. THE BACKSTORY — The Exact Problem Facebook Had

The standard story says Cassandra was built to solve Facebook's "scale" problem. That is technically true and completely misleading. The specific problem was Inbox Search — and the engineering constraints of that feature forced a set of design decisions that would define distributed databases for the next two decades.

Here is what Inbox Search required, precisely:

  • A user types a word into Facebook's message search bar.
  • The system must instantly search billions of messages across that user's entire message history.
  • This has to work for 100+ million users simultaneously.
  • The data must be globally replicated across multiple data centers.
  • It must be available 24/7 with no tolerance for downtime.
  • The write volume is enormous — every message sent anywhere must be indexed in real time.

The challenge was write throughput. Traditional relational databases — MySQL, Oracle, PostgreSQL — are optimized for reads and complex joins. They use B+ tree data structures that require random disk seeks on writes. At Facebook's message volume, random disk writes become the bottleneck. The database can't keep up.

Avinash Lakshman had seen this problem before. At Amazon, he was one of the co-authors of the Dynamo paper (2007) — Amazon's internal distributed key-value store built for the shopping cart, designed around the insight that availability matters more than perfect consistency. He understood the trade-off intimately.

At Facebook, he partnered with Prashant Malik, an IIT Delhi classmate who had worked on database technologies at Oracle. Together they had a moment of insight: "Why not bring Bigtable on top of Dynamo?" Google's Bigtable (2006) offered the data model — a wide column store that could handle enormous datasets. Dynamo offered the distribution strategy — consistent hashing, peer-to-peer architecture, no single master node.

Nobody had combined them. They did it at a hackathon.

The result ran on Facebook's first production deployment with:
- 600+ processor cores
- 120+ terabytes of disk space
- Handling the inbox search index for 100 million users

When Inbox Search launched in June 2008, it ran on Cassandra. By the time the open source announcement happened two months later, the cluster had scaled to 250 million users.


3. THE GRIND — The "Nobody Believed It" Period

Facebook open-sourced Cassandra on Google Code in July 2008. The reaction from the industry was not applause. It was skepticism verging on dismissal.

The specific shape of the resistance came from a fundamental design choice: Cassandra chose availability over consistency. In the CAP theorem — the foundational framework for distributed systems — you cannot simultaneously guarantee Consistency, Availability, and Partition Tolerance. You pick two. Cassandra picked AP: availability and partition tolerance.

What this meant in practice: if you write a record to Cassandra, and then immediately read it back, you might not get the data you just wrote. You might get stale data from a replica that hasn't caught up yet. This is called "eventual consistency" — the system promises your data will eventually propagate everywhere, but offers no guarantee about when.

For developers trained on relational databases with ACID transactions, this felt like madness. Databases were supposed to be correct. Cassandra was designed to be available even when it might be slightly wrong. That was a philosophical inversion most engineers weren't ready for.

The Facebook open-source release also had a specific operational problem: Facebook maintained an internal repository and periodically pushed code externally. As Jonathan Ellis — the man who would later co-found DataStax — wrote, this was "no way to run an OSS project." The project looked moribund. The community couldn't contribute, couldn't trust the roadmap, couldn't build on it.

Ellis, then at Rackspace, started working with the code anyway. He got voted in as an Apache committer. In March 2009, Cassandra moved to Apache's infrastructure as an incubator project. That move — putting the code genuinely in the open — is what restarted momentum.

Then came the Digg v4 disaster.

2010: The Public Humiliation

Digg — then one of the biggest social news sites on the internet, a direct competitor to Reddit — rebuilt its entire platform on Cassandra for version 4. The relaunch was catastrophic. The site was slow, buggy, and deeply unpopular with users. Kevin Rose, the founder, publicly suggested Cassandra was to blame.

The internet believed him. Database engineers did not.

Digg's own engineers went on Quora to clarify: the architectural migration involved multiple technologies simultaneously, the bugs weren't in Cassandra specifically, and the cluster hadn't been properly tuned. Riptano (the commercial Cassandra support company, later DataStax) stated flatly: "Cassandra can scale to levels equal to or greater than what Digg was putting on it."

It didn't matter. The headline was written: "Digg bet on Cassandra. Digg is dying." The association stuck.

Meanwhile, quietly, Reddit was running the same migration. They finished it. They credited Cassandra with enabling their 3x traffic growth in 2010 — the year Digg collapsed and users fled to Reddit. The prophetess who wasn't believed was, again, correct.


4. THE BREAKTHROUGH — Open Source Release and the Adoption Wave

February 17, 2010: Apache Cassandra graduates from the incubator to a top-level Apache Software Foundation project.

The timing was right. The world was changing in a way that made Cassandra's design assumptions suddenly look prophetic rather than reckless. Mobile was exploding. Every app needed global replication. Social networks were discovering what "write-heavy workloads" actually meant at scale. The companies that needed Cassandra didn't know they needed it yet — but they were building toward it.

The early adopter list reads like a who's who of early internet scale:
- Twitter — used Cassandra for geolocation data and real-time analytics (more on their complicated relationship later)
- Digg — the cautionary tale (but engineers insist it wasn't Cassandra's fault)
- Reddit — the vindication
- Cisco, Rackspace, eBay — enterprise legitimacy
- Disney, Netflix — entertainment scale
- By 2012: 1,000+ production deployments

The commercialization parallel: In 2010, Jonathan Ellis and Matt Pfeil left Rackspace to found Riptano in Austin, Texas — the first company built specifically to support Cassandra enterprise deployments. They renamed it DataStax later that year, moved to Santa Clara, and released the first commercial distribution (DataStax Enterprise v1.0) in October 2011.

DataStax's pitch was elegant: Cassandra is free. We sell the expertise, the tooling, the support, and the integrated analytics layer you need to actually run it in production. The classic open-source commercialization playbook, executed at exactly the right moment.

By 2015, 90% of Fortune 100 companies were running Cassandra in some form.


5. THE AFTERMATH — The Prophet Who Was Right

The scale numbers, when you actually look at them, are hard to process.

Apple:
- 75,000+ Cassandra nodes
- 10+ petabytes of data
- Millions of operations per second
- At least one cluster exceeding 1,000 nodes

Apple doesn't talk about its infrastructure publicly. This number surfaced from engineers at Cassandra conferences, almost as an aside. Seventy-five thousand nodes. A single company. Running one database.

Netflix:
- 10,000+ Cassandra instances
- 6+ petabytes of data
- 100+ clusters
- Over 1 trillion requests per day

Netflix runs Cassandra as the backbone of its viewing history, payments, and recommendation systems. When you press play on a movie, Cassandra is part of what happens in the next 200 milliseconds.

Instagram:
One of the major early social platforms that adopted Cassandra as its primary data store. Running millions of concurrent users' feed data through the system.

Uber:
Scaled Cassandra to tens of thousands of nodes for location tracking, trip data, and real-time surge pricing.

The prophet who predicted the end of Oracle — who was dismissed, blamed for Digg's failure, criticized for making the wrong CAP theorem trade-off — ended up as the infrastructure underneath Apple's iCloud, Netflix's homepage, and Instagram's feed.

Facebook's 150-node cluster from 2008 became Apple's 75,000-node empire.

Cassandra saw it coming.


6. THE TWITTER COMPLICATION — The Messy Non-Hero Arc

Twitter's relationship with Cassandra is the most honest part of this story, because it isn't clean.

2009-2010: Twitter uses Cassandra for geolocation storage, analytics, and real-time counters. They publish blog posts about it. They have engineers dedicated to it. The tech press writes headlines like "Twitter Drops MySQL For Cassandra."

2010: Ryan King of Twitter says the quiet part out loud — they're not going to migrate tweet storage to Cassandra. The actual quote: "This is a change in strategy. Instead we're going to continue to maintain our existing MySQL-based storage." The reason wasn't a technical failure. It was bandwidth: "We believe that this isn't the time to make a large scale migration to a new technology."

2014: Twitter builds Manhattan, their own internal distributed database, and begins phasing out Cassandra for several workloads.

The read: Twitter adopted Cassandra, got ambivalent, built their own thing instead, and never came back.

The more accurate read: Twitter used Cassandra for what it was good for (high-throughput, availability-critical analytics and counters), made a pragmatic decision not to migrate their core tweet storage, and eventually outgrew the early version of the project. This is not failure — it's what mature infrastructure decisions actually look like. Not every major company that adopted Cassandra is still running it in its original form. Some outgrew the open-source version and built derivatives. Some switched workloads. Some went deeper.

The "Twitter abandoned Cassandra" headline made for a compelling story. The reality was more mundane and more interesting: they made a sensible risk-management call about a large-scale migration during a period of rapid growth.


7. FIVE THINGS NOBODY TALKS ABOUT WITH CASSANDRA

1. The name was a shot at Oracle, not a mythological poetry exercise.
Prashant Malik confirmed it: "We thought it would become bigger than Oracle." Two hackathon engineers named their database after a competing oracle because they actually believed it. They were right. Cassandra now powers infrastructure that Oracle hasn't touched.

2. Facebook had a 150-node cluster running for 500 million users in 2010 — the same year Digg's engineers couldn't make it work.
Same database. Radically different outcomes. The Digg failure wasn't a Cassandra failure. It was an operational failure on a new architecture launched before it was ready. Facebook ran the same technology silently, at orders of magnitude larger scale, without incident.

3. Apple runs more Cassandra nodes than most tech companies have engineers.
75,000 nodes. Apple operates this infrastructure for iCloud. They have never published a case study, given a conference talk about it, or acknowledged it in a press release. The number emerged sideways from infrastructure conference discussions. A company famous for secrecy runs the world's largest known Cassandra deployment, and almost nobody talks about it.

4. The "eventually consistent" criticism almost killed it — and is exactly what makes it work for most real-world use cases.
The engineers who criticized Cassandra's AP trade-off were thinking about ACID transactions for financial data. Cassandra was designed for something different: user feeds, message indices, location data, analytics. For those workloads, availability matters far more than per-millisecond consistency. The criticisms were technically correct about the wrong use case.

5. The project was nearly moribund two years after launch, saved by a single engineer who got voted in as a committer.
Jonathan Ellis, working at Rackspace with no Facebook affiliation, started contributing to the open-source codebase in 2009 when it was stalled. He got voted in as a committer, helped push it to the Apache Foundation, and then left Rackspace to build DataStax on top of it. The entire commercial trajectory of Cassandra — 90% of Fortune 100 running it — traces back to one engineer who decided to show up when the project was going nowhere.


SOURCE NOTES

Primary sources consulted for this biography:
- Facebook Engineering Blog (2008): Cassandra — A Structured Storage System on a P2P Network
- Cassandra original paper: Lakshman & Malik, LADIS 2009
- YourStory interview with Prashant Malik, November 2019
- Jonathan Ellis (DataStax co-founder), spyced.blogspot.com, 2009–2011
- High Scalability: "Why Twitter Really Not Using Cassandra to Store Tweets" (2010)
- LinkedIn Pulse: "Breathtaking Scale: 75,000 Cassandra Nodes and 10 Petabytes of Data"
- Netflix Technology Blog: "Benchmarking Cassandra Scalability on AWS"
- Quora: "Is Cassandra to blame for Digg v4's technical failures?"
- TechCrunch/DataStax: "How the World Caught Up with Apache Cassandra"
- DataStax Wikipedia / Instaclustr history articles

Ready to see a real-time data integration platform in action? Book a demo with real engineers and discover how Stacksync brings together two-way sync, workflow automation, EDI, managed event queues, and built-in monitoring to keep your CRM, ERP, and databases aligned in real time without batch jobs or brittle integrations.
→  FAQS

Syncing data at scale
across all industries.

a blue checkmark icon
POC from integration engineers
a blue checkmark icon
Two-way, Real-time sync
a blue checkmark icon
Workflow automation
a blue checkmark icon
White-glove onboarding
“We’ve been using Stacksync across 4 different projects and can’t imagine working without it.”

Alex Marinov

VP Technology, Acertus Delivers
Vehicle logistics powered by technology