Seven Academics Who Had Never Run a Company: The Origin Story of Databricks

Berkeley, California. 2009. Matei Zaharia is in Soda Hall, the computer science building at UC Berkeley that smells of old coffee and ambition, and he is watching a machine learning job crawl. The job should take minutes. It is going to take hours. He knows why, because the reason is obvious once you understand how MapReduce works—and understanding MapReduce is one of the few things Zaharia is genuinely good at.
Blog post featured image

Seven Academics Who Had Never Run a Company: The Origin Story of Databricks

Generated by Master Biographer | Source for LinkedIn Content


THE HOOK: A PhD Student Stares at a Number

Berkeley, California. 2009.

Matei Zaharia is in Soda Hall, the computer science building at UC Berkeley that smells of old coffee and ambition, and he is watching a machine learning job crawl.

The job should take minutes. It is going to take hours. He knows why, because the reason is obvious once you understand how MapReduce works—and understanding MapReduce is one of the few things Zaharia is genuinely good at.

Hadoop MapReduce, the system Google described in a 2004 paper and Yahoo built into a product, had become the religion of big data. If you wanted to process a massive dataset, you used Hadoop. Every major tech company ran it. Every data engineering job posting required it. The logic was settled, the architecture was blessed, and the consensus was total.

Zaharia looked at his job logs and thought: this is wrong.

Not wrong in a small way. Wrong in a foundational way. Wrong in the way that makes a certain kind of person put down their coffee and start building something new.

What he built would eventually be used by thousands of companies to process petabytes of data every day. What his colleagues built around it would become one of the most valuable private technology companies in history. But none of that was visible in 2009. What was visible, in that moment, was a number on a screen—a job that should have finished but hadn't—and a 24-year-old PhD student who had a different idea about why.


THE BACKSTORY: The Lab That Wanted to Change the World

The AMPLab—Algorithms, Machines, and People Laboratory—was founded at UC Berkeley in 2009 with $40 million in funding from DARPA, the National Science Foundation, and a consortium of technology companies.

The premise was unusual for academic research. Most university labs optimized for papers. AMPLab wanted to optimize for impact. The faculty running it were some of the best distributed systems researchers in the world, and they wanted to solve real problems—not just describe them.

Ion Stoica was one of the principal investigators. Romanian-born, Berkeley-trained, Stoica was the kind of academic who wrote foundational papers and then watched the industry build billion-dollar companies on top of them. His earlier work on Chord, a distributed hash table protocol, had become infrastructure for peer-to-peer systems worldwide. He was patient with ideas, methodical with proofs, and deeply curious about the gap between what systems could theoretically do and what they actually did.

Scott Shenker was another. Shenker spanned disciplines—networking, economics, computer science—in a way that made him hard to categorize and easy to collaborate with. He had a gift for seeing the essential shape of a problem before the details resolved.

Michael Franklin ran the database side, focused on how you query and govern data at scale. His background was in systems that had to handle not just large data but uncertain data—data that was incomplete, inconsistent, or still arriving.

Into this lab arrived Matei Zaharia.


THE GRIND: What Hadoop Got Wrong

To understand what Zaharia built, you have to understand what MapReduce was designed to do—and what it was never designed to do at all.

Google's MapReduce paper described a system for batch processing. You had an enormous dataset. You wanted to run a computation across it. The system split the work across hundreds of machines, processed it in parallel, merged the results. It was elegant, fault-tolerant, and genuinely powerful for the problem it addressed.

The problem it addressed was batch processing. Not machine learning. Not iterative computation. Not anything where you needed to see the data multiple times.

Machine learning algorithms are, almost by definition, iterative. Gradient descent—the engine behind most ML models—requires running repeatedly over the same dataset, adjusting weights each time, until the model converges. With MapReduce, each iteration meant reading from disk, processing, writing to disk. Every single pass. The intermediate results were never kept in memory. They were flushed to the distributed file system and reloaded from scratch.

If your dataset lived on 200 hard drives, and each iteration required a full read-write cycle across all of them, and your algorithm needed 100 iterations to converge—you were making 20,000 disk operations for what was conceptually a single training run.

This was not a bug. It was the architecture. MapReduce had been designed for workloads where you read once and write once. It worked beautifully for those workloads. It was catastrophic for everything else.

Zaharia's insight was not subtle. Keep the data in memory.

That sentence contains multitudes.


THE INVENTION: Spark

The technical challenge was this: if you keep data in memory across a cluster of machines, you have a fault tolerance problem. Memory is volatile. Machines crash. If a node holding 40GB of intermediate data dies, that data is gone—and in MapReduce's model, data loss meant restarting from scratch, because you always had the original dataset on disk to fall back to.

If you eliminate the disk writes to get speed, you also eliminate the recovery mechanism that the disk writes provided.

Zaharia's solution was the Resilient Distributed Dataset, or RDD.

An RDD was not stored redundantly. It did not replicate itself across machines the way a traditional database would. Instead, it remembered. Each RDD knew its lineage—the sequence of transformations that had been applied to the original source data to produce it. If a partition of an RDD was lost, the system didn't need a backup copy. It just replayed the lineage from wherever a clean copy existed.

This was fault tolerance through provenance rather than redundancy. It was faster, it was cleaner, and it worked.

In 2010, Zaharia and his collaborators—Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica—published the first Spark paper at HotCloud. Two years later, in 2012, the full RDD paper appeared at NSDI and won the Best Paper Award.

The benchmarks were not modest. Spark ran iterative machine learning algorithms up to 100 times faster than Hadoop MapReduce. On disk-based workloads, even without the memory advantage in full effect, it was still 10 times faster.

Zaharia won the ACM Doctoral Dissertation Award in 2014 for this work. The committee cited it as one of the most impactful dissertations in the history of computer science.

But by then, something else was already happening.


THE SWEDE FROM TEHRAN: Ali Ghodsi

In December 1978, Ali Ghodsi was born in Tehran. In 1983, when he was five years old, his family fled Iran as the Iran-Iraq war escalated. They settled in Stockholm.

Sweden gave the Ghodsi family safety, but not ease. His parents—both physicians in Iran—faced recertification barriers and had to rebuild their careers. Ghodsi grew up aware of being an outsider, aware that the credential his parents had worked decades to earn meant nothing on the other side of a border, aware that belonging was not given but constructed.

He studied engineering at Mid Sweden University, then completed a PhD at KTH Royal Institute of Technology in Stockholm in 2006, working in distributed computing under Seif Haridi. His dissertation produced the "Dominant Resource Fairness" paper—an algorithm for fairly allocating multiple types of resources across competing tasks in a shared cluster. The paper became foundational to Apache Mesos, the cluster manager that would later power Twitter and Apple's infrastructure.

He became an assistant professor at KTH. He taught well. He published. He found the environment suffocating.

Swedish academia in 2009 rewarded seniority. Progress was incremental and slow. Ghodsi was 30 years old and already impatient. When a visiting scholar position at UC Berkeley's AMPLab opened, he took it.

Berkeley was different. The faculty moved fast. The grad students were doing work that would become products within years. The distance between an idea and an implementation was measured in months, not decades.

In the halls of Soda Hall, Ghodsi met Zaharia. They were studying the same problems from different angles—Zaharia obsessed with computation speed, Ghodsi obsessed with resource allocation and fairness. They found the intersection of their obsessions immediately. What if you could build a system that was both fast and fair? What if the resource scheduler and the computation engine were designed together?

Ghodsi saw in Spark something that Zaharia, who was a systems researcher rather than a businessman, was not quite seeing yet. He saw a company.


THE FOUNDING: Seven Academics Walk Into a Startup

In early 2010, Spark was open-sourced. Not with commercial intent—with academic intent. The AMPLab researchers published their code because that's what researchers did. They wanted the community to use it, improve it, find the bugs they hadn't found.

The community did exactly that.

By 2012, Spark had contributors from Yahoo, Cloudera, and Twitter. Companies were running it in production on real workloads. The GitHub repository was accumulating stars. The mailing list was active. Something was happening that was larger than a research project.

Ben Horowitz, co-founder of Andreessen Horowitz, heard about Spark through Scott Shenker. He reportedly told the team something direct: a $10 billion company could be built around this technology.

He was underestimating.

In 2013, seven researchers from the AMPLab founded Databricks:

  • Matei Zaharia: Spark's creator. CTO.
  • Ion Stoica: The senior professor, first CEO.
  • Ali Ghodsi: VP of Engineering and Product, destined to be CEO.
  • Andy Konwinski: Co-founder, who had studied cluster scheduling and MapReduce inefficiencies.
  • Patrick Wendell: Co-founder, Spark's open-source release manager.
  • Reynold Xin: Co-founder, who had built Shark—a SQL layer on top of Spark that would eventually become Spark SQL.
  • Scott Shenker: Co-founder, the Berkeley legend who had helped advise the whole project into existence.

Andreessen Horowitz led the Series A. $13.9 million.

The initial product was unglamorous: a managed Spark service. You give us your data, we run it on Spark, you get your results. No infrastructure. No DevOps. No cluster management. Just fast computation, delivered as a service.

It was the right product for 2013, when running Spark yourself required serious engineering muscle. It was not the product that would take them to $62 billion. That product hadn't been invented yet.


THE TENSION: Open Source vs. Commerce

There is a permanent tension in building a company on top of an open source project, and the people at Databricks understood it intellectually long before they felt it viscerally.

Apache Spark was not Databricks' property. It was a community project, governed by the Apache Software Foundation, contributed to by engineers at dozens of companies. Databricks was its largest contributor—by far—but it could not own the standard it had created. It could only be the best at using it.

This was intentional. Ghodsi, even before becoming CEO, argued that open-sourcing was the right strategy. A proprietary Spark would have been smaller. A Spark that the whole industry could use would become infrastructure—ubiquitous, essential, unavoidable. And Databricks would be the company that built the best product on top of it.

The risk was that customers could use Spark without Databricks. And many did. Competitors built managed Spark services. Cloud providers offered Spark clusters. The raw technology was free.

Databricks had to be better than free.

For the first few years, the company grew, but not explosively. Ion Stoica was an extraordinary researcher and a capable operator, but the company needed a CEO built for enterprise sales, enterprise culture, and the kind of patient, aggressive account-hunting that turns a promising product into a dominant platform.

In January 2016, Ali Ghodsi became CEO.


THE PIVOT: From Spark Company to Data Platform

Ghodsi made the pivot quickly.

Databricks was not a Spark company. Databricks was a data company. Spark was the engine—but engines don't win markets. Platforms do.

He hired enterprise executives. He went after Fortune 500 accounts. He built the sales machine that a company valued at $60 million Series C needed if it was going to become the company he saw in his head.

And he started thinking about the next architectural problem.

By 2017, enterprises had a data infrastructure that was at war with itself.

On one side: the data warehouse. Fast queries. Structured data. ACID transactions. Reliable. Expensive. Poor at machine learning. Hostile to unstructured data. Locked into proprietary formats that made switching painful and migration projects endless.

On the other side: the data lake. Cheap storage on cloud object storage. Flexible. Capable of handling raw, unstructured, semi-structured data. Friendly to machine learning workloads. Also: chaotic. No transactions. No schema enforcement. No reliability guarantees. A place where data went to become inconsistent and ungovernable.

Every enterprise was running both. Warehousing their clean, structured business data in Redshift or Snowflake. Dumping raw logs and events and clickstreams into S3 or GCS or ADLS. ETL pipelines moving data between the two systems. Data engineers keeping the pipelines from breaking. A second team of ML engineers working on the lake side, unable to query the data the warehouse team was using.

Two systems. Two teams. Two sets of governance rules. One company.


THE LAKEHOUSE: A New Paradigm

The insight that became the Lakehouse was this: the separation was not inevitable.

It was a historical accident. Data warehouses were built in the era of spinning disk. They optimized for the constraints of the hardware that existed then. When cloud object storage arrived—S3, GCS, ADLS, storing data at a fraction of traditional warehouse costs—no one went back and rebuilt the warehouse architecture from scratch. They layered on top of it.

What if you started from cloud object storage—cheap, scalable, durable—and built warehouse-grade reliability directly on it?

That was Delta Lake.

Delta Lake was an open-source storage layer that added ACID transactions, schema enforcement, and versioning to data sitting in cloud object storage. It made the data lake reliable. It made the data lake queryable with SQL performance that rivaled dedicated warehouses. And because the underlying storage was open format—Parquet files that any tool could read—it didn't lock you in.

The Lakehouse combined what was good about both architectures and eliminated what was bad about both. The flexibility and cost of a lake. The reliability and queryability of a warehouse. One system. One team. One governance model.

In 2020, Databricks published the academic paper formally defining the Lakehouse architecture. In the same year, they made Delta Lake fully open source under the Linux Foundation.

The naming was strategic, the architecture was real, and the timing was exactly right. Snowflake had just IPO'd at a $33 billion valuation, becoming the largest software IPO in history. Snowflake's market proved that enterprises would pay enormous amounts of money for reliable data infrastructure. Databricks positioned the Lakehouse as the next step: all of what Snowflake offered, plus machine learning, plus open formats, plus AI.


THE AFTERMATH: What $62 Billion Looks Like

The funding rounds came in waves:

  • 2013: $13.9M Series A — Andreessen Horowitz
  • 2014: $33M Series B — New Enterprise Associates
  • 2016: $60M Series C — New Enterprise Associates
  • 2017: $140M Series D — Andreessen Horowitz
  • 2019: $250M Series E — Andreessen Horowitz, Coatue, Microsoft
  • 2019: $400M Series F — $6.2B valuation
  • 2021: $1B Series G — $28B valuation
  • 2021: $1.6B Series H — $38B valuation
  • 2023: $500M Series I — $43B valuation
  • 2025: $10B Series J — $62B valuation

The 2025 round included $5.25 billion in debt from JPMorgan Chase, Barclays, Citi, Goldman Sachs, and Morgan Stanley. Meta participated as a strategic investor.

Ben Horowitz was wrong in 2013. Not about the company. About the number.

Matei Zaharia is now an Associate Professor of EECS at UC Berkeley (he returned after years at Stanford) and CTO of Databricks. The project he started in a Berkeley basement in 2009—to make a machine learning job run in minutes instead of hours—now processes petabytes of data daily for thousands of companies on every continent.

Ion Stoica transitioned from CEO to Executive Chairman in 2016, where he remains one of the company's most prominent faces.

Ali Ghodsi, the Iranian refugee who became a Swedish academic who found the whole thing too slow, who came to Berkeley as a visiting scholar in 2009 and never really left—he runs a company worth $62 billion. He leads 7,000+ employees. He's building what he calls a "Data Intelligence Platform," which means: all your data, in one place, with AI embedded throughout, making sense of it in ways that were previously impossible.

In 2024, Databricks acquired Tabular—the company co-founded by Michael Armbrust, one of Spark's core contributors—for $2 billion, to strengthen its open table format capabilities. The acquisition meant Databricks was now acquiring the people who had helped it exist.

The IPO is coming. It will likely be the largest in tech history when it arrives.


WHAT IT MEANS

Databricks is often described as a data company. That framing is accurate but incomplete.

What Databricks actually built was a new paradigm for how computation relates to storage, and a new assertion about who owns the standard.

The assertion: that the data layer of enterprise technology should be open. Not open as a marketing claim, but open as architecture—open formats, open protocols, open source engines. Databricks bets that it can build the best product on top of an open standard faster than any competitor can build a competing standard. And that when a customer picks the open ecosystem, Databricks wins more often than not, because Databricks wrote most of the ecosystem.

This is a different kind of moat than Snowflake's. Snowflake's moat is proprietary—your data is in Snowflake's format, on Snowflake's storage, and migrating out is painful. Databricks' moat is relational—your data is in open formats, and Databricks is simply the best at working with open formats, because Databricks invented most of them.

Whether the open bet wins is still being written.

But it started in Soda Hall. With a graduate student who thought MapReduce was too slow. With an immigrant from Tehran who thought Swedish academia was too cautious. With seven academics who decided that the gap between what was possible and what existed was wide enough to build a company in.

They were right. They are still building.


PEOPLE

Matei Zaharia — Romanian-born, Canadian-raised, Berkeley-educated. Created Apache Spark as his PhD project at AMPLab, 2009. Won ACM Doctoral Dissertation Award 2014. Co-founder and CTO of Databricks. Current Associate Professor of EECS, UC Berkeley.

Ali Ghodsi — Born Tehran, 1978. Fled to Sweden at age 5. PhD from KTH Royal Institute of Technology, Stockholm, 2006. Visited AMPLab as a scholar in 2009. Co-founder of Databricks. VP Engineering and Product (2013–2016). CEO (2016–present).

Ion Stoica — Romanian-born Berkeley professor. Expert in distributed systems and networking. AMPLab co-director. First CEO of Databricks (2013–2016). Executive Chairman (2016–present). Still faculty at UC Berkeley.

Scott Shenker — Berkeley professor spanning networking, economics, and CS. AMPLab co-director. Co-founder of Databricks. Board member.

Andy Konwinski — PhD from Berkeley, studied MapReduce inefficiencies and cluster scheduling. Co-founder of Databricks.

Patrick Wendell — Apache Spark's open-source release manager. Co-founder of Databricks.

Reynold Xin — Built Shark (SQL on Spark), which became Spark SQL. Co-founder of Databricks.


TIMELINE

Year Event
2004 Google publishes the MapReduce paper
2006 Hadoop released as open source by Yahoo
2009 AMPLab founded at UC Berkeley. Matei Zaharia begins building Spark. Ali Ghodsi arrives at Berkeley as visiting scholar
2010 Apache Spark open-sourced. First Spark paper published (HotCloud)
2012 RDD paper wins Best Paper at NSDI. Spark adoption begins at Yahoo, Twitter, Cloudera
2013 Spark donated to Apache Software Foundation. Databricks founded. Series A: $13.9M from Andreessen Horowitz
2014 Zaharia wins ACM Doctoral Dissertation Award. Databricks Cloud platform launched
2015 AWS partnership. Project Tungsten (performance optimization) begins
2016 Azure partnership. Ali Ghodsi becomes CEO. Series C: $60M
2017 Delta Lake development begins. Series D: $140M
2019 Delta Lake open-sourced. Series E + F: $650M total, $6.2B valuation
2020 Lakehouse architecture paper published. Snowflake IPOs at $33B valuation
2021 Series G + H: $2.6B total, $38B valuation
2022 Delta Lake fully open-sourced under Linux Foundation
2023 Series I: $500M at $43B valuation. MosaicML acquired for $1.3B
2024 Tabular acquired for $2B
2025 Series J: $10B at $62B valuation. Meta joins as strategic investor. $3.7B annual revenue run rate

Sources: AMPLab research papers (2010–2015), Apache Spark research history page, Matei Zaharia's Berkeley faculty page, Ali Ghodsi biographical research (Digidai/2025), Bigeye company history, MicroVentures milestones, TechCrunch funding coverage, Facebook Engineering Blog (Spark at scale), Hadoop MapReduce documentation, InfQ Spark analysis.

Ready to see a real-time data integration platform in action? Book a demo with real engineers and discover how Stacksync brings together two-way sync, workflow automation, EDI, managed event queues, and built-in monitoring to keep your CRM, ERP, and databases aligned in real time without batch jobs or brittle integrations.
→  FAQS

Syncing data at scale
across all industries.

a blue checkmark icon
POC from integration engineers
a blue checkmark icon
Two-way, Real-time sync
a blue checkmark icon
Workflow automation
a blue checkmark icon
White-glove onboarding
“We’ve been using Stacksync across 4 different projects and can’t imagine working without it.”

Alex Marinov

VP Technology, Acertus Delivers
Vehicle logistics powered by technology