The Living Database: The Origin Story of Materialize

Somewhere inside Microsoft Research's Silicon Valley lab, Frank McSherry had a problem that most people would consider a fantasy. He was one of the most decorated computer scientists working in a research environment — the kind of researcher that universities fight over, that DARPA funds without blinking, that gets cited thousands of times before most people finish their dissertations.
Blog post featured image

The Living Database: The Origin Story of Materialize

A company biography


THE HOOK

Somewhere inside Microsoft Research's Silicon Valley lab, Frank McSherry had a problem that most people would consider a fantasy. He was one of the most decorated computer scientists working in a research environment — the kind of researcher that universities fight over, that DARPA funds without blinking, that gets cited thousands of times before most people finish their dissertations. He had co-invented differential privacy, the mathematical framework that Apple and Google would later deploy to protect the data of a billion users. He had built Timely Dataflow and Differential Dataflow, two open-source systems that would eventually be recognized as foundational contributions to how computers process changing data at scale.

And he was restless.

The lab closed in 2014. Microsoft shut down its Silicon Valley research outpost as part of a broader restructuring, and McSherry — along with a generation of researchers who had been quietly changing what was possible in distributed systems — had to make a choice. He could go to another research institution. He could join a big company as a staff researcher. Or he could do what seemed, on the surface, like the least natural thing for someone of his background: go build a startup.

He chose the startup. And the thing he chose to build was a database that works the way a spreadsheet does — except that it never stops updating, and it can handle the entire internet's worth of data change without blinking.

This is the story of Materialize.


THE BACKSTORY

To understand what Frank McSherry built, you have to understand what kind of mind he is.

In 2006, McSherry co-authored a paper with Kobbi Nissim that introduced the formal mathematical definition of differential privacy. The paper — "Calibrating Noise to Sensitivity in Private Data Analysis" — presented a deceptively simple idea: you could take any statistical query on a dataset, add a precisely calibrated amount of mathematical noise to the result, and guarantee that no individual record in the dataset could be re-identified from the answer. The privacy guarantee was provable. Not "probably safe." Not "approximately private." Mathematically provable, in the way that a theorem is proven.

The tech industry took years to catch up. When Apple announced in 2016 at WWDC that iOS 10 would use differential privacy to collect usage statistics without learning individual user behavior, they were operationalizing a decade-old idea from a Microsoft Research paper. Google followed with their own implementation. Today, differential privacy is baked into iOS, Chrome, macOS, Android, and dozens of enterprise data systems. It has become the standard mathematical framework for "privacy-preserving analytics" across the entire industry. The idea that created it came from a paper McSherry co-authored before the first iPhone existed.

That is the kind of thinker McSherry is. He does not solve the problem in front of him. He formalizes the problem class, proves the solution space, and lets the world catch up.

At Microsoft Research's Silicon Valley lab, McSherry spent years working on distributed data processing — specifically, the question of what happens when your dataset never stops changing. Most databases are built around a single, seductive abstraction: the snapshot. You have a table. The table contains rows. You run a query. The query returns results based on the current state of the table, at the exact moment you pressed Enter. This is how SQL has worked since 1974. It is also, McSherry came to believe, fundamentally the wrong model for how data actually moves through organizations.

His answer was Timely Dataflow, and then its extension Differential Dataflow. The ideas are connected but distinct.

Timely Dataflow is a model for distributed computation where data flows through a graph of operators, and each message carries a logical timestamp. The system can track exactly where in time any piece of computation is, enabling it to reason about progress — about what work is complete and what is still outstanding. This sounds administrative. It is not. It is the thing that makes it possible to run iterative algorithms over changing data without the system losing track of itself.

Differential Dataflow builds on top of this. The insight is to represent data not as a static table but as a stream of changes — additions and retractions — each timestamped. When new data arrives, you don't recompute your query from scratch. You compute only the difference. The delta. The minimum amount of work required to update the answer from its previous state to its new, correct state.

A simple analogy: imagine you have a spreadsheet tracking the total revenue of your top 100 customers. A traditional database approach — a snapshot — means every time you refresh the page, it re-sums every transaction from every customer, from the beginning of time, to produce the total. Differential dataflow means the database keeps track of the running total, and when a new transaction arrives, it adds the transaction value to the existing total. One addition versus millions of re-computations. The answers are identical. The work is incomparably different.

This seems obvious when you say it like that. It is extraordinarily difficult to make correct when you say it like that. The word "correct" is doing enormous work in that sentence. Because in a system where data can arrive out of order, where upstream sources can emit corrections and retractions, where a JOIN across three changing tables requires coordinating timestamps from three different streams — maintaining the correct answer incrementally, and proving that it is always the correct answer, is a problem that took Frank McSherry years to solve rigorously.

He published the work. He open-sourced the implementations. And then, when the lab closed, he started thinking about what it would mean to turn this research into a product.


THE GRIND

Arjun Narayan was the other half of the origin story.

Where McSherry came from deep systems research, Narayan came from a more commercially-oriented background in databases and distributed systems. He had spent time at Cockroach Labs, deep in the work of building production-grade, fault-tolerant databases — the kind of systems that have to actually work when a customer's production environment is on fire at 3 a.m. He understood both the engineering rigor required to ship a database and the market dynamics of the modern data infrastructure space.

Narayan and McSherry founded Materialize in January 2019. The core thesis was specific and arguable: SQL is the right language for data analytics, it is not going away, and the problem is not the language but the execution model underneath it. Every time you run a SQL query, the database recomputes the answer from scratch. What if instead the database maintained a "live view" — a pre-computed result that updates itself the moment the underlying data changes?

The technical term is "incrementally maintained materialized views." The concept is not new. Databases have had materialized views for decades. But traditional materialized views are refreshed on a schedule — you run a refresh job every hour, every day, every week, and the view becomes stale in between. What Materialize was promising was something different: materialized views that are always current, updated in milliseconds as data changes, without any refresh job, without any scheduler, without any staleness.

Explaining this to investors was, by all accounts, genuinely hard.

The challenge is that "incrementally maintained views" sounds like a database performance optimization when you say it to a non-technical audience. It doesn't sound like a paradigm shift. It doesn't evoke the visceral problem the way "your reports are 24 hours stale" does. McSherry himself has described the difficulty of making people feel the significance of the idea without drowning them in the formalism that makes it rigorous.

And the formalism matters enormously. The entire value proposition of Materialize rests on a guarantee: the answer is always correct. Not eventually correct. Not approximately correct. Precisely, provably, always current. In a world where "real-time" analytics often means "runs a query against a replica that's 15 minutes behind the primary," this is a strong claim. But it requires differential dataflow's mathematical foundations to make it true — because without the formal machinery underneath, incremental computation can produce wrong answers in subtle, catastrophic ways. A JOIN between two changing tables, updated naively, can produce phantom rows. A GROUP BY with incremental updates can drift from the true aggregate without any error being raised.

McSherry's years of formalism-first research at Microsoft Research were not academic overhead. They were the product. The rigor was the moat.

Materialize released its first public version in February 2020 — written in Rust, a choice that reflected both the team's engineering culture and the reality that a system promising low-latency, always-correct incremental computation cannot afford garbage collection pauses or memory safety violations. They launched at the beginning of a pandemic. The modern data stack was in the middle of its most turbulent period of growth. dbt was reorganizing how analysts thought about transformations. Databricks was growing explosively on top of Apache Spark. Snowflake was about to file for what would become the largest software IPO in history.

Materialize occupied a strange, precise gap: it was neither a data warehouse nor a stream processor. It was a database where SQL queries returned live results.


THE BREAKTHROUGH

The Kafka integration was not incidental. It was the key that turned the lock.

Apache Kafka had become, over the preceding decade, the default nervous system of the modern enterprise data architecture. Event-driven systems, microservices, change data capture pipelines — nearly everything that moved in a large technical organization eventually touched Kafka. Data flowed through it in real time: user events, transaction records, sensor readings, inventory updates, fraud signals.

The problem was that Kafka is not a database. It is a durable log — a record of what happened, in order. You can replay it. You can consume it. But you cannot run SELECT * FROM fraud_signals WHERE risk_score > 0.9 AND user_id = 'abc123' against it. The Kafka ecosystem had grown up around stream processors — Flink, Kafka Streams, KSQL — that let you write code to transform and react to the stream, but these systems required you to think like a stream processor, not like a database analyst. You had to understand windowing functions, watermarks, state stores. You could not just write SQL.

Materialize could.

You pointed Materialize at a Kafka topic. Materialize read it as a source. You wrote SQL over it — joins, aggregations, subqueries, window functions, the full relational algebra — and Materialize translated that SQL into differential dataflow operators running over the live stream. The results were materialized into a view that updated, incrementally, with every new message in the Kafka topic.

This meant that instead of building a Flink job with hundreds of lines of Java to detect whether a user's purchasing behavior exceeded a fraud threshold in the last 60 seconds, you could write:

CREATE MATERIALIZED VIEW suspicious_users AS
SELECT user_id, COUNT(*) AS transaction_count, SUM(amount) AS total_spend
FROM transactions
WHERE event_time > NOW() - INTERVAL '60 seconds'
GROUP BY user_id
HAVING SUM(amount) > 10000;

And Materialize would keep that view perpetually current. Every new transaction that arrived would update the relevant aggregates. When you queried suspicious_users, the answer was already there, already correct, already reflecting the most recent second of data. No query latency for the computation. Milliseconds, not seconds or minutes.

The use cases cascaded. Fraud detection — where the difference between a 50ms response and a 5-second response is a fraudulent transaction that clears or doesn't. Real-time leaderboards in gaming and trading platforms, where staleness is not a data quality problem but a user experience failure. Operational analytics for e-commerce, where inventory levels and pricing rules had to update continuously across thousands of SKUs. Feature stores for machine learning — where the feature values fed into a model had to reflect the current state of the world, not a snapshot from the previous ETL run.

Vontive, a real estate lending platform, reduced their loan eligibility calculation time from 27 seconds to half a second. Neo Financial reduced costs on their online feature store by 80%. SuperScript achieved 50-millisecond feature lookups across three data sources for ML scoring. These were not marginal improvements. They were order-of-magnitude changes in what was possible.


THE AFTERMATH

The $100 million Series C, announced in January 2022, arrived at a moment when the modern data stack was both ascendant and contested. Kleiner Perkins led the round, joined by Lightspeed and Redpoint, investors who had collectively watched the data infrastructure market consolidate around a handful of large platforms and believed that something was still missing at the operational layer.

The strategic logic was clear even to outside observers: the data warehouse was optimized for analytical queries at rest. The stream processor was optimized for event-by-event computation. Neither was optimized for the thing that a growing number of applications actually needed — a queryable, SQL-accessible, always-fresh view of operational data that could serve both analysts and production systems with sub-second latency.

By 2022, Materialize had introduced a cloud-native distributed platform — rewriting the original single-node architecture into a multi-active, horizontally scalable system that separated compute from storage, used cloud object storage for near-infinite persistence, and could scale elastically to handle enterprise-grade workloads. This was not a small engineering project. It was essentially a second version of the entire system, built while the first version was running in production at real customers.

Nate Stewart, former product leader at Cockroach Labs, became CEO, bringing commercial operational experience to sit alongside McSherry's role as Chief Scientist — the arrangement that made both the technical depth and the go-to-market momentum sustainable simultaneously.

The competitive landscape remained genuinely complex. dbt was reorganizing how data teams thought about transformation, but it operated in batch cadence — running transforms on a schedule, not continuously. Databricks had Structured Streaming built into Spark, but it required Databricks and Python and Spark expertise, not just SQL. Apache Flink was the incumbent stream processor, but it required stream processor expertise, not SQL expertise. ClickHouse and Apache Druid offered fast analytics over large datasets but were not incrementally maintained — they re-scanned data on every query.

Materialize occupied a position that was genuinely differentiated: a PostgreSQL-compatible SQL interface, incremental correctness guarantees, live streaming data as first-class input, and sub-millisecond query response times because the computation had already been done. The product could be described in one sentence — "a database where queries never get stale" — but the technical depth required to make that sentence true had taken the better part of a decade to build.

Frank McSherry, still writing code, still publishing insights, still treating SQL as an investigative instrument rather than a static retrieval mechanism, remained what he had always been: a researcher who had found a problem worth leaving the lab for.


5 THINGS NOBODY KNOWS ABOUT MATERIALIZE

1. The privacy framework on your iPhone came from the same mind that built the database engine.

Frank McSherry co-invented differential privacy in 2006 — the mathematical framework that Apple announced at WWDC 2016 as the cornerstone of iOS 10's privacy model, and that Google independently deployed for Chrome and Android data collection. When you use Siri and Apple tells you they've "used differential privacy" to learn from your behavior without identifying you, that is McSherry's theorem in production. Materialize is not a privacy company. But the person who built its core engine also invented the privacy technology protecting a billion devices. The two ideas — differential privacy and differential dataflow — are not coincidentally both called "differential." Both are about tracking changes precisely enough to say something rigorous about what changed, and what didn't.

2. "Differential dataflow" is a pun about subtraction, not a description of the data flow.

The word "differential" in differential dataflow does not mean "different from other dataflows." It means the mathematical differential — the change, the delta, the derivative. The entire model represents data as collections of additions and retractions: "+1 transaction for user A worth $50" and, if that transaction is corrected, "-1 transaction for user A worth $50, +1 transaction for user A worth $55." These tuples flow through the computation graph, and every operator — joins, aggregations, filters — is designed to process differences rather than whole datasets. The result is that when a single row changes in a table being joined against a billion-row dataset, Materialize computes only the impact of that one change, not the join of two billion rows. The "differential" is doing precise mathematical work, not marketing work.

3. The problem with "real-time" analytics is that almost nothing described as real-time actually is.

Most systems that call themselves real-time analytics are running against a replica database that lags the primary by 15 to 60 minutes. Most "materialized views" in traditional databases are refreshed on a schedule — once an hour, once a day. The "real-time dashboards" in most data teams run SQL queries that re-execute every few minutes against a data warehouse that ingested the last batch two hours ago. Materialize's claim — that the view is always current, that the answer to a query reflects the last millisecond of data — is not a slight improvement over these systems. It is a categorically different architecture. The confusion is linguistic: "real-time" has been diluted into meaninglessness by every vendor who wanted to imply freshness without delivering it. Materialize's actual value proposition is not real-time. It is "always-correct, never-stale, no-refresh-required." That is a harder thing to say, and a harder thing to build, than anything that calls itself real-time.

4. Writing SQL for Materialize is counterintuitive because SQL was never designed to describe ongoing computation.

SQL is a declarative language for expressing what result you want from a static dataset. SELECT COUNT(*) FROM transactions WHERE status = 'pending' means "right now, count the rows that match this condition." Materialize converts this into a continuous dataflow program that maintains the count perpetually as rows are added, updated, and deleted. This conversion — from a declarative, snapshot-oriented query to an incremental, live-updating computation — is the core technical challenge of the product. The reason it works is that SQL's relational algebra maps cleanly onto differential dataflow operators: SELECT is a filter, GROUP BY is a reduce, JOIN is a join, and each of these has an incremental form in differential dataflow. The user writes SQL. What runs underneath is a dataflow graph executing differential updates at microsecond intervals. The abstraction is seamless. Making the abstraction seamless took years.

5. Materialize is written in Rust because the correctness guarantees require it.

The entire commercial value of Materialize rests on one promise: the answer is always correct. A single bug that allows a view to return a stale result, a phantom row, or a dropped update undermines the entire product. Rust's memory safety model — its guarantee that memory cannot be accessed after it is freed, that data races cannot occur in concurrent code, that integer overflow cannot silently produce wrong results — is not a performance choice. It is a correctness choice. The team chose Rust in 2019, before Rust was fashionable in the database industry, because they were building a system where the cost of being wrong was uniquely high. Systems that promise eventual consistency can tolerate bugs that occasionally produce slightly stale results. Systems that promise always-correct incremental computation cannot. The language choice was the first design decision, and it was made on the same grounds as every other design decision in the system: mathematical rigor first, everything else second.


Materialize was founded in 2019 by Arjun Narayan and Frank McSherry. It has raised over $100 million from Kleiner Perkins, Lightspeed, and Redpoint. The company is headquartered in New York.

Ready to see a real-time data integration platform in action? Book a demo with real engineers and discover how Stacksync brings together two-way sync, workflow automation, EDI, managed event queues, and built-in monitoring to keep your CRM, ERP, and databases aligned in real time without batch jobs or brittle integrations.
→  FAQS

Syncing data at scale
across all industries.

a blue checkmark icon
POC from integration engineers
a blue checkmark icon
Two-way, Real-time sync
a blue checkmark icon
Workflow automation
a blue checkmark icon
White-glove onboarding
“We’ve been using Stacksync across 4 different projects and can’t imagine working without it.”

Alex Marinov

VP Technology, Acertus Delivers
Vehicle logistics powered by technology