The 20% Project That Became the Template: The Origin Story of BigQuery

A paper published by Google in 2010 describes a system that had already been quietly running inside the company for four years. It processes petabytes of data in seconds. It has no concept of "managing servers" or "provisioning clusters." It can scan a trillion rows and return an answer before your coffee gets cold.

Author: Ignacio Malpartida · GTM Engineer
Published: March 23, 2026
Read time: 14 min read

The 20% Project That Became the Template: The Origin Story of BigQuery

ARTICLE

The System That Ran in Secret for Four Years

The paper gets published. The industry reads it. And five companies — Snowflake among them — realize they need to build this. They spend years trying. The original, the one Google built first, becomes BigQuery.

The founding engineer of BigQuery spends a decade helping enterprises query petabytes. Then he leaves. He builds something called MotherDuck. His thesis: the entire premise of big data was mostly wrong.

This is the story of a database that changed the industry — and the man who built it and then argued it was overkill.

Dremel, the 20% Project Nobody Was Supposed to Know About

Conception: 2006

BigQuery did not start as a product. It started as a 20% project.

In 2006, a Google engineer named Andrey Gubarev conceived a system called Dremel. The idea was deceptively simple: what if you could query massive datasets interactively, in seconds, the way you use a search engine — instead of waiting hours for a MapReduce job to finish?

At the time, the dominant paradigm inside Google for large-scale data analysis was MapReduce. MapReduce was powerful but slow. You'd write a job, submit it, wait hours, get your result. If your query had a bug, you'd wait hours again. Interactive analytics at petabyte scale was considered technically impossible.

Dremel proved it wasn't.

The Technical Insight That Changed Everything

Dremel's core innovations were two architectural decisions that, together, were radical:

1. Columnar storage for nested data. Traditional databases store data row-by-row. Columnar databases store it column-by-column, so a query that needs only two fields out of fifty doesn't have to read the other forty-eight. But Google's data was deeply nested — Protobuf objects inside objects inside objects. Nobody had figured out how to do columnar storage on nested data. Dremel's team (Sergey Melnik et al.) invented a novel encoding using repetition levels and definition levels to "shred" nested records into columns and reassemble them. This algorithm directly influenced Apache Parquet, the file format that the entire modern data stack now runs on.

2. Massively parallel execution trees. Dremel runs queries across thousands of machines simultaneously using a hierarchical tree structure — root servers, intermediate servers, leaf servers — with each layer aggregating partial results. The result: a trillion-row table could return aggregate answers in seconds, not hours.

The storage infrastructure: By 2010, Google had decoupled Dremel's compute from its storage, using Colossus (Google's distributed file system, successor to GFS), the Jupiter network fabric for petabit-per-second internal bandwidth, and Borg (the precursor to Kubernetes) for compute orchestration. This disaggregated architecture — storage and compute scaling independently — is now considered best practice for cloud data warehouses. In 2010, it was novel enough to be published as a research paper.

The 2010 Paper

In 2010, with Dremel having been in production at Google for four years, the team published: "Dremel: Interactive Analysis of Web-Scale Datasets" — authored by Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis, all from Google.

The paper was presented at VLDB 2010 (Very Large Data Bases conference). It described a system that could run aggregation queries over trillion-row tables in seconds. It described disaggregated compute and storage, columnar storage for semistructured/nested data, in situ analysis, and serverless scaling.

Google has a long tradition of publishing its internal infrastructure papers — GFS (2003), MapReduce (2004), Bigtable (2006), Spanner (2012). The pattern is consistent: Google publishes the paper, the industry reads it, open-source clones emerge, and companies are built on top of those clones. Publishing the Dremel paper was no different. The day it was published, the clock started on a new generation of data warehouse companies — and on BigQuery itself.

That same year, 2010, Google announced BigQuery at Google I/O.

From Internal Tool to Public Product

The Gap Between Dremel and BigQuery

Dremel had been running at Google since 2006. It powered internal analytics for Google Search, Ads, YouTube, and other products. Google engineers could query petabytes of log data interactively. This was not available to anyone outside Google.

When Google announced BigQuery at Google I/O in May 2010, it was simultaneously publishing the paper that described its technical foundations. But "announced" and "available" are different things.

Timeline:
- 2006: Dremel conceived as a 20% project by Andrey Gubarev
- 2006–2010: Dremel in production at Google, powering internal analytics
- 2010: VLDB paper published. BigQuery announced at Google I/O. Limited external access.
- 2011: Limited availability for select external customers
- 2012: General availability — BigQuery open to anyone with a Google account

The gap from internal inception to public availability was six years. By the time enterprises could actually use BigQuery, the team had been running it at Google scale for half a decade.

The "SQL Doesn't Scale" Problem

Here is the counterintuitive part of the story. Inside Google in the 2000s, the received wisdom was that SQL doesn't scale. MapReduce was the paradigm. SQL was for small databases. This wasn't fringe thinking — it was the consensus at the company that was simultaneously building the world's most advanced SQL-at-scale system.

Dremel broke this assumption in practice years before it changed the consensus in theory. When the paper was published, it didn't just describe a faster database — it challenged a foundational belief about what distributed systems could do.

Serverless Before "Serverless" Existed

2011: The First Serverless Data Warehouse

When BigQuery became publicly available, it was unlike anything the data world had seen.

With Amazon Redshift (announced November 2012, generally available February 2013), you provision a cluster. You pick your node type. You configure your cluster size. You pay for that cluster whether you're running queries or not. You do capacity planning. If your queries get bigger, you resize the cluster. If your business grows, you resize again. The database behaves like infrastructure.

With BigQuery, there is no cluster. There is no provisioning. You sign up, you get a web UI or a REST API, and you start querying. Google's infrastructure scales behind you invisibly. You pay only for the bytes your query scans. If your table is a terabyte, you run one query, you pay for one query's worth of scanning.

In 2011, the word "serverless" didn't exist as a mainstream term — it was coined and popularized years later (AWS Lambda launched in 2014). BigQuery was serverless before the concept had a name.

This should have been a decisive competitive advantage. And eventually it was. But for most of the 2010s, Redshift won enterprise adoption anyway. The reason is counterintuitive: enterprises preferred the predictability of a cluster. They could budget for a known monthly cost. BigQuery's per-query pricing felt unpredictable. Finance teams didn't know how to model it. IT teams were uncomfortable with infrastructure they couldn't touch and configure.

The technically superior product was slower to win because of pricing psychology.

When BigQuery Bills Could Be Catastrophic

The Architecture of Surprise

BigQuery's pricing model is based on bytes scanned per query. At the current rate of roughly $6.25 per terabyte, most queries are cheap. But the model has a crucial property that catches engineers off guard: BigQuery charges for data referenced, not data returned.

This means:
- A SELECT * on a 10TB table that returns 100 rows costs as much as a SELECT * that returns 10 billion rows.
- A LIMIT 1,000,000 clause does not reduce what BigQuery charges — it limits the output, not the scan.
- If you touch a 1PB table, even if you only need 1 MB, BigQuery charges for 1PB.

The $10,000 Bill in 22 Seconds

In one documented incident, a company ran three queries on a BigQuery public cryptocurrency dataset. The queries used LIMIT 1,000,000 to restrict results — a completely standard SQL optimization. The queries ran in 22 seconds. The bill: $9,847.24.

The reason: the underlying table was approximately 509 TB per query. The LIMIT clause was irrelevant to BigQuery's billing engine. The company was charged for scanning roughly 1,576 TB of data across three queries.

Shopify's $949,000/Month Query

Shopify's engineering team built a marketing analytics pipeline using Apache Flink to process one billion rows. When they loaded the dataset into BigQuery and calculated the cost at production scale — 60 requests per minute — the math was:

60 RPM × 60 min × 24 hrs × 30 days = 2.59 million queries/month
2.59 million queries × 75 GB per query = 194.4 billion GB/month
At BigQuery's on-demand rate = $949,218.75/month

The solution was table clustering — reorganizing the data storage to match the query's WHERE clauses. After clustering, the same query scanned 508 MB instead of 75 GB. Monthly cost dropped to $1,370.67 — a 692x reduction.

The lesson embedded in these stories: BigQuery's pricing model rewards people who understand its internal architecture. It punishes people who assume it behaves like a traditional SQL database.

Jordan Tigani Builds the Opposite of What He Built

The Founding Engineer

Jordan Tigani was a founding engineer at Google BigQuery. He spent a decade on the product — as an engineer, then as an engineering lead, then as a product manager. He was the person who would query a petabyte live on stage at conferences. He was the face of BigQuery's technical ambition.

He left Google and became Chief Product Officer at SingleStore, a fast-growing Series E database startup.

Then he did something unexpected. He wrote a blog post titled "Big Data is Dead."

The Counterintuitive Admission

Tigani's argument, published in early 2023, was built on data he'd seen across BigQuery's entire customer base:

The vast majority of BigQuery customers had less than a terabyte of total data in storage
Most heavily-using customers had median storage under 100 GB
90% of queries processed less than 100 MB of data
Even post-exit tech companies surveyed by his investor contacts — the largest B2B enterprises — had around 1 terabyte of data; B2C companies topped out around 10 terabytes

The punchline: at SingleStore itself — a Series E unicorn, fast-growing tech company — combining all data sources (finance, customers, marketing, logs) totaled a few gigabytes.

Tigani had spent ten years building infrastructure to query petabytes. His conclusion: most companies don't have petabytes, and even the ones that do mostly run queries that touch a small fraction of them.

MotherDuck: BigQuery for the Other 99%

After discovering DuckDB — an in-process analytical database that runs on a laptop — Tigani saw the same pattern he'd seen at BigQuery: someone had built an extraordinary query engine, and it needed a cloud layer to be accessible to ordinary developers.

He co-founded MotherDuck in 2022, building a serverless cloud analytics platform on top of DuckDB. MotherDuck raised $52.5M at a $400M valuation in 2023.

The architectural irony is almost perfect: Tigani took the same playbook he'd used to build BigQuery (take an exceptional query engine, wrap it in a serverless cloud service, charge per query) and applied it to DuckDB — an engine designed for single-node execution, the opposite of BigQuery's distributed architecture.

His framing: BigQuery was built to query Google's data at Google scale. MotherDuck is built for the reality that most companies' data fits comfortably in RAM on a modern laptop.

The man who built the petabyte warehouse concluded that most people don't need the petabyte warehouse.

See real-time two-way sync in action

Book a demo with real engineers, no sales script.

Book a demo

The Looker Acquisition Completes the Stack

What BigQuery Was Missing

By 2019, BigQuery had a storage and compute story. What it didn't have was a business intelligence layer — a way for non-engineers to actually visualize and act on the data sitting in BigQuery.

Google's answer was a $2.6 billion acquisition: Looker.

Looker, founded in 2012, had built something unusual in BI: LookML, a modeling language that sits between the database and the visualization layer. Instead of every analyst writing ad hoc SQL, an engineer writes LookML once — defining what "revenue" means, what "customer" means, what "active user" means — and every downstream report uses those definitions consistently. It's called a semantic layer.

The strategic logic for Google was clear: BigQuery could store and query the data, but without a semantic layer, every company needed a team of engineers to translate BigQuery's output into business decisions. Looker closed that gap. Together, BigQuery + Looker was Google's answer to the complete data platform — the same vision Snowflake was building from a different direction.

The acquisition was announced June 2019. Completed February 2020. Google committed to keeping Looker multi-cloud — meaning it still works with AWS and Azure — which was unusual for a Google acquisition and signaled that Google saw Looker's value as an industry standard, not just a BigQuery lock-in play.

BigQuery vs. Snowflake vs. Databricks

The Three Paradigms

By 2023, three platforms dominated enterprise data:

BigQuery (Google, 2011): Fully serverless. No clusters. Pay per query. In-memory shuffle tier creates a 5-10x performance gap on real-world analytics against competitors. Architecture: Dremel + Colossus + Jupiter. Native JSON and nested data. Weakness: query pricing unpredictability. Strength: zero operations overhead.

Snowflake (2012): Multi-cloud. Separates compute and storage but still requires provisioning virtual warehouses (clusters). Directly cited Dremel as architectural inspiration. Massive enterprise adoption through 2020-2022. The Snowflake IPO in September 2020 — the largest software IPO in history at the time — was partly a bet that the enterprise data warehouse market was larger than BigQuery's Google-centric positioning had captured.

Databricks (2013): Built on Apache Spark. The "lakehouse" paradigm — unified data engineering, data science, and analytics on one platform. Delta Lake for transactions. Photon execution engine for performance. Cheapest of the three by some estimates. The ML/AI angle — Databricks is where data scientists and ML engineers live.

The War's Current State

By 2025-2026, the three platforms had converged architecturally — all use columnar storage, cost-based query planning, pipelined execution, just-in-time compilation. Benchmarks show near parity on standard queries.

The differentiator has shifted from "who's fastest" to "who owns the workflow." Google ties BigQuery to Vertex AI, BigQuery ML, Looker, and the broader GCP ecosystem. Snowflake bets on data sharing and multi-cloud flexibility. Databricks bets on the unified engineering-to-production pipeline.

BigQuery's structural advantage — no clusters, no provisioning, true serverless — remains its cleanest story. Its structural vulnerability — pricing unpredictability and deep GCP tie-in — remains its clearest obstacle for enterprises already invested in AWS.

Five Things Most People Don't Know About BigQuery

1. Dremel was a 20% project that ran for four years before the paper was published.
Andrey Gubarev conceived Dremel in 2006. Google ran it internally for four years — powering Search, Ads, and YouTube analytics — before publishing the paper that would inspire an entire industry. By the time Snowflake's founders read about it, Google had been doing it in production longer than most startups survive.

2. The 2010 Dremel paper's record-shredding algorithm is inside almost every data file format you use.
The repetition-level / definition-level encoding for nested columnar data that Sergey Melnik et al. described in the paper directly influenced Apache Parquet, the file format that underlies virtually every modern data lake. Every time a Snowflake query runs, or a dbt model executes, or a Databricks job reads a Parquet file, it's executing an idea from that 2010 paper.

3. BigQuery's founding engineer concluded, after 10 years, that big data was mostly a myth.
Jordan Tigani spent a decade building infrastructure to query petabytes. Then he analyzed BigQuery's actual customer data and found that 90% of queries touched less than 100 MB. His conclusion: the big data paradigm was real at Google scale and almost nowhere else. He built MotherDuck — serverless DuckDB — as the answer for everyone who doesn't actually have Google's problem.

4. Shopify almost paid $949,000/month for a single query.
Before clustering their BigQuery tables, Shopify's data pipeline would have cost nearly $1 million per month for one analytics workload. The fix — table clustering — reduced the same query from 75 GB of scanning to 508 MB. A 692x cost reduction from one architectural decision. This is why BigQuery rewards engineers who understand its internals and punishes those who treat it like a normal SQL database.

5. Google bought a $2.6B BI company and kept it multi-cloud.
When Google acquired Looker in 2019, they did something unusual: they committed to keeping Looker working on AWS and Azure. For a company famous for pushing GCP exclusivity, this was a signal that Looker's semantic layer (LookML) was valuable enough as an industry standard that Google didn't want to limit its adoption. It was a product bet masquerading as a competitive concession.

About the author

Ignacio Malpartida

GTM Engineer

Ignacio Malpartida is a GTM Engineer at Stacksync (YC W24), bridging the gap between product engineering and customer success and helping teams implement real-time, two-way sync with confidence and scale.

All posts by Ignacio Malpartida

About Stacksync

Stacksync powers real-time, two-way sync between CRMs, ERPs, and databases. Engineers sync data at scale and automate workflows, not dirty API plumbing.

Coworkers laughing in front of a laptop in a casual office setting

Your last integration took months.
Your next one takes a prompt.

Book a demo Tour the platform on your own

The 20% Project That Became the Template: The Origin Story of BigQuery

The System That Ran in Secret for Four Years