.webp)
Generated by Master Biographer | Source for LinkedIn Content
A paper published by Google in 2010 describes a system that had already been quietly running inside the company for four years. It processes petabytes of data in seconds. It has no concept of "managing servers" or "provisioning clusters." It can scan a trillion rows and return an answer before your coffee gets cold.
The paper gets published. The industry reads it. And five companies — Snowflake among them — realize they need to build this. They spend years trying. The original, the one Google built first, becomes BigQuery.
The founding engineer of BigQuery spends a decade helping enterprises query petabytes. Then he leaves. He builds something called MotherDuck. His thesis: the entire premise of big data was mostly wrong.
This is the story of a database that changed the industry — and the man who built it and then argued it was overkill.
BigQuery did not start as a product. It started as a 20% project.
In 2006, a Google engineer named Andrey Gubarev conceived a system called Dremel. The idea was deceptively simple: what if you could query massive datasets interactively, in seconds, the way you use a search engine — instead of waiting hours for a MapReduce job to finish?
At the time, the dominant paradigm inside Google for large-scale data analysis was MapReduce. MapReduce was powerful but slow. You'd write a job, submit it, wait hours, get your result. If your query had a bug, you'd wait hours again. Interactive analytics at petabyte scale was considered technically impossible.
Dremel proved it wasn't.
Dremel's core innovations were two architectural decisions that, together, were radical:
1. Columnar storage for nested data. Traditional databases store data row-by-row. Columnar databases store it column-by-column, so a query that needs only two fields out of fifty doesn't have to read the other forty-eight. But Google's data was deeply nested — Protobuf objects inside objects inside objects. Nobody had figured out how to do columnar storage on nested data. Dremel's team (Sergey Melnik et al.) invented a novel encoding using repetition levels and definition levels to "shred" nested records into columns and reassemble them. This algorithm directly influenced Apache Parquet, the file format that the entire modern data stack now runs on.
2. Massively parallel execution trees. Dremel runs queries across thousands of machines simultaneously using a hierarchical tree structure — root servers, intermediate servers, leaf servers — with each layer aggregating partial results. The result: a trillion-row table could return aggregate answers in seconds, not hours.
The storage infrastructure: By 2010, Google had decoupled Dremel's compute from its storage, using Colossus (Google's distributed file system, successor to GFS), the Jupiter network fabric for petabit-per-second internal bandwidth, and Borg (the precursor to Kubernetes) for compute orchestration. This disaggregated architecture — storage and compute scaling independently — is now considered best practice for cloud data warehouses. In 2010, it was novel enough to be published as a research paper.
In 2010, with Dremel having been in production at Google for four years, the team published: "Dremel: Interactive Analysis of Web-Scale Datasets" — authored by Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis, all from Google.
The paper was presented at VLDB 2010 (Very Large Data Bases conference). It described a system that could run aggregation queries over trillion-row tables in seconds. It described disaggregated compute and storage, columnar storage for semistructured/nested data, in situ analysis, and serverless scaling.
Google has a long tradition of publishing its internal infrastructure papers — GFS (2003), MapReduce (2004), Bigtable (2006), Spanner (2012). The pattern is consistent: Google publishes the paper, the industry reads it, open-source clones emerge, and companies are built on top of those clones. Publishing the Dremel paper was no different. The day it was published, the clock started on a new generation of data warehouse companies — and on BigQuery itself.
That same year, 2010, Google announced BigQuery at Google I/O.
Dremel had been running at Google since 2006. It powered internal analytics for Google Search, Ads, YouTube, and other products. Google engineers could query petabytes of log data interactively. This was not available to anyone outside Google.
When Google announced BigQuery at Google I/O in May 2010, it was simultaneously publishing the paper that described its technical foundations. But "announced" and "available" are different things.
Timeline:
- 2006: Dremel conceived as a 20% project by Andrey Gubarev
- 2006–2010: Dremel in production at Google, powering internal analytics
- 2010: VLDB paper published. BigQuery announced at Google I/O. Limited external access.
- 2011: Limited availability for select external customers
- 2012: General availability — BigQuery open to anyone with a Google account
The gap from internal inception to public availability was six years. By the time enterprises could actually use BigQuery, the team had been running it at Google scale for half a decade.
Here is the counterintuitive part of the story. Inside Google in the 2000s, the received wisdom was that SQL doesn't scale. MapReduce was the paradigm. SQL was for small databases. This wasn't fringe thinking — it was the consensus at the company that was simultaneously building the world's most advanced SQL-at-scale system.
Dremel broke this assumption in practice years before it changed the consensus in theory. When the paper was published, it didn't just describe a faster database — it challenged a foundational belief about what distributed systems could do.
When BigQuery became publicly available, it was unlike anything the data world had seen.
With Amazon Redshift (announced November 2012, generally available February 2013), you provision a cluster. You pick your node type. You configure your cluster size. You pay for that cluster whether you're running queries or not. You do capacity planning. If your queries get bigger, you resize the cluster. If your business grows, you resize again. The database behaves like infrastructure.
With BigQuery, there is no cluster. There is no provisioning. You sign up, you get a web UI or a REST API, and you start querying. Google's infrastructure scales behind you invisibly. You pay only for the bytes your query scans. If your table is a terabyte, you run one query, you pay for one query's worth of scanning.
In 2011, the word "serverless" didn't exist as a mainstream term — it was coined and popularized years later (AWS Lambda launched in 2014). BigQuery was serverless before the concept had a name.
This should have been a decisive competitive advantage. And eventually it was. But for most of the 2010s, Redshift won enterprise adoption anyway. The reason is counterintuitive: enterprises preferred the predictability of a cluster. They could budget for a known monthly cost. BigQuery's per-query pricing felt unpredictable. Finance teams didn't know how to model it. IT teams were uncomfortable with infrastructure they couldn't touch and configure.
The technically superior product was slower to win because of pricing psychology.
BigQuery's pricing model is based on bytes scanned per query. At the current rate of roughly $6.25 per terabyte, most queries are cheap. But the model has a crucial property that catches engineers off guard: BigQuery charges for data referenced, not data returned.
This means:
- A SELECT * on a 10TB table that returns 100 rows costs as much as a SELECT * that returns 10 billion rows.
- A LIMIT 1,000,000 clause does not reduce what BigQuery charges — it limits the output, not the scan.
- If you touch a 1PB table, even if you only need 1 MB, BigQuery charges for 1PB.
In one documented incident, a company ran three queries on a BigQuery public cryptocurrency dataset. The queries used LIMIT 1,000,000 to restrict results — a completely standard SQL optimization. The queries ran in 22 seconds. The bill: $9,847.24.
The reason: the underlying table was approximately 509 TB per query. The LIMIT clause was irrelevant to BigQuery's billing engine. The company was charged for scanning roughly 1,576 TB of data across three queries.
Shopify's engineering team built a marketing analytics pipeline using Apache Flink to process one billion rows. When they loaded the dataset into BigQuery and calculated the cost at production scale — 60 requests per minute — the math was:
60 RPM × 60 min × 24 hrs × 30 days = 2.59 million queries/month
2.59 million queries × 75 GB per query = 194.4 billion GB/month
At BigQuery's on-demand rate = $949,218.75/month
The solution was table clustering — reorganizing the data storage to match the query's WHERE clauses. After clustering, the same query scanned 508 MB instead of 75 GB. Monthly cost dropped to $1,370.67 — a 692x reduction.
The lesson embedded in these stories: BigQuery's pricing model rewards people who understand its internal architecture. It punishes people who assume it behaves like a traditional SQL database.
Jordan Tigani was a founding engineer at Google BigQuery. He spent a decade on the product — as an engineer, then as an engineering lead, then as a product manager. He was the person who would query a petabyte live on stage at conferences. He was the face of BigQuery's technical ambition.
He left Google and became Chief Product Officer at SingleStore, a fast-growing Series E database startup.
Then he did something unexpected. He wrote a blog post titled "Big Data is Dead."
Tigani's argument, published in early 2023, was built on data he'd seen across BigQuery's entire customer base:
The punchline: at SingleStore itself — a Series E unicorn, fast-growing tech company — combining all data sources (finance, customers, marketing, logs) totaled a few gigabytes.
Tigani had spent ten years building infrastructure to query petabytes. His conclusion: most companies don't have petabytes, and even the ones that do mostly run queries that touch a small fraction of them.
After discovering DuckDB — an in-process analytical database that runs on a laptop — Tigani saw the same pattern he'd seen at BigQuery: someone had built an extraordinary query engine, and it needed a cloud layer to be accessible to ordinary developers.
He co-founded MotherDuck in 2022, building a serverless cloud analytics platform on top of DuckDB. MotherDuck raised $52.5M at a $400M valuation in 2023.
The architectural irony is almost perfect: Tigani took the same playbook he'd used to build BigQuery (take an exceptional query engine, wrap it in a serverless cloud service, charge per query) and applied it to DuckDB — an engine designed for single-node execution, the opposite of BigQuery's distributed architecture.
His framing: BigQuery was built to query Google's data at Google scale. MotherDuck is built for the reality that most companies' data fits comfortably in RAM on a modern laptop.
The man who built the petabyte warehouse concluded that most people don't need the petabyte warehouse.
By 2019, BigQuery had a storage and compute story. What it didn't have was a business intelligence layer — a way for non-engineers to actually visualize and act on the data sitting in BigQuery.
Google's answer was a $2.6 billion acquisition: Looker.
Looker, founded in 2012, had built something unusual in BI: LookML, a modeling language that sits between the database and the visualization layer. Instead of every analyst writing ad hoc SQL, an engineer writes LookML once — defining what "revenue" means, what "customer" means, what "active user" means — and every downstream report uses those definitions consistently. It's called a semantic layer.
The strategic logic for Google was clear: BigQuery could store and query the data, but without a semantic layer, every company needed a team of engineers to translate BigQuery's output into business decisions. Looker closed that gap. Together, BigQuery + Looker was Google's answer to the complete data platform — the same vision Snowflake was building from a different direction.
The acquisition was announced June 2019. Completed February 2020. Google committed to keeping Looker multi-cloud — meaning it still works with AWS and Azure — which was unusual for a Google acquisition and signaled that Google saw Looker's value as an industry standard, not just a BigQuery lock-in play.
By 2023, three platforms dominated enterprise data:
BigQuery (Google, 2011): Fully serverless. No clusters. Pay per query. In-memory shuffle tier creates a 5-10x performance gap on real-world analytics against competitors. Architecture: Dremel + Colossus + Jupiter. Native JSON and nested data. Weakness: query pricing unpredictability. Strength: zero operations overhead.
Snowflake (2012): Multi-cloud. Separates compute and storage but still requires provisioning virtual warehouses (clusters). Directly cited Dremel as architectural inspiration. Massive enterprise adoption through 2020-2022. The Snowflake IPO in September 2020 — the largest software IPO in history at the time — was partly a bet that the enterprise data warehouse market was larger than BigQuery's Google-centric positioning had captured.
Databricks (2013): Built on Apache Spark. The "lakehouse" paradigm — unified data engineering, data science, and analytics on one platform. Delta Lake for transactions. Photon execution engine for performance. Cheapest of the three by some estimates. The ML/AI angle — Databricks is where data scientists and ML engineers live.
By 2025-2026, the three platforms had converged architecturally — all use columnar storage, cost-based query planning, pipelined execution, just-in-time compilation. Benchmarks show near parity on standard queries.
The differentiator has shifted from "who's fastest" to "who owns the workflow." Google ties BigQuery to Vertex AI, BigQuery ML, Looker, and the broader GCP ecosystem. Snowflake bets on data sharing and multi-cloud flexibility. Databricks bets on the unified engineering-to-production pipeline.
BigQuery's structural advantage — no clusters, no provisioning, true serverless — remains its cleanest story. Its structural vulnerability — pricing unpredictability and deep GCP tie-in — remains its clearest obstacle for enterprises already invested in AWS.
1. Dremel was a 20% project that ran for four years before the paper was published.
Andrey Gubarev conceived Dremel in 2006. Google ran it internally for four years — powering Search, Ads, and YouTube analytics — before publishing the paper that would inspire an entire industry. By the time Snowflake's founders read about it, Google had been doing it in production longer than most startups survive.
2. The 2010 Dremel paper's record-shredding algorithm is inside almost every data file format you use.
The repetition-level / definition-level encoding for nested columnar data that Sergey Melnik et al. described in the paper directly influenced Apache Parquet, the file format that underlies virtually every modern data lake. Every time a Snowflake query runs, or a dbt model executes, or a Databricks job reads a Parquet file, it's executing an idea from that 2010 paper.
3. BigQuery's founding engineer concluded, after 10 years, that big data was mostly a myth.
Jordan Tigani spent a decade building infrastructure to query petabytes. Then he analyzed BigQuery's actual customer data and found that 90% of queries touched less than 100 MB. His conclusion: the big data paradigm was real at Google scale and almost nowhere else. He built MotherDuck — serverless DuckDB — as the answer for everyone who doesn't actually have Google's problem.
4. Shopify almost paid $949,000/month for a single query.
Before clustering their BigQuery tables, Shopify's data pipeline would have cost nearly $1 million per month for one analytics workload. The fix — table clustering — reduced the same query from 75 GB of scanning to 508 MB. A 692x cost reduction from one architectural decision. This is why BigQuery rewards engineers who understand its internals and punishes those who treat it like a normal SQL database.
5. Google bought a $2.6B BI company and kept it multi-cloud.
When Google acquired Looker in 2019, they did something unusual: they committed to keeping Looker working on AWS and Azure. For a company famous for pushing GCP exclusivity, this was a signal that Looker's semantic layer (LookML) was valuable enough as an industry standard that Google didn't want to limit its adoption. It was a product bet masquerading as a competitive concession.
| Fact | Detail |
|---|---|
| Dremel conceived | 2006, by Andrey Gubarev (20% project) |
| Dremel paper published | VLDB 2010 — Melnik, Gubarev, Long, Romer, Shivakumar, Tolton, Vassilakis |
| BigQuery announced | May 2010, Google I/O |
| BigQuery GA | 2012 |
| Redshift GA | February 2013 |
| Looker acquisition announced | June 2019 |
| Looker acquisition closed | February 2020 |
| Looker price | $2.6 billion |
| MotherDuck founded | 2022 |
| MotherDuck raise | $52.5M at $400M valuation (2023) |
| Jordan Tigani role | Founding engineer, BigQuery (10+ years) |
| BigQuery pricing | $6.25/TB scanned (on-demand) |
| Worst pricing incident documented | $9,847 in 22 seconds (1,576 TB scanned) |
| Shopify optimization | $949K/month → $1,370/month (692x reduction via clustering) |
| Dremel's successor influence | Apache Parquet file format |
| Companies inspired by Dremel paper | Snowflake (cited directly), Hive, Spark SQL, Presto |
Sources: VLDB 2010 Dremel paper, MotherDuck "Big Data is Dead" (Tigani, 2023), Shopify Engineering Blog, DEV.to billing incident report, VentureBeat/TechCrunch Looker acquisition coverage, The Register Tigani interview (2023), Google Cloud Blog, GeekWire MotherDuck funding coverage.