The Open Data Lakehouse: Architecting with BigQuery, BigLake, and Apache Iceberg
Enterprises managing modern data workloads face a persistent tension: the structured, governed performance of a data warehouse versus the cost-effective, flexible storage of a data lake. For years, teams either paid a premium for warehouse rigidity or accepted the sprawl of an ungoverned lake. The data lakehouse architecture resolves this tension — and on Google Cloud, the combination of BigLake, Apache Iceberg, and BigQuery provides a production-ready path to get there.
This article explains the architectural foundations from first principles, clarifies what each service actually does under the hood, and shows how they compose into a unified, interoperable lakehouse.
The Evolution: Warehouse, Lake, and Lakehouse
1. Data Warehouse
The data warehouse has long been the gold standard for business intelligence. It uses a schema-on-write model — you define the schema before loading data. This rigidity delivers strong SQL performance, full ACID transaction guarantees, and mature governance tooling. The cost: proprietary storage tightly coupled to compute, high per-TB pricing, and no support for unstructured or semi-structured data.
2. Data Lake
A data lake inverts the model. Raw data of any type — structured, semi-structured, unstructured — is stored cheaply in object storage and a schema is applied at query time (schema-on-read). This makes it ideal for machine learning and archival use cases. The downside is well-known: without transactional guarantees, data quality erodes over time. Lakes frequently become “data swamps.”
3. Data Lakehouse
The lakehouse adds a management and metadata layer on top of open-format files stored in low-cost object storage, delivering ACID transactions, schema enforcement, and governance without proprietary lock-in. It is designed to serve both BI/SQL and ML/AI workloads from the same storage layer.
The Foundation: Apache Iceberg
For a data lake to behave like a warehouse, raw files need a management layer. Apache Iceberg provides exactly this. Critically, Iceberg does not move or copy your data. Instead, it adds a layer of metadata and structure on top of files already stored in Cloud Storage — acting as an index and catalog for your data files. When you query an Iceberg table, you are querying the same underlying Parquet or Avro files that have always lived in GCS; Iceberg just makes them addressable as a versioned, transactional table.
|
What Iceberg actually does: Apache Iceberg solves the data swamp problem by adding a metadata and structure layer on top of files in Cloud Storage. It does not move your data; it organises and manages it, acting as an index and catalog. The result is ACID guarantees, time travel, and schema evolution — without a proprietary storage format. |
Iceberg’s success as the industry-standard open table format stems from three key properties: its engine-agnostic design (any compute engine can read and write it), its ability to scale metadata efficiently even at petabyte scale, and native support for critical features like time travel and schema evolution.
Building an Open Lakehouse on Google Cloud
01
Google Cloud StorageScalable, low-cost object storage. Stores the actual data files (Parquet, Avro, ORC) and Iceberg metadata. |
02
Apache IcebergThe open table format. Adds transactional management, schema enforcement, and time travel on top of GCS files — without moving data. |
03
BigQuery + BigLakeThe analytical engine and storage bridge. Queries Iceberg tables on GCS with native warehouse-grade performance and governance. |
BigLake: The Storage Bridge
BigLake is the critical connector between BigQuery’s query engine and data stored in open formats on GCS. When you query an Iceberg table via BigQuery, BigLake gives BQ’s query engine the instructions to read and process that data directly from the data lake as if it were native — applying the same powerful parallel processing that powers BigQuery’s internal tables. No data movement, no duplication.
|
How BigLake works internally: BigLake acts as a storage engine and connector. It gives BigQuery’s query engine the instructions to read open-format data (such as Apache Iceberg tables) from Cloud Storage, combining low-cost flexible storage with powerful querying and governance — without moving or duplicating data. |
Critically, BigQuery has first-class, native support for Apache Iceberg through BigLake. It understands Iceberg’s metadata for advanced optimisations like partition pruning and data clustering, and it supports direct UPDATE, DELETE, and MERGE statements on Iceberg tables — the same DML operations you’d expect on a native warehouse table, now applied directly to your open-format data in GCS.
BigLake Metastore: The Single Source of Truth
To enable multiple compute engines to share the same Iceberg tables without copying data or duplicating metadata, Google provides BigLake Metastore — a fully managed, serverless metadata service. It exposes the standard Apache Iceberg REST catalog interface, which means any engine that speaks the Iceberg REST spec (Spark, Trino, Flink, and others) can register and read tables from the same central registry that BigQuery uses.
This has two important security implications. First, credential vending: BigLake Metastore can assume a service account’s permissions to access underlying GCS buckets, so analysts can query data through BigQuery without requiring direct read/write access to the storage layer. Second, centralised policies defined once in Dataplex Universal Catalog — such as row-level or column-level security — are enforced consistently across every engine that queries through BigLake.
Key BigLake Capabilities
- Engine interoperability
- Credential vending
- Unified governance via Dataplex
- Iceberg REST Catalog support
- Native DML on Iceberg (
UPDATE/DELETE/MERGE) - Serverless, fully managed
BigLake Tables in BigQuery: How It Works
There are two ways to expose Iceberg data through BigQuery, depending on who creates and owns the table:
External BigLake tables (OSS-created): An external tool such as Spark creates the Iceberg table and its metadata on GCS, then registers it with BigLake Metastore. BigQuery exposes this as an external table. BigLake reads the Iceberg metadata for optimised query planning — applying partition pruning and clustering hints — and BigQuery’s Dremel engine queries the data files directly. No data movement occurs.
BigLake tables for Apache Iceberg (BigQuery-managed): For teams that want a fully managed experience, BigQuery can also create and own the Iceberg table directly. This provides the same open-format storage on GCS but with BigQuery handling all metadata management. Both read and full DML operations — INSERT, UPDATE, DELETE, MERGE — are supported natively, giving you warehouse-grade write semantics on your own GCS bucket.
|
First-class Iceberg support in BigQuery: BigQuery offers native support for Apache Iceberg through BigLake, understanding Iceberg’s metadata for advanced optimisations including partitioning and clustering. It supports direct |
Real-World Scenario
The Unified Lakehouse in Practice
A data engineering team uses Apache Spark on Dataproc to ingest and transform raw event data. They write the output as an Iceberg table in GCS and register it with BigLake Metastore via the Iceberg REST catalog API. A data analyst, working entirely within BigQuery, can immediately run SQL against that same Iceberg table — including UPDATE, DELETE, and MERGE operations — without any data movement, schema translation, or pipeline overhead. Governance policies defined in Dataplex apply to both the Spark writer and the BigQuery reader from a single control plane.
Conclusion
The combination of BigLake metastore and BigQuery provides a robust solution for building a modern, open data lakehouse on Google Cloud.
- BigLake provides the essential abstraction layer, enabling an open, managed, and high-performance lakehouse with automated data management and built-in governance using Apache Iceberg.
- BigQuery offers the powerful analytical engine that can directly query Apache Iceberg open table formats.
By decoupling storage from compute, embracing open standards like Apache Iceberg, and unifying metadata with BigLake metastore, Google Cloud helps enterprises break down data silos, reduce costs, and accelerate their journey from raw data to actionable AI-driven insights.
Contact us today to build enterprise-level data lakehouse on GCP and unlock real productivity gains for your teams.
Sources
Introduction to BigLake external tables | BigQuery | Google Cloud Documentation
Author: Lae Lae Win
Date Published: Mar 31, 2026
