Getting Started with Snowflake and BigQuery via Iceberg

Apache Iceberg has become a very popular open-source table format for data lakes as its flexibility enables freedom of choice for organizations. Whether an organization wants to use best of breed services or they've inherited multiple data platforms due to mergers or acquisitions, adopting an open table format could be key to eliminating data silos.

Use Case

There is often no one-size-fits-all approach to tackling complex business challenges. Organizations often store their data in different places and use multiple tools and platforms to put that data to work. By uniting data across platforms and query engines using an open table format, organizations can serve a variety of business needs, including:

Modernizing their data lake
Enabling data interoperability and a joint data mesh architecture
Building batch or streaming data pipelines
Building transformation and change data capture (CDC) pipelines
Serving analytics-ready data to business teams

Prerequisites

Familiarity with Snowflake and a Snowflake account
Familiarity with Google Cloud and a Google Cloud account.

What You'll Learn

Creating a Snowflake Managed Iceberg table and then have BigQuery read the table
Creating a BigLake Managed table and then have Snowflake read the table
Keeping BigQuery in-sync with Iceberg tables hosted on Snowflake

What You'll Need

What You'll Build

A Snowflake Managed Iceberg table
A BigLake Managed Iceberg table

In this section, you will create a table in Snowflake using the Iceberg format and also create a BigLake (external) table in BigQuery that points to the same Iceberg files.

Login to Snowflake

Click the Create button on the top left
Select SQL Worksheet
Run this command to switch to an account admin since we have to run some commands that requires this role:

USE ROLE accountadmin;

Create a warehouse to hold the data

CREATE OR REPLACE WAREHOUSE ICEBERG_WAREHOUSE WITH WAREHOUSE_SIZE='XSMALL';

Reference: https://docs.snowflake.com/en/sql-reference/sql/create-warehouse

Create a database (the database)

CREATE OR REPLACE DATABASE ICEBERG_DATABASE;

Create a bucket to hold your BigLake Managed Table

Open: https://console.cloud.google.com/storage/browser

Click the Create Bucket button
Enter your bucket name: bigquery-snowflake-sharing (you can choose a different name)
Click Next: Use Region: us-central1 <- must match your snowflake region
- Note: this quickstart uses the region us-central1 if you decide to use a different region you will have to make updates to this quickstart wherever ‘us-central1' is referenced.
Click Create at the bottom

Create a link between Snowflake and GCS.

In Snowflake, create a GCS volume integration. This will create a link between Snowflake and GCS. A service principal will be created and we will grant access to our GCS bucket.

CREATE STORAGE INTEGRATION bigquery_integration
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'GCS'
  ENABLED = TRUE
  STORAGE_ALLOWED_LOCATIONS = ('gcs://bigquery-snowflake-sharing');

Reference: https://docs.snowflake.com/en/sql-reference/sql/create-storage-integration

Get the service principal that we will grant Storage Object Admin in our GCS bucket

DESC STORAGE INTEGRATION bigquery_integration;

Copy the STORAGE_GCP_SERVICE_ACCOUNT
e.g. xxxxxxxxx@gcpuscentral1-1dfa.iam.gserviceaccount.com

Create a custom IAM role that has the permissions required to access the bucket and get objects.

Open: https://console.cloud.google.com/iam-admin/roles
Select Create Role.
Enter a Title and optional Description for the custom role. [e.g. Snowflake Storage Admin]
Select Add Permissions.
In Filter, select Service and then select storage.
Filter the list of permissions, and add the following from the list:
storage.buckets.get
storage.objects.get
storage.objects.create
storage.objects.delete
storage.objects.list
Select Add.
Select Create.
Reference: https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-gcs?_fsi=YFzl41ld&_fsi=YFzl41ld&_fsi=YFzl41ld

Open the storage account you created

Open: https://console.cloud.google.com/storage/browser
Click on: bigquery-snowflake-sharing (or whatever you named it)
Click on Permissions
Click Grant Access
Paste in the service account name (from Snowflake)
For the role select Custom | Snowflake Storage Admin
Click Save

In Snowflake, create an external volume on GCS

CREATE EXTERNAL VOLUME snowflake_ext_volume
  STORAGE_LOCATIONS =
    (
      (
        NAME = 'us-central1'
        STORAGE_PROVIDER = 'GCS'
        STORAGE_BASE_URL = 'gcs://bigquery-snowflake-sharing/snowflake-volume/'
      )
    ),
    ALLOW_WRITES = TRUE;

Reference: https://docs.snowflake.com/en/sql-reference/sql/create-external-volume

Describe the volume

DESCRIBE EXTERNAL VOLUME snowflake_external_volume

Set the current database

USE ICEBERG_DATABASE;

Create a schema in Snowflake

CREATE SCHEMA iceberg_schema;

Make the schema active

USE SCHEMA iceberg_schema;
Create Iceberg table using Snowflake Catalog
CREATE ICEBERG TABLE driver (driver_id int, driver_name string)
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'snowflake_ext_volume'
  BASE_LOCATION = 'driver';

Reference: https://docs.snowflake.com/en/sql-reference/sql/create-iceberg-table-snowflake

Show the table you just created

SHOW TABLES

Insert new data

INSERT INTO driver (driver_id, driver_name) VALUES (1, 'Driver 001');
SELECT * FROM driver;

Review the latest metadata json file that Snowflake is pointing to as we need to point BigQuery to the same place

SELECT REPLACE(JSON_EXTRACT_PATH_TEXT(
     SYSTEM$GET_ICEBERG_TABLE_INFORMATION('ICEBERG_DATABASE.iceberg_schema.driver'),
          'metadataLocation'), 'gcs://', 'gs://');

Open the storage account you created

Open: https://console.cloud.google.com/storage/browser
Click on: bigquery-snowflake-sharing (or whatever you named it)
You can now browse the Iceberg files

Create a BigQuery Dataset

CREATE SCHEMA IF NOT EXISTS snowflake_dataset OPTIONS(location = 'us-central1');

Navigate to BigQuery

Open: https://console.cloud.google.com/bigquery
Click the Add button
Select "Connections to external data sources"
Select "Vertex AI remote models, remote functions and BigLake (Cloud Resource)"
Select region: us-central1
Enter a name: snowflake-connection (use the for friendly name and description)

Expand your project in the left hand panel

Expand external connections
Double click on us-central1.snowflake-connection
Copy the service account id: e.g. bqcx-xxxxxxxxxxxx-s3rf@gcp-sa-bigquery-condel.iam.gserviceaccount.com

Open the storage account you created

Open: https://console.cloud.google.com/storage/browser
Click on: blmt-snowflake-sharing (or whatever you named it)
Click on Permissions
Click Grant Access
Paste in the service account name
For the role select Cloud Storage | Storage Object Viewer [Since Snowflake is the write BigQuery just reads]
Click Save

The URIs needs to be from the following Snowflake command

CREATE OR REPLACE EXTERNAL TABLE `snowflake_dataset.driver`
WITH CONNECTION `us-central1.snowflake-connection`
OPTIONS (
  format = "ICEBERG",
  uris = ["gs://bigquery-snowflake-sharing/snowflake-volume/driver/metadata/00001-2d763c77-df0a-4230-bd52-033877d02c40.metadata.json"]
);

View the data in BQ

SELECT * FROM `snowflake_dataset.driver`;

Update table metadata in BQ

BigQuery will not see updated data in Snowflake since we are pointing to a specific snapshot or metadata json. We will have to update the table definition to the latest table metadata.

You will need to run the "SYSTEM$GET_ICEBERG_TABLE_INFORMATION" command in Snowflake

SYSTEM$GET_ICEBERG_TABLE_INFORMATION('<iceberg_table_name>')

You will then need to update the table metadata in BigQuery: https://cloud.google.com/bigquery/docs/iceberg-external-tables#update-table-metadata

In this section, you will create a table in BigQuery using the Iceberg format and also create an Iceberg (external) table in Snowflake that points to the same Iceberg files.

Create a bucket to hold your BigLake Managed Table

Open: https://console.cloud.google.com/storage/browser
Click the Create Bucket button
Enter your bucket name: blmt-snowflake-sharing (you can choose a different name)
Click Next: Use Multi-region: us-central1
Click Create at the bottom

Navigate to BigQuery

Open: https://console.cloud.google.com/bigquery
Click the Add button
Select "Connections to external data sources"
Select "Vertex AI remote models, remote functions and BigLake (Cloud Resource)"
Enter a name: blmt-connection (use the for friendly name and description)

Expand your project in the left hand panel

Expand external connections
Double click on us-central1.blmt-connection
Copy the service account id: e.g. bqcx-xxxxxxxxxxxx-s3rf@gcp-sa-bigquery-condel.iam.gserviceaccount.com
Open your storage account you created
Open: https://console.cloud.google.com/storage/browser
Click on: blmt-snowflake-sharing (or whatever you named it)
Click on Permissions
Click Grant Access
Paste in the service account name
For the role select Cloud Storage | Storage Object Admin [We want admin since we want Read/Write]
Click Save

Navigate to BigQuery

Open: https://console.cloud.google.com/bigquery
Open a query window
Run this to create a dataset:

CREATE SCHEMA IF NOT EXISTS blmt_dataset OPTIONS(location = 'us');

Create a BigLake Managed table

Run this to create a BigLake Managed table (change the storage account name below)

CREATE OR REPLACE TABLE `blmt_dataset.driver`
(
 driver_id                 INT64,
 driver_name               STRING,
 driver_mobile_number      STRING,
 driver_license_number     STRING,
 driver_email_address      STRING,
 driver_dob                DATE,
 driver_ach_routing_number STRING,
 driver_ach_account_number STRING
)
CLUSTER BY driver_id
WITH CONNECTION `us-central1.blmt-connection`
OPTIONS (
 file_format = 'PARQUET',
 table_format = 'ICEBERG',
 storage_uri = 'gs://blmt-snowflake-sharing/driver'
);

Load the table with some data

LOAD DATA INTO `blmt_dataset.driver`
FROM FILES (
 format = 'parquet',
 uris = ['gs://data-analytics-golden-demo/biglake/v1-source/managed-table-source/driver/*.parquet']);

View the data

SELECT * FROM `blmt_dataset.driver` LIMIT 1000;

View the storage metadata

Open: https://console.cloud.google.com/storage/browser and click on your storage account
You should see a folder called "driver"
You can navigate to the "metadata" folder
You will probably see the file "v0.metadata.json"

To export the most recent metadata run this in BigQuery (a separate window is preferred)

EXPORT TABLE METADATA FROM blmt_dataset.driver;

In your storage window

Press the Refresh button (top right)
You will not see many files in the metadata folder

Let's connect the data to Snowflake

Login to Snowflake
Click the Create button on the top left
Select SQL Worksheet
Run this command to switch to an account admin since we have to run some commands that requires this role

USE ROLE accountadmin;

Create a warehouse to hold the data

CREATE OR REPLACE WAREHOUSE BLMT_WAREHOUSE WITH WAREHOUSE_SIZE='XSMALL';

Create a database in snowflake

CREATE OR REPLACE DATABASE BLMT_DATABASE;

Select the database

USE DATABASE BLMT_DATABASE;

Create the schema to hold the table

CREATE SCHEMA IF NOT EXISTS BLMT_SCHEMA;

Select the schema

USE SCHEMA BLMT_SCHEMA;

Create our GCS volume integration. This will create a link between Snowflake and GCS.

A service principal will be created and we will grant access to our GCS bucket.
Reference: https://docs.snowflake.com/en/sql-reference/sql/create-storage-integration
Change the name of the storage account

CREATE STORAGE INTEGRATION gcs_storage_integration
 TYPE = EXTERNAL_STAGE
 STORAGE_PROVIDER = 'GCS'
 ENABLED = TRUE
 STORAGE_ALLOWED_LOCATIONS = ('gcs://blmt-snowflake-sharing/');

Get the service principal that we will grant Storage Object Admin in our GCS bucket

DESC STORAGE INTEGRATION gcs_storage_integration;

Copy the STORAGE_GCP_SERVICE_ACCOUNT
e.g. xxxxxxxxx@gcpuscentral1-1dfa.iam.gserviceaccount.com

Configure an external volume for Google Cloud Storage

Create a custom IAM role
Create a custom role that has the permissions required to access the bucket and get objects.
Open: https://console.cloud.google.com/iam-admin/roles
Select Create Role.
Enter a Title and optional Description for the custom role. [e.g. Snowflake Reader]
Select Add Permissions.
In Filter, select Service and then select storage.
Filter the list of permissions, and add the following from the list:
storage.buckets.get
storage.objects.get
Storage.objects.list
Select Add.
Select Create.
Reference: https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume-gcs?_fsi=YFzl41ld&_fsi=YFzl41ld&_fsi=YFzl41ld

Open your storage account you created

Open: https://console.cloud.google.com/storage/browser
Click on: blmt-snowflake-sharing (or whatever you named it)
Click on Permissions
Click Grant Access
Paste in the service account name (from Snowflake)
For the role select Custom | Snowflake Reader
Click Save

Create an external volume on GCS Reference: https://docs.snowflake.com/en/sql-reference/sql/create-external-volume CREATE OR REPLACE EXTERNAL VOLUME gcs_volume STORAGE_LOCATIONS = ( ( NAME = ‘gcs_volume' STORAGE_PROVIDER = ‘GCS' STORAGE_BASE_URL = ‘gcs://blmt-snowflake-sharing/' ) );

Create a catalog integration to manage Iceberg tables in the external volume

CREATE OR REPLACE CATALOG INTEGRATION catalog_integration
   CATALOG_SOURCE=OBJECT_STORE   -- Indicates that the catalog is backed by an object store
   TABLE_FORMAT=ICEBERG          -- Specifies the table format as Iceberg
   ENABLED=TRUE;                 -- Enables the catalog integration

Create the Iceberg table, pointing to the existing metadata file

To get the "latest" metadata file
Open your storage account and navigate to the driver/metadata folder
Open the version-hint.text, there will be a number inside
Replace the number below (in the value of v173076288, but leave the "v") NOTE: Due to some metadata optional parameters set/required by V1 of the Iceberg we need to ignore some metadata. To due this, please ask Snowflake to perform this on your account. This should not be an issue when Iceberg V2 is used. JIRA for instructions to set the metadata flag on the Snowflake account - SNOW-1624657.

CREATE OR REPLACE ICEBERG TABLE driver
   CATALOG='catalog_integration'   -- The catalog where the table will reside
   EXTERNAL_VOLUME='gcs_volume'    -- The external volume for table data
   BASE_LOCATION=''                -- Optional: Subdirectory within the storage location
   METADATA_FILE_PATH='driver/metadata/v1730762887.metadata.json'; -- Path to the existing metadata file

This will show the table just created

SHOW TABLES

Query the Iceberg table

SELECT * FROM driver;
SELECT COUNT(*) FROM driver;

Now that you are linked, you can try the following, Insert a record into the BigQuery table:

INSERT INTO `blmt_dataset.driver`
(driver_id, driver_name, driver_mobile_number, driver_license_number, driver_email_address,
driver_dob, driver_ach_routing_number, driver_ach_account_number)
VALUES (0, 'New Driver', 'xxx-xxx-xxxx', 'driver_license_number', 'driver_email_address',
CURRENT_DATE(), 'driver_ach_routing_number','driver_ach_account_number');

Now Query the record in Snowflake

SELECT * FROM driver WHERE driver_id = 0;

NOTE: You will not see the new record. You first need to tell BigQuery to export the latest metadata (see step "View the storage metadata") and update the metadata used by Snowflake (see step "Create the Iceberg table, pointing to the existing metadata file") pointing to the latest JSON file.

In this section you will set up BigQuery to stay in-sync with Iceberg tables hosted in Snowflake.

So far, in this Quickstart, you've created an Iceberg table and exported Snowflake metadata to the Google Cloud Storage account upon each update. (Section 1 of this Quickstart)
You've also created BigLake Iceberg tables in BigQuery and pointed to a specific metadata JSON file on storage. (Section 2 of this Quickstart). However, you are pointing at a specific version of the data and if changes are made to the Iceberg table in Snowflake, BigQuery will not see them.
To sync Iceberg tables in Snowflake with BigQuery, BigQuery will have to read the Iceberg Catalog in Snowflake which will provide the latest metadata json file. Please see this GitHub repository for steps to enable this sync and automating the process:

https://github.com/GoogleCloudPlatform/data-analytics-golden-demo/blob/main/colab-enterprise/BigLake-Iceberg-Sync-Demo.ipynb

Finally, use the PATCH command to update BigQuery's Iceberg table to the latest metadata from Snowflake.

Technical Considerations

Technical considerations for the data sharing integration between BigQuery and Snowflake today include:

An inconsistency in the manifest statistics prevents Snowflake from reading BigLake Managed Iceberg tables. You can create a support ticket with Snowflake customer support to address the issue and allow Snowflake to read Iceberg tables created in BigQuery.
BigQuery does not yet support external federated Iceberg metastores such as Snowflake's Polaris catalog.
BigQuery and Snowflake don't currently support self-managed table replication from external tables. Google Cloud and Snowflake continue to collaboratively develop deeper product integrations to achieve our shared goal of improving data interoperability for customers.

What you learned

By following this quickstart, you learned how to:

Create a Snowflake Managed Iceberg table and then have BigQuery read the table
Create a BigLake Managed table and then have Snowflake read the table
Keep BigQuery in-sync with Iceberg tables hosted in Snowflake

Resources

Google Cloud and Snowflake's commitment to Apache Iceberg's open-source table format empowers our customers to unite their data across query engines and platforms and use it to solve business problems with familiar features, tools, and services from both organizations. Learn how to build your first open data lakehouse with Google Cloud and Snowflake by exploring documentation from Google Cloud and Snowflake, watching a tutorial, then diving into the Quickstart Guide.