Snowflake's native integration with Apache Iceberg empowers organizations to build a highly interoperable and open lakehouse architecture. Additionally, Snowflake Open Catalog, a managed service for Apache Polaris, offers robust role-based access controls, ensuring seamless data governance and secure collaboration across multiple engines.

This guide covers how to create managed Apache Iceberg™ tables in Snowflake using a csv file and syncing the Iceberg table with Snowflake Open Catalog

What You'll Learn

What You'll Need

Set Up Snowflake Open Catalog

create snowflake open catalog account

open catalog home page

Creating an S3 Bucket

Create an IAM policy that grants access to your S3 location

sts endpoints page

{
      "Version": "2012-10-17",
      "Statement": [
            {
               "Effect": "Allow",
               "Action": [
                  "s3:PutObject",
                  "s3:GetObject",
                  "s3:GetObjectVersion",
                  "s3:DeleteObject",
                  "s3:DeleteObjectVersion"
               ],
               "Resource": "arn:aws:s3:::<my_bucket>/*"
            },
            {
               "Effect": "Allow",
               "Action": [
                  "s3:ListBucket",
                  "s3:GetBucketLocation"
               ],
               "Resource": "arn:aws:s3:::<my_bucket>",
               "Condition": {
                  "StringLike": {
                        "s3:prefix": [
                           "*"
                        ]
                  }
               }
            }
      ]
   }
Before clicking save, replace "<my_bucket>" with the name of the bucket you specified in step 2.e.i (i.e., deconnect-hol-demo-<name>) in both places. Note that the value you replace with this should just be the name of the bucket ("deconnect-hol-demo-<name>") and not include the protocol (i.e., "s3://")
In the bottom right, click "Next"
For "Policy name", enter "deconnect-hol-demo-policy"
Scroll down, and in the bottom right click "Create policy"

Create an IAM role to grant privileges on your S3 bucket

Creating a Catalog in Snowflake Open Catalog

Grant the Snowflake Open Catalog IAM user permissions to access bucket objects

{
     "Version": "2012-10-17",
     "Statement": [
       {
         "Sid": "",
         "Effect": "Allow",
         "Principal": {
           "AWS": "<open_catalog_iam_user_arn>"
         },
         "Action": "sts:AssumeRole",
         "Condition": {
           "StringEquals": {
             "sts:ExternalId": "deconnect_hol_demo"
           }
         }
       }
     ]
   }
Before clicking save, replace "<open_catalog_iam_user_arn>" with the ARN of the Snowflake Open Catalog IAM user you copied in step 5.f
Click "Update policy"

Creating a Principal Role

Creating a Catalog Role

Granting the Catalog Role to the Principal Role

Creating a Service Connection

Create Catalog Integration

CREATE CATALOG INTEGRATION deconnect_hol_demo_int 
  CATALOG_SOURCE=POLARIS 
  TABLE_FORMAT=ICEBERG 
  CATALOG_NAMESPACE='default' 
  REST_CONFIG = (
    CATALOG_URI ='<account_locator_url>/polaris/api/catalog' 
    WAREHOUSE = 'external_catalog_snowflake'
  )
  REST_AUTHENTICATION = (
    TYPE=OAUTH 
    OAUTH_CLIENT_ID='<client_id>' 
    OAUTH_CLIENT_SECRET='<secret>' 
    OAUTH_ALLOWED_SCOPES=('PRINCIPAL_ROLE:ALL') 
  ) 
  ENABLED=true;

Create an external volume object in Snowflake

CREATE DATABASE deconnect_hol_iceberg;

CREATE SCHEMA deconnect_hol_iceberg.hol_demo;

USE deconnect_hol_iceberg.hol_demo;

CREATE OR REPLACE EXTERNAL VOLUME iceberg_hol_demo_ext_vol
  STORAGE_LOCATIONS =
      (
        (
            NAME = 'main_hol_s3'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL ='<s3_uri>'
            STORAGE_AWS_ROLE_ARN = '<role_arn>'
            STORAGE_AWS_EXTERNAL_ID = 'deconnect_hol_demo'
        )
      );

Grant your Snowflake account IAM user permissions on your bucket

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": "<open_catalog_iam_user>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "deconnect_hol_demo"
                }
            }
        },
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": "<snowflake_extvol_iam_user>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "deconnect_hol_demo"
                }
            }
        }
    ]
}

Syncing a Snowflake managed table(s) with Snowflake Open Catalog

To query a Snowflake-managed Apache Iceberg™ table using a third-party engine such as Apache Spark™, you can sync the table with Snowflake Open Catalog.

alter schema deconnect_hol_iceberg.hol_demo 
    set catalog = 'snowflake' 
        external_volume = iceberg_hol_demo_ext_vol 
        CATALOG_SYNC = deconnect_hol_demo_int;

Create a Snowflake-managed Iceberg table using a parquet file from a stage

CREATE STAGE rawdatastage ;
ls @rawdatastage ;
CREATE OR REPLACE FILE FORMAT my_csv
  TYPE = CSV
  PARSE_HEADER = TRUE
  FIELD_OPTIONALLY_ENCLOSED_BY = '"'
  ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
  REPLACE_INVALID_CHARACTERS= TRUE
  SKIP_BLANK_LINES = TRUE;

SELECT GENERATE_COLUMN_DESCRIPTION(
    ARRAY_AGG(OBJECT_CONSTRUCT(*)) WITHIN GROUP (ORDER BY ORDER_ID asc), 'table') AS COLUMNS
      FROM TABLE (
        INFER_SCHEMA(
          LOCATION=>'@rawdatastage/',
          FILE_FORMAT=>'MY_CSV',
          IGNORE_CASE => FALSE,
          MAX_FILE_COUNT => 1
          ,MAX_RECORDS_PER_FILE => 1000
        )
      );

CREATE or REPLACE iceberg table POI (LOCATION_NAME TEXT,
TOP_CATEGORY TEXT,
SUB_CATEGORY TEXT,
NAICS_CODE NUMBER(6, 0),
LATITUDE NUMBER(11, 9),
LONGITUDE NUMBER(12, 9),
STREET_ADDRESS TEXT,
CITY TEXT,
REGION TEXT,
POSTAL_CODE NUMBER(5, 0),
ISO_COUNTRY_CODE TEXT,
PHONE_NUMBER NUMBER(11, 0) )
;

-- Load the CSV files and Snowflake will create Parquet data and metadata files in your specified base_location.

COPY INTO POI
FROM @rawdatastage
FILE_FORMAT = my_csv 
MATCH_BY_COLUMN_NAME = 'case_sensitive' 
ON_ERROR = CONTINUE
LOAD_MODE = FULL_INGEST;
select * from poi;

Run arbitrary SQL queries on this iceberg table from snowflake

Feel free to run any SQL you want against this iceberg table. Here's a sample query you can run:

select city, region, count(*) 
    from poi 
    group by 1, 2

Now, let's actually persist those query results as a dynamic iceberg table. You'll note that TARGET_LAG is set to 1 minute. This means that snowflake will automatically keep that dynamic table's contents up to date within 1 minute of any upstream source data changes. If when it checks, the upstream table (in this case "poi") hasn't changed, then you won't be billed at all.

CREATE DYNAMIC ICEBERG TABLE poi_dt_iceberg 
  TARGET_LAG = '1 minute'
  WAREHOUSE = COMPUTE_WH -- or a different warehouse name if you prefer
  AS
    select city, region, count(*) 
    from poi 
    group by 1, 2

You can see the checks snowflake has been making in the Snowsight UI via Monitoring -> Dynamic Tables -> poi_dt_iceberg and select the "Refresh History" tab. If you want to see confirmation that Snowflake has been doing checks each minute but not using any compute, uncheck the "Warehouse used only" checkbox

Query the dynamic iceberg table from Snowflake

Feel free to query this dynamic iceberg table, either via a SELECT * or whatever arbitrary SQL you want to run against it.

SELECT * FROM poi_dt_iceberg 
name: deconnect-hol-demo
channels:
  - conda-forge
dependencies:
  - findspark=2.0.1
  - jupyter=1.0.0
  - pyspark=3.5.0
  - openjdk=11.0.13

(Optional ) Query the Snowflake-managed Iceberg table from Spark

import os
os.environ['OPEN_CATALOG_SPARK_CRED'] = 

import pyspark
from pyspark.sql import SparkSession 

spark = SparkSession.builder.appName('hol-demo') \
    .config('spark.driver.host', 'localhost') \
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,software.amazon.awssdk:bundle:2.20.160') \
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .config('spark.sql.defaultCatalog', 'open_catalog') \
    .config('spark.sql.catalog.open_catalog', 'org.apache.iceberg.spark.SparkCatalog') \
    .config('spark.sql.catalog.open_catalog.type', 'rest') \
    .config('spark.sql.catalog.open_catalog.uri','<Open Catalog Account Locator URL>/polaris/api/catalog') \
    .config('spark.sql.catalog.open_catalog.credential', '<client_id>:<client_secret') \
    .config('spark.sql.catalog.open_catalog.header.X-Iceberg-Access-Delegation','vended-credentials') \
    .config('spark.sql.catalog.open_catalog.warehouse','external_catalog_snowflake') \
    .config('spark.sql.catalog.open_catalog.scope','PRINCIPAL_ROLE:ALL')\
    .config('spark.sql.catalog.open_catalog.client.region','us-west-2')\
    .config("spark.sql.iceberg.vectorization.enabled", "false")\
    .getOrCreate()

print('Spark is running')

    Replace "<Open Catalog Account Locator URL>" with the URL you pasted in the HOL spreadsheet
    Replace "<spark client id>" with the client ID you copied in the HOL spreadsheet for spark
    Replace "<spark client secret> with the client ID you copied in the HOL spreadsheet for spark. Note you'll need to retain the colon (":") in between the two
try:
 spark.sql("show namespaces in open_catalog").show()
 spark.sql("use open_catalog.DECONNECT_HOL_ICEBERG.HOL_DEMO").show()
 spark.sql("show tables").show(20, False)
 spark.sql("select * from POI limit 10").show()
 spark.sql("select * from POI_DT_ICEBERG limit 10").show()
except Exception as e :
 import traceback
 traceback.print_exc()

❄️ Congratulations! This concludes our lab.

What we've covered:

Additional Resources