Apache Iceberg is an open table format for huge analytical datasets that enables high performance analytics on open data formats with ACID compliance. Snowflake and AWS both support Iceberg format that enables customers to drastically improve data interoperability, speed of implmentation and peformance for integrated data lakes.

This guide will take you through the steps of converting existing parquet data to Iceberg and using it to build open analytic environments using Snowflake and AWS Glue.

Workflow

For this guide we will use a Financial Services use case where Insurance data is analyzed. The Quotes data is collected from systems and stored as parquet on S3, while Customer and Policy data is already available as internal Snowflake tables. We will try and identify customers who are likely to churn or potential fraud with a high number of recent quote requests.

Prerequisites

What You'll Learn

What You'll Need

What You'll Build

In this step we will use CloudFormation to configure objects in the AWS account needed for the quickstart, we will also cover basics of the AWS console and how to leverage browser tabs to multi task with various AWS services.

AWSRegion

We will now run a CloudFormation Template to configure a S3 bucket and Role with specific Policies in the AWS account.

AWSCloudformation

AWSRegion

https://snowflake-corp-se-workshop.s3.us-west-1.amazonaws.com/VHOL_Iceberg_SNOW_AWS/setup/glue-snflk-devday-v1.3.yaml

CreateStack

It will take a few seconds to create the resources in your AWS account. In the Events screen you can click the refresh button to see updates.

CFoutput

In this step we will use AWS Glue to create an icebrg table and convert parquet data to iceberg format. Glue Studio will be used.

Create Glue Database

AWSRegion

AddDatabase

CreateDatabase

Create Glue Table

AddTable

TableProp

Schema

delete

SaveSchema

SchemaNext

Create Glue ETL Job

In this section, we will show you have to create an ETL job in Glue Studio visual interface.

ETLJob

ETLJobConfig

ETLSource

QuotesRaw

Transform

ChangeSchema

Leave everything as is, don't change anything here. This is just to show you that you have an option to change schema with the help of this transform.

Target

TargetOptions

SavedJob

We can now run the Glue job to convert raw parquet files and save it as an Iceberg table.

RunETL

RETLComplete

You have succesfully created an Iceberg table in your AWS account from paruqet files using Glue Studio.

As a bonus step you can open the browser tab with the S3 console and see the new folders and files where the data and metadata/manifest files are stored.

In this step we will configure the Snowflake account with Internal tables to represent Customer and Policy data. We will also create Snowflake objects such as a databaase, warehouse and role that are used the quickstart. Then we will configure the external volume to allow Snowflake to read and write on S3. We will also create an integration with the Glue Data Catalog to allow Snowflake to retrieve the Iceberg Catalog information directly from the the Glue Catalog.

Configure your Snowflake account

Setup SQL

Workflow SQL

newsheet

sheetSQL

run

USE ROLE SECURITYADMIN;

CREATE OR REPLACE ROLE HOL_ICE_RL COMMENT='Iceberg Role';
GRANT ROLE HOL_ICE_RL TO ROLE SYSADMIN;
USE ROLE ACCOUNTADMIN;

GRANT CREATE INTEGRATION ON ACCOUNT TO ROLE HOL_ICE_RL;
GRANT CREATE EXTERNAL VOLUME ON ACCOUNT TO ROLE HOL_ICE_RL;
GRANT CREATE DATABASE ON ACCOUNT TO ROLE HOL_ICE_RL;
GRANT CREATE WAREHOUSE ON ACCOUNT TO ROLE HOL_ICE_RL;
USE ROLE HOL_ICE_RL;

CREATE OR REPLACE DATABASE HOL_ICE_DB;

CREATE OR REPLACE WAREHOUSE HOL_ICE_WH
  WITH WAREHOUSE_SIZE = 'XSMALL'
  INITIALLY_SUSPENDED = TRUE;

Configure the Snowflake External Volume integration with S3 and Catalog integration with Glue

addsheet

USE ROLE HOL_ICE_RL;

USE HOL_ICE_DB.PUBLIC;

USE WAREHOUSE HOL_ICE_WH;

We will now create an External Volume and configure it to work with your Snowflake account.

accountID

CREATE OR REPLACE EXTERNAL VOLUME HOL_ICE_EXT_VOL
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'my-s3-ice-ext-vol'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://glue-snowflake-lab-6546xxxxxxx/iceberg/'
            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::6546xxxxxxx:role/glue-snowflake-GluesnowflakedevdayLabRole-xxxxxxxxxxx'

         )
      );

We will now setup the trust relationship with your AWS account role

DESC EXTERNAL VOLUME HOL_ICE_EXT_VOL;

The STORAGE_LOCATIONS output row will have a property value in JSON that looks like this

{"NAME":"my-s3-ice-ext-vol",
"STORAGE_PROVIDER":"S3",
"STORAGE_BASE_URL":"s3://glue-snowflake-devday-lab-6546xxxxxxxx/iceberg/",
"STORAGE_ALLOWED_LOCATIONS":["s3://glue-snowflake-lab-6546xxxxxxxx/iceberg/*"],
"STORAGE_AWS_ROLE_ARN":"arn:aws:iam::65465xxxxxxx:role/glue-snowflake-GluesnowflakedevdayLabRole-crOqCT36mDB4",
"STORAGE_AWS_IAM_USER_ARN":"arn:aws:iam::90541xxxxxxxxxx:user/vvyk0000-s",
"STORAGE_AWS_EXTERNAL_ID":"YJB50193_SFCRole=2_f1IsD5b8/DAFxxxxxxxxxxxx",
"ENCRYPTION_TYPE":"NONE",
"ENCRYPTION_KMS_KEY_ID":""}

TrustPolicy

Select everything in the JSON policy and replace with the JSON below.

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
            "Service": "glue.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      },
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "<snowflake storage arn>"
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "sts:ExternalId": "<snowflake external id ext volume>"
          }
        }
      },
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "<snowflake glue arn>"
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "sts:ExternalId": "<snowflake external id glue catalog>"
          }
        }
      }
    ]
  }

Take your time and be careful when pasting the values into the JSON policy in the next few steps.

Next we will create the integration with the Glue Data Catalog, which works in a very similar manner.

It will look something like this

CREATE or REPLACE CATALOG INTEGRATION HOL_ICE_GLUE_CAT_INT
  CATALOG_SOURCE=GLUE
  CATALOG_NAMESPACE='iceberg'
  TABLE_FORMAT=ICEBERG
  GLUE_AWS_ROLE_ARN='arn:aws:iam::6546xxxxxxxx:role/glue-snowflake-GluesnowflakedevdayLabRole-xxxxxxxxxxxx'
  GLUE_CATALOG_ID='6546xxxxxxxxx'
  ENABLED=TRUE; 
DESC CATALOG INTEGRATION HOL_ICE_GLUE_CAT_INT;

The Results will have the 2 property values needed to finalize the IAM Trust Policy in AWS. GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID

In the IAM Trust Policy update the following

Your IAM Trust policy will look something like this

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
            "Service": "glue.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      },
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "arn:aws:iam::90541xxxxxx:user/vvyk0000-s"
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "sts:ExternalId": "YJB50193_SFCRole=2_f1IsD5b8/DAFYPxxxxxxxxxxxx"
          }
        }
      },
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "arn:aws:iam::90541xxxxxxx:user/vvyk0000-s"
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "sts:ExternalId": "YJB50193_SFCRole=2_cTvotHxxxxxxxxxxxxxxxx"
          }
        }
      }
    ]
  }

This now completes the integration to both S3 and the Glue Catalog

In this step we will use the integrations with S3 adn Glue to create an Iceberg table object refering the Quotes Iceberg table. THat will be used for analysis with the internal Customer and Policy tables. We will then create an aggregate table written in Iceberg format on S3 and use Snowflake to query it.

We can now create an Iceberg table by referencing the Glue catalog. We simply execute the command below. Line 80 to 83.

CREATE OR REPLACE ICEBERG TABLE QUOTES_ICE
  EXTERNAL_VOLUME='HOL_ICE_EXT_VOL'
  CATALOG='HOL_ICE_GLUE_CAT_INT'
  CATALOG_TABLE_NAME='QUOTES';

Let's read the Quotes data in Iceberg format from S3

SELECT * FROM QUOTES_ICE LIMIT 10;

We can now do some analysis by combining the Iceberg table (Quotes data) withe inetrnal Snowflake Customer table data.

SELECT C.FULLNAME, C.POSTCODE, C.CUSTID, C.IPID, C.PRODUCTNAME, C.QUOTECOUNT,
Q.POLICYNO, Q.QUOTEDATE, Q.QUOTE_PRODUCT, Q.ORIGINALPREMIUM, Q.TOTALPREMIUMPAYABLE 
FROM CUSTOMER C, QUOTES_ICE Q
WHERE C.FULLNAME = Q.FULLNAME
AND C.POSTCODE = Q.POSTCODE
AND C.QUOTECOUNT > 5
ORDER BY C.QUOTECOUNT DESC;

We can now also create an aggreagte data set by combining the Quotes data in Iceberg with Customer and Policy data in internal Snowfake tables.

WITH CUSTQUOTE AS
(SELECT C.FULLNAME, C.POSTCODE, C.CUSTID, C.IPID, C.PRODUCTNAME, C.QUOTECOUNT,
Q.POLICYNO, Q.QUOTEDATE, Q.QUOTE_PRODUCT, Q.ORIGINALPREMIUM, Q.TOTALPREMIUMPAYABLE 
FROM CUSTOMER C, QUOTES_ICE Q
WHERE C.FULLNAME = Q.FULLNAME
AND C.POSTCODE = Q.POSTCODE
AND C.QUOTECOUNT > 5)
SELECT CQ.FULLNAME, CQ.POSTCODE, CQ.CUSTID, CQ.IPID, CQ.PRODUCTNAME,
CQ.QUOTECOUNT, CQ.POLICYNO, CQ.QUOTEDATE, CQ.QUOTE_PRODUCT,
CQ.ORIGINALPREMIUM, CQ.TOTALPREMIUMPAYABLE, 
P.CREATEDDATE, P.BRAND, P.BRANCHCODE, P.POLICY_STATUS_DESC,
P.TYPEOFCOVER_DESC, P.INSURER_NAME, P.INCEPTIONDATE, P.RENEWALDATE
FROM CUSTQUOTE CQ, POLICIES P
WHERE CQ.CUSTID = P.CUSTID;

This aggregate data can also be written in Iceberg format back to S3 for consumption by other services and engines.

CREATE OR REPLACE ICEBERG TABLE QUOTE_ANALYSIS_ICE  
  CATALOG='SNOWFLAKE'
  EXTERNAL_VOLUME='HOL_ICE_EXT_VOL'
  BASE_LOCATION='quoteanalysisiceberg'
  AS 
  WITH CUSTQUOTE AS
(SELECT C.FULLNAME, C.POSTCODE, C.CUSTID, C.IPID, C.PRODUCTNAME, C.QUOTECOUNT,
Q.POLICYNO, Q.QUOTEDATE, Q.QUOTE_PRODUCT, Q.ORIGINALPREMIUM, Q.TOTALPREMIUMPAYABLE 
FROM CUSTOMER C, QUOTES_ICE Q
WHERE C.FULLNAME = Q.FULLNAME
AND C.POSTCODE = Q.POSTCODE
AND C.QUOTECOUNT > 5)
SELECT CQ.FULLNAME, CQ.POSTCODE, CQ.CUSTID, CQ.IPID, CQ.PRODUCTNAME,
CQ.QUOTECOUNT, CQ.POLICYNO, CQ.QUOTEDATE, CQ.QUOTE_PRODUCT,
CQ.ORIGINALPREMIUM, CQ.TOTALPREMIUMPAYABLE, 
P.CREATEDDATE, P.BRAND, P.BRANCHCODE, P.POLICY_STATUS_DESC,
P.TYPEOFCOVER_DESC, P.INSURER_NAME, P.INCEPTIONDATE, P.RENEWALDATE
FROM CUSTQUOTE CQ, POLICIES P
WHERE CQ.CUSTID = P.CUSTID;

Lastly we can also use Snowflake to query the aggregate data in Iceberg. Let's see which customer with more the 5 quotes have Policy that is Renewed and also a premium higher than 100.

SELECT DISTINCT(CUSTID), FULLNAME, POSTCODE,IPID, PRODUCTNAME, QUOTECOUNT,
POLICYNO, QUOTEDATE, QUOTE_PRODUCT, ORIGINALPREMIUM, TOTALPREMIUMPAYABLE,
CREATEDDATE, BRAND, BRANCHCODE, POLICY_STATUS_DESC, TYPEOFCOVER_DESC,
INSURER_NAME, INCEPTIONDATE, RENEWALDATE
FROM QUOTE_ANALYSIS_ICE
WHERE TOTALPREMIUMPAYABLE >100
AND POLICY_STATUS_DESC = 'Renewed' 
ORDER BY CREATEDDATE DESC;

This completes the creation of iceberg tables in Snowflake using an External Catalog, the Glue Data Catalog. You also used the iceberg table with internal tables to perform analysis on isurance customers, quotes and policies. For the last part a Snowflake managed iceberg table was written to S3 with agrregated data enabling different engines to query it, we use Snowflake as the query engine for a quick analysis on policy renewals.

You've succesfully converted parquet data to Iceberg format use AWS Glue, integrated Snowflake with S3 and the Glue Data Catalog, then combined Iceberg table data with Internal Snowflake data for analytics, wrote aggregate data in Iceberg format to S3 and finally used Snowflake to analzye the Iceberg data.

What You Learned

Related Resources