Apache Iceberg is an open table format for huge analytical datasets that enables high performance analytics on open data formats with ACID compliance. Snowflake and AWS both support Iceberg format that enables customers to drastically improve data interoperability, speed of implmentation and peformance for integrated data lakes.

This guide will take you through the steps of converting existing parquet data to Iceberg and using it to build open analytic environments using Snowflake and AWS Glue.

Workflow

For this guide we will use a Financial Services use case where Insurance data is analyzed. The Quotes data is collected from systems and stored as parquet on S3, while Customer and Policy data is already available as internal Snowflake tables. We will try and identify customers who are likely to churn or potential fraud with a high number of recent quote requests.

Prerequisites

What You'll Learn

What You'll Need

What You'll Build

In this step we will use CloudFormation to configure objects in the AWS account needed for the quickstart, we will also cover basics of the AWS console and how to leverage browser tabs to multi task with various AWS services.

Start by logging into your AWS Console.

Once logged in to the AWS Console look for the AWS Region in the top right hand corner and select the US West (Oregon) us-west-2 region.

AWSRegion

We will now run a CloudFormation Template to configure a S3 bucket and Role with specific Policies in the AWS account.

In the top left menu bar enter Cloudformation in the Search bar, then select the CloudFormation service from the dropdown list.

AWSCloudformation

Once in the CLoudFormation screen verify the AWS Region that it is the US West (Oregon) us-west-2 region.

AWSRegion

Then it is recommended to duplicate the browser tab 3 times to make it simple to use multiple AWS services in the same browser session.

You can now create a CloudFormtion Stack. You can click the Create stack button, or if it is not visible click on the 3 Horizontal bars top left corner and then select Stacks to see the Create stack button.

Once you are in the Create stack screen select the following options:

Then copy the CloudFormation Template URL below into the Amazon S3 URL box.

https://snowflake-corp-se-workshop.s3.us-west-1.amazonaws.com/VHOL_Iceberg_SNOW_AWS/setup/glue-snflk-devday-v1.2.yaml

CreateStack

Click Next

On the next screen Specify stack details enter a name for the stack in Stack name, in this case you can use iceberg-devday then click Next

The following screen Configure stack options simply click Next

On the last screen Review and create scroll down to the bottom and select the check box next to I acknowledge that AWS CloudFormation might create IAM resources. and then click Submit to create the stack.

It will take a few seconds to create the resources in your AWS account. In the Events screen you can click the refresh button to see updates.

Once complete click on the Outputs Tab in the center top. This will display the Role name and S3 bucket name that was created. These will be used later to configure the integrations with Snowflake.

CFoutput

In this step we will use AWS Glue to create an icebrg table and convert parquet data without rewriting the data on S3 to iceberg format. Glue Notebooks will be used.

In one of the browser tabs with the AWS console type S3 in the search bar and select S3. This will allow you to quickly see what data and folders are created in S3. Verify the AWS Region in the S3 screen.

In the next browser tab type Glue and then select AWS Glue.

Once in the AWS Glue screen verify the AWS Region that it is the US West (Oregon) us-west-2 region.

AWSRegion

Click on the 3 horizonal bars on the top left side to open the side left side menu. Then select Notebooks under ETL jobs.

Notebooks

We have developed a Python notebook that can be downloaded to your local system. Click on the Download Notebook button below, which will download a .ipynb file to your downloads folder.

Download Notebook

In the AWS Glue Studio main frame click on the Notebook button

Notebookconfig

Give the notebook a little time to start.

We will execute notebook code cells one at a time to step through the process. To run notebook code cells, simple select the code cell and then click the little run button at the top or Shift + Return on the keyboard.

In the next cell we will configure the variables for the notebook session. You first need to update the bucket_name variable with your S3 bucket name that was created by the CloudFormation Template. The bucket name can be found on the browser tab with the CloudFormation Outputs

Notebookvariables

This step will take a little time as Glue is provisioning resources to support execution of he notebook.

Next we will copy parquet data to your S3 bucket from a public bucket. Run the Configure S3 path and copy files code cell.

Now we will create a Glue Data Catalog Database to be used for the lab by executing the Create Glue Database code cell.

Run the Initialize Spark Session code cell to configure the Spark environment.

In the next 2 code cells we will just verify that there ar eno tables in the Glue database we created.

We can now create a temporary view to generate the DDL for the Iceberg table that will be created. Run the code to create the temporary view .

Then run the cell to create the Iceberg table in the database using this temporary view.

We can now verify the Iceberg table is created by running the Show Tables cell

showtables

We are now ready to add the parquet files to the Iceberg table. It is important to note that we are not rewriting the data, but just adding it in place to the Iceberg table.

Run the Execute add_files procedure code cell to use Glue to add the files and generate the Iceberg metadata and manifest files.

Lastly we will verify that the parquet files we added to the iceberg table and the manifest files were generated. Run the Check data files and Check new metadata files code cells.

Run the Stop session cell to shutdown the resources used in Glue.

As a bonus step you can open the bowser tab with the S3 console and see the new folders and files where the data and metadata/manifest files are stored.

In this step we will configure the Snowflake account with Internal tables to represent Customer and Policy data. We will also create SNowflake objects such as a databaase, warehouse and role that are used the quickstart. Then we will configure the external volume to allow Snowflake to read and write on S3. We will also create an integration with the Glue Data Catalog to allow Snowflake to retrieve the Iceberg Catalog information directly from the the Glue Catalog.

First we will download 2 SQL files to your local system, which will be used to run the lab.

First is the Setup SQL script

Setup SQL

Next is the Workflow SQL

Workflow SQL

Open a browser tab and log into your Snowflake account as a user with ACCOUNTADMIN priviliges.

On the left hand menu select Projects and then Worksheet in that section. Click on the Blue + button in the top right corner to create a new worksheet and select SQL Worksheet.

newsheet

The worksheet screen will appear with a new tab with a timestamp on it, when you hover the mouse over the tab 3 dots will appear, click on it and select Import SQL from File

sheetSQL

Now select the hol_ice_setup_vxx.sql file that you downloaded earlier, and click Open.

This script is used to create a Snowflake Role, Database, Warehouse and the tables used for the Quickstart. We will execute the script in sections, by highlighting a section and then pressing the blue Run button in the top corner.

run

Select the following SQL statements with the mouse cursor and then click play

USE ROLE SECURITYADMIN;

CREATE OR REPLACE ROLE HOL_ICE_RL COMMENT='Iceberg Role';
GRANT ROLE HOL_ICE_RL TO ROLE SYSADMIN;

Once completed repeat for the following SQL statements

USE ROLE ACCOUNTADMIN;

GRANT CREATE INTEGRATION ON ACCOUNT TO ROLE HOL_ICE_RL;
GRANT CREATE EXTERNAL VOLUME ON ACCOUNT TO ROLE HOL_ICE_RL;
GRANT CREATE DATABASE ON ACCOUNT TO ROLE HOL_ICE_RL;
GRANT CREATE WAREHOUSE ON ACCOUNT TO ROLE HOL_ICE_RL;

Then for the next section

USE ROLE HOL_ICE_RL;

CREATE OR REPLACE DATABASE HOL_ICE_DB;

CREATE OR REPLACE WAREHOUSE HOL_ICE_WH
  WITH WAREHOUSE_SIZE = 'XSMALL'
  INITIALLY_SUSPENDED = TRUE;

You can now select the rest of the SQL statements and execute them to create the tables and load data. This is line 44 to 110 on the worksheet.

We are ready to go through the Snowflake workflow of the Quickstart.

On the top tabs next to the existing worksheet click on the + button to create a new worksheet. Again select SQL Worksheet

addsheet

Following the same steps as before load the SQL script file into the worksheet. This time however select the hol_ice_workflow_vxx.sql file that you downloaded earlier, and click Open.

Run lines 16 - 20 to set the context for the worksheet.

USE ROLE HOL_ICE_RL;

USE HOL_ICE_DB.PUBLIC;

USE WAREHOUSE HOL_ICE_WH;

We will now create an External Volume and configure it to work with your Snowflake account.

First in the STORAGE_BASE_URL make sure to change the enter your S3 bucket name by going back to your CloudFormation Output window to copy in your S3 bucket name in.

Then in the STORAGE_AWS_ROLE_ARN change the enter your AWS account ID by going back to the AWS console and copying the AWS account ID. You do this by clicking on the AWS account name in the top right corner adn then the copy button next to the account ID.

accountID

Lastly copy the AWS role that was created from the CloudFormation Output window into the your AWS Role that was created

Now execute the SQL statement, that will look something like this.

CREATE OR REPLACE EXTERNAL VOLUME HOL_ICE_EXT_VOL
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'my-s3-ice-ext-vol'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://glue-snowflake-devday-lab-6546xxxxxxx/iceberg/'
            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::6546xxxxxxx:role/iceberg-devday-GluesnowflakedevdayLabRole-xxxxxxxxxxx'

         )
      );

We will now setup the trust relationship with your AWS account role

Run the describe external command to get the Snowflake values of the object.

DESC EXTERNAL VOLUME HOL_ICE_EXT_VOL;

The storage locations output row will have a property value in JSON that looks like this

{"NAME":"my-s3-ice-ext-vol",
"STORAGE_PROVIDER":"S3",
"STORAGE_BASE_URL":"s3://glue-snowflake-devday-lab-6546xxxxxxxx/iceberg/",
"STORAGE_ALLOWED_LOCATIONS":["s3://glue-snowflake-devday-lab-6546xxxxxxxx/iceberg/*"],
"STORAGE_AWS_ROLE_ARN":"arn:aws:iam::65465xxxxxxx:role/iceberg-devday-GluesnowflakedevdayLabRole-crOqCT36mDB4",
"STORAGE_AWS_IAM_USER_ARN":"arn:aws:iam::90541xxxxxxxxxx:user/vvyk0000-s",
"STORAGE_AWS_EXTERNAL_ID":"YJB50193_SFCRole=2_f1IsD5b8/DAFxxxxxxxxxxxx",
"ENCRYPTION_TYPE":"NONE",
"ENCRYPTION_KMS_KEY_ID":""}

Go back to the CloudFormation browser tab and create a duplicate browser tab for the AWS Console. In the new tab Search for IAM and click on that option. This will open the IAM service.

Select everything in the JSON policy and replace with the JSON below.

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "<snowflake storage arn>"
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "sts:ExternalId": "<snowflake external id ext volume>"
          }
        }
      },
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "<snowflake glue arn>"
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "sts:ExternalId": "<snowflake external id glue catalog>"
          }
        }
      }
    ]
  }

Take your time and be careful when pasting the values into the JSON policy in the next few steps.

Next we will create the integration with the Glue Data Catalog, which works in a very similar manner.

Go back to the Snowflake worksheet and look at lines 62 to 68. You will again have to repalce some values with those of your AWS account.

It will look something like this

CREATE or REPLACE CATALOG INTEGRATION HOL_ICE_GLUE_CAT_INT
  CATALOG_SOURCE=GLUE
  CATALOG_NAMESPACE='iceberg_devday'
  TABLE_FORMAT=ICEBERG
  GLUE_AWS_ROLE_ARN='arn:aws:iam::6546xxxxxxxx:role/iceberg-devday-GluesnowflakedevdayLabRole-xxxxxxxxxxxx'
  GLUE_CATALOG_ID='6546xxxxxxxxx'
  ENABLED=TRUE; 

Now run the describe catalog integration, line 70

DESC CATALOG INTEGRATION HOL_ICE_GLUE_CAT_INT;

The Results will have the 2 property values needed to finalize the IAM Trust Policy in AWS. GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID

In the IAM Trust Policy update the following

Your IAM Trust policy will look something like this

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "arn:aws:iam::90541xxxxxx:user/vvyk0000-s"
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "sts:ExternalId": "YJB50193_SFCRole=2_f1IsD5b8/DAFYPxxxxxxxxxxxx"
          }
        }
      },
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "arn:aws:iam::90541xxxxxxx:user/vvyk0000-s"
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "sts:ExternalId": "YJB50193_SFCRole=2_cTvotHxxxxxxxxxxxxxxxx"
          }
        }
      }
    ]
  }

Click the Update policy button and your trust relationship will be configured.

This now completes the integration to both S3 and the Glue Catalog

In this step we will use the integrations with S3 adn Glue to create an Iceberg table object refering the Quotes Iceberg table. THat will be used for analysis with the internal Customer and Policy tables. We will then create an aggregate table written in Iceberg format on S3 and use Snowflake to query it.

We can now create an Iceberg table by referencing the Glue catalog. We simply execute the command below. Line 80 to 83.

CREATE OR REPLACE ICEBERG TABLE QUOTES_ICE
  EXTERNAL_VOLUME='HOL_ICE_EXT_VOL'
  CATALOG='HOL_ICE_GLUE_CAT_INT'
  CATALOG_TABLE_NAME='QUOTES';

Let's read the Quotes data in Iceberg format from S3

SELECT * FROM QUOTES_ICE LIMIT 10;

We can now do some analysis by combining the Iceberg table (Quotes data) withe inetrnal Snowflake Customer table data.

SELECT C.FULLNAME, C.POSTCODE, C.CUSTID, C.IPID, C.PRODUCTNAME, C.QUOTECOUNT,
Q.POLICYNO, Q.QUOTEDATE, Q.QUOTE_PRODUCT, Q.ORIGINALPREMIUM, Q.TOTALPREMIUMPAYABLE 
FROM CUSTOMER C, QUOTES_ICE Q
WHERE C.FULLNAME = Q.FULLNAME
AND C.POSTCODE = Q.POSTCODE
AND C.QUOTECOUNT > 5
ORDER BY C.QUOTECOUNT DESC;

We can now also create an aggreagte data set by combining the Quotes data in Iceberg with Customer and Policy data in internal Snowfake tables.

WITH CUSTQUOTE AS
(SELECT C.FULLNAME, C.POSTCODE, C.CUSTID, C.IPID, C.PRODUCTNAME, C.QUOTECOUNT,
Q.POLICYNO, Q.QUOTEDATE, Q.QUOTE_PRODUCT, Q.ORIGINALPREMIUM, Q.TOTALPREMIUMPAYABLE 
FROM CUSTOMER C, QUOTES_ICE Q
WHERE C.FULLNAME = Q.FULLNAME
AND C.POSTCODE = Q.POSTCODE
AND C.QUOTECOUNT > 5)
SELECT CQ.FULLNAME, CQ.POSTCODE, CQ.CUSTID, CQ.IPID, CQ.PRODUCTNAME,
CQ.QUOTECOUNT, CQ.POLICYNO, CQ.QUOTEDATE, CQ.QUOTE_PRODUCT,
CQ.ORIGINALPREMIUM, CQ.TOTALPREMIUMPAYABLE, 
P.CREATEDDATE, P.BRAND, P.BRANCHCODE, P.POLICY_STATUS_DESC,
P.TYPEOFCOVER_DESC, P.INSURER_NAME, P.INCEPTIONDATE, P.RENEWALDATE
FROM CUSTQUOTE CQ, POLICIES P
WHERE CQ.CUSTID = P.CUSTID;

This aggregate data can also be written in Iceberg format back to S3 for consumption by other services and engines.

CREATE OR REPLACE ICEBERG TABLE QUOTE_ANALYSIS_ICE  
  CATALOG='SNOWFLAKE'
  EXTERNAL_VOLUME='HOL_ICE_EXT_VOL'
  BASE_LOCATION='quoteanalysisiceberg'
  AS 
  WITH CUSTQUOTE AS
(SELECT C.FULLNAME, C.POSTCODE, C.CUSTID, C.IPID, C.PRODUCTNAME, C.QUOTECOUNT,
Q.POLICYNO, Q.QUOTEDATE, Q.QUOTE_PRODUCT, Q.ORIGINALPREMIUM, Q.TOTALPREMIUMPAYABLE 
FROM CUSTOMER C, QUOTES_ICE Q
WHERE C.FULLNAME = Q.FULLNAME
AND C.POSTCODE = Q.POSTCODE
AND C.QUOTECOUNT > 5)
SELECT CQ.FULLNAME, CQ.POSTCODE, CQ.CUSTID, CQ.IPID, CQ.PRODUCTNAME,
CQ.QUOTECOUNT, CQ.POLICYNO, CQ.QUOTEDATE, CQ.QUOTE_PRODUCT,
CQ.ORIGINALPREMIUM, CQ.TOTALPREMIUMPAYABLE, 
P.CREATEDDATE, P.BRAND, P.BRANCHCODE, P.POLICY_STATUS_DESC,
P.TYPEOFCOVER_DESC, P.INSURER_NAME, P.INCEPTIONDATE, P.RENEWALDATE
FROM CUSTQUOTE CQ, POLICIES P
WHERE CQ.CUSTID = P.CUSTID;

Lastly we can also use Snowflake to query the aggregate data in Iceberg. Let's see which customer with more the 5 quotes have Policy that is Renewed and also a premium higher than 100.

SELECT DISTINCT(CUSTID), FULLNAME, POSTCODE,IPID, PRODUCTNAME, QUOTECOUNT,
POLICYNO, QUOTEDATE, QUOTE_PRODUCT, ORIGINALPREMIUM, TOTALPREMIUMPAYABLE,
CREATEDDATE, BRAND, BRANCHCODE, POLICY_STATUS_DESC, TYPEOFCOVER_DESC,
INSURER_NAME, INCEPTIONDATE, RENEWALDATE
FROM QUOTE_ANALYSIS_ICE
WHERE TOTALPREMIUMPAYABLE >100
AND POLICY_STATUS_DESC = 'Renewed' 
ORDER BY CREATEDDATE DESC;

You've succesfully converted parquet data to Iceberg format use AWS Glue, integrated Snowflake with S3 and the Glue Data Catalog, then combined Iceberg table data with Internal Snowflake data for analytics, wrote aggregate data in Iceberg format to S3 and finally used Snowflake to analzye the Iceberg data.

What You Learned

Related Resources