Snowflake's Snowpipe streaming capabilities are designed for rowsets with variable arrival frequency. It focuses on lower latency and cost for smaller data sets. This helps data workers stream rows into Snowflake without requiring files with a more attractive cost/latency profile.
Here are some of the use cases that can benefit from Snowpipe streaming:
In our demo, we will use real-time commercial flight data over the San Francisco Bay Area from the Opensky Network to illustrate the solution using Snowflake's Snowpipe streaming and Azure Event Hubs.
The architecture diagram below shows the deployment. An Azure Event Hub and an Azure Linux Virtual Machine (jumphost) will be provisioned in an Azure Virtual Network. The Linux jumphost will host the Kafka producer and Snowpipe streaming via Kafka Connect.
The Kafka producer calls the data sources' REST API and receives time-series data in JSON format. This data is then ingested into the Kafka cluster before being picked up by the Snowflake Connector for Kafka and delivered to a Snowflake table. The data in Snowflake table can be visualized in real-time with Azure Managed Grafana and Streamlit The historical data can also be analyzed by BI tools like Microsoft Power BI on Azure.
To participate in the virtual hands-on lab, attendees need the following resources.
ACCOUNTADMIN
accessFollow this Azure doc to create a resource group, say Streaming
.
Go to the newly created resource group, and click + Create
tab to create an event hub by following this Azure doc. Select the Standard
pricing tier to use Apache Kafka. Make sure that you select public access to the Event Hub in the networking setting.
See below sample screen capture for reference, here we have created a namespace called SnowflakeTest
.
In the same resource group, create a Linux(Red Hat enterprise) virtual machine by following this doc. Choose Redhat Enterprise
as the image. Note that this quickstart guide was written using scripts based on the RedHat syntax, optionally you can select Ubuntu or other Linux distributions but will need to modify the scripts.
Also make sure that you allow ssh access to the VM in the network rule setting.
Download and save the private key for use in the next step.
Once the VM is provisioned, we will then use it to run the Kafka connector with Snowpipe streaming SDK and the producer. We will be using the default VM user azureuser
for this workshop.
Here is a screen capture of the VM overview for reference.
From you local machine, either using a ssh application such as Putty if you have a Windows PC or simply the ssh CLI (ssh -i
for Linux or Mac based systems.
Create a key pair in the VM console by executing the following commands. You will be prompted to give an encryption password, remember this phrase, you will need it later.
cd $HOME
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8
See below example screenshot:
Next we will create a public key by running following commands. You will be prompted to type in the phrase you used in above step.
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub
see below example screenshot:
Next we will print out the public key string in a correct format that we can use for Snowflake.
grep -v KEY rsa_key.pub | tr -d '\n' | awk '{print $1}' > pub.Key
cat pub.Key
see below example screenshot:
Run the following command to install the Kafka connector and Snowpipe streaming SDK
passwd=changeit # Use the default password for the Java keystore, you should change it after finishing the lab
directory=/home/azureuser/snowpipe-streaming # Installation directory
cd $HOME
mkdir -p $directory
cd $directory
pwd=`pwd`
sudo yum clean all
sudo yum -y install openssl vim-common java-1.8.0-openjdk-devel.x86_64 gzip tar jq python3-pip
wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz
tar xvfz kafka_2.12-2.8.1.tgz -C $pwd
rm -rf $pwd/kafka_2.12-2.8.1.tgz
cd /tmp && cp /usr/lib/jvm/java-openjdk/jre/lib/security/cacerts kafka.client.truststore.jks
cd /tmp && keytool -genkey -keystore kafka.client.keystore.jks -validity 300 -storepass $passwd -keypass $passwd -dname "CN=snowflake.com" -alias snowflake -storetype pkcs12
#Snowflake kafka connector
wget https://repo1.maven.org/maven2/com/snowflake/snowflake-kafka-connector/2.2.1/snowflake-kafka-connector-2.2.1.jar -O $pwd/kafka_2.12-2.8.1/libs/snowflake-kafka-connector-2.2.1.jar
#Snowpipe streaming SDK
wget https://repo1.maven.org/maven2/net/snowflake/snowflake-ingest-sdk/2.1.0/snowflake-ingest-sdk-2.1.0.jar -O $pwd/kafka_2.12-2.8.1/libs/snowflake-ingest-sdk-2.1.0.jar
wget https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.14.5/snowflake-jdbc-3.14.5.jar -O $pwd/kafka_2.12-2.8.1/libs/snowflake-jdbc-3.14.5.jar
wget https://repo1.maven.org/maven2/org/bouncycastle/bc-fips/1.0.1/bc-fips-1.0.1.jar -O $pwd/kafka_2.12-2.8.1/libs/bc-fips-1.0.1.jar
wget https://repo1.maven.org/maven2/org/bouncycastle/bcpkix-fips/1.0.3/bcpkix-fips-1.0.3.jar -O $pwd/kafka_2.12-2.8.1/libs/bcpkix-fips-1.0.3.jar
Note that the version numbers for Kafka, the Snowflake Kafka connector, and the Snowpipe Streaming SDK are dynamic, as new versions are continually published. We are using the version numbers that have been validated to work.
Go to the Event Hubs console, select the namespace you just created, then select Settings
and Shared access policies
on the left pane and click RootManageSharedAccessKey
policy. Then copy the Connection string-primary key
, see screenshot below.
See example screenshot below.
Now switch back to the VM console and execute the following command by replacing
with the copied values. DO NOT forget to include the double quotes.
export CS="<connection string>"
see the example screen capture below.
We also need to extract the kafka broker string(BS) from the connection string by running this command:
export BS=`echo $CS | awk -F\/ '{print $3":9093"}'`
Now run the following command to add CS
as an environment variable so it is recognized across the Linux sessions.
echo "export CS=\"$CS\"" >> ~/.bashrc
echo "export BS=$BS" >> ~/.bashrc
Run the following commands to generate a configuration file connect-standalone.properties
in directory /home/azureuser/snowpipe-streaming/scripts
for the client to authenticate with the Event hubs namespace.
dir=/home/azureuser/snowpipe-streaming/scripts
mkdir -p $dir && cd $dir
cat << EOF > $dir/connect-standalone.properties
#************CREATING SNOWFLAKE Connector****************
bootstrap.servers=$BS
#************SNOWFLAKE VALUE CONVERSION****************
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=com.snowflake.kafka.connector.records.SnowflakeJsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
#************SNOWFLAKE ****************
offset.storage.file.filename=/tmp/connect.offsets
# Flush much faster than normal, which is useful for testing/debugging
offset.flush.interval.ms=10000
# required EH Kafka security settings
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="\$ConnectionString" password="$CS";
consumer.security.protocol=SASL_SSL
consumer.sasl.mechanism=PLAIN
consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="\$ConnectionString" password="$CS";
plugin.path=/home/azureuser/snowpipe-streaming/kafka_2.12-2.8.1/libs
EOF
A configuration file connect-standalone.properties
is created in directory /home/azureuser/snowpipe-streaming/scripts
Run the following commands to create a security configuration file client.properties
.
dir=/home/azureuser/snowpipe-streaming/scripts
cat << EOF > $dir/client.properties
# required EH Kafka security settings
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="\$ConnectionString" password="$CS";
EOF
A configuration file client.properties
is created in directory /home/azureuser/snowpipe-streaming/scripts
Go to the Event Hubs namespace and clicke + Event Hub
.
Give the event hub a name, say streaming
, then click Next: Capture
.
Leave everything as default, and click Next: Review + create
.
Click Create
to create the event hub.
Scroll down to the list of event hubs, you will see the streaming
event hub there.
To describe the topic, run the following commands in the VM shell:
$HOME/snowpipe-streaming/kafka_2.12-2.8.1/bin/kafka-topics.sh --bootstrap-server $BS --command-config $HOME/snowpipe-streaming/scripts/client.properties --describe --topic streaming
See below example screenshot:
First login to your Snowflake account as a power user with ACCOUNTADMIN role. Then run the following SQL commands in a worksheet to create a user, database and the role that we will use in the lab.
-- Set default value for multiple variables
-- For purpose of this workshop, it is recommended to use these defaults during the exercise to avoid errors
-- You should change them after the workshop
SET PWD = 'Test1234567';
SET USER = 'STREAMING_USER';
SET DB = 'AZ_STREAMING_DB';
SET WH = 'AZ_STREAMING_WH';
SET ROLE = 'AZ_STREAMING_RL';
USE ROLE ACCOUNTADMIN;
-- CREATE USERS
CREATE USER IF NOT EXISTS IDENTIFIER($USER) PASSWORD=$PWD COMMENT='STREAMING USER';
-- CREATE ROLES
CREATE OR REPLACE ROLE IDENTIFIER($ROLE);
-- CREATE DATABASE AND WAREHOUSE
CREATE DATABASE IF NOT EXISTS IDENTIFIER($DB);
USE IDENTIFIER($DB);
CREATE OR REPLACE WAREHOUSE IDENTIFIER($WH) WITH WAREHOUSE_SIZE = 'SMALL';
-- GRANTS
GRANT CREATE WAREHOUSE ON ACCOUNT TO ROLE IDENTIFIER($ROLE);
GRANT ROLE IDENTIFIER($ROLE) TO USER IDENTIFIER($USER);
GRANT OWNERSHIP ON DATABASE IDENTIFIER($DB) TO ROLE IDENTIFIER($ROLE);
GRANT USAGE ON WAREHOUSE IDENTIFIER($WH) TO ROLE IDENTIFIER($ROLE);
-- SET DEFAULTS
ALTER USER IDENTIFIER($USER) SET DEFAULT_ROLE=$ROLE;
ALTER USER IDENTIFIER($USER) SET DEFAULT_WAREHOUSE=$WH;
-- RUN FOLLOWING COMMANDS TO FIND YOUR ACCOUNT IDENTIFIER, COPY IT DOWN FOR USE LATER
-- IT WILL BE SOMETHING LIKE <organization_name>-<account_name>
-- e.g. ykmxgak-wyb52636
WITH HOSTLIST AS
(SELECT * FROM TABLE(FLATTEN(INPUT => PARSE_JSON(SYSTEM$allowlist()))))
SELECT REPLACE(VALUE:host,'.snowflakecomputing.com','') AS ACCOUNT_IDENTIFIER
FROM HOSTLIST
WHERE VALUE:type = 'SNOWFLAKE_DEPLOYMENT_REGIONLESS';
Please write down the Account Identifier, we will need it later.
Next we need to configure the public key for the streaming user to access Snowflake programmatically.
First, in the Snowflake worksheet, replace < pubKey > with the content of the file /home/azureuser/pub.Key
(see step 5
by clicking on section #2 Create an Event Hub and a Linux virtual machine in Azure cloud
in the left pane) in the following SQL command and execute.
use role accountadmin;
alter user streaming_user set rsa_public_key='< pubKey >';
See below example screenshot:
Now logout of Snowflake, sign back in as the default user streaming_user
we just created with the associated password (default: Test1234567). Run the following SQL commands in a worksheet to create a schema (e.g. AZ_STREAMING_SCHEMA
) in the default database (e.g. AZ_STREAMING_DB
):
SET DB = 'AZ_STREAMING_DB';
SET SCHEMA = 'AZ_STREAMING_SCHEMA';
USE IDENTIFIER($DB);
CREATE OR REPLACE SCHEMA IDENTIFIER($SCHEMA);
This step is optional for this workshop but is highly recommended if you prefer to use the CLI to interact with Snowflake later instead of the web console.
SnowSQL is the command line client for connecting to Snowflake to execute SQL queries and perform all DDL and DML operations, including loading data into and unloading data out of database tables.
To install SnowSQL. Execute the following commands on the Linux Session Manager console:
curl https://sfc-repo.snowflakecomputing.com/snowsql/bootstrap/1.2/linux_x86_64/snowsql-1.2.24-linux_x86_64.bash -o /tmp/snowsql-1.2.24-linux_x86_64.bash
echo -e "~/bin \n y" > /tmp/ans
bash /tmp/snowsql-1.2.24-linux_x86_64.bash < /tmp/ans
See below example screenshot:
Next set the environment variable for Snowflake Private Key Phrase:
export SNOWSQL_PRIVATE_KEY_PASSPHRASE=<key phrase you set up when running openssl previously>
Note that you should add the command above in the ~/.bashrc file to preserve this environment variable across sessions.
echo "export SNOWSQL_PRIVATE_KEY_PASSPHRASE=$SNOWSQL_PRIVATE_KEY_PASSPHRASE" >> ~/.bashrc
Now you can execute this command to interact with Snowflake:
$HOME/bin/snowsql -a <Snowflake Account Identifier> -u streaming_user --private-key-path $HOME/rsa_key.p8 -d az_streaming_db -s az_streaming_schema
Type Ctrl-D
to get out of SnowSQL session.
You can edit the ~/.snowsql/config
file to set default parameters and eliminate the need to specify them every time you run snowsql.
At this point, the Snowflake setup is complete.
Run the following commands to collect various connection parameters for the Kafka connector.
cd $HOME
outf=/tmp/params
cat << EOF > /tmp/get_params
a=''
until [ ! -z \$a ]
do
read -p "Input Snowflake account identifier: e.g. ylmxgak-wyb53646 ==> " a
done
echo export clstr_url=\$a.snowflakecomputing.com > $outf
export clstr_url=\$a.snowflakecomputing.com
read -p "Snowflake cluster user name: default: streaming_user ==> " user
if [[ \$user == "" ]]
then
user="streaming_user"
fi
echo export user=\$user >> $outf
export user=\$user
pass=''
until [ ! -z \$pass ]
do
read -p "Private key passphrase ==> " pass
done
echo export key_pass=\$pass >> $outf
export key_pass=\$pass
read -p "Full path to your Snowflake private key file, default: /home/azureuser/rsa_key.p8 ==> " p8
if [[ \$p8 == "" ]]
then
p8="/home/azureuser/rsa_key.p8"
fi
priv_key=\`cat \$p8 | grep -v PRIVATE | tr -d '\n'\`
echo export priv_key=\$priv_key >> $outf
export priv_key=\$priv_key
cat $outf >> $HOME/.bashrc
EOF
. /tmp/get_params
See below example screen capture.
Run the following commands to generate a configuration file for the Kafka connector.
dir=/home/azureuser/snowpipe-streaming/scripts
cat << EOF > $dir/snowflakeconnectorAZ.properties
name=snowpipeStreaming
connector.class=com.snowflake.kafka.connector.SnowflakeSinkConnector
tasks.max=4
topics=streaming
snowflake.private.key.passphrase=$key_pass
snowflake.database.name=AZ_STREAMING_DB
snowflake.schema.name=AZ_STREAMING_SCHEMA
snowflake.topic2table.map=streaming:AZ_STREAMING_TBL
buffer.count.records=10000
buffer.flush.time=5
buffer.size.bytes=20000000
snowflake.url.name=$clstr_url
snowflake.user.name=$user
snowflake.private.key=$priv_key
snowflake.role.name=AZ_STREAMING_RL
snowflake.ingestion.method=snowpipe_streaming
snowflake.enable.schematization=false
value.converter.schemas.enable=false
jmx=true
key.converter=org.apache.kafka.connect.storage.StringConverter
valur.converter=com.snowflake.kafka.connector.records.SnowflakeJsonConverter
errors.tolerance=all
EOF
Finally, we are ready to start ingesting data into the Snowflake table.
Go back to the Linux console and execute the following commands to start the Kafka connector.
$HOME/snowpipe-streaming/kafka_2.12-2.8.1/bin/connect-standalone.sh $HOME/snowpipe-streaming/scripts/connect-standalone.properties $HOME/snowpipe-streaming/scripts/snowflakeconnectorAZ.properties
If everything goes well, you should see something similar to screen capture below:
Leave this screen open and let the connector continue to run.
Open up a new ssh session connection to the VM. In the shell, run the following command:
curl --connect-timeout 5 http://ecs-alb-1504531980.us-west-2.elb.amazonaws.com:8502/opensky | $HOME/snowpipe-streaming/kafka_2.12-2.8.1/bin/kafka-console-producer.sh --broker-list $BS --producer.config $HOME/snowpipe-streaming/scripts/client.properties --topic streaming
You should see response similar to screen capture below if everything works well.
Note that in the script above, the producer queries a Rest API that provides real-time flight data over the San Francisco Bay Area in JSON format. The data includes information such as timestamps, icao numbers, flight IDs, destination airport, longitude, latitude, and altitude of the aircraft, etc. The data is ingested into the streaming
topic on the event hub and then picked up by the Snowpipe streaming Kafka connector, which delivers it directly into a Snowflake table az_streaming_db.az_streaming_schema.az_streaming_tbl
.
Now, switch back to the Snowflake console and make sure that you signed in as the default user streaming_user
. The data should have been streamed into a table, ready for further processing.
To verify that data has been streamed into Snowflake, execute the following SQL commands.
use az_streaming_db;
use schema az_streaming_schema;
show channels in table az_streaming_tbl;
You should see that there are two channels, corresponding to the two partitions created earlier in the topic. .
Note that, unlike the screen capture above, at this point, you should only see one row in the table, as we have only ingested data once. We will see new rows being added later as we continue to ingest more data.
Now run the following query on the table.
select * from az_streaming_tbl;
You should see there are two columns in the table: RECORD_METADATA
and RECORD_CONTENT
as shown in the screen capture below.
The RECORD_CONTENT
column is an JSON array that needs to be flattened.
Now execute the following SQL commands to flatten the raw JSONs and create a materialized view with multiple columns based on the key names.
create or replace view flights_vw
as select
f.value:utc::timestamp_ntz ts_utc,
CONVERT_TIMEZONE('UTC','America/Los_Angeles',ts_utc::timestamp_ntz) as ts_pt,
f.value:alt::integer alt,
f.value:dest::string dest,
f.value:orig::string orig,
f.value:id::string id,
f.value:icao::string icao,
f.value:lat::float lat,
f.value:lon::float lon,
st_geohash(to_geography(ST_MAKEPOINT(lon, lat)),12) geohash,
year(ts_pt) yr,
month(ts_pt) mo,
day(ts_pt) dd,
hour(ts_pt) hr
FROM az_streaming_tbl,
Table(Flatten(az_streaming_tbl.record_content)) f;
The SQL commands create a view, convert timestamps to different time zones, and use Snowflake's Geohash function to generate geohashes that can be used in time-series visualization tools like Grafana
Let's query the view flights_vw
now.
select * from flights_vw;
As a result, you will see a nicely structured output with columns derived from the JSONs
We can now write a loop to stream the flight data continuously into Snowflake.
Go back to the Linux session and run the following script.
while true
do
curl --connect-timeout 5 -k http://ecs-alb-1504531980.us-west-2.elb.amazonaws.com:8502/opensky | $HOME/snowpipe-streaming/kafka_2.12-2.8.1/bin/kafka-console-producer.sh --broker-list $BS --producer.config $HOME/snowpipe-streaming/scripts/client.properties --topic streaming
sleep 10
done
You can now go back to the Snowflake worksheet to run a select count(1) from flights_vw
query every 10 seconds to verify that the row counts is indeed increasing.
When you are done with the demo, to tear down the Azure resources, follow this doc to dismantle the resource group and its associated resources.
For Snowflake cleanup, execute the following SQL commands.
USE ROLE ACCOUNTADMIN;
DROP DATABASE AZ_STREAMING_DB;
DROP WAREHOUSE AZ_STREAMING_WH;
DROP ROLE AZ_STREAMING_RL;
-- Drop the streaming user
DROP USER IF EXISTS STREAMING_USER;
In this lab, we built a demo to show how to ingest time-series data using Snowpipe streaming and Kafka with low latency. We demonstrated this using an Azure event hub and the Kafka connector for Snowpipe streaming hosted on a VM. You can also containerize the connector on the Azure Kubernetes Services (AKS) to leverage the benefits of scability and manageability.
Related Resources