Using the Snowpark API, you can query and manipulate data by writing code that uses objects (like a DataFrame) rather than SQL statements. Snowpark is designed to make building complex data pipelines easy, allowing you to interact with Snowflake directly without moving data. When you use the Snowpark API, the library uploads and runs your code in Snowflake so that you don't need to move the data to a separate system for processing.

What You'll Build

What You'll Learn

Prerequisites

You can also use a development tool or environment that supports SBT projects for Scala 2.12 (specifically, version 2.12.9 or later 2.12.x versions). Snowpark does not yet support versions of Scala after 2.12 (for example, Scala 2.13).

Snowpark supports code compiled to run on Java 11.

You'll find the demo in a Snowflake GitHub repository. After installing git, you can clone the repository using your terminal.

Clone the repository

  1. Open a terminal window and run the following commands to change to the directory where you want the repository cloned, and then clone the repository.
    cd {directory_where_you_want_the_repository}
    git clone https://github.com/Snowflake-Labs/sfguide-getting-started-snowpark-scala
    
  2. Change to the directory of the repository that you cloned:
    cd sfguide-snowpark-demo
    

The repository's demo directory includes the following files:

The demo directory contains a snowflake_connection.properties file that the example code uses to create a session to connect to Snowflake. You'll need to edit these properties so the code connects to your Snowflake account.

Configure the connection settings

Edit this file and replace the <placeholder> values with the values that you use to connect to Snowflake. For example:

URL = https://myaccount.snowflakecomputing.com
USER = myusername
PRIVATE_KEY_FILE = /home/username/rsa_key.p8
PRIVATE_KEY_FILE_PWD = my_passphrase_for_my_encrypted_private_key_file
ROLE = my_role
WAREHOUSE = my_warehouse
DB = my_db
SCHEMA = my_schema

In this step, you'll confirm that you can connect to Snowflake with the demo code and your connection properties.

Confirm that you can connect to Snowflake

Using the SBT command-line tool, run the following command to build and run the HelloWorld.scala example to verify that you can connect to Snowflake:

sbt "runMain HelloWorld"

Code walkthrough

When the HelloWorld application runs successfully, take a look at the following walkthrough of its code and output.

Now that you have verified that you can connect to Snowflake, you need to get the data and libraries to make the UDF work.

In this step, you'll download the sample data file and libraries you need to run the user-defined function you're going to create. This demo uses the sentiment140 dataset and libraries from the CoreNLP project.

Download dependency JARs and sample data file

  1. Go to the sentiment140 page and click Download to download the ZIP archive containing the dataset.
  2. Unzip training.1600000.processed.noemoticon.csv from the archive.zip file that you downloaded.
  3. Download version 3.6.0 of the CoreNLP libraries.
  4. Unzip the libraries from the stanford-corenlp-full-2015-12-09.zip file that you downloaded.
  5. In your sfguide-snowpark-demo directory, create a temporary directory for the data and JAR files – for example, mkdir files_to_upload.
  6. Copy the following file extracted from archive.zip to your sfguide-snowpark-demo/files_to_upload/ directory:
    • training.1600000.processed.noemoticon.csv
  7. Copy the following files extracted from stanford-corenlp-full-2015-12-09.zip to your sfguide-snowpark-demo/files_to_upload/ directory.
    • stanford-corenlp-3.6.0.jar
    • stanford-corenlp-3.6.0-models.jar
    • slf4j-api.jar
    • ejml-0.23.jar

In sfguide-snowpark-demo/files_to_upload/, you should now see the following files:

$ pwd
<path>/sfguide-snowpark-demo

$ ls files_to_upload
ejml-0.23.jar					stanford-corenlp-3.6.0.jar
slf4j-api.jar					training.1600000.processed.noemoticon.csv
stanford-corenlp-3.6.0-models.jar

Next, run the UDFDemoSetup.scala example to create the stages for these files, and upload the files to the stages.

In this section, you'll run the UDFDemoSetup.scala example to create internal stages to hold the data file and libraries, then upload those files to the stages.

Because the user-defined function in the demo executes in Snowflake, you have to upload the JAR files for these libraries to an internal stage to make them available to Snowflake. You also need to upload the dataset to a stage, where the demo will access the data.

Upload dependency JARs and sample data file

Run the following command to run the code.

sbt "runMain UDFDemoSetup"

Code walkthrough

When the UDFDemoSetup application runs successfully, read the following to understand what it does.

After creating a session, the application code calls uploadDemoFiles twice – once to upload the sample data CSV file, then again to upload the JAR files that will be dependencies of the UDF you'll create. The method uses the Snowpark library to create a stage for the uploaded files.

Next, run the UDFDemo.scala example to create the user-defined function.

In this step, you'll run the UDFDemo.scala demo application to create and call a user-defined function (UDF). Read the topics that follow to take a closer look at the example and the output to see how the Snowpark library does this.

Create and call the UDF

Run the following command to run the code:

sbt "runMain UDFDemo"

This example does the following:

See the topics that follow for more on how this works.

The collectTweetData method creates a DataFrame to read CSV data from a file in a stage. It does this with a DataFrameReader object.

Code walkthrough

Output

For collectTweetData, you'll see output such as the following.

=== Setting up the DataFrame for the data in the stage ===

[sbt-bg-threads-1]  INFO (Logging.scala:22) - Actively querying parameter snowpark_lazy_analysis from server.

=== Retrieving the data and printing the text of the first 10 tweets
[sbt-bg-threads-1]  INFO (Logging.scala:22) - Execute query [queryID: 01a255c9-0504-b313-0000-438301da600a]  CREATE  TEMPORARY  FILE  FORMAT  If  NOT  EXISTS "SNOWPARK_DEMO_DATA"."PUBLIC".SN_TEMP_OBJECT_1879375406 TYPE  = csv COMPRESSION  =  'gzip'
[sbt-bg-threads-1]  INFO (Logging.scala:22) - Execute query [queryID: 01a255c9-0504-b313-0000-438301da600e]  SELECT  *  FROM ( SELECT  *  FROM ( SELECT "TEXT" FROM ( SELECT $1::STRING AS "TARGET", $2::STRING AS "IDS", $3::STRING AS "DATE", $4::STRING AS "FLAG", $5::STRING AS "USER", $6::STRING AS "TEXT" FROM @snowpark_demo_data/training.1600000.processed.noemoticon.csv( FILE_FORMAT  => '"SNOWPARK_DEMO_DATA"."PUBLIC".SN_TEMP_OBJECT_1879375406'))) LIMIT 100) LIMIT 10
------------------------------------------------------
|"TEXT"                                              |
------------------------------------------------------
|"...

The createUDF method sets up dependencies for a UDF that analyzes tweets for sentiment, then it creates the UDF in Snowflake.

Code walkthrough

Output

The processHappyTweets method uses the UDF to analyze tweet text to discover which tweets are happy.

Code walkthrough

That's it for this part! So far, you've uploaded Scala code as a user-defined function, then run the UDF to analyze tweet data for happy tweets.

In the last step, you'll take the code you've got already and turn it into a stored procedure in Snowflake.

In this step, you'll take the code you've just run and create a stored procedure from it. To do that, you'll copy the Scala code into a Snowflake worksheet, wrap the code in an SQL statement, and run it to create the stored procedure.

The following steps use the new web interface.

For related documentation, be sure to read Writing Stored Procedures in Snowpark (Scala).

Create and run the stored procedure

  1. In the new web interface, create a new worksheet and call it discoverHappyTweets.
  2. In the worksheet editor, ensure that the session context matches the settings you specified in the snowflake_connection.properties file you edited earlier. For example:
    • From the dropdown in the upper left, select the role you specified.
    • From the session context dropdown in the upper-right, select the role and warehouse values you specified.
    • From the database dropdown, select the database and schema you specified.
  3. Into the discoverHappyTweets worksheet, paste the following:
    create or replace procedure discoverHappyTweets()
    returns string
    language scala
    runtime_version=2.12
    packages=('com.snowflake:snowpark:latest')
    imports=('@snowpark_demo_udf_dependency_jars/ejml-0.23.jar','@snowpark_demo_udf_dependency_jars/slf4j-api.jar','@snowpark_demo_udf_dependency_jars/stanford-corenlp-3.6.0-models.jar','@snowpark_demo_udf_dependency_jars/stanford-corenlp-3.6.0.jar')
    handler = 'UDFDemo.discoverHappyTweets'
    target_path = '@snowpark_demo_udf_dependency_jars/discoverHappyTweets.jar'
    as
    $$
    
    /*Paste the UDFDemo object Scala code, including imports. You can optionally
    omit the main function.*/
    
    $$;
    
    This code creates a stored procedure called discoverHappyTweets. The documentation has more, but be sure to note the following:
    • The packages parameter specifies the Snowpark version to use.
    • The imports parameter specifies JAR files needed as your code's dependencies. This is the list of JAR files you uploaded to Snowflake earlier.
    • The handler parameter specifies the function Snowflake should call when executing the stored procedure.
    • The target_path parameter specifies the name and location of the JAR to create when you run this code. Snowflake compiles the code and packages the compiled classes.
  4. Into the section between the $$ delimiters, paste the Scala code from UDFDemo. Be sure to include the import statements. Don't include the main function; you don't need it. In particular, Snowflake will inject a Session object in place of the one you were creating there. Snowflake will instead call the method you specified with the handler parameter.
  5. Run the code in the worksheet. Snowflake compiles the code in the body of CREATE PROCEDURE and packages it into a JAR file.
  6. Create a new worksheet and call it call discoverHappyTweets.
  7. Into the new worksheet, paste the following code:
    call discoverHappyTweets();
    select * from demo_happy_tweets;
    
  8. In the worksheet, select both lines, then run the code in the worksheet. Beneath the worksheet, you should see messages as Snowflake calls the stored procedure. After the procedure call completes, these will be replaced with the results of the query – the list of happy tweets. Calling the stored procedure just executes the method or function associated with that stored procedure. The execution happens with that JAR file in the classpath.

That's how easy it is to create a stored procedure. For bonus points, you could call the new stored procedure with a nightly task.

create or replace task process_tweets_nightly
warehouse = 'small'
schedule = '1440 minute'
as
call discoverHappyTweets();

Congratulations! You used Snowpark to perform sentiment analysis on tweets. You used a sample dataset of tweets for this guide. If you want to automatically ingest new tweets as they are written, follow the Auto Ingest Twitter Data into Snowflake guide.

What You Covered

Related Resources