This Snowflake Quickstart covers the basics of training machine learning models, interpreting them, and deploying them to make predictions.

With Dataiku's Visual Snowpark ML plugin - you won't need to write a single line of code. That's right!

Use Case

Consumer lending is difficult. What factors about an individual and their loan application could indicate whether they're likely to pay back the loan? How can our bank optimize the loans approved and rejected based on our risk tolerance? We'll use machine learning to help with this decision making process. Our model will learn patterns between historical loan applications and default, then we can use it to make predictions for a fresh batch of applications.

What You'll Learn

The exercises in this lab will walk you through the steps to:

What You'll Need

If you haven't already, register for a Snowflake free 30-day trial The rest of the sections in this lab assume you are using a new Snowflake account created by registering for a trial.

After activation, you will create a usernameand password. Write down these credentials. Bookmark this URL for easy, future access.

Log Into the Snowflake User Interface (UI)

Open a browser window and enter the URL of your Snowflake 30-day trial environment. You should see the login screen below. Enter your unique credentials to log in.

img

You may see "welcome" and "helper" boxes in the UI when you log in for the first time. Close them by clicking on Skip for now in the bottom right corner in the screenshot below.

img

Create Dataiku trial via Partner Connect

At the top right of the page, confirm that your current role is ACCOUNTADMIN, by clicking on your profile on the top right.

  1. Click on Data Products on the left-hand menu
  2. Click on Partner Connect
  3. Search for Dataiku
  4. Click on the Dataiku tile

img

This will automatically create the connection parameters required for Dataiku to connect to Snowflake. Snowflake will create a dedicated database, warehouse, system user, system password and system role, with the intention of those being used by the Dataiku account.

For this lab we'd like to use the PC_DATAIKU_USER to connect from Dataiku to Snowflake, and use the PC_DATAIKU_WH when performing activities within Dataiku that are pushed down into Snowflake.

This is to show that a Data Science team working on Dataiku and by extension on Snowflake can work completely independently from the Data Engineering team that works on loading data into Snowflake using different roles and warehouses.

img

  1. Click Connect
  2. You will get a pop-ip which tells you your partner account has been created. Click on Activate

This will launch a new page that will redirect you to a launch page from Dataiku.

Here, you will have two options:

  1. Login with an existing Dataiku username
  2. Sign up for a new Dataiku account

Make sure you use the same email address that you used for your Snowflake trial.

img

When using your email address, ensure your password fits the following criteria:

  1. At least 8 characters in length
  2. Should contain: Lower case letters (a-z) Upper case letters (A-Z) Numbers (i.e. 0-9)

You should have received an email from Dataiku to the email you have signed up with. Activate your Dataiku account via the email sent.

Review Dataiku Setup

Upon clicking on the activation link, please briefly review the Terms of Service of Dataiku Cloud. In order to do so, please scroll down to the bottom of the page.

img

img

This is the Cloud administration console where you can perform tasks such as inviting other users to collaborate, add plugin extensions, install industry solutions to accelerate projects as well as access community and academy resources to help your learning journey.

Add the Visual SnowparkML Plugin

It's beyond the scope of this course to cover plugins in depth but for this lab we would like to enable a few the Visual SnowparkML plugin so lets do that now.

  1. Click on Plugins on the left menu
  2. Select + ADD A PLUGIN
  3. Find Visual SnowparkML
  4. Check Install on my Dataiku instance, and click INSTALL

img

img

  1. Click on Code Envs on the left menu
  2. Select ADD A CODE ENVIRONMENT
  3. Select NEW PYTHON ENV
  4. Name your code env py_39_snowpark NOTE: The name must match exactly
  5. Click CREATE

img

img

img

  1. Select `Pandas 1.3 (Python 3.7 and above) from Core Packages menu
  2. Add the following packages
scikit-learn==1.3.2
mlflow==2.9.2
statsmodels==0.12.2
protobuf==3.16.0
xgboost==1.7.3
lightgbm==3.3.5
matplotlib==3.7.1
scipy==1.10.1
snowflake-snowpark-python==1.14.0
snowflake-snowpark-python[pandas]==1.14.0
snowflake-connector-python[pandas]==3.7.0
MarkupSafe==2.0.1
cloudpickle==2.0.0
flask==1.0.4
Jinja2==2.11.3
snowflake-ml-python==1.5.0
  1. Select rebuild env from the menu on the left
  2. Click Save and update

img

You've now successfully set up your Dataiku trial account via Snowflake's Partner Connect. We are now ready to continue with the lab. For this, move back to your Snowflake browser.

Return to the Snowflake UI

We will now create an optimized warehouse

  1. Click Admin from the bottom of the left hand menu
  2. Then Warehouses
  3. Then click + Warehouse in the top right corner

img

Once in the New Warehouse creation screen perform the following steps:

  1. Create a new warehouse called SNOWPARK_WAREHOUSE
  2. For the type select Snowpark-optimized
  3. Select Medium as the size
  4. Lastly click Create Warehouse

img



img

We need to permission the Dataiku Role that was created by Partner Connect in the earlier chapter for this new warehouse.

img

  1. For the Role select the role PC_DATAIKU_ROLE
  2. Under Pivileges grant the USAGE privilege
  3. Click on Grant Privileges

img

You should now see your new privileges have been applied

img

Return to the Dataiku trial launchpad in your browser

  1. Ensure you are on the Overview page
  2. Click on OPEN INSTANCE to get started.

img

Congratulations you are now using the Dataiku platform! For the remainder of this lab we will be working from this environment which is called the design node, its the pre-production environment where teams collaborate to build data products.

Now lets import our first project.

Starter Project

Once you have download the starter project we can create our first project

  1. Click + NEW PROJECT
  2. Then Import project

img

img



You should see a project with 4 dataset - two local CSVs which we've then imported into Snowflake

img

Now that we have all our setup done, lets start working with our data.

Before we begin analyzing the data in our new project lets take a minute to understand some of the concepts and terminology of a project in Dataiku.

Here is the project we are going to build along with some annotations to help you understand some key concepts in Dataiku.

img

img



  1. Click the Statistics tab on the top
  2. Next click + Create first worksheet

img



img



img

Question: What trends do you notice in the data?

Look at the correlation matrix, and the DEFAULTED row. Notice that INTEREST_RATE has the highest correlation with DEFAULTED. We should definitely include this feature in our models!

img

Positive correlation means as INTEREST_RATE rises, so does DEFAULTED (higher interest rate -> higher probability of default). Notice MONTHLY_INCOME has a negative correlation to DEBT_TO_INCOME_RATIO. This means that as monthly income goes up, applicants' debt to income ratio generally goes down.

See if you can identify a few other features we should include in our models.

Create a new Visual SnowparkML recipe

Now we will tran an ML model using our plugin. Return to the flow either by clicking on the flow icon or by using the keyboard shortcut (g+f)

  1. From the Flow click once on the LOAN_REQUESTS_KNOWN_SF dataset.
  2. From the Actions menu on the right scroll down and select the Visual Snowpark ML plugin

img



img

We now need to set our three Outputs

img

  1. Set the name to train
  2. Select PC_DATAIKU_DB to store into
  3. Click CREATE DATASET

img

We will now repeat this process for the other two outputs

  1. Set the name to test
  2. Select PC_DATAIKU_DB to store into
  3. Click CREATE DATASET

img

  1. Set the name to models
  2. Select dataiku-managed-storage to store into
  3. Click CREATE FOLDER

img

img

Define model training settings

Lets fill out the parameters for our training session.

  1. Give your model a name
  2. Choose DEFAULTED as the target column
  3. Select Two-class classification as the prediction type
  4. Choose ROC AUC as our model metric. This is a common machine learning metric for classification problems.

Leave the Train ratio and random seed as is. This will split our input dataset into 80% of records for training, leaving 20% for an unbiased evaluation of the model

img

img

img



img

  1. Leave the Search space limit as 4
  2. Write SNOWPARK_WAREHOUSE to use the Snowpark-optimized warehouse we created earlier.
  3. Check the "Deploy to Snowflake ML Model Registry" box. This will deploy our best trained model to Snowflake's Model Registry - where we can use it to make predictions later on.

img

While we're waiting for our models to train, let's learn a bit about machine learning. This is an oversimplification of some complicated topics. If you're interested there are links at the end of the course for the Dataiku Academy and many other free resources online.

Machine Learning, Classification, and Regression

Machine learning - the use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data.

Oversimplified definition of machine learning - Fancy pattern matching based on the data you feed into it

The two most common types of machine learning solutions are supervised and unsupervised learning.
Supervised learning Goal: predict a target variable

Examples:

Unsupervised learning Goal: identify patterns

Examples:

We need a structured dataset to train a model, in particular:

img

Train / Test split

Once we have a structured dataset with observations, a target, and features, we split it into train and test sets

We could split it:

A random split of 80% train / 20% test is common

img

We train models on the train set, then evaluate them on the test set. This way, we can simulate how the model will perform on data that it hasn't seen before.

Feature Handling

Two keys to choosing features to include in our model:

Most machine learning models require variables to be a specific format to be able to find patterns in the data.

We can generally break up our variables into two categories:

Here are some ways to transform these types of features:

Numeric - e.g. AMOUNT_REQUESTED, DEBT_TO_INCOME_RATIO

Things you typically want to consider:

img


Categorical - e.g. LOAN_PURPOSE, STATE

Things you typically want to consider:

Machine Learning Algorithms

Let's go through a few common machine learning algorithms.

Linear Regression

For linear regression (predicting a number), we find the line of best fit, plotting our feature variables and our target

y = b0 + b1 * x

If we were training a model to predict exam scores based on # hours of study, we would solve for this equation

exam_score = b0 + b1 * (num_hours_study)

img

We use math (specifically a technique called Ordinary Least Squares[1]) to find the b0 and b1 of our best fit line

exam_score = b0 + b1 * (num_hours_study)

exam_score = 32 + 8 * (num_hours_study)

img

Logistic Regression

Logistic regression is similar to linear regression - except built for a classification problem (e.g. loan default prediction).

log(p/1-p) = b0 + b1 * (num_hours_study)

log(p/1-p) = 32 + 8 * (num_hours_study)

p = probability of exam success

img

Decision Trees

Imagine our exam pass/fail model with more variables.

Decision trees will smartly create if / then statements, sending each row along a branch until it makes a prediction of your target variable

img

Random Forest

A Random Forest model trains many decision trees, introduces randomness into each one, so they behave differently, then averages their predictions for a final prediction

img

Overfitting

We want our ML model to be able to understand true patterns in the data - uncover the signal, and ignore the noise (random, unexplained variation in the data)

Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data

img
img

How to control for overfitting

Logistic Regression

Example

C = 0.01: log(p/1-p) = 32 + 8 * (num_hours_study) + 6 * (num_hours_sleep)

C = 0.1: log(p/1-p) = 32 + 5 * (num_hours_study) + 4 * (num_hours_sleep)

C = 1: log(p/1-p) = 32 + 3 * (num_hours_study) + 2 * (num_hours_sleep)

C = 10: log(p/1-p) = 32 + 2 * (num_hours_study) + 0 * (num_hours_sleep)

C = 100: log(p/1-p) = 32 + 1 * (num_hours_study) + 0 * (num_hours_sleep)

Random Forest

For more in-depth tutorials and self-paced machine learning courses see the links to Dataiku's freely available Academy in the last chapter of this course

Once we've trained our models, we'll want to take a deeper dive deep into how they're performing, what features they're considering, and whether they may be biased. Dataiku has a number of tools for evaluating models.

img

img


  1. Select Feature Importance from the menu on the left side
  2. Then click COMPUTE NOW

img



Here we can see that the top 3 features impacting the model are applicants' FICO scores, the interest rate of the loan, and the amount requested. This makes sense!

img



Scroll down on this page - you'll see the directional effect of each feature on default predictions. You can see that the higher FICO scores generally mean lower probability of default.

img



Here we can see how the model would have performed on the hold-out test set of loan applicants. Notice that my model is very good at catching defaulters (83 Predicted 1.0 out of 84 Actually 1.0), at the expense of mistakenly rejecting 124 applicants that would have paid back their loan.

Try moving the threshold bar back and forth. It will cause the model to be more or less sensitive. Based on your business problem, you may want a higher or lower threshold.

img

Using a machine learning model to make predictions is called scoring or inference

Score the unknown loan applications using the trained model

  1. Go to the project Flow, click once on the LOAN_REQUESTS_UNKNOWN_SF dataset
  2. Then click on the Visual Snowpark ML plugin from the right hand Actions menu.

img



img



We need to add our model as an input and set an output dataset for the results of the scoring.

  1. In the Inputs under the Saved Model option click on SET to add your saved model
  2. In the Outputs section under Scored Dataset Option click on SET and give your output dataset a name
  3. For Store into use the `PC_DATAIKU_DB** connection
  4. Click on CREATE DATASET

img

img



img



When it finishes, your flow should look like this

img

img

Let's say we want to automatically run new loan applications through our model every week on Sunday night.

Assume that LOAN_REQUESTS_UNKNOWN is a live dataset of new loan applications that is updated throughout the week.

We want to rerun all the recipes leading up to unknown_loans_scored, where our model makes predictions.

Build a weekly scoring scenario

img

img

img



img



img



img



img



img



img



You'll be able to see scenario run details in the "Last runs" tab

img

Build a monthly model retraining scenario (optional)

It's good practice to retrain machine learning models on a regular basis with more up-to-date data. The world changes around us; the patterns of loan applicant attributes affecting default probability are likely to change too.

If you have time you can assume that LOAN_REQUESTS_KNOWN is a live dataset of historical loan applications that is updated with new loan payback and default data on an ongoing basis.

You can automatically retrain your model every month with scenarios, and put in a AUC check to make sure that the model is performing and build the scored dataset

Congratulations on completing this introductory lab exercise! Congratulations! You've mastered the Snowflake basics and you've taken your first steps toward a no-code approach to training machine learning models with Dataiku.

You have seen how Dataiku's deep integrations with Snowflake can allow teams with different skill sets get the most out of their data at every stage of the machine learning lifecycle.

We encourage you to continue with your free trial and continue to refine your models and by using some of the more advanced capabilities not covered in this lab.

What You Learned:

Related Resources