This Snowflake Quickstart covers the basics of training machine learning models, interpreting them, and deploying them to make predictions.
With Dataiku's Visual Snowpark ML plugin - you won't need to write a single line of code. That's right!
Consumer lending is difficult. What factors about an individual and their loan application could indicate whether they're likely to pay back the loan? How can our bank optimize the loans approved and rejected based on our risk tolerance? We'll use machine learning to help with this decision making process. Our model will learn patterns between historical loan applications and default, then we can use it to make predictions for a fresh batch of applications.
The exercises in this lab will walk you through the steps to:
(Optional)
Set up an MLOps process to retrain the model, check for accuracy, and make new predictions on a weekly basisIf you haven't already, register for a Snowflake free 30-day trial The rest of the sections in this lab assume you are using a new Snowflake account created by registering for a trial.
AWS
for this labEnterprise edition
so you can leverage some advanced capabilities that are not available in the Standard Edition.After activation, you will create a username
and password
. Write down these credentials. Bookmark this URL for easy, future access.
Open a browser window and enter the URL of your Snowflake 30-day trial environment. You should see the login screen below. Enter your unique credentials to log in.
You may see "welcome" and "helper" boxes in the UI when you log in for the first time. Close them by clicking on Skip for now in the bottom right corner in the screenshot below.
At the top right of the page, confirm that your current role is ACCOUNTADMIN
, by clicking on your profile on the top right.
Data Products
on the left-hand menuPartner Connect
Dataiku
tileThis will automatically create the connection parameters required for Dataiku to connect to Snowflake. Snowflake will create a dedicated database, warehouse, system user, system password and system role, with the intention of those being used by the Dataiku account.
For this lab we'd like to use the PC_DATAIKU_USER to connect from Dataiku to Snowflake, and use the PC_DATAIKU_WH when performing activities within Dataiku that are pushed down into Snowflake.
This is to show that a Data Science team working on Dataiku and by extension on Snowflake can work completely independently from the Data Engineering team that works on loading data into Snowflake using different roles and warehouses.
Connect
Activate
This will launch a new page that will redirect you to a launch page from Dataiku.
Here, you will have two options:
"Sign Up" box is selected
, and sign up with either GitHub, Google or your email address and your new password.Make sure you use the same email address that you used for your Snowflake trial.
Click sign up
.When using your email address, ensure your password fits the following criteria:
You should have received an email from Dataiku to the email you have signed up with. Activate your Dataiku account via the email sent.
Upon clicking on the activation link, please briefly review the Terms of Service of Dataiku Cloud. In order to do so, please scroll down to the bottom of the page.
I AGREE
and then click on NEXT
Start
.GOT IT!
to continue.
This is the Cloud administration console where you can perform tasks such as inviting other users to collaborate, add plugin extensions, install industry solutions to accelerate projects as well as access community and academy resources to help your learning journey.
It's beyond the scope of this course to cover plugins in depth but for this lab we would like to enable a few the Visual SnowparkML plugin so lets do that now.
Plugins
on the left menu+ ADD A PLUGIN
Visual SnowparkML
Install on my Dataiku instance
, and click INSTALL
Code Envs
on the left menuADD A CODE ENVIRONMENT
NEW PYTHON ENV
py_39_snowpark
NOTE: The name must match exactlyCREATE
scikit-learn==1.3.2
mlflow==2.9.2
statsmodels==0.12.2
protobuf==3.16.0
xgboost==1.7.3
lightgbm==3.3.5
matplotlib==3.7.1
scipy==1.10.1
snowflake-snowpark-python==1.14.0
snowflake-snowpark-python[pandas]==1.14.0
snowflake-connector-python[pandas]==3.7.0
MarkupSafe==2.0.1
cloudpickle==2.0.0
flask==1.0.4
Jinja2==2.11.3
snowflake-ml-python==1.5.0
rebuild env
from the menu on the leftSave and update
You've now successfully set up your Dataiku trial account via Snowflake's Partner Connect. We are now ready to continue with the lab. For this, move back to your Snowflake browser.
We will now create an optimized warehouse
Admin
from the bottom of the left hand menuWarehouses
+ Warehouse
in the top right cornerOnce in the New Warehouse
creation screen perform the following steps:
SNOWPARK_WAREHOUSE
Snowpark-optimized
Medium
as the sizeCreate Warehouse
clicking on it once
.We need to permission the Dataiku Role that was created by Partner Connect in the earlier chapter for this new warehouse.
+ Privilege
PC_DATAIKU_ROLE
USAGE
privilegeGrant Privileges
You should now see your new privileges have been applied
Return to the Dataiku trial launchpad in your browser
Overview
pageOPEN INSTANCE
to get started.Congratulations you are now using the Dataiku platform! For the remainder of this lab we will be working from this environment which is called the design node, its the pre-production environment where teams collaborate to build data products.
Now lets import our first project.
Once you have download the starter project we can create our first project
+ NEW PROJECT
Import project
IMPORT
You should see a project with 4 dataset - two local CSVs which we've then imported into Snowflake
Now that we have all our setup done, lets start working with our data.
Before we begin analyzing the data in our new project lets take a minute to understand some of the concepts and terminology of a project in Dataiku.
Here is the project we are going to build along with some annotations to help you understand some key concepts in Dataiku.
Double click
into the LOAN_REQUESTS_KNOWN_SF
dataset. This is our dataset of historical loan applications, a number of attributes about them, and whether the loan was paid back or defaulted (the DEFAULTED column - 1.0 = default, 0.0 = paid back).Statistics
tab on the top+ Create first worksheet
Automatically suggest analyses
CREATE SELECTED CARDS
Question: What trends do you notice in the data?
Look at the correlation matrix, and the DEFAULTED row. Notice that INTEREST_RATE has the highest correlation with DEFAULTED. We should definitely include this feature in our models!
Positive correlation means as INTEREST_RATE rises, so does DEFAULTED (higher interest rate -> higher probability of default). Notice MONTHLY_INCOME has a negative correlation to DEBT_TO_INCOME_RATIO. This means that as monthly income goes up, applicants' debt to income ratio generally goes down.
See if you can identify a few other features we should include in our models.
Now we will tran an ML model using our plugin. Return to the flow either by clicking on the flow icon or by using the keyboard shortcut (g+f)
LOAN_REQUESTS_KNOWN_SF
dataset.Actions menu
on the right scroll down and select the Visual Snowpark ML plugin
We now need to set our three Outputs
Set
under the Generated Train Dataset
train
PC_DATAIKU_DB
to store intoCREATE DATASET
We will now repeat this process for the other two outputs
Set
under the Generated Test Dataset
test
PC_DATAIKU_DB
to store intoCREATE DATASET
Set
under the Models Folder
models
dataiku-managed-storage
to store intoCREATE FOLDER
CREATE
Lets fill out the parameters for our training session.
DEFAULTED
as the target columnTwo-class classification
as the prediction typeROC AUC
as our model metric. This is a common machine learning metric for classification problems.Leave the Train ratio and random seed as is. This will split our input dataset into 80% of records for training, leaving 20% for an unbiased evaluation of the model
RUN
button in the bottom left hand corner to start training our models.While we're waiting for our models to train, let's learn a bit about machine learning. This is an oversimplification of some complicated topics. If you're interested there are links at the end of the course for the Dataiku Academy and many other free resources online.
Machine learning - the use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data.
Oversimplified definition of machine learning - Fancy pattern matching based on the data you feed into it
The two most common types of machine learning solutions are supervised and unsupervised learning.
Supervised learning Goal: predict a target variable
Examples:
Unsupervised learning Goal: identify patterns
Examples:
We need a structured dataset to train a model, in particular:
Once we have a structured dataset with observations, a target, and features, we split it into train and test sets
We could split it:
A random split of 80% train / 20% test is common
We train models on the train set, then evaluate them on the test set. This way, we can simulate how the model will perform on data that it hasn't seen before.
Two keys to choosing features to include in our model:
Most machine learning models require variables to be a specific format to be able to find patterns in the data.
We can generally break up our variables into two categories:
Here are some ways to transform these types of features:
Numeric - e.g. AMOUNT_REQUESTED, DEBT_TO_INCOME_RATIO
Things you typically want to consider:
Categorical - e.g. LOAN_PURPOSE, STATE
Things you typically want to consider:
Let's go through a few common machine learning algorithms.
Linear Regression
For linear regression (predicting a number), we find the line of best fit, plotting our feature variables and our target
y = b0 + b1 * x
If we were training a model to predict exam scores based on # hours of study, we would solve for this equation
exam_score = b0 + b1 * (num_hours_study)
We use math (specifically a technique called Ordinary Least Squares[1]) to find the b0 and b1 of our best fit line
exam_score = b0 + b1 * (num_hours_study)
exam_score = 32 + 8 * (num_hours_study)
Logistic Regression
Logistic regression is similar to linear regression - except built for a classification problem (e.g. loan default prediction).
log(p/1-p) = b0 + b1 * (num_hours_study)
log(p/1-p) = 32 + 8 * (num_hours_study)
p = probability of exam success
Decision Trees
Imagine our exam pass/fail model with more variables.
Decision trees will smartly create if / then statements, sending each row along a branch until it makes a prediction of your target variable
Random Forest
A Random Forest model trains many decision trees, introduces randomness into each one, so they behave differently, then averages their predictions for a final prediction
Overfitting
We want our ML model to be able to understand true patterns in the data - uncover the signal, and ignore the noise (random, unexplained variation in the data)
Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data
How to control for overfitting
Logistic Regression
Example
C = 0.01: log(p/1-p) = 32 + 8 * (num_hours_study) + 6 * (num_hours_sleep)
C = 0.1: log(p/1-p) = 32 + 5 * (num_hours_study) + 4 * (num_hours_sleep)
C = 1: log(p/1-p) = 32 + 3 * (num_hours_study) + 2 * (num_hours_sleep)
C = 10: log(p/1-p) = 32 + 2 * (num_hours_study) + 0 * (num_hours_sleep)
C = 100: log(p/1-p) = 32 + 1 * (num_hours_study) + 0 * (num_hours_sleep)
Random Forest
For more in-depth tutorials and self-paced machine learning courses see the links to Dataiku's freely available Academy in the last chapter of this course
Once we've trained our models, we'll want to take a deeper dive deep into how they're performing, what features they're considering, and whether they may be biased. Dataiku has a number of tools for evaluating models.
Double click
on your model (Green diamond) from the flowClick
into it.Feature Importance
from the menu on the left sideCOMPUTE NOW
Here we can see that the top 3 features impacting the model are applicants' FICO scores, the interest rate of the loan, and the amount requested. This makes sense!
Scroll down on this page - you'll see the directional effect of each feature on default predictions. You can see that the higher FICO scores generally mean lower probability of default.
Confusion matrix
tab from the menu on the left.Here we can see how the model would have performed on the hold-out test set of loan applicants. Notice that my model is very good at catching defaulters (83 Predicted 1.0 out of 84 Actually 1.0), at the expense of mistakenly rejecting 124 applicants that would have paid back their loan.
Try moving the threshold bar back and forth. It will cause the model to be more or less sensitive. Based on your business problem, you may want a higher or lower threshold.
Using a machine learning model to make predictions is called scoring
or inference
LOAN_REQUESTS_UNKNOWN_SF
datasetVisual Snowpark ML
plugin from the right hand Actions menu.Score New Records using Snowpark
We need to add our model as an input and set an output dataset for the results of the scoring.
Inputs
under the Saved Model
option click on SET
to add your saved modelOutputs
section under Scored Dataset Option
click on SET
and give your output dataset a nameStore into
use the `PC_DATAIKU_DB** connectionCREATE DATASET
click on CREATE
SNOWPARK_WAREHOUSE
is selected then click on RUN
When it finishes, your flow should look like this
Double click
into the output scored dataset - scroll to the right, and you should see predictions of whether someone is likely to pay back their loan or not!Let's say we want to automatically run new loan applications through our model every week on Sunday night.
Assume that LOAN_REQUESTS_UNKNOWN
is a live dataset of new loan applications that is updated throughout the week.
We want to rerun all the recipes leading up to unknown_loans_scored, where our model makes predictions.
+ CREATE YOUR FIRST SCENARIO
"Weekly Loan Application Scoring"
every week on Sunday at 9pm
Steps
tab, click Add Step
, then Build / Train
unknown_loans_scored
datasetForce-build
button to recursively build all datasets leading up to unknown_loans_scored
, then click the run
button to test it out.You'll be able to see scenario run details in the "Last runs" tab
It's good practice to retrain machine learning models on a regular basis with more up-to-date data. The world changes around us; the patterns of loan applicant attributes affecting default probability are likely to change too.
If you have time you can assume that LOAN_REQUESTS_KNOWN is a live dataset of historical loan applications that is updated with new loan payback and default data on an ongoing basis.
You can automatically retrain your model every month with scenarios, and put in a AUC check to make sure that the model is performing and build the scored dataset
Congratulations on completing this introductory lab exercise! Congratulations! You've mastered the Snowflake basics and you've taken your first steps toward a no-code approach to training machine learning models with Dataiku.
You have seen how Dataiku's deep integrations with Snowflake can allow teams with different skill sets get the most out of their data at every stage of the machine learning lifecycle.
We encourage you to continue with your free trial and continue to refine your models and by using some of the more advanced capabilities not covered in this lab.
(Optional)
Set up an MLOps process to retrain the model, check for accuracy, and make new predictions on a weekly basis