This Snowflake Quickstart introduces you to the using Snowflake together with Dataiku Cloud as part of a Machine learning project, and build an end-to-end machine learning solution. This lab will showcase seamless integration of both Snowflake and Dataiku at every stage of ML life cycle. We will also use Snowflake Data Marketplace to enrich the dataset.

Business Problem

Will go through a supervised machine learning by building a binary classification model to predict if a lender will default on a loan. LOAN_STATUS (yes/no) considering multiple features.

Supervised machine learning is the process of taking a historical dataset with KNOWN outcomes of what we would like to predict, to train a model, that can be used to make future predictions. After building a model we will deploy back to Snowflake for scoring by using Snowpark-java udf.

Dataset

We will be exploring a financial service use of evaluating loan information to predict if a lender will default on a loan. The base data set was derived from loan data from the Lending Club.

In addition to base data, this will then be enriched with unemployment data from Knoema on the Snowflake Data Marketplace.

What We're Going To Build

We will build a project. The project contains the input datasets from Snowflake. We'll build a data science pipeline by applying data transformations, enriching from Marketplace employment data, building a machine learning model, and deploying it to the Flow. We will then see how you can score the model against fresh data from Snowflake and automate

1

Prerequisites

What You'll Need During the Lab

To participate in the virtual hands-on lab, attendees need the following:

What You'll Build

Operational end-to-end ML project using joint capabilities of Snowflake and Dataiku from Data collection to deployment

2

3

After registering, you will receive an emailwith an activation link and your Snowflake account URL. Kindly activate the account.

4

After activation, you will create a user nameand password. Write down these credentials. Bookmark this URL for easy, future access.

5

Step 1

Log in with your credentials. Bookmark this URL for easy, future access.

6

Resize your browser window, so that you can view this guide and your web browser side-by-side and follow the lab instructions. If possible, use a secondary display dedicated to the lab guide.

Step 2

Log into your Snowflake account. By default it will open up home page.

7

Step 3

To create Worksheet . Click on the Worksheets tab. A new screen will open up.

8

Step 4

Click on + Worksheet to create your first worksheet.

9

Step 5

New Worksheet will be created with a Time stamp. Let's now rename this Worksheet by clicking on the Time stamp.

10

You can name anything, but for this lab we will Rename it as Data Loading.

11

Download the following .sql file that contains a series of SQL commands we will execute throughout this lab. You can either execute cell by cell commands from the sql file or copy the below code blocks and follow.

Snowflake_Dataiku_ML.sql

Part 1 : Step 1 - Step 4

Creating database, Warehouse, loading dataset

Part 2 : Step 5 - Step 8

Tapping Snowflake Marketplace dataset

After creating the worksheet in the last step we can import the sql file provided .

Click on drop down button.

13

Select Import SQL from File option to import the SQL file just downloaded. Select it and Enter.

13

Data Loading : Steps

Each step throughout the guide has an associated SQL command to perform the work we are looking to execute, and so feel free to step through each action running the code line by line as we walk through the lab.

If you wish to run the code at once

Part 1 : Step 1 - Step 4 need to run first and then additional steps are then required before executing

Part 2 : Step 5 - Step 8.

To execute this code, all we need to do is place our cursor on the line we wish to run and then either hit the "run" button at the top left of the worksheet or press Cmd/Ctrl + Enter

Step 1 : Virtual warehouse that we will use to compute with the SYSADMIN role.

USE ROLE SYSADMIN;

CREATE OR REPLACE WAREHOUSE ML_WH

  WITH WAREHOUSE_SIZE = 'XSMALL'

  AUTO_SUSPEND = 120

  AUTO_RESUME = true

  INITIALLY_SUSPENDED = TRUE;

Step 2 : In this step we will first create ML_DB and then create a Loan_data table in that database.

USE WAREHOUSE ML_WH;

CREATE DATABASE IF NOT EXISTS ML_DB;

USE DATABASE ML_DB;

CREATE OR REPLACE TABLE loan_data (
  
        LOAN_ID NUMBER(38,0),
  
        LOAN_AMNT FLOAT,

        FUNDED_AMNT FLOAT,

        TERM VARCHAR(4194304),

        INT_RATE VARCHAR(4194304),

        INSTALLMENT FLOAT,

        GRADE VARCHAR(4194304),

        SUB_GRADE VARCHAR(4194304),

        EMP_TITLE VARCHAR(4194304),

        EMP_LENGTH_YEARS NUMBER(38,0),

        HOME_OWNERSHIP VARCHAR(4194304),

        ANNUAL_INC FLOAT,

        VERIFICATION_STATUS VARCHAR(4194304),

        ISSUE_DATE_PARSED TIMESTAMP_TZ(9),

        LOAN_STATUS VARCHAR(4194304),

        PYMNT_PLAN BOOLEAN,
        
        PURPOSE VARCHAR(4194304),

        TITLE VARCHAR(4194304),
    
        ZIP_CODE VARCHAR(4194304),

        ADDR_STATE VARCHAR(4194304),

        DTI FLOAT,

        DELINQ_2YRS FLOAT,

        EARLIEST_CR_LINE VARCHAR(4194304),

        INQ_LAST_6MTHS FLOAT,

        MTHS_SINCE_LAST_DELINQ FLOAT,

        MTHS_SINCE_LAST_RECORD FLOAT,

        OPEN_ACC FLOAT,

        REVOL_BAL FLOAT,

        REVOL_UTIL FLOAT,

        TOTAL_ACC FLOAT,

        TOTAL_PYMNT FLOAT,

        MTHS_SINCE_LAST_MAJOR_DEROG FLOAT,

        TOT_CUR_BAL FLOAT,

        ISSUE_MONTH NUMBER(38,0),

        ISSUE_YEAR NUMBER(38,0)
  
);


After running the cell above, we have successfully created a loan_data table.

15

Step 3 : In this step we will create an external stage LOAN_DATA to load the lab data. This is done from a public S3 bucket to simplified for this workshop.

Typically an external stage will be using various secure integrations as described in this link.

CREATE OR REPLACE STAGE LOAN_DATA

  url='s3://snowflake-corp-se-workshop/Summit_Snowflake_Dataiku/data/';
  
 
 ---- List the files in the stage 

 list @LOAN_DATA;

Listing the files from S3 bucket

16

Step 4 : In this step we will copy the loan_data csv file to the loan_data table we created.

COPY INTO loan_data FROM @LOAN_DATA/loans_data.csv
FILE_FORMAT = (TYPE = 'CSV' field_optionally_enclosed_by='"',SKIP_HEADER = 1);  

SELECT * FROM loan_data LIMIT 100;

Below is the snapshot of the data and it represents aggregation from various internal systems for lender information and loans. We can have a quick look and see the various attributes in it.

17

We have successfully loaded the data from external stage to snowflake.

Step 5 : Time to switch to get Konema Employement Data from Snowflake Market place

We can now look at additional data in the Snowflake Marketplace that can be helpful for improving ML models. It may be good to look at employment data in the region when analyzing loan defaults. Let's look in the Snowflake Data Marketplace and see what external data is available from the data providers.

Lets go to home screen by clicking on home icon.

18

Imp Note

  1. Click Market place tab
  2. Make Sure ACCOUNTADMIN role is selected
  3. In search bar type Labor Data Atlas

19

Click on the tile with Labor Data Atlas

20

Next click on the Get Data button. This will provide a pop up window in which you can create a database in your account that will provide the data from the data provider.

Important : Steps

  1. Change the name of the database to KNOEMA_LABOR_DATA_ATLAS
  2. Select additional roles drop down PUBLIC
  3. Click Get Data

21

When the confirmation is provided click on done and then you can close the browser tab with the Preview App.

22

Other advantage of using Snowflake Data Marketplace does not require any additional work and will show up as a database in your account. A further benefit is that the data will automatically update as soon as the data provider does any updates to the data on their account.

  1. After done just to confirm the datasets are properly configured
  2. Click on Data tab Database
  3. You should see KNOEMA_LABOR_DATA_ATLAS and ML_DB

23

After confirming Databases. Lets go to Worksheets tab and then open the Data Loadingworksheet

24

Step 6 : Querying the KNOEMA_LABOR_DATA_ATLASfor some basic analysis

There are multiple datasets. Lets try to find unemployment dataset in US to narrow down our search.

USE WAREHOUSE ML_WH;

USE DATABASE KNOEMA_LABOR_DATA_ATLAS;

SELECT * 
FROM "LABOR"."DATASETS"
WHERE "DatasetName" ILIKE '%unemployment%' 
AND "DatasetName" ILIKE '%U.S%';

22

Amazing! We have successfully tapped into live data collection of the most important, used, and high-quality datasets on the labor market and human resources on national and sub-national levels from a dozen of sources.

We can find answers such as what is the number of initial claims for unemployment insurance in the US over time?

SELECT * FROM "LABOR"."USUID2017Sep" WHERE "Region Name" = 'United States' AND 
      "Indicator Name" = 'Initial Claims' AND "Measure Name" = 'Value' AND 
       "Seasonal Adjustment Name" = 'Seasonally Adjusted' ORDER BY "Date";

22

Now for this exercise we are going to Enrich the Loan dataset we created earlier using the BLSLA dataset

Step 7 : Creating a KNOEMA_EMPLOYMENT_DATA marketplace data view. We will pivot the data for the different employment metrics so it can be used easily for analysis.

USE DATABASE ML_DB;

CREATE OR REPLACE VIEW KNOEMA_EMPLOYMENT_DATA AS (

SELECT *

FROM (SELECT "Measure Name" MeasureName, "Date", 
      "RegionId" State, 
      AVG("Value") Value 
      FROM "KNOEMA_LABOR_DATA_ATLAS"."LABOR"."BLSLA" WHERE "RegionId" is not null 
      and "Date" >= '2018-01-01' AND "Date" < '2018-12-31' GROUP BY "RegionId", "Measure Name", "Date")
  PIVOT(AVG(Value) FOR MeasureName
  IN ('civilian noninstitutional population', 'employment', 'employment-population ratio', 
     'labor force', 'labor force participation rate', 'unemployment', 'unemployment rate')) AS 
        p (Date, State, civilian_noninstitutional_population, employment, employment_population_ratio, 
           labor_force, labor_force_participation_rate, unemployment, unemployment_rate)
);

SELECT * FROM KNOEMA_EMPLOYMENT_DATA LIMIT 100;

25

We have successfully created the view.

Step 8 : Now in this step we will Create a new table called UNEMPLOYMENT DATA using the geography and time periods by joining LOAN_DATA table created from S3 and KNOEMA_EMPLOYMENT_DATA VIEW created in last step.

This will provide us with unemployment data in the region associated with the specific loan.

CREATE OR REPLACE TABLE UNEMPLOYMENT_DATA AS

 SELECT l.LOAN_ID, e.CIVILIAN_NONINSTITUTIONAL_POPULATION, 
        e.EMPLOYMENT, e.EMPLOYMENT_POPULATION_RATIO, e.LABOR_FORCE, 
        e.LABOR_FORCE_PARTICIPATION_RATE, e.UNEMPLOYMENT, e.UNEMPLOYMENT_RATE

  FROM LOAN_DATA l LEFT JOIN KNOEMA_EMPLOYMENT_DATA e

 on l.ADDR_STATE = right(e.state,2) and l.issue_month = month(e.date) and l.issue_year = year(e.date);

SELECT * FROM UNEMPLOYMENT_DATA LIMIT 100;

26

Go to home screen clicking on home button.

27

Select the Admin from the list.

27a

For the next steps

  1. Click the Partner Connect
  2. From drop down switch role and make sure ACCOUNTADMIN is selected
  3. Search title type Dataiku
  4. Click on the Dataiku tile.

Your screen should like below Screen Shot

28

After you have clicked on Dataiku. This will launch the following window, which will automatically create the connection parameters required for Dataiku to connect to Snowflake.

29

Snowflake will create a dedicated database, warehouse, system user, system password and system role, with the intention of those being used by the Dataiku account.

We'd like to use the PC_DATAIKU_USER to connect from Dataiku to Snowflake, and use the PC_DATAIKU_WHwhen performing activities within Dataiku that are pushed down into Snowflake.

Note that the user password (which is autogenerated by Snowflake and never displayed), along with all of the other Snowflake connection parameters, are passed to the Dataiku server so that they will automatically be used for the Dataiku connection. DO NOT CHANGE THE PC_DATAIKU_USER password, otherwise Dataiku will not be able to connect to the Snowflake database.

Click on Connect. You may be asked to provide your first and last name. If so, add them and click Connect. Your partner account has been created. Click on Activate to get it activated.

30

This will launch a new page that will redirect you to a launch page from Dataiku.

For the lab ae assume that you're new to Dataiku, so ensure the "Sign Up" box is selected, and sign up using the email address (Note: This should be the same email address that you used to set up your Snowflake account) and a new password of your choosing.

31

When using your email address, ensure your password fits the following criteria:

  1. At least 8 characters in length
  2. Should contain:Lower case letters (a-z)Upper case letters (A-Z)Numbers (i.e. 0-9)

Upon clicking on the activation link, please briefly review the Terms of Service of Dataiku Cloud. In order to do so, please scroll down to the bottom of the page. Click on I AGREE

32

Next, you'll need to complete your sign up information then click on Start.

You will be redirected to the Dataiku Cloud Launchpad site. Click GOT IT! to continue.

34

35

You've now successfully set up your Dataiku trial account via Snowflake's Partner Connect. We are now ready to continue with the lab. For this, move back to your Snowflake browser.

After connecting Snowflake to Dataiku via partner connect. We will clone the table created in ML_DB to PC_DATAIKU_DB for the Dataiku consumption.

Snowflake provides a very unique feature called Zero Copy Cloning that will create a new copy of the data by only making a copy of the metadata of the objects. This drastically speeds up creation of copies and also drastically reduces the storage space needed for data copies.

You should see three database now PC_DATAIKU_DB is the system generated database created.

36

You should see PC_DATAIKU_USER is the system generated database created.

36a

Go back to Data_Loading Worksheet you are working and run below commands.

Granting Privileges of ML_DB to PC_Dataiku_role

grant all privileges on database ML_DB to role PC_Dataiku_role;
grant usage on all schemas in database ML_DB to role PC_Dataiku_role;
grant select on all tables in schema ML_DB.public to role PC_Dataiku_role;
grant select on all views in schema ML_DB.public to role PC_Dataiku_role;

There are two options after this. You can either create a New Worksheet or continue in same worksheet . We will continue with same Worksheet. We just have to refresh your browser after the next step

USE ROLE PC_DATAIKU_ROLE;

Imp: Refresh the web page

After running above command you might see the prompt below. Kindly refresh the browser.

36b

Cloning tables to DATAIKU Database before consuming it for Dataiku DSS

USE DATABASE PC_DATAIKU_DB;
USE WAREHOUSE PC_DATAIKU_WH;

--- cloning 

CREATE OR REPLACE TABLE LOANS_ENRICHED CLONE ML_DB.PUBLIC.LOAN_DATA;
CREATE OR REPLACE TABLE UNEMPLOYMENT_DATA CLONE ML_DB.PUBLIC.UNEMPLOYMENT_DATA;


SELECT * FROM LOANS_ENRICHED LIMIT 10;

After running above commands, we have created clones for the tables to be used for analysis. Kindly check PC_DATAIKU_DB

you should have two datasets LOANS_ENRICHED and UNEMPLOYMENT_DATA

37

Return to Dataiku Online and if you haven't already click on OPEN DATAIKU DSS from the Launchpad to start your instance of Dataiku DSS

35

Here is the project we are going to build along with some annotations to help you understand some key concepts in Dataiku DSS:

35a

Input dataset:The dataset is based on the Loans Dataset from LendingClub which is a peer-to-peer lending company that matches borrowers and investors.

In the interests of time we have performed some initial steps of the data pipeline such as cleansing and transformations on the loans dataset. These steps can be created in Dataiku from the raw datasets from the Lending Club to form a complete pipeline with the data and execution happening in Snowflake.

Reminder of our goal

Our goal is to build an optimized machine learning model that can be used to predict the risk of default on loans for customers and advise them on how to reduce their risk. To do this, we'll join the input datasets, perform transformations & feature engineering so that they are ready to use for building a binary classification model.

Creating a Dataiku Project

Once you've logged in, click on + NEW PROJECT and select + Blank project to create a new project.

Name the project as Credit Scoring

35d

The project home acts as the command center from which you can see the overall status of a project, view recent activity, and collaborate through comments, tags, and a project to-do list. Let's add our datasets from Snowflake.

37

38

39

40

41

42

In DSS, the datasets and the recipes together make up the flow. We have created a visual grammar for data science, so users can quickly understand a data pipeline through the flow.

Using the flow, DSS knows the lineage of every dataset in the flow. DSS, therefore, is able to dynamically rebuild datasets whenever one of their parent datasets or recipes has been modified. This is where we will work from in this lab.

Now we have all of the raw data needed for this lab. Let's explore what's inside these datasets.

One column to note is the LOAN_STATUS column. This will be our target variable to predict against later in the lab.

43

Join the Data

So far, your Flow only contains datasets. To take action on datasets, you need to apply recipes. The LOANS_ENRICHED and UNEMPLOYMENT_DATA datasets both contain a column of Loan IDs. Let's join these two datasets together using a visual recipe.

44

45

45

Your flow should now look like this

45

Data cleaning and preparation is typically one of the most time-consuming tasks for anyone working with data. In our lab, in order to save some of that time, our main lending dataset already had a number of cleaning steps applied. In the real world this would be done by other colleagues, say, from the data analytics team collaborating on this project and you would see their work as steps in our projects flow.

Let's take a brief look at the Prepare recipe, the workhorse of the visual recipes in Dataiku, and perform some final investigations and transformations.

45

In a Prepare recipe you assemble a series of steps to transform your data from a library of ~100 processors. There are a couple of ways you can select these processors to build your script. Firstly you can select these processors directly by using the +ADD A NEW STEP button on the left. Secondly because Dataiku DSS infers meanings for each column, it suggests relevant actions in many cases. In the example below although the column is stored in Snowflake as a String Dataiku DSS recognizes it as a date format so infers a Date(unparsed) meaning and suggests the Parse Date processor, by selecting the More actions menu item further suggestions are made.

46

Let's try using processors with both methods, firstly via the suggested actions:

46

47

48

48

Now we have our desired feature we can remove the two date columns.

48a

Your script steps should now look like this:

49

Optionally you can place the three date transformation script steps into their own group with comments to make it simple for a colleague to follow everything you have done. Let's turn our attention to the INT_RATE column. The interest rate is likely to be a powerful predictive feature when modeling credit defaults but currently its stored as a string:

50

50a

Our INT_RATE column has some suspiciously high values. Let's use the Analyze tool again and see how it can be used to take certain actions in a Prepare recipe

52

Finally lets take a look at our DTI column which is a ratio of the borrower's total monthly debt payments on the total debt obligations divided by the borrower's self-reported monthly income.

52

We can see that there are a very small number of missing rows. We're going to perform some calculations using this column in our next lab section so lets fix that now.

Your final series of steps should look like this

52

As before you can optionally group and comment your transformation steps.

Dataiku DSS integrates with Snowpark for Python allowing coders to take advantage all the benefits of Snowflake whilst collaborating alongside their no/low-code colleagues on projects to accelerate time to value in DSS, their end-to-end, governed AI lifecycle platform.

When using Dataiku's SaaS option from Partner Connect the setup is done for us automatically. Let's check that.

Return to your browser tab with Dataiku Launchpad open (if you have shut this just go to Launchpad.

Select the Features menu 64

Your Snowpark extension is now ready to use. 64

A Note on Code Environments: Dataiku uses the concept of code environments to address the problem of managing dependencies and versions when writing code in R and Python. Code environments provide a number of benefits such as Isolation and Reproducibility of results

When using Snowpark for Python from Dataiku DSS you will use a code environment that includes the Snowpark library as well as other packages you wish to use. In our lab, to make things easy, we are using a default Snowpark code environment which just contains just the minimum required libraries but once you have completed the lab and wish to explore further you can create your own code environments.

In addition to selecting an appropriate code environment there are just a couple of extra lines of code to add to your DSS recipe to start using Snowpark for Python.

Lets take a look at a simple example.

Firstly you need to add the following line to your imports:

from dataiku.snowpark import DkuSnowpark

Then read the inputs, instantiate Snowpark, get the dataframe, write your code then write your output.

# Read recipe inputs
input_dataset = dataiku.Dataset("my_input_dataset")

# get input dataset as snowpark dataframe
dku_snowpark = DkuSnowpark()
snowdf = dku_snowpark.get_dataframe(input_dataset)

# ALL YOUR CODE HERE

# get output dataset
OUTPUT_DATASET = dataiku.Dataset("my_output_dataset")

# write input dataframe to output dataset
dku_snowpark.write_with_schema(OUTPUT_DATASET,snowdf)

We have an example Jupyter notebook to help you get started. Download the notebook from the S3 bucket to a local drive then we will upload to DSS (Note: You would typically use the Git integrations in DSS for managing team notebooks developed outside of DSS).

Snowpark_Jupyter_notebook.ipynb

Either select notebooks from the menu or use the G+N keyboard shortcut. Select to upload your notebook, choose the file and click upload.

66

67

Here is the notebook we imported, click create recipe

65

select Python recipe and click ok

66

For the input dataset we will select LOANS_ENRICHED_joined_prepared and for the output dataset type LOANS_FE and then click Create recipe

67

You now have the notebook set up with correct input and output datasets in our flow. You can either use the default code editor or jupyter notebook. We will work on jupyter notebook. Click edit in notebook

68

Ensure your Jupyter notebook is using the snowpark kernel, if not change it from the Change Kernel menu

68

Test running your cells (note the code assumes the dataset names specified above. If you have changed any input or output dataset names be sure and make those updates in the code).

Feel free to add you own code and experiment, when you are done click SAVE BACK TO RECIPE.

From the default Code Editor lets check apply the correct code environment. Click on Advanced and then select a Snowpark code environment from the dropdown (Note: Your available code environments may differ from the screenshot)

68a

68b

Return to the Code screen and click the Run button to execute the recipe using Snowpark and to generate the output dataset in the flow.

Having sufficiently explored and prepared the loans and employment data, the next stage of the AI lifecycle is to experiment with machine learning models.

This experimentation stage encompasses two key phases: model building and model assessment.

Model building: Users have full control over the choice and design of a model — its features, algorithms, hyperparameters and more.

Model assessment: Tools such as visualizations and statistical summaries allow users to compare model performance.

These two phases work in tandem to realize the idea of Responsible AI. Either through a visual interface or code, building models with DSS can be transparently done in an automated fashion. At the same time, the model assessment tools provide a window into ensuring the model is not a black box.

Before building our model first we will split our output dataset from our python step.

This is how your flow should look like before splitting 54

55

55a

56

57

58

When building a visual model, users can choose a template instructing DSS to prioritize considerations like speed, performance, and interpretability. Having decided on the basic type of machine learning task, you retain full freedom to adjust the default settings chosen by DSS before training any models. These options include the metric for which to optimize, what features to include, and what algorithms should be tested etc.

Lets take a look at the settings from the template.

58

On the left side we can view/adjust the various settings for our current experiment. We don't have time in todays lab to cover all the options but here is a brief outline of a few we will use in the lab:

TRAIN/TEST SET - When training a model, it is important to test the performance of the model on a "test set". During the training phase, DSS "holds out" on the test set, and the model is only trained on the train set. In this section you can adjust the strategy.

DEBUGGING - ML Diagnostics are designed to identify and help troubleshoot potential problems and suggest possible improvements at different stages of training and building machine learning models.

FEATURES HANDLING - We can allow Dataiku DSS to automatically choose the features included in our model, or we can manually select which features we want to include when our model is trained and how we handle the feature types.

ALGORITHMS - DSS natively supports algorithms that can be used to train predictive models depending on the machine learning task: Clustering or Prediction (Classification or Regression). We can also choose to use our own machine learning algorithm, by adding a custom Python model. In our case we are using the algorithms based on the Scikit-Learn, LightGBM and XGBoost ML libraries.

Let's use the defaults the template has set.

58

58

The RESULTS pane in DSS provides a single interface to compare performance in terms of sessions or models, making it easy to find the best performing model for the chosen metric.

In the RESULTS screen we can see the output of our first experiment. DSS displays a graph of the evolution of the best cross-validation scores found so far. Hovering over one of the points, we can see the evolution of the hyperparameter values that yielded an improvement. In the right part of the charts, we can see final test scores.

We can also see that some Diagnostics checks have been flagged.

Imp Note : Your results may vary from the screen shots below.

58

Here we can see there a number of potential issues DSS has identified for us. It seems we have an imbalanced dataset which is leading to the model almost always predicting class 1 (that there will be no default on the loan).

We can see this in our distribution.

58

Here we can see that our loan defaults only make 4% of the dataset. So even if our model erroneously predicted that no loan would ever default it would still be correct 96% of the time for this imbalanced dataset! This is a common issue in certain types of classification problems such as credit card fraud, identifying rare diseases or, as in our case, loan defaults.

Although this a common problem in machine learning it is not one that is always easy to solve. Fortunately DSS has a number of ways to help such as weighting strategies, class rebalance sampling, Algorithm selection and more. Let's look at a couple of these techniques.

Firstly we can a look at class rebalance.

58

Lets also change the algorithms we are using as logistic regression and tree-based algos tend not to perform as well with imbalanced datasets. Let's look at some of our boosting algos.

58

As you can see on our results page we saw an improvement in our score and addressed our imbalance issue. The diagnostics warn us the test set might be too small now but we have a much larger dataset available to us from the LendingClub if we want to use it.

58

After having trained as many models as desired, DSS offers tools for full training management to track and compare model performance across different algorithms. DSS also makes it easy to update models as new data becomes available and to monitor performance across sessions over time.

58

58

We can compare across our experiments, saved models and evaluations from a DSS evaluation store (not part of this lab). You can set a champion and compare to challengers.

58

Clicking on any model produces a full report of tables and visualizations of performance against a range of different possible metrics.

60

Here we can see Variable importance

61

After experimenting with a range of models built on historic training data, the next stage is to deploy our chosen model to score new, unseen records.

For many AI applications, batch scoring, where new data is collected over some period of time before being passed to the model, is the most effective scoring pattern. Deploying a model creates a "saved" model in the Flow, together with its lineage. A saved model is the output of a Training recipe which takes as input the original training data used while designing the model.

61

Your flow should now look like this:

61

62

62

62

Your final project flow should now look like this.

62

We can now We can see the results back on the Snowflake tab. If you hit the refresh icon near the top left of our screen by your databases, you should see the CREDIT_SCORING_LOANS_TEST_SCORED table that was created once we kicked off our prediction job.

Preview Data will give you glimpse of additional column added to the list.

USE ROLE SYSADMIN;
USE DATABASE PC_DATAIKU_DB;
USE WAREHOUSE PC_DATAIKU_WH;
SELECT * 
FROM LOANS_TEST_SCORED_CREDITSCORING_1 
LIMIT 10;

62

Additional info

SELECT 
	EMP_TITLE ,
	SUM(CASE WHEN "prediction" = 'ok' THEN 1 ELSE 0 END) AS prediction_yes,
	SUM(CASE WHEN "prediction" = 'incident' THEN 1 ELSE 0 END) AS prediction_no
	FROM LOANS_TEST_SCORED_CREDITSCORING_1 
GROUP BY 
	EMP_TITLE 
order by prediction_yes DESC;

Congratulations you have now successfully built, deployed and scored your model results back to Snowflake. Your final flow should look like this.

63

What we have covered

Related Resources

SnowFlake University

Dataiku Academy

To enable the anaconda libraries on snowflake account

1.Create a new trial account on https://signup.snowflake.com

2.Login

3.Switch to ORGADMIN role

4.Go into Admin » Billing

5.Click on Terms & Billing, and enable Anaconda terms.

73