H2O Driverless AI is a supervised machine learning platform leveraging the concept of automated machine learning. Supervised machine learning is a method that takes historic data where the response or target is known and build relationships between the input variables and the target variable. Driverless AI automates most of difficult supervised machine learning workflow such as feature engineering, model validation, model tuning, model selection, and model deployment. Modeling pipelines, which are produced from H2O Driverless AI, can exported as standalone scoring artifacts to power your AI/ML use case.
This tutorial presents a quick introduction to the Driverless AI platform via Snowflake Partner Connect.
We will use a dataset from LendingClub.com to build a classification model to help us predict the likelihood a LendingClub.com borrower will default on their loan. LendingClub.com is an established online loan marketplace that funds personal loans, commercial loans, funding of medical procedures, and other financing needs. The data consist of 25 columns and approximately 39,000 rows, with each row corresponding to a customer. Here is preview of the data:
Note that the dataset consist of numerical columns (loan_amount
, installment
, emp_length
, dti
, etc.), categorical columns (term
, home_ownership
, verification_status
, purpose
, etc.), and a text column (desc
). Our target variable is bad_loan
which is a Boolean with values True
and False
, thus this will be a binary classification problem.
We will use Snowflake and Driverless AI to:
ACCOUNTADMIN
role, which is required for this lab)The first thing you will need to do is download the following .sql file that contains a series of SQL commands we will execute throughout this lab.
At this point, log into your Snowflake account and have a clear screen to start working with. If you have just created a free trial account, feel free to minimize or close any hint boxes that are looking to help guide you. These will not be needed for this lab as most of the hints will be covered throughout the remainder of this exercise.
To ingest our script in the Snowflake UI, navigate to the ellipsis button on the top right hand side of a "New Worksheet" and load our script.
Snowflake provides "worksheets" as the spot for you to execute your code. This lab assumes you have already run a few queries in Snowflake before. Therefore, we are going to execute a series of commands quickly, so we get the data in tables and continue to the more interesting part of the lab of building and deploying models. The .sql file that you upload should look like this:
USE ROLE PC_H2O_ROLE;
USE DATABASE PC_H2O_DB;
USE SCHEMA public;
USE WAREHOUSE PC_H2O_WH;
CREATE OR REPLACE TABLE loans (
id INTEGER,
loan_amnt INTEGER,
term String(1024),
installment Real,
grade String(1024),
...)
...
Note: before you execute the SQL statements, please proceed to the next section to connect to H2O and launch your Driverless AI instance.
Snowflake's Partner Connect feature allows you to seamlessly get started with partner tools and manages most of the connection details for you to get up and running as quickly as possible.
Go ahead and click on the "Partner Connect" application. This should take you to the following screen where you will see many of the Snowflake partners, and through a simple method of setting up an account and integration, allow you to quickly move data into a partner tool.
To be able to continue test out partner applications, in our case H2O, we need to promote ourselves to the ACCOUNTADMIN
role. This is an out of worksheet process, and therefore isn't a command for us to run. We need to do this one manually.
Once you have completed this step, go ahead and click on the H2O application. This will present you with a screen to connect to H2O. It will outline a number of Snowflake objects that will be auto-created. For the purposes of this lab, we have already created the snowflake objects that we will need, so you can press "Connect" .
This creates a partner account which you can immediately Activate
You next need to accept the H2O Terms and Conditions for the Trial Agreement
and wait while your H2O Driverless AI instance is configured and launched.
Your brand new Driverless AI instance looks like
A summary of the information and views we will cover in this tutorial include:
DATASETS
: A view for importing, listing, and operating on datasets.AUTOVIZ
: The Automatic Visualizations of data view.EXPERIMENTS
: The view where we build and deploy predictive models.DIAGNOSTICS
: Model diagnostics view.MLI
: Machine learning interpretability view, information to help us understand our models.RESOURCES
: A pull-down menu for accessing system information, clients, help, and other resources.Now let's setup the database and warehouse in Snowflake, and create a table to use for the lab.
In the Snowflake worksheet, you have previously loaded a .sql
script. The SQL commands in this script will import the Lendingclub dataset and create a table called loans
. This table will be used with H2O Driverless AI to train and deploy a machine learning model.
To execute the entire .sql code, which contains 9 different statements, all we need to do is click on the "All Queries" button next to blue "run" button at the top left of the worksheet and then press "run". You should see the "run" button has a "(9)", meaning it will execute all 9 commands in the uploaded file.
From the empty Datasets view, click the Add Dataset
button and select the SNOWFLAKE
connector:
This launches the Make Snowflake Query
form.
Enter into the form:
PC_H2O_DB
PC_H2O_WH
PUBLIC
loans.csv
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
note: the quotation marks are single double singleSELECT * FROM LOANS
Then click the CLICK TO MAKE QUERY
button. This imports the data into the Driverless AI system.
The dataset is now available for next steps in Driverless AI
Right click the loans
dataset to get details.
The Dataset Details
view is a quick way to inspect the dataset columns, see their storage type (integer, string, etc.), get summary statistics and distribution plots for each column.
In more advanced usage, you can edit the data type interactively
Scrolling to the right, inspect the bad_loans
column, our target variable.
The target bad_loans
is Boolean with 38,980 observations and has a mean value of 0.1592. This means that 15.92% of the customers (rows) in this dataset have a loan that was not paid off.
Clicking the DATASET ROWS
button on the upper right yields a spreadsheet format.
This is helpful in understanding the layout of the data. A quick inspection of your dataset using Details
is a good practice that we always recommended.
Autoviz
in Driverless AI automatically creates a variety of informative interactive graphs that are designed for understanding the data to be used in building a predictive model. Autoviz
is unique in that it only shows the graphs that are applicable for your data based on the information in your data.
Right click the dataset name and select VISUALIZE
to launch AutoViz
The available visualizations for the loans
data are shown below.
Selecting the SKEWED HISTOGRAMS
section, for example, yields a series of histograms on only the columns that are sufficiently skewed. We show one below for the credit_length
column.
Clicking the left and right navigation arrows allows you to inspect additional variables, ordered by their skewness.
Close the SKEWED HISTOGRAMS
display and scroll down to see RECOMMENDATIONS
.
Selecting RECOMMENDATIONS
produces
The philosophy underlying automatic visualizations is to make it easy for the data scientist to quickly understand their data fields, but it does not make decisions for the data scientist.
There are a number of additional useful graphs that can be navigated to fully understand your data prior to modeling.
Splitting data into train and test sets allows models to be built with the train set and evaluated on the test data. This protects against overfit and yields more accurate error estimates. To use the Dataset Splitter utility, right click the dataset and select SPLIT
Name your train
and test
splits, then select a split ratio (here we use 0.8).
For a time series use case, enter the time column. If your data have predefined folds for k-fold cross validation, enter the fold column. A seed is available for reproducibility. Select the target column bad_loan
The data type of the target column determines the splitting algorithm. For classification problems, stratefied random sampling is used. For numeric target columns, simple random sampling is used.
Click SAVE
to create the datasets.
The train
dataset has around 31,000 rows and the test
dataset around 8000 rows.
We use the term Experiment in Driverless AI to refer to the entire feature engineering and model evolution process. Instead of fitting one model, we are fitting many and using a "survival of the fittest" approach to optimize features and model hyperparameters. The result is a combination feature engineering-modeling pipeline, which can easily be investigated and promoted into production.
We start an experiment from the Datasets
view by clicking on the line corresponding to the train
dataset and selecting PREDICT
from the dropdown menu
This opens the following form for configuring an experiment.
The fields are
Projects
.For our experiment, enter "Baseline" as the display name (#1). Next select the TEST DATASET
file test
(#5). The desc
column contains a written explanation from the customer describing the reason for requesting a loan. Although Driverless AI has extensive NLP (natural language processing) capabilities, we omit them in this baseline model. Thus using DROPPED COLUMNS
(#3), select desc
:
Next select bad_loan
as the TARGET COLUMN
(#6). You will have to scroll down, since bad_loan
is the next-to-last variable in the dataset
After selecting the target variable, Driverless AI analyzes the data and experimental settings and prefills additional options:
These include
ACCURACY/TIME/INTERPRETABILITY
dials which range from 1 to 10 and largely determine the recipe for the experiment.CLASSIFICATION/REPRODUCIBLE/GPUS DISABLED
clickable buttons.SCORER
used in model building and evaluation.EXPERT SETTINGS
for fine control over a vast number of system, model, feature, recipe, and specialty options.LAUNCH EXPERIMENT
to run the experiment defined by dial settings, scorer, and expert settings.For our experiment,
bool
(Boolean) with 31,184 observations, 4963 of which are equal to 1 (#1). The CLASSIFICATION
button (#3) is enabled by default because the target is Boolean.ACCURACY
dial is set to 5. Higher values of accuracy are more computationally intensive. The description under (#6) shows that ACCURACY
impacts how features are evaluated (model & validation strategy) and what form the final pipeline will take (individual models vs. ensembles and validation strategy).TIME
dial is set to 4. Higher values of TIME
allow for longer feature evolution. TIME
levels also include early stopping rules for efficiency.ACCURACY
and TIME
do not always lead to better predictive models. Model performance should always be evaluated using a holdout test data set.INTERPRETABILITY
dial ranges from 1 (least interpretable = most complex) to 10 (most interpretable = least complex). INTERPRETABILITY
set to 7 or higher enable monotonicity constraints, which significantly increases model understanding.Click on the REPRODUCIBLE
button to enable reproducibility. This may be important for regulatory reasons or, as in our case, for educational purposes. Also select AUC as the scorer (#4)
Clicking on EXPERT SETTINGS
(#5) exposes an immense array of options and settings
This gives the expert data scientist complete control over the Driverless AI experience, including the ability to customize models, feature transformers, scorers, and data using CUSTOM RECIPES
. Select CANCEL
to exit out of the expert settings screen.
Before launching the experiment, your settings should look like the following.
Click LAUNCH EXPERIMENT
to commence. The Driverless AI UI now includes a descriptive rotating dial in the center with live monitoring displays for model evolution, variable importance, resource usage, and model evaluation.
To get more detailed resource monitoring, go to RESOURCES
in the menu and select SYSTEM INFO
.
The System Info
view shows hardware usage and live activity monitoring of individual CPU cores.
Clicking CLOSE
sends us back to the running Experiment Baseline
view.
Note that
System Info
view we just closed.ITERATION
monitor corresponds to an individual model. The last model evaluated is a LightGBM model with 21 features and an AUC of 0.7316. Moving your mouse over any of the model dots will highlight that model and summary information.VARIABLE IMPORTANCE
display shows the features of the latest model (or the model selected in the ITERATION DATA
display) and their relative importance.Selecting Notifications
in the CPU/MEMORY
section (2) opens important information and discoveries from Driverless AI.
Ours reports that
Notification are important to read and understand. The advice in notifications often leads to better models.
The technical data scientist might consider selecting Log
in the CPU/MEMORY
section. Driverless AI logs its entire process in great detail. Clicking Log
opens a system logging window for monitoring live. Logs can be downloaded during or after the experiment.
Nearing the conclusion of the experiment
we see that the dial is at 100% complete, the elapsed time is approximately 6:30 (while results are reproducible, times are not themselves exactly reproducible), and the experiment is stopping early, needing only 33 of 56 iterations.
Upon completion, the Experiment Baseline
view replaces the spinning dial in the center with a stack of clickable bars
The lower right panel includes an experiment summary, zoomed in below:
The summary contains information about the experiment settings, its seed, the train, validation, and test data, system (hardware) specifications, features created, models created, timing, and scores. In particular, note that
Importantly, the MOJO latency timing of 0.13 milliseconds indicates the speed of scoring this model in production.
Selecting ROC in the lower right replaces the summary with the ROC curve.
You can toggle between VALIDATION METRICS
and TEST SET METRICS
for this display.
Selecting any point along the curve produces a confusion matrix with additional peformance metrics
You can view other model performance metrics, including Precision-Recall
Lift chart
Gains chart
and Kolmogorov-Smirnov
Once an experiment is completed, it is important to understand the final model's predictive performance, its features, parameters, and how the features and model combine to make a pipeline.
The DIAGNOSE MODEL ON NEW DATASET ...
button is used to create extensive diagnostics for a model built in Driverless AI. After clicking the button,
select the dataset used for diagnostics, we will use the test
dataset.
The Diagnostics
view that is returned is very complete. You can choose from a plethora of Scores
on the left. And each of the Metric Plots
on the right is interactive.
Selecting the confusion matrix plot yields
Likewise, the interactive ROC curve produces
By default, an automated report is created for each experiment that is run. Download the AutoReport
by
The document that is created is a very thorough summary of the experiment in the form of a white paper, documenting in detail the data, settings, and methodologies used to create the final pipeline.
This includes detailed information on the features that were engineered and the process for engineering them.
It also contains validation and test metrics and plots.
For this particular experiment, the AutoReport is a 36-page technically detailed document.
Selecting the VISUALIZE SCORING PIPELINE
button
returns a visual representation of the pipeline
This pipeline is also available in the AutoReport, along with explanatory notes copied below. The pipeline consists of
bad_loan = False
and bad_loan = True
.One of Driverless AI's most important features is the implementation of a host of cutting-edge techniques and methodologies for interpreting and explaining the results of black-box models. In this tutorial, we just highlight some of the MLI features available in Driverless AI without discussing their theoretical underpinnings.
To launch MLI from a completed experiment, select the INTERPRET THIS MODEL
button
The MLI view allows easy navigation through the various interactive plots.
The Dashboard
view displays four useful summary plots
Each of these plots are available in a larger format from the main MLI view.
Other plots include Feature importance on the transformed features
and on the original features
Shapley values are also available for the transformed and original features
The MLI view provides tools for disparate impact analysis and sensitivity analysis, also called "What If" analysis.
The final model from a Driverless AI experiment can be exported as either a MOJO scoring pipeline or a Python scoring pipeline. The MOJO scoring pipeline comes with a pipeline.mojo
file that can be deployed in any environment that supports Java or C++. There are a myriad of different deployment scenarios for Real-time, Batch or Stream scoring with the pipeline.mojo
file. In this tutorial, we deploy the final model as a Snowflake Java UDF.
We need to collect the following components from Driverless AI:
pipeline.mojo
mojo2-runtime.jar
H2oDaiScore.jar
license.sig
The first two files we will download from Driverless AI directly. Select DOWNLOAD MOJO SCORING PIPELINE
from the STATUS: COMPLETE
buttons
and then DOWNLOAD MOJO SCORING PIPELINE
again from the MOJO Scoring Pipeline instructions
screen
This downloads a file mojo.zip
which contains the pipeline.mojo
and mojo2-runtime.jar
files, along with a number of other files we will not be needing.
The next file, H2oDaiScore
, is a custom scorer developed by H2O.ai to deploy MOJOs using Snowflake Java UDFs. It can be downloaded from H2O here: https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai-snowflake-integration/java-udf/download/index.html. Select the latest release (0.0.7 at the time of this writing). Extract the downloaded H2oScore-0.0.7.tgz
file to find H2oDaiScore-0.0.7.jar
.
Last, you will need your Driverless AI license file license.sig
.
The first step in creating a Java UDF in Snowflake is to put the 4 Driverless AI artifacts into the table stage, which was created when we created loans
table and uploaded some data in the very beginning.
In order to do that, we will need to leverage SnowSQL (Snowflake's CLI tool), which will need to be installed locally so you can put the artifacts on your local computer into the table stage in your Snowflake Cloud.
Travel to your command line and enter the follow:
snowsql
You will be asked for your Account
:
This is a part of the unique URL you were given when creating a trial. Here is how the URL is defined (.snowflakecomputing.com). Enter only the Account portion.
Next enter your User
: and Password
:
These are the login name and password you created after navigating to the unique URL of your Snowflake deployment.
Once logged in, you can now execute the following:
USE DATABASE PC_H2O_DB;
USE SCHEMA public;
USE WAREHOUSE PC_H2O_WH;
USE ROLE PC_H2O_ROLE;
Finally, we can now upload the 4 artifacts:
put file://{path}/pipeline.mojo @%loans auto_compress=false;
put file://{path}/license.sig @%loans auto_compress=false;
put file://{path}/H2oDaiScore-0.0.7.jar @%loans auto_compress=false;
put file://{path}/mojo2-runtime.jar @%loans auto_compress=false;
Note, you will need to change where it says path
in the ‘put' commands to path where the files you downloaded are located. This will take 1-2 mins to upload.
We are now ready to actually create the Java UDF via the CREATE FUNCTION
statement. To do so, you must provide:
pipeline.mojo
and all other artifacts,The code has been prepared for you. At this point, this can either be run in SnowSQL or back in your GUI session.
CREATE FUNCTION H2OScore_Java(params STRING, rowData ARRAY)
returns variant language java
imports = ('@%loans/pipeline.mojo',
'@%loans/license.sig',
'@%loans/mojo2-runtime.jar',
'@%loans/H2oDaiScore-0.0.7.jar'
)
handler = 'h2oDai.H2oDaiScore.h2oDaiScore';
The syntax for calling a Java UDF in Snowflake is
SELECT <JAVA_UDF_FUNCTION_NAME>(<JAVA_UDF_FUNCTION_PARAMS>) FROM <TABLE_NAME>;
Importtant: H2O's customer scorer, H2oDaiScore.jar
, has a unique feature to autogenerate the SQL command for scoring. Simply call the Java UDF you just created (H2OScore_Java
) with the parameter sql
set to true
.
For example,
SELECT H2OScore_Java('Modelname=pipeline.mojo Sql=true', ARRAY_CONSTRUCT());
Results Preview
"select ROW_NUMBER() OVER (ORDER BY (select 0)) as RowNumber, H2OScore_Java('Modelname=pipeline.mojo', ARRAY_CONSTRUCT(loan_amnt, term, int_rate, installment, emp_length, home_ownership, annual_inc, verification_status, addr_state, dti, delinq_2yrs, inq_last_6mths, pub_rec, revol_bal, revol_util, total_acc)) from <add-table-name>;"
Now let's look at an example using the H2OScore_Java
UDF defined above to score our table loans
using pipeline.mojo
as follows:
SELECT
ROW_NUMBER() OVER (ORDER BY (select 0)) as RowNumber,
H2OScore_Java(
'Modelname=pipeline.mojo',
ARRAY_CONSTRUCT(loan_amnt, term, int_rate, installment, emp_length,
home_ownership, annual_inc, verification_status, addr_state,
dti, delinq_2yrs, inq_last_6mths, pub_rec, revol_bal, revol_util, total_acc)
) AS H2OScore
FROM loans;
It should take about 7 seconds to score and the results should look like this:
Results Preview (first 3 rows)
Row | ID | H2OScore |
1 | 1077501 | 0.8469023406505585 |
2 | 1077430 | 0.5798575133085251 |
3 | 1077175 | 0.5994115248322487 |
And as they say, that is all folks! We have now scored a model inside Snowflake. What this does is give you the flexibility of Snowflake's Scale Up and Scale Out capabilities to score as much data as you want.
A Snowflake Worksheet template to deploy and score DAI MOJOs using Java UDFs can be automatically generated using the H2O REST Server deployment:
curl "<ip>:<port>/autogen?name=<model_name>¬ebook=snowflake.udf"
For example,
curl "http://127.0.0.1:8080/autogen?name=pipeline.mojo¬ebook=snowflake.udf"