In this guide, we'll be walking you through how to validate that the assumptions you've made about your identity model are correct using Simon Data's Snowflake native app, IdentityQA. Let's get going!
By the end of this guide, you'll learn:
First, it's important to understand how Snowflake native apps work and the benefits they provide from a security & privacy standpoint.
By default, the application has no permissions to do anything in your Snowflake account. You must grant account- & object-level privileges to the application. Once permission is granted, the app only has read-only access to the data within and will never make any changes or transformations to the data. Once the app is run, the output is a set of tables in the application schema that you can query to see the results of the QA on your identity model.
In Snowflake, under Marketplace, search for Simon Data's IdentityQA listing. Click Get to request the application for your Snowflake account.
You will be asked to select a warehouse to be used for installation and name the application.
In Snowflake, under Data → Private Sharing, install the IdentityQA listing.
Once installed, you'll have to change the name of the app to what you named it upon installation by running the following command in a worksheet:
CREATE APPLICATION <CHOSEN APP NAME>
FROM APPLICATION PACKAGE IDQAAPP;
Expand the APP schema and see a number of Procedures that make up the IdentityQA app. The rest of this guide will walk you through how to use those procedures.
Before the app can run, however, you need to give it access to the correct database, schema, and table. Run the following commands:
GRANT USAGE ON DATABASE <YOUR_DATA_DB_NAME> to APPLICATION <CHOSEN APP NAME>;
GRANT USAGE ON SCHEMA <YOUR_DATA_SCHEMA_NAME> to APPLICATION <CHOSEN APP NAME>;
GRANT SELECT ON TABLE <YOUR_TABLE_NAME> TO APPLICATION <CHOSEN APP NAME>;
In a worksheet, run the SET_INPUT_TABLE stored procedure, as follows. Be sure to include as many lines as you have identifiers in your source table.
CALL <CHOSEN APP NAME>.APP.SET_INPUT_TABLE('SOURCE_TABLE', [['<SOURCE_COLUMN>', '<IDENTIFIER_TYPE>', <IS_STABLE>],
['<SOURCE_COLUMN>', '<IDENTIFIER_TYPE>', <IS_STABLE>],
['<SOURCE_COLUMN>', '<IDENTIFIER_TYPE>', <IS_STABLE>]]);
CALL SIMONIDQA.APP.SET_INPUT_TABLE('SIMONDATA_TEST.PUBLIC.IDENTITY', [['EMAIL', 'EMAIL', TRUE],
['CLIENT_ID', 'CUSTOM_ID', FALSE],
['PHONE_NUMBER', 'PHONE_NUMBER', FALSE]]);
Once you've configured your identifiers, you can set certain constraints (or assumptions) to validate. These constraints outline the relationships between two identifiers (i.e. 1:1 or 1:many) and identifier-specific limits (e.g. device_id can be shared across a maximum of 5 profiles).
CALL <CHOSEN APP NAME>.APP.SET_RELATIONSHIP_CONSTRAINT('<RELATIONSHIP_TYPE>', '<IDENTIFIER_1>', '<IDENTIFIER_2>');
The query below tells the application to validate whether or not user_id and email truly have a 1:1 relationship. If there are instances where the 1:1 relationship constraint is invalidated, the app flags them in the generated report.
CALL SIMONIDQA.APP.SET_RELATIONSHIP_CONSTRAINT('ONE-TO-ONE', 'CLIENT_ID', 'EMAIL');
This constraint checks for the uniqueness of the values for a particular identifier.
CALL <CHOSEN APP NAME>.APP.SET_UNIQUE_CONSTRAINT('<IDENTIFIER_NAME>');
CALL SIMONIDQA.APP.SET_UNIQUE_CONSTRAINT('EMAIL');
This constraint sets a maximum threshold for which a single identifier can be shared across different profiles.
CALL <CHOSEN APP NAME>.APP.SET_SHARED_IDENTIFIER_LIMIT('<IDENTIFIER_NAME>', <THRESHOLD>);
A customer may have a landline that they share with everyone in the household. Let's say the average number of people in a household is 3 and you believe your customer base tends to have landlines. You could set your shared identifier limit threshold to 3 to validate this assumption.
CALL SIMONIDQA.APP.SET_SHARED_IDENTIFIER_LIMIT('phone_number', 3);
Run the following command to generate your IdentityQA report!
CALL <CHOSEN APP NAME>.APP.generate_report();
The output will be available in the following tables in the REPORT schema within the SIMON_IDENTITY_QA application as well as on the Report page in the Streamlit. To query the tables directly in the REPORT schema within the application, you can run the following:
-- RUN THIS QUERY TO PULL THE TABLE NAMES SPECIFIC TO YOUR IDENTIFIERS
SELECT * FROM <CHOSEN APP NAME>.REPORT.CARDINALITY_TABLES
-- THEN RUN A SELECT * ON EACH OF THE TABLES GENERATED. FOR EXAMPLE:
SELECT * FROM <CHOSEN APP NAME>.REPORT.CARDINALITY_CHECK_USER_ID_EMAIL
-- THE REST OF THESE QUERIES CAN BE RUN TO INSPECT OTHER RESULTS
SELECT * FROM <CHOSEN APP NAME>.REPORT.IDENTIFIER_VALIDATION
SELECT * FROM <CHOSEN APP NAME>.REPORT.INVALID_IDENTIFIERS
SELECT * FROM <CHOSEN APP NAME>.REPORT.RELATIONSHIP_CONSTRAINTS_CHECKS
SELECT * FROM <CHOSEN APP NAME>.REPORT.SHARED_IDENTIFIER_LIMIT_CHECKS
SELECT * FROM <CHOSEN APP NAME>.REPORT.UNIQUE_CONSTRAINTS_CHECKS
To view your report in Streamlit, return to the app & launch it in Streamlit. Steps below:
This is an example input table that we used to generate the report shown in screenshots below.
The top of the report shows you the name of the input table you configured as well as the total number of rows in said input table.
Next, the report will show pass/fail statuses on the constraints you configured.
As you can see in the screenshot below, the client_id & email identifiers did not pass the 1:1 relationship check. This tells you that the app found instances where an email or client_id had more than one of the other identifier. Examples of these failures will be explored later.
Additionally, the shared identifier limit check failed on phone_number. This means that the app discovered instances where a single phone_number appears on more than 3 profiles, which violates the assumption that phone_number is shared at most across 3 profiles at any given time.
The next check is a high-cardinality test for each identifier in the input table compared to the stable identifier. IdentityQA tests for cardinality because exceptionally high counts of any identifier per single stable identifier generally mean that there is an issue with the underlying data. In the screenshot below, you can see that there are 10 profiles that have an abnormally high count of client_ids compared to the other 15 profiles in the sample. This tells you that these emails may be fake, or maybe they're emails used for testing purposes and should be excluded from marketing.
Your next step here should be to dive into these email addresses and determine why they might have so many client_ids associated with them and if they're valid profiles or not.
Next, the report shows email & phone_number validations. You can see in this example that out of the ~1M total records in the input table, ~30K emails are considered invalid and ~80K phone numbers are considered invalid.
Congratulations! You've successfully QA'd your customer identity model using Simon Data's Native Snowflake App, IdentityQA! You now have the insight you need to be able to finetune & improve your identity model.