Is Soda the data quality testing solution you've been looking for? 🥤Take a sip and see! Use this guide to install Soda, connect it to your Snowflake data source, and run a simple Soda scan for data quality.
Soda is a tool that enables Data Engineers to test data for quality where and when they need to.
Is your data fresh? Is it complete or missing values? Are there unexpected duplicate values? Did something go wrong during transformation? Are all the data values valid? These are the questions that Soda answers for Data Engineers.
Soda works by taking the data quality checks that you prepare and using them to run a scan of datasets in a data source. A scan is a CLI command which instructs Soda to prepare optimized SQL queries that execute data quality checks on your data source to find invalid, missing, or unexpected data. When checks fail, they surface bad-quality data and present check results that help you investigate and address quality issues.
To test your data quality, you install the Soda Library CLI tool and sign up for a Soda Cloud account so that you can complete the following tasks:
configuration.yml
file which stores access details for your data source such as host, port, and data source login credentials.checks.yml
file. A Soda Check is a test that Soda performs when it scans a dataset in your data source. The checks YAML file stores the checks you write using the Soda Checks Language (SodaCL), a domain-specific language for data quality testing.configuration.yml
file to push check results to your account so you can access visualized scan results, set alert notifications, track trends in data quality over time, and integrate with the messaging, ticketing, and data cataloging tools you already use, like Slack, Jira, and Alation.Add Soda data quality checks to your data pipeline to prevent downstream issues.
Use GitHub Actions to add automated Soda data quality checks to your development workflow to prevent merging issues into production.
mkdir soda_sip
cd soda_sip
.venv
directory. python3 -m venv .venv
source .venv/bin/activate
pip install -i https://pypi.cloud.soda.io soda-snowflake
soda --help
# Example output
Usage: soda [OPTIONS] COMMAND [ARGS]...
Soda Library CLI version 1.0.0, Soda Core CLI version 3.0.39
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
ingest Ingests test results from a different tool
scan Runs a scan
suggest Generates suggestions for a dataset
test-connection Tests a connection
update-dro Updates contents of a distribution reference file
To exit the virtual environment when you are done with this tutorial, use the command deactivate
.
To connect Soda to Snowflake, you use a configuration.yml
file which stores access details for your data source.
This guide also instructs you to connect to a Soda Cloud account using API keys that you create and add to the same configuration.yml
file. Available for free as a 45-day trial, your Soda Cloud account gives you access to visualized scan results, tracks trends in data quality over time, enables you to set alert notifications, and much more.
configuration.yml
and save it in your soda_sip
directory.my_datasource_name
with the name of your own Snowflake data source. data_source my_datasource_name:
type: snowflake
connection:
username: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASS}
account: plu449.us-west-1
database: sodadata_test
warehouse: compute_wh
role: analyst
session_parameters:
QUERY_TAG: soda-queries
QUOTED_IDENTIFIERS_IGNORE_CASE: false
schema: public
soda_cloud
syntax and paste into your configuration.yml
file. Do not nest the soda_cloud
syntax in the data_source
block. data_source my_datasource_name:
type: snowflake
connection:
username: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASS}
...
soda_cloud:
host: cloud.soda.io
api_key_id: 2ca***4679
api_key_secret: 1iDldq***vhg
configuration.yml
file and close the API modal in your Soda account.soda_sip
directory, run the following command to test Soda's connection to Snowflake, replacing the value of my_datasource_name
with the name of your Snowflake data source. soda test-connection -d my_datasource_name -c configuration.yml
# Example output
Soda Library 1.0.0
Soda Core 3.0.39
Successfully connected to 'adventureworks'.
Connection 'adventureworks' is valid.
Need help? Ask the Soda community on Slack.A check is a test that Soda executes when it scans a dataset in your data source. The checks.yml
file stores the checks you write using the Soda Checks Language (SodaCL). You can create multiple checks.yml files to organize your data quality checks and run all, or some of them, at scan time.
soda_sip
directory, create another file named checks.yml
.checks.yml
file in your code editor, then copy and paste the following rather generic checks into the file. Note that the row_count
check is written to fail to demonstrate what happens when a data quality check fails.dataset_name
with the name of a dataset in your Snowflake data source.column1
with the name of a column in the dataset. checks for dataset_name:
# Checks that dataset contains fewer than 2 rows; written to fail
- row_count < 2:
name: Dataset is unreasonably small
# Checks that column contains no NULL values
- missing_count(column1) = 0:
name: No NULL values
# Checks for columns removed or added, or changed type or index
- schema:
warn:
when schema changes: any
name: No changes to schema
checks.yml
file, then, from the command-line, use the following command to run a scan. A scan is a CLI command which instructs Soda to prepare SQL queries that execute data quality checks on your data source. As input, the command requires:-d
the name of the data source to scan; replace my_datasource_name
with your Snowflake data source name-c
the filepath and name of the configuration.yml
filechecks.yml
file soda scan -d my_datasource_name -c configuration.yml checks.yml
# Example output
Soda 1.0.0
Soda Core 3.0.39
By downloading and using Soda Library, you agree to Sodas Terms & Conditions (https://go.soda.io/t&c) and Privacy Policy (https://go.soda.io/privacy).
Sending failed row samples to Soda Cloud
Scan summary:
3/5 checks PASSED:
dim_customer in adventureworks
No changes to schema [PASSED]
Emails formatted correctly [PASSED]
No null values for last name [PASSED]
2/5 checks FAILED:
dim_customer in adventureworks
No duplicate phone numbers [FAILED]
check_value: 715
Data is fresh [FAILED]
max_column_timestamp: 2014-01-28 23:59:59.999999
max_column_timestamp_utc: 2014-01-28 23:59:59.999999+00:00
now_variable_name: NOW
now_timestamp: 2023-04-24T21:02:15.900007+00:00
now_timestamp_utc: 2023-04-24 21:02:15.900007+00:00
freshness: 3372 days, 21:02:15.900008
Oops! 2 failures. 0 warnings. 0 errors. 3 pass.
Sending results to Soda Cloud
Soda Cloud Trace: 4417******32502
Need help? Ask the Soda community on Slack.✨Well done!✨ You've taken the first step towards a future in which you and your colleagues can trust the quality and reliability of your data. Huzzah!
Now that you have seen Soda in action, learn more about how and where to integrate data quality into your existing workflows and pipelines.