soda-data-quality

Is Soda the data quality testing solution you've been looking for? 🥤Take a sip and see! Use this guide to install Soda, connect it to your Snowflake data source, and run a simple Soda scan for data quality.

Soda is a tool that enables Data Engineers to test data for quality where and when they need to.

Is your data fresh? Is it complete or missing values? Are there unexpected duplicate values? Did something go wrong during transformation? Are all the data values valid? These are the questions that Soda answers for Data Engineers.

Prerequisites

What You Will Learn

What You Need

What You Will Build

Soda works by taking the data quality checks that you prepare and using them to run a scan of datasets in a data source. A scan is a CLI command which instructs Soda to prepare optimized SQL queries that execute data quality checks on your data source to find invalid, missing, or unexpected data. When checks fail, they surface bad-quality data and present check results that help you investigate and address quality issues.

To test your data quality, you install the Soda Library CLI tool and sign up for a Soda Cloud account so that you can complete the following tasks:

Add Soda data quality checks to your data pipeline to prevent downstream issues.

data-pipeline

Use GitHub Actions to add automated Soda data quality checks to your development workflow to prevent merging issues into production.

dev-workflow

  1. In your command-line interface, create a Soda project directory in your local environment, then navigate to the directory.
    mkdir soda_sip
    cd soda_sip
    
  2. Best practice dictates that you install the Soda using a virtual environment. In your command-line interface, create a virtual environment in a .venv directory.
    python3 -m venv .venv
    
  3. Activate the virtual environment.
    source .venv/bin/activate
    
  4. Execute the following command to install the Soda package for Snowflake in your virtual environment.
    pip install -i https://pypi.cloud.soda.io soda-snowflake
    
  5. Validate the installation.
    soda --help
    
    # Example output
    Usage: soda [OPTIONS] COMMAND [ARGS]...
    
      Soda Library CLI version 1.0.0, Soda Core CLI version 3.0.39
    
    Options:
      --version  Show the version and exit.
      --help     Show this message and exit.
    
    Commands:
      ingest           Ingests test results from a different tool
      scan             Runs a scan
      suggest          Generates suggestions for a dataset
      test-connection  Tests a connection
      update-dro       Updates contents of a distribution reference file
    

To exit the virtual environment when you are done with this tutorial, use the command deactivate.

To connect Soda to Snowflake, you use a configuration.yml file which stores access details for your data source.

This guide also instructs you to connect to a Soda Cloud account using API keys that you create and add to the same configuration.yml file. Available for free as a 45-day trial, your Soda Cloud account gives you access to visualized scan results, tracks trends in data quality over time, enables you to set alert notifications, and much more.

  1. In a code editor such as Sublime or Visual Studio Code, create a new file called configuration.yml and save it in your soda_sip directory.
  2. Copy and paste the connection configuration details for Snowflake as in the example below.
  1. In a browser, navigate to cloud.soda.io/signup to create a new Soda account. If you already have a Soda account, log in.
  2. Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate a new set of API keys. create-api-keys
  3. Copy the soda_cloud syntax and paste into your configuration.yml file. Do not nest the soda_cloud syntax in the data_source block.
     data_source my_datasource_name:
       type: snowflake
       connection:
         username: ${SNOWFLAKE_USER}
         password: ${SNOWFLAKE_PASS}
         ...
    
     soda_cloud:
        host: cloud.soda.io
        api_key_id: 2ca***4679
        api_key_secret: 1iDldq***vhg
    
  4. Save the configuration.yml file and close the API modal in your Soda account.
  5. From the command-line, in the virtual environment in the soda_sip directory, run the following command to test Soda's connection to Snowflake, replacing the value of my_datasource_name with the name of your Snowflake data source.
    soda test-connection -d my_datasource_name -c configuration.yml
    
    # Example output
    Soda Library 1.0.0
    Soda Core 3.0.39
    Successfully connected to 'adventureworks'.
    Connection 'adventureworks' is valid.
    
    Need help? Ask the Soda community on Slack.

A check is a test that Soda executes when it scans a dataset in your data source. The checks.yml file stores the checks you write using the Soda Checks Language (SodaCL). You can create multiple checks.yml files to organize your data quality checks and run all, or some of them, at scan time.

  1. In the same soda_sip directory, create another file named checks.yml.
  2. Open the checks.yml file in your code editor, then copy and paste the following rather generic checks into the file. Note that the row_count check is written to fail to demonstrate what happens when a data quality check fails.
  1. Save the checks.yml file, then, from the command-line, use the following command to run a scan. A scan is a CLI command which instructs Soda to prepare SQL queries that execute data quality checks on your data source. As input, the command requires:
  1. As you can see from the CLI output, some checks failed and Soda sent the results to your Soda Cloud account. To access visualized check results and further examine the failed checks, return to your Soda account in your browser and click Checks. quick-sip-results
  2. In the table of check results Soda displays, you can click the line item for one of the checks that failed to examine the visualized results in a line graph, and to access the failed row samples that Soda automatically collected when it ran the scan and executed the checks. check-result

✨Well done!✨ You've taken the first step towards a future in which you and your colleagues can trust the quality and reliability of your data. Huzzah!

Now that you have seen Soda in action, learn more about how and where to integrate data quality into your existing workflows and pipelines.

Choose Your Adventure

Experiment

Sip More Soda

Need help?