The Business Problem

Imagine you're a merchandising manager at ABT, a consumer electronics retailer. You need to perform a competitive pricing analysis against Best Buy to ensure your prices remain competitive in the market. You've successfully scraped product data from Best Buy's website, but now face a critical challenge: how do you match Best Buy's product descriptions to your own catalog when the data formats, naming conventions, and product identifiers are completely different?

This is a classic data harmonization problem – determining which records in different datasets refer to the same real-world entity. Traditional solutions would require extensive manual mapping, brittle rule-based systems, or complex ML model training.

In this quickstart, you'll build a modern, AI-powered data harmonization pipeline that leverages Snowflake's native capabilities to automatically harmonize and match product data with minimal manual effort.

The Solution Architecture

This quickstart demonstrates a three-stage approach to data harmonization:

  1. Data Harmonization: Use Snowflake Cortex AI to analyze schema differences and create unified datasets with semantic field mappings
  2. Hybrid Entity Matching: Combine vector similarity (fast, efficient) with AI_CLASSIFY (intelligent, context-aware) for optimal matching performance
  3. Unmatched Record Reconciliation: Provide an interactive interface for reviewing and correcting low-confidence matches



What You'll Learn

What You'll Need

What You'll Build

By the end of this quickstart, you will have built:

  1. A Data Harmonization Streamlit app that uses AI to map schema differences between datasets
  2. A Hybrid Entity Matching notebook that combines vector similarity and AI classification for intelligent product matching
  3. An Unmatched Records Review app for human-in-the-loop validation of uncertain matches
  4. A complete, production-ready data harmonization pipeline with audit trails and quality metrics

The solution processes 1,000+ product records from two different retailers and produces high-quality matches with confidence scores, achieving 85%+ accuracy through the hybrid approach.

Download the Source Files

Note on the Datasets Used

In this section, you'll prepare your Snowflake environment by creating the necessary database objects and loading the sample datasets. We'll be working with three CSV files containing product data from ABT and Best Buy, plus a ground truth mapping file for validation.

Step 1: Create Database and Schema

First, create a dedicated database and schema for this project:

-- Create the main database
CREATE OR REPLACE DATABASE ABT_BEST_BUY;

-- Create the schema to organize our tables
CREATE OR REPLACE SCHEMA ABT_BEST_BUY.STRUCTURED;

-- Set context for subsequent operations
USE DATABASE ABT_BEST_BUY;
USE SCHEMA STRUCTURED;

Step 2: Create a Warehouse

Create a warehouse sized appropriately for the AI and vector operations we'll be performing:

-- Create your data harmonization Snowflake Warehouse (we used a 2XL in the demo video)
CREATE OR REPLACE WAREHOUSE ENTITY_RESOLUTION_WH
WITH
    WAREHOUSE_SIZE = '2X-LARGE'
    AUTO_SUSPEND = 300
    AUTO_RESUME = TRUE
    INITIALLY_SUSPENDED = TRUE
    COMMENT = 'Warehouse for data harmonization with Cortex AI';

-- Set the warehouse as active
USE WAREHOUSE ENTITY_RESOLUTION_WH;

Step 3: Create File Format

Create a reusable file format for loading CSV files:

-- Create file format for CSV loading
CREATE OR REPLACE FILE FORMAT ABT_BEST_BUY.STRUCTURED.CSV_FORMAT
    TYPE = CSV
    SKIP_HEADER = 1
    FIELD_DELIMITER = ','
    TRIM_SPACE = TRUE
    FIELD_OPTIONALLY_ENCLOSED_BY = '"'
    REPLACE_INVALID_CHARACTERS = TRUE
    DATE_FORMAT = AUTO
    TIME_FORMAT = AUTO
    TIMESTAMP_FORMAT = AUTO
    COMMENT = 'CSV format for data harmonization data';

Step 4: Create Internal Stage

Create an Internal Stage where we will upload our CSV files:

-- Create stage for CSV files
CREATE OR REPLACE STAGE ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_DATA_STAGE
 -- Enable the directory table feature
    DIRECTORY = ( ENABLE = TRUE )
    COMMENT = 'Internal Stage for ABT and Best Buy product data';

-- Verify stage contents (which will be empty initially)
LIST @ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_DATA_STAGE;

Step 5: Add Datasets to Internal Stage



Unzip the files. We will then upload the 3 files downloaded via Snowsight to the stage we just created:



After the files are uploaded, run the LIST command again to verify they are in the stage. You should see the 3 files we just uploaded in the results returned.

-- Verify stage contents
LIST @ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_DATA_STAGE;

Step 6: Create Tables

Now create the three tables that will hold our product data:

-- Create ABT products table
CREATE OR REPLACE TABLE ABT_BEST_BUY.STRUCTURED.ABT (
    SKU NUMBER(38,0),
    PRODUCT_LABEL VARCHAR(16777216),
    ITEM_DETAILS VARCHAR(16777216),
    RETAIL_PRICE VARCHAR(16777216)
);

-- Create Best Buy products table
CREATE OR REPLACE TABLE ABT_BEST_BUY.STRUCTURED.BEST_BUY (
    PRODUCTID NUMBER(38,0),
    NAME VARCHAR(16777216),
    DESCRIPTION VARCHAR(16777216),
    MANUFACTURER VARCHAR(16777216),
    PRICE VARCHAR(16777216)
);

-- Create ground truth mapping table for validation
CREATE OR REPLACE TABLE ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_PERFECT_MAPPING (
    IDABT NUMBER(38,0),
    IDBUY NUMBER(38,0)
);

Step 7: Load Data

Load data from the staged CSV files into the tables:

-- Load ABT products
COPY INTO ABT_BEST_BUY.STRUCTURED.ABT
FROM (
    SELECT 
        $1::NUMBER AS SKU,
        $2::VARCHAR AS PRODUCT_LABEL,
        $3::VARCHAR AS ITEM_DETAILS,
        $4::VARCHAR AS RETAIL_PRICE
    FROM @ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_DATA_STAGE
)
FILES = ('Abt.csv')
FILE_FORMAT = (FORMAT_NAME = 'ABT_BEST_BUY.STRUCTURED.CSV_FORMAT')
ON_ERROR = CONTINUE;

-- Load Best Buy products
COPY INTO ABT_BEST_BUY.STRUCTURED.BEST_BUY
FROM (
    SELECT 
        $1::NUMBER AS IDBUY,
        $2::VARCHAR AS NAME,
        $3::VARCHAR AS DESCRIPTION,
        $4::VARCHAR AS MANUFACTURER,
        $5::VARCHAR AS PRICE
    FROM @ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_DATA_STAGE
)
FILES = ('Buy.csv')
FILE_FORMAT = (FORMAT_NAME = 'ABT_BEST_BUY.STRUCTURED.CSV_FORMAT')
ON_ERROR = CONTINUE;

-- Load ground truth mapping
COPY INTO ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_PERFECT_MAPPING
FROM (
    SELECT 
        $1::NUMBER AS IDABT,
        $2::NUMBER AS IDBUY
    FROM @ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_DATA_STAGE
)
FILES = ('abt_buy_perfectMapping.csv')
FILE_FORMAT = (FORMAT_NAME = 'ABT_BEST_BUY.STRUCTURED.CSV_FORMAT')
ON_ERROR = CONTINUE;

Step 8: Verify Data Load

Confirm that data has been loaded successfully:

-- Check ABT table
SELECT 'ABT' AS TABLE_NAME, COUNT(*) AS RECORD_COUNT FROM ABT_BEST_BUY.STRUCTURED.ABT
UNION ALL
SELECT 'BUY' AS TABLE_NAME, COUNT(*) AS RECORD_COUNT FROM ABT_BEST_BUY.STRUCTURED.BEST_BUY
UNION ALL
SELECT 'ABT_BUY_PERFECTMAPPING' AS TABLE_NAME, COUNT(*) AS RECORD_COUNT FROM ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_PERFECT_MAPPING;

-- Preview sample records from each table
SELECT 'ABT Sample' AS SOURCE, * FROM ABT_BEST_BUY.STRUCTURED.ABT LIMIT 100;
SELECT 'BUY Sample' AS SOURCE, * FROM ABT_BEST_BUY.STRUCTURED.BEST_BUY LIMIT 100;
SELECT 'Mapping Sample' AS SOURCE, * FROM ABT_BEST_BUY.STRUCTURED.ABT_BEST_BUY_PERFECT_MAPPING LIMIT 100;

Expected Output:



Your Snowflake environment is now ready! In the next section, we'll use the Data Harmonization Streamlit app to create unified datasets from these disparate schemas.

Understanding the Challenge

Looking at the ABT and BEST_BUY tables you just created, you'll notice they have completely different column names:

While both tables contain product information, the column names, data types, and structures differ. Before we can match products between these datasets, we need to harmonize them – creating a unified schema with consistent field mappings.

Traditional approaches would require manual analysis and custom ETL code. Instead, we'll use Snowflake Cortex AI to automatically analyze these schemas and recommend semantic mappings.

What the Data Harmonization App Does

The Data Harmonization Streamlit application performs the following workflow:

  1. Table Profiling: Analyzes both tables to understand column types, cardinality, null rates, sample values, and statistical distributions
  2. AI-Powered Mapping: Uses Snowflake Cortex AI (mistral-large model) to recommend semantic mappings between fields based on content analysis
  3. Interactive Review: Allows you to review and adjust the AI recommendations through a user-friendly interface
  4. Dataset Creation: Generates harmonized output tables with:
    • Consistent column names (e.g., PRODUCT_ID, PRODUCT_NAME, PRODUCT_DESCRIPTION)
    • Cleaned text (special characters removed)
    • An audit table tracking all mapping decisions

Running the Data Harmonization App

Step 1: Access Streamlit in Snowflake

  1. Navigate to Projects > Streamlit in Snowsight
  2. Click + Streamlit App
  3. Name it Data_Harmonization_App
  4. Select your ABT_BEST_BUY database and STRUCTURED schema
  5. Select the ENTITY_RESOLUTION_WH warehouse

Step 2: Upload the Application Code

  1. Delete the example code in the streamlit_app.py file
  2. Copy the contents of Data Harmonization with Snowflake Cortex AI.py
  3. Paste the new code from the Data Harmonization with Snowflake Cortex AI.py into this file
  4. Be sure you install the appropriate packages
  5. Click Run



4. Ensure you have the appropriate packages installed.



5. Click Run to launch the application

Step 3: Configure Tables

In the app interface:

  1. Verify the database and schema are set to ABT_BEST_BUY.STRUCTURED
  2. Select ABT as the Reference Table
  3. Select BEST_BUY as the Input Table
  4. The app will automatically profile both tables



Step 4: Generate AI Recommendations

  1. Click Run AI Analysis to invoke Cortex AI
  2. The AI will analyze the schemas and suggest mappings:
    • ABT.SKUproduct_idBEST_BUY.PRODUCTID
    • ABT.PRODUCT_LABELproduct_nameBEST_BUY.NAME
    • ABT.ITEM_DETAILSproduct_descriptionBEST_BUY.DESCRIPTION
    • (The Manufacturer field should be ‘None' for ABT because it doesn't have a corresponding match.)

The Cortex AI model analyzes your data to identify:

Sample Profiling Output

For each column, the app computes:

Step 5: Review and Adjust Mappings

In the Review and Edit Mappings section:

Your output should look like this:



Step 6: Create Harmonized Datasets

  1. Preview the output structure to verify column names and transformations
  2. Set the target database and schema (keep defaults: ABT_BEST_BUY.STRUCTURED)
  3. Click Create Datasets



The app will create three tables:

Audit Table Structure

The audit table tracks:

Why This Matters

By harmonizing the data:

  1. Standardized Schema: Both datasets now have consistent column names
  2. Clean Text: Special characters removed, enabling better matching
  3. Audit Trail: Complete lineage from source to harmonized tables
  4. Ready for Matching: Data is now in the perfect format for the hybrid entity matching algorithm

In the next section, we'll use these harmonized tables to perform intelligent entity matching using a hybrid approach that combines vector similarity with AI classification.

Now that we have harmonized datasets, it's time to tackle the core challenge: determining which ABT products correspond to which Best Buy products. This is where the magic of hybrid entity matching comes in.

The Hybrid Matching Strategy

Traditional entity matching approaches typically fall into two camps:

  1. Rule-based matching: Fast but brittle (e.g., "if product names match exactly")
  2. Pure ML models: Accurate but require training data and can be slow

Our hybrid approach combines the best of both worlds:

This strategy is inspired by Snowflake's AI SQL best practices and provides an optimal balance of speed, accuracy, and cost.

Step 1. Uploading the data harmonization Notebook

  1. From the Snowflake navation plane click Create > Notebook > Import .ipynb File
  2. Locate the HYBRID_ENTITY_MATCHING_WITH_AI_CLASSIFY.ipynb file you downloaded and upload it
  3. Name it HYBRID_ENTITY_MATCHING_WITH_AI_CLASSIFY
  4. Select your ABT_BEST_BUY database and STRUCTURED schema
  5. Select the ENTITY_RESOLUTION_WH warehouse as your query warehouse
  6. All other defaults can be kept the same



Notebook Overview

The notebook is organized into several key sections. Let's walk through what each major cell accomplishes.

Section 1: Configuration Setup (Python Cell)

What It Does: This interactive cell discovers your harmonization output tables and lets you select which entity columns to use for matching.

Key Operations:

Why It Matters: This dynamic configuration means the notebook adapts to your specific harmonization output – no hardcoded table names or column names. You can run this workflow on any pair of harmonized datasets.

  1. Select your database (ABT_BEST_BUY) and schema (STRUCTURED)
  2. Choose the audit table created by the harmonization app
  3. Select the entity column pair (for this run, we will use PRODUCT_LABEL ↔ NAME)
  4. Verify the table names and click Validate, Save Configuration, and Set Up Environment

Running this cell prepares the SQL environment and extracts configuration details for the matching process, ensuring all subsequent queries reference the correct tables and columns dynamically.



Section 2: Vector Feature Engineering (SQL Cell)

What It Does: Creates the foundational features for hybrid matching by generating vector embeddings for all entity names and computing initial similarity scores.

Key Operations:

  1. Generate Embeddings: Uses SNOWFLAKE.CORTEX.EMBED_TEXT_1024 with the voyage-multilingual-2 model to create 1024-dimension vector embeddings for each product name
  2. Cross-Join: Creates all possible pairs between reference (ABT) and input (Best Buy) products
  3. Compute Similarity: Calculates VECTOR_COSINE_SIMILARITY between each pair's embeddings
  4. Filter Low Scores: Drops pairs with similarity < 0.2 to reduce search space

Why It Matters:

Expected Output: The cell displays a summary showing:

Section 3: Hybrid Matching with AI_CLASSIFY (SQL Cell)

What It Does: Implements the two-stage matching logic that combines vector similarity with AI classification for optimal results. At the end of this step, we will have matched our two datasets together, keeping only what we believe to be the actual matches and discarding the rest.

The Algorithm:

Stage 1: High-Confidence Vector Matches

Stage 2: AI_CLASSIFY for Ambiguous Cases

Stage 3: Confidence Scoring

Why This Works:

Expected Output: The cell creates the hybrid_final_results table and displays metrics:

Section 4: Record Management and Threshold Selection (Python Cell)

What It Does: Provides an interactive streamlit interface to review match quality and separate records into MATCHED and UNMATCHED tables based on a confidence threshold.

Why This Matters: Different use cases require different confidence thresholds:

Expected Interaction:



Section 5: Performance Evaluation (SQL Cell)

What It Does: Evaluates the hybrid matching approach against our golden dataset mapping table to measure accuracy and identify areas for improvement. In many cases, you may not have a golden dataset to compare to. Here we have one which we use to test the accuracy of our methods.

Key Metrics Calculated:

  1. Overall Accuracy: Percentage of matches that align with ground truth
  2. Accuracy by Confidence Bucket: How accuracy varies across confidence ranges
  3. Accuracy by Match Method: Performance of HIGH_CONFIDENCE_VECTOR vs AI_CLASSIFY vs MEDIUM_CONFIDENCE_VECTOR
  4. False Match Analysis: Understanding where the algorithm makes mistakes

Why This Matters:

As you can see, using our methods against this test dataset we achieve a >90% accuracy when compared to our golden dataset!



In the next section, we'll build an interactive Streamlit app to review and correct the unmatched records, creating a human-in-the-loop validation workflow.

After running the hybrid matching algorithm, you'll have two sets of records:

  1. Matched Records: High-confidence matches ready for use in analysis
  2. Unmatched Records: Lower-confidence cases that need human review

This section addresses the unmatched records through an interactive Streamlit application that enables efficient, user-friendly review and correction.

The Problem with Unmatched Records

Unmatched records typically fall into a few categories:

  1. Close Calls: Multiple viable candidates with similar scores (e.g., different models of the same product line)
  2. Data Quality Issues: Missing information, typos, or incomplete descriptions
  3. Unique Products: Items that genuinely don't have a match in the other dataset
  4. Algorithm Limitations: Cases where both vector similarity and AI_CLASSIFY struggled

Rather than accepting these as lost opportunities, the Unmatched Records Reviewer provides a structured workflow to:

What the Unmatched Records App Does

The application implements a comprehensive review workflow:

1. Intelligent Data Loading

2. Interactive Review Interface

For each unmatched record, the reviewer displays:

3. Dynamic Search and Filtering

4. Batch Processing

Running the Unmatched Records App

Step 1: Create the Streamlit App

  1. Navigate to Projects > Streamlit in Snowsight
  2. Click + Streamlit App
  3. Name it data harmonization - Unmatched Records
  4. Select your ABT_BEST_BUY database and STRUCTURED schema
  5. Select the ENTITY_RESOLUTION_WH warehouse

Step 2: Upload the Application Code

  1. Delete the example code in the streamlit_app.py file
  2. Copy the contents of data harmonization - Unmatched Records.py
  3. Paste the new code from the data harmonization - Unmatched Records.py into this file
  4. Be sure you install the appropriate packages
  5. Click Run



Step 3: Select Unmatched Table

  1. The app automatically queries AUDIT_HARMONIZATION_HYBRID_AUDIT to find unmatched/matched table pairs
  2. From the dropdown, select your unmatched table (e.g., AUDIT_HARMONIZATION_2025_10_04_HYBRID_UNMATCHED)
  3. Click Process to load the records

Step 3: Review Records

For each record on the page:

Reference Record Display:



Candidate Selection: The dropdown shows candidates with similarity scores:



Actions:

Step 4: Navigate and Continue Reviewing

Step 5: Submit Reviews

Once you reach the last page:

  1. Review the Review Summary showing:
    • Records Approved to Move
    • Records Not Approved
  2. Click Submit All Reviews to process all approved matches across all pages
  3. The app will:
    • Update the FINAL_MATCH column for approved records
    • Move approved records from unmatched to matched table
    • Delete processed records from unmatched table
    • Display a detailed summary of changes

Step 6: Review Completion Summary

After submission, you'll see a summary of the records you sent to the matched table.

Click 📑 Continue with Analysis to review remaining records or refresh the view.



Best Practices for Record Review

1. Prioritize by Confidence

Start with records that have similarity scores just below your threshold (e.g., 0.75-0.80), as these are most likely to be correctable with manual review.

2. Look for Patterns

If you notice multiple records failing for the same reason (e.g., missing manufacturer information), consider updating your harmonization mapping to include additional fields.

3. Use Domain Knowledge

Product expertise is invaluable here. If you know ABT's catalog, you can quickly identify:

4. Don't Force Matches

If none of the candidates are correct, leave the record unchecked. It's better to have an unmatched record than an incorrect match that pollutes your analysis.

5. Iterative Review Sessions

You don't have to review all records in one sitting. The app preserves state across sessions (records remain in the unmatched table until explicitly moved).

The Value of Human-in-the-Loop

This review workflow exemplifies the human-in-the-loop approach to AI:

  1. AI Does the Heavy Lifting: The hybrid matching algorithm handles 75-85% of records automatically
  2. Humans Handle Edge Cases: Reviewers focus only on ambiguous cases that need domain expertise
  3. Continuous Improvement: Insights from manual review can inform future matching logic improvements
  4. Quality Assurance: Human validation provides confidence in match quality for downstream analytics

By the end of this section, you'll have:

You now have a complete, production-ready data harmonization pipeline with matched records ready for competitive pricing analysis!

Congratulations! You've successfully built an end-to-end data harmonization solution using Snowflake's native AI capabilities, combining automated matching with human-in-the-loop validation.

What You Accomplished

Throughout this quickstart, you:

  1. Prepared Data: Loaded product catalogs from two different retailers with different schemas
  2. Harmonized Schemas: Used Snowflake Cortex AI to automatically map semantic field differences and create unified datasets
  3. Implemented Hybrid Matching: Combined vector similarity (fast) with AI_CLASSIFY (intelligent) for optimal matching performance
  4. Validated Quality: Measured accuracy against ground truth data and achieved 85%+ match quality
  5. Reviewed Edge Cases: Used an interactive Streamlit app to manually validate uncertain matches
  6. Built Production Pipeline: Created a complete workflow with audit trails, confidence scoring, and quality metrics

Key Takeaways

Business Value:

Technical Innovation:

Best Practices:

Related Resources

Documentation: