This QuickStart help you understand and build the Snowflake Document AI Data Extraction & Validation Pipeline, designed for seamless and reusable document processing. The pipeline enables users to load their files into a Snowflake stage and leverages streams, tasks, and Python-based procedures for efficient data extraction, validation, and downstream integration. It can be easily adapted to work with multiple Document AI models by creating an appropriate end table to capture the final data points.

DataFlow

The pipeline ensures that documents meet business expectations through pre-processing checks, feeds suitable documents into a DOC AI model for data extraction, validates the extracted data against business-specific accuracy thresholds, and routes clean data to downstream systems. Additionally, the Streamlit UI provides a live view of the data flow and key performance metrics, enabling real-time monitoring and insights into pipeline performance. This flexible, automated solution is ideal for handling diverse document processing requirements.

Prerequisites

What You'll Learn

What You'll Need

What You'll Build

The pre-processing stage validates document attributes such as size and page count to ensure only processable documents proceed further. Criteria can be customized to align with business needs and Document AI guidelines, as outlined in the Snowflake Documentation Files failing these checks are routed to a "Manual Review" stage, where users can review and, if necessary, resend documents for processing via a Streamlit app.

Pre-Process

Documents are ingested from the Internal Stage and processed through a stream. A Python-based UDF evaluates the attributes, storing results in a pre-filter table. Documents passing the validation move to the DOC AI Data Extraction, while invalid ones are redirected for manual review.

The DOC_AI_QuickStart.SQL script is designed to set up the necessary environment for the Document AI pipeline. When executed in your Snowflake account, it will:

Setup Instructions

To set up your environment, copy the code from the DOC_AI_QuickStart.SQL script available in the GitHub repository and execute it in your Snowflake account.

This will configure the necessary infrastructure to support document ingestion, preprocessing, extraction, and validation, ensuring a seamless Document AI pipeline execution.

Download sample documents from the GitHub repository. These AI-generated sample invoices will be used to demonstrate processing multiple Document AI models in a single pipeline.

We will use two Document AI models for this demonstration, but you can add more as needed. The pipeline will process all models sequentially.

Create Document AI Models

  1. Go to Snowsight: Navigate to AI & ML → Document AI.
  2. Create First Model: Click the + Build button and fill in:
    • Build Name: INVOICE_MODEL
    • Location: DS_DEV_DB
    • Schema: DOC_AI_SCHEMA
  3. Train the Model:
    • Upload a few documents from the sample dataset.
    • In production, training with at least 20 varied documents ensures higher accuracy.
    • Click on a document and add Values (labels/columns to extract).
    • Enter a Prompt/Question for the LLM model to extract relevant data.
    • Use the following values and prompts for this QuickStart:

    Value Name

    Prompt

    PO_NUMBER

    What is the purchase order number (P.O)?

    ORDER_DATE

    What is the invoice order date?

    DELIVERY_DATE

    What is the delivery date?

    TOTAL_AMOUNT

    What is the total?

    SUB_TOTAL

    What is the sub-total amount?

    TAX_AMOUNT

    What is the amount withheld (3%) as tax?

    • Review and correct extracted values if needed. Accept corrections to improve accuracy.
    • Click Accept All and Close after reviewing all documents.
  4. Create the Second Model: Repeat the steps for another model.
    • Build Name: PURCHASE_MODEL
    • Location: DS_DEV_DB
    • Schema: DOC_AI_SCHEMA

Doc_AI

Once both models are trained, they will be processed sequentially through the pipeline. Continue with the next steps to configure and execute the pipeline.

Once the models are trained, publish them in the Document AI Model Training screen.

To test model results, copy and execute the SQL queries listed under Extracting Query: Doc_AI_1

For a Single Document

SELECT DS_DEV_DB.DOC_AI_SCHEMA.INVOICE_MODEL!PREDICT(
    GET_PRESIGNED_URL(@<stage_name>, <relative_file_path>), 1
);

For a Batch of Files in Stage

SELECT DS_DEV_DB.DOC_AI_SCHEMA.INVOICE_MODEL!PREDICT(
    GET_PRESIGNED_URL(@<stage_name>, RELATIVE_PATH), 1
)
FROM DIRECTORY(@<stage_name>);

Here, the number 1 represents the model version. Ensure you replace <stage_name> and <relative_file_path> with actual values from your setup before execution.

Deploying the Streamlit App

The final step is to create a Streamlit app to visualize the pipeline in action.

  1. Open Snowsight: Navigate to Projects → Streamlit on the left panel.
  2. Create a New App: Click the + Streamlit App blue button in the top right corner.
  3. Configure the App:
    • Name your app.
    • Select the Database, Schema, and Warehouse created earlier in this QuickStart.
    • Click Create.
  4. Replace Default Code:
  5. Add Required Packages:
    • In the Packages dropdown, select:
      • plotly
      • pypdfium2
  6. Run the App: Click Run.

Python

Streamlit App Features

The Streamlit app provides full control over the pipeline and enables real-time monitoring. It shows:

Testing the Pipeline in Action

  1. Load Sample Documents:
    • Upload all sample documents from GitHub Sample Docs to your INVOICE_DOCS stage under the folder Invoice.
    • For this QuickStart, load the same set of files into another folder called Purchase.

Stage

  1. Monitor the Pipeline in Streamlit:
    • Open the Streamlit UI.
    • Select the Document AI model.
    • Observe the documents flowing through the pipeline in real time.

This step completes the setup, allowing you to see the automated document processing pipeline in action!

Streamlit

Once extracted, data is validated to ensure accuracy. Key data points, such as the invoice number, are assigned high validation scores since incorrect values impact downstream processes. Validation rules help maintain data integrity and ensure only high-quality data is processed. Validation

Remember to clean up the objects you created after completing this QuickStart. Run the cleanup script available at:
Cleanup

Conclusion

This QuickStart provides an introduction to automating a Document AI pipeline that processes multiple models and pre-processes documents efficiently. In real-world scenarios, pre-processing logic can be customized, batch processing can be optimized, and validation rules can be enhanced to improve accuracy. This pipeline serves as a foundation for scalable, AI-powered document processing solutions.

What You Learned

Resources