This QuickStart is designed to help you build out an automated invoice reconciliation pipeline in Snowflake using Document AI, which is designed for seamless and reusable document processing. The pipeline enables users to load their invoice files into a Snowflake stage and leverages streams, tasks, and Document AI's Table Extraction feature for efficient data extraction, validation, and downstream integration.

DataFlow

The pipeline uses a Document AI table extraction model to automatically extract data from invoice documents uploaded to a Snowflake internal stage, and then an initial reconciliation is performed to compare the bronze extracted data against the bronze layer main database tables.

If the data matches, then the invoice is considered ‘auto-reconciled' and its data is passed to the gold layer table to become the official reconciled result for the invoices.

If there are discrepancies, the results of the reconciliation are then passed to a Streamlit application in Snowflake for viewing and manual review of the discrepancies in invoice values. After manual review and reconciliation approval, the invoice values are passed to the final gold layer table.

What is Document AI?

Document AI is a Snowflake machine learning feature that uses a large language model (LLM) to extract data from documents. With Document AI, you can prepare pipelines for continuous processing of new documents of a specific type, such as invoice or finance statement documents.

Document AI uses a model that provides both zero-shot extraction and fine-tuning. Zero-shot means that the foundation model is trained on a large volume of various documents, so the model broadly understands the type of document being processed. In this way, it can locate and extract information specific to this document type, even if the model has never seen the document before.

Additionally, you can create your own customized, fine-tuned Document AI model to improve your results by training the model on the documents specific to your use case.

Prerequisites

What You'll Learn

What You'll Need

What You'll Build

The docai_invoice_qs_setup.sql script is designed to set up the necessary environment for the Document AI pipeline. When executed in your Snowflake account, it will:

Setup Instructions

To set up your environment, copy the code from the docai_invoice_qs_setup.sql script available in the GitHub repository and execute it in your Snowflake account.

This will configure the necessary infrastructure to support invoice ingestion and extraction, the first part of our Document AI pipeline.

Download sample documents from the GitHub repository. These AI-generated sample invoices will be used to demonstrate processing multiple Document AI models in a single pipeline.

Create Document AI Model

  1. Go to Snowsight: Navigate to AI & ML → Document AI.
  2. Create First Model: Click the + Build button and fill in:
    • Build Name: DOC_AI_QS_INVOICES
    • Location: DOC_AI_QS_DB
    • Schema: DOC_AI_SCHEMA
  3. Define Values to Extract:
    • Upload one of the documents from the folder extraction documents.
    • Click on a document and select the Document Processing Type (labels/columns to extract).

Doc_AI_initial

Table_select

Order_infoTotal_info

For this quickstart, we don't need to train our Document AI model before publishing, as it has been designed to handle these types of invoices out of the box.

The docai_invoice_qs_reconcile.sql script is designed to set up the infrastructure for the invoice reconciliation layers, as well as kick it off by supplying invoice data to our bronze database. When executed in your Snowflake account, it will:

Setup Instructions

To set up your environment, copy the code from the docai_invoice_qs_reconcile.sql script available in the GitHub repository and execute it in your Snowflake account.

This will configure the necessary infrastructure to support invoice reconciliation, the final part of our Document AI pipeline.

Deploying the Streamlit App

The final step is to create a Streamlit app to provide manual review of documents that were unable to be auto-reconciled.

  1. Open Snowsight: Navigate to Projects → Streamlit on the left panel.
  2. Create a New App: Click the + Streamlit App blue button in the top right corner.
  3. Configure the App:
    • Name your app.
    • Select the Database, Schema, and Warehouse created earlier in this QuickStart.
    • Click Create.
  4. Replace Default Code:
  5. Add Required Packages:
    • In the Packages dropdown, select:
      • pypdfium2
    Packages
  6. Run the App: Click Run.

Streamlit App Features

The Streamlit app provides full manual review capabilities for the invoice reconciliation process. It shows:

Putting the Invoice Reconciliation in Action

  1. Load Sample Invoices:
    • Upload all sample documents from extraction documents to your DOC_AI_STAGE stage. This can be done using the databases gui path:

Stage

Or by using the upload document widget in the streamlit app!

  1. Wait a few minutes for your automated tasks to execute, then review the results:
    • Open the Streamlit UI.
    • Observe the auto-reconciled metrics
    • Review the invoices with discrepancies and provide manual final reconciliation

This step completes the quickstart, allowing you to see the invoice reconciliation process in action!

Streamlit

If you'd like, you can clean up the objects you created after completing this QuickStart. Run the cleanup script available at:
docai_invoice_qs_cleanup.sql

Conclusion

This QuickStart provides an introduction to automating a Document AI invoice reconciliation pipeline that extracts table values from invoice documents and reconciles them against an invoice database. In real-world scenarios, invoice reconciliation logic can be customized, batch processing can be optimized, and validation rules can be enhanced to improve accuracy. This pipeline serves as a foundation for scalable, AI-powered document processing solutions.

What You Learned

Resources