This QuickStart help you understand and build the Snowflake Document AI Data Extraction & Validation Pipeline, designed for seamless and reusable document processing. The pipeline enables users to load their files into a Snowflake stage and leverages streams, tasks, and Python-based procedures for efficient data extraction, validation, and downstream integration. It can be easily adapted to work with multiple Document AI models by creating an appropriate end table to capture the final data points.
The pipeline ensures that documents meet business expectations through pre-processing checks, feeds suitable documents into a DOC AI model for data extraction, validates the extracted data against business-specific accuracy thresholds, and routes clean data to downstream systems. Additionally, the Streamlit UI provides a live view of the data flow and key performance metrics, enabling real-time monitoring and insights into pipeline performance. This flexible, automated solution is ideal for handling diverse document processing requirements.
The pre-processing stage validates document attributes such as size and page count to ensure only processable documents proceed further. Criteria can be customized to align with business needs and Document AI guidelines, as outlined in the Snowflake Documentation Files failing these checks are routed to a "Manual Review" stage, where users can review and, if necessary, resend documents for processing via a Streamlit app.
Documents are ingested from the Internal Stage and processed through a stream. A Python-based UDF evaluates the attributes, storing results in a pre-filter table. Documents passing the validation move to the DOC AI Data Extraction, while invalid ones are redirected for manual review.
The DOC_AI_QuickStart.SQL
script is designed to set up the necessary environment for the Document AI pipeline. When executed in your Snowflake account, it will:
To set up your environment, copy the code from the DOC_AI_QuickStart.SQL
script available in the GitHub repository and execute it in your Snowflake account.
This will configure the necessary infrastructure to support document ingestion, preprocessing, extraction, and validation, ensuring a seamless Document AI pipeline execution.
Download sample documents from the GitHub repository. These AI-generated sample invoices will be used to demonstrate processing multiple Document AI models in a single pipeline.
We will use two Document AI models for this demonstration, but you can add more as needed. The pipeline will process all models sequentially.
+ Build
button and fill in:INVOICE_MODEL
DS_DEV_DB
DOC_AI_SCHEMA
Value Name | Prompt |
| What is the purchase order number (P.O)? |
| What is the invoice order date? |
| What is the delivery date? |
| What is the total? |
| What is the sub-total amount? |
| What is the amount withheld (3%) as tax? |
PURCHASE_MODEL
DS_DEV_DB
DOC_AI_SCHEMA
Once both models are trained, they will be processed sequentially through the pipeline. Continue with the next steps to configure and execute the pipeline.
Once the models are trained, publish them in the Document AI Model Training screen.
To test model results, copy and execute the SQL queries listed under Extracting Query:
SELECT DS_DEV_DB.DOC_AI_SCHEMA.INVOICE_MODEL!PREDICT(
GET_PRESIGNED_URL(@<stage_name>, <relative_file_path>), 1
);
SELECT DS_DEV_DB.DOC_AI_SCHEMA.INVOICE_MODEL!PREDICT(
GET_PRESIGNED_URL(@<stage_name>, RELATIVE_PATH), 1
)
FROM DIRECTORY(@<stage_name>);
Here, the number 1
represents the model version. Ensure you replace <stage_name> and <relative_file_path> with actual values from your setup before execution.
The final step is to create a Streamlit app to visualize the pipeline in action.
+ Streamlit App
blue button in the top right corner.plotly
pypdfium2
The Streamlit app provides full control over the pipeline and enables real-time monitoring. It shows:
INVOICE_DOCS
stage under the folder Invoice
.Purchase
.This step completes the setup, allowing you to see the automated document processing pipeline in action!
Once extracted, data is validated to ensure accuracy. Key data points, such as the invoice number, are assigned high validation scores since incorrect values impact downstream processes. Validation rules help maintain data integrity and ensure only high-quality data is processed.
Remember to clean up the objects you created after completing this QuickStart. Run the cleanup script available at:
Cleanup
This QuickStart provides an introduction to automating a Document AI pipeline that processes multiple models and pre-processes documents efficiently. In real-world scenarios, pre-processing logic can be customized, batch processing can be optimized, and validation rules can be enhanced to improve accuracy. This pipeline serves as a foundation for scalable, AI-powered document processing solutions.