1. Overview
In this quickstart, you'll learn how to build an end-to-end application that extracts text from images and makes it searchable using Large Language Models (LLMs). The application combines Optical Character Recognition (OCR), vector embeddings, and Retrieval Augmented Generation (RAG) to create an intelligent document assistant.
What You'll Learn
- Setting up OCR processing in Snowflake using Tesseract
- Creating and managing vector embeddings for semantic search
- Building a RAG-based question answering system
- Developing an interactive Streamlit interface
What You'll Build
A full-stack application that enables users to:
- Upload images containing text
- Process images through OCR to extract text
- Search through extracted text using semantic similarity
- Ask questions about the documents' content and get AI-powered responses
- View source images alongside answers
Prerequisites
- Snowflake account in a supported region for Cortex functions
- Account must have these features enabled:
2. Setup Environment
Create Database and Schema
- Open a new worksheet in Snowflake
- Create the database and schema:
CREATE DATABASE IF NOT EXISTS ocr_rag;
CREATE SCHEMA IF NOT EXISTS ocr_rag;
Create Image Storage Stage
- Create a stage to store your images:
CREATE STAGE IF NOT EXISTS ocr_rag.images_to_ocr
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
DIRECTORY = (ENABLE = true);
Upload Images
- Navigate to Data > Databases > OCR_RAG > IMAGES_TO_OCR > Stages
- Click "Upload Files" button in top right
- Select your image files
- Verify upload success:
ls @images_to_ocr;
You should see your uploaded files listed with their sizes.
Dataset citation
Sample Images taken from RVL-CDIP Dataset
A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015
Bibtex format:
@inproceedings{harley2015icdar, title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}}, year = {2015} }
3. Open Snowflake Notebooks
- Click on Getting Started Notebook to download the Notebook from GitHub. (NOTE: Do NOT right-click to download.)
- In your Snowflake account:
- On the left hand navigation menu, click on Projects » Notebooks
- On the top right, click on Notebook down arrow and select Import .ipynb file from the dropdown menu
- Select the file you downloaded in step 1 above
- In the Create Notebook popup
- For Notebook location, select ocr_rag for your database and ocr_rag as your schema
- Select your Warehouse
- Click on Create button
- On the top right Click Packages
- Install tesserocr, pillow, and snowflake
Here we add our imports that we will use for our project Key Components:
Tesseract
: A powerful OCR engine that converts images to textPIL [Pillow] (Python Imaging Library)
: Handles image processing tasks
# Import python packages
import streamlit as st
import tesserocr
import io
import pandas as pd
from PIL import Image
# Import Snowpark packages
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.types import StringType, StructField, StructType, IntegerType
from snowflake.snowpark.files import SnowflakeFile
from snowflake.core import CreateMode
from snowflake.core.table import Table, TableColumn
from snowflake.core.schema import Schema
from snowflake.core import Root
# Setup session
session = get_active_session()
session.use_schema("ocr_rag")
root = Root(session)
database = root.databases[session.get_current_database()]
4. Create Table Structure
In the Notebook we will create the table that will store processed documents:
docs_chunks_table = Table(
name="docs_chunks_table",
columns=[TableColumn(name="relative_path", datatype="string"),
TableColumn(name="file_url", datatype="string"),
TableColumn(name="scoped_file_url", datatype="string"),
TableColumn(name="chunk", datatype="string"),
TableColumn(name="chunk_vec", datatype="vector(float,768)")]
)
database.schemas["ocr_rag"].tables.create(docs_chunks_table, mode=CreateMode.or_replace)
);
This table stores:
relative_path
: Path to source image in stagefile_url
: Full URL to access imagescoped_file_url
: Temporary URL for secure accesschunk
: Extracted text segmentchunk_vec
: Vector embedding for semantic search
5. Implement OCR Processing
Create OCR Function
In the Notebook we will create a User-Defined Table Function (UDTF) for OCR:
session.sql("DROP FUNCTION IF EXISTS IMAGE_TEXT(VARCHAR)").collect()
class ImageText:
def process(self, file_url: str):
with SnowflakeFile.open(file_url, 'rb') as f:
buffer = io.BytesIO(f.readall())
image = Image.open(buffer)
text = tesserocr.image_to_text(image)
yield (text,)
output_schema = StructType([StructField("full_text", StringType())])
session.udtf.register(
ImageText,
name="IMAGE_TEXT",
is_permanent=True,
stage_location="@ocr_rag.images_to_ocr",
schema="ocr_rag",
output_schema=output_schema,
packages=["tesserocr", "pillow","snowflake-snowpark-python"],
replace=True
)
This function:
- Reads image binary data from stage
- Converts to PIL Image object
- Processes with Tesseract OCR
- Returns extracted text
Process Images
In the Notebook we will run OCR on staged images:
SELECT
relative_path,
file_url,
build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path) AS scoped_file_url,
ocr_result.full_text
FROM
directory(@ocr_rag.images_to_ocr),
TABLE(IMAGE_TEXT(build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path))) AS ocr_result;
6. Process Text and Create Embeddings
In the Notebook we will insert processed text and create embeddings:
INSERT INTO docs_chunks_table (relative_path, file_url, scoped_file_url, chunk, chunk_vec)
SELECT
relative_path,
file_url,
scoped_file_url,
chunk.value,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', chunk.value) AS chunk_vec
FROM
{{run_through_files_to_ocr}},
LATERAL FLATTEN(SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(full_text,'none', 4000, 400)) chunk;
This query:
- Takes OCR output text
- Splits into 4000-character chunks (400 character overlap)
- Creates vector embeddings using e5-base-v2 model
- Stores results in docs_chunks_table
Verify data insertion:
SELECT COUNT(*) FROM docs_chunks_table;
7. Build Question-Answering Interface
Create Streamlit App
This section in the Notebook creates a Streamlit application that enables users to ask questions about their OCR-processed documents. The application uses semantic search to find relevant text and generates answers using Snowflake Cortex LLMs.
import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.core import Root
import pandas as pd
num_chunks = 3
model = "mistral-7b"
def create_prompt(myquestion):
cmd = """
with results as
(SELECT RELATIVE_PATH,
VECTOR_COSINE_SIMILARITY(docs_chunks_schema.docs_chunks_table.chunk_vec,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', ?)) as similarity,
chunk
from docs_chunks_schema.docs_chunks_table
order by similarity desc
limit ?)
select chunk, relative_path from results
"""
df_context = session.sql(cmd, params=[myquestion, num_chunks]).to_pandas()
context_lenght = len(df_context) -1
prompt_context = ""
for i in range (0, context_lenght):
prompt_context += df_context._get_value(i, 'CHUNK')
prompt_context = prompt_context.replace("'", "")
relative_path = df_context._get_value(0,'RELATIVE_PATH')
prompt = f"""
'You are an expert assistance extracting information from context provided.
Answer the question based on the context. Be concise and do not hallucinate.
If you don´t have the information just say so.
Context: {prompt_context}
Question:
{myquestion}
Answer: '
"""
cmd2 = f"select GET_PRESIGNED_URL(@ocr_rag.images_to_ocr, '{relative_path}', 360) as URL_LINK from directory(@ocr_rag.images_to_ocr)"
df_url_link = session.sql(cmd2).to_pandas()
url_link = df_url_link._get_value(0,'URL_LINK')
return prompt, url_link, relative_path
def complete(myquestion, model_name):
prompt, url_link, relative_path = create_prompt(myquestion)
cmd = """
select SNOWFLAKE.CORTEX.COMPLETE(?,?) as response
"""
df_response = session.sql(cmd, params=[model_name, prompt]).collect()
return df_response, url_link, relative_path
def display_response(question, model):
response, url_link, relative_path = complete(question, model)
res_text = response[0].RESPONSE
st.markdown(res_text)
display_url = f"Link to [{relative_path}]({url_link}) that may be useful"
st.markdown(display_url)
st.title("Asking Questions to Your Scanned Documents with Snowflake Cortex:")
docs_available = session.sql("ls @ocr_rag.images_to_ocr").collect()
question = st.text_input("Enter question", placeholder="What are my documents about?", label_visibility="collapsed")
if question:
display_response(question, model)
Code Walkthrough
The application:
- Uses vector similarity to find relevant text chunks
- Creates a prompt with context and question
- Calls Cortex LLM to generate answer
- Displays answer and source image link
Key parameters:
num_chunks
: Number of context chunks (default: 3)model
: LLM model (default: mistral-7b)
8. Conclusion and Resources
Congratulations! You've successfully built an end-to-end OCR and RAG application in Snowflake that transforms images into searchable, queryable content. Using Snowflake Notebooks and Cortex capabilities, you've implemented a solution that processes images through OCR, creates vector embeddings for semantic search, and provides AI-powered answers using Large Language Models - all while keeping your data secure within Snowflake's environment. Finally, you created a Streamlit application that allows users to interactively query their document content using natural language.
What You Learned
- How to implement OCR processing in Snowflake using Tesseract and Snowpark Python
- How to use open-source Python libraries from curated Snowflake Anaconda channel
- How to create and manage vector embeddings for semantic search capabilities
- How to build RAG applications using Snowflake Cortex Search and LLM functions
- How to create automated document processing pipelines using Snowflake Tasks
- How to develop interactive Streamlit applications within Snowflake
Related Resources
Documentation:
Blogs & Articles:
Sample Code & Guides: