In this quickstart, you'll learn how to build an end-to-end application that extracts text from images and makes it searchable using Large Language Models (LLMs). The application combines Optical Character Recognition (OCR), vector embeddings, and Retrieval Augmented Generation (RAG) to create an intelligent document assistant.
A full-stack application that enables users to:
CREATE DATABASE IF NOT EXISTS ocr_rag;
CREATE SCHEMA IF NOT EXISTS ocr_rag;
CREATE STAGE IF NOT EXISTS @ocr_rag.images_to_ocr
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
DIRECTORY = (ENABLE = true);
ls @ocr_rag.images_to_ocr;
You should see your uploaded files listed with their sizes.
Sample Images taken from RVL-CDIP Dataset
A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015
Bibtex format:
@inproceedings{harley2015icdar, title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}}, year = {2015} }
Here we add our imports that we will use for our project Key Components:
Tesseract
: A powerful OCR engine that converts images to textStreamlit
: Creates an intuitive web interface for user interactionPIL (Python Imaging Library)
: Handles image processing tasksSnowpark
: Provides Python API for Snowflake operations# Import python packages
import streamlit as st
import tesserocr
import io
import pandas as pd
from PIL import Image
# Import Snowpark packages
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.types import StringType, StructField, StructType, IntegerType
from snowflake.snowpark.files import SnowflakeFile
from snowflake.core import CreateMode
from snowflake.core.table import Table, TableColumn
from snowflake.core.schema import Schema
from snowflake.core import Root
# Setup session
session = get_active_session()
session.use_schema("ocr_rag")
root = Root(session)
database = root.databases[session.get_current_database()]
In the Notebook we will create the table that will store processed documents:
docs_chunks_table = Table(
name="docs_chunks_table",
columns=[TableColumn(name="relative_path", datatype="string"),
TableColumn(name="file_url", datatype="string"),
TableColumn(name="scoped_file_url", datatype="string"),
TableColumn(name="chunk", datatype="string"),
TableColumn(name="chunk_vec", datatype="vector(float,768)")]
)
database.schemas["ocr_rag"].tables.create(docs_chunks_table, mode=CreateMode.or_replace)
);
This table stores:
relative_path
: Path to source image in stagefile_url
: Full URL to access imagescoped_file_url
: Temporary URL for secure accesschunk
: Extracted text segmentchunk_vec
: Vector embedding for semantic searchIn the Notebook we will create a User-Defined Table Function (UDTF) for OCR:
session.sql("DROP FUNCTION IF EXISTS IMAGE_TEXT(VARCHAR)").collect()
class ImageText:
def process(self, file_url: str):
with SnowflakeFile.open(file_url, 'rb') as f:
buffer = io.BytesIO(f.readall())
image = Image.open(buffer)
text = tesserocr.image_to_text(image)
yield (text,)
output_schema = StructType([StructField("full_text", StringType())])
session.udtf.register(
ImageText,
name="IMAGE_TEXT",
is_permanent=True,
stage_location="@ocr_rag.images_to_ocr",
schema="ocr_rag",
output_schema=output_schema,
packages=["tesserocr", "pillow","snowflake-snowpark-python"],
replace=True
)
This function:
In the Notebook we will run OCR on staged images:
SELECT
relative_path,
file_url,
build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path) AS scoped_file_url,
ocr_result.full_text
FROM
directory(@ocr_rag.images_to_ocr),
TABLE(IMAGE_TEXT(build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path))) AS ocr_result;
In the Notebook we will insert processed text and create embeddings:
INSERT INTO docs_chunks_table (relative_path, file_url, scoped_file_url, chunk, chunk_vec)
SELECT
relative_path,
file_url,
scoped_file_url,
chunk.value,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', chunk.value) AS chunk_vec
FROM
{{run_through_files_to_ocr}},
LATERAL FLATTEN(SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(full_text,'none', 4000, 400)) chunk;
This query:
Verify data insertion:
SELECT COUNT(*) FROM docs_chunks_table;
This section in the Notebook creates a Streamlit application that enables users to ask questions about their OCR-processed documents. The application uses semantic search to find relevant text and generates answers using Snowflake Cortex LLMs.
import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.core import Root
import pandas as pd
num_chunks = 3
model = "mistral-7b"
def create_prompt(myquestion):
cmd = """
with results as
(SELECT RELATIVE_PATH,
VECTOR_COSINE_SIMILARITY(docs_chunks_schema.docs_chunks_table.chunk_vec,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', ?)) as similarity,
chunk
from docs_chunks_schema.docs_chunks_table
order by similarity desc
limit ?)
select chunk, relative_path from results
"""
df_context = session.sql(cmd, params=[myquestion, num_chunks]).to_pandas()
context_lenght = len(df_context) -1
prompt_context = ""
for i in range (0, context_lenght):
prompt_context += df_context._get_value(i, 'CHUNK')
prompt_context = prompt_context.replace("'", "")
relative_path = df_context._get_value(0,'RELATIVE_PATH')
prompt = f"""
'You are an expert assistance extracting information from context provided.
Answer the question based on the context. Be concise and do not hallucinate.
If you don´t have the information just say so.
Context: {prompt_context}
Question:
{myquestion}
Answer: '
"""
cmd2 = f"select GET_PRESIGNED_URL(@ocr_rag.images_to_ocr, '{relative_path}', 360) as URL_LINK from directory(@ocr_rag.images_to_ocr)"
df_url_link = session.sql(cmd2).to_pandas()
url_link = df_url_link._get_value(0,'URL_LINK')
return prompt, url_link, relative_path
def complete(myquestion, model_name):
prompt, url_link, relative_path = create_prompt(myquestion)
cmd = """
select SNOWFLAKE.CORTEX.COMPLETE(?,?) as response
"""
df_response = session.sql(cmd, params=[model_name, prompt]).collect()
return df_response, url_link, relative_path
def display_response(question, model):
response, url_link, relative_path = complete(question, model)
res_text = response[0].RESPONSE
st.markdown(res_text)
display_url = f"Link to [{relative_path}]({url_link}) that may be useful"
st.markdown(display_url)
st.title("Asking Questions to Your Scanned Documents with Snowflake Cortex:")
docs_available = session.sql("ls @ocr_rag.images_to_ocr").collect()
question = st.text_input("Enter question", placeholder="What are my documents about?", label_visibility="collapsed")
if question:
display_response(question, model)
The application:
Key parameters:
num_chunks
: Number of context chunks (default: 3)model
: LLM model (default: mistral-7b)Congratulations! You've successfully built an end-to-end OCR and RAG application in Snowflake that transforms images into searchable, queryable content. Using Snowflake Notebooks and Cortex capabilities, you've implemented a solution that processes images through OCR, creates vector embeddings for semantic search, and provides AI-powered answers using Large Language Models - all while keeping your data secure within Snowflake's environment. Finally, you created a Streamlit application that allows users to interactively query their document content using natural language.
Documentation:
Blogs & Articles:
Sample Code & Guides: