In this quickstart, you'll learn how to build an end-to-end application that extracts text from images and makes it searchable using Large Language Models (LLMs). The application combines Optical Character Recognition (OCR), vector embeddings, and Retrieval Augmented Generation (RAG) to create an intelligent document assistant.

What You'll Learn

What You'll Build

A full-stack application that enables users to:

Prerequisites

Create Database and Schema

  1. Open a new worksheet in Snowflake
  2. Create the database and schema:
CREATE DATABASE IF NOT EXISTS ocr_rag;
CREATE SCHEMA IF NOT EXISTS ocr_rag;

Create Image Storage Stage

  1. Create a stage to store your images:
CREATE STAGE IF NOT EXISTS @ocr_rag.images_to_ocr
  ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
  DIRECTORY = (ENABLE = true);

Upload Images

Download Sample Images

  1. Navigate to Data > Databases > OCR_RAG > IMAGES_TO_OCR > Stages
  2. Click "Upload Files" button in top right
  3. Select your image files
  4. Verify upload success:
ls @ocr_rag.images_to_ocr;

You should see your uploaded files listed with their sizes.

Dataset citation

Sample Images taken from RVL-CDIP Dataset

A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015

Bibtex format:

@inproceedings{harley2015icdar, title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval}, author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis}, booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}}, year = {2015} }

  1. Click on Getting Started Notebook to download the Notebook from GitHub. (NOTE: Do NOT right-click to download.)
  2. In your Snowflake account:
  1. In the Create Notebook popup

Here we add our imports that we will use for our project Key Components:

# Import python packages
import streamlit as st
import tesserocr
import io
import pandas as pd
from PIL import Image

# Import Snowpark packages
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.types import StringType, StructField, StructType, IntegerType
from snowflake.snowpark.files import SnowflakeFile
from snowflake.core import CreateMode
from snowflake.core.table import Table, TableColumn
from snowflake.core.schema import Schema
from snowflake.core import Root

# Setup session
session = get_active_session()
session.use_schema("ocr_rag")
root = Root(session)
database = root.databases[session.get_current_database()]

In the Notebook we will create the table that will store processed documents:

docs_chunks_table = Table(
    name="docs_chunks_table",
    columns=[TableColumn(name="relative_path", datatype="string"),
            TableColumn(name="file_url", datatype="string"),
            TableColumn(name="scoped_file_url", datatype="string"),
            TableColumn(name="chunk", datatype="string"),
            TableColumn(name="chunk_vec", datatype="vector(float,768)")]
)
database.schemas["ocr_rag"].tables.create(docs_chunks_table, mode=CreateMode.or_replace)
);

This table stores:

Create OCR Function

In the Notebook we will create a User-Defined Table Function (UDTF) for OCR:

session.sql("DROP FUNCTION IF EXISTS IMAGE_TEXT(VARCHAR)").collect()

class ImageText:
    def process(self, file_url: str):
        with SnowflakeFile.open(file_url, 'rb') as f:
            buffer = io.BytesIO(f.readall())
        image = Image.open(buffer)
        text = tesserocr.image_to_text(image)
        yield (text,)

output_schema = StructType([StructField("full_text", StringType())])

session.udtf.register(
    ImageText,
    name="IMAGE_TEXT",
    is_permanent=True,
    stage_location="@ocr_rag.images_to_ocr",
    schema="ocr_rag",
    output_schema=output_schema,
    packages=["tesserocr", "pillow","snowflake-snowpark-python"],
    replace=True
)

This function:

  1. Reads image binary data from stage
  2. Converts to PIL Image object
  3. Processes with Tesseract OCR
  4. Returns extracted text

Process Images

In the Notebook we will run OCR on staged images:

SELECT 
    relative_path, 
    file_url, 
    build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path) AS scoped_file_url,
    ocr_result.full_text
FROM 
    directory(@ocr_rag.images_to_ocr),
    TABLE(IMAGE_TEXT(build_scoped_file_url(@ocr_rag.images_to_ocr, relative_path))) AS ocr_result;

In the Notebook we will insert processed text and create embeddings:

INSERT INTO docs_chunks_table (relative_path, file_url, scoped_file_url, chunk, chunk_vec)
SELECT 
    relative_path, 
    file_url,
    scoped_file_url,
    chunk.value,
    SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', chunk.value) AS chunk_vec
FROM
    {{run_through_files_to_ocr}},
    LATERAL FLATTEN(SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(full_text,'none', 4000, 400)) chunk;

This query:

  1. Takes OCR output text
  2. Splits into 4000-character chunks (400 character overlap)
  3. Creates vector embeddings using e5-base-v2 model
  4. Stores results in docs_chunks_table

Verify data insertion:

SELECT COUNT(*) FROM docs_chunks_table;

Create Streamlit App

This section in the Notebook creates a Streamlit application that enables users to ask questions about their OCR-processed documents. The application uses semantic search to find relevant text and generates answers using Snowflake Cortex LLMs.

import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.core import Root
import pandas as pd

num_chunks = 3 
model = "mistral-7b"

def create_prompt(myquestion):
    cmd = """
     with results as
     (SELECT RELATIVE_PATH,
       VECTOR_COSINE_SIMILARITY(docs_chunks_schema.docs_chunks_table.chunk_vec,
                SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', ?)) as similarity,
       chunk
     from docs_chunks_schema.docs_chunks_table
     order by similarity desc
     limit ?)
     select chunk, relative_path from results 
     """
    df_context = session.sql(cmd, params=[myquestion, num_chunks]).to_pandas()      

    context_lenght = len(df_context) -1
    prompt_context = ""
    for i in range (0, context_lenght):
        prompt_context += df_context._get_value(i, 'CHUNK')
    prompt_context = prompt_context.replace("'", "")
    relative_path =  df_context._get_value(0,'RELATIVE_PATH')
    prompt = f"""
      'You are an expert assistance extracting information from context provided. 
       Answer the question based on the context. Be concise and do not hallucinate. 
       If you don´t have the information just say so.
      Context: {prompt_context}
      Question:  
       {myquestion} 
       Answer: '
       """
    cmd2 = f"select GET_PRESIGNED_URL(@ocr_rag.images_to_ocr, '{relative_path}', 360) as URL_LINK from directory(@ocr_rag.images_to_ocr)"
    df_url_link = session.sql(cmd2).to_pandas()
    url_link = df_url_link._get_value(0,'URL_LINK')

    return prompt, url_link, relative_path

def complete(myquestion, model_name):
    prompt, url_link, relative_path = create_prompt(myquestion)
    cmd = """
             select SNOWFLAKE.CORTEX.COMPLETE(?,?) as response
           """
    df_response = session.sql(cmd, params=[model_name, prompt]).collect()
    return df_response, url_link, relative_path

def display_response(question, model):
    response, url_link, relative_path = complete(question, model)
    res_text = response[0].RESPONSE
    st.markdown(res_text)
    display_url = f"Link to [{relative_path}]({url_link}) that may be useful"
    st.markdown(display_url)

st.title("Asking Questions to Your Scanned Documents with Snowflake Cortex:")
docs_available = session.sql("ls @ocr_rag.images_to_ocr").collect()
question = st.text_input("Enter question", placeholder="What are my documents about?", label_visibility="collapsed")
if question:
    display_response(question, model)

Code Walkthrough

The application:

  1. Uses vector similarity to find relevant text chunks
  2. Creates a prompt with context and question
  3. Calls Cortex LLM to generate answer
  4. Displays answer and source image link

Key parameters:

Congratulations! You've successfully built an end-to-end OCR and RAG application in Snowflake that transforms images into searchable, queryable content. Using Snowflake Notebooks and Cortex capabilities, you've implemented a solution that processes images through OCR, creates vector embeddings for semantic search, and provides AI-powered answers using Large Language Models - all while keeping your data secure within Snowflake's environment. Finally, you created a Streamlit application that allows users to interactively query their document content using natural language.

What You Learned

Related Resources

Documentation:

Blogs & Articles:

Sample Code & Guides: