Compare and Evaluate Chunking Strategies

Chunking strategies play a critical role in the effectiveness of Retrieval Augmented Generation (RAG) applications. The way you split your documents can significantly impact the quality of responses your AI system provides. In this guide, you'll learn how to systematically compare different chunking approaches using Snowflake's AI Observability features.

You'll build a complete RAG application that analyzes SEC 10-K filings, implementing two different chunking strategies:

Basic paragraph-based chunking
Context-enhanced chunking with document summaries

By the end, you'll have quantitative metrics to determine which approach delivers better results for your specific use case.

What You'll Learn

How to parse and process PDF documents in Snowflake
Techniques for implementing different text chunking strategies
Methods to create and compare vector search services
How to use AI Observability to evaluate RAG performance
Ways to make data-driven decisions about chunking approaches

What You'll Build

A complete RAG application that can answer questions about SEC 10-K filings, with tools to measure and compare the performance of different chunking strategies.

What You'll Need

Access to a Snowflake account
Basic familiarity with SQL and Python
Understanding of RAG concepts

Database and Schema Setup

Run setup.sql to create the neccessary databases, schemas, warehouses and roles.

Notebook Configuration

Firstly, to follow along with this quickstart, you can download the Notebook from the GitHub repository and then create a new Snowflake notebook by importing the notebook file in Snowsight into the CHUNKING_EVALUATION database and CHUNKING_EVALUATION schema.

In your Snowflake notebook, install the following python packages from the Snowflake conda channel:

snowflake-ml-python
snowflake.core
trulens-core
trulens-providers-cortex
trulens-connectors-snowflake

Then, you can get the active snowpark session and turn on TruLens OpenTelemetry tracing.

from snowflake.snowpark.context import get_active_session
session = get_active_session()

# Enable OpenTelemetry tracing for AI Observability
import os
os.environ["TRULENS_OTEL_TRACING"] = "1"

Data Preparation

Download the PDF documents from the data folder in the GitHub repository and upload them to the stage @CHUNKING_EVALUATION.CHUNKING_EVALUATION.PDF_10KS.

In this section, we'll extract text from the PDF documents using Snowflake's AI_PARSE_DOCUMENT function. This function can handle complex document layouts, including tables and columns.

Create a Table for Parsed Text

First, let's define our database and schema variables:

DB_NAME = 'CHUNKING_EVALUATION'
DOC_SCHEMA_NAME = 'DOCS'
TEXT_SCHEMA_NAME = 'PARSED_DATA'

Now, create a table to store the parsed text:

-- Create a table to hold the extracted text from the PDF files
CREATE TABLE CHUNKING_EVALUATION.PARSED_DATA.PARSED_TEXT (
  relative_path VARCHAR(500), 
  raw_text VARIANT
) IF NOT EXISTS;

Extract Text from PDFs

Next, we'll use the PARSE_DOCUMENT function to extract text from each PDF file:

INSERT INTO CHUNKING_EVALUATION.PARSED_DATA.PARSED_TEXT (relative_path, raw_text)
WITH pdf_files AS (
    SELECT DISTINCT
        METADATA$FILENAME AS relative_path
    FROM @CHUNKING_EVALUATION.DOCS.PDF_10KS
    WHERE METADATA$FILENAME ILIKE '%.pdf'
      -- Exclude files that have already been parsed
      AND METADATA$FILENAME NOT IN (SELECT relative_path FROM CHUNKING_EVALUATION.PARSED_DATA.PARSED_TEXT)
)
SELECT 
    relative_path,
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
        '@CHUNKING_EVALUATION.DOCS.PDF_10KS',  -- Your stage name
        relative_path,  -- File path
        {'mode': 'layout'}  -- Using layout mode to preserve document structure
    ) AS raw_text
FROM pdf_files;

The ‘layout' mode preserves the document's structure, including tables and formatting, which is important for financial documents like 10-K filings.

Inspect the Parsed Documents

Let's check our parsed documents and count the tokens in each:

-- Inspect the results and count the tokens for each document
SELECT *, SNOWFLAKE.CORTEX.COUNT_TOKENS('mistral-7b', RAW_TEXT) as token_count
FROM CHUNKING_EVALUATION.PARSED_DATA.PARSED_TEXT;

This gives us a sense of the document sizes and helps us plan our chunking strategy.

Now we'll implement two different chunking strategies to compare their effectiveness.

Strategy 1: Basic Paragraph Chunking

First, let's create chunks based on paragraph separators:

-- Chunk the text based on paragraph separators
CREATE OR REPLACE TABLE CHUNKING_EVALUATION.PARSED_DATA.PARAGRAPH_CHUNKS AS
WITH text_chunks AS (
    SELECT
        relative_path,
        SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(
            raw_text:content::STRING,  -- Extract the 'content' field from the JSON
            'markdown', -- Format type
            2000,       -- Chunk size (in tokens)
            100,        -- Overlap size
            ['\n\n']    -- Paragraph separators
        ) AS chunks
    FROM CHUNKING_EVALUATION.PARSED_DATA.PARSED_TEXT
)
SELECT
    relative_path,
    c.value AS chunk  -- Extract each chunk of the parsed text
FROM text_chunks,
LATERAL FLATTEN(INPUT => chunks) c;

This approach splits the text at paragraph boundaries, creating chunks of approximately 2000 tokens with a 100-token overlap between consecutive chunks.

Strategy 2: Context-Enhanced Chunking

For our second strategy, we'll enhance each chunk with document-level context by adding a document summary using AI_SUMMARIZE_AGG:

-- Add the DOC_SUMMARY column if it doesn't exist
ALTER TABLE CHUNKING_EVALUATION.PARSED_DATA.PARAGRAPH_CHUNKS
    ADD COLUMN IF NOT EXISTS DOC_SUMMARY VARCHAR(5000);

-- Generate document summaries
UPDATE CHUNKING_EVALUATION.PARSED_DATA.PARAGRAPH_CHUNKS AS tgt
SET DOC_SUMMARY = src.DOC_SUMMARY
FROM (
    SELECT 
        RELATIVE_PATH,
        AI_SUMMARIZE_AGG(CHUNK) AS DOC_SUMMARY
    FROM CHUNKING_EVALUATION.PARSED_DATA.PARAGRAPH_CHUNKS
    GROUP BY RELATIVE_PATH
) AS src
WHERE tgt.RELATIVE_PATH = src.RELATIVE_PATH
  AND tgt.DOC_SUMMARY IS NULL;

-- Create combined chunks with summary context
ALTER TABLE CHUNKING_EVALUATION.PARSED_DATA.PARAGRAPH_CHUNKS 
ADD COLUMN IF NOT EXISTS CHUNK_WITH_SUMMARY VARCHAR;

UPDATE CHUNKING_EVALUATION.PARSED_DATA.PARAGRAPH_CHUNKS
  SET CHUNK_WITH_SUMMARY = DOC_SUMMARY || '\n\n' || CHUNK;

This strategy prepends each chunk with a summary of the entire document, providing additional context that might help the model understand the content better.

Now we'll create two Cortex Search services, one for each chunking strategy, to enable vector search capabilities.

Basic Chunk Search Service

CREATE OR REPLACE CORTEX SEARCH SERVICE CHUNKING_EVALUATION.PARSED_DATA.SEC_CHUNK_RETRIEVAL
  ON SEARCH_COL
  WAREHOUSE = COMPUTE
  TARGET_LAG = '1 hour'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
AS (
  SELECT
    RELATIVE_PATH,
    CHUNK::STRING AS SEARCH_COL,
  FROM CHUNKING_EVALUATION.PARSED_DATA.PARAGRAPH_CHUNKS
);

Context-Enhanced Search Service

CREATE OR REPLACE CORTEX SEARCH SERVICE CHUNKING_EVALUATION.PARSED_DATA.SEC_CONTEXTUAL_CHUNK_RETRIEVAL
  ON SEARCH_COL
  WAREHOUSE = COMPUTE
  TARGET_LAG = '1 hour'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
AS (
  SELECT
    RELATIVE_PATH,
    CHUNK,
    CHUNK_WITH_SUMMARY::STRING AS SEARCH_COL
  FROM CHUNKING_EVALUATION.PARSED_DATA.PARAGRAPH_CHUNKS
);

Both services use the same embedding model but index different content - one uses basic chunks while the other uses chunks enhanced with document summaries.

Test the Search Services

Let's test our search services to make sure they're working correctly:

# Query the basic search service
from snowflake.snowpark import Session
from snowflake.core import Root
root = Root(session)

sec_search_service = (root
  .databases[DB_NAME]
  .schemas[TEXT_SCHEMA_NAME]
  .cortex_search_services['SEC_CHUNK_RETRIEVAL']
)

resp = sec_search_service.search(
  query="What was under armour's net sales?",
  columns=['SEARCH_COL'],
  limit=3
)
results = resp.results
results

# Query the contextual search service
sec_contextual_search_service = (root
  .databases[DB_NAME]
  .schemas[TEXT_SCHEMA_NAME]
  .cortex_search_services['SEC_CONTEXTUAL_CHUNK_RETRIEVAL']
)

resp = sec_contextual_search_service.search(
  query="What was zscalar's net sales?",
  columns=['SEARCH_COL'],
  limit=3
)
results = resp.results
results

These tests help us verify that both search services are operational and returning relevant results.

Now we'll build two RAG applications, one for each chunking strategy, using the search services we created.

Define the RAG Class

First, let's define a RAG class that we can use for both versions:

from snowflake.cortex import complete
from trulens.core.otel.instrument import instrument
from trulens.otel.semconv.trace import SpanAttributes

class RAG:

    def __init__(self, search_service):
        self.search_service = search_service

    @instrument(
        span_type=SpanAttributes.SpanType.RETRIEVAL,
        attributes={
            SpanAttributes.RETRIEVAL.QUERY_TEXT: "query",
            SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: "return",
            }
    )
    def retrieve_context(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        
        response = self.search_service.search(
          query=query,
          columns=['SEARCH_COL'],
          limit=4
        )

        if response.results:
            return [curr["SEARCH_COL"] for curr in response.results]
        else:
            return []


    @instrument(
        span_type=SpanAttributes.SpanType.GENERATION)
    def generate_completion(self, query: str, context_list: list) -> str:
        """
        Generate answer from context.
        """
        prompt = f"""
          You are an expert assistant extracting information from context provided.
          Answer the question as concisely as possible without any preface.
          Context: {context_list}
          Question:
          {query}
          Answer:
        """
        response = complete("claude-4-sonnet", prompt)
        return response

    @instrument(
        span_type=SpanAttributes.SpanType.RECORD_ROOT, 
        attributes={
            SpanAttributes.RECORD_ROOT.INPUT: "query",
            SpanAttributes.RECORD_ROOT.OUTPUT: "return",
        })
    def query(self, query: str) -> str:
        context_str = self.retrieve_context(query)
        return self.generate_completion(query, context_str)

Create RAG Instances

Now, let's create two instances of our RAG class, one for each search service:

rag = RAG(search_service = sec_search_service)
contextual_rag = RAG(search_service = sec_contextual_search_service)

The @instrument decorators in our RAG class enable AI Observability, which will help us track and evaluate the performance of our RAG applications.

Now we'll set up AI Observability to compare the performance of our two RAG applications.

Connect to Snowflake

First, let's establish a connection to Snowflake for AI Observability:

from trulens.apps.app import TruApp
from trulens.connectors.snowflake import SnowflakeConnector

tru_snowflake_connector = SnowflakeConnector(snowpark_session=session)

Register RAG Applications

Next, we'll register our two RAG applications with AI Observability:

app_name = "sec_10k_chat_app"

base_tru_recorder = TruApp(
        rag,
        app_name=app_name,
        app_version="base",
        connector=tru_snowflake_connector,
        main_method_name="query"
    )

contextual_tru_recorder = TruApp(
        contextual_rag,
        app_name=app_name,
        app_version="contextual chunks",
        connector=tru_snowflake_connector,
        main_method_name="query"
    )

This registration allows AI Observability to track and compare the performance of our two RAG versions.

To evaluate our RAG applications, we need a test dataset with questions and ground truth answers.

CREATE OR REPLACE TABLE SEC_FILINGS_QA (
  COMPANY_NAME        STRING,
  QUESTION            STRING,
  GROUND_TRUTH_ANSWER STRING
);

INSERT INTO SEC_FILINGS_QA
  (COMPANY_NAME, QUESTION, GROUND_TRUTH_ANSWER)
VALUES
  (
    'Autodesk, Inc.',
    'In Autodesk's Form 10-K covering fiscal year 2024, what amount of Remaining Performance Obligations (RPO) was reported as of the close of the fiscal year on January 31, 2024?',
    '$6.11 billion'
  ),
  (
    'IQVIA Holdings Inc.',
    'According to IQVIA's most recent annual 10-K, what was the approximate total number of employees worldwide at the end of the reporting period?',
    'IQVIA has approximately 87,000 employees.'
  ),
  (
    'Alcoa Corporation',
    'Within Alcoa's 2022 10-K, how many thousand metric tons of alumina shipments to third parties were disclosed for the six months ending June 30, 2022?',
    '4,715 kmt'
  ),
  (
    'Zscaler, Inc.',
    'As reported in Zscaler's 2024 10-K, what was the total deferred revenue recorded at the fiscal year-end date of July 31, 2024?',
    '$1,895.0 million'
  ),
  (
    '3M Company',
    'In 3M's discussion of environmental litigation matters in its 2018 Form 10-K, what was the pre-tax amount recorded in connection with the Minnesota settlement over PFAS?',
    '$897 million'
  ),
  (
    'Under Armour, Inc.',
    'Looking at Under Armour's 2020 10-K, what net revenue and cost of goods sold figures were reported for the three-month period ending June 30, 2020?',
    'Net revenue: $707,640 thousand - Cost of goods sold: $358,471 thousand'
  ),
  (
    'Packaging Corporation of America',
    'According to PCA's 2019 annual 10-K disclosures, what was the final purchase price paid to acquire the assets of Englander?',
    '$57.7 milion'
  ),
  (
    'Spectrum Brands Holdings',
    'In Spectrum's 2022 10-K, which specific appliances and cookware business acquisition was highlighted in the notes to the financial statements?',
    'Tristar Business'
  ),
  (
    'Spectrum Brands Holdings',
    'Per Spectrum's most recent 10-K description of its operating structure, what are the principal business segments reported?',
    'GPC (Global Pet Care), H&G (Home & Garden), and HPC (Home & Personal Care)'
  ),
  (
    'Southwestern Energy Company',
    'In the 2018 10-K, Southwestern Energy reported the divestiture of its Arkansas subsidiaries — which buyer purchased them and at what agreed price?',
    'Flywheel Energy Operating LLC, for a price of $1,650 million.'
  );

This dataset contains specific questions about information in the 10-K filings, along with the correct answers.

Now we'll create and run evaluations to compare our two RAG applications.

Configure Evaluation Runs

First, let's configure the evaluation runs:

from trulens.core.run import Run
from trulens.core.run import RunConfig
from datetime import datetime

TEST_RUN_NAME = f"base_run_{datetime.now().strftime('%Y%m%d%H%M%S')}"

base_run_config = RunConfig(
    run_name=TEST_RUN_NAME,
    description="Questions about SEC 10KS",
    dataset_name="SEC_FILINGS_QA",
    source_type="TABLE",
    label="CHUNKS",
    dataset_spec={
        "RECORD_ROOT.INPUT": "QUESTION",
        "RECORD_ROOT.GROUND_TRUTH_OUTPUT":"GROUND_TRUTH_ANSWER",
    },
)

base_run = base_tru_recorder.add_run(run_config=base_run_config)

CONTEXTUAL_TEST_RUN_NAME = f"contextual_run_{datetime.now().strftime('%Y%m%d%H%M%S')}"

contextual_run_config = RunConfig(
    run_name=CONTEXTUAL_TEST_RUN_NAME,
    dataset_name="SEC_FILINGS_QA",
    description="Questions about SEC 10KS",
    source_type="TABLE",
    label="CONTEXTUAL_CHUNKS",
    dataset_spec={
        "RECORD_ROOT.INPUT": "QUESTION",
        "RECORD_ROOT.GROUND_TRUTH_OUTPUT":"GROUND_TRUTH_ANSWER",
    }
)

contextual_run = contextual_tru_recorder.add_run(run_config=contextual_run_config)

Start the Evaluation Runs

Now, let's start the evaluation runs:

base_run.start()
print("Finished base run")

contextual_run.start()
print("Finished contextual run")

run_list = [base_run, contextual_run]

Compute Evaluation Metrics

After the runs complete, we'll compute metrics to evaluate performance:

import time
for i in run_list:
    while i.get_status() == "INVOCATION_IN_PROGRESS":
        time.sleep(3)
    if i.get_status() == "INVOCATION_COMPLETED":
        i.compute_metrics(["correctness",
                           "answer_relevance",
                           "context_relevance",
                           "groundedness",])
        print(f"Kicked off Metrics Computation for Run {i.run_name}")
    if i.get_status() in ["FAILED", "UNKNOWN"]:
        print("Not able to compute metrics! Run status:", i.get_status())

These metrics will help us understand how well each RAG version performs:

Correctness: How accurate the answers are compared to ground truth
Answer Relevance: How relevant the answers are to the questions
Context Relevance: How relevant the retrieved context is to the questions
Groundedness: How well the answers are supported by the retrieved context

Compare results

Finally, let's view the results of our runs by opening the evaluations page:

AI Observability Leaderboard

In the dashboard, you can:

Compare overall metrics between the two versions
Analyze performance on individual questions
Examine retrieved contexts and generated answers
Identify strengths and weaknesses of each chunking strategy

To compare the runs head to head, click the checkbox for each version and then clicking Compare on the right side of the screen.

Congratulations! You've successfully built and compared two different chunking strategies for a RAG application using Snowflake's AI Observability features. By systematically evaluating these approaches, you now have data-driven insights into which chunking strategy works better for your specific use case.

This methodology can be applied to optimize any RAG application, helping you make informed decisions about document processing, chunking, and retrieval strategies. The quantitative metrics provided by AI Observability allow you to iterate and improve your RAG applications with confidence.

What You Learned

Parsed and processed PDF documents using Snowflake's document parsing capabilities
Implemented two different chunking strategies for RAG applications
Created search services to retrieve data using our two chunking strategies
Built complete RAG applications with instrumentation for observability
Evaluated and compared RAG performance using AI Observability metrics