Hybrid Tables Performance Optimization Primer

Hybrid Tables

A hybrid table is a Snowflake table type that is optimized for hybrid transactional and operational workloads that require low latency and high throughput on small random point reads and writes. A hybrid table supports unique and referential integrity constraint enforcement that is critical for transactional workloads. You can use a hybrid table along with other Snowflake tables and features to power Unistore workloads that bring transactional and analytical data together in a single platform.

Use cases that may benefit from hybrid tables include:

Applications requiring INSERT/UPDATE/DELETE concurrency matched with SELECT performance
Usage requiring low-latency, high concurrency INSERT/UPDATE like log tables, metadata tracking, or status information
Low latency responses for highly selective record retrieval over very large data sets

Architecture

Hybrid tables are integrated seamlessly into the existing Snowflake architecture. Customers connect to the same Snowflake database service. Queries are compiled and optimized in the cloud services layer and executed in the same query engine in virtual warehouses. This provides several key benefits:

Snowflake platform features, such as data governance, work with hybrid tables out of the box.
You can run hybrid workloads mixing operational and analytical queries.
You can join hybrid tables with other Snowflake tables and the query executes natively and efficiently in the same query engine. No federation is required.
You can execute an atomic transaction across hybrid tables and other Snowflake tables. There is no need to orchestrate your own two-phase commit.

Architecture

Hybrid tables leverage a row store as the primary data store to provide excellent operational query performance. When you write to a hybrid table, the data is written directly into the rowstore. Data is asynchronously copied into object storage in order to provide better performance and workload isolation for large scans without impacting your ongoing operational workloads. Some data may also be cached in columnar format on your warehouse in order to provide better performance on analytical queries. You simply execute SQL statements against the logical hybrid table and Snowflake's query optimizer decides where to read data from in order to provide the best performance. You get one consistent view of your data without needing to worry about the underlying infrastructure.

What You Will Learn

How to design hybrid tables and access patterns that achieve high performance
How to analyze queries using Snowsight
How to adjust queries to improve performance
How hybrid table unique characteristics like Indexes, primary keys, unique and foreign keys affect performance

Prerequisites

Familiarity with the Snowflake's Snowsight interface
Familiarity with SQL
Snowflake non-trial account with Hybrid Tables enabled
A database and/or schema within which you have permissions to create tables

Process

We will be creating three tables that represent a data model. Following the trucking example from the Getting Started with Hybrid Tables quickstart, we will use TRUCK and ORDER_HEADER tables to represent our data.
Create and work with TRUCK table to understand primary keys, secondary indexes, and accompanying query patterns
Build on TRUCK with ORDER_HEADER to work with foreign key relationships
Add a secondary index to find orders for trucks based on order date

We will use table structures populated with synthetic data to create scenarios We will initially create scenarios that do not perform well. Following the guide, you will then adjust the query and/or the table structure to create an optimal query.

Creating an Environment

Create a worksheet and establish a database and schema to do the work. We are doing exploration work here so make sure you are using a role that has permissions to create tables and view query profiles.

-- CUSTOMIZE AS NEEDED
-- CREATE OR REPLACE DATABASE HT_PERFORMANCE;
-- CREATE OR REPLACE SCHEMA HT_PERFORMANCE.PRIMER;

-- CREATE OR REPLACE WAREHOUSE HT_PERFORMANCE_WAREHOUSE WITH WAREHOUSE_SIZE=XSMALL;
    
-- THE PRIMER WILL USE THIS SCHEMA FOR THE REMAINDER OF THE LEARNING
-- CUSTOMIZE AS NEEDED
USE SCHEMA HT_PERFORMANCE.PRIMER;
USE WAREHOUSE HT_PERFORMANCE_WAREHOUSE;

Now, let's explore primary keys and secondary indexes

Primary keys are a required configuration for a hybrid table. The key establishes the rules for row uniqueness as well as the sort order for how data is written to the row storage. See primary key documentation for more details.

Secondary Indexes provide fast access to query patterns that may not be able to use the primary key.

What We Will Cover

Creating a hybrid table with test data
Primary Key setup and analysis
Secondary index setup and analysis

Create the `TRUCKS` Hybrid Table

Create and populate a simple hybrid table. This table is similar to the table used in the getting started with hybrid tables guide but is simplified for brevity. This quickstart will use similar queries for the remainder of the work:

--- CREATE LOTS OF TRUCKS WITH RANDOM YEARS, MAKES, FRANCHISE AND EMAILS.
CREATE OR REPLACE HYBRID TABLE TRUCK (
    TRUCK_ID NUMBER(38,0) NOT NULL,
    YEAR NUMBER(38,0) NOT NULL,
    MAKE VARCHAR(250) NOT NULL,
    FRANCHISE_ID NUMBER(38,0) NOT NULL,
    TRUCK_EMAIL VARCHAR NOT NULL,
    PRIMARY KEY (TRUCK_ID) 
)
AS SELECT
    SEQ4(),
    UNIFORM(1970, 2025, RANDOM()),
    ARRAY_CONSTRUCT( 'Freightliner', 'Peterbilt', 'Kenworth', 'Volvo Trucks', 'Mack', 'International (Navistar)', 'Western Star')[UNIFORM(0,6, RANDOM())],
    UNIFORM(1, 250, RANDOM()),
    RANDSTR(8, RANDOM()) || '@sf-quickstart'
FROM TABLE(GENERATOR(ROWCOUNT => 1000000)) -- WE NEED ENOUGH RECORDS FOR REASONABLE TESTING OTHERWISE EVERYTHING IS IN-MEMORY
;
-- SEE WHAT THE RESULTS LOOK LIKE
SELECT * FROM TRUCK LIMIT 10;

Query with the Primary Key

Primary keys are a highly performant way to query hybrid tables. Let's look at a query profile of what querying using the primary key looks like.

-- SELECT AND SET A RANDOM TRUCK_ID TO SESSION VARIABLE FOR CONVENIENCE
SET TRUCK_ID = (SELECT TRUCK_ID FROM TRUCK SAMPLE(1 ROWS));

SELECT *
FROM TRUCK
WHERE TRUCK_ID = $TRUCK_ID;

Now, click on the resulting query id (QID) on the Snowsight console. A new tab will open, and you will be presented with a screen that looks like this:

The screen gives you basic information about the query.

Clicking on the "Query Profile" button:

will lead you to a screen showing you the visual representation of how the query planner decided to execute the query:

This is the primary screen we will be using for query analysis.

Analyze the Query Profile Output

Now that you understand how to get to the query profile screen, let's discover how this query was executed. Once we understand how the query planner decided to execute the query, we can understand how to implement any changes in the query that will improve performance.

Clicking on the bottom "TableScan" node will focus our attention on what happened in the data access. In this example we can see that:

The Access Predicate was indeed the primary key TRUCK.TRUCK_ID
The Scan Mode was ROW_BASED
Exactly one record was retrieved from storage by the scan, exactly matching the truck ID.

ROW_BASED scans means that the query directly accessed the row store and did not use the backing object store to produce the results. For selective predicates against the primary key, ROW_BASED access indicates that the primary key index was used to push down the predicate and is the fastest and most efficient way to execute the query.

Query with `COLUMN_BASED` Access (no key)

Instead of querying the table using the primary key, let's query the table using a column that is not indexed and see what the resulting query plan looks like.

SET EMAIL = (SELECT ANY_VALUE(TRUCK_EMAIL) FROM TRUCK);
    
SELECT *
FROM TRUCK
WHERE TRUE
    AND TRUCK_EMAIL = $EMAIL;

We can see from the query profile screen that the query plan looks far different. Much time is spent on the table scan and it is the most expensive node in the plan:

Looking at the "TableScan" node, we can clearly see that the scan was COLUMN_BASED and processed many partitions with lots of data. The optimizer was not able to push down any access predicates against an index.

Next, we will explore secondary indexes.

Add a Secondary Index

Clearly, the above query is not optimal. A secondary index can help the query run faster. Let's create a secondary index and use it to explore the query improvement:

CREATE OR REPLACE HYBRID TABLE TRUCK (
    TRUCK_ID NUMBER(38,0) NOT NULL,
    YEAR NUMBER(38,0) NOT NULL,
    MAKE VARCHAR(250) NOT NULL,
    FRANCHISE_ID NUMBER(38,0) NOT NULL,
    TRUCK_EMAIL VARCHAR NOT NULL UNIQUE,
    PRIMARY KEY (TRUCK_ID),
    INDEX IDX_TRUCK_EMAIL (TRUCK_EMAIL)
)
AS SELECT
    SEQ4(),
    UNIFORM(1970, 2025, RANDOM()),
    ARRAY_CONSTRUCT( 'Freightliner', 'Peterbilt', 'Kenworth', 'Volvo Trucks', 'Mack', 'International (Navistar)', 'Western Star')[UNIFORM(0,6, RANDOM())],
    UNIFORM(1, 250, RANDOM()),
    RANDSTR(8, RANDOM()) || '@sf-quickstart'
FROM TABLE(GENERATOR(ROWCOUNT => 1000000)) -- WE NEED ENOUGH RECORDS FOR REASONABLE TESTING OTHERWISE EVERYTHING IS IN-MEMORY
;
-- SEE WHAT THE RESULTS LOOK LIKE
SELECT * FROM TRUCK LIMIT 10;

Once the table is created we can re-run the query and show that it is using the newly created index:

SET EMAIL = (SELECT ANY_VALUE(TRUCK_EMAIL) FROM TRUCK);

SELECT *
FROM TRUCK
WHERE TRUE
  AND TRUCK_EMAIL = $EMAIL;

Using the index results in a much better query plan that uses the index we created:

Next, we will cover foreign keys.

RDBMS data models use foreign keys to establish relationships between tables. All kinds of models take advantage of the relationships to keep data from becoming corrupted or otherwise orphaned from a definition. We call the relationship between the primary key of the reference table and the usage of the key in another table a foreign key.

Setup

We will use a TRUCK and ORDER_HEADER relationship to create a foreign key relationship. Let's create the tables and explore how the query plan looks with and without a foreign key.

Let's reuse the TRUCK table and create orders for each truck.

--- CREATE ORDERS FOR EACH TRUCK
CREATE OR REPLACE HYBRID TABLE ORDER_HEADER (
	ORDER_ID NUMBER(38,0) NOT NULL,
	TRUCK_ID NUMBER(38,0) NOT NULL,
    ORDER_TIMESTAMP TIMESTAMP_NTZ NOT NULL,
	ORDER_TOTAL NUMBER(38,2),
	ORDER_STATUS VARCHAR(200) DEFAULT 'INQUEUE',
	PRIMARY KEY (ORDER_ID)
) AS
SELECT
    SEQ4(),
    T.TRUCK_ID,
    DATEADD('seconds', UNIFORM(-1*60*60*24*365, -1*60*60, RANDOM()), CURRENT_TIMESTAMP()), -- SPREAD RANDOM ORDER IDS OUT
    UNIFORM(200.0, 25000.0, RANDOM()),
    ARRAY_CONSTRUCT('DELIVERED', 'INQUEUE', 'LOADING', 'EN-ROUTE')[UNIFORM(0,3, RANDOM())],
FROM TABLE(GENERATOR(ROWCOUNT => 5000))
CROSS JOIN (SELECT TRUCK_ID FROM TRUCK SAMPLE (1000 ROWS)) T    -- ASSIGN ORDERS TO EVERY TRUCK = # TRUCKS X # ORDERS PER TRUCK
;

Consider a query that loads all orders for a specific truck. This query will join the TRUCK table with the ORDER_HEADER table while filtering for a particular truck.

-- SELECT A TRUCK THAT HAS ORDER DETAILS
SET TRUCK_ID = (SELECT ANY_VALUE(TRUCK_ID) FROM ORDER_HEADER);

-- QUERY FOR ALL ORDERS (5000) FOR THIS PARTICULAR TRUCK 
SELECT *
FROM TRUCK T
INNER JOIN ORDER_HEADER O ON O.TRUCK_ID=T.TRUCK_ID
WHERE TRUE
  AND T.TRUCK_ID = $TRUCK_ID
;

Note that the query optimizer does not have information to relate the two tables within the join. Thus, the optimizer chooses a column scan mode when reading data from the ORDER_HEADER table. Technically, the optimizer must interpret the join as a many-to-many join.

Introducing a FOREIGN KEY will serve two important purposes. First, it establishes a relationship between the tables that guarantees a one-to-many join for a given PRIMARY KEY in the parent table. Second, the relationship will enforce a rule that requires data to be consistent such that child table TRUCK_ID values must exist in the TRUCK primary key. Recreate the table with a FOREIGN KEY:

-- CREATE ORDER_HEADER WITH A FOREIGN KEY
CREATE OR REPLACE HYBRID TABLE ORDER_HEADER (
    ORDER_ID NUMBER(38,0) NOT NULL,
    TRUCK_ID NUMBER(38,0) NOT NULL,
    ORDER_TIMESTAMP TIMESTAMP_NTZ NOT NULL,
    ORDER_TOTAL NUMBER(38,2),
    ORDER_STATUS VARCHAR(200) DEFAULT 'INQUEUE',
    PRIMARY KEY (ORDER_ID),
    CONSTRAINT FK_TRUCK_ID FOREIGN KEY (TRUCK_ID) REFERENCES TRUCK(TRUCK_ID)
);
-- USE INSERT INTO METHOD BECAUSE THE FOREIGN KEY MUST EXIST TO BE ENFORCED
INSERT INTO ORDER_HEADER
SELECT
    SEQ4(),
    T.TRUCK_ID,
    DATEADD('seconds', UNIFORM(-1*60*60*24*365, -1*60*60, RANDOM()), CURRENT_TIMESTAMP()), -- SPREAD RANDOM ORDER IDS OUT
    UNIFORM(200.0, 25000.0, RANDOM()),
    ARRAY_CONSTRUCT('DELIVERED', 'INQUEUE', 'LOADING', 'EN-ROUTE')[UNIFORM(0,3, RANDOM())],
FROM TABLE(GENERATOR(ROWCOUNT => 5000))
CROSS JOIN (SELECT TRUCK_ID FROM TRUCK SAMPLE (1000 ROWS)) T    -- ASSIGN ORDERS TO EVERY TRUCK = # TRUCKS X # ORDERS PER TRUCK
;

Next, find a new TRUCK_ID and execute the query to retrieve the order header information.

SET TRUCK_ID = (SELECT ANY_VALUE(TRUCK_ID) FROM ORDER_HEADER);

SELECT *
FROM TRUCK T
INNER JOIN ORDER_HEADER O ON O.TRUCK_ID=T.TRUCK_ID
WHERE TRUE
    AND T.TRUCK_ID = $TRUCK_ID
;

Checking the query plan, we can see that the query optimizer is now using the foreign key to efficiently select the order detail information.

Next, consider a query to retrieve a specific order and join in details for the truck for which the order was submitted. This sort of pattern is common when dealing with normalized data models in transactional applications:

-- SELECT A SPECIFIC ORDER
SET ORDER_ID = (SELECT ANY_VALUE(ORDER_ID) FROM ORDER_HEADER);

SELECT *
FROM ORDER_HEADER O
INNER JOIN TRUCK T ON O.TRUCK_ID=T.TRUCK_ID
WHERE TRUE
  AND O.ORDER_ID = $ORDER_ID
;

The query plan shows both primary keys in use loading only a single record from each table by using an Nested Loop Join pattern where we first retrieve the order record and then use the truck ID for that order to push down the truck lookup against the primary key of the truck table:

Next, we will explore how secondary indexes help queries that do not use primary keys.

A common query pattern is to join two tables using foreign keys and filter one of them with another column. For example, this query is fetching orders over the last few days for a specific truck:

SELECT *
FROM ORDER_HEADER O
INNER JOIN TRUCK T ON T.TRUCK_ID=O.TRUCK_ID
WHERE TRUE
    AND O.ORDER_TIMESTAMP > DATEADD('days', -2, CURRENT_TIMESTAMP()) :: TIMESTAMP_NTZ  -- NOTE THE CAST
    AND T.TRUCK_ID = $TRUCK_ID
;

The TRUCK_ID will be used as a primary key filter on TRUCK and a foreign key filter on ORDER_HEADER. The timestamp filter will be applied after data was fetched from hybrid table object store. From the query profile, we can see that only the TRUCK_ID was used as an access predicate. The ORDER_TIMESTAMP filtering was completed after data was fetched.

A secondary index on (TRUCK_ID, ORDER_TIMESTAMP) will improve performance by allowing us to also push the timestamp filter down to storage:

--- CREATE ORDERS FOR EACH TRUCK
CREATE OR REPLACE HYBRID TABLE ORDER_HEADER (
    ORDER_ID NUMBER(38,0) NOT NULL,
    TRUCK_ID NUMBER(38,0) NOT NULL,
    ORDER_TIMESTAMP TIMESTAMP_NTZ NOT NULL,
    ORDER_TOTAL NUMBER(38,2),
    ORDER_STATUS VARCHAR(200) DEFAULT 'INQUEUE',  -- USE DEFAULT VALUE
    PRIMARY KEY (ORDER_ID),
    FOREIGN KEY (TRUCK_ID) REFERENCES TRUCK(TRUCK_ID),  -- REFER TO THE TRUCK TABLE FOR TRUCK_ID
    INDEX IDX_TRUCK_TIMESTAMPS (TRUCK_ID, ORDER_TIMESTAMP )  -- ADD THE SECONDARY INDEX FOR ORDER_TIMESTAMP
);
INSERT INTO ORDER_HEADER (ORDER_ID, TRUCK_ID, ORDER_TIMESTAMP, ORDER_TOTAL)
SELECT
    SEQ4(),
    T.TRUCK_ID,
    DATEADD('seconds', UNIFORM(-1*60*60*24*365, -1*60*60, RANDOM()), CURRENT_TIMESTAMP()), -- SPREAD RANDOM ORDER IDS OUT
    UNIFORM(200.0, 25000.0, RANDOM()),
FROM TABLE(GENERATOR(ROWCOUNT => 5000))
CROSS JOIN (SELECT TRUCK_ID FROM TRUCK SAMPLE (1000 ROWS)) T    -- ASSIGN ORDERS TO EVERY TRUCK = # TRUCKS X # ORDERS PER TRUCK
;

Now that the index exists and is ACTIVE, we can use it in a query:

SET TRUCKID = (SELECT ANY_VALUE(TRUCK_ID) FROM ORDER_HEADER);

SELECT *
FROM ORDER_HEADER O
INNER JOIN TRUCK T ON T.TRUCK_ID=O.TRUCK_ID
WHERE TRUE
    AND O.ORDER_TIMESTAMP > DATEADD('days', -1, CURRENT_TIMESTAMP()) :: TIMESTAMP_NTZ
    AND T.TRUCK_ID = $TRUCKID
;

Now the truck and timestamp predicates are being pushed down against the index in order to reduce the amount of data retrieved from the order_header table improving latency and efficiency.

Next, we will summarize best practices.

Hybrid table performance benefits from following best practices that extract the best performance and highest value from this Snowflake feature.

Use a PRIMARY KEY, built from business columns, to drive the highest performance.
Implement FOREIGN KEYS when joining hybrid tables.
Write queries that use keys and indexes to keep queries in ROW_BASED mode.
Use Snowsight query plans to verify optimal query plans and troubleshoot performance issues.
Pay close attention to data types, join criteria, and index utilization.
Follow Snowflake hybrid table best practices.

Overview

Thank you for completing this Hybrid Tables Performance Optimization Primer. You now have a solid understanding of how to relate query performance to the query planning tool. You will use this knowledge to help create and optimize your hybrid table solutions.

What You Learned

Having completed this quickstart you have successfully

Created hybrid tables with primary keys
Wrote and learned how to analyze a query profile
Learned how to use secondary indexes and constraints
Created a child table using foreign keys referencing a parent table
Joined tables and worked with secondary indexes to maximize performance
Utilized secondary indexes to speed query performance across non-primary key columns

Hybrid Tables

Architecture

What You Will Learn

Prerequisites

Process

Creating an Environment

What We Will Cover

Create the TRUCKS Hybrid Table

Query with the Primary Key

Analyze the Query Profile Output

Query with COLUMN_BASED Access (no key)

Add a Secondary Index

Setup

Overview

What You Learned

Related Resources:

Create the `TRUCKS` Hybrid Table

Query with `COLUMN_BASED` Access (no key)