A hybrid table is a Snowflake table type that is optimized for hybrid transactional and operational workloads that require low latency and high throughput on small random point reads and writes. A hybrid table supports unique and referential integrity constraint enforcement that is critical for transactional workloads. You can use a hybrid table along with other Snowflake tables and features to power Unistore workloads that bring transactional and analytical data together in a single platform.
Use cases that may benefit from hybrid tables include:
Hybrid tables are integrated seamlessly into the existing Snowflake architecture. Customers connect to the same Snowflake database service. Queries are compiled and optimized in the cloud services layer and executed in the same query engine in virtual warehouses. This provides several key benefits:
Hybrid tables leverage a row store as the primary data store to provide excellent operational query performance. When you write to a hybrid table, the data is written directly into the rowstore. Data is asynchronously copied into object storage in order to provide better performance and workload isolation for large scans without impacting your ongoing operational workloads. Some data may also be cached in columnar format on your warehouse in order to provide better performance on analytical queries. You simply execute SQL statements against the logical hybrid table and Snowflake's query optimizer decides where to read data from in order to provide the best performance. You get one consistent view of your data without needing to worry about the underlying infrastructure.
We will use table structures populated with synthetic data to create scenarios We will initially create scenarios that do not perform well. Following the guide, you will then adjust the query and/or the table structure to create an optimal query.
Create a worksheet and establish a database and schema to do the work. We are doing exploration work here so make sure you are using a role that has permissions to create tables and view query profiles.
-- CUSTOMIZE AS NEEDED
-- CREATE OR REPLACE DATABASE HT_PERFORMANCE;
-- CREATE OR REPLACE SCHEMA HT_PERFORMANCE.PRIMER;
-- CREATE OR REPLACE WAREHOUSE HT_PERFORMANCE_WAREHOUSE WITH WAREHOUSE_SIZE=XSMALL;
-- THE PRIMER WILL USE THIS SCHEMA FOR THE REMAINDER OF THE LEARNING
-- CUSTOMIZE AS NEEDED
USE SCHEMA HT_PERFORMANCE.PRIMER;
USE WAREHOUSE HT_PERFORMANCE_WAREHOUSE;
Now, let's explore primary keys and secondary indexes
Primary keys are a required configuration for a hybrid table. The key establishes the rules for row uniqueness as well as the sort order for how data is written to the row storage. See primary key documentation for more details.
Secondary Indexes provide fast access to query patterns that may not be able to use the primary key.
TRUCKS
Hybrid TableCreate and populate a simple hybrid table. This table is similar to the table used in the getting started with hybrid tables guide but is simplified for brevity. This quickstart will use similar queries for the remainder of the work:
--- CREATE LOTS OF TRUCKS WITH RANDOM YEARS, MAKES, FRANCHISE AND EMAILS.
CREATE OR REPLACE HYBRID TABLE TRUCK (
TRUCK_ID NUMBER(38,0) NOT NULL,
YEAR NUMBER(38,0) NOT NULL,
MAKE VARCHAR(250) NOT NULL,
FRANCHISE_ID NUMBER(38,0) NOT NULL,
TRUCK_EMAIL VARCHAR NOT NULL,
PRIMARY KEY (TRUCK_ID)
)
AS SELECT
SEQ4(),
UNIFORM(1970, 2025, RANDOM()),
ARRAY_CONSTRUCT( 'Freightliner', 'Peterbilt', 'Kenworth', 'Volvo Trucks', 'Mack', 'International (Navistar)', 'Western Star')[UNIFORM(0,6, RANDOM())],
UNIFORM(1, 250, RANDOM()),
RANDSTR(8, RANDOM()) || '@sf-quickstart'
FROM TABLE(GENERATOR(ROWCOUNT => 1000000)) -- WE NEED ENOUGH RECORDS FOR REASONABLE TESTING OTHERWISE EVERYTHING IS IN-MEMORY
;
-- SEE WHAT THE RESULTS LOOK LIKE
SELECT * FROM TRUCK LIMIT 10;
Primary keys are a highly performant way to query hybrid tables. Let's look at a query profile of what querying using the primary key looks like.
-- SELECT AND SET A RANDOM TRUCK_ID TO SESSION VARIABLE FOR CONVENIENCE
SET TRUCK_ID = (SELECT TRUCK_ID FROM TRUCK SAMPLE(1 ROWS));
SELECT *
FROM TRUCK
WHERE TRUCK_ID = $TRUCK_ID;
Now, click on the resulting query id (QID) on the Snowsight console. A new tab will open, and you will be presented with a screen that looks like this:
The screen gives you basic information about the query.
Clicking on the "Query Profile" button:
will lead you to a screen showing you the visual representation of how the query planner decided to execute the query:
This is the primary screen we will be using for query analysis.
Now that you understand how to get to the query profile screen, let's discover how this query was executed. Once we understand how the query planner decided to execute the query, we can understand how to implement any changes in the query that will improve performance.
Clicking on the bottom "TableScan" node will focus our attention on what happened in the data access. In this example we can see that:
TRUCK.TRUCK_ID
ROW_BASED
ROW_BASED
scans means that the query directly accessed the row store and did not use the backing object store to produce the results. For selective predicates against the primary key, ROW_BASED
access indicates that the primary key index was used to push down the predicate and is the fastest and most efficient way to execute the query.
COLUMN_BASED
Access (no key)Instead of querying the table using the primary key, let's query the table using a column that is not indexed and see what the resulting query plan looks like.
SET EMAIL = (SELECT ANY_VALUE(TRUCK_EMAIL) FROM TRUCK);
SELECT *
FROM TRUCK
WHERE TRUE
AND TRUCK_EMAIL = $EMAIL;
We can see from the query profile screen that the query plan looks far different. Much time is spent on the table scan and it is the most expensive node in the plan:
Looking at the "TableScan" node, we can clearly see that the scan was COLUMN_BASED
and processed many partitions with lots of data. The optimizer was not able to push down any access predicates against an index.
Next, we will explore secondary indexes.
Clearly, the above query is not optimal. A secondary index can help the query run faster. Let's create a secondary index and use it to explore the query improvement:
CREATE OR REPLACE HYBRID TABLE TRUCK (
TRUCK_ID NUMBER(38,0) NOT NULL,
YEAR NUMBER(38,0) NOT NULL,
MAKE VARCHAR(250) NOT NULL,
FRANCHISE_ID NUMBER(38,0) NOT NULL,
TRUCK_EMAIL VARCHAR NOT NULL UNIQUE,
PRIMARY KEY (TRUCK_ID),
INDEX IDX_TRUCK_EMAIL (TRUCK_EMAIL)
)
AS SELECT
SEQ4(),
UNIFORM(1970, 2025, RANDOM()),
ARRAY_CONSTRUCT( 'Freightliner', 'Peterbilt', 'Kenworth', 'Volvo Trucks', 'Mack', 'International (Navistar)', 'Western Star')[UNIFORM(0,6, RANDOM())],
UNIFORM(1, 250, RANDOM()),
RANDSTR(8, RANDOM()) || '@sf-quickstart'
FROM TABLE(GENERATOR(ROWCOUNT => 1000000)) -- WE NEED ENOUGH RECORDS FOR REASONABLE TESTING OTHERWISE EVERYTHING IS IN-MEMORY
;
-- SEE WHAT THE RESULTS LOOK LIKE
SELECT * FROM TRUCK LIMIT 10;
Once the table is created we can re-run the query and show that it is using the newly created index:
SET EMAIL = (SELECT ANY_VALUE(TRUCK_EMAIL) FROM TRUCK);
SELECT *
FROM TRUCK
WHERE TRUE
AND TRUCK_EMAIL = $EMAIL;
Using the index results in a much better query plan that uses the index we created:
Next, we will cover foreign keys.
RDBMS data models use foreign keys to establish relationships between tables. All kinds of models take advantage of the relationships to keep data from becoming corrupted or otherwise orphaned from a definition. We call the relationship between the primary key of the reference table and the usage of the key in another table a foreign key.
We will use a TRUCK
and ORDER_HEADER
relationship to create a foreign key relationship. Let's create the tables and explore how the query plan looks with and without a foreign key.
Let's reuse the TRUCK
table and create orders for each truck.
--- CREATE ORDERS FOR EACH TRUCK
CREATE OR REPLACE HYBRID TABLE ORDER_HEADER (
ORDER_ID NUMBER(38,0) NOT NULL,
TRUCK_ID NUMBER(38,0) NOT NULL,
ORDER_TIMESTAMP TIMESTAMP_NTZ NOT NULL,
ORDER_TOTAL NUMBER(38,2),
ORDER_STATUS VARCHAR(200) DEFAULT 'INQUEUE',
PRIMARY KEY (ORDER_ID)
) AS
SELECT
SEQ4(),
T.TRUCK_ID,
DATEADD('seconds', UNIFORM(-1*60*60*24*365, -1*60*60, RANDOM()), CURRENT_TIMESTAMP()), -- SPREAD RANDOM ORDER IDS OUT
UNIFORM(200.0, 25000.0, RANDOM()),
ARRAY_CONSTRUCT('DELIVERED', 'INQUEUE', 'LOADING', 'EN-ROUTE')[UNIFORM(0,3, RANDOM())],
FROM TABLE(GENERATOR(ROWCOUNT => 5000))
CROSS JOIN (SELECT TRUCK_ID FROM TRUCK SAMPLE (1000 ROWS)) T -- ASSIGN ORDERS TO EVERY TRUCK = # TRUCKS X # ORDERS PER TRUCK
;
Consider a query to retrieve a specific order and join in details for the truck for which the order was submitted. This sort of pattern is common when dealing with normalized data models in transactional applications:
-- SELECT A SPECIFIC ORDER
SET ORDER_ID = (SELECT ANY_VALUE(ORDER_ID) FROM ORDER_HEADER);
SELECT *
FROM ORDER_HEADER O
INNER JOIN TRUCK T ON O.TRUCK_ID=T.TRUCK_ID
WHERE TRUE
AND O.ORDER_ID = $ORDER_ID
;
The query plan shows both primary keys in use loading only a single record from each table by using an Nested Loop Join pattern where we first retrieve the order record and then use the truck ID for that order to push down the truck lookup against the primary key of the truck table:
Next, we will explore how secondary indexes help queries that do not use primary keys.
A common query pattern is to join two tables using foreign keys and filter one of them with another column. For example, this query is fetching orders over the last few days for a specific truck:
SELECT *
FROM ORDER_HEADER O
INNER JOIN TRUCK T ON T.TRUCK_ID=O.TRUCK_ID
WHERE TRUE
AND O.ORDER_TIMESTAMP > DATEADD('days', -2, CURRENT_TIMESTAMP()) :: TIMESTAMP_NTZ -- NOTE THE CAST
AND T.TRUCK_ID = $TRUCK_ID
;
The TRUCK_ID
will be used as a primary key filter on TRUCK
and a foreign key filter on ORDER_HEADER
. The timestamp filter will be applied after data was fetched from hybrid table object store. From the query profile, we can see that only the TRUCK_ID
was used as an access predicate. The ORDER_TIMESTAMP
filtering was completed after data was fetched.
A secondary index on (TRUCK_ID, ORDER_TIMESTAMP) will improve performance by allowing us to also push the timestamp filter down to storage:
--- CREATE ORDERS FOR EACH TRUCK
CREATE OR REPLACE HYBRID TABLE ORDER_HEADER (
ORDER_ID NUMBER(38,0) NOT NULL,
TRUCK_ID NUMBER(38,0) NOT NULL,
ORDER_TIMESTAMP TIMESTAMP_NTZ NOT NULL,
ORDER_TOTAL NUMBER(38,2),
ORDER_STATUS VARCHAR(200) DEFAULT 'INQUEUE', -- USE DEFAULT VALUE
PRIMARY KEY (ORDER_ID),
FOREIGN KEY (TRUCK_ID) REFERENCES TRUCK(TRUCK_ID), -- REFER TO THE TRUCK TABLE FOR TRUCK_ID
INDEX IDX_TRUCK_TIMESTAMPS (TRUCK_ID, ORDER_TIMESTAMP ) -- ADD THE SECONDARY INDEX FOR ORDER_TIMESTAMP
);
INSERT INTO ORDER_HEADER (ORDER_ID, TRUCK_ID, ORDER_TIMESTAMP, ORDER_TOTAL)
SELECT
SEQ4(),
T.TRUCK_ID,
DATEADD('seconds', UNIFORM(-1*60*60*24*365, -1*60*60, RANDOM()), CURRENT_TIMESTAMP()), -- SPREAD RANDOM ORDER IDS OUT
UNIFORM(200.0, 25000.0, RANDOM()),
FROM TABLE(GENERATOR(ROWCOUNT => 5000))
CROSS JOIN (SELECT TRUCK_ID FROM TRUCK SAMPLE (1000 ROWS)) T -- ASSIGN ORDERS TO EVERY TRUCK = # TRUCKS X # ORDERS PER TRUCK
;
Now that the index exists and is ACTIVE
, we can use it in a query:
SET TRUCKID = (SELECT ANY_VALUE(TRUCK_ID) FROM ORDER_HEADER);
SELECT *
FROM ORDER_HEADER O
INNER JOIN TRUCK T ON T.TRUCK_ID=O.TRUCK_ID
WHERE TRUE
AND O.ORDER_TIMESTAMP > DATEADD('days', -1, CURRENT_TIMESTAMP()) :: TIMESTAMP_NTZ
AND T.TRUCK_ID = $TRUCKID
;
Now the truck and timestamp predicates are being pushed down against the index in order to reduce the amount of data retrieved from the order_header table improving latency and efficiency.
Next, we will summarize best practices.
Hybrid table performance benefits from following best practices that extract the best performance and highest value from this Snowflake feature.
PRIMARY KEY
, built from business columns, to drive the highest performance.FOREIGN KEYS
when joining hybrid tables.ROW_BASED
mode.Thank you for completing this Hybrid Tables Performance Optimization Primer. You now have a solid understanding of how to relate query performance to the query planning tool. You will use this knowledge to help create and optimize your hybrid table solutions.
Having completed this quickstart you have successfully