This guide will show you how to use the R language to analyze data stored in Snowflake. It will take you through the process of setting up an R environment in an RStudio Pro IDE, which comes with the Posit Workbench Native App in Snowflake. It will also show you how to use R's {dbplyr}
interface to translate R code into SQL where it can be run within a Snowflake database, which is an environment that provides more memory and compute power than the typical R environment.
We will explore a financial service use case of analyzing web traffic to identify market opportunities.
First, we will procure free data from the Snowflake Marketplace and load it into Snowflake. The data contains metrics collected by the Web Traffic Foundation and made available by Cybersyn. We will then launch an R session and access the data from R, where we will clean, transform, and plot the data, preparing both an HTML report and an interactive Shiny app of our analysis. We will also demonstrate how to write into a Snowflake database from R. In this way, the guide will show you how to do an end-to-end R analysis entirely within Snowflake.
accountadmin
role.Along the way, you will use R to analyze trends in web traffic for the websites of large corporations. You can follow along with this quickstart guide, or look at the materials provided in the accompanying repository: https://github.com/posit-dev/snowflake-posit-quickstart-r.
Before we begin there are a few components we need to get ready. We need to:
For this analysis, we will use the Web Traffic Foundation (Experimental)
database from Cybersyn, which is available for free in the Snowflake Marketplace.
In your Snowflake account, Go to Data Products
> Marketplace
to open the marketplace.
You can search for cybersyn
and look for the Web Traffic Foundation (Experimental)
database.
Once your find the database, you can load it into your Snowflake account by clicking the Get button on the right-hand side.
If everything is set up properly, you should see the database by navigating to: Data
> Database
. The database will appear as WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL
. The tables we will use are available in the CYBERSYN
schema. You can see the tables (views) under the Views
dropdown.
Cybersyn makes its data available as views. We will be mostly working with the WEBTRAFFIC_SYNDICATE_TIMESERIES
view. We also need the WEBTRAFFIC_SYNDICATE_ATTRIBUTES
view so we can work with the type of count data easier.
A "view" is a saved SQL query that returns a table. Think of it as a way to save a SQL query and have it masquerade around like a table.
We can then start our data exploration using Posit Workbench. You can find Posit Workbench as a Snowflake Native Application, and use it to connect to your database.
In your Snowflake account, Go to Data Products
> Apps
to open the Native Apps collection. If Posit Workbench is not already installed, click Get
.
Once Posit Workbench is installed, click on the app under Installed Apps
to launch the app. If you do not see the Posit Workbench app listed, ask you Snowflake account administrator for access to the app.
After clicking on the app, you will see a page with configuration instructions and a blue Launch app
button.
Click on the app. This should take you to the webpage generated for the Workbench application. You may be prompted to first login to Snowflake using your regular credentials or authentication method.
Posit Workbench provides several IDEs, such as RStudio Pro, JupyterLab, and VS Code. For this analysis we will use an RStudio Pro IDE.
Within Posit Workbench, click New Session
to launch a new session to spin up your coding environment.
When prompted, select the RStudio Pro IDE.
Next, connect to your Snowflake account from within Posit Workbench. Click on Sign In
in the Snowflake section, and follow the sign in prompts.
When the Snowflake authentication is successful, you will be able to start your session. Otherwise, the Start Session
button will be greyed out.
Once everything is ready, you will be able to work with your Snowflake data in the familiar RStudio Pro IDE. Since the IDE is provided by the Posit Workbench Native App, your entire analysis will occur securely within Snowflake.
This Quickstart will step you through the analysis contained in https://github.com/posit-dev/snowflake-posit-quickstart-r/blob/main/quarto.qmd. To follow along, open the file in your RStudio Pro IDE. You could do this by:
quarto.qmd
. Click on the file to open it.Now that we're in a familiar R environment, we need to prepare the packages we will use. For this analysis, we will use the Tidyverse suite of packages, as well as a few others.
install.packages(c("tidyverse", "DBI", "dbplyr", "gt", "gtExtras"))
We will also use the development version of the {odbc}
package to connect to the Snowflake database.
install.packages("pak")
pak::pak("r-dbi/odbc")
After we install the packages, we load them.
library(tidyverse)
library(DBI)
library(dbplyr)
library(gt)
library(gtExtras)
library(odbc)
We will run our code in the R environment created by the RStudio Pro IDE, but the code will use data stored in the WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL
database on Snowflake.
To access this data, we will use the {DBI}
and {odbc}
R packages to connect to the database. We will then use {dplyr}
and {dbplyr}
to query the data with R, without having to write raw SQL. Let's take a look at how this works.
{DBI}
{DBI}
is an a R package that allows us to connect to databases with DBI::dbConnect()
.
To connect to our Snowflake database, we will use a driver provided by the {odbc}
package. We will need to provide a warehouse
for compute, and a database
to connect to. We can also provide a schema
here to make connecting to specific tables and views easier.
# Connect to the database
conn <-
DBI::dbConnect(
drv = odbc::snowflake(),
warehouse = "DEVREL_WH_LARGE",
database = "WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL",
schema = "CYBERSYN"
)
We will save our connection as an object, named conn
, which will make it easy to use.
Once we build a connection, we can see the databases, schemas, and tables available to us in the RStudio IDE Connections pane. Click on the database icon to the right of a database to see its schemas. Click on the schema icon to the right of a schema to see its tables. Click on the table icon to the right of a table to see a preview of the table.
The Connections pane displays the tables in a schema, but not the views associated with those tables. Since Cybersyn provides its data as views, they will not appear here.
tbl
s that correspond to tables in the databaseOnce we build a connection, we can use dplyr::tbl()
to create tbl
s. A tbl is an R object that represents a table or view accessed through a connection.
timeseries <- tbl(conn, "WEBTRAFFIC_SYNDICATE_TIMESERIES")
attributes <- tbl(conn, "WEBTRAFFIC_SYNDICATE_ATTRIBUTES")
If we did not provide the schema
argument into DBI::dbConnect()
earlier, we would need to specify the view with tbl(conn, in_schema("CYBERSYN", "WEBTRAFFIC_SYNDICATE_TIMESERIES"))
.
{dbplyr}
to translate R to SQLWe can now use timeseries
and attributes
as if they were tibbles in R. For example, we can filter rows and select columns from our data.
facebook <-
timeseries |>
filter(
DOMAIN_ID == "facebook.com",
lubridate::month(DATE) == 5
) |>
select(DATE, VARIABLE_NAME, VALUE)
When we use tbl
s, {dbplyr}
translates our R code into SQL queries. Think of any object made downstream from a tbl like a view: it contains SQL that represents a table built from the tbl. If we want to see the SQL code that {dbplyr}
generates, we can run dbplyr::show_query()
.
facebook |>
show_query()
SELECT "DATE", "VARIABLE_NAME", "VALUE"
FROM "WEBTRAFFIC_SYNDICATE_TIMESERIES"
WHERE ("DOMAIN_ID" = 'facebook.com') AND (EXTRACT('month', "DATE") = 5.0)
To save compute, R waits to execute a query until we request the data that it represents.
To save memory, R stores the data as a temporary file in the database (instead of in R's global environment). When we inspect the data, R only returns the first few rows of this file to display.
facebook
# Source: SQL [?? x 3]
# Database: Snowflake 8.16.0[@Snowflake/WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL]
DATE VARIABLE_NAME VALUE
<date> <chr> <dbl>
1 2021-05-02 Sessions (Count) | Week | All Devices | Model version 2024.01.25 10016599466
2 2021-05-09 Sessions (Count) | Week | All Devices | Model version 2024.01.25 10024864464
3 2021-05-16 Sessions (Count) | Week | All Devices | Model version 2024.01.25 9948687298
4 2021-05-23 Sessions (Count) | Week | All Devices | Model version 2024.01.25 9913216616
5 2021-05-30 Sessions (Count) | Week | All Devices | Model version 2024.01.25 9886247565
6 2022-05-01 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12502175558
7 2022-05-08 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12140393116
8 2022-05-15 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12446181861
9 2022-05-22 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12436995354
10 2022-05-29 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12418458176
# ℹ more rows
# ℹ Use `print(n = ...)` to see more rows
collect()
We can trigger R to execute a query and to return the entire result as a tibble with dbplyr::collect()
. Keep in mind, that if your result is large, you may not want to collect it into R.
facebook |> collect()
# A tibble: 42 × 3
DATE VARIABLE_NAME VALUE
<date> <chr> <dbl>
1 2021-05-02 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 39917694787
2 2021-05-09 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 40870514816
3 2021-05-16 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 41073679841
4 2021-05-23 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 41184673335
5 2021-05-30 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 41837064266
6 2022-05-01 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 99321575179
7 2022-05-08 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 98206319156
8 2022-05-15 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 101792763304
9 2022-05-22 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 102066608868
10 2022-05-29 Pageviews (Count) | Week | All Devices | Model version 2024.01.25 103002669500
# ℹ 32 more rows
# ℹ Use `print(n = ...)` to see more rows
This system:
We don't need to manage the process, it happens automatically behind the scenes.
{dbplyr}
can translate the most common tidyverse and base R functions to SQL. However, you may sometimes use a function that {dbplyr}
does not recognize or for which there is no SQL analogue. On these occasions, you will need to collect()
your data into R, where you can process it as a real tibble.
Learn more about {dbplyr}
at dbplyr.tidyverse.org
You can also use {DBI}
to create a new table in a database or append to an existing table.
To add a new table, use DBI::dbWriteTable()
. The Cybersyn database doesn't allow creating new tables, but the following code demonstrates how we would add a data frame df
to a different connection.
dbWriteTable(conn2, name = "new_table", value = df)
To append to an existing table, specify append = TRUE
. If appending, name
should specify an existing table.
dbWriteTable(conn2, name = "existing_table", value = df, append = TRUE)
Now that we understand how R will interact with the database, we can use R to perform our analysis.
We want to compare the pageviews, users, and sessions over time for a few popular websites. This information is split between the timeseries
and attributes
tables. Before we begin, we will need to join and clean the tables.
First we standardize column names, so we won't need to worry about capitalization.
# Standardize column names
timeseries <- timeseries |> rename_with(str_to_lower)
attributes <- attributes |> rename_with(str_to_lower)
Then we join the tables, matching rows based on their values of variable
and variable_name
, which are shared across the tables.
# Join to make complete table
timeseries <-
timeseries |>
left_join(
attributes,
by = join_by(variable, variable_name)
)
Our combined timeseries
table will look like this:
> timeseries
# Source: SQL [?? x 12]
# Database: Snowflake 8.15.0[@Snowflake/WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL]
domain_id geo_id geo_name variable variable_name date value measure unit frequency device
<chr> <chr> <chr> <chr> <chr> <date> <dbl> <chr> <chr> <chr> <chr>
1 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2021-11-14 3.83e6 Pagevi... Count Week All D...
2 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2021-11-21 2.85e7 Pagevi... Count Week All D...
3 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2021-11-28 2.61e7 Pagevi... Count Week All D...
4 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2021-12-05 3.65e7 Pagevi... Count Week All D...
5 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2021-12-12 3.80e7 Pagevi... Count Week All D...
6 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2021-12-19 1.38e7 Pagevi... Count Week All D...
7 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2021-12-26 2.31e7 Pagevi... Count Week All D...
8 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2022-01-02 6.40e6 Pagevi... Count Week All D...
9 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2022-01-09 1.72e7 Pagevi... Count Week All D...
10 larepubliquedesp... World... Worldwi... pagevie... Pageviews (C... 2022-01-16 1.50e7 Pagevi... Count Week All D...
# ℹ more rows
# ℹ 1 more variable: model_version <chr>
# ℹ Use `print(n = ...)` to see more rowss
The data contains information about 67,671
distinct websites, but we are only interested in a few of them. To make our lives easier, we filter the data to just the domains we are interested in. We also reduce the data to just the columns we are interested in.
top_domains <-
c(
"youtube.com",
"google.com",
"facebook.com",
"tiktok.com",
"instagram.com",
"airbnb.com",
"vrbo.com",
"lyft.com",
"uber.com"
)
timeseries <-
timeseries |>
filter(domain_id %in% top_domains) |>
select(domain_id, date, measure, value) # selects columns
INTERACTIVITY NOTE: We are reducing the list of domains here so we can easily compare between a few of the domains. A user may want to choose their own set of domain names to compare.
Finally, to make the table easier to work with, we move values of pageviews
, users
, and sessions
into their own columns.
timeseries <-
timeseries |>
pivot_wider(names_from = measure, values_from = value) |>
rename_with(str_to_lower)
Our table now looks like this.
> timeseries
# Source: SQL [?? x 5]
# Database: Snowflake 8.15.0[@Snowflake/WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL]
domain_id date pageviews users sessions
<chr> <date> <dbl> <dbl> <dbl>
1 airbnb.com 2022-02-27 171313810 15433961 26568865
2 youtube.com 2022-07-17 262185037173 2472788560 21915523137
3 youtube.com 2023-12-17 340737187803 3083645673 32503255500
4 youtube.com 2023-07-09 236276769477 2857094928 27375541301
5 youtube.com 2023-10-15 263861379113 2976403065 28240902614
6 uber.com 2022-08-07 5158044 2189269 2548854
7 uber.com 2022-08-28 5203070 2024092 2451623
8 uber.com 2023-06-18 5920585 2244755 3118792
9 vrbo.com 2022-02-06 45043973 4594557 7508303
10 tiktok.com 2021-07-25 43315325647 817664398 1887858571
# ℹ more rows
# ℹ Use `print(n = ...)` to see more rows
When we are running these commands on a data base connection, {dbplyr}
is translating the code into SQL for us under the hood. We don't need to write raw SQL commands, and the compute is happening directly on the database. You can pipe |>
the code into show_query()
if you want to see the generated SQL query.
The Cybersyn data provides signals that can help us:
One of the best ways to spot those signals is to visualize the performance of web domains over time.
We can use {ggplot2}
to visually compare the web traffic of two domains over time. Here we can see that, since 2021, airbnb.com
has always been more popular than vrbo.com
, but page views for both really started to take off in 2024.
domains <- c("airbnb.com", "vrbo.com")
timeseries |>
filter(domain_id %in% domains) |>
ggplot(aes(date, pageviews, color = domain_id)) +
geom_line() +
scale_y_log10() +
theme_minimal() +
theme(legend.position="bottom") +
labs(
x = "",
y = "",
color = "",
title = "Pageviews"
)
INTERACTIVITY NOTE: We've written the code above to make it simple to visualize other domains. A user can store one or more new values to domains
and then rerun the code.
We can continue to explore our data with visualizations, or we can make a table that concisely displays multiple pieces of information at once. For example, we can use {dplyr}
verbs to identify the median number of weekly pageviews, users, and session for our top domains.
timeseries |>
summarize(
across(
c("pageviews", "users", "sessions"),
\(x) median(x, na.rm = TRUE),
.names = "avg_{.col}"
)
)
# Source: SQL [1 x 3]
# Database: Snowflake 8.15.0[@Snowflake/WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL]
avg_pageviews avg_users avg_sessions
<dbl> <dbl> <dbl>
1 15814434389 319991480. 1190998470.
With {dplyr}
's group_by()
command, we can compute separate metrics for each domain.
comparison <-
timeseries |>
group_by(domain_id) |>
summarize(
across(
c("pageviews", "users", "sessions"),
\(x) median(x, na.rm = TRUE),
.names = "avg_{.col}"
),
.groups = "drop"
) |>
arrange(desc(avg_pageviews))
comparison
# Source: SQL [9 x 4]
# Database: Snowflake 8.15.0[@Snowflake/WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL]
# Ordered by: desc(avg_pageviews)
domain_id avg_pageviews avg_users avg_sessions
<chr> <dbl> <dbl> <dbl>
1 google.com 222944428888 3373207378 27864010868.
2 youtube.com 217373542132. 2535250883 22623754428
3 facebook.com 114193789484. 2205116942. 12941002565
4 tiktok.com 50081097228 1105659086. 2691429058.
5 instagram.com 16405246854. 319991480. 1190998470.
6 airbnb.com 312666776. 15437368. 27302062.
7 vrbo.com 36548417 4173328. 6663558.
8 uber.com 5472395 2133430. 2598333
9 lyft.com 5251633 1660472. 2321456
Here we see, among other things, that youtube.com
receives almost as many pageviews as google.com, but from a much smaller set of users.
This is a useful way to examine the information for ourselves. However, if we wish to share the information with others, we might prefer to present the table in a more polished format. We can do this with commands from R's {gt}
package.
The following code creates a table displaying the information in comparison
and highlighting the two domains we investigated in the previous section.
comparison |>
gt(rowname_col = "domain_id") |>
fmt_number(suffixing = TRUE, decimals = 0) |>
cols_label(
avg_pageviews = "Pageviews",
avg_users = "Users",
avg_sessions = "Sessions"
) |>
gt_highlight_rows(
rows = (domain_id == "airbnb.com"),
fill = "#ffce67"
) |>
gt_highlight_rows(
rows = (domain_id == "vrbo.com"),
fill = "#78c2ad"
)
Pageviews | Users | Sessions | |
221B | 3B | 28B | |
219B | 3B | 23B | |
113B | 2B | 13B | |
49B | 1B | 3B | |
15B | 296M | 1B | |
357M | 17M | 31M | |
37M | 4M | 7M | |
6M | 2M | 3M | |
5M | 2M | 2M |
Now that we've accumulated some insights, let's think about how we might present the results of our data analysis to our colleagues.
We've conveniently written our analysis in a Quarto (.qmd) document, quarto.qmd. Quarto is an open-source publishing system that makes it easy to create data products such as documents, presentations, dashboards, websites, and books.
By placing our work in a Quarto document, we have interwoven all of our code, results, output, and prose text into a single literate programming document. This way everything can travel together in a reproducible data product.
A Quarto document can be thought of as a regular markdown document, but with the ability to run code chunks.
You can run any of the code chunks by clicking the play button above the chunk in the RStudio Pro IDE.
You can render the entire document into a polished report to share, with the Render
button.
This will run all the code in the document from top to bottom in a new R session, and generate an HTML file, by default, for you to view and share.
You can learn more about Quarto here: https://quarto.org/, and the documentation for all the various Quarto outputs here: https://quarto.org/docs/guide/. Quarto works with R, Python, and Javascript Observable code out-of-the box, and is a great tool to communicate your data science analyses.
Our visual analysis compared two specific websites: airbnb.com and vrbo.com, but we could reuse the code to compare any combination of websites. An efficient way to do this is with an interactive Shiny app.
We've prepared an example Shiny app in the directory: https://github.com/posit-dev/snowflake-posit-quickstart-r
To run the app, open app.R
and then click the Run App button at the top of the script in the RStudio Pro IDE.
To use the app, update the domains and date range in the sidebar. The app will respond automatically by updating its plot, table, and value boxes.
You can learn more about Shiny at: https://shiny.posit.co/. This example uses Shiny for R, but Shiny for Python is also available!
If you're new to Shiny, you can try it online with shinylive. It too, comes in a Python version.
R is beloved by data scientists for its intuitive, concise syntax. You can now combine this syntax with the power and peace of mind of Snowflake. The Posit Workbench Native Application provides an IDE for R within Snowflake. You can then use R's existing database packages—{DBI}
, {odbc}
, {dbplyr}
—to access your Snowflake databases.