As datasets continue to grow, GPU acceleration has become critical for machine learning workflows. To address this, Snowflake ML has integrated with NVIDIA's cuML and cuDF libraries to provide significant performance boosts for popular ML algorithms. These libraries are part of the NVIDIA CUDA-X Data Science ecosystem, an open-source suite of GPU-accelerated tools designed to speed up data processing pipelines.
Snowflake ML is an integrated set of capabilities for end-to-end machine learning workflows directly on top of the governed data. Snowflake ML now integrates with NVIDIA's cuML and cuDF libraries directly in the Container Runtime, accelerating ML algorithms like scikit-learn and pandas on GPUs. This powerful solution helps Snowflake customers tackle large datasets in areas like topic modeling, boosting efficiency for computationally demanding tasks.
This guide will walk you through how to leverage this native integration in Snowflake. You will learn how to accelerate model development cycles for libraries like scikit-learn and pandas with no code changes required, turning processing times from hours into minutes. We will explore a topic modeling example to demonstrate how these integrated libraries make it fast and seamless to work with large datasets in Snowflake ML.
You will build and execute a topic modeling pipeline that processes 500,000 book reviews in under a minute using GPU-accelerated libraries.
The demo notebooks for this topic modeling use case can be obtained by downloading the topic-modeling.ipynb notebook and uploading it to your Snowflake environment.
To get started, you need to configure your Snowflake Notebook to run on a container with access to GPU instances. The integration with NVIDIA's libraries is available through the Container Runtime, a pre-built environment for machine learning development.
+ > Notebook > New Notebook or Import .ipynb File.
With the latest update to the Snowflake ML Container Runtime, cuML and cuDF are fully integrated into the default GPU environment. To activate their drop-in acceleration capabilities for pandas, scikit-learn, UMAP, and HDBSCAN, you only need to import them and run their respective install() functions at the beginning of your notebook.
# Import the libraries and enable the acceleration
import cudf; cudf.pandas.install()
import cuml; cuml.accel.install()
Topic modeling is a common text analysis technique, but it can be computationally expensive, especially with large datasets. The iterative nature of data science means that waiting hours for a single run is not practical. With NVIDIA CUDA-X libraries in Snowflake, you can achieve significant speed-ups with zero or near-zero code changes. This example demonstrates topic modeling on 500,000 book reviews, reducing the runtime from over 8 hours on a CPU to a few minutes on a GPU.
The topic modeling workflow consists of four main steps, all of which can now be accelerated on GPUs:
umap-learn (accelerated by cuML).hdbscan (accelerated by cuML).After running the setup commands from the previous section, the rest of your code remains unchanged.
import cudf; cudf.pandas.install()
import cuml; cuml.accel.install()
import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
# This pandas operation is now GPU-accelerated by cuDF
data = pd.read_json(data_path, lines=True)
docs = data["text"].tolist()
# This model automatically uses CUDA-enabled PyTorch
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, batch_size=128, show_progress_bar=True)
# BERTopic uses UMAP and HDBSCAN, which are now GPU-accelerated by cuML
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)
With just two initial lines of code, the entire topic modeling pipeline is accelerated, allowing for rapid iteration and analysis.
You can proceed to the remainder of the topic modeling notebook.

The integration of NVIDIA's cuML and cuDF libraries into Snowflake ML offers a powerful solution for accelerating large-scale machine learning workflows. By abstracting away the complexities of GPU infrastructure management, Snowflake enables data scientists to significantly boost performance for popular libraries like pandas and scikit-learn with no code changes. This enhancement dramatically speeds up iterative development in computationally demanding fields like topic modeling from hours to minutes.
Ready for more? After you complete this quickstart, you can try one of the following additional examples:
Learn more about Snowflake ML: