Getting Started with Audio Sentiment Analysis using Snowflake Notebooks

In this quickstart, you'll learn how to build an end-to-end application that analyzes audio files for emotional tone and sentiment using Snowflake Notebooks on Container Runtime. The application combines audio processing, speech recognition, and sentiment analysis to create comprehensive insights from audio data.

What is Container Runtime?

Snowflake Notebooks on Container Runtime enable advanced data science and machine learning workflows directly within Snowflake. Powered by Snowpark Container Services, it provides a flexible environment to build and operationalize various workloads, especially those requiring Python packages from multiple sources and powerful compute resources, including CPUs and GPUs. With this Snowflake-native experience, you can process audio, perform speech recognition, and execute sentiment analysis while seamlessly running SQL queries. NOTE: This feature is currently in Public Preview.

Learn more about Container Runtime.

What is wav2vec2?

Wav2vec2 is a state-of-the-art framework for self-supervised learning of speech representations. Developed by Facebook AI, it's specifically designed for speech recognition tasks but has been adapted for various audio analysis tasks including emotion recognition. The model we use in this guide, "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition", is fine-tuned for detecting emotions in speech, capable of identifying emotions like happiness, sadness, anger, and neutral tones from audio input.

Learn more about wav2vec2.

What is Snowflake Cortex?

Snowflake Cortex is a suite of AI features that use large language models (LLMs) to understand unstructured data, answer freeform questions, and provide intelligent assistance. In this guide, we use Cortex's sentiment analysis capabilities to analyze the emotional content of transcribed speech.

Learn more about Snowflake Cortex.

What is Whisper?

OpenAI's Whisper is an open-source automatic speech recognition (ASR) model designed for high-quality transcription and translation of spoken language. Trained on diverse multilingual data, it handles various languages, accents, and challenging audio conditions like background noise. Whisper supports transcription, language detection, and translation to English, making it versatile for applications such as subtitles, accessibility tools, and voice interfaces.

Learn more about Whisper.

What You'll Learn

Setting up audio processing in Snowflake ML using PyTorch and Hugging Face
Extracting emotional tone from audio using wav2vec2
Transcribing audio with Whisper model
Performing sentiment analysis using Snowflake Cortex AI
Comparing emotional tone with sentiment scores

What You'll Build

A full-stack application that enables users to:

Process audio files for emotional tone analysis
Generate text transcripts from audio
Analyze sentiment in transcribed text
Compare emotional tone with sentiment scores
View comprehensive analysis results

Prerequisites

Snowflake account (non-trial) with:

Step 1. In Snowsight, create a SQL Worksheet and open setup.sql to execute all statements in order from top to bottom.

Step 2. In Snowsight, switch your user role to AUDIO_CONTAINER_RUNTIME_ROLE.

Step 3. Click on sfguide_getting_started_with_audio_sentiment_analysis_using_snowflake_notebooksipynb to download the Notebook from GitHub. (NOTE: Do NOT right-click to download.)

Step 4. In Snowsight:

On the left hand navigation menu, click on Projects » Notebooks
On the top right, click on Notebook down arrow and select Import .ipynb file from the dropdown menu
Select sfguide_getting_started_with_audio_sentiment_analysis_using_snowflake_notebooks.ipynb* file you downloaded in the step above
In the Create Notebook popup:
- For Notebook location, select AUDIO_SENTIMENT_DB and AUDIO_SCHEMA
- For SQL warehouse, select AUDIO_WH_S
- For Python environment, select Run on container
- For Runtime, select GPU Runtime
- For Compute pool, select GPU_POOL
Click on Create button

Step 5. Open Notebook

Click in the three dots at the very top-right corner and select Notebook settings » External access
Turn on ALLOW_ALL_ACCESS_INTEGRATION and HUGGINGFACE_ACCESS_INTEGRATION
Click on Save button
Click on Start button on top right

Supported Audio Formats

WAV (recommended)
MP3
FLAC
OGG
M4A

Audio Specifications

For best results, your audio files should have:

Sample rate: 16kHz or higher
Bit depth: 16-bit or higher
Channels: Mono (stereo will be converted to mono)
File size: Up to 25MB

Best Practices

For optimal analysis:

Use clear recordings with minimal background noise
Ensure speech is clearly audible
Avoid multiple speakers talking simultaneously
Record in a quiet environment
Use lossless formats (WAV/FLAC) when possible

Common Use Cases

This system works well for analyzing:

Customer service calls
Meeting recordings
Voice messages
Interview recordings
Support interactions
Training materials

Here's the walkthrough of the notebook cells and their functions:

Cell 1: Package Installation

Installs required packages:
- torch: For deep learning and neural network operations
- librosa: For loading and manipulating audio files
- transformers: For accessing pre-trained models

Cell 2: Environment Setup

Imports required Python libraries
Sets up Snowflake session
Configures stage name and device (GPU/CPU)
Initializes connection to Snowflake services

Cell 3: Audio Processing Configuration

Sets up the main processing function that:
1. Loads audio files from Snowflake stage
2. Analyzes emotional tone using wav2vec2 model
3. Transcribes audio using Whisper model
4. Performs sentiment analysis on transcripts
5. Compares emotional tone with sentiment scores

Key components:

Audio classification pipeline using ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition
Speech to text with whisper
Sentiment analysis using Snowflake Cortex

Cell 4: Process Files

Sets random seed for reproducibility
Initializes audio classification and speech recognition pipelines
Processes each audio file to:
- Extract emotional tone
- Generate transcript
- Analyze sentiment
- Compare tone and sentiment
Creates a DataFrame with results

Example output:

                    File    Emotion  Emotion_Score                                          Transcript  Sentiment_Score Tone_Sentiment_Match
0  customer_call1.wav      happy         0.892    Thank you so much for your help today! You've...            0.8            Match
1  customer_call2.wav      angry         0.945    I've been waiting for hours and nobody has...             -0.7            Match
2  customer_call3.wav    neutral         0.756    I would like to inquire about the status...              0.1          Unknown
3  customer_call4.wav       happy         0.834    This service has exceeded my expectations...              0.9            Match

The notebook outputs results showing the file name, detected emotion, emotion confidence score, transcript, sentiment score, and whether the tone matches the sentiment analysis.

The analysis provides multiple metrics that work together to give a comprehensive view of the audio:

Sentiment Analysis

Sentiment Score Range: [-1 to 1]
- Negative values indicate negative sentiment
- Positive values indicate positive sentiment
- Values near 0 indicate neutral sentiment

Emotional Classification

Emotion Score: [0 to 1]
- Represents the confidence level of the emotional classification
- Higher values indicate stronger confidence in the detected emotion
- Primary emotions detected: happy, angry, neutral, sad

Tone-Sentiment Matching

The system compares emotional tone with sentiment scores to verify consistency:

Happy emotion should correspond with positive sentiment (> 0)
- Match example: Happy emotion (0.85) with sentiment score (0.6)
- Mismatch example: Happy emotion (0.75) with sentiment score (-0.2)
Angry emotion should correspond with negative sentiment (< 0)
- Match example: Angry emotion (0.92) with sentiment score (-0.7)
- Mismatch example: Angry emotion (0.88) with sentiment score (0.3)

The ‘Tone_Sentiment_Match' field in the results indicates:

"Match": Emotion and sentiment align (e.g., happy with positive sentiment)
"Do Not Match": Emotion and sentiment conflict (e.g., angry with positive sentiment)
"Unknown": For neutral or other emotional states

Congratulations! You've successfully built an end-to-end audio analysis application in Snowflake that combines emotional tone detection, speech recognition, and sentiment analysis using Container Runtime for ML.

What You Learned

How to set up Container Runtime in Snowflake Notebooks
How to implement audio processing using PyTorch and Hugging Face with Snowflake ML
How to use pre-trained models for emotion recognition and speech-to-text
How to perform sentiment analysis using Snowflake Cortex
How to compare and analyze multiple aspects of audio communication

Related Resources

Webpages:

Documentation:

Sample Code & Guides:

Related Quickstarts: