In this quickstart, you'll learn how to build an end-to-end application for multimodal analysis using AI models through Snowflake Cortex AI. This application uses AI_COMPLETE with models like Claude 4 Sonnet and Pixtral-large to extract insights, detect emotions, and generate descriptions from images, plus uses AI_TRANSCRIBE to transcribe audio with speaker identification - all within the Snowflake ecosystem.

Note: AI_COMPLETE multimodal capability and AI_TRANSCRIBE are currently in Public Preview.

What You'll Learn

What You'll Build

A multimodal analysis system that enables users to:

Prerequisites

To set up your Snowflake environment for multimodal analysis:

  1. Download the setup.sql file
  2. Open a new worksheet in Snowflake
  3. Paste the contents of setup.sql or upload and run the file
  4. The script will create:
    • A new database and schema for your project
    • Image and audio storage stages

Upload Media Files

After running the setup script:

  1. Download the data.zip and unzip for sample images and audio files
  2. For images:
    • Navigate to Data > Databases > MULTIMODAL_ANALYSIS > MEDIA > IMAGES > Stages
    • Click "Upload Files" button in top right
    • Select your image files
  3. For audio files:
    • Navigate to Data > Databases > MULTIMODAL_ANALYSIS > MEDIA > AUDIO > Stages
    • Click "Upload Files" button in top right
    • Select your audio files
  4. Verify upload success:
-- Check uploaded images
LS @MULTIMODAL_ANALYSIS.MEDIA.IMAGES;

-- Check uploaded audio files
LS @MULTIMODAL_ANALYSIS.MEDIA.AUDIO;

You should see your uploaded files listed with their sizes.

Let's create a notebook to explore multimodal analysis techniques:

  1. Navigate to Projects > Notebooks in Snowflake
  2. Click "+ Notebook" button in the top right
  3. To import the existing notebook:
  4. In the Create Notebook popup:
    • Select your MULTIMODAL_ANALYSIS database and MEDIA schema
    • Choose an appropriate warehouse
    • Click "Create" to finish the import

The notebook includes:

Congratulations! You've successfully built an end-to-end multimodal analysis system using AI models via Snowflake Cortex. This solution allows you to extract valuable insights from both images and audio content, perform transcription with speaker identification, detect emotions, analyze scenes, and generate rich descriptions - all within the Snowflake environment using AI_COMPLETE and AI_TRANSCRIBE functions.

The combination of visual and audio analysis capabilities opens up powerful possibilities for content understanding, customer experience analysis, compliance monitoring, and automated content processing workflows.

What You Learned

Related Resources