In this guide, we'll be walking through all the steps you'll need to go through to build a prediction model for the Euro 2024 Football (Soccer) Tournament. Leveraging features like Snowpark, Snowpark ML, and the Snowflake Model Registry, this guide will be using historical results and international rankings to create a model and then simulate all 51 games of the tournament.


What You'll Learn

What You'll Build

If you do not have a Snowflake account, you can register for a Snowflake free 30-day trial. The cloud provider (AWS, Azure, Google Cloud), and Region (US East, EU, e.g.) do not matter for this lab. However, we suggest you select the region which is physically closest to you.

To easily follow the instructions, resize your browser windows so you can view this Quickstart and your Snowflake environment side-by-side. If possible, even better is to use a secondary display dedicated to the Quickstart.

Create a Warehouse, Database, and Stages

Run the following SQL commands in a Snowsight Worksheet to setup your environment:



Note they can also be found in setup.sql.

Staging the Data

Next you will need to upload the data files in the repo into the DATA stage you just created.

  1. Open Snowight UI, and navigate to DATA / DATABASES / EURO_2024
  2. Click on the PUBLIC schema, then STAGES, and then click on the DATA Stage
  3. In the top right corner, click on the +FILES option and add the fixtures.csv, rankings.csv, and results.csv files - make sure to click on the UPLOAD button.

Loading the Notebooks

Finally we need to load the 5 notebooks from the repo found here:

This should include the following files:

To load the notebooks in Snowflake's Notebook environment:

  1. Within Snowsight navigate to PROJECTS / NOTEBOOKS
  2. Click on the dropdown option on the +NOTEBOOK in the top right, and select Import from .ipynb
  3. Check that the Database is set to EURO2024, the PUBLIC scheme is select, and the warehouse is set as EURO2024_WH


To load the notebooks in Hex, please follow these instructions

Run through the cells in the 1_euro2024_data_ingest.ipynb

In this notebook you will be ingesting the data from the staged files, and creating 3 tables containing the historical games, the historical team ranking, and the fixtures for the upcoming tournament.

At the end of this step, you should see 3 new tables in the EURO_2024 database;

Run through the cells in the 2_euro2024_data_transformation.ipynb

In this notebook you will be executing various feature engineering steps to build our dataset for training. In summary the steps will be:

Run through the cells in the 3_euro2024_modelling.ipynb

In this notebook you will be using Snowpark ML and Snowflake Model Registry to perform the following:

  1. Scale up the warehouse in a single line of code.
  2. Use Snowpark ML to run Hyperparameter Tuning to determine the best parameter set for the highest accuracy.
  3. Scale down the warehouse.
  4. Using the parameter outputs from HP Tuning, train a new Model using Snowpark ML.
  5. Saving the newly trained Model to the Snowflake Model Registry.

Run through the cells in the 4_euro2024_create_sprocs_udtfs.ipynb

In this notebook you will be registering a set of re-usable stored procedures and UDTFs that will be used after each simualation of the tournament:

At the end of this step, we will have a model, and all the functions and procedures needed to simulate the entire tournament!

Run through the cells in the 5_euro2024_predictions.ipynb

In the final notebook, you will be stepping through all rounds of the Euro 2024 tournament and using the model you created in step 4 to run predictions for every game.

At the end of this stage, we will have a predicted winner of the tournament. Great job!


Congratulation! - you've completed all the steps to build a full end-to-end model pipeline with Snowpark ML, seen how to persist models into the Snowflake Model Registry, and then used these Models for inference with both Snowpark Python and SQL.

Where to go from here

This guide was an introduction into the techniques for feature engineering, model training and inference. If you wanted to extend this you could introduce more features, experiment with different model types, and add some automation into the pipeline using streams and tasks. In theory the concept could be used to predict any sporting event where you can obtain historical data.

For more information on what we've covered today please check out these links: