Analyzing Location specific data in order to make decisions often requires ‘niche' techniques which are often worked on by ‘geography' experts. In snowflake, everyone can be an expert in any location in which they need to understand - whether it's points of interest, transport links or government boundaries - all is feasible using standard functionality available within Snowflake.

Already, there are so many location specific datasets available within the Snowflake Marketplace which significantly reduces data ingestion and engineering time. Getting access to these datasets are as simple as Get Data, where you will enjoy leveraging location specific features using either SQL Queries or Snowpark Dataframes. And these result sets are then rendered easily using Streamlit in Snowflake dashboards.

Snowflake Cortex LLMs - are used to save the analyst, engineer or even BI developer time - which can help a multitude of tasks - from improving the readability of location tooltips, to the generation or synthetic data for testing purposes.

In this quickstart, we will be leveraging the the tools within Snowflake to:

Prerequisites

alt text

Note: All datasets for this dataset used are either frequently replicated or originate in AWS London. If you choose an alternative location, you may need to wait 10 minutes for the replication process to finish.

What You'll Learn

What You'll Build

Open up a new SQL worksheet and run the following commands. To open up a new SQL worksheet, select Projects » Worksheets, then click the blue plus button and select SQL worksheet.

CREATE OR REPLACE DATABASE BUILD_UK;

CREATE SCHEMA DOCUMENT_AI;

CREATE OR REPLACE SCHEMA STREAMLITS;

CREATE OR REPLACE SCHEMA NOTEBOOKS;

CREATE OR REPLACE SCHEMA DATA;

CREATE OR REPLACE WAREHOUSE BUILD_UK_WAREHOUSE;

GRANT USAGE, OPERATE ON WAREHOUSE BUILD_UK_WAREHOUSE TO ROLE ACCOUNTADMIN;

GRANT USAGE ON DATABASE BUILD_UK TO ROLE ACCOUNTADMIN;

-------- use this paramater to try the LLMs using a region which does not currently support them.

ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION';

Once logged in go to the Snowflake Marketplace - this is under Data Products > Snowflake Marketplace

alt text

Search for Northern Trains Station Data

alt text

Press Get to get the data from the marketplace. Do not change the database name.

alt text

Search for Overture Maps - Places

alt text

Click on the following dataset then press Get Do not change the database name.

alt text

Search for the Met office Weather Data Press Get - Keep the name as it is.

alt text

You will have access to the dataset for a period of 14 days

In Snowsight, go back to the home page and select Projects » Notebooks.

alt text Observe the alternative actions - you can also import a notebook or create one from github repository (if you have github connected to snowflake).

alt text

Today we will be creating a new notebook from scratch.

alt text

In Notebook location, select BUILD_UK from the list of available databases and NOTEBOOKS from the available schemas. All data within the notebook will be held inside the chosen database and schema and will sit inside an automatically generated stage. Select the BUILD_UK_WAREHOUSE and press Create.

Note

If you wish, you can import the previously created notebook from the following location:

notebook in Github

However, to experience the creation of the notebook yourself, carry on with the blank notebook, and copy/paste the code as we go along.

Creating your first Map

Once the notebook has loaded, Remove all cells in the notebook. You should see something like this:

alt text

Before we start working in our notebook, we will add an additional package which is not installed as default. This package is available within the Anaconda Snowflake channel and is easily accessible.

Add the Pydeck package for visualising location data

In the top menu, under Packages » Anaconda Packages, import pydeck. You need this to render maps.

alt text

Now we have all the packages we need. Lets start the notebook by pressing Start which is at the top right hand corner of the screen.

alt text

You will be using a variety of Snowflake functions to do some transformation tasks. Some of the functions are built into the snowpark library (such as array_agg, parse_json), others we need to call using the call_function module. call_function allows the user to leverage ANY Snowflake scalar function - you can use this for both built in as well as user defined functions.

All functions are held inside the snowflake.snowpark.functions module.

We will import the call_function, streamlit library, json, numpy, pandas and pydeck packages. For pydeck, you will need to add the package as this is not installed by default.

Click on the + Python button to add a new Python cell. On the top left hand corner of the screen, you can rename your cell from ‘cell1' to something more meaningful. This is useful for debugging purposes.

Rename cell1 to libraries by typing over the text ‘cell1'.

Copy and paste the following code into the newly created cell.

# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import max,min,avg,call_function, split,substr,hour,concat,col,sqrt,lit,array_slice,array_agg,object_construct,parse_json, to_geography, to_array,to_date, round, replace
from snowflake.snowpark.types import StringType,VariantType, DateType, IntegerType,DecimalType
import json
import pandas as pd
import numpy as np
import pydeck as pdk

# Write directly to the app
st.title("UK Analytics within the North of England :train:")
st.write(
    """This app shows key insight of places and events that may effect Northern Trains.
    """
)

# Get the current credentials
session = get_active_session()

alt text

Add a new python cell by hovering over the bottom edge of the cell and select Python.

alt text

Rename the the cell to n_trains_data

We will firstly leverage the Northern Trains dataset to filter the carto overture maps places dataset. We want to do this in order to get Points of interests that are relevant for the Northern Trains locality. Joining by each Station Code would be resource hungry - plus we do not want to join by exact locations, only by roughly where all of the train stations are situated.

Create a Boundary Box to filter the data based on Northern Trains

We will be seeing where Northern Train stations are and will create a boundary box to filter all our data so it will be only within the boundaries of the train stations. Firstly, lets find out where the stations are so we can effectively draw a boundary box around it.

Copy and paste the code into the new python cell and run this notebook cell:

trains_latlon = session.table('NORTHERN_TRAINS_STATION_DATA.TESTING."StationLatLong"')

st.markdown('#### A dataframe which shows all the train stations')
st.dataframe(trains_latlon)
st.map(trains_latlon, latitude='Latitude', longitude='Longitude')

Below is an example of what you should see

alt text

We have created a dataframe which is simply selecting the shared station table. We have also leveraged Streamlit to create a title, leveraged st.dataframe to display the snowpark dataframe in a clear way, and finally created a simple map using the streamlit st.map function.

st.map is useful for quickly generating simple maps by rendering latitude and longitude as points. Pydeck however, has more sophisticated capabilities such as rendering lines, points, polygons and H3 indexes. We will be leveraging the Pydeck library in the next step

You previously loaded the places dataset from Carto Overture maps. This dataset offers a comprehensive list of places of interest across the world such as restaurants, bars and schools. We want to filter this dataset to only list places of interest that occur within the Northern Trains locality. Creating a Boundary box is the easiest option.

Copy and Paste the following into a new cell and name the new cell boundary_box

#create a point from the coordinates
envelope = trains_latlon.with_column('POINT',call_function('ST_MAKEPOINT',col('"Longitude"'),col('"Latitude"')))

#collect all the points into one row of data
envelope = envelope.select(call_function('ST_COLLECT',col('POINT')).alias('POINTS'))

#create a rectangular shape which boarders the minimum possible size which covers all of the points
envelope = envelope.select(call_function('ST_ENVELOPE',col('POINTS')).alias('BOUNDARY'))
envelope.collect()[0][0]

You will see this has generated 5 sets of coordinates in order to draw the boundary box.

FACT: Every valid polygon will have the same last pair of coordinates as the first pair. Lets visualise what this looks like using the library pydeck. Although st.map is useful for simple quick visualisation of points, pydeck has the ability to visualise lines, points and polygons in 2D and 3D. It also has layer options for lines, points, icons and H3 indexes.

https://deckgl.readthedocs.io/en/latest/

Add a new Python cell and name it boundary_layer. Copy and paste the following:

#find the centre point so the map will render from that location

centre = envelope.with_column('CENTROID',call_function('ST_CENTROID',col('BOUNDARY')))
centre = centre.with_column('LON',call_function('ST_X',col('CENTROID')))
centre = centre.with_column('LAT',call_function('ST_Y',col('CENTROID')))

#create LON and LAT variables

centrepd = centre.select('LON','LAT').to_pandas()
LON = centrepd.LON.iloc[0]
LAT = centrepd.LAT.iloc[0]

### transform the data in pandas so the pydeck visualisation tool can view it as a polygon

envelopepd = envelope.to_pandas()
envelopepd["coordinates"] = envelopepd["BOUNDARY"].apply(lambda row: json.loads(row)["coordinates"][0])


####visualise on a map

#### create a layer - this layer will visualise the rectangle

polygon_layer = pdk.Layer(
            "PolygonLayer",
            envelopepd,
            opacity=0.3,
            get_polygon="coordinates",
            filled=True,
            get_fill_color=[16, 14, 40],
            auto_highlight=True,
            pickable=False,
        )

 
#### render the map 
    
st.pydeck_chart(pdk.Deck(
    map_style=None,
    initial_view_state=pdk.ViewState(
        latitude=LAT,
        longitude=LON,
        zoom=5,
        height=400
        ),
    
layers= [polygon_layer]

))

You will see that to render the map, we present the data in a format for pydeck to accurately read. The final transformed dataset is a pandas dDataframe. We specify the dataframe in a pydeck layer, then apply this layer to a streamlit pydeck chart. If we want, we can use the same logic in order to create a streamlit app. Snowflake Notebooks are great as you can render streamlit on the fly - without having to run the ‘app' separately.

When you run the cell, the boundary box should look like this:

boundarybox

Next, lets leverage and filter the overture maps so these will only consist of data within this area. Overture maps consist of location data across the entire globe.

Create and preview the places dataframe - we are also filtering to only display places in the UK.

Copy and paste the python code into a new python cell. Name the cell places_dataset

places = session.table('OVERTURE_MAPS__PLACES.CARTO.PLACE')
places = places.filter(col('ADDRESSES')['list'][0]['element']['country'] =='GB')

places.limit(3)

When you run the cell, you will see there is a lot of semi structured data returned. We will use snowflake's native semi structured querying capability to take key elements out of the data which includes information concerning the location

In a new cell which you will name places_refined, paste the following and run the cell:

places = places.select(col('NAMES')['primary'].astype(StringType()).alias('NAME'),
                        col('PHONES')['list'][0]['element'].astype(StringType()).alias('PHONE'),
                      col('CATEGORIES')['primary'].astype(StringType()).alias('CATEGORY'),
                        col('CATEGORIES')['alternate']['list'][0]['element'].astype(StringType()).alias('ALTERNATE'),
                    col('websites')['list'][0]['element'].astype(StringType()).alias('WEBSITE'),
                      col('GEOMETRY'))
                        

places.limit(10)

This is what you should see. The exact dataset set may ot be the same as the screenshot but will contain the same type of data. semi_structured_filtered

You will now filter the data to only view places which are categorised as train_station.

Copy and paste the code below into a new python cell. Name the cell places_filtered_boundary

places = places.filter(col('CATEGORY') =='train_station')

places = places.join(envelope,call_function('ST_WITHIN',places['GEOMETRY'],envelope['boundary']))
places = places.with_column('LON',call_function('ST_X',col('GEOMETRY')))
places = places.with_column('LAT',call_function('ST_Y',col('GEOMETRY')))
st.write(places)

We can view the points on a map easily by using st.map(places) but as pydeck has many more options such as different mark types, tool tips and layers we will create an additional pydeck layer which adds this data to the previously created data layer. When you hover over in the boundary box you will see a tooltip containing the alternate category as well as the place name.

Copy and paste the following into a new cell and name the cell places_visualised:

placespd = places.to_pandas()
poi_l = pdk.Layer(
            'ScatterplotLayer',
            data=placespd,
            get_position='[LON, LAT]',
            get_color='[255,255,255]',
            get_radius=600,
            pickable=True)

#### render the map showing trainstations based on overture maps
    
st.pydeck_chart(pdk.Deck(
    map_style=None,
    initial_view_state=pdk.ViewState(
        latitude=LAT,
        longitude=LON,
        zoom=5,
        height=400
        ),
    
layers= [polygon_layer, poi_l], tooltip = {'text':"Place Name: {NAME}, alternate: {ALTERNATE}"}

))

boundarybox

The results should look like this.

boundarybox

Now we have a map with all the restaurants within the Northern trains boundary. Lets now compare this with another layer which shows the train stations provided by Northern Trains. We have already loaded the train station locations into the notebook when we created the boundary box.

Create a new cell and name it trains_visualised. Copy and paste the following:

trains_latlon_renamed = trains_latlon

trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"CrsCode"','NAME')
trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"Postcode"','ALTERNATE')
trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"Latitude"','LAT')
trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"Longitude"','LON')
trains_latlon_renamed_pd = trains_latlon_renamed.to_pandas()

nw_trains_l = pdk.Layer(
            'ScatterplotLayer',
            data=trains_latlon_renamed_pd,
            get_position='[LON, LAT]',
            get_color='[0,187,255]',
            get_radius=600,
            pickable=True)

#### render the map showing trainstations based on overture maps
    
st.pydeck_chart(pdk.Deck(
    map_style=None,
    initial_view_state=pdk.ViewState(
        latitude=LAT,
        longitude=LON,
        zoom=5,
        height=400
        ),
    
layers= [polygon_layer, poi_l, nw_trains_l], tooltip = {'text':"Place Name: {NAME}, alternate: {ALTERNATE}"}

))

We have now rendered a multi layer map which overlays restaurants and northern rail train stations. Next, we will leverage Cortex to curate descriptive tooltips derived by station attributes.

At the moment we only have very basic train station information. Lets add more info from the shared dataset:

Copy and paste the following code into a new cell. Name the cell station_attributes

further_train_info = session.table('NORTHERN_TRAINS_STATION_DATA.TESTING."STATION ATTRIBUTES 2"')
further_train_info.limit(4)

boundarybox

We have quite a bit of information, it would be great if we can use Snowflake Cortex LLM to explain this data and then we could add the results to our tool tip!! Right now we only have the postcode in the tooltip.

Let's do that by creating a new cell. Name the cell cortex_tooltip.

Paste the following into the new cell and run. This takes around 1.5 minutes to complete as it will write a summary report for all 400 train stations.

further_train_info= further_train_info.with_column('OBJECT',object_construct(lit('CRS Code'),
col('"CRS Code"'),
lit('Full Timetable Calls'),
col('"Dec21 Weekday Full Timetable Daily Calls"').astype(IntegerType()),
lit('Emergency Timetable Calls'),
col('"Dec21 Weekday Emergency Timetable Daily Calls"').astype(IntegerType()),
lit('Footfall'),
col( '"ORR Station Footfall 2020-21"').astype(IntegerType()),
lit('Parking'),
col('"Car Parking - Free/Chargeable"'),
lit('MP'),
col("MP"),
lit("Political Party"),
col('"Political Party"'),
lit('MP Email Address'),
col('"MP Email Address"'),                                                                             
lit('Car Parking Spaces'),
col('"Car Parking Spaces"').astype(IntegerType()),
lit('Staffed?'),
col('"Staffed?"')))

prompt = 'In less than 200 words, write a summary based on the following train station details'
prompt2 = 'write in the best way for it to describe a point on a map'

further_train_info = further_train_info.select('"CRS Code"',
        'MP',
        '"Political Party"', 
        '"MP Email Address"',
        call_function('snowflake.cortex.complete','mistral-large2',
            concat(lit(prompt),
            col('OBJECT').astype(StringType()),
            lit('prompt2'))).alias('ALTERNATE'))

further_train_info.write.mode('overwrite').save_as_table("DATA.TRAIN_STATION_INFORMATION")
session.table('DATA.TRAIN_STATION_INFORMATION')

While we wait for the train station tooltips to materialize add a markdown cell above the cell we have just created. Markdown is useful to help tell a data story within the notebook. Hover above the newly created cell which is currently running and press Markdown. Name the cell cortex_description and paste the following narrative:

Below we are leveraging Mistral-large2 to produce meaningful tooltips relating to over **400** train stations which are managed by **Northern Trains**

Press the tick on the top right hand side of the cell to confirm the edit.

train_details

We used call_function to call Snowflake Cortex complete which returns a response that completes an input prompt. Snowflake Cortex runs LLMs that are fully hosted and managed by Snowflake, requiring no setup. In this example, we are using Snowflake Arctic, an open enterprise-grade LLM model developed by Snowflake.

You should get a new table that will look like this:

train_details

IMPORTANT Comment out the write to table command to prevent the LLM being called every time you refresh the notebook. We now have the data saved in a table so you do not need to run this model again.

# further_train_info.write.mode('overwrite').\
# save_as_table("DATA.TRAIN_STATION_INFORMATION")

Ok, let's now revise the map. Create a new python cell and name it visualise_tooltip Add the following code:

trains_latlon_renamed = trains_latlon

trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"CrsCode"','NAME')
trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"Latitude"','LAT')
trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"Longitude"','LON')

station_info = session.table('DATA.TRAIN_STATION_INFORMATION')

trains_latlon_renamed = trains_latlon_renamed.join(station_info,station_info['"CRS Code"']==trains_latlon_renamed['NAME']).drop('"CRS Code"')
trains_latlon_renamed_pd = trains_latlon_renamed.to_pandas()

nw_trains_l = pdk.Layer(
            'ScatterplotLayer',
            data=trains_latlon_renamed_pd,
            get_position='[LON, LAT]',
            get_color='[0,187,2]',
            get_radius=600,
            pickable=True)

#### render the map showing trainstations based on overture maps

tooltip = {
   "html": """<b>Name:</b> {NAME} <br> <b>Alternate:</b> {ALTERNATE}""",
   "style": {
       "width":"50%",
        "backgroundColor": "steelblue",
        "color": "white",
       "text-wrap": "balance"
   }
}
    
st.pydeck_chart(pdk.Deck(
    map_style=None,
    initial_view_state=pdk.ViewState(
        latitude=LAT,
        longitude=LON,
        zoom=5,
        height=700
        ),
    
layers= [polygon_layer, poi_l, nw_trains_l], tooltip = tooltip

))

Hover over the map and checked the updated tool tips.

tool_tip_image

Any location may be impacted by key events. Let's try and pinpoint out any key event happening in the north of England and how restaurants and train stations may be impacted by this. We do not have specific event data for this, so in this case, we will leverage Snowflake Cortex and Snowflake Arctic to suggest events that may impact this area. Arctic is not a live data repository - it simply retrieves data back based on trained history within the model.

Create the following in a new cell named cortex_events. This will generate event data and saves the results in a new table.

json1 = '''{"DATE":"YYYY-MM-DD", "NAME":"event",DESCRIPTION:"describe what the event is" "CENTROID":{
  "coordinates": [
    0.000000<<<this needs to be longitude,
    0.000000<<<<this needs to be latitude
  ],
  "type": "Point"
},"COLOR":"Random bright and unique color in RGB presented in an array"}'''


prompt = f''' Retrieve 6 events within different cities of the north of england and will happen in 2024.  do not include commentary or notes retrive this in the following json format {json1}  '''
events = session.create_dataframe([{'prompt':prompt}])

events = events.select(call_function('SNOWFLAKE.CORTEX.COMPLETE','mistral-large2',prompt).alias('EVENT_DATA'))

events = events.with_column('EVENT_DATA',replace(col('EVENT_DATA'),lit('''```json'''),lit('')))
events = events.with_column('EVENT_DATA',replace(col('EVENT_DATA'),lit('''```'''),lit('')))

events.write.mode('overwrite').save_as_table("DATA.EVENTS_IN_THE_NORTH")
session.table('DATA.EVENTS_IN_THE_NORTH')

Again, we will utilise the semi-structured support in Snowflake to flatten the retrieved json to transpose a data frame in a table format

Copy and paste the code below into a new cell which you will rename as events_normalised:

events = session.table('DATA.EVENTS_IN_THE_NORTH')
events = events.join_table_function('flatten',parse_json('EVENT_DATA')).select('VALUE')
events=events.with_column('NAME',col('VALUE')['NAME'].astype(StringType()))
events=events.with_column('DESCRIPTION',col('VALUE')['DESCRIPTION'].astype(StringType()))
events=events.with_column('CENTROID',to_geography(col('VALUE')['CENTROID']))
events=events.with_column('COLOR',col('VALUE')['COLOR'])
events=events.with_column('DATE',col('VALUE')['DATE'].astype(DateType())).drop('VALUE')
events

Leveraging H3

We now have a list of events in a new table. We would like to utilise this to understand the restaurants and train stations which may be impacted. H3 indexes are a way to bucket multiple points into a standardised grid. H3 buckets points into hexagons. Every hexagon at each resolution has a unique index. This index can also be used to join with other datasets which have also been indexed to the same standardised grid. We will do this later in the lab.

Create a new cell which you will name h3index. Add the H3 Index at resolution 5 to the dataframe

events=events.with_column('H3',call_function('H3_POINT_TO_CELL_STRING',col('CENTROID'),lit(5)))

events

You will see a new column called H3:

events_H3

We will now add these events onto the map as another layer. Add a new cell and name it visualise_h3.

events = events.with_column('R',col('COLOR')[0])
events = events.with_column('G',col('COLOR')[1])
events = events.with_column('B',col('COLOR')[2])
events = events.with_column_renamed('DESCRIPTION','ALTERNATE')
eventspd = events.group_by('H3','NAME','ALTERNATE','R','G','B').count().to_pandas()

st.write(eventspd)

h3_events = pdk.Layer(
        "H3HexagonLayer",
        eventspd,
        pickable=True,
        stroked=True,
        filled=True,
        extruded=False,
        get_hexagon="H3",
        get_fill_color=["255-R","255-G","255-B"],
        line_width_min_pixels=2,
        opacity=0.4)

#### render the map showing trainstations based on overture maps

tooltip = {
   "html": """<b>Name:</b> {NAME} <br> <b>Alternate:</b> {ALTERNATE}""",
   "style": {
       "width":"50%",
        "backgroundColor": "steelblue",
        "color": "white",
       "text-wrap": "balance"
   }
}

st.pydeck_chart(pdk.Deck(
    map_style=None,
    initial_view_state=pdk.ViewState(
        latitude=LAT,
        longitude=LON,
        zoom=5,
        height=600
        ),
    
layers= [polygon_layer, poi_l, h3_events,nw_trains_l, ], tooltip = tooltip

))

events_map

So we can see the train stations and restaurants that might be impacted by the events. The events rendered may be different to what is shown in the screenshot. Lets create a dataset that extracts only the impacted areas.

Join the Events data frame to The Train Stations Data frame. Then, Join the Events Data frame to the Restaurants Data frame.

You may notice that there are new H3 columns for the restaurants and places of the same resolution as the events. This naturally creates a key to join to. There is also a new column which displays the distance away the restaurant is from the event. This was created using a standard geospatial function. Copy and paste the following into a new cell called affected_by_events

trains_h3 = trains_latlon_renamed.with_column('H3',call_function('H3_LATLNG_TO_CELL_STRING',col('LAT'),col('LON'),lit(5)))
trains_h3 = trains_h3.join(events.select('H3',col('NAME').alias('EVENT_NAME'),'DATE'),'H3')

st.markdown('#### Affected Train Stations')
st.write(trains_h3.limit(1))
places_h3 = places.with_column('H3',call_function('H3_POINT_TO_CELL_STRING',col('GEOMETRY'),lit(5)))
places_h3 = places_h3.join(events.select('H3','CENTROID',col('NAME').alias('EVENT_NAME'),'DATE'),'H3')
places_h3 = places_h3.with_column('DISTANCE_FROM_EVENT',call_function('ST_DISTANCE',col('CENTROID'),col('GEOMETRY')))
places_h3 = places_h3.filter(col('DISTANCE_FROM_EVENT')< 3000)
places_h3 = places_h3.sort(col('DISTANCE_FROM_EVENT').asc())
st.markdown('#### Affected Restaurants')                             
st.write(places_h3.limit(10))

events_map

We now have all of this joined together - in the next step we will use an LLM to write a letter to each MP which details the concerns which may impact the events.

Now that we can see where the events impact stations and restaurants, let's use an LLM to craft a letter to the MP to notify them of these effects. To do this, we need to put all the information needed into objects to easily pass them through the cortex function.

Create an object which links all affected restaurants to the respective MP. We are also including the distance from the event for each restaurant.

Create a new cell celled letter_restaurants and copy and paste the following code into the cell:

object3 = trains_h3.select('H3','MP','"MP Email Address"').distinct()
object3 = places_h3.join(object3,'H3')  
object3 = object3.group_by('MP','"MP Email Address"').agg(array_agg(object_construct(lit('NAME'),
                                                                col('NAME'),
                                                                lit('DISTANCE_FROM_EVENT'),
                                                                round('DISTANCE_FROM_EVENT',5).astype(DecimalType(20,4)),
                                                                lit('PHONE'),
                                                                col('PHONE'),
                                                               lit('WEBSITE'),
                                                               col('WEBSITE'))).within_group('MP').alias('RESTAURANTS'))
object3


You will now create another object which links all the affected trains stations to the respective MP. Copy and paste the following code into a new python cell called letter_trains

object1 = trains_h3.group_by('MP').agg(array_agg(object_construct(lit('Train Station information'),col('ALTERNATE'))).within_group('MP').alias('TRAIN_STATIONS'))
object1

Create an object which links all affected events to the respective MP. Create a new python cell called letter_events and paste the following code into it:

object2 = trains_h3.select('MP','EVENT_NAME','DATE').distinct()
object2 = object2.group_by('MP').agg(array_agg(object_construct(lit('EVENT'),col('EVENT_NAME'),lit('DATE'),col('DATE'))).within_group('MP').alias('EVENTS'))
object2

you will now join all these objects together and persist the results in a table. Create a new cell called letter_object and copy and paste the python below into it:

all_3 = object1.join(object2,'MP')
all_3 = all_3.join(object3,'MP')

all_3.write.mode('overwrite').save_as_table("DATA.EVENTS_AND_WHAT_IS_AFFECTED")
session.table('DATA.EVENTS_AND_WHAT_IS_AFFECTED')

events_map

The results can include a large number of restaurants by MP, so let's only refer to the first 8 restaurants for each MP letter based on distance from the event. The array_slice method does just that. Create a new Python cell called filt_restaurant_obj and paste the following python content into it:

all_3 = session.table("DATA.EVENTS_AND_WHAT_IS_AFFECTED")
all_3 = all_3.select('MP','"MP Email Address"','TRAIN_STATIONS','EVENTS',
                     
array_slice(col('RESTAURANTS'),lit(0),lit(8)).alias('RESTAURANTS'))

          
all_3

Create a prompt for the LLM which pulls all this information together. You may want to change who the letter is written to. The prompt will encourage the letter to be written by Becky.

Copy the code below into a new python cell called prompt_letter

prompt = concat(lit('Write an email addressed to this MP:'),
                col('MP'),
               lit('about these events: '),
               col('EVENTS').astype(StringType()),
               lit('effecting these stations: '),
               col('TRAIN_STATIONS').astype(StringType()),
                lit('And these Restaurants: '),
                col('RESTAURANTS').astype(StringType()),
               lit('The letter is written by Becky - becky.oconnor@snowflake.com - a concerned Citizen'))


Call the LLM with the prompt by copying the code below into a new cell. You may want to be creative and change who the letter is written by, or even ask Cortex to write it in the style of someone. The LLM we are using for this is Mixtral-8x7b as its good at writing letters. Copy and paste the following in a new python cell called cortex_letter

letters = all_3.select('MP','"MP Email Address"', call_function('SNOWFLAKE.CORTEX.COMPLETE','mixtral-8x7b',prompt).alias('LETTER'))
letters.write.mode('overwrite').save_as_table("DATA.LETTERS_TO_MP")

We have now saved our letters. The next task is to view the letters by copying this code into a new python cell called new_letters:

letters = session.table('DATA.LETTERS_TO_MP')
letters

The code below allows the user to browse the letters with a slider and visualise a letter.Try it out by copying the code into a new python cell called letter_reader.

letterspd = letters.to_pandas()
selected_letter = st.slider('Choose Letter:',0,letterspd.shape[0]-1,1)
st.markdown(f''' **Email**: {letterspd['MP Email Address'].iloc[selected_letter]}''')
st.write()
st.write(letterspd.LETTER.iloc[selected_letter])

events_map

That's it. If you wish, you can download the completed python code from here by exporting it as .ipynb that could be used to import it into a new notebook.

This streamlit application will generate events occurring during the time of the events, which will involve the train stations and the restaurants.

events_map

Go back to the home page and Navigate to Streamlit which is within the Projects Menu.

Add a new Streamlit App

create_streamlit

Populate the prompts as suggested below.

alt text

Replace all the sample code with the following:

# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import col, call_function, lit,concat, parse_json,object_construct,replace
from snowflake.snowpark.types import StringType, FloatType, ArrayType, VariantType, DateType

# Write directly to the app


# Get the current credentials
session = get_active_session()

col1,col2 = st.columns([0.2,0.8])
with col1:
    st.image('https://cdn.prgloo.com/web/NorthernRail/NorthernNewLogo.png')

with col2:
    st.title("EVENT SIMULATOR")
st.write(
    """Here are all the events and what places are potentially effected again).
    """
)


####Use a higher order filter to filter each array to only show restaurants that are no more than 250m from the event and include this in the LLM

events_what_affected = session.sql('SELECT MP, TRAIN_STATIONS,EVENTS,CASE WHEN ARRAY_SIZE(RESTAURANTS) >10 THEN FILTER(RESTAURANTS,i -> i:DISTANCE_FROM_EVENT <=250) ELSE RESTAURANTS END RESTAURANTS FROM DATA.EVENTS_AND_WHAT_IS_AFFECTED')
mps = events_what_affected.select('MP').to_pandas()

with st.expander("View Prompt Information"):
    st.dataframe(events_what_affected,column_config={
                                    'MP':st.column_config.ListColumn('MP',
                                     help='The current acting MP responsible for the area',
                                     width='medium')   
                                        }
              )



    
st.image('https://cdn.prgloo.com/media/aad47116f9cc4c33b6f02e51fb8070f4.jpg?width=1200&height=400')



json_template = {"DATE":"01/01/2024",
                 "Location Details":"CRS Code of train station or Restaurant name and address",
                 "LOCATION":{"LAT":0,"LON":0},
                 "REPORTED_BY":"BECKY O'CONNOR","DESCRIPTION_OF_INCIDENTS":"generate unique ficticious incident details here"}

#st.write(json_template)

with st.form('Generate Events'):
    'Generate Synthetic Events based on the following:'
    col1,col2, col3,col4 = st.columns(4)
    with col1:
        model = st.selectbox('Choose Model',['mistral-large2'])
    with col2:
        mp = st.selectbox('Choose MP: ',mps)
    with col3:
        activity = st.selectbox('Choose Activity: ', ['Overcrowding','Food Poisoning','Train Incident','Fight'])
    with col4:
        event_count = st.number_input('How many events:',1,10,5)
    
    submitted = st.form_submit_button('Generate Reports')


if submitted:
    filtered_data = events_what_affected.filter(col('MP')==lit(mp))

    st.markdown('Filtered Objects')

    st.dataframe(filtered_data)

    filtered_data_pd = filtered_data.to_pandas()

    prompt = concat(lit('create'),
                    lit(event_count),
                    lit('synthetic incidents using this json template'),
                    lit(json_template).astype(StringType()), 
                    lit('involving'), 
                    lit(activity), 
                    lit('concerning one of these train stations:'), 
                    col('TRAIN_STATIONS').astype(StringType()), 
                    lit('populate the incident date as the same date as one of these events'),
                    col('EVENTS').astype(StringType()),
                    lit('.Each incident will have connection with the provided event, and will also involve'),
                    lit('one of the following restaurants:'), 
                    col('RESTAURANTS').astype(StringType()),
                    lit('.  Each Incident will be Reported By a synthetic and randomly generated full name'),
                    lit('populate the latitude and longitude as one json element using the provided json template'),
                   lit('Nest all the generated incidents in an single json object called incidents.  Do not include Note'), 
                    lit('RETURN ONLY THE JSON'))

    mistral = call_function('snowflake.cortex.complete',(lit(model),prompt))


    

    generated = filtered_data.with_column('generated_events',mistral)

    generated = generated.with_column('generated_events',replace(col('generated_events'),lit('''```json'''),lit('')))
    generated = generated.with_column('generated_events',replace(col('generated_events'),lit('''```'''),lit('')))
    #st.write(generated)
    generated = generated.select('MP',parse_json('GENERATED_EVENTS').alias('GENERATED_EVENTS'))
    generated = generated.with_column('INCIDENT_TYPE',lit(activity))
    #st.write(generated)

    sql2 = '''create table if not exists BUILD_UK.DATA.INCIDENTS (MP VARCHAR(255),
            GENERATED_EVENTS VARIANT,
         INCIDENT_TYPE VARCHAR(255))'''
    
    session.sql(sql2).collect()
    generated.write.mode('append').save_as_table("DATA.INCIDENTS")

    

    st.markdown('##### NEW EVENTS')

    st.dataframe(generated)

st.markdown('#### GENERATED EVENTS')


try:
    
    incident_table = session.table('DATA.INCIDENTS')
    st.markdown('##### ALL GENERATED EVENTS')
    
    sql = 'DROP TABLE DATA.INCIDENTS'



    
    clear = st.button('clear incident table')
    
    if clear:
        session.sql(sql).collect()
    
    #st.dataframe(incident_table)

    flatten = incident_table.select('MP','INCIDENT_TYPE',parse_json('GENERATED_EVENTS').alias('JSON'))
    #st.write(flatten)
    flatten = flatten.join_table_function('FLATTEN',col('JSON')['incidents'])
    flatten = flatten.select('MP','INCIDENT_TYPE','VALUE')
    
    flatten = flatten.with_column('DESCRIPTION_OF_INCIDENTS',
                                  col('VALUE')['DESCRIPTION_OF_INCIDENTS'].astype(StringType()),
                                 )

    flatten = flatten.with_column('LAT',
                                  col('VALUE')['LOCATION']['LAT'].astype(FloatType()))
                                 
    flatten = flatten.with_column('LON',
                                  col('VALUE')['LOCATION']['LON'].astype(FloatType()))
                                 

    flatten = flatten.with_column('REPORTED_BY',
                                  col('VALUE')['REPORTED_BY'].astype(StringType()),
                                 )

    flatten = flatten.with_column('DATE',
                                  col('VALUE')['DATE'].astype(StringType()),
                                 ).drop('VALUE')
    
    
    
    st.write(flatten)
    
    
    


    st.divider()

    st.markdown('#### INDIVIDUAL INCIDENTS')
    MP = st.selectbox('Choose MP:',flatten.select('MP').distinct())
    flatten = flatten.filter(col('MP')==MP)
    
    map = flatten.select('LAT','LON')
    st.map(map)
    flattenpd = flatten.to_pandas()
    count = flattenpd.shape[0]
    record = st.slider('Choose Incident Record:',0,count-1,count-1)
    
    st.markdown('###### INCIDENT TYPE')
    st.write(flattenpd.INCIDENT_TYPE.iloc[record])
    st.markdown('###### REPORTED BY')
    st.write(flattenpd.REPORTED_BY.iloc[record])
    st.markdown('###### DATE OF INCIDENT')
    st.write(flattenpd.DATE.iloc[record])
    st.markdown('###### DESCRIPTION OF INCIDENT')
    st.write(flattenpd.DESCRIPTION_OF_INCIDENTS.iloc[record])

    st.divider()

    st.markdown(''' #### NEWLY GENERATED SOCIAL MEDIA ''')

    #### SOCIAL MEDIA DATA
    
    social_media = session.create_dataframe([0])
    json = '''{"date","YYYY-MM-DD","post","abcdefg","sentiment_score",0.2,"username","bob"}'''
    
    social_media = social_media.with_column('V',call_function('SNOWFLAKE.CORTEX.COMPLETE',model, 
    concat(lit('generate 4 random synthetic social media post concerning the follwing incident:'), 
    lit(f'''{flattenpd.DESCRIPTION_OF_INCIDENTS.iloc[record]}'''), 
    lit('Add a date, username and relevant emojis to each post.\
    Include emotion.  Return the results as a json object and include sentiment scores.')
           ,lit('use the following json template to structure the data'),lit(json))).astype(VariantType()))
    

    social_media = social_media.with_column('V',replace(col('V'),lit('''```json'''),lit('')))
    social_media = social_media.with_column('V',replace(col('V'),lit('''```'''),lit('')))
    
    smedia = social_media.join_table_function('flatten',parse_json('V')).select('VALUE')
    smedia = smedia.select(object_construct(lit('INCIDENT_TYPE'),lit(flattenpd.INCIDENT_TYPE.iloc[record]),lit('MP'),lit(MP),lit('DATA'),col('VALUE')).alias('V'))
    smedia.write.mode('append').save_as_table('DATA.SOCIAL_MEDIA')
    smedia = smedia.with_column('"Date"',col('V')['DATA']['date'].astype(DateType()))
    smedia = smedia.with_column('"Post"',col('V')['DATA']['post'].astype(StringType()))
    smedia = smedia.with_column('"Sentiment"',col('V')['DATA']['sentiment_score'])
    smedia = smedia.with_column('"Username"',col('V')['DATA']['username'].astype(StringType()))
    smedia = smedia.with_column('"Incident Type"',col('V')['INCIDENT_TYPE'].astype(StringType()))
    smedia = smedia.with_column('"MP"',col('V')['MP'].astype(StringType()))
    st.dataframe(smedia)

    st.divider()
except:
    st.info('No Results Found')
    

    


try:
    st.markdown(''' #### ALL SOCIAL MEDIA POSTINGS ''')
    smediaV = session.table('DATA.SOCIAL_MEDIA')
    smediaV = smediaV.with_column('"Date"',col('V')['DATA']['date'].astype(DateType()))
    smediaV = smediaV.with_column('"Post"',col('V')['DATA']['post'].astype(StringType()))
    smediaV = smediaV.with_column('"Sentiment"',col('V')['DATA']['sentiment_score'])
    smediaV = smediaV.with_column('"Username"',col('V')['DATA']['username'].astype(StringType()))
    smediaV = smediaV.with_column('"Incident Type"',col('V')['INCIDENT_TYPE'].astype(StringType()))
    smediaV = smediaV.with_column('"MP"',col('V')['MP'].astype(StringType()))
    smediaV.create_or_replace_view('DATA.V_SOCIAL_MEDIA')
    st.write(session.table('DATA.V_SOCIAL_MEDIA'))
except:
    st.info('No Results Found')
    
         
    
         

Running the App to generate data

Within the App, generate events for various mp's. You can decide how many you would like to generate. Once you have selected what you would like to generate, press Generate Reports. This will create Incidents relating to the MP. They will also be in the context of the specified activity.

Once you have generated events, select each MP under the Individual incidents section. This will generate synthetic social media posts for the incidents.

alt text

Finally we will create a Streamlit app which adds this additional information to a map.

alt text

You will now be creating another streamlit app in order to visualise the results.

Below is sample code which takes what we have learnt to create a streamlit with all the places event, location and incident data that from the shared datasets as well as synthetic data.

You will need to install pydeck.

import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import max,min,avg,call_function, split,substr,hour,concat,col,sqrt,lit,array_slice,array_agg,object_construct,parse_json, to_geography, to_array,to_date,round
from snowflake.snowpark.types import StringType,VariantType, DateType,FloatType, IntegerType,DecimalType
import json
import pandas as pd
import numpy as np
import pydeck as pdk
st.set_page_config(layout="wide")
# Write directly to the app
st.title("UK Analytics within the North of England :train:")
st.write(
    """This app shows key insight of places and events that may effect Northern Trains).
    """
)

# Get the current credentials
session = get_active_session()

trains_latlon = session.table('NORTHERN_TRAINS_STATION_DATA.TESTING."StationLatLong"')


#create a point from the coordinates
envelope = trains_latlon.with_column('POINT',call_function('ST_MAKEPOINT',col('"Longitude"'),col('"Latitude"')))

#collect all the points into one row of data
envelope = envelope.select(call_function('ST_COLLECT',col('POINT')).alias('POINTS'))

#create a rectangular shape which boarders the minimum possible size which covers all of the points
envelope = envelope.select(call_function('ST_ENVELOPE',col('POINTS')).alias('BOUNDARY'))

#find the centre point so the map will render from that location

centre = envelope.with_column('CENTROID',call_function('ST_CENTROID',col('BOUNDARY')))
centre = centre.with_column('LON',call_function('ST_X',col('CENTROID')))
centre = centre.with_column('LAT',call_function('ST_Y',col('CENTROID')))

#create LON and LAT variables

centrepd = centre.select('LON','LAT').to_pandas()
LON = centrepd.LON.iloc[0]
LAT = centrepd.LAT.iloc[0]

### transform the data in pandas so the pydeck visualisation tool can view it as a polygon

envelopepd = envelope.to_pandas()
envelopepd["coordinates"] = envelopepd["BOUNDARY"].apply(lambda row: json.loads(row)["coordinates"][0])

places = session.table('OVERTURE_MAPS__PLACES.CARTO.PLACE')
places = places.filter(col('ADDRESSES')['list'][0]['element']['country'] =='GB')

places = places.select(col('NAMES')['primary'].astype(StringType()).alias('NAME'),
                        col('PHONES')['list'][0]['element'].astype(StringType()).alias('PHONE'),
                      col('CATEGORIES')['main'].astype(StringType()).alias('CATEGORY'),
                        col('CATEGORIES')['alternate']['list'][0]['element'].astype(StringType()).alias('ALTERNATE'),
                    col('websites')['list'][0]['element'].astype(StringType()).alias('WEBSITE'),
                      col('GEOMETRY'))

places = places.filter(col('CATEGORY') =='restaurant')

places = places.join(envelope,call_function('ST_WITHIN',places['GEOMETRY'],envelope['boundary']))
places = places.with_column('LON',call_function('ST_X',col('GEOMETRY')))
places = places.with_column('LAT',call_function('ST_Y',col('GEOMETRY')))

placespd = places.to_pandas()

trains_latlon_renamed = trains_latlon

trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"CrsCode"','NAME')
trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"Latitude"','LAT')
trains_latlon_renamed = trains_latlon_renamed.with_column_renamed('"Longitude"','LON')

station_info = session.table('BUILD_UK.DATA.TRAIN_STATION_INFORMATION')

trains_latlon_renamed = trains_latlon_renamed.join(station_info,station_info['"CRS Code"']==trains_latlon_renamed['NAME']).drop('"CRS Code"')
trains_latlon_renamed_pd = trains_latlon_renamed.to_pandas()

events = session.table('BUILD_UK.DATA.EVENTS_IN_THE_NORTH')
events = events.join_table_function('flatten',parse_json('EVENT_DATA')).select('VALUE')
events=events.with_column('NAME',col('VALUE')['NAME'].astype(StringType()))
events=events.with_column('DESCRIPTION',col('VALUE')['DESCRIPTION'].astype(StringType()))
events=events.with_column('CENTROID',to_geography(col('VALUE')['CENTROID']))
events=events.with_column('COLOR',col('VALUE')['COLOR'])
events=events.with_column('DATE',col('VALUE')['DATE'].astype(DateType())).drop('VALUE')
events=events.with_column('H3',call_function('H3_POINT_TO_CELL_STRING',col('CENTROID'),lit(5)))

events = events.with_column('R',col('COLOR')[0])
events = events.with_column('G',col('COLOR')[1])
events = events.with_column('B',col('COLOR')[2])
events = events.with_column_renamed('DESCRIPTION','ALTERNATE')
eventspd = events.group_by('H3','NAME','ALTERNATE','R','G','B').count().to_pandas()

incident_table = session.table('BUILD_UK.DATA.INCIDENTS')
flatten = incident_table.select('MP','INCIDENT_TYPE',parse_json('GENERATED_EVENTS').alias('JSON'))
flatten = flatten.join_table_function('FLATTEN',col('JSON')['incidents'])
flatten = flatten.select('MP',col('INCIDENT_TYPE').alias('NAME'),'VALUE')
flatten = flatten.with_column('ALTERNATE',
                                  col('VALUE')['DESCRIPTION_OF_INCIDENTS'].astype(StringType()),
                                 )
flatten = flatten.with_column('LAT',
                                  col('VALUE')['LOCATION']['LAT'].astype(FloatType()))                           
flatten = flatten.with_column('LON',
                                  col('VALUE')['LOCATION']['LON'].astype(FloatType()))
flatten = flatten.with_column('REPORTED_BY',
                                  col('VALUE')['REPORTED_BY'].astype(StringType()),
                                 )
flatten = flatten.with_column('DATE',
                                  col('VALUE')['DATE'].astype(StringType()),
                                 ).drop('VALUE')
flattenpd = flatten.to_pandas()
####MAP LAYERS

polygon_layer = pdk.Layer(
            "PolygonLayer",
            envelopepd,
            opacity=0.3,
            get_polygon="coordinates",
            filled=True,
            get_fill_color=[16, 14, 40],
            auto_highlight=True,
            pickable=False,
        )

 
poi_l = pdk.Layer(
            'ScatterplotLayer',
            data=placespd,
            get_position='[LON, LAT]',
            get_color='[255,255,255]',
            get_radius=600,
            pickable=True)


nw_trains_l = pdk.Layer(
            'ScatterplotLayer',
            data=trains_latlon_renamed_pd,
            get_position='[LON, LAT]',
            get_color='[0,187,2]',
            get_radius=600,
            pickable=True)

h3_events = pdk.Layer(
        "H3HexagonLayer",
        eventspd,
        pickable=True,
        stroked=True,
        filled=True,
        extruded=False,
        get_hexagon="H3",
        get_fill_color=["255-R","255-G","255-B"],
        line_width_min_pixels=2,
        opacity=0.4)



incidents_layer = pdk.Layer(
            'ScatterplotLayer',
            data=flattenpd,
            get_position='[LON, LAT]',
            get_color='[255,165,0]',
            get_radius=2000,
            pickable=True)

#### render the map showing trainstations based on overture maps

tooltip = {
   "html": """<b>Name:</b> {NAME} <br> <b>Alternate:</b> {ALTERNATE}""",
   "style": {
       "width":"50%",
        "backgroundColor": "steelblue",
        "color": "white",
       "text-wrap": "balance"
    }
    }
    
letters = session.table('BUILD_UK.DATA.LETTERS_TO_MP')
mps = letters.select('MP').distinct()
selected_mp = st.selectbox('Choose MP:',mps)
letterspd = letters.filter(col('MP')==lit(selected_mp)).to_pandas()
#st.write(letterspd.PROMPT.iloc[0])

st.divider()
col1,col2 = st.columns([0.5,0.5])

with col1:
    st.markdown('##### MAP OF EVENTS WITH ALL EFFECTED STATIONS AND RESTAURANTS')
    st.pydeck_chart(pdk.Deck(
    map_style=None,
    initial_view_state=pdk.ViewState(
        latitude=LAT,
        longitude=LON,
        zoom=7,
        height=750
        ),
    
    layers= [polygon_layer, poi_l, h3_events,nw_trains_l,incidents_layer ], tooltip = tooltip

    ))
    st.caption('Hover for more info')
with col2:
    st.markdown('#### LETTER TO CHOSEN MP')
    st.write(letterspd.LETTER.iloc[0])
    st.divider()

social_media = session.table('DATA.V_SOCIAL_MEDIA').filter(col('MP')==selected_mp)

st.markdown('##### SOCIAL MEDIA')
st.table(social_media.drop('V'))


    

During the lab we have produced quite a bit of unstructured data from social media posts, to incidents, through to letters. Now lets use vector embedding functionality to make this information searchable. This is really useful when you would like to use an LLM to answer questions but do not want to send the entire dataset as a large object - which could be quite expensive and also would take a long time to run. For large text blocks, you may wish to ‘chunk' the data first. As the text in this scenario is relatively small - we will keep it as is. Once you have explored the principles of how text searching works, you will create a Cortex Search Service.

Click here and download the notebook from GitHub

Vector_Embeddings.ipynb

Import as a new snowflake notebook. Add it to the BUILD.UK.NOTEBOOKS schema and follow the instructions provided in the notebook.

Met office Weather

Now we will leverage the weather data to understand the weather forecasts for each Event. This time we will import a notebook which has already been created.

weather.ipynb

alt text

Streamlit showing weather by Train Station

alt text

this is an example weather app which shows the weather forecast for each train station

# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import any_value,max,min,avg,call_function, split,substr,hour,concat,col,sqrt,lit,array_slice,array_agg,object_construct,parse_json, to_geography, to_array,to_date
from snowflake.snowpark.types import StringType,VariantType, DateType, IntegerType,DecimalType
import json
import pandas as pd
import numpy as np
import pydeck as pdk

st.set_page_config(layout="wide")

# Write directly to the app

col1,col2,col3 = st.columns([0.2,0.6,0.2])
with col1:
    st.image('https://cdn.prgloo.com/web/NorthernRail/NorthernNewLogo.png')

with col3:
    st.image('https://upload.wikimedia.org/wikipedia/en/thumb/f/f4/Met_Office.svg/1200px-Met_Office.svg.png')
st.title("Northern Trains Weather Data")

st.write(
    """This app shows the weather that may affect Northern Trains).
    """
)

# Get the current credentials
session = get_active_session()

#### add postcode sector to trains
trains_latlon = session.table('NORTHERN_TRAINS_STATION_DATA.TESTING."StationLatLong"')
trains_latlon = trains_latlon.with_column('POSTCODE_SPLIT',split(col('"Postcode"'),lit(' ')))
trains_latlon = trains_latlon.with_column('Postcode_outcode',col('POSTCODE_SPLIT')[0].astype(StringType()))
trains_latlon = trains_latlon.with_column('Postcode_inward_code',col('POSTCODE_SPLIT')[1].astype(StringType()))
trains_latlon = trains_latlon.with_column('Postcode_Sector',concat('POSTCODE_OUTCODE',lit('_'),substr(col('Postcode_inward_code'),1,1)))
trains_latlon = trains_latlon.drop('POSTCODE_SPLIT','POSTCODE_OUTCODE','POSTCODE_INWARD_CODE')


# Add hourly and daily weather for each train station

weather_hourly = session.table('POSTCODE_SECTOR_WEATHER_FORECASTS__ADVANCED_WITH_SOLAR.PCSECT_FORECAST."advanced_with_solar_hourly_view"')
weather_daily = session.table('POSTCODE_SECTOR_WEATHER_FORECASTS__ADVANCED_WITH_SOLAR.PCSECT_FORECAST."advanced_with_solar_daily_view"')

weather_hourly_max = weather_hourly.agg(max('"Issued_at"').alias('MAX'))
weather_hourly = weather_hourly.join(weather_hourly_max,weather_hourly_max['MAX']==weather_hourly['"Issued_at"']).drop('MAX')

weather_daily_max = weather_daily.agg(max('"Issued_at"').alias('MAX'))
weather_daily = weather_daily.join(weather_daily_max,weather_daily_max['MAX']==weather_daily['"Issued_at"']).drop('MAX')

weather_hourly = weather_hourly.join(trains_latlon,trains_latlon['POSTCODE_SECTOR']==weather_hourly['PC_SECT']).drop('PC_SECT')
weather_hourly = weather_hourly.with_column('VALIDITY_DATE',to_date('"Validity_date_and_time"'))

weather_daily_t = weather_daily.join(trains_latlon,trains_latlon['POSTCODE_SECTOR']==weather_daily['PC_SECT']).drop('PC_SECT')








#add h3 column to weather
weather_daily_h3 = weather_daily_t.with_column('H3',call_function('H3_LATLNG_TO_CELL_STRING',col('"Latitude"'),col('"Longitude"'),lit(5)))


#add northern trains boundary box
#create a point from the coordinates
envelope = trains_latlon.with_column('POINT',call_function('ST_MAKEPOINT',col('"Longitude"'),col('"Latitude"')))

#collect all the points into one row of data
envelope = envelope.select(call_function('ST_COLLECT',col('POINT')).alias('POINTS'))

#create a rectangular shape which boarders the minimum possible size which covers all of the points
envelope = envelope.select(call_function('ST_ENVELOPE',col('POINTS')).alias('BOUNDARY'))

# Create a centre point to position the maps
centre = envelope.with_column('CENTROID',call_function('ST_CENTROID',col('BOUNDARY')))
centre = centre.with_column('LON',call_function('ST_X',col('CENTROID')))
centre = centre.with_column('LAT',call_function('ST_Y',col('CENTROID')))



#create LON and LAT variables

centrepd = centre.select('LON','LAT').to_pandas()
LON = centrepd.LON.iloc[0]
LAT = centrepd.LAT.iloc[0]

# index the boundary with h3 res 5
boundary_h3 = envelope.with_column('H3',call_function('H3_COVERAGE_STRINGS',col('BOUNDARY'),lit(5))).join_table_function('flatten',col('H3')).select(col('VALUE').astype(StringType()).alias('H3'))


#### use the postcodes data set to add H3 to postcode sectors
postcodes = weather_daily.group_by('PC_SECT').agg(any_value('POINT').alias('POINT'))
postcodes = postcodes.with_column('H3',call_function('H3_POINT_TO_CELL_STRING',col('"POINT"'),lit(5)))
postcodes = postcodes.join(boundary_h3,boundary_h3['H3']==postcodes['H3'],lsuffix='L').drop('H3L')


### join weather data to the boundary
weather_boundary = boundary_h3.join(weather_daily_h3,weather_daily_h3['H3']==boundary_h3['H3'],lsuffix='L',type='right').drop('H3L')
weather_boundarypd = weather_boundary.select('H3','"Max_temperature_day"','"Max_feels_like_temperature_day"').to_pandas()


from snowflake.snowpark.functions import dateadd


##join weather to all postcode sectors


weather_daily_north = weather_daily.join(weather_daily_max,weather_daily_max['MAX']==weather_daily['"Issued_at"']).drop('MAX')
weather_daily_north = weather_daily_north.join(postcodes,postcodes['PC_SECT']==weather_daily_north['PC_SECT'])







station_filter = trains_latlon.select('"CrsCode"')
date_filter = weather_hourly.agg(max('VALIDITY_DATE').alias('MAX'),
                          min('VALIDITY_DATE').alias('MIN'))

date_filter = date_filter.with_column('MIN',dateadd('DAY',lit(1),col('MIN'))).to_pandas()

with st.form('select_data'):
    col1,col2, = st.columns([0.3,0.7])
    with col1:
        selected_station = st.selectbox('Select Station:',station_filter)
        selected_date = st.date_input('Select Date:',date_filter.MIN.iloc[0],date_filter.MIN.iloc[0],date_filter.MAX.iloc[0])
        model = 'reka-flash','mixtral-8x7b', 'gemma-7b','llama2-70b-chat'
        select_model = st.selectbox('Choose Model',model)
        run = st.form_submit_button('Run Dashboard')
    with col2:
        st.image('https://cdn.prgloo.com/media/aad47116f9cc4c33b6f02e51fb8070f4.jpg?width=1200&height=400')

if run:

    #st.write(weather_daily_north).limit(10)
    todaypd = weather_daily_t.filter((col('"Validity_date"')==selected_date)
                                     & (col('"CrsCode"')==selected_station)).to_pandas()
    st.write(todaypd)
    melt = pd.melt(todaypd)
    melt['variable'] = melt['variable'].astype("string")
    melt['value'] = melt['value'].astype("string")

    ##### create an object to feed into the LLM
    object = session.create_dataframe(melt)
    object = object.with_column('object',object_construct(col('"variable"'),col('"value"')))
    
    object = object.select(array_agg('OBJECT').alias('OBJECT'))

    

        ##### create an LLM prompt which includes the data object
    prompt = concat(lit('Summarise the weather report in 500 words using paragraphs for the date specified which includes relevant emojis to summarise the weather based on the following dataset'),
                    col('object').astype(StringType()),
                   lit('USE APPROPIATE MARKDOWN TO ENHANCE THE PRESENTATION. NO COMMENTS'))
    
    # construct the cortex.complete function - this will run based on the chosen model
    complete = call_function('snowflake.cortex.complete',lit(select_model),lit(prompt))
    
    object = object.select(complete)

    #### print the results
    st.write(object.collect()[0][0])

    st.divider()
    
    # dataframes for h3 maps

    weather_daily_north = weather_daily_north.select('H3',col('"Validity_date"').astype(StringType()).alias('DATE'),
                                                     '"Max_temperature_day"',
                                                     '"Max_feels_like_temperature_day"')
    
    weather_daily_north_0 = weather_daily_north.filter(col('"Validity_date"')== dateadd('day',lit(1),lit(selected_date)))
    weather_daily_north_1 = weather_daily_north.filter(col('"Validity_date"')== dateadd('day',lit(2),lit(selected_date)))
    weather_daily_north_2 = weather_daily_north.filter(col('"Validity_date"')== dateadd('day',lit(3),lit(selected_date)))


    ####
    weather_daily_northpd = weather_daily_north_0.to_pandas()
    weather_daily_northpd1 = weather_daily_north_1.to_pandas()
    weather_daily_northpd2 = weather_daily_north_2.to_pandas()

    

    #################CONSTRUCT THE MAP

    ### create a layer to cover the boundary with H3 Indexes

    st.markdown('#### TEMPERATURE COMPARISON OVER THE NEXT 3 DAYS')

    filtered_station = trains_latlon.filter(col('"CrsCode"')==selected_station)
    filtered_station_pd = filtered_station.to_pandas()

    #st.write(filtered_station)



    station = pdk.Layer(
            'ScatterplotLayer',
            data=filtered_station_pd,
            get_position='[Longitude,Latitude]',
            get_color='[0,0,0]',
            get_radius=2000,
            radius_scale=6,
            radius_min_pixels=1,
            opacity=0.6,
            radius_max_pixels=1000,
            filled=True,
            line_width_min_pixels=1,
            pickable=False)
    

    h3_coverage = pdk.Layer(
        "H3HexagonLayer",
        weather_daily_northpd,
        pickable=True,
        stroked=True,
        filled=True,
        extruded=False,
        get_hexagon="H3",
        get_fill_color=["0+Max_temperature_day*8","(255/100) * 0+Max_temperature_day*12","(255/100) * 200-Max_temperature_day*9"],
        line_width_min_pixels=0,
        opacity=0.9)

    h3_coverage_1 = pdk.Layer(
        "H3HexagonLayer",
        weather_daily_northpd1,
        pickable=True,
        stroked=True,
        filled=True,
        extruded=False,
        get_hexagon="H3",
        get_fill_color=["0+Max_temperature_day*8","(255/100) * 0+Max_temperature_day*12","(255/100) * 200-Max_temperature_day*9"],
        line_width_min_pixels=0,
        opacity=0.9)

    h3_coverage_2 = pdk.Layer(
        "H3HexagonLayer",
        weather_daily_northpd2,
        pickable=True,
        stroked=True,
        filled=True,
        extruded=False,
        get_hexagon="H3",
        get_fill_color=["0+Max_temperature_day*8","(255/100) * 0+Max_temperature_day*12","(255/100) * 200-Max_temperature_day*9"],
        line_width_min_pixels=0,
        opacity=0.9)



    #### render the map 


    col1,col2,col3 = st.columns(3)

    with col1:
        st.write(weather_daily_northpd.DATE.iloc[1])
        st.pydeck_chart(pdk.Deck(
        map_style=None,
        initial_view_state=pdk.ViewState(
        latitude=LAT,
        longitude=LON,
        zoom=5,
        height=400
        ),
    
        layers= [h3_coverage,station],tooltip = {'text':"Temperature: {Max_temperature_day}, Feels like: {Max_feels_like_temperature_day}"}

        ))

    with col2:
        st.write(weather_daily_northpd1.DATE.iloc[1])
        st.pydeck_chart(pdk.Deck(
        map_style=None,
        initial_view_state=pdk.ViewState(
            latitude=LAT,
            longitude=LON,
            zoom=5,
            height=400
            ),
    
        layers= [h3_coverage_1,station],tooltip = {'text':"Temperature: {Max_temperature_day}, Feels like: {Max_feels_like_temperature_day}"}

        ))

    with col3:
        st.write(weather_daily_northpd2.DATE.iloc[1])
        st.pydeck_chart(pdk.Deck(
        map_style=None,
        initial_view_state=pdk.ViewState(
        latitude=LAT,
        longitude=LON,
        zoom=5,
        height=400
            ),
    
        layers= [h3_coverage_2,station],tooltip = {'text':"Temperature: {Max_temperature_day}, Feels like: {Max_feels_like_temperature_day}"}

        ))



    
    weather_filter_hour = weather_hourly.filter(col('"CrsCode"') == selected_station)
    weather_filter_day = weather_daily_t.filter(col('"CrsCode"') == selected_station)

    weather_filter_hour = weather_filter_hour.filter(col('VALIDITY_DATE')==selected_date)
    #weather_filter_day = weather_filter_day.filter(col('"Validity_date"')==selected_date)
    weather_filter_hour = weather_filter_hour.with_column('HOUR',hour('"Validity_date_and_time"'))


    weather_filter_hour = weather_filter_hour.group_by('HOUR').agg(avg('"Feels_like_temperature"').alias('Feels Like'),
                                                              avg('"Screen_temperature"').alias('"Screen"'),
                                                               avg('"Screen_Dew_Point_Temperature"').alias('"Dew Point"'),
    avg('"UV_index"').alias(' '),                                                        avg('"Relative_humidity"').alias('"Relative"'),                                                        avg('"Wind_speed"').alias('"Speed"'),
    avg('"Wind_gust"').alias('"Gust"'),).to_pandas()

    #weather_filter_hour
    st.markdown('#### 12 DAY DAILY FORECAST')

    col1,col2 = st.columns(2)
    with col1:
        st.markdown('##### TEMPERATURE')
    #weather_filter_day

        st.line_chart(weather_filter_day.to_pandas(),x='Validity_date',y=['Max_temperature_day','Max_feels_like_temperature_day','Min_feels_like_temperature_night'])


    with col2:
        st.markdown('##### HUMIDITY')
    #weather_filter_day

        st.line_chart(weather_filter_day.to_pandas(),x='Validity_date',y=['Relative_humidity_Approx_Local_Midday','Relative_humidity_Approx_Local_Midnight'])

    st.divider()

    st.markdown('#### HOURLY VIEW')

    col1,col2 = st.columns(2)

    with col1:
        st.markdown('##### TEMPERATURE')
        st.line_chart(weather_filter_hour,x='HOUR',y=['Feels Like','Screen','Dew Point'])
    with col2:
        st.markdown('##### UV INDEX')
        st.bar_chart(weather_filter_hour,x='HOUR',y=[' '])

    with col1:
        st.markdown('##### WIND')
        st.line_chart(weather_filter_hour,x='HOUR',y=['Speed','Gust'])
    with col2:
        st.markdown('##### HUMIDITY')
        st.bar_chart(weather_filter_hour,x='HOUR',y=['Relative'])


Conclusion

Snowflake provides powerful capabilities when you bring Cortex with Snowpark and Streamlit's visualization capabilities together.

What You Learned

You will have learned the following:

Related Resources

Datasources used in the lab

Source code

Further Related Material