In this tutorial, you'll build a complete bioinformatics project from scratch in Snowflake. You'll work with the Delaney solubility dataset to analyze molecular properties that are crucial for successful drug discovery efforts.

By the end of this guide, you'll have created an interactive dashboard that visualizes important molecular properties and their relationship to solubility, providing valuable insights for pharmaceutical research.

What You'll Learn

What You'll Build

An interactive bioinformatics dashboard that visualizes molecular properties from the Delaney solubility dataset, allowing researchers to explore relationships between molecular weight, rotatable bonds, LogP values, and aromatic proportions.

Here's an illustration of the overview of this bioinformatics project that you'll build:

image

What You'll Need

Prepare Your Environment

Firstly, to follow along with this quickstart, you can click on Bioinformatics_Solubility_Dashboard.ipynb to download the Notebook from GitHub.

Snowflake Notebooks come pre-installed with common Python libraries for data science and machine learning, such as NumPy, Pandas, Matplotlib, and more! If you need additional packages for this tutorial, click on the Packages dropdown on the top right to add them to your notebook.

Create a database

Before we can upload data into Snowflake, we'll need to create a database. In Snowsight, you can create a database in by clicking on Projects → Worksheets in the left sidebar and on the top-right hand corner click on the "+" blue button to spin up a SQL worksheet.

Enter the following SQL query into the worksheet and run it:

CREATE DATABASE IF NOT EXISTS CHANINN_DEMO_DATA;

Note: Please replace CHANINN_DEMO_DATA with a database name of your choice.

Upload the dataset

Download the Delaney data set as a delaney_solubility_with_descriptors.csv file.

In Snowsight, click on Data → Add Data and then click on the "Load data into a Table" button. Drag-and-drop the CSV file or click on Browse to select the CSV file.

image

Scroll down, under the "Select or create a database and schema" section, select your database and schema which I'll use CHANINN_DEMO_DATA.PUBLIC and for the table name I'll use SOLUBILITY. Finally, click on the Next button to proceed. After a few moments, your CSV data will be loaded into Snowflake.

The data table is now available at CHANINN_DEMO_DATA.PUBLIC.SOLUBILITY.

Load the Delaney Dataset

First, we'll load the Delaney solubility dataset from Snowflake using a simple SQL query:

SELECT * FROM CHANINN_DEMO_DATA.PUBLIC.SOLUBILITY

This query retrieves all records from the solubility table. The result is stored in a variable called sql_data.

Convert to Pandas DataFrame

Next, we'll convert the SQL result to a Pandas DataFrame for easier manipulation and analysis:

df = sql_data.to_pandas()

Let's examine the DataFrame:

image

This would display the first segments of our dataset, showing columns like MOLWT (molecular weight), NUMROTATABLEBONDS (number of rotatable bonds), MOLLOGP (measure of lipophilicity), and AROMATICPROPORTION (proportion of aromatic atoms).

Basic Data Analysis

We can perform some basic statistical analysis to understand our data better:

df.describe()

This provides statistical summaries for each numerical column, including count, mean, standard deviation, minimum, and maximum values.

Classify Molecules by Size

For our analysis, we'll classify molecules based on their molecular weight:

This classification will help us compare properties between smaller and larger molecules:

df['MOLWT_CLASS'] = pd.Series(['small' if x < 300 else 'large' for x in df['MOLWT']])

Aggregate Data by Molecular Weight Class

Now, let's calculate the average values of various properties for each molecular weight class:

df_class = df.groupby('MOLWT_CLASS').mean().reset_index()
df_class

image

This aggregation gives us a clear comparison between small and large molecules across all measured properties.

Analyze the Results

The aggregated data reveals interesting patterns:

These insights will form the basis of our visualization dashboard.

Create a Streamlit Dashboard

Now we'll build an interactive dashboard using Streamlit to visualize our findings. Here's the complete code for our dashboard:

import streamlit as st

st.title('☘️ Solubility Dashboard')

# Data Filtering
mol_size = st.slider('Select a value', 100, 500, 300)
df['MOLWT_CLASS'] = pd.Series(['small' if x < mol_size else 'large' for x in df['MOLWT']])
df_class = df.groupby('MOLWT_CLASS').mean().reset_index()

st.divider()

# Calculate Metrics
molwt_large = round(df_class['MOLWT'][0], 2)
molwt_small = round(df_class['MOLWT'][1], 2)
numrotatablebonds_large = round(df_class['NUMROTATABLEBONDS'][0], 2)
numrotatablebonds_small = round(df_class['NUMROTATABLEBONDS'][1], 2)
mollogp_large = round(df_class['MOLLOGP'][0], 2)
mollogp_small = round(df_class['MOLLOGP'][1], 2)
aromaticproportion_large = round(df_class['AROMATICPROPORTION'][0], 2)
aromaticproportion_small = round(df_class['AROMATICPROPORTION'][1], 2)

# Data metrics and visualizations
col = st.columns(2)
with col[0]:
    st.subheader('Molecular Weight')
    st.metric('Large', molwt_large)
    st.metric('Small', molwt_small)
    st.bar_chart(df_class, x='MOLWT_CLASS', y='MOLWT', color='MOLWT_CLASS')

    st.subheader('Number of Rotatable Bonds')
    st.metric('Large', numrotatablebonds_large)
    st.metric('Small', numrotatablebonds_small)
    st.bar_chart(df_class, x='MOLWT_CLASS', y='NUMROTATABLEBONDS', color='MOLWT_CLASS')
with col[1]:
    st.subheader('Molecular LogP')
    st.metric('Large', mollogp_large)
    st.metric('Small', mollogp_small)
    st.bar_chart(df_class, x='MOLWT_CLASS', y='MOLLOGP', color='MOLWT_CLASS')

    st.subheader('Aromatic Proportion')
    st.metric('Large', mollogp_large)
    st.metric('Small', mollogp_small)
    st.bar_chart(df_class, x='MOLWT_CLASS', y='AROMATICPROPORTION', color='MOLWT_CLASS')

with st.expander('Show Original DataFrame'):
    st.dataframe(df)
with st.expander('Show Aggregated DataFrame'):
    st.dataframe(df_class)

Here's the rendered dashboard:

image

Dashboard Components

Our dashboard includes several key components:

  1. Interactive Slider: Allows users to dynamically adjust the threshold for classifying molecules as "small" or "large"
  2. Metrics Display: Shows the average values for each property by molecular weight class
  3. Bar Charts: Visualizes the differences between small and large molecules for each property
  4. Expandable Data Views: Provides access to the raw and aggregated data for deeper analysis

Interpret the Dashboard

The dashboard makes it easy to observe how molecular properties change with size:

Users can adjust the molecular weight threshold to explore different classification schemes and see how that affects the property distributions.

Before diving into the code, let's understand what we're working with. The Delaney dataset contains information about various molecules and their properties, including:

These properties are crucial in determining whether a molecule might make a good drug candidate or "drug-like" according to the Lipinski's Rule of 5.

Why Solubility Matters in Drug Development

Molecular solubility is a critical property in pharmaceutical research that directly impacts whether a drug can effectively reach its target in the human body.

Importance of Solubility

Solubility refers to a molecule's ability to dissolve in a liquid - specifically in the human bloodstream. If a drug cannot dissolve properly, it cannot be transported to its target site, rendering it ineffective regardless of its other properties.

Poorly soluble drugs often require:

All of these factors can lead to increased side effects and reduced patient compliance.

Lipinski's Rule of 5

In drug development, scientists often refer to Lipinski's Rule of 5 (RO5), a set of guidelines that help predict whether a molecule will be sufficiently soluble to make an effective oral medication. The rule states that an orally active drug generally has:

These parameters help researchers focus their efforts on molecules that are more likely to succeed in clinical trials and eventually become viable medications.

By understanding and optimizing solubility early in the drug discovery process, pharmaceutical companies can develop more effective medicines that can be easily administered and work efficiently in the body.

Congratulations! You've successfully built a bioinformatics dashboard that analyzes molecular properties crucial for drug discovery. Using Snowflake, Python, and Streamlit, you've created an interactive tool that helps visualize and understand the relationships between molecular weight, rotatable bonds, LogP values, and aromatic proportions - all key factors in determining a molecule's solubility and potential as a drug candidate.

What You Learned

Related Resources

Articles:

Documentation:

Happy coding and happy exploration of the fascinating intersection of data science and bioinformatics!