In this tutorial, you'll build a complete bioinformatics project from scratch in Snowflake. You'll work with the Delaney solubility dataset to analyze molecular properties that are crucial for successful drug discovery efforts.
By the end of this guide, you'll have created an interactive dashboard that visualizes important molecular properties and their relationship to solubility, providing valuable insights for pharmaceutical research.
An interactive bioinformatics dashboard that visualizes molecular properties from the Delaney solubility dataset, allowing researchers to explore relationships between molecular weight, rotatable bonds, LogP values, and aromatic proportions.
Here's an illustration of the overview of this bioinformatics project that you'll build:
Firstly, to follow along with this quickstart, you can click on Bioinformatics_Solubility_Dashboard.ipynb to download the Notebook from GitHub.
Snowflake Notebooks come pre-installed with common Python libraries for data science and machine learning, such as NumPy, Pandas, Matplotlib, and more! If you need additional packages for this tutorial, click on the Packages dropdown on the top right to add them to your notebook.
Before we can upload data into Snowflake, we'll need to create a database. In Snowsight, you can create a database in by clicking on Projects → Worksheets in the left sidebar and on the top-right hand corner click on the "+" blue button to spin up a SQL worksheet.
Enter the following SQL query into the worksheet and run it:
CREATE DATABASE IF NOT EXISTS CHANINN_DEMO_DATA;
Note: Please replace CHANINN_DEMO_DATA
with a database name of your choice.
Download the Delaney data set as a delaney_solubility_with_descriptors.csv file.
In Snowsight, click on Data → Add Data and then click on the "Load data into a Table" button. Drag-and-drop the CSV file or click on Browse to select the CSV file.
Scroll down, under the "Select or create a database and schema" section, select your database and schema which I'll use CHANINN_DEMO_DATA.PUBLIC
and for the table name I'll use SOLUBILITY
. Finally, click on the Next button to proceed. After a few moments, your CSV data will be loaded into Snowflake.
The data table is now available at CHANINN_DEMO_DATA.PUBLIC.SOLUBILITY
.
First, we'll load the Delaney solubility dataset from Snowflake using a simple SQL query:
SELECT * FROM CHANINN_DEMO_DATA.PUBLIC.SOLUBILITY
This query retrieves all records from the solubility table. The result is stored in a variable called sql_data
.
Next, we'll convert the SQL result to a Pandas DataFrame for easier manipulation and analysis:
df = sql_data.to_pandas()
Let's examine the DataFrame:
This would display the first segments of our dataset, showing columns like MOLWT
(molecular weight), NUMROTATABLEBONDS
(number of rotatable bonds), MOLLOGP
(measure of lipophilicity), and AROMATICPROPORTION
(proportion of aromatic atoms).
We can perform some basic statistical analysis to understand our data better:
df.describe()
This provides statistical summaries for each numerical column, including count, mean, standard deviation, minimum, and maximum values.
For our analysis, we'll classify molecules based on their molecular weight:
This classification will help us compare properties between smaller and larger molecules:
df['MOLWT_CLASS'] = pd.Series(['small' if x < 300 else 'large' for x in df['MOLWT']])
Now, let's calculate the average values of various properties for each molecular weight class:
df_class = df.groupby('MOLWT_CLASS').mean().reset_index()
df_class
This aggregation gives us a clear comparison between small and large molecules across all measured properties.
The aggregated data reveals interesting patterns:
These insights will form the basis of our visualization dashboard.
Now we'll build an interactive dashboard using Streamlit to visualize our findings. Here's the complete code for our dashboard:
import streamlit as st
st.title('☘️ Solubility Dashboard')
# Data Filtering
mol_size = st.slider('Select a value', 100, 500, 300)
df['MOLWT_CLASS'] = pd.Series(['small' if x < mol_size else 'large' for x in df['MOLWT']])
df_class = df.groupby('MOLWT_CLASS').mean().reset_index()
st.divider()
# Calculate Metrics
molwt_large = round(df_class['MOLWT'][0], 2)
molwt_small = round(df_class['MOLWT'][1], 2)
numrotatablebonds_large = round(df_class['NUMROTATABLEBONDS'][0], 2)
numrotatablebonds_small = round(df_class['NUMROTATABLEBONDS'][1], 2)
mollogp_large = round(df_class['MOLLOGP'][0], 2)
mollogp_small = round(df_class['MOLLOGP'][1], 2)
aromaticproportion_large = round(df_class['AROMATICPROPORTION'][0], 2)
aromaticproportion_small = round(df_class['AROMATICPROPORTION'][1], 2)
# Data metrics and visualizations
col = st.columns(2)
with col[0]:
st.subheader('Molecular Weight')
st.metric('Large', molwt_large)
st.metric('Small', molwt_small)
st.bar_chart(df_class, x='MOLWT_CLASS', y='MOLWT', color='MOLWT_CLASS')
st.subheader('Number of Rotatable Bonds')
st.metric('Large', numrotatablebonds_large)
st.metric('Small', numrotatablebonds_small)
st.bar_chart(df_class, x='MOLWT_CLASS', y='NUMROTATABLEBONDS', color='MOLWT_CLASS')
with col[1]:
st.subheader('Molecular LogP')
st.metric('Large', mollogp_large)
st.metric('Small', mollogp_small)
st.bar_chart(df_class, x='MOLWT_CLASS', y='MOLLOGP', color='MOLWT_CLASS')
st.subheader('Aromatic Proportion')
st.metric('Large', mollogp_large)
st.metric('Small', mollogp_small)
st.bar_chart(df_class, x='MOLWT_CLASS', y='AROMATICPROPORTION', color='MOLWT_CLASS')
with st.expander('Show Original DataFrame'):
st.dataframe(df)
with st.expander('Show Aggregated DataFrame'):
st.dataframe(df_class)
Here's the rendered dashboard:
Our dashboard includes several key components:
The dashboard makes it easy to observe how molecular properties change with size:
Users can adjust the molecular weight threshold to explore different classification schemes and see how that affects the property distributions.
Before diving into the code, let's understand what we're working with. The Delaney dataset contains information about various molecules and their properties, including:
MOLWT
)NUMROTATABLEBONDS
)MOLLOGP
)AROMATICPROPORTION
)logS
)These properties are crucial in determining whether a molecule might make a good drug candidate or "drug-like" according to the Lipinski's Rule of 5.
Molecular solubility is a critical property in pharmaceutical research that directly impacts whether a drug can effectively reach its target in the human body.
Solubility refers to a molecule's ability to dissolve in a liquid - specifically in the human bloodstream. If a drug cannot dissolve properly, it cannot be transported to its target site, rendering it ineffective regardless of its other properties.
Poorly soluble drugs often require:
All of these factors can lead to increased side effects and reduced patient compliance.
In drug development, scientists often refer to Lipinski's Rule of 5 (RO5), a set of guidelines that help predict whether a molecule will be sufficiently soluble to make an effective oral medication. The rule states that an orally active drug generally has:
These parameters help researchers focus their efforts on molecules that are more likely to succeed in clinical trials and eventually become viable medications.
By understanding and optimizing solubility early in the drug discovery process, pharmaceutical companies can develop more effective medicines that can be easily administered and work efficiently in the body.
Congratulations! You've successfully built a bioinformatics dashboard that analyzes molecular properties crucial for drug discovery. Using Snowflake, Python, and Streamlit, you've created an interactive tool that helps visualize and understand the relationships between molecular weight, rotatable bonds, LogP values, and aromatic proportions - all key factors in determining a molecule's solubility and potential as a drug candidate.
Articles:
Documentation:
Happy coding and happy exploration of the fascinating intersection of data science and bioinformatics!