<center>
    <img src="./images/molssi_ai.png"
         alt="MolSSI-AI Logo"
         width=400 
         height=250
    />
</center>

## Authors

* **Bonnie Hall**, Grand View University, Des Moines, IA, USA
* **Ashley Ringer McDonald**, California Polytechnic State University, San Luis Obispo, CA, USA

## Learning Objectives
- Learn how molecules can be represented in a machine readable format
- Generate a cheminformatics data set using the RDKit library starting from a list of SMILES codes
- Create and train a random forest regression model and compare the performance to a linear regression model
- Visualizing a random forest tree
- Tune a random forest model for optimal performance 
- Brainstorm how you could add a python and/or machine learning module to a course you teach


### Importing libraries
Before we start, let's make sure we have the necessary libraries ready for use.  Once again, we will be installing a few packages that are not included in the default packages.  Remember, for the installation commands to work they must each be in their own code cell.

In [None]:
import pandas as pd                 # for data manipulation
import seaborn as sns               # for data visualization
import matplotlib.pyplot as plt     # for data visualization
import numpy as np                  # for numerical operations
import sweetviz as sv               # for fast exploratory data analysis (eda)

from rdkit import Chem              # for calculating cheminformatics properties of molecules
from rdkit.Chem import Descriptors  # for determining chemical descriptors
from rdkit.Chem import Crippen      # for calculating logP (cLogP)
from rdkit.Chem import PandasTools  # for displaying molecules
PandasTools.RenderImagesInAllDataFrames(images=True) # Ensures molecules are rendered in the notebook

from sklearn.preprocessing import StandardScaler            # for scaling the data
from sklearn.model_selection import train_test_split        # for splitting the data into training and testing sets
from sklearn.model_selection import cross_val_score, KFold  # for K-fold cross-validation
from sklearn.linear_model import LinearRegression           # for creating a linear regression model
from sklearn.ensemble import RandomForestRegressor          # for creating a random forest regression model
from sklearn.metrics import mean_squared_error, r2_score    # for evaluating the model
from sklearn.pipeline import make_pipeline                  # for building operational pipelines

<div class="alert alert-block alert-info">
    <b>Note</b>
    We have added comments to clarify the purpose of each imported library.
</div>

## Problem Statement

In this tutorial, we will be **creating a cheminformatics data set** from a machine-readable list of molecules. The goal is to use the provided molecules to calculate various chemical properties of each molecule and then predict the solubility of a molecule base on its chemical structure using regression models. We will re-create the pre-processed version of the [Delaney's solubility dataset](https://doi.org/10.1021/ci034243x) we used earlier.  We will then use this re-created data set for **building a Random Forest model** and compare its performance with a linear regression model. 

<br>
<br>





## SMILES Representation of Molecules

SMILES stands for "Simplified Molecular-Input Line-Entry System" and is a way to represent molecules as a string of characters.

Consider the molecule ethanol. The image below shows a representation that we are used to seeing in chemistry:

![ethanol](images/ethanol.png)

However, the SMILES representation of this molecule would be "CCO".

You can read more about SMILES at [this tutorial](https://archive.epa.gov/med/med_archive_03/web/html/smiles.html), but rules for atoms and bonds are also repeated below.

### Atoms
SMILES supports all elements in the periodic table. An atom is represented using its respective atomic symbol. Upper case letters refer to non-aromatic atoms; lower case letters refer to aromatic atoms. If the atomic symbol has more than one letter the second letter must be lower case.

### Bonds
```
-	Single bond
=	Double bond
#	Triple bond
*	Aromatic bond
.	Disconnected structures
```
Single bonds are the default and therefore need not be entered. For example, 'CC' would mean that there is a non-aromatic carbon attached to another non-aromatic carbon by a single bond, and the computer would identify the structure as the chemical ethane. It is also assumed that the bond between two lower case atom symbols is aromatic. A blank terminates the SMILES string.

### Branches

A branch from a chain is specified by placing the SMILES symbol(s) for the branch between parenthesis. Some examples:

```

CC(O)C	2-Propanol
CC(=O)C	2-Propanone
```

### Rings

A ring is specified by placing a number directly after the SMILES symbol where the ring closure occurs. This number acts as a marker, indicating that the atoms with the same number are connected, thus forming a ring. For instance:

```
C1CCCC1 cyclopentane
n1ccccc1	Pyridine
```

### SMILES Examples

<div style="text-align:center;">
  <img src="images/smiles_example_1.png" style="display: block; margin: 0 auto; max-height:300px;">
</div>

<div style="text-align:center;">
  <img src="images/smiles_example_2.png" style="display: block; margin: 0 auto; max-height:300px;">
</div>

### Using Online Resources
Most of the time, you will not need to write a SMILES string by hand. You will be able to look up a molecule's SMILES string from a web database like [PubChem](https://pubchem.ncbi.nlm.nih.gov/).

You can also use tools like this [molecule sketcher from the Protein Data Bank](https://www.rcsb.org/chemical-sketch)
to draw molecules and get their SMILES strings.

### Other Molecular File Formats

Molecules can also be represented using a number of different file formats. As you work more in chemistry, you may see a number of these. Sometimes you will have to pick a file format based on the software you are using or the molecular information you want to save. 

| File Format | Description                                                                 | Features                                                              | Common Uses                              |
|-------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------|------------------------------------------|
| SMILES      | Simplified Molecular Input Line Entry System                                | Line notation for representing molecular structures                   | Database               |
| InChI       | International Chemical Identifier                                           | Textual identifier for chemical substances                            | Databases             |
| MOL/SDF     | MDL MOLfile and Structure-Data File                                         | Contains 2D/3D coordinates, atoms, bonds                              | Structure visualization, cheminformatics |
| PDB         | Protein Data Bank format                                                    | Often used for 3D structures of proteins and nucleic acids,but can also be used for small molecules. Often does not contain molecule information, and cannot store partial charges.                           | Structural biology, bioinformatics       |
| XYZ         | Cartesian coordinates                                                       | Simple text format with atom types and 3D coordinates                 | Computational chemistry, molecular dynamics |     |
| CIF         | Crystallographic Information File                                           | Text file format for representing crystal structure data              | Crystallography                          |
| PQR         | Extended PDB format with partial charges and radii                          | Includes atomic coordinates, partial charges, and radii               | Electrostatics calculations              |
| PDBQT       | PDB format with torsion angles and charges used in AutoDock                 | Includes atomic coordinates, partial charges, torsion angles          | Molecular docking                        |
|MOL2   |Tripos Mol2 format|	Contains atomic coordinates, bonds, molecule types, substructures, and partial charges|	Molecular modeling, cheminformatics, computational chemistry


# Introduction to RDKit Molecules



There are Python libraries that are made for working just with chemical data. One commonly used library in Python for data science (or cheminformatics) is called [RDKit](https://en.wikipedia.org/wiki/RDKit). RDKit is an open-source cheminformatics library, primarily developed in C++ and has been under development since the year 2000. We will be using the Python interface to RDKit, though there are interfaces in other languages.

RDKit provides a molecule object that allows you to manipulate chemical structures. It has capabilities for reading and writing molecular file formats, calculating molecular properties, and performing substructure searches. In addition, it offers a wide range of cheminformatics algorithms such as molecular fingerprint generation, similarity metrics calculation, and molecular descriptor computation. This notebook introduces RDKit basics.

<div class="alert alert-block alert-success"> 
<strong>Python Skills: Python Objects</strong>

Most of this functionality is achieved through the RDKit `mol` object. In Python, we use the word "object" to refer to a variable type with associated data and methods. 
One example of an object we have seen in notebooks is a list - we could also call it a "list object". An object has `attributes` (data) and `methods`. 
You access information about objects with the syntax
```python
object.data
```
where data is the attribute name.

You access object methods with the syntax
```python
object.method(arguments)
```
</div>    

In this lesson, we will create and manipulate RDKit `mol` objects. RDKit `mol` objects represent molecules and have
attributes (data) and methods (actions) associated with molecules.

We are going to use a part of RDKit called `Chem`. To use `Chem` we have to import it, which we did above in the importing libraries section. 

### The original data set

Let's load the list of molecules using the ``pandas`` library 
and take a look at a few samples in the dataset

In [None]:
# Path to the list of molecules data file
data_path = "./data/solubility-molecule-list.csv"

# Read the data into a DataFrame
df = pd.read_csv(data_path)

# Display the first few rows of the DataFrame
df.head()

The dataset contains the following columns:
- **Compound ID**: compound name in a range of formats
- **smiles**: SMILES string representation of each molecule
- **logS**: the solubility of the molecule in mol/L measured at 25 $\degree$ ùê∂

### Adding molecule structures to the data set using the SMILES strings

Above we learned about molecular representations using SMILES strings. Now we will use SMILES strings to create molecule objects in RDKit. 

We can create a representation of methane using RDKit by using the `MolFromSmiles` function in `rdkit.Chem`.

In [None]:
# visualizing methane as an example
methane = Chem.MolFromSmiles("C")
methane

In [None]:
# visualizing propane as another example
propane = Chem.MolFromSmiles("CCC")
propane

### The `.apply` function

In the examples above, we made a single molecule object from a single SMILES string.  However, when we are working with a lot of data, we might have a whole column of SMILES strings that we need to use to make molecule objects.  Further, we would like to save those molecule objects as a new column in our pandas dataframe.  This is generally true; you often want to calculate a new column of data using an existing column in your data frame.  The way to accomplish this is to use the `.apply` method.  You access any exisiting column of your python dataframe, put `.apply()` and then in the parenthesis, list a python function that calculates the thing you want to calculate.  In the code below, we will take the column of SMILES strings and apply the `Chem.MolFromSmiles` function and save the results as a new column of the dataframe.

In [None]:
# visualizing all the molecules in our data set
df['mol'] = df['smiles'].apply(Chem.MolFromSmiles)
df.head()

RDKit molecule objects have a number of methods we can use to get more information about the molecule. In the next few cells, we'll look at some methods that can tell us some things about the molecules we've created.  

We can use the `.apply` function that we just discussed to apply these methods to our molecule objects and save the results in a new column.

### Calculating Molecular Weights

In [None]:
df['mol_weight'] =df['mol'].apply(Descriptors.MolWt)
df.head()

### Calculating number of rotatable bonds

In [None]:
df['rot_bonds'] =df['mol'].apply(Chem.rdMolDescriptors.CalcNumRotatableBonds)
df.head()

### Calculating logP

This uses the Wildman-Crippen LogP value calculation, an atom-based scheme based on the values in the paper Wildman and G. M. Crippen JCICS 39 868-873 (1999)

In [None]:
df['clogP'] =df['mol'].apply(Chem.Crippen.MolLogP)
df.head()

### Calculating aromatic ratio

The aromatic proportion was calculated in the original paper by dividing the number of aromatic atoms by the number of total atoms. Although there is not a function in RDKit that calculates this directly, we can calculate it by creating our own function that uses two existing RDKit functions to perform the calculation. Then we can use our new function and ``.apply`` to make a new column in our data set.

In [None]:
## defining the function that will calculate the aromatic proportion

def aromatic_calc(mol):
    prop_aromatic = len(mol.GetAromaticAtoms())/mol.GetNumAtoms()
    return prop_aromatic

In [None]:
df['aromatic_ratio'] =df['mol'].apply(aromatic_calc)
df.head()

### Other calculations RDKit can perform

There are many other properties of molecules that RDKit can caculate.  In general, the methods in RDKit are organized into modules baesd on the type of property they calculate.  For instance, in some of the examples above, we used methods from the 
[`Descriptors` module](https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html) and the [`rdMolDescriptors` module](https://www.rdkit.org/docs/source/rdkit.Chem.rdMolDescriptors.html).  You can click on either of those links to see the full list of the different properties you can access. 

<div class="alert alert-block alert-success"> 
    <b>Exercise</b>
    Look through the documentation and find some additional molecular properties you want
    to add to your dataframe.
</div>

### EDA of the re-created data set

We will use SweetViz again to verify the data we generated has no missing values and has the expected value distributions. The molecule images, however, will cause an error. So we will create a version of the DataFrame called df_nomol with the 'mol' column dropped.

In [None]:
# dropping the mol columns

df_nomol = df.drop(columns = ['mol'])
df_nomol.head()

In [None]:
# Analyse Dataset
report = sv.analyze(df_nomol)

# View and Save
report.show_notebook()

<div class="alert alert-block alert-success"> 
<b>What to check in the data set:</b>

1. There should be no missing values.<br>
2. The associations between variables should be similar to what you saw previously.<br>
3. The numerical values of skewness should be similar to what you saw previously, using the skewness value ranges below:
- The skewness value of zero indicates a perfect symmetrical distribution,
- a skewness between -0.5 and 0.5 indicates an approximately symmetric distribution,
- a skewness between -1 and -0.5 (or 0.5 and 1) indicates a moderately skewed distribution,
- a skewness between -1.5 and -1 (or 1 and 1.5) indicates a highly skewed distribution, and
- a skewness less than -1.5 (or greater than 1.5) indicates an extremely skewed distribution.
</div>

# Building a multifeature linear regression

## Splitting the data

We will again split the data into train and test prior to doing any other data cleaning or engineering, to prevent data leakage between the training and testing data. We will use the ``train_test_split`` function from the ``sklearn.model_selection`` module to split the data into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate the model's performance. We will also use a ``random_state `` again, so the data is the same when comparing different models. 

We will drop the target vector 'logS', and also the 'smiles' and 'mol' columns as we do not want the model to use those to predict the solubility.

In [None]:
# Create the feature matrix (x) and target vector (y)
x = df.drop(columns=['logS', 'smiles', 'mol', 'Compound ID'])
y = df['logS']

# Split the data into training and testing sets (80% training, 20% testing)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123, shuffle=True)

# Display the shapes of the training and testing sets
x_train.shape, x_test.shape, y_train.shape, y_test.shape

### Feature engineering

After splitting our data, we need to scale our training (and test) features. Scaling is a crucial step in the data preprocessing pipeline as it ensures that all features have the same scale as many machine learning models are sensitive to the scale of the input features. We will use the ``StandardScaler`` from the ``sklearn.preprocessing`` module to scale our features. Note that a Random Forest model does not require data scaling, as it is a tree-based model and so different scales will not affect model performance. We will still scale the data, however, so we can also build a linear regression model and compare the performance of two models. 

In [None]:
# Create the standard scaler object
scaler = StandardScaler()

# Fit and transform the training feature vector x_train
x_train_scaled = scaler.fit_transform(x_train)

# Transform the test feature vector x_test
x_test_scaled = scaler.transform(x_test)

# Make sure the training data is scaled correctly
print(f" Training feature mean: {x_train_scaled.mean():.5f}")
print(f" Training feature standard deviation: {x_train_scaled.std():.5f}\n")

# Print the scaler statistics on the test data
print(f" Testing feature mean: {x_test_scaled.mean():.5f}")
print(f" Testing feature standard deviation: {x_test_scaled.std():.5f}")

<div class="alert alert-block alert-info">
    <b>Reminder:</b>
    It is extremely important to split the data first and then fit the scaler on the training data only. Fitting the scaler on the entire data and then splitting it can cause a <b>data leakage</b> problem which violates our intention to treat the test data as a good representative sample of the real-world data.  
</div>

In [None]:
# Create a linear regression model
multi_feature_model = LinearRegression()

# Fit the model to the training data
multi_feature_model.fit(x_train_scaled, y_train)

# Make predictions on the test data
y_pred_linear_multi = multi_feature_model.predict(x_test_scaled)

# Calculate the performance metrics and store them in a DataFrame
results = pd.DataFrame({
    "MSE": mean_squared_error(y_test, y_pred_linear_multi),     # the mean squared error
    "R2": r2_score(y_test, y_pred_linear_multi)                 # the coefficient of determination
}, index=["Multi-Linear-Regression"])
                            

# Set the formatting style
results.style.format(
    {
        "MSE": "{:.3f}",
        "R2": "{:.2f}"
    }
)

<div class="alert alert-block alert-success"> 
    <b>Exercise</b>
    Is the model performance similar using the data set you constructed, compared to using the provided the pre-processed data?
</div>

## Building and Training a Random Forest Regression Model

### What is a random forest model?
The next step after the data preparation is to build and train our random forest regression model. A random forest is a decision tree model that uses a "forest" of multiple decision trees and randomly chooses which variables to use in each tree. Generally, each individual tree is not that good at making a prediction, but collectively the trees are quite good at making predictions. Note that each tree predicts an individual value and then a vote is taken (in the case of regression, an average of the predicted values) to determine the final predicted value.

<div style="text-align:center;">
  <img src="images/Random_forest_explain.png" style="display: block; margin: 0 auto; max-height:400px;">
</div>

### Hyperparameters
Random forest models (and many other models) have hyperparameters that can be 'tuned' to opmitize model performance.The number of trees (``n_iterators``) can be specified, with a default value of 100 trees. Each tree also has a depth(``max_depth``) which specifies the maximum number of splits a tree can have, with the default value being no limit on the depth ('None'). Too few trees or trees that are too shallow results in a model that predicts poorly due to underfitting the data--the model is too simple to predict using train or test data. Too many trees or trees that are too deep results in a model that predicts poorly due to overfitting the data--the model is so complex it can predict using the original data extremely well but cannot predict using new data.

### Building and training a default random forest regression model
We will first build and train a random forest model that uses the default parameters. We will use the ``RandomForestRegressor`` class from the ``sklearn.ensemble`` module to create the model. 

In [None]:
# Create a random forest regression model
default_rf_model = RandomForestRegressor()

# Fit the model to the training data
default_rf_model.fit(x_train_scaled, y_train)

# Make predictions on the test data
y_pred_default_rf = default_rf_model.predict(x_test_scaled)

# Calculate the performance metrics
default_rf_model_results = pd.DataFrame({
    "MSE": mean_squared_error(y_test, y_pred_default_rf),     # the mean squared error
    "R2": r2_score(y_test, y_pred_default_rf)                 # the coefficient of determination
}, index=["Default_RF_Regression"])

# Store the results into results DataFrame
results = pd.concat([results, default_rf_model_results])
results

<div class="alert alert-block alert-success"> 
    <b>Exercise</b>
    Which model is best at predicting the value of logS?
</div>

In [None]:
# Create a plot object
fig, ax = plt.subplots(figsize=(5, 5))

# Plot the test data
ax.scatter(y_test, y_test, color='blue', label='Test Data')

# Plot the simple linear regression model
ax.scatter(y_test, y_pred_default_rf, color='red', label='Random Forest Regression')

# Plot the multi-variable linear regression model
ax.scatter(y_test, y_pred_linear_multi, color='green', label="Multifeature Linear Regression")

# Create the legends
fig.legend(facecolor='white')

# Show the plot
plt.show()

### Improving model performance using cross-validation

We will use cross-validation to train both the multi-feature linear regression model and the random forest regression model. We will again use the ``cross_val_score`` function from the ``sklearn.model_selection`` module to perform a 5-fold cross validation experiment.

In [None]:
# Create a pipeline object
pipeline_multi_feature_linear = make_pipeline(scaler, multi_feature_model)
pipeline_default_rf = make_pipeline(scaler, default_rf_model)

# Create a KFold object
kf = KFold(n_splits=5, shuffle=True, random_state=123)

# Perform cross-validation
cv_results_multi_feature_linear = cross_val_score(pipeline_multi_feature_linear, x, y, cv=kf, scoring="neg_mean_squared_error")
cv_results_default_rf = cross_val_score(pipeline_default_rf, x, y, cv=kf, scoring="neg_mean_squared_error")

# Calculate the mean and standard deviation of the cross-validation results
print(f"CV Results Multifeature Linear Regression Mean MSE: {-cv_results_multi_feature_linear.mean():.5f} +/- {cv_results_multi_feature_linear.std():.5f} ")
print(f"CV Results Default Random Forest Mean MSE: {-cv_results_default_rf.mean():.5f} +/- {cv_results_default_rf.std():.5f}")

<div class="alert alert-block alert-success"> 
    <b>Exercise</b>
    Did using cross-validation improve the performance of the models?
</div>

### Tuning a Random Forest Regression Model

Our first Random Forest model used the default hyperparameter settings. Tuning (optimizing) hyperparameter settings can improve model performance. We will use a loop to try a range of hyperparameter settings with our model. Although we don't have time to use them today, tools such as ``GridSearchCV`` and ``Optuna`` can be used help identify the best set of hyperparameters. 

In [None]:

# Defining the values to test for the n_iterators and max_depth hyperparameters
trees = [50, 100, 300, 500, 1000]
depths = [1, 3, 5, 7, 9, 11]

## Defining lists to store results for each set of 
tree_count = []
tree_depth = []
tuned_rf_MSE = []
tuned_rf_R2 = []

for tree in trees:
    
    for depth in depths:
        
        # Create a random forest regression model
        tuned_rf_model = RandomForestRegressor(n_estimators = tree, max_depth = depth)

        # Fit the model to the training data
        tuned_rf_model.fit(x_train_scaled, y_train)

        # Make predictions on the test data
        y_pred_tuned_rf = tuned_rf_model.predict(x_test_scaled)

        # Storing the results in lists
        tree_count.append(tree)
        tree_depth.append(depth)
        tuned_rf_MSE.append(mean_squared_error(y_test, y_pred_tuned_rf))
        tuned_rf_R2.append(r2_score(y_test, y_pred_tuned_rf))
        
        
# Create a DataFrame from the lists of results
results_df = pd.DataFrame(
    {'Number of trees': tree_count,
     'Max depth': tree_depth,
     'MSE': tuned_rf_MSE,
     'R2': tuned_rf_R2
    })

# display the results DataFrame
results_df

<div class="alert alert-block alert-success"> 
    <b>Exercise</b>
    Which set of hyperparameters gave the best model performance?
</div>

<div class="alert alert-block alert-success"> 
    <b>Final Exercise</b>
    Where could you use python coding or machine learning in your courses?
</div>