Spaces:
Sleeping
Sleeping
File size: 36,244 Bytes
3a5fa97 2aeb710 3a5fa97 3167707 3a5fa97 3167707 3a5fa97 3167707 3a5fa97 3167707 3a5fa97 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <img src=\"./images/molssi_ai.png\"\n",
" alt=\"MolSSI-AI Logo\"\n",
" width=400 \n",
" height=250\n",
" />\n",
"</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Authors\n",
"\n",
"* **Bonnie Hall**, Grand View University, Des Moines, IA, USA\n",
"* **Ashley Ringer McDonald**, California Polytechnic State University, San Luis Obispo, CA, USA"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Learning Objectives\n",
"- Learn how molecules can be represented in a machine readable format\n",
"- Generate a cheminformatics data set using the RDKit library starting from a list of SMILES codes\n",
"- Create and train a random forest regression model and compare the performance to a linear regression model\n",
"- Visualizing a random forest tree\n",
"- Tune a random forest model for optimal performance \n",
"- Brainstorm how you could add a python and/or machine learning module to a course you teach\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Importing libraries\n",
"Before we start, let's make sure we have the necessary libraries ready for use. Once again, we will be installing a few packages that are not included in the default packages. Remember, for the installation commands to work they must each be in their own code cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2024-06-26T16:51:46.314577Z",
"start_time": "2024-06-26T16:51:46.301095Z"
}
},
"outputs": [],
"source": [
"import pandas as pd # for data manipulation\n",
"import seaborn as sns # for data visualization\n",
"import matplotlib.pyplot as plt # for data visualization\n",
"import numpy as np # for numerical operations\n",
"import sweetviz as sv # for fast exploratory data analysis (eda)\n",
"\n",
"from rdkit import Chem # for calculating cheminformatics properties of molecules\n",
"from rdkit.Chem import Descriptors # for determining chemical descriptors\n",
"from rdkit.Chem import Crippen # for calculating logP (cLogP)\n",
"from rdkit.Chem import PandasTools # for displaying molecules\n",
"PandasTools.RenderImagesInAllDataFrames(images=True) # Ensures molecules are rendered in the notebook\n",
"\n",
"from sklearn.preprocessing import StandardScaler # for scaling the data\n",
"from sklearn.model_selection import train_test_split # for splitting the data into training and testing sets\n",
"from sklearn.model_selection import cross_val_score, KFold # for K-fold cross-validation\n",
"from sklearn.linear_model import LinearRegression # for creating a linear regression model\n",
"from sklearn.ensemble import RandomForestRegressor # for creating a random forest regression model\n",
"from sklearn.metrics import mean_squared_error, r2_score # for evaluating the model\n",
"from sklearn.pipeline import make_pipeline # for building operational pipelines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\">\n",
" <b>Note</b>\n",
" We have added comments to clarify the purpose of each imported library.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Problem Statement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, we will be **creating a cheminformatics data set** from a machine-readable list of molecules. The goal is to use the provided molecules to calculate various chemical properties of each molecule and then predict the solubility of a molecule base on its chemical structure using regression models. We will re-create the pre-processed version of the [Delaney's solubility dataset](https://doi.org/10.1021/ci034243x) we used earlier. We will then use this re-created data set for **building a Random Forest model** and compare its performance with a linear regression model. \n",
"\n",
"<br>\n",
"<br>\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SMILES Representation of Molecules"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SMILES stands for \"Simplified Molecular-Input Line-Entry System\" and is a way to represent molecules as a string of characters.\n",
"\n",
"Consider the molecule ethanol. The image below shows a representation that we are used to seeing in chemistry:\n",
"\n",
"\n",
"\n",
"However, the SMILES representation of this molecule would be \"CCO\".\n",
"\n",
"You can read more about SMILES at [this tutorial](https://archive.epa.gov/med/med_archive_03/web/html/smiles.html), but rules for atoms and bonds are also repeated below.\n",
"\n",
"### Atoms\n",
"SMILES supports all elements in the periodic table. An atom is represented using its respective atomic symbol. Upper case letters refer to non-aromatic atoms; lower case letters refer to aromatic atoms. If the atomic symbol has more than one letter the second letter must be lower case.\n",
"\n",
"### Bonds\n",
"```\n",
"-\tSingle bond\n",
"=\tDouble bond\n",
"#\tTriple bond\n",
"*\tAromatic bond\n",
".\tDisconnected structures\n",
"```\n",
"Single bonds are the default and therefore need not be entered. For example, 'CC' would mean that there is a non-aromatic carbon attached to another non-aromatic carbon by a single bond, and the computer would identify the structure as the chemical ethane. It is also assumed that the bond between two lower case atom symbols is aromatic. A blank terminates the SMILES string.\n",
"\n",
"### Branches\n",
"\n",
"A branch from a chain is specified by placing the SMILES symbol(s) for the branch between parenthesis. Some examples:\n",
"\n",
"```\n",
"\n",
"CC(O)C\t2-Propanol\n",
"CC(=O)C\t2-Propanone\n",
"```\n",
"\n",
"### Rings\n",
"\n",
"A ring is specified by placing a number directly after the SMILES symbol where the ring closure occurs. This number acts as a marker, indicating that the atoms with the same number are connected, thus forming a ring. For instance:\n",
"\n",
"```\n",
"C1CCCC1 cyclopentane\n",
"n1ccccc1\tPyridine\n",
"```\n",
"\n",
"### SMILES Examples\n",
"\n",
"<div style=\"text-align:center;\">\n",
" <img src=\"images/smiles_example_1.png\" style=\"display: block; margin: 0 auto; max-height:300px;\">\n",
"</div>\n",
"\n",
"<div style=\"text-align:center;\">\n",
" <img src=\"images/smiles_example_2.png\" style=\"display: block; margin: 0 auto; max-height:300px;\">\n",
"</div>\n",
"\n",
"### Using Online Resources\n",
"Most of the time, you will not need to write a SMILES string by hand. You will be able to look up a molecule's SMILES string from a web database like [PubChem](https://pubchem.ncbi.nlm.nih.gov/).\n",
"\n",
"You can also use tools like this [molecule sketcher from the Protein Data Bank](https://www.rcsb.org/chemical-sketch)\n",
"to draw molecules and get their SMILES strings."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other Molecular File Formats\n",
"\n",
"Molecules can also be represented using a number of different file formats. As you work more in chemistry, you may see a number of these. Sometimes you will have to pick a file format based on the software you are using or the molecular information you want to save. \n",
"\n",
"| File Format | Description | Features | Common Uses |\n",
"|-------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------|------------------------------------------|\n",
"| SMILES | Simplified Molecular Input Line Entry System | Line notation for representing molecular structures | Database |\n",
"| InChI | International Chemical Identifier | Textual identifier for chemical substances | Databases |\n",
"| MOL/SDF | MDL MOLfile and Structure-Data File | Contains 2D/3D coordinates, atoms, bonds | Structure visualization, cheminformatics |\n",
"| PDB | Protein Data Bank format | Often used for 3D structures of proteins and nucleic acids,but can also be used for small molecules. Often does not contain molecule information, and cannot store partial charges. | Structural biology, bioinformatics |\n",
"| XYZ | Cartesian coordinates | Simple text format with atom types and 3D coordinates | Computational chemistry, molecular dynamics | |\n",
"| CIF | Crystallographic Information File | Text file format for representing crystal structure data | Crystallography |\n",
"| PQR | Extended PDB format with partial charges and radii | Includes atomic coordinates, partial charges, and radii | Electrostatics calculations |\n",
"| PDBQT | PDB format with torsion angles and charges used in AutoDock | Includes atomic coordinates, partial charges, torsion angles | Molecular docking |\n",
"|MOL2 |Tripos Mol2 format|\tContains atomic coordinates, bonds, molecule types, substructures, and partial charges|\tMolecular modeling, cheminformatics, computational chemistry\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to RDKit Molecules\n",
"\n",
"\n",
"\n",
"There are Python libraries that are made for working just with chemical data. One commonly used library in Python for data science (or cheminformatics) is called [RDKit](https://en.wikipedia.org/wiki/RDKit). RDKit is an open-source cheminformatics library, primarily developed in C++ and has been under development since the year 2000. We will be using the Python interface to RDKit, though there are interfaces in other languages.\n",
"\n",
"RDKit provides a molecule object that allows you to manipulate chemical structures. It has capabilities for reading and writing molecular file formats, calculating molecular properties, and performing substructure searches. In addition, it offers a wide range of cheminformatics algorithms such as molecular fingerprint generation, similarity metrics calculation, and molecular descriptor computation. This notebook introduces RDKit basics.\n",
"\n",
"<div class=\"alert alert-block alert-success\"> \n",
"<strong>Python Skills: Python Objects</strong>\n",
"\n",
"Most of this functionality is achieved through the RDKit `mol` object. In Python, we use the word \"object\" to refer to a variable type with associated data and methods. \n",
"One example of an object we have seen in notebooks is a list - we could also call it a \"list object\". An object has `attributes` (data) and `methods`. \n",
"You access information about objects with the syntax\n",
"```python\n",
"object.data\n",
"```\n",
"where data is the attribute name.\n",
"\n",
"You access object methods with the syntax\n",
"```python\n",
"object.method(arguments)\n",
"```\n",
"</div> \n",
"\n",
"In this lesson, we will create and manipulate RDKit `mol` objects. RDKit `mol` objects represent molecules and have\n",
"attributes (data) and methods (actions) associated with molecules.\n",
"\n",
"We are going to use a part of RDKit called `Chem`. To use `Chem` we have to import it, which we did above in the importing libraries section. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The original data set\n",
"\n",
"Let's load the list of molecules using the ``pandas`` library \n",
"and take a look at a few samples in the dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2024-06-26T16:49:13.095479Z",
"start_time": "2024-06-26T16:49:13.088117Z"
}
},
"outputs": [],
"source": [
"# Path to the list of molecules data file\n",
"data_path = \"./data/solubility-molecule-list.csv\"\n",
"\n",
"# Read the data into a DataFrame\n",
"df = pd.read_csv(data_path)\n",
"\n",
"# Display the first few rows of the DataFrame\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset contains the following columns:\n",
"- **Compound ID**: compound name in a range of formats\n",
"- **smiles**: SMILES string representation of each molecule\n",
"- **logS**: the solubility of the molecule in mol/L measured at 25 $\\degree$ πΆ"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding molecule structures to the data set using the SMILES strings\n",
"\n",
"Above we learned about molecular representations using SMILES strings. Now we will use SMILES strings to create molecule objects in RDKit. \n",
"\n",
"We can create a representation of methane using RDKit by using the `MolFromSmiles` function in `rdkit.Chem`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# visualizing methane as an example\n",
"methane = Chem.MolFromSmiles(\"C\")\n",
"methane"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# visualizing propane as another example\n",
"propane = Chem.MolFromSmiles(\"CCC\")\n",
"propane"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### The `.apply` function\n",
"\n",
"In the examples above, we made a single molecule object from a single SMILES string. However, when we are working with a lot of data, we might have a whole column of SMILES strings that we need to use to make molecule objects. Further, we would like to save those molecule objects as a new column in our pandas dataframe. This is generally true; you often want to calculate a new column of data using an existing column in your data frame. The way to accomplish this is to use the `.apply` method. You access any exisiting column of your python dataframe, put `.apply()` and then in the parenthesis, list a python function that calculates the thing you want to calculate. In the code below, we will take the column of SMILES strings and apply the `Chem.MolFromSmiles` function and save the results as a new column of the dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# visualizing all the molecules in our data set\n",
"df['mol'] = df['smiles'].apply(Chem.MolFromSmiles)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"RDKit molecule objects have a number of methods we can use to get more information about the molecule. In the next few cells, we'll look at some methods that can tell us some things about the molecules we've created. \n",
"\n",
"We can use the `.apply` function that we just discussed to apply these methods to our molecule objects and save the results in a new column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculating Molecular Weights"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['mol_weight'] =df['mol'].apply(Descriptors.MolWt)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Calculating number of rotatable bonds"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['rot_bonds'] =df['mol'].apply(Chem.rdMolDescriptors.CalcNumRotatableBonds)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculating logP\n",
"\n",
"This uses the Wildman-Crippen LogP value calculation, an atom-based scheme based on the values in the paper Wildman and G. M. Crippen JCICS 39 868-873 (1999)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['clogP'] =df['mol'].apply(Chem.Crippen.MolLogP)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculating aromatic ratio\n",
"\n",
"The aromatic proportion was calculated in the original paper by dividing the number of aromatic atoms by the number of total atoms. Although there is not a function in RDKit that calculates this directly, we can calculate it by creating our own function that uses two existing RDKit functions to perform the calculation. Then we can use our new function and ``.apply`` to make a new column in our data set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2024-06-26T16:49:45.746932Z",
"start_time": "2024-06-26T16:49:45.735372Z"
},
"tags": []
},
"outputs": [],
"source": [
"## defining the function that will calculate the aromatic proportion\n",
"\n",
"def aromatic_calc(mol):\n",
" prop_aromatic = len(mol.GetAromaticAtoms())/mol.GetNumAtoms()\n",
" return prop_aromatic"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['aromatic_ratio'] =df['mol'].apply(aromatic_calc)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other calculations RDKit can perform\n",
"\n",
"There are many other properties of molecules that RDKit can caculate. In general, the methods in RDKit are organized into modules baesd on the type of property they calculate. For instance, in some of the examples above, we used methods from the \n",
"[`Descriptors` module](https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html) and the [`rdMolDescriptors` module](https://www.rdkit.org/docs/source/rdkit.Chem.rdMolDescriptors.html). You can click on either of those links to see the full list of the different properties you can access. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\"> \n",
" <b>Exercise</b>\n",
" Look through the documentation and find some additional molecular properties you want\n",
" to add to your dataframe.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### EDA of the re-created data set\n",
"\n",
"We will use SweetViz again to verify the data we generated has no missing values and has the expected value distributions. The molecule images, however, will cause an error. So we will create a version of the DataFrame called df_nomol with the 'mol' column dropped."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2024-06-26T16:50:52.798960Z",
"start_time": "2024-06-26T16:50:52.792664Z"
}
},
"outputs": [],
"source": [
"# dropping the mol columns\n",
"\n",
"df_nomol = df.drop(columns = ['mol'])\n",
"df_nomol.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Analyse Dataset\n",
"report = sv.analyze(df_nomol)\n",
"\n",
"# View and Save\n",
"report.show_notebook()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\"> \n",
"<b>What to check in the data set:</b>\n",
"\n",
"1. There should be no missing values.<br>\n",
"2. The associations between variables should be similar to what you saw previously.<br>\n",
"3. The numerical values of skewness should be similar to what you saw previously, using the skewness value ranges below:\n",
"- The skewness value of zero indicates a perfect symmetrical distribution,\n",
"- a skewness between -0.5 and 0.5 indicates an approximately symmetric distribution,\n",
"- a skewness between -1 and -0.5 (or 0.5 and 1) indicates a moderately skewed distribution,\n",
"- a skewness between -1.5 and -1 (or 1 and 1.5) indicates a highly skewed distribution, and\n",
"- a skewness less than -1.5 (or greater than 1.5) indicates an extremely skewed distribution.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Building a multifeature linear regression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Splitting the data\n",
"\n",
"We will again split the data into train and test prior to doing any other data cleaning or engineering, to prevent data leakage between the training and testing data. We will use the ``train_test_split`` function from the ``sklearn.model_selection`` module to split the data into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate the model's performance. We will also use a ``random_state `` again, so the data is the same when comparing different models. \n",
"\n",
"We will drop the target vector 'logS', and also the 'smiles' and 'mol' columns as we do not want the model to use those to predict the solubility."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create the feature matrix (x) and target vector (y)\n",
"x = df.drop(columns=['logS', 'smiles', 'mol', 'Compound ID'])\n",
"y = df['logS']\n",
"\n",
"# Split the data into training and testing sets (80% training, 20% testing)\n",
"x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123, shuffle=True)\n",
"\n",
"# Display the shapes of the training and testing sets\n",
"x_train.shape, x_test.shape, y_train.shape, y_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature engineering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After splitting our data, we need to scale our training (and test) features. Scaling is a crucial step in the data preprocessing pipeline as it ensures that all features have the same scale as many machine learning models are sensitive to the scale of the input features. We will use the ``StandardScaler`` from the ``sklearn.preprocessing`` module to scale our features. Note that a Random Forest model does not require data scaling, as it is a tree-based model and so different scales will not affect model performance. We will still scale the data, however, so we can also build a linear regression model and compare the performance of two models. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create the standard scaler object\n",
"scaler = StandardScaler()\n",
"\n",
"# Fit and transform the training feature vector x_train\n",
"x_train_scaled = scaler.fit_transform(x_train)\n",
"\n",
"# Transform the test feature vector x_test\n",
"x_test_scaled = scaler.transform(x_test)\n",
"\n",
"# Make sure the training data is scaled correctly\n",
"print(f\" Training feature mean: {x_train_scaled.mean():.5f}\")\n",
"print(f\" Training feature standard deviation: {x_train_scaled.std():.5f}\\n\")\n",
"\n",
"# Print the scaler statistics on the test data\n",
"print(f\" Testing feature mean: {x_test_scaled.mean():.5f}\")\n",
"print(f\" Testing feature standard deviation: {x_test_scaled.std():.5f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\">\n",
" <b>Reminder:</b>\n",
" It is extremely important to split the data first and then fit the scaler on the training data only. Fitting the scaler on the entire data and then splitting it can cause a <b>data leakage</b> problem which violates our intention to treat the test data as a good representative sample of the real-world data. \n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a linear regression model\n",
"multi_feature_model = LinearRegression()\n",
"\n",
"# Fit the model to the training data\n",
"multi_feature_model.fit(x_train_scaled, y_train)\n",
"\n",
"# Make predictions on the test data\n",
"y_pred_linear_multi = multi_feature_model.predict(x_test_scaled)\n",
"\n",
"# Calculate the performance metrics and store them in a DataFrame\n",
"results = pd.DataFrame({\n",
" \"MSE\": mean_squared_error(y_test, y_pred_linear_multi), # the mean squared error\n",
" \"R2\": r2_score(y_test, y_pred_linear_multi) # the coefficient of determination\n",
"}, index=[\"Multi-Linear-Regression\"])\n",
" \n",
"\n",
"# Set the formatting style\n",
"results.style.format(\n",
" {\n",
" \"MSE\": \"{:.3f}\",\n",
" \"R2\": \"{:.2f}\"\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\"> \n",
" <b>Exercise</b>\n",
" Is the model performance similar using the data set you constructed, compared to using the provided the pre-processed data?\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Building and Training a Random Forest Regression Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What is a random forest model?\n",
"The next step after the data preparation is to build and train our random forest regression model. A random forest is a decision tree model that uses a \"forest\" of multiple decision trees and randomly chooses which variables to use in each tree. Generally, each individual tree is not that good at making a prediction, but collectively the trees are quite good at making predictions. Note that each tree predicts an individual value and then a vote is taken (in the case of regression, an average of the predicted values) to determine the final predicted value.\n",
"\n",
"<div style=\"text-align:center;\">\n",
" <img src=\"images/Random_forest_explain.png\" style=\"display: block; margin: 0 auto; max-height:400px;\">\n",
"</div>\n",
"\n",
"### Hyperparameters\n",
"Random forest models (and many other models) have hyperparameters that can be 'tuned' to opmitize model performance.The number of trees (``n_iterators``) can be specified, with a default value of 100 trees. Each tree also has a depth(``max_depth``) which specifies the maximum number of splits a tree can have, with the default value being no limit on the depth ('None'). Too few trees or trees that are too shallow results in a model that predicts poorly due to underfitting the data--the model is too simple to predict using train or test data. Too many trees or trees that are too deep results in a model that predicts poorly due to overfitting the data--the model is so complex it can predict using the original data extremely well but cannot predict using new data.\n",
"\n",
"### Building and training a default random forest regression model\n",
"We will first build and train a random forest model that uses the default parameters. We will use the ``RandomForestRegressor`` class from the ``sklearn.ensemble`` module to create the model. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a random forest regression model\n",
"default_rf_model = RandomForestRegressor()\n",
"\n",
"# Fit the model to the training data\n",
"default_rf_model.fit(x_train_scaled, y_train)\n",
"\n",
"# Make predictions on the test data\n",
"y_pred_default_rf = default_rf_model.predict(x_test_scaled)\n",
"\n",
"# Calculate the performance metrics\n",
"default_rf_model_results = pd.DataFrame({\n",
" \"MSE\": mean_squared_error(y_test, y_pred_default_rf), # the mean squared error\n",
" \"R2\": r2_score(y_test, y_pred_default_rf) # the coefficient of determination\n",
"}, index=[\"Default_RF_Regression\"])\n",
"\n",
"# Store the results into results DataFrame\n",
"results = pd.concat([results, default_rf_model_results])\n",
"results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\"> \n",
" <b>Exercise</b>\n",
" Which model is best at predicting the value of logS?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2024-06-26T16:52:00.069272Z",
"start_time": "2024-06-26T16:51:59.817331Z"
}
},
"outputs": [],
"source": [
"# Create a plot object\n",
"fig, ax = plt.subplots(figsize=(5, 5))\n",
"\n",
"# Plot the test data\n",
"ax.scatter(y_test, y_test, color='blue', label='Test Data')\n",
"\n",
"# Plot the simple linear regression model\n",
"ax.scatter(y_test, y_pred_default_rf, color='red', label='Random Forest Regression')\n",
"\n",
"# Plot the multi-variable linear regression model\n",
"ax.scatter(y_test, y_pred_linear_multi, color='green', label=\"Multifeature Linear Regression\")\n",
"\n",
"# Create the legends\n",
"fig.legend(facecolor='white')\n",
"\n",
"# Show the plot\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Improving model performance using cross-validation\n",
"\n",
"We will use cross-validation to train both the multi-feature linear regression model and the random forest regression model. We will again use the ``cross_val_score`` function from the ``sklearn.model_selection`` module to perform a 5-fold cross validation experiment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a pipeline object\n",
"pipeline_multi_feature_linear = make_pipeline(scaler, multi_feature_model)\n",
"pipeline_default_rf = make_pipeline(scaler, default_rf_model)\n",
"\n",
"# Create a KFold object\n",
"kf = KFold(n_splits=5, shuffle=True, random_state=123)\n",
"\n",
"# Perform cross-validation\n",
"cv_results_multi_feature_linear = cross_val_score(pipeline_multi_feature_linear, x, y, cv=kf, scoring=\"neg_mean_squared_error\")\n",
"cv_results_default_rf = cross_val_score(pipeline_default_rf, x, y, cv=kf, scoring=\"neg_mean_squared_error\")\n",
"\n",
"# Calculate the mean and standard deviation of the cross-validation results\n",
"print(f\"CV Results Multifeature Linear Regression Mean MSE: {-cv_results_multi_feature_linear.mean():.5f} +/- {cv_results_multi_feature_linear.std():.5f} \")\n",
"print(f\"CV Results Default Random Forest Mean MSE: {-cv_results_default_rf.mean():.5f} +/- {cv_results_default_rf.std():.5f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\"> \n",
" <b>Exercise</b>\n",
" Did using cross-validation improve the performance of the models?\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tuning a Random Forest Regression Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our first Random Forest model used the default hyperparameter settings. Tuning (optimizing) hyperparameter settings can improve model performance. We will use a loop to try a range of hyperparameter settings with our model. Although we don't have time to use them today, tools such as ``GridSearchCV`` and ``Optuna`` can be used help identify the best set of hyperparameters. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Defining the values to test for the n_iterators and max_depth hyperparameters\n",
"trees = [50, 100, 300, 500, 1000]\n",
"depths = [1, 3, 5, 7, 9, 11]\n",
"\n",
"## Defining lists to store results for each set of \n",
"tree_count = []\n",
"tree_depth = []\n",
"tuned_rf_MSE = []\n",
"tuned_rf_R2 = []\n",
"\n",
"for tree in trees:\n",
" \n",
" for depth in depths:\n",
" \n",
" # Create a random forest regression model\n",
" tuned_rf_model = RandomForestRegressor(n_estimators = tree, max_depth = depth)\n",
"\n",
" # Fit the model to the training data\n",
" tuned_rf_model.fit(x_train_scaled, y_train)\n",
"\n",
" # Make predictions on the test data\n",
" y_pred_tuned_rf = tuned_rf_model.predict(x_test_scaled)\n",
"\n",
" # Storing the results in lists\n",
" tree_count.append(tree)\n",
" tree_depth.append(depth)\n",
" tuned_rf_MSE.append(mean_squared_error(y_test, y_pred_tuned_rf))\n",
" tuned_rf_R2.append(r2_score(y_test, y_pred_tuned_rf))\n",
" \n",
" \n",
"# Create a DataFrame from the lists of results\n",
"results_df = pd.DataFrame(\n",
" {'Number of trees': tree_count,\n",
" 'Max depth': tree_depth,\n",
" 'MSE': tuned_rf_MSE,\n",
" 'R2': tuned_rf_R2\n",
" })\n",
"\n",
"# display the results DataFrame\n",
"results_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\"> \n",
" <b>Exercise</b>\n",
" Which set of hyperparameters gave the best model performance?\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-success\"> \n",
" <b>Final Exercise</b>\n",
" Where could you use python coding or machine learning in your courses?\n",
"</div>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
|