|
| 1 | +# Molecular Solubility Prediction Report |
| 2 | + |
| 3 | +DISCLAIMER: this is an AI-generated report, so it may contain errors. Please check the reasoning traces and executed code for accuracy. |
| 4 | + |
| 5 | +## Executive Summary |
| 6 | + |
| 7 | +This report details the development of machine learning models to predict molecular solubility from chemical structure data. Using the ESOL dataset containing 1,144 compounds, we built and evaluated several regression models, with Gradient Boosting emerging as the best performer (R²=0.917). The analysis demonstrates that computational methods can accurately estimate aqueous solubility from molecular descriptors, offering valuable insights for drug discovery and chemical research. |
| 8 | + |
| 9 | +## Introduction |
| 10 | + |
| 11 | +Molecular solubility is a critical property in pharmaceutical development and chemical research, influencing drug bioavailability and formulation. Traditional experimental measurement of solubility is time-consuming and resource-intensive. This project aimed to develop machine learning models that can predict aqueous solubility (measured as log mol/L) directly from molecular structures represented as SMILES strings. |
| 12 | + |
| 13 | +The ESOL dataset used contains: |
| 14 | +- 1,144 organic compounds |
| 15 | +- Experimentally measured solubility values (log mol/L) |
| 16 | +- SMILES string representations of each molecule |
| 17 | +- Existing ESOL model predictions for comparison |
| 18 | + |
| 19 | +## Data Exploration |
| 20 | + |
| 21 | +The dataset was thoroughly examined before model development: |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +- **Data Quality**: Complete dataset with no missing values (1144 compounds) |
| 26 | +- **Target Variable**: |
| 27 | + - Range: -11.6 to 1.58 log mol/L |
| 28 | + - Mean: -3.06 ± 2.1 (standard deviation) |
| 29 | + - Distribution: Approximately normal with slight left skew |
| 30 | +- **Features**: |
| 31 | + - Initial 217 molecular descriptors computed from SMILES strings |
| 32 | + - Reduced to 50 most relevant features through correlation analysis |
| 33 | + |
| 34 | +Key molecular descriptors identified included: |
| 35 | +1. MolLogP (octanol-water partition coefficient) |
| 36 | +2. PEOE_VSA6 (partial charge surface area descriptor) |
| 37 | +3. Molecular weight |
| 38 | +4. Morgan fingerprint density |
| 39 | +5. BCUT descriptors (molecular connectivity) |
| 40 | + |
| 41 | +## Analysis & Methodology |
| 42 | + |
| 43 | +The analytical approach followed these steps: |
| 44 | + |
| 45 | +1. **Feature Engineering**: |
| 46 | + - Computed 217 molecular descriptors using RDKit |
| 47 | + - Removed constant and highly correlated features |
| 48 | + - Selected top 50 features by correlation with solubility |
| 49 | + |
| 50 | +2. **Model Selection**: |
| 51 | + - Evaluated three regression approaches: |
| 52 | + - Random Forest |
| 53 | + - Gradient Boosting |
| 54 | + - Support Vector Regression |
| 55 | + - Used 80-20 train-test split for evaluation |
| 56 | + |
| 57 | +3. **Evaluation Metrics**: |
| 58 | + - R² (coefficient of determination) |
| 59 | + - MAE (Mean Absolute Error) |
| 60 | + - RMSE (Root Mean Squared Error) |
| 61 | + |
| 62 | +## Results & Findings |
| 63 | + |
| 64 | +The models achieved the following performance: |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | +| Model | R² | MAE | RMSE | |
| 69 | +|--------------------------|--------|--------|--------| |
| 70 | +| Random Forest | 0.910 | 0.460 | 0.627 | |
| 71 | +| Gradient Boosting | 0.917 | 0.452 | 0.600 | |
| 72 | +| Support Vector Regression| 0.816 | 0.562 | 0.894 | |
| 73 | + |
| 74 | +Key findings: |
| 75 | +- Gradient Boosting demonstrated the best overall performance |
| 76 | +- All models significantly outperformed the baseline ESOL predictions included in the dataset |
| 77 | +- Molecular weight and hydrophobicity (MolLogP) were among the most important predictive features |
| 78 | +- The models captured both the central tendency and extremes of the solubility range well |
| 79 | + |
| 80 | +## Conclusions |
| 81 | + |
| 82 | +This analysis successfully developed machine learning models capable of accurately predicting molecular solubility from chemical structure data. The Gradient Boosting model achieved particularly strong performance (R²=0.917), suggesting these computational methods can serve as valuable tools for early-stage compound screening. |
| 83 | + |
| 84 | +**Limitations:** |
| 85 | +- Model performance may degrade for compounds very different from those in the training set |
| 86 | +- SMILES parsing could fail for certain complex or unusual molecular structures |
| 87 | +- Limited interpretability of some molecular descriptors |
| 88 | + |
| 89 | +**Future Work:** |
| 90 | +- Hyperparameter tuning to optimize model performance |
| 91 | +- Exploration of deep learning approaches using graph neural networks |
| 92 | +- Incorporation of additional molecular representations (e.g., molecular fingerprints) |
| 93 | +- Application to larger and more diverse chemical datasets |
| 94 | + |
| 95 | +The developed models provide a foundation for computational solubility prediction that could significantly accelerate chemical research and drug discovery workflows. |
0 commit comments