Skip to content

Commit 54b648f

Browse files
authored
Merge pull request #1 from togethercomputer/shang/dev
update examples
2 parents 989c5e4 + 146e0d5 commit 54b648f

9 files changed

+95
-54
lines changed

examples/solubility_prediction/build_a_machine_learning_model_to_predict_the_solu.md

Lines changed: 0 additions & 54 deletions
This file was deleted.
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Molecular Solubility Prediction Report
2+
3+
DISCLAIMER: this is an AI-generated report, so it may contain errors. Please check the reasoning traces and executed code for accuracy.
4+
5+
## Executive Summary
6+
7+
This report details the development of machine learning models to predict molecular solubility from chemical structure data. Using the ESOL dataset containing 1,144 compounds, we built and evaluated several regression models, with Gradient Boosting emerging as the best performer (R²=0.917). The analysis demonstrates that computational methods can accurately estimate aqueous solubility from molecular descriptors, offering valuable insights for drug discovery and chemical research.
8+
9+
## Introduction
10+
11+
Molecular solubility is a critical property in pharmaceutical development and chemical research, influencing drug bioavailability and formulation. Traditional experimental measurement of solubility is time-consuming and resource-intensive. This project aimed to develop machine learning models that can predict aqueous solubility (measured as log mol/L) directly from molecular structures represented as SMILES strings.
12+
13+
The ESOL dataset used contains:
14+
- 1,144 organic compounds
15+
- Experimentally measured solubility values (log mol/L)
16+
- SMILES string representations of each molecule
17+
- Existing ESOL model predictions for comparison
18+
19+
## Data Exploration
20+
21+
The dataset was thoroughly examined before model development:
22+
23+
![Distribution of Measured Solubility](plot_20250605_120215.png)
24+
25+
- **Data Quality**: Complete dataset with no missing values (1144 compounds)
26+
- **Target Variable**:
27+
- Range: -11.6 to 1.58 log mol/L
28+
- Mean: -3.06 ± 2.1 (standard deviation)
29+
- Distribution: Approximately normal with slight left skew
30+
- **Features**:
31+
- Initial 217 molecular descriptors computed from SMILES strings
32+
- Reduced to 50 most relevant features through correlation analysis
33+
34+
Key molecular descriptors identified included:
35+
1. MolLogP (octanol-water partition coefficient)
36+
2. PEOE_VSA6 (partial charge surface area descriptor)
37+
3. Molecular weight
38+
4. Morgan fingerprint density
39+
5. BCUT descriptors (molecular connectivity)
40+
41+
## Analysis & Methodology
42+
43+
The analytical approach followed these steps:
44+
45+
1. **Feature Engineering**:
46+
- Computed 217 molecular descriptors using RDKit
47+
- Removed constant and highly correlated features
48+
- Selected top 50 features by correlation with solubility
49+
50+
2. **Model Selection**:
51+
- Evaluated three regression approaches:
52+
- Random Forest
53+
- Gradient Boosting
54+
- Support Vector Regression
55+
- Used 80-20 train-test split for evaluation
56+
57+
3. **Evaluation Metrics**:
58+
- R² (coefficient of determination)
59+
- MAE (Mean Absolute Error)
60+
- RMSE (Root Mean Squared Error)
61+
62+
## Results & Findings
63+
64+
The models achieved the following performance:
65+
66+
![Model Performance Comparison](plot_20250605_120259.png)
67+
68+
| Model || MAE | RMSE |
69+
|--------------------------|--------|--------|--------|
70+
| Random Forest | 0.910 | 0.460 | 0.627 |
71+
| Gradient Boosting | 0.917 | 0.452 | 0.600 |
72+
| Support Vector Regression| 0.816 | 0.562 | 0.894 |
73+
74+
Key findings:
75+
- Gradient Boosting demonstrated the best overall performance
76+
- All models significantly outperformed the baseline ESOL predictions included in the dataset
77+
- Molecular weight and hydrophobicity (MolLogP) were among the most important predictive features
78+
- The models captured both the central tendency and extremes of the solubility range well
79+
80+
## Conclusions
81+
82+
This analysis successfully developed machine learning models capable of accurately predicting molecular solubility from chemical structure data. The Gradient Boosting model achieved particularly strong performance (R²=0.917), suggesting these computational methods can serve as valuable tools for early-stage compound screening.
83+
84+
**Limitations:**
85+
- Model performance may degrade for compounds very different from those in the training set
86+
- SMILES parsing could fail for certain complex or unusual molecular structures
87+
- Limited interpretability of some molecular descriptors
88+
89+
**Future Work:**
90+
- Hyperparameter tuning to optimize model performance
91+
- Exploration of deep learning approaches using graph neural networks
92+
- Incorporation of additional molecular representations (e.g., molecular fingerprints)
93+
- Application to larger and more diverse chemical datasets
94+
95+
The developed models provide a foundation for computational solubility prediction that could significantly accelerate chemical research and drug discovery workflows.
-37.7 KB
Binary file not shown.
-79.5 KB
Binary file not shown.
-60.7 KB
Binary file not shown.
36.6 KB
Loading
36.6 KB
Loading
96.5 KB
Loading
-116 KB
Binary file not shown.

0 commit comments

Comments
 (0)