Skip to content

Commit 95239e2

Browse files
authored
Sync over internal code to open source (#266)
Co-authored-by: Steve Chan <>
1 parent 9abce48 commit 95239e2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+9029
-1072
lines changed

.github/workflows/ci.yml

Lines changed: 6 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ concurrency:
1818

1919
jobs:
2020
test-precommit:
21-
runs-on: ubuntu-22.04-16core
21+
runs-on: ubuntu-24.04
2222

2323
steps:
2424
- name: Checkout code
@@ -32,20 +32,11 @@ jobs:
3232
cache-write: ${{ github.event_name == 'push' && github.ref_name == 'main' }}
3333

3434
- name: Check formatting and typing
35-
run: |
36-
set -e
37-
# pyright seems to do something weird at initialization that causes it to error out
38-
# We can ignore the first invocation here.
39-
pyright esm/__init__.py || true
40-
pre-commit install
41-
env NODE_OPTIONS="--max-old-space-size=16384" pre-commit run --all-files --show-diff-on-failure
42-
[ -z "$(git status --porcelain)" ] && true || (echo "❌❌❌ pre-commit hook failed! A few files changed ❌❌❌]" && git status --porcelain && false)
43-
git reset --hard HEAD # test without the pre-commit changes
44-
shell: pixi run bash -e {0}
35+
run: pixi run lint-all
4536

4637

4738
test-esm:
48-
runs-on: ubuntu-22.04-16core
39+
runs-on: ubuntu-24.04
4940

5041
steps:
5142
- name: Checkout code
@@ -59,20 +50,18 @@ jobs:
5950
cache-write: ${{ github.event_name == 'push' && github.ref_name == 'main' }}
6051

6152
- name: Run tests
62-
run: |
63-
set -o pipefail
64-
pytest -v --junitxml=pytest.xml tests/ | tee pytest-coverage.txt
65-
shell: pixi run bash -e {0}
53+
run: pixi run cov-test
6654

6755
- name: Run Docker tests
6856
env:
6957
DOCKER_TAG: ${{ github.sha }}
7058
FORGE_URL: https://forge.evolutionaryscale.ai/
59+
ESM3_FORGE_TOKEN: ${{ secrets.ESM3_FORGE_TOKEN }}
7160
run: |
7261
set -e
7362
cd tests
7463
make build-oss-ci
75-
make start-docker-oss URL=${{ env.FORGE_URL }} DOCKER_TAG=${{ env.DOCKER_TAG }} ESM3_FORGE_TOKEN=${{ secrets.ESM3_FORGE_TOKEN }}
64+
make start-docker-oss URL=${{ env.FORGE_URL }} DOCKER_TAG=${{ env.DOCKER_TAG }} ESM3_FORGE_TOKEN=${{ env.ESM3_FORGE_TOKEN }}
7665
shell: pixi run bash -e {0}
7766

7867
- name: cleanup docker containers if they're hanging

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ esm.egg-info
22
# pixi environments
33
.pixi
44
*.egg-info
5+
*.pyc

.pre-commit-config.yaml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# See https://pre-commit.com for more information
2+
# See https://pre-commit.com/hooks.html for more hooks
3+
exclude: (fasta|pdb|cif|mds|json)$
4+
repos:
5+
- repo: https://github.com/pre-commit/pre-commit-hooks
6+
rev: v3.2.0
7+
hooks:
8+
- id: trailing-whitespace
9+
- id: end-of-file-fixer
10+
- id: check-yaml
11+
- id: check-added-large-files
12+
exclude: pixi.lock
13+
- id: check-merge-conflict
14+
- repo: https://github.com/seddonym/import-linter
15+
rev: v1.12.1
16+
hooks:
17+
- id: import-linter
18+
- repo: https://github.com/astral-sh/ruff-pre-commit
19+
rev: v0.7.3
20+
hooks:
21+
- id: ruff # linter
22+
args: [ --fix ]
23+
- id: ruff-format # formatter
24+
types_or: [python, jupyter]
25+
- repo: https://github.com/RobertCraigie/pyright-python
26+
rev: v1.1.399
27+
hooks:
28+
- id: pyright
29+
name: pyright
30+
entry: pyright
31+
language: system
32+
types: [python]
33+
pass_filenames: true # For speed, we only check the files that are changed

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010

1111
- [Installation ](#installation-)
12+
- [Available Models](#available-models-)
1213
- [ESM 3](#esm-3-)
1314
- [Quickstart for ESM3 Open](#esm3-quickstart-)
1415
- [ESM3 98B via Forge API](#esm3-forge)
@@ -33,6 +34,31 @@ To get started with ESM, install the python library using pip:
3334
pip install esm
3435
```
3536

37+
## Available Models <a name="available-models"></a>
38+
39+
### ESM 3 Family
40+
41+
| Model | Model Size | Release Date | Note |
42+
|-------|------------|--------------|------|
43+
| **Flagship Models** | | | Most users will be interested in using one of these models. |
44+
| esm3-large-2024-03 | 98B | 2024-03 | |
45+
| esm3-medium-2024-08 | 7B | 2024-08 | |
46+
| esm3-small-2024-08 | 1.4B | 2024-08 | |
47+
| **Published Models** | | | These models were used to generate all of the results in the ESM3 paper and are provided to facilitate reproducibility. |
48+
| esm3-large-2024-03 | 98B | 2024-03 | |
49+
| esm3-medium-2024-03 | 7B | 2024-03 | |
50+
| esm3-small-2024-03 | 1.4B | 2024-03 | |
51+
| **Experimental Models** | | | These models are provided for early use by researchers and are still under development. |
52+
| esm3-medium-multimer-2024-09 | 7B | 2024-09 | |
53+
54+
### ESM C Models
55+
56+
| Model | Model Size | Number of Layers | Release Date |
57+
|-------|------------|------------------|--------------|
58+
| esmc-6b-2024-12 | 6B | 80 | 2024-12 |
59+
| esmc-600m-2024-12 | 600M | 36 | 2024-12 |
60+
| esmc-300m-2024-12 | 300M | 30 | 2024-12 |
61+
3662
## ESM 3 <a name="esm3"></a>
3763

3864
[ESM3](https://www.evolutionaryscale.ai/papers/esm3-simulating-500-million-years-of-evolution-with-a-language-model) is a frontier generative model for biology, able to jointly reason across three fundamental biological properties of proteins: sequence, structure, and function. These three data modalities are represented as tracks of discrete tokens at the input and output of ESM3. You can present the model with a combination of partial inputs across the tracks, and ESM3 will provide output predictions for all the tracks.

cookbook/local/open_generate.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@
3838
"\n",
3939
"!pip install py3Dmol\n",
4040
"import py3Dmol\n",
41+
"\n",
4142
"from esm.models.esm3 import ESM3\n",
4243
"from esm.sdk.api import ESMProtein, GenerationConfig\n",
4344
"from esm.utils.structure.protein_chain import ProteinChain"

cookbook/local/raw_forwards.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,7 @@
1313
from esm.tokenization.function_tokenizer import (
1414
InterProQuantizedTokenizer as EsmFunctionTokenizer,
1515
)
16-
from esm.tokenization.sequence_tokenizer import (
17-
EsmSequenceTokenizer,
18-
)
16+
from esm.tokenization.sequence_tokenizer import EsmSequenceTokenizer
1917
from esm.utils.structure.protein_chain import ProteinChain
2018
from esm.utils.types import FunctionAnnotation
2119

cookbook/tutorials/1_esmprotein.ipynb

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@
7272
"outputs": [],
7373
"source": [
7474
"from biotite.database import rcsb\n",
75+
"\n",
7576
"from esm.sdk.api import ESMProtein\n",
7677
"from esm.utils.structure.protein_chain import ProteinChain\n",
7778
"from esm.utils.types import FunctionAnnotation\n",
@@ -496,9 +497,10 @@
496497
"# Functions for visualizing InterPro function annotations\n",
497498
"\n",
498499
"from dna_features_viewer import GraphicFeature, GraphicRecord\n",
499-
"from esm.utils.function.interpro import InterPro, InterProEntryType\n",
500500
"from matplotlib import colormaps\n",
501501
"\n",
502+
"from esm.utils.function.interpro import InterPro, InterProEntryType\n",
503+
"\n",
502504
"\n",
503505
"def visualize_function_annotations(\n",
504506
" annotations: list[FunctionAnnotation],\n",

cookbook/tutorials/3_gfp_design.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@
6464
"import matplotlib.pyplot as pl\n",
6565
"import py3Dmol\n",
6666
"import torch\n",
67+
"\n",
6768
"from esm.sdk import client\n",
6869
"from esm.sdk.api import ESMProtein, GenerationConfig\n",
6970
"from esm.utils.structure.protein_chain import ProteinChain"

cookbook/tutorials/4_forge_generate.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@
3636
"\n",
3737
"!pip install py3Dmol\n",
3838
"import py3Dmol\n",
39+
"\n",
3940
"from esm.sdk import client\n",
4041
"from esm.sdk.api import ESMProtein, GenerationConfig\n",
4142
"from esm.utils.structure.protein_chain import ProteinChain"

cookbook/tutorials/5_guided_generation.ipynb

Lines changed: 64 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@
1414
"3. Minimize a biophysical energy function\n",
1515
"4. Use experimental screening data to guide designs with a regression model\n",
1616
"\n",
17-
"As long as your scoring function takes a protein as input and outputs a single score, you can use it to guide designs. To accomplish this, we use an implementation of derivative-free guidance inspired by Soft Value-Based Decoding described in [Li, et al 2024](https://arxiv.org/abs/2408.08252).\n",
17+
"As long as your scoring function takes a protein as input and outputs a single score, you can use it to guide designs. To accomplish this, we use an implementation of derivative-free guidance inspired by Soft Value-Based Decoding described in [Li, et al 2024](https://arxiv.org/abs/2408.08252) and constrained optimization using the Modified Differential Method of Multipliers from [Platt & Barr 1987](https://proceedings.neurips.cc/paper_files/paper/1987/file/a1126573153ad7e9f44ba80e99316482-Paper.pdf)\n",
1818
"\n",
1919
"In this notebook we will walk through a few examples to illustrate how to use guided generation. \n",
2020
"\n",
2121
"1. Guide towards high pTM for improved generation quality\n",
2222
"2. Generate a protein with no cysteine (C) residues\n",
23-
"3. Maximize protein globularity by minimizing the radius of gyration\n",
23+
"3. Maximize protein globularity by minimizing the radius of gyration, while keeping pTM high\n",
2424
"\n"
2525
]
2626
},
@@ -49,6 +49,7 @@
4949
"source": [
5050
"import biotite.structure as bs\n",
5151
"import py3Dmol\n",
52+
"\n",
5253
"from esm.sdk.api import ESMProtein, GenerationConfig\n",
5354
"from esm.sdk.experimental import ESM3GuidedDecoding, GuidedDecodingScoringFunction"
5455
]
@@ -269,6 +270,11 @@
269270
"metadata": {},
270271
"outputs": [],
271272
"source": [
273+
"# Start from a fully masked protein\n",
274+
"PROTEIN_LENGTH = 256\n",
275+
"starting_protein = ESMProtein(sequence=\"_\" * PROTEIN_LENGTH)\n",
276+
"\n",
277+
"# Call guided_generate\n",
272278
"no_cysteine_protein = no_cysteine_guided_decoding.guided_generate(\n",
273279
" protein=starting_protein,\n",
274280
" num_decoding_steps=len(starting_protein) // 8,\n",
@@ -302,7 +308,20 @@
302308
"source": [
303309
"## Maximize Globularity\n",
304310
"\n",
305-
"We use the radius of gyration as a proxy to maximize globularity, we also encourage generations to have high pTM"
311+
"We use the radius of gyration as a proxy to maximize globularity, and we will also encourage generations to have high pTM by using constraints"
312+
]
313+
},
314+
{
315+
"cell_type": "code",
316+
"execution_count": null,
317+
"metadata": {},
318+
"outputs": [],
319+
"source": [
320+
"from esm.sdk.experimental import (\n",
321+
" ConstraintType,\n",
322+
" ESM3GuidedDecodingWithConstraints,\n",
323+
" GenerationConstraint,\n",
324+
")"
306325
]
307326
},
308327
{
@@ -313,12 +332,11 @@
313332
"source": [
314333
"class RadiousOfGyrationScoringFunction(GuidedDecodingScoringFunction):\n",
315334
" def __call__(self, protein: ESMProtein) -> float:\n",
335+
" # Use the negative radius of gyration as the score to maximize\n",
316336
" score = -1 * self.radius_of_gyration(protein)\n",
317337
"\n",
318-
" assert protein.ptm is not None, \"Protein must have pTM scores to be scored\"\n",
319-
" if protein.ptm < 0.5:\n",
320-
" # Penalize proteins with low pTM scores\n",
321-
" score = score * 2\n",
338+
" # Re-scale the score to be in a similar magnitude as pTM\n",
339+
" score = score / 100.0\n",
322340
"\n",
323341
" return score\n",
324342
"\n",
@@ -335,8 +353,19 @@
335353
"metadata": {},
336354
"outputs": [],
337355
"source": [
338-
"radius_guided_decoding = ESM3GuidedDecoding(\n",
339-
" client=model, scoring_function=RadiousOfGyrationScoringFunction()\n",
356+
"# Constrain generation to have pTM > 0.75\n",
357+
"ptm_constraint = GenerationConstraint(\n",
358+
" scoring_function=PTMScoringFunction(),\n",
359+
" constraint_type=ConstraintType.GREATER_EQUAL,\n",
360+
" value=0.75,\n",
361+
")\n",
362+
"\n",
363+
"radius_guided_decoding = ESM3GuidedDecodingWithConstraints(\n",
364+
" client=model,\n",
365+
" scoring_function=RadiousOfGyrationScoringFunction(),\n",
366+
" constraints=[ptm_constraint], # Add list of constraints\n",
367+
" damping=1.0, # Damping factor for the MMDM algorithm\n",
368+
" learning_rate=10.0, # Learning rate for the MMDM algorithm\n",
340369
")"
341370
]
342371
},
@@ -346,6 +375,11 @@
346375
"metadata": {},
347376
"outputs": [],
348377
"source": [
378+
"# Start from a fully masked protein\n",
379+
"PROTEIN_LENGTH = 256\n",
380+
"starting_protein = ESMProtein(sequence=\"_\" * PROTEIN_LENGTH)\n",
381+
"\n",
382+
"# Call guided_generate\n",
349383
"radius_guided_protein = radius_guided_decoding.guided_generate(\n",
350384
" protein=starting_protein,\n",
351385
" num_decoding_steps=len(starting_protein) // 8,\n",
@@ -359,11 +393,32 @@
359393
"metadata": {},
360394
"outputs": [],
361395
"source": [
396+
"# Visualize the trajectory of the constrained generation\n",
397+
"radius_guided_decoding.visualize_latest_trajectory()"
398+
]
399+
},
400+
{
401+
"cell_type": "code",
402+
"execution_count": null,
403+
"metadata": {},
404+
"outputs": [],
405+
"source": [
406+
"# Visualize the generated protein\n",
362407
"view = py3Dmol.view(width=800, height=400)\n",
363408
"view.addModel(radius_guided_protein.to_pdb_string(), \"pdb\")\n",
364409
"view.setStyle({\"cartoon\": {\"color\": \"spectrum\"}})\n",
365410
"view.zoomTo()"
366411
]
412+
},
413+
{
414+
"cell_type": "code",
415+
"execution_count": null,
416+
"metadata": {},
417+
"outputs": [],
418+
"source": [
419+
"# Check pTM\n",
420+
"radius_guided_protein.ptm"
421+
]
367422
}
368423
],
369424
"metadata": {

0 commit comments

Comments
 (0)