smahbub commited on
Commit
6598e0b
·
verified ·
1 Parent(s): 7367c4e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -10
README.md CHANGED
@@ -5,7 +5,7 @@ license: other
5
  ---
6
 
7
  # Protein Inverse Folding
8
- We finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
9
 
10
  #### Setup:
11
  Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
@@ -31,15 +31,20 @@ Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
31
  nvidia-smi # this should print the GPUs' details
32
  ```
33
  - Execute the following steps from **within** the docker container you just created.
 
34
 
35
- #### Download model checkpoints:
36
 
37
  - Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.ProteinIF-16B/tree/main). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks`.
38
 
39
- **Alternatively**, you can simply run the following script to do this (Note: this script uses the [wget](https://www.gnu.org/software/wget/) tool):
40
  ```
41
- mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
42
- bash download_model_chunks.sh ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
 
 
 
 
43
  ```
44
 
45
  #### Download data:
@@ -48,16 +53,22 @@ Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
48
  **Alternatively**, you can do it by simply running the following script:
49
  ```
50
  mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/
51
- wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_map.pkl
52
- wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_splits.json
 
53
  ```
54
 
55
  #### Run inference:
56
- - Then run the bash script for inference:
57
  ```
58
- bash prot_inverse_folding.sh
 
 
 
 
 
 
59
  ```
60
- - **Note:** Multi-GPU inference for inverse folding is not currently supported and will be included in the future.
61
 
62
  #### Outputs:
63
  - The evaluation score will be printed on the console.
 
5
  ---
6
 
7
  # Protein Inverse Folding
8
+ Protein inverse folding represents a computational technique aimed at generating protein sequences that will fold into specific three-dimensional structures. The central challenge in protein inverse folding involves identifying sequences capable of reliably adopting the intended structure. In our research, we concentrate on designing sequences based on the known backbone structure of a protein, represented with 3D coordinates of the atoms of the backbone (without any information about what the individual amino-acids are). Specifically. we finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benchmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
9
 
10
  #### Setup:
11
  Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
 
31
  nvidia-smi # this should print the GPUs' details
32
  ```
33
  - Execute the following steps from **within** the docker container you just created.
34
+ - **Note:** Multi-GPU inference for inverse folding is not currently supported and will be included in the future.
35
 
36
+ #### Download and merge model checkpoint chunks:
37
 
38
  - Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.ProteinIF-16B/tree/main). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks`.
39
 
40
+ **Alternatively**, you can do this by simply running the following script:
41
  ```
42
+ mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/
43
+ huggingface-cli download genbio-ai/AIDO.ProteinIF-16B \
44
+ --repo-type model \
45
+ --local-dir ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/
46
+ # Merge chunks
47
+ python merge_ckpt.py ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model.ckpt
48
  ```
49
 
50
  #### Download data:
 
53
  **Alternatively**, you can do it by simply running the following script:
54
  ```
55
  mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/
56
+ huggingface-cli download genbio-ai/protein-inverse-folding \
57
+ --repo-type dataset \
58
+ --local-dir ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold
59
  ```
60
 
61
  #### Run inference:
62
+ - From your terminal, change directory to `experiments/AIDO.Protein/protein_inverse_folding` folder and run the following script:
63
  ```
64
+ cd experiments/AIDO.Protein/protein_inverse_folding
65
+ # Run inference
66
+ mgen test --config protein_inv_fold_test.yaml \
67
+ --trainer.default_root_dir ${MGEN_DATA_DIR}/modelgenerator/logs/protein_inv_fold/ \
68
+ --ckpt_path ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model.ckpt \
69
+ --trainer.devices 0, \
70
+ --data.path ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/
71
  ```
 
72
 
73
  #### Outputs:
74
  - The evaluation score will be printed on the console.