|
### Format of the input JSON file |
|
The JSON file format closely resembles that used by the AlphaFold Server, with a few key differences: |
|
|
|
1. There are no restrictions on the types of ligands, ions, and modifications, whereas the AlphaFold Server currently supports only a limited set of specific CCD codes. |
|
2. Users can specify bonds between entities, such as covalent bonds between ligands and polymers. |
|
3. It supports inputting ligands in the form of SMILES strings or molecular structure files. |
|
4. Ligands composed of multiple CCD codes can be treated as a single entity. This feature is useful for representing glycans, for example, "NAG-NAG". |
|
5. The "glycans" field is no longer supported. Glycans can be fully represented by inputting multiple ligands with defined bonding or by providing their SMILES strings. |
|
|
|
Here is an overview of the JSON file format: |
|
```json |
|
[ |
|
{ |
|
"name": "Test Fold Job Number One", |
|
"sequences": [...], |
|
"covalent_bonds": [...] |
|
} |
|
] |
|
``` |
|
The JSON file consists of a list of dictionaries, where each dictionary represents a set of sequences you want to model. |
|
Even if you are modeling only one set of sequences, the top-level structure should still be a list. |
|
|
|
Each dictionary contains the following three keys: |
|
* `name`: A string representing the name of the inference job. |
|
* `sequences`: A list of dictionaries that describe the entities (e.g., proteins, DNA, RNA, small molecules, and ions) involved in the inference. |
|
* `covalent_bonds`: An optional list of dictionaries that define the covalent bonds between atoms from different entities. |
|
|
|
Details of `sequences` and `covalent_bonds` are provided below. |
|
|
|
#### sequences |
|
There are 5 kinds of supported sequences: |
|
* `proteinChain` β used for proteins |
|
* `dnaSequence` β used for DNA (single strand) |
|
* `rnaSequence` β used for RNA (single strand) |
|
* `ligand` β used for ligands |
|
* `ion` β used for ions |
|
|
|
##### proteinChain |
|
```json |
|
{ |
|
"proteinChain": { |
|
"sequence": "PREACHINGS", |
|
"count": 1, |
|
"modifications": [ |
|
{ |
|
"ptmType": "CCD_HY3", |
|
"ptmPosition": 1, |
|
}, |
|
{ |
|
"ptmType": "CCD_P1L", |
|
"ptmPosition": 5 |
|
} |
|
], |
|
"msa":{ |
|
"precomputed_msa_dir": "./precomputed_msa", |
|
"pairing_db": "uniref100", |
|
}, |
|
}, |
|
} |
|
``` |
|
* `sequence`: A string representating a protein sequence, which can only contain the 20 standard amino acid type and X (UNK) for unknown residues. |
|
* `count`: The number of copies of this protein chain (integer). |
|
* `modifications`: An optional list of dictionaries that describe post-translational modifications. |
|
|
|
* `ptmType`: A string containing CCD code of the modification. |
|
* `ptmPosition`: The position of the modified amino acid (integer). |
|
* `msa`: A dictionary containing options for Multiple Sequence Alignment (MSA). **If you want to search MSAs using our inference pipeline, you should not set this field or set it to an empty dictionary**: |
|
* `precomputed_msa_dir`: The path to a directory containing precomputed MSAs. This directory should contain two specific files: "pairing.a3m" for MSAs used for pairing, and "non_pairing.a3m" for non-pairing MSAs. |
|
* `pairing_db`: The name of the genomic database used for pairing MSAs. The default is "uniref100" and you should not change it. In fact, The MSA search against the UniRef30, a clustered version of the UniRef100. |
|
|
|
##### dnaSequence |
|
```json |
|
{ |
|
"dnaSequence": { |
|
"sequence": "GATTACA", |
|
"modifications": [ |
|
{ |
|
"modificationType": "CCD_6OG", |
|
"basePosition": 1 |
|
}, |
|
{ |
|
"modificationType": "CCD_6MA", |
|
"basePosition": 2 |
|
} |
|
], |
|
"count": 1 |
|
} |
|
}, |
|
{ |
|
"dnaSequence": { |
|
"sequence": "TGTAATC", |
|
"count": 1 |
|
} |
|
} |
|
``` |
|
Please note that the `dnaSequence` type refers to a single stranded DNA sequence. If you |
|
wish to model double-stranded DNA, please add a second `dnaSequence` entry representing |
|
the sequence of the reverse complement strand. |
|
|
|
* `sequence`: A string containing a DNA sequence; only letters A, T, G, C and N (unknown ribonucleotide) are allowed. |
|
* `count`: The number of copies of this DNA chain (integer). |
|
* `modifications`: An optional list of dictionaries describing of |
|
the DNA chemical modifications: |
|
* `modificationType`: A string containing CCD code of modification. |
|
* `basePosition`: A position of the modified nucleotide (integer). |
|
|
|
##### rnaSequence |
|
```json |
|
{ |
|
"rnaSequence": { |
|
"sequence": "GUAC", |
|
"modifications": [ |
|
{ |
|
"modificationType": "CCD_2MG", |
|
"basePosition": 1 |
|
}, |
|
{ |
|
"modificationType": "CCD_5MC", |
|
"basePosition": 4 |
|
} |
|
], |
|
"count": 1 |
|
} |
|
} |
|
``` |
|
* `sequence`: A string representing the RNA sequence (single-stranded); only letters A, U, G, C and N (unknown nucleotides) are allowed. |
|
* `count`: The number of copies of this RNA chain (integer). |
|
* `modifications`: An optional list of dictionaries describing RNA chemical modifications: |
|
* `modificationType`: A string containing |
|
CCD code of modification. |
|
* `basePosition`: The position of the modified nucleotide (integer). |
|
|
|
##### ligand |
|
```json |
|
{ |
|
"ligand": { |
|
"ligand": "CCD_ATP", |
|
"count": 1 |
|
} |
|
}, |
|
{ |
|
"ligand": { |
|
"ligand": "FILE_your_file_path/atp.sdf", |
|
"count": 1 |
|
} |
|
}, |
|
{ |
|
"ligand": { |
|
"ligand": "Nc1ncnc2c1ncn2[C@@H]1O[C@H](CO[P@@](=O)(O)O[P@](=O)(O)OP(=O)(O)O)[C@@H](O)[C@H]1O", |
|
"count": 1 |
|
} |
|
} |
|
``` |
|
* `ligand`: A string representing the ligand. `ligand` can be one of the following three: |
|
* A string containing the CCD code of the ligand, prefixed with "CCD_". For glycans or similar structures, this can be a concatenation of multiple CCD codes, for example, "CCD_NAG_BMA_BGC". |
|
* A molecular SMILES string representing the ligand. |
|
* A path to a molecular structure file, prefixed with "FILE_", where the supported file formats are PDB, SDF, MOL, and MOL2. The file must include the 3D conformation of the molecule. |
|
|
|
* `count` is the number of copies of this ligand (integer). |
|
|
|
##### ion |
|
```json |
|
{ |
|
"ion": { |
|
"ion": "MG", |
|
"count": 2 |
|
} |
|
}, |
|
{ |
|
"ion": { |
|
"ion": "NA", |
|
"count": 3 |
|
} |
|
} |
|
``` |
|
* `ion`: A string containing the CCD code for the ion. Note that, unlike ligands, the ion code **does not** start with "CCD_". |
|
* `count`: The number of copies of this ion (integer). |
|
|
|
#### covalent_bonds |
|
```json |
|
"covalent_bonds": [ |
|
{ |
|
"entity1": "2", |
|
"copy1": 1, |
|
"position1": "2", |
|
"atom1": "N6", |
|
"entity2": "3", |
|
"copy2": 1, |
|
"position2": "1", |
|
"atom2": "C1" |
|
} |
|
] |
|
``` |
|
|
|
The `covalent_bonds` section specifies covalent bonds between a polymer and a ligand, or between two ligands. |
|
To define a covalent bond, two atoms involved in the bond must be identified. The following fields are used: |
|
|
|
* `entity1`, `entity2`: The entity numbers for the two atoms involved in the bond. |
|
The entity number corresponds to the order in which the entity appears in the `sequences` list, starting from 1. |
|
* `copy2`, `copy2`: The copy index (starting from 1) of the `left_entity` and `right_entity`, respectively. These fields are optional, but if specified, both `left_copy` and `right_copy` must be filled simultaneously or left empty at the same time. If neither field is provided, a bond will be created between all pairs of copies of the two entities. For example, if both entity1 and entity2 have two copies, a bond will be formed between entity1.copy1 and entity2.copy1, as well as between entity1.copy2 and entity2.copy2. In this case, the number of copies for both entities must be equal. |
|
* `position1`, `position2` - The position of the residue (or ligand part) within the entity. |
|
The position value starts at 1 and can vary based on the type of entity: |
|
* For **polymers** (e.g., proteins, DNA, RNA), the position corresponds to the location of the residue in the sequence. |
|
* For **ligands** composed of multiple CCD codes, the position refers to the serial number of the CCD code. |
|
* For **single CCD code ligands**, or ligands defined by **SMILES** or **FILE**, the position is always set to 1. |
|
|
|
* `atom1`, `atom2` - The atom names (or atom indices) of the atoms to be bonded. |
|
* If the entity is a polymer or described by a CCD code, the atom names are consistent with those defined in the CCD. |
|
* If the entity is a ligand defined by SMILES or a FILE, atoms can be specified by their atom index. The atom index corresponds to the position of the atom in the file or in the SMILES string, starting from 0. |
|
|
|
Deprecation Notice: The previous fields such as old `left_entity`, `right_entity`, and other fields starting with `left`/`right` have been updated to use `1` and `2` to denote the two atoms forming a bond. The current code still supports the old field names, but they may be deprecated in the future, leaving only the new field names. |
|
|
|
### Format of the model output |
|
The outputs will be saved in the directory provided via the `--dump_dir` flag in the inference script. The outputs include the predicted structures in CIF format and the confidence in JSON files. The `--dump_dir` will have the following structure: |
|
|
|
```bash |
|
βββ <name>/ # specified in the input JSON file |
|
β βββ <seed>/ # specified via the `--seeds` flag in the inference script |
|
β β βββ <name>_<seed>_sample_0.cif |
|
β β βββ <name>_<seed>_summary_confidence_sample_0.json |
|
β β βββ... # the number of samples in each seed is specified via `--sample_diffusion.N_sample ` flag in the inference script |
|
β βββ... |
|
βββ ... |
|
``` |
|
|
|
The contents of each output file are as follows: |
|
- `<name>_<seed>_sample_*.cif` - A CIF format text file containing the predicted structure |
|
- `<name>_<seed>_summary_confidence_sample_*.json` - A JSON format text file containing various confidence scores for assessing the reliability of predictions. Hereβs a description of each score: |
|
|
|
- `plddt` - Predicted Local Distance Difference Test (pLDDT) score. Higher values indicate greater confidence. |
|
- `gpde` - Globl Predicted Distance Error (PDE) score. Lower values indicate greater confidence. |
|
- `ptm` - Predicted TM-score (pTM). Values closer to 1 indicate greater confidence. |
|
- `iptm` - Interface Predicted TM-score, used to estimate the accuracy of interfaces between chains. Values closer to 1 indicate greater confidence. |
|
- `chain_ptm` - pTM score calculated for individual chains with the shape of [N_chains], indicating the reliability of specific chain structure. |
|
- `chain_pair_iptm`: Pairwise interface pTM scores between chain pairs with the shape of [N_chains, N_chains], indicating the reliability of specific chain-chain interactions. |
|
- `chain_iptm` - Average ipTM scores for each chain with the shape of [N_chains]. |
|
- `chain_pair_iptm_global` - Averge `chain_iptm` between chain pairs with the shape of [N_chains, N_chains]. For interface containing a small molecule, ion, or bonded ligand chain (named `C*`), this value is equal to the `chain_iptm` value of `C*`. |
|
- `chain_plddt` - pLDDT scores calculated for individual chains with the shape of [N_chains]. |
|
- `chain_pair_plddt` - Pairwise pLDDT scores for chain pairs with the shape of [N_chains, N_chains]. |
|
- `has_clash` - Boolean flag indicating if there are steric clashes in the predicted structure. |
|
- `disorder` - Predicted regions of intrinsic disorder within the protein, highlighting residues that may be flexible or unstructured. |
|
- `ranking_score` - Predicted confidence score for ranking complexes. Higher values indicate greater confidence. |
|
- `num_recycles`: Number of recycling steps used during inference. |