|
--- |
|
license: other |
|
tags: |
|
- biology |
|
- RNA |
|
- Torsional |
|
- Angles |
|
pipeline_tag: token-classification |
|
base_model: |
|
- zhihan1996/DNA_bert_3 |
|
--- |
|
# `RNA-TorsionBERT` |
|
|
|
## Model Description |
|
|
|
`RNA-TorsionBERT` is a 86.9 MB parameter BERT-based language model that predicts RNA torsional and pseudo-torsional angles from the sequence. |
|
|
|
`RNA-TorsionBERT` is a DNABERT model that was pre-trained on ~4200 RNA structures. |
|
|
|
It provides improvement of [MCQ](https://github.com/tzok/mcq4structures) over the previous state-of-the-art models like |
|
[SPOT-RNA-1D](https://github.com/jaswindersingh2/SPOT-RNA-1D) or inferred angles from existing methods, on the Test Set (composed of RNA-Puzzles and CASP-RNA). |
|
|
|
|
|
**Key Features** |
|
* Torsional and Pseudo-torsional angles prediction |
|
* Predict sequences up to 512 nucleotides |
|
|
|
## Usage |
|
|
|
Get started generating text with `RNA-TorsionBERT` by using the following code snippet: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True) |
|
model = AutoModel.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True) |
|
|
|
sequence = "ACG CGG GGT GTT" |
|
params_tokenizer = { |
|
"return_tensors": "pt", |
|
"padding": "max_length", |
|
"max_length": 512, |
|
"truncation": True, |
|
} |
|
inputs = tokenizer(sequence, **params_tokenizer) |
|
output = model(inputs)["logits"] |
|
``` |
|
|
|
- Please note that it was fine-tuned from a DNABERT-3 model and therefore the tokenizer is the same as the one used for DNABERT. Nucleotide `U` should therefore be replaced by `T` in the input sequence. |
|
- The output is the sinus and the cosine for each angle. The angles are in the following order: `alpha`, `beta`,`gamma`,`delta`,`epsilon`,`zeta`,`chi`,`eta`,`theta`,`eta'`,`theta'`,`v0`,`v1`,`v2`,`v3`,`v4`. |
|
|
|
To convert the predictions into angles, you can use the following code snippet: |
|
|
|
```python |
|
import transformers |
|
from transformers import AutoModel, AutoTokenizer |
|
import numpy as np |
|
import pandas as pd |
|
from typing import Optional, Dict |
|
import os |
|
|
|
os.environ["TOKENIZERS_PARALLELISM"] = "false" |
|
|
|
transformers.logging.set_verbosity_error() |
|
|
|
|
|
BACKBONE = [ |
|
"alpha", |
|
"beta", |
|
"gamma", |
|
"delta", |
|
"epsilon", |
|
"zeta", |
|
"chi", |
|
"eta", |
|
"theta", |
|
"eta'", |
|
"theta'", |
|
"v0", |
|
"v1", |
|
"v2", |
|
"v3", |
|
"v4", |
|
] |
|
|
|
|
|
class RNATorsionBERTHelper: |
|
def __init__(self): |
|
self.model_name = "sayby/rna_torsionbert" |
|
self.tokenizer = AutoTokenizer.from_pretrained( |
|
self.model_name, trust_remote_code=True |
|
) |
|
self.params_tokenizer = { |
|
"return_tensors": "pt", |
|
"padding": "max_length", |
|
"max_length": 512, |
|
"truncation": True, |
|
} |
|
self.model = AutoModel.from_pretrained(self.model_name, trust_remote_code=True) |
|
|
|
def predict(self, sequence: str): |
|
sequence_tok = self.convert_raw_sequence_to_k_mers(sequence) |
|
inputs = self.tokenizer(sequence_tok, **self.params_tokenizer) |
|
outputs = self.model(inputs)["logits"] |
|
outputs = self.convert_sin_cos_to_angles( |
|
outputs.cpu().detach().numpy(), inputs["input_ids"] |
|
) |
|
output_angles = self.convert_logits_to_dict( |
|
outputs[0, :], inputs["input_ids"][0, :].cpu().detach().numpy() |
|
) |
|
output_angles.index = list(sequence)[:-2] # Because of the 3-mer representation |
|
return output_angles |
|
|
|
def convert_raw_sequence_to_k_mers(self, sequence: str, k_mers: int = 3): |
|
""" |
|
Convert a raw RNA sequence into sequence readable for the tokenizer. |
|
It converts the sequence into k-mers, and replace U by T |
|
:return: input readable by the tokenizer |
|
""" |
|
sequence = sequence.upper().replace("U", "T") |
|
k_mers_sequence = [ |
|
sequence[i : i + k_mers] |
|
for i in range(len(sequence)) |
|
if len(sequence[i : i + k_mers]) == k_mers |
|
] |
|
return " ".join(k_mers_sequence) |
|
|
|
def convert_sin_cos_to_angles( |
|
self, output: np.ndarray, input_ids: Optional[np.ndarray] = None |
|
): |
|
""" |
|
Convert the raw predictions of the RNA-TorsionBERT into angles. |
|
It converts the cos and sinus into angles using: |
|
alpha = arctan(sin(alpha)/cos(alpha)) |
|
:param output: Dictionary with the predictions of the RNA-TorsionBERT per angle |
|
:param input_ids: the input_ids of the RNA-TorsionBERT. It allows to only select the of the sequence, |
|
and not the special tokens. |
|
:return: a np.ndarray with the angles for the sequence |
|
""" |
|
if input_ids is not None: |
|
output[ |
|
(input_ids == 0) |
|
| (input_ids == 2) |
|
| (input_ids == 3) |
|
| (input_ids == 4) |
|
] = np.nan |
|
pair_indexes, impair_indexes = np.arange(0, output.shape[-1], 2), np.arange( |
|
1, output.shape[-1], 2 |
|
) |
|
sin, cos = output[:, :, impair_indexes], output[:, :, pair_indexes] |
|
tan = np.arctan2(sin, cos) |
|
angles = np.degrees(tan) |
|
return angles |
|
|
|
def convert_logits_to_dict(self, output: np.ndarray, input_ids: np.ndarray) -> Dict: |
|
""" |
|
Convert the raw predictions into dictionary format. |
|
It removes the special tokens and only keeps the predictions for the sequence. |
|
:param output: predictions from the models in angles |
|
:param input_ids: input ids from the tokenizer |
|
:return: a dictionary with the predictions for each angle |
|
""" |
|
index_start, index_end = ( |
|
np.where(input_ids == 2)[0][0], |
|
np.where(input_ids == 3)[0][0], |
|
) |
|
output_non_pad = output[index_start + 1 : index_end, :] |
|
output_angles = { |
|
angle: output_non_pad[:, angle_index] |
|
for angle_index, angle in enumerate(BACKBONE) |
|
} |
|
out = pd.DataFrame(output_angles) |
|
return out |
|
|
|
|
|
if __name__ == "__main__": |
|
sequence = "AGGGCUUUAGUCUUUGGAG" |
|
rna_torsionbert_helper = RNATorsionBERTHelper() |
|
output_angles = rna_torsionbert_helper.predict(sequence) |
|
print(output_angles) |
|
``` |