rna_torsionBERT / README.md

Add(last version: new angles + add TB-MCQ)

fe64e43 verified 3 months ago

6.18 kB

	---
	license: other
	tags:
	- biology
	- RNA
	- Torsional
	- Angles
	pipeline_tag: token-classification
	base_model:
	- zhihan1996/DNA_bert_3
	---
	# `RNA-TorsionBERT`

	## Model Description

	`RNA-TorsionBERT` is a 86.9 MB parameter BERT-based language model that predicts RNA torsional and pseudo-torsional angles from the sequence.

	`RNA-TorsionBERT` is a DNABERT model that was pre-trained on ~4200 RNA structures.

	It provides improvement of [MCQ](https://github.com/tzok/mcq4structures) over the previous state-of-the-art models like
	[SPOT-RNA-1D](https://github.com/jaswindersingh2/SPOT-RNA-1D) or inferred angles from existing methods, on the Test Set (composed of RNA-Puzzles and CASP-RNA).


	Key Features
	* Torsional and Pseudo-torsional angles prediction
	* Predict sequences up to 512 nucleotides

	## Usage

	Get started generating text with `RNA-TorsionBERT` by using the following code snippet:

	```python
	from transformers import AutoModel, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)
	model = AutoModel.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)

	sequence = "ACG CGG GGT GTT"
	params_tokenizer = {
	"return_tensors": "pt",
	"padding": "max_length",
	"max_length": 512,
	"truncation": True,
	}
	inputs = tokenizer(sequence, **params_tokenizer)
	output = model(inputs)["logits"]
	```

	- Please note that it was fine-tuned from a DNABERT-3 model and therefore the tokenizer is the same as the one used for DNABERT. Nucleotide `U` should therefore be replaced by `T` in the input sequence.
	- The output is the sinus and the cosine for each angle. The angles are in the following order: `alpha`, `beta`,`gamma`,`delta`,`epsilon`,`zeta`,`chi`,`eta`,`theta`,`eta'`,`theta'`,`v0`,`v1`,`v2`,`v3`,`v4`.

	To convert the predictions into angles, you can use the following code snippet:

	```python
	import transformers
	from transformers import AutoModel, AutoTokenizer
	import numpy as np
	import pandas as pd
	from typing import Optional, Dict
	import os

	os.environ["TOKENIZERS_PARALLELISM"] = "false"

	transformers.logging.set_verbosity_error()


	BACKBONE = [
	"alpha",
	"beta",
	"gamma",
	"delta",
	"epsilon",
	"zeta",
	"chi",
	"eta",
	"theta",
	"eta'",
	"theta'",
	"v0",
	"v1",
	"v2",
	"v3",
	"v4",
	]


	class RNATorsionBERTHelper:
	def __init__(self):
	self.model_name = "sayby/rna_torsionbert"
	self.tokenizer = AutoTokenizer.from_pretrained(
	self.model_name, trust_remote_code=True
	)
	self.params_tokenizer = {
	"return_tensors": "pt",
	"padding": "max_length",
	"max_length": 512,
	"truncation": True,
	}
	self.model = AutoModel.from_pretrained(self.model_name, trust_remote_code=True)

	def predict(self, sequence: str):
	sequence_tok = self.convert_raw_sequence_to_k_mers(sequence)
	inputs = self.tokenizer(sequence_tok, **self.params_tokenizer)
	outputs = self.model(inputs)["logits"]
	outputs = self.convert_sin_cos_to_angles(
	outputs.cpu().detach().numpy(), inputs["input_ids"]
	)
	output_angles = self.convert_logits_to_dict(
	outputs[0, :], inputs["input_ids"][0, :].cpu().detach().numpy()
	)
	output_angles.index = list(sequence)[:-2] # Because of the 3-mer representation
	return output_angles

	def convert_raw_sequence_to_k_mers(self, sequence: str, k_mers: int = 3):
	"""
	Convert a raw RNA sequence into sequence readable for the tokenizer.
	It converts the sequence into k-mers, and replace U by T
	:return: input readable by the tokenizer
	"""
	sequence = sequence.upper().replace("U", "T")
	k_mers_sequence = [
	sequence[i : i + k_mers]
	for i in range(len(sequence))
	if len(sequence[i : i + k_mers]) == k_mers
	]
	return " ".join(k_mers_sequence)

	def convert_sin_cos_to_angles(
	self, output: np.ndarray, input_ids: Optional[np.ndarray] = None
	):
	"""
	Convert the raw predictions of the RNA-TorsionBERT into angles.
	It converts the cos and sinus into angles using:
	alpha = arctan(sin(alpha)/cos(alpha))
	:param output: Dictionary with the predictions of the RNA-TorsionBERT per angle
	:param input_ids: the input_ids of the RNA-TorsionBERT. It allows to only select the of the sequence,
	and not the special tokens.
	:return: a np.ndarray with the angles for the sequence
	"""
	if input_ids is not None:
	output[
	(input_ids == 0)
	\| (input_ids == 2)
	\| (input_ids == 3)
	\| (input_ids == 4)
	] = np.nan
	pair_indexes, impair_indexes = np.arange(0, output.shape[-1], 2), np.arange(
	1, output.shape[-1], 2
	)
	sin, cos = output[:, :, impair_indexes], output[:, :, pair_indexes]
	tan = np.arctan2(sin, cos)
	angles = np.degrees(tan)
	return angles

	def convert_logits_to_dict(self, output: np.ndarray, input_ids: np.ndarray) -> Dict:
	"""
	Convert the raw predictions into dictionary format.
	It removes the special tokens and only keeps the predictions for the sequence.
	:param output: predictions from the models in angles
	:param input_ids: input ids from the tokenizer
	:return: a dictionary with the predictions for each angle
	"""
	index_start, index_end = (
	np.where(input_ids == 2)[0][0],
	np.where(input_ids == 3)[0][0],
	)
	output_non_pad = output[index_start + 1 : index_end, :]
	output_angles = {
	angle: output_non_pad[:, angle_index]
	for angle_index, angle in enumerate(BACKBONE)
	}
	out = pd.DataFrame(output_angles)
	return out


	if __name__ == "__main__":
	sequence = "AGGGCUUUAGUCUUUGGAG"
	rna_torsionbert_helper = RNATorsionBERTHelper()
	output_angles = rna_torsionbert_helper.predict(sequence)
	print(output_angles)
	```