shivendrra
/

enigma1

 - bio-bert
 - enigma
 - bio-enigma
+---
+# enigma-1.5b
+## Model Details
+this is a transformer based model trained on DNA seq data, capable of generating new sequences of DNA. It's a 2.5billion parameter model trained till convergence.
+It also has one more BERT based model that has 47million parameters, also capable of generating new sequences.
+### Model Description
+- **Developed by:** [Shivendra Singh]()
+- **License:** [MIT]
+### Model Sources
+- **Repository:** [github/enigma-1.5b](https://github.com/shivendrra/enigma-1.5b)
+- **Papers**: [# Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision](https://arxiv.org/html/2311.02333v2#bib.bib35
+## Uses
+Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature.
+### Direct Use
+Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for enbert-47m model.
+## Bias, Risks, and Limitations
+This model was trained on only around ~500mbs of DNA data and that too per-character level, not sub-word or sequence level like in language models. Which means it would have more precision but limited because of training.
+I wasn't able to train it on other datasets for better generalizations because of my technical limits, lack of gpu and good hardware.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+# Load model directly
+from transformers import AutoModel
+model = AutoModel.from_pretrained("Shivendrra/enigma-1.5b")
+# generate from the model
+from model import Transformer
+model = Transformer(vocab_size=vocab_size)
+class Generator(Transformer):
+	def __init__(self, vocab_size):
+	super().__init__()
+	self.vocab_size = vocab_size
+	self.block_size = Transformer.block_size
+	def generate(self, idx, max_new_tokens, temperature=1.0, top_k=0):
+		generated_tokens = []
+		for _ in range(max_new_tokens):
+		idx_cond = idx[:, -self.block_size:]
+		logits, _ = self(idx_cond)
+		logits = logits[:, -1, :]
+		scaled_logits = logits / temperature
+		if top_k > 0:
+			scaled_logits = self._top_k_filtering(scaled_logits, top_k)
+		probs = F.softmax(scaled_logits, dim=-1)
+		sampled_idx = torch.multinomial(probs, num_samples=1)
+		generated_tokens.append(sampled_idx.item())
+		idx = torch.cat((idx, sampled_idx), dim=1)
+		return generated_tokens
+	def _top_k_filtering(self, logits, top_k):
+		values, indices = torch.topk(logits, top_k, dim=-1)
+		min_value = values[:, -1].unsqueeze(-1).expand_as(logits)
+		filtered_logits = torch.where(logits < min_value, torch.ones_like(logits) * -float('inf'), logits)
+		return filtered_logits
+```
+## Training Details
+### Training Data
+Used from this dataset: [human_ref_data](https://huggingface.co/datasets/samchain/human_ref_dna)
+Consolidated 8 ~500mb files into one big dataset. I've uploaded the data for the training though.
+### Training Procedure
+These models were trained to 3k-4k iterations, each. on ~500million letters of DNA, roughly around 500mbs of data. Final losses were around ~0.02 for 47million parameter model and ~0.003 for 2.5billion parameter model. I had saved more data, lot more than this, but couldn't train it more due to technical in-capabilities.
+Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
+#### Functions:
+This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
+```python
+def get_batch(split):
+    # generate a small batch of data of inputs x and targets y
+    data = train_data if split == 'train' else val_data
+    ix = torch.randint(len(data) - block_size, (batch_size,))
+    x = torch.stack([data[i:i+block_size] for i in ix])
+    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
+    x, y = x.to(device), y.to(device)
+    return x, y
+@torch.no_grad()
+def estimate_loss():
+    out = {}
+    model.eval()
+    for split in ['train', 'val']:
+        losses = torch.zeros(eval_iters)
+        for k in range(eval_iters):
+            X, Y = get_batch(split)
+            logits, loss = model(X, Y)
+            losses[k] = loss.item()
+        out[split] = losses.mean()
+    model.train()
+    return out
+from model import Transformer
+model = Transformer(vocab_size=vocab_size)
+m = model.to(device)
+n_param = sum(p.numel() for p in m.parameters())/1e6
+print(f"{n_param:.2f} million")
+optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
+steps = []
+train_losses = []
+val_losses = []
+for iter in range(max_iters):
+  if iter % eval_interval == 0 or iter == max_iters - 1:
+    losses = estimate_loss()
+    print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
+    steps.append(iter)
+    train_losses.append(losses['train'])
+    val_losses.append(losses['val'])
+  xb, yb = get_batch('train')
+  logits, loss = model(xb, yb)
+  optimizer.zero_grad(set_to_none=True)
+  loss.backward()
+  optimizer.step()
+torch.save(model.state_dict(), f'enigma_{n_param:.0f}m.pth')
+```
+#### Training Hyperparameters
+Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.
+```json
+{
+  "batch_size": 10,
+  "block_size": 512,
+  "max_iters": 5000,
+  "eval_interval": 50,
+  "learning_rate": 3e-5,
+  "eval_iters": 100,
+  "d_model": 384,
+  "n_head": 12,
+  "n_layer": 12,
+  "dropout": 0.2,
+  "norm_eps": 1e-5
+}
+```
+### Model Architecture and Objective
+EnBERT is a 47million parameter model, follows BERT architecture, and has one more layer of masked self-attention layer to predict next tokens.
+Engima-2.5b is a transformer model. It has a fairly complex model.
+![architecture](https://github.com/shivendrra/enigma-1.5b/blob/main/architecture.png)
+#### Encoder Part:
+---
+It consists two different layers, each followed by their own normalization and dropout layers. Input embeddings along with positional embeddings are fed to the encoder block:
+##### Self Attention:
+- Each head of self-attention layer is similar to that of used in `grokAI`. Key and Query matrices have biases whereas Value matrix doesn't.
+- After implementing `torch.matmul()` on Key and Query, relational positional embeddings are applied to the attention matrix.
+- Attention and value matrix are then multiplied using `torch.matmul()`.
+- Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer
+#### FeedForward:
+- Normalized outputs are then passed to position-wise `feedforward` layer, with `expansion_factor` of 5.
+- GELU is used as the activation function in this case and two linear layers, one for output and other for input.
+- Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.
+#### Decoder Part:
+---
+Consists of three different layers:
+##### Masked Attention:
+- This layer is similar to the self-attention implemented in encoder part, except it has a triangular mask that forbids tokens to look for the context of next token.
+- Rest is all same, relational positional embeddings are applied in the same way, but to the masked attention matrix this time.
+- Attention and value matrix are then multiplied using `torch.matmul()`.
+- Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer
+#### Self-Attention:
+- Before this, outputs from encoder layer and masked-attention layer are added together, and then passed to this layer.
+- Same as the encoder's unmasked attention layer. Key, Query and Value matrices are created using same technique.
+- Finally all the outputs are normalized and passed to dropout layer.
+#### FeedForward:
+- Normalized outputs are then passed to position-wise `feedforward` layer, with `expansion_factor` of 5.
+- GELU is used as the activation function in this case and two linear layers, one for output and other for input.
+- Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.