shivendrra commited on
Commit
e9e5531
·
verified ·
1 Parent(s): d895254

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +196 -1
README.md CHANGED
@@ -10,4 +10,199 @@ tags:
10
  - bio-bert
11
  - enigma
12
  - bio-enigma
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - bio-bert
11
  - enigma
12
  - bio-enigma
13
+ ---
14
+
15
+ # enigma-1.5b
16
+
17
+
18
+ ## Model Details
19
+ this is a transformer based model trained on DNA seq data, capable of generating new sequences of DNA. It's a 2.5billion parameter model trained till convergence.
20
+ It also has one more BERT based model that has 47million parameters, also capable of generating new sequences.
21
+ ### Model Description
22
+
23
+ - **Developed by:** [Shivendra Singh]()
24
+ - **License:** [MIT]
25
+
26
+ ### Model Sources
27
+
28
+ - **Repository:** [github/enigma-1.5b](https://github.com/shivendrra/enigma-1.5b)
29
+ - **Papers**: [# Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision](https://arxiv.org/html/2311.02333v2#bib.bib35
30
+
31
+ ## Uses
32
+
33
+ Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature.
34
+ ### Direct Use
35
+
36
+ Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for enbert-47m model.
37
+
38
+ ## Bias, Risks, and Limitations
39
+
40
+ This model was trained on only around ~500mbs of DNA data and that too per-character level, not sub-word or sequence level like in language models. Which means it would have more precision but limited because of training.
41
+ I wasn't able to train it on other datasets for better generalizations because of my technical limits, lack of gpu and good hardware.
42
+
43
+ ## How to Get Started with the Model
44
+
45
+ Use the code below to get started with the model.
46
+
47
+ ```python
48
+ # Load model directly
49
+ from transformers import AutoModel
50
+ model = AutoModel.from_pretrained("Shivendrra/enigma-1.5b")
51
+
52
+ # generate from the model
53
+ from model import Transformer
54
+ model = Transformer(vocab_size=vocab_size)
55
+
56
+ class Generator(Transformer):
57
+ def __init__(self, vocab_size):
58
+ super().__init__()
59
+ self.vocab_size = vocab_size
60
+ self.block_size = Transformer.block_size
61
+
62
+ def generate(self, idx, max_new_tokens, temperature=1.0, top_k=0):
63
+ generated_tokens = []
64
+
65
+ for _ in range(max_new_tokens):
66
+ idx_cond = idx[:, -self.block_size:]
67
+ logits, _ = self(idx_cond)
68
+ logits = logits[:, -1, :]
69
+ scaled_logits = logits / temperature
70
+
71
+ if top_k > 0:
72
+ scaled_logits = self._top_k_filtering(scaled_logits, top_k)
73
+ probs = F.softmax(scaled_logits, dim=-1)
74
+ sampled_idx = torch.multinomial(probs, num_samples=1)
75
+ generated_tokens.append(sampled_idx.item())
76
+ idx = torch.cat((idx, sampled_idx), dim=1)
77
+ return generated_tokens
78
+
79
+ def _top_k_filtering(self, logits, top_k):
80
+ values, indices = torch.topk(logits, top_k, dim=-1)
81
+ min_value = values[:, -1].unsqueeze(-1).expand_as(logits)
82
+ filtered_logits = torch.where(logits < min_value, torch.ones_like(logits) * -float('inf'), logits)
83
+ return filtered_logits
84
+ ```
85
+
86
+ ## Training Details
87
+
88
+ ### Training Data
89
+
90
+ Used from this dataset: [human_ref_data](https://huggingface.co/datasets/samchain/human_ref_dna)
91
+ Consolidated 8 ~500mb files into one big dataset. I've uploaded the data for the training though.
92
+
93
+ ### Training Procedure
94
+
95
+ These models were trained to 3k-4k iterations, each. on ~500million letters of DNA, roughly around 500mbs of data. Final losses were around ~0.02 for 47million parameter model and ~0.003 for 2.5billion parameter model. I had saved more data, lot more than this, but couldn't train it more due to technical in-capabilities.
96
+ Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
97
+ #### Functions:
98
+ This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
99
+
100
+ ```python
101
+ def get_batch(split):
102
+     # generate a small batch of data of inputs x and targets y
103
+     data = train_data if split == 'train' else val_data
104
+     ix = torch.randint(len(data) - block_size, (batch_size,))
105
+     x = torch.stack([data[i:i+block_size] for i in ix])
106
+     y = torch.stack([data[i+1:i+block_size+1] for i in ix])
107
+     x, y = x.to(device), y.to(device)
108
+
109
+     return x, y
110
+
111
+ @torch.no_grad()
112
+ def estimate_loss():
113
+     out = {}
114
+     model.eval()
115
+     for split in ['train', 'val']:
116
+         losses = torch.zeros(eval_iters)
117
+         for k in range(eval_iters):
118
+             X, Y = get_batch(split)
119
+             logits, loss = model(X, Y)
120
+             losses[k] = loss.item()
121
+         out[split] = losses.mean()
122
+     model.train()
123
+     return out
124
+
125
+ from model import Transformer
126
+ model = Transformer(vocab_size=vocab_size)
127
+ m = model.to(device)
128
+
129
+ n_param = sum(p.numel() for p in m.parameters())/1e6
130
+ print(f"{n_param:.2f} million")
131
+ optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
132
+ steps = []
133
+ train_losses = []
134
+ val_losses = []
135
+
136
+ for iter in range(max_iters):
137
+   if iter % eval_interval == 0 or iter == max_iters - 1:
138
+     losses = estimate_loss()
139
+     print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
140
+     steps.append(iter)
141
+     train_losses.append(losses['train'])
142
+     val_losses.append(losses['val'])
143
+
144
+   xb, yb = get_batch('train')
145
+   logits, loss = model(xb, yb)
146
+   optimizer.zero_grad(set_to_none=True)
147
+   loss.backward()
148
+   optimizer.step()
149
+
150
+ torch.save(model.state_dict(), f'enigma_{n_param:.0f}m.pth')
151
+ ```
152
+
153
+ #### Training Hyperparameters
154
+
155
+ Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.
156
+
157
+ ```json
158
+ {
159
+ "batch_size": 10,
160
+ "block_size": 512,
161
+ "max_iters": 5000,
162
+ "eval_interval": 50,
163
+ "learning_rate": 3e-5,
164
+ "eval_iters": 100,
165
+ "d_model": 384,
166
+ "n_head": 12,
167
+ "n_layer": 12,
168
+ "dropout": 0.2,
169
+ "norm_eps": 1e-5
170
+ }
171
+ ```
172
+
173
+ ### Model Architecture and Objective
174
+
175
+ EnBERT is a 47million parameter model, follows BERT architecture, and has one more layer of masked self-attention layer to predict next tokens.
176
+ Engima-2.5b is a transformer model. It has a fairly complex model.
177
+
178
+ ![architecture](https://github.com/shivendrra/enigma-1.5b/blob/main/architecture.png)
179
+ #### Encoder Part:
180
+ ---
181
+ It consists two different layers, each followed by their own normalization and dropout layers. Input embeddings along with positional embeddings are fed to the encoder block:
182
+ ##### Self Attention:
183
+ - Each head of self-attention layer is similar to that of used in `grokAI`. Key and Query matrices have biases whereas Value matrix doesn't.
184
+ - After implementing `torch.matmul()` on Key and Query, relational positional embeddings are applied to the attention matrix.
185
+ - Attention and value matrix are then multiplied using `torch.matmul()`.
186
+ - Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer
187
+
188
+ #### FeedForward:
189
+ - Normalized outputs are then passed to position-wise `feedforward` layer, with `expansion_factor` of 5.
190
+ - GELU is used as the activation function in this case and two linear layers, one for output and other for input.
191
+ - Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.
192
+ #### Decoder Part:
193
+ ---
194
+ Consists of three different layers:
195
+ ##### Masked Attention:
196
+ - This layer is similar to the self-attention implemented in encoder part, except it has a triangular mask that forbids tokens to look for the context of next token.
197
+ - Rest is all same, relational positional embeddings are applied in the same way, but to the masked attention matrix this time.
198
+ - Attention and value matrix are then multiplied using `torch.matmul()`.
199
+ - Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer
200
+ #### Self-Attention:
201
+ - Before this, outputs from encoder layer and masked-attention layer are added together, and then passed to this layer.
202
+ - Same as the encoder's unmasked attention layer. Key, Query and Value matrices are created using same technique.
203
+ - Finally all the outputs are normalized and passed to dropout layer.
204
+
205
+ #### FeedForward:
206
+ - Normalized outputs are then passed to position-wise `feedforward` layer, with `expansion_factor` of 5.
207
+ - GELU is used as the activation function in this case and two linear layers, one for output and other for input.
208
+ - Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.