Update README.md
Browse files
README.md
CHANGED
@@ -43,16 +43,18 @@ This is the exact same model ([meta-llama/Meta-Llama-3-8B](https://huggingface.c
|
|
43 |
|
44 |
## Why We Made This Model
|
45 |
|
46 |
-
The Llama 3 base (non-instruct) model, while powerful, came with a significant oversight that some special tokens for instruction following within its architecture were left untrained, potentially derailing further fine-tuning processes. This was first noted by Daniel Han on X, highlighting a critical but fixable flaw in a widely used model.
|
|
|
|
|
47 |
|
48 |
The primary goal of releasing a patched version of this model was to address this issue so that the community can utilize the Llama 3 model without facing training instabilities, such as sudden gradient explosions or `NaN` gradients, or having to go through complicated processes to fix the model themselves before fine-tuning.
|
49 |
|
50 |
|
51 |
## Details of the Adjustment
|
52 |
|
53 |
-
The [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model was pulled directly from HuggingFace and loaded using transformers. Then, the input embedding and output embedding dimension values are retrieved using `model.get_input_embeddings().weight.data` and `model.get_output_embeddings().weight.data`. These 2 matrics are identical in shape, with each row
|
54 |
|
55 |
-
The special tokens can be found by locating the rows where the entire row of the embedding values are all zeros, which imply they were not trained during the pretraining phase of the model from Meta. Such untrained tokens could lead to heavy computational issues, like gradient explosions or `NaN` gradients, during downstream fine-tuning on specific tasks.
|
56 |
|
57 |
|
58 |
<details>
|
|
|
43 |
|
44 |
## Why We Made This Model
|
45 |
|
46 |
+
The Llama 3 base (non-instruct) model, while powerful, came with a significant oversight that some special tokens for instruction following within its architecture were left untrained, potentially derailing further fine-tuning processes. This was first noted by [Daniel Han on X](https://twitter.com/danielhanchen/status/1781395882925343058), highlighting a critical but fixable flaw in a widely used model.
|
47 |
+
|
48 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/655ad0f8727df37c77a09cb9/1U2rRrx60p1pNeeAZw8Rd.png" alt="graph" width="400"/>
|
49 |
|
50 |
The primary goal of releasing a patched version of this model was to address this issue so that the community can utilize the Llama 3 model without facing training instabilities, such as sudden gradient explosions or `NaN` gradients, or having to go through complicated processes to fix the model themselves before fine-tuning.
|
51 |
|
52 |
|
53 |
## Details of the Adjustment
|
54 |
|
55 |
+
The [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model was pulled directly from HuggingFace and loaded using transformers. Then, the input embedding and output embedding dimension values are retrieved using `model.get_input_embeddings().weight.data` and `model.get_output_embeddings().weight.data`. These 2 matrics are identical in shape, with each row representing a token id, and each column representing an embedding feature.
|
56 |
|
57 |
+
The special (untrained & problematic) tokens can be found by locating the rows where the entire row of the embedding values are all zeros, which imply they were not trained during the pretraining phase of the model from Meta. Such untrained tokens could lead to heavy computational issues, like gradient explosions or `NaN` gradients, during downstream fine-tuning on specific tasks.
|
58 |
|
59 |
|
60 |
<details>
|