lapp0 commited on
Commit
7a470a0
·
verified ·
1 Parent(s): 6958d56

End of training

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -72,7 +72,7 @@ More information needed
72
 
73
  # Resource Usage Comparison
74
 
75
- - VRAM Use: 8.2920 GB
76
 
77
  # Distillation (Teacher -> Student) Architecture Difference:
78
 
@@ -92,7 +92,7 @@ More information needed
92
  <br/>
93
 
94
  # Train Dataset
95
- Trained on 145,724,804 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
96
 
97
  - Num Samples: `247,500`
98
  - Subset: `20231101.en`
@@ -102,7 +102,7 @@ Trained on 145,724,804 tokens from the [wikimedia/wikipedia](https://huggingface
102
  # Training Objective
103
 
104
  ```
105
- DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=logsum, layer_mapper=all, projector=miles))
106
  ```
107
 
108
  # Hyperparameters
@@ -119,9 +119,9 @@ The following hyperparameters were used during training:
119
  - lr_scheduler_type: `cosine_with_min_lr`
120
  - lr_scheduler_warmup_ratio: `0.5`
121
  - num_epochs: `1.0`
122
- - distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=logsum, layer_mapper=all, projector=miles))`
123
  - train_embeddings: `True`
124
- - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f6927719540>`
125
  - student_model_name_or_path: `None`
126
  - student_config_name_or_path: `None`
127
  - student_model_config: `None`
 
72
 
73
  # Resource Usage Comparison
74
 
75
+ - VRAM Use: 7.7861 GB
76
 
77
  # Distillation (Teacher -> Student) Architecture Difference:
78
 
 
92
  <br/>
93
 
94
  # Train Dataset
95
+ Trained on 145,724,337 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
96
 
97
  - Num Samples: `247,500`
98
  - Subset: `20231101.en`
 
102
  # Training Objective
103
 
104
  ```
105
+ DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=logsum_v2, layer_mapper=all, projector=miles))
106
  ```
107
 
108
  # Hyperparameters
 
119
  - lr_scheduler_type: `cosine_with_min_lr`
120
  - lr_scheduler_warmup_ratio: `0.5`
121
  - num_epochs: `1.0`
122
+ - distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=logsum_v2, layer_mapper=all, projector=miles))`
123
  - train_embeddings: `True`
124
+ - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7faed8466410>`
125
  - student_model_name_or_path: `None`
126
  - student_config_name_or_path: `None`
127
  - student_model_config: `None`