CocoRoF commited on
Commit
9df150d
ยท
verified ยท
1 Parent(s): 53cd02f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -40
README.md CHANGED
@@ -1,65 +1,99 @@
1
  ---
2
  library_name: transformers
3
- base_model: CocoRoF/KoModernBERT-base-mlm-v04-retry-model-chp19
4
- tags:
5
- - generated_from_trainer
6
  model-index:
7
- - name: KoModernBERT-base-mlm-v04-retry-model-chp20
8
  results: []
 
 
9
  ---
10
 
11
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
- should probably proofread and complete it, then remove this comment. -->
13
 
14
- # KoModernBERT-base-mlm-v04-retry-model-chp20
15
 
16
- This model is a fine-tuned version of [CocoRoF/KoModernBERT-base-mlm-v04-retry-model-chp19](https://huggingface.co/CocoRoF/KoModernBERT-base-mlm-v04-retry-model-chp19) on an unknown dataset.
17
- It achieves the following results on the evaluation set:
18
- - Loss: 2.0527
19
 
20
- ## Model description
 
 
 
 
 
 
21
 
22
- More information needed
 
 
23
 
24
- ## Intended uses & limitations
 
 
25
 
26
- More information needed
 
 
27
 
28
- ## Training and evaluation data
 
29
 
30
- More information needed
31
 
32
- ## Training procedure
33
 
34
- ### Training hyperparameters
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- The following hyperparameters were used during training:
37
- - learning_rate: 1e-06
38
- - train_batch_size: 4
39
- - eval_batch_size: 4
40
- - seed: 42
41
- - distributed_type: multi-GPU
42
- - num_devices: 8
43
- - gradient_accumulation_steps: 64
44
- - total_train_batch_size: 2048
45
- - total_eval_batch_size: 32
46
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
47
- - lr_scheduler_type: linear
48
- - num_epochs: 1.0
49
 
50
- ### Training results
51
 
52
- | Training Loss | Epoch | Step | Validation Loss |
53
- |:-------------:|:------:|:-----:|:---------------:|
54
- | 132.9037 | 0.2001 | 2500 | 2.0796 |
55
- | 133.5153 | 0.4002 | 5000 | 2.0743 |
56
- | 131.6747 | 0.6002 | 7500 | 2.0601 |
57
- | 131.0512 | 0.8003 | 10000 | 2.0527 |
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  ### Framework versions
61
 
62
- - Transformers 4.48.3
63
  - Pytorch 2.5.1+cu124
64
  - Datasets 3.2.0
65
- - Tokenizers 0.21.0
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
+ base_model: answerdotai/ModernBERT-base
 
5
  model-index:
6
+ - name: x2bee/KoModernBERT-base-mlm
7
  results: []
8
+ language:
9
+ - ko
10
  ---
11
 
12
+ # KoModernBERT-base-v02
 
13
 
14
+ This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) <br>
15
 
16
+ * Flash-Attention 2
17
+ * StabelAdamW
18
+ * Unpadding & Sequence Packing
19
 
20
+ ## Example Use
21
+ ```python
22
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
23
+ from huggingface_hub import HfApi, login
24
+ with open('./api_key/HGF_TOKEN.txt', 'r') as hgf:
25
+ login(token=hgf.read())
26
+ api = HfApi()
27
 
28
+ model_id = "x2bee/KoModernBERT-base-mlm-v01"
29
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
30
+ model = AutoModelForMaskedLM.from_pretrained(model_id).to("cuda")
31
 
32
+ def modern_bert_convert_with_multiple_masks(text: str, top_k: int = 1, select_method:str = "Logit") -> str:
33
+ if "[MASK]" not in text:
34
+ raise ValueError("MLM Model should include '[MASK]' in the sentence")
35
 
36
+ while "[MASK]" in text:
37
+ inputs = tokenizer(text, return_tensors="pt").to("cuda")
38
+ outputs = model(**inputs)
39
 
40
+ input_ids = inputs["input_ids"][0].tolist()
41
+ mask_indices = [i for i, token_id in enumerate(input_ids) if token_id == tokenizer.mask_token_id]
42
 
43
+ current_mask_index = mask_indices[0]
44
 
45
+ logits = outputs.logits[0, current_mask_index]
46
 
47
+ top_k_tokens = logits.topk(top_k).indices.tolist()
48
+ top_k_logits, top_k_indices = logits.topk(top_k)
49
+
50
+ if select_method == "Logit":
51
+ probabilities = torch.softmax(top_k_logits, dim=0).tolist()
52
+ predicted_token_id = random.choices(top_k_indices.tolist(), weights=probabilities, k=1)[0]
53
+ predicted_token = tokenizer.decode([predicted_token_id]).strip()
54
+
55
+ elif select_method == "Random":
56
+ predicted_token_id = random.choice(top_k_tokens)
57
+ predicted_token = tokenizer.decode([predicted_token_id]).strip()
58
+
59
+ elif select_method == "Best":
60
+ predicted_token_id = top_k_tokens[0]
61
+ predicted_token = tokenizer.decode([predicted_token_id]).strip()
62
+
63
+ else:
64
+ raise ValueError("select_method should be one of ['Logit', 'Random', 'Best']")
65
 
66
+ text = text.replace("[MASK]", predicted_token, 1)
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
+ print(f"Predicted: {predicted_token} | Current text: {text}")
69
 
70
+ return text
71
+ ```
 
 
 
 
72
 
73
+ ```
74
+ text = "30์ผ ์ „๋‚จ ๋ฌด์•ˆ๊ตญ์ œ[MASK] ํ™œ์ฃผ๋กœ์— ์ „๋‚  ๋ฐœ์ƒํ•œ ์ œ์ฃผํ•ญ๊ณต [MASK] ๋‹น์‹œ ๊ธฐ์ฒด๊ฐ€ [MASK]์ฐฉ๋ฅ™ํ•˜๋ฉด์„œ ๊ฐ•ํ•œ ๋งˆ์ฐฐ๋กœ ์ƒ๊ธด ํ”์ ์ด ๋‚จ์•„ ์žˆ๋‹ค. ์ด ์ฐธ์‚ฌ๋กœ [MASK]๊ณผ ์Šน๋ฌด์› 181๋ช… ์ค‘ 179๋ช…์ด ์ˆจ์ง€๊ณ  [MASK]๋Š” ํ˜•์ฒด๋ฅผ ์•Œ์•„๋ณผ ์ˆ˜ ์—†์ด [MASK]๋๋‹ค. [MASK] ๊ทœ๋ชจ์™€ [MASK] ์›์ธ ๋“ฑ์— ๋Œ€ํ•ด ๋‹ค์–‘ํ•œ [MASK]์ด ์ œ๊ธฐ๋˜๊ณ  ์žˆ๋Š” ๊ฐ€์šด๋ฐ [MASK]์— ์„ค์น˜๋œ [MASK](์ฐฉ๋ฅ™ ์œ ๋„ ์•ˆ์ „์‹œ์„ค)๊ฐ€ [MASK]๋ฅผ ํ‚ค์› ๋‹ค๋Š” [MASK]์ด ๋‚˜์˜ค๊ณ  ์žˆ๋‹ค."
75
+ result = mbm.modern_bert_convert_with_multiple_masks(text, top_k=1)
76
+
77
+ '30์ผ ์ „๋‚จ ๋ฌด์•ˆ๊ตญ์ œํ„ฐ๋ฏธ๋„ ํ™œ์ฃผ๋กœ์— ์ „๋‚  ๋ฐœ์ƒํ•œ ์ œ์ฃผํ•ญ๊ณต ์‚ฌ๊ณ  ๋‹น์‹œ ๊ธฐ์ฒด๊ฐ€ ๋ฌด๋‹จ์ฐฉ๋ฅ™ํ•˜๋ฉด์„œ ๊ฐ•ํ•œ ๋งˆ์ฐฐ๋กœ ์ƒ๊ธด ํ”์ ์ด ๋‚จ์•„ ์žˆ๋‹ค. ์ด ์ฐธ์‚ฌ๋กœ ์Šน๊ฐ๊ณผ ์Šน๋ฌด์› 181๋ช… ์ค‘ 179๋ช…์ด ์ˆจ์ง€๊ณ  ์ผ๋ถ€๋Š” ํ˜•์ฒด๋ฅผ ์•Œ์•„๋ณผ ์ˆ˜ ์—†์ด ์‹ค์ข…๋๋‹ค. ์‚ฌ๊ณ  ๊ทœ๋ชจ์™€ ์‚ฌ๊ณ  ์›์ธ ๋“ฑ์— ๋Œ€ํ•ด ๋‹ค์–‘ํ•œ ์˜ํ˜น์ด ์ œ๊ธฐ๋˜๊ณ  ์žˆ๋Š” ๊ฐ€์šด๋ฐ ๊ธฐ๋‚ด์— ์„ค์น˜๋œ ESC(์ฐฉ๋ฅ™ ์œ ๋„ ์•ˆ์ „์‹œ์„ค)๊ฐ€ ์‚ฌ๊ณ ๋ฅผ ํ‚ค์› ๋‹ค๋Š” ์ฃผ์žฅ์ด ๋‚˜์˜ค๊ณ  ์žˆ๋‹ค.'
78
+ ```
79
+
80
+ ```
81
+ text = "์ค‘๊ตญ์˜ ์ˆ˜๋„๋Š” [MASK]์ด๋‹ค"
82
+ result = mbm.modern_bert_convert_with_multiple_masks(text, top_k=1)
83
+ '์ค‘๊ตญ์˜ ์ˆ˜๋„๋Š” ๋ฒ ์ด์ง•์ด๋‹ค'
84
+
85
+ text = "์ผ๋ณธ์˜ ์ˆ˜๋„๋Š” [MASK]์ด๋‹ค"
86
+ result = mbm.modern_bert_convert_with_multiple_masks(text, top_k=1)
87
+ '์ผ๋ณธ์˜ ์ˆ˜๋„๋Š” ๋„์ฟ„์ด๋‹ค'
88
+
89
+ text = "๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ๊ฐ€์žฅ ํฐ ๋„์‹œ๋Š” [MASK]์ด๋‹ค"
90
+ result = mbm.modern_bert_convert_with_multiple_masks(text, top_k=1)
91
+ '๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ๊ฐ€์žฅ ํฐ ๋„์‹œ๋Š” ์„œ์šธ์ด๋‹ค'
92
+ ```
93
 
94
  ### Framework versions
95
 
96
+ - Transformers 4.48.0
97
  - Pytorch 2.5.1+cu124
98
  - Datasets 3.2.0
99
+ - Tokenizers 0.21.0