MASR / transformers /docs /source /ko /tasks /token_classification.md
Yuvarraj's picture
Initial commit
a0db2f9

ํ† ํฐ ๋ถ„๋ฅ˜[[token-classification]]

[[open-in-colab]]

ํ† ํฐ ๋ถ„๋ฅ˜๋Š” ๋ฌธ์žฅ์˜ ๊ฐœ๋ณ„ ํ† ํฐ์— ๋ ˆ์ด๋ธ”์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ํ† ํฐ ๋ถ„๋ฅ˜ ์ž‘์—… ์ค‘ ํ•˜๋‚˜๋Š” ๊ฐœ์ฒด๋ช… ์ธ์‹(Named Entity Recognition, NER)์ž…๋‹ˆ๋‹ค. ๊ฐœ์ฒด๋ช… ์ธ์‹์€ ๋ฌธ์žฅ์—์„œ ์‚ฌ๋žŒ, ์œ„์น˜ ๋˜๋Š” ์กฐ์ง๊ณผ ๊ฐ™์€ ๊ฐ ๊ฐœ์ฒด์˜ ๋ ˆ์ด๋ธ”์„ ์ฐพ์œผ๋ ค๊ณ  ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ ํ•™์Šตํ•  ๋‚ด์šฉ์€:

  1. WNUT 17 ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ DistilBERT๋ฅผ ํŒŒ์ธ ํŠœ๋‹ํ•˜์—ฌ ์ƒˆ๋กœ์šด ๊ฐœ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ถ”๋ก ์„ ์œ„ํ•ด ํŒŒ์ธ ํŠœ๋‹ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ ์„ค๋ช…ํ•˜๋Š” ์ž‘์—…์€ ๋‹ค์Œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์— ์˜ํ•ด ์ง€์›๋ฉ๋‹ˆ๋‹ค:

ALBERT, BERT, BigBird, BioGpt, BLOOM, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LiLT, Longformer, LUKE, MarkupLM, MEGA, Megatron-BERT, MobileBERT, MPNet, Nezha, Nystrรถmformer, QDQBert, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO

์‹œ์ž‘ํ•˜๊ธฐ ์ „์—, ํ•„์š”ํ•œ ๋ชจ๋“  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pip install transformers datasets evaluate seqeval

Hugging Face ๊ณ„์ •์— ๋กœ๊ทธ์ธํ•˜์—ฌ ๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•˜๊ณ  ์ปค๋ฎค๋‹ˆํ‹ฐ์— ๊ณต์œ ํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋ฉ”์‹œ์ง€๊ฐ€ ํ‘œ์‹œ๋˜๋ฉด, ํ† ํฐ์„ ์ž…๋ ฅํ•˜์—ฌ ๋กœ๊ทธ์ธํ•˜์„ธ์š”:

>>> from huggingface_hub import notebook_login

>>> notebook_login()

WNUT 17 ๋ฐ์ดํ„ฐ ์„ธํŠธ ๊ฐ€์ ธ์˜ค๊ธฐ[[load-wnut-17-dataset]]

๋จผ์ € ๐Ÿค— Datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ WNUT 17 ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค:

>>> from datasets import load_dataset

>>> wnut = load_dataset("wnut_17")

๋‹ค์Œ ์˜ˆ์ œ๋ฅผ ์‚ดํŽด๋ณด์„ธ์š”:

>>> wnut["train"][0]
{'id': '0',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}

ner_tags์˜ ๊ฐ ์ˆซ์ž๋Š” ๊ฐœ์ฒด๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ˆซ์ž๋ฅผ ๋ ˆ์ด๋ธ” ์ด๋ฆ„์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๊ฐœ์ฒด๊ฐ€ ๋ฌด์—‡์ธ์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

>>> label_list = wnut["train"].features[f"ner_tags"].feature.names
>>> label_list
[
    "O",
    "B-corporation",
    "I-corporation",
    "B-creative-work",
    "I-creative-work",
    "B-group",
    "I-group",
    "B-location",
    "I-location",
    "B-person",
    "I-person",
    "B-product",
    "I-product",
]

๊ฐ ner_tag์˜ ์•ž์— ๋ถ™์€ ๋ฌธ์ž๋Š” ๊ฐœ์ฒด์˜ ํ† ํฐ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค:

  • B-๋Š” ๊ฐœ์ฒด์˜ ์‹œ์ž‘์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • I-๋Š” ํ† ํฐ์ด ๋™์ผํ•œ ๊ฐœ์ฒด ๋‚ด๋ถ€์— ํฌํ•จ๋˜์–ด ์žˆ์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค(์˜ˆ๋ฅผ ๋“ค์–ด State ํ† ํฐ์€ Empire State Building์™€ ๊ฐ™์€ ๊ฐœ์ฒด์˜ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค).
  • 0๋Š” ํ† ํฐ์ด ์–ด๋–ค ๊ฐœ์ฒด์—๋„ ํ•ด๋‹นํ•˜์ง€ ์•Š์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์ „์ฒ˜๋ฆฌ[[preprocess]]

๋‹ค์Œ์œผ๋กœ tokens ํ•„๋“œ๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด DistilBERT ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

์œ„์˜ ์˜ˆ์ œ tokens ํ•„๋“œ๋ฅผ ๋ณด๋ฉด ์ž…๋ ฅ์ด ์ด๋ฏธ ํ† ํฐํ™”๋œ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ๋กœ ์ž…๋ ฅ์€ ์•„์ง ํ† ํฐํ™”๋˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ ๋‹จ์–ด๋ฅผ ํ•˜์œ„ ๋‹จ์–ด๋กœ ํ† ํฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด is_split_into_words=True๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ œ๋กœ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

>>> example = wnut["train"][0]
>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
>>> tokens
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']

๊ทธ๋Ÿฌ๋‚˜ ์ด๋กœ ์ธํ•ด [CLS]๊ณผ [SEP]๋ผ๋Š” ํŠน์ˆ˜ ํ† ํฐ์ด ์ถ”๊ฐ€๋˜๊ณ , ํ•˜์œ„ ๋‹จ์–ด ํ† ํฐํ™”๋กœ ์ธํ•ด ์ž…๋ ฅ๊ณผ ๋ ˆ์ด๋ธ” ๊ฐ„์— ๋ถˆ์ผ์น˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ๋ ˆ์ด๋ธ”์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์ผ ๋‹จ์–ด๋Š” ์ด์ œ ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋‹จ์–ด๋กœ ๋ถ„ํ• ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ† ํฐ๊ณผ ๋ ˆ์ด๋ธ”์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์žฌ์ •๋ ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  1. word_ids ๋ฉ”์†Œ๋“œ๋กœ ๋ชจ๋“  ํ† ํฐ์„ ํ•ด๋‹น ๋‹จ์–ด์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค.
  2. ํŠน์ˆ˜ ํ† ํฐ [CLS]์™€ [SEP]์— -100 ๋ ˆ์ด๋ธ”์„ ํ• ๋‹นํ•˜์—ฌ, PyTorch ์†์‹ค ํ•จ์ˆ˜๊ฐ€ ํ•ด๋‹น ํ† ํฐ์„ ๋ฌด์‹œํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  3. ์ฃผ์–ด์ง„ ๋‹จ์–ด์˜ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ์—๋งŒ ๋ ˆ์ด๋ธ”์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ™์€ ๋‹จ์–ด์˜ ๋‹ค๋ฅธ ํ•˜์œ„ ํ† ํฐ์— -100์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ํ† ํฐ๊ณผ ๋ ˆ์ด๋ธ”์„ ์žฌ์ •๋ ฌํ•˜๊ณ  DistilBERT์˜ ์ตœ๋Œ€ ์ž…๋ ฅ ๊ธธ์ด๋ณด๋‹ค ๊ธธ์ง€ ์•Š๋„๋ก ์‹œํ€€์Šค๋ฅผ ์ž˜๋ผ๋‚ด๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค:

>>> def tokenize_and_align_labels(examples):
...     tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

...     labels = []
...     for i, label in enumerate(examples[f"ner_tags"]):
...         word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
...         previous_word_idx = None
...         label_ids = []
...         for word_idx in word_ids:  # Set the special tokens to -100.
...             if word_idx is None:
...                 label_ids.append(-100)
...             elif word_idx != previous_word_idx:  # Only label the first token of a given word.
...                 label_ids.append(label[word_idx])
...             else:
...                 label_ids.append(-100)
...             previous_word_idx = word_idx
...         labels.append(label_ids)

...     tokenized_inputs["labels"] = labels
...     return tokenized_inputs

์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๋ ค๋ฉด, ๐Ÿค— Datasets [~datasets.Dataset.map] ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”. batched=True๋กœ ์„ค์ •ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋ฉด map ํ•จ์ˆ˜์˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

์ด์ œ [DataCollatorWithPadding]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ œ ๋ฐฐ์น˜๋ฅผ ๋งŒ๋“ค์–ด๋ด…์‹œ๋‹ค. ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ „์ฒด๋ฅผ ์ตœ๋Œ€ ๊ธธ์ด๋กœ ํŒจ๋”ฉํ•˜๋Š” ๋Œ€์‹ , ๋™์  ํŒจ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐฐ์น˜์—์„œ ๊ฐ€์žฅ ๊ธด ๊ธธ์ด์— ๋งž๊ฒŒ ๋ฌธ์žฅ์„ ํŒจ๋”ฉํ•˜๋Š” ๊ฒƒ์ด ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.

```py >>> from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

</pt>
<tf>
```py
>>> from transformers import DataCollatorForTokenClassification

>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")

ํ‰๊ฐ€[[evaluation]]

ํ›ˆ๋ จ ์ค‘ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฒƒ์ด ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๐Ÿค— Evaluate ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋น ๋ฅด๊ฒŒ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ž‘์—…์—์„œ๋Š” seqeval ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. (ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ  ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ๋Š” ๐Ÿค— Evaluate ๋น ๋ฅธ ๋‘˜๋Ÿฌ๋ณด๊ธฐ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”). Seqeval์€ ์‹ค์ œ๋กœ ์ •๋ฐ€๋„, ์žฌํ˜„๋ฅ , F1 ๋ฐ ์ •ํ™•๋„์™€ ๊ฐ™์€ ์—ฌ๋Ÿฌ ์ ์ˆ˜๋ฅผ ์‚ฐ์ถœํ•ฉ๋‹ˆ๋‹ค.

>>> import evaluate

>>> seqeval = evaluate.load("seqeval")

๋จผ์ € NER ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ ธ์˜จ ๋‹ค์Œ, [~evaluate.EvaluationModule.compute]์— ์‹ค์ œ ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๋ ˆ์ด๋ธ”์„ ์ „๋‹ฌํ•˜์—ฌ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค:

>>> import numpy as np

>>> labels = [label_list[i] for i in example[f"ner_tags"]]


>>> def compute_metrics(p):
...     predictions, labels = p
...     predictions = np.argmax(predictions, axis=2)

...     true_predictions = [
...         [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
...         for prediction, label in zip(predictions, labels)
...     ]
...     true_labels = [
...         [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
...         for prediction, label in zip(predictions, labels)
...     ]

...     results = seqeval.compute(predictions=true_predictions, references=true_labels)
...     return {
...         "precision": results["overall_precision"],
...         "recall": results["overall_recall"],
...         "f1": results["overall_f1"],
...         "accuracy": results["overall_accuracy"],
...     }

์ด์ œ compute_metrics ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์œผ๋ฉฐ, ํ›ˆ๋ จ์„ ์„ค์ •ํ•˜๋ฉด ์ด ํ•จ์ˆ˜๋กœ ๋˜๋Œ์•„์˜ฌ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ›ˆ๋ จ[[train]]

๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์ „์—, id2label์™€ label2id๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ƒ๋˜๋Š” id์™€ ๋ ˆ์ด๋ธ”์˜ ๋งต์„ ์ƒ์„ฑํ•˜์„ธ์š”:

>>> id2label = {
...     0: "O",
...     1: "B-corporation",
...     2: "I-corporation",
...     3: "B-creative-work",
...     4: "I-creative-work",
...     5: "B-group",
...     6: "I-group",
...     7: "B-location",
...     8: "I-location",
...     9: "B-person",
...     10: "I-person",
...     11: "B-product",
...     12: "I-product",
... }
>>> label2id = {
...     "O": 0,
...     "B-corporation": 1,
...     "I-corporation": 2,
...     "B-creative-work": 3,
...     "I-creative-work": 4,
...     "B-group": 5,
...     "I-group": 6,
...     "B-location": 7,
...     "I-location": 8,
...     "B-person": 9,
...     "I-person": 10,
...     "B-product": 11,
...     "I-product": 12,
... }

[Trainer]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ์ต์ˆ™ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ, ์—ฌ๊ธฐ์—์„œ ๊ธฐ๋ณธ ํŠœํ† ๋ฆฌ์–ผ์„ ํ™•์ธํ•˜์„ธ์š”!

์ด์ œ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ฌ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค! [AutoModelForSequenceClassification]๋กœ DistilBERT๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ  ์˜ˆ์ƒ๋˜๋Š” ๋ ˆ์ด๋ธ” ์ˆ˜์™€ ๋ ˆ์ด๋ธ” ๋งคํ•‘์„ ์ง€์ •ํ•˜์„ธ์š”:

>>> from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

>>> model = AutoModelForTokenClassification.from_pretrained(
...     "distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id
... )

์ด์ œ ์„ธ ๋‹จ๊ณ„๋งŒ ๊ฑฐ์น˜๋ฉด ๋์ž…๋‹ˆ๋‹ค:

  1. [TrainingArguments]์—์„œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ •์˜ํ•˜์„ธ์š”. output_dir๋Š” ๋ชจ๋ธ์„ ์ €์žฅํ•  ์œ„์น˜๋ฅผ ์ง€์ •ํ•˜๋Š” ์œ ์ผํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ์— ์—…๋กœ๋“œํ•˜๊ธฐ ์œ„ํ•ด push_to_hub=True๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค(๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•˜๊ธฐ ์œ„ํ•ด Hugging Face์— ๋กœ๊ทธ์ธํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.) ๊ฐ ์—ํญ์ด ๋๋‚  ๋•Œ๋งˆ๋‹ค, [Trainer]๋Š” seqeval ์ ์ˆ˜๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ  ํ›ˆ๋ จ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
  2. [Trainer]์— ํ›ˆ๋ จ ์ธ์ˆ˜์™€ ๋ชจ๋ธ, ๋ฐ์ดํ„ฐ ์„ธํŠธ, ํ† ํฌ๋‚˜์ด์ €, ๋ฐ์ดํ„ฐ ์ฝœ๋ ˆ์ดํ„ฐ ๋ฐ compute_metrics ํ•จ์ˆ˜๋ฅผ ์ „๋‹ฌํ•˜์„ธ์š”.
  3. [~Trainer.train]๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜์„ธ์š”.
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_wnut_model",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     num_train_epochs=2,
...     weight_decay=0.01,
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     load_best_model_at_end=True,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_wnut["train"],
...     eval_dataset=tokenized_wnut["test"],
...     tokenizer=tokenizer,
...     data_collator=data_collator,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

ํ›ˆ๋ จ์ด ์™„๋ฃŒ๋˜๋ฉด, [~transformers.Trainer.push_to_hub] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ์— ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

>>> trainer.push_to_hub()

Keras๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ์ต์ˆ™ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ, ์—ฌ๊ธฐ์˜ ๊ธฐ๋ณธ ํŠœํ† ๋ฆฌ์–ผ์„ ํ™•์ธํ•˜์„ธ์š”!

TensorFlow์—์„œ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋ ค๋ฉด, ๋จผ์ € ์˜ตํ‹ฐ๋งˆ์ด์ € ํ•จ์ˆ˜์™€ ํ•™์Šต๋ฅ  ์Šค์ผ€์ฅด, ๊ทธ๋ฆฌ๊ณ  ์ผ๋ถ€ ํ›ˆ๋ จ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:
>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_train_epochs = 3
>>> num_train_steps = (len(tokenized_wnut["train"]) // batch_size) * num_train_epochs
>>> optimizer, lr_schedule = create_optimizer(
...     init_lr=2e-5,
...     num_train_steps=num_train_steps,
...     weight_decay_rate=0.01,
...     num_warmup_steps=0,
... )

๊ทธ๋Ÿฐ ๋‹ค์Œ [TFAutoModelForSequenceClassification]์„ ์‚ฌ์šฉํ•˜์—ฌ DistilBERT๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ , ์˜ˆ์ƒ๋˜๋Š” ๋ ˆ์ด๋ธ” ์ˆ˜์™€ ๋ ˆ์ด๋ธ” ๋งคํ•‘์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers import TFAutoModelForTokenClassification

>>> model = TFAutoModelForTokenClassification.from_pretrained(
...     "distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id
... )

[~transformers.TFPreTrainedModel.prepare_tf_dataset]์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ tf.data.Dataset ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_wnut["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_validation_set = model.prepare_tf_dataset(
...     tokenized_wnut["validation"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

compile๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จํ•  ๋ชจ๋ธ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)

ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ์„ค์ •ํ•ด์•ผํ•  ๋งˆ์ง€๋ง‰ ๋‘ ๊ฐ€์ง€๋Š” ์˜ˆ์ธก์—์„œ seqeval ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ์— ์—…๋กœ๋“œํ•  ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋‘ Keras callbacks๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

[~transformers.KerasMetricCallback]์— compute_metrics ํ•จ์ˆ˜๋ฅผ ์ „๋‹ฌํ•˜์„ธ์š”:

>>> from transformers.keras_callbacks import KerasMetricCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

[~transformers.PushToHubCallback]์—์„œ ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์—…๋กœ๋“œํ•  ์œ„์น˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers.keras_callbacks import PushToHubCallback

>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="my_awesome_wnut_model",
...     tokenizer=tokenizer,
... )

๊ทธ๋Ÿฐ ๋‹ค์Œ ์ฝœ๋ฐฑ์„ ํ•จ๊ป˜ ๋ฌถ์Šต๋‹ˆ๋‹ค:

>>> callbacks = [metric_callback, push_to_hub_callback]

๋“œ๋””์–ด, ๋ชจ๋ธ ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค! fit์— ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์„ธํŠธ, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ, ์—ํญ์˜ ์ˆ˜ ๋ฐ ์ฝœ๋ฐฑ์„ ์ „๋‹ฌํ•˜์—ฌ ํŒŒ์ธ ํŠœ๋‹ํ•ฉ๋‹ˆ๋‹ค:

>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)

ํ›ˆ๋ จ์ด ์™„๋ฃŒ๋˜๋ฉด, ๋ชจ๋ธ์ด ์ž๋™์œผ๋กœ ํ—ˆ๋ธŒ์— ์—…๋กœ๋“œ๋˜์–ด ๋ˆ„๊ตฌ๋‚˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

ํ† ํฐ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ•˜๋Š” ์ž์„ธํ•œ ์˜ˆ์ œ๋Š” ๋‹ค์Œ PyTorch notebook ๋˜๋Š” TensorFlow notebook๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

์ถ”๋ก [[inference]]

์ข‹์•„์š”, ์ด์ œ ๋ชจ๋ธ์„ ํŒŒ์ธ ํŠœ๋‹ํ–ˆ์œผ๋‹ˆ ์ถ”๋ก ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๊ณ ์ž ํ•˜๋Š” ํ…์ŠคํŠธ๋ฅผ ๊ฐ€์ ธ์™€๋ด…์‹œ๋‹ค:

>>> text = "The Golden State Warriors are an American professional basketball team based in San Francisco."

ํŒŒ์ธ ํŠœ๋‹๋œ ๋ชจ๋ธ๋กœ ์ถ”๋ก ์„ ์‹œ๋„ํ•˜๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ [pipeline]๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ๋กœ NER์˜ pipeline์„ ์ธ์Šคํ„ด์Šคํ™”ํ•˜๊ณ , ํ…์ŠคํŠธ๋ฅผ ์ „๋‹ฌํ•ด๋ณด์„ธ์š”:

>>> from transformers import pipeline

>>> classifier = pipeline("ner", model="stevhliu/my_awesome_wnut_model")
>>> classifier(text)
[{'entity': 'B-location',
  'score': 0.42658573,
  'index': 2,
  'word': 'golden',
  'start': 4,
  'end': 10},
 {'entity': 'I-location',
  'score': 0.35856336,
  'index': 3,
  'word': 'state',
  'start': 11,
  'end': 16},
 {'entity': 'B-group',
  'score': 0.3064001,
  'index': 4,
  'word': 'warriors',
  'start': 17,
  'end': 25},
 {'entity': 'B-location',
  'score': 0.65523505,
  'index': 13,
  'word': 'san',
  'start': 80,
  'end': 83},
 {'entity': 'B-location',
  'score': 0.4668663,
  'index': 14,
  'word': 'francisco',
  'start': 84,
  'end': 93}]

์›ํ•œ๋‹ค๋ฉด, pipeline์˜ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜๋™์œผ๋กœ ๋ณต์ œํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๊ณ  PyTorch ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> inputs = tokenizer(text, return_tensors="pt")

์ž…๋ ฅ์„ ๋ชจ๋ธ์— ์ „๋‹ฌํ•˜๊ณ  logits์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers import AutoModelForTokenClassification

>>> model = AutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits

๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋ฅผ ๋ชจ๋ธ์˜ id2label ๋งคํ•‘์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

>>> predictions = torch.argmax(logits, dim=2)
>>> predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
>>> predicted_token_class
['O',
 'O',
 'B-location',
 'I-location',
 'B-group',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-location',
 'B-location',
 'O',
 'O']
ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๊ณ  TensorFlow ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> inputs = tokenizer(text, return_tensors="tf")

์ž…๋ ฅ๊ฐ’์„ ๋ชจ๋ธ์— ์ „๋‹ฌํ•˜๊ณ  logits์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers import TFAutoModelForTokenClassification

>>> model = TFAutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
>>> logits = model(**inputs).logits

๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋ฅผ ๋ชจ๋ธ์˜ id2label ๋งคํ•‘์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

>>> predicted_token_class_ids = tf.math.argmax(logits, axis=-1)
>>> predicted_token_class = [model.config.id2label[t] for t in predicted_token_class_ids[0].numpy().tolist()]
>>> predicted_token_class
['O',
 'O',
 'B-location',
 'I-location',
 'B-group',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-location',
 'B-location',
 'O',
 'O']