File size: 3,471 Bytes
a0db2f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# πŸ€— Tokenizers 라이브러리의 ν† ν¬λ‚˜μ΄μ € μ‚¬μš©ν•˜κΈ°[[use-tokenizers-from-tokenizers]]

[`PreTrainedTokenizerFast`]λŠ” [πŸ€— Tokenizers](https://huggingface.co/docs/tokenizers) λΌμ΄λΈŒλŸ¬λ¦¬μ— κΈ°λ°˜ν•©λ‹ˆλ‹€. πŸ€— Tokenizers 라이브러리의 ν† ν¬λ‚˜μ΄μ €λŠ”
πŸ€— Transformers둜 맀우 κ°„λ‹¨ν•˜κ²Œ 뢈러올 수 μžˆμŠ΅λ‹ˆλ‹€.

ꡬ체적인 λ‚΄μš©μ— λ“€μ–΄κ°€κΈ° 전에, λͺ‡ μ€„μ˜ μ½”λ“œλ‘œ 더미 ν† ν¬λ‚˜μ΄μ €λ₯Ό λ§Œλ“€μ–΄ λ³΄κ² μŠ΅λ‹ˆλ‹€:

```python
>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>> from tokenizers.pre_tokenizers import Whitespace

>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

>>> tokenizer.pre_tokenizer = Whitespace()
>>> files = [...]
>>> tokenizer.train(files, trainer)
```

μš°λ¦¬κ°€ μ •μ˜ν•œ νŒŒμΌμ„ 톡해 이제 ν•™μŠ΅λœ ν† ν¬λ‚˜μ΄μ €λ₯Ό κ°–κ²Œ λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 이 λŸ°νƒ€μž„μ—μ„œ 계속 μ‚¬μš©ν•˜κ±°λ‚˜ JSON 파일둜 μ €μž₯ν•˜μ—¬ λ‚˜μ€‘μ— μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

## ν† ν¬λ‚˜μ΄μ € κ°μ²΄λ‘œλΆ€ν„° 직접 뢈러였기[[loading-directly-from-the-tokenizer-object]]

πŸ€— Transformers λΌμ΄λΈŒλŸ¬λ¦¬μ—μ„œ 이 ν† ν¬λ‚˜μ΄μ € 객체λ₯Ό ν™œμš©ν•˜λŠ” 방법을 μ‚΄νŽ΄λ³΄κ² μŠ΅λ‹ˆλ‹€.
[`PreTrainedTokenizerFast`] ν΄λž˜μŠ€λŠ” μΈμŠ€ν„΄μŠ€ν™”λœ *ν† ν¬λ‚˜μ΄μ €* 객체λ₯Ό 인수둜 λ°›μ•„ μ‰½κ²Œ μΈμŠ€ν„΄μŠ€ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

```python
>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
```

이제 `fast_tokenizer` κ°μ²΄λŠ” πŸ€— Transformers ν† ν¬λ‚˜μ΄μ €μ—μ„œ κ³΅μœ ν•˜λŠ” λͺ¨λ“  λ©”μ†Œλ“œμ™€ ν•¨κ»˜ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€! μžμ„Έν•œ λ‚΄μš©μ€ [ν† ν¬λ‚˜μ΄μ € νŽ˜μ΄μ§€](main_classes/tokenizer)λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.

## JSON νŒŒμΌμ—μ„œ 뢈러였기[[loading-from-a-JSON-file]]

<!--In order to load a tokenizer from a JSON file, let's first start by saving our tokenizer:-->

JSON νŒŒμΌμ—μ„œ ν† ν¬λ‚˜μ΄μ €λ₯Ό 뢈러였기 μœ„ν•΄, λ¨Όμ € ν† ν¬λ‚˜μ΄μ €λ₯Ό μ €μž₯ν•΄ λ³΄κ² μŠ΅λ‹ˆλ‹€:

```python
>>> tokenizer.save("tokenizer.json")
```

JSON νŒŒμΌμ„ μ €μž₯ν•œ κ²½λ‘œλŠ” `tokenizer_file` λ§€κ°œλ³€μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ [`PreTrainedTokenizerFast`] μ΄ˆκΈ°ν™” λ©”μ†Œλ“œμ— 전달할 수 μžˆμŠ΅λ‹ˆλ‹€:

```python
>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
```

이제 `fast_tokenizer` κ°μ²΄λŠ” πŸ€— Transformers ν† ν¬λ‚˜μ΄μ €μ—μ„œ κ³΅μœ ν•˜λŠ” λͺ¨λ“  λ©”μ†Œλ“œμ™€ ν•¨κ»˜ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€! μžμ„Έν•œ λ‚΄μš©μ€ [ν† ν¬λ‚˜μ΄μ € νŽ˜μ΄μ§€](main_classes/tokenizer)λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.