File size: 3,471 Bytes
a0db2f9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
β οΈ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# π€ Tokenizers λΌμ΄λΈλ¬λ¦¬μ ν ν¬λμ΄μ μ¬μ©νκΈ°[[use-tokenizers-from-tokenizers]]
[`PreTrainedTokenizerFast`]λ [π€ Tokenizers](https://huggingface.co/docs/tokenizers) λΌμ΄λΈλ¬λ¦¬μ κΈ°λ°ν©λλ€. π€ Tokenizers λΌμ΄λΈλ¬λ¦¬μ ν ν¬λμ΄μ λ
π€ Transformersλ‘ λ§€μ° κ°λ¨νκ² λΆλ¬μ¬ μ μμ΅λλ€.
ꡬ체μ μΈ λ΄μ©μ λ€μ΄κ°κΈ° μ μ, λͺ μ€μ μ½λλ‘ λλ―Έ ν ν¬λμ΄μ λ₯Ό λ§λ€μ΄ λ³΄κ² μ΅λλ€:
```python
>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>> from tokenizers.pre_tokenizers import Whitespace
>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
>>> tokenizer.pre_tokenizer = Whitespace()
>>> files = [...]
>>> tokenizer.train(files, trainer)
```
μ°λ¦¬κ° μ μν νμΌμ ν΅ν΄ μ΄μ νμ΅λ ν ν¬λμ΄μ λ₯Ό κ°κ² λμμ΅λλ€. μ΄ λ°νμμμ κ³μ μ¬μ©νκ±°λ JSON νμΌλ‘ μ μ₯νμ¬ λμ€μ μ¬μ©ν μ μμ΅λλ€.
## ν ν¬λμ΄μ κ°μ²΄λ‘λΆν° μ§μ λΆλ¬μ€κΈ°[[loading-directly-from-the-tokenizer-object]]
π€ Transformers λΌμ΄λΈλ¬λ¦¬μμ μ΄ ν ν¬λμ΄μ κ°μ²΄λ₯Ό νμ©νλ λ°©λ²μ μ΄ν΄λ³΄κ² μ΅λλ€.
[`PreTrainedTokenizerFast`] ν΄λμ€λ μΈμ€ν΄μ€νλ *ν ν¬λμ΄μ * κ°μ²΄λ₯Ό μΈμλ‘ λ°μ μ½κ² μΈμ€ν΄μ€νν μ μμ΅λλ€:
```python
>>> from transformers import PreTrainedTokenizerFast
>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
```
μ΄μ `fast_tokenizer` κ°μ²΄λ π€ Transformers ν ν¬λμ΄μ μμ 곡μ νλ λͺ¨λ λ©μλμ ν¨κ» μ¬μ©ν μ μμ΅λλ€! μμΈν λ΄μ©μ [ν ν¬λμ΄μ νμ΄μ§](main_classes/tokenizer)λ₯Ό μ°Έμ‘°νμΈμ.
## JSON νμΌμμ λΆλ¬μ€κΈ°[[loading-from-a-JSON-file]]
<!--In order to load a tokenizer from a JSON file, let's first start by saving our tokenizer:-->
JSON νμΌμμ ν ν¬λμ΄μ λ₯Ό λΆλ¬μ€κΈ° μν΄, λ¨Όμ ν ν¬λμ΄μ λ₯Ό μ μ₯ν΄ λ³΄κ² μ΅λλ€:
```python
>>> tokenizer.save("tokenizer.json")
```
JSON νμΌμ μ μ₯ν κ²½λ‘λ `tokenizer_file` 맀κ°λ³μλ₯Ό μ¬μ©νμ¬ [`PreTrainedTokenizerFast`] μ΄κΈ°ν λ©μλμ μ λ¬ν μ μμ΅λλ€:
```python
>>> from transformers import PreTrainedTokenizerFast
>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
```
μ΄μ `fast_tokenizer` κ°μ²΄λ π€ Transformers ν ν¬λμ΄μ μμ 곡μ νλ λͺ¨λ λ©μλμ ν¨κ» μ¬μ©ν μ μμ΅λλ€! μμΈν λ΄μ©μ [ν ν¬λμ΄μ νμ΄μ§](main_classes/tokenizer)λ₯Ό μ°Έμ‘°νμΈμ.
|