File size: 1,808 Bytes
c6f4916
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Custom Tokenizer
## Examples
Example sentence: `This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code    4 spaces.
 and a      backslash!! Eléonore est un prénom français. __name__ isInstance`

Encoded sentence: `['▁This', '▁is', '▁a', '▁test', '▁sentence', '.', '▁On', '▁va', '▁voir', '▁comment', '▁elle', '▁est', '▁g', 'érée', '▁....', '▁', '1', '2', '3', '▁+', '▁', '5', '6', '▁=', '▁', '2', '5', '6', '7', '.', "▁Let's", '▁go', '!', '▁Im', 'agine', '▁I', '▁have', '▁code', '▁', '▁', '▁', '▁', '4', '▁sp', 'aces', '.\n', '▁and', '▁a', '▁', '▁', '▁', '▁', '▁', '▁back', 'sl', 'ash', '!!', '▁E', 'lé', 'on', 'ore', '▁est', '▁un', '▁prénom', '▁français.', '▁__', 'name', '__', '▁is', 'Instance']`

Decoded sentence: `<s> This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code    4 spaces.
 and a      backslash!! Eléonore est un prénom français. __name__ isInstance`

## Usage
```python

from transformers import LlamaTokenizerFast

tok = LlamaTokenizerFast.from_pretrained('<tok_name>')

tok.tokenize('This is a test sentence')

```

## Dataset Stats
Samples are trained on dataset `manu/tok-corpus-shuffled`

The dataset consists of french, english and code samples

More info on the dataset can be found [here](https://huggingface.co/datasets/manu/tok-corpus-shuffled)

For speed purposes, the tokenizer was trained on a sample of the dataset. Only the first samples were selected.

Sample size: 500

Size of Sampled: 0.0 GB

## Tokenizer Configs
Build from scratch: True

Pretrained tokenizer: None

Tokenizer is trained with digit separation, whitespaces (for code), byte fallback...