Saiteja commited on
Commit
c952f8b
·
verified ·
1 Parent(s): b864e89

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +38 -0
  2. examples.json +99 -0
  3. tokenizer.json +0 -0
README.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: te
3
+ tags:
4
+ - telugu
5
+ - tokenizer
6
+ - bpe
7
+ license: mit
8
+ ---
9
+
10
+ # Telugu BPE Tokenizer
11
+
12
+ A Byte-Pair Encoding (BPE) tokenizer trained on Telugu text data from Wikipedia.
13
+
14
+ ## Model Description
15
+
16
+ This tokenizer was trained on Telugu text data collected from Wikipedia articles. It uses Byte-Pair Encoding (BPE) to create subword tokens.
17
+
18
+ ## Stats
19
+ - Vocabulary Size: 5000 tokens
20
+ - Compression Ratio: 1.26
21
+
22
+ ## Usage
23
+
24
+ ```python
25
+ from tokenizers import Tokenizer
26
+
27
+ # Load the tokenizer
28
+ tokenizer = Tokenizer.from_file("tokenizer.json")
29
+
30
+ # Tokenize text
31
+ text = "నమస్కారం"
32
+ encoding = tokenizer.encode(text)
33
+ print(encoding.tokens)
34
+ ```
35
+
36
+ ## Training Data
37
+
38
+ The tokenizer was trained on Telugu text data collected from Wikipedia articles. The data includes a diverse range of topics and writing styles.
examples.json ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "text": "నమస్కారం",
4
+ "tokens": [
5
+ "Ġనమ",
6
+ "à°¸",
7
+ "à±į",
8
+ "à°ķ",
9
+ "à°¾",
10
+ "à°°",
11
+ "à°Ĥ"
12
+ ],
13
+ "ids": [
14
+ 438,
15
+ 196,
16
+ 177,
17
+ 185,
18
+ 179,
19
+ 180,
20
+ 181
21
+ ]
22
+ },
23
+ {
24
+ "text": "తెలుగు భాష చాలా అందమైనది",
25
+ "tokens": [
26
+ "Ġà°¤",
27
+ "à±Ĩ",
28
+ "à°²",
29
+ "à±ģ",
30
+ "à°Ĺ",
31
+ "à±ģ",
32
+ "Ġà°Ń",
33
+ "à°¾",
34
+ "à°·",
35
+ "Ġà°ļ",
36
+ "à°¾",
37
+ "à°²",
38
+ "à°¾",
39
+ "Ġà°ħ",
40
+ "à°Ĥ",
41
+ "దమ",
42
+ "à±Ī",
43
+ "నద",
44
+ "à°¿"
45
+ ],
46
+ "ids": [
47
+ 230,
48
+ 204,
49
+ 183,
50
+ 182,
51
+ 199,
52
+ 182,
53
+ 254,
54
+ 179,
55
+ 223,
56
+ 225,
57
+ 179,
58
+ 183,
59
+ 179,
60
+ 211,
61
+ 181,
62
+ 946,
63
+ 213,
64
+ 447,
65
+ 178
66
+ ]
67
+ },
68
+ {
69
+ "text": "భారతదేశం నా దేశం",
70
+ "tokens": [
71
+ "Ġà°Ń",
72
+ "à°¾",
73
+ "రతద",
74
+ "à±ĩ",
75
+ "à°¶",
76
+ "à°Ĥ",
77
+ "Ġà°¨",
78
+ "à°¾",
79
+ "Ġà°¦",
80
+ "à±ĩ",
81
+ "à°¶",
82
+ "à°Ĥ"
83
+ ],
84
+ "ids": [
85
+ 254,
86
+ 179,
87
+ 524,
88
+ 195,
89
+ 217,
90
+ 181,
91
+ 206,
92
+ 179,
93
+ 215,
94
+ 195,
95
+ 217,
96
+ 181
97
+ ]
98
+ }
99
+ ]
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff