Update README.md
Browse files
README.md
CHANGED
@@ -11,6 +11,33 @@ license: cc-by-sa-4.0
|
|
11 |
|
12 |
The Micka Tokenizer is a subword tokenizer with a vocabulary size of 32,768 tokens, designed to support a wide array of natural language processing (NLP) tasks in both Slovenian and English.
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
## Model Details
|
15 |
|
16 |
- **Developed by**: Marko Kokol
|
|
|
11 |
|
12 |
The Micka Tokenizer is a subword tokenizer with a vocabulary size of 32,768 tokens, designed to support a wide array of natural language processing (NLP) tasks in both Slovenian and English.
|
13 |
|
14 |
+
It includes a series of special tokens tailored for handling various structured language tasks, from standard natural language processing needs to more specific chatbot and conversational models. Special unicode characters are used as delimiters (`⸢` and `⸥`) at the start and end of each special token, reducing the likelihood that these token sequences will appear in regular text data.
|
15 |
+
- **Start delimiter**: `⸢` (Unicode: `U+2E22`)
|
16 |
+
- **End delimiter**: `⸥` (Unicode: `U+2E25`)
|
17 |
+
|
18 |
+
## Special Tokens
|
19 |
+
|
20 |
+
The tokenizer supports a range of special tokens that enable structured processing, padding, and segmentation in different tasks:
|
21 |
+
|
22 |
+
- **Padding, Masking, and Separation Tokens**:
|
23 |
+
- **Padding** (`⸢PAD⸥`): Used to pad sequences to a uniform length.
|
24 |
+
- **Masking** (`⸢MSK⸥`): Marks parts of a sequence that should be masked for certain tasks, such as masked language modeling.
|
25 |
+
- **Separation** (`⸢|⸥`): A general-purpose separator token.
|
26 |
+
|
27 |
+
- **Sentence and Paragraph Structure Tokens**:
|
28 |
+
- **Start of Sentence** (`⸢s⸥`) and **End of Sentence** (`⸢/s⸥`): Define sentence boundaries for better structuring and parsing within long text sequences.
|
29 |
+
- **Start of Paragraph** (`⸢p⸥`) and **End of Paragraph** (`⸢/p⸥`): Define paragraph breaks, which can be helpful for document-based tasks or summarization tasks where paragraph structure is essential.
|
30 |
+
|
31 |
+
- **Out-of-Vocabulary / Unknown Token**:
|
32 |
+
- **Unknown Token** (`⸢UNK⸥`): Represents out-of-vocabulary words or unknown tokens that the model encounters.
|
33 |
+
|
34 |
+
- **Chatbot and Conversational Tokens**:
|
35 |
+
- **System Message Start** (`⸢SYS⸥`): Marks the beginning of a system message, useful in multi-turn dialogue systems.
|
36 |
+
- **User Message Start** (`⸢USR⸥`): Designates the start of a user’s input, enabling clear differentiation in conversation logs.
|
37 |
+
- **Agent Message Start** (`⸢AGT⸥`): Indicates the start of a response from the chatbot or conversational agent.
|
38 |
+
|
39 |
+
|
40 |
+
|
41 |
## Model Details
|
42 |
|
43 |
- **Developed by**: Marko Kokol
|