Create Industry-Specific-Tokens_Tokenizer_ReadMe.md
Browse files
Industry-Specific-Tokens_Tokenizer_ReadMe.md
ADDED
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
This customized tokenizer is a necessary and practical component of the supply chain forecasting model.
|
3 |
+
The custom tokenizer is designed for the Enhanced Business Model for Collaborative Predictive Supply Chain.
|
4 |
+
It prioritizes industry-specific tokens from a `vocab.json` file and uses Byte-Pair Encoding (BPE) for out-of-vocabulary (OOV) words.
|
5 |
+
It has dedicated tokens to handle various data and feature types expected in supply chain data.
|
6 |
+
It handles the specific requirements of the model, including the custom industry-specific vocabulary, BPE training, and preparation of data for a Transformer.
|
7 |
+
|
8 |
+
Key Features:
|
9 |
+
|
10 |
+
1. **Custom Vocabulary Prioritization:** The tokenizer initializes its
|
11 |
+
vocabulary from a `vocab.json` file. This file contains predefined tokens
|
12 |
+
for common entities in the supply chain domain (e.g., specific SKUs,
|
13 |
+
store identifiers, manufacturer plant codes). These tokens are given
|
14 |
+
precedence over tokens learned through BPE.
|
15 |
+
|
16 |
+
2. **Byte-Pair Encoding (BPE) for OOV Handling:** To handle variations in
|
17 |
+
product names, new SKUs, or other unseen words, the tokenizer incorporates
|
18 |
+
BPE. The `train_bpe` method allows the tokenizer to learn subword units
|
19 |
+
from a provided text corpus, enabling it to represent words not present in
|
20 |
+
the initial `vocab.json`.
|
21 |
+
|
22 |
+
3. **Data Preprocessing for Transformers:** The `prepare_for_model` method
|
23 |
+
is crucial for integrating the tokenizer with the model's data pipeline.
|
24 |
+
It takes a Pandas DataFrame (containing features like timestamp, SKU,
|
25 |
+
store ID, quantity, price, promotions) and transforms each row into a
|
26 |
+
sequence of token IDs and an attention mask, ready for input to a
|
27 |
+
Transformer model. This includes:
|
28 |
+
* Constructing a formatted input string from DataFrame columns.
|
29 |
+
* Tokenizing the string using the custom vocabulary and BPE.
|
30 |
+
* Padding (or truncating) sequences to a specified `max_length`.
|
31 |
+
* Creating an attention mask to indicate valid tokens vs. padding.
|
32 |
+
|
33 |
+
4. **Standard Tokenization Operations:** The tokenizer provides standard
|
34 |
+
methods expected of a tokenizer:
|
35 |
+
* `encode`: Converts text to a list of tokens.
|
36 |
+
* `encode_as_ids`: Converts text to a list of token IDs.
|
37 |
+
* `decode`: Converts token IDs back to text.
|
38 |
+
* `token_to_id`: Converts a token string to its ID.
|
39 |
+
* `id_to_token`: Converts a token ID to its string representation.
|
40 |
+
* `get_vocab_size`: Returns size of vocabulary.
|
41 |
+
|
42 |
+
5. **Saving and Loading:** The `save` and `from_pretrained` methods allow
|
43 |
+
for easy persistence and reuse of the trained tokenizer, including both
|
44 |
+
the tokenizer configuration (in Hugging Face's `tokenizer.json` format)
|
45 |
+
and a copy of the `vocab.json`.
|
46 |
+
|
47 |
+
6. **Integration with `tokenizers` Library:** The tokenizer is built using
|
48 |
+
the `tokenizers` library from Hugging Face, ensuring efficiency and
|
49 |
+
compatibility with other Hugging Face tools.
|
50 |
+
|
51 |
+
7. **Normalization and Pre-Tokenization**: Includes lowercase and Unicode
|
52 |
+
normalization and pre-tokenization (splitting on whitespace and individual
|
53 |
+
digits.)
|
54 |
+
|
55 |
+
8. **Special Tokens**: Handles special tokens ([UNK], [CLS], [SEP], [PAD],
|
56 |
+
[MASK]) for Transformer models.
|
57 |
+
|
58 |
+
The Tokenizer2 script also includes a comprehensive example usage section demonstrating how to create, train, use, save, and load the tokenizer. This tokenizer is a
|
59 |
+
critical component for bridging the gap between raw supply chain data and a Transformer-based forecasting model.
|
60 |
+
|
61 |
+
Key aspects and explanations:
|
62 |
+
|
63 |
+
* **Custom Vocabulary Handling:** The `vocab.json` gives priority to the industry-specific tokens.
|
64 |
+
|
65 |
+
* **Complete:** This code includes creating a dummy `vocab.json`, training the BPE model, encoding, decoding, saving, loading, and preparing a Pandas DataFrame for input.
|
66 |
+
* **BPE Training:** The `train_bpe` method is implemented using the `tokenizers` library's `BpeTrainer`. This allows the tokenizer to learn merges for words *not* in the initial vocabulary. This is crucial for handling variations in product names, new SKUs, etc.
|
67 |
+
* **Normalization and Pre-tokenization:** Includes standard normalization (lowercase, Unicode normalization) and pre-tokenization (splitting on whitespace and individual digits).
|
68 |
+
* **Special Tokens:** Correctly handles special tokens (`[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`) for Transformer models.
|
69 |
+
* **Encoding and Decoding:** Provides methods for encoding text into tokens and IDs, and for decoding IDs back into text.
|
70 |
+
* **`token_to_id` and `id_to_token`:** Added methods for converting between tokens and IDs.
|
71 |
+
* **`get_vocab_size`:** Added for completeness.
|
72 |
+
* **Saving and Loading:** Implements `save` and `from_pretrained` methods for persisting the tokenizer, including both the `tokenizer.json` (Hugging Face's configuration) and a copy of the `vocab.json`.
|
73 |
+
* **`prepare_for_model` Method:** This is the *most important* addition. This method takes a Pandas DataFrame as input and:
|
74 |
+
* Constructs an input string for each row, combining the relevant features (timestamp, SKU, store ID, etc.). *This is where you define the input format for your model.*
|
75 |
+
* Tokenizes the string.
|
76 |
+
* Handles padding (or truncation) to the `max_length`.
|
77 |
+
* Creates the attention mask.
|
78 |
+
* Returns the `input_ids` (list of token ID sequences) and `attention_masks` ready for use in a Transformer model.
|
79 |
+
* **Clear Example Usage:** The `if __name__ == "__main__":` block provides a comprehensive example showing how to use all the key methods.
|
80 |
+
* **Error Handling:** Includes a check for the existence of the `vocab.json` file.
|
81 |
+
* **Type Hinting:** Uses type hints for better code clarity and maintainability.
|
82 |
+
* **Pandas Integration:** The tokenizer is designed to work directly with Pandas DataFrames, which are commonly used for this type of data.
|
83 |
+
|