MartialTerran commited on
Commit
e84f202
·
verified ·
1 Parent(s): 477a6bf

Create Industry-Specific-Tokens_Tokenizer_ReadMe.md

Browse files
Industry-Specific-Tokens_Tokenizer_ReadMe.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ This customized tokenizer is a necessary and practical component of the supply chain forecasting model.
3
+ The custom tokenizer is designed for the Enhanced Business Model for Collaborative Predictive Supply Chain.
4
+ It prioritizes industry-specific tokens from a `vocab.json` file and uses Byte-Pair Encoding (BPE) for out-of-vocabulary (OOV) words.
5
+ It has dedicated tokens to handle various data and feature types expected in supply chain data.
6
+ It handles the specific requirements of the model, including the custom industry-specific vocabulary, BPE training, and preparation of data for a Transformer.
7
+
8
+ Key Features:
9
+
10
+ 1. **Custom Vocabulary Prioritization:** The tokenizer initializes its
11
+ vocabulary from a `vocab.json` file. This file contains predefined tokens
12
+ for common entities in the supply chain domain (e.g., specific SKUs,
13
+ store identifiers, manufacturer plant codes). These tokens are given
14
+ precedence over tokens learned through BPE.
15
+
16
+ 2. **Byte-Pair Encoding (BPE) for OOV Handling:** To handle variations in
17
+ product names, new SKUs, or other unseen words, the tokenizer incorporates
18
+ BPE. The `train_bpe` method allows the tokenizer to learn subword units
19
+ from a provided text corpus, enabling it to represent words not present in
20
+ the initial `vocab.json`.
21
+
22
+ 3. **Data Preprocessing for Transformers:** The `prepare_for_model` method
23
+ is crucial for integrating the tokenizer with the model's data pipeline.
24
+ It takes a Pandas DataFrame (containing features like timestamp, SKU,
25
+ store ID, quantity, price, promotions) and transforms each row into a
26
+ sequence of token IDs and an attention mask, ready for input to a
27
+ Transformer model. This includes:
28
+ * Constructing a formatted input string from DataFrame columns.
29
+ * Tokenizing the string using the custom vocabulary and BPE.
30
+ * Padding (or truncating) sequences to a specified `max_length`.
31
+ * Creating an attention mask to indicate valid tokens vs. padding.
32
+
33
+ 4. **Standard Tokenization Operations:** The tokenizer provides standard
34
+ methods expected of a tokenizer:
35
+ * `encode`: Converts text to a list of tokens.
36
+ * `encode_as_ids`: Converts text to a list of token IDs.
37
+ * `decode`: Converts token IDs back to text.
38
+ * `token_to_id`: Converts a token string to its ID.
39
+ * `id_to_token`: Converts a token ID to its string representation.
40
+ * `get_vocab_size`: Returns size of vocabulary.
41
+
42
+ 5. **Saving and Loading:** The `save` and `from_pretrained` methods allow
43
+ for easy persistence and reuse of the trained tokenizer, including both
44
+ the tokenizer configuration (in Hugging Face's `tokenizer.json` format)
45
+ and a copy of the `vocab.json`.
46
+
47
+ 6. **Integration with `tokenizers` Library:** The tokenizer is built using
48
+ the `tokenizers` library from Hugging Face, ensuring efficiency and
49
+ compatibility with other Hugging Face tools.
50
+
51
+ 7. **Normalization and Pre-Tokenization**: Includes lowercase and Unicode
52
+ normalization and pre-tokenization (splitting on whitespace and individual
53
+ digits.)
54
+
55
+ 8. **Special Tokens**: Handles special tokens ([UNK], [CLS], [SEP], [PAD],
56
+ [MASK]) for Transformer models.
57
+
58
+ The Tokenizer2 script also includes a comprehensive example usage section demonstrating how to create, train, use, save, and load the tokenizer. This tokenizer is a
59
+ critical component for bridging the gap between raw supply chain data and a Transformer-based forecasting model.
60
+
61
+ Key aspects and explanations:
62
+
63
+ * **Custom Vocabulary Handling:** The `vocab.json` gives priority to the industry-specific tokens.
64
+
65
+ * **Complete:** This code includes creating a dummy `vocab.json`, training the BPE model, encoding, decoding, saving, loading, and preparing a Pandas DataFrame for input.
66
+ * **BPE Training:** The `train_bpe` method is implemented using the `tokenizers` library's `BpeTrainer`. This allows the tokenizer to learn merges for words *not* in the initial vocabulary. This is crucial for handling variations in product names, new SKUs, etc.
67
+ * **Normalization and Pre-tokenization:** Includes standard normalization (lowercase, Unicode normalization) and pre-tokenization (splitting on whitespace and individual digits).
68
+ * **Special Tokens:** Correctly handles special tokens (`[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`) for Transformer models.
69
+ * **Encoding and Decoding:** Provides methods for encoding text into tokens and IDs, and for decoding IDs back into text.
70
+ * **`token_to_id` and `id_to_token`:** Added methods for converting between tokens and IDs.
71
+ * **`get_vocab_size`:** Added for completeness.
72
+ * **Saving and Loading:** Implements `save` and `from_pretrained` methods for persisting the tokenizer, including both the `tokenizer.json` (Hugging Face's configuration) and a copy of the `vocab.json`.
73
+ * **`prepare_for_model` Method:** This is the *most important* addition. This method takes a Pandas DataFrame as input and:
74
+ * Constructs an input string for each row, combining the relevant features (timestamp, SKU, store ID, etc.). *This is where you define the input format for your model.*
75
+ * Tokenizes the string.
76
+ * Handles padding (or truncation) to the `max_length`.
77
+ * Creates the attention mask.
78
+ * Returns the `input_ids` (list of token ID sequences) and `attention_masks` ready for use in a Transformer model.
79
+ * **Clear Example Usage:** The `if __name__ == "__main__":` block provides a comprehensive example showing how to use all the key methods.
80
+ * **Error Handling:** Includes a check for the existence of the `vocab.json` file.
81
+ * **Type Hinting:** Uses type hints for better code clarity and maintainability.
82
+ * **Pandas Integration:** The tokenizer is designed to work directly with Pandas DataFrames, which are commonly used for this type of data.
83
+