--- title: TextTokenization emoji: 🐠 colorFrom: indigo colorTo: indigo sdk: docker pinned: false license: mit short_description: Text Tokenization using Byte-Pair Encoding (BPE) --- # Tokenization While training an LLM, following steps are followed: ![LLM Training](./assets/docs/LLMprocess.png) - After the data collection and preprocessing, the data is tokenized i.e. it is converted into discrete tokens. - After tokenization, embeddings are generated by transforming the tokens into numerical vectors for processing by the model. - Below are the ways in which the data is represented and why unicode is used for tokenization. [toc] ## ASCII ASCII (American Standard Code for Information Interchange) is a character encoding standard that uses 7 bits to represent characters, allowing for 128 unique symbols, including English letters, digits, and control characters. ASCII is limited in its ability to represent characters from other languages, which is why Unicode was developed to cover a broader range of characters. ## Unicode Unicode is a standardized system that defines a set of characters and their corresponding code points, allowing for the representation of text in multiple languages and scripts. As of now, Unicode encompasses roughly 150,000 characters across 161 scripts, including 3,790 emojis, which facilitates the encoding of diverse languages such as Hindi, Korean, and more The different versions of UTF (Unicode Transformation Format) represent Unicode characters in binary data. The most common encodings are: ### UTF-8: A variable-length encoding that uses 1 to 4 bytes per character. It is the most widely used encoding on the web and can represent all Unicode characters. The first 128 characters (which correspond to ASCII) are encoded in one byte, while additional characters require more bytes. ### UTF-16: This encoding typically uses 2 bytes for most characters but can use 4 bytes for less common characters. It is often used in environments where memory is less of a concern. ### UTF-32: A fixed-length encoding that uses 4 bytes for every character, making it straightforward but less efficient in terms of space compared to UTF-8 and UTF-16. --- ## Tokens - There are different ways in which tokens can be created using the data - Character-level and word-level tokens are less commonly used in Large Language Models (LLMs) like GPT due to specific limitations that make them less efficient and less effective for most language modeling tasks compared to subword tokenization techniques. Here’s why: ### 1. Character-Level Tokens - **Advantages**: - Simple to implement. - Can handle any input text without encountering "unknown tokens" since every character is part of the vocabulary. - Good for tasks requiring fine-grained control, like poetry or transliteration. - **Disadvantages**: - **Longer Sequences**: Representing text character by character results in significantly longer input sequences. For example, the word "language" requires 8 tokens instead of 1 or 2 with subword tokenization. Longer sequences increase computational costs and training time. - **Loss of Semantics**: Characters individually don't carry much semantic meaning, so the model has to work harder to infer relationships and build contextual meaning over long sequences. - **Inefficiency**: LLMs have a fixed input size for each sequence (e.g., 2048 tokens for GPT-3). Using character-level tokens wastes a lot of capacity on redundant or trivial information. ### 2. Word-Level Tokens - **Advantages**: - More semantically meaningful than characters. Each token corresponds to a complete word, reducing sequence length. - Simpler vocabulary compared to subword tokenization. - **Disadvantages**: - **Large Vocabulary**: Word-level tokenization leads to a very large vocabulary to cover all possible words in a language, especially for morphologically rich languages. This increases memory requirements and makes the model harder to train. - **Out-of-Vocabulary (OOV) Words**: Unseen words during training cannot be represented, leading to issues with generalization. For example, new words, names, or typos will not be handled well. - **Lack of Subword Information**: The model cannot exploit the shared structure of words (e.g., "run," "runner," "running"). This makes it less effective at generalizing patterns across related words. ### 3. Why Subword Tokens Work Better Subword tokenization techniques, such as Byte Pair Encoding (BPE), WordPiece, or Unigram Language Modeling, provide a middle ground: - **Balanced Vocabulary**: The vocabulary size is smaller than word-level tokenization but larger than character-level tokenization. - **Handles Rare and OOV Words**: New words or typos can be broken into meaningful subwords, allowing the model to still understand and process them (e.g., "unhappiness" → "un," "happi," "ness"). - **Efficient Sequence Length**: Subwords reduce sequence length compared to characters, improving computational efficiency without losing much semantic information. - **Reusability**: Common prefixes, suffixes, and roots (e.g., "ing," "pre," "ly") are tokenized consistently, which aids in learning and generalization. ### Summary While character-level and word-level tokenization have their use cases, they are not ideal for LLMs due to inefficiency and limitations in vocabulary handling and semantic representation. Subword tokenization strikes the right balance by being computationally efficient, flexible, and effective for generalization. --- ## Regex - In order to create tokens, regex is used to identify the patterns in the data. - Regular expressions (regex) play a key role in subword tokenization processes like Byte Pair Encoding (BPE), WordPiece, and Unigram Language Modeling, as they help define and extract meaningful patterns from text. Here's why regex is commonly used in these processes: ### 1. Splitting and Preprocessing Text Regex is highly efficient for text preprocessing, which is a crucial first step in subword tokenization. It is used to: - Normalize Text: Remove special characters, extra spaces, or unwanted symbols. - Split Text into Basic Units: Regex can split text into initial units, such as words, whitespace-separated tokens, or even characters, which serve as the foundation for creating subword vocabularies. - Example: Splitting "Hello, world!" into ["Hello", ",", "world", "!"]. ### 2. Identifying Subword Patterns Regex allows the tokenization algorithm to recognize subword units based on patterns: - Breaking Words into Prefixes, Roots, and Suffixes: Regex can match patterns like "un-", "-ing", "-ly", etc., that are common subword components. - Example: Matching re or ing in "repeating" using regex patterns like \bre or ing\b. - Handling Non-Alphanumeric Characters: Regex makes it easy to handle punctuation, symbols, or digits by matching them as separate tokens. ### 3. Constructing Subword Vocabularies During vocabulary construction, regex helps: - Counting Subword Frequencies: Regex can efficiently identify and count occurrences of subwords in a corpus, which is essential for frequency-based algorithms like BPE. - Finding Merge Candidates: In BPE, regex identifies pairs of adjacent tokens (e.g., lo and ve in love) to determine which pair should be merged into a single token. ### 4. Tokenizing New Text When applying subword tokenization to new text, regex helps in: - Matching Known Subword Units: Regex is used to break down words into subwords that exist in the pre-trained vocabulary. - Example: Tokenizing "unhappiness" into ["un", "happi", "ness"] using regex patterns to match vocabulary entries. - Handling OOV Cases: Regex can break unknown words into smaller subunits that still make sense semantically or phonetically. ### 5. Efficiency and Flexibility Regex is both: - Fast: Regex libraries are optimized for text pattern matching, making them suitable for large-scale tokenization tasks. - Flexible: Regex can be easily customized for different languages, tokenization rules, or specific needs (e.g., handling emojis, URLs, or hashtags). ## Regex for New Language - Designing a regex for tokenizing a new language requires careful consideration of linguistic, syntactic, and practical factors. - Each language has unique characteristics such as writing systems, grammar rules, and punctuation usage that must be addressed. Below are the key factors to consider: ### 1. Writing System and Script - Character Set: Identify the script used in the language (e.g., Latin, Cyrillic, Devanagari, Arabic, etc.). - Regex should include Unicode ranges for the characters in the language. - Example: For Hindi (Devanagari script), use [\u0900-\u097F] to match characters. - Diacritics: Consider combining characters like accents or tone markers. - Example: In French, regex should account for é, è, ê, etc. ### 2. Word Boundaries - Word Separation: Determine how words are separated. Most languages use spaces, but some (e.g., Chinese, Japanese, Thai) do not. - For space-separated languages: \b (word boundary) is useful. - For languages without spaces: Define rules for splitting text based on known word patterns or syllables. ### 3. Morphology - Agglutinative or Inflected Forms: Languages like Turkish or Finnish have long words with multiple morphemes. Regex should consider splitting based on suffixes, prefixes, or infixes. - Example: Use patterns like - or \w+ for handling hyphenated or compound words. - Compound Words: German or Dutch often forms compound words. You might need regex to separate components intelligently. ### 4. Special Characters - Punctuation: Define how punctuation marks are handled (e.g., splitting them as separate tokens or keeping them attached to words). - Example: Tokenizing "Hello, world!" might involve a regex like \w+|\S. - Numerals: Decide how to tokenize numbers, especially if they include decimal points, commas, or currency symbols. - Example: Use \d+(\.\d+)? to match integers and decimals. - Currency, Dates, and Times: Handle specific patterns like $100, 2025-01-05, or 12:30 PM. ### 5. Language-Specific Rules - Elisions and Clitics: Handle contractions or shortened forms. - Example: In French, "l'amour" should be split into ["l'", "amour"]. - Regex: \w+|\w+'\w+. - Honorifics and Titles: Account for prefixes like "Mr.", "Dr.", or equivalents in other languages. ### 6. Multilingual Considerations - If the language frequently incorporates words or phrases from other languages (e.g., English borrowings in Japanese), the regex should accommodate mixed scripts or transliterations. - Example: Tokenizing "コンピュータcomputer" in Japanese should handle both scripts appropriately. ### 7. Whitespace and Line Breaks - Whitespace Handling: Decide how to treat tabs, newlines, or multiple spaces. - Regex like \s+ can be used to standardize whitespace. ### 8. Efficiency - Avoid overly complex regex patterns that could slow down tokenization for large texts. Break down tasks into smaller regex components if necessary. ### 9. Non-Alphanumeric Symbols - Consider language-specific symbols such as: - Emojis or emoticons. - Logograms or ideograms in Chinese. - Phonetic annotations (e.g., furigana in Japanese). --- ## Byte Pair Encoding (BPE) Implementation `dataset.txt`: Downloaded from [Link](https://ai4bharat.iitm.ac.in/datasets/sangraha) `byte_pair_encoding.py`: Implementation of BPE `tokenizer.json`: Saved tokens ```bash Token length: 31617 Ids length: 2045 Compression ratio: 15.4606X ``` ### Usage ```bash $ python byte_pair_encoding.py ```