Darija Text Normalization Model

This repository contains a Sequence-to-Sequence LSTM model trained to normalize Darija text. The model converts noisy or informal Darija into a standardized format using character-level tokenization.

Model Details

  • Architecture: Encoder-Decoder LSTM (Sequence-to-Sequence)
  • Task: Text Normalization
  • Language: Darija (Moroccan Arabic)
  • Input Tokenizer: Character-level
  • Target Tokenizer: Character-level
  • Embedding Dimension: 50
  • Latent Dimension (LSTM Units): 128
  • Training Data: Darija Open Dataset (link)
  • Saved Model Format: Keras (.keras)
  • Tokenizers Format: JSON (.json)
  • Parameters Format: JSON (.json)

Files in this Repository

  • darija-text-normalizer.keras: The trained Keras model.
  • tokenizer_input.json: JSON file for the input tokenizer configuration.
  • tokenizer_target.json: JSON file for the target tokenizer configuration.
  • model_parameters.json: Model parameters (such as max sequence lengths and vocabulary sizes).
  • config.yaml: YAML file with complete model configuration details.
  • README.md: This file, describing the model and its usage.

How It Works

The model uses an Encoder-Decoder LSTM architecture. The encoder processes the input text into a context vector. The decoder then uses this context to generate normalized text one character at a time. This approach helps the model handle spelling variations and out-of-vocabulary characters.

Example Usage

Below is an example of how to load the model and tokenizers and perform text normalization:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import tokenizer_from_json
import json
import numpy as np

# --- Load the Keras model ---
loaded_model = tf.keras.models.load_model("darija-text-normalizer.keras")
encoder_model = tf.keras.models.Model(loaded_model.input[0], loaded_model.layers[2].output)

# Get latent dimension from the model configuration
latent_dim_loaded = loaded_model.layers[3].get_config()['units']
decoder_state_input_h = tf.keras.layers.Input(shape=(latent_dim_loaded,))
decoder_state_input_c = tf.keras.layers.Input(shape=(latent_dim_loaded,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_embedding_layer = loaded_model.layers[2]
decoder_lstm_layer = loaded_model.layers[3]
decoder_dense_layer = loaded_model.layers[4]

decoder_embedding_inf = decoder_embedding_layer(loaded_model.input[1])
decoder_outputs_inf, state_h_inf, state_c_inf = decoder_lstm_layer(
    decoder_embedding_inf, initial_state=decoder_states_inputs
)
decoder_states_inf = [state_h_inf, state_c_inf]
decoder_outputs_inf = decoder_dense_layer(decoder_outputs_inf)
decoder_model = tf.keras.models.Model(
    [loaded_model.input[1]] + decoder_states_inputs,
    [decoder_outputs_inf] + decoder_states_inf
)

# --- Load Tokenizers ---
with open("tokenizer_input.json", 'r', encoding='utf-8') as f:
    tokenizer_input_config = json.load(f)
tokenizer_input = tokenizer_from_json(tokenizer_input_config)

with open("tokenizer_target.json", 'r', encoding='utf-8') as f:
    tokenizer_target_config = json.load(f)
tokenizer_target = tokenizer_from_json(tokenizer_target_config)

# --- Load Model Parameters ---
with open("model_parameters.json", 'r', encoding='utf-8') as f:
    model_params = json.load(f)
    max_input_len = model_params['max_input_len']
    max_target_len = model_params['max_target_len']

def normalize_text(input_text, encoder_model, decoder_model, input_tokenizer, target_tokenizer, max_target_len, max_input_len):
    """Normalizes input Darija text using the trained encoder-decoder model."""
    input_seq = input_tokenizer.texts_to_sequences([input_text])
    padded_input_seq = tf.keras.preprocessing.sequence.pad_sequences(input_seq, maxlen=max_input_len, padding='post')
    states_value = encoder_model.predict(padded_input_seq, verbose=0)

    target_seq = [target_tokenizer.word_index.get(target_tokenizer.oov_token, 0)]
    if target_seq[0] is None:
        target_seq = [0]
    target_seq = np.array(target_seq).reshape(1, 1)

    decoded_sentence = ''
    stop_condition = False
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value, verbose=0)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = target_tokenizer.index_word.get(sampled_token_index, '')

        if sampled_char and sampled_char != target_tokenizer.oov_token:
            decoded_sentence += sampled_char

        if sampled_char == '' or sampled_char == target_tokenizer.oov_token or len(decoded_sentence) > max_target_len:
            stop_condition = True

        target_seq = np.array([sampled_token_index]).reshape(1, 1)
        states_value = [h, c]

    return decoded_sentence

# --- Example ---
input_text = "kn-mchiw l-sou9"  # Example input (Darija for "We are going to the market")
print("Input text:", input_text)
print("Normalized text:", normalize_text(input_text, encoder_model, decoder_model, tokenizer_input, tokenizer_target, max_target_len, max_input_len))

Installation Requirements

Make sure to install the following packages:

pip install tensorflow numpy pyyaml huggingface_hub

Before running this script, please follow these steps to resolve the Hugging Face Hub push error:

  1. Get your Hugging Face API Token:

  2. Enter your Hugging Face Username:

    • Replace HF_USERNAME = "YOUR_HF_USERNAME" in the code with your actual Hugging Face username. This is the username you use to log in to Hugging Face.
  3. (Recommended) Login to Hugging Face CLI:

    • Open your terminal or command prompt.
    • Run this command: huggingface-cli login
    • Enter your Hugging Face API token when prompted. This securely configures Git to authenticate with Hugging Face.
  4. Check your Internet Connection:

    • Ensure you have a stable internet connection.

After completing these steps, run the Python script again.

If you still encounter issues, double-check:

  • That you have correctly installed all required libraries (tensorflow, numpy, pyyaml, huggingface_hub).
  • That your API token is valid and has "write" permissions.
  • That your Hugging Face username is correctly entered.

If the error persists, it might be a more temporary network issue or a problem on the Hugging Face Hub side, although the above steps should resolve the most common causes of this error.

Downloads last month
16
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.