Mehfil-e-Sukhan: Har Lafz Ek Mehfil

Roman Urdu Poetry Generation Model

A bidirectional LSTM neural network for generating Roman Urdu poetry, fine-tuned on a curated dataset of Urdu poetry in Latin script.

Main Layout 1 Main Layout 2

Table of Contents

Overview

Mehfil-e-Sukhan (meaning "Poetry Gathering" in Urdu) is a natural language generation model specifically designed for Roman Urdu poetry creation. This repository contains the complete model implementation, including data preprocessing, tokenization, model architecture, training code, and inference utilities.

The model uses a Bidirectional LSTM architecture trained on a dataset of approximately 1,300 lines of Roman Urdu poetry to learn patterns, rhythms, and stylistic elements of Urdu poetry written in Latin script.

Repository Structure

The repository contains the following key files:

  • poetry_generation.ipynb: Complete notebook with data preparation, model definition, training code, and generation utilities
  • model_weights.pth: Trained model weights (243 MB)
  • urdu_sp.model: SentencePiece tokenizer model (429 KB)
  • urdu_sp.vocab: SentencePiece vocabulary file (181 KB)
  • all_texts.txt: Preprocessed dataset used for training (869 KB)
  • requirements.txt: Required Python packages
  • .gitattributes: Git LFS tracking for large files

Model Architecture

The poetry generation model uses a Bidirectional LSTM architecture:

  • Embedding Layer: 512-dimensional embeddings
  • BiLSTM Layers: 3 stacked bidirectional LSTM layers with 768 hidden units in each direction
  • Dropout: 0.2 dropout rate for regularization
  • Output Layer: Linear projection to vocabulary size (12,000 tokens)

This architecture was chosen to capture both preceding and following context in poetry lines, which is essential for maintaining coherence and style in the generated text.

Dataset

The model is trained on the Roman Urdu Poetry dataset, which contains approximately 1,300 lines of Urdu poetry written in Latin script (Roman Urdu). The dataset includes works from various poets and covers a range of poetic styles and themes.

Dataset Source: Roman Urdu Poetry Dataset on Kaggle

Data Processing

Raw poetry lines undergo several preprocessing steps:

  1. Diacritic Removal: Unicode diacritics are normalized and removed
  2. Text Cleaning: Excessive punctuation, symbols, and repeated spaces are eliminated
  3. Tokenization: SentencePiece BPE (Byte Pair Encoding) tokenization with a vocabulary size of 12,000

The tokenization approach allows the model to handle out-of-vocabulary words by breaking them into subword units, which is particularly important for Roman Urdu where spelling variations are common.

Training Methodology

The model was trained with the following parameters:

  • Train/Validation/Test Split: 80% / 10% / 10%
  • Loss Function: Cross-Entropy with ignore_index for padding tokens
  • Optimizer: Adam with learning rate 1e-3 and weight decay 1e-5
  • Learning Rate Schedule: StepLR with step size 2 and gamma 0.5
  • Gradient Clipping: Maximum norm of 5.0
  • Epochs: 10 (sufficient for convergence on this dataset size)
  • Batch Size: 64

Training was performed on both CPU and GPU environments, with automatic device detection.

Text Generation Process

Poetry generation uses nucleus sampling (top-p) with adjustable parameters:

  • Temperature: Controls randomness in word selection (default: 1.2)
  • Top-p (nucleus) sampling: Limits token selection to the smallest set whose cumulative probability exceeds the threshold (default: 0.85)
  • Formatting: Automatically formats output with 6 words per line for aesthetic presentation

This sampling approach balances creativity and coherence in the generated text, allowing for controlled variation in the output.

Demo Screenshot

Results and Performance

The final model achieves a test loss of approximately 3.17, which is reasonable considering the dataset size. The model demonstrates the ability to:

  • Generate contextually relevant continuations from a seed word
  • Maintain some aspects of Urdu poetic style in Roman script
  • Produce text with thematic consistency

The limited dataset size (1,300 lines) does result in some repetitiveness in longer generations, which could be improved with additional training data.

Usage

To use the model for generating poetry:

# Import required libraries (these are included in the notebook)
import torch
import sentencepiece as spm

# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("urdu_sp.model")

# Load the BiLSTM model
model = BiLSTMLanguageModel(vocab_size=sp.get_piece_size(), 
                           embed_dim=512, 
                           hidden_dim=768, 
                           num_layers=3, 
                           dropout=0.2)
model.load_state_dict(torch.load("model_weights.pth", map_location=device))
model.eval()

# Generate poetry
start_word = "ishq"  # Example: "love"
generated_poetry = generate_poetry_nucleus(model, sp, start_word, 
                                          num_words=12, 
                                          temperature=1.2, 
                                          top_p=0.85)
print(generated_poetry)

Interactive Demo

An interactive demo of this model is available as a Streamlit application, which provides a user-friendly interface to generate Roman Urdu poetry with adjustable parameters:

Mehfil-e-Sukhan Demo on HuggingFace Spaces

The Streamlit app allows users to:

  • Enter a starting word or phrase
  • Adjust the number of words to generate
  • Control the creativity (temperature) and focus (top-p) parameters
  • View the formatted poetry output in an elegant interface

Installation

To set up this model locally:

  1. Clone the repository
  2. Install the required dependencies:
    pip install -r requirements.txt
    
  3. Open and run poetry_generation.ipynb to explore the complete implementation

The required packages include:

  • torch
  • sentencepiece
  • pandas
  • scikit-learn
  • numpy

Future Improvements

Potential enhancements for the model include:

  1. Expanded Dataset: Increasing the training data size to thousands of poetry lines for improved diversity and coherence
  2. Transformer Architecture: Replacing BiLSTM with a Transformer-based model for better long-range dependencies
  3. Style Control: Adding mechanisms to control specific poetic styles or meters
  4. Multi-Language Support: Extending the model to handle both Roman Urdu and Nastaliq script
  5. Fine-Tuning Options: Adding more parameters to control the generation style and themes

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Contact

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-generation models for pytorch library.

Space using zaiffi/Mehfil-e-Sukhan 1