Mehfil-e-Sukhan: Har Lafz Ek Mehfil

Roman Urdu Poetry Generation Model

A bidirectional LSTM neural network for generating Roman Urdu poetry, fine-tuned on a curated dataset of Urdu poetry in Latin script.

Overview
Repository Structure
Model Architecture
Dataset
Data Processing
Training Methodology
Text Generation Process
Results and Performance
Usage
Interactive Demo
Installation
Future Improvements
License
Contact

Overview

Mehfil-e-Sukhan (meaning "Poetry Gathering" in Urdu) is a natural language generation model specifically designed for Roman Urdu poetry creation. This repository contains the complete model implementation, including data preprocessing, tokenization, model architecture, training code, and inference utilities.

The model uses a Bidirectional LSTM architecture trained on a dataset of approximately 1,300 lines of Roman Urdu poetry to learn patterns, rhythms, and stylistic elements of Urdu poetry written in Latin script.

Repository Structure

The repository contains the following key files:

poetry_generation.ipynb: Complete notebook with data preparation, model definition, training code, and generation utilities
model_weights.pth: Trained model weights (243 MB)
urdu_sp.model: SentencePiece tokenizer model (429 KB)
urdu_sp.vocab: SentencePiece vocabulary file (181 KB)
all_texts.txt: Preprocessed dataset used for training (869 KB)
requirements.txt: Required Python packages
.gitattributes: Git LFS tracking for large files

Model Architecture

The poetry generation model uses a Bidirectional LSTM architecture:

Embedding Layer: 512-dimensional embeddings
BiLSTM Layers: 3 stacked bidirectional LSTM layers with 768 hidden units in each direction
Dropout: 0.2 dropout rate for regularization
Output Layer: Linear projection to vocabulary size (12,000 tokens)

This architecture was chosen to capture both preceding and following context in poetry lines, which is essential for maintaining coherence and style in the generated text.

Dataset

The model is trained on the Roman Urdu Poetry dataset, which contains approximately 1,300 lines of Urdu poetry written in Latin script (Roman Urdu). The dataset includes works from various poets and covers a range of poetic styles and themes.

Dataset Source: Roman Urdu Poetry Dataset on Kaggle

Data Processing

Raw poetry lines undergo several preprocessing steps:

Diacritic Removal: Unicode diacritics are normalized and removed
Text Cleaning: Excessive punctuation, symbols, and repeated spaces are eliminated
Tokenization: SentencePiece BPE (Byte Pair Encoding) tokenization with a vocabulary size of 12,000

The tokenization approach allows the model to handle out-of-vocabulary words by breaking them into subword units, which is particularly important for Roman Urdu where spelling variations are common.

Training Methodology

The model was trained with the following parameters:

Train/Validation/Test Split: 80% / 10% / 10%
Loss Function: Cross-Entropy with ignore_index for padding tokens
Optimizer: Adam with learning rate 1e-3 and weight decay 1e-5
Learning Rate Schedule: StepLR with step size 2 and gamma 0.5
Gradient Clipping: Maximum norm of 5.0
Epochs: 10 (sufficient for convergence on this dataset size)
Batch Size: 64

Training was performed on both CPU and GPU environments, with automatic device detection.

Text Generation Process

Poetry generation uses nucleus sampling (top-p) with adjustable parameters:

Temperature: Controls randomness in word selection (default: 1.2)
Top-p (nucleus) sampling: Limits token selection to the smallest set whose cumulative probability exceeds the threshold (default: 0.85)
Formatting: Automatically formats output with 6 words per line for aesthetic presentation

This sampling approach balances creativity and coherence in the generated text, allowing for controlled variation in the output.

Results and Performance

The final model achieves a test loss of approximately 3.17, which is reasonable considering the dataset size. The model demonstrates the ability to:

Generate contextually relevant continuations from a seed word
Maintain some aspects of Urdu poetic style in Roman script
Produce text with thematic consistency

The limited dataset size (1,300 lines) does result in some repetitiveness in longer generations, which could be improved with additional training data.

Usage

To use the model for generating poetry:

# Import required libraries (these are included in the notebook)
import torch
import sentencepiece as spm

# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("urdu_sp.model")

# Load the BiLSTM model
model = BiLSTMLanguageModel(vocab_size=sp.get_piece_size(), 
                           embed_dim=512, 
                           hidden_dim=768, 
                           num_layers=3, 
                           dropout=0.2)
model.load_state_dict(torch.load("model_weights.pth", map_location=device))
model.eval()

# Generate poetry
start_word = "ishq"  # Example: "love"
generated_poetry = generate_poetry_nucleus(model, sp, start_word, 
                                          num_words=12, 
                                          temperature=1.2, 
                                          top_p=0.85)
print(generated_poetry)

Interactive Demo

An interactive demo of this model is available as a Streamlit application, which provides a user-friendly interface to generate Roman Urdu poetry with adjustable parameters:

Mehfil-e-Sukhan Demo on HuggingFace Spaces

The Streamlit app allows users to:

Enter a starting word or phrase
Adjust the number of words to generate
Control the creativity (temperature) and focus (top-p) parameters
View the formatted poetry output in an elegant interface

Installation

To set up this model locally:

Clone the repository
Install the required dependencies:
```
pip install -r requirements.txt
```
Open and run poetry_generation.ipynb to explore the complete implementation

The required packages include:

torch
sentencepiece
pandas
scikit-learn
numpy

Future Improvements

Potential enhancements for the model include:

Expanded Dataset: Increasing the training data size to thousands of poetry lines for improved diversity and coherence
Transformer Architecture: Replacing BiLSTM with a Transformer-based model for better long-range dependencies
Style Control: Adding mechanisms to control specific poetic styles or meters
Multi-Language Support: Extending the model to handle both Roman Urdu and Nastaliq script
Fine-Tuning Options: Adding more parameters to control the generation style and themes

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

zaiffi
/

Mehfil-e-Sukhan