--- language: - multilingual tags: - audio - text - multimodal - seamless - subtitle-editing-time-prediction library_name: transformers base_model: facebook/hf-seamless-m4t-medium license: cc-by-nc-4.0 --- # videoloc/seamless-basic ## Model Description This is a **SeamlessBasic** model that processes audio and text inputs to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment. The model is built on top of Meta's SeamlessM4T and fine-tuned on a multimodal dataset containing audio-subtitle pairs with editing time annotations across 5 languages: **English, French, Spanish, Italian, and German**. ### Key Features - **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs - **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability) - **TTE Prediction**: Predicts editing time required for subtitle segments - **Direct Output**: Raw time values in seconds for immediate use ## Model Architecture The model consists of the following components: 1. **Audio Processing**: - SeamlessM4T speech encoder (frozen) processes raw audio input - Audio projection layer maps speech encoder output to 1024 dimensions - Mean pooling over sequence length to get fixed-size audio embedding 2. **Text Processing**: - SeamlessM4T text encoder (frozen) processes tokenized text input - Text projection layer maps text encoder output to 1024 dimensions - Mean pooling over sequence length to get fixed-size text embedding 3. **Feature Fusion**: - Audio and text embeddings are concatenated (2048 total dimensions) - No additional cross-modal attention or complex fusion mechanisms 4. **Regression Head**: - Multi-layer perceptron: 2048 → 1024 → 512 → 256 → 1 - ReLU activations and dropout for regularization - Single output for TTE prediction (regression, in seconds) ## Quick Start ### Installation ```bash pip install transformers torch torchaudio huggingface_hub ``` ### Basic Usage ```python from transformers import AutoModel, AutoConfig from huggingface_hub import hf_hub_download import torch import numpy as np import importlib.util # Load model - custom architecture requires importing the model class model_files = hf_hub_download(repo_id="videoloc/seamless-basic", filename="modeling_seamless_basic.py") spec = importlib.util.spec_from_file_location("modeling_seamless_basic", model_files) modeling_module = importlib.util.module_from_spec(spec) spec.loader.exec_module(modeling_module) # Now load the model using the custom class config = modeling_module.SeamlessBasicConfig.from_pretrained("videoloc/seamless-basic") model = modeling_module.HFSeamlessBasic.from_pretrained("videoloc/seamless-basic") # Load the data collator (included in this repo) collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py") spec = importlib.util.spec_from_file_location("data_collator", collator_file) collator_module = importlib.util.module_from_spec(spec) spec.loader.exec_module(collator_module) # Initialize data collator data_collator = collator_module.DataCollatorSimpleSeamless( processor="facebook/hf-seamless-m4t-medium", max_audio_length_sec=8.0, max_text_length=256 ) # Prepare your data your_data = [ { 'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz 'raw_text': "Your subtitle text here", # Note: No translation features needed for basic model } ] # Process and run inference batch = data_collator(your_data) model.eval() with torch.no_grad(): outputs = model(**batch) tte_prediction = outputs.logits.item() print(f"Predicted Time To Edit: {tte_prediction:.2f} seconds") ``` ## Model Details - **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium) - **Audio Encoder**: Frozen SeamlessM4T speech encoder - **Text Encoder**: Frozen SeamlessM4T text encoder - **Hidden Size**: 1024 - **Audio Input**: 16kHz - **Output**: Single regression value (TTE in seconds) - **Task**: Subtitle editing time prediction ## Data Format Your input data should be a list of dictionaries with: - `raw_audio`: NumPy array of audio samples (16kHz sampling rate) - `raw_text`: String of subtitle text - `labels`: Target TTE values in seconds (optional, for training) Example: ```python data = [ { 'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz 'raw_text': "Subtitle text content", 'labels': 2.5 # optional TTE target value in seconds } ] ``` ## Performance Metrics - **Best Eval RMSE**: 33.34 ## Training Details - **Base Model**: facebook/hf-seamless-m4t-medium - **Epochs**: 10 - **Batch Size (Train)**: 32 - **Batch Size (Eval)**: 64 - **Learning Rate**: 1.2e-4 - **LR Scheduler**: cosine_with_restarts - **Warmup Ratio**: 0.05 - **Weight Decay**: 0.001 - **Optimizer**: AdamW (torch) - **Max Grad Norm**: 1.0 - **FP16**: True - **Early Stopping Patience**: 5 - **Audio Max Length**: 8.0 seconds - **Text Max Length**: 256 tokens - **Sample Rate**: 16kHz - **Normalization**: None (raw values) - **Dataset Split**: 80/20 train/test - **Random Seed**: 42 - **Metric**: RMSE (lower is better) ## Training Configuration The model was trained with the following specifications: - **Dataset**: Multimodal audio-subtitle pairs with TTE annotations (5 languages: EN, FR, ES, IT, DE) - **Train/Test Split**: 80/20 with random seed 42 - **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset - **Text Processing**: Max 256 tokens - **Normalization**: None (raw TTE values in seconds) - **Caching**: Audio segments cached and compressed for efficiency ## Usage Notes - This is the **basic** variant - processes only audio and text - For translation-aware models, see `seamless-translation` and `seamless-langpairs` - Model expects 16kHz audio input (automatically resampled by data collator) - Text is processed with SeamlessM4T text encoder - No feature normalization applied - outputs raw TTE predictions in seconds - Optimized for subtitle editing time estimation tasks ## Limitations - Designed for TTE prediction, not general audio-text matching - Performance may vary on out-of-domain content or different editing workflows - Requires specific data preprocessing (use included data collator) ## Related Models - **[seamless-translation](https://huggingface.co/videoloc/seamless-translation)**: Adds translation awareness features - **[seamless-langpairs](https://huggingface.co/videoloc/seamless-langpairs)**: Includes language pair embeddings for multilingual scenarios - **[seamless-crossattention](https://huggingface.co/videoloc/seamless-crossattention)**: Advanced cross-modal attention mechanisms for sophisticated audio-text interactions