Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +230 -0
__init__.py +6 -0
config.json +13 -0
data_collator.py +88 -0
example_usage.py +54 -0
modeling_seamless_crossattention.py +225 -0
pytorch_model.bin +3 -0
requirements.txt +8 -0

README.md ADDED Viewed

	@@ -0,0 +1,230 @@

+---
+language:
+- multilingual
+tags:
+- audio
+- text
+- multimodal
+- seamless
+- subtitle-editing-time-prediction
+- cross-attention
+- attention-mechanism
+library_name: transformers
+base_model: facebook/hf-seamless-m4t-medium
+---
+# videoloc/seamless-crossattention
+## Model Description
+This is a **SeamlessCrossAttention** model that processes audio and text inputs with advanced cross-modal attention mechanisms to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, leveraging sophisticated cross-attention patterns between audio and text modalities.
+The model extends the SeamlessM4T architecture with bidirectional cross-attention layers that allow audio and text representations to attend to each other, creating rich cross-modal embeddings that capture temporal and semantic relationships across 5 languages: **English, French, Spanish, Italian, and German**.
+### Key Features
+- **Cross-Modal Attention**: Bidirectional attention between audio and text representations
+- **Advanced Architecture**: Audio-to-text and text-to-audio attention mechanisms
+- **Scalar Mixing**: Learnable combination of global and attended embeddings
+- **Embedding Regularization**: Optional L2 regularization for embedding stability
+- **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs
+- **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability)
+- **TTE Prediction**: Predicts editing time required for subtitle segments
+- **Direct Output**: Raw time values in seconds for immediate use
+## Model Architecture
+The model implements sophisticated cross-modal attention mechanisms:
+1. **Audio Processing**:
+   - SeamlessM4T speech encoder (frozen) processes raw audio input
+   - Audio projection layer maps speech encoder output to 1024 dimensions
+   - Layer normalization for stability
+2. **Text Processing**:
+   - SeamlessM4T text encoder (frozen) processes tokenized text input
+   - Text projection layer maps text encoder output to 1024 dimensions
+   - Layer normalization for stability
+3. **Cross-Modal Attention**:
+   - **Audio-to-Text Attention**: Each audio token attends to all text tokens
+   - **Text-to-Audio Attention**: Each text token attends to all audio tokens
+   - Multi-head attention (8 heads) with dropout for regularization
+   - Bidirectional information flow between modalities
+4. **Feature Fusion**:
+   - Global pooling of original audio and text embeddings
+   - Global pooling of cross-attended embeddings
+   - Scalar mixing layer combines all four embeddings with learnable weights
+   - Final embedding captures both global and cross-modal patterns
+5. **Regression Head**:
+   - Multi-layer perceptron: 1024 → 512 → 256 → 1
+   - ReLU activations and dropout for regularization
+   - Single output for TTE prediction (regression, in seconds)
+6. **Optional Regularization**:
+   - L2 regularization on embedding norms for training stability
+   - Configurable regularization strength
+## Quick Start
+### Installation
+```bash
+pip install transformers torch torchaudio huggingface_hub
+```
+### Basic Usage
+```python
+from transformers import AutoModel, AutoConfig
+from huggingface_hub import hf_hub_download
+import torch
+import numpy as np
+import importlib.util
+# Load model - custom architecture requires importing the model class
+model_files = hf_hub_download(repo_id="videoloc/seamless-crossattention", filename="modeling_seamless_crossattention.py")
+spec = importlib.util.spec_from_file_location("modeling_seamless_crossattention", model_files)
+modeling_module = importlib.util.module_from_spec(spec)
+spec.loader.exec_module(modeling_module)
+# Now load the model using the custom class
+config = modeling_module.SeamlessCrossAttentionConfig.from_pretrained("videoloc/seamless-crossattention")
+model = modeling_module.HFSeamlessCrossAttention.from_pretrained("videoloc/seamless-crossattention")
+# Load the data collator (included in this repo)
+collator_file = hf_hub_download(repo_id="videoloc/seamless-crossattention", filename="data_collator.py")
+spec = importlib.util.spec_from_file_location("data_collator", collator_file)
+collator_module = importlib.util.module_from_spec(spec)
+spec.loader.exec_module(collator_module)
+# Initialize data collator
+data_collator = collator_module.DataCollatorSimpleSeamless(
+    processor="facebook/hf-seamless-m4t-medium",
+    max_audio_length_sec=8.0,
+    max_text_length=256
+)
+# Prepare your data
+your_data = [
+    {
+        'raw_audio': np.random.randn(16000 * 5),  # 5 seconds at 16kHz
+        'raw_text': "Your subtitle text here",
+        # Note: Cross-attention model doesn't require translation features
+    }
+]
+# Process and run inference
+batch = data_collator(your_data)
+model.eval()
+with torch.no_grad():
+    outputs = model(**batch)
+    tte_prediction = outputs.logits.item()
+print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
+```
+## Model Details
+- **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium)
+- **Audio Encoder**: Frozen SeamlessM4T speech encoder
+- **Text Encoder**: Frozen SeamlessM4T text encoder
+- **Hidden Size**: 1024
+- **Attention Heads**: 8 (configurable)
+- **Cross-Attention**: Bidirectional (audio↔text)
+- **Scalar Mix**: 4 embeddings (audio global, text global, audio→text, text→audio)
+- **Audio Input**: 16kHz
+- **Output**: Single regression value (TTE in seconds)
+- **Task**: Subtitle editing time prediction
+## Data Format
+Your input data should be a list of dictionaries with:
+- `raw_audio`: NumPy array of audio samples (16kHz sampling rate)
+- `raw_text`: String of subtitle text
+- `labels`: Target TTE values in seconds (optional, for training)
+Example:
+```python
+data = [
+    {
+        'raw_audio': audio_samples,  # shape: (num_samples,) at 16kHz
+        'raw_text': "Subtitle text content",
+        'labels': 2.5  # optional TTE target value in seconds
+    }
+]
+```
+## Performance Metrics
+- **Best Eval RMSE**: 33.34
+## Training Details
+- **Base Model**: facebook/hf-seamless-m4t-medium
+- **Model Type**: seamless_cross_attention
+- **Epochs**: 10
+- **Batch Size (Train)**: 32
+- **Batch Size (Eval)**: 64
+- **Learning Rate**: 1.2e-4
+- **LR Scheduler**: cosine_with_restarts
+- **Warmup Ratio**: 0.05
+- **Weight Decay**: 0.001
+- **Optimizer**: AdamW (torch)
+- **Max Grad Norm**: 1.0
+- **FP16**: True
+- **Early Stopping Patience**: 5
+- **Audio Max Length**: 8.0 seconds
+- **Text Max Length**: 256 tokens
+- **Sample Rate**: 16kHz
+- **Cross-Attention**: 8-head multi-head attention
+- **Scalar Mixing**: 4 embedding types
+- **Embedding Regularization**: Optional L2
+- **Normalization**: None (raw values)
+- **Dataset Split**: 80/20 train/test
+- **Random Seed**: 42
+- **Metric**: RMSE (lower is better)
+## Training Configuration
+The model was trained with the following specifications:
+- **Dataset**: Multimodal audio-subtitle pairs with TTE annotations (5 languages: EN, FR, ES, IT, DE)
+- **Train/Test Split**: 80/20 with random seed 42
+- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
+- **Text Processing**: Max 256 tokens
+- **Cross-Attention**: 8-head multi-head attention with dropout
+- **Scalar Mixing**: Learnable combination of 4 embedding types
+- **Normalization**: None (raw TTE values in seconds)
+- **Caching**: Audio segments cached and compressed for efficiency
+## Usage Notes
+- This is the **advanced cross-attention** variant with sophisticated attention mechanisms
+- For simpler models, see `seamless-basic`, `seamless-translation`, or `seamless-langpairs`
+- Model expects 16kHz audio input (automatically resampled by data collator)
+- Cross-attention captures complex temporal and semantic relationships
+- No feature normalization applied - outputs raw TTE predictions in seconds
+- Optimized for detailed subtitle editing time estimation tasks
+## Architecture Advantages
+- **Rich Cross-Modal Interactions**: Audio and text modalities directly attend to each other
+- **Temporal Alignment**: Cross-attention naturally captures temporal relationships
+- **Semantic Understanding**: Text-to-audio attention helps model understand content meaning
+- **Flexible Combination**: Scalar mixing allows model to weight different embedding types
+- **Regularization Options**: Optional embedding regularization for training stability
+## Limitations
+- Higher computational complexity than basic models due to attention mechanisms
+- Requires more training data to fully leverage cross-attention capabilities
+- Designed for TTE prediction, not general audio-text matching
+- Performance may vary on out-of-domain content or different editing workflows
+- Requires specific data preprocessing (use included data collator)
+## Related Models
+- **seamless-basic**: Basic audio+text model without attention mechanisms
+- **seamless-translation**: Includes translation awareness but no cross-attention
+- **seamless-langpairs**: Includes language pair embeddings but no cross-attention

__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""
+SeamlessCrossAttention model for HuggingFace Transformers
+"""
+from .modeling_seamless_crossattention import HFSeamlessCrossAttention, SeamlessCrossAttentionConfig
+__all__ = ["HFSeamlessCrossAttention", "SeamlessCrossAttentionConfig"]

config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "architectures": [
+    "HFSeamlessCrossAttention"
+  ],
+  "dropout_prob": 0.1,
+  "embedding_regularization": 0.0,
+  "hidden_size": 1024,
+  "model_type": "seamless_crossattention",
+  "num_attention_heads": 8,
+  "seamless_model_name": "facebook/hf-seamless-m4t-medium",
+  "torch_dtype": "float32",
+  "transformers_version": "4.50.2"
+}

data_collator.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import torch
+import numpy as np
+from transformers import AutoProcessor
+from typing import Dict, List, Union
+import logging
+logger = logging.getLogger(__name__)
+class DataCollatorSimpleSeamless:
+    def __init__(
+        self,
+        processor: str,
+        sample_rate: int = 16000,
+        max_audio_length_sec: float = 8.0,
+        max_text_length: int = 256,
+        normalization_type: str = "none"
+    ):
+        """Initialize the data collator.
+        Args:
+            processor: The processor to use.
+            sample_rate: Audio sample rate.
+            max_audio_length_sec: Maximum audio length in seconds.
+            max_text_length: Maximum text length.
+            normalization_type: Type of normalization to apply to labels. Options: "log1p", "none"
+        """
+        logger.info(f"Loading processor: {processor}")
+        self.processor = AutoProcessor.from_pretrained(processor)
+        self.sample_rate = sample_rate
+        self.max_audio_sample_length = int(max_audio_length_sec * sample_rate)
+        self.max_text_length = max_text_length
+        self.normalization_type = normalization_type
+    def __call__(self, batch: List[Dict[str, Union[np.ndarray, str, float]]]) -> Dict[str, torch.Tensor]:
+        """Process a batch of raw features into model inputs."""
+        # Extract raw data
+        raw_audios = [item['raw_audio'] for item in batch]
+        raw_texts = [item['raw_text'] for item in batch]
+        raw_audios = [torch.tensor(audio) for audio in raw_audios]
+        audio_inputs = self.processor(
+            audios=raw_audios,
+            sampling_rate=self.sample_rate,
+            return_tensors="pt",
+            padding="longest",
+            truncation=True,
+            max_length=self.max_audio_sample_length,
+        )
+        text_inputs = self.processor(
+            text=raw_texts,
+            return_tensors="pt",
+            padding="longest",
+            truncation=True,
+            max_length=self.max_text_length,
+        )
+        # Extract translation features
+        is_translation = torch.tensor([item.get('is_translation', 0) for item in batch], dtype=torch.float32)
+        # Extract language pair features
+        language_pair_id = torch.tensor([item.get('language_pair_id', 0) for item in batch], dtype=torch.long)
+        if 'labels' in batch[0]:
+            labels = [item['labels'] for item in batch]
+            labels = torch.tensor(labels, dtype=torch.float32)
+            # Apply normalization based on type
+            if self.normalization_type == "log1p":
+                labels = torch.log1p(labels)
+            elif self.normalization_type == "none":
+                pass
+            else:
+                raise ValueError(f"Unknown normalization type: {self.normalization_type}")
+        else:
+            labels = None
+        return {
+            'input_features': audio_inputs['input_features'],
+            'audio_attention_mask': audio_inputs.get('attention_mask', None) if audio_inputs.get('attention_mask') is not None else None,
+            'input_ids': text_inputs['input_ids'],
+            'text_attention_mask': text_inputs['attention_mask'],
+            'is_translation': is_translation,
+            'language_pair_id': language_pair_id,
+            **({'labels': labels} if labels is not None else {})
+        }

example_usage.py ADDED Viewed

	@@ -0,0 +1,54 @@

+#!/usr/bin/env python3
+# Example usage for videoloc/seamless-crossattention
+from transformers import AutoModel, AutoConfig
+from huggingface_hub import hf_hub_download
+import torch
+import numpy as np
+import importlib.util
+def load_model_and_collator():
+    # Load model - custom architecture requires importing the model class
+    model_files = hf_hub_download(repo_id="videoloc/seamless-crossattention", filename="modeling_seamless_crossattention.py")
+    spec = importlib.util.spec_from_file_location("modeling_seamless_crossattention", model_files)
+    modeling_module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(modeling_module)
+    # Now load the model using the custom class
+    config = modeling_module.SeamlessCrossAttentionConfig.from_pretrained("videoloc/seamless-crossattention")
+    model = modeling_module.HFSeamlessCrossAttention.from_pretrained("videoloc/seamless-crossattention")
+    # Load data collator
+    collator_file = hf_hub_download(repo_id="videoloc/seamless-crossattention", filename="data_collator.py")
+    spec = importlib.util.spec_from_file_location("data_collator", collator_file)
+    collator_module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(collator_module)
+    data_collator = collator_module.DataCollatorSimpleSeamless(
+        processor="facebook/hf-seamless-m4t-medium",
+        max_audio_length_sec=8.0,
+        max_text_length=256
+    )
+    return model, data_collator
+def example_inference():
+    model, collator = load_model_and_collator()
+    # Example data: audio segment + subtitle text for cross-attention TTE prediction
+    data = [{
+        'raw_audio': np.random.randn(16000 * 3),  # 3 seconds at 16kHz
+        'raw_text': "Example subtitle text with cross-modal attention for TTE prediction",
+    }]
+    batch = collator(data)
+    model.eval()
+    with torch.no_grad():
+        outputs = model(**batch)
+        tte_prediction = outputs.logits.item()
+    print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
+    return tte_prediction
+if __name__ == "__main__":
+    example_inference()

modeling_seamless_crossattention.py ADDED Viewed

	@@ -0,0 +1,225 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import PreTrainedModel, PretrainedConfig
+from transformers.modeling_outputs import SequenceClassifierOutput
+from transformers import SeamlessM4TModel
+import logging
+logger = logging.getLogger(__name__)
+class SeamlessCrossAttentionConfig(PretrainedConfig):
+    """Configuration class for SeamlessCrossAttention model."""
+    model_type = "seamless_crossattention"
+    def __init__(
+        self,
+        seamless_model_name="facebook/hf-seamless-m4t-medium",
+        hidden_size=1024,
+        dropout_prob=0.1,
+        num_attention_heads=8,
+        embedding_regularization=0.0,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.seamless_model_name = seamless_model_name
+        self.hidden_size = hidden_size
+        self.dropout_prob = dropout_prob
+        self.num_attention_heads = num_attention_heads
+        self.embedding_regularization = embedding_regularization
+class ScalarMix(nn.Module):
+    """Scalar mixing layer for combining multiple embeddings."""
+    def __init__(self, num_inputs=4):
+        super().__init__()
+        self.weights = nn.Parameter(torch.ones(num_inputs))
+        self.gamma = nn.Parameter(torch.tensor(1.0))
+    def forward(self, *tensors):
+        # Normalize weights with softmax
+        weights = F.softmax(self.weights, dim=0)
+        # Weighted sum
+        weighted_sum = sum(w * t for w, t in zip(weights, tensors))
+        # Scale by gamma
+        return self.gamma * weighted_sum
+class HFSeamlessCrossAttention(PreTrainedModel):
+    """SeamlessM4T model with cross attention for HuggingFace Hub."""
+    config_class = SeamlessCrossAttentionConfig
+    supports_gradient_checkpointing = True
+    def __init__(self, config):
+        super().__init__(config)
+        self.config = config
+        # Load the underlying SeamlessM4T model
+        self.seamless_model = SeamlessM4TModel.from_pretrained(config.seamless_model_name)
+        self.seamless_model_speech_encoder = self.seamless_model.speech_encoder
+        self.seamless_model_text_encoder = self.seamless_model.text_encoder
+        # Freeze pre-trained models
+        for param in self.seamless_model_speech_encoder.parameters():
+            param.requires_grad = False
+        for param in self.seamless_model_text_encoder.parameters():
+            param.requires_grad = False
+        # Projection layers
+        self.audio_proj = nn.Linear(
+            self.seamless_model_speech_encoder.config.hidden_size,
+            config.hidden_size
+        )
+        self.text_proj = nn.Linear(
+            self.seamless_model_text_encoder.config.hidden_size,
+            config.hidden_size
+        )
+        # Layer norms
+        self.audio_norm = nn.LayerNorm(config.hidden_size)
+        self.text_norm = nn.LayerNorm(config.hidden_size)
+        # Cross-attention layers
+        self.audio_to_text_attention = nn.MultiheadAttention(
+            embed_dim=config.hidden_size,
+            num_heads=config.num_attention_heads,
+            dropout=config.dropout_prob,
+            batch_first=True
+        )
+        self.text_to_audio_attention = nn.MultiheadAttention(
+            embed_dim=config.hidden_size,
+            num_heads=config.num_attention_heads,
+            dropout=config.dropout_prob,
+            batch_first=True
+        )
+        # Scalar mix for combining embeddings
+        self.scalar_mix = ScalarMix(num_inputs=4)
+        # Enhanced classifier with residual connections
+        self.fc = nn.Sequential(
+            nn.Linear(config.hidden_size, 512),
+            nn.ReLU(),
+            nn.Dropout(config.dropout_prob),
+            nn.Linear(512, 256),
+            nn.ReLU(),
+            nn.Dropout(config.dropout_prob),
+            nn.Linear(256, 1)
+        )
+        # Initialize new layers
+        self._initialize_new_layers()
+    def _initialize_new_layers(self):
+        """Initialize new layers with proper weights."""
+        for module in [self.audio_proj, self.text_proj, self.fc]:
+            if isinstance(module, nn.Linear):
+                nn.init.xavier_uniform_(module.weight)
+                nn.init.zeros_(module.bias)
+            elif isinstance(module, nn.Sequential):
+                for layer in module:
+                    if isinstance(layer, nn.Linear):
+                        nn.init.xavier_uniform_(layer.weight)
+                        nn.init.zeros_(layer.bias)
+    def forward(
+        self,
+        input_features,
+        input_ids,
+        text_attention_mask,
+        audio_attention_mask=None,
+        labels=None,
+        **kwargs  # Accept additional features but ignore them
+    ):
+        # Create default audio attention mask if not provided
+        if audio_attention_mask is None:
+            audio_attention_mask = torch.ones(
+                input_features.size(0), input_features.size(1),
+                device=input_features.device
+            )
+        # 1. Encode audio
+        audio_output = self.seamless_model_speech_encoder(
+            input_features=input_features,
+            attention_mask=audio_attention_mask
+        )
+        audio_hidden_states = audio_output.last_hidden_state  # [batch_size, audio_seq_len, hidden_size]
+        # 2. Encode text
+        text_output = self.seamless_model_text_encoder(
+            input_ids=input_ids,
+            attention_mask=text_attention_mask
+        )
+        text_hidden_states = text_output.last_hidden_state  # [batch_size, text_seq_len, hidden_size]
+        # 3. Project to common dimension
+        audio_projected = self.audio_proj(audio_hidden_states)  # [batch_size, audio_seq_len, hidden_size]
+        text_projected = self.text_proj(text_hidden_states)    # [batch_size, text_seq_len, hidden_size]
+        audio_projected = self.audio_norm(audio_projected)
+        text_projected = self.text_norm(text_projected)
+        # 4. Global pooling (mean) of original embeddings
+        audio_global = audio_projected.mean(dim=1)  # [batch_size, hidden_size]
+        text_global = text_projected.mean(dim=1)    # [batch_size, hidden_size]
+        # 5. Cross-attention with masks
+        # Audio attends to text - each audio token attends to all text tokens
+        audio_attended_to_text, _ = self.audio_to_text_attention(
+            query=audio_projected,   # [batch_size, audio_seq_len, hidden_size]
+            key=text_projected,      # [batch_size, text_seq_len, hidden_size]
+            value=text_projected,    # [batch_size, text_seq_len, hidden_size]
+        )
+        # Text attends to audio - each text token attends to all audio tokens
+        text_attended_to_audio, _ = self.text_to_audio_attention(
+            query=text_projected,    # [batch_size, text_seq_len, hidden_size]
+            key=audio_projected,     # [batch_size, audio_seq_len, hidden_size]
+            value=audio_projected,   # [batch_size, audio_seq_len, hidden_size]
+        )
+        # 6. Global pooling (mean) of attended embeddings
+        audio_attended_emb = audio_attended_to_text.mean(dim=1)  # [batch_size, hidden_size]
+        text_attended_emb = text_attended_to_audio.mean(dim=1)   # [batch_size, hidden_size]
+        # 7. Combine with scalar mix
+        final_embedding = self.scalar_mix(
+            audio_global,
+            text_global,
+            audio_attended_emb,
+            text_attended_emb
+        )
+        # 8. Classification
+        logits = self.fc(final_embedding).squeeze(-1)
+        # Compute loss if labels are provided
+        loss = None
+        if labels is not None:
+            mse_loss = F.mse_loss(logits, labels)
+            # Add embedding regularization if specified
+            reg_loss = 0.0
+            if self.config.embedding_regularization > 0:
+                reg_loss = (
+                    torch.norm(audio_global, p=2, dim=1).mean() +
+                    torch.norm(text_global, p=2, dim=1).mean() +
+                    torch.norm(audio_attended_emb, p=2, dim=1).mean() +
+                    torch.norm(text_attended_emb, p=2, dim=1).mean()
+                ) / 4.0
+            loss = mse_loss + self.config.embedding_regularization * reg_loss
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=None,
+            attentions=None
+        )

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:29764a5e44028038b2251c4bc21f8bccaefc03f06bd4e796f77683b4e7914e51
+size 4883154633

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+transformers>=4.50.2
+torch>=2.6.0
+torchaudio>=2.6.0
+huggingface_hub>=0.33.0
+numpy>=2.2.3
+sentencepiece>=0.2.0
+accelerate>=1.5.2
+soundfile>=0.13.1