Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

.gitattributes +1 -0
.ipynb_checkpoints/Transformers-checkpoint.ipynb +0 -0
README.md +299 -0
README_HF.md +299 -0
Transformers.ipynb +0 -0
best_transformer_model.pth +3 -0
m4_transformer_results.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+m4_transformer_results.png filter=lfs diff=lfs merge=lfs -text

.ipynb_checkpoints/Transformers-checkpoint.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md ADDED Viewed

	@@ -0,0 +1,299 @@

+---
+title: Transformers from Scratch - Complete Implementation
+emoji: 🔮
+colorFrom: blue
+colorTo: green
+sdk: pytorch
+app_file: Transformers.ipynb
+pinned: false
+license: mit
+tags:
+- deep-learning
+- transformers
+- attention
+- pytorch
+- nlp
+- text-classification
+- sentiment-analysis
+- educational
+- from-scratch
+datasets:
+- synthetic-movie-reviews
+---
+# Transformers from Scratch: Complete Implementation
+A comprehensive PyTorch implementation of the Transformer architecture from "Attention Is All You Need", featuring detailed mathematical foundations, educational content, and practical text classification applications.
+## Model Description
+This repository contains a complete, from-scratch implementation of the Transformer architecture. The model demonstrates the core concepts behind modern NLP systems like BERT, GPT, and ChatGPT through a practical sentiment analysis task. This implementation serves as both a working model and an educational resource for understanding the revolutionary attention mechanism.
+### Architecture Details
+- **Model Type**: Transformer Encoder for Text Classification
+- **Framework**: PyTorch
+- **Task**: Binary sentiment classification (positive/negative movie reviews)
+- **Model Dimension**: 128
+- **Attention Heads**: 8
+- **Layers**: 4 Transformer blocks
+- **Feed-Forward Dimension**: 256
+- **Total Parameters**: ~200K
+- **Vocabulary Size**: Dynamic (built from training data)
+### Key Components
+1. **Multi-Head Attention**: Core mechanism allowing parallel processing of sequences
+2. **Positional Encoding**: Sine/cosine embeddings to inject position information
+3. **Transformer Blocks**: Attention + feed-forward with residual connections
+4. **Layer Normalization**: Stabilizes training and improves convergence
+5. **Classification Head**: Global average pooling + linear layer for predictions
+## Mathematical Foundation
+### Scaled Dot-Product Attention
+```
+Attention(Q, K, V) = softmax(QK^T / √d_k)V
+```
+### Multi-Head Attention
+```
+MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
+head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
+```
+### Positional Encoding
+```
+PE(pos, 2i) = sin(pos/10000^(2i/d_model))
+PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))
+```
+## Training Details
+- **Dataset**: Synthetic movie reviews (positive/negative sentiment)
+- **Optimizer**: AdamW with weight decay (0.01)
+- **Learning Rate**: 0.0001 with cosine annealing
+- **Batch Size**: 16
+- **Max Sequence Length**: 24 tokens
+- **Training Epochs**: 30
+- **Hardware**: Optimized for Apple M4 and CUDA GPUs
+## Model Performance
+### Metrics
+- **Test Accuracy**: 85%+
+- **Training Time**: ~10 minutes on Apple M4
+- **Model Size**: 200K parameters
+- **Convergence**: Stable training without overfitting
+### Capabilities
+- ✅ Binary sentiment classification
+- ✅ Attention weight visualization
+- ✅ Fast inference on modern hardware
+- ✅ Educational transparency
+- ✅ Easily extensible architecture
+## Usage
+### Quick Start
+```python
+import torch
+import torch.nn as nn
+import math
+# Load the complete implementation (from notebook)
+class TransformerClassifier(nn.Module):
+    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_len, num_classes):
+        super().__init__()
+        self.d_model = d_model
+        self.embedding = nn.Embedding(vocab_size, d_model)
+        self.pos_encoding = PositionalEncoding(d_model, max_len)
+        self.transformer_blocks = nn.ModuleList([
+            TransformerBlock(d_model, num_heads, d_ff)
+            for _ in range(num_layers)
+        ])
+        self.norm = nn.LayerNorm(d_model)
+        self.classifier = nn.Linear(d_model, num_classes)
+    def forward(self, x):
+        # Embedding + positional encoding
+        x = self.embedding(x) * math.sqrt(self.d_model)
+        x = self.pos_encoding(x)
+        # Transformer blocks
+        for transformer in self.transformer_blocks:
+            x = transformer(x)
+        # Classification
+        x = self.norm(x)
+        x = x.mean(dim=1)  # Global average pooling
+        return self.classifier(x)
+# Load trained model
+model = TransformerClassifier(
+    vocab_size=vocab_size,
+    d_model=128,
+    num_heads=8,
+    num_layers=4,
+    d_ff=256,
+    max_len=24,
+    num_classes=2
+)
+model.load_state_dict(torch.load('best_transformer_model.pth'))
+model.eval()
+# Example inference
+def predict_sentiment(text, model, vocab_to_idx, max_length=24):
+    tokens = tokenize_text(text, vocab_to_idx, max_length)
+    with torch.no_grad():
+        output = model(tokens.unsqueeze(0))
+        prediction = torch.softmax(output, dim=1)
+        return "Positive" if prediction[0][1] > 0.5 else "Negative"
+# Test the model
+result = predict_sentiment("This movie was absolutely fantastic!", model, vocab_to_idx)
+print(f"Sentiment: {result}")
+```
+### Advanced Usage
+```python
+# Visualize attention weights
+def visualize_attention(model, text, vocab_to_idx):
+    # Extract attention weights from each layer
+    # Create heatmaps showing what the model focuses on
+    pass
+# Fine-tune on new data
+def fine_tune_model(model, new_data_loader, epochs=5):
+    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
+    # Continue training on domain-specific data
+    pass
+```
+## Visualizations and Analysis
+1. **Training Curves**: Loss and accuracy evolution over epochs
+2. **Attention Heatmaps**: Visualize what the model pays attention to
+3. **Performance Metrics**: Precision, recall, F1-score breakdowns
+4. **Architecture Diagrams**: Component-wise model visualization
+5. **Error Analysis**: Common failure cases and model limitations
+## Files and Outputs
+- `Transformers.ipynb`: Complete implementation with educational content
+- `best_transformer_model.pth`: Trained model weights
+- `m4_transformer_results.png`: Training curves and performance metrics
+- Architecture visualization and attention weight examples
+## Educational Value
+This implementation is designed as a comprehensive learning resource featuring:
+### Mathematical Understanding
+- **Complete Derivations**: From attention theory to implementation
+- **Step-by-Step Breakdown**: Each component explained individually
+- **Visual Mathematics**: Attention visualizations and formula explanations
+- **Practical Examples**: Concrete numerical calculations
+### Implementation Insights
+- **Clean Code Architecture**: Modular, readable, and well-documented
+- **Best Practices**: Modern PyTorch patterns and techniques
+- **Performance Optimization**: Efficient training and inference
+- **Debugging Techniques**: How to monitor and improve training
+### Real-World Applications
+- **End-to-End Pipeline**: From raw text to predictions
+- **Production Considerations**: Model deployment and optimization
+- **Extension Examples**: How to adapt for different tasks
+- **Transfer Learning**: Building on pre-trained representations
+## Applications
+This Transformer implementation can be adapted for:
+### Text Classification Tasks
+- **Sentiment Analysis**: Movie reviews, product feedback, social media
+- **Topic Classification**: News categorization, document organization
+- **Spam Detection**: Email filtering, content moderation
+- **Intent Recognition**: Chatbot understanding, voice assistants
+### Sequence Processing
+- **Named Entity Recognition**: Extract people, places, organizations
+- **Part-of-Speech Tagging**: Grammatical analysis
+- **Text Similarity**: Document matching, plagiarism detection
+- **Feature Extraction**: Dense representations for downstream tasks
+### Research and Development
+- **Architecture Experiments**: Test new attention mechanisms
+- **Ablation Studies**: Understand component contributions
+- **Scaling Experiments**: Larger models and datasets
+- **Novel Applications**: Domain-specific adaptations
+## Comparison with Other Architectures
+### Advantages over RNNs
+- ✅ **Parallel Processing**: Much faster training and inference
+- ✅ **Long-Range Dependencies**: Better handling of distant relationships
+- ✅ **Scalability**: Efficient on modern hardware
+- ✅ **Interpretability**: Attention weights provide insights
+### Advantages over CNNs
+- ✅ **Sequence Modeling**: Natural fit for text and time series
+- ✅ **Variable Length**: Handle sequences of any length
+- ✅ **Global Context**: Attend to entire sequence simultaneously
+- ✅ **Position Awareness**: Explicit positional information
+### Educational Benefits
+- 🎓 **Foundation Understanding**: Core concepts behind modern NLP
+- 🎓 **Mathematical Clarity**: Clean mathematical formulations
+- 🎓 **Implementation Practice**: Hands-on coding experience
+- 🎓 **Research Preparation**: Basis for advanced architectures
+## Citation
+If you use this implementation in your research or projects, please cite:
+```bibtex
+@misc{transformers_from_scratch_2024,
+  title={Transformers from Scratch: Complete Implementation},
+  author={Gruhesh Kurra},
+  year={2024},
+  url={https://huggingface.co/karthik-2905/TransformersFromScratch}
+}
+```
+## Future Extensions
+Planned improvements and research directions:
+- 🔄 **Encoder-Decoder Architecture**: Full sequence-to-sequence implementation
+- 🎨 **Pre-training Pipeline**: Large-scale language model training
+- 📊 **Alternative Attention**: Sparse, local, and linear attention variants
+- 🖼️ **Vision Transformers**: Adapt architecture for image tasks
+- 🎵 **Multimodal Transformers**: Text, image, and audio processing
+- 🧬 **Scientific Applications**: Protein sequences, molecular modeling
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Additional Resources
+- **GitHub Repository**: [TransformersFromScratch](https://github.com/GruheshKurra/TransformersFromScratch)
+- **Original Paper**: "Attention Is All You Need" by Vaswani et al.
+- **Educational Content**: Complete mathematical derivations and examples
+- **Performance Benchmarks**: Detailed analysis and comparisons
+## Model Card Authors
+**Gruhesh Kurra** - Implementation, documentation, and educational content
+---
+**Tags**: transformers, attention, pytorch, nlp, text-classification, educational
+**Model Card Last Updated**: December 2024

README_HF.md ADDED Viewed

	@@ -0,0 +1,299 @@

+---
+title: Transformers from Scratch - Complete Implementation
+emoji: 🔮
+colorFrom: blue
+colorTo: green
+sdk: pytorch
+app_file: Transformers.ipynb
+pinned: false
+license: mit
+tags:
+- deep-learning
+- transformers
+- attention
+- pytorch
+- nlp
+- text-classification
+- sentiment-analysis
+- educational
+- from-scratch
+datasets:
+- synthetic-movie-reviews
+---
+# Transformers from Scratch: Complete Implementation
+A comprehensive PyTorch implementation of the Transformer architecture from "Attention Is All You Need", featuring detailed mathematical foundations, educational content, and practical text classification applications.
+## Model Description
+This repository contains a complete, from-scratch implementation of the Transformer architecture. The model demonstrates the core concepts behind modern NLP systems like BERT, GPT, and ChatGPT through a practical sentiment analysis task. This implementation serves as both a working model and an educational resource for understanding the revolutionary attention mechanism.
+### Architecture Details
+- **Model Type**: Transformer Encoder for Text Classification
+- **Framework**: PyTorch
+- **Task**: Binary sentiment classification (positive/negative movie reviews)
+- **Model Dimension**: 128
+- **Attention Heads**: 8
+- **Layers**: 4 Transformer blocks
+- **Feed-Forward Dimension**: 256
+- **Total Parameters**: ~200K
+- **Vocabulary Size**: Dynamic (built from training data)
+### Key Components
+1. **Multi-Head Attention**: Core mechanism allowing parallel processing of sequences
+2. **Positional Encoding**: Sine/cosine embeddings to inject position information
+3. **Transformer Blocks**: Attention + feed-forward with residual connections
+4. **Layer Normalization**: Stabilizes training and improves convergence
+5. **Classification Head**: Global average pooling + linear layer for predictions
+## Mathematical Foundation
+### Scaled Dot-Product Attention
+```
+Attention(Q, K, V) = softmax(QK^T / √d_k)V
+```
+### Multi-Head Attention
+```
+MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
+head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
+```
+### Positional Encoding
+```
+PE(pos, 2i) = sin(pos/10000^(2i/d_model))
+PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))
+```
+## Training Details
+- **Dataset**: Synthetic movie reviews (positive/negative sentiment)
+- **Optimizer**: AdamW with weight decay (0.01)
+- **Learning Rate**: 0.0001 with cosine annealing
+- **Batch Size**: 16
+- **Max Sequence Length**: 24 tokens
+- **Training Epochs**: 30
+- **Hardware**: Optimized for Apple M4 and CUDA GPUs
+## Model Performance
+### Metrics
+- **Test Accuracy**: 85%+
+- **Training Time**: ~10 minutes on Apple M4
+- **Model Size**: 200K parameters
+- **Convergence**: Stable training without overfitting
+### Capabilities
+- ✅ Binary sentiment classification
+- ✅ Attention weight visualization
+- ✅ Fast inference on modern hardware
+- ✅ Educational transparency
+- ✅ Easily extensible architecture
+## Usage
+### Quick Start
+```python
+import torch
+import torch.nn as nn
+import math
+# Load the complete implementation (from notebook)
+class TransformerClassifier(nn.Module):
+    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_len, num_classes):
+        super().__init__()
+        self.d_model = d_model
+        self.embedding = nn.Embedding(vocab_size, d_model)
+        self.pos_encoding = PositionalEncoding(d_model, max_len)
+        self.transformer_blocks = nn.ModuleList([
+            TransformerBlock(d_model, num_heads, d_ff)
+            for _ in range(num_layers)
+        ])
+        self.norm = nn.LayerNorm(d_model)
+        self.classifier = nn.Linear(d_model, num_classes)
+    def forward(self, x):
+        # Embedding + positional encoding
+        x = self.embedding(x) * math.sqrt(self.d_model)
+        x = self.pos_encoding(x)
+        # Transformer blocks
+        for transformer in self.transformer_blocks:
+            x = transformer(x)
+        # Classification
+        x = self.norm(x)
+        x = x.mean(dim=1)  # Global average pooling
+        return self.classifier(x)
+# Load trained model
+model = TransformerClassifier(
+    vocab_size=vocab_size,
+    d_model=128,
+    num_heads=8,
+    num_layers=4,
+    d_ff=256,
+    max_len=24,
+    num_classes=2
+)
+model.load_state_dict(torch.load('best_transformer_model.pth'))
+model.eval()
+# Example inference
+def predict_sentiment(text, model, vocab_to_idx, max_length=24):
+    tokens = tokenize_text(text, vocab_to_idx, max_length)
+    with torch.no_grad():
+        output = model(tokens.unsqueeze(0))
+        prediction = torch.softmax(output, dim=1)
+        return "Positive" if prediction[0][1] > 0.5 else "Negative"
+# Test the model
+result = predict_sentiment("This movie was absolutely fantastic!", model, vocab_to_idx)
+print(f"Sentiment: {result}")
+```
+### Advanced Usage
+```python
+# Visualize attention weights
+def visualize_attention(model, text, vocab_to_idx):
+    # Extract attention weights from each layer
+    # Create heatmaps showing what the model focuses on
+    pass
+# Fine-tune on new data
+def fine_tune_model(model, new_data_loader, epochs=5):
+    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
+    # Continue training on domain-specific data
+    pass
+```
+## Visualizations and Analysis
+1. **Training Curves**: Loss and accuracy evolution over epochs
+2. **Attention Heatmaps**: Visualize what the model pays attention to
+3. **Performance Metrics**: Precision, recall, F1-score breakdowns
+4. **Architecture Diagrams**: Component-wise model visualization
+5. **Error Analysis**: Common failure cases and model limitations
+## Files and Outputs
+- `Transformers.ipynb`: Complete implementation with educational content
+- `best_transformer_model.pth`: Trained model weights
+- `m4_transformer_results.png`: Training curves and performance metrics
+- Architecture visualization and attention weight examples
+## Educational Value
+This implementation is designed as a comprehensive learning resource featuring:
+### Mathematical Understanding
+- **Complete Derivations**: From attention theory to implementation
+- **Step-by-Step Breakdown**: Each component explained individually
+- **Visual Mathematics**: Attention visualizations and formula explanations
+- **Practical Examples**: Concrete numerical calculations
+### Implementation Insights
+- **Clean Code Architecture**: Modular, readable, and well-documented
+- **Best Practices**: Modern PyTorch patterns and techniques
+- **Performance Optimization**: Efficient training and inference
+- **Debugging Techniques**: How to monitor and improve training
+### Real-World Applications
+- **End-to-End Pipeline**: From raw text to predictions
+- **Production Considerations**: Model deployment and optimization
+- **Extension Examples**: How to adapt for different tasks
+- **Transfer Learning**: Building on pre-trained representations
+## Applications
+This Transformer implementation can be adapted for:
+### Text Classification Tasks
+- **Sentiment Analysis**: Movie reviews, product feedback, social media
+- **Topic Classification**: News categorization, document organization
+- **Spam Detection**: Email filtering, content moderation
+- **Intent Recognition**: Chatbot understanding, voice assistants
+### Sequence Processing
+- **Named Entity Recognition**: Extract people, places, organizations
+- **Part-of-Speech Tagging**: Grammatical analysis
+- **Text Similarity**: Document matching, plagiarism detection
+- **Feature Extraction**: Dense representations for downstream tasks
+### Research and Development
+- **Architecture Experiments**: Test new attention mechanisms
+- **Ablation Studies**: Understand component contributions
+- **Scaling Experiments**: Larger models and datasets
+- **Novel Applications**: Domain-specific adaptations
+## Comparison with Other Architectures
+### Advantages over RNNs
+- ✅ **Parallel Processing**: Much faster training and inference
+- ✅ **Long-Range Dependencies**: Better handling of distant relationships
+- ✅ **Scalability**: Efficient on modern hardware
+- ✅ **Interpretability**: Attention weights provide insights
+### Advantages over CNNs
+- ✅ **Sequence Modeling**: Natural fit for text and time series
+- ✅ **Variable Length**: Handle sequences of any length
+- ✅ **Global Context**: Attend to entire sequence simultaneously
+- ✅ **Position Awareness**: Explicit positional information
+### Educational Benefits
+- 🎓 **Foundation Understanding**: Core concepts behind modern NLP
+- 🎓 **Mathematical Clarity**: Clean mathematical formulations
+- 🎓 **Implementation Practice**: Hands-on coding experience
+- 🎓 **Research Preparation**: Basis for advanced architectures
+## Citation
+If you use this implementation in your research or projects, please cite:
+```bibtex
+@misc{transformers_from_scratch_2024,
+  title={Transformers from Scratch: Complete Implementation},
+  author={Gruhesh Kurra},
+  year={2024},
+  url={https://huggingface.co/karthik-2905/TransformersFromScratch}
+}
+```
+## Future Extensions
+Planned improvements and research directions:
+- 🔄 **Encoder-Decoder Architecture**: Full sequence-to-sequence implementation
+- 🎨 **Pre-training Pipeline**: Large-scale language model training
+- 📊 **Alternative Attention**: Sparse, local, and linear attention variants
+- 🖼️ **Vision Transformers**: Adapt architecture for image tasks
+- 🎵 **Multimodal Transformers**: Text, image, and audio processing
+- 🧬 **Scientific Applications**: Protein sequences, molecular modeling
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Additional Resources
+- **GitHub Repository**: [TransformersFromScratch](https://github.com/GruheshKurra/TransformersFromScratch)
+- **Original Paper**: "Attention Is All You Need" by Vaswani et al.
+- **Educational Content**: Complete mathematical derivations and examples
+- **Performance Benchmarks**: Detailed analysis and comparisons
+## Model Card Authors
+**Gruhesh Kurra** - Implementation, documentation, and educational content
+---
+**Tags**: transformers, attention, pytorch, nlp, text-classification, educational
+**Model Card Last Updated**: December 2024

Transformers.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

best_transformer_model.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d2c6241ca2e72ed6e0587c6875485bb522b2b55d4fc6272c8686f139a379f20
+size 2215301

m4_transformer_results.png ADDED Viewed

Git LFS Details

SHA256: 93adbfc16ced197d81e7bc2a5dcfb316f4f10da61c15f0dc7736077202b74ba4
Pointer size: 131 Bytes
Size of remote file: 218 kB