Tiny-AST Military Audio Classifier
๐๏ธ State-of-the-art military audio classification model achieving 96.73% accuracy on the Military Audio Dataset (MAD).
Model Description
This model is a fine-tuned version of MIT/ast-finetuned-audioset-10-10-0.4593 on the Military Audio Dataset (MAD). It's designed for edge deployment on devices like Raspberry Pi 5 for military surveillance applications.
Key Features
- ๐ฏ 96.73% accuracy on MAD dataset (7 military audio classes)
- ๐ Edge-optimized for Raspberry Pi deployment
- โก Fast inference (<200ms per sample)
- ๐ง Efficient (16.5% of parameters fine-tuned)
- ๐ Robust to real-world military environments
Training Results
Progressive Training Performance:
- Phase 1 (Classifier only): 94.32% accuracy
- Phase 2 (Top 2 layers): 96.73% accuracy โ Best Model
- Phase 3 (Top 4 layers): 96.35% accuracy
- Phase 4 (Top 6 layers): 96.73% accuracy
Training Configuration:
- Method: Progressive unfreezing strategy
- Learning Rates: Conservative (1e-4 โ 2e-5)
- Normalization: MAD-specific statistics (mean: -2.16, std: 2.85)
- Class Weighting: Balanced for imbalanced dataset
- Training Time: 40 minutes on RTX 3060
Model Classes
The model classifies 7 military audio categories:
Class ID | Class Name | Training Samples | Test Samples |
---|---|---|---|
0 | Communication | 774 | 207 |
1 | Footsteps | 1,293 | 280 |
2 | Gunshot | 773 | 104 |
3 | Shelling | 883 | 104 |
4 | Vehicle | 910 | 122 |
5 | Helicopter | 934 | 91 |
6 | Fighter | 862 | 129 |
Usage
Quick Start
from transformers import ASTForAudioClassification, ASTFeatureExtractor
import librosa
import torch
# Load model and feature extractor
model = ASTForAudioClassification.from_pretrained("Akashpaul123/tiny-ast-mad-military-audio-classifier")
feature_extractor = ASTFeatureExtractor.from_pretrained("Akashpaul123/tiny-ast-mad-military-audio-classifier")
# Load audio file (16kHz recommended)
audio, sr = librosa.load("military_audio.wav", sr=16000)
# Extract features
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=-1).item()
# Class mapping
classes = ['Communication', 'Footsteps', 'Gunshot', 'Shelling', 'Vehicle', 'Helicopter', 'Fighter']
print(f"Predicted class: {classes[predicted_class]}")
Edge Deployment (Raspberry Pi 5)
import onnxruntime as ort
# Load ONNX model for edge inference
session = ort.InferenceSession("tiny_ast_mad_optimized.onnx")
# ... inference code
Training Details
Dataset
- Source: Military Audio Dataset (MAD)
- Total Samples: 7,466 audio files
- Duration: 2-8 seconds per sample
- Sample Rate: 16kHz
- Augmentation: Military-specific (time stretch, pitch shift, noise injection)
Architecture
- Base Model: Audio Spectrogram Transformer (AST)
- Parameters: 86.2M total, 14.2M trainable (16.5%)
- Input: Log-Mel spectrograms (1024 x 128)
- Output: 7 military audio classes
Performance Metrics
- Accuracy: 96.73%
- F1-Macro: 96.84%
- F1-Weighted: 96.74%
- Precision: High across all classes
- Recall: Balanced performance
Hardware Requirements
Training
- GPU: RTX 3060 (12GB VRAM) or similar
- RAM: 16GB+ recommended
- Storage: 50GB for dataset and models
Inference (Edge)
- Device: Raspberry Pi 5 or similar ARM device
- RAM: 2GB minimum
- Inference Time: <200ms per sample
- Power: <5W continuous operation
Limitations and Considerations
- Domain-specific: Optimized for military audio contexts
- Language: Primarily English communication samples
- Environment: Trained on MAD dataset conditions
- Real-time: Designed for batch processing, not streaming
Citation
If you use this model in your research, please cite:
@misc{tiny-ast-mad-2024,
title={Tiny-AST Military Audio Classifier: Progressive Fine-tuning for Edge Deployment},
author={Paul, Akash},
year={2024},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/Akashpaul123/tiny-ast-mad-military-audio-classifier}
}
License
This model is licensed under the Apache 2.0 License.
Contact
- Author: Akash Paul
- GitHub: @akashpaul123
- Hugging Face: @akashpaul123
Model trained as part of military audio surveillance research with focus on edge deployment and real-world robustness.
- Downloads last month
- 36
Evaluation results
- Accuracy on MAD Datasetself-reported0.967
- F1-weighted on MAD Datasetself-reported0.967