Tiny-AST Military Audio Classifier

๐ŸŽ–๏ธ State-of-the-art military audio classification model achieving 96.73% accuracy on the Military Audio Dataset (MAD).

Model Description

This model is a fine-tuned version of MIT/ast-finetuned-audioset-10-10-0.4593 on the Military Audio Dataset (MAD). It's designed for edge deployment on devices like Raspberry Pi 5 for military surveillance applications.

Key Features

  • ๐ŸŽฏ 96.73% accuracy on MAD dataset (7 military audio classes)
  • ๐Ÿš€ Edge-optimized for Raspberry Pi deployment
  • โšก Fast inference (<200ms per sample)
  • ๐Ÿง  Efficient (16.5% of parameters fine-tuned)
  • ๐Ÿ”Š Robust to real-world military environments

Training Results

Progressive Training Performance:

  • Phase 1 (Classifier only): 94.32% accuracy
  • Phase 2 (Top 2 layers): 96.73% accuracy โ† Best Model
  • Phase 3 (Top 4 layers): 96.35% accuracy
  • Phase 4 (Top 6 layers): 96.73% accuracy

Training Configuration:

  • Method: Progressive unfreezing strategy
  • Learning Rates: Conservative (1e-4 โ†’ 2e-5)
  • Normalization: MAD-specific statistics (mean: -2.16, std: 2.85)
  • Class Weighting: Balanced for imbalanced dataset
  • Training Time: 40 minutes on RTX 3060

Model Classes

The model classifies 7 military audio categories:

Class ID Class Name Training Samples Test Samples
0 Communication 774 207
1 Footsteps 1,293 280
2 Gunshot 773 104
3 Shelling 883 104
4 Vehicle 910 122
5 Helicopter 934 91
6 Fighter 862 129

Usage

Quick Start

from transformers import ASTForAudioClassification, ASTFeatureExtractor
import librosa
import torch

# Load model and feature extractor
model = ASTForAudioClassification.from_pretrained("Akashpaul123/tiny-ast-mad-military-audio-classifier")
feature_extractor = ASTFeatureExtractor.from_pretrained("Akashpaul123/tiny-ast-mad-military-audio-classifier")

# Load audio file (16kHz recommended)
audio, sr = librosa.load("military_audio.wav", sr=16000)

# Extract features
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=-1).item()

# Class mapping
classes = ['Communication', 'Footsteps', 'Gunshot', 'Shelling', 'Vehicle', 'Helicopter', 'Fighter']
print(f"Predicted class: {classes[predicted_class]}")

Edge Deployment (Raspberry Pi 5)

import onnxruntime as ort

# Load ONNX model for edge inference
session = ort.InferenceSession("tiny_ast_mad_optimized.onnx")
# ... inference code

Training Details

Dataset

  • Source: Military Audio Dataset (MAD)
  • Total Samples: 7,466 audio files
  • Duration: 2-8 seconds per sample
  • Sample Rate: 16kHz
  • Augmentation: Military-specific (time stretch, pitch shift, noise injection)

Architecture

  • Base Model: Audio Spectrogram Transformer (AST)
  • Parameters: 86.2M total, 14.2M trainable (16.5%)
  • Input: Log-Mel spectrograms (1024 x 128)
  • Output: 7 military audio classes

Performance Metrics

  • Accuracy: 96.73%
  • F1-Macro: 96.84%
  • F1-Weighted: 96.74%
  • Precision: High across all classes
  • Recall: Balanced performance

Hardware Requirements

Training

  • GPU: RTX 3060 (12GB VRAM) or similar
  • RAM: 16GB+ recommended
  • Storage: 50GB for dataset and models

Inference (Edge)

  • Device: Raspberry Pi 5 or similar ARM device
  • RAM: 2GB minimum
  • Inference Time: <200ms per sample
  • Power: <5W continuous operation

Limitations and Considerations

  • Domain-specific: Optimized for military audio contexts
  • Language: Primarily English communication samples
  • Environment: Trained on MAD dataset conditions
  • Real-time: Designed for batch processing, not streaming

Citation

If you use this model in your research, please cite:

@misc{tiny-ast-mad-2024,
  title={Tiny-AST Military Audio Classifier: Progressive Fine-tuning for Edge Deployment},
  author={Paul, Akash},
  year={2024},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/Akashpaul123/tiny-ast-mad-military-audio-classifier}
}

License

This model is licensed under the Apache 2.0 License.

Contact


Model trained as part of military audio surveillance research with focus on edge deployment and real-world robustness.

Downloads last month
36
Safetensors
Model size
86.2M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results