File size: 8,054 Bytes
d2a2381 56eeffc d2a2381 486960e d2a2381 aab5975 e7700e0 aab5975 3cc4661 aab5975 3cc4661 aab5975 7858d92 ff3dfcb 7858d92 ff3dfcb aab5975 088cb80 7858d92 088cb80 7858d92 088cb80 7858d92 088cb80 aab5975 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
license: cc-by-4.0
datasets:
- ilyassmoummad/Xeno-Canto-6s-16khz
pipeline_tag: feature-extraction
tags:
- Bioacoustics
- pytorch
---
# ProtoCLR
This repository contains a CvT-13 [Convolutional Vision Transformer](https://arxiv.org/abs/2103.15808) model trained from scratch on the [Xeno-Canto dataset](https://huggingface.co/datasets/ilyassmoummad/Xeno-Canto-6s-16khz), specifically on 6-second audio segments sampled at 16 kHz. The model is trained on Mel spectrograms of bird sounds using ProtoCLR [(Prototypical Contrastive Loss)](https://arxiv.org/abs/2409.08589) for 300 epochs and can be used as a feature extractor for bird audio classification and related tasks.
## Files
- `cvt.py`: Defines the CvT-13 model architecture.
- `protoclr.pth`: Pre-trained model weights for ProtoCLR.
- `config/`: Configuration files for CvT-13 setup.
- `mel_spectrogram.py`: Contains the `MelSpectrogramProcessor` class, which converts audio waveforms into Mel spectrograms, a format suitable for model input.
## Setup
1. **Clone this repository**:
Clone the repository and navigate into the project directory:
```git clone https://huggingface.co/ilyassmoummad/ProtoCLR```
```cd ProtoCLR/```
2. **Install dependencies**:
Ensure you have the required Python packages, including `torch` and any other dependencies listed in `requirements.txt`.
```bash
pip install -r requirements.txt
```
## Usage
1. **Prepare the Audio**:
To ensure compatibility with the model, follow these preprocessing steps for your audio files:
- **Mono Channel (Mandatory)**:
If the audio has multiple channels, convert it to a single mono channel by averaging the channels.
- **Sample Rate (Mandatory)**:
Resample the audio to a consistent sample rate of 16 kHz.
- **Padding (Recommended)**:
For audio files shorter than 6 seconds, pad with zeros or repeat the audio until it reaches a length of 6 seconds.
- **Chunking (Recommended)**:
For audio files longer than 6 seconds, split them into chunks of 6 seconds each for better processing.
2. **Process the Audio**:
Use the `MelSpectrogramProcessor` (from `melspectrogram.py`) to transform the prepared audio into a Mel spectrogram, a format suitable for model input, as demonstrated in the following example.
## Example Code
The following example demonstrates loading, processing, and running inference on an audio file:
```python
import torch
from cvt import cvt13 # Import model architecture
from melspectrogram import MelSpectrogramProcessor # Import Mel spectrogram processor
# Initialize the preprocessor and model
preprocessor = MelSpectrogramProcessor()
model = cvt13()
# Load weights trained using Cross-Entropy
model.load_state_dict(torch.load("ce.pth", map_location="cpu")['encoder'])
# Load weights trained using SimCLR (self-supervised contrastive learning)
model.load_state_dict(torch.load("simclr.pth", map_location="cpu"))
# Load weights trained using SupCon (supervised contrastive learning)
model.load_state_dict(torch.load("supcon.pth", map_location="cpu"))
# Load weights trained using ProtoCLR (supervised contrastive learning using prototypes)
model.load_state_dict(torch.load("protoclr.pth", map_location="cpu"))
# Optional: Move the model to GPU for faster processing if available using : model = model.to('cuda') , for instance.
model.eval()
# Load and preprocess a sample audio waveform
def load_waveform(file_path):
# Replace this with your specific audio loading function
# For example, using torchaudio to load and resample
pass
waveform = load_waveform("path/to/audio.wav") # Load your audio file here
# Ensure waveform is sampled at 16 kHz, then pad/chunk as needed for 6s length
input_tensor = preprocessor.process(waveform).unsqueeze(0) # Add batch dimension
# Run the model on the preprocessed audio
with torch.no_grad():
output = model(input_tensor)
print("Model output shape:", output.shape)
```
## Model Performance Comparison
The following table presents the classification accuracy of various models on one-shot and five-shot bird sound classification tasks, evaluated across different [soundscape datasets](https://zenodo.org/records/13994373).
| Model | Model Size | PER | NES | UHH | HSN | SSW | SNE | Mean |
|---------------------------|------------|-------------|-------------|-------------|-------------|-------------|-------------|-------|
| Random Guessing | - | 0.75 | 1.12 | 3.70 | 5.26 | 1.04 | 1.78 | 2.22 |
| | | | | | | | | |
| **1-Shot Classification** | | | | | | | | |
| BirdAVES-biox-base | 90M | 7.41±1.0 | 26.4±2.3 | 13.2±3.1 | 9.84±3.5 | 8.74±0.6 | 14.1±3.1 | 13.2 |
| BirdAVES-bioxn-large | 300M | 7.59±0.8 | 27.2±3.6 | 13.7±2.9 | 12.5±3.6 | 10.0±1.4 | 14.5±3.2 | 14.2 |
| BioLingual | 28M | 6.21±1.1 | 37.5±2.9 | 17.8±3.5 | 17.6±5.1 | 22.5±4.0 | 26.4±3.4 | 21.3 |
| Perch | 80M | 9.10±5.3 | 42.4±4.9 | 19.8±5.0 | 26.7±9.8 | 22.3±3.3 | 29.1±5.9 | 24.9 |
| CE (Ours) | 19M | 9.55±1.5 | 41.3±3.6 | 19.7±4.7 | 25.2±5.7 | 17.8±1.4 | 31.5±5.4 | 24.2 |
| SimCLR (Ours) | 19M | 7.85±1.1 | 31.2±2.4 | 14.9±2.9 | 19.0±3.8 | 10.6±1.1 | 24.0±4.1 | 17.9 |
| SupCon (Ours) | 19M | 8.53±1.1 | 39.8±6.0 | 18.8±3.0 | 20.4±6.9 | 12.6±1.6 | 23.2±3.1 | 20.5 |
| ProtoCLR (Ours) | 19M | 9.23±1.6 | 38.6±5.1 | 18.4±2.3 | 21.2±7.3 | 15.5±2.3 | 25.8±5.2 | 21.4 |
| | | | | | | | | |
| **5-Shot Classification** | | | | | | | | |
| BirdAVES-biox-base | 90M | 11.6±0.8 | 39.7±1.8 | 22.5±2.4 | 22.1±3.3 | 16.1±1.7 | 28.3±2.3 | 23.3 |
| BirdAVES-bioxn-large | 300M | 15.0±0.9 | 42.6±2.7 | 23.7±3.8 | 28.4±2.4 | 18.3±1.8 | 27.3±2.3 | 25.8 |
| BioLingual | 28M | 13.6±1.3 | 65.2±1.4 | 31.0±2.9 | 34.3±3.5 | 43.9±0.9 | 49.9±2.3 | 39.6 |
| Perch | 80M | 21.2±1.2 | 71.7±1.5 | 39.5±3.0 | 52.5±5.9 | 48.0±1.9 | 59.7±1.8 | 48.7 |
| CE (Ours) | 19M | 21.4±1.3 | 69.2±1.8 | 35.6±3.4 | 48.2±5.5 | 39.9±1.1 | 57.5±2.3 | 45.3 |
| SimCLR (Ours) | 19M | 15.4±1.0 | 54.0±1.8 | 23.0±2.3 | 32.8±4.0 | 22.0±1.2 | 40.7±2.4 | 31.3 |
| SupCon (Ours) | 19M | 17.2±1.3 | 64.6±2.4 | 34.1±2.9 | 42.5±2.9 | 30.8±0.8 | 48.1±2.4 | 39.5 |
| ProtoCLR (Ours) | 19M | 19.2±1.1 | 67.9±2.8 | 36.1±4.3 | 48.0±4.3 | 34.6±2.3 | 48.6±2.8 | 42.4 |
For additional details, please see the [pre-print on arXiv](https://arxiv.org/abs/2409.08589) and the [official GitHub repository](https://github.com/ilyassmoummad/ProtoCLR).
## Citation
If you use our model in your research, please cite the following paper:
```bibtex
@misc{moummad2024dirlbs,
title={Domain-Invariant Representation Learning of Bird Sounds},
author={Ilyass Moummad and Romain Serizel and Emmanouil Benetos and Nicolas Farrugia},
year={2024},
eprint={2409.08589},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2409.08589},
}
``` |