|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
[![arXiv](https://img.shields.io/badge/arXiv-2306.01533-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2306.01533) |
|
|
|
# Usage |
|
```python |
|
import torch |
|
from transformers import AutoModel, PreTrainedTokenizerFast |
|
import torchaudio |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
model = AutoModel.from_pretrained( |
|
"wsntxxn/cnn14rnn-tempgru-audiocaps-captioning", |
|
trust_remote_code=True |
|
).to(device) |
|
tokenizer = PreTrainedTokenizerFast.from_pretrained( |
|
"wsntxxn/audiocaps-simple-tokenizer" |
|
) |
|
|
|
wav, sr = torchaudio.load("/path/to/file.wav") |
|
wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate) |
|
if wav.size(0) > 1: |
|
wav = wav.mean(0).unsqueeze(0) |
|
|
|
with torch.no_grad(): |
|
word_idxs = model( |
|
audio=wav, |
|
audio_length=[wav.size(1)], |
|
) |
|
|
|
caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True) |
|
print(caption) |
|
``` |
|
This will make the description as specific as possible. |
|
|
|
You can also manually assign a temporal tag to control the specificity of temporal relationship description: |
|
```python |
|
with torch.no_grad(): |
|
word_idxs = model( |
|
audio=wav, |
|
audio_length=[wav.size(1)], |
|
temporal_tag=[2], # desribe "sequential" if there are sequential events, otherwise use the most complex relationship |
|
) |
|
``` |
|
The temporal tag is defined as: |
|
|Temporal Tag|Definition| |
|
|----:|-----:| |
|
|0|Only 1 Event| |
|
|1|Simultaneous Events| |
|
|2|Sequential Events| |
|
|3|More Complex Events| |
|
|
|
|
|
# Citation |
|
If you find the model useful, please cite this paper: |
|
```BibTeX |
|
@inproceedings{xie2023enhance, |
|
author = {Zeyu Xie and Xuenan Xu and Mengyue Wu and Kai Yu}, |
|
title = {Enhance Temporal Relations in Audio Captioning with Sound Event Detection}, |
|
year = 2023, |
|
booktitle = {Proc. INTERSPEECH}, |
|
pages = {4179--4183}, |
|
} |
|
``` |