wsntxxn commited on
Commit
f405418
·
verified ·
1 Parent(s): 3a72e8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -1
README.md CHANGED
@@ -3,4 +3,67 @@ license: apache-2.0
3
  language:
4
  - en
5
  ---
6
- [![arXiv](https://img.shields.io/badge/arXiv-2306.01533-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2306.01533)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - en
5
  ---
6
+ [![arXiv](https://img.shields.io/badge/arXiv-2306.01533-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2306.01533)
7
+
8
+ # Usage
9
+ ```python
10
+ import torch
11
+ from transformers import AutoModel, PreTrainedTokenizerFast
12
+ import torchaudio
13
+
14
+
15
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
16
+
17
+ model = AutoModel.from_pretrained(
18
+ "wsntxxn/cnn14rnn-tempgru-audiocaps-captioning",
19
+ trust_remote_code=True
20
+ ).to(device)
21
+ tokenizer = PreTrainedTokenizerFast.from_pretrained(
22
+ "wsntxxn/audiocaps-simple-tokenizer"
23
+ )
24
+
25
+ wav, sr = torchaudio.load("/path/to/file.wav")
26
+ wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
27
+ if wav.size(0) > 1:
28
+ wav = wav.mean(0).unsqueeze(0)
29
+
30
+ with torch.no_grad():
31
+ word_idxs = model(
32
+ audio=wav,
33
+ audio_length=[wav.size(1)],
34
+ )
35
+
36
+ caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
37
+ print(caption)
38
+ ```
39
+ This will make the description as specific as possible.
40
+
41
+ You can also manually assign a temporal tag to control the specificity of temporal relationship description:
42
+ ```python
43
+ with torch.no_grad():
44
+ word_idxs = model(
45
+ audio=wav,
46
+ audio_length=[wav.size(1)],
47
+ temporal_tag=[2], # desribe "sequential" if there are sequential events, otherwise use the most complex relationship
48
+ )
49
+ ```
50
+ The temporal tag is defined as:
51
+ |Temporal Tag|Definition|
52
+ |----:|-----:|
53
+ |0|Only 1 Event|
54
+ |1|Simultaneous Events|
55
+ |2|Sequential Events|
56
+ |3|More Complex Events|
57
+
58
+
59
+ # Citation
60
+ If you find the model useful, please cite this paper:
61
+ ```BibTeX
62
+ @inproceedings{xie2023enhance,
63
+ author = {Zeyu Xie and Xuenan Xu and Mengyue Wu and Kai Yu},
64
+ title = {Enhance Temporal Relations in Audio Captioning with Sound Event Detection},
65
+ year = 2023,
66
+ booktitle = {Proc. INTERSPEECH},
67
+ pages = {4179--4183},
68
+ }
69
+ ```