rishitdagli commited on
Commit
f9509e5
·
verified ·
1 Parent(s): 5a9dbea

Add model metadata

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ metrics:
3
+ - MFCC-DTW
4
+ - ZCR
5
+ - Chroma Score
6
+ - Spectral Score
7
+ model-index:
8
+ - name: SEE-2-SOUND
9
+ results:
10
+ - task:
11
+ type: spatial-audio-generation
12
+ name: Spatial Audio Generation
13
+ dataset:
14
+ type: rishitdagli/see-2-sound-eval
15
+ name: SEE-2-SOUND Evaluation Dataset
16
+ metrics:
17
+ - type: MFCC-DTW
18
+ value: 0.03 × 10^-3
19
+ name: AViTAR Marginal Scene Guidance - Mel-Frequency Cepstral Coefficient - Dynamic Time Warping
20
+ - type: ZCR
21
+ value: 0.95
22
+ name: AViTAR Marginal Scene Guidance - Zero Crossing Rate
23
+ - type: Chroma
24
+ value: 0.77
25
+ name: Chroma Feature
26
+ - type: Spectral Score
27
+ value: 0.95
28
+ name: AViTAR Marginal Scene Guidance - Spectral Score
29
+ source:
30
+ name: arXiv
31
+ url: https://arxiv.org/abs/2406.06612
32
+ tags:
33
+ - vision
34
+ - audio
35
+ - spatial audio
36
+ - audio generation
37
+ - music
38
+ - art
39
+ ---
40
+
41
+ <h2>SEE-2-SOUND🔊: Zero-Shot Spatial Environment-to-Spatial Sound</h2>
42
+
43
+ [**Rishit Dagli**](https://rishitdagli.com/)<sup>1</sup> · [**Shivesh Prakash**](https://shivesh777.github.io/)<sup>1</sup> · [**Rupert Wu**](https://www.cs.toronto.edu/~rupert/)<sup>1</sup> · [**Houman Khosravani**](https://scholar.google.ca/citations?user=qzhk98YAAAAJ&hl=en)<sup>1,2,3</sup>
44
+
45
+ <sup>1</sup>University of Toronto&emsp;&emsp;&emsp;&emsp;<sup>2</sup>Temerty Centre for Artificial Intelligence Research and Education in Medicine&emsp;&emsp;&emsp;&emsp;<sup>3</sup>Sunnybrook Research Institute
46
+
47
+ | | | | |
48
+ |---|---|---|---|
49
+ | [![Paper PDF](https://img.shields.io/badge/arXiv-See2Sound-red)](https://arxiv.org/abs/2406.06612) | [![Project Page](https://img.shields.io/badge/Project_Page-See2Sound-green)](https://see2sound.github.io) | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/rishitdagli/see-2-sound) | [![Hugging Face Paper](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Paper-yellow)](https://huggingface.co/papers/2406.06612) |
50
+
51
+ This work presents **SEE-2-SOUND**, a method to generate spatial audio from images, animated images, and videos to accompany the visual content. Check out our [website](https://see2sound.github.io) to view some results of this work.
52
+
53
+ ![teaser](https://raw.githubusercontent.com/see2sound/see2sound/main/assets/teaser.png
54
+
55
+ These checkpoints are meant to be used with our code: [SEE-2-SOUND](https://github.com/see2sound/see2sound).
56
+
57
+ ## Installation
58
+
59
+ First, install the pip package and download these checkpoints (needs Git LFS):
60
+
61
+ ```sh
62
+ pip install -e git+https://github.com/see2sound/see2sound.git#egg=see2sound
63
+ git clone https://huggingface.co/rishitdagli/see-2-sound
64
+ cd see-2-sound
65
+ ```
66
+
67
+ View the full installation instructions as well a tips on dependencies in the [repository README](https://github.com/see2sound/see2sound/tree/main?tab=readme-ov-file#installation).
68
+
69
+ ## Running the Models
70
+
71
+ Now, we can start by making a configuration file, make a file called `config.yaml`:
72
+
73
+ ```yaml
74
+ codi_encoder: 'codi/codi_encoder.pth'
75
+ codi_text: 'codi/codi_text.pth'
76
+ codi_audio: 'codi/codi_audio.pth'
77
+ codi_video: 'codi/codi_video.pth'
78
+
79
+ sam: 'sam/sam.pth'
80
+ # H, L or B in decreasing performance
81
+ sam_size: 'H'
82
+
83
+ depth: '/depth/depth.pth'
84
+ # L, B, or S in decreasing performance
85
+ depth_size: 'L'
86
+
87
+ download: False
88
+
89
+ # Change to True if your GPU has < 40 GB vRAM
90
+ low_mem: False
91
+ fp16: False
92
+ gpu: True
93
+ steps: 500
94
+ num_audios: 3
95
+ prompt: ''
96
+ verbose: True
97
+ ```
98
+
99
+ Now, we can start running inference:
100
+
101
+ ```py
102
+ import see2sound
103
+
104
+
105
+ config_file_path = "config.yaml"
106
+
107
+ model = see2sound.See2Sound(config_path = config_file_path)
108
+ model.setup()
109
+ model.run(path = "test.png", output_path = "test.wav")
110
+ ```
111
+
112
+ ## More Information
113
+
114
+ Feel free to take a look at the full [dcoumentation](https://github.com/see2sound/see2sound/blob/main/README.md) for extra information and tips on running the model.