File size: 7,987 Bytes
f684a69
 
 
 
 
 
 
 
 
 
 
0cc2ef7
f684a69
 
 
d30fe13
0cc2ef7
 
 
 
d30fe13
 
 
 
 
 
 
0cc2ef7
 
 
d30fe13
 
 
 
 
 
 
0cc2ef7
 
 
d30fe13
 
 
 
 
 
 
0cc2ef7
f684a69
 
 
 
 
 
 
460e2b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f684a69
460e2b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f684a69
 
 
 
 
 
 
 
 
460e2b9
 
f684a69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
460e2b9
f684a69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
460e2b9
 
 
 
f684a69
 
460e2b9
 
f684a69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0cc2ef7
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
base_model:
  - facebook/wav2vec2-large-xlsr-53
language:
  - en
license:
  - cc-by-nc-4.0
pipeline_tag: audio-classification
tags:
  - audio
  - classification
  - audio-classification
  - Wav2Vec2
  - sentiment
  - earnings conference calls
  - transformers

widget:
- src: negative.mp3
  example_title: Negative speech sample
  output:
  - label: positive
    score: 0.00
  - label: neutral
    score: 0.01
  - label: negative
    score: 0.99

- src: neutral.mp3
  example_title: Neutral speech sample
  output:
  - label: positive
    score: 0.00
  - label: neutral
    score: 0.99
  - label: negative
    score: 0.00

- src: positive.mp3
  example_title: Positive speech sample
  output:
  - label: positive
    score: 0.94
  - label: neutral
    score: 0.06
  - label: negative
    score: 0.00

---

# FinVoc2Vec
We introduce FinVoc2Vec, a vocal tone classifier designed for real-world corporate disclosures.
In the first stage, we apply a self-supervised pre-training procedure that allows the base model to adapt to the acoustic characteristics of disclosure environments using a sample of 500,000 unlabeled sentences of conference call speech. In the second stage, we apply a supervised fine-tuning procedure that enables the model to learn representations of human-labeled vocal tone. We construct a speech corpus containing 
5,000 audio recordings of linguistically neutral sentences from conference calls and manually label each sentence with perceived vocal tone — positive, negative, or neutral.

## Example using a demo dataset
```python
import torch
from datasets import load_dataset
from transformers import Wav2Vec2FeatureExtractor, AutoModel
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"

# load model and feature extractor
model = AutoModel.from_pretrained("waiv/FinVoc2Vec", trust_remote_code=True).to(device)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("waiv/FinVoc2Vec")

# load dataset
demo_dataset = load_dataset("waiv/FinVoc2Vec_demo")

arrays = [demo['audio']['array'] for demo in demo_dataset['test']]

# extract features
features = feature_extractor(
    arrays,
    sampling_rate=feature_extractor.sampling_rate,
    padding=True,
    truncation=False)

inputs = torch.tensor(np.array(features['input_values']), dtype=torch.float32).to(device)
attention_mask = torch.tensor(np.array(features['attention_mask']), dtype=torch.long).to(device)

prob_dict = {}
with torch.no_grad():
    model_output = model(inputs, attention_mask=attention_mask)
    logits = model_output['logits'].to(torch.float32).to('cpu')
    probs = torch.nn.functional.softmax(logits, dim=1).numpy()
    
    label_to_id = model.config.label2id
    for i, id in enumerate(demo_dataset['test']['id']):
        prob_dict[id] = {'prob_negative': probs[i, label_to_id['negative']],
                         'prob_neutral': probs[i, label_to_id['neutral']],
                         'prob_positive': probs[i, label_to_id['positive']]}
```

## Example using audio files
```python
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from dataclasses import dataclass
from typing import Dict, List, Optional, Union
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2Processor, AutoModel
import torchaudio

device = "cuda" if torch.cuda.is_available() else "cpu"

@dataclass
class DataCollatorWithPadding:

    processor: Union[Wav2Vec2Processor, Wav2Vec2FeatureExtractor]
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self,
                 features: List[Dict[str, Union[List[int], torch.Tensor]]]
                 )-> Dict[str, torch.Tensor]:
        
        input_features = [{"input_values": feature["input_values"]} for feature in features]

        # trunc and pad max lengths, get attention mask
        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt")
        return batch


def preprocess_audio(batch: Dict,
                     feature_extractor: Wav2Vec2FeatureExtractor = None,
                     max_duration: Optional[float] = 20.0):
    
    target_sr = feature_extractor.sampling_rate # 16kHz
    audio_arrays = []

    for path in batch['path']:
        audio_array, sampling_rate = torchaudio.load(path)
        
        # split to mono if multiple channels exist
        if audio_array.shape[0] > 1:
            audio_array = torch.mean(audio_array, dim=0, keepdim=True)
        
        # resample audio
        resampler = torchaudio.transforms.Resample(sampling_rate, target_sr)
        audio_array = resampler(audio_array).squeeze().numpy()
        audio_arrays.append(audio_array)
    
    # set params for feature extractor
    max_length = int(target_sr*max_duration) if max_duration is not None else None

    # use feature extractor to normalize inputs and trunc data
    result = feature_extractor(
        audio_arrays,
        sampling_rate=target_sr,
        max_length=max_length,
        truncation=bool(max_length))
    return result

# load model
model = AutoModel.from_pretrained("waiv/FinVoc2Vec", trust_remote_code=True).to(device)

# load feature extractor
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("waiv/FinVoc2Vec")

# load dataset
# NOTE: Needed feature: 'path' -> path to the audio-data
test_dataset = load_dataset(r'path/to/dataset')

# preprocess audio data
test_dataset = test_dataset.map(
    preprocess_audio,
    batch_size=1000,
    batched=True,
    num_proc=4,
    fn_kwargs={'feature_extractor': feature_extractor,
                'max_duration': 20.0})

data_collator = DataCollatorWithPadding(feature_extractor)

data_loader = DataLoader(
    test_dataset, 
    batch_size=16,
    shuffle=False,
    collate_fn=data_collator,
    num_workers=4)

with torch.no_grad():
    for batch in data_loader:
        
        attention_mask, inputs = batch['attention_mask'], batch['input_values']
        
        inputs.to(device)
        attention_mask.to(device)
        
        model_output = model(inputs, attention_mask=attention_mask)

        logits = model_output['logits'].to(torch.float32).to('cpu')
        probs = torch.nn.functional.softmax(logits, dim=1).numpy()
        
        label_to_id = model.config.label2id
        dict_probs = {f'prob_negative': probs[:, label_to_id['negative']],
                      f'prob_neutral': probs[:, label_to_id['neutral']],
                      f'prob_positive': probs[:, label_to_id['positive']]}
```

## Register for autoclass
To register the model for your local autoclass, use the following code:
```python
from transformers import AutoConfig, AutoModel

# download model and config
finvoc2vec_config = AutoConfig.from_pretrained("waiv/FinVoc2Vec", trust_remote_code=True)
finvoc2vec_model = AutoModel.from_pretrained("waiv/FinVoc2Vec", trust_remote_code=True)

# register model and config for automodel class
AutoConfig.register("finvoc2vec", FinVoc2VecConfig)
AutoModel.register(FinVoc2VecConfig, FinVoc2Vec)
```

## Further resources
Check the 🤗 Hugging Face [Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2) model description for additional resources and configurations.

## License
- This model is a derivative work based on Wav2Vec2 (Apache-2.0)
- This model is licensed under the Creative Commons Attribution Non Commercial 4.0 licence (CC-BY-NC-4.0)

## Paper
- [Listen Closely: Measuring Vocal Tone in Corporate Disclosures](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4307178)

## BibTeX
```
@article{ewertz2024,
  title={Listen Closely: Measuring Vocal Tone in Corporate Disclosures},
  author={Ewertz, Jonas and Knickrehm, Charlotte and Nienhaus, Martin and Reichmann, Doron},
  year={2024},
  note={Available at SSRN: \url{https://ssrn.com/abstract=4307178}}
}