evaluation instruction

Files changed (7) hide show

.gitignore +142 -0
README.md +78 -0
eval.py +61 -0
models_xin.py +68 -0
requirements.txt +5 -0
utils.py +100 -0
wrapper.py +23 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,142 @@

+# Users
+*_.py
+*.pth.tar
+temp
+slurm*
+.envrc
+__pycache__/*
+outputs/*
+templates/*
+sample
+.idea
+.vscode
+main.py
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/

README.md CHANGED Viewed

@@ -2,6 +2,84 @@ This repo includes the official PyTorch checkpoint of *ParaCLAP – Towards a ge
 ## Abstract
 Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to ‘answer’ a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational paralinguistic tasks, where it is shown to surpass the performance of open-source state-of-the-art models.
 ---
 license: cc-by-nc-nd-4.0
 language:

 ## Abstract
 Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to ‘answer’ a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational paralinguistic tasks, where it is shown to surpass the performance of open-source state-of-the-art models.
+## Instruction
+Before Evaluation, I would recommand to clone the repo from HuggingFace or [GitHub](https://github.com/KeiKinn/ParaCLAP)
+### Evaluation
+```python
+import os
+import torch
+import librosa
+from transformers import logging
+from transformers import AutoTokenizer
+from models_xin import CLAP
+from utils import compute_similarity
+if __name__ == '__main__':
+    logging.set_verbosity_error()
+    ckpt = torch.hub.load_state_dict_from_url(
+            url="https://huggingface.co/KeiKinn/paraclap/resolve/main/best.pth.tar?download=true",
+            map_location="cpu",
+            check_hash=True,
+        )
+    text_model = 'bert-base-uncased'
+    audio_model = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
+    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
+    candidates = ['happy', 'sad', 'surprise', 'angry'] # free to adapt it to your need
+    wavpath = '[Waveform path]' # single channel wavform
+    waveform, sample_rate = librosa.load(wavpath, sr=16000)
+    x = torch.Tensor(waveform)
+    tokenizer = AutoTokenizer.from_pretrained(text_model)
+    candidate_tokens = tokenizer.batch_encode_plus(
+        candidates,
+        padding=True,
+        truncation=True,
+        return_tensors='pt'
+    )
+    model = CLAP(
+        speech_name=audio_model,
+        text_name=text_model,
+        embedding_dim=768,
+    )
+    model.load_state_dict(ckpt)
+    model.to(device)
+    print(f'Checkpoint is loaded')
+    model.eval()
+    with torch.no_grad():
+        z = model(
+            x.unsqueeze(0).to(device),
+            candidate_tokens
+        )
+    similarity = compute_similarity(z[2], z[0], z[1])
+    prediction = similarity.T.argmax(dim=1)
+    result = candidates[prediction]
+```
+## Citation Info
+ParaCLAP has been accept at InterSpeech 2024 for presentation.
+```bash
+@inproceedings{Jing24_PTA,
+  title     = {ParaCLAP – Towards a general language-audio model for computational paralinguistic tasks},
+  author    = {Xin Jing and Andreas Triantafyllopoulos and Björn Schuller},
+  year      = {2024},
+  booktitle = {Interspeech 2024},
+  pages     = {1155--1159},
+  doi       = {10.21437/Interspeech.2024-1315},
+  issn      = {2958-1796},
+}
+```
 ---
 license: cc-by-nc-nd-4.0
 language:

eval.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import os
+import torch
+from transformers import logging
+from transformers import AutoTokenizer
+from wrapper import EvalWrapper
+from models_xin import CLAP
+from utils import compute_similarity
+import librosa
+if __name__ == '__main__':
+    logging.set_verbosity_error()
+    ckpt = torch.hub.load_state_dict_from_url(
+            url="https://huggingface.co/KeiKinn/paraclap/resolve/main/best.pth.tar?download=true",
+            map_location="cpu",
+            check_hash=True,
+        )
+    text_model = 'bert-base-uncased'
+    audio_model = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
+    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
+    candidates = ['happy', 'sad', 'surprise', 'angry'] # free to adapt it to your need
+    wavpath = '[Waveform path]' # single channel wavform
+    waveform, sample_rate = librosa.load(wavpath, sr=16000)
+    x = torch.Tensor(waveform)
+    tokenizer = AutoTokenizer.from_pretrained(text_model)
+    candidate_tokens = tokenizer.batch_encode_plus(
+        candidates,
+        padding=True,
+        truncation=True,
+        return_tensors='pt'
+    )
+    model = CLAP(
+        speech_name=audio_model,
+        text_name=text_model,
+        embedding_dim=768,
+    )
+    model.load_state_dict(ckpt)
+    model.to(device)
+    print(f'Checkpoint is loaded')
+    model.eval()
+    with torch.no_grad():
+        z = model(
+            x.unsqueeze(0).to(device),
+            candidate_tokens
+        )
+    similarity = compute_similarity(z[2], z[0], z[1])
+    prediction = similarity.T.argmax(dim=1)
+    result = candidates[prediction]
+    print(result)

models_xin.py ADDED Viewed

	@@ -0,0 +1,68 @@

+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import (
+    AutoModel,
+    Wav2Vec2Model,
+)
+class Projection(torch.nn.Module):
+    def __init__(self, d_in: int, d_out: int, p: float = 0.5) -> None:
+        super().__init__()
+        self.linear1 = torch.nn.Linear(d_in, d_out, bias=False)
+        self.linear2 = torch.nn.Linear(d_out, d_out, bias=False)
+        self.layer_norm = torch.nn.LayerNorm(d_out)
+        self.drop = torch.nn.Dropout(p)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        embed1 = self.linear1(x)
+        embed2 = self.drop(self.linear2(F.gelu(embed1)))
+        embeds = self.layer_norm(embed1 + embed2)
+        return embeds
+class SpeechEncoder(torch.nn.Module):
+    def __init__(self, model_name):
+        super().__init__()
+        self.model_name = model_name
+        self.base = Wav2Vec2Model.from_pretrained(self.model_name)
+        self.hidden_size = self.base.config.hidden_size
+    def forward(self, x):
+        x = self.base(x)['last_hidden_state']
+        x = x.mean(1)
+        return x
+class TextEncoder(torch.nn.Module):
+    def __init__(self, model_name: str) -> None:
+        super().__init__()
+        self.base = AutoModel.from_pretrained(model_name)
+    def forward(self, x):
+        out = self.base(**x)[0]
+        out = out[:, 0, :].detach()  # get CLS token output
+        return out
+class CLAP(torch.nn.Module):
+    def __init__(self, speech_name: str, text_name: str, embedding_dim: int = 1024):
+        super().__init__()
+        self.audio_branch = SpeechEncoder(model_name=speech_name)
+        self.text_branch = TextEncoder(model_name=text_name)
+        self.audio_projection = Projection(self.audio_branch.hidden_size, embedding_dim)
+        self.text_projection = Projection(self.text_branch.base.config.hidden_size, embedding_dim)
+        self.logit_scale = torch.nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
+    def forward(self, audio, text):
+        speech_emb = self.audio_branch(audio)
+        text_emb = self.text_branch(text)
+        speech_emb = self.audio_projection(speech_emb)
+        text_emb = self.text_projection(text_emb)
+        return text_emb, speech_emb, self.logit_scale.exp()

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+audformat
+audmetric
+audtorch
+torch
+transformers==4.25.1

utils.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import torch
+import torch.nn.functional as F
+import collections
+def compute_similarity(logit_scale, audio_embeddings, text_embeddings):
+    r"""Compute similarity between text and audio embeddings"""
+    audio_embeddings = audio_embeddings/torch.norm(audio_embeddings, dim=-1, keepdim=True)
+    text_embeddings = text_embeddings/torch.norm(text_embeddings, dim=-1, keepdim=True)
+    similarity = logit_scale*text_embeddings @ audio_embeddings.T
+    return similarity.T
+def compute_logit(logit_scale, audio_embeddings, text_embeddings):
+    logits_per_audio = logit_scale * audio_embeddings @ text_embeddings.T
+    logits_per_text = logit_scale * text_embeddings @ audio_embeddings.T
+    return logits_per_audio, logits_per_text
+def laion_compute_similarity(logit_scale, audio_embeddings, text_embeddings):
+    r"""Compute similarity between text and audio embeddings"""
+    audio_embeddings = F.normalize(audio_embeddings, dim=-1)
+    text_embeddings = F.normalize(text_embeddings, dim=-1)
+    similarity = logit_scale*audio_embeddings @ text_embeddings.T
+    return similarity
+def freeze_branch_parameters(named_parameters, branch_name, freeze_flag):
+    branch_parameters = [
+        p
+        for n, p in named_parameters
+        if branch_name in n
+    ]
+    if freeze_flag:
+        print(f"Freezing {branch_name.capitalize()} parameters.")
+        for param in branch_parameters:
+            param.requires_grad = False
+def format_emotion(emotion):
+    if emotion == 'no_agreement':
+        return 'there is no clear emotion.'
+    else:
+        return f'this person is feeling {emotion}.'
+def preprocess_text(text_queries, tokenizer):
+    r"""Load list of class labels and return tokenized text"""
+    token_keys = ['input_ids', 'token_type_ids', 'attention_mask']
+    tokenized_texts = []
+    for ttext in text_queries:
+        tok = tokenizer.encode_plus(
+            text=ttext, add_special_tokens=True, max_length=77, padding='max_length', return_tensors="pt")
+        for key in token_keys:
+            tok[key] = tok[key].reshape(-1).cuda()
+        tokenized_texts.append(tok)
+    return default_collate(tokenized_texts)
+def default_collate(batch):
+        r"""Puts each data field into a tensor with outer dimension batch size"""
+        elem = batch[0]
+        elem_type = type(elem)
+        if isinstance(elem, torch.Tensor):
+            out = None
+            if torch.utils.data.get_worker_info() is not None:
+                # If we're in a background process, concatenate directly into a
+                # shared memory tensor to avoid an extra copy
+                numel = sum([x.numel() for x in batch])
+                storage = elem.storage()._new_shared(numel)
+                out = elem.new(storage)
+            return torch.stack(batch, 0, out=out)
+        elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
+                and elem_type.__name__ != 'string_':
+            if elem_type.__name__ == 'ndarray' or elem_type.__name__ == 'memmap':
+                # array of string classes and object
+                if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
+                    raise TypeError(
+                        default_collate_err_msg_format.format(elem.dtype))
+                return default_collate([torch.as_tensor(b) for b in batch])
+            elif elem.shape == ():  # scalars
+                return torch.as_tensor(batch)
+        elif isinstance(elem, float):
+            return torch.tensor(batch, dtype=torch.float64)
+        elif isinstance(elem, int):
+            return torch.tensor(batch)
+        elif isinstance(elem, str):
+            return batch
+        elif isinstance(elem, collections.abc.Mapping):
+            return {key: default_collate([d[key] for d in batch]) for key in elem}
+        elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
+            return elem_type(*(default_collate(samples) for samples in zip(*batch)))
+        elif isinstance(elem, collections.abc.Sequence):
+            # check to make sure that the elements in batch have consistent size
+            it = iter(batch)
+            elem_size = len(next(it))
+            if not all(len(elem) == elem_size for elem in it):
+                raise RuntimeError(
+                    'each element in list of batch should be of equal size')
+            transposed = zip(*batch)
+            return [default_collate(samples) for samples in transposed]
+        raise TypeError(default_collate_err_msg_format.format(elem_type))

wrapper.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import os
+class EvalWrapper:
+    def __init__(self, dataset_name):
+        self.name = dataset_name.lower()
+        self.evaluate_map = {
+            'iemocap': 'evaluation.evaluate_iemo',
+            'ravdess': 'evaluation.evaluate_ravdess',
+            'cremad-d': 'evaluation.evaluate_cremad',
+            'tess': 'evaluation.evaluate_tess',
+            'aibo': 'evaluation.evaluate_aibo'
+        }
+    def set_eval(self):
+        # Get the module path dynamically
+        module_path = self.evaluate_map.get(self.name)
+        if not module_path:
+            supported_datasets = ', '.join(self.evaluate_map.keys())
+            raise ValueError(f"Unsupported dataset name: {self.name}.\nSupported datasets are: {supported_datasets}")
+        # Import the evaluate function dynamically
+        evaluate = __import__(module_path, fromlist=['evaluate']).evaluate
+        return self.name, evaluate