---
pipeline_tag: other
language:
- multilingual
- en
- de
- am
- fr
- bn
- uz
- pl
- es
- sw
license: apache-2.0
---

# PWESuite-metric_learner

This is a phonetic word embedding model based on PWESuite, as described in [PWESuite: Phonetic Word Embeddings and Tasks They Facilitate](https://aclanthology.org/2024.lrec-main.1168/).
The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces.
The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models.
These models have been trained on all languages jointly.

## Instructions

To run any of the three metric learner models, run:
```bash
git clone https://github.com/zouharvi/pwesuite.git
cd pwesuite
mkdir -p computed/models
pip3 install -e .

# download the three models
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.pt -O computed/models/
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.pt -O computed/models/
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.pt   -O computed/models/
```

Then, in Python, you can run [this example script](https://github.com/zouharvi/pwesuite/blob/master/scripts/50-use_metric_learner.py):
```python

from models.metric_learning.model import RNNMetricLearner
from models.metric_learning.preprocessor import preprocess_dataset_foreign
from main.utils import load_multi_data
import torch
import tqdm
import math

data = load_multi_data(purpose_key="all")
data = preprocess_dataset_foreign(data[:10], features="token_ipa")

model = RNNMetricLearner(
    dimension=300,
    feature_size=data[0][0].shape[1],
)
model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt"))

# some cheap paralelization
BATCH_SIZE = 32
data_out = []
for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))):
    batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]]
    data_out += list(
        model.forward(batch).detach().cpu().numpy()
    )

assert len(data) == len(data_out)
assert all([len(x) == 300 for x in data_out])
```

You can also run the inference on all the data and evaluate it:
```bash
mkdir -p computed/embd/
python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa
python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl
```

Which gives you an output like:
```
human_similarity: 0.6054
correlation: 0.8995
retrieval: 0.9158
analogy: 0.1128
rhyme: 0.6375
cognate: 0.6513
JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033}
Score (overall): 0.6370
```


## Training

Training this model takes about an hour on a mid-tier GPU.
See [scripts/03-train_metric_learning.sh](https://github.com/zouharvi/pwesuite/blob/master/scripts/03-train_metric_learning.sh) for the specific training command.
Further description TODO.

## Other

Cite as:
```
@inproceedings{zouhar-etal-2024-pwesuite,
    title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate",
    author = "Zouhar, Vil{\'e}m  and
      Chang, Kalvin  and
      Cui, Chenxuan  and
      Carlson, Nate B.  and
      Robinson, Nathaniel Romney  and
      Sachan, Mrinmaya  and
      Mortensen, David R.",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1168/",
    pages = "13344--13355",
}
```

Available also on arXiv: https://arxiv.org/abs/2304.02541