|
--- |
|
pipeline_tag: other |
|
language: |
|
- multilingual |
|
- en |
|
- de |
|
- am |
|
- fr |
|
- bn |
|
- uz |
|
- pl |
|
- es |
|
- sw |
|
license: apache-2.0 |
|
--- |
|
|
|
# PWESuite-metric_learner |
|
|
|
This is a phonetic word embedding model based on PWESuite, as described in [PWESuite: Phonetic Word Embeddings and Tasks They Facilitate](https://aclanthology.org/2024.lrec-main.1168/). |
|
The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces. |
|
The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models. |
|
These models have been trained on all languages jointly. |
|
|
|
## Instructions |
|
|
|
To run any of the three metric learner models, run: |
|
```bash |
|
git clone https://github.com/zouharvi/pwesuite.git |
|
cd pwesuite |
|
mkdir -p computed/models |
|
pip3 install -e . |
|
|
|
# download the three models |
|
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.pt -O computed/models/ |
|
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.pt -O computed/models/ |
|
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.pt -O computed/models/ |
|
``` |
|
|
|
Then, in Python, you can run [this example script](https://github.com/zouharvi/pwesuite/blob/master/scripts/50-use_metric_learner.py): |
|
```python |
|
|
|
from models.metric_learning.model import RNNMetricLearner |
|
from models.metric_learning.preprocessor import preprocess_dataset_foreign |
|
from main.utils import load_multi_data |
|
import torch |
|
import tqdm |
|
import math |
|
|
|
data = load_multi_data(purpose_key="all") |
|
data = preprocess_dataset_foreign(data[:10], features="token_ipa") |
|
|
|
model = RNNMetricLearner( |
|
dimension=300, |
|
feature_size=data[0][0].shape[1], |
|
) |
|
model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt")) |
|
|
|
# some cheap paralelization |
|
BATCH_SIZE = 32 |
|
data_out = [] |
|
for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))): |
|
batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]] |
|
data_out += list( |
|
model.forward(batch).detach().cpu().numpy() |
|
) |
|
|
|
assert len(data) == len(data_out) |
|
assert all([len(x) == 300 for x in data_out]) |
|
``` |
|
|
|
You can also run the inference on all the data and evaluate it: |
|
```bash |
|
mkdir -p computed/embd/ |
|
python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa |
|
python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl |
|
``` |
|
|
|
Which gives you an output like: |
|
``` |
|
human_similarity: 0.6054 |
|
correlation: 0.8995 |
|
retrieval: 0.9158 |
|
analogy: 0.1128 |
|
rhyme: 0.6375 |
|
cognate: 0.6513 |
|
JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033} |
|
Score (overall): 0.6370 |
|
``` |
|
|
|
|
|
## Training |
|
|
|
Training this model takes about an hour on a mid-tier GPU. |
|
See [scripts/03-train_metric_learning.sh](https://github.com/zouharvi/pwesuite/blob/master/scripts/03-train_metric_learning.sh) for the specific training command. |
|
Further description TODO. |
|
|
|
## Other |
|
|
|
Cite as: |
|
``` |
|
@inproceedings{zouhar-etal-2024-pwesuite, |
|
title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate", |
|
author = "Zouhar, Vil{\'e}m and |
|
Chang, Kalvin and |
|
Cui, Chenxuan and |
|
Carlson, Nate B. and |
|
Robinson, Nathaniel Romney and |
|
Sachan, Mrinmaya and |
|
Mortensen, David R.", |
|
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", |
|
month = may, |
|
year = "2024", |
|
address = "Torino, Italia", |
|
publisher = "ELRA and ICCL", |
|
url = "https://aclanthology.org/2024.lrec-main.1168/", |
|
pages = "13344--13355", |
|
} |
|
``` |
|
|
|
Available also on arXiv: https://arxiv.org/abs/2304.02541 |