Other
zouharvi's picture
fix wget typo
2ea7987 verified
|
raw
history blame
4.09 kB
---
pipeline_tag: other
language:
- multilingual
- en
- de
- am
- fr
- bn
- uz
- pl
- es
- sw
license: apache-2.0
---
# PWESuite-metric_learner
This is a phonetic word embedding model based on PWESuite, as described in [PWESuite: Phonetic Word Embeddings and Tasks They Facilitate](https://aclanthology.org/2024.lrec-main.1168/).
The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces.
The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models.
These models have been trained on all languages jointly.
## Instructions
To run any of the three metric learner models, run:
```bash
git clone https://github.com/zouharvi/pwesuite.git
cd pwesuite
mkdir -p computed/models
pip3 install -e .
# download the three models
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.pt -O computed/models/
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.pt -O computed/models/
wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.pt -O computed/models/
```
Then, in Python, you can run [this example script](https://github.com/zouharvi/pwesuite/blob/master/scripts/50-use_metric_learner.py):
```python
from models.metric_learning.model import RNNMetricLearner
from models.metric_learning.preprocessor import preprocess_dataset_foreign
from main.utils import load_multi_data
import torch
import tqdm
import math
data = load_multi_data(purpose_key="all")
data = preprocess_dataset_foreign(data[:10], features="token_ipa")
model = RNNMetricLearner(
dimension=300,
feature_size=data[0][0].shape[1],
)
model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt"))
# some cheap paralelization
BATCH_SIZE = 32
data_out = []
for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))):
batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]]
data_out += list(
model.forward(batch).detach().cpu().numpy()
)
assert len(data) == len(data_out)
assert all([len(x) == 300 for x in data_out])
```
You can also run the inference on all the data and evaluate it:
```bash
mkdir -p computed/embd/
python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa
python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl
```
Which gives you an output like:
```
human_similarity: 0.6054
correlation: 0.8995
retrieval: 0.9158
analogy: 0.1128
rhyme: 0.6375
cognate: 0.6513
JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033}
Score (overall): 0.6370
```
## Training
Training this model takes about an hour on a mid-tier GPU.
See [scripts/03-train_metric_learning.sh](https://github.com/zouharvi/pwesuite/blob/master/scripts/03-train_metric_learning.sh) for the specific training command.
Further description TODO.
## Other
Cite as:
```
@inproceedings{zouhar-etal-2024-pwesuite,
title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate",
author = "Zouhar, Vil{\'e}m and
Chang, Kalvin and
Cui, Chenxuan and
Carlson, Nate B. and
Robinson, Nathaniel Romney and
Sachan, Mrinmaya and
Mortensen, David R.",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1168/",
pages = "13344--13355",
}
```
Available also on arXiv: https://arxiv.org/abs/2304.02541