--- pipeline_tag: other language: - multilingual - en - de - am - fr - bn - uz - pl - es - sw license: apache-2.0 --- # PWESuite-metric_learner This is a phonetic word embedding model based on PWESuite, as described in [PWESuite: Phonetic Word Embeddings and Tasks They Facilitate](https://aclanthology.org/2024.lrec-main.1168/). The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces. The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models. These models have been trained on all languages jointly. ## Instructions To run any of the three metric learner models, run: ```bash git clone https://github.com/zouharvi/pwesuite.git cd pwesuite mkdir -p computed/models pip3 install -e . # download the three models wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.pt -O computed/models/ wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.pt -O computed/models/ wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.pt -O computed/models/ ``` Then, in Python, you can run [this example script](https://github.com/zouharvi/pwesuite/blob/master/scripts/50-use_metric_learner.py): ```python from models.metric_learning.model import RNNMetricLearner from models.metric_learning.preprocessor import preprocess_dataset_foreign from main.utils import load_multi_data import torch import tqdm import math data = load_multi_data(purpose_key="all") data = preprocess_dataset_foreign(data[:10], features="token_ipa") model = RNNMetricLearner( dimension=300, feature_size=data[0][0].shape[1], ) model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt")) # some cheap paralelization BATCH_SIZE = 32 data_out = [] for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))): batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]] data_out += list( model.forward(batch).detach().cpu().numpy() ) assert len(data) == len(data_out) assert all([len(x) == 300 for x in data_out]) ``` You can also run the inference on all the data and evaluate it: ```bash mkdir -p computed/embd/ python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl ``` Which gives you an output like: ``` human_similarity: 0.6054 correlation: 0.8995 retrieval: 0.9158 analogy: 0.1128 rhyme: 0.6375 cognate: 0.6513 JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033} Score (overall): 0.6370 ``` ## Training Training this model takes about an hour on a mid-tier GPU. See [scripts/03-train_metric_learning.sh](https://github.com/zouharvi/pwesuite/blob/master/scripts/03-train_metric_learning.sh) for the specific training command. Further description TODO. ## Other Cite as: ``` @inproceedings{zouhar-etal-2024-pwesuite, title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate", author = "Zouhar, Vil{\'e}m and Chang, Kalvin and Cui, Chenxuan and Carlson, Nate B. and Robinson, Nathaniel Romney and Sachan, Mrinmaya and Mortensen, David R.", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1168/", pages = "13344--13355", } ``` Available also on arXiv: https://arxiv.org/abs/2304.02541