fix wget typo

2ea7987 verified 3 days ago

4.09 kB

	---
	pipeline_tag: other
	language:
	- multilingual
	- en
	- de
	- am
	- fr
	- bn
	- uz
	- pl
	- es
	- sw
	license: apache-2.0
	---

	# PWESuite-metric_learner

	This is a phonetic word embedding model based on PWESuite, as described in [PWESuite: Phonetic Word Embeddings and Tasks They Facilitate](https://aclanthology.org/2024.lrec-main.1168/).
	The metric learner model is based on mimicking distances in the vector space that correspond to Panphon's phonetic distnaces.
	The representation is either based on orthography (token_ort), IPA (token_ipa), or Panphon pronunciation vectors (panphon), which yields three models.
	These models have been trained on all languages jointly.

	## Instructions

	To run any of the three metric learner models, run:
	```bash
	git clone https://github.com/zouharvi/pwesuite.git
	cd pwesuite
	mkdir -p computed/models
	pip3 install -e .

	# download the three models
	wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ort_all.pt -O computed/models/
	wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_token_ipa_all.pt -O computed/models/
	wget https://huggingface.co/zouharvi/PWESuite-metric_learner/resolve/main/rnn_metric_learning_panphon_all.pt -O computed/models/
	```

	Then, in Python, you can run [this example script](https://github.com/zouharvi/pwesuite/blob/master/scripts/50-use_metric_learner.py):
	```python

	from models.metric_learning.model import RNNMetricLearner
	from models.metric_learning.preprocessor import preprocess_dataset_foreign
	from main.utils import load_multi_data
	import torch
	import tqdm
	import math

	data = load_multi_data(purpose_key="all")
	data = preprocess_dataset_foreign(data[:10], features="token_ipa")

	model = RNNMetricLearner(
	dimension=300,
	feature_size=data[0][0].shape[1],
	)
	model.load_state_dict(torch.load("computed/models/rnn_metric_learning_token_ipa_all.pt"))

	# some cheap paralelization
	BATCH_SIZE = 32
	data_out = []
	for i in tqdm.tqdm(range(math.ceil(len(data) / BATCH_SIZE))):
	batch = [f for f, _ in data[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]]
	data_out += list(
	model.forward(batch).detach().cpu().numpy()
	)

	assert len(data) == len(data_out)
	assert all([len(x) == 300 for x in data_out])
	```

	You can also run the inference on all the data and evaluate it:
	```bash
	mkdir -p computed/embd/
	python3 ./models/metric_learning/apply.py -l all -mp computed/models/rnn_metric_learning_token_ipa_all.pt -o computed/embd/rnn_metric_learning_token_ipa_all.pkl --features token_ipa
	python3 ./suite_evaluation/eval_all.py --embd computed/embd/rnn_metric_learning_token_ipa_all.pkl
	```

	Which gives you an output like:
	```
	human_similarity: 0.6054
	correlation: 0.8995
	retrieval: 0.9158
	analogy: 0.1128
	rhyme: 0.6375
	cognate: 0.6513
	JSON1!{"human_similarity": 0.6053864496119294, "correlation": 0.8995336394813026, "retrieval": 0.9157905555555556, "analogy": 0.1127777777777778, "rhyme": 0.6374601910828025, "cognate": 0.6512651265126512, "overall": 0.6370356233370033}
	Score (overall): 0.6370
	```


	## Training

	Training this model takes about an hour on a mid-tier GPU.
	See [scripts/03-train_metric_learning.sh](https://github.com/zouharvi/pwesuite/blob/master/scripts/03-train_metric_learning.sh) for the specific training command.
	Further description TODO.

	## Other

	Cite as:
	```
	@inproceedings{zouhar-etal-2024-pwesuite,
	title = "{PWES}uite: Phonetic Word Embeddings and Tasks They Facilitate",
	author = "Zouhar, Vil{\'e}m and
	Chang, Kalvin and
	Cui, Chenxuan and
	Carlson, Nate B. and
	Robinson, Nathaniel Romney and
	Sachan, Mrinmaya and
	Mortensen, David R.",
	booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
	month = may,
	year = "2024",
	address = "Torino, Italia",
	publisher = "ELRA and ICCL",
	url = "https://aclanthology.org/2024.lrec-main.1168/",
	pages = "13344--13355",
	}
	```

	Available also on arXiv: https://arxiv.org/abs/2304.02541