hiepph
commited on
Commit
Β·
3828ced
1
Parent(s):
44dbcc8
docs: Add text_tokenize.py example
Browse files- examples/README.md +32 -24
examples/README.md
CHANGED
@@ -1,31 +1,39 @@
|
|
1 |
# torchMoji examples
|
2 |
|
3 |
-
## Initialization
|
4 |
-
[create_twitter_vocab.py](create_twitter_vocab.py)
|
5 |
-
Create a new vocabulary from a tsv file.
|
6 |
-
|
7 |
-
[tokenize_dataset.py](tokenize_dataset.py)
|
8 |
-
Tokenize a given dataset using the prebuilt vocabulary.
|
9 |
-
|
10 |
-
[vocab_extension.py](vocab_extension.py)
|
11 |
-
Extend the given vocabulary using dataset-specific words.
|
12 |
-
|
13 |
-
[dataset_split.py](dataset_split.py)
|
14 |
Split a given dataset into training, validation and testing.
|
15 |
-
|
16 |
-
## Use pretrained model/architecture
|
17 |
-
[score_texts_emojis.py](score_texts_emojis.py)
|
18 |
-
Use torchMoji to score texts for emoji distribution.
|
19 |
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
Use torchMoji to encode the text into 2304-dimensional feature vectors for further modeling/analysis.
|
22 |
|
23 |
## Transfer learning
|
24 |
-
[finetune_youtube_last.py](finetune_youtube_last.py)
|
25 |
-
Finetune the model on the SS-Youtube dataset using the 'last' method.
|
26 |
-
|
27 |
-
[finetune_insults_chain-thaw.py](finetune_insults_chain-thaw.py)
|
28 |
-
Finetune the model on the Kaggle insults dataset (from blog post) using the 'chain-thaw' method.
|
29 |
-
|
30 |
-
[finetune_semeval_class-avg_f1.py](finetune_semeval_class-avg_f1.py)
|
31 |
-
Finetune the model on the SemeEval emotion dataset using the 'full' method and evaluate using the class average F1 metric.
|
|
|
1 |
# torchMoji examples
|
2 |
|
3 |
+
## Initialization
|
4 |
+
[create_twitter_vocab.py](create_twitter_vocab.py)
|
5 |
+
Create a new vocabulary from a tsv file.
|
6 |
+
|
7 |
+
[tokenize_dataset.py](tokenize_dataset.py)
|
8 |
+
Tokenize a given dataset using the prebuilt vocabulary.
|
9 |
+
|
10 |
+
[vocab_extension.py](vocab_extension.py)
|
11 |
+
Extend the given vocabulary using dataset-specific words.
|
12 |
+
|
13 |
+
[dataset_split.py](dataset_split.py)
|
14 |
Split a given dataset into training, validation and testing.
|
|
|
|
|
|
|
|
|
15 |
|
16 |
+
## Use pretrained model/architecture
|
17 |
+
[score_texts_emojis.py](score_texts_emojis.py)
|
18 |
+
Use torchMoji to score texts for emoji distribution.
|
19 |
+
|
20 |
+
[text_emojize.py](text_emojize.py)
|
21 |
+
Use torchMoji to output emoji visualization from a single text input (mapped from `emoji_overview.png`)
|
22 |
+
|
23 |
+
```sh
|
24 |
+
python examples/text_emojize.py --text "I love mom's cooking\!"
|
25 |
+
# => I love mom's cooking! π π π π β€
|
26 |
+
```
|
27 |
+
|
28 |
+
[encode_texts.py](encode_texts.py)
|
29 |
Use torchMoji to encode the text into 2304-dimensional feature vectors for further modeling/analysis.
|
30 |
|
31 |
## Transfer learning
|
32 |
+
[finetune_youtube_last.py](finetune_youtube_last.py)
|
33 |
+
Finetune the model on the SS-Youtube dataset using the 'last' method.
|
34 |
+
|
35 |
+
[finetune_insults_chain-thaw.py](finetune_insults_chain-thaw.py)
|
36 |
+
Finetune the model on the Kaggle insults dataset (from blog post) using the 'chain-thaw' method.
|
37 |
+
|
38 |
+
[finetune_semeval_class-avg_f1.py](finetune_semeval_class-avg_f1.py)
|
39 |
+
Finetune the model on the SemeEval emotion dataset using the 'full' method and evaluate using the class average F1 metric.
|