gasmichel commited on
Commit
ecf231e
·
verified ·
1 Parent(s): 6d63c56

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +81 -3
README.md CHANGED
@@ -1,3 +1,81 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ ---
6
+ # UAR Play
7
+
8
+ Literary Character Representations using [UAR Play](https://aclanthology.org/2024.findings-emnlp.744/)., trained on fictional character utterances.
9
+
10
+ You can find the training and evaluation repository [here](https://github.com/deezer/character_embeddings_qa).
11
+
12
+ This model is based on [LUAR implementation](https://aclanthology.org/2021.emnlp-main.70/). It uses `all-distillroberta-v1` as the base sentence encoder and was trained on the Play split of [DramaCV](https://huggingface.co/datasets/gasmichel/DramaCV), a dataset consisting of drama plays collected from Project Gutenberg.
13
+
14
+ You can find the model trained on the Scene split at this [url](https://huggingface.co/gasmichel/UAR_scene).
15
+
16
+
17
+ ## Usage
18
+
19
+ ```python
20
+ from transformers import AutoModel, AutoTokenizer
21
+ tokenizer = AutoTokenizer.from_pretrained("gasmichel/UAR_Play")
22
+ model = AutoModel.from_pretrained("gasmichel/UAR_Play")
23
+ #`episodes` are embedded as colletions of documents presumed to come from an author
24
+ # NOTE: make sure that `episode_length` consistent across `episode`
25
+ batch_size = 3
26
+ episode_length = 16
27
+ text = [
28
+ ["Foo"] * episode_length,
29
+ ["Bar"] * episode_length,
30
+ ["Zoo"] * episode_length,
31
+ ]
32
+ text = [j for i in text for j in i]
33
+ tokenized_text = tokenizer(
34
+ text,
35
+ max_length=32,
36
+ padding="max_length",
37
+ truncation=True,
38
+ return_tensors="pt"
39
+ )
40
+ # inputs size: (batch_size, episode_length, max_token_length)
41
+ tokenized_text["input_ids"] = tokenized_text["input_ids"].reshape(batch_size, episode_length, -1)
42
+ tokenized_text["attention_mask"] = tokenized_text["attention_mask"].reshape(batch_size, episode_length, -1)
43
+ print(tokenized_text["input_ids"].size()) # torch.Size([3, 16, 32])
44
+ print(tokenized_text["attention_mask"].size()) # torch.Size([3, 16, 32])
45
+ out = model(**tokenized_text)
46
+ print(out.size()) # torch.Size([3, 512])
47
+ # to get the Transformer attentions:
48
+ out, attentions = model(**tokenized_text, output_attentions=True)
49
+ print(attentions[0].size()) # torch.Size([48, 12, 32, 32])
50
+ ```
51
+
52
+ ## Citing & Authors
53
+
54
+ If you find this model helpful, feel free to cite our [publication](https://aclanthology.org/2024.findings-emnlp.744/).
55
+
56
+ ```
57
+ @inproceedings{michel-etal-2024-improving,
58
+ title = "Improving Quotation Attribution with Fictional Character Embeddings",
59
+ author = "Michel, Gaspard and
60
+ Epure, Elena V. and
61
+ Hennequin, Romain and
62
+ Cerisara, Christophe",
63
+ editor = "Al-Onaizan, Yaser and
64
+ Bansal, Mohit and
65
+ Chen, Yun-Nung",
66
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
67
+ month = nov,
68
+ year = "2024",
69
+ address = "Miami, Florida, USA",
70
+ publisher = "Association for Computational Linguistics",
71
+ url = "https://aclanthology.org/2024.findings-emnlp.744",
72
+ doi = "10.18653/v1/2024.findings-emnlp.744",
73
+ pages = "12723--12735",,
74
+ }
75
+ ```
76
+
77
+ ## License
78
+
79
+ UAR Scene is distributed under the terms of the Apache License (Version 2.0).
80
+
81
+ All new contributions must be made under the Apache-2.0 licenses.