File size: 3,062 Bytes
89dc4fa
89713b0
 
 
89dc4fa
89713b0
89dc4fa
89713b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
language:
- en
license: apache-2.0
---
# UAR Scene

Literary Character Representations using [UAR Scene](https://aclanthology.org/2024.findings-emnlp.744/)., trained on fictional character utterances.

You can find the training and evaluation repository [here](https://github.com/deezer/character_embeddings_qa).

This model is based on [LUAR implementation](https://aclanthology.org/2021.emnlp-main.70/). It uses `all-distillroberta-v1` as the base sentence encoder and was trained on the Scene split of [DramaCV](https://huggingface.co/datasets/gasmichel/DramaCV), a dataset consisting of drama plays collected from Project Gutenberg.

You can find the model trained on the Play split at this [url](https://huggingface.co/gasmichel/UAR_Play).


## Usage

```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gasmichel/UAR_scene")
model = AutoModel.from_pretrained("gasmichel/UAR_scene")
#`episodes` are embedded as colletions of documents presumed to come from an author
# NOTE: make sure that `episode_length` consistent across `episode`
batch_size = 3
episode_length = 16
text = [
    ["Foo"] * episode_length,
    ["Bar"] * episode_length,
    ["Zoo"] * episode_length,
]
text = [j for i in text for j in i]
tokenized_text = tokenizer(
    text, 
    max_length=32,
    padding="max_length", 
    truncation=True,
    return_tensors="pt"
)
# inputs size: (batch_size, episode_length, max_token_length)
tokenized_text["input_ids"] = tokenized_text["input_ids"].reshape(batch_size, episode_length, -1)
tokenized_text["attention_mask"] = tokenized_text["attention_mask"].reshape(batch_size, episode_length, -1)
print(tokenized_text["input_ids"].size())       # torch.Size([3, 16, 32])
print(tokenized_text["attention_mask"].size())  # torch.Size([3, 16, 32])
out = model(**tokenized_text)
print(out.size())   # torch.Size([3, 512])
# to get the Transformer attentions:
out, attentions = model(**tokenized_text, output_attentions=True)
print(attentions[0].size())     # torch.Size([48, 12, 32, 32])
```

## Citing & Authors

If you find this model helpful, feel free to cite our [publication](https://aclanthology.org/2024.findings-emnlp.744/).

```
@inproceedings{michel-etal-2024-improving,
    title = "Improving Quotation Attribution with Fictional Character Embeddings",
    author = "Michel, Gaspard  and
      Epure, Elena V.  and
      Hennequin, Romain  and
      Cerisara, Christophe",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.744",
    doi = "10.18653/v1/2024.findings-emnlp.744",
    pages = "12723--12735",,
}
```

## License

UAR Scene is distributed under the terms of the Apache License (Version 2.0).

All new contributions must be made under the Apache-2.0 licenses.