File size: 3,931 Bytes
610d5f5
 
 
 
 
 
 
 
 
868bcaa
610d5f5
88418d5
610d5f5
868bcaa
610d5f5
e58b6c0
 
 
610d5f5
acbc196
590629c
 
13d1453
590629c
13d1453
 
 
 
590629c
13d1453
 
 
 
 
590629c
13d1453
 
590629c
13d1453
 
 
 
 
 
 
 
 
590629c
 
868bcaa
610d5f5
fd4f757
 
868bcaa
610d5f5
868bcaa
610d5f5
acbc196
 
 
 
22006d4
 
868bcaa
610d5f5
868bcaa
610d5f5
868bcaa
610d5f5
13d1453
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: mit
datasets:
- gustavecortal/DreamBank-annotated
language:
- en
pipeline_tag: text-generation
---

## Presentation

Oneirogen ([0.5B](https://huggingface.co/gustavecortal/oneirogen-0.5B), [1.5B](https://huggingface.co/gustavecortal/oneirogen-1.5B) and [7B](https://huggingface.co/gustavecortal/oneirogen-7B)) is a language model for dream generation based on [Qwen2](https://huggingface.co/Qwen/Qwen2-7B). It was trained on [DreamBank](https://dreambank.net/), a corpus of more than 27,000 dream narratives. 

Oneirogen was used to produce [The Android and The Machine](https://huggingface.co/datasets/gustavecortal/the-android-and-the-human), an English dataset composed of 10,000 real and 10,000 generated dreams.

Oneirogen can be used to generate novel dream narratives. It can also be used for dream analysis. For example, one could finetuned this model on [Hall and Van de Castle annotations](https://dreams.ucsc.edu/Coding/) to predict character and emotion in dream narratives. I've introduced this task in this [paper](https://aclanthology.org/2024.lrec-main.1282/).

Generation examples are available on my [website](https://gustavecortal.com/project/oneirogen).

## Code for generation

 ```py
from transformers import AutoTokenizer, AutoModelForCausalLM, StoppingCriteria, StoppingCriteriaList

class CustomStoppingCriteria(StoppingCriteria):
    def __init__(self, stop_token, tokenizer):
        self.stop_token = stop_token
        self.tokenizer = tokenizer

    def __call__(self, input_ids, scores, **kwargs):
        decoded_output = self.tokenizer.decode(input_ids[0], skip_special_tokens=True)
        if self.stop_token in decoded_output:
            return True
        return False

stop_token = "END." # The model was trained with this special end of text token.
stopping_criteria = StoppingCriteriaList([CustomStoppingCriteria(stop_token, tokenizer)])

tokenizer = AutoTokenizer.from_pretrained("gustavecortal/oneirogen-0.5B")
model = AutoModelForCausalLM.from_pretrained("gustavecortal/oneirogen-0.5B", torch_dtype=torch.float16)
model.to("cuda")

text = "Dream:" # The model was trained with this prefix

inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=256, top_k = 50, top_p = 0.95, do_sample = True, temperature=0.9, num_beams = 1, repetition_penalty= 1.11, stopping_criteria=stopping_criteria)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=False)[0])
```

## Inspiration

An oneirogen, from the Greek _óneiros_ meaning "dream" and _gen_ "to create", is a substance or other stimulus which produces or enhances dreamlike states of consciousness.

This model resonates with a speech called _The Android and The Human_ given by science-fiction author Philip K. Dick:

> Our environment – and I mean our man-made world of machines, artificial constructs, computers, electronic systems, interlinking homeostatic components – all of this is in fact beginning more and more to possess what the earnest psychologists fear the primitive sees in his environment: animation. In a very real sense our environment is becoming alive, or at least quasi-alive, and in ways specifically and fundamentally analogous to ourselves... Rather than learning about ourselves by studying our constructs, perhaps we should make the attempt to comprehend what our constructs are up to by looking into what we ourselves are up to

## Technical aspects

Oneirogen is a Qwen2 model finetuned on the DreamBank corpus using LoRA adaptation. A notebook to replicate the training will soon be available.

This work was performed using HPC resources (Jean Zay supercomputer) from GENCI-IDRIS (Grant 20XX-AD011014205).

## Contact

Mail: [email protected]

X: [@gustavecortal](https://x.com/gustavecortal)

Website: [gustavecortal.com](gustavecortal.com)