File size: 4,177 Bytes
7ad9600
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eca827c
7ad9600
 
605e556
7ad9600
 
605e556
 
 
 
 
 
eca827c
605e556
7ad9600
605e556
7ad9600
 
 
605e556
 
 
 
 
7ad9600
 
 
 
 
 
 
 
 
605e556
7ad9600
 
 
 
 
 
 
eca827c
 
 
 
7ad9600
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eca827c
7ad9600
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
language:
  - en
tags:
    - image-generation
    - text-to-image
    - vae
    - t5
    - conditional-generation
    - generative-modeling
    - image-synthesis
    - image-manipulation
    - design-prototyping
    - research
    - educational
license: mit
datasets:
  - blowing-up-groundhogs/font-square-v2
metrics:
  - FID
  - KID
  - HWD
  - CER
library_name: t5
---

# Emuru

**Emuru** is a conditional generative model that integrates a T5-based decoder with a Variational Autoencoder (VAE) for image generation conditioned on text and style images. It allows users to combine textual prompts (e.g., style text, generation text) and style images to create new, synthesized images.


## Model description

- **Architecture**:  
  Emuru uses a [T5ForConditionalGeneration](https://huggingface.co/docs/transformers/model_doc/t5) as its text decoder and an [AutoencoderKL](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/autoencoder_kl.py) as the VAE backbone. The T5 model encodes textual prompts and partially decoded latent representations, then predicts the next latent tokens. The VAE is used both to encode the initial style image and decode the predicted latent tokens back into an image.

- **Inputs**:  
  1. **Style Image**: A reference image, which Emuru encodes to capture its “style” or other visual characteristics.  
  2. **Style Text**: Text describing the style or context.  
  3. **Generation Text**: Text describing the content or object to generate.  

- **Outputs**:  
  1. A synthesized image that reflects the fused style and text descriptions.  

- **Tokenization**:  
  Emuru uses [AutoTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) to handle the text prompts, which adjusts the T5’s vocabulary and token embeddings accordingly.

- **Usage scenarios**:  
  - Stylized text-to-image generation  
  - Image manipulation or design prototyping based on textual descriptions  
  - Research or educational demonstrations of T5-based generative modeling  


## How to use

Below is a minimal usage example in Python. You can load the model with `AutoModel.from_pretrained(...)` and simply call `.generate(...)` or `.generate_batch(...)` to create images.

```python
import torch
from PIL import Image
from transformers import AutoModel
from huggingface_hub import hf_hub_download
from torchvision.transforms import functional as F

def load_image(img_path):
    img = Image.open(img_path).convert("RGB")
    # Resize the image to have a fixed height of 64 pixels
    img = img.resize((img.width * 64 // img.height, 64))
    img = F.to_tensor(img)
    img = F.normalize(img, [0.5], [0.5])
    return img

# 1. Load the model
model = AutoModel.from_pretrained("blowing-up-groundhogs/emuru", trust_remote_code=True)
model.cuda()  # Move to GPU if available

# 2. Prepare your inputs
style_text = 'THE JOLLY IS "U"'
gen_text = 'EMURU'
img_path = hf_hub_download(repo_id="blowing-up-groundhogs/emuru", filename="sample.png")
style_img = load_image(img_path)
style_img = style_img.cuda()

# 3. Generate an image
generated_pil_image = model.generate(
    style_text=style_text,
    gen_text=gen_text,
    style_img=style_img,
    max_new_tokens=64
)

# 4. Save the result
generated_pil_image.save("generated_image.png")
```

### Batch Generation
You can also generate a batch of images if you have multiple style texts, generation texts, and style images:

```python
style_texts = ['THE JOLLY IS "U"', 'THE JOLLY IS "M"', 'THE JOLLY IS "R"']
gen_texts   = ['EMURU', 'EMURU', 'EMURU']
style_imgs  = torch.stack([style_img, style_img, style_img], dim=0)  # shape: (batch_size, C, H, W)
lengths     = [style_img.size(-1), style_img.size(-1), style_img.size(-1)]

output_images = model.generate_batch(
    style_texts=style_texts,
    gen_texts=gen_texts,
    style_imgs=style_imgs,
    lengths=lengths,
    max_new_tokens=64
)

# `output_images` is a list of PIL images
for idx, pil_img in enumerate(output_images):
    pil_img.save(f"batch_generated_image_{idx}.png")
```


## Citation

If you use Emuru in your research or wish to refer to it, please cite:

```
Wait for it...
```