blowing-up-groundhogs
/

vatrpp

@@ -1,120 +1,173 @@
----
-language:
-  - en
-tags:
-    - image-generation
-    - text-to-image
-    - conditional-generation
-    - generative-modeling
-    - image-synthesis
-    - image-manipulation
-    - design-prototyping
-    - research
-    - educational
-license: mit
-metrics:
-  - FID
-  - KID
-  - HWD
-  - CER
----
-# Handwritten Text Generation from Visual Archetypes ++
-This repository includes the code for training the VATr++ Styled Handwritten Text Generation model.
 ## Installation
-```bash
-conda create --name vatr python=3.9
-conda activate vatr
-conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
-git clone https://github.com/aimagelab/VATr.git && cd VATr
-pip install -r requirements.txt
-```
-[This folder](https://drive.google.com/drive/folders/13rJhjl7VsyiXlPTBvnp1EKkKEhckLalr?usp=sharing) contains the regular IAM dataset `IAM-32.pickle` and the modified version with attached punctuation marks `IAM-32-pa.pickle`.
-The folder also contains the synthetically pretrained weights for the encoder `resnet_18_pretrained.pth`.
-Please download these files and place them into the `files` folder.
-## Training
-To train the regular VATr model, use the following command. This uses the default settings from the paper.
-```bash
-python train.py
-```
-Useful arguments:
-```bash
-python train.py
-        --feat_model_path PATH  # path to the pretrained resnet 18 checkpoint. By default this is the synthetically pretrained model
-        --is_cycle              # use style cycle loss for training
-        --dataset DATASET       # dataset to use. Default IAM
-        --resume                # resume training from the last checkpoint with the same name
-        --wandb                 # use wandb for logging
-```
-Use the following arguments to apply full VATr++ training
-```bash
-python train.py
-        --d-crop-size 64 128          # Randomly crop input to discriminator to width 64 to 128
-        --text-augment-strength 0.4   # Text augmentation for adding more rare characters
-        --file-suffix pa              # Use the punctuation attached version of IAM
-        --augment-ocr                 # Augment the real images used to train the OCR model
 ```
-### Pretraining dataset
-The model `resnet_18_pretrained.pth` was pretrained by using this dataset: [Font Square](https://github.com/aimagelab/font_square)
-## Generate Styled Handwritten Text Images
-We added some utility to generate handwritten text images using the trained model. These are used as follows:
-```bash
-python generate.py [ACTION] --checkpoint files/vatrpp.pth
-```
-The following actions are available with their respective arguments.
-### Custom Author
-Generate the given text for a custom author.
-```bash
-text  --text STRING     # String to generate
-      --text-path PATH  # Optional path to text file
-      --output PATH     # Optional output location, default: files/output.png
-      --style-folder PATH    # Optional style folder containing writer samples, default: 'files/style_samples/00'
-```
-Style samples for the author are needed. These can be automatically generated from an image of a page using `create_style_sample.py`.
-```bash
-python create_style_sample.py  --input-image PATH     # Path of the image to extract the style samples from.
-                               --output-folder PATH   # Folder where the style samples should be saved
-```
-### All Authors
-Generate some text for all authors of IAM. The output is saved to `saved_images/author_samples/`
-```bash
-authors --test-set        # Generate authors of test set, otherwise training set is generated
-        --checkpoint PATH # Checkpoint used to generate text, files/vatr.pth by default
-        --align           # Detect the bottom lines for each word and align them
-        --at-once         # Generate the whole sentence at once instead of word-by-word
-        --output-style    # Also save the style images used to generate the words
-```
-### Evaluation Images
-```bash
-fid --target_dataset_path PATH  # dataset file for which the test set will be generated
-    --dataset-path PATH         # dataset file from which style samples will be taken, for example the attached punctuation
-    --output PATH               # where to save the images, default is saved_images/fid
-    --checkpoint PATH           # Checkpoint used to generate text, files/vatr.pth by default
-    --all-epochs                # Generate evaluation images for all saved epochs available (checkpoint has to be a folder)
-    --fake-only                 # Only output fake images, no ground truth
-    --test-only                 # Only generate test set, not train set
-    --long-tail                 # Only generate words containing long tail characters
-```

+# VATr++ (Hugging Face Version)
+This is a re-upload of the **VATr++** styled handwritten text generation model to the Hugging Face Model Hub. The original code and more detailed documentation can be found in the [VATr-pp GitHub repository](https://github.com/EDM-Research/VATr-pp).
+> **Note**: Please refer to the original repo for:
+> - Full training instructions
+> - In-depth code details
+> - Extended usage and references
+This Hugging Face version allows you to directly load the **VATr++** model with `AutoModel.from_pretrained(...)` and use it in your pipelines or scripts without manually handling checkpoints. The usage differs slightly from the original GitHub repository, primarily because we leverage Hugging Face’s `transformers` interface here.
+---
 ## Installation
+1. **Create a conda environment (recommended)**:
+   ```bash
+   conda create --name vatr python=3.9
+   conda activate vatr
+   ```
+2. **Install PyTorch and CUDA (if available)**:
+   ```bash
+   conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
+   ```
+3. **Install additional requirements** (including `transformers`, `Pillow`, `numpy`, etc.):
+   ```bash
+   pip install transformers Pillow numpy
+   ```
+   *You may need to adjust or add libraries based on your specific environment needs.*
+---
+## Loading the Model
+#### **VATr++**
+To load the **VATr++** version:
+```python
+from transformers import AutoModel
+model_vatr_pp = AutoModel.from_pretrained(
+    "blowing-up-groundhogs/vatrpp",
+    trust_remote_code=True
+)
 ```
+#### **VATr (original)**
+To load the **original VATr** model (instead of VATr++), specify the `subfolder` argument:
+```python
+model_vatr = AutoModel.from_pretrained(
+    "blowing-up-groundhogs/vatrpp",
+    subfolder="vatr",
+    trust_remote_code=True
+)
+```
+---
+## Usage (Inference Example)
+Below is a **minimal** usage example that demonstrates how to:
+1. Load the VATr++ model from the Hugging Face Hub.
+2. Preprocess a style image (an image of handwriting).
+3. Generate a new handwritten line of text in the style of the provided image.
+> **Important**: This model requires `trust_remote_code=True` to properly load its custom generation logic.
+```python
+import numpy as np
+from PIL import Image
+import torch
+from torchvision import transforms as T
+from transformers import AutoModel
+# 1. Load the model (VATr++)
+model = AutoModel.from_pretrained("blowing-up-groundhogs/vatrpp", trust_remote_code=True)
+# 2. Helper functions to load and process style images
+def load_image(img, chunk_width=192):
+    # Convert to grayscale and resize to height 32
+    img = img.convert("L")
+    img = img.resize((img.width * 32 // img.height, 32))
+    arr = np.array(img)
+    # Setup transforms: invert + normalize
+    transform = T.Compose([
+        T.Grayscale(num_output_channels=1),
+        T.ToTensor(),
+        T.Normalize((0.5,), (0.5,))
+    ])
+    # Pad / chunk the image to a fixed width
+    arr = 255 - arr
+    height, width = arr.shape
+    out = np.zeros((height, chunk_width), dtype="float32")
+    out[:, :width] = arr[:, :chunk_width]
+    out = 255 - out
+    # Apply transforms
+    out = transform(Image.fromarray(out.astype(np.uint8)))
+    return out, width
+def load_image_line(img, chunk_width=192, style_imgs_count=15):
+    # Convert to grayscale and resize
+    img = img.convert("L")
+    img = img.resize((img.width * 32 // img.height, 32))
+    arr = np.array(img)
+    # Split into fixed-width chunks
+    chunks = []
+    for start in range(0, arr.shape[1], chunk_width):
+        chunk = arr[:, start:start+chunk_width]
+        chunks.append(chunk)
+    # Transform each chunk
+    transformed = []
+    for c in chunks:
+        t, _ = load_image(Image.fromarray(c), chunk_width)
+        transformed.append(t)
+    # If fewer than `style_imgs_count` chunks, repeat them
+    while len(transformed) < style_imgs_count:
+        transformed += transformed
+    transformed = transformed[:style_imgs_count]
+    # Combine
+    return torch.cat(transformed, 0)
+# 3. Load a style image of your handwriting (or any handwriting sample)
+style_image_path = "path/to/your_style_image.png"
+img = Image.open(style_image_path)
+style_imgs = load_image_line(img)
+# 4. Generate text in the style of `style_image_path`
+generated_pil_image = model.generate(
+    gen_text="This is a test",    # Text to generate
+    style_imgs=style_imgs,        # Preprocessed style chunks
+    align_words=True,             # Align words at baseline
+    at_once=True,                 # Generate line at once
+)
+# 5. Save the generated image
+generated_pil_image.save("generated_output.png")
+```
+- **`style_imgs`**: A batch of fixed-width image chunks from your style reference. In practice, you can supply multiple small style samples or a single line image split into chunks.
+- **`gen_text`**: The text to render in the given style.
+- **`align_words`** and **`at_once`**: Optional arguments that control how the text is laid out and generated.
+---
+## Original Repository
+This model is built upon the code from [**EDM-Research/VATr-pp**](https://github.com/EDM-Research/VATr-pp), which is itself an improvement on the [VATr](https://github.com/aimagelab/VATr) project. If you need to:
+- Train your own model from scratch
+- Explore advanced features (like style cycle loss, punctuation modes, or advanced augmentation)
+- Examine experimental details or replicate the original paper's setup
+Please visit the original GitHub repos for comprehensive documentation and support files.
+---
+## License and Acknowledgments
+- The original code and model are under the license found in [the GitHub repository](https://github.com/EDM-Research/VATr-pp).
+- All credit goes to the original authors and maintainers for creating VATr++ and releasing it openly.
+- This Hugging Face re-upload is merely intended to **simplify inference** and **model sharing**; no changes have been made to the core training code or conceptual pipeline.
+---
+**Enjoy generating styled handwritten text!** For any issues specific to this Hugging Face version, feel free to open an issue or pull request here. Otherwise, for deeper technical questions, please consult the original repository or its authors.