zerogpu-aoti/wan2-2-fp8da-aoti-faster · Support aspect-ratio–preserving output (avoid forced 832×480)

Summary
The demo currently always outputs a fixed resolution video (LANDSCAPE_WIDTH = 832, LANDSCAPE_HEIGHT = 480). Even when a portrait or square image is provided, it’s center-cropped and resized to 832×480 in resize_image[_landscape]. This makes subjects get cropped off or look distorted. Could we add an option to preserve the input image’s aspect ratio and either (a) letterbox/pillarbox, or (b) adapt the generated H/W to the input ratio?

Where this happens

Fixed target dims:

LANDSCAPE_WIDTH = 832
LANDSCAPE_HEIGHT = 480

Pre-resize crops to target aspect:

def resize_image_landscape(image):
    target_aspect = LANDSCAPE_WIDTH / LANDSCAPE_HEIGHT
    ...
    return image.resize((LANDSCAPE_WIDTH, LANDSCAPE_HEIGHT), Image.LANCZOS)

AOT warm-up compiled with fixed H/W:

optimize_pipeline_(pipe,
    image=Image.new('RGB', (LANDSCAPE_WIDTH, LANDSCAPE_HEIGHT)),
    height=LANDSCAPE_HEIGHT,
    width=LANDSCAPE_WIDTH,
    num_frames=MAX_FRAMES_MODEL,
)

Expected behavior

When I upload a portrait (e.g., 1080×1920) or square image, the generated video should keep that aspect ratio (within model constraints), with one of:
1. Fit (letterbox/pillarbox): pad to nearest valid multiple (e.g., 1024×1824 → pad to 1024×1856 if needed).
2. Fill (crop): current behavior, but as a user-selectable option.
Optionally, offer a few pre-set aspect buckets (1:1, 4:3, 3:2, 16:9, 9:16) and choose the nearest bucket to the input.

Actual behavior

All outputs are 832×480. Portrait/square inputs are center-cropped to a 16:9 slice and then scaled to 832×480.

Why this matters

Many inputs (selfies, mobile photos) are portrait or square. Forcing 16:9 often chops heads/feet or important context and looks worse in vertical feeds.

Proposed approach (backward-compatible)

Expose an “Aspect Mode” control in the UI
- aspect_mode = ["Fit (letterbox)", "Fill (crop)"], default to “Fit”.

Compute target H/W dynamically while respecting model multiples

MOD = 16  # or 32 if required by Wan kernels
def compute_target_size(w, h, max_side=1024, mod=MOD):
    # scale so the longer side = max_side, preserve ratio
    if w >= h:
        new_w = max_side
        new_h = int(round(h * max_side / w))
    else:
        new_h = max_side
        new_w = int(round(w * max_side / h))
    # snap to model-friendly multiples
    new_w = max(mod, (new_w // mod) * mod)
    new_h = max(mod, (new_h // mod) * mod)
    return new_w, new_h

Implement letterbox/pad for “Fit”

from PIL import Image, ImageOps

def letterbox(image, target_w, target_h, fill=(0,0,0)):
    img = image.copy()
    img.thumbnail((target_w, target_h), Image.LANCZOS)
    pad_w = target_w - img.width
    pad_h = target_h - img.height
    padding = (pad_w // 2, pad_h // 2, pad_w - pad_w // 2, pad_h - pad_h // 2)
    return ImageOps.expand(img, padding, fill=fill)

Warm-up/compile per chosen size (or opt into dynamic)
- If keeping AOT export, cache optimize_pipeline_ by (width,height) so each distinct shape warms once.
- Alternatively, switch to torch.compile(..., dynamic=True, mode="reduce-overhead") to tolerate multiple shapes (trade-off: tiny perf drop vs. flexibility).

Use the computed size at call time

in_w, in_h = input_image.size
tgt_w, tgt_h = compute_target_size(in_w, in_h, max_side=832)  # reuse 832 as longest side
if aspect_mode == "Fit (letterbox)":
    resized = letterbox(input_image, tgt_w, tgt_h)
else:  # Fill (crop)
    resized = input_image.copy()
    # center-crop to tgt aspect, then resize to (tgt_w, tgt_h)
    in_aspect = in_w / in_h; tgt_aspect = tgt_w / tgt_h
    if in_aspect > tgt_aspect:
        new_w = int(round(in_h * tgt_aspect))
        left = (in_w - new_w) // 2
        resized = resized.crop((left, 0, left + new_w, in_h))
    else:
        new_h = int(round(in_w / tgt_aspect))
        top = (in_h - new_h) // 2
        resized = resized.crop((0, top, in_w, top + new_h))
    resized = resized.resize((tgt_w, tgt_h), Image.LANCZOS)

# ensure the same H/W are passed to the pipeline
output = pipe(
    image=resized,
    height=resized.height,
    width=resized.width,
    ...
)

(Optional) Aspect buckets to curb recompiles
- Snap to nearest among: [(832,832), (768,512), (832,480), (480,832)].
- This caps the number of AOT warm-ups while still covering square/landscape/portrait.

Acceptance criteria

A new UI control allows choosing “Fit (letterbox)” vs. “Fill (crop)”.
When uploading a portrait image, output video keeps a portrait aspect without stretching (letterbox or adaptive H/W).
Pipeline is warmed/compiled for the selected shape (or uses dynamic compile) and does not crash when switching between portrait/square/landscape inputs.
Default behavior remains the same if users don’t touch the new option (or defaults to “Fit”, whichever you prefer).

Environment

Space: zerogpu-aoti/wan2-2-fp8da-aoti-faster (latest main)
Uses Wan-AI/Wan2.2-I2V-A14B-Diffusers + bf16 transformers, FP8 quant, AOT compile
PyTorch nightly in the Space (2.8 per the setup block)

Notes
If AOT export requires static shapes, pre-warming a small set of aspect buckets is a practical compromise. I’m happy to help test or provide a PR if the approach above sounds reasonable.