Support aspect-ratio–preserving output (avoid forced 832×480)

#9
by jiuface - opened

Summary
The demo currently always outputs a fixed resolution video (LANDSCAPE_WIDTH = 832, LANDSCAPE_HEIGHT = 480). Even when a portrait or square image is provided, it’s center-cropped and resized to 832×480 in resize_image[_landscape]. This makes subjects get cropped off or look distorted. Could we add an option to preserve the input image’s aspect ratio and either (a) letterbox/pillarbox, or (b) adapt the generated H/W to the input ratio?

Where this happens

  • Fixed target dims:

    LANDSCAPE_WIDTH = 832
    LANDSCAPE_HEIGHT = 480
    
  • Pre-resize crops to target aspect:

    def resize_image_landscape(image):
        target_aspect = LANDSCAPE_WIDTH / LANDSCAPE_HEIGHT
        ...
        return image.resize((LANDSCAPE_WIDTH, LANDSCAPE_HEIGHT), Image.LANCZOS)
    
  • AOT warm-up compiled with fixed H/W:

    optimize_pipeline_(pipe,
        image=Image.new('RGB', (LANDSCAPE_WIDTH, LANDSCAPE_HEIGHT)),
        height=LANDSCAPE_HEIGHT,
        width=LANDSCAPE_WIDTH,
        num_frames=MAX_FRAMES_MODEL,
    )
    

Expected behavior

  • When I upload a portrait (e.g., 1080×1920) or square image, the generated video should keep that aspect ratio (within model constraints), with one of:

    1. Fit (letterbox/pillarbox): pad to nearest valid multiple (e.g., 1024×1824 → pad to 1024×1856 if needed).
    2. Fill (crop): current behavior, but as a user-selectable option.
  • Optionally, offer a few pre-set aspect buckets (1:1, 4:3, 3:2, 16:9, 9:16) and choose the nearest bucket to the input.

Actual behavior

  • All outputs are 832×480. Portrait/square inputs are center-cropped to a 16:9 slice and then scaled to 832×480.

Why this matters

  • Many inputs (selfies, mobile photos) are portrait or square. Forcing 16:9 often chops heads/feet or important context and looks worse in vertical feeds.

Proposed approach (backward-compatible)

  1. Expose an “Aspect Mode” control in the UI

    • aspect_mode = ["Fit (letterbox)", "Fill (crop)"], default to “Fit”.
  2. Compute target H/W dynamically while respecting model multiples

    MOD = 16  # or 32 if required by Wan kernels
    def compute_target_size(w, h, max_side=1024, mod=MOD):
        # scale so the longer side = max_side, preserve ratio
        if w >= h:
            new_w = max_side
            new_h = int(round(h * max_side / w))
        else:
            new_h = max_side
            new_w = int(round(w * max_side / h))
        # snap to model-friendly multiples
        new_w = max(mod, (new_w // mod) * mod)
        new_h = max(mod, (new_h // mod) * mod)
        return new_w, new_h
    
  3. Implement letterbox/pad for “Fit”

    from PIL import Image, ImageOps
    
    def letterbox(image, target_w, target_h, fill=(0,0,0)):
        img = image.copy()
        img.thumbnail((target_w, target_h), Image.LANCZOS)
        pad_w = target_w - img.width
        pad_h = target_h - img.height
        padding = (pad_w // 2, pad_h // 2, pad_w - pad_w // 2, pad_h - pad_h // 2)
        return ImageOps.expand(img, padding, fill=fill)
    
  4. Warm-up/compile per chosen size (or opt into dynamic)

    • If keeping AOT export, cache optimize_pipeline_ by (width,height) so each distinct shape warms once.
    • Alternatively, switch to torch.compile(..., dynamic=True, mode="reduce-overhead") to tolerate multiple shapes (trade-off: tiny perf drop vs. flexibility).
  5. Use the computed size at call time

    in_w, in_h = input_image.size
    tgt_w, tgt_h = compute_target_size(in_w, in_h, max_side=832)  # reuse 832 as longest side
    if aspect_mode == "Fit (letterbox)":
        resized = letterbox(input_image, tgt_w, tgt_h)
    else:  # Fill (crop)
        resized = input_image.copy()
        # center-crop to tgt aspect, then resize to (tgt_w, tgt_h)
        in_aspect = in_w / in_h; tgt_aspect = tgt_w / tgt_h
        if in_aspect > tgt_aspect:
            new_w = int(round(in_h * tgt_aspect))
            left = (in_w - new_w) // 2
            resized = resized.crop((left, 0, left + new_w, in_h))
        else:
            new_h = int(round(in_w / tgt_aspect))
            top = (in_h - new_h) // 2
            resized = resized.crop((0, top, in_w, top + new_h))
        resized = resized.resize((tgt_w, tgt_h), Image.LANCZOS)
    
    # ensure the same H/W are passed to the pipeline
    output = pipe(
        image=resized,
        height=resized.height,
        width=resized.width,
        ...
    )
    
  6. (Optional) Aspect buckets to curb recompiles

    • Snap to nearest among: [(832,832), (768,512), (832,480), (480,832)].
    • This caps the number of AOT warm-ups while still covering square/landscape/portrait.

Acceptance criteria

  • A new UI control allows choosing “Fit (letterbox)” vs. “Fill (crop)”.
  • When uploading a portrait image, output video keeps a portrait aspect without stretching (letterbox or adaptive H/W).
  • Pipeline is warmed/compiled for the selected shape (or uses dynamic compile) and does not crash when switching between portrait/square/landscape inputs.
  • Default behavior remains the same if users don’t touch the new option (or defaults to “Fit”, whichever you prefer).

Environment

  • Space: zerogpu-aoti/wan2-2-fp8da-aoti-faster (latest main)
  • Uses Wan-AI/Wan2.2-I2V-A14B-Diffusers + bf16 transformers, FP8 quant, AOT compile
  • PyTorch nightly in the Space (2.8 per the setup block)

Notes
If AOT export requires static shapes, pre-warming a small set of aspect buckets is a practical compromise. I’m happy to help test or provide a PR if the approach above sounds reasonable.

Thank you very much for this post! I was about to write a similar one but you did it 10 times better than I ever could. Great work! I absolutely agree, I would love to see it preserve the actual image format!

Sign up or log in to comment