Spaces:
Restarting
on
Zero
Support aspect-ratio–preserving output (avoid forced 832×480)
Summary
The demo currently always outputs a fixed resolution video (LANDSCAPE_WIDTH = 832
, LANDSCAPE_HEIGHT = 480
). Even when a portrait or square image is provided, it’s center-cropped and resized to 832×480 in resize_image[_landscape]
. This makes subjects get cropped off or look distorted. Could we add an option to preserve the input image’s aspect ratio and either (a) letterbox/pillarbox, or (b) adapt the generated H/W to the input ratio?
Where this happens
Fixed target dims:
LANDSCAPE_WIDTH = 832 LANDSCAPE_HEIGHT = 480
Pre-resize crops to target aspect:
def resize_image_landscape(image): target_aspect = LANDSCAPE_WIDTH / LANDSCAPE_HEIGHT ... return image.resize((LANDSCAPE_WIDTH, LANDSCAPE_HEIGHT), Image.LANCZOS)
AOT warm-up compiled with fixed H/W:
optimize_pipeline_(pipe, image=Image.new('RGB', (LANDSCAPE_WIDTH, LANDSCAPE_HEIGHT)), height=LANDSCAPE_HEIGHT, width=LANDSCAPE_WIDTH, num_frames=MAX_FRAMES_MODEL, )
Expected behavior
When I upload a portrait (e.g., 1080×1920) or square image, the generated video should keep that aspect ratio (within model constraints), with one of:
- Fit (letterbox/pillarbox): pad to nearest valid multiple (e.g., 1024×1824 → pad to 1024×1856 if needed).
- Fill (crop): current behavior, but as a user-selectable option.
Optionally, offer a few pre-set aspect buckets (1:1, 4:3, 3:2, 16:9, 9:16) and choose the nearest bucket to the input.
Actual behavior
- All outputs are 832×480. Portrait/square inputs are center-cropped to a 16:9 slice and then scaled to 832×480.
Why this matters
- Many inputs (selfies, mobile photos) are portrait or square. Forcing 16:9 often chops heads/feet or important context and looks worse in vertical feeds.
Proposed approach (backward-compatible)
Expose an “Aspect Mode” control in the UI
aspect_mode = ["Fit (letterbox)", "Fill (crop)"]
, default to “Fit”.
Compute target H/W dynamically while respecting model multiples
MOD = 16 # or 32 if required by Wan kernels def compute_target_size(w, h, max_side=1024, mod=MOD): # scale so the longer side = max_side, preserve ratio if w >= h: new_w = max_side new_h = int(round(h * max_side / w)) else: new_h = max_side new_w = int(round(w * max_side / h)) # snap to model-friendly multiples new_w = max(mod, (new_w // mod) * mod) new_h = max(mod, (new_h // mod) * mod) return new_w, new_h
Implement letterbox/pad for “Fit”
from PIL import Image, ImageOps def letterbox(image, target_w, target_h, fill=(0,0,0)): img = image.copy() img.thumbnail((target_w, target_h), Image.LANCZOS) pad_w = target_w - img.width pad_h = target_h - img.height padding = (pad_w // 2, pad_h // 2, pad_w - pad_w // 2, pad_h - pad_h // 2) return ImageOps.expand(img, padding, fill=fill)
Warm-up/compile per chosen size (or opt into dynamic)
- If keeping AOT export, cache
optimize_pipeline_
by(width,height)
so each distinct shape warms once. - Alternatively, switch to
torch.compile(..., dynamic=True, mode="reduce-overhead")
to tolerate multiple shapes (trade-off: tiny perf drop vs. flexibility).
- If keeping AOT export, cache
Use the computed size at call time
in_w, in_h = input_image.size tgt_w, tgt_h = compute_target_size(in_w, in_h, max_side=832) # reuse 832 as longest side if aspect_mode == "Fit (letterbox)": resized = letterbox(input_image, tgt_w, tgt_h) else: # Fill (crop) resized = input_image.copy() # center-crop to tgt aspect, then resize to (tgt_w, tgt_h) in_aspect = in_w / in_h; tgt_aspect = tgt_w / tgt_h if in_aspect > tgt_aspect: new_w = int(round(in_h * tgt_aspect)) left = (in_w - new_w) // 2 resized = resized.crop((left, 0, left + new_w, in_h)) else: new_h = int(round(in_w / tgt_aspect)) top = (in_h - new_h) // 2 resized = resized.crop((0, top, in_w, top + new_h)) resized = resized.resize((tgt_w, tgt_h), Image.LANCZOS) # ensure the same H/W are passed to the pipeline output = pipe( image=resized, height=resized.height, width=resized.width, ... )
(Optional) Aspect buckets to curb recompiles
- Snap to nearest among:
[(832,832), (768,512), (832,480), (480,832)]
. - This caps the number of AOT warm-ups while still covering square/landscape/portrait.
- Snap to nearest among:
Acceptance criteria
- A new UI control allows choosing “Fit (letterbox)” vs. “Fill (crop)”.
- When uploading a portrait image, output video keeps a portrait aspect without stretching (letterbox or adaptive H/W).
- Pipeline is warmed/compiled for the selected shape (or uses dynamic compile) and does not crash when switching between portrait/square/landscape inputs.
- Default behavior remains the same if users don’t touch the new option (or defaults to “Fit”, whichever you prefer).
Environment
- Space:
zerogpu-aoti/wan2-2-fp8da-aoti-faster
(latest main) - Uses
Wan-AI/Wan2.2-I2V-A14B-Diffusers
+ bf16 transformers, FP8 quant, AOT compile - PyTorch nightly in the Space (
2.8
per the setup block)
Notes
If AOT export requires static shapes, pre-warming a small set of aspect buckets is a practical compromise. I’m happy to help test or provide a PR if the approach above sounds reasonable.
Thank you very much for this post! I was about to write a similar one but you did it 10 times better than I ever could. Great work! I absolutely agree, I would love to see it preserve the actual image format!