Control lora models for Wan2.1
Not exactly a controlnet, but similar in function. They work like the Flux tools loras. Example comfyui workflows included.
1.3b t2v
Tile
Like the tile controlnets for SD, this takes a blurred video as the control signal, and generates a high quality, detailed output. If the results look weird and overcooked, make sure the input is blurred. The provided workflow is set up with the correct blur size, but tweaking the radius up or down (~10-15px radius) can bias towards larger or smaller details. Works best at 100% denoise strength, but it can also be used at lower denoise strength v2v.
Training: Videos only, 9-13 frames at 624px area equivalent. 62k training steps, about 3 days on 1x3090.
More coming soon
Technical details
Why this instead of controlnet? Controlnet adds significant inference cost, and can be difficult to train well. There are many proposed alternatives or modifications, but most are complex to implement and not much easier to train. This method is extremely lightweight during inference (practically free if the lora is fused), relatively easy to train, and simple to implement. Concatenating the control features along the input channel dimension and training the whole model with lora allows the control signals to be automatically integrated in an optimized way, without any pesky architecture searching.
Training method:
Like flux tools, expand the diffusion model input layer to add new channels, zero out the new channels and copy the weights and bias from the old input layer to the noise channels of the new layer. Add a lora to the input layer + transformer blocks.
A small but significant optimization is to set a higher learning rate for the input layer lora weights, this allows it to catch on to the control signals much quicker and avoid the typical controlnet training issue of slow warmup then sudden convergence as it escapes the zero init.
The control video is encoded through the vae (Scaled! Don't make the ip2p unscaled latent mistake!) then concatenated with the noisy latent input along the channel dimension.
Inference:
For comfyui, a special "reshape_weight" key is included in the lora to automatically add the new control channels to the model, which are then conditioned with the InstructPixToPixConditioning node.
To implement in your own code: expand the input layer in the same way as in training, then load the pretrained lora. During each sampling step (positive and negative conditions alike), concatenate the condition latent to the noisy latent along the channel dimension. That's it!