Abstract:Virtual Try-On (VTON) synthesizes realistic images of a person wearing a target garment, with broad applications in e-commerce and fashion. Diffusion-based dual-UNet methods achieve strong results but double the parameters by dedicating a separate network to garment conditioning. Spatial concatenation offers a simpler single-network alternative, yet both UNet- and DiT-based instantiations report that full fine-tuning is ineffective, and the community has settled for attention-only training. We ask: why does full fine-tuning fail, and can this be resolved? Through what is, to our knowledge, the first visualization study of dual-UNet reference network behavior, we identify a unifying insight: garment conditioning must be decoupled from the denoising process. Spatial concatenation violates this by embedding the garment within the denoising target, causing three conflicts: guidance leakage, gradient competition, and train-test discrepancy. We derive three design principles to restore this decoupling and implement them as a pure recipe atop a standard architecture with no modification. The resulting model, DeCo-VTON (860M params), achieves single-network state of the art, matching the dual-UNet state of the art at half the cost while being preferred in human evaluation.
From: Kihyun Na [view email]
[v1]
Mon, 24 Nov 2025 05:19:44 UTC (35,093 KB)
[v2]
Tue, 30 Jun 2026 04:34:08 UTC (22,578 KB)