DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing

The DiT-VTON model offers diverse use cases, enabling inpainting within user-specified editing regions using content guided by a reference image. The model can semantically infer and generate expected objects, textures, perform local editing, and even identify specific body parts for virtual try-on tasks including multi-garment try-on, showcasing its versatility in content-aware editing and synthesis. Meanwhile, we have pioneered the expansion of this research area beyond traditional garment virtual try-on to to virtual try-all, extending its application to a wide range of product categories, including furniture, jewelry, shoes, and other wearables such as scarves, glasses, and handbags, etc.

Abstract

The rapid growth of e-commerce has intensified the demand for Virtual Try-On (VTO) technologies, enabling customers to realistically visualize products overlaid on their own images. Despite recent advances, existing VTO models face challenges with fine-grained detail preservation, robustness to real-world imagery, efficient sampling, image editing capabilities, and generalization across diverse product categories. In this paper, we present DiT-VTON, a novel VTO framework that leverages an architecture based on a Diffusion Transformer (DiT), renowned for its performance on text-conditioned image generation (text-to-image), adapted here for the image-conditioned VTO task. We systematically explore multiple DiT configurations, including in-context token concatenation, channel concatenation, and ControlNet integration, to determine the best setup for VTO image conditioning. Our findings indicate that token concatenation combined with pose stitching yields the best performance. To enhance robustness, we train the model on an expanded dataset encompassing varied backgrounds, unstructured references, and non-garment categories, demonstrating the benefits of data scaling for VTO adaptability. DiT-VTON also redefines the VTO task beyond garment try-on, offering a versatile Virtual Try-All (VTA) solution capable of handling a wide range of product categories and supporting advanced image editing functionalities, such as pose preservation, precise localized region editing and refinement, texture transfer and object-level customization. Experimental results show that our model surpasses state-of-the-art methods on public datasets VITON-HD and DressCode on the VTO task, achieving superior detail preservation and robustness without reliance on additional image condition encoders. It also surpasses state-of-the-art models that have VTA and image editing capabilities on a varied dataset composed of thousands of product categories. As a result, DiT-VTON significantly advances VTO applicability in diverse real-world scenarios, enhancing both the realism and personalization of online shopping experiences.

Key Contributions

Virtual Try-All (VTA) Capability: Expanding virtual try-on beyond garments to a wide range of product categories, including furniture, jewelry, shoes, and accessories, etc.
Diffusion Transformer-based VTO: Leveraging DiT for high-fidelity, image-conditioned virtual try-on with superior detail preservation.
Multi-Garment Try-On: Allows seamless try-on of multiple garments simultaneously while maintaining realistic composition.
Local Editing: Enables precise image customization, including logo refinement and pattern modification, for enhanced personalization.
Texture and Style Transfer: Accurately transfers styles and textures from reference images, enabling a more flexible and realistic try-on experience.
Advanced Pose Control and Pose-Oriented Generation: Supports pose-preserving transformations and pose-guided image generation without adding extra model components or parameters. Find more details in our research paper "Is Concatenation Really All You Need: Efficient Concatenation-Based Pose Conditioning and Pose Control for Virtual Try-On".

Overall Pipeline

Illustration of different model configurations of DiT-VTON to effectively integrate image conditions.

We explore the optimal model configuration to integrate additional image conditions into the transformer blocks. (Left) Channel concatenation. We follow convention in UNet-based inpainting models that concatenates the masked image I_e, the mask image I_m and the latent noise x_t in the channel dimension. As for the additional reference image I_r, we concatenate it with the masked image at the spatial dimension. (Middle) ControlNet. Adding control to diffusion models by copy (part of) the main denoising backbone as the ControlNet to encode conditions. Then the encoded image representation is fused back to the main backbone by cross attention, summation or other adaptive norm layers. (Right) Token Concatenation. we patchify each latent image into tokens and directly concatenate all the image tokens together as the input.

BibTeX

@article{DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing,
  author    = {Qi Li, Shuwen Qiu, Julien Han, Kee Kiat Koo, Karim Bouyarmane},
  title     = {DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing},
  year      = {2024},
}

DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing

Vritual Try-On Examples

Multi-Garments, and Mask Control

Virtual Try-All for Non-Garment categories

Texture Transfer

Localized region editing and refinement

Abstract

Key Contributions

Overall Pipeline

BibTeX