Despite the rapid advancement of video generation technologies, video virtual try-on (VVT) in unrestricted scenarios—such as challenging subject or camera motion, dynamic scenes, and diverse character styles—remains unexplored. Specifically, current approaches encounter three major limitations. First, these methods rely heavily on scarce paired garment-centric datasets, which significantly limits their applicability to arbitrary garments and complex video inputs. Second, substantial spatial misalignment arises when deforming entire garment images to spatiotemporally varying regions across video frames, which hinders effective model convergence due to conflicts with the priors of pretrained video models. Third, relying solely on front-view garment images can mislead the generation of video frames from markedly different viewpoints. To address these challenges, we propose Try-On Master, a stage-wise framework built upon Diffusion Transformers (DiTs) that systematically decomposes the VVT task into three consecutive stages. In the first stage, a keyframe sampling strategy is employed to identify frames exhibiting pronounced motion or viewpoint variations, thereby providing diverse and informative cues for subsequent video generation. Next, the second stage employs a multi-frame try-on model trained with large-scale person-to-person image data, enabling the precise mapping of arbitrary garment types onto the selected keyframes. Finally, the third stage introduces a multi-modal guided video editing model equipped with a visual adapter, which leverages spatially aligned keyframe try-on images generated in the previous stage, along with motion features and prompts, to synthesize visually coherent virtual try-on videos. This approach facilitates the full utilization of priors from pretrained video models and readily available unpaired human-centric videos. Extensive quantitative and qualitative experiments demonstrate that Try-On Master surpasses existing methods in preserving high-fidelity garment details and temporal stability in real-world scenarios.
* Try-On Master enables virtual try-on of complete outfits—including tops, bottoms, skirts, shoes, socks, and more. If a user uploads only a top, the model can automatically generate and match appropriate bottoms and footwear to complete the outfit. This capability is not available in previous methods.
* Try-On Master is capable of handling complex human motions, including runway walks and 360-degree rotations, with high fidelity in garment detail preservation and robust temporal consistency.
* Try-On Master is capable of enabling virtual try-on in videos featuring subjects within complex static or dynamic environments.
* Try-On Master can preserve temporal consistency and high-fidelity garment details, even when the input video features challenging camera movements and prominent scene transitions.
* Try-On Master can generate realistic physical dynamics in scenarios involving garment interactions, for example, inserting hands into pockets or interacting with soft clothing materials.
* Even more interestingly, Try-On Master is capable of outfitting cartoon characters with real-world garments, even in highly demanding scenarios involving unrestricted subject poses or camera movement and dynamic scenes.
The images and audios used in these demos are from public sources or generated by models, and are solely used to demonstrate the capabilities of this research work. If there are any concerns, please contact us (dongxin.1016@bytedance.com) and we will delete it in time.