A Generalized Approach to Virtual Try On

Published in

OffNote Labs

11 min readDec 22, 2021

Virtual Try On (ViTON) is an immensely practical problem with numerous consumable applications. The literature is vast and there are numerous solutions published in top Computer Vision conferences over the years. On the other hand, the solutions have had limited impact in the consumer market — often the outputs are low resolution, not photo-realistic or lack texture artifacts. A few recent solutions are finally able to overcome these issues and generate practically useful try-ons.

Across the numerous solutions published, most rely on generative adversarial network models (GANs). However, the description of these solutions is overly complex with several moving parts. This makes understanding these solutions as well as comparing between them very hard.

In this article, we propose a simple, modular, unified framework to encompass many different ViTON approaches. We disentangle the key components of many proposed solutions and bring out the key set of modules required to build a ViTON solution. We hope that this exposition will benefit and accelerate future research on this topic.

Virtual Try On (ViTON) — Definition and variations

The ViTON problem, more specifically, is about pose and texture/style transfer in human body images, sometimes called person image synthesis as a whole.

An example of virtual try-on image transfer. (from the TryOnGAN paper)

In pose transfer we want to generate an image of a person P in various poses. There are two images that we get as input: the person P’s image and the image of another person Q in the desired pose. Our goal is to generate P’s image in Q’s pose.
In texture transfer, also called style transfer, we want to generate an image of a person in the same pose but with different style of body parts or garments. e.g. we might want to change the style of the shirt or the skin color of the person. Here the input is the image of the person, image containing the desired style and the region of the image that is to be changed.
If a model does both of the tasks then we can generate images of a person in various poses and garments just from a single image of the person and reference pose and style images.

There are many approaches (ADGAN, PISE, TryOnGAN) to solve these transfer problems but with many fine-grained differences. It is hard to compare and gain a holistic understanding of these approaches.

So, we asked: is there a generic framework that subsumes all these approaches and highlights the key challenges?

Defining a unified framework

In this article we will try to define a common framework that encompasses most of the methods. We will explore different variations enabled by the framework and how they affect the results generated. We first present the key definitions involved in this setting.

Unified framework

Inputs. The inputs to our model will be two images. Each image contains an image of a person in a particular pose wearing a set of garments. Here we define the images more formally as having multiple attributes such as identity, pose and garments.

Image A with person I₁ in pose P₁ wearing garments G₁ = {G₁₁, G₁₂, G₁₃}
Image B with person I₂ in pose P₂ wearing garments G₂ = {G₂₁, G₂₂, G₂₃}

Outputs : The output of our model will depend on the task. Here we list some of the tasks that we can do:

Pose transfer : Changing the pose of the person.
Identity transfer : Changing the pose and identity of the person while keeping the garments same. By identity we mean a combination of face, body shape and skin color, things that define a person’s visual identity.
Garment transfer : Changing the shape of garment of the person while keeping the pose and identity same.
Texture transfer : Changing only the texture of the cloth. By texture we mean the visual property of the cloth such as shine or roughness.
Combination of any previous tasks.

Short aside about Image Generators and ViTON. Recent developments in generative modeling enable us to generate very realistic images, using progressive GAN architectures (ProGAN, StyleGAN etc). Although GANs are now able to generate photo-realistic images, the vanilla GAN models do not allow us control over the features / attributes of the generated object. This problem is somewhat mitigated by conditional GANs, where along with the latent vector we also input a vector with object attribute information. This makes sure that the image generated has the desired attribute.
Unfortunately, this approach also does not scale. In particular, there may be an large, arbitrary number of features that we might want to control and it might be tricky to encode some of these features explicitly. Further, if new features are discovered/ desired, it is hard to retrain the model every time we discover a different feature classes. Solving ViTON requires us to capture many features related to human body (face, shape, pose) and the worn garments (texture, color, shape), which are hard to explicitly encode. This leads to complex ViTON solutions.

A Generic approach — How do we get the output from the inputs?

What are general steps to get the output from the input for any of these transfer problems?

At the architectural level, we find that for all these methods, the four important steps are encode, decompose, modify and compose.

Intuitively, we need to

encode the image into semantically meaningful encodings
separate the encodings for different attributes (pose, identity, garments)
modify the encodings and combine the separated attribute encodings into a new one.

We have two different variations of these steps, depending on whether we work directly on image or its encodings.

We may first encode the whole image and then decompose the encoding into different parts corresponding to different attributes.

We can first decompose the image into different parts and then encode the separate parts. For example we could use segmentation maps to divide up an image and use a masked image for each region to generate encodings and pose detection to get pose key points.

Now, let’s look deeper into each of the above four steps and investigate how to implement each step efficiently.

The two steps, decompose and encode, can be a performed by an human image encoder:

Human images are very similar to each other.
A good encoder should detect the pose and segment the different semantic regions such as face, skin, shirts, trousers in a human image.
Moreover if our encodings are vectors (tensors), we would desire that the encoding vectors (tensors) contain encoding for each attribute separately. Further, we desire them to be at the same relative position, say in the beginning of the vector.
We call these disentangled encodings. The ideal encoder will create encodings where all attributes/features of interest are disentangled from each other.

However, we find that encoders used in existing ViTON approaches generate mixed and entangled encodings. This has the following consequences:

Some part of the code controls multiple unrelated attributes. For example an element of the code vector might control both the hair color and shirt color, and any change to the element will cause multiple changes to the output image.
Different regions of the code control one attribute. e.g. Multiple random elements of the vector control the garment length.
Due to the entanglement of encodings we need to add an extra decompose step. The decomposer outputs multiple encodings for different attributes of the image. For example we can encode the global pose separately and compute region-wise encodings for shirts, trousers and face.

Composition. Now given these encodings and given a target task T, we want to accomplish T. How? Simply by modifying and composing the encodings of two images . For example, if we want to change the shirt region we change the encoding for shirt region in the image encoding, and then:

We can swap the shirt region encoding out with another shirt encoding.
We can also use interpolations between two shirt encodings to get any desired intermediate result.

The compose step takes in the modified encodings for the new image and generates it. The generated image should reflect the encoding modifications done in the modify step. Primarily, we use a generator model for this step. The information regarding the image can be injected into the generator at multiple points. The main challenges with the modify and generate step are:

Generating occluded or hidden regions. Some regions might not be visible in any of the original image. For example a front image of a dress has little information about what it looks like from the back. If we want to change the pose of the person to be a back image, the composer will have to fill in the details and essentially hallucinate the details.
Warping and modifying visible regions. In real images garment’s appearance is modified due to folds and shadows. Even if the region to be generated is visible, the model must be able to distinguish actual garment from folds and shadows. For example, two front images taken from different angles will have slightly different folds and shadows visible. The composer should be able to go from one angle to another while preserving the garment’s detail and modifying shadows and folds accordingly.

Variations in the unified framework

Now that we have defined a unified framework, let us see how existing, state-of-the-art approaches fit into the framework.

Examples of previous papers

ADGAN.

This method first separates the image into components. They use pre- trained models to generate pose key points and human segmentation maps. This is the decompose step.
Then they use a pre-trained VGG model and a trainable module to encode the RGB image into separate components. Using this they create a style code.
The style code can be modified by a simple swap operation.
Finally they use a StyleGAN type architecture to generate an image using pose and style code as input. The StyleGAN generator is the composer in our framework.

PoseWithStyle

This method is similar to ADGAN.
Instead of generating 2d points as pose representations, they generate a dense pose, which maps all pixels of an RGB image to the 3D surface of the human body.
The dense pose encoder gives UV map, which is further segmented into body parts.
Then they use a VGG-like feature extractor to encode the different body regions.
Finally they use StyleGAN2 architecture with some modifications to generate image using pose and appearance features.

TryOnGAN.

Encoder. Given a pretrained StyleGAN generator on a fashion dataset, they use latent inversion techniques to get the style encodings from an RGB image. That is, they first encode the entire image in form of these style encodings.
The pose and style encodings are disentangled by architecture design: pose heatmaps are input at head, while style encodings are input laterally to each style block generator component.
Given encodings for two images, we need to generate an image that has selected attributes from either images. The paper claims that we can find the encodings for the new desired image by interpolating between the latent codes. They use a separate network that learns a linear interpolation between the encodings. The inputs to the interpolator network are two image encoding and a result image encoding. This network predicts linear coefficients that can be used to get from input encoding to output encoding. The interpolation network must have the implicit capability to separate the encoding into different semantic components. It then combines the separate parts based on the interpolation coefficients.
Finally the result encodings are used to generate images using StyleGAN.

Differences at module level, steps taken

In the previous section we saw how different papers fit into our universal framework. In this section we will list out some common techniques that are used to accomplish the steps in the universal framework.

Encode

Most methods separate (disentangle) pose from other attributes using a separate pose encoder. Here we list some possible ways to encode pose.

Pose encoding

Two dimensional points where we get a 2d point for each key-point of human body.
Dense pose maps where we get a dense map for each body part [HumanGAN]. This method can be considered a combination of 2-d pose key-points and segmentation map.
Three dimensional points that correspond to key-points in a 3-d space.

For attributes other than pose, some methods first decompose, then encode, while others encode then decompose (as discussed earlier). To decompose the image before encoding the following methods are used:

Simple appearance encoding consists of a computing a segmentation map for the image. Every region of the human body can be then encoded separately. But this method is not shape independent. Also, it also does not take into account the whole human body but only the parts visible in the image.
UV texture map. We can generate a UV texture map. The map is highly warped but it is shape independent and does not depend on the part of body visible in the image. A part of the body will end up in the same place on the map. If it is not visible in the image, that part would be empty.

For methods that first encode then decompose the following techniques are used:

Latent Inversion. This technique can be used with a GAN to generate encodings for any image. This technique was used by the paper “TryOnGAN”.
VAE. VAEs produce a latent encoding as an intermediate step for generation. We can use this latent code as encoding for our image. This technique was used by the paper “HumanGAN”

Modify

Swap. In this operation we simply replace the code for an attribute with desired code. For example if we want green shirt instead of blue we replace the code for blue shirt with green shirt.
Interpolate. This operation can be used when the desired result is a combination of two source images. We can pick any intermediate code using this method. E.g. If we want a color in between blue and green we can average the blue shirt code and green shirt code.

Compose

The compose step is performed by a generator network. The network should be able to take as input the modified encodings and generate a photo-realistic image.

The most popular model type used for this purpose is StyleGAN.
This is because of the state of the art results (StyleGAN generates photo-realistic images on many object types) on many classes of images and the intuitive structure of the model, using latent vectors from the image.
Many methods modify the latent vector or modulate interpolation coefficients to control the generated image.

Other relevant papers

Wrap Up

Virtual Try On is an immensely practical problem with numerous consumable applications. In this article we discussed how many different virtual try on solution fit into a general, modular framework:

Decompose, Encode, Modify, Compose.

We explored variations and differences for each of these modules. Generating disentangled, high quality encoding appears to be the fundamental and the most challenging problem; existing approaches try to solve this problem in different creative ways.

Having a modular framework also enables us to improve each of the steps independently. For example, in the encode step we can explore novel methodologies to encode images. The compose step can also be improved by better preserving details in the images. There is also scope of improvement in finding out new rearrangements and combinations of the steps. For example TryOnGan combines the decompose and encode tasks, to avoid generating decomposed encodings explicitly.

We hope this article helped you understand various steps and variations in virtual try on methods. We would love to know your thoughts and feedback.

The work was done as part of the AI Research Program at OffNote Labs (LinkedIn, Github, Youtube). Read our other articles on Medium, explore code on Github and apply if interested.