U-GAT-IT Architecture

Paper: U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation

Architecture:

Generator:

Let x ∈ {Xs, Xt} represent a sample from the source and the target domain.
Our translation model Gs→t consists of an encoder Es, a decoder Gt, and an auxiliary classifier ηs
where ηs(x) represents the probability that x comes from Xs.
Let E_s^k(x) be the k-th activation map of the encoder.
Inspired by CAM. the auxiliary classifier is trained to learn the importance weights of the k-th feature map, w_s^k, by using the global average pooling and global max pooling.
By exploiting the importance weights, we can calculate a set of domain specific attention feature map. a_s(x) = w_s * E_s(x) , where n is the number of encoded feature maps.
Then out translation model Gs→t becomes equal to Gt(as(x)).
At decoder Gt, the residual blocks with AdaLIN whose parameters are dynamically computer by a fully connected layer from the attention map.

Discriminator:

Let x ∈ {Xt, Gs→t(Xs)} represent a sample from the target domain and the translated source domain.
The discriminator Dt consists of an encoder E_Dt, a classifier C_Dt, and an auxiliary classifier ηDt.
Unlike the other translation models, both ηDt(x) and Dt(x) are now trained to discriminate whether x comes from Xt or Gs→t(Xs).
Given a sample x, Dt(x) exploits the attention feature maps a_Dt(x) = w_Dt * E_Dt(x) using the importance weight w_Dt on the encoded feature maps E_Dt(x) that is trained by η_Dt(x).
Then the discriminator Dt(x) becomes equal to C_Dt(a_Dt(x)).

Loss Function:

Adversarial Loss: An adversarial loss is employed to match the distribution of the translated images to the target image distribution.

Cycle Loss: To alleviate the mode collapse problem, we apply a cycle consistency constraint to the generator. Given an image x ∈ Xs , after the sequential translations of x from Xs to Xt and from Xt to Xs, the image should be successfully translated back to the original domain.

Identity Loss: To ensure that the color distributions of the input image and output image are similar, we apply an identity consistency constraint to the generator. Given an image x ∈ Xt, after the translation of x using Gs→t, the image should not change.

CAM Loss: By exploiting the information from the auxiliary classifiers ηs and η_Dt, given an image x ∈ {Xs, Xt}. Gs→t and Dt get to where they need to improve or what makes the most difference between two domains in the current state.

Full Objective: Finally, we jointly train the encoders, de-coders, discriminators, and auxiliary classifiers to optimize the final objective:
\(min_{G_{s->t}, G_{t->s}, \eta_s , \eta_t} . max_{D_s, D_t, \eta_Ds , \eta_Dt} \ \ \lambda_1 L_{gan} + \lambda_2 L_{cycle} + \lambda_3 L_{identity} + \lambda_4 L_{cam}\)

Resource

Papers with code