SkinGAN Project using LaMa-Fourier

The goal of this project is to help protect the privacy of customers by removing tattoos that would enable them to be identified. This is an important issue when managing personal images, as is the case with the iToBoS.

Thus, we can use AI to replace tattoos with realistic and relevant skin. Another interesting use case is the removal of acne. We chose a model called LaMa (large mask inpainting) for its performance when using large masks, its light weight and its ability to generalize well to a much higher resolution.

A regular convolution on a particular point will take into account its local neighborhood so CNN needs a lot of layers in order to integrate information across the whole image. When using large masks and small kernels, the Receptive Field (RF) may fall into the masked region through several layers!

Existing approaches convolutional models have slowly growing RFs, resulting in wasted computational resources and model parameters. Hence, to correctly inpaint wide and large missing parts of an image, a good architecture should have units with a RF as wide as possible and as soon as possible.

Fast Fourier Convolutions allow LaMa to acquire global context and to use information from all parts of the image, from the start to inpaint masked regions. Before moving on, let's remind the concept of Fast Fourier Transform (FFT). We use FFT to transform an image to the frequency (also called Fourier) domain and the Inverse Transform (IFFT) to retrieve the image in the spatial domain. The reasoning behind FFC is the following:

  • convolution in the spatial domain = conv. across the neighbors of the pixel.
  • convolution in the Fourier domain = conv. across the neighboring frequencies.
  • every value in the Fourier domain represents information about all the image.

Therefore, updating a single value in the Fourier domain affects all original data!

The Perceptual Loss (PL) is commonly used to evaluate the distance between the features computed by a trained base network  of the predicted images and the ground-truth images. To better capture the global structure, LaMa uses the High Receptive Field PL.

LaMa is trained with an adversarial loss, where the discriminator works on a local patch level and receives “fake” labels only for areas that intersect with the masks. Additionally, a PL on the features of the discriminator is used.

Base LaMa is trained with aggressively generated large masks that uses samples from polygonal chains dilated by a high random width and rectangles of arbitrary aspect ratios. The increased diversity of masks is beneficial for inpainting.

Some results concerning the SkinGAN project: