5 Kaggle Data Sets for training GANs

towards-data-science

This post was originally published by Sadrach Pierre, Ph.D. at Towards Data Science

Generative adversarial networks(GANs) are a set of deep neural network models, introduced by Ian Goodfellow in 2014, used to generate synthetic data. GANs have been used in a wide variety of tasks including improving astronomical images, up-scaling resolution of old video games, and the most well known application, ‘deepfakes’ which involve human image synthesis. In this post, I will walk through some interesting data sets that can be used to train GAN models. This catalogue of data can serve as a starting point for those interested in getting started on building GAN models.

Let’s get started!

To start, let’s briefly go over the concepts behind GAN models. A GAN is composed of two competing neural networks, a generator and a discriminator. The generator is a modified convolutional neural network that learns to produce synthetic data from noise. The discriminator is a convolutional neural network that learns to distinguish between fake and real data. As model training proceeds the discriminator gets better at distinguishing between real and fake data and the generator gets better at generating realistic data.

Now, let’s get into some interesting data sets.

Source

This data set contains 2782 abstract art images scraped from wikiart.org. This data can be used to build a GAN in order to generate synthetic images of abstract art. The data set contains images of real abstract art by Van Gogh, Dali, Picasso, and more.

Source

This data contains images corresponding to screens to find novel antibiotics using roundworm C. Elegans. The data has images of roundworms infected with a pathogen called Enterococcus faecalis. Some of the images are of roundworms that have not been treated with the antibiotic, ampicillin, and others are of infected roundworms which have been treated with ampicillin. For those interested in applying GANs to an interesting drug discovery problem, this is a great place to start!

Source

This data set contains chest X-ray images that are clinically labeled by radiologists. There are 336 chest X-ray images with tuberculosis and 326 images that correspond to healthy individuals. This is a great data source for those who are interested in getting their feet wet with using GANs for medical image data synthesis.

Source

This data actually contains synthetic images of human faces generated by GANs. These images were scraped from the website This Person does not Exist. The site generates a new fake face image, produced by a GAN, each time you refresh the page. It is a great set of data to start with for generating synthetic images with GANs.

Source

This data set contains images of faces with glasses and images of faces without glasses. While these images were generated using GANs, they can also serve as training data for generating additional synthetic images.

CONCLUSIONS

To summarize, in this post we discussed five Kaggle data sets that can be used to generate synthetic images with GAN models. These data sources should be a good starting point for getting your feet wet with GANs. If you are interested in some useful code to get you started using GANs, check out this Intro to GANs Kaggle notebook. I hope you found this post useful/interesting. Thank you for reading!

Spread the word

This post was originally published by Sadrach Pierre, Ph.D. at Towards Data Science

Related posts