Image Captioning using Deep Learning

Image Captioning Using Deep Learning

mediumThis post was originally published by at Medium [AI]

We will clean the text in the following ways:

  1. Convert all characters into lowercase.
  2. Perform basic decontractions i.e words like won’t, can’t and so on will be converted to will not, cannot and so on respectively.
  3. Remove punctuation from text. Note that full stop will not be removed because the findings contain multiple sentences, so we need the model to generate reports in a similar way by identifying sentences.
  4. Remove all numbers from the text.
  5. Remove all words with length less than or equal to 2. For example, ‘is’, ‘to’ etc are removed. These words don’t provide much information. But the word ‘no’ will not be removed since it adds value. Adding ‘no’ to a sentence changes its meaning entirely. So we have to be careful while performing these kind of cleaning steps. You need to identify which words to keep and which ones to avoid.
  6. It was also found that some texts contain multiple full stops or spaces or ‘X’ repeated multiple times. Such characters are also removed.

The model we will develop will generate a report given a combination of two images, and the report will be generated one word at a time. The sequence of previously generated words will be provided as input. Therefore, we will need a ‘first word’ to kick-off the generation process and a ‘last word’ to signal the end of the report. We will use the strings ‘startseq’ and ‘endseq’ for this purpose. These strings are added to our findings. It is important to do this now because when we encode the text, we need these strings to be encoded correctly.

The major step in encoding text is to create a consistent mapping from words to unique integer values known as tokenization. In order to get our computer to understand any text, we need to break that word or sentence down in a way that our machine can understand. We can’t work with text data if we don’t perform tokenization. Tokenization is a way of separating a piece of text into smaller units called tokens. Tokens can be either words or characters but in our case it’ll be words. Keras provides an inbuilt library for this purpose.

Now the text that we have are properly cleaned and tokenized for future use. The full code for all this is available in my GitHub account whose link is provided at the end of this story.

Images along with partial reports are the inputs to our model. We need to convert every image into a fixed sized vector which can then be fed as input to the model. We will use transfer learning for this purpose.

“In transfer learning, we first train a base network on a base dataset and task, and then we re-purpose the learned features, or transfer them, to a second target network to be trained on a target dataset and task. This process will tend to work if the features are general, meaning suitable to both base and target tasks, instead of specific to the base task.”

VGG16, VGG19 or InceptionV3 are the common CNNs used for transfer learning. These are trained on datasets like Imagenets whose images are completely different from that of a chest x-ray. So logically, they doesn’t seem to be a good choice for our task. So which network should we use for our problem?

If you are unfamiliar, let me introduce you to CheXNet. CheXNet, is a 121-layer convolutional neural network trained on ChestX-ray14, currently the largest publicly available chest X-ray dataset, containing over 100,000 frontal-view X-ray images with 14 diseases. However, our purpose here is not to classify the images but just to get the bottleneck features for each image. Therefore the last classification layer of this network is not needed.

You can download the trained weights of CheXNet from here.

Last few layers of the CheXNet

If you forgot, we have 2 images as input to our model. So, here is how the bottleneck features are obtained:

Each image is resized to (224,224,3) and is passed through the CheXNet and a 1024 length feature vector is obtained. Later both these feature vectors are concatenated to obtain a 2048 feature vector. If you notice, we have added an average pooling layer as the last layer. There’s a specific reason for this. Since we are concatenating both images, the model might learn some order of concatenation. For example, image1 always comes after image2 or vice-versa, but that isn’t the case here. We are not keeping any order while concatenating them. This problem is solved through pooling which creates location in-variance.

Obtaining Image Features

The code for this is as follows:

These features are stored in a dictionary in pickle format, which can be used for future purposes.

Consider a scenario where you have lots of data, so much that you cannot have all of it at once in the RAM. Purchasing more RAM is obviously not an option for everyone.

The solution can be to feed mini-batches of our data into the model dynamically. This is exactly what data generators do. They can generate the model input dynamically thus forming a pipeline from the storage to the RAM to load the data as and when it is required. Another advantage of this pipeline is, one can easily apply preprocessing routines on these mini-batches of data as they are prepared to feed into the model.

We will be using for our problem.

We will first divide our dataset into two parts, a train dataset and a validation dataset. While dividing, just make sure that you have enough data points for training and a decent amount for validation as well. The proportion that I chose allowed me to have 2560 data points in my train set and 1147 data points in the validation set.

Now it’s time for us to create the generator for our dataset.

Data Generator

Here we created two data generators, train_dataset for training and cv_dataset for validation. The create_dataset function takes the IDs (which are keys of the dictionary, for the bottleneck features created earlier) and the preprocessed reports, and creates the generator. The generator generates the BATCH_SIZE number of data points at a time.

As mentioned earlier the model that we are going to create will be a word by word model. The model takes as input the image features and the partial sequences to generate the next word in the sequence.

For example: Let the report corresponding to the ‘Image_features_1’ be — “startseq the cardiac silhouette and mediastinum size are within normal limits endseq”.

Then the input sequence would be split into 11 input-output pairs to train the model:

The model takes the Image and Partial Reports as input and outputs the Output Word

Note that we are NOT creating these input-output pairs through the generator. The generator only provides us with the BATCH_SIZE number of image features and their corresponding complete reports at a time. The input-output pairs are generated later during the training process, which will be explained in a short while.

A sequence-to-sequence model is a deep learning model that takes a sequence of items (in our case, features of an image) and outputs another sequence of items (reports).

The encoder processes each item in the input sequence, it compiles the information it captures into a vector called the context. After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.

The encoder in our case is a CNN which produces a context vector by taking in our image features. The decoder is a Recurrent Neural Network.

In his paper, Where to put the Image in an Image Caption Generator, Marc Tanti has introduced many architectures such as, init-inject, par-inject, pre-inject and merge, specifying where an image should be injected while creating an image caption generator. We will use the merge architecture specified in his paper for our problem.


In the “Merge” architecture the RNN is not exposed to the image vector (or a vector derived from the image vector) at any point. Instead, the image is introduced into the language model after the prefix has been encoded by the RNN in its entirety. This is a late binding architecture and it does not modify the image representation with every time step.

Some important conclusions from his paper were used in our implemented architecture. They are:

  • RNN output needs to be regularized with dropout.
  • The image vector should not have a non-linear activation function or be regularized with dropout.
  • The image input vector must be normalized before being fed to the neural network which was done while obtaining features from the CheXNet.


A word embedding is a class of approaches for representing words and documents using a dense vector representation. Keras offers an Embedding layer that can be used for neural networks on text data. It can also use a word embedding learned elsewhere. It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.

In our model, with the embedding layer, each word has been mapped into a 300 dimensional representation using a pre-trained GLOVE model. While using a pre-trained embedding, keep in mind that the weights of the layer should be frozen by setting the argument ‘trainable=False’ so that the weights don’t get updated while training.

Model Code:

Model Summary:

Summary of the model parameters


A Masked Loss Function was created for this problem. For eg:

If we have a sequence of tokens- [3],[10],[7],[0],[0],[0],[0],[0]

We only have 3 words in this sequence, the zeros correspond to the padding which is actually not a part of the report. But the model will think that the zeros are also a part of the sequence and will start learning them. When the model starts to correctly predict the zeros, the loss will decrease because for the model it is learning correctly. But for us the loss should only decrease if the model is predicting the actual words(non-zeros) correctly.

Therefore we should mask the zeros in the sequence so that the model don’t give its attention to them and only learns the needed words in the report.

Masked Loss

The output words are One-Hot-Encoded, therefore CategoricalCrossentropy will be our loss function.

Remember our data generators? Now it’s time to use them.

Here, the batches provided by the generator are not the actual batches of data that we use for training. Remember that they are not word by word input-output pairs. They just return the image and its corresponding whole report.

We will retrieve each batch from the generator and will manually create input-output sequences from that set of batches, i.e we will create our own custom batches of data for training. So here, the BATCH_SIZE logically turns out to be the number of image pairs the model will see in a single batch. We can vary it depending on our system capability. I found this method to be way faster than the traditional custom generators mentioned in other blogs.

Since we are creating our own batches of data for training, we will be using “train_on_batch” for training our model.

Training Steps

The convert function mentioned in the code converts the data from the generator to a word by word input-output pair representation. Then the partial reports were padded to the maximum length of the reports.

Convert Function:

Adam optimizer was used with a learning rate of 0.001. The model was trained for 40 epochs but the best results were obtained at the 35th epoch. The results you get might vary due to the stochastic nature.

Tensorboard Showing the Loss Plot of the Model

NOTE: Above training has been implemented in Tensorflow 2.1.

Now that we have trained our model, it’s time to prepare our model to predict reports.

For this purpose we have to make some adjustments in our model. This will save us some time during testing.

First we will separate the encoder and decoder part from our model. The features predicted by the encoder will be used as the input to our decoder along with the partial reports.

Inference Setup

By doing this we will only need to predict the encoder features just once while we use that for our greedy search and beam search algorithms.

We will implement both these algorithms for generating text and will see which one works best.

Greedy search is an algorithmic paradigm that builds up a solution piece by piece, always choosing the next piece that offers the most obvious benefit.


  1. The encoder outputs the features of our image. The encoder’s job is finished here. We don’t need to attend to the encoder once we have the features we need.
  2. This feature vector along with the start token- ‘startseq’(our initial input sequence) is given as the first input to the decoder.
  3. The decoder predicts a probability distribution across the whole vocabulary and the word with the maximum probability will be chosen as the next word.
  4. This predicted word along with the previous input sequence will be our next input sequence to the decoder.
  5. Steps 3-4 are continued till we encounter the end token i.e ‘endseq’.

Let’s check how our model is performing after using greedysearch for report generation.

BLEU Score — Greedy Search :

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence.

A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0. The approach works by counting matching n-grams in the candidate text to n-grams in the reference text, where 1-gram or uni-gram would be each token and a bi-gram comparison would be each word pair.

BLEU scores after generating reports by Greedy Search

A perfect score is not possible in practice as a translation would have to match the reference exactly. This is not even possible by human translators. The number and quality of the references used to calculate the BLEU score means that comparing scores across datasets can be troublesome.

To learn more about BLEU, click here.

Beam search is an algorithm that expands upon the greedy search and returns a list of most likely output sequences. Each sequence will have a score associated with it. The sequence with the highest score is taken as the final result.

Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k, known as the beam width, is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

A beam search with a beam width of 1 is nothing but your greedy search. Common beam width values are 5–10 but even values as high as 1000 or 2000 above are used in researches to squeeze out the best performance from a model. To read more about beam search, click here.

But keep in mind that with increasing beam width the time complexity also increases. Therefore these are much slower than greedy search.

Beam Search

A beam search doesn’t always guarantee better results but in most cases it gives you one.

You can check your BLEU scores for beam search using the function given above. But keep in mind that it takes a while(a few hours) to evaluate them.

Now let’s see some predicted reports for our chest X-rays:

Image Pair 1

Original report for Image Pair 1 : “the heart normal size. the mediastinum unremarkable. the lungs are clear.

Predicted report for Image Pair 1 : “the heart normal size. the mediastinum unremarkable. the lungs are clear.”

The model is predicting the exact same report for this example.

Image Pair 2

Original report for Image Pair 2 : “heart size and pulmonary vascularity within normal limits. no focal infiltrate pneumothora pleural effusion identified.

Predicted report for Image Pair 2 : “the heart size and pulmonary vascularity appear within normal limits. the lungs are free focal airspace disease. no pleural effusion pneumothora seen.

Though not exactly same, the predicted is almost similar to the original report.

Image Pair 3

Original report for Image Pair 3 : “lungs are hyperinflated but clear. no focal infiltrate effusion. heart and mediastinal contours within normal limits. calcified mediastinal identified.

Predicted report for Image Pair 3 : “the heart size normal. the mediastinal contour within normal limits. the lungs are free any focal infiltrates. there are no nodules masses. no visible pneumothora. no visible pleural fluid. the are grossly normal. there no visible free intraperitoneal air under the diaphragm.”

Well you didn’t expect the model to work flawlessly, did you? No model is perfect, this one ain’t either. Although there are some details which are correctly identified from the image pair 3, there are a lot of extra details produced which may or may not be correct.

The model we created is in no way a perfect one, but it does generate decent reports for our images.

Let’s now look at an advanced model and see whether it improves the current performance or not!!

If you pay attention on the focused part, you can see a river surrounded by hills

The attention mechanism was proposed as an improvement to the encoder-decoder models. The context vector turned out to be a bottleneck for these types of models. It made it challenging for them to deal with long sentences. A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”, which highly improved the quality of machine translation systems. Attention allows the model to focus on the relevant parts of the input sequence as needed. Later this idea was implemented for image captioning in the paper, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

So, how do we model an attention mechanism for images?

In the case of text, we have a representation for every location of the input sequence. But for images we typically use representation from one of the fully connected layers of a network, but this representation do not contain any location information(Just think about it, they are fully connected). We need to look at specific portions (locations) of an image to describe what’s there. For example, to describe the size of a person’s heart from the x-ray, we need to look at only his heart area and not his arms or any other part. So what should be the input to the Attention Mechanism?

Well, instead of the fully connected representation, we use the output from one of the convolution layers(transfer learning) which has spatial information.

Conv Layer Contains Spatial Information

For example, let the output of the last convolutional layer be a (7*14*1024) size feature map. Here, the ‘7*14’ are the actual locations which corresponds to certain portions in the image and 1024 are the channels. We are not paying attention to the channels but to the locations of the image. Therefore, here we have 7*14 = 98 such locations. We can think of it as 98 locations each having a 1024 dimensional representation.

Now we have 98 time steps with 1024 dimensional representations each. We need to now decide how the model should pay attention to these 98 time steps or locations. A simple way is to assign some weights to each location and get a weighted sum of all these 98 locations. If a particular time step is very important in predicting an output, that time step will have a higher weight. Let these weights be denoted as alphas.

Now we know that, the alphas determine the importance of a particular location. Higher the alpha, higher the importance. But how do we find the values of alpha? No one is going to give us these values, the model itself should learn these values from the data. To enable this we define a function:

This quantity captures the importance of the j_th input for decoding the t_th output. h_j is the j_th location represention and s_t-1 is the state of the decoder till that point. We need these two mentioned quantities to determine e_jt. f_ATT is just a function which we will define later.

Across all the inputs, now we want this quantity(e_jt) to sum to 1. It’s just like taking a probability distribution over which input is important by how much. The e_jt is converted into a probability distribution by taking softmax.

Converting e_jt into a probability distribution using softmax

Now we have our alphas.! Alphas are our softmax of e_jts. Alpha_jt denotes the probability of focusing on the j_th input to produce the t_th output.

Its time to define our function f_ATT. One among many other possible choices is the following:

V, U and W are the parameters which will be learned during the training to determine the value of e_jt.

We have the alphas, we have the inputs, now we just need to get the weighted sum to produce the new context vector which will be fed to the decoder. In practice these models work better than the encoder decoder models.

Like the encoder-decoder model mentioned above, this model will also consist of 2 parts, an encoder and a decoder but this time the decoder will have an extra component of attention in it, i.e an attention decoder. Let’s now write the above explained steps of attention in code for better understanding:

We don’t have to write these lines of code from scratch ourselves while building the model. The keras library already has an inbuilt attention layer for this purpose. We will be using the AdditiveAttention Layer or otherwise called Bahdanau’s Attention directly. You can read more about the layer from from the documentation itself. The link has been provided in the above line.

The text input to this model will remain the same but as for the image features, this time we’ll be taking the features from the last conv layer of the CheXNet network.

Extracting Image Features for Attention

The final output shape after combining our 2 images will be (None, 7, 14, 1024). So the input to the encoder after reshaping will be (None, 98, 1024). Why reshaping? Well, this has been explained in the attention intro, if you have any doubts, make sure you read the explanation once more.


Attention Model

The model is similar to the encoder-decoder model we saw earlier but with the Attention Component and some minor updates. You can try your own changes if you want, they might produce better results.

Model Architecture:

Model Summary:

Summary of Model Parameters

The training steps will be exactly the same as that of our encoder-decoder model. We’ll be generating batches using the same ‘convert’ function, thus obtaining word by word input-output sequences and training it using train_on_batch. The attention model will require a little bit more memory and computing power than the encoder-decoder model. Therefore, you might have to decrease the batch size for this one. Please refer the training section of encoder-decoder model for full process.

For attention too, Adam optimizer was used with a learning rate of 0.0001. The model was trained for 20 epochs. The results you get might vary due to the stochastic nature.

Tensorboard showing the loss plot

The code for everything can be accessed from my GitHub. It’s link has been provided at the end of this blog.

Same as in enc-dec, we’ll be separating the encoder and decoder parts from the model.

This saves us some time during testing.

Now that we have build our model, let’s check if the BLEU scores obtained is actually an improvement over the previous model or not:

BLEU Scores of Attention Model after Greedy Search

We can see that it has better performance than the encoder-decoder model with greedy search. Hence it’s definitely an improvement over the previous one.

Now let’s see some scores for beam search:

The BLEU scores are lower than that of greedy but they are not far-off. But it’s noticeable that with increasing beam_width the scores are actually increasing. So, there might be some value of beam_width where the scores actually do cross the greedy values.

Below are some reports generated by the model using greedy search:

Image Pair 1

Original report for Image Pair 1: “heart size and pulmonary vascularity within normal limits. no focal infiltrate pneumothora pleural effusion identified.

Predicted report for Image Pair 1: “the heart size and mediastinal contours are within normal limits. the lungs are clear. there no pneumothora pleural effusion. there are no acute bony findings.

The predictions are almost similar to the original report.

Image Pair 2

Original report for Image Pair 2: “the heart size and pulmonary vascularity appear within normal limits. the lungs are free focal airspace disease. no pleural effusion pneumothora seen.

Predicted report for Image Pair 2: “the heart size and pulmonary vascularity appear within normal limits. the lungs are free focal airspace disease. no pleural effusion pneumothora seen.

The predicted report is exactly the same!!

Image Pair 3

Original report for Image Pair 3: “the heart normal size. the mediastinum unremarkable. the lungs are clear.

Predicted report for Image Pair 3: “the heart normal size. the mediastinum unremarkable. the lungs are clear .

In this example too, the model is doing a really good job.

Image Pair 4

Original report for Image Pair 4: “the lungs are clear bilaterally. specifically no evidence focal consolidation pneumothora pleural effusion. cardio mediastinal silhouette unremarkable. visualized osseous structures the thora are without acute abnormality.

Predicted report for Image Pair 4: “the heart size and mediastinal contours are within normal limits. the lungs are clear. there no pneumothora pleural effusion.

You can see that this prediction is not really convincing.

“But the beam search for this example was predicting the exact same report even though it was producing lower BLEU scores for the whole test data combined!!!”

So, which one to choose? Well, it’s up to us. Just pick a method that generalizes well.

Here, even our attention model can’t predict each and every image accurately. As we can see from the example, this pair do not have a side view image or if we look at the words in the original report there are some complex words which through some EDA can be found that it doesn’t occur that often. These might be some of the reasons we do not have a good prediction in some of the cases. Keep in mind that we are just training this model on 2560 data points. To learn more complex features, the model will need more data.

Now that we have come to an end to this project, let’s summarize what all we’ve done:

  • We just saw an application of image captioning in the medical field. We understood the problem and the need for such an application.
  • We saw how to use data generators for the input pipeline.
  • Created an Encoder-Decoder model which gave us decent results.
  • Improved the base results by building an Attention model.
  • As we mentioned we didn’t have a big dataset for this task. A larger dataset will produce better results.
  • No major hyperparameter tuning were done for any of the models. Therefore, a better hyperparameter tuning might produce better results.
  • Making use of little more advanced techniques like transformers or BERT, might yield better results.
Spread the word

This post was originally published by at Medium [AI]

Related posts