Automated Image Caption

Image captioning is an exciting application of deep learning that leverages the power of both computer vision and natural language processing. While it is really easy for a human to understand what is happening in the image, for machines it is quite complex and has been a major challenge for long until the inception of deep learning models like CNN and LSTM.

Image captioning has a wide range of applications. The leaders of AI have built successful products by use of it. Microsoft’s ‘Caption-Bot’ is an excellent example of this. Also, Facebook has added this feature a few years back. Image captioning can be an excellent model for image sentiment analysis as well. We can literally write a story about what is happening inside an image.

The crux of image captioning lies in generating features from the images and somehow using the same as an input that can generate stories for us. So what can be a better model for this than an LSTM. So such a hybrid generative model that uses the advantages of both CNN and LSTM would be able to fulfill our task.

Before jumping into the problem let us have a look at the dataset that is supposed to be used for this task. We would use flickr-8k dataset that contains caption for each of the images. An image would have multiple captions so that much of the variance of it could be described. Also, the dataset doesn’t have any famous person or place so that the description of the image can be learned based on the objects of the image only which would be great for generalizing for any image in the real world.

As a conceptual approach, our model shall be inspired by the classic encoder-decoder model that is used for machine translation. However with a difference that the encoder part would be CNN unlike a variant of RNN. Here we shall use a pre-trained model like VGG-16 or Resnet-50 to get the features from images instead of training our own model. As these models have been trained to capture complex features and are trained on the mighty imagenet dataset. Also, it would be cumbersome for us to create a model from scratch and gather similar features. We would only take the output of the penultimate layer of the model as we are only interested in getting features but not in classification.

After getting features from the pre-trained model we would use it as input to the decoder part which would have LSTMs to generate words. Now in the dataset we already have captions for each of the images. Our aim would be to maximize the probability of getting a certain word as in the dataset given that we have already provided the image features and the word that was generated in the previous time-step as input. It would be clear with the help of the following image.

As we can see in the image there are many Giraffes standing in the image and we have a certain caption for the corresponding image as the ground truth. And we are supposed to tune the parameters such that at time step-1 the probability of getting ‘Giraffes’ as the output is maximized. Mathematically we can say that given the output from the image and word from the previous timestep we have to maximize the probability of a certain word or P(St|I,S1,S2,S3,S4…..St-1) is maximum where I is the image vector learned by the vgg-16 and S1,S2,S3……St-1 are the words we get as output by the LSTM at the time-step 1,2,3…….t-1 respectively.

Senior Data Analyst


Data Scientist

data recovery service

articles for creation data science

taj alagawani

You can use the API in your own application or website with ease from the link below
I equipped a trained node with a Swigger unit for direct communication without authentication
All you should not forget to mention the source during your work ," Image Caption , by Taj Noah 2020 "

open your terminal and copy paste this command line after you change your image url  

curl -X POST "" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "image=@your_image_url;type=image/jpeg"

or check my swigger API in IBM Cluster :from HERE