The process of generating text from images is called Image Captioning. It not only requires the recognition of the
object and the scene but the ability to analyze the state and identify the relationship among these objects. Therefore image
captioning integrates the field of computer vision and natural language processing. Thus we introduces a novel image
captioning model which is capable of recognizing human faces in an given image using transformer model. The proposed
Faster R-CNN-Transformer model architecture comprises of feature extraction from images, extraction of semantic keywords
from captions, and encoder-decoder transformers.