Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 11 Issue: 03 | Mar 2024

p-ISSN: 2395-0072

www.irjet.net

Image Caption Generator using Deep Learning Algorithm – VGG16 and LSTM B. Bhaskar Rao1, Kalki Chaitanya Lade2, Avinash Pasumarthy3, Parasuram Swamy Katreddy4, Garapati Chaitanya Nagendra Kumar 5 1 Professor, Dept of CSE, GITAM (Deemed to be University), Visakhapatnam, Andhra Pradesh, sIndia 2,3,4,5 Student, GITAM (Deemed to be University), Visakhapatnam, Andhra Pradesh, India.

---------------------------------------------------------------------***--------------------------------------------------------------------create captions for brand-new, undiscovered photos as a Abstract - Deep learning Image Caption Generator (ICG)

result. After being trained, the Image Caption Generator may provide a relevant and appropriate caption for an input image. The model's produced captions seek to highlight the important elements and details in the picture, offering a useful aid for automatic description and comprehension of images. The student uses a variety of datasets with matched photos and captions to do this. The deep learning model is trained using this dataset as its basis. The algorithm picks up complex relationships and patterns between written descriptions and visual elements during training. Effective feature extraction from images is facilitated by the use of pretrained CNNs. RNNs, on the other hand, aid in the sequential production of captions, guaranteeing a coherent and suitably contextual story. The project showcases the student's expertise in machine learning and deep learning techniques by implementing and optimizing cutting-edge neural network topologies. The resulting Image Caption Generator seems promising for a number of uses, such as helping people with visual impairments access visual content, improving search engine indexing for content, and making contributions to the larger artificial intelligence community. Because image caption generators can automatically provide descriptive captions or written summaries for photographs, they have become indispensable tools for a variety of applications. These tools are essential for improving accessibility since they offer descriptions for those who are blind or visually impaired. as well as supporting others who might find it difficult to understand visual stuff. Additionally, they make it easier for search engines and databases to index and retrieve content since they make it possible for them to recognize and classify images according to their content. By automatically creating descriptions for visual information, image caption generators on social media help users share it more quickly and efficiently while also saving time and effort. Furthermore, these generators help with the fast generation of pertinent captions for photos used in articles, blogs, or presentations in workflows involving content creation, such as journalism or blogging. By giving descriptions for the visuals used in learning materials, they aid in accessibility and comprehension in educational environments. Additionally, in research and analysis contexts, picture captioning technology can automate activities like sentiment analysis and image categorization. Image captions improve

with VGG16 architecture. The aim of this study is to create a model that can produce captions for input photographs that are both descriptive and pertinent to the context. Our strategy leverages the rich feature representations that VGG16 has acquired from its pre-trained convolutional layers in order to close the semantic gap that exists between textual descriptions and visual material. The encoder in the suggested model extracts features from images using a convolutional neural network (CNN) based on VGG16, while the decoder uses recurrent neural networks (RNNs) to generate captions. The encoder efficiently encodes visual information by utilizing VGG16 to extract high-level features from images. This allows the decoder to provide captions that accurately match the content and context of the input images. The attention techniques incorporated by the decoder enable the model to concentrate on pertinent image regions during the generation of each caption word, hence augmenting the diversity and informativeness of generated captions. Furthermore, methods like beam search and vocabulary augmentation are used to encourage coherence and diversity in generated captions. Test results on reference datasets like Flickr8k show how well the suggested method works to produce insightful captions for a variety of photos. Both qualitative and quantitative assessments demonstrate how the model can generate linguistically and contextually relevant captions, highlighting its potential for a range of uses such as picture understanding, retrieval, and accessibility for those with visual impairments.

Key Words: Artificial Intelligence, Convolutional Neural Network, Caption Generator, VGG16, LSTM

1.INTRODUCTION The Image Caption Generator creates insightful captions for them. It analyzes visual content and produces logical written descriptions by using sophisticated neural network topologies. The model uses a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs) to produce sequential captions based on extracted information from images. A sizable dataset of matched photos and captions is used in the training of the Image Caption Generator. The model picks up the ability to link particular visual cues with relevant textual descriptions throughout the training phase. It can now generalize and

Impact Factor value: 8.226

ISO 9001:2008 Certified Journal

Page 1135