Text2Video: AI-driven Video Synthesis from Text Prompts by IRJET Journal

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 11 Issue: 06 | Jun 2024

www.irjet.net

p-ISSN: 2395-0072

Text2Video: AI-driven Video Synthesis from Text Prompts Shankar Tejasvi1, Merin Meleet1 1Department of Information Science and Engineering,

RV College of Engineering Bengaluru, Karnataka, India ---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - The emerging discipline of text-to-video synthesis

PyTorch, PyTorch Lightning, and OpenCV. Contextual information from the input text is extracted in this procedure, and the information is then converted into visual components.

combines computer vision and natural language understanding to create coherent, realistic videos that are based on written descriptions. The research is an endeavour to provide a bridge between the fields of computer vision and natural language processing by using a robust text-to-video production system. The system's main goal is to convert text prompts into visually appealing videos using pre-trained models and style transfer techniques, providing a fresh approach to content development. The method demonstrates flexibility and effectiveness by including well-known libraries like PyTorch, PyTorch Lightning, and OpenCV. The work emphasises the potential of style transfer in boosting the creative quality of visual outputs by emphasising its capability to make videos with distinct styles through rigorous experimentation. The outcomes illustrate how language clues and artistic aesthetics can be successfully combined, as well as the system's ramifications for media production, entertainment, and communication. This study adds to the rapidly changing field of text-to-video synthesis and exemplifies the fascinating opportunities that result from the fusion of artificial intelligence and the production of multimedia content.

The main goal of the work is to investigate how linguistic and visual clues might be combined to produce movies that accurately convey textual material while also displaying stylistic details. A key component of this system, style transfer enables the adoption of current visual styles onto the produced videos, producing visually stunning results that exemplify creative aesthetics. The system aims to demonstrate the effectiveness of its methodology in video production with a variety of styles, so showcasing the possibilities for innovation and customization. This work contributes to the changing environment of content creation as artificial intelligence and multimedia continue to converge by providing insights into the opportunities made possible by the interaction between language and visual. The research highlights the gamechanging possibilities of AI-driven multimedia synthesis by showcasing the capabilities of text-to-video production combined with style transfer.

Key Words: Text to Video, Pre- Trained Models, Style Transfer, Multimedia Content Creation, Natural Language Processing

2. LITERATURE REVIEW The method for zero-shot picture categorization that is suggested in this study makes use of human gaze as auxiliary data. A paradigm for data collecting that involves a discriminating task is suggested in order to increase the information content of the gaze data. The paper also proposes three gaze embedding algorithms that exploit spatial layout, location, duration, sequential ordering, and user's concentration characteristics to extract discriminative descriptors from gaze data. The technique is implemented on the CUB-VW dataset, and several experiments are conducted to evaluate its effectiveness. The results show that human gaze discriminates between classes better than mouse-click data and expert-annotated characteristics. The authors acknowledge that although their approach is generalizable to other areas, finer-grained datasets would benefit from utilising different data collection methodologies. Overall, the suggested strategy provides a more precise and organic way to identify class membership in zero-shot learning contexts. [1]

1.INTRODUCTION Natural language processing (NLP) and computer vision have recently come together to revolutionise the way that multimedia material is produced. A fascinating area of this confluence is text-to-video creation, which includes creating visual stories out of written prompts. Due to its potential applications in a variety of industries, including entertainment, education, advertising, and communication, this developing topic has attracted significant attention. Text-to-video generation offers a cutting-edge method of information sharing by enabling the transformation of written descriptions into compelling visual content. The complexity of text-to-video production is explored in this work, with a focus on using pre-trained models and style transfer methods. The work's goal is to make it easier to convert textual cues into dynamic video sequences by utilising the strength of well-known frameworks like

Impact Factor value: 8.226

ISO 9001:2008 Certified Journal

Page 87