Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 10 Issue: 05 | May 2023

p-ISSN: 2395-0072

www.irjet.net

AN EFFICIENT MODEL FOR VIDEO PREDICTION Nguyen Van Truong1, Trang Phung T. Thu2 1Thai Nguyen University of Education, Thai Nguyen, Viet Nam 2Thai Nguyen University, Thai Nguyen, Viet Nam

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Video prediction aims to generate future

supervised representational learn ing [1]. Unlike still images, video provides complex trans formations and patterns of movement across time. At a small level of detail, if we focus on a small array at the same spa tial location over successive time steps, we can identify a se ries of locally similar visual distortions due to consistent over time. In contrast, by looking at the big picture, successive frames will be visually different but semantically consistent. This variation in the visual appearance of videos at different scales is mainly due to, aberrations, variations in lighting con ditions and camera movement, among other factors. From this time-ordered visual signal, predictive models can extract representative space-time correlations describing movements in a video sequence. proposed to solve the problem mainly based on CNN and LSTM networks, ... Figure 1 shows an overview of the proposed machine learning methods to solve the video prediction problem. In it, a network is proposed to take as input i.e. videos, a sequence of stacked frames, and the output of the network is also a sequence of frames. How ever, the key difference between network input and output is that input frames display objects including shape, size, color, motion, etc, at the current time while output of the network are the predicted frames for the object’s future movements.

frames from a given past frames. This is one of the fundamental tasks in the computer vision and machine learning. It has attracted many researchers and there are various methods have been proposed to address this task. However, most of them have focused on increasing the performance and ignored memory space and computation cost issue. In this paper, we proposed a lightweight yet efficient network for video prediction. In spire by depthwise and pointwise convolution in the image domainm, we introduce the 3D depthwise and pointwise con volution neural network for video prediction. The experiment results have shown that our proposed framework outperforms state-of-the-art methods in terms of PSNR, SSIM and LPIPS on standard datasets such as KTH, KITTI and BAIR datasets. Index Terms - Video Prediction, Lightweight Model, Video Processing.

1.INTRODUCTION Video prediction is one of the fundamental problems in com puter vision. The goal of this task is to predict future frames from past video frames. The predicted future frames may be in the form of RGB images and/or optical flow. These fu ture frames can be used for a variety of tasks such as action prediction, video encoding, video surveillance, autonomous driving, etc. In recent years, deep learning has significantly improved the performance of the video prediction problem. Most of these methods use a convolutional neural network (CNN) model, a Long Short-Term Memory (LSTM) model, or a variant of them e.g., the ConvLSTMs model.

Some typical methods can be mentioned as Kwon et al [2] have proposed a model based on liver retrospective cycle to solve the problem. Straka et al. [3] introduced a new net work architecture called PrecNet. Meanwhile, Byeon et al. [4] proposes a Contextvp network that allows both temporal and spatial information to be learned across Conv-LSTM lay ers. In this paper, we focus on the remaining problems that the above deep models have not solved and propose to build deep learning models with high results on standard datasets. Specifically, we propose a lightweight deep learning model based on 3D CNN to effectively solve the video pre diction problem. In which, instead of using conventional Convolution blocks, we propose to use Deptwise Convolution and Pointwise Convolution blocks to reduce computational cost and memory storage during training and testing. The test results show that the method proposed by us gives superior results compared to other state-of-the-art methods.

Fig. 1. Overview of video prediction The video prediction task closely captures the fundamen tals of predictive coding modeling, and it is considered an in termediate step between raw video data and decision making. The potential to extract meaningful descriptions of underly ing patterns in video data makes the video prediction task a promising avenue for self-

Impact Factor value: 8.226

The rest of the paper is presented as follows: Part 2 presents the works related to video prediction problem in

ISO 9001:2008 Certified Journal

Page 1836