Abstract
The accessibility of diverse display devices and their aspect ratios has drawn much research attention to video retargeting. Non-consistent video retargeting can significantly affect a video's spatial and temporal quality, particularly in extreme retargeting cases. Since there are no perfectly annotated datasets for video retargeting, deep learning-based techniques are rarely utilized. This paper proposes a method that learns to retarget videos by detecting the salient areas and shifting them to the appropriate location. First, we segment the salient objects using a unified Transformer model. Using convolutional layers and a shifting strategy, we shift and warp objects to the appropriate size and location in the frame. We use 1D convolution to move the salient items in the scene. Additionally, we employ a frame interpolation technique to preserve temporal information. To train the network, we feed the retargeted frames to a variational auto-encoder network to map the retargeted frames back to the input frames. Furthermore, we design perceptual and wavelet-based loss functions to train our model. Thus, we train the network unsupervised. Extensive qualitative and quantitative experiments on the DAVIS dataset show the superiority of the proposed method over existing image and video-based methods.