Abstract
Due to the availability of heterogeneous display devices and their aspect ratios, video retargeting has received considerable research attention among researchers. Non-consistent video retargeting can significantly affect a video's spatial and temporal quality, particularly for extreme retargeting cases. Since no perfectly annotated datasets exist for video retargeting, deep learning-based techniques are rarely utilized. This paper proposes a method that learns to retarget videos by detecting the salient areas and shifting them to the appropriate location. First, we segment the salient objects using a unified Transformer model. Using convolutional layers and a shifting strategy, we shift and warp objects to the suitable size and location in the frame. We use 1D convolution for shifting the salient objects. We also use a frame interpolation technique to preserve temporal information. To train the network, we feed the retargeted frames to a variational auto-encoder network to map the retargeted frames back to the input frames. Besides, we design perceptual and wavelet-based loss functions to train our model. Thus, we train the network unsupervised. Extensive qualitative and quantitative experiments and ablation studies on the DAVIS dataset show the superiority of the proposed method over the existing state-of-the-art methods.