Citation: Yang, X.; Yu, Y.; Wu, X.
Double Linear Transformer for
Background Music Generation from
Videos. Appl. Sci. 2022, 12, 5050.
https://doi.org/10.3390/app12105050
Academic Editors: Katia Lida
Kermanidis, Phivos Mylonas and
Manolis Maragoudakis
Received: 22 April 2022
Accepted: 13 May 2022
Published: 17 May 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
Double Linear Transformer for Background Music Generation
from Videos
Xueting Yang , Ying Yu * and Xiaoyu Wu
Faculty of Information and Communication Engineering, Communication University of China,
Beijing 100024, China; yangxueting@cuc.edu.cn (X.Y.); wuxiaoyu@cuc.edu.cn (X.W.)
* Correspondence: yuying@cuc.edu.cn; Tel.: +86-10-6577-9427
Abstract:
Many music generation research works have achieved effective performance, while rarely
combining music with given videos. We propose a model with two linear Transformers to generate
background music according to a given video. To enhance the melodic quality of the generated music,
we firstly input note-related and rhythm-related music features separately into each Transformer
network. In particular, we pay attention to the connection and the independence of music features.
Then, in order to generate the music that matches the given video, the current state-of-the-art cross-
modal inference method is set up to establish the relationship between visual mode and sound mode.
Subjective and objective experiment indicate that the generated background music matches the video
well and is also melodious.
Keywords: video background music generation; music feature extraction; linear Transformer
1. Introduction
Music can effectively convey information and express emotions. Compared with
silent videos, an appropriate background music can make the video content easier to
understand and accept. However, in daily life, generating video soundtrack is often a
technical and time-consuming work. It requires the selection of suitable music from a large
amount of music and needs people capable of using specific tools to edit the corresponding
audio paragraphs. Furthermore, the existing methods cannot automatically customize the
appropriate background music for the given video. To address these problems, this paper
proposes an automatic background music generation model with two linear Transformers
training jointly. This method ensures the convenience in use as well as the music uniqueness.
At the same time, after a large amount of data training, it ensures both the rhythmicity of
the generated music and a high degree of matching with the given video.
For the tasks related to the automatic generation of video background music, there
have been many excellent achievements, such as music generation and video-audio match-
ing tasks. However, as far as we know, the combination of generated music and video
associations has not been considered for most of the existing works. Many works on music
generation focus on music generation itself [
1
,
2
], and recently, more studies have paid atten-
tion to controllable music generation [
3
–
5
], while seldom [
6
] combining music generation
with videos. As a result, the generating music cannot meet the background requirement
for a given video. Furthermore, since there is no paired video–background music dataset,
the existing video background music generation methods [
6
] skillfully established the
corresponding relationship between video features and music elements, and then used the
video features to change the music elements for different given videos. Although these
approaches have achieved breakthrough results, they have paid less attention to the rela-
tionship and the independence of musical elements, which has led to a weak melodiousness.
In this article, the proposed model improves the extraction of musical elements with two
linear Transformers [
7
] training jointly and using the above inference method to improve
the rhythm of the generated music as well as matching the given video.
Appl. Sci. 2022, 12, 5050. https://doi.org/10.3390/app12105050 https://www.mdpi.com/journal/applsci