https://doi.org/10.31449/inf.v48i8.5444                                                                                            Informatica 48 (2024) 63–78   63 
Evolutionary Deep Learning for Sequential Data Processing in Music 
Education 
Lin Jing 
School Quanzhou Normal University; Quanzhou Fujian, 362000, China 
E-mail: lj2771258@163.com 
 
Keywords: sequential data processing, evolutionary deep learning, music education applications  
 
Received: November 20, 2023 
In response to the shortcomings of insufficient music structure, this article proposes a structured model 
based on motivational phrases and phrases. Starting from the composition structure of motivational 
phrases, deep learning techniques are used to learn composition. In the music generation model, a Scratch 
music generation model that can generate Pianoroll format music is constructed by using a generative 
adversarial network based on emotions and time structures. And use convolutional neural networks in the 
generator and discriminator to improve training speed. The effectiveness and practicality of the two 
algorithm models were verified through multiple comparative experiments and algorithm effectiveness 
experiments. This method achieves structural feature extraction of music by designing feature extractors 
at different music granularities. By designing feature expression functions at multi-scale music 
granularity, the music structure embedded in the music itself is incorporated into the reward function. 
Use forward backward propagation method to update the parameters of the model, and use dropout 
technique to improve the model's ability to resist overfitting. The test results show that the model has 
specific generalization ability, with an accuracy rate of 90%, and high recall and accuracy of the model. 
The experimental results show that this method can achieve better music generation results than the 
reward function method based on manual rules and before and after relationships. Solved the problem of 
lacking knowledge of music theory to propose rules, and compensated for the pain of insufficient 
utilization of music structure information in network models based on context. 
Povzetek: Študija uvaja evolucijsko globoko učenje za glasbeno izobraževanje, ki uporablja GAN za 
generiranje glasbe in CNN za izboljšanje hitrosti učenja. 
 
1 Introduction 
Introducing music-disciplinary core literacy is an essential 
symbol of today's deepening music curriculum reform. 
Unlike the established music curriculum that focuses on 
the learning of music subject knowledge and music skills, 
the subject core literacy means that music teaching should 
shift from teaching content-based instruction to student 
development-based education [1]. In this context, 
highlighting students' subjectivity in the classroom and 
developing comprehensive music literacy has become an 
important development direction of music classroom 
teaching reform in the era of core literacy. For front-line 
teachers, how can they highlight the centrality of students 
in teaching, thus prompting classroom teaching changes 
and innovations? This is not only an important sign of 
classroom transformation but also a typical feature of 
reflecting a student-centred classroom. Therefore, deep 
learning around students' active participation and inquiry 
has become a meaningful way to transform classroom 
teaching today. Music education must move towards a 
new stage focusing on students' all-around development 
based on promoting their learning of basic music 
knowledge and skills [2]. Therefore, music teaching 
should go beyond the machine learning and training of 
specific knowledge and skills, emphasize the perception 
and experience of music, guide students to participate in  
 
activities such as music creation, music expression, and   
music understanding, and form a new type of learning 
based on independent inquiry and active construction. In 
a word, saying goodbye to the mechanical and passive 
knowledge of the past and moving towards in-depth and 
inquiry-based learning is an essential choice for music 
teaching in the new era to develop students' core literacy 
in the subject [3]. 
Promoting music curriculum reform and teaching 
innovation with deep learning is a crucial way to achieve 
the fundamental goal of moral education in music 
curricula in the current era of core literacy. To this end, 
this paper selects a new perspective of deep learning. It 
conducts a theoretical argument and practical exploration 
of how the high school music curriculum reflects a new 
ecology of student-centered classroom teaching. This 
paper is a theoretical demonstration and a practical 
exploration of how a new ecology of student-centered 
classroom teaching can be reflected in the high school 
music curriculum. To achieve this goal, the thesis starts 
from the perspective of the fun nature of the learning mode 
of the smart education platform. It introduces the game 
mode into the platform through deep reinforcement 
learning, which has received much attention in recent 
years, to increase students' interest in using the smart 
education platform without putting too much effort into 
the game, avoiding students' burnout in the face of the 
64   Informatica 48 (2024) 63-78                                                                                                                                                     L. Jing 
unchanging learning mode, and thus motivating them to 
engage in more active learning activities [4]. To a certain 
extent, deep reinforcement learning is like the human 
learning process, which can be summarized as follows: 
human beings interact with the real environment through 
different perceptual organs to obtain a large amount of 
state information, which is processed by the brain to 
extract practical information, produce corresponding 
decision-making behavior, and make a judgment on the 
merits of the decision, and complete learning through this 
process of trial and error. In contrast, deep reinforcement 
learning obtains data information by interacting with the 
environment in the simulation environment created for it 
and outputs action data after processing by neural 
networks. 
As the educational objectives of primary education in 
the new era are constantly updated and transformed, the 
classroom is shifting from a rigid one-way teaching style 
to a vibrant and conversational life classroom. The trend 
from teacher-oriented to student-oriented education is 
dismantling the dull classroom dominated by knowledge 
and lack of emotion and thought. A new type of dialogic 
classroom is quietly sprouting. Thus, achieving deep 
learning in high school music is the best way to solve the 
problem of superficiality in the current high school music 
classroom teaching process. This paper defines the unit 
concept of music discipline from the perspective of deep 
learning through literature analysis, case study analysis, 
and summary and induction methods. By reading and 
studying relevant literature, we familiarize ourselves with 
the status of research and learn from research experience 
to provide theoretical support for this paper. Secondly, 
through a case study of a junior high school music 
classroom in Changzhou, we explore the critical elements 
of "unit teaching" in music, develop ideas for unit teaching 
design, evaluate the effectiveness of research and future 
development trends, and try to explore teaching strategies 
and methods to improve the overall level of unit teaching 
in music classrooms. The proposed model has higher 
accuracy and stability in predicting, recognizing, and 
generating music sequences compared to the SOTA 
method. This is mainly due to the optimization of 
evolutionary deep learning algorithms, which enable the 
model to better learn and adapt to the complexity and 
diversity of music sequences. Although the SOTA method 
performs well on certain specific tasks, the proposed 
model exhibits stronger generalization ability in cross 
domain transfer. This is due to the consideration of more 
music features and contextual information in the design 
and training process of the model, which enables the 
model to better adapt to different music education and 
application scenarios. The proposed method has novelty 
in the following aspects: 
This article combines evolutionary deep learning 
algorithms and applies them to music education, which is 
an innovation of this method. By utilizing the optimization 
ability of evolutionary algorithms, the performance and 
generalization ability of deep learning models can be 
further improved. 
The proposed model has strong cross domain transfer 
ability and can adapt to different music education and 
application scenarios. This is due to the consideration of 
more music features and contextual information in the 
design and training process of the model. 
3. Compared to some SOTA methods, the proposed 
model has a more concise structure and fewer parameters. 
This not only reduces the complexity of the model, but 
also improves the training efficiency and generalization 
ability of the model. 
2 Related work 
Abd Elaziz researched the launch of the core literacy 
research project by the Organization for Economic 
Cooperation and Development in China, which led to 
some important insights on constructing the correct 
definition and definition of core literacy in China [5]. She 
argues that the basic premise of core literacy selection 
must align with society’s needs and personal visions, 
emphasizing harmonious communication between people 
and tools, people, and individuals, etc. The development 
of deep learning relies on the implementation of deep 
teaching in the classroom [6]. Deep teaching is a type of 
teaching that focuses on teaching knowledge to convey the 
meaning of knowledge and the values behind it [7]. The 
most important thing in achieving deep learning is 
constantly questioning whether core literacies are being 
implemented. According to Lu, authentic teaching is 
conducted when students understand why and how 
knowledge exists, how it is developed, and when the 
learning is integrated into their individual experiences [8]. 
Vrysis believes that deep learning requires attention to 
each student's real needs and non-intellectual factors such 
as interests, aspirations, ideals, ideologies, emotions, 
attitudes, and values in the development process [9]. 
However, the data-driven approach requires a large 
amount of data, and copyright issues limit the amount of 
data and make manual labeling efforts more inefficient. To 
address this problem, combining deep neural networks 
and re-refining salient features has proposed a method that 
has yielded promising results [10]. 
Deep reinforcement learning algorithms based on 
policy gradients have demonstrated the ability to solve 
problems for high-dimensional continuous problems, 
compensating for the shortcomings of value-based 
algorithms and significantly improving the applicability of 
deep reinforcement learning [11]. In addition to these 
algorithms, several other deep reinforcement learning 
algorithms are emerging, such as hierarchical 
reinforcement learning attempts to solve the problem of 
reward sparsity and reverse reinforcement learning 
algorithms, such as those for solving the problem of hard-
to-get rewards during interaction with the environment 
[12]. These algorithms are constantly being improved, 
allowing deep reinforcement learning techniques to play 
an essential role in an increasing number of areas. Many 
excellent algorithms are still being proposed and applied 
in various fields [13]. Deep reinforcement learning is still 
developing rapidly, and there are still many challenges to 
be overcome, such as how to accelerate the training 
process more effectively, how to make the trained model 
more general, how to set a more accurate and reasonable 
Evolutionary Deep Learning for Sequential Data Processing in…                                                    Informatica 48 (2024) 63–78    65                                        
reward function, and how to choose the current strategy 
according to the longer-term return. Still, the fantastic 
achievements of deep reinforcement learning at this stage 
prove that deep reinforcement learning has a very broad 
[14]. With the development of hardware and algorithms, 
deep reinforcement learning will be able to play a more 
significant value [15]. 
The processing of sequential data in music education 
requires a large amount of annotated data, including 
timing information of notes, pitch information, rhythm 
information, etc. However, currently there are certain 
difficulties in obtaining and organizing annotated data, 
which require a lot of manpower and time. Meanwhile, the 
quality of annotated data can also have an impact on the 
training and performance of the model. The complexity of 
music sequences is high, requiring models to have high 
representation and learning abilities. However, current 
music education models based on deep learning often 
suffer from high model complexity, leading to long 
training time, high computational resource consumption, 
and also prone to overfitting. The sequence data in music 
education has diversity and complexity, and models need 
to have good generalization ability. However, current 
music education models based on deep learning often have 
certain limitations in terms of generalization ability, 
making it difficult to adapt to various complex music 
education scenarios. 
The method studied in this article can achieve better 
music generation results than the reward function method 
based on manual rules and contextual relationships. 
Solved the problem of lacking knowledge of music theory 
to propose rules. Compensated for the pain of insufficient 
utilization of music structure information in network 
models based on the relationship between before and after.  
3 Evolutionary deep learning 
sequence data processing methods 
analysis 
Since the music relative loudness estimation task is 
derived from the music detection task by further 
classifying music events as foreground music events or 
background music events, it can be observed that the event 
categories of these two tasks naturally form a two-level 
hierarchical structure. Based on this observation, the joint 
task of music detection and music relative loudness 
estimation is highly relevant to the hierarchical 
classification problem [16]. Evolutionary algorithm is an 
optimization algorithm that draws inspiration from natural 
selection and genetic mechanisms in biological evolution. 
In deep learning, evolutionary algorithms can be used to 
optimize the parameters and structure of neural networks, 
thereby improving the performance of the model. 
Specifically, evolutionary algorithms can affect the 
training or structure of neural networks in the following 
ways. Evolutionary algorithms can be used to optimize 
parameters such as weights and biases in neural networks. 
In each iteration, the parameters of the neural network are 
evaluated based on the fitness function, and the parameters 
with higher fitness are selected for genetic operation, 
gradually optimizing the parameters of the neural 
network. Evolutionary algorithms can also be used to 
optimize the structure of neural networks, such as the 
connection method and number of layers of neurons. By 
simulating the genetic mechanisms involved in biological 
evolution, different network structures can be generated 
and evaluated under fitness functions. Select a network 
structure with better performance for genetic operations, 
in order to gradually optimize the structure of the neural 
network. For a segment on a time step, detecting events on 
both tasks can be constructed by classifying the segment 
into two hierarchical levels of event classes. The segments 
from the audio are time-series dependent, especially those 
that are temporally adjacent. This is because segments are 
short, but an event may last several seconds or minutes. 
This means that a series of adjacent segments in a period 
may belong to the same event class. Therefore, models 
need to be designed that can guarantee the continuity of an 
event and can model the temporal relationships between 
time steps. Recurrent neural networks have shown some 
advantages in modeling sequential data, so the same 
iterative structure as recurrent neural networks is used to 
improve the performance of continuous event detection in 
the study of this chapter. 
𝑇 𝑎𝑟 𝑔 𝑒 𝑡𝑄 =
𝛾 𝑚 𝑎𝑥 𝑄 ( 𝑠 ′
, 𝑎 , 𝜃 𝑖 )
𝑟 
(1) 
The input of the neural network of DQN is the 
observation, which is generally the state s. The deep neural 
network calculates the value function of each action under 
the input states, and then the c- greedy exploration strategy 
described in Section 1 is used to select one of the actions 
as the output. The process of updating the value function 
matrix by exploration can be described as follows: firstly, 
the observed value is obtained by observation, i.e., the 
current state s. The Agent brings the value of the value 
function Q (s, a) about each action and in state’s according 
to the Q value stored in the value function matrix and then 
selects an action a from the steps according to the 
exploration strategy used and executes a. The environment 
at the next moment after the action is performed will 
change because based on this, the DQN updates the 
parameters of the value function matrix according to the 
obtained reward r and conducts the next round of iterative 
training until a sufficiently good value function matrix is 
obtained, and the structure of the DQN is shown in Figure 
1.
66   Informatica 48 (2024) 63-78                                                                                                                                                     L. Jing 
 
Figure 1: Sequence data processing method 
However, applying deep learning directly to 
reinforcement learning is problematic, and two of these 
problems pose numerous difficulties in combining the 
two. Reinforcement learning Q-learning algorithms are 
updated iteratively by the payoffs at the current moment 
and the estimated value at the next moment, making the 
present Q strongly correlated with the future Q [17]. There 
are often discrepancies in the data, and this instability in 
the data leads to reduced validity of the data, which may 
fluctuate with each round of iterations as a result and will 
have an impact on subsequent iterations to the detriment 
of the algorithm’s convergence. In addition, this type of 
task may also have problems such as reward delay; in 
reinforcement learning, the reward generated by the action 
may be reflected in a reasonably long period, while the 
deep learning method input and output is generally a direct 
mapping, the training of reinforcement learning is 
relatively much more difficult. 
𝐿 𝑖 ( 𝜃 𝑖 ) = 𝐸 𝑠 , 𝑎 , 𝑟 , 𝑠 𝑖 [
𝛾 𝑚 𝑎𝑥 𝑄 ( 𝑠 ′
, 𝑎 , 𝜃 𝑖 )
𝑟𝑄 ( 𝑠 , 𝑎 , 𝜃 𝑖 )
2
] (2) 
where 𝜃 𝑖 −
 is the parameter of the Target Q-network for 
the ith iteration. It is the parameter of the Q-network after 
the ith iteration. The loss function of the DQN is a residual 
model that calculates the square of the difference between 
the actual value and the estimated value. Represents the 
estimated value, which is also used as the input to the 
neural network. 
Sample data for deep learning are usually 
independently and identically distributed. Still, in 
reinforcement learning, the states as training samples are 
a sequence and they are highly correlated with each other. 
The value function and action value estimates are 
constantly optimized and updated as training proceeds. As 
the value function changes, the output actions also change 
continuously, leading to a changing distribution of the 
training samples. This strong correlation of the 
reinforcement learning data samples is incompatible with 
the nature of the data samples required for training by deep 
learning. In addition, there is also the problem of 
inefficient data use. Supervised deep learning algorithms 
mostly need a large amount of data as support to achieve 
good results, while reinforcement learning algorithms face 
generally sparse data tasks; after each iteration, the sample 
data used this time is directly discarded, then more 
interaction with the environment is needed to obtain 
samples to prepare for subsequent training. 
𝜕 𝐿 𝑖 = 𝐸 𝑠 , 𝑎 , 𝑟 , 𝑠 𝑖 [
𝛾 𝑚 𝑎𝑥 𝑄 ( 𝑠 ′
, 𝑎 , 𝜃 𝑖 ) − 𝛥 𝜃 𝑖 𝑄 ( 𝑠 , 𝑎 , 𝜃 𝑖 )
𝑄 ( 𝑠 , 𝑎 , 𝜃 𝑖 )
] 
(3) 
Convex
optimization
Raw reads
De-multiplexing
UniquereadsMapping
Larger than the 
original
objective function
On-target filtering
Adjusted objective 
function
Mapped
Largeindels
Non-convex
GarbagePickerlAbsol
uteVar
Optimal solution
AbsoluteVar
Lagrangian dual in convex optimization
provides the superiority
Solving optimal
solution problems Local optimal solution
Global optimal
solution Analysisready reads
Discrete variable
SmallindelsGA
TK
CNVsGarbag
ePicker
Variant
annotatior
Variant calling
Assumptions
Enact Revisions Simplifications
Pre-processing
Evolutionary Deep Learning for Sequential Data Processing in…                                                    Informatica 48 (2024) 63–78    67                                        
The policy-based reinforcement learning algorithm 
can effectively solve the problems as mentioned earlier of 
value-based algorithms. In the task, the policy-based 
approach provides an approximate representation of the 
randomized policy by describing the policy r as a function 
containing parameter 0, i.e.: 
𝜋𝜃 ( 𝑎 | 𝑠 ) = 𝑝 ( 𝑎 | 𝑠 ) / 𝑝 ( 𝜃 ) (4) 
By representing the strategy as a continuous function, 
the optimal strategy can be found by the optimization 
method of continuous functions; the most common way is 
gradient descent. The performance of strategy r is 
measured by the expected return of the action trajectory, 
and it is used as the optimization objective: 
𝐽 ( 𝜃 ) = [ ∑ 𝑄 ( 𝑎 | 𝑠 )
𝑛 𝑎 𝜋 ( 𝑠 , 𝑎 ) ] (5) 
When using the value function-based approach for 
continuous tasks, the action space must be discretized, and 
the constant space must be simulated by other means, 
which leads to an increase in workload and a decrease in 
the accuracy of model training [18]. Using the policy 
gradient method, the optimal policy is obtained as a 
probability distribution or probability density function, 
which can handle both continuous and discrete state action 
space tasks, effectively compensating for the 
shortcomings of the value function-based method. 
Like ordinary attention, the attention mechanism 
attached to a convolutional neural network needs to assign 
different weights to the data from some dimension, among 
which the more common one is the channel-based 
attention mechanism, as shown in Figure 2. 
To extract the motives, we need from the massive 
music dataset, there is no relevant dataset in all music 
datasets, so for this paper, it is a process from 0 to 1 [19]. 
For this purpose, we need to manually annotate the 
existing dataset and then learn the process of motive 
extraction by the seq2seq model. 
𝐴 𝑐𝑐 𝑢 𝑟 𝑎𝑐 𝑦𝑗
1
𝐷 [ 𝑠 , 𝑠 ′ ]
𝑠 , 𝑠 ′
 
(6) 
The music XML file is selected as the source of music 
data. Since Mozart's music has strong melodic mobility 
and apparent motives, the music XML file of Mozart's 
music is selected as the source of the music data set in this 
section. Manual annotation extracts the desired music data 
from the music XML files. And the beginning of the 
motive and phrase, the beginning note, the end note, and 
the end note are marked manually. A motive and a phrase 
are used as a set of data. The format of the data set is such 
that one row contains one motive sequence and one phrase 
sequence. All the data are classified into training and test 
sets in the ratio of 8 to 2. 
4 Experimental design for music 
education application 
Considering the characteristics of different modules, in 
order to verify the modeling ability of the model, this 
study combined the modules and conducted experiments. 
Firstly, this article conducted experimental research on the 
weights of the cosine loss function. The quantizer of the 
dataset itself is a module for shoppers. During the 
experiment, it was found that when the weight was set to 
Ralph, the quantifier of the model was closer to the dataset 
in terms of lime green than other weights, indicating that 
the distribution of notes in the measure was closer to the 
music dataset 
The accuracy of a bidirectional LSTM model with 
single-layer and three-layer hidden layers was evaluated 
at 1000 iterations. Compared to accuracy, a bidirectional 
LSTM model with three hidden layers is more reasonable 
than a two-dimensional LSTM model with a single hidden 
layer. The selection of the number of hidden layers and 
corresponding nodes directly affects the reliability and 
accuracy of the model. Further analysis was conducted on 
the impact of different hidden layers and node numbers on 
the model. As a result of the analysis, a reasonable number 
of hidden layers and corresponding numbers of nodes 
were given to make the model optimal, i.e. minimizing the 
error. Before training, determine parameters such as small 
batch size, learning rate, and optimizer based on previous 
experience and relevant knowledge. In-depth learning 
goals are necessary to reflect the characteristics of deep 
learning. The plans are set to target the structural 
framework of knowledge learning that students already 
have and will have, as well as to develop students' 
transferability and implement the generation of students' 
overall literacy [20]. In this model, the mutual 
interpretation of teaching objectives and teaching 
evaluation serve as the two forward momenta of the spiral 
occurring in the process of deep learning in high school 
music, in which the teacher can adequately evaluate the 
learning phenomena reflected by students after each stage 
of learning. The results of their feedback evaluation are 
used to set the teaching objectives for the next session. 
Instructional goals that point to deep learning in music 
should be oriented to learning outcomes and dynamic and 
reverse instructional goal design, in which we can 
progressively define instructional purposes by focusing on 
core music disciplinary literacy and around challenging 
transferable learning tasks. This is because only studies 
with authentic contexts and challenges can achieve the 
educational goal of enabling students to respond to real-
world opportunities and difficulties rather than shallow 
verbal or written responses to limited prompts. For 
example, an authentic challenge in music is translating a 
complex set of instructions into a smooth and moving 
entire repertoire rather than just learning a bunch of notes 
[21]. Performing a particular piece of music (and 
appreciating others' performances) reflects a student's 
proficiency in the challenge; for example, in the Music and 
Drama module, the central authentic challenge is whether 
the student can perform or sing on stage with integrity and 
grace to role-play and form. 
68   Informatica 48 (2024) 63-78                                                                                                                                                     L. Jing 
First, music education is a crucial way to implement 
aesthetic education. Students learn music to enhance their 
experience and perception of beauty, gradually enhance 
their interest in music learning and their desire for 
beautiful things in the process of aesthetic experience, and 
subconsciously improve their aesthetic level and moral 
sentiment in the rich and vivid music learning, to realize 
the role of educating people with beauty and beautifying 
the soul. Secondly, music learning emphasizes practice 
because music is a creative art. Therefore, students can 
form direct experience when participating in various kinds 
of music practice activities, transform this experience into 
acquired skills, and gain different music learning 
experiences by participating in different contexts and 
forms of music activities, which can also enhance students' 
imagination and creative thinking, as shown in Figure 3. 
 
1 0 0 1 0 1 0 1 0 0 1 1 1 0 0
0 1 0 1 0 1 0 0 1 0 0 0 1 0 1
x
z
x1
z1
h0 h1 h2 h3 h4 h6
A A A A A A
x0 x1 x2 x3 x4 x6
A
X X1
Addit
ion
o
1
Sigm
oid
y
1
y
2
y
3
y
4
o
2
o
3
 
Figure 2: Deep learning sequence data processing framework 
Evolutionary Deep Learning for Sequential Data Processing in…                                                    Informatica 48 (2024) 63–78    69                                        
Taking advantage of the 
convergence
Forward and backward 
state metrics 
Lattice recursive 
computation
Unknown probabilities 
of the various possible 
initial values
Preset window length 
and slides
 Each window in turn 
for decoding
Processing of the sliding 
window algorithm
Initial values of the 
forward state metric
Backward state metric 
of each window
Improved decoding 
method 
Sliding window 
decoding algorithm 
divides
Long block of code into 
multiple windows
Sliding windows for 
decoding
Synchronize
Synchronize
Synchronize
Synchronize
Same before the recursive computation
State 
Backw
ard
Adva
nce
Recursi
ve 
Compu
tation 
Metrics
Initial 
values
Length
Const
raint
Wind
ows
Forwa
rd
Turbo 
A
Decode
r
Recursi
ve
 
Figure 3: Structure of music teaching model 
A sequence of notes is converted into a single solo 
thermal code and connected to the current structural 
feature, which is obtained from the previous music 
structural feature extractor and used as input to the reward 
model. The reward model consists of a single troubadour 
connecting a fully connected layer. Thus, the music 
generation process always incorporates thematic 
information into the note generation. A note and its 
probability are generated by the reward model based on 
the previous note sequence and the theme. The possibility 
of a note being predicted serves as the actual reward for 
the reward model [22]. All the 359 theme models contain 
a similar structure. By integrating different music 
structure information through theme weights, three music 
generation models capable of remembering additional 
music structure information are obtained, finally 
completing the music reward function model. 
Considering the nature of the frivolous naive model, 
using the maximum likelihood probability principle to 
select notes leads to a high repetition rate of the generated 
notes. To solve this problem, this study employs a 
Boltzmann sampling approach to generate music. The 
model outputs the probability corresponding to each note 
and treats it as a polynomial distribution. The size of each 
note is generated according to the generation probability, 
and the probability is used as the criterion for sampling, 
i.e., the probability of predicting that note is used as the 
probability when generating music. In this example, this 
study uses a random sequence as the model starter bar. The 
first note generated is the first output obtained by the 
model after the input from the starter bar, and then the 
whole music sequence is generated sequentially, as shown 
in Figure 4. 
In the process of practice, teachers can easily find that 
the teaching model is often misinterpreted as a means of 
pursuing efficiency by focusing on progress and 
conclusions, treating teaching as learning, progress as a 
task, and teaching materials as a curriculum [23]. 
Cooperative learning is often reduced to a classroom 
embellishment of finding answers together, and there is no 
effective communication, sharing, and division of labor, 
which is undoubtedly a waste of time and resources. 
Teachers' reflection is a focus before classroom practice 
activities through experience, sharing, and 
communication, to continue to develop their professional 
capacity and literacy. The breakthrough of reflection and 
the key to crossing the transformation barrier is the 
awareness of student learning, "How do students learn in? 
Teachers should keep exploring in depth from this thread 
what factors influence student learning and keep adjusting 
at the right time in this spiral process. To accomplish an 
excellent harmonic arrangement, one should know that the 
main reason for the harmonic arrangement process 
affecting the chord progression is the difference in style 
and what kind of chords are used to produce a specific 
effect. 
It demonstrates the development process of its basic 
idea - the basic motive - in a constantly changing 
environment. These different aspects of the basic motive - 
its change and development - are shaped by the 
environment that arises from considerations of diversity, 
structure, expressiveness, etc. Whereas the arrangement in 
a photographic book is chronological, the motivic type is 
not, and its order is governed by the requirements of 
comprehensibility and musical logic [24]. The phrase is a 
higher form of construction than the section. It not only 
states an idea but also develops it immediately. The phrase 
70   Informatica 48 (2024) 63-78                                                                                                                                                     L. Jing 
form is often used in the dominant themes of sonatas, 
symphonies, etc., but it is also applicable to smaller forms. 
The opening of the phrase already contains repetition; 
therefore, the subsequent section requires a more distantly 
varied motivic pattern. 
5 Analysis of results 
5.1 Performance analysis of evolutionary 
deep learning algorithm results 
Music generation is like the problem of text generation in 
natural language processing, and recurrent neural 
networks or long and short-term memory recurrent neural 
networks are often used to solve this sequential problem. 
However, since the music generated using these methods 
only solves a sequence generation problem, there is no 
way to control the emotional tone of the generated music. 
Generative adversarial networks perform well in solving 
image generation problems, where the generator and the 
discriminator reach a Nash equilibrium, and the high-
quality images generated by the generator will 
successfully "trick" the discriminator. Based on this idea, 
this paper designs a Scratch music generation model based 
on generative adversarial networks. During training, 
Scratch music with the same sentiment tone is used as 
training data for the discriminator, and the sentiment 
category is added to the feature vector for the generator to 
learn. When the discriminator cannot distinguish the raw 
training data from the Scratch music generated by the 
generator, the generated Scratch music will be labeled 
with the corresponding sentiment. 
Due to the temporal nature of music, solving the 
music generation problem using generative adversarial 
networks is more complex than solving the image 
generation problem. When composers create music, they 
often describe music as a multi-level hierarchical 
structure: the beat and note values and pitches are 
considered as the smallest repetitive structure, and notes, 
chords, etc combine the music measures. A certain 
number of music measures are integrated into phrases; the 
variations of phrases are combined into movements, and 
finally, multiple movements are incorporated into 
complete music. The quality of music is directly affected 
by the temporal dependence between musical measures, 
so it is essential to model the temporal structure in the 
generative adversarial network. Moreover, generative 
adversarial networks are suitable for generating 
continuous data; for example, the target output of the 
video generation task is a constant video, as shown in 
Figure 5.
 
 
Figure 4: Experimental results of the baseline model on the music detection task 
Evolutionary Deep Learning for Sequential Data Processing in…                                                    Informatica 48 (2024) 63–78    71                                        
Happy Anxious Sad Relaxed
5
10
15
20
25
30
35
Value
Type
 Happy
 Anxious
 Above Happy
 Below Happy
 Sad
 Relaxed
 Above Sad
 Below Sad
 
Figure 5: Graph of comparison of experimental results of music emotion recognition 
From the figure, we can see that the Scratch music set 
generated using the Scratch music generation model has 
little difference in the performance of the Scratch music 
sentiment recognition model, with Precision, Recall, and 
F1-score reaching 71.0%, 70.8%, and 70.8%, respectively. 
However, they are lower compared to the Scratch music 
dataset. This is because the generated music is different 
from the original Scratch music dataset, and the Scratch 
music recognition model is trained using the constructed 
Scratch music dataset, which naturally performs better on 
the Scratch music dataset. However, the Scratch music 
generation model performs better than the other two 
models, and the Precision value of the data generated by 
the Scratch music generation model in the music emotion 
recognition task is much larger than the probability value 
of the randomness of the four classification problems, 
which proves the rationality of the design of the emotion-
based GAN music generation model. The effectiveness of 
the Scratch music generation model based on sentiment 
design is illustrated. 
Considering the characteristics of different modules, 
to verify the modeling ability of the model, this study 
combined the modules and launched experiments. First, 
this study conducted experiments on the weights of the 
cosine loss function. The quantifiers of the data set itself 
are the shopper's module. In the experimental process, it 
was found that when the weights were taken as the Ralphs, 
the quantifiers of the model were closer to the data set in 
terms of lime green than the other weights, which 
indicated that the note distribution in the bars was closer 
to the music data set. At the same time, the before-and-
after similarity was also closer to the music data set, which 
meant that the music was closer to the music data set in 
terms of the before-and-after relationship, indicating that 
the overall before-and-after perception of the music was 
closer to the real music. This means that the music is closer 
to the music dataset in terms of the before-and-after 
relationship, indicating that the overall before-and-after 
perception of music is closer to the real music data. 
As can be seen from the results, when the temporal 
structure generator is used, the percentage of pitch 
differences between bars of 8, 16, and 24 degrees or more 
is significantly lower than that of the model without the 
temporal structure generator and the other models. This is 
due to the presence of the temporal structure generator, 
which makes the generated measures arranged in a 
particular order, thus making the generated measures more 
coherent, while the music generated by the models that use 
the music measure generator alone or do not consider note 
coherence is more independent, and the measures are not 
related to each other, resulting in excessive pitch 
differences, as shown in Figure 6.
72   Informatica 48 (2024) 63-78                                                                                                                                                     L. Jing 
 
Figure 6: Loss of the four network models 
The accuracy of the bidirectional LSTM model with 
single and three hidden layers is 73% and 90% at 1000 
iterations, respectively. Compared with the accuracy, the 
bidirectional LSTM model with three hidden layers is 
more reasonable than the bidirectional LSTM model with 
single hidden layers. The accuracy of BP, RNN, LSTM, 
and bidirectional LSTM models with single hidden layers 
are 40%, 47%, 68%, and 73%, respectively. Three models 
are tested with no higher than 70% accuracy, and the 
single-layer bidirectional LSTM model has the highest 
accuracy. 
As the choice of the number of hidden layers and the 
corresponding number of nodes directly affects the 
reliability and accuracy of the model. Further analysis is 
carried out regarding the degree of influence of different 
hidden layers and the number of nodes on the model. As a 
result of the analysis, a reasonable number of hidden 
layers and the corresponding number of nodes are given to 
make the model optimal, i.e., with minimum error. Before 
training, parameters such as mini-batch, learning rate, and 
optimizer are determined concerning previous experience 
and related knowledge. We can see that as the complexity 
of the model increases, the accuracy of the model also 
improves. The bidirectional LSTM model with three 
hidden layers has the highest accuracy, reaching 90%, 
while the bidirectional LSTM model with a single hidden 
layer has an accuracy of 73%. This indicates that 
increasing the number of hidden layers in the model can 
improve its performance. The confusion matrix can help 
us understand the prediction performance of the model on 
different categories. Through the confusion matrix, we can 
intuitively see the performance of the model on each 
category and which categories have the most accurate 
predictions. Accuracy is the proportion of samples 
predicted by the model to be true positive examples. 
Overfitting refers to the model performing well on training 
data but performing poorly on test data. This is usually due 
to the model being too complex, resulting in overfitting of 
the training data. To prevent overfitting, we can adopt 
some strategies, such as increasing the amount of data, 
using regularization, and reducing model complexity. 
Insufficient fitting refers to the poor performance of the 
model on both training and testing data. This is usually due 
to the model being too simple to capture the complex 
patterns of the data. To solve the problem of insufficient 
fitting, we can increase the complexity of the model or use 
a more complex model structure. 
5.2 Music teaching application results 
This process may involve the teacher's expertise, 
experience, knowledge, and understanding of the teaching 
environment and students. To assess the goals of deeper 
music instruction, teachers should first believe that all 
students can make appropriate progress, that students 
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Loss
Epoches
 BLSTM
 LSTM
 BP
 RNN
Evolutionary Deep Learning for Sequential Data Processing in…                                                    Informatica 48 (2024) 63–78    73                                        
understand the criteria for assessment, and that learning-
oriented assessment activities will truly achieve deeper 
learning in music. Second, teachers should strive to design 
assessment criteria that are more operational and promote 
education in response to the necessarily demanding 
standards. Teachers design appropriate learning tasks and 
guide their completion in ways that, in turn, are extensions 
of classroom instruction rather than simple replications of 
curricular requirements, and learning goals and 
assessment tasks can maximize their pointing power if the 
process by which students complete the assessment is the 
same as the learning process. While the previous chapter's 
deep learning model for music reveals how deep learning 
develops in classroom instruction, a clear assessment of 
learning objectives for which guided assessment tasks is 
the initial point of the launch can save the time experience 
required to invest in the learning model. 
Whether written or performance audition, teachers' 
elaborate design of evaluation situations should be 
objective and diverse, combining listening experience 
analysis and discussion in depth as grades. In addition to 
the innovation of the evaluation process, the deep 
combination of vocal teaching and instrumental teaching 
and pointing to the core literacy of music discipline should 
have complied with the curriculum in the evaluation 
session. Different musical language, artistic emotion, and 
artistic, social value should be examined, and the deep 
evaluation of experiential ability, analytical ability 
creative ability and cultural understanding should be 
conducted in connection with the characteristics of other 
disciplines, as shown in Figure 7. 
Comparing the results of Experiment 1 and 
Experiment 2, it can be found that the addition of the pitch 
feature is not as significant as the rhythm for improving 
the effect. The reason for this is that the pitch feature 
already contains enough information about the structure of 
music; the pitch feature is the basic feature of music, in 
which the change of pitch represents the structure of 
music; the rhythm is also a critical auxiliary feature for 
music, which determines the properties of notes in time-
based on the determination of the structure of music, and 
together with the pitch, it constitutes the structural feature 
of music. The pitch, on the other hand, represents the 
keynote of the music and represents only the overall sonic 
height of the music, not the structure of the music, and 
therefore has little effect. 
The purpose of this part of the experiment is to verify 
the effect of the overall link, taking the MIREX05 dataset, 
which is an audio file and includes a variety of styles of 
music. This experiment first uses the method proposed in 
Chapter 3 to extract the main melody pitch and music 
rhythm and pitch, converting them to JSON format; after 
that, the method in Chapter 4 is used to calculate the 
similarity, and this model uses the BiLSTM with better 
effect plus the attention mechanism. 
In this part of the experiment, to verify the similarity 
of the music structure, the MIREX05 dataset was used 
while the songs were split into several short and fixed-
length music segments. Since the segments are short and 
have high similarity under the same song, this experiment 
slices each audio segment according to a length of 100. 
Therefore, in this part, the audio label in the dataset is set 
to the name of the song to which the clip belongs. In the 
experiment, the similarity Top1 and Top3 are used to 
measure the accuracy of the links. After inputting a 
sample, its similarity to other samples is calculated and 
ranked, where Top1 refers to the sample with the top 
accuracy rate, and Top3 refers to the sample with the top 
three accuracy rates. When the sample whose accuracy 
ranking meets the requirement has the same label as the 
calculated sample, the sample is considered to be correctly 
identified, as shown in Figure 8.
74   Informatica 48 (2024) 63-78                                                                                                                                                     L. Jing 
 
Figure 7: Experimental results of the model and similarity calculation formula 
0.54
0.9
0.96
0.64
0.6
0.82
0.49
0.81
0.49
0.4
0.45
0.87
0.84
0.66
0.84
0.98
0.64
0.7
0.93
0.45
0.84
0.88
0.85
0.81
0.55
0.76
0.45
0.47
0.76
0.63
1
2
3
4
5
6
7
8
9
10
0.0 0.5 1.0 1.5 2.0 2.5
Values
Experiment number
 SC
 TP
 AP
Evolutionary Deep Learning for Sequential Data Processing in…                                                    Informatica 48 (2024) 63–78    75                                        
 
Figure 8: Experimental sample clustering 
The experimental results show that the attention 
mechanism has a positive effect on improving the 
accuracy rate, while the accuracy rate performs better 
when the cosine similarity is used. For the contour 
coefficient, the attention mechanism has little effect. In 
contrast, the distance calculation formula has a greater 
effect, whereas the cosine similarity has the most apparent 
impact on the improvement of the contour coefficient. The 
second part verifies which musical features impact 
similarity detection more. The experiments show that 
rhythm has a greater impact on the experimental results 
when using pitch sequences as the base feature, and pitch 
impacts the results, but the effect is not as evident as the 
rhythm feature. 
Music generation models can be used as part of 
teaching tools to assist music education. For example, 
students can use these models to generate their own music 
works, or create based on specific emotional or stylistic 
requirements. In addition, these models can also be used 
to analyze students' music works and provide feedback on 
their emotions, styles, and other aspects. The use of music 
generation models can enhance students' learning 
experience, enhance their interest and motivation in 
learning. By creating their own music works, students can 
better understand the composition and creative process of 
music, thereby deepening their understanding and 
appreciation of music. In addition, these models can also 
help students understand how music with different 
emotions and styles is encoded, thereby enhancing their 
music analysis and comprehension abilities. Although 
music generation models may be new technologies for 
some educators, many existing educational and research 
institutions are actively promoting and applying these 
technologies. In addition, with the development of 
technology, these models have become increasingly easy 
to use and understand. Educators can learn and master 
these technologies by participating in relevant training 
courses or seminars, in order to apply them to their 
teaching. 
6 Conclusion 
This paper is based on the actual teaching practice of 
middle school music classroom, through theoretical study 
and front-line teaching practice, and based on our own 
professional teaching experience and classroom teaching 
reflection, we focus the center of the paper on the 
exploration of large units of teaching in the music 
discipline. The teaching mode in line with deep learning 
can permeate the core literacy of music subjects into each 
teaching module, effectively addressing the 
implementation of students' musical ability and literacy 
and better promoting a deep understanding of music 
culture. This is also mainly related to the characteristic 
connotation of deep learning and its alignment with the 
values of core literacy. Whether from a temporal or a 
0 20 40 60 80 100
0
20
40
60
80
100
120
140
 Train 1
 Train 2
 Train 3
 Train 4
 Train 5
Values
Percent
76   Informatica 48 (2024) 63-78                                                                                                                                                     L. Jing 
content perspective, deep learning is extended and 
deepened. It is always conducive to developing students' 
abilities and literacies for lifelong learning in music and 
reaching the essence and core of the music discipline. The 
comparison experiment in this paper uses many data sets. 
Because of the specificity of the training data, experiments 
involving different input data can only change data sets. 
Therefore, this paper adds comparative experiments for 
similarity detection under the overall process. In this 
paper, the multi-stage and multi-level double helix form 
of the teaching model is chosen precisely because it 
conforms to the real contextual, practical activity and 
perceptual-experiential tendency path, from the 
acceptance stage to the participation stage to the migration 
stage and the specific practice process of the model, in 
which students and teachers take each other as subjects, 
and the bottom-up and real-time dynamic design of 
teaching objectives and teaching evaluation oriented to 
learning outcomes, and through the spiral of both 
development and interpretation from the bottom up 
through the music learning behavior. 
Competing of interests 
The authors declare no competing of interests. 
Authorship contribution statement 
Lin Jing: Writing-Original draft preparation, 
Conceptualization, Supervision, Project administration. 
References 
[1] X. Wang, S. Zhao, J. Liu, and L. Wang, “College 
music teaching and ideological and political 
education integration mode based on deep 
learning,” Journal of Intelligent Systems, 31(1): 
466–476, 2022. https://doi.org/10.1515/jisys-
2022-0031 
[2] G. Taranto-Vera, P. Galindo-Villardón, J. 
Merchán-Sánchez-Jara, J. Salazar-Pozo, A. 
Moreno-Salazar, and V. Salazar-Villalva, 
“Algorithms and software for data mining and 
machine learning: a critical comparative view 
from a systematic review of the literature,” J 
Supercomput, 77: 11481–11513, 2021. 
https://doi.org/10.1007/s11227-021-03708-5 
[3] J. Liu, S. Snodgrass, A. Khalifa, S. Risi, G. N. 
Yannakakis, and J. Togelius, “Deep learning for 
procedural content generation,” Neural Comput 
Appl, 33(1): 19–37, 2021. 
https://doi.org/10.1007/s00521-020-05383-8 
[4] M. AlQuraishi and P. K. Sorger, “Differentiable 
biology: using deep learning for biophysics-based 
and data-driven modeling of molecular 
mechanisms,” Nat Methods, 18(10): 1169–1180, 
2021. https://doi.org/10.1038/s41592-021-01283-
4 
[5] M. Abd Elaziz et al., “Advanced metaheuristic 
optimization techniques in applications of deep 
neural networks: a review,” Neural Comput Appl, 
1–21, 2021. https://doi.org/10.1007/s00521-021-
05960-5 
[6] Q. Zhang, J. Lu, and Y. Jin, “Artificial intelligence 
in recommender systems,” Complex & Intelligent 
Systems, 7: 439–457, 2021. 
https://doi.org/10.1007/s40747-020-00212-w 
[7] A. Darwish, A. E. Hassanien, and S. Das, “A 
survey of swarm and evolutionary computing 
approaches for deep learning,” Artif Intell Rev, 53: 
1767–1812, 2020. 
https://doi.org/10.1007/s10462-019-09719-2 
[8] J. Lu et al., “Illustrating changes in time-series 
data with data video,” IEEE Comput Graph Appl, 
40(2): 18–31, 2020. 
https://doi.org/10.1109/MCG.2020.2968249 
[9] L. Vrysis, N. Tsipas, I. Thoidis, and C. Dimoulas, 
“1D/2D deep CNNs vs. temporal feature 
integration for general audio classification,” 
Journal of the Audio Engineering Society, 
68(1/2): 66–77, 2020. 
https://doi.org/10.17743/jaes.2019.0058 
[10] M. Abdel-Basset, H. Hawash, R. K. Chakrabortty, 
M. Ryan, M. Elhoseny, and H. Song, “ST-
DeepHAR: Deep learning model for human 
activity recognition in IoHT applications,” IEEE 
Internet Things J, 8(6): 4969–4979, 2020. 
https://doi.org/10.1109/JIOT.2020.3033430 
[11] S. H. Lim, S. Kim, B. Shim, and J. W. Choi, “Deep 
learning-based beam tracking for millimeter-wave 
communications under mobility,” IEEE 
Transactions on Communications, 69(11): 7458–
7469, 2021. 
https://doi.org/10.1109/TCOMM.2021.3107526 
[12] P. Gomathi, S. Baskar, P. M. Shakeel, and V. R. 
S. Dhulipala, “Identifying brain abnormalities 
from electroencephalogram using evolutionary 
gravitational neocognitron neural network,” 
Multimed Tools Appl, 79: 10609–10628, 2020. 
https://doi.org/10.1007/s11042-022-13850-8 
[13] Y. SATO, Y. HORAGUCHI, L. VANEL, and S. 
SHIOIRI, “Prediction of image preferences from 
spontaneous facial expressions,” Interdiscip Inf 
Sci, 28(1): 45–53, 2022. 
https://doi.org/10.4036/iis.2022.A.02 
[14] S. Bhaskaran and R. Marappan, “Analysis of 
collaborative, content & session based and multi-
criteria recommendation systems,” The 
Educational Review, USA, 6(8): 387–390, 2022. 
Doi: 10.26855/er.2022.08.009 
[15] V. A. Vuyyuru, G. A. Rao, and Y. V. S. Murthy, 
“A novel weather prediction model using a hybrid 
mechanism based on MLP and VAE with fire-fly 
optimization algorithm,” Evol Intell, 14: 1173–
1185, 2021. https://doi.org/10.1007/s12065-021-
00589-8 
[16] I. Santos, L. Castro, N. Rodriguez-Fernandez, A. 
Torrente-Patino, and A. Carballal, “Artificial 
Evolutionary Deep Learning for Sequential Data Processing in…                                                    Informatica 48 (2024) 63–78    77                                        
neural networks and deep learning in the visual 
arts: A review,” Neural Comput Appl, 33: 121–
157, 2021. https://doi.org/10.1007/s00521-020-
05565-4 
[17] M. Littmann et al., “Validity of machine learning 
in biology and medicine increased through 
collaborations across fields of expertise,” Nat 
Mach Intell, 2(1): 18–24, 2020. 
https://doi.org/10.1038/s42256-019-0139-8 
[18] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. 
Mezaris, and I. Patras, “Video summarization 
using deep neural networks: A survey,” 
Proceedings of the IEEE, 109(11): 1838–1863, 
2021. 
https://doi.org/10.1109/JPROC.2021.3117472 
[19] H. Ghanei, F. Manavi, and A. Hamzeh, “A novel 
method for malware detection based on hardware 
events using deep neural networks,” Journal of 
Computer Virology and Hacking Techniques, 
17(4): 319–331, 2021. 
https://doi.org/10.1007/s11416-021-00386-y 
[20] J. Klinger, J. Mateos-Garcia, and K. 
Stathoulopoulos, “Deep learning, deep change? 
Mapping the evolution and geography of a general 
purpose technology,” Scientometrics, 126: 5589–
5621, 2021. https://doi.org/10.1007/s11192-021-
03936-9 
[21] I. A. Doush and A. Sawalha, “Automatic music 
composition using genetic algorithm and artificial 
neural networks,” Malaysian Journal of 
Computer Science, 33(1): 35–51, 2020. 
https://doi.org/10.22452/mjcs.vol33no1.3 
[22] L. Ma and B. Sun, “Machine learning and AI in 
marketing–Connecting computing power to 
human insights,” International Journal of 
Research in Marketing, 37(3): 481–504, 2020. 
https://doi.org/10.1016/j.ijresmar.2020.04.005 
[23] X. Wang, Y. Han, V. C. M. Leung, D. Niyato, X. 
Yan, and X. Chen, “Convergence of edge 
computing and deep learning: A comprehensive 
survey,” IEEE Communications Surveys & 
Tutorials, 22(2): 869–904, 2020. 
https://doi.org/10.1109/COMST.2020.2970550 
[24] C. Gresse von Wangenheim, J. C. R. Hauck, F. S. 
Pacheco, and M. F. Bertonceli Bueno, “Visual 
tools for teaching machine learning in K-12: A 
ten-year systematic mapping,” Educ Inf Technol 
(Dordr), 26(5), 5733–5778, 2021. 
https://doi.org/10.1007/s10639-021-10570-8 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78   Informatica 48 (2024) 63-78                                                                                                                                                     L. Jing