https://doi.org/10.31449/inf.v48i12.6014 Informatica 48 (2024) 107–112 107 Audio Feature Extraction: Research on Retrieval and Matching of Hummed Melodies Liu Yang Dezhou University, Dezhou, Shandong 253023, China Email: yangl1979@outlook.com Keywords: humming, music, retrieval and matching, dynamic time warping Received: April 12, 2024 Humming clips facilitate more intuitive and user-friendly music retrieval. This paper combined the K- means clustering algorithm, back-propagation neural network (BPNN) model, and dynamic time warping (DTW) algorithm for humming music retrieval and matching. Initially, the K-means algorithm was used to narrow down the search scope. Then, the BPNN model was employed to extract the abstract features of the music melody, and the DTW algorithm was used to match these abstract features. In the simulation experiment, the classification ability of the K-means algorithm was verified, and then it was compared with the DTW and BPNN+DTW retrieval algorithms. The results showed that the K-means algorithm had good music segment classification performance. The retrieval algorithms used could retrieve the target music more accurately and stably. Povzetek: Študija se osredotoča na ekstrakcijo avdio značilnosti za lažje iskanje in ujemanje melodij. Avtor uporablja kombinacijo K-means algoritma, povratno-propagacijsko nevronsko mrežo (BPNN) in dinamično prilagajanje časa (DTW) za izboljšanje točnosti iskanja in ujemanja melodij. 1 Introduction Music, as a form of artistic expression, transcends cultural and linguistic boundaries, resonating deeply within people’s hearts [1]. With the continuous progress in the field of music, there has been a significant increase in the repertoire available on music platforms, providing users with more choices. However, this abundance also increases the difficulty of selection. The primary function of a music platform is its search feature. If a user clearly knows the name of the music he is looking for [2], he can quickly locate the desired music by entering its name. If the user knows music-related keywords, he can also efficiently delineate the scope and reduce retrieval difficulties. The above traditional retrieval methods require users to clearly possess explicit information such as the title, lyrics, and author of the music [3]. Once users do not know this information, the retrieval difficulty will significantly increase. The humming melody-based retrieval method is more intuitive. This method only require the user to have an impression of the music melody and be able to hum a portion of it for successful retrieval, and this retrieval method does not require the user to know the key information of the music, nor does it require the platform to annotate the music with much information [4]. Xing et al. [5] proposed an emotion-driven cross-media retrieval system for Chinese folk images and music. The retrieval effectiveness of the method was demonstrated by the experimental findings. Deng et al. [6] proposed a technique for retrieving music that relies on the dynamics of emotions in music over time and verified the advantages of the multi-dynamic texture model in predicting time-varying music emotions. Cheng et al. [7] put forward a user-information-aware music interest topic model. The experimental studies showed that this model significantly enhanced the search accuracy of existing text-based music retrieval approaches. The performance of the algorithms used in the above literature is presented in Table 1. This paper combined the K-means clustering algorithm, back-propagation neural network (BPNN) model, and dynamic time warping (DTW) algorithm for humming music retrieval and matching. The K-means algorithm was initially used to reduce the search scope, the BPNN model was employed to extract the abstract features of the music melody, and the DTW algorithm was used to match the abstract features. Finally, simulation experiments were conducted. Table 1: The performance of algorithms in relevant literature. Literature Algorithm Accuracy Xing et al. [5] DE-SVM 89.75% Deng et al. [6] MDT 83.33% Cheng et al. [7] UIA-MIT 85.64% 2 Humming-based music retrieval algorithm Using humming melodies to search for music is more in line with people’s daily intuitive habits and reduces the manual workload of music tagging. The process of retrieving music through humming involves three steps: audio preprocessing, audio melody feature extraction, and melody feature matching [8]. In the extraction of audio 108 Informatica 48 (2024) 107–112 L. Yang melody features, pitch and duration are usually used as features. In matching melody features, since the user is not a professional musician, the hummed melody is improvisatory and uncertain. Simply speaking, for the same melody, the hummed melody of the user will be different from the standard melody of the database, which leads to some difficulties in feature matching [9]. This paper chooses the clustering and BPNN algorithms to retrieve and match the hummed melody. The clustering algorithm can divide the music melody into categories to narrow the matching range. The BPNN algorithm mines the hidden rules in the melody to improve the matching accuracy. Input the autio data from the music library Extract melody feature Classify melodies using the clustering algorithm Train BPNN using the melody features with classification labels Extract the abstract feature representation of the melody from the hidden layer of BPNN Construct an abstract feature database Input the humming audio data Extract melody feature Use BPNN to classify melodies and extract abstract features Match and retrieve music in the abstract feature database belonging to the same category Training phase Testing phase Figure 1: Humming-based retrieval and matching process using BPNN. The humming-based retrieval matching process using BPNN is shown in Figure 1, and the whole process is divided into the training and testing phases. The first is the training phase. ① The audio data in the music library undergoes input and preprocessing, which involves procedures such as noise reduction and framing processing [10]. ② The melody feature is extracted from the audio. In this paper, the fundamental frequency of the audio frame is selected as pitch, and its calculation formula is:        = =  = ) ( max arg ) ( ) ( 0 1 i M n i n i f S f nf A h f S , (1) where i f is the i -th candidate fundamental frequency, max f is the global peak frequency in the spectrogram of a frame [11], ) ( i f S is the confidence when i f is utilized as the fundamental frequency, M is the quantity of harmonics, n h is the compression factor of the n -th harmonic, ) ( i nf A is the amplitude of the n -th harmonic when i f is utilized as the fundamental frequency, 0 f is the calculated fundamental frequency of the signal of a frame, i.e., the frequency with the maximum confidence among the candidate ones. The calculated fundamental frequency of the audio signal of each frame is arranged in time order, which is the pitch sequence [12]. ③ The clustering algorithm is used to classify the audio data. This paper uses the K-means clustering algorithm to realize the classification. The steps are as follows. K audio data were randomly selected as cluster centers. Then, the Euclidean distance between the remaining audio data and each cluster center is calculated (the pitch sequence of the audio data is the melody feature vector of the audio, and the Euclidean distance is the distance between the two vectors). The remaining audio data is allocated to different cluster centers according to the nearest principle. Then, the mean vector of all audio melody feature vectors in each cluster is calculated respectively, and the mean vector is used as the new cluster center of the cluster, and the nearest assignment principle is repeated until the cluster center converges to a stable state [13]. ④ The category of the cluster to which the audio belongs is used as the melody label of the audio, and then the audio data with melody labels is used to train the BPNN model. The input data of the BPNN model is the melody feature vector of audio. The melody feature vector is forward calculated in the hidden layer of the model, and then the melody label is output in the output layer. The forward calculation formula in the hidden layer is:       − =  = n i i b x f h 1  , (2) where h is the hidden state output by the hidden layer, b is the bias term, ) ( f is the activation function, and  is the weight. The melody label output by the output layer is compared with the actual label, and the cross-entropy is used as the error. If there is an error, the parameters are adjusted reversally until the error converges [14]. ⑤ The hidden state output of the last hidden layer in the trained BPNN model is used as the abstract feature representation of the melody, and a melody abstract feature database of audio data is constructed. After that, it is the testing phase. ① The data of humming audio is input and preprocessed by noise reduction and framing. ② The pitch sequence of the data is extracted according to equation (1), which is the melody feature vector. ③ The melody feature vector is input into the trained BPNN model, and the model gives the melody classification label and the abstract feature representation of the melody. ④ According to the melody classification labels given by the BPNN model, the melody of the same class is selected from the melody abstract feature database as the preliminary candidate set. Then, the abstract feature representation of the melody is matched with the melody abstract features in the preliminary candidate set. The DTW algorithm is used to compute the distance between the humming audio and the audio in the preliminary candidate set [15]. The formula is:  = = M i i w i w R i T d D 1 ) ( ))] ( ( ), ( [ min , (3) where ) (i w is the frame number matching path function between the humming audio and the audio in the preliminary candidate set, i is the frame ordinal number of the candidate audio, with a maximum of M frames, ) (i T is the melody abstract feature of the i -th frame of the candidate audio, )) ( ( i w R is the melody abstract feature of the humming audio frame, ))] ( ( ), ( [ i w R i T d represents the distance between them, and D is the Audio Feature Extraction: Research on Retrieval and Matching… Informatica 48 (2024) 107–112 109 distance between the humming audio and the candidate audio. The candidate audio is sorted based on the size of D , from smallest to largest, and the music corresponding to the top k candidate audios is used as the retrieval and matching result according to the requirements. 3 Simulation experiment 3.1 Experimental data The music datasets used in this simulation experiment were the MIR and MedleyDB datasets. The MIR dataset is a widely used music dataset containing various music types, such as pop, classical, jazz, etc., and the dataset provides audio files and corresponding metadata. The MedleyDB dataset is an audio dataset for music analysis and information retrieval. It contains various styles and types of music, including pop, rock, jazz, classical, etc., and the dataset also provides high-quality audio files and metadata. Non-repeated music was selected from the above dataset, including pop, classical, jazz, rock, and folk, with 100 pieces per type. Each song was divided into several segments. Finally, a total of 12,140 music segments were obtained, 70% of which were used as the training set, and the remaining 30% were used as the test set. The music data had been filtered to remove noise and divided into frames. 3.2 Experimental setup In the proposed music retrieval and matching algorithm, the K-means algorithm was used to cluster similar music clips, and then the music data with clustering performance was used to train the BPNN model so that it could identify the type of music clips and initially narrow the matching range. The hidden layer of the BPNN model gave the abstract feature representation of the music melody, and the abstract feature representation was used to match music in the music library of the same type obtained after narrowing the matching range. The relevant parameters of the algorithm obtained after the orthogonal experiment are shown in Table 2. The K value of the K-means algorithm part was set to 5. The number of input layer nodes in the BPNN model depended on the number of features in the input sample, i.e., the dimensionality of the melody feature vector. The number of output layer nodes in the BPNN model depended on the initial classification quantity of the K-means algorithm. The reason for using sigmoid in the hidden layer is that it can fit non-linear patterns in data better. Table 2: The relevant parameters of the proposed algorithm. Parameter Setting Parameter Setting The K value of the K- means algorithm 5 The number of iterations in the K- means algorithm 200 The activation function of the BPNN algorithm sigmoid The hidden layer of the BPNN algorithm One layer, 256 nodes The output layer of the BPNN algorithm Five nodes The number of trainings in the BPNN algorithm 500 It was also compared with two other retrieval and matching algorithms. One of them was the DTW algorithm, which directly used the melody feature of extracted pitch sequence to retrieve and match the humming audio, and the k value representing the number of the results was also set to 5. The second algorithm was the BPNN+DTW algorithm. It used the BPNN model to extract the abstract feature representation of the music melody and the DTW algorithm to retrieve and match music based on the abstract feature. The difference between this algorithm and the algorithm proposed in this paper was that the clustering algorithm lies in the absence of a clustering algorithm for classifying audio data used to train the BPNN model. 3.3 Evaluation criteria Before validating the retrieval matching algorithm, a comparison of the classification performance between support vector machine (SVM) and K-means algorithms was conducted. The evaluation criteria included precision, recall rate, and F value. The corresponding equations are:          +   = + = + = R P R P F FP TP TP R FN TP TP P 2 , (4) where P is the precision, R is the recall rate, F is the comprehensive value of precision and recall rate, TP indicates the true positive rate, FP indicates the false positive rate, FN indicates the false negative rate, and TN indicates the true negative rate. Precision, recall rate, and F value are commonly used to evaluate retrieval and matching algorithms. However, in addition to the above evaluation indicators, there are two other indicators for retrieval and matching algorithms: the mean reciprocal rank (MRR) and hit rate (HR). They are calculated as follows:        = =   = = − K hit HR K rank MRR K i i K i i 1 1 1 ) ( , (5) where K is the times of humming, i.e., the times of detections when testing, i rank is the order of the retrieval 110 Informatica 48 (2024) 107–112 L. Yang target in the returned list in the i -th retrieval, i hit is an indication value of whether the search target is in the returned list in the i -th retrieval (1 if it is and 0 if not). 3.4 Experimental results Firstly, the classification performance of the K-means clustering algorithm was tested and compared with the SVM algorithm (Figure 2). It can be intuitively seen that the K-means clustering algorithm had better classification performance for audio. The precision of the SVM algorithm was 78.9%, the recall rate was 79.6%, and the F value was 79.25%. For the K-means algorithm, the precision was 98.7%, the recall rate was 97.6%, and the F value was 98.15%. The SVM algorithm classified data using the support vector plane, and the data class was determined by which side of the support vector plane the data was on. The distance between the data and the support vector plane was not important, nor was the distance between the data. The K-means algorithm was based on the distance between the data, i.e., the degree of similarity, so the data in the same class were more similar, which was consistent with the intuition. Figure 2: Classification performance of the SVM and K- means algorithm. Table 3 shows the partial retrieval results of the three retrieval algorithms. It can be seen that all three retrieval algorithms could retrieve the target music, but different retrieval lists were presented. In the retrieval list presented by the retrieval algorithm proposed in this paper, the target music was the first, and the rest of the music was the same in terms of genre. In the search list presented by the BPNN+DTW retrieval algorithm, the target music was in the middle, and the rest had a few different genres. In the search list presented by the DTW algorithm, not only was the target at the bottom of the list, but the genres of other music were also confusing. Table 3: Partial results of the three retrieval algorithms. Humming clip Music name (genre) Daybreak (popular) Five Hundred Miles (ballad) The search list of the DTW algorithm 1. The Butterfly Lovers (jazz) 2. The Same Smile (jazz) 3. Love is a Song (folk song) 4. The Sailing Ship (rock) 5. Daybreak (pop) 1. Yuan Wu Qu (jazz) 2. Time is a Bulldozer (folk song) 3. In Tune (pop) 4. Sorry for You (ballad) 5. Five Hundred Miles (ballad) The search list of the BPNN+DT W algorithm 1. A stunning sight (popular) 2. Godfather (popular) 3. Daybreak (popular) 4. Like is Not Love (jazz) 5. Winter Campus (ballad) 1. The Sun (ballad) 2. The Good Wind Will Sing (ballad) 3. Five Hundred Miles (ballad) 4. Come on (popular) 5. The Pilot (jazz) The search results of the K- means+BP NN+DTW algorithm 1. Daybreak (popular) 2. The Heart Goes Somewhere (pop) 3. In Tune (pop) 4. Manto Manor (popular) 5. Ode to Gallantry (popular) 1. Five Hundred Miles (ballad) 2. Time for a Tear (ballad) 3. Little Flower (ballad) 4. A Leaf Knowing Autumn (ballad) 5. Coming as Promised (folk song) Audio Feature Extraction: Research on Retrieval and Matching… Informatica 48 (2024) 107–112 111 The performance comparison results of the three retrieval algorithms are presented in Figure 3. It can be seen that the HR and MRR of the retrieval algorithm using only the DTW algorithm were 0.75 and 0.74; the HR and MRR of the BPNN+DTW algorithm were 0.84 and 0.86; the HR and MRR of the K- means+BPNN+DTW algorithm were 0.97 and 0.95. It can be seen that the retrieval and matching algorithm proposed in this paper had the best performance, followed by the BPNN+DTW algorithm, and the DTW algorithm was the worst. Figure 3: Performance comparison of the three retrieval algorithms. 4 Discussion The development of music has resulted in an increasing number of music tracks, which makes it more challenging to find the desired music from a vast collection. Traditional methods for retrieving music rely on textual information such as song titles, composers, and lyrics. However, this text-based approach requires users to remember specific details about the music, and not all users are knowledgeable in musical information. In contrast, using humming melodies to search for music is simpler and aligns better with human habits. This article combined the K-means clustering algorithm, BPNN model, and DTW algorithm to retrieve humming melodies. Subsequently, simulation experiments were conducted. Compared to the SVM algorithm, the K-means clustering algorithm performed better in audio classification performance. When the SVM algorithm classified data, it used the support vector. The data category is determined by which side of the support vector plane the data falls on. The distance between data and the support vector plane is not important, nor is the "distance" between data points. On the other hand, the K-means algorithm classified data based on the "distance" or similarity between data points, resulting in more similar data within each category. Afterward, the comparison results of the three retrieval algorithms also showed that the retrieval matching algorithm proposed in this article was the best. The reason for this is that although the DTW algorithm solved the problem of unequal lengths between matched audios by using non-linear folding, it still relied on audio features during the comparison process. The audio feature used solely by the DTW algorithm was pitch sequence, which cannot fully reflect all audio features. By combining the BPNN model with the DTW algorithm, the BPNN model was employed to further abstract feature extraction of melody features, i.e., deepening hidden patterns contained in features, the matching performance was improved. In the combined retrieval and matching algorithm, when training the BPNN model, the K-means clustering algorithm was first used to preliminarily classify audio data and assign labels to them, providing more features for subsequent training of the BPNN model. Therefore, the overall performance of this algorithm was superior. The novelty of this article lies in the use of the BPNN algorithm to extract deep features from audio data, while employing the K-means clustering algorithm to pre- classify the training data, further enhancing the characteristics of audio data and improving the performance of music retrieval and matching. 5 Conclusion In this paper, the K-means clustering algorithm, BPNN model, and DTW algorithm were combined for humming music retrieval and matching. The K-means algorithm was used to narrow down the retrieval scope, the BPNN model was employed to extract the abstract features of music melody, and the DTW algorithm was used to match the abstract features. In the simulation experiment, the classification ability of the K-means algorithm was verified. Then, the combined algorithm was compared with the DTW and BPNN+DTW algorithms. The results are as follows. (1) The K-means clustering algorithm had good classification performance for audio. (2) Among the three algorithms, the retrieval algorithm proposed in this paper prioritized the target music and retrieved other music of the same genre. (3) The performance of the proposed retrieval and matching algorithm was the best, followed by the BPNN+DTW and DTW algorithms. References [1] Schindler A, Rauber A (2016) Harnessing Music- Related Visual Stereotypes for Music Information Retrieval. ACM Transactions on Intelligent Systems & Technology, 8, pp. 1-21. https://doi.org/10.1145/2926719 [2] Srinivasa M Y V, Koolagudi S G (2018) Content- Based Music Information Retrieval (CB-MIR) and Its Applications toward the Music Industry: A Review. ACM Computing Surveys (CSUR), 51, pp. 1-46. https://doi.org/10.1145/3177849 [3] Weissenberger L (2015) Toward a universal, meta- theoretical framework for music information classification and retrieval. Journal of Documentation, 71, pp. 917-937. https://doi.org/10.1108/JD-08-2013-0106 [4] Furner M, Islam M Z, Li C T (2021) Knowledge Discovery and Visualisation Framework using Machine Learning for Music Information Retrieval from Broadcast Radio Data. Expert Systems with Applications, 182, pp. 1-11. https://doi.org/10.1016/j.eswa.2021.115236 112 Informatica 48 (2024) 107–112 L. Yang [5] Xing B, Zhang K, Sun S, Zhang L, Gao Z, Wang J, Chen S (2015) Emotion-driven Chinese folk music- image retrieval based on DE-SVM. Neurocomputing, 148, pp. 619-627. https://doi.org/10.1016/j.neucom.2014.08.007 [6] Deng J J, Leung C H C (2015) Dynamic Time Warping for Music Retrieval Using Time Series Modeling of Musical Emotions. IEEE Transactions on Affective Computing, 6, pp. 137-151. https://doi.org/10.1109/TAFFC.2015.2404352 [7] Cheng Z, Shen J, Nie L, Chua T S, Kankanhalli M (2017) Exploring User-Specific Information in Music Retrieval. Acm Sigir Forum, 51, pp. 655-664. https://doi.org/10.1145/3077136.3080772 [8] Hu X, Lee J H, Bainbridge D, Choi K, Organisciak P, Downie J S (2017) The MIREX grand challenge: A framework of holistic user-experience evaluation in music information retrieval. Journal of the Association for Information Science & Technology, 68, pp. 97-112. https://doi.org/10.1002/asi.23618 [9] Jao P K, Yang Y H (2015) Music Annotation and Retrieval using Unlabeled Exemplars: Correlation and Sparse Codes. IEEE Signal Processing Letters, 22, pp. 1771-1775. https://doi.org/10.1109/LSP.2015.2433061 [10] Silva A C M D, Silva D F, Marcacini R M (2022) Multimodal representation learning over heterogeneous networks for tag-based music retrieval. Expert Systems with Application, 207, pp. 117969.1-117969.9. [11] Silva D F, Yeh C, Zhu Y, Batista G E, Keoph E (2019) Fast Similarity Matrix Profile for Music Analysis and Exploration. IEEE Transactions on Multimedia, 21, pp. 29-38. [12] Schwabe M, Heizmann M (2021) Influence of input data representations for time-dependent instrument recognition. Technisches Messen: Sensoren, Gerate, Systeme, 88. [13] Fernández-Rubio G, Carlomagno F, Vuust P, Kringelbach M L, Bonetti L (2022) Associations between abstract working memory abilities and brain activity underlying long-term recognition of auditory sequences. PNAS Nexus, 1, pp. pgac216. https://doi.org/10.1093/pnasnexus/pgac216 [14] Lei W H, Lee C Y (2016) Query By Singing/Humming System Using Segment-based Melody Matching for Music Retrieval. WSEAS Transactions on Systems, 15, pp. 157-167. [15] Patil P, Mengade S, Kolpe P, Gawande V, Budhe K (2015) Song Search Engine Based on Querying by Singing/Humming. International Journal for Scientific Research & Development, 3, pp. 14-16.