https://doi.org/10.31449/inf.v45i7.3739 Informatica 45 (2021) 57–65 57 Wind Sounds Classification Using Different Audio Feature Extraction Techniques Wala'a N. Jasim Department of Pharmacognosy, College of Pharmacy, University of Basra, Iraq E-mail: Walaa.jasim@uobasrah.edu.iq Saba Abdual Wahid Saddam and Esra'a J. Harfash Department of Computer Science, College of Computer Science and Information Technology University of Basra, Iraq E-mail: Saba.Saddam@uobasrah.edu.iq, esra.harfash@uobasrah.edu.iq Keywords: audio signal, audio feature extraction, ZCR, FFT, LPC, PLP, CNN, CNN classification Received: 10/12/2021 In this research, different audio feature extraction techniques are implemented and classification approaches are presented to classify seven types of wind. We applied features techniques such as Zero Crossing Rate (ZCR), Fast Fourier Transformation (FFT), Linear predictive coding (LPC), and Perceptual Linear Prediction (PLP). We know that some of these methods are good with human voices, but we tried to apply them here to characterize the wind audio content. The CNN classification method is implemented to determine the class of input wind sound signal. Experimental results show that each of these extraction feature methods give different results, but classification accuracy that are obtained by using PLP features return the best results. Povzetek: V tej raziskavi se izvajajo različne tehnike ekstrakcije zvočnih funkcij in predstavljeni so klasifikacijski pristopi za razvrščanje sedmih vrst vetra. Kjer smo uporabili tehniko funkcij, kot so Zero Crossing Rate (ZCR), Fast Fourier Transformation (FFT), Linear Prediction Coding (LPC), Perceptual Linear Prediction (PLP). Vemo, da nekatere od teh metod dobro vplivajo na človeške glasove, vendar smo jih poskušali uporabiti tukaj za označevanje zvočne vsebine vetra. Za določitev razreda vhodnega zvočnega signala vetra je uporabljena klasifikacijska metoda CNN. Eksperimentalni rezultati kažejo, da je vsaka od teh metod ekstrakcijskih lastnosti dala različne rezultate, vendar se je za klasifikacijo lastnosti PLP izkazalo, da imajo najboljše rezultat. 1 Introduction Processing of an audio signal generally includes extracting the most important features from it, analyzing, determining the presence of a specific pattern in the signal, and evaluating its behavior pattern, as well as how a particular signal is related to other similar signals. The sound signal has different types such as the speech signal, animal sounds, sounds of specific events in our life, music, and environmental sounds. Therefore, the processing of the audio signal has clearly developed during the past few years, especially with regard to analyzing the audio signals and extracting the most important characteristics from and classifying it [1]. Any signal that represents a sound has a number of parameters such as amplitude, frequency, bandwidth, etc. These qualities can be used in many audio signal processors. Figure 1. shows a representation of any audio signal with its parameters, amplitude and time [2]. Audio processing techniques involve the extraction of the features of a wave signal file, followed by decision- making schemes to detect and classify the inputted sound. It is critical to order the audio information into different classes like discourse, music, or clamor for quicker and precise access of the information [3]. Then the classification of the audio content is one of the significant and interesting issues. It has 2 main parts, which are: audio feature extraction and [4]. The feature extraction of audio is one important base of present audio signal processing research and evolution. The audio features are an information which can be produced from an audio signal. An information represents contextual information. The features can be divided into groups, that contain definitions of set for the features. In spite of these problems being somewhat different in nature, they heavily lean on groups related to features audio. Low level features are calculated immediately from the audio signal Figure 1: The Time and Frequencies of Sound Signals. Figure 1: The Time and Frequencies of Sound Signals 58 Informatica 45 (2021) 57–65 W.N. Jasim et al. in a frame-by-frame basis oftentimes like zero-crossing rate, and signal energy spectral centroid [5]. The classification of audio is one of the most widespread utilizing cases and includes taking a sound and assigning it to one of various classes. For example, the function could be to identify the kind or sound source. The recently increased attention in deep learning has attracted many scientific and practical applications in different fields of signal processing, oftentimes the processing of traditional signal is outperforming on wide range. In most recent wave, deep learning first produced attraction and interest in image processing, however has then been vastly adopted in environmental sound processing, music and speech processing in addition to a wide range of areas such as chemistry, genomics, quantum, drug discovery, recommendation systems and natural language processing. As a result, previously utilized techniques in processing of audio signal, like Gaussian mixture models(GMM), non-negative matrix factorization and hidden Markov models(HMM) were often bested via DL models, in applications where enough data is obtainable [6]. Many scientific problems and fields have witnessed great developments through the use of deep learning, which has led to its improvement and increase in its achievement rate, for example, computer vision, natural language processing, and also in the field of sound area, like music recommendation and speech recognition [7]- [9].The sound classification systems based on deep neural networks such as CNNs have undergone important improvements in the recognition and classification capability of models. None the less, their complexity of computational and inappropriate exploration of universal dependencies for long sequences restrict perfections in their results of classification [10]. Many researches in the recently years have been achieved in the automatic sound classification and detection area in outdoor environments. Some researchers focused their studying about the environmental sounds such as natural and human produced whilst others have focused and specified the detection and classification of various species of animals [11,15]. However, the objective of this paper is to introduce a wind sound detection and classification system, that is focused on the classification of some classes of wind sounds. According to the information features contained in the signal about the frequency and time space, these features are used to investigate and classify the wind audio signal. The classification of seven classes of wind sounds are applied, these classes are (Soft, Howling, Ghost, Blizzard, Cold, Desert, Strong, Scary) wind. Several extraction features techniques are implemented to extract the most important features in time or frequency of the wind audio signal. These techniques are: ZCR, FFT, LPC, PLP. CNN model is used here to classify the wind sounds. The rest of the paper is organized as follows; after showing the introduction in section 1, the related work is given in section 2. section 3 explains some feature technique of sounds. Sections 4 explains audio Deep Learning Models, and section 5 shows the database and main steps with all techniques used to complete the system work, and the results of accuracy performance. Finally, the conclusion is given in the last section. 2 Literature review Nowadays, the classification of sound is a wide field of studying that has great attracted interest from many researchers. With an improvement of Deep CNN and its effective utilize in computer vision(CV), language modeling, recognition of speech, and other regarding fields, it is confirmed that architecture of CNN based out- classes the classical ways in different classification missions. Which is why, they were stratified in the automatic sound event recognition task in recently years. As is the case in this paper presented in which we used CNN to classify wind sounds and predict what will happen after the wind. Pablo Zinemanas et al. [7] They proposed a new explicable DL model for automatic sound classification, that interpret its foretelling base on likeness of the inputting to a group of learned proto-types in a latent space. Their proposed consist of two main components: an auto-encoder and a classifier. The model of inputting is a representation of time frequency for the audio signal. The aim of the auto-encoder was for representing the inputting into a latent space of beneficial, features which were learned through training step. Then the encoded inputting was utilized via the classifier in order to make a foretelling. Their proposed model realizes results which were similar to that of state-of-art approaches in 3 various tasks of sound classification including music, environmental audio and speech. Two automatic techniques are presented in order to prune their proposed model. Their model was opened source and it was chaperoned via a application web for the editing manual model, that let for a human-in-the-loop debugging method. Loris Nanni et al. [16] presented work for combining different clustering techniques with a Siamese NN and in order to produce a variation space that is then utilized to train the SVM for classification of animal audio. They used free datasets of animal audio which consist of sounds of birds and cats. They used an SVM for classifying a spectrogram via its variation vector. Their research proposed technique showed based on variation space implement good on both classification tasks with no ad- hoc optimization of clustering approaches. Their results showed that the stand-alone CNNs is worked not better than the combination of CNN-based methods which applied on animal audio classification. Silvia Liberata Ullo et al. [17] are presented a hybrid model for accurate and automatic of environmental sounds classification. They used Optimal allocation sampling (OAS) in order to extract the samples of informative from any class. The samples that have been acquired via OAS are turned into the spectrogram containing the representation of Time Frequency Amplitude via utilizing a Short-Time Fourier Transform (STFT). They used pre- training networks and classified it by applied multi- classification methods such as Decision Tree (DT) {fine, medium, coarse kernel}, K-Nearest Neighbor(K-NN) Wind Sounds Classification Using Different Audio Feature... Informatica 45 (2021) 57–65 59 {fine, cosine, medium, cubic, coarse and weighted kernel}, SVM, Linear Discriminant Analysis (LDA), Bagged Tree and Softmax classifiers to extract multiple deep features. They used a ESC-10 dataset for the evaluation of the methodology. Their proposed method is proved robust, promising and effective comparing with other techniques that using the same dataset. Md. Rayhan Ahmed et al [18] are presented system by using the Convolutional Neural Network (CNN) for processing turn into a short sound event audio file to an image of spectrogram and feed that image to (CNN) for processing. The features that are produced from the image are utilized for classification of different environmental sounds events like fire cracking, sea waves, dog barking, raining, lightning, etc. They have utilized the log-mel spectrogram auditory feature to train six-layer stack of CNN model. They are predestined the accuracy of their model to classify the environmental sounds in three datasets and they are carried out an accuracy for the urbansound8k, the ESC-10 and the ESC-50 datasets 92.9%, 91.7% and 65.8% consecutively. Their studying is showed a comparative between Adam Optimizer and RAdam optimizer utilized for training the model to correctly classify the environmental sound from architecture of image recognition. Diez Gasponet et al [19] presented an automatic system for detecting and classifying sounds, particularly those generated via insects and birds among other sounds that can be heard in an environment of natural. They compared the performance of three various features: mel frequency cepstral coefficients (MFCC), log mel filtered spectrogram (Mel Spectrogram) and log spectrogram (STFT). They generated a sound dataset in order to the development their system. The recording dataset is contained three various Natural Parks, with sounds of many insect species and birds and background noises. Their proposed system is used the neural networks NN to detect and classify sound frames. Their experiments are offered good accuracy in detection and classification of sound frames and with high results compare with other approaches. Yu Su et al [20] are proposed two combination features to allow a more universal environment representation sounds and CNN is presented with a four- layer to get better the implementation of ESC with suggested grossed features. These features were (Log-mel spectrogram, chroma, spectral contrast and tonnetz). In their proposed system Log-mel spectrogram, chroma, spectral contrast and tonnetz are aggregated to compose the feature sets of LMC, and MFCC is jointed with spectral contrast, chroma and tonnetz to compose the MC feature sets. Then, the CNN trained with various features are fused utilizing the Dempster–Shafer evidence theory to form TSCNN-DS model. The results of their system refer that the features of combination with the four-layer CNN were suitable of the problems for environment sound taxonomic and considerably outperformed other classic techniques. The TSCNN-DS model is achieved an accuracy of classification of 97.2%. Aditya Kamparia et al [21] are proposed system to classify the sounds of environmental based upon the produced spectrograms of these sounds by using deep learning networks. They used CNN in both stages: feature extraction stage and classification stage. They utilized the spectrogram images of environmental sounds for training the tensor deep stacking network (TDSN) and the convolutional neural network (CNN). They applied two datasets for their experimental work: ESC-50 and ESC-10. Two systems have been trained on their used datasets, and the carried out the accuracy was 49% and 77% in the CNN and 56% in the TDSN trained on the ESC10. From their experimental work, they concluded that their proposed system for classification of sound using the spectrogram sounds images can be effectively used to evolve the sound recognition and classification systems. Marielle Malfante et al [22] are presented addresses the environmental monitoring issue. specially, their proposed is focused on the using of systems acoustic for monitoring of passiving acoustic of ocean vitality for fish populations. In their study, they used 84 features in feature extraction stage and used a forward selection approach: features are ranked by the importance according to their weight in the RF model in Features selection stage. They built a discriminative model by using Support Vector Machines (SVM) and random forest (RF) which are most important supervise machine learning techniques. Their features proposed to describe the acquisitions came from an inclusive state of the art in different domains that acoustic signals classification is performed, included of music, environmental sounds and speech. In addition, their studying proposed for extracting features from three representations of the data (frequency, time, and cepstral domains). On real fish sounds recorded on different areas their proposed classification scheme is tested and obtained 96.9% correct classification. Sunit Sivasankaran et al [23] are presented algorithms to classify sounds of environmental in order to provide information of contextual to devices like hearing aids for the best performance. They utilized signal sub-band energy for constructing signal-dependent dictionary and matching pursuit algorithms to get a scattered representation of a signal. They were applied and used the coefficients of sparse vector as weight values for computing the weighted features. These features in the previous step with MFCC have been utilized as feature vectors for classification. The results of their Experimental are showed that their proposed method achieved accuracy 95.6 % whilst classified 14 classes of sound of environmental by utilizing the (GMM). Siddarth Sigtia et al [24] are presented Automatic Environmental Sound Recognition (AESR) algorithms and developed them with fixed consideration for counting cost. By their experiment, Mel-frequency cepstral coefficient (MFCC) features were extracted from the audio. MFCC features are wide used in environmental sound recognition and in speech recognition. They proved that AESR algorithm could made the most of a limited amount of computing power by compare the performance of sound classification as its computational cost function. Their results offered that DNN produced the best accuracy for classification sound across a computational costs range, whilst GMM yielded a sensible good accuracy with 60 Informatica 45 (2021) 57–65 W.N. Jasim et al. small cost, and SVM stand between both in terms of adjustment between computational cost and accuracy. 3 Features extraction techniques The Features mean something that values can be quantified and measured numerically using specific techniques available for it. For example, sample rate and sample data are two things that a sound wave is made of primarily. Now several transformations can be performed on the sample rate and sample data to extract important valuable features from it [25-27]. The accuracy of the system relies on the features and classification methods. Extracting efficient features is an important phase in the front-end module of building an sound classification system. For each class of sound there are some features that distinguish it from the rest of the other types of sounds, ut that the sound signal of one class may change with time, and this change may occur on any of the sound variables, such as amplitude or frequency. In the following paragraphs, it explains some of the techniques that are used to extract features from sound file. Some are specialized in extracting features from the time space and others from the frequency space. 3.1 The zero crossing rat The ZCR is the rate of change of a sign signal (from positive to negative or inverse) along the signal. Speech recognition and music processes topics often use this feature in many of their processing. Its value is high with percussion sounds such as those found in minerals and rocks [28], The ZCR is defined according to the following equation [29]: Where: sgn(·) is the sign function, i.e 3.2 Discrete fourier transform The Spectrum features are important in digital audio processing. A spectrum can be represented Mathematically using Fourier transform of a signal, where the time domain of signal is converted to the frequency domain. This means, a spectrum is the frequency domain representation of the input audio's time-domain signal [30]. Mathematically, the Discrete Fourier Transform (DFT) transforms a limited sequence of samples of equally spaced of a function into a sequence of same- length of equally spaced samples of the discrete-time Fourier transform (DTFT), that is a complex valued function of frequency. The DFT transforms a sequence of N complex numbers into another sequence of complex numbers, which is defined by [31]. As DFT deals with a limited data amount, it may be conducted in computer devices via the numeral algorithms or even devoted hardware. Those performances often employ effective Fast Fourier Transform (FFT) algorithms;[3] both "FFT" and "DFT" terms are typically utilized in an interchangeable manned. Prior to its current usage, the "FFT" initialism may have also been utilized for the ambiguous term "Finite Fourier Transform"[32]. 3.3 Linear Predictive Coding (LPC) In audio signal processing and speech processing, the LPC is a method used mostly for representing the spectral envelope of a digital signal of that represent speech in compressed form, using the information of a linear predictive model [33] : For the periodic signals with a period N P, it’s evident that (S(n)≈ S(n-N p)). However, that isn’t what LP is doing; it estimates (S(n)) from (P(P<