Informatica 42 (2018) 259 –264 259 Research on Intelligent English Oral Training System in Mobile Network Fen Zhu Foreign Languages School, Luoyang Institute of Science and Technology, Luoyang, Henan, 471000, China E-mail: fenzhu_lit@163.com Technical Paper Keywords: mobile network, Android system, spoken English, resonance peak, evaluation Received: March 13, 2018 With the rapid development of mobile networks, mobile learning, as a new learning form, is gradually accepted by people. Based on the Android mobile platform, this paper designed a spoken English training system that could be applied to mobile network equipment from the aspects of speech recognition, pronunciation scoring and function setting. Based on the characteristics of the Android system, this paper selected the MEL cepstrum coefficient as the feature parameters to speech recognition, and introduced the dynamic time neat algorithm as the matching algorithm of the speech recognition pattern to make speech recognition more suitable for mobile Internet devices. Besides, the voice formant was used as a reference for oral scores and the scoring method based on single reference template was adopted. Finally, the spoken English training system was developed under the eclipse integration environment. The test results showed that the success rate of voice input was over 98%, and the accuracy rate of spoken voices of monophthong words, diphthong words and polysyllabic words was 97.15%, 94.96% and 93.62% respectively, suggesting that the system could accurately input and score English learners’ spoken English, and assist English pronunciation. Povzetek: Prispevek se ukvarja z mobilnim učenjem angleščine na sistemih z Androidom.. 1 Introduction With the deepening of economic globalization, communication between China and other counties has become increasingly frequent. Therefore English which is the most extensively applied language worldwide has gradually been an indispensable tool in daily life and work, and moreover many English training institutions and learning tools have emerged. But the traditional learning mode, i.e. face-to-face teaching mode in training institutions, usually cannot achieve a good result in spoken English, which contributes to the large difference of pronunciation between English and Chinese. People who grow up in Chinese environment will make the mistake of pronunciation unconsciously when learning oral English. Moreover English teachers who have correct pronunciation and are able to guide pronunciation are lack of in China. Time and environment for spoken English practice are also not enough. With the rapid development of mobile information technology, mobile network terminals such as smartphone and panel personal computer have almost covered every aspect of our life. Smartphone based oral English training software is more convenient and practical compared to the traditional teaching mode and can effectively avoid the shortcomings of the traditional teaching mode. Mobile network device based mobile learning has been extensively studied. Wang et al. [1] found that computer corpus based teaching mode was more effective than the traditional teaching mode. Alamer et al. [2] designed and develop mobile Web technology and API based lightweight language learning management system. The system aimed to allow language students to view and download learning content on their phones and complete interactive tasks designed by teachers. Milutinovic [3] et al. proposed a mobile adaptive language learning model, whose main goal was to improve the mobile language learning process using adaptive technology. The proposed model was designed to take advantage of unique opportunities to transfer learning content in real learning situations. Taking Android smartphone as the application platform, this study aimed to build an intelligent spoken English training system that could be used on mobile network devices. 2 Mobile learning Mobile learning [5] refers to the use of portable mobile communication equipment and technology so that learners can choose their preferred way to study any time and place. Compared with the time fixed English classroom learning mode, mobile learning has extensiveness, timeliness and interactivity features, giving learners a more relaxed and pleasant learning experience. In addition, the multimedia combination of audio, text, video, image and animation makes mobile learning more vivid. Mobile language learning enables 260 Informatica 42 (2018) 259 –264 F. Zhu learners to have more learning options, to make full use of fragmented time, and to be efficient and flexible. 3 Intelligent spoken English training system design 3.1 Speech recognition 3.1.1 Speech signal preprocessing (1) Speech signal digitization Speech signal can be analyzed and processed by computer through digital conversion. This paper uses the headset of Android phone as the input device of voice signal, and uses the Audio Record Wizard [6] of Android system to collect the underlying data. According to Nyquist frequency theorem, the sampling frequency of 7000 Hz is used to collect the speech signal. (2) Pre-emphasis In order to eliminate the influence of mouth and nose radiation, speech signals are usually pre-emphasized by a first-order high-pass filter [7]. Pre-emphasis refers to improving the resolution of the high-frequency part of speeches by emphasizing the high-frequency part of speeches based on the difference between signal properties and noise properties. Usually pre-emphasis is realized using first-order FIR high-pass digital filter [16]. The formula used by the filter is shown below.   1 1 ε H x x   , (1) Where ε refers to the pre-emphasis coefficient and is set to 0.98 in this study. Set the speech signal at the nth time point to be   n s , then the weighted signal is:       2 ε1 s n s n s n    , (2) Where   n s 2 refers to the speech signal after pre- emphasis and   1  n s refers to the last filter output value. (3) Windowing processing In order to ensure continuous and complete voice signals in each frame, a window function is generally multiplied before processing each frame of speech [8]. This paper uses Hamming window function to window the signal, with the formula as follows:                 othervalue n N n N n n w , 0 1 0 , 1 π 2 cos 46 . 0 54 . 0 . (3) (4) Endpoint detection According to the characteristics of the Android platform, this paper uses the combination of short time average energy [9] and short-time zero-crossing rate to detect the endpoint. The short-term average energy is calculated as follows: set the short time average energy and frame length of the n-th frame of speech signal   h s n to n E and N respectively, then the calculation formula is as follows:   1 0 , 1 0 2        N h h x E N m n n (4) According to the size of the short-term energy, the learner's voice and noise can be distinguished, and high energy signal is the speech signal. However, this method is less stable under low SNR conditions. Therefore, it is necessary to use short-time zero-rate method. Set the speech signal to be   m x n , then the short-time zero- crossing rate is:                            0 1 0 1 sgn 1 sgn sgn 2 1 1 0 x x x h x h x z N m n n n , (5) Where   sgn refers to the sign function. According to the low frequency band of voiced sound energy and the high frequency band of voiceless sound energy, the zero- crossing rate of the speaker is stable relative to the ambient noise and the sound segment can be clearly identified. 3.1.2 Extraction of speech signal features Feature extraction [17] was performed after the preprocessing to highlight the data features of pattern matching, improve recognition rate, compress information and reduce computation load and storage. The commonly used feature parameters include Mel- frequency cepstral coefficient (MFCC) which has strong recognition performance and anti-noise capacity, linear predictive coefficient which has small computer load but general efficacy and accent sensitivity parameter which has favorable performance in recognition the middle frequency band of signals. In this system, Mel Frequency Cepstrum Coefficient (MFCC) [10] is used as the characteristic parameter of oral training. MEL scale and frequency have the following relationship:   2595 ln 1 700 mel ff  , (6) Where f refers to the actual frequency of the signal. Fourier transform [11] is performed on each frame of speech signal after preprocessing to obtain the signal spectrum. Then, the spectrum square is cut off, Mel band-pass filter is applied for filtering, all of the filter outputs undergo logarithm calculation, and then discrete cosine transform is made on DCT to obtain MFCC, the process is shown in Figure 1.         L l p n L n l l w N n C ,..., 2 , 1 ,..., 2 , 1 , π 2 1 cos log 2                 , (7) Where L refers to the number of filters,   l w refers to the output of each triangle filter, N refers to the length of each frame, and p refers to the order of parameters. Research on Intelligent English Or al… Informatica 42 (2018) 259 –264 261 Pre-emphasis, framing, windowing Fast Fourier Transform Take absolute or square value Mel filtering Take the logarithm log [] Discrete cosine transform Dynamic Characteristics Voice input Figure 1: MFCC feature extraction process. 3.2 Speech signal pattern matching In this paper, Dynamic Time Warping (DTW) [12] is used to match the characteristics of speech signals. Firstly, set the eigenvector sequence of the standard template to be           M B m B B B B ,..., ,..., 2 , 1  , where M is the total speech frame number, m is the time series label of the signal frame and   m B is the eigenvector of the m-th frame. The eigenvector sequence of the speech test template is           Q T n T T T T ,..., ,..., 2 , 1  , where Q is the number of frames, n is the sequence number of the speech in the template and   n T is the eigenvector of the nth frame. The similarity between the test template and the standard template is represented by vector distance, and the similarity decreases with the increase of vector distance. Euclidean distance [13] is used to represent the distance between   n T and   m B , as follows:             p i i i b t m B n T d 1 2 , , (8) Where i t refers to the eigenvector of the i-th dimension of   n T , i b refers to the eigenvector of the i-th dimension of   m B . The dynamic time warping is to map the time axis n of the speech test template to the time axis m of the standard template to obtain the minimum vector distance of the template, as follows:            N n n w m b n T d D 1 , min (9) Dynamic time warping generally requires finding a path which goes through each intersection with the distance measure sum of the intersection at the path minimized. Generally, constraint conditions are given: Boundary condition:     M N w w   , 1 1 (10) Continuity condition:                      1 = 1,2 1 2 , 1 , 0 1 n w n w n w n w n w n w (11) With the above two conditions met and the frame distance accumulated sum the minimum, the optimal path   n w m  is sought as follows: starting from (1, 1), backstepping is repeated until (N, M) to find the optimal matching path.   M N D , refers to template distance of the matching path and the minimum matching distance is   M N D , min , which is taken as the measuring criterion for the similarity matching degree between templates. 3.3 Pronunciation scoring Firstly, the average matching distance of frames is calculated:   N M N D d .  , (12) Where   M N D , refers to the total matching distance of the test templates, N refers to the frame length of the test templates. When selecting the average frame matching distance, the effect of the speech length is eliminated. In terms of scoring, this paper proposes a scoring method based on the single reference template. The range of pronunciation score is 0~100, and the scoring method is as follows:   f d e score   1 100 , (13) Where d refers to the average frame matching degree, and e and f are the scoring parameters obtained based on the experience of spoken English teachers and matching distance. 3.4 Scoring parameter selection In this study, the formant was taken as a criterion to evaluate the learner's spoken language pronunciation, and the learners' spoken English pronunciation quality was judged by the similarity contrast between the pronunciation formant of the test model and the standard model. Formant refers to the areas where energies are concentrated in the speech spectrum and it reflects the physical characteristics of the resonant cavity. In the process of producing vowels and consonants in the oral cavity, the harmonic vibration frequency of the sound is regulated by the sound cavity, which is strengthened or attenuated irregularly, and the region with high degree of enhancement forms the resonance peak. In the spectrum 262 Informatica 42 (2018) 259 –264 F. Zhu of vowels, the first three resonant peaks play a key role in the quality of sound. The first two resonant peaks are particularly sensitive to the height of the tongue position. The higher the first resonance peak, the lower the tongue position, and the second and third formants also have a certain relationship with the tongue position, but the relationship between them is not particularly prominent. Therefore, the first resonance peak is chosen as the judging basis for the pronunciation quality. In this paper, the resonance peak is extracted using linear prediction method [14]. Regarding the sound channel as a resonant cavity, then the resonant peak is the resonant frequency of the wall. 3.5 Function and interface design The oral English training system based on the Android smartphone platform can provide effective feedback to learners' oral English pronunciation through animation, audio, video and image forms. The function design of the system is as follows: First of all, the system should have standard pronunciation audios and videos to guide learners, and introduce the key points of English pronunciation and tongue type in the form of pictures and texts. Before establishing the system, spoken phonetic materials such as phonetic symbols, words, and sentences need to be collected. Folders of pictures, videos and texts should be established separately for system access. We use AudioTrack for audio and video playback, specifically, class method for speech signal playback and Video View class method in Android SDK for video playback. Secondly, the system should be able to prompt the learner to read the words and phrases, record and play back the voice signals, create a cache folder, and record the recorded voice signals according to the MP3 format. AudioRecond class method is used to record voice signals, and the sampling frequency is set to 8000Hz, channel mono, 16-bit sampling bits. Then, the system uses the speech recognition and related algorithms to score learners' spoken pronunciations and establish a spoken appraisal folder. The Shared Preferences component of the Android system is used to store the learner's spoken rating results. Finally, the system should have the function of comparing learner's spoken pronunciation with standard pronunciation, and use the Achart Engine to show the comparison chart of the changes of formant to the oral learners so as to make the learners find the problems more intuitively. Besides, the system should give advice on spoken pronunciation based on the relationship between tongue shape, mouth shape and signal formant. Interface design: The main interface includes four oral training options of vowels, consonants, words and sentences and the learners can choose the items according to their own willingness. At the same time, the resonance peak comparison chart and historical scoring items are added on the main interface to facilitate learners to view. Help options and exit keys are also set. Training score interface elements include pronunciation demonstration, pronunciation following, pronunciation contrast, pronunciation evaluation, main menu, oral demonstration (animation, audio and video, pictures and other forms) and the corresponding text description. The development of the system is mainly done in the Eclipse integration environment. Specific development and operating environment: PC operating system Windows7 (32bt); Development components: Java JDK 8.0, Eclipse [15] 4.5 (Mars), Android SDK 4.0; Hardware Environment: Glory Play 6X (RAM: 3GB, ROM: 32G, Android 6.0); Programming Language: Java. Figure 2 shows the interface effect. Figure 2: Oral training system main interface and rating interface. 4 System test results This study invites three experienced English teachers as score judges and 10 college students as the subjects of the oral English training system. The scoring is based on tongue type, mouth type, pronunciation completeness and clarity, 25 points for each item. The average of the scores given by the three teachers is taken as the final score. 4.1 Speech input test First, the subjects' speech was recorded and the recognition rate of the speech input system was tested. According to the system instructions, the subjects read after the system of 20 monophthong words, 15 diphthong words and 15 polysyllabic words. The three teachers judged whether the speech input was successful and the results are shown in Table 1. Word type Monophthong Diphthong Polysyllabic Total number 20 15 15 Accuracy 100% 100% 96% Table 1: Speech input success rate of the system. Research on Intelligent English Or al… Informatica 42 (2018) 259 –264 263 4.2 Scoring accuracy test Based on the speech input, scores were given by the system and the three teachers respectively. Suppose the system score of the i-th word was i x , and the score by the teachers was i y , the similarity of the two scores was calculated according to the formula i i i i y y x   1  , then, the scoring accuracy of n samples can be calculated based on   n n         ... 2 1 , as shown in Table 2. Word type Monophthong Diphthong Polysyllabic Total number 20 15 15 Accuracy 97.15% 94.96% 93.62% Table 2: System test accuracy results. As shown in Table 2, the accuracy on monophthong word pronunciation reached 97.15%, and that on diphthong words and polysyllabic words reached 94.96% and 93.62% respectively. With the increase of vowels in the words, the pronunciation became complicated, which affected the scoring mode. But, the scoring accuracy reached above 90% on average. 5 Conclusion With the rapid development of mobile network and the upgrading of mobile network equipment, the concept of mobile learning has been gradually integrated into our life. This paper focused on mobile learning and designed an oral English training system that could be used on Android smartphones. Firstly, we designed the speech signal preprocessing, feature extraction and signal pattern matching of system speech recognition. According to the characteristics of Android system, the dynamic time regulation algorithm with small amount of computation was introduced as the pattern matching algorithm of speech signals. Then, according to the pronunciation characteristics of spoken English, we selected the pronunciation resonance peak as the reference of system scoring, and determined the single reference template as the scoring method. Afterwards, the system functions were designed from the three aspects of pronunciation demonstration, pronunciation imitation and pronunciation evaluation. Finally, the system was tested, the results of which showed that the system had a high success rate in the recognition of the spoken word pronunciation and a high accuracy in spoken English scoring. In general, the system we designed initially met the needs of speech accuracy and scoring accuracy of mobile spoken English training, and provided some points that need attention in pronunciation, which is helpful for spoken English training. 6 References [1] An L L, Wu Y N, Liu Z, Liu RS (2012). An Application of Mispronunciation Detecting Network for Computer Assisted Language Learning System. Journal of Electronics & Information Technology, 34(9), pp. 2085-2090. [2] Alamer R A, Al-Otaibi H M, Al-Khalifa H S (2015). L3MS: A Lightweight Language Learning Management System Using Mobile Web Technologies, 2015 IEEE 15th International Conference on Advanced Learning Technologies (ICALT), IEEE, Hualien, Taiwan, pp. 326-327. [3] Milutinovic M, Bojovic Z, Labus A, Bogdanovic B, Despotovic-Zrakic M (2016). Ontology-based generated learning objects for mobile language learning. Computer Science & Information Systems, pp. 4-4. [4] Troussas C, Virvou M, Alepis E (2014). Multifactorial user models for personalized mobile- assisted language learning. Frontiers in Artificial Intelligence & Applications, 262, pp. 275-282. [5] Sharples M, Arnedillosánchez I, Milrad M, Vavoula G (2014). Mobile Learning. R Keith Sawyer, pp. 501-521. [6] Hu Y, Azim T, Neamtiu I (2015). Versatile yet lightweight record-and-replay for Android. ACM Sigplan International Conference on Object- Oriented Programming, Systems, Languages, and Applications, ACM, Newyork, USA, pp. 349-366. [7] Deepa D, Shanmugam A (2011). Enhancement of noisy speech signal based on variance and modified gain function with PDE preprocessing technique for digital hearing aid. Journal of Scientific & Industrial Research, 70(5), pp. 332-337. [8] Takagi T, Seiyama N, Miyasaka E (2015). A method for pitch extraction of speech signals using autocorrelation functions through multiple window lengths. Electronics & Communications in Japan, 83(2), pp. 67-79. [9] Sahoo T R, Patra S (2014). Silence Removal and Endpoint Detection of Speech Signal for Text Independent Speaker Identification. International Journal of Image Graphics & Signal Processing, 6(6), pp. 27-35. [10] Valentini-Botinhao C, Yamagishi J, King S (2012). Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise. Proc. Interspeech., 631-634. [11] Rathore P S, Boyat A, Joshi B K (2013). Speech signal analysis using Fourier-Bessel Expansion and Hilbert Transform Separation Algorithm. IEEE International Conference on Signal Processing, Computing and Control, IEEE, Solan, India, pp. 1- 4. [12] Dhingra S, Nijhawan G, Pandit P (2013). Isolated speech recognition using MFCC and DTW. International Journal of Advanced Research in Electrical Electronics & Instrumentation Engineering, 2(8), pp. 4085-4092. 264 Informatica 42 (2018) 259 –264 F. Zhu [13] Lang F Y, Li X G (2012). Multi-Sensors Information Fusion Based on Momentis Method and Euclid Distance. Advanced Materials Research, 383-390(383-390), pp. 5447-5452. [14] Yusnita M A, Paulraj M P, Yaacob S, Bakar SA, Saidatul A (2011). Malaysian English accents identification using LPC and formant analysis. IEEE International Conference on Control System, Computing and Engineering, IEEE, Penang, Malaysia, pp. 472-476. [15] Wang L, Groves P, Ziebart M (2013). Urban Positioning on a Smartphone: Real-time Shadow Matching Using GNSS and 3D City Models. Proceedings of the 26th International Technical Meeting of The Satellite Division of the Institute of Navigation, Nashville Convention Center, pp. 1606-1619. [16] Thakral S, Goswami D, Sharma R, Prasanna CK, Joshi AM (2016). Design and implementation of a high speed digital FIR filter using unfolding. IEEE, Power India International Conference, pp. 1-4. [17] Han Z, Wang J (2016). Dynamic feature extraction for speech signal based on MUSIC. Control and Decision Conference, pp. 3770-3773.