https://doi.org/10.31449/inf.v47i10.5391 Informatica 47 (2023) 65 –70 65 
A Study of Correction Training for English Pronunciation Errors 
Through Deep Learning 
Guannan Li 
Department of Foreign Languages, Huihua College, Hebei Normal University, Hebei 050000, China 
E-mail: ncz939@126.com 
Keywords: deep learning, spoken English, pronunciation error correction, Mel-frequency cepstral coefficient, facial 
feature 
Received: October 31, 2023 
In the process of globalization, English has become an essential skill. This article provides a brief 
introduction to the recognition process of English pronunciation errors based on deep learning. In the 
recognition process, the audio features of pronunciation were combined with the video features of lip 
movements during pronunciation to improve error detection performance. Subsequently, simulation 
experiments were conducted on the error detection algorithm, and a case analysis was performed on 100 
freshmen from Hui College at Hebei Normal University to verify the effectiveness of the algorithm in 
correcting pronunciation. The results showed that the long short-term memory (LSTM) algorithm based 
on audio and video converged the fastest during training and had the smallest loss function. Additionally, 
it achieved the highest accuracy in phoneme recognition and pronunciation error detection, while being 
less affected by noise interference. After using the pronunciation error detection algorithm proposed in 
this article for oral correction training, students' pronunciation was significantly improved. 
Povzetek: Članek obravnava proces prepoznave napak v angleški izgovorjavi z globokim učenjem, kar 
izboljšuje izgovorjavo študentov. 
 
 
1 Related works 
Some studies related to English pronunciation correction 
is shown in Table 1. The studies listed in Table 1 have all 
focused-on methods for recognizing English speech. 
Some emphasized improving speech quality to enhance 
recognition accuracy, while others have already applied 
speech recognition algorithms to English pronunciation 
practice and verified their auxiliary role. This article also 
approaches the topic from the perspective of speech 
recognition and applies algorithms to English 
pronunciation correction. The principle behind this 
correction is to utilize speech recognition algorithms to 
convert speech into text and then compare it with the 
actual text of the recognized speech in order to identify 
errors. In terms of speech recognition, this article mainly 
employs the long short-term memory (LSTM) algorithm 
and introduces more intuitive lip feature points for 
improved accuracy. 
 
Main 
authors 
Research content Research results 
Gang [4] Based on artificial 
emotion 
recognition and 
high-speed hybrid 
models, they 
analyze and filter 
various types of 
noise that affect 
speech quality in 
The results 
demonstrated that 
the model they 
constructed 
performed well.  
order to enhance 
students' English 
speech 
recognition 
abilities. 
Sidgi et al. 
[5] 
They conducted a 
study on the 
effectiveness of 
ASR eyeespeak 
software in 
improving 
pronunciation for 
Iraqi English 
learners. 
The research 
results showed 
that the software 
could significantly 
improve students' 
English 
pronunciation. 
Dai [6] They designed an 
intelligent system 
based on speech 
recognition 
technology to 
correct students' 
English 
pronunciation 
errors. 
Comparative 
experiments 
verified the 
practical 
application value 
of the system. 
Table 1: A summary of related works 
2 Introduction 
Pronunciation errors have always been one of the 
challenges that students face because they not only affect 
communication effectiveness [1] but also can lead to a 
66 Informatica 47 (2023) 65 –70 G. Li  
decrease in learners' interest and confidence in learning 
English [2]. Traditional methods for correcting spoken 
English mainly rely on teacher guidance and repetitive 
practice; however, this training method is limited by time 
and location, making it unable to meet the personalized 
needs of students for autonomous learning. Additionally, 
correcting oral mistakes is also a time-consuming and 
tedious task for teachers. The development of deep 
learning technology has brought new solutions to the field 
of speech processing [3]. Through extensive data training, 
deep learning technology enables voice recognition and 
analysis, facilitating oral correction. The article provides a 
brief introduction to the process of recognizing 
pronunciation errors in spoken English based on deep 
learning. In this recognition process, audio features of 
pronunciation were combined with video features of lip 
movements during pronunciation to improve error 
identification performance. Subsequently, simulation 
experiments were conducted on the error detection 
algorithm, and a case study was performed on 100 
freshmen from Huihua College at Hebei Normal 
University to verify the effectiveness of the algorithm in 
correcting pronunciation. 
3 Deep learning-based English 
pronunciation error recognition 
During English oral training, students typically mimic the 
pronunciation of standard texts. However, they frequently 
face challenges when trying to accurately pronounce 
certain sounds during practice sessions. It is difficult for 
them to adjust themselves and enhance their pronunciation 
on their own due to limited guidance from teachers caused 
by time and location restrictions, leading to low 
effectiveness. The advent of deep learning technology 
offers a novel approach for correcting pronunciation [7]. 
In the field of audio, the smallest unit is a phoneme, which 
is represented by 'phonetic symbols' in English 
pronunciation. Therefore, when using deep learning 
techniques to correct pronunciation, it involves 
recognizing students' imitated pronunciation's phoneme 
sequence and comparing it with the standard text's 
phoneme sequence to achieve identification and 
correction of pronunciation errors. 
 
Input speech 
data and video 
image of mouth 
shape
Preprocess
Extract speech 
features as well 
as video 
features
Phoneme 
classification of 
speech features 
and video features
Decision fusion of 
speech phoneme 
probability and video 
phoneme probability
Output the 
phoneme 
sequence
Comparing the predicted 
phoneme sequence with the 
standard phoneme sequence
 
Figure 1: The recognition flow of deep learning-based 
English pronunciation errors. 
Figure 1 illustrates the process of English 
pronunciation error recognition based on deep learning. In 
this recognition process, not only audio features are 
utilized but also mouth shape video features are 
incorporated [8] to enhance the accuracy of phoneme 
sequence recognition. The specific steps are as follows. 
① The English pronunciation audio data of students 
and the corresponding synchronized video data of 
pronunciation are inputted. 
② The audio and video data are preprocessed [9].  
③ Features are extracted from the audio and video 
data. Mel-frequency cepstral coefficients (MFCC) are 
employed to extract features from the audio data. Firstly, 
fast Fourier transform (FFT) transformation is performed 
on the audio signal [10], then MFCC features are extracted 
from it. The corresponding formula is: 
 

















= 




 +
=
=






• =
=
 =




−
=
−
=
−
=
−
=
−
L l
M
m l
m S l c
k H
k H P m S
k Y P
e n y k Y
M
m
M
m
m
N
k
m
N
n
N
kn j
, , 3 , 2 , 1
2
) 1 2 (
cos ) ( ) (
1 ) (
) ( ) ( ln ) (
) ( ) (
) ( ) (
1
1
1
0
1
0
2
1
0
2





, (1) 
 
where ) (k Y stands for the frequency domain signal after 
FFT [11], ) (n y stands for the original time-domain 
signal, k stands for the serial number of the sampled 
point, n represents the time sampling point of the time-
domain signal, ) (  P is the instantaneous energy of 
) (k Y , ) (k H
m
 is the frequency response of a triangular 
filter, m is the serial number of a group of triangular 
filters, totally M , ) (l c is the L -order MFCC feature 
parameter, and ) (m S is the energy spectral function of 
frequency domain signal after filter processing. The Dlib 
algorithm is used for video data feature extraction, which 
utilizes gradient-boosted regression trees to extract and 
recognize 68 feature points in face images. The 
corresponding formula is: 
 



+ =
=
+
) , (
) , , , (
1
68 2 1
t t t t
t t t
t
S I r S S
x x x S 
, (2) 
 
where 
t
S represents the current set of facial feature points, 
t
i
x is the i -th feature point in the current facial image, 
) (
t
r is the t -grade cascade regressor that is used for 
calculating the residual error between the current facial 
key points and the real face and updating the current facial 
key points according to the residual error, and 
1 + t
S 
represents the set of facial feature points after ) (
t
r 
updating. In the recognition process, pronunciation is 
identified through lip movements. Therefore, only 20 
feature points in the lip area are needed to avoid 
A Study of Correction Training for English Pronunciation Errors … Informatica 45 (2021) 501 –505 67 
interference from other feature points. Additionally, these 
20 feature points are also normalized [12]. 
④ Both audio and video features are used for 
phoneme classification, and LSTM is employed to 
recognize these two types of features. Compared to 
ordinary neural network structures, LSTM not only 
utilizes the current input data but also takes advantage of 
previous state data, making it more suitable for handling 
sequence problems. Pronunciation phoneme recognition is 
also a kind of sequence problem. The calculation formula 
within the hidden layer of LSTM is: 
 









+ + =
=
+ + =
+ + + =
+ + =
−
−
− −
−
) (
) tanh(
) (
) (
) (
1
1
1 1
1
t q t q q t
t t t
t g t g g t
t t t t t t
t f t f f t
h x u b q
q s h
h x u b g
h ux b g s f s
h x u b f
 
 
 
 
, (3) 
 
where 
t t t t
q g s f , , , are the output results of the forget, 
circulating, input, and output gates [13], 
t
h is the hidden 
state in the calculation process, 
q g f
    , , , are the 
weights of the gated recurrent, forget gate, input gate, and 
output gate units for hidden state 
1 − t
h at the last moment, 
q g f
u u u u , , , represent the weight of the gated recurrent, 
forget gate, input gate, and output gate units for current 
input data 
t
x , and  
q g f
b b b b , , , represent the bias terms of 
the gated recurrent, forget gate, input gate, and output gate 
units. 
⑤ After performing forward computation separately 
on the audio features and video features using LSTM, 
probability distribution sequences of phonemes are 
obtained for each. Then, the phoneme probability 
distribution sequences from both audio and video are 
weighted and summed together. The weight allocation 
between them is usually fixed based on empirical 
knowledge, but this paper adopts a gating mechanism to 
adaptively adjust the weights. Finally, the highest 
probability phoneme sequence is obtained from the 
combined phoneme probability distribution sequence [14]. 
⑥ The calculated predicted phoneme sequence is 
compared with the standard phoneme sequence 
corresponding to the input audio or video in order to detect 
any inconsistencies and provide suggestions. 
4 Case study 
4.1 Experimental environment 
The article initially examined the algorithm designed to 
identify English pronunciation errors and subsequently 
evaluated its efficacy in correction training. The 
recognition algorithm was tested on a laboratory server. 
4.2 Algorithm test setup 
The audio and video dataset used for simulation 
experiments was a self-built dataset. The data were 
collected from 100 sophomore students who were 
randomly selected from Huihua College of Hebei Normal 
University, including 52 male students and 48 female 
students. The pronunciation of twenty sentences by these 
participants was recorded at a sampling rate of 16 kHz. 
During the process of collecting pronunciations, the facial 
changes of the participants were also simultaneously 
recorded at a frame rate of 60 fps. Consent has been 
obtained from the subjects for the collection of audio data 
and facial video data, and the purpose of the data has been 
explained to them, with an assurance that it will not be 
used for any other purposes. 
The relevant parameter settings for the identification 
algorithm used in this article are shown in Table 2. The 
number of nodes in the input layer of the LSTM algorithm 
depended on the dimensionality of the input data, while 
the number and activation function type of hidden layer 
nodes were obtained through orthogonal experiments. The 
number of output layer nodes depended on the phoneme 
labels as well as the quantity of blank symbols and 
termination symbols in the speech dataset. In addition, to 
further validate the recognition algorithm proposed, 
comparative experiments were conducted with two other 
algorithms. One algorithm only used lip video for 
recognition while the other only used audio. The two 
algorithms only differed in the recognition features used, 
with both algorithms utilizing LSTM as the main 
component. The relevant parameters of the two algorithms 
were solely dependent on the input feature dimensions in 
terms of the number of nodes in the input layer, while all 
other parameters remained consistent with Table 2. 
 
Table 2 Settings of relevant parameters of the proposed 
recognition algorithm 
Name of 
parameter 
Value Name of 
parameter 
Value 
Number of 
MFCC 
feature 
dimension 
39 Number of 
lip feature 
dimension 
46 
Number of 
nodes in the 
input layer 
of LSTM 
85 Number of 
hidden 
layers 
3 
Number of 
nodes in the 
hidden layer 
200 Activation 
function of 
the hidden 
layer 
Sigmoid 
Number of 
nodes in the 
output layer 
50 Maximum 
training 
number 
300 
 
In addition, to test the robustness of the algorithm in 
this article against noise, white noise was added to the 
audio file, and then speech recognition was performed on 
the noisy audio. 
68 Informatica 47 (2023) 65 –70 G. Li  
4.3 Test on the correction effectiveness of 
the algorithms 
A total of 100 freshmen from Huihua College of Hebei 
Normal University were randomly selected and divided 
into two groups: the control group and the experimental 
group. Both groups underwent a pronunciation test before 
receiving correction training, with a maximum score of 10 
points. Afterwards, both the control group and the 
experimental group received two weeks of correction 
training, followed by another pronunciation test [15]. 
Traditional teaching methods included: ① 
conventional classroom teaching, where students follow 
the teacher's reading; ② students formed groups and 
engaged in English communication on specific topics. 
The improved teaching method includes assigning 
homework for students to practice oral skills using the 
algorithm proposed in this article, in addition to the above 
two activities. The algorithm helped identify 
pronunciation errors and allowed students to adjust their 
pronunciation based on standard pronunciation. 
Statistical analysis was conducted on the test scores 
of two groups of students before and after correction 
training using SPSS software, followed by an independent 
t-test. A P-value less than 0.05 indicated significant 
differences. 
4.4 Test results 
The convergence curves of the three algorithms for 
pronunciation error recognition are shown in Figure 2. 
From Figure 2, it can be observed that all three algorithms 
converged as the number of iterations increased. The 
video-based LSTM algorithm achieved stability after 
approximately 160 iterations, while the audio-based 
LSTM algorithm converged after about 120 iterations. 
The audio-video based LSTM algorithm reached stability 
after approximately 80 iterations. The video-based LSTM 
algorithm had the highest value in terms of the loss 
function, followed by the audio-based LSTM algorithm, 
and finally, the audio-video based LSTM algorithm had 
the lowest value when convergence was reached. 
 
 
Figure 2: Convergence curves of three pronunciation 
error recognition algorithms. 
The accuracy of three algorithms for phoneme 
recognition and pronunciation error detection is shown in 
Figure 3. It can be observed that the LSTM algorithm 
based on audio-video data had the highest accuracy, 
followed by the LSTM algorithm based solely on audio, 
and the lowest accuracy was seen in the LSTM algorithm 
relying solely on video. Furthermore, it was noted that the 
phoneme recognition accuracy of each algorithm 
surpassed that of pronunciation error detection. This is 
because when detecting pronunciation errors, the 
phoneme sequence of the pronunciation was recognized 
firstly, and then the recognized sequence was compared to 
the standard sequence, which reduced the accuracy. 
 
 
Figure 3: Phoneme recognition accuracy and 
pronunciation error detection accuracy of three 
pronunciation error recognition algorithms. 
In order to test the noise resistance of the recognition 
algorithm, white noise was added to the audio files for 
recognition, and the results are shown in Figure 4. Overall, 
even with the addition of white noise in the audio files, the 
LSTM algorithm based on audio-video achieved the 
highest accuracy in phoneme recognition, followed by the 
audio-based LSTM algorithm and the video-based LSTM 
algorithm with lower accuracy. Compared with the same 
recognition algorithm before and after adding white noise, 
it can also be observed that the LSTM algorithm based on 
audio and video had a slight decrease in phoneme 
recognition accuracy when facing white noise interference, 
but the decrease was not significant. On the other hand, 
the other two algorithms had a larger decrease, especially 
the LSTM algorithm based on video. 
 
 
Figure 4: The phoneme recognition accuracy of three 
recognition algorithms under noise interference. 
After English oral training, the distribution of test 
scores for the control group and experimental group is 
A Study of Correction Training for English Pronunciation Errors … Informatica 45 (2021) 501 –505 69 
shown in Figure 5. From Figure 5, it can be observed that 
prior to English oral training, there was minimal disparity 
in the distribution of test scores between the control group 
and experimental group, with a majority of scores 
concentrated within the range of five to six points. 
However, following oral English training, a noticeable 
distinction emerged in the distribution of test scores 
between the two group. The control group exhibited little 
change compared to their pre-training performance, 
whereas the range of score that most subjects achieved 
increased to eight points. The descriptive statistics of 
scores for the control group and experimental group 
before and after oral training are presented in Table 3. It 
can be observed that prior to the training, the P value for 
the average score  standard deviation of the two groups 
was 0.784, i.e., the difference was not remarkable. After 
the oral training, the average score  standard deviation 
of the two groups was significantly different, and the 
average score of the experimental group was significantly 
higher, showing a P value of 0.011. In addition, comparing 
the performance before and after the training within the 
same group, it was found that the P value of the control 
group was 0.698, i.e., the difference was not significant; 
the P value of the experimental group was 0.014, i.e., the 
difference was significant. Therefore, it was concluded 
that the use of the LSTM algorithm based on audio-video 
data effectively assisted students in correcting 
pronunciation errors during oral practice. 
 
 
Figure 5: The distribution of oral test scores of the 
control and experimental groups before and after oral 
training. 
Table 3 Descriptive statistics of the control and 
experimental groups before and after oral training 
Group Before 
oral 
training 
After oral 
training 
P value 
Control group 5.00 
2.33 
4.80 
2.09 
0.698 
Experimental 
group 
4.96 
2.37 
7.50 
1.63 
0.014 
P value 0.784 0.011  
5 Discussion 
In the process of language learning, pronunciation is a 
crucial aspect. For many non-native English speakers, 
English pronunciation can be quite challenging. Deep 
learning technology offers new possibilities for addressing 
this issue. By utilizing deep learning, it becomes possible 
to model and analyze large-scale speech data, enabling 
more accurate and rapid speech recognition while also 
correcting any errors present. This article primarily 
employs LSTM for speech recognition and introduces lip 
movement features during pronunciation in order to 
enhance the algorithm's accuracy. Afterwards, the 
performance of the algorithm was tested and applied to 
oral training to examine its auxiliary role in oral training, 
as shown in the previous section. Among the single video-
based algorithm, the single audio-based algorithm, and the 
audio-video-based algorithm, the audio-video-based 
algorithm converged fastest during training and had the 
smallest error when stable; similarly, this algorithm also 
demonstrated the best recognition performance compared 
to other test results. When applying the algorithm 
proposed to oral training, the experimental group that 
utilized this algorithm demonstrated significant 
improvement in their oral scores after training, whereas 
the control group that employed the traditional oral 
training method did not exhibit significant improvement. 
Analyzing the reasons behind the above results, it can 
be observed that the algorithms based on single video and 
single audio rely solely on lip shape features and MFCC 
features respectively. Different pronunciations exhibit 
distinct lip shape characteristics; however, in practical 
applications, slight deviations in lip shape variations 
during continuous speech may occur, which consequently 
reduce the accuracy of these features. MFFC features are 
characteristics of audio that can more directly reflect the 
properties of the sound compared to lip shape features. 
Therefore, in both training and practical testing, MFFC 
features outperformed single video-based speech 
recognition algorithms. The video-audio-based algorithm 
combined lip shape features with MFFC features, 
resulting in superior performance during training and 
practical testing compared to the other two algorithms. 
When applying this algorithm to spoken language training, 
it could more accurately identify errors in user 
pronunciation and provide targeted corrections. Moreover, 
with the use of this speech recognition algorithm, users 
can practice anytime and anywhere on their own, making 
it more convenient compared to traditional methods of 
oral practice. 
6 Conclusion 
The article provides a brief introduction to the process of 
recognizing English pronunciation errors based on deep 
learning. In this recognition process, audio features of 
pronunciation were combined with video features of lip 
movements during pronunciation to improve error 
identification performance. Subsequently, simulation 
experiments were conducted on the error detection 
algorithm, and a case study was performed on 100 
70 Informatica 47 (2023) 65 –70 G. Li  
freshmen from Huihua College at Hebei Normal 
University to verify the effectiveness of the algorithm in 
correcting oral English. (1) Compared to the LSTM 
algorithm based solely on video or audio, the LSTM 
algorithm based on audio and video converged faster and 
had the smallest loss function when convergence was 
stable. (2) Whether it was the accuracy of phoneme 
recognition or pronunciation error detection, the LSTM 
algorithm based on audio-video achieved the highest 
accuracy, followed by the algorithm based solely on audio, 
while the algorithm based solely on video had the lowest 
accuracy. (3) Although the accuracy of phoneme 
recognition using the audio-video-based LSTM algorithm 
was slightly reduced when facing white noise interference, 
the reduction was not significant; however, the other two 
algorithms showed a great decrease in accuracy, 
especially the video-based LSTM algorithm. (4) Before 
the oral training, there was no significant difference in the 
distribution of scores between the control group and 
experimental group, as well as in their average score, 
highest/lowest score, and standard deviation. However, 
following the oral training, the control group did not show 
any noticeable change, while the experimental group 
exhibited a shift towards higher score ranges in terms of 
distribution. Additionally, there were improvements in 
their average and lowest scores in the experimental group. 
Additionally, a decrease in standard deviation was noted. 
The contribution of this article lies in the introduction 
of lip shape features on top of using MFCC characteristics 
for speech recognition, enhancing the accuracy of the 
algorithm and providing an effective auxiliary tool for 
English pronunciation correction. 
References 
[1] Igarashi K, Wilson I (2020). Improving Japanese 
English pronunciation with speech recognition and 
feed-back system. SHS Web of Conferences, 77, pp. 
1-5. https://doi.org/10.1051/shsconf/20207702003 
[2] Li D S (2020). English Speech Recognition and 
Multidimensional Pronunciation Evaluation. 
Education Research Frontier, 010, pp. 184-188. 
[3] Kleynhans N, Hartman W, van Niekerk D, van 
Heerden CJ, Schwartz R, Tsakalidis S, Davel M 
(2016). Code-switched English pronunciation 
modeling for Swahili spoken term detection. 
Procedia Computer Science, 81, pp. 128-135. 
https://doi.org/10.1016/j.procs.2016.04.040 
[4] Gang Z (2021). Quality evaluation of English 
pronunciation based on artificial emotion 
recognition and gaussian mixture model. Journal of 
Intelligent & Fuzzy Systems: Applications in 
Engineering and Technology, 40, pp. 7085-7095. 
https://doi.org/10.3233/JIFS-189538 
[5] Sidgi L F S, Shaari A J (2017). The Effect of 
Automatic Speech Recognition EyeSpeak Software 
on Iraqi Students' English Pronunciation: A Pilot 
Study. Advances in Language & Literary Studies, 8, 
pp. 48-54. 
https://doi.org/info:doi/10.7575/aiac.alls.v.8n.2p.48 
[6] Dai M (2021). Intelligent Correction System of 
Students' English Pronunciation Errors Based on 
Speech Recognition Technology. WSEAS 
Transactions on Advances in Engineering Education, 
pp. 192-198. 
https://doi.org/10.1142/S0219649222400135 
[7] Lim D Y, Kim S G, Chong K T (2018). Development 
of a Real-time Lip Recognition for Improving 
English Pronunciation using Deep Learning. Journal 
of Institute of Control Robotics and Systems, 24, pp. 
327-333. 
https://doi.org/10.5302/J.ICROS.2018.18.8003 
[8] Liu X, Xu M, Li M, Han M, Chen Z, Mo Y, Chen X, 
Liu M (2019). Improving English pronunciation via 
automatic speech recognition technology. 
International Journal of Innovation and Learning, 
25, pp. 126-140. 
https://doi.org/10.1504/IJIL.2019.097674 
[9] Giantari K, Sabarudin S, Zahrida Z (2020). 
Pronunciation Recognition of –ed Ending Words by 
the Students of English Education Study Program of 
the University of Bengkulu. Journal of English 
Education and Teaching, 4, pp. 278-293. 
https://doi.org/10.33369/jeet.4.2.278-293 
[10] Zhao L, Liu Y, Chen L, Zhang J, Jonathan KG 
(2019). English oral evaluation algorithm based on 
fuzzy measure and speech recognition. Journal of 
Intelligent & Fuzzy Systems: Applications in 
Engineering and Technology, 37, pp. 241-248. 
https://doi.org/10.3233/jifs-179081 
[11] Wu H, Sangaiah A K (2021). Oral English Speech 
Recognition Based on Enhanced Temporal 
Convolutional Network. Intelligent Automation and 
Soft Computing, 28, pp. 121-132. 
https://doi.org/10.32604/iasc.2021.016457 
[12] Evers K, Chen S (2021). Effects of Automatic 
Speech Recognition Software on Pronunciation for 
Adults With Different Learning Styles. Journal of 
Educational Computing Research, 59, pp. 669-685. 
https://doi.org/10.1177/0735633120972011 
[13] Cao Q, Hao H (2021). Optimization of Intelligent 
English Pronunciation Training System Based on 
Android Platform. Complexity, 2021, pp. 1-11. 
https://doi.org/10.1155/2021/5537101 
[14] Zhan W, Chen Y (2020). Application of machine 
learning and image target recognition in English 
learning task. Journal of Intelligent and Fuzzy 
Systems, 39, pp. 5499-5510. 
https://doi.org/10.3233/JIFS-189032 
[15] Peng S (2018). Research on Interactive English 
Speech Recognition Algorithm in Multimedia 
Cooperative Teaching. International English 
Education Research, pp. 79-82. 
https://doi.org/10.1109/ICITBS.2018.00095