https://doi.org/10.31449/inf.v48i10.5919 Informatica 48 (2024) 19–34 19 
 
The Application of Action Recognition Based on MPP-YOLOv3 
Algorithm in Posture Correction 
Zhongwei Wang
1*,
 Shujuan Dong
2 
1
Department of Public Studies, Henan Vocational College of Nursing, Anyang 455000, China 
2
Information Engineering Institute, Yellow River Conservancy Technical Institute, Kaifeng 475004, China 
Email of corresponding author: wzw202312@126.com, 13707617998@126.com  
Keywords: YOLOv3, key points, posture recognition, posture correction, lightweight 
Received: March 15, 2024 
Posture recognition, as a research hotspot, has been widely applied. A recognition model based on 
bone key point detection is proposed for the posture correction application module. Firstly, the 
lightweight You Only Look Once v3 tiny network was chosen as the infrastructure, and the OpenPose 
algorithm in the bottom-up strategy was chosen to implement posture recognition. To reduce the 
computational burden of the model, the Media Pipe Blaze Pose algorithm was introduced for 
improvement. At the same time, by refining more bone key points, the accuracy of the model has been 
improved. The experiment outcomes revealed the recognition accuracy of Cross View in the NTU60 
RGB+D dataset and NTU120 RGB+D dataset was 94.7% and 82.7%, respectively. Compared to 
graph Transformer networks and semantic posture recognition models, the Cross-Subject metric 
improved by an average of 3.5%. Therefore, the research and design model has shown better 
robustness in the field of posture recognition, which can help complete pose correction more 
efficiently. 
Povzetek: Predstavljen je nov sistem za prepoznavanje in popravljanje drže. Uporablja MPP-YOLOv3, 
OpenPose in Media Pipe Blaze Pose.
1 Introduction 
Computer vision has gradually integrated into people's 
daily lives. Among them, posture recognition has always 
been one of the research hotspots in this field. Posture 
recognition is applied in multiple realms including home 
monitoring, posture correction, and rehabilitation training 
[1-2]. The demand for pose correction has gradually 
increased, and it is mostly used to correct improper 
movement postures. This is especially important in 
professional sports training, physical therapy, and 
personal fitness scenarios, where accurately identifying 
and correcting incorrect exercise postures can greatly 
improve exercise efficiency, reduce the risk of injury, and 
promote physical health. Deep learning technology is one 
of the commonly used methods in this field, playing an 
important role in the design of motion pose correction 
systems [3]. Traditional motion recognition techniques 
often rely on wearable sensors or complex labeling 
systems, which are often limited, costly, and may 
interfere with the natural movements of athletes. The 
deep learning network model provides a non-invasive 
solution for it, which can directly recognize human 
posture by analyzing image or video data obtained from 
cameras. In the field of deep learning, Convolutional 
Neural Network (CNN), Recurrent Neural Network 
(RNN), and others are widely used for feature extraction 
and modeling of visual data, thereby achieving 
recognition and analysis of human actions. Posture 
recognition has two parts to compose its full function, 
namely identifying bone keypoints and keypoint 
connections [4-6]. However, in practical applications, 
models need to learn data from different actions and 
perspectives, which will inevitably impose a greater 
computational burden on them. Therefore, the research 
aims to further enhance the computational ability of the 
model while ensuring its accuracy. Based on this, a You 
Only Look Once version 3 (YOLOv3) posture 
recognition network was studied and designed, which 
achieved a balance between recognition accuracy and 
speed through lightweight and bone key point refinement. 
It is based on Media Pipe Blaze Pose and is therefore 
known as Media Pipe Blaze Pose YOLOv3 
(MPP-YOLOv3). The research has 4 parts. There are two 
innovations in research technology. The first is the use of 
lightweight YOLOv3 Tiny to build the basic framework, 
and the second is the implementation of the OpenPose 
algorithm to enhance the system and optimize model 
calculations. Additionally, the model extracts more 
skeletal details to capture motion details and improve 
technical reliability. As a result, the research model 
accurately recognizes the posture and movements of the 
human body and provides precise posture correction 
suggestions. The first gives the status of posture 
recognition, the second part designs deep learning 
network models, the third part conducts experimental 
analysis on the performance of the design model, and the 
last part summarizes the experimental data. 
 
20   Informatica 48 (2024) 19–34                                                                 Z. Wang
 
et al. 
 
2 Related works 
The application of posture recognition is very extensive 
and has received academic research on prevention. Zhou 
et al. proposed a new semantic posture recognition 
method that classifies actions in videos by learning 
multiple visual pose models and pose dictionaries 
associated with body parts. The researchers identified 
hidden poses in video frames and mapped them to actions 
in semantic instructions. The experiment outcomes 
indicated that their solution was effective on multiple 
datasets [7]. Liu et al. learned and put forward a 
Transformer network for skeleton based human posture 
recognition. They utilized multi head self attention and 
temporal kernel attention to capture high-order 
dependencies of joints in the skeleton and enhance the 
temporal correlation of actions. The experimental results 
indicated that their model outperforms the baseline 
models [8]. Alemu et al. proposed a method for 
generating human actions using an auxiliary conditional 
genetic neural network. The aim was to overcome the 
limitations of single-view action generation by creating 
samples from new perspectives and expanding the view 
range. Additionally, they introduced a view domain 
generalization model to improve posture recognition 
performance from different perspectives. Tests on 
multiple RGB+D skeleton datasets showed that their 
method effectively improves the accuracy of posture 
recognition [9]. Fang et al. learned and put forward a 
spatiotemporal slow fast graph convolutional network, 
which effectively captured the spatiotemporal joint 
relationships of long and short distances in skeleton data 
by designing specific adjacency matrices. They used fast 
and slow paths to process action information at different 
time scales. Tests on multiple datasets showed that 
STSF-GCN achieves leading recognition performance at 
lower computational costs [10]. 
Coskun et al. designed a minimal transfer learning 
method. By independently training local visual cues and 
using a meta learning-based framework, the action 
classification model was transferred with only a few 
samples. The experiment outcomes indicated that their 
solution was effective [11]. Li et al. proposed an posture 
recognition method supported by wavelet transform, 
which enhanced the sensitivity and discriminative ability 
of graph convolutional networks to local movements 
through an innovative three attention module. By 
aggregating global statistical information and integrating 
multidimensional features, the perception of significant 
changes was strengthened. The experiment outcomes 
indicated that their solution achieved comparable 
performance on multiple datasets [12]. Hao et al. 
proposed the use of hypergraph neural networks to 
enhance human motion recognition based on machine 
vision. By constructing hypergraphs, attention 
mechanisms, and residual modules to obtain 
discriminative features, the three-stream fusion 
architecture further improved recognition accuracy. Their 
method achieved optimal performance on two benchmark 
datasets [13]. Table 1 depicts the specific literature 
content. 
 
Table 1: Main contents of literature review 
Reference Research method Research dataset and results Limitations 
Reference 
[7] 
A new semantic pose recognition 
method has been proposed in the study 
Using a pose dictionary dataset for 
classification proves the 
effectiveness of this technique 
This technology is 
computationally 
complex 
Reference 
[8] 
A Transformer network based on 
skeleton for human pose recognition 
has been proposed in the study 
Tested on a universal human image 
dataset, this model outperforms the 
baseline model 
This technology 
does not take into 
account changes in 
complex motion data 
Reference 
[9] 
Research proposes a human action 
generation method supported by 
auxiliary conditional genetic neural 
networks 
Tested on the RGB+D skeleton 
dataset, this method effectively 
improves the accuracy of pose 
recognition 
This model requires 
complex parameter 
calculations 
Reference 
[10] 
A spatiotemporal slow and fast graph 
convolutional network has been 
proposed in the study 
STSF-GCN has excellent 
recognition performance in both 
self-made and universal skeleton 
data 
This method is 
susceptible to noise 
drying 
Reference 
[11] 
Researched and designed a minimum 
transfer learning method 
The model classification 
performance is tested on a linear 
sample dataset, and the scheme is 
reliable 
This method has 
poor stability 
Reference 
[12] 
A pose recognition method supported 
by wavelet transform has been 
proposed in the study 
Tested on NTU-RGB+D-skeletons 
data, the research method has 
shown excellent performance 
The model performs 
poorly on lower 
samples 
The Application of Action Recognition Based on MPP-YOLOv3… Informatica 48 (2024) 19–34 21 
 
 
Current posture recognition models primarily rely on 
extracting bone key points to improve recognition 
accuracy, but often neglect the improvement of detection 
speed. Therefore, the study proposes an posture 
recognition model based on lightweight YOLOv3 Tiny, 
which not only enhances the refinement of key points but 
also improves the detection efficiency of the model. 
3 Design of a posture recognition 
system based on YOLOv3 tiny 
network architecture and mpp 
optimization 
A deep learning based action recognition network is 
proposed for the application field of posture correction. 
Firstly, the lightweight YOLOv3 Tiny network is chosen 
as the infrastructure and optimized using the OpenPose 
algorithm. Subsequently, Media Pipe Pose is introduced 
to achieve deep convolutional separation and lightweight 
design, achieving an improvement in the accuracy of the 
network model. 
 
3.1 Design of action recognition module 
based on improved YOLOv3 tiny network 
architecture 
With the artificial intelligence growth, machine vision are 
applied in various realms such as motion, medicine, and 
industry. Deep learning has been applied to the direction 
of posture correction, and a human action recognition 
algorithm has been developed. CNN are a commonly 
used technology in deep learning and have matured 
significantly. The research selects the most representative 
YOLO algorithm, which is based on the feature network 
of the image, performs grid partitioning, generates 
corresponding prior boxes, and finally performs target 
recognition through regression tasks. Based on the 
performance comparison of various versions of the 
YOLO algorithm, a lightweight YOLOv3 Tiny algorithm 
is chosen for research, which has better flexibility and 
agility. However, its lightweight characteristics can also 
lead to a decrease in model accuracy. Therefore, the study 
further strengthened the accuracy of algorithm 
recognition by reducing the number of predicted 
classifications to lower network operating parameters, as 
shown in Figure 1. 
 
Upsample 
Input
Conv 1
Conv 1
Pool 1
Pool 1
...
...
...
...
Conv 6
Conv 6
Pool 6
Pool 6
Conv 7
Conv 7
...
...
Conv 10
Conv 10
Conv 8
Conv 8
Output1
Conv 11
Conv 11
Conv 12
Conv 12
Conv 13
Conv 13
Output2 Concat 
Conv 5
Conv 5
 
Figure 1: YOLOv3-Tiny network infrastructure 
 
The prerequisite for achieving posture correction is action 
recognition, which involves identifying key points in the 
human body and making optimal connections. Deep 
learning-based recognition techniques include two forms: 
two-dimensional and three-dimensional. The running 
process of the two is roughly the same, but 3D 
recognition requires a transformation of 3D space after 
mapping key points. According to the number of target 
recognition, it contains two categories: single person and 
multi person recognition [14-15]. The study investigated 
more complex scenarios for recognizing multiple 
individuals, using common detection strategies such as 
bottom-up and top-down approaches. In the top-down 
approach, only target parameters need to be detected 
before action estimation. While the recognition accuracy 
is high, it does not meet the real-time requirements for 
multiple targets. From top to bottom, it is necessary to 
detect and group key human points, and then perform 
matching recognition. This approach is more suitable for 
recognizing multi-target actions. To study the commonly 
used OpenPose algorithm in bottom-up strategies, it is 
necessary to first perform feature extraction and generate 
corresponding feature maps. Then input it into the Visual 
Geometry Group 19 (VGG19) network for feature 
extraction. Estimate the Part Confidence Maps (PCMs) in 
the upper branch, and estimate the Part Affinity Fields 
22   Informatica 48 (2024) 19–34                                                                 Z. Wang
 
et al. 
(PAFs) in the lower branch. Among them, PAF is a 2D 
limb annotation technique that can preserve the position 
and directional parameter information of limb intervals, 
as shown in Figure 2. 
 
Input image
Input image
PCMs
PCMs
VGG19
VGG19 Characteristic 
map
Characteristic 
map PAFs
PAFs
PCMs
PCMs
PAFs
PAFs
Phase 1
Phase 1
Phase 2
Phase 2
...
...
PCMs
PCMs
PAFs
PAFs
Phase 6
Phase 6
  
Figure 2: OpenPose module combination diagram 
 
PAFs can represent the practice of each key point, 
providing support for subsequent key point matching. 
The joint confidence plot 
j
S
 and PAFs plot 
c
L are 
shown in formula (1). 
 
 
 
*
* *2
, 1,2,...,
, 1,2,...,
wh
j
wh
c
S R j J
L R c C






 (1) 
In formula (1), * wh represents the input image size, 
and / JC represent the total number of human 
keypoints and bone connections, respectively. The 
confidence plots 
l
S and PAFs 
l
L
 for the production 
cost period are shown in formula (2). 
 
( )
( )
' 1 1
' 1 1
, , , 2
, , , 2
l l l
l l l
S F S L t
L F S L t


−−
−−

=  


=  


  (2) 
In formula (2), 
''
/  represent the stage inference of 
the confidence map and PAFs map, F represents the 
feature map, and represents the confidence map and PAFs 
map of the previous stage, respectively. The study 
introduces L2 loss functions to calculate the losses of 
each branch separately, aiming to ensure the correct 
expansion of the training direction. Subsequently, the 
study aimed to address the phenomenon of missing labels 
in annotated samples and optimized the loss function 
through mask operation. The loss function of the entire 
model is the sum of two branches, and the improved loss 
function /
tt
SL
ff corresponding to the upper branch of 
PCMs and the lower branch of PAFs is shown in formula 
(3) [16]. 
 
( ) ( ) ( )
( ) ( ) ( )
2
*
2
1
2
*
2
1
J
tt
S j j
jp
C
tt
L c c
cp
f W p S p S p
f W p L p L p
=
=

=  −




=  −




(3) 
In formula (3), t represents the corresponding number 
of stages, ( ) ( ) /
tt
jc
S p L p represent the pixel PCMs and 
PAFs of each stage, ( ) ( )
**
/
jc
S p L p represent the 
confidence map and partial affinity domain of the 
annotated points, and ( ) Wp represents the binary mask 
function of the pixel points 
p
. ( ) Wp value of 1 or 0 
corresponds to the cases where pixels are labeled and 
unlabeled, respectively. There are a total of 18 key points 
in the human body, as Figure 3. 
 
The Application of Action Recognition Based on MPP-YOLOv3… Informatica 48 (2024) 19–34 23 
14
14
15
15
16
16
17
17
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
Upper body
Upper body
Head 
Head 
Lower  body
Lower  body
14/15:Right / Left eyes
14/15:Right / Left eyes
0: Nose 
0: Nose 
16/17:Right / Left ears
16/17:Right / Left ears
2/5:Right / Left shoulders
2/5:Right / Left shoulders
1: Neck 
1: Neck 
3/6:Right / Left elbows
3/6:Right / Left elbows
4/7:Right / Left wrists
4/7:Right / Left wrists
8/11:Right / Left hips
8/11:Right / Left hips
9/12:Right / Left knees
9/12:Right / Left knees
4/7:Right / Left ankles
4/7:Right / Left ankles
 
Figure 3: Key points of human bones 
 
The human body is considered as three parts: the head, 
upper body, and lower body. Except for the nose and 
neck, all other key points are symmetrical on both sides. 
When there is no occlusion in the human body image, 
there is a unique maximum value among its 
corresponding PCMs. When there are k human targets 
in the image, there will be k peaks in the corresponding 
keypoint PCMs, represented as ( )
*
, jk
Sp . By using the 
max method, multi-objective PCMs aggregation ( )
*
j
Sp 
can be achieved, as formula (4). 
 
( ) ( )
**
,
2
,
2
2
max
max exp
j j k
k
jk
k
S p S p
px

=
 
−
 
=−
 
  
      (4) 
In formula (4), 

 represents the standard deviation, 
p
 
represents the position of ( )
*
, jk
Sp , and 
, jk
x
 represents 
the pixel position of k people in the corresponding joint 
point 
j
. After the key points are extracted, they need to 
be grouped and connected to ultimately achieve human 
pose recognition. When there are multiple targets in the 
image, the fully connected form can lead to redundant 
connections, while the midpoint detection method can 
lead to errors in connecting multiple targets. Therefore, 
the study aims to use the corresponding partial affinity 
domain values for key point connections through 
two-dimensional vector field PAFs, as shown in formula 
(5) [17]. 
 
 
 ( )
*
,
,
0,
ck
v p k
Lp
pk
 
=



  (5) 
In formula (5), ( )
*
, ck
Lp represents the PAFs values of 
the corresponding limb target keypoints, 
c
 represents 
the truncation of the target limb, and 
v
 represents the 
unit vector. The calculation of the unit vector is given as 
formula (6). 
 
21
21
..
..
2
j k j k
j k j k
xx
v
xx
−
=
−
         (6) 
In formula (6), 
12
/ jj represent different key points, and 
21
..
/
j k j k
xx
 represents the coordinates of target k in 
different feature points. 
3.2 Design of posture correction system 
based on Media Pipe Blaze Pose 
To reduce the computational burden of the model, the 
study introduces the Depth Separable Convolution (DSC) 
strategy for improvement. Among them, DSC improves 
an algorithm model for annotated convolution by 
separating the correlation between the spatial dimension 
and the channel (depth) dimension.  
 
 
 
 
24   Informatica 48 (2024) 19–34                                                                 Z. Wang
 
et al. 
This reduces the number of parameters required for 
convolution calculation and improves the computational 
efficiency of the model [18]. Although the recognition 
accuracy of this network may slightly decrease, the 
decrease in accuracy can be negligible when the 
parameters are significantly reduced. The DSC strategy 
aims to decompose standard convolution into deep 
convolution kernel 1 * 1 convolution, corresponding to 
the combination of spatial dimension data and channel 
data. Between each depth convolutional layer and 1 * 1 
convolutional layer, a BN layer and ReLU layer are set, 
consistent with the standard convolution. The DSC 
decomposition structure is given in Figure 4. 
 
...
...
...
...
...
...
D K
D K
D K
D K
M
M
N
N
Standard convolution
Standard convolution
D K
D K
D K
D K
1
1
M
M
Deep convolution
Deep convolution
M
M
1*1 convolution
1*1 convolution
1
1
1
1
M
M
 
Figure 4: Depthwise separable convolution structure 
 
In Figure 4, the convolution sum size is 
K
D * 
K
D , 
FF
D D M  is the symbol that denoting the input 
feature map size, 
FF
D D N  is the symbol that 
denoting the output feature map size, and N 
convolution kernels number. The required computational 
cost 
S
C is given in formula (7) [19-20]. 
 
S K K F F
C D D M D D N =       (7) 
The DSC convolution computation 
D
C divides the 
standard convolution computation into two modules: 
depth convolution and 1 * 1 convolution, as shown in 
formula (8). 
 
D K K F F F F
C D D M D D M N D D =     +    (8) 
By comparing the standard convolution computation with 
the DSC convolution computation, the corresponding 
computational ratio of the two can be obtained, as shown 
in formula (9). 
2
11
D
C
S K
C
P
CN D
= = +
    (9) 
From formula (9), the calculation ratio is related to the 
number and size of convolution kernels. In practical 
networks, the number of convolutional kernels is usually 
large, so the impact on computational complexity can be 
ignored in calculations. In the study of using network 
models, the kernel size is set to 3 * 3, which means 
K
D 
is equal to 3. Therefore, the ratio of DSC convolution 
computation to standard convolution computation is 
approximately 1:9, which also proves that DSC can help 
cut the computational parameters and burden of the 
model greatly. Based on this, a graphic cross platform 
architecture Media Pipe is introduced and applied to the 
target post recognition. This model can achieve 
end-to-end acceleration, with a built-in fast ML inference 
and processing framework that enables it to run on 
servers such as mobile devices and workstations. In 
addition, the Media Pipe model can also simultaneously 
build multiple learning channels such as videos and 
sensors. The Media Pipe infrastructure is shown in Figure 
5. 
 
The Application of Action Recognition Based on MPP-YOLOv3… Informatica 48 (2024) 19–34 25 
Camera
Camera
Frame Selection
Frame Selection
Object Detection
Object Detection
Detection Tracking
Detection Tracking
Detection Merging
Detection Merging
Detection Annotation
Detection Annotation
Display
Display
Compute node
Compute node
 
Figure 5: Media Pipe infrastructure 
 
In addition, the model also needs to further implement a 
module for real-time inference, and a lightweight CNN 
network Blaze Pose is studied and introduced, which is 
specifically designed for mobile device applications. 
When using the Blaze Pose network for inference, 33 
bone key points can be extracted through the detector 
tracker module. The tracker can predict the position of 
key points, as well as predict whether there is a target in 
the image frame and the pose interest region of the image 
frame. If the presence of human targets is not detected in 
the corresponding image frame, continue to re detect and 
predict tracking in the next frame. This working mode 
determines that the dependency between adjacent frames 
is strong. Embed the Blaze Pose network into the Media 
Pipe framework to obtain the Media Pipe Pose model. 
The Blaze Pose network is a stacked encoder-decoder 
heatmap network, along with a regression encoder 
network. The model training applies heat map and offset 
loss, and deletes the corresponding output layer during 
inference. Jump connections are used between different 
stages to balance high and low-level functions. However, 
the regression encoder gradient stops the connection. This 
strategy can effectively improve prediction accuracy and 
coordinate regression accuracy. The number of human 
bone keypoints in the Blaze Pose network has increased 
to 33, and the head, hand, and foot keypoints have been 
refined, as shown in Figure 6. 
 
2
2
3
3
15
15
14
14
22
22
23
23
25
25
27
27
11
11
26
26
28
28
1
1
5
5
4
4
6
6
0
0
9
9
10
10
12
12
13
13
21
21
19
19
17
17
24
24
16
16
18
18
20
20
31
31
29
29
30
30
32
32
Key point refinement
Key point refinement
Head refinement key points ：
Head refinement key points ：
Side eyes and lips
Side eyes and lips
Palms and fingers
Palms and fingers
Hands refinement key points ：
Hands refinement key points ：
Toes and heels
Toes and heels
Feet  refinement key points ：
Feet  refinement key points ：
7
7
8
8
 
Figure 6: Detailed analysis of key points 
 
In Figure 6, the newly added outer and inner sides of the 
left and right eyes on the head are 3/6 and 1/4, 
respectively, and the left and right lips are 9/10, 
respectively. In the newly added key points of the hand, 
the left and right little thumbs are 17/18, 19/20 
respectively for the left and right index fingers, and 21/22 
respectively for the left and right thumbs. The feet have 
been added with 29/30 on the left and right heels, and 
31/32 on the left and right toes. The remaining key points 
in the model are consistent with the original. The human 
26   Informatica 48 (2024) 19–34                                                                 Z. Wang
 
et al. 
body posture can be divided into 8 types: standing, 
walking, squatting, waving, bending, kicking, as well as 
side pressing and sitting. After obtaining attitude data, 
normalization processing is required, which means that 
all key points are mapped to the (0,1) interval. In the 
normalization of the horizontal axis in a two-dimensional 
coordinate system, the normalized horizontal axis is 
shown in formula (10). 
 
1
X
X
W
=            (10) 
In formula (10), 
1
/ XX respectively represent the 
horizontal coordinates of the key points before and after 
normalization, while W is the symbol that denoting the 
display camera width. Correspondingly, the vertical 
coordinates in the two-dimensional coordinate system are 
normalized, as shown in formula (11). 
 
1
Y
Y
H
=             (11) 
In formula (11), 
1
/ YY represent the vertical coordinates 
of the key points before and after normalization, and H 
represents the length of the display camera. However, 
some key points in the image may be obscured. To avoid 
changes in the dimensionality of the feature vector, its 
coordinates should be supplemented with (0,0). 
4 Performance and practical 
application analysis of action 
recognition network based on 
MPP-YOLOv3 tiny 
To assess the reliability of the proposed deep learning 
network for action recognition, the study conducted 
experiments on the model's performance, including its 
training and validation error, detection speed, and weight 
file size. Subsequently, practical application analysis is 
conducted to understand the specific recognition 
performance for different actions, with the demand to 
verify its superiority in action recognition. 
4.1 Performance verification of action 
recognition model based on MPP-YOLOv3 
Tiny network 
The study first investigated the performance of each 
module in the MPP-YOLOv3 Tiny network. The specific 
experimental environment and parameter selection are 
shown in Table 2. 
 
Table 2: Experimental environment and parameter settings 
Name Settings 
Operating system Ubuntu 18.04 
GPU NVIDIA Quadro M2200 
CPU Intel Xeon CPU E3-1505M v6 @3.00GHz 
RAM 16GB 
CUDA 10.1 
Programming language Python 3.6 
Epochs 220 
Batch size 64 
Learning_rate 0.0001 
Optimizer function Adam 
 
The study selects the CityPersons public dataset for 
experimentation, which contained a total of 3475 image  
 
 
data. Among them, the number of test sets is 500. The 
accuracy and loss curve of the model for training and 
validation are shown in Figure 7.
The Application of Action Recognition Based on MPP-YOLOv3… Informatica 48 (2024) 19–34 27 
6
6
5
5
4
4
1
1
0
0
0
0
0.2
0.2
0.4
0.4
0.6
0.6
1.0
1.0
0.8
0.8
Accuracy /% Accuracy /%
0
0
40
40
80
80
120
120
160
160
200
200
Loss Loss 
Iteration 
Iteration 
0
0
40
40
80
80
120
120
160
160
200
200
Iteration 
Iteration 
3
3
2
2
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
0
0
0
0.2
0.2
0.4
0.4
0.6
0.6
1.0
1.0
0.8
0.8
Accuracy /% Accuracy /%
0
0
40
40
80
80
120
120
160
160
200
200
Loss Loss 
Iteration 
Iteration 
0
0
40
40
80
80
120
120
160
160
200
200
Iteration 
Iteration 
Training accuracy
Training accuracy
Training loss
Training loss
Validation accuracy
Validation accuracy
Validation loss
Validation loss
(a) Model training performance
(a) Model training performance
(b) Model validation test performance
(b) Model validation test performance
80.01%
80.01%
3.97
3.97
82.39%
82.39%
0.76
0.76
  
Figure 7: Model training verification performance analysis 
 
In Figure 7 (a), as iterations increased, the curve of the 
training data rapidly improved and eventually stabilized. 
As the model approached 40 iterations, the accuracy 
curve became smoother, and ultimately reached 93.01% 
in the 230 th iteration. The loss value during training 
began to flatten out around the 20 th iteration, and at this 
point, the training loss decreased to 0.085. At the 230 th 
iteration, the training loss of the model is only 0.0462. In 
Figure 7 (b), the model validation data quickly drops to a 
flat state around the 22 nd iteration. As the number of 
iterations increases, its accuracy gradually improves. At 
the 230 th iteration, the accuracy of the validation data 
reaches 96.97%. Moreover, its validation loss curve also 
rapidly decreases to a steady state after around the 15 th 
iteration. In the 230 th iteration, the validation loss value 
is only 0.0417. Overall, the model performs well in both 
training and validation. Further research compares the 
designed MPP-YOLOv3 Tiny model with the 
pre-optimized OpenPose algorithm model and 
OpenPose-VGG19 algorithm model, as shown in Figure 
8. 
 
Speed 
0
10
20
30
40
Speed /FPS Key point identification number
Key point 
number
OpenPose 
OpenPose-VGG19 
Ours
(a) Comparison of speed and key 
points before and after optimization
89
90
91
92
93
94
95
96
97
98
99
0
50
100
150
200
250
OpenPose Ours
Weight file Accuracy 
204 M
Accuracy /% Accuracy /%
Weight file / 
M
Weight file / 
M
98.17%
Min
Max
Max
(b) Comparison of file size and 
accuracy before and after optimization
OpenPose-VGG19 
Model Model 
Index
Value Value 
 
Figure 8: Comparison of model performance before and after optimization 
 
Figure 8 (a) shows the comparison of detection speed and 
number of key point recognition before and after 
optimization of the model. The MPP-YOLOv3 Tiny 
network designs for research could recognize up to 33 
key points of human bones, while the two models before 
optimization could only recognize 18 key points of 
human bones. Additional key points may improve the 
recognition accuracy of human body movements in the 
image, but may also result in slower recognition speed. 
Therefore, the detection speed of the research design 
model is 11 FPS, which is 36.36% lower than the 7 FPS 
of the OpenPose algorithm model. However, compared to 
the OpenPose-VGG19 algorithm, the detection speed is 
still 42.11% higher. In Figure 8 (b), the accuracy of the 
research design model reaches 98.17%, while the 
accuracy of the two models before optimization is below 
95%. The accuracy of the MPP-YOLOv3 Tiny network is 
improved by an average of 5.01%. The weight file for the 
28   Informatica 48 (2024) 19–34                                                                 Z. Wang
 
et al. 
research and design model is 1.6 M, which is 99.22% 
lower than the OpenPose algorithm. Compared to the 
OpenPose-VGG19 algorithm, it reduces by 79.94%. In 
summary, although the detection speed of the research 
and design model is slightly lower than that of the 
OpenPose algorithm, its overall performance is the best. 
Therefore, studying the optimization of initial action 
recognition models is effective and reliable. 
4.2 Practical analysis of posture correction 
model based on MPP-YOLOv3 tiny network 
With the demand to do recognition performance 
verification for 8 common actions, a pose dataset is 
constructed using MediaPipe Pose. In addition, the study 
also introduces running, hugging, computer operation, 
and falling movements that are easily confused with other 
movements for further comparison. The total number of 
images is 17980, and the experiment randomly divides 
them into a training set and a validation set in an 8:2 ratio. 
The experiment outcomes are given as Figure 9. 
 
Stand 
Wave 
Stoop 
Play computer
Stand 
Walk 
Squat 
Wave 
Tumble 
Run 
Stoop 
Embrace 
Kick 
Play computer
Side press leg
0
20
40
60
80
Walk 
Squat 
Tumble 
Run 
Embrace 
Kick 
Side press leg
(b) Confusing attitude recognition results
Stand 
Squat 
Stoop 
Side press leg
Stand 
Walk 
Squat 
Wave 
Stoop 
Kick 
Side press leg
Sit 
0
20
40
60
80
Walk 
Wave 
Kick 
Sit 
Movement 
Movement 
Movement 
(a) Common attituderecognition results  
Movement 
Accuracy /% Accuracy /%
 
Figure 9: Different action recognition accuracy analysis  
 
Figure 9 (a) shows the visualization of the recognition 
effect of the model on common actions. Among them, the 
recognition accuracy of the model for side leg pressing 
and hand waving movements is 100%, and the 
recognition accuracy for other movements is also above 
90%. The recognition accuracy of walking movements is 
91.1%, because walking movements are easily confused 
with standing and kicking movements, with recognition 
errors of 4.4% for both movements. The recognition error 
for squatting and sitting movements is 2.2%, and the 
recognition error for kicking and lateral pressure is also 
2.2%. Bending, squatting, walking, and sitting 
movements are easily confused, but the errors are all 
within 2.2%. Overall, the model achieves an average 
recognition accuracy of 96.7% for common actions. In 
Figure 9 (b), four movements including running have 
been added, while sitting and standing movements have 
been removed. The model has achieved 100% recognition 
accuracy for five movements: standing, squatting, waving, 
kicking, and playing computer games. Except for the 
bending motion, the recognition accuracy for all other 
movements has also reached over 92%. The recognition 
accuracy of bending and falling movements is only 
86.7%, because bending and falling movements are easily 
confused, with a recognition error of 12.3%. Further 
research will apply the MPP-YOLOv3 Tiny action 
recognition network to posture correction. Taking 
squatting as an example, the model's recognition effect on 
incorrect squatting movements is shown in Figure 10. 
 
The Application of Action Recognition Based on MPP-YOLOv3… Informatica 48 (2024) 19–34 29 
15
16
17
18
19
20
1 2 3 4 5
Internal knee buckle
Number of experiments
Number of successful identification
(a) Test results
Excessive forward 
leaning of the torso
Trunk roll Erect torso
20
19 19
20
19
0.75
0.85
0.95
1 2 3 4 5
0.90
0.80
0.70
1.00
Internal knee buckle
Excessive forward leaning of the torso
Trunk roll
Erect torso
Success rate /%
Number of experiments
(b) Success rate
 
Figure 10: Recognition results of squat posture correction 
 
Figure 10(a) shows that the model's overall recognition 
performance for the knee joint buckle is poor, with a 
recognition accuracy of only 87.4%. This is mainly due to 
recognition errors that occur during the motion amplitude 
of this action. The recognition accuracy for excessive 
forward leaning of the trunk is relatively high, reaching 
92.6%. However, the main reason affecting its 
recognition is the obstruction of the excessive forward 
leaning surface. The recognition rate for trunk tilt and 
excessive upright posture is 96.5%. Figure 10(b) indicates 
that more than 19 individuals are successful, primarily 
due to significant changes in their movements. Next, the 
study compares the semantic action recognition model 
based on Pose Lexicon proposed by reference [7], and the 
action recognition model based on graph transformer 
network proposed by reference [8]. The experimental 
datasets used are the NTU60 RGB+D dataset and the 
NTU120 RGB+D dataset, respectively, with the latter 
being an extension of the former. The experiment 
outcomes are given as Table 3.
  
Table 3: Comparative analysis of the performance of each model 
Models Index 
Data sets 
NTU60 RGB+D NTU120 RGB+D 
Zhou. et al [7] 
CV 93.6% 82.6% 
CS 86.1% 78.8% 
Liu. et al [8] 
CV 92.5% 82.1% 
CS 85.2% 74.8% 
Ours 
CV 94.7% 82.7% 
CS 89.3% 80.6% 
 
The indicators in Table 3 represent the recognition 
accuracy of the model from the perspectives of Cross 
View (CV) and Cross Subject (CS), respectively. It could 
be concluded that all models performed better in the 
NTU60 RGB+D dataset. However, the research design 
model is still 1.7% better than the other two models in 
terms of CV index, while the CS index is 3.2% better 
than the other two models. In the NTU120 RGB+D 
dataset, the recognition accuracy of each  
 
 
model has decreased to below 90%. The CV index of the 
research design model in this dataset is increased by 
0.35% compared to the other models. The CS index has 
relatively increased by 3.8%. In summary, the 
MPP-YOLOv3 Tiny network proposed in the study can 
better achieve pose correction through action recognition. 
Additionally, eight tests are conducted to compare the 
performance of various models in terms of recall, 
precision, and F1 values, as depicted in Figure 11. 
 
30   Informatica 48 (2024) 19–34                                                                 Z. Wang
 
et al. 
Recall
Training frequency
1 2 3 4 5 6 7 8
0.95
1.00
（a ）Recall
0.85
0.90
0.80
Precision
1 2 3 4 5 6 7 8
0.95
1.00
（b ）Precision
0.85
0.90
0.80
F1
1 2 3 4 5 6 7 8
0.95
1.00
（c ）F1
0.85
0.90
0.80
Ours
Y Liu. et al
L Zhou. et al
Ours
Y Liu. et al
L Zhou. et al
Ours
Y Liu. et al
L Zhou. et al
Training frequency
Training frequency
 
Figure 11: Test results of recall, accuracy, and F1 value for different models 
 
Figures 11 (a) to 11 (c) display the results of the recall, 
accuracy, and F1 value tests, respectively. The research 
design model maintained a recall rate of 0.95 or higher in 
8 rounds of testing, outperforming other models. 
However, the model proposed by L Zhou et al performed 
poorly, with a recall rate below 0.85 in the third round of 
testing and poor stability. When comparing accuracy, the 
design model consistently performed excellently with a 
stability of 0.93 or above, while the other two models had 
an accuracy range of 0.85 and poor overall stability. In 
terms of the F1 value indicator, the model proposed by Y 
Liu et al showed significant fluctuations in the 4th and 
7th rounds, performing worse than the design model 
overall. However, the F1 values of the design model are 
all above 0.94 with excellent stability. To test the 
recognition performance of the design model, 
conventional action data are collected in real 
environments as a self-generated training dataset for 
simple scenes, with a total of 1565 entries. Additionally, 
1756 complex scene data, such as squatting, kicking, and 
occlusion, are selected as a self-generated complex scene 
dataset. These two datasets are used to produce 
recognition results for real-world scenarios, as shown in 
Figure 12. 
 
The Application of Action Recognition Based on MPP-YOLOv3… Informatica 48 (2024) 19–34 31 
20 40 60 80 100 120
70
75
80
85
90
95
100
Iteration 
(a) Simple scenario
Recognition accuracy (%)
20 40 60 80 100 120
50
60
70
80
85
90
95
(b) Complex scenes
Iteration 
Recognition accuracy (%)
Ours
Y Liu. et al
L Zhou. et al
L Zhou. et al
Ours
Y Liu. et al
 
Figure 12: Comparison of recognition effects of different models in real scenarios 
 
In Figure 12 (a), the results of action recognition in a 
simple scenario are presented. The designed model 
achieved convergence after 80 iterations, with an 
accuracy of 95.25%. Both the models proposed by L. 
Zhou et al. and Y. Liu et al. achieved convergence after 
100 iterations. However, the overall recognition accuracy 
of the model proposed by L. Zhou et al. was 90.25%, 
while that of Y. Liu et al. was 88.21%. In complex 
scenarios, the recognition accuracy of all three models 
significantly decreased. According to Zhou et al, their 
proposed model exhibited poor overall stability and lower 
recognition accuracy than Liu's model in the later stage. 
The main reason for this may be the model's inability to 
handle the extraction of multivariate features in complex 
scenes, which affects the accuracy of later recognition. In 
comparison, the overall design model outperformed the 
other two models with a recognition accuracy of 90.95% 
when converging. In summary, the study found that the 
MPP-YOLOv3 algorithm has better practical application 
results. This was due to its lightweight design and 
enhanced feature extraction of upper and lower limbs, 
which ensured accuracy and stability of the technology. 
In complex recognition scenarios, similar models with 
less feature extraction will significantly decrease 
accuracy. However, research models optimized the 
recognition process through lightweight design and DSC 
strategy to ensure recognition accuracy. 
4.3 Discussion 
The application of action recognition technology in 
correcting posture holds significant research value. This 
technology allows for real-time monitoring and analysis 
of human body posture, aiding in the evaluation and 
correction of poor posture, and ultimately improving 
overall health and motor skills. Reference [10] proposed 
an action recognition method based on spatiotemporal 
slow fast graph convolutional networks. However, this 
method was susceptible to external noise during data 
collection and action recognition processes, which can 
decrease the quality of the collected data and affect the 
recognition effect of the final action. 
Reference [11] proposed a technique for action 
recognition with minimal transfer learning. However, this 
technology exhibited high stability and recognition 
accuracy during the training process. As the training 
progresses, the stability gradually decreased and cannot 
meet the needs of practical applications. Reference [12] 
proposed a motion recognition technique based on 
wavelet transform. This technology relied on a large 
amount of collected data. The overall recognition 
accuracy of this method significantly decreased with 
fewer data samples, limiting its feasibility and accuracy 
in practical applications. Therefore, this study examined 
the application of action recognition in attitude correction 
using the MPP-YOLOv3 algorithm. The study compared 
and analyzed the techniques proposed in references [10], 
[11], and [12], and found that these methods have 
limitations. In contrast, the research method has 
advantages. Firstly, it can effectively overcome external 
noise during data collection and action recognition. The 
main focus of this study is to explore methods for 
reducing the number of predicted classifications in order 
to lower network operating parameters and improve data 
quality and action recognition performance. Secondly, the 
research method exhibited high stability and recognition 
accuracy during the training process. This was especially 
true with the introduction of the Blaze Pose network, 
which significantly improved the inference process and 
makes the research technology more suitable for practical 
applications. Additionally, when compared to literature 
[12], the research method demonstrated lower 
dependence on data samples while maintaining high 
recognition accuracy even with fewer data samples. 
In summary, the study found that the action recognition 
method based on the MPP-YOLOv3 algorithm has better 
application effects in attitude correction, higher 
recognition accuracy, and stronger stability. However, 
further experiments and research are needed to validate 
and improve the research methods, in order to enhance 
their accuracy and stability, and to promote their 
widespread application in practical settings. 
32   Informatica 48 (2024) 19–34                                                                 Z. Wang
 
et al. 
5 Conclusion 
To correct posture and apply it to various fields, such as 
sports and medicine, research proposes using deep 
learning for action recognition. First, a lightweight 
network infrastructure based on YOLOv3 Tiny was built. 
Then, the model's accuracy was improved by refining 
bone key points through the MPP module. In the 
experimental analysis on the CityPersons public dataset, 
the results showed that the curves of model accuracy and 
loss values tended to stabilize with increasing iterations, 
and reached a training accuracy of 93.01% at 230 
iterations. The validation accuracy and loss values were 
96.97% and 0.0417, respectively. Although its detection 
speed was slightly lower at 11 FPS, compared to the pre 
optimized OpenPose algorithm and OpenPose-VGG19 
algorithm, the accuracy had improved by 3.2% and the 
model size has increased by 83.1%. In practical 
application analysis, the MPP-YOLOv3 Tiny network 
achieved a recognition accuracy of 96.7% for common 
actions, and can accurately recognize confusing actions 
such as running, hugging, operating a computer, and 
falling in datasets, demonstrating good generalization 
ability. When applied to pose correction for squatting 
movements, the success rate of identifying knee joint 
buckles was 87.4%, and the average success rate of 
identifying other erroneous movements reached 94.7%. 
This indicated that the model had practical value in pose 
correction. Compared to the action recognition model 
proposed by Zhou et al. [7] and the model proposed by Y 
Liu et al. [8], in the NTU60 RGB+D and NTU120 
RGB+D datasets, the average CV index improved by 
1.02% and the average CS index improved by 3.5%. In 
summary, the action recognition model based on 
MPP-YOLOv3 Tiny network proposed in the study has 
significant application value in posture correction. 
However, the detection speed of the action recognition 
model designed for research still needs to be improved. In 
the future, further efforts should be made to enhance the 
recognition speed of this module while ensuring 
lightweight and accuracy. 
References
 
[1] Y. Bai, Q. Zou, X. Chen, L. Li, Z. Ding, and L. Chen, 
“Extreme Low-Resolution Action Recognition with 
Confident Spatial-Temporal Attention Transfer,” 
International Journal of Computer Vision, vol. 131, 
no. 6, pp. 1550-1565, 2023. 
https://doi.org/10.1007/s11263-023-01771-4 
[2] K. Bhosle, and V. Musande, “Evaluation of deep 
learning CNN model for recognition of devanagari 
digit,” Artificial Intelligence Applications, vol. 1, no. 
2, pp. 114-118, 2023. 
https://doi.org/10.47852/bonviewAIA3202441 
[3] H. Liu, Y. Chen, W. Zhao, and S. Zhang, “Human 
pose recognition via adaptive distribution encoding 
for action perception in the self-regulated learning 
process,” Infrared Physics And Technology, vol. 
114, no. 1, pp. 103660-103669, 2021. 
https://doi.org/10.1016/j.infrared.2021.103660 
[4] R. Kumar, and S. Kumar, “Survey on artificial 
intelligence-based human action recognition in 
video sequences,” Optical Engineering, vol. 62, no. 
2, pp. 23102-23123, 2023. 
https://doi.org/10.1117/1.OE.62.2.023102 
[5] C. Bian, W. Feng, L. Wan, and S. Wang, “Structural 
knowledge distillation for efficient Skeleton-Based 
action recognition,” IEEE Transactions on Image 
Processing, vol. 30, no. 1, pp. 2963-2976, 2021. 
https://doi.org/10.1109/TIP.2021.3056895 
[6] A. Kumar, S. Majee, and S. Jain, “CDM: A coupled 
deformable model for image segmentation with 
speckle noise and severe intensity inhomogeneity,” 
Chaos, Solitons & Fractals, vol. 173, no. 3, 
104385-104396, 2023. 
https://doi.org/10.1016/j.chaos.2023.113551 
[7] L. Zhou, and T. Jiang, “Learning body part-based 
pose lexicons for semantic action recognition,” IET 
Computer Vision, vol. 17, no. 2, pp. 135-155, 2023. 
https://doi.org/10.1049/cvi2.12143 
[8] Y. Liu, H. Zhang, D. Xu, and K. He, “Graph 
transformer network with temporal kernel attention 
for skeleton-based action recognition,” 
Knowledge-Based Systems, vol. 240, no. 3, pp. 
108146-10862, 2022. 
https://doi.org/10.1016/j.knosys.2022.108146 
[9] G. Alemu, Y. Ji, Y. Yang, L. L. Gao, and H. T. Shen, 
“Arbitrary-View human action recognition via 
Novel-view action generation,” Pattern Recognition, 
vol. 118, no. 10, pp. 108043-108052, 2021. 
https://doi.org/10.1016/j.patcog.2021.108043 
[10] Z. Fang, X. Zhang, T. Cao, Y. Zheng, and M. Sun, 
“Spatial-Temporal slowfast graph convolutional 
network for Skeleton-Based action recognition,” 
IET Computer Vision, vol. 16, no. 3, pp. 205-217, 
2021. https://doi.org/10.1049/cvi2.12080 
[11] H. Coskun, M. Z. Zia, B. Tekin, F. Bogo, N. Navab, 
F. Tombari, and H. S. Sawhney, “Domain-Specific 
priors and meta learning for Few-Shot First-Person 
action recognition,” IEEE Transactions on Software 
Engineering, vol. 45, no. 6, pp. 6659-6673, 2021. 
https://doi.org/10.48550/arXiv.1907.09382 
[12] X. Li, W. Zhai, and Y. Cao, “A tri ‐attention 
enhanced graph convolutional network for Skeleton
‐Based action recognition,” IET Computer Vision, 
vol. 15, no. 2, pp. 110-121, 2021. 
https://doi.org/10.1049/cvi2.12017 
[13] X. Hao, J. Li, Y. Guo, T. Jiang, and M. Yu, 
“Hypergraph neural network for Skeleton-Based 
action recognition,” IEEE Transactions on Image 
Processing, vol. 30, no. 1, pp. 2263-2275, 2021. 
https://doi.org/10.1109/TIP.2021.3051495 
[14] T. Le, N. Huynh-Duc, C. T. Nguyen, and M. T. Tran, 
“Motion embedded images: an approach to capture 
spatial and temporal features for action recognition,” 
The Application of Action Recognition Based on MPP-YOLOv3… Informatica 48 (2024) 19–34 33 
Informatica, vol. 47, no. 3, pp. 327-329, 2023. 
https://doi.org/10.31449/inf.v47i3.4755 
[15] P. Climent-Pérez, and F. Florez-Revuelta, 
“Improved action recognition with separable 
Spatio-Temporal attention using alternative skeletal 
and video Pre-Processing,” Sensors, vol. 21, no. 3, 
pp. 1005-1005, 2021. 
https://doi.org/10.3390/s21031005 
[16] Z. Liu, J. Cheng, L. Liu, Z. Ren, Q. Zhang, and C. 
Song. “Dual-stream cross-modality fusion 
transformer for RGB-D action recognition,” 
Knowledge-based Systems, vol. 255, no. 11, pp. 
109741-109752, 2022. 
https://doi.org/10.1016/j.knosys.2022.109741 
[17] M. Gutoski, A. E. Lazzaretti, and H. S. Lopes, 
“Deep metric learning for open-set human action 
recognition in videos,” Neural Computing & 
Applications, vol. 33, no. 4, pp. 1207-1220, 2021. 
[18] M. Smith, and R. Toumi, “Using video recognition 
to identify tropical cyclone positions,” Geophysical 
Research Letters, vol. 48, no. 7, pp. 1-9, 2021. 
https://doi.org/10.1029/2020GL091912 
[19] D. Lee, D. Wang, Y. Yang, L. Deng, and G. Li, 
“QTTNet: Quantized tensor train neural networks 
for 3D object and video recognition,” Neural 
Networks, vol. 141, no. 5, pp. 420-432, 2021. 
https://doi.org/10.1016/j.neunet.2021.05.034 
[20] P. Preethi, and H. R. Mamatha, “Region-Based 
convolutional neural network for segmenting text in 
epigraphical images,” Artificial Intelligence and 
Applications, vol. 1, no. 2, pp. 119-127, 2023. 
https://doi.org/10.47852/bonviewAIA2202293 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34   Informatica 48 (2024) 19–34                                                                 Z. Wang
 
et al.