https://doi.org/10.31449/inf.v48i10.5918 Informatica 48 (2024) 35–50 35 
Sports Action Detection and Counting Algorithm Based on Pose 
Estimation and Its Application in Physical Education Teaching 
Zhengyuan Song
1
, Zhonghai Chen
2*
 
1
College of Physical Education, Chongqing Technology and Business University, Chongqing 400067, China 
2
Physical Education Institute, Yanching Institute of Technology, Sanhe 065201, China 
E-mail: sszhyy163@163.com 
*
Corresponding author 
Keywords: pose estimation, sports, action recognition, counting, ViBE 
Received: March 15, 2024  
Accurate motion detection and counting are important for improving training effectiveness and 
preventing sports injuries in physical education teaching and training. Traditional analysis of sports 
movements mainly relies on observing coaches and the subjective feelings of athletes. This method is 
not only time-consuming and labor-intensive, but also susceptible to personal experience and 
judgment biases. A 3D bone keypoint detection algorithm based on the pose estimation model Visual 
Background Extractor was proposed to address this issue. Each repeated action was divided and 
scored using a penalty function by analyzing the continuous action information on the time series. 
Meanwhile, the lightweight application of this function was proposed based on OpenPose. The 
performance test confirmed that the detection accuracy of the playground and outdoor space was the 
same, both at 96%, which was the highest among all environments. Although the height, gender, and 
clothing of the participants varied, these factors did not significantly affect the performance of the 
algorithm from an individual performance perspective, with accuracy ranging from 92% to 94%. 
These experiments confirm that the proposed motion detection and counting model based on pose 
estimation has high robustness and reliability under different environmental conditions. 
Povzetek: Algoritem za zaznavanje in štetje športnih akcij temelji na oceni drže in uporablja model 
Visual Background Extractor za izboljšanje vadbe in preprečevanje poškodb.
1 Introduction 
Accurate action detection and counting are important for 
improving training effectiveness and preventing sports 
injuries in physical education teaching and training. As 
technology advances, action detection and counting 
algorithms based on pose estimation are research hotspots. 
These algorithms can provide real-time feedback on the 
quality of exercise execution by analyzing the posture 
and movements of athletes. These algorithms can help 
coaches and athletes optimize training methods and 
improve athletic performance [1]. Traditional analysis of 
sports movements mainly relies on observing coaches and 
the subjective feelings of athletes. This method is not 
only time-consuming and labor-intensive, but also 
susceptible to personal experience and judgment bias [2]. 
Pose estimation-based algorithms provide new 
possibilities for automatic detection and analysis of 
motion actions with the advancement of computer vision 
and machine learning technology. These algorithms can 
accurately identify and count specific movements in 
real-time, such as jumping, weightlifting, and yoga 
postures, by analyzing video or sensor data [3]. However, 
there are still many challenges when applying these 
algorithms to practical physical education teaching and 
training. For example, how to ensure that the algorithm 
can maintain high accuracy and robustness in different 
environments and conditions, how to handle the 
recognition and counting of complex actions, and how to 
make the algorithm adapt to the individual differences of 
different athletes. In addition, there is an important 
research topic on how to effectively integrate the output 
of algorithms into teaching and training to improve 
teaching quality and athlete learning efficiency. In view 
of this, the research aims to explore sports action 
detection and counting algorithms based on pose 
estimation and study their application in physical 
education teaching. This study develops an algorithm that 
can accurately recognize and count multiple motion 
actions by combining advanced computer vision 
technology and machine learning methods. 
The study consists of five parts. Firstly, an 
introduction is given to physical education teaching, 
sports training, action detection, and counting algorithms. 
Secondly, an action detection model and counting 
algorithm are constructed based on pose estimation. Then 
the performance of the model and algorithm is tested and 
analyzed. Furthermore, the results obtained from the 
research are discussed. Meanwhile, the expectations and 
challenges in practical applications are pointed out. 
Finally, a discussion is made on the above content. 
 
 
36   Informatica 48 (2024) 35–50                                                                  Z. Song et al. 
2 Related works 
Research on motion recognition is gradually becoming 
popular with the popularization of image algorithms and 
the popularity of miniaturized wearable devices. Bingzhu 
et al. used different feature analysis and combined the 
feature signals of lying and sitting positions to improve 
the classifying performance of robots for lower limb 
movements. A trained motion decoder was obtained 
through sEMG feature extraction and pattern recognition. 
Control commands were sent to the robot to drive the 
lower limbs for corresponding rehabilitation training. 
These experiments confirmed the effectiveness of control 
methods with sEMG signals [4]. Pengyun et al. believed 
that human motion recognition based on ultra wideband 
through wall radar faced limitations in terms of samples 
and perspectives. Therefore, they proposed a multi-radar 
cooperative human motion recognition model based on 
transfer learning ResNeXt network and ensemble 
learning. These experiments confirmed that ResNeXt 
networks based on set learning could achieve higher 
recognition accuracy compared to fusion models based on 
single-view radar [5]. Neural network is a mathematical 
model that mimics the structure and function of 
biological neural networks. Neural networks have simple 
decision-making and judgment abilities similar to humans, 
which can provide better results in image and speech 
recognition. The neural network is divided into 
Convolutional Neural Network (CNN), Generative 
Adversarial Network (GAN), recurrent neural networks, 
etc. according to different connection methods. The 
applications of neural networks have become increasingly 
widespread with the development of artificial intelligence. 
Nemani et al. investigated the application of deep 
learning in predicting the remaining life of bearings. They 
determined the bearing fault threshold based on ISO 
standards and proposed a two-stage Long Short-term 
Memory (LSTM) model for extracting fault feature 
signals of bearings. Gaussian layers were embedded in 
LSTM for parameter optimization. These experiments 
confirmed that this model had good accuracy in 
predicting bearing life [6]. Chen, Beijing et al. proposed a 
novel Xception-LSTM by integrating spatiotemporal 
attention mechanism with ConvLSTM to improve the 
accuracy of fake face detection. ConvLSTM was 
introduced to consider frame structure information and 
modeled temporal information. These excellent 
performance results confirmed that this algorithm 
performed better than existing algorithms [7]. 
Kaadoud et al. used an internal state clustering 
algorithm in LSTM to address the transparency and 
interpretability in machine learning algorithms and 
understand the results from simple clues and rules. They 
studied the hidden states of spatial extraction knowledge 
and established and validated automatic sequences for 
extraction based on basic syntax. These experiments 
confirmed that sequences extracted from the original 
syntax had high recognition rates [8]. Le et al. used 
LSTM to study the channel access of vehicles in wireless 
tram networks, transforming the access control into a 
non-Markov problem. LSTM was integrated into 
Q-learning networks, and a vehicle connecting method 
was put forward using deep recursion in Q-learning 
networks. Simulation experiments confirmed that this 
algorithm had higher stability and efficiency compared to 
the benchmark scheme [9]. Wu et al. combined recurrent 
neural networks and LSTM to construct a system for 
predicting dynamic gestures through joint coordinate 
features. This model achieved the highest accuracy of 
99.31%, indicating the superior recognition performance 
[10]. Jia et al. proposed a motorcycle helmet detection 
method based on YOLOv5 for detecting motorcycle 
driver helmet. The soft-NMS was used instead of NMS to 
fuse the YOLOv5 detector. The experiment achieved 
97.7% mAP, 92.7% F1 score, and 63 Frames Per Second 
(FPS), which was superior to other state-of-the-art 
detection methods [11]. Li and Ye constructed a model 
for predicting network wireless traffic by integrating 
LSTM and recurrent neural networks. These experiments 
confirmed that this model had good prediction accuracy 
and training speed, met the needs of wireless network 
traffic prediction, and had good application prospects 
[12]. Dudi and Rajesh proposed a CNN-based plant leaf 
classification model to facilitate the classification of plant 
leaves and identify plant types. They improved 
classification accuracy and validated the effectiveness of 
the model by introducing hybrid whale optimization 
algorithm based on shark odor [13]. In recent years, the 
combination of sports and computer teaching becomes a 
hot research field. Mcdonough et al. designed a control 
experiment to improve the intervention effect of the 
school dance sports game education model and enhance 
the enjoyment and self-efficacy of urban minority 
students. Urban minority students had a higher happiness 
in the group exercise mode, which was an effective dance 
sports game intervention mode [14]. Liu et al. 
investigated the effectiveness of the Small Private Online 
Course (SPOC) teaching model and conducted 
experiments using embryology courses as an example. 
These experiments confirmed that SPOC teaching 
improved students' average professional grades and 
enhanced their learning motivation. This indicated that 
the SPOC teaching model was scientifically reasonable 
and could be promoted and applied in medical courses 
[15]. The literature summary is shown as follows 
Appendix 1. 
In summary, current research has further improved 
the effectiveness of action recognition and training 
through image recognition and wearable devices. The 
advancement of action detection technology is achieved 
through models such as neural networks. Meanwhile, the 
application of different educational intervention modes 
improves the effectiveness of physical education teaching. 
However, the current method lacks specificity and 
efficiency in identifying actions in sports testing. As a 
Sports Action Detection and Counting Algorithm Based on Pose… Informatica 48 (2024) 35–50 37 
result, it is difficult to meet the needs of physical 
education teaching to a certain extent. A motion posture 
estimation algorithm based on bilateral filtering and 
ViBE is developed to address this issue, and a penalty 
function is used to complete the scoring. Finally, this 
algorithm is applied in physical education teaching. 
Therefore, the effectiveness of sports detection can be 
further enhanced in physical education teaching, the 
efficiency of student sports learning can be improved, 
sports posture can be corrected, and sports injuries can be 
reduced, thus promoting the development of physical 
education teaching. 
3 Construction of sports action 
detection and counting algorithm 
based on pose estimation 
Current research regards sports action detection as an 
image classification problem. The action detection has 
limitations in recognizing repetitive actions and cannot 
accurately distinguish the frame intervals of a single 
action in the image. A 3D bone keypoint detection 
algorithm based on the pose estimation model, namely 
Visual Background Extractor (ViBE), is proposed to deal 
with this challenge. Each repeated action is divided and 
scored using a penalty function by analyzing the 
continuous action information in the time series. 
3.1 Construction of motion pose estimation 
algorithm based on bilateral filtering and 
ViBE 
There is a significant difference between motion 
recognition and detection counting. Action recognition 
mainly focuses on recognizing specific postures. Action 
detection counting should recognize posture and 
accurately divide each action’s beginning and end, 
requiring higher accuracy. For example, algorithmic 
inaccuracies may lead to incorrect division of the action 
process and repetitive counting in repetitive movements 
such as sit ups [16]. Motion is a continuous process in 
terms of timing, which is composed of a combination of 
behaviors from multiple frames of images. Therefore, 
whether a motion is qualified is a continuous cumulative 
process should be determined, and temporal information 
cannot be ignored. There is also a challenge of algorithm 
identification and screening of non-conforming actions in 
continuous processes. If external force is used to support 
the ground during sit ups, there are unqualified actions in 
certain frames of images during a certain movement 
process composed of multiple frames of images. 
Although this behavior may only exist in a few short 
frames of the image during the motion process, the 
algorithm should accurately recognize and filter out the 
motion process. Exercise is different from conventional 
behaviors such as walking, running, and jumping. The 
human body is almost lying flat on the ground during 
physical exercises such as sit ups. Meanwhile, the pixel 
range occupied by the human body in the picture is very 
small. There is also a lot of body overlap and occlusion 
during the exercise, which is a great challenge for posture 
estimation algorithms. The research is based on pose 
estimation technology as the algorithm framework and 
designed for sports detection and counting scenarios. 
Figure 1 shows the overall algorithm process. 
Video input
Image 
compression
Key point 
extraction
Possible Action 
Interval 
Classification
Parameterisation
Human Skeletal 
Network 
Rendering
Exercise movement 
evaluation and 
counting
Output results 
and rendered 
video
 
Figure 1: Overall algorithm process 
 
The proposed algorithm is based on pose estimation 
technology, which is designed for the detection and 
counting of sports movements. Firstly, the input video is 
preprocessed, including encoding format adjustment, 
image denoising, and scaling, to ensure efficient 
subsequent processing. Then a pose estimation model is 
used to extract the coordinates of key points (such as 
elbows, wrists, etc.). Afterwards, the skeleton keypoint 
38   Informatica 48 (2024) 35–50                                                                  Z. Song et al. 
coordinates of each frame is used to calculate parameters 
such as body angle, arm angle, knee Euclidean distance, 
angle change curve, etc. Then the possible frame rate 
range for each motion process is accurately divided based 
on the main parameters. A penalty function is used to 
evaluate the quality of actions and select qualified actions. 
Here, the penalty function is defined by the algorithm that 
can evaluate the undesirable or unreasonable behavior 
and generate corresponding penalty scores. At the same 
time, the penalty function will be added to each possible 
action frame set to influence the model. The penalty 
weight adjusts the parameter that affects the influence of 
the penalty function. The higher the value, the more 
severe the punishment for the model's bad behavior. The 
threshold parameter defines when to trigger the penalty 
function and how to determine the threshold for the 
penalty score. That is, when the model's error exceeds 
this threshold, the penalty function will be triggered. 
Finally, a human skeleton network is generated using the 
Skinned Multi-Person Linear Model (SMPL) model to 
achieve visual representation of actions [17-18]. It is 
necessary to preprocess the images in advance to ensure 
uniform input image size and to ensure the performance 
of key point detection in 3D models. The study first 
modifies the encoding format of sports videos by using 
bilateral filtering to remove noise and interference from 
the images [19-22]. Figure 2 is the schematic diagram of 
bilateral filtering. 
 
Input
Img
Img Output
 
Figure 2: Schematic diagram of bilateral filtering 
 
In Figure 2, bilateral filtering is an efficient image 
filtering technique that considers both spatial proximity 
and pixel value similarity, which can preserve edges 
while smoothing the image. This filter processes each 
pixel, weighting the pixels within its neighborhood based 
on spatial distance and pixel value differences. Therefore, 
bilateral filtering can effectively remove noise without 
blurring edges, which is widely used in fields such as 
image denoising, texture smoothing, and detail 
enhancement. The calculation of bilateral filtering is 
represented by equation (1). 
 ,
( , )
1
( , ) ( , ) ( , )
filtered i j
i j S
p
I x y I i j w x y
W

=
 (1) 
In equation (1), ( , ) I i j represents the image to be 
filtered. ( , ) xy refers to the current pixel position. S 
means the size of the filter. 
p
W
 is the normalization 
factor. 
,
( , )
ij
w x y
 is the weight of pixel ( , ) ij , 
represented by equation (2). 
 
,
( , ) ( , ) ( ( , ), ( , ))
i j d r
w x y w x y w I i j I x y =
 (2) 
In equation (2), ( , )
d
w x y represents the spatial 
domain weight. ( ( , ), ( , )
r
w I i j I x y is the pixel domain 
weight. The definition of ( , )
d
w x y is represented by 
equation (3). 
 
22
2
( ) ( )
( , ) exp
2
d
d
i x j y
w x y

 − + −
=−


 (3) 
In equation (3), 
2
d
 represents the square of the 
standard deviation of spatial distance. The calculation of 
( ( , ), ( , )
r
w I i j I x y is represented by equation (4). 
 ( )
2
2
( ( , ), ( , )
( , ) ( , )
exp
2
r
r
w I i j I x y
I i j I x y

=

−
 −


 (4) 
In equation (4), 
2
r
 represents the standard 
deviation of the pixel value similarity. 
Afterwards, the image is scaled and resolution 
adjusted using bilinear interpolation to adjust the image 
size without losing the original image information 
[23-25]. The basic idea of bilinear interpolation is to 
perform two linear interpolations separately to obtain 
pixel values in a new image. This increases the original 
image pixels to improve image clarity during image 
scaling operations. Two linear interpolations are 
represented by equation (5). 
Sports Action Detection and Counting Algorithm Based on Pose… Informatica 48 (2024) 35–50 39 
21
1 11 21
2 1 2 1
21
2 12 22
2 1 2 1
( , ) ( ) ( )
( , ) ( ) ( )
x x x x
f x y f Q f Q
x x x x
x x x x
f x y f Q f Q
x x x x
−
−
− 
+

−−


−

+

−−

 (5) 
In equation (5), the value of the unknown function at 
point P is ( , ) xy. The function f is known to have 
four values of 
11 1 1
( , ) Q x y = , 
12 1 2
( , ) Q x y = , 
21 2 1
( , ) Q x y = , and 
22 2 2
( , ) Q x y = . The linear 
interpolation in the 
y
 direction is represented by 
equation (6). 
 
21
2 1 2 1
11 12 2
21 22 1
1
( , )
( )( )
( ) ( )
( ) ( )
f x y x x x x
x x y y
f Q f Q y y
f Q f Q y y
 − −
−−
−    
   
−
   
 (6) 
In equation (6), the directions of two linear 
interpolations can be interchanged. The position of 
skeletal joints in the human body is the key to body 
activity recognition. Meanwhile, the relative positions are 
fixed in the human body, well reflecting the human 
body’s moving status. At present, there are mainly two 
types of pose estimation methods: 2D and 3D pose 
estimation models. The calculation time of the 2D pose 
estimation model OpenPose is less than that of the 3D 
pose estimation model ViBE. Meanwhile, the calculation 
time for a one-minute motion action video is about twice 
as fast. However, the 2D pose estimation model ignores 
the depth information of the image throughout the entire 
activity due to the specific motion poses in most motion 
action videos. Meanwhile, the lack of depth information 
can affect the accuracy of feature extraction for bone key 
points due to changes in camera angle and the 
relationship between the camera and body position. This 
is mainly due to the fact that there may be mutual 
occlusion on both sides of the body in most physical 
education teaching and exercise processes when the 
image acquisition device is located on the side of the 
body. If there is a lack of image depth information, it is 
easy to encounter mutual occlusion between the body 
closer to the camera and the body farther away during the 
bone keypoint extraction. This phenomenon can affect 
the accuracy of model pose estimation. Therefore, the 
study uses a human pose estimation model to extract 
skeletal joints’ key points. The 3D pose estimation model 
is based on ViBE. The 2D pose estimation model used in 
lightweight devices is based on OpenPose. Figure 3 is a 
schematic diagram of ViBE construction. 
V
1
(x) V
2
(x) V
3
(x)
V
5
(x) V(x) V
4
(x)
V
6
(x) V
7
(x) V
8
(x)
V
1
(x)
V5(x)
V
2
(x)
V
4
(x)
V2(x)
Random probability 
selection
 
1 2 3 1
( ) , , , , ,
NN
M x V V V V V
−
=
Pixel point x 8 field space
Pixel Dot x Background Model Set
 
Figure 3: Schematic diagram of ViBE model construction 
 
In Figure 3, ViBE is an algorithm used for 
background extraction in videos. First, the background 
model is initialized through the initial frame of the video, 
and multiple sample values are randomly selected for 
each pixel. When processing subsequent frames, ViBE 
compares each pixel value with the sample in the 
background model and determines whether the pixel is 
foreground or background based on similarity. This 
algorithm regularly updates the background model and 
replaces sample values to adapt to changes in the scene.  
 
 
 
The advantages of ViBE lie in its efficiency and 
adaptability to dynamic scenes, but false positives may 
occur when dealing with extreme changes [26-28]. The 
VIBE network is trained using sports evaluation 
movements. 100 sets of sit up video data are extracted 
from the HMDB51 dataset for training taking sit ups as 
an example. 
 
 
 
 
40   Informatica 48 (2024) 35–50                                                                  Z. Song et al. 
3.2 Construction of detection and counting 
module for sports projects based on skeletal 
key points 
A motion detection and counting module is designed for 
the detection and counting of sports projects after 
obtaining the image coordinates of bone key points. 
Figure 4 shows the algorithm flow of sports action 
detection and counting module. 
 
Segmentation of 
possible action frame 
intervals
Interval range 
optimisation
Preprocessing 
and Coordinate 
Transformation
Extracting key points 
of the skeleton
Parameterisation
Calculate single-frame 
image penalty scores
Calculate the 
action interval 
penalty value
Reject action 
intervals that contain 
offending actions
Calculate the 
result and output
Action interval 
traversal not completed
Action interval 
traversal complete
 
Figure 4: Algorithm flow of sports action detection and counting module 
 
In Figure 4, the core of the motion detection and 
counting algorithm is to first capture the coordinates of 
bone keypoints through the pose estimation network. 
Subsequently, these coordinates are transformed and 
standardized to unify the processing of images at 
different scales. Next, the action related parameters of 
each frame of the image are calculated based on the 
coordinates of key points. Then the possible intervals for 
the actions are divided. These intervals are optimized to 
ensure the accuracy of the action intervals to improve 
accuracy. A penalty function is applied to evaluate each 
action interval due to the possibility of non-compliant 
actions. Therefore, penalty scores for multiple frames of 
images can be accumulated, and intervals that exceeded 
the set threshold can be eliminated. In the end, the most 
accurate action count and action improvement 
suggestions based on penalty scores are output. The joint 
coordinates are stored in the matrix after obtaining the 3D 
joint point data. This movement mainly revolves around 
the buttocks taking sit ups as an example. Therefore, the 
focus is on the coordinate points of the buttocks in each 
frame of the image. If a frame lacks hip coordinates, it is 
considered abnormal data and removed, which can reduce 
jitter during the rendering process and improve the 
accuracy of action detection and counting. Subsequently, 
a clear coordinate matrix following the image pixel 
coordinate system is obtained. This article converts the 
coordinate matrix into a normalized matrix based on 
human joints considering that differences in camera 
parameters and shooting angles may affect parameter 
calculations. Specifically, the coordinate system is 
adjusted to allow this algorithm to describe motion 
without being limited by shooting conditions with the 
buttocks as the pivot point. Actions can be described 
more accurately by transforming the coordinate matrix 
into a normalized matrix based on human joints. This 
process involves calculating the position and direction of 
each joint and describing them using a unified coordinate 
system. This normalization allows for comparison and 
analysis of data collected from different locations. In 
addition, the selection of this coordinate system makes it 
easier to classify and recognize different movements, 
thereby improving the performance of the system. 
Suitable calculation parameters are selected for different 
sports movements. Rotation is detected by calculating 
cosine similarity, represented by equation (7). 
 
1
cos
ii
i
ii
ty
ty

−


= 



 (7) 
In equation (7), 
i
t and 
i
y represent the vector  
 
 
Sports Action Detection and Counting Algorithm Based on Pose… Informatica 48 (2024) 35–50 41 
connecting the joint points. Euclidean distance is chosen 
to detect distance, represented by equation (8). 
 
3
2
,,
1
( , ) ( )
i i i j i j
j
d t y t y
=
=−

 (8) 
In equation (8), 
, ij
t and 
, ij
y
 are the intercepts of 
two points in direction 
y
, respectively. Figure 5 shows 
several key joint points and angles in this study. 
 
P0 nose
P1 neck
P2 hip
P3 knee
P4 
ankle
P5 wrist
P6 elbow
P7 shoulder 
Figure 5: Joint key points and angles 
 
In Figure 5, 
1 2 4
body
p p p  =
, 3 2 4
knee
p p p  = , 
4 2 4
ankle
p p p   = , and 5 2 1
wrist
p p p  = represent the 
angles between the torso, knees and ground, ankles and 
ground, and wrists and upper body during exercise, 
respectively. The compliance of the movement is 
determined based on the above angles during exercise. 
The effective frames in the video are divided into 
multiple sets of candidate event frame datasets after 
calculating the key parameters for each frame. Each 
frame dataset contains a portion of frames pertaining to 
the target action event. In action events, it is also 
necessary to screen and count illegal actions. Taking sit 
ups as an example, assisting other parts of the body in 
completing actions is a violation, and the loss values and 
adjustment actions of other parts need to be calculated. 
Penalty factors are set for each loss value to form a 
penalty function, and weighted coefficients are set on the 
foundation of each joint’s participation in sit ups. The 
penalty value is represented by equation (9). 
( )
i
t
e
wrist knee ankle
s
S loss loss loss    = + +
 (9) 
In equation (9), 
wrist
loss , 
knee
loss , and 
ankle
loss 
represent the loss values of the wrist, knee, and ankle, 
respectively. 
i
e and 
i
s mean the end frame and start 
frame of the action, respectively. 

, 

, and 

 are 
penalty factors. The loss values of various body parts are 
represented by equation (10). 
 
2
()
jj
loss threshold  =− (10) 
In equation (10), 
j

 represents the current 
parameter values of each part. threshold refers to the 
threshold. Table 1 shows the threshold values and penalty 
factors for various body parts parameters. 
 
Table 1: Parameter thresholds and penalty factors for 
each part 
Body parts Penalty threshold Penalty factor 
Hand 35° 0.5 
Knee 15° 0.2 
Ankle 1.24 0.4 
 
3.3 Lightweight network construction based 
on OpenPose 
The study considers uploading videos to the cloud for 
processing. The powerful computing power of servers is 
utilized to parallelly process multiple action detection 
tasks. However, this also consumes bandwidth, and cloud 
services continue to consume server resources. An 
end-to-end cloud integration strategy is adopted to cope 
with short-term high concurrency pose estimation tasks. 
The research attempts to reduce the model size and 
accelerate the solution speed to deal with the high 
memory consumption and slow computing speed of the 
framework, and pre-training and other tasks are placed on 
the cloud for processing. The research will optimize 
algorithms and develop lightweight motion action 
detection and counting models to adapt to the limitations 
of computing power of edge devices. The pose estimation 
42   Informatica 48 (2024) 35–50                                                                  Z. Song et al. 
algorithm is optimized based on the OpenPose backbone 
network [29-30]. OpenPose is a bottom-up pose 
estimation method that first finds all the points of all 
people in an image. Then these points are matched and 
connected to connect the joint points of the same person. 
The input image will output two tensor data, namely the 
heat map of the key points and the corresponding 
connection relationship of the key points in the inference 
stage. These output heat maps are only one eighth of the 
original image, as shown in Figure 6. 
Backbone
Initial 
Stage
Keypoint 
heatmaps
part affinity fields
Refinement  
stage 1
Resize feature maps
Keypoints extraction
Keypoints grouping
 
Figure 6: OpenPose process 
 
In Figure 6, OpenPose is a deep learning based 
multiplayer pose estimation framework that first uses 
CNN to simultaneously predict the confidence map of 
body key points and the correlation map between parts 
from the input image. Next, the confidence map is 
processed using NonMaximum Value Suppression (NMS) 
to identify key point positions. Finally, bipartite graph 
matching and greedy algorithm are applied to pair and 
associate the obtained key points to construct the human 
pose. OpenPose can process video streams in real-time 
and recognize and track the poses of multiple people. The 
lightweight version of its OpenPose, Lightweight 
OpenPose, is selected as the baseline model. The 
parameter count of Lightweight OpenPose is only 15% of 
the original model, but its performance is almost the same. 
The INT8 quantization method is chosen for quantization 
considering the hardware characteristics and resource 
limitations of the devices to enable deep learning models 
to be deployed on low-power devices and quantify them. 
INT8 quantization is to convert model parameters from 
floating-point numbers to 8-bit integers to reduce model 
size and maintain accuracy. 
4 Performance testing of sports 
action detection and counting 
algorithms based on pose 
estimation 
The research aims to use the pose estimation algorithm 
for motion action detection and counting, thus selecting 
motion scene videos from various venues, scales, and 
angles, including offline shooting and network collection. 
A data augmentation method was adopted to enhance the 
training samples and enhance the model's generalization 
ability. 
4.1 Tests of sports action detection and 
counting algorithm output accuracy 
A test dataset consisting of 40 videos was 
constructed to verify the performance of the proposed 
algorithm in motion action detection and counting. 20 
videos of this dataset were from the HMDB51 dataset. 
These videos were from movies and were characterized 
by complex and variable scenes. In these complex 
environments, HMDB51 videos’ average detecting 
accuracy reached 74%. In addition, the experiment also 
included recording videos from 20 laboratories. 20 
laboratory personnel wore different colored clothing to 
ensure recognizability in different environments. These 
videos were recorded under five different environmental 
conditions, each lasting 10 minutes. It was ensured that 
there was no interference from anyone other than the 
tester, and only the tester's own limbs were obstructed. 
Figure 7 shows the output results of action detection and 
computational algorithms. 
Count result:18 Count result:18 Count result:19 Count result:19
 
Figure 7: Output results of action detection and calculation algorithms 
Sports Action Detection and Counting Algorithm Based on Pose… Informatica 48 (2024) 35–50 43 
 
Figure 7 shows the recognition and counting of sit 
ups in video frames. The pink network in the figure 
indicates changes in human movement. Clearly, the count 
changed from 18 to 19 after completing one action, 
proving the accuracy of the proposed action detection 
model. The experimental personnel were kept dressed 
unchanged in different testing environments and not 
obstructed by other characters. Figure 8 shows the 
average counting accuracy results. 
 
The 
playground
The 
dormitory
The laboratory 
Environment
Outdoor 
space
The gym
0
20
40
60
80
100
Average Accuracy (%)
Environments
Daylight Evening Night
 
Figure 8: Average count accuracy results 
 
In various testing environments, the average 
accuracy of the action detection algorithm varied in 
different locations when the testers were wearing fixed 
colored clothing and there were no other people 
obstructing them. Specifically, the detection accuracy of 
the playground and outdoor space was the same, both at 
96%, which was the highest among all environments. The 
accuracy of the gym followed closely, at 94%. The 
accuracy in the laboratory environment was 92%. The 
accuracy in the dormitory environment was the lowest, at 
88%. These data reflected the robustness and reliability of 
the algorithm under different environmental conditions. 
Table 2 shows the average counting accuracy of different 
testers in multiple scenarios. 
 
 
Table 2: Average count accuracy of different testers in multiple scenarios 
Person ID 1 2 3 4 5 6 7 8 9 10 
ACC/% 93 92 92 94 92 93 93 94 92 93 
Person ID 11 12 13 14 15 16 17 18 19 20 
ACC/% 94 92 93 94 92 93 94 92 93 94 
 
These experiments confirmed that the designed 
sports action recognition and counting algorithm 
exhibited good effectiveness. The changes in 
environmental background caused fluctuations in 
accuracy. Meanwhile, the algorithm achieved higher 
recognition accuracy in simple and open scenes. 
Participants had different heights, genders, and clothing 
for individual performance. However, these factors did 
not significantly affect the performance of the algorithm, 
with accuracy ranging from 92% to 94%. The fourth 
participant had the highest detection accuracy, at 94%. 
The accuracy of the second and third participants was 
92%, which was the lowest among all participants. Figure 
9 shows the test results of the calculation speed and 
memory usage. 
44   Informatica 48 (2024) 35–50                                                                  Z. Song et al. 
0
5
10
15
20
25
30
35
40
45
50
0:00:00
0:00:30
00:01:00
00:01:30
00:02:00
00:02:30
00:03:00
00:03:30
00:04:00
00:04:30
00:05:00
Time
Memory Usage /GB
224*224 512*512 800*800
0
10
20
30
40
50
60
70
80
90
100
0:00:00
0:00:30
00:01:00
00:01:30
00:02:00
00:02:30
00:03:00
00:03:30
00:04:00
00:04:30
00:05:00
Time
CPU share /%
User System
(a) Software operating CPU usage
(b) Computation rate for different page sizes
 
Figure 9: Performance evaluation results 
 
Figure 9 (a) shows the CPU usage test results of the 
system proposed model during runtime. The CPU usage 
of the server increased as the system continued to run, 
showing an overall upward trend, but with significant 
fluctuations. The CPU usage instantly increased and then 
decreased to normal levels when new processes joined. 
The highest CPU usage was 21.7%, with the lowest CPU 
usage at the beginning of the process. The CPU usage 
remained below 20% throughout the entire script runtime, 
and the server resource ratio was within a controllable 
range. The trend of system CPU usage over time was 
similar to that of server resources, slightly higher than 
that of server resources. The highest CPU usage was 
27.6% during the entire system runtime, and the lowest 
CPU usage occurred at the beginning of the process. The 
CPU usage remained below 30% during the system 
runtime. Figure 9 (b) shows the comparison of the 
processing speed and resource consumption of video files 
with different Page sizes by the model, with file sizes of 
224*224, 512*512, and 800*800, respectively. As the 
size of the model file changed, the processing speed of 
the system also changes accordingly, and the transaction 
processing efficiency of the system remained basically 
consistent. The model took 24.6ms to process files of size 
224*224, 98.4ms to process files of size 512*512, and 
274ms to process files of size 800*800. As the file size 
increased, Memory Usage gradually increased. Figure 10 
shows the results of the ablation test. 
 
 
Sports Action Detection and Counting Algorithm Based on Pose… Informatica 48 (2024) 35–50 45 
0
20
40
60
80
100
PCKh@0.5 Accuracy (%)
MPJPE (mm)
Average Accuracy (%)
Full Model
Without Action 
Detection & 
Counting
Without Frame 
Interval 
Optimization
Remove 
Action 
Detection
Remove 
Counting 
Module
Modelling
Accuracy
 
Figure 10: Results of ablation test 
 
In the ablation experiment, the performance of the 
pose estimation model VIBE was evaluated by gradually 
removing the action detection and counting modules and 
optimizing the image frame intervals within them. On the 
Human3.6M dataset, the unoptimized ViBE achieved a 
joint accuracy of 94.1% under the PCKh@0.5 metric, 
slightly lower than the original model's 95.3%. It showed 
an average error of 49.4mm under the MPJ pose 
estimation index, which was better than the original 
model's 51.1mm. On the other hand, the complete 
model’s optimal average accuracy in action detection and 
counting was 78%, which was 16% lower than the 94% 
accuracy that included frame interval optimization steps. 
This indicated errors such as image noise, object 
occlusion, or keypoint jitter that occurred during sit up 
movements, emphasizing the importance of frame 
interval optimization in improving accuracy. 
On-site videos of students from different grades 
performing different sports actions in actual physical 
education teaching were collected and classified 
according to different lighting conditions to further verify 
the robustness and reliability of the proposed algorithm. 
Subsequently, a fault injection testing method was 
adopted to test the detection ability of the proposed 
algorithm for violations, including hand support, foot off 
the ground, unilateral body roll up, and standing 
simulation sit ups during movement. The results are 
shown in Table 3. 
 
Table 3: Test results of the proposed algorithm in actual physical education teaching 
Category ACC/% Time/s 
Violation actions 
Hand support to the ground 95.15 2.16 
Feet off the ground 93.32 2.27 
Unilateral body mass 94.30 1.31 
Stand up simulation sit ups 91.28 2.32 
Lighting conditions 
 
Low 90.29 3.95 
Medium 94.34 2.44 
High 96.31 1.29 
 
From Table 3, the motion recognition accuracy of 
the proposed algorithm was generally stable at over 90% 
under different lighting conditions, with a maximum of 
96.31% and a minimum required time of only 1.29 
seconds. Meanwhile, the fastest recognition time was 
only 1.31 seconds for violations of sports actions, and the 
highest accuracy reached 95.15%. This result verified that 
the proposed algorithm effectively identified and 
eliminated illegal actions, with good robustness and 
robustness. 
4.2 Test of lightweight sports action detection 
and counting algorithm output accuracy 
The action detection video with the primer pointing 
upwards was captured through a monocular RGB camera 
connected to the Jetson Nano device. All experiments 
were conducted for inference and acceleration on the 
Jetson Nano 2GB memory device developed by NVIDIA, 
using the default training parameters of Lightweight 
OpenPose and the COCO dataset as the training basis. 
The training was completed on an Intel CPU server 
equipped with 16 cores and 64GB of memory, operating 
system Ubuntu 18.04, and network input resolution set to 
368×368. The model was trained in three stages and its 
accuracy was verified at the end of each stage. Figure 11 
shows the comparison of Lightweight OpenPose accuracy 
46   Informatica 48 (2024) 35–50                                                                  Z. Song et al. 
and computational complexity. 
 
37.8
40.3
40.9
43.1
61.7
80.3
98.9
117.5
136.1
37.8
2.5
0.6
2.2
18.6
18.6
18.6
18.6
18.6
35.5
43.4
46.2
47.4
48.1
48.6
AP.%
GFLOPs
GFLOPs 
total
Backbone
Accuracy and Arithmetic Volume
50
100
Conv4_3
Conv4_4
Initial stage
Refinement stage I
Refinement stage 2
Refinement stage 3
Refinement stage 4
Refinement stage 5
Floor
 
Figure 11: Comparison of OpenPose accuracy and computational complexity 
 
The average accuracy (AP%) for Lightweight 
OpenPose increased from 35.5% in the initial stage to 
43.4% when a Refinement stage was introduced. 
However, the AP% only increased to 48.6% despite 
increasing to five Refinement stages, indicating a small 
performance gain. Although the performance 
improvement was not significant, the computational load 
had almost doubled. Giga Floating-Point Operations Per 
Second (GFLOPs) increased from initial 43.1 to final 
136.1. Therefore, this study attempted to use only one 
Refinement stage for pose estimation and tested its 
performance. The purpose of this method is to reduce the 
use of computing resources while maintaining reasonable 
accuracy, thereby optimizing the efficiency of the model. 
This work emphasizes the importance of balancing 
performance and computational efficiency in designing 
deep learning models. Table 4 shows the selected training 
accuracy and comparison results for Lightweight 
OpenPose. 
Table 4: Training accuracy and comparative testing of Lightweight OpenPose 
Step APval Epochs Batch_size 
1 0.3964 260 64 
2 0.4196 260 64 
3 0.4286 260 64 
Model Average FPS Best FPS / 
OpenPose 1.32 2.2 / 
Lightweight OpenPose 9.24 10.23 / 
Model File size/MB ACC/% FPS frame *s-1 
OpenPose 200 96.13 1.52 
Lightweight Openpose 8.2 95.26 17.04 
 
In Table 4, the training accuracy and comparative 
test results of Lightweight OpenPose confirmed that the 
inference speed of Lightweight OpenPose deployed on 
Jetson Nano was about 9 times faster than the original 
model. The accuracy of Lightweight OpenPose was 
consistent with official data, with only a 6% decrease 
compared to OpenPose. The inference frame rate of the 
model increased by nearly 15 times while the accuracy 
decreased by less than 1% through pruning, modifying 
convolutional layers, and quantifying to Int8, using 
TensorRT acceleration. This demonstrated the 
lightweight of the pose estimation model and its 
efficiency in real-time motion action detection counting 
on edge devices. Lightweight OpenPose used MobileNet 
V1 instead of the original backbone network to study the 
rate, regression rate, and accuracy of quantized pre- and 
post motion action detection models on a local server. 
Meanwhile, the study also tested the inference time, FPS, 
and GPU utilization of Lightweight OpenPose without 
Int8 quantization before and after TensorRT acceleration 
Sports Action Detection and Counting Algorithm Based on Pose… Informatica 48 (2024) 35–50 47 
to demonstrate the effectiveness of inference acceleration 
in Figure 12. 
 
Pre-
acceleration
75
80
85
90
95
100
10
11
12
13
14
Inference time /ms
FPS frame *s
-1
GPU Utilization
Post-
acceleration
/Inference time (ms) /GPU 
Utilization (%)
FPS frame *s
-1
(b) Performance change before 
and after model application 
inference acceleration
TensorFlow TFlite_ int8
50
60
70
80
90
100
12
13
14
15
16
17
18
Rate/ms
FPS frame *s-1
Recall
Acc
FPS frame *s-1
Rate (ms) /Recall /Acc
(a) Quantifying the performance 
of pre- and post-sport action 
detection models
 
Figure 12: Inference acceleration test 
 
In Figure 12, Int8 quantization significantly 
improved processing speed with an accuracy loss of less 
than 1% under the same model and input conditions. This 
study further validated the performance of unquantified 
Lightweight OpenPose after applying TensorRT 
acceleration. The inference time was shortened and the 
frame rate was increased by nearly 3 frames, while the 
GPU utilization rate remained unchanged. The addition 
of Int8 quantization further shortened the inference time 
by about 18 milliseconds and increased the frame rate by 
about 4 frames, verifying the effectiveness of Int8 
quantization in inference acceleration. 
5 Discussion 
With the rapid development of computer vision 
applications, this technology used for online sports 
competitions and exercise to improve the quality of 
physical education teaching receives widespread attention. 
However, the current computer vision-based motion 
recognition framework focuses on classifying different 
actions during the motion. Meanwhile, there is still a lack 
of relevant research and practical application for further 
operations such as counting and filtering inappropriate 
actions. The research aims to explore motion detection 
frameworks and counting algorithms based on pose 
estimation, and they are applied to sports detection and 
recognition. Firstly, a motion detection framework and 
counting algorithm based on pose estimation were 
proposed, which improved the recognition accuracy of 
motion in sports evaluation scenarios. Secondly, a motion 
detection system combining end-to-end cloud technology 
was designed by using university sports testing as a 
practical application scenario. The results showed that 
although participants had different heights, genders, and 
clothing, the accuracy of the designed sports action 
recognition and counting algorithm varied between 92% 
and 94%, demonstrating good effectiveness. This method 
had stronger stability and wider applicability compared 
with the results obtained in three datasets in reference [7]. 
This is because the proposed method obtains a 
normalized matrix based on human joints. Therefore, 
actions can be more accurately described through a 
unified coordinate system, making classification and 
recognition of different movements easier. The inference 
speed of the Lightweight OpenPose model increased by 
about 9 times in terms of training accuracy and 
comparative testing. The inference frame rate increased 
by nearly 15 times with less than 1% accuracy decrease 
after using TensorRT acceleration. The proposed model 
achieved smaller improvements and better performance 
compared to the accuracy and efficiency in references [5] 
and [11]. The reason is that the study used INT8 
quantization to convert floating-point parameters in deep 
learning models into fixed-point numbers. Therefore, the 
storage space and computational complexity of the model 
can be reduced, and the stable accuracy can be 
maintained while reducing the model size by four times. 
Overall, the study successfully implemented a 
cloud-based training network and deployed lightweight 
models on edge devices. Therefore, real-time inference 
by optimizing network models and accelerating inference 
can be completed, further improving the efficiency of 
motion detection and counting tasks. The proposed 
technology can be integrated into existing physical 
education teaching environments in terms of practical 
application. More accurate motion detection and counting 
can be achieved based on sports motion data collected by 
48   Informatica 48 (2024) 35–50                                                                  Z. Song et al. 
sensors, cameras, etc. Therefore, personalized guidance 
and feedback to students can be provided, and 
movements can be timely adjusted to achieve better 
results, thus improving the physical education teaching. 
The proposed lightweight motion detection and counting 
algorithm improved the frame rate of the model on edge 
devices to a certain extent. However, this algorithm still 
cannot achieve the FPS of cloud-based inference. 
Meanwhile, this method saves computational resources. 
However, there is still room for improvement in detection 
performance. Therefore, continuous optimization is 
required to adapt to different sports movements and 
scenarios when applying this method to a wider range of 
physical education teaching. Meanwhile, how to improve 
the recognition ability for fast or complex movements 
should be explored. 
6 Conclusion 
Traditional sports action analysis mainly relies on the 
observation of coaches and the subjective feelings of 
athletes. This method is not only time-consuming and 
labor-intensive, but also easily influenced by personal 
experience and judgment bias. Therefore, the pose 
estimation-based algorithm automatically detected, 
analyzed, and counted motion actions. These 
performance tests confirmed that the detection accuracy 
of the playground and outdoor space was the same, both 
at 96%, which was the highest among all environments. 
On the Human 3.6M dataset, the unoptimized ViBE 
achieved a joint accuracy of 94.1% under the PCKh@0.5 
metric, slightly lower than the original model's 95.3%. It 
showed an average error of 49.4mm under the MPJ pose 
estimation index, which was better than the original 
model's 51.1mm. The optimal average accuracy of the 
complete model in action detection and counting was 
78%, which was 16% lower than the 94% accuracy that 
included frame interval optimization steps. The training 
accuracy and comparative testing of Lightweight 
OpenPose confirmed that the inference speed of 
Lightweight OpenPose deployed on Jetson Nano was 
about 9 times faster than the original model. Its accuracy 
was consistent with official data, with only a 6% decrease 
compared to OpenPose. After using TensorRT 
acceleration, the inference frame rate of the model 
increased by nearly 15 times while the accuracy 
decreased by less than 1%. The inference time was 
shortened and the frame rate was increased by nearly 3 
frames after applying TensorRT acceleration, while the 
GPU utilization remained unchanged. The addition of 
Int8 quantization further shortened the inference time by 
about 18 milliseconds and increased the frame rate by 
about 4 frames, verifying the effectiveness of Int8 
quantization in inference acceleration. These experiments 
confirmed that the proposed pose estimation-based 
motion action detection and counting model had high 
robustness and reliability under different environmental 
conditions. However, there are still some shortcomings in 
the research. Although the proposed pose estimation 
motion detection framework is effective in handling 
repetitive fitness movements, the universality of the 
detection and counting modules for specific movements 
is insufficient. In the future, more universal modules need 
to be developed to expand the applicability of the 
framework. 
References 
[1] J. Ge, J. Shi, Z. Zhou, Z. Wang, and Q. Qian, "A 
grasping posture estimation method based on 3D 
detection network," Computers and Electrical 
Engineering, vol. 132, no. 10, pp. 96-108, 2022. 
https://doi.org/10.1016/j.compeleceng.2022.107896. 
[2] N. M. Ghahjaverestan, M. M. Kabir, S. Saha, B. 
Gavrilovic, K. Zhu, B. Taati, H. Alshaer, and A. 
Yadollahi, "Relative tidal volume and respiratory 
airflow estimation using tracheal sound and 
movement during sleep," Journal of Sleep Research, 
vol. 30, no. 4, pp. 79-80, 2021. 
https://doi.org/10.1111/jsr.13279. 
[3] X. Li, S. Liu, Y. Chang, S. Li, Y. Fan, and H. Yu, "A 
human joint torque estimation method for elbow 
exoskeleton control," International Journal of 
Humanoid Robotics, vol. 17, no. 3, pp. 39-56, 2020. 
https://doi.org/10.1142/S0219843619500397. 
[4] B. Wang, C. Ou, N. Xie, L. Wang, T. Yu, G. Fan, and 
J. Chu, "Lower limb motion recognition based on 
surface electromyography signals and its 
experimental verification on a novel multi-posture 
lower limb rehabilitation robot," Computers and 
Electrical Engineering, vol. 101, pp. 110-129, 2022. 
https://doi.org/10.1016/j.compeleceng.2022.108067. 
[5] P. Chen, S. Guo, H. Li, X. Wang, G. Cui, C. Jiang, 
and L. Kong, "Through-wall human motion 
recognition based on transfer learning and ensemble 
learning," IEEE Geoscience and Remote Sensing 
Letters, vol. 19, pp. 191-196, 2022. 
https://doi.org/10.1109/LGRS.2021.3070374. 
[6] V. P. Nemani, H. Lu, A. Thelen, C. Hu, and A. T. 
Zimmerman, "Ensembles of probabilistic LSTM 
predictors and correctors for bearing prognostics 
using industrial standards," Neurocomputing, vol. 
491, no. 6, pp. 575-596, 2022. 
https://doi.org/10.1016/j.neucom.2021.12.035. 
[7] B. Chen, T. Li, and W. Ding, "Detecting deepfake 
videos based on spatiotemporal attention and 
convolutional LSTM," Information Sciences, vol. 60, 
no. 1, pp. 58-70, 2022. 
https://doi.org/10.1016/j.ins.2021.12.062. 
[8] I. C. Kaadoud, N. P. Rougier, and F. Alexandre, 
"Knowledge extraction from the learning of 
sequences in a long short-term memory (LSTM) 
architecture," Knowledge-Based Systems, vol. 235, 
pp. 657-675, 2022. 
https://doi.org/10.1016/j.knosys.2021.107657. 
[9] T. D. Le and G. Kaddoum, "LSTM-based channel 
Sports Action Detection and Counting Algorithm Based on Pose… Informatica 48 (2024) 35–50 49 
access scheme for vehicles in cognitive vehicular 
networks with multi-agent settings," IEEE 
Transactions on Vehicular Technology, vol. 70, no. 
9, pp. 9132-9143, 2021. 
https://doi.org/10.1109/TVT.2021.3100591. 
[10] B. Wu, J. Zhong, and C. Yang, "A visual-based 
gesture prediction framework applied in social 
robots," IEEE/CAA Journal of Automatica Sinica, 
vol. 9, no. 3, pp. 510-519, 2022. 
https://doi.org/10.1109/JAS.2021.1004243. 
[11] W. Jia, S. Xu, Z. Liang, Y. Zhao, H. Min, S. Li, and 
Y. Yu, "Real-time automatic helmet detection of 
motorcyclists in urban traffic using improved 
YOLOv5 detector," IET Image Processing, vol. 15, 
no. 14, pp. 3623-3637, 2021. 
https://doi.org/10.1049/ipr2.12295. 
[12] L. Li and T. Ye, "Research on throughput prediction 
of 5G network based on LSTM," Intelligent and 
Converged Networks, vol. 3, no. 2, pp. 217-227, 
2022. https://doi.org/10.23919/ICN.2022.0006. 
[13] B. Dudi and V. Rajesh, "Optimized threshold-based 
convolutional neural network for plant leaf 
classification: a challenge towards untrained data," 
Journal of Combinatorial Optimization, vol. 43, no. 
2, pp. 312-349, 2022. 
https://doi.org/10.1007/s10878-021-00770-w. 
[14] D. Mcdonough, W. Liu, X. Su, and Z. Gao, 
"Small-groups versus full-class exergaming on 
urban minority adolescents' physical activity, 
enjoyment, and self-efficacy," Journal of Physical 
Activity and Health, vol. 18, no. 2, pp. 192-198, 
2021. https://doi.org/10.1123/jpah.2020-0348. 
[15] S. Liu, Y. Guo, H. Liu, A. Hao, X. Zhang, and H. 
Liu, "Blended learning model via small private 
online course improves active learning and 
academic performance of embryology," Clinical 
Anatomy, vol. 35, no. 2, pp. 211-221, 2022. 
https://doi.org/10.1002/ca.23818. 
[16] M. Madadi, H. Bertiche, and S. Escalera, "SMPLR: 
Deep learning-based SMPL reverse for 3D human 
pose and shape recovery," Pattern Recognition, vol. 
106, no. 7, pp. 72-78, 2020. 
https://doi.org/10.1016/j.patcog.2020.107472. 
[17] M. Hasanvand, M. Nooshyar, E. Moharamkhani, and 
A. Selyari, "Machine learning methodology for 
identifying vehicles using image processing," AIA, 
vol. 1, no. 3, pp. 170-178, 2023. 
https://doi.org/10.47852/bonviewAIA3202833. 
[18] P. Preethi and H. R. Mamatha, "Region-based 
convolutional neural network for segmenting text in 
epigraphical images," Artificial Intelligence and 
Applications, vol. 1, no. 2, pp. 119-127, 2023. 
https://doi.org/10.47852/bonviewAIA2202293. 
[19] B. Liu, B. Li, J. Cao, W. Wang, and X. Liu, 
"Adaptive and propagated mesh filtering," 
Computer-Aided Design, vol. 154, no. 10, pp. 22-34, 
2023. https://doi.org/10.1016/j.cad.2022.103422. 
[20] Riya, B. Gupta, and S. S. Lamba, "Structure-aware 
adaptive bilateral texture filtering," Digital Signal 
Processing, vol. 123, no. 3, pp. 86-99, 2022. 
https://doi.org/10.1016/j.dsp.2022.103386. 
[21] I. Gonzalez-Perez, P. L. Guirao-Saura, and A. 
Fuentes-Aznar, "Application of the bilateral filter 
for the reconstruction of spiral bevel gear tooth 
surfaces from point clouds," Journal of Mechanical 
Design, vol. 143, no. 5, pp. 24-34, 2021. 
https://doi.org/10.1115/1.4048219. 
[22] C. Karam, K. Sugimoto, and K. Hirakawa, 
"Color-compressive bilateral filter and nonlocal 
means for high-dimensional images," Journal of 
Electronic Imaging, vol. 30, no. 2, pp. 23-44, 2021. 
https://doi.org/10.1117/1.JEI.30.2.023001. 
[23] M. Redmann and I. P. Duff, "Model order reduction 
for bilinear systems with non-zero initial states - 
different approaches with error bounds," 
International Journal of Control, vol. 96, no. 4/6, pp. 
1491-1504, 2023. 
https://doi.org/10.1080/00207179.2022.2053209. 
[24] X. Zhao, C. Huang, X. Yu, S. Zou, and f. Qing, "An 
arbitrary Lagrangian-Eulerian discontinuous 
Galerkin method for two-dimensional compressible 
flows on adaptive quadrilateral meshes," 
International Journal for Numerical Methods in 
Fluids, vol. 95, no. 5, pp. 796-819, 2023. 
https://doi.org/10.1002/fld.5172. 
[25] H. Wang, G. Xu, X. Pan, Z. Liu, N. Tang, R. Lan, 
and X. Luo, "Attention-inception-based U-Net for 
retinal vessel segmentation with advanced residual," 
Computers & Electrical Engineering, vol. 98, no. 3, 
pp. 92-110, 2022. 
https://doi.org/10.1016/j.compeleceng.2021.107670. 
[26] A. M. Vukicevic, I. Macuzic, N. Mijailovic, A. 
Peulic, and M. Radović, "Assessment of the 
handcart pushing and pulling safety by using Deep 
Learning 3D pose estimation and IoT force sensors," 
Expert Systems with Application, vol. 183, no. 1, pp. 
53-71, 2021. 
https://doi.org/10.1016/j.eswa.2021.115371. 
[27] B. Li, Z. Xu, J. Zhang, X. Wang, and X. Fan, 
"Background modeling based on statistical 
clustering partitioning," Mathematical Problems in 
Engineering, vol. 2021, no. 2, pp. 1-28, 2021. 
https://doi.org/10.1155/2021/2346438. 
[28] J. Coll-Font, O. Afacan, J. S. Chow, R. S. Lee, S. K. 
Warfield, and S. Kurugol, "Modeling dynamic 
radial contrast enhanced MRI with linear time 
invariant systems for motion correction in 
quantitative assessment of kidney function," 
Medical Image Analysis, vol. 67, no. 4, pp. 33-45, 
2021. https://doi.org/10.1016/j.media.2020.101880. 
[29] Y. Wang, "A study on the recognition of typical 
movement characteristics of ethic folk dances based 
on movement data," Informatica, vol. 48, no. 5, 
2024. https://doi.org/10.31449/inf.v48i5.540. 
[30] H. Jiang and S. B. Tsai, "An empirical study on 
sports combination training action recognition based 
50   Informatica 48 (2024) 35–50                                                                  Z. Song et al. 
on SMO algorithm optimization model and artificial 
intelligence," Mathematical Problems in 
Engineering, vol. 2021, no. 31, pp. 83-94, 2021. 
https://doi.org/10.1155/2021/7217383. 
Sports Action Detection and Counting Algorithm Based on Pose… Informatica 48 (2024) 35–50 51 
Appendix 
Appendix 1 Literature summary 
Reference Main content Results 
[4] 
Feature extraction and 
pattern recognition 
through sEMG 
Good control 
effect 
[5] 
ResNeXt network 
based on set learning 
Higher 
recognition 
accuracy than 
fusion models 
based on single 
view radar 
[6] 
A two-stage LSTM 
model 
The accuracy of 
bearing life 
prediction 
exceeded 95% 
[7] 
New Xception-LSTM 
algorithm 
Excellent 
algorithm 
performance on 
three common 
datasets 
[8] 
Using internal state 
clustering algorithm to 
improve LSTM 
extraction algorithm 
Extracting 
sequences from 
the original 
syntax had a 
higher 
recognition rate 
[9] 
Q-learning network 
vehicle connection 
algorithm based on 
deep recursion 
Higher stability 
and efficiency 
[10] 
Combining recurrent 
neural networks and 
LSTM networks 
Achieved the 
highest accuracy 
of 99.31% 
[11] 
Using Soft-NMS 
instead of NMS to 
fuse YOLOv5 detector 
97.7% mAP, 
92.7% F1 score, 
and 63 frames 
per second 
currents 
[12] 
A model for predicting 
wireless traffic in a 
predictive network 
that integrates LSTM 
algorithm and 
recurrent neural 
network 
Good prediction 
accuracy and 
algorithm 
training speed 
[13] 
Introducing a hybrid 
whale optimization 
algorithm based on 
shark odor to optimize 
plant leaf 
classification models 
Improved 
classification 
accuracy 
[14] 
Designing a control 
experiment to improve 
the intervention effect 
of school dance sports 
game education mode 
Proved the 
effectiveness of 
group exercise 
mode 
[15] 
Using Embryology 
Course as an Example 
to Study the 
Effectiveness of 
SPOC Teaching 
Model 
Improved the 
average 
professional 
grades of 
students and 
enhanced their 
learning 
enthusiasm