https://doi.org/10.31449/inf.v48i8.5943                                               Informatica 48 (2024) 1–18 1 
Improved C3D Network Model Construction and it ’s Posture 
Recognition Study in Swimming Sports 
Xiaozhi Peng
1
, Yang Li
2*
 
1
Department of Public Course Teaching, Ningbo Polytechnic, Ningbo 315800, China 
2
Department of Physical Education, Hanyang University, Seoul 04763, Korea 
Email of corresponding author: lee972023@163.com 
Keywords: C3D networks, swimming action recognition, global average pooling, residual networks, deep learning 
Received: March 21, 2024 
To solve the problem of low recognition accuracy caused by the C3D network limited by the large 
number of parameters, the study proposes an improved C3D network-based pose recognition model. 
The improvement of the C3D network is realized by using global average pooling instead of fully 
connected layer, and the attention residual network on the basis of improved C3D is further designed, 
and the attention staged residual network model is constructed by introducing the spatio-temporal 
channel attention mechanism. Comparative validation showed that the improved C3D network 
increased the accuracy by 13.49% over the C3D network on the HMDB51 dataset. When the various 
models were compared, it was found that the suggested model, which had an area under the receiver 
operating characteristic curve as high as 0.98, improved the study's accuracy over the two well-known 
networks by an average of 14.34%. The accuracy of the proposed model increased the accuracy of the 
study over the popular networks by an average of 14.50% for the recognition of the postures of all the 
swimming categories in the homemade swimming sports dataset. The findings show that the number of 
parameters in the enhanced C3D network proposed in the study has been successfully reduced, and 
that the attention residual network model based on the enhanced C3D network has a superior 
application value in sports pose recognition. It also offers some advantages in terms of 
fine-grainedness and recognition accuracy. 
Povzetek: Članek predstavlja izboljšan model omrežja C3D za prepoznavanje položajev v plavalnih 
športih. Izboljšave vključujejo zamenjavo popolnoma povezanih plasti z globalnim povprečnim 
združevanjem in uporabo omrežja preostale pozornosti, kar zmanjšuje število parametrov in povečuje 
natančnost modela. Eksperimentalni rezultati kažejo, da izboljšano omrežje C3D in model ASRNM 
dosegata visoko natančnost in robustnost v primerjavi z obstoječimi metodami.
1 Introduction 
The application of Human posture recognition (HPR) 
technology is expanding across diverse sectors and 
settings, particularly in sports, owing to the swift 
advancement of computer vision technology and 
associated hardware facilities [1]. Therefore, conducting 
research on HPR based on visual assistive technology is 
very important and has a lot of practical application value. 
Current research on gesture recognition mainly utilizes 
Deep learning (DL) algorithms, including 2-Dimensional 
(2D) convolution-based dual-stream networks, 
3-Dimensional (3D) convolution-based neural networks, 
and recurrent convolutional networks [2-3]. Among them, 
3D convolutional network can directly carry out 
spatio-temporal (ST) feature extraction without feature 
fusion, so scholars at home and abroad mostly utilize it 
for HPR technology design [4]. Convolutional 
3-Dimensional (C3D) networks, as the most commonly 
used method in 3D networks, have an increased time 
dimension compared to 2D convolution [5-6]. However, 
the excessive number of parameters and the relatively 
simple network structure lead to poor performance 
accuracy, which is difficult to meet the demand for high 
accuracy in HPR in current sports. Therefore, the study 
designed a novel network structure based on the C3D 
network. Attentional staged residual network modeling 
(ASRNM) for the improved C3D network was 
constructed on the basis of firstly replacing the fully 
connected layer (FCL) with global average pooling (GAP) 
and replacing the improved C3D network by utilizing 
Gaussian error linear unit (GELU) activation function, 
and then experimentally verified it in the HPR of the 
swimming motion. 
The study is divided into four main sections. The 
first section summarizes the findings of domestic and 
foreign research on HPR based on vision technology, as 
well as its drawbacks. In the second part, the swimming 
posture (S-Pos) recognition model based on the improved 
C3D network is studied and designed. In the third part, 
the proposed improved C3D network and S-Pos 
recognition model are experimented and analyzed. In Part 
IV, the experimental results are summarized and future 
research directions are indicated. 
As an important computer vision technology, HPR 
technology provides rich information about body 
2   Informatica 48 (2024) 1–18                                                                     X. Li
 
et al. 
movement by recognizing and analyzing human posture. 
Researchers domestically and internationally have made 
notable advancements in HPR technology. This 
technology is currently extensively used in industries 
such as human-computer interaction, intelligent robotics, 
virtual reality, and medical diagnosis. [7] To realize 
higher precision 3D human posture reconstruction, 
Verma and Rajeev proposed a deep architecture model by 
combining traditional 2D network and 3D network. 
Additionally, they developed a stack-hourglass network 
for 2D keypoint heat map prediction, and on the MPII 
and Human 3.6M datasets, it performed similarly to 
state-of-the-art techniques [8]. To address the complexity 
of the convolutional neural network structure and the 
issue of extracting deep features that only provide global 
information, Sahoo et al suggested a two-stage residual 
convolutional network design for learning features from 
color gesture photos. Using a multi-class support vector 
machine classifier based on a linear kernel for gesture 
pose detection allowed for the avoidance of the need for a 
particular preprocessing step [9]. In an effort to lessen 
worker load and increase motion detection accuracy for 
construction industry workers, Chen et al suggested an 
inherited sensor fusion method for danger prevention. A 
multi-sensor-based construction site motion identification 
system was further created using a selective depth 
detection method based on ordinary depth optimization. 
The accuracy and effectiveness of body motion detection 
particular to construction sites was enhanced by merging 
various signal types to rectify and evaluate worker 
motion [10]. The ability to tele operate robots can be 
enhanced by recognizing and reproducing human-like 
behaviors, however current center-of-mass dynamic 
balancing is difficult to achieve. To address the issue of 
variable time series length, Balmik et al. developed a 
robot-oriented adaptive balancing technique that 
computes the robot joint angles using pitch and roll 
control algorithms and uses a proposed 7-layer 
one-dimensional convolutional neural network to 
recognize human actions [11]. 
As an optimization of 3D convolutional network, 
C3D network brings new ideas to HPR research, and 
experts and scholars apply it widely in HPR, which 
promotes the value of HPR technology in real life to a 
certain extent. For 2D skeleton data, Weng et al. 
suggested a new 3D graph convolutional network model 
with ST attention mechanism. The C3D network 
successfully extracted the ST aspects of the skeleton 
descriptors, which included joint coordinates, frame 
differences, and angles, enabling the precise 
identification and categorization of persons crossing the 
street [12]. Many labeled data sets and labor are needed 
for the current skeleton-based recognition techniques, 
which mostly learn the ideal representation based on 
human-created criteria. In order to achieve this, Yu et al. 
presented an adaptive skeleton-based neural network that 
uses a data-driven methodology to automatically learn the 
best ST representation. This method effectively allowed 
memory blocks to learn long-term associations and 
short-term frame dependencies by encasing a C3D 
network in a unique attention model [13]. A human 
aberrant behavior recognition system based on 
dual-channel C3D and DL was developed by L. Jiang et 
al. to tightly regulate construction order, work efficiency, 
and quick response to emergencies at the infrastructure 
site. A better model was used to integrate this system 
with a convolutional neural network, yielding validation 
findings of 98.01% identification rate for particular 
angles, 97.27% for horizontal angles, and 95.68% for 
vertical angles [14]. For the challenging problem of 
recognizing complicated student behavior in films, Jisi 
and Yin suggested a new feature fusion network for 
student behavior detection in education. The method 
combined spatial affine transform network and C3D 
network with weighted sum method for ST feature fusion, 
which resulted in superior recognition accuracy over 
other state-of-the-art algorithms in a wide range of 
datasets [15]. The above related work is summarized in 
Table 1.
  
Table 1: Summary of related work 
Methodologies Data sets Results Reference 
Reconstruction of 3D poses 
based on early and late fusion 
strategies with the introduction 
of an enhanced 
stack-hourglass network 
MPII and Human 3.6M 
datasets 
Performance comparable to 
state-of-the-art methods 
[8] 
Reducing the number of CNN 
layers and fusing global and 
local information from 
different layers 
Ha-GRID 
The method overcomes the need 
for a specific pre-processing step 
[9] 
Image optimization using 
selective depth detection and 
construction of a construction 
site motion recognition system 
based on sensors 
Customized dataset (motion 
data of 5 adult males aged 
20-30 years) 
Improving the accuracy and 
efficiency of detecting 
construction site-specific body 
movements 
[10] 
NAO adaptive balancing 
technique based on 7-layer 
NAO behavior recognition 
dataset 
95% recognition accuracy 
compared to Hidden Markov 
[11] 
Improved C3D Network Model Construction and it’s Posture…                           Informatica 48 (2024) 1–18   3 
1D-CNN Models and Neural Networks 
A novel 3D graph 
convolutional network model 
with spatio-temporal attention 
mechanism 
Homemade dataset (ZCP's 
crosswalk pedestrian dataset); 
NTU RGB+D dataset 
This method outperforms 
2D-CNN in recognition results 
[12] 
Neural network based on 
adaptive skeleton to 
automatically learn the optimal 
spatio-temporal representation 
through a data-driven 
approach 
MSR-Action-3D dataset; SBU 
Kinect Interaction dataset; 
NTU RGB-D dataset; 
NW-UCLA dataset; UWA3D 
dataset 
State-of-the-art performance was 
achieved in five challenging 
benchmarks 
[13] 
DL and dual-channel C3D 
based human abnormal 
behavior recognition system 
Self-made dataset 
Abnormality recognition rate 
reaches over 95% 
[14] 
Fusion of spatio-temporal 
features through a combination 
of spatial affine transform 
networks and C3D networks 
with a weighted sum approach 
HMDB51 dataset; UCF101 
dataset; Real student behavior 
data 
Student behavior recognition 
results are effectively improved 
and superior to other algorithms 
[15] 
 
Combined with Table 1, it is evident that scholars 
both domestically and internationally have conducted 
extensive research on human gesture recognition 
technology based on DL. However, as the number of 
video frames and image pixels continues to increase, 
current human gesture recognition requires more 
advanced image features. Meanwhile, the conventional 
C3D network has a high number of parameters, hindering 
the effective extraction of deep features from large 
datasets. Therefore, this study proposes constructing a 
deep learning recognition model for sports gesture 
recognition based on an enhanced C3D network. To 
increase the recognition accuracy of the network model 
under the enormous number of parameters, the study 
creatively substitutes the GAP with a FCL and improves 
the C3D network by replacing the activation function. 
2 Swimming posture recognition 
model construction based on 
improved C3D network 
In order to improve the accuracy of HPR in swimming 
movement, the study proposes an improved C3D network 
and further designs an S-Pos recognition model based on 
the improved C3D network. Firstly, the FCL replacement 
as well as the activation function replacement are 
performed on the basis of the C3D network. Secondly, 
the improved convolutional network is further extended 
into a fully pre-activated residual structure network, and 
the ST channel attention focusing mechanism is 
introduced to construct the S-Pos recognition model. 
 
2.1 Improvement of 3D Convolution-Based 
C3D network 
With the rapid development of the world's 
swimming sports, swimming is loved and welcomed by 
more and more people. How to accurately identify and 
evaluate swimming movements has emerged as a 
research hotspot in the field of sports monitoring in 
relation to the instruction and training of swimming 
sports. Among them, the DL algorithm has become a 
commonly used method in the research of S-Pos 
recognition. The C3D network in DL can realize the 
direct extraction of ST features, which effectively 
circumvents the defects of the dual-stream network that 
consumes a large number of resources in order to realize 
the extraction of the temporal features individually 
[16-17]. However, the huge number of parameters can 
cause the convolutional network to be difficult to extract 
ST features completely during the extraction process, and 
the effectiveness of feature extraction is limited by the 
narrow number of convolutional network layers. 
Therefore, to address the problem of low accuracy of 
C3D network for HPR, a novel C3D network is proposed 
in the study. C3D network extracts ST features more 
efficiently than 2D convolution. Traditional 2D 
convolution processes video frames by ignoring the 
relationship between video frame sequences, whereas 3D 
convolutional feature map (FM) contains not only the 
information between pixels within a single video frame, 
but also the correlation between the video frame motion 
data [18-20]. A comparison of the operational maps of 
the two convolutions is shown in Figure 1. 
 
4   Informatica 48 (2024) 1–18                                                                     X. Li
 
et al. 
X
Y
Z
(a) 2D convolution operation
X
Y
Z
(b) 3D convolution operation
 
Figure 1: Comparison of two convolutional network operation graphs 
 
C3D network, as a classical network for 3D 
convolution, is able to synchronize the preservation of 
temporal and spatial information of the video action when 
it performs the convolution operation. Its main 
convolution formula is shown in equation (1). 
1 1 1
( )( )( )
( 1)
0 0 0
()
i i i
XYZ
abc xyz a x b y c z
ij ijn i n ij
n x y z
f   
− − −
+ + +
−
= = =
 = +

(1) 
In equation (1), 
abc
ij
 denotes the convolution result 
of the 
j
th convolution kernel (CK) of the i layer in 
position ( , , ) abc . 
a
, b and 
c
 denote the spatial 3D 
coordinates, and () f • denotes the convolution function. 
i
X denotes the width of the CK in layer i , and 
i
Y 
denotes the height. 
i
Z denotes the depth, and 
xyz
ijn
 
denotes the weight of the convolution operation of this 
layer with the n th FM of the previous layer at position 
( , , ) x y z . 

 denotes the input value of the previous 
layer at the same position, and 
ij

 denotes the amount 
of bias. The structure of the C3D network is relatively 
simple, consisting mainly of a FCL, 3D convolution, and 
maximum pooling. Since a time dimension is added to 
the 3D convolution, it requires a larger number of 
parameters than the 2D convolutional layers (CLs). This 
allows multiple 3D CLs to be stacked, driving the full 
number of parameters of the network to be 
correspondingly large. At the same time, the network 
training speed depends on the distribution of transmitted 
data in the CLs, but in C3D networks the CLs do not have 
data normalization processing, so the traditional C3D 
networks are not as effective for recognition in HPR 
[21-22]. Figure 2 depicts the precise structure of the C3D 
network. 
 
Input:3@16@
112×112
Conv1a
64
3×3
3DMaxpool
1×2×2
Conv2a
128
3×3×3
3DMaxpool
2×2×2
Conv3a
256
3×3×3
Conv4b
512
3×3×3
3DMaxpool
1×2×2
Conv4a
512
3×3×3
3DMaxpool
2×2×2
Conv3b
256
3×3×3
Conv5a
512
3×3×3
FC6
4096
3DMaxpool
2×2×2
FC7
4096
Softmax
Conv5b
512
3×3×3
 
Figure 2: C3D network structure diagram 
 
In Figure 2, the stochastic gradient descent 
technique optimizes the training of the entire network, 
and an FCL is used at the network's conclusion. In order 
for the classifier to classify the data, the extracted 
features must be mapped to the label space by the FCL. 
The FCL carries out the feature purification process, 
meaning that the number of one-dimensional feature 
vector inputs in the FCL represents a multiple of the 
number of neurons. The C3D network has an excessive 
number of parameters and is not suitable for network 
porting in embedded devices due to every node in the 
FCL being connected to every other node in the layer 
before it. Therefore, the study proposes to replace the 
FCL by utilizing GAP, which reduces the parameter 
computation by synthesizing the feature information of 
the weighted average of the FM. In this instance, Figure 3 
displays the FCL schematic diagram. 
Improved C3D Network Model Construction and it’s Posture…                           Informatica 48 (2024) 1–18   5 
n
[w,h]
[1, n×w×h]
[1, 4096]
[1, 4096]
[1,Classes]
 
Figure 3: Fully connected layer diagram 
 
GAP itself does not require training parameter 
computation, by purifying the output features extracted 
from the CL [23]. Firstly, by sampling the images within 
each feature channel equally and ensuring that each 
channel has an output feature image of size 1×1×1. 
Secondly, the output feature images are transferred to the 
classification connectivity layer according to the 
corresponding feature channel. By collecting the spatial 
information of the feature image using average sampling, 
GAP's proposed FM may significantly retain the spatial 
information of the feature image. In addition, the process 
of transmitting the feature image to the classification 
layer according to the corresponding channel after GAP 
processing can effectively increase the mapping 
connection between the feature image and the 
classification and weaken the complexity of the FM being 
interpreted as a category confidence map. The GAP 
schematic is shown in Figure 4. 
n
[w,h]
[1 ，n×1×1]
[1 ，Classes]
 
Figure 4: Schematic diagram of GAP 
 
Obviously, only the replacement of FCL is not 
effective to achieve the improvement of C3D network, 
the study further introduces 3D dot convolution layer and 
batch normalization (BN) for enhancing the network's 
ability to combine features. On this basis, all the 
activation functions in the network are replaced with 
GELU functions. The 3D CL is responsible for ST 
feature extraction in the C3D network, which has an 
additional temporal dimension than the 2D convolution, 
and different features are extracted depending on the 
parameters of the convolutional kernel. Therefore, each 
CK corresponds to a FM affected by the input FM and the 
CK, which is calculated as shown in equation (2). 
 
( 2 ) / 1
( 2 ) / 1
out in
out in
out
X X P F S
Y Y P F S
ZJ
= + − + 

= + − +


=

 (2) 
The width, height, and depth of the output FM are 
indicated by the letters 
out
X , 
out
Y , and 
out
Z in equation 
(2), respectively. The width and height of the input FM 
are indicated by 
in
X and 
in
Y , while the pixel padding 
value of the FM edge is indicated by P . The symbols 
F , S , and J represent the CK size, step size, and 
number of CKs, respectively. During the convolution 
operation on an image, the FMs are locally linked in 
spatial dimension, all linked in depth, and the weights of 
neurons at the same depth are shared [24-25]. Therefore, 
6   Informatica 48 (2024) 1–18                                                                     X. Li
 
et al. 
in order to be able to make the C3D network with a 
lighter degree of network structure, it is investigated to 
construct an asymmetric 3D CL by merging and splitting 
convolutional kernels. The schematic diagram of merging 
and splitting of CKs is shown in Figure 5. 
 
(a) Merging of convolution 
kernels
(b) Asymmetric splitting of 
convolutional kernels
 
Figure 5: Merging and splitting of convolutional kernels 
 
The study increases the convolutional kernels of all 
CLs other than the first in the original C3D network to 
3:3:3 3D CLs, and combines the three CLs with increased 
convolutional kernels into a CL with 3:7:7 convolutional 
kernels. The feature extraction capability of the CL is 
improved by increasing the region of the CL that affects a 
specific unit of the network in the input controls. Finally, 
the CL with 3:7:7 convolutional kernel is asymmetrically 
disassembled into two asymmetric 3D CLs with 
convolutional kernels of 3:1:7 and 3:7:1. The CL's weight 
parameters can be decreased to enhance the image's 
spatial information and lessen overfitting during network 
training. Based on the obtained non-stacked 3D CLs, the 
study further introduces 3D point CLs for cross-channel 
information fusion and transfers the fused feature 
information to the next set of asymmetric 3D CLs. 
Considering that the parameters change continuously 
during the network training process and the change of 
data distribution in the previous layer affects the 
subsequent data distribution, the study utilizes BN to 
process the input data in the 3D CL. Considering the BN 
processing as a network layer processing and with 
trainable parameters distributed between CLs, when the 
network learns and trains the data in small batches of data, 
the BN performs normalization with variance of 1 and 
mean of 0 based on the small batches of data. In this case, 
the expression formula for the input data is shown in 
equation (3). 
 
2
iB
i
B
u 


−

+
 (3) 
In equation (3), denotes the input data, and 
i
 B 
denotes the set of m data entered in small batches. 
2
B
 
denotes the variance, 

 denotes the tiny constant value 
that avoids the denominator equal to 0, and 
B
u denotes 
the mean value. Equation (4) displays the formula for 
figuring out the mean value of the picture feature data. 
 
1
1
m
Bi
i
u
m

=


 (4) 
The formula for calculating the variance of the 
image feature data is shown in equation (5). 
 
22
1
1
()
m
B i B
i
u
m

=
−

 (5) 
The expression function of the output normalized 
result according to GN processing is shown in equation 
(6). 
 
ii
    + (6) 
In equation (6), 
i
 denotes the output result of 
i
 
after BN processing.  denotes the learnable parameter 
of scaling and 

 denotes the learnable parameter of 
translation. The most widely used activation function in 
neural networks is the rectified linear unit (ReLU), which 
can handle difficult nonlinear issues and enhance the 
low-expression effect of linear functions for challenging 
issues [26]. Equation (7) displays the particular formula 
for the function expression. 
 
0 ( 0)
max(0, )
( 0)



 
==



 (7) 
In equation (7), 

 denotes the output. However, 
the ReLU function ignores the link between activation 
and regularization of the data. Again, the study utilizes 
the GELU function with regularization as the activation 
function of the improved C3D network [27]. Its specific 
expression formula is shown in Equation (8). 
 
1
[1 ( / 2)]
2
erf    =  + (8) 
In equation (8), () erf • denotes the Gaussian error 
expression function. The GELU function is an activation 
function that compresses the stochastic process, 
combining the activation ability of nonlinearity with data 
regularization to achieve a stochastic regularization effect. 
Combining the above, the overall network architecture of 
the improved C3D network proposed in the study is 
shown in Figure 6. 
Improved C3D Network Model Construction and it’s Posture…                           Informatica 48 (2024) 1–18   7 
64
3×3×3
Conv1a
1×2×2
3DMaxpool
BN
128
3×1×7
Conv2a
2×2×2
3DMaxpool
128
3×7×1
Conv2b
128
1×1×1
Conv2c
256
3×1×7
Conv3a
256
1×1×1
Conv3c
256
3×7×1
Conv3b
2×2×2
3DMaxpool
512
3×1×7
Conv4a
512
3×7×1
Conv4b
512
1×1×1
Conv4c
2×2×2
3DMaxpool
512
3×1×7
Conv5a
512
3×7×1
Conv5b
512
1×1×1
Conv5c
2×2×2
3DMaxpool
GAP 51, FC
Softmax
classifier
Export
Import
BN
BN
BN
BN
BN BN
BN
BN
 
Figure 6: Overall structure of the improved C3D network 
 
Firstly, the input video is segmented into 
corresponding video frame images after data 
preprocessing and fed into a 3D CL consisting of 3:3:3 
convolutional kernels for comprehensive feature 
extraction. Next, the extracted data is normalized in small 
batches using BN, after which the redundant information 
is removed using the 3D deflation layer. Then, the ST 
feature information extraction is performed by the 
asymmetric 3D CL, and then input into the 3D point CL. 
Finally, all the feature data are passed through the GAP 
and classification connection layer for discriminative 
output value calculation, and the final classification result 
is output in the form of probability through the Softmax 
classifier. 
2.2 ASRNM Based on Improved C3D 
network 
In order to extract deeper ST features of feature images, 
the study further designs the ASRNM for S-Pos 
recognition based on the proposed improved C3D 
network. Firstly, the improved C3D network with fully 
pre-activated residual’s structure is further extended into 
a FPR network based on the C3D attention, and the 
Staged Residuals (SR) structure for network optimization. 
The FPR structure, unlike the original residual structure 
which can only achieve constant mapping connections on 
residual blocks, can combine regularization with an 
activation function as a pre-activation before the 
information enters the convolutional weights. Based on 
fully pre-activated residuals-C3D (FPR-C3D), which 
extends FPR to form a C3D basis in the network, the 
study replaces the maximum pooling of the network with 
soft pool (SP). SP obtains the weights of each FM 
activation value by Softmax exponential normalization, 
and the final SP output is achieved by weighted 
summation of the weights for each activation value 
within the pooling kernel [28]. The weight expression 
function of the activation values is shown in equation (9). 
 
exp( )
exp( )
g
g
h
hR
W


=


 (9) 
In equation (9), 
g
W
 denotes the weight assigned to 
each activation value within the pooling kernel. 
g

 and 
h
 denote the activation values within the pooling kernel 
of the activation FM, 
g
 and h denote the index 
numbers within the pooling kernel range, and R 
denotes the pooling kernel range. Equation (10) displays 
the final formula for the output of SP. 
 
gg
gR
W

 =  

 (10) 
In equation (10),  denotes the output result after 
the final SP. Meanwhile, considering that the increase of 
network depth will negatively affect the BN small BN 
effect, the study utilizes group normalization (GN) to 
perform regularization operations on individual 3D CLs. 
The data normalization process is achieved by calculating 
the variance and mean used to normalize the features 
within the grouped channels. The normalization formula 
is shown in equation (11). 
8   Informatica 48 (2024) 1–18                                                                     X. Li
 
et al. 
 
2
1
1
()
g
g
m
gk
kQ
m
g k g
kQ
gg
g
g
gg
u
m
u
m
u

  



   
=
=

=




= − +



−

=


=+ 



 (11) 
In equation (11), 
g
Q
 denotes the set of pixels with 
the mean and variance of the data. The 
g
Q
 expression 
function is shown in equation (12). 
,
//
CC
g D D
kg
Q k k g
C T C T

    
= = =

   
    

 (12) 
C and k respectively stand for the channel 
dimension and the number of input data in equation (12). 
The batch size is shown by D , and the ability to 
compute each input data set's mean and variance using 
the ( , , ) C X Y axis is indicated by 
DD
kg = . T denotes 
the number of groupings, • 

 denotes rounding down 
the data, and 
//
CC
kg
C T C T
   
=
   
   
 denotes that both 
indexed data are in the same channel grouping. GN 
regularization computes the input data to circumvent the 
BN's dependence on memory consumption, which is 
conducive to improving the accuracy of the network 
model for HPR. However, before video clips can be 
classed and identified in the FPR-C3D network for the 
purpose of recognizing human posture, they must be 
processed into time-series video frames. Furthermore, the 
effectiveness of the attention module influences the 
network's recognition effect [29]. Therefore, the study 
proposes an improving convolutional block attention 
model (ICBAM) based on the convolutional block 
attention model (CBAM), which is extended to the ST 
domain by adding the temporal dimension, as shown in 
Figure 7. 
 
2
1
3
N
i
Input feature
× × ×
Channel 
attention
Spatial 
attention
Temporal 
attention 2
1
3
N
 
Figure 7: ICBAM principle 
 
In Figure 7, ICBAM first inputs the FMs extracted 
by 3D convolution and obtains the Identity (ID) channel 
attention FMs through the attention module of the 
channels. After adaptive feature refinement, the channel 
attention FM is obtained by multiplying the ID channel 
attention FM by the original FM element by element. 
Equation (13), in particular, displays the calculation 
formula. 
 ()
L
P V P P = (13) 
In equation (13), P denotes the input FM and 
()
L
VP denotes the ID channel attention FM. 
L
V 
denotes the channel attention module and  denotes the 
element-by-element multiplication. P  denotes the FM 
obtained by multiplying ()
L
VP and P element by 
element. Pass P  through the spatial attention module to 
obtain the 2D spatial attention FM ()
o
VP 
, multiply 
()
o
VP 
 with P  element by element to obtain the new 
adaptive feature refined channel attention FM P . 
Equation (14) displays the particular calculating formula. 
 ()
o
P V P P     = (14) 
The P is then passed through the temporal 
attention module () VP


 in order to distinguish the key 
video frames. Thus the final obtained FM P  on the 
basis of temporal channel attention is shown in equation 
(15). 
 () P V P P

       = (15) 
Combined with ICBAM, the proposed FPR-C3D 
network is further optimized as a full-domain activated 
residual (FPR-ICBAM-C3D) network based on the C3D 
attention network. However, FPR is not fully effective in 
solving the network degradation problem in a huge 
Improved C3D Network Model Construction and it’s Posture…                           Informatica 48 (2024) 1–18   9 
number of network layers, and only normalizes the 
residual branches, which cannot normalize the data in the 
convolutional weight layer. Therefore, the study 
introduces SR without a point CL for network 
optimization. SR enables faster and more efficient 
transfer of information through the network and enables 
the synchronization of driving the network to parameter 
learning and training to optimize the deep network 
[30-31]. Therefore, the study finally proposes the 
ASRNM model as shown in Figure 8. 
Input
1×2×2
3DSoftpool
64, 3×3×3
Conv1a
128, 3×1×7
Conv2a
128, 3×7×1
Conv2b
128, 3×1×7
Conv2c
128, 3×7×1
Conv2d
ICBAM
2×2×2
3DSoftpool
256, 3×1×7
Conv3b
256, 3×7×1
Conv3a
ICBAM
256, 3×7×1
Conv3c
256, 3×1×7
Conv3d
ICBAM
2×2×2
3DSoftpool
512, 3×1×7
Conv4a
512, 3×7×1
Conv4b
ICBAM
512, 3×1×7
Conv4c
512, 3×7×1
Conv4d
ICBAM
2×2×2
3DSoftpool
512, 3×1×7
Conv5b
512, 3×7×1
Conv5a
ICBAM
512, 3×7×1
Conv5c
512, 3×1×7
Conv5d
ICBAM
Output
Softmax
classifier
51, FC
GAP
GN
GN
GN
ICBAM
GN
GN
GN
GN GN
GN
GN
GN
GN
GN
GN
GN
GN
GN
GN
 
Figure 8: ASRNM network model 
 
The ASRNM model starts with feature extraction 
and normalization of the input data by the first 3×3 
convolutional and GN layers, and the first step of SR 
processing. Within the residual block of the initial stage, 
SP downscaling is performed, followed by feature 
extraction using asymmetric 3D CLs, GN regularization 
of the data, and then information extraction of key frames 
using ICBAM. The initial stage residual block processes 
the data and then enters the end residual block of SR. 
Based on the whole SR processed data obtained, the 
above processing operations are repeated in the next part 
of SR until all SRs are passed. Finally, the extracted 
feature information is passed through GAP, Classification  
Connection Layer and Softmax for the final recognition 
result output. 
 
 
 
 
3 Experimental analysis of 
swimming posture recognition 
model based on improved C3D 
Network 
To validate the effectiveness of the proposed S-Pos 
recognition technique in the study, the proposed 
improved C3D network is firstly tested for comparison in 
the dataset. Based on this, additional performance 
validation of the ASRNM model is carried out in order to 
assess its efficacy in comparison with the currently in use 
network models in the sports dataset. Finally, swimming 
action pose recognition experiments are conducted in the 
dataset of swimming sports. 
 
 
 
10   Informatica 48 (2024) 1–18                                                                     X. Li
 
et al. 
3.1 Improved C3D network validation 
The study used the adaptive moment estimation algorithm 
for the network model training optimization algorithm 
and set the network iteration period to 50 times, the initial 
learning rate to 0.00001, the number of groups for the 
normalized grouping to 32, the weight decay parameter to 
5×10-4, and the batch size during the training process to 
8 in order to experimentally validate the performance of 
the improved C3D network. The HMDB51 dataset and 
Sports-1M dataset are pre-processed based on the above 
parameters, and then the improved C3D network model is 
used to compare the validity with the traditional C3D 
model. The HMDB51 dataset is mainly derived from 
movie clips and short videos uploaded online by netizens, 
with a total of 6,766 video data, most of which suffer 
from camera jitter, poor shooting angles, and low-quality 
video frame defects, and the use of which better 
demonstrates the reliability of the network model 
proposed in the study. The Sports-1M dataset is a 
collection of sports video clips classified into 487 action 
categories, totaling 1,100,000 clips. It is a useful tool for 
validating the effectiveness of the network model 
proposed in the study. Figure 9 displays the accuracy and 
F1 value change curves for both datasets. 
 
0
15
30
45
60
75
5 0 10 15 20 25 30 35 40 45 50
Epoch
Accuracy (%)
C3D test
C3D train
Improved C3D test
Improved C3D train
10
22
34
46
58
70
5 0 10 15 20 25 30 35 40 45 50
Epoch
F1 score (%)
C3D test
C3D train
Improved C3D test
Improved C3D train
(a) Comparison of accuracy in HMDB51 
dataset
(b) Comparison of F1-score in HMDB51 
dataset
20
30
40
50
60
70
5 0 10 15 20 25 30 35 40 45 50
Epoch
Accuracy (%)
C3D test
C3D train
Improved C3D test
Improved C3D train
20
30
40
50
60
70
5 0 10 15 20 25 30 35 40 45 50
Epoch
F1 score (%)
C3D test
C3D train
Improved C3D test
Improved C3D train
(c) Comparison of accuracy in Sports-1M 
dataset
(d) Comparison of F1-score in Sports-1M 
dataset
 
Figure 9: Comparison of accuracy and F1 score in HMDB51 dataset and Sports-1M dataset 
 
In Figure 9(a), the accuracy of both testing and 
training of the improved C3D network is improved in 
different ways compared to the traditional C3D network. 
Compared to the unimproved C3D training, the improved 
C3D training improved the accuracy by 15.15% after 50 
iterations, while the test accuracy improved by 13.49%. 
Figure 9(b) presents a comparison of the F1 values of the 
two convolutional networks. Testing and training results 
indicate that the upgraded C3D network's F1 values 
outperform the C3D network. This suggests that the 
model's performance can be enhanced by the enhanced 
C3D network that the study suggests. After 10 iterations, 
the improved C3D network model shows a leveling off 
trend earlier than the C3D network, which indicates that 
the improved C3D network model can find the optimal 
solution quickly. The improved C3D network model's 
superiority is evident when comparing the accuracy and 
F1 values of both networks in the Sports-1M dataset, as 
shown in Figures 9(c) and 9(d). The study also compares 
the C3D network with the enhanced C3D network's 
receiver operating characteristic (ROC) and precision 
recall (PR) curves; the comparison's findings are 
displayed in Figure 10. 
Improved C3D Network Model Construction and it’s Posture…                           Informatica 48 (2024) 1–18   11 
0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1.0
False positive rate
True positive rate
0
0
Recall
Precision
(a) Comparison of ROC curve in 
HMDB51 dataset
(b) Comparison of PR curve in 
HMDB51 dataset
C3D (area=0.85)
Improved C3D (area=0.89) 
Random
0.2
0.4
0.6
0.8
1.0
0.2 0.4 0.6 0.8 1.0
C3D (mAP=0.61) 
Improved C3D (mAP=0.68)
0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1.0
False positive rate
True positive rate
0
0
Recall
Precision
(c) Comparison of ROC curve in 
Sports-1M dataset
(d) Comparison of PR curve in 
Sports-1M dataset
C3D (area=0.67)
Improved C3D (area=0.78) 
Random
0.2
0.4
0.6
0.8
1.0
0.2 0.4 0.6 0.8 1.0
C3D (mAP=0.54) 
Improved C3D (mAP=0.62)
 
Figure 10: Comparison of ROC and PR curve 
 
As compared to C3D, the revised C3D network 
model's area under the ROC curve in Figure 10(a) rises 
by 4.71%, indicating an improvement in the model's 
accuracy. Better model accuracy is shown by a greater 
mean average precision (mAP) of the area under the PR 
curve. The mAP value of the enhanced C3D network 
model in Figure 10(b) is 0.68, a 11.48% increase over 
C3D. Figure 10(c) shows the ROC curves of the two 
network models in Sports-1M. The improved C3D 
network model has a higher curve area than C3D. Figure 
10(d) illustrates that the improved C3D network model 
has a 14.81% increase in mAP value over C3D. The 
sample data distribution has less of an impact on the ROC 
curve, and the PR curve more accurately represents the 
performance of the model with a broader sample data 
distribution. This suggests that the study's enhanced C3D 
network can successfully address the shortcomings of the 
conventional C3D network, which has an inadequate 
classification impact because of an excessive number of 
characteristics. Finally, the study further compares the 
training time overhead of the two models, the improved 
C3D network and the traditional C3D network, in the two 
datasets, as shown in Figure 11. 
0.1
0.2
0.3
0.4
0.5
0 1000
Sample size
Time (s)
0
(a) Training time overhead in 
HMDB51 dataset
(b) Training time overhead in 
Sports-1M dataset
C3D
Improved C3D 
0.3
0.5
0.7
0.9
1.1
2000 3000 4000 5000 2000 4000 6000 8000 10000
0.1
Time (s)
0
C3D
Improved C3D 
Sample size
 
Figure 11: Comparison of the training time overhead of the two network models in the dataset 
12   Informatica 48 (2024) 1–18                                                                     X. Li
 
et al. 
 
Figure 11(a) shows that the training time overhead 
of the improved C3D network model is significantly 
lower than that of the C3D network model in the 
HMDB51 dataset. As the number of samples increases, 
the time overhead of both network models also increases, 
but the increase is smaller for the improved C3D network 
model. Figure 11(b) shows that the time overhead of the 
improved C3D network model is slightly higher than that 
of C3D when the sample data is at 2000. This may be due 
to the fact that the improved C3D network takes some 
time to adapt to the computation of the samples after the 
reduction of the number of references. However, as the 
data samples increase, the increase in the time of the 
improved C3D network model decreases. By improving 
the number of parameters and the activation function of 
the C3D network, the use of GELU as the activation 
function facilitates the generalization of the model. This  
results in a reduction in the model time overhead, leading 
to faster and more accurate recognition of the swimming 
action. 
3.2 Verification of ASRNM based on 
improved C3D network 
The improved C3D network demonstrated its good 
classification performance in the HMDB51 dataset, but 
considering the one-sidedness of a single dataset and the 
effectiveness of S-Pos identification, the study utilized 
the kinetic400 dataset and the UCF101 dataset to build a 
sports action (SA) dataset for the performance validation 
of the FPR-C3D network, ASRNM model. performance 
validation. 5302 video clips in all, broken down into 43 
categories with 108 clips in each, make up the SA dataset. 
The suggested network model is evaluated in terms of 
accuracy and F1 value performance against the currently 
in use networks using the SA dataset; the comparison 
results are displayed in Figure 12. 
 
0
20
40
60
80
100
5 0 10 15 20 25 30 35 40 45 50
Epoch
Accuracy (%)
FPR-C3D test
0
20
40
60
80
100
5 0 10 15 20 25 30 35 40 45 50
Epoch
F1 score (%)
(a) Comparison of accuracy in SA dataset
(b) Comparison of F1-score in SA dataset
Res3D test
Res3D train
R(2+1)D-18 test
R(2+1)D-18 train
FPR-C3D train
ASRNM test
ASRNM train
FPR-C3D test Res3D test
Res3D train
R(2+1)D-18 test
R(2+1)D-18 train
FPR-C3D train
ASRNM test
ASRNM train
 
Figure 12: Comparison of accuracy and F1 score in SA dataset 
 
In Figure 12(a), the research-proposed ASRNM 
model has the fastest and largest increase in accuracy 
during training, and the accuracy after 50 iterations is the 
highest among all compared models. The FPR-C3D 
network model has the lowest accuracy among the four 
methods, which confirms the need for the study to 
propose optimized convolutional networks using the SR 
residual structure. Res3D and R(2+1)D-18, the more 
popular networks, had higher accuracy than the FPR-C3D 
network, but the accuracy of the ASRNM model training 
increased by 24.37% and 4.31% over the two popular 
networks, respectively. The F1 values of the four models 
are compared in Figure 12(b), which demonstrates that 
the ASRNM model continues to be the most superior in  
 
 
 
terms of both speed and magnitude of improvement. The 
F1 curves of the four models in the SA dataset improve 
more quickly, and the change in their F1 values tends to 
stabilize when the iteration is 10 times. When the training 
went through 50 iterations, the ASRNM model increased 
the F1 value by 38.67% over the FPR-C3D network. The 
change in accuracy and F1 value curves also shows that 
the ASRNM model has less fluctuation, which indicates 
its superior generalization. The four models' AUC, mAP, 
number of model parameters (Params), and floating-point 
perations per second (Flops) are compared in order to 
further demonstrate the superiority of the model 
suggested in the study. The precise findings are displayed 
in Table 2. 
 
 
 
Improved C3D Network Model Construction and it’s Posture…                           Informatica 48 (2024) 1–18   13 
Table 2: Comparison of experimental results of different models on the SA dataset 
Model AUC mAP Params ( ×106) Flops ( ×109) 
C3D 0.92 0.64 78.21 38.66 
Improved C3D 0.94 0.69 26.98 40.85 
FPR-C3D 0.95 0.72 47.95 45.34 
ASRNM 0.98 0.88 47.95 45.37 
Res3D 0.97 0.79 33.20 37.54 
R(2+1)D-18 0.96 0.86 33.31 38.75 
 
It is evident from comparing the models' AUC and 
mAP values that the ASRNM model performs the best on 
the SA dataset. Its mAP value rises by 2.33%-37.50% 
across multiple approaches, and its AUC value reaches as 
high as 0.98. This suggests that in the SA dataset, the 
ASRNM model performs better overall than the Res3D 
and R(2+1)D-18 networks. The ASRNM model 
outperforms the other two models in terms of 
computational complexity and parameter count due to the 
inclusion of a ST channel attention mechanism in the 
network. This mechanism increases the number of 
parameters, which in turn increases the computing 
complexity and Flops value. However, comparing with 
the traditional C3D network, the ASRNM model 
parameter computation is reduced by 38.69% and the 
computational complexity is only increased by 17.36%. 
Furthermore, the study examines how the ASRNM 
model's loss value changes after 1000 iterations in the 
HMDB51 and SA datasets. The precise findings are 
displayed in Figure 13. 
0.0
0.1
0.2
0.3
0.4
0.5
100 0 200 300 400 500 600 700 800 900 1000
Epoch
Loss value
(a) Loss curve of ASRNM model on HMDB51 dataset
Train loss
Test loss
0.00
0.05
0.10
0.20
0.25
0.30
100 0 200 300 400 500 600 700 800 900 1000
Epoch
Loss value
(b) Loss curve of ASRNM model on AS dataset
Train loss
Test loss
0.15
 
Figure 13: Loss curves of ASRNM model in HMDB51 dataset and SA dataset 
 
In the HMDB51 dataset, the ASRNM model 
converges after 900 iterations, as shown in Figure 13(a), 
and the loss value tends to be near 0. The loss value of 
the SA model is almost equal to 0.01 after 800 iterations, 
but as the number of iterations rises, it gets closer and 
closer to 0. On the whole, the initial loss value of the 
ASRNM model is relatively low, which indicates that its 
model classification performance is better and has 
14   Informatica 48 (2024) 1–18                                                                     X. Li
 
et al. 
superior classification effect in sports pose recognition. 
When the aforementioned information is combined, it 
becomes clear that the study's ASRNM model has a 
strong classification capability for sports gesture 
identification. As a result, the research produced a visual 
representation of the prediction outcomes for the SA 
dataset based on the confusion matrix of the ASRNM 
model, as seen in Figure 14. 
 
ButterflyStroke
prob:0.8839
BreastStroke
prob:0.9864
(a) Butterfly stroke recognition results (b)  Breast stroke recognition results
 
Figure 14: Visualization of prediction results 
 
Figures 14(a) and (b) demonstrate the recognition of 
the butterfly and breaststroke, respectively. When there is 
intra-image reference blurring or video blurring, the 
ASRNM model detects it correctly. 
3.3 Experimental validation of gesture 
recognition in swimming 
Based on the above validation results, the study further 
compares the fine-grained recognition effects of the 
ASRNM model and the R(2+1)D-18 network in the 
swimming motion (Swim) dataset, which is a dataset 
constructed from four swimming motions, namely, 
breaststroke, butterfly, backstroke, and freestyle in the 
SA dataset, with 494 video clips, and utilizes this dataset 
to perform the fine-grained recognition of swimming 
motions. The reliability and accuracy of the model's 
classification in environments such as light and water 
refraction can be verified, as well as the recognition 
accuracy of gestures with similar movements. The 
recognition results of the two algorithms on the Swim 
dataset are shown in Figure 15. 
0.00 0.29 0.11 0.61
0.42 0.00 0.54 0.04
0.05 0.86 0.00 0.10
0.62 0.00 0.31 0.08
Back stroke
Breast stroke
Butterfly stroke
Front crawl
Back stroke
Breast stroke
Butterfly stroke
Front crawl
True label
Predictd label
0.0 1.0
0.04 0.11 0.02 0.83
0.28 0.00 0.67 0.05
0.00 0.93 0.00 0.07
0.78 0.01 0.18 0.03
Back stroke
Breast stroke
Butterfly stroke
Front crawl
Back stroke
Breast stroke
Butterfly stroke
Front crawl
True label
Predictd label
0.0 1.0
(a) Confusion matrix results for 
R(2+1)D-18
(b) Confusion matrix results 
for ASRNM
 
Figure 15: Matrix plot for Swim datasets 
 
Figure 15(a) displays the confusion matrix results 
produced by the R(2+1)D-18 network on the Swim 
dataset, whereas Figure 15(b) displays the confusion 
matrix results produced by the ASRNM model on the 
same dataset. The comparison shows that the ASRNM 
model increases the accuracy of pose recognition for the 
four swimming categories by an average of 14.50% over 
the R(2+1)D-18 network. This suggests that the study's 
ASRNM model can recognize swimming poses with 
greater fine-grainedness. In the meantime, the study 
Improved C3D Network Model Construction and it’s Posture…                           Informatica 48 (2024) 1–18   15 
examines the two approaches' variations in loss values on 
the Swim dataset in more detail. The comparative results 
are displayed in Figure 16. 
0.0
0.1
0.3
0.4
0.6
0.7
100 0 200 300 400 500 600 700 800 900 1000
Epoch
Loss value
(a) Loss curve of R(2+1)D-18 model on Swim dataset
Train loss
Test loss
0.2
0.5
0.0
0.1
0.3
0.4
0.6
0.7
100 0 200 300 400 500 600 700 800 900 1000
Epoch
Loss value
(b) Loss curve of ASRNM model on Swim dataset
Train loss
Test loss
0.2
0.5
 
Figure 16: Comparison of loss values between the two methods in the Swim dataset 
 
The loss curve of the R(2+1)D-18 network for 1000 
iterations in the Swim dataset is shown in Figure 16(a). A 
comparison of the ASRNM model's loss curves in Figure 
16(a), (b) reveals that, at 1000 iterations, the R(2+1)D-18 
network's loss value is near to 0.03, whereas the ASRNM 
model's loss value is close to 0 at roughly 300 iterations. 
This suggests that the study's proposed ASRNM model 
performs better and needs fewer iterations. 
4 Discussion 
The development and application of artificial intelligence 
have led to the emergence of HPR, which involves 
interdisciplinary disciplines for automatic extraction of 
human feature poses in video images via DL technology. 
However, the traditional C3D network model requires a 
large number of parameters in the process of feature 
extraction, which can lead to a decline in model 
recognition accuracy. Therefore, this study proposes an 
improved C3D network model. The validation of the 
model's performance indicates that the accuracy and F1 
value of the enhanced C3D network proposed in the study 
are significantly higher than those of the original C3D 
network. This finding was consistent with the results 
reported in literature [8] and [15]. Then, the accuracy and 
F1 value of the model varied depending on the validation 
dataset used. When comparing the validation results of 
literature [15] in the HMDB51 dataset, the accuracy of 
the proposed improved C3D network model was lower. 
This may be due to the fact that the study's recognition 
performance was affected to some extent by reducing the 
number of parameters in the network model. However, 
the validation resulted from the Sports-1M dataset further 
confirm the feasibility of the improved C3D network 
model for recognizing swimming sport poses. 
The practical value of gesture recognition 
technology in sports applications is significant. In this 
study, an S-Pos recognition model was constructed based 
on the improved C3D network, combined with the SR 
structure of the ST channel attention mechanism. The 
recognition model achieved a high mAP value of 0.88 
and an AUC of 0.98 in the homemade SA dataset. While 
previous HPR studies achieved up to 95% accuracy in 
some benchmark datasets, the validation of the mAP 
value was not analyzed further. However, when 
16   Informatica 48 (2024) 1–18                                                                     X. Li
 
et al. 
comparing AUC, the study showed that the ASRNM 
model is superior for recognizing sports. Additionally, the 
confusion matrix validation results for the R(2+1)D-18 
network on gesture recognition of swimming actions 
further affirmed the value of the ASRNM model in sports 
applications such as swimming. 
However, recognizing SA poses can be limited by 
factors such as video quality, illumination, and the 
distance between the athlete and the camera. The 
ASRNM model construction is performed based on the 
improved C3D network. The model is then validated 
from different perspectives using the HMDB51 dataset, 
Sports-1M dataset, and SA dataset. The HMDB51 dataset 
comprises 6,849 videos with a video resolution in the 
range of 320*240 and includes 51 types of actions, such 
as general facial actions, human body actions, and 
general body actions. The dataset's performance 
validation confirms the effectiveness of the proposed 
model for recognizing human gestures in low-quality 
videos. The Sports-1M dataset consists mainly of videos 
from YouTube, which vary in quality, shooting 
background, lighting, and camera distance. The proposed 
improved C3D network shows an average improvement 
of 25% compared to the original C3D network. This 
suggests that the improved C3D network has potential 
applications in sports. The ASRNM model proposed still 
demonstrates superior performance in studying the 
homemade SA sports dataset and recognizing swimming 
actions. 
5 Conclusion 
The paper suggests a pose recognition model based on the 
enhanced C3D network in an effort to address the issue of 
the unsatisfactory recognition impact of the network 
under the large number of parameters. Firstly, the FCL as 
well as the activation function are replaced to improve the 
C3D network, and the improved C3D network is further 
extended into the FPR-C3D network, and the S-Pos 
recognition model is constructed by utilizing the ST 
channel attention focusing mechanism and the SR 
structure. The validation of the improved C3D network 
revealed that its training and testing accuracies in the 
HMDB51 dataset increased by 15.15% and 13.49%, 
respectively, compared to the traditional C3D network. 
The accuracy of the ASRNM model in the SA dataset 
increased by 24.37% and 4.31% over the two popular 
networks, respectively, according to a comparison of the 
performance of the various models. Its AUC value was as 
high as 0.98 and its mAP value increased by 
2.33%-37.50% over several methods. The confusion 
matrix results for the Swim dataset revealed that the 
ASRNM model increased the accuracy of pose 
recognition for the four swimming categories by an 
average of 14.50% over the R(2+1)D-18 network. The 
aforementioned findings demonstrate that the study's 
improved C3D network has successfully had its 
parameter count reduced. Additionally, the ASRNM 
model, which is based on the improved C3D network, is 
much lighter than the traditional C3D model and 
performs better in terms of accuracy and fine-grainedness 
when it comes to sports pose recognition. 
6 Limitations and future work 
Experimental validation on various datasets confirms the 
effectiveness of the improved C3D network model and 
the ASRNM model for SA recognition, such as 
swimming. However, the study's shortcoming is the low 
validation accuracy and F1 value scores for large datasets, 
although it is still superior to the C3D network. 
Additionally, the ASRNM model proposed in the study 
could not be pre-trained on large datasets due to hardware 
limitations. As a result, the number of networks capable 
of effectively recognizing contrasts is limited. This is a 
crucial aspect to improve and optimize in the next step of 
the study. To improve the accuracy of period human pose 
recognition, further reducing the number of network 
participants and using larger computer equipment for 
pre-training the recognition model will be considered. 
The next step of the research will be to design a 
visualization system for swimming sport recognition 
based on the ASNRM model. Conducting research on the 
recognition of various sports movements is not only 
conducive to the development of sports, but also helps to 
promote the intelligence, science, and rationality of 
physical exercise. The utilization of DL and other 
technologies for researching movement recognition is 
significant for sports. This data can be used to plan 
athlete training and recuperation. Additionally, 
implementing intelligent technology to recognize human 
movement postures can improve incorrect national 
movement postures, promoting the healthy development 
of national sports and achieving the ambitious goal of 
strengthening the national body. 
References 
[1] M. Estiri, J. H. Dahooie, and E. K. Zavadskas, 
“Providing a framework for evaluating the quality of 
health care services using the healthqual model and 
multi-attribute decision-making under imperfect 
knowledge of data,” Informatica, vol. 34, no. 1, pp. 
85-120, 2023. 
https://doi.org/10.15388/23-INFOR512. 
[2] H. Tang, L. Ding, S. Wu, B. Ren, N. Sebe, and P. 
Rota, “Deep unsupervised key frame extraction for 
efficient video classification,” ACM Transactions 
on Multimedia Computing Communications and 
Applications, vol. 19, no. 3, pp. 1-17, 2023. 
https://doi.org/10.1145/3571735. 
[3] S. Salimian, S. M. Mousavi, and Z. Turskis, 
“Transportation mode selection for organ transplant 
networks by a new multi-criteria group decision 
model under interval-valued intuitionistic fuzzy 
uncertainty,” Informatica, vol. 34, no. 2, pp. 
337-355, 2023. 
Improved C3D Network Model Construction and it’s Posture…                           Informatica 48 (2024) 1–18   17 
https://doi.org/10.15388/23-INFOR513. 
[4] C. Pham, L. Nguyen, A. Nguyen, N. Nguyen, and V. 
T. Nguyen, “Combining skeleton and accelerometer 
data for human fine-grained activity recognition and 
abnormal behaviour detection with deep temporal 
convolutional networks,” Multimedia Tools and 
Applications, vol. 80, no. 19, pp. 28919-28940, 
2021. https://doi.org/10.1007/s11042-021-11058-w. 
[5] T. Huang, X. Ben, C. Gong, B. Zhang, R. Yan, and Q. 
Wu, “Enhanced spatial-temporal salience for 
cross-view gait recognition,” T-CSVT, vol. 32, no. 
10, pp. 6967-6980, 2022. 
https://doi.org/10.1109/TCSVT.2022.3175959. 
[6] T. C. Koh, C. K. Yeo, X. Jing, and S. Sivadas, 
“Towards efficient video-based action recognition: 
context-aware memory attention network,” SN 
Applied Sciences, vol. 5, no. 12, pp. 1-12, 2023. 
https://doi.org/10.1007/s42452-023-05568-5. 
[7] C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, 
N. Kehtarnavaz, and M. Shah, “Deep learning-based 
human pose estimation: A survey,” ACM 
Computing Surveys, vol. 56, no. 1, pp. 1-37, 2023, 
https://doi.org/10.1145/3603618. 
[8] P. Verma and S. Rajeev, “Two-stage multi-view deep 
network for 3D human pose reconstruction using 
images and its 2D joint heatmaps through enhanced 
stack-hourglass approach,” The Visual Computer, 
vol. 38, no. 7, pp. 2417-2430, 2022. 
https://doi.org/10.1007/s00371-021-02120-7. 
[9] J. P. Sahoo, S. P. Sahoo, S. Ari, and S. K. Patra, 
“RBI-2RCNN: Residual block intensity feature 
using a two-stage residual convolutional neural 
network for static hand gesture recognition,” Signal 
Image and Video Processing, vol. 16, no. 8, pp. 
2019-2027, 2022. 
https://doi.org/10.1007/s11760-022-02163-w. 
[10] T. S. Chen, N. Yabuki, and T. Fukuda, “Motion 
recognition method for construction workers using 
selective depth inspection and optimal inertial 
measurement unit sensors,” CivilEng, vol. 4, no. 1, 
pp. 204-223, 2023. 
https://doi.org/10.3390/civileng4010013. 
[11] A. Balmik, A. Paikaray, M. Jha, and A. Nandy, 
“Motion recognition using deep convolutional 
neural network for Kinect-based NAO 
teleoperation,” Robotica, vol. 40, no. 9, pp. 
3222-3253, 2022. 
https://doi.org/10.1017/S0263574722000169. 
[12] L. Weng, W. Lou, X. Shen, and F. Gao. “A 3D 
graph convolutional networks model for 2D 
skeleton ‐based human action recognition,” IET 
Image Processing, vol. 17, no. 3, pp. 773-783, 2023. 
https://doi.org/10.1049/ipr2.12671. 
[13] J. Yu, H. Gao, Y. Chen, D. Zhou, J. Liu, and Z. Ju, 
“Adaptive spatiotemporal representation learning 
for skeleton-based human action recognition,” IEEE 
Transactions on Cognitive and Developmental 
Systems, vol. 14, no.4, pp. 1654-1665, 2021. 
https://doi.org/10.1109/TCDS.2021.3131253. 
[14] L. Jiang, B. Zou, S. Liu, W. Yang, M. Wang, and E. 
Huang, “Recognition of abnormal human behavior 
in dual-channel convolutional 3D construction site 
based on deep learning,” Neural Computing and 
Applications, vol. 35, no. 12, pp. 8733-8745, 2023. 
https://doi.org/10.1007/s00521-022-07881-3. 
[15] A. Jisi and S. Yin, “A new feature fusion network 
for student behavior recognition in education,” 
JASE, vol. 24, no. 2, pp. 133-140, 2021. 
https://doi.org/10.6180/jase.202104_24(2).0002. 
[16] A. Jan and G. M. Khan, “Real-world malicious event 
recognition in CCTV recording using Quasi-3D 
network,” Journal of Ambient Intelligence and 
Humanized Computing, vol. 14, no. 8, pp. 
10457-10472, 2023. 
https://doi.org/10.1007/s12652-022-03702-6. 
[17] Z. Chen, M. Liang, Z. Xue, and W. Yu, “STRAN: 
Student expression recognition based on 
spatio-temporal residual attention network in 
classroom teaching videos,” Applied Intelligence, 
vol. 53, no. 21, pp. 25310-25329, 2023. 
https://doi.org/10.1007/s10489-023-04858-0. 
[18] H. Wu, J. Luo, X. Lu, and Y. Zeng, “3D transfer 
learning network for classification of Alzheimer’s 
disease with MRI,” International Journal of Machine 
Learning and Cybernetics, vol. 13, no. 7, pp. 
1997-2011, 2022. 
https://doi.org/10.1007/s13042-021-01501-7. 
[19] X. Hong, T. Zhang, Z. Cui, and J. Yang, 
“Variational gridded graph convolution network for 
node classification,” JAS, vol. 8, no. 10, pp. 
1697-1708, 2021. 
https://doi.org/10.1109/JAS.2021.1004201. 
[20] S. Kumawat, M. Verma, Y. Nakashima, and S. 
Raman, “Depthwise spatio-temporal STFT 
convolutional neural networks for human action 
recognition,” TPAMI, vol. 44, no. 9, pp. 4839-4851, 
2021. 
https://doi.org/10.1109/TPAMI.2021.3076522. 
[21] S. Liu, Y. Ren, L. Li, X. Sun, Y. Song, and C. C. 
Hung, “Micro-expression recognition based on 
SqueezeNet and C3D,” Multimedia Systems, vol. 28, 
no. 6, pp. 2227-2236, 2022. 
https://doi.org/10.1007/s00530-022-00949-z. 
[22] J. Guo, Y. Liu, Q. Yang, Y. Wang, and S. Fang, 
“GPS-based citywide traffic congestion forecasting 
using CNN-RNN and C3D hybrid model,” 
Transportmetrica A: Transport Science, vol. 17, no. 
2, pp. 190-211, 2021. 
https://doi.org/10.1080/23249935.2020.1745927. 
[23] H. Gao, Y. Liu, and S. Ji, “Topology-aware graph 
pooling networks,” TPAMI, vol. 43, no. 12, pp. 
4512-4518, 2021. 
https://doi.org/10.1109/TPAMI.2021.3062794. 
[24] S. M. S. Abdullah and A. M. Abdulazeez, “Facial 
expression recognition based on deep learning 
convolution neural network: A review,” JSCDM, 
18   Informatica 48 (2024) 1–18                                                                     X. Li
 
et al. 
vol. 2, no. 1, pp. 53-65, 2021. 
https://doi.org/10.30880/jscdm.2021.02.01.006. 
[25] B. Gülmez, “A novel deep neural network model 
based Xception and genetic algorithm for detection 
of COVID-19 from X-ray images,” Annals of 
Operations Research, vol. 328, no. 1, pp. 617-641, 
2023. https://doi.org/10.1007/s10479-022-05151-y. 
[26] I. Jahan, M. F. Ahmed, M. O. Ali, and Y. M. Jang, 
“Self-gated rectified linear unit for performance 
improvement of deep neural networks,” ICT 
Express, vol. 9, no. 3, pp. 320-325, 2023. 
https://doi.org/10.1016/j.icte.2021.12.012. 
[27] Y. Xie, A. N. J. Raj, Z. Hu, S. Huang, Z. Fan, and M. 
Joler, “A twofold lookup table architecture for 
efficient approximation of activation functions,” 
IEEE Transactions on Very Large-Scale Integration 
(VLSI) Systems, vol. 28, no. 12, pp. 2540-2550, 
2020. https://doi.org/10.1109/TVLSI.2020.3015391. 
[28] Y. Wang, D. J. Tan, N. Navab, and F. Tombari, 
“Softpool++: An encoder–decoder network for point 
cloud completion,” IJCV, vol. 130, no. 5, pp. 
1145-1164, 2022. 
https://doi.org/10.1007/s11263-022-01588-7. 
[29] S. Liu, X. Wang, L. Zhao, B. Li, W. Hu, J. Yu, and 
Y. D. Zhang, “3DCANN: A spatio-temporal 
convolution attention neural network for EEG 
emotion recognition,” IEEE Journal of Biomedical 
and Health Informatics, vol. 26, no. 11, pp. 
5321-5331, 2021. 
https://doi.org/10.1109/JBHI.2021.3083525. 
[30] T. Ge and O. Darcy, “Study on the design of 
interactive distance multimedia teaching system 
based on VR technology,” International Journal of 
Continuing Engineering Education and Life Long 
Learning, vol. 32, no. 1, pp. 65-77, 2022. 
https://doi.org/10.1504/IJCEELL.2022.121221. 
[31] T. V. Henriksen, N. Tarazona, A. Frydendahl, T. 
Reinert, F. Gimeno-Valiente, J. A. Carbonell-Asins, 
and C. L. Andersen, “Circulating tumor DNA in 
stage III colorectal cancer, beyond minimal residual 
disease detection, toward assessment of adjuvant 
therapy efficacy and clinical behavior of 
recurrences,” Clinical Cancer Research, vol. 28, no. 
3, pp. 507-517, 2022. 
https://doi.org/10.1158/1078-0432.CCR-21-2404.