https://doi.org/10.31449/inf.v48i13.6100 Informatica 48 (2024) 97–110 97 3D-CNN-based Action Recognition Algorithm for Basketball Players Zhilei Cui College of Physical Education, Taiyuan University of Technology, Taiyuan 030024, China E-mail: cuizhilei@tyut.edu.cn Keywords: Basketball technical action; Target detection; Action recognition; Single shot multibox detector; 3D-CNN Received: April 21, 2024 The development of artificial intelligence has led to numerous methods for human action recognition. In basketball, its technical action features are obvious, so the feasibility of recognizing and classifying its technical actions is high. However, the existing action recognition methods are difficult to effectively utilize continuous frames, resulting in poor accuracy of basketball technical action recognition. Thus, the study suggests a continuous frame action identification approach based on the single-shot multi-edge detection algorithm and 3D convolutional neural network in order to enhance the performance of technical action recognition. The experimental results revealed that single shot multibox detector algorithm accurately recognizes the human body in the image and labels its confidence level. In addition, in basketball action recognition, the loss value of original frame was 6.0 and 6.8 on the training set and validation set, respectively, and the loss value of crop frame was 5.1 and 5.9 on the training set and validation set, respectively. 3D convolutional neural network achieved the highest classification accuracy of about 88.3% for the stop-and-go jump shot action in the original frame and its crop frame with an average recognition rate of about 90.3%. The recognition accuracy of original frame and crop frame increased with the increase of epoch, and reached a stable state when the epoch was 30. The presence of variable features in the European step, change of direction, and Sam Gould's action led to misjudgment of both original frame and crop frame. The accuracy of the original frame training set and test set were 0.91 and 0.81, respectively, and the accuracy of the crop frame training set and test set were about 0.92 and 0.81, respectively. After the fusion of the original frame features and the crop frame features, the average recognition rate was about 94.6%, which was significantly higher than that of the single-resolution recognition. In addition, with the increase of frame input, the F1-score gradually increased, while the false positive rate gradually decreased. When the frame input was 7, the F1-score and the misjudgment rate were 0.79 and 0.19, respectively. When the frame input was 16, the F1-score and the misjudgment rate were 0.92 and 0.05, respectively. The above results show that the continuous frame action recognition method based on single-shot multi-frame detection algorithm and three-dimensional convolutional neural network can realize the accurate recognition of the technical action in basketball video. Povzetek: Predstavljen je algoritem za prepoznavanje košarkarskih akcij, ki temelji na 3D-CNN in algoritmu za zaznavanje večrobnih okvirjev. Rezultati kažejo, da metoda bistveno izboljša točnost prepoznavanja tehničnih akcij. 1 Introduction Basketball, the second most popular sport globally, boasts over 2.7 billion fans. With the rise of short-video platforms, all kinds of high-quality videos provide people with rich spiritual food. Basketball technical videos and game highlights are popular among basketball fans. Recognizing the technical moves in the videos can promote the sport of basketball. In the field of video action recognition (AR), among the traditional recognition methods, improved dense trajectorie (iDT) has the best recognition performance, but its recognition speed is too slow. Moreover, the emergence of deep learning provides new ideas for video AR, common deep learning AR methods are single-frame video image-based recognition methods, two stream convolutional neural network (TSCNN), long short-term memory (LSTM) network and 3D convolutional neural network (3D-CNN) [1-2]. Among the recognition methods based on single-frame video images the selected frames are difficult to represent the whole video and it is difficult to match the extracted features with the action features. The spatial convolution of TSCNN is only performed on a single frame and cannot learn pixel-level correspondence between spatial and temporal features. Although LSTM can handle the timing problem better, it suffers from long training time and high consumption of computational resources. 3D-CNN, on the other hand, can effectively capture temporal and spatial information and can handle volumetric data, and thus performs well in video classification and AR [3-5]. However, due to the 98 Informatica 48 (2024) 97–110 Z. Cui complexity of basketball technical action (BTA) and 3D-CNN also suffers from the problem of long training time. Therefore, the study suggests a basketball video target detection (TD) and augmented reality (AR) model based on single shot multibox detector (SSD) algorithm and 3D-CNN in order to increase training speed and identification accuracy and decrease the consumption of computer resources. The innovation of the study is to combine the TD of SSD to generate AR of dual-resolution 3D-CNN with low-resolution image input, which provides a new idea for BTA recognition. The study is divided into six chapters, in which the introduction first provides background information and explanations of the importance of the study. Chapter two is a literature review, which will describe the SSD algorithm (SSDA) and related research on 3D-CNN. Chapter 3 is the study methodology, which will investigate the SSDA and dual-resolution 3D-CNN architectures. Chapter 4 is the experimental results, which will analyze the performance of the research methods, TD and AR. Chapter five is a discussion, which will analyze the advantages of the proposed method. Chapter 6 is the conclusion of the research results, and will summarize the research results of this paper. 2 Related works The two-stage algorithm's candidate region generation and following pixel or feature resampling phases are removed by the SSDA, which combines the regression concept and the anchor frame mechanism. All calculations are contained in a single network. As a result, it is extensively employed in many different industries and offers the benefits of simple training and quick speed. For the problem of video face detection, Liu et al. suggested a face detection approach based on SSDA and Res-Net. This method uses kernel correlation filtering to track successive n frames and Res-Net as the core network of SSDA. The approach can achieve real-time detection and increase the precision of video face detection, according to experimental data [6]. For the purpose of solving the vehicle recognition problem in intelligent transportation, Zhao et al. suggested a detection approach based on the SSDA and feature pyramid augmentation strategy. By using a feature pyramid augmentation strategy, the method enhanced SSD's feature extraction (FE) capabilities, and it enhanced its localization capabilities by cascading the detection mechanism. According to experimental results, this method's detection time is less than that of previous approaches [7]. Liu et al. proposed a detection model based on SSDA and depth separable fusion hierarchical feature model for the pedestrian detection problem. The model effectively reduced the complexity of the model by depth separable convolution and realized feature fusion enhancement by using hierarchical structure. On the INRIA dataset, experimental results showed that the improved model's leakage detection rate was only 9.68% [8]. Li et al. proposed a detection model based on multi-block SSDA for the small TD problem of on-site monitoring of railroad drones. Experimental results were obtained from this model. After segmenting the image into overlapping blocks, the model transfers the blocks individually to the SSD and utilizes the concept of non-maximum suppression sub-layer suppression and filtering algorithms were used to remove overlapping frames from the sub-layers. The multi-block SSD model's overall accuracy was shown to be 96.6% in experimental findings, 9.2% better than the conventional SSDA [9]. For the problem of electromagnetic luminous surface defect detection, Xu et al. suggested an enhanced SSD technique based on feature fusion, which employs the concept of feature pyramid network. The enhanced SSDA outperforms previous algorithms in terms of electromagnetic luminescence defect detection, according to experimental data [10]. 3-dimensional convolution neural network (3D-CNN) is a popular choice in multi-channel image processing because, in contrast to 2D-CNN, it is capable of capturing discriminative features along spatial and temporal dimensions, generating multiple information channels from adjacent video frames, and performing convolution and downsampling in each channel independently to obtain the final feature representations by combining the information from the video channels. Xu and Zhang proposed a 3D-CNN gesture estimation method. The method used the depth image to reconstruct the 3D spatial structure of the hand, and converted the hand model to voxel grid by 3D-CNN, which in turn realized the hand gesture estimation. Experimental results revealed that the improved 3D-CNN model can achieve an average accuracy of 87.98% with an average absolute error of only 8.82 mm [11]. Rehman et al. proposed a 3D-CNN-based tumor classification model for the problem of brain tumor detection and automated classification. This model extracted the brain tumor features by 3D-CNN and realized the verification and classification of the features by feed-forward neural network. The outcomes revealed that this 3D-CNN model could classify brain tumors with an accuracy of up to 98.32% [12]. For the problem of concentration estimate of gas mixtures, Pareek et al. suggested a concentration estimation network based on 3D-CNN and constrained Boltzmann machine. The network processed sensor array data using 3D-CNN, and for end-to-end gas concentration estimation, it used a constrained Boltzmann machine. The results indicated that the network was able to accurately predict the concentration of the gas mixture [13]. Chaddad et al. proposed a prediction model based on Gaussian mixture model and 3D-CNN for the problem of predicting the survivability of pancreatic ductal adenocarcinoma patients. The model utilized a Gaussian mixture model to model the distribution of learned features obtained from preoperative computed tomography, followed by FE and learning via 3D-CNN, and finally a robust classifier based on random forest to 3D-CNN-based Action Recognition Algorithm for Basketball… Informatica 48 (2024) 97–110 99 predict survival outcomes. The experimental results revealed that the ROC of this model was 0.72, which was much higher than other models [14]. For the purpose of classifying hyperspectral images, Roy et al. presented a hybrid spectral convolutional neural network-based classification model. The model utilized 3D-CNN for joint spatial-spectral feature representation and further learned more advanced spatial representation through 2D-CNN. According to experimental findings, the hybrid spectral convolutional neural network successfully decreased model complexity while increasing the classification accuracy of hyperspectral images [15]. A summary of the related literature is shown in Table 1. Table 1: Summary of related literature Author Method Index Liu et al. [6] A face detection method based on the SSD algorithm and Res-Net Accuracy is over 92% Zhao et al. [7] Vehicle detection method based on SSD algorithm and feature pyramid enhancement strategy 80.6% accuracy and 14 ms detection time Liu et al. [8] Pedestrian detection model based on SSD algorithm and deep separable fusion hierarchical feature model The missed detection rate is 9.68% Li et al. [9] A small object detection model based on the multi-block SSD algorithm The accuracy rate is 96.6% Xu et al. [10] Improved SSD algorithm based on feature fusion Accuracy is over 90% Xu and Zhang [11] A 3D-CNN gesture estimation method based on the end-to-end hierarchical model and physical constraints The average accuracy is 87.98%, and the mean absolute error is only 8.82 mm Rehman et al. [12] Tumor classification model based on 3D-CNN The highest accuracy rate is up to 98.32% Pareek et al. [13] Concentration estimation network based on a 3D-CNN and a confined Boltzmann machine Accuracy is over 91% Chaddad et al. [14] Pretive model of viability in patients with pancreatic ductal adenocarcinoma based on a Gaussian mixed model and 3D-CNN The ROC is 0.72 Roy et al. [15] Hyperspectral image classification model based on a hybrid spectral convolutional neural network The accuracy rate is nearly 99% In summary, both SSDA and 3D-CNN are widely used in image processing. However, facing the problem of complex BTA and strong correlation of consecutive frames of basketball video, it is difficult for existing image processing techniques to accurately recognize and classify BTA. Based on this, the study proposes a BTA recognition algorithm based on SSDA and 3D-CNN with dual-resolution 3D-CNN, in order to realize the accurate recognition and classification of BTA. 3 Action recognition algorithm for basketball video based on SSDA and 3D-CNN Basketball sports videos can accurately reflect the technical movements of basketball and have certain advantages in teaching. However, due to the wide variety of BTAs, it is difficult to classify them by traditional AR techniques, and there is a lack of methods to utilize continuous image frames. Therefore, in order to better recognize and classify BTAs in sports videos, the study proposes a video clipping and AR method based on SSD and 3D-CNN. 3.1 Video cropping algorithm based on SSDA Before the recognition of video actions, the video needs to be cropped to select the motion region and reduce the size of image frames. As a lightweight TD algorithm, the SSD method addresses the drawback of the YOLO algorithm, which is its inability to detect small targets, with its advantages of high detection accuracy and quick operation time. Although both SSDA and YOLO algorithm perform detection through CNN network, SSDA performs detection in the intermediate layer instead of after the fully connected layer. The SSDA first generates prediction frames (PFs) on the input image (IM), then labels the locations based on the PFs and feature maps (FMs), and outputs the classification categories. Lastly, the non-maximum suppression technique is used to eliminate the redundant and PFs that do not match the confidence expectation [16-17]. Figure 1 depicts the SSDA's framework. 100 Informatica 48 (2024) 97–110 Z. Cui Input image VGG up to conv4_3 38×38×512 VGG up to FC7 19×19×1024 Convolutional Layer 10×10×512 Convolutional Layer 5×5×256 Convolutional Layer 3×3×256 Avg pooling 1×1×256 Detectors & Classifiers 1 Normalization Fast Non-maximum Suppression (Fast NMS) Detectors & Classifiers 2 Detectors & Classifiers 3 Detectors & Classifiers 6 Detectors & Classifiers 5 Detectors & Classifiers 4 Figure 1: The framework of the SSDA In Figure 1, the base network of the SSDA is VGG 16, and it is modified. Firstly, the fully connected layers FC6 and FC7 of VGG 16 are replaced with convolutional layers (CLs) of 33  and 11  , then the Dropout and FC8 layers are deleted, and then the Pool 5 of the pooling layer is changed to 33  with stride=5. At the same time, in order to obtain a denser score mapping, the experiment also includes the Atrous algorithm, and more CLs are also added to VGG 16 in order to increase the FMs. The additional network is then a CNN network with gradually decreasing scales, which can be used to detect targets of different scales [18-19]. In the SSDA, the VGG16 and the additional CLs are responsible for the FE: first, the size of the IM is 300 300 3  , then the convolutional kernel (CK) of Conv 1 is initialized and two convolutional operations are performed, and the CK and activation function for the convolutional operations are the   3,3 and the ReLU functions, respectively. Since the convolution operation results in the loss of image boundary information, in order to preserve the boundary features, it is necessary to set the Padding parameter to same, which indicates that the image boundary is 0 [20]. The resulting features are subjected to the maximum pooling process following the convolution. Then, Conv 2 to 5 performs the identical action. It is important to note that Convolution 3 to 5 require three convolution computations total. Taking padding=1 as an example, the padding operation is shown in Figure 2. 13 14 15 16 9 10 11 12 5 7 6 5 1 2 3 4 13 14 15 16 9 10 11 12 5 7 6 5 1 2 3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure 2: Padding operation diagram In Figure 2, after many convolutions, the size of the output picture will keep decreasing. In order to avoid the size of the picture becoming smaller after convolution, padding is applied to the periphery of the picture. When padding=1, the size of padding is 1, the value of padding is 0, and the output size changes from the original 44  to 66  . Since the SSDA changes FC6 and FC7 of VGG 16 into CLs and changes the stride of the pooling layer, Pool 5, in order to cope with this change, Atrous algorithm, which performs the null convolutional operation, is introduced in VGG 16, and its The two CKs for convolution computation are   1,1 and   3,3 , respectively. Figure 3 displays the schematic diagram of the void convolution. 3D-CNN-based Action Recognition Algorithm for Basketball… Informatica 48 (2024) 97–110 101 Figure 3: A Schematic representation of the atrous convolution From Figure 3, it can be observed that for the conventional convolution after the feeling field is 33  , the feeling field can be enlarged by performing the cavity convolution operation on its FM, and the size of the feeling field at this time is 77  . Continuing to carry out the cavity convolution, the size of the feeling field is enlarged to 15 15  . After cavity convolution, the two convolution calculations are carried out by Conv 6, and the CKs are   1,1 and   3,3 , respectively, and the second convolution calculation is of a step size of two. Then the same operation is carried out by Conv 7 to 9. It is worth noting that Conv 7 to 9 all step to carry out the filling operation. After the above operations, six FMs can be obtained, and the detection results can be obtained according to the FMs. The bounding box's position and the FMs' confidence are included in the detection value, which is obtained using the   3,3 convolution calculation. While the number of CKs needed for the bounding box location is four times the number of a priori frames, the CKs needed for classification in the prediction stage are calculated as the product of the number of a priori frames needed for the FM and the classification category [21-22]. Since each target of the FM corresponds to a number of default frames, whereas for the targets in the PFs a convolution operation is performed to obtain confidence and bounding box location information, and the true values are matched with the default frames to obtain negative and positive samples. The intersection over union (IoU) formula on the concatenation of predicted and true frames is given in Equation (1). AB IoU AB  =  (1) In Equation (1), A and B denote the PF and the true frame, respectively. As the number of CNN layers increases, the abstraction level of the extracted image features increases gradually. Whereas, since neurons are locally aware and locally connected, the underlying FM retains a large amount of detail information. When the multi-scale FMs of the SSDA are matched with the PFs, they become equal to distinct perceptual fields, which improves the SSDA's recognition capability. The PF scale calculation formula is shown in Equation (2). ( )   max min min 1 , 1, 1 k ss s s k k m m − = + −  − (2) In Equation (2), k s denotes the ratio of the a priori frame size relative to the image, max s and min s denote the maximum and minimum values of the ratio, respectively, which are generally 0.2 and 0.9. m denotes the number of FMs and k denotes the number of a priori frames. It is worth noting that the a priori box scale obeys the linear increment rule, i.e., as the size of the feature image decreases, the a priori box scale increases linearly. Equation (3) displays the default box (DB) height calculating formula. r a k k r s h a = (3) In Equation (3), r a k h is the height of the DB, r a represents the aspect ratio of the prediction box (PB), and its value set is 11 1,2,3, , 23    . The formula for calculating the width of the DB is shown in Equation (4) r a k k r w s a = (4) In Equation (4), r a k w represents the width of the DB. The formula for calculating the center of the PB is shown in Equation (5). 0.5 0.5 , kk ij ff  ++    (5) In Equation (5), k f denotes the size of the FM. Since more DBs increase the time required for training and computational complexity, some of the CLs remove the DBs with aspect ratios of 3 and 1/3. Moreover, the SSDA also eliminates the DBs where the recognized object is located outside the box and the PBs where the confidence is less than the set prediction, this operation is realized by the non-maximal value suppression algorithm. 102 Informatica 48 (2024) 97–110 Z. Cui 3.2 3D-CNN-based video action recognition algorithm After cropping the video by the SSDA, the motion region can be selected, which reduces the difficulty for subsequent AR. When performing BTA recognition, since the input is a sequence of video frames. Therefore, not only the spatial representation of the motion but also the temporal order of the motion needs to be considered. The 3D-CNN increases the temporal dimension compared to the traditional 2D-CNN, which makes the 3D-CNN effectively deal with the temporal order of actions. The 3D convolution is shown in Figure 4. Time Figure 4: 3D convolution schematic diagram In Figure 4, 3D convolution can produce multiple FMs by sharing a CK, and each FM contains time dimension information between them. The CKs in the time dimension are encoded with different colors and the same colors share weights. FE is achieved by applying the same 3D CK to the overlapping 3D cubes in the input video. The 3D convolution formula is given in Equation (6). ( ) ( ) ( ) ( ) 1 1 1 1 0 0 0 i i i H W S x h y w z s xyz hws lij ijm ij ij h w s v f k V b − − − + + + − = = =  =+      (6) In Equation (6), i and j denote the feature blocks in the previous layer and the CKs in the current layer, respectively. l denotes the time and xyz lji v denotes the 3D convolution result. i H and i W denote the height and width of the CK, respectively, and m denotes the index of the FM connected to the current layer. hws ijm k denotes the value of the CK at ( ) , hw , and i S denotes the size of the CK in the spectral dimension. ij b denotes the offset, and ( ) ( ) ( ) ( ) 1 x h y w z s ij V + + + − denotes the convolution result of the previous layer. Since the performance of CNN is affected by factors such as hyperparameters and algorithmic architecture, the study improves the dual-resolution 3D-CNN in order to reduce the training time while ensuring the training quality. The so-called dual-resolution 3D-CNN refers to processing the same image at different resolutions and then fusing its features to improve the AR capability. However, since the image is composed of numerous pixel points, the amount of data will be larger when it is directly input, resulting in longer training time [23-24]. The color information of the image is not very useful in AR, so the image needs to be grayscaled to reduce the amount of data, and also to avoid the interference of the background, people's clothing and light. The 3D-CNN architecture is shown in Figure 5. Input Conv1 32 Conv2 64 Conv3 128 Conv4 256 Conv5 256 FC6 FC7 Softmax Pooling Pooling Pooling Pooling Pooling Figure 5: Architecture of the 3D-CNN In Figure 5, the original frame (OF) 3D-CNN in AR consists of five CLs, FC6 and FC7. The IM can get the final result by Softmax after five consecutive convolution and pooling operations and then two FCs. The CK size and step size of the CLs are   3,3,3 and   1,1,1 , respectively, the pooling window and step size of the first layer are   2,2,1 , and the rest of the pooling windows and steps are   2, 2, 2 . In addition, crop frame (CF)-3D-CNN differs from OF-3D-CNN in that it consists of 4 CLs, FC5 and FC6, and all pooling windows and steps are   2, 2, 2 . In addition, to further improve the training efficiency, the 2D weight parameters need to be 3D-CNN-based Action Recognition Algorithm for Basketball… Informatica 48 (2024) 97–110 103 utilized to initialize the 3D convolutional weight parameters [25-26]. Due to the high background similarity of consecutive frame images, the 2D weight matrix needs to be utilized to initialize the 3D weight matrix, which is calculated in Equation (7). 2 3 D D t t W W T = (7) In Equation (7), 3D t W and 2D W denote the 3D weight matrix and 2D weight matrix, respectively, and T denotes the timing information. In addition, in order to get different 3D weight matrices, it is necessary to initialize their scaling, which is calculated in Equation (8). 32 1 ,( 0, 1) T DD t t t t t t WW    = =  =  (8) In Equation (8), t  denotes a random constant. Negative weights initialization is also required to set the values of the sub-matrices of the 3D weight matrix. The negative weights initialization formula is shown in Equation (9). 32 21 1 1 2 DD tt t WW T t T tT T    =  −    =    =          (9) After the OF image and CF image are processed by the dual-resolution 3D-CNN, the corresponding weight files will be obtained, which will be used as the model parameters for frame sequence prediction to obtain the feature vectors. At this time, the feature vectors need to be fused to facilitate the classification and recognition of features. The feature fusion calculation formula is shown in Equation (10). , , , , ab p q d p q d Y X X =+ (10) In Equation (10), Y denotes the final feature representation. ,, a pqd X denotes the feature of OF image and ,, b pqd X denotes the feature of CF image. p denotes the feature image height and q denotes the feature image width. The final feature representation is used as an input to SVM to perform action classification recognition. SVM is chosen for AR because it can effectively deal with high-dimensional data and linear indivisibility, and it is based on the Vapnik-Chervonenkis dimension principle and resultant risk minimization [27-28]. The schematic diagram of SVM to find the risk minimization cutoff is shown in Figure 6. y x y x Optimal interface Maximum classification distance Figure 6: Schematic diagram of SVM searching for the minimum risk boundary In Figure 6, in the two-dimensional data space, there exist numerous straight lines that can classify the data. However, when the classification straight line is too close to a certain sample, its sensitivity to the noise signal is high, resulting in a weak generalization ability, so the classification effect of the straight line is not optimal. At this time, SVM can find the optimal dividing line by minimizing the distance between the training samples and the classification straight line. Figure 7 displays the schematic diagram used by SVM to classify linearly indivisible data. Mapping Figure 7: Schematic diagram of linear indivisible data classification 104 Informatica 48 (2024) 97–110 Z. Cui When faced with linearly indivisible data, as in Figure 7, SVM will use the kernel function to map the low-dimensional linearly indivisible data into the high-dimensional space, creating high-dimensional linearly divisible data. Then the optimal classification hypersurface is obtained in the high-dimensional space by linearly differentiable method [29-30]. In addition, when training the 3D-CNN model, a suitable loss function is needed to realize the weight update. The update strategy is mini-batch and the loss function is shown in Equation (11). ( ) ( ) ( ) 1 1 log 1 log 1 u i i i i i loss x z x z u = = − + − −    (11) In Equation (11), u denotes the size of the mini-batch, and i x and i z denote the predicted and true values of the i th sample in each batch, respectively. The gradient descent formula is shown in Equation (12). ( ) 1 t t w w w J w  − = −  (12) In Equation (12), t w denotes the weights,  denotes the learning rate, and ( ) Jw denotes the cost function. Additionally, the study uses the Adam optimizer to shorten the training period in order to guarantee the training rate. At this time, the weight calculation formula is shown in Equation (13). ( ) ( ) 1 11 1 tt w w w c s c c J w   −  = −  +    = + −    (13) In Equation (13), c denotes the correction of the momentum vector and e denotes the correction of the gradient-squared accumulation vector. c denotes the momentum vector and 1  denotes the decaying momentum hyperparameter. The gradient-squared cumulative vector formula is shown in Equation (14). ( ) ( ) ( ) 22 1 ww e e J w J w  = + −    (14) In Equation (14), e denotes the gradient squared accumulation vector and 2  denotes the scaling decay hyperparameter. The formulas for the correction of the momentum vector and the gradient-squared accumulation vector are given in Equation (15). 1 2 1 1 t t c c e e    =  −    =  −  (15) Since the initial values of the momentum vector and the gradient-squared accumulation vector are both 0, the correction facilitates the improvement of both during the initial phase of training. 4 Video action recognition result analysis Among basketball enthusiasts, basketball instructional videos and basketball jams are popular and widely studied and imitated. To enhance the technical action learning method available to basketball enthusiasts, the study suggests a video AR approach that utilizes 3D-CNN and SSD. The study will evaluate the CF generation of SSDA and the AR performance of 3D-CNN in order to assess the recognition performance of the approach. The datasets for the COCO and VOC, which contain 328,000 images with 2,500,000 instances labeled and more than 5,000 examples classified in 82 out of 91 categories, respectively, will be used for the SSDA tests. There are twenty-one categories in the VOC dataset, including bicycles, dogs, and cats. Since the SSDA's, goal is to recognize people in images, irrelevant images will be ignored. The test for 3D-CNN will be performed on the homemade BTA dataset, which contains 1800 videos, 300 videos for each of the change of direction, tonbay, turn, Eurostep, Sam Gaudet, and sharp stop and jump shot movements, and the length of each video is from 0.5 s to 1.5 s. Fifty videos will be selected as the validation set for each category of movements in the BTA dataset, and the rest of the videos will be used as the training set. In the experiments, the deep learning framework is Keras, the programming language is Python 3.7, the batchsize and Epoch are 32 and 50, respectively, and the image input frame is 16 frames. The SSDA employs a confidence threshold of 0.9, a ratio of the prediction box minimum to original size of 0.2, a prediction box maximum to original size of 0.9, and an IoU threshold of 0.5. The original images have a resolution of 112112 pixels, the cropped frames have a resolution of 6464 pixels, and the number of output units in the fully connected layer is 1024. The recognition results of SSDA on COCO dataset are shown in Figure 8. 3D-CNN-based Action Recognition Algorithm for Basketball… Informatica 48 (2024) 97–110 105 Person 1.00 Person 0.96 Person 1.00 Bats 1.00 Mitt 0.94 Person 1.00 Person 1.00 Person 1.00 Person 1.00 Person 1.00 Frisbee 1.00 Frisbee 1.00 Person 0.98 Person 0.98 Person 0.92 Car 1.00 Figure 8: Identification results of the SSDA on the COCO dataset In Figure 8, the SSDA accurately recognizes the people in the pictures in all COCO datasets and marks their confidence levels. Meanwhile, for non-human objects, the SSDA marks them with other colors and shows the category they belong to. It can be seen that the SSDA accurately recognizes human beings in the images and marks the non-human objects for the purpose of recognizing only human bodies. The result of CF generation based on SSDA is shown in Figure 9. Person 1.00 Figure 9: The crop frame generation results based on the SSDA The human body in the picture is identified in Figure 9 following the SSDA's TD on the OF, and its confidence level is 1.00. Then after OpenCV obtains the four coordinates of the detection frame, it can be clipped on the OF in order to generate the CF. As an example, the CF generation result of the sharp stop jump shot in BTA is shown in Figure 10. Figure 10: Cropped frame of stop jumper In Figure 10, the OF is processed by the SSDA to obtain the human body in the detection frame of the continuous frame image of the video, and in order to avoid the interference of the spectators and other players, the confidence threshold is adjusted up to 0.95 to obtain the CF image of the target human body. Since the size of CF is inconsistent, it will be preprocessed first to ensure the consistent size when performing the subsequent AR with dual-resolution 3D-CNN. The OF and CF loss values of the dual-resolution 3D-CNN are shown in Figure 11. 106 Informatica 48 (2024) 97–110 Z. Cui 6 8 10 12 14 16 18 0 Loss 20 10 30 40 50 Epoch (a) The loss value of the original frame Training set Validation set 6 8 10 12 14 0 Loss 20 10 30 40 50 Epoch (b) The loss value of the crop frame Training set Validation set Figure 11: Raw frame and trimmed frame loss values From Figure 11(a), the loss value of OF on the training set starts to converge at epoch of 19, with a loss value of about 6.0. On the validation set, the loss value of OF starts to converge at epoch of 21, with a loss value of about 6.8. From Figure 11(a), the loss value of CF on both the training and validation sets starts to converge at epoch of 20, with loss values of 5.1 and 5.9. The above results show that 3D-CNN converges better for CF, but the convergence speeds of OF and CF are not much different. The training confusion matrices of OF and CF are shown in Figure 12. 0.84 0.08 0.08 0.00 0.00 0.00 0.09 0.85 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.89 0.06 0.05 0.10 0.07 0.83 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.91 0.07 0.00 0.00 0.00 0.05 0.02 0.93 Cross over Cross over Euro Step Feint Sham God Spin Move Stop jumpshot Euro Step Feint Sham God Spin Move Stop jumpshot (a) The training confusion matrix of the original frame 0.86 0.06 0.08 0.00 0.00 0.00 0.11 0.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.91 0.06 0.03 0.10 0.06 0.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.94 0.06 0.00 0.00 0.00 0.03 0.02 0.95 Cross over Cross over Euro Step Feint Sham God Spin Move Stop jumpshot Euro Step Feint Sham God Spin Move Stop jumpshot (b) The training confusion matrix of the crop frame Figure 12: The training confusion matrix of raw frames and cropped frames In Figure 12(a), among the ARs of OF, the highest classification accuracy is achieved for the hasty jump shot action, which has an average recognition rate of about 88.3%, but there is still the possibility of being misclassified as a turnaround or tonbay. In addition, there are cases of misclassification for the recognition of Eurostep, change of direction and Samgold movements, which may be due to the fact that all the above three movements are characterized by change of direction. In Figure 12(b), in the AR of CF, there are still misjudgment cases for the recognition of European step, change of direction and Samgold action, but their misjudgment probabilities have decreased. In addition, the classification accuracy for the sharp stop jump shot action is still the highest, and its average recognition rate is about 90.3%. It can be seen that the accuracy of AR has been improved after recognition by the SSDA and removal of interfering factors. The recognition accuracy of OF and CF are shown in Figure 13. 3D-CNN-based Action Recognition Algorithm for Basketball… Informatica 48 (2024) 97–110 107 0.2 0.3 0.4 0.5 0.6 0.7 0.9 0 Accuracy 20 10 30 40 50 Epoch (a) The accuracy of the original frame Training set Validation set 0 Accuracy 20 10 30 40 50 Epoch (b) The accuracy of the crop frame Training set Validation set 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.9 0.8 Figure 13: Recognition accuracy of raw frames and cropped frames In Figure 13(a), the recognition accuracy (RA) of OF in both the training set and the validation set rises with the increase of epoch, and both reach a stable state when the epoch is 30, at which time its accuracy is about 0.91 and 0.81, respectively, but the RA of the test set fluctuates a lot in the interval of epoch 10 to 20. In Figure 13(b), the RA of the CF training set and the test set is basically the same with the trend of epoch, and also reaches a stable state when the epoch is 30, at which time the RA is about 0.92 and 0.81, respectively. After the training of OF and CF, the obtained weight file is used as a parameter for the recognition of consecutive frames, and the OF and CF features are fused and then classified by SVM to achieve continuous frame recognition. SVM classification can realize AR for continuous frames. The confusion matrix for feature fusion is shown in Figure 14. 0.91 0.02 0.07 0.00 0.00 0.00 0.05 0.93 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.94 0.03 0.03 0.05 0.02 0.93 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.95 0.03 0.00 0.00 0.00 0.03 0.00 0.97 Cross over Cross over Euro Step Feint Sham God Spin Move Stop jumpshot Euro Step Feint Sham God Spin Move Stop jumpshot Figure 14: Confusion matrix of feature fusion In Figure 14, compared with the single-resolution recognition of OF and CF, the RA of each BTA after feature fusion has increased, and the sharp jump shot is basically not misjudged as a turn, and the probability of misjudgment of the European step, change of direction and Sam Gaudet's action has decreased significantly. The average recognition rate after dual-resolution feature fusion is about 94.6%, which is significantly higher than single-resolution recognition. The above results show that dual-resolution recognition has obvious advantages over single-resolution recognition. Figure 15 displays the dual-resolution 3D-CNN's recognition performance under various frame inputs. 70% 75% 80% 85% 90% 95% 100% 7 10 13 16 19 Frame input Accuracy Precision (a) The accuracy and precision of different frame inputs 0.00 0.05 0.10 0.15 0.20 0.25 0.7 0.8 0.8 0.9 0.9 1.0 7 10 13 16 19 Error rate F-measure Frame input F1-measure Error rate (b) The F1-measure and error rate of different frame inputs Figure 15: Recognition accuracy and precision under different frame inputs In Figure 15(a), the RA and precision rate gradually increase with the increase of frame input. When the frame input is 7, the accuracy and precision rate are 81.4% and 79.6%, respectively. When the frame input is 10, the accuracy and precision are 85.6% and 83.8%, respectively. In addition, when the frame input is 16, the accuracy and precision rate increase to 93.8% and 91.5%, respectively. In Figure 15(b), as the frame input increases, 108 Informatica 48 (2024) 97–110 Z. Cui the F1-score gradually rises while the misclassification rate gradually decreases. When the frame input is 7, the F1-score and the misjudgment rate are 0.79 and 0.19, respectively. When the frame input is 10, the F1-score and the misjudgment rate are 0.82 and 0.14, respectively. In addition, when the frame input is 16, the F1-score and the misjudgment rate are 0.92 and 0.05, respectively. The reason why the larger the frame input is, the better the performance of AR is because it is difficult to recognize the AR when the frame number is small, it is difficult to recognize the corresponding timing information, so its recognition performance for continuous frames is poor. However, a high frame number also leads to an increase in training time and training volume, which shows that the appropriate frame input has a greater impact on the AR performance for continuous frames. The F1-score and recall of the dual-resolution 3D-CNN with different frame inputs are shown in Figure 16. 0.4 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 1.0 7 10 13 16 19 Recall F1-score Frame input F1-score Recall Figure 16: The F1-score and recall of the dual-resolution 3D-CNN with different frame inputs Figure 16 illustrates that the F1-score and recall of the dual-resolution 3D-CNN increase with the number of input frames. The F1-score and recall are 0.70 and 0.80, respectively, when the input frame is 13. When the input frame is 14, the F1-score and recall are 0.81 and 0.85, respectively. The results of the ablation experiments of the BTA recognition model based on the SSDA and dual-resolution 3D-CNN are presented in Table 2. Table 2: Results of ablation experiments of basketball technical action recognition model SSD Dual-resolution 3D-CNN Accuracy × × 80.5% × √ 84.3% √ × 81.7% √ √ 89.2% Table 2 indicates that in the absence of trimmed frames generated by the SSDA, the average RA of BTA is 80.5% for the ordinary 3D-CNN and 84.3% for the double-resolution 3D-CNN, respectively. In the case of cropped frames generated by the SSDA, the average RA of the ordinary 3D-CNN is 81.7% and 89.2% for the 3D-CNN, respectively. The preceding results demonstrate that the SSDA and dual-resolution 3D-CNN are effective in enhancing the RA of actions. 5 Discussion In order to improve the efficiency of video analysis, a video cropping and action recognition method based on SSDA and 3D-CNN was proposed. The average recognition rate of this algorithm was about 94.6%, and the RA and precision gradually improved with the increase of frame input. A 3D-CNN pose estimation method based on an end-to-end hierarchical model and physical constraints was proposed by Xu and Zhang [11]. This method enabled gesture estimation by reconstructing the 3D spatial structure of hands using deep images and converting the hand model into a voxel mesh via 3 D-CNN. Nevertheless, although the method can effectively utilize deep images, it is challenging to accurately identify objects in complex environments. In comparison to the research of Xu and Zhang, the proposed method exhibited a higher degree of RA. The reason for the enhanced RA was that the research team opted to utilize the SSD object detection algorithm to generate a basketball technique action cropping frame dataset. The SSD detection algorithm is employed to process the original frame images of BTA videos, thereby enabling the identification of the human body within the detection box of each frame image. To prevent the inclusion of background elements that might otherwise interfere with the identification of the human subject, the confidence threshold is increased to 0.9, thereby allowing for the capture of the final human image. A hyperspectral image classification model based on a hybrid spectral convolutional neural network was proposed by Roy et al. [15]. The model employs 3D-CNN for joint spatial spectral feature representation, subsequently acquiring more advanced spatial representation through 2D-CNN. However, the model necessitates the resolution of the image, which is unable to process images with disparate resolutions. The dual-resolution 3D-CNN model is capable of handling continuous technical action frames at different input resolutions and fusing features at different resolutions to enhance recognition capabilities. 6 Conclusion Compared with other sports, basketball has significant technical movement characteristics. Therefore, in basketball teaching, coaches often teach technical movements through video analysis. In addition, in basketball leagues, analyzing the technical characteristics 3D-CNN-based Action Recognition Algorithm for Basketball… Informatica 48 (2024) 97–110 109 of opposing athletes through game videos can also support the formulation of tactics. In view of this, in order to improve the efficiency of video analysis, the study proposes a video cropping and AR method based on SSDA and 3D-CNN. The experimental results revealed that the SSDA accurately recognizes the human body in the image and labels its confidence level. While for non-human objects other color criteria were used and that detection was abolished. While in basketball AR, the loss values of OF on the training and validation set start to converge at epochs of 19 and 21, with loss values of 6.0 and 6.8, respectively. The loss values of CF on both the training and validation set start to converge at epoch of 20, with loss values of 5.1 and 5.9, respectively. The 3D-CNN achieved the highest classification accuracy of the sharp stop jumper action in the OF of about 88.3%. In addition, due to the fact that the European step, change of direction and Samgold action all had change of direction features, which led to its misclassification situation. The recognition of CF by 3D-CNN was similar to that of OF, but its average recognition rate was 90.3%, which was higher than that of OF. The RA of OF and CF both epoch increased, and both reached a stable state when epoch was 30. The accuracy of the OF training set and the test set were 0.91 and 0.81, respectively, and the RA of the CF training set and the test set were about 0.92 and 0.81, respectively, with a small difference between the two. In addition, after fusing OF features and CF features, the average recognition rate was about 94.6%, which was significantly higher than the single resolution recognition. In addition, with the increase of frame input, the RA and precision rate gradually increased. When the frame input was 7, the accuracy and precision rate were 81.4% and 79.6%, respectively. When the frame input was 16, the accuracy and precision increased to 93.8% and 91.5%, respectively. The above results reveal that the BTA recognition performance for continuous frames is significantly improved after the SSDA recognizes and removes interfering factors and performs feature fusion. Although the study achieved some results in BTA recognition of continuous frames, the recognition performance of the proposed AR method in the case of occluded characters is doubtful because the basketball technology videos used are all unobscured videos. In view of this, future research will focus on how to improve the AR accuracy for occluded characters. In addition, how to apply the AR algorithm proposed by the study to short video platforms or web-side is also a worthwhile research direction. References [1] Z. Guo, Y. Hou, R. Xiao, C. Li, and W. Li, “Motion saliency based hierarchical attention network for action recognition,” Multimedia Tools and Applications, vol. 82, no. 3, pp. 4533-4550, 2022. https://doi.org/10.1007/s11042-022-13441-7 [2] G. V. Reddy, K. Deepika, L. Malliga, D. Hemanand, C. Senthilkumar, and S. Gopalakrishnan, “Human action recognition using difference of gaussian and difference of wavelet,” Big Data Mining and Analytics, vol. 6, no. 3, pp. 336-346, 2023. https://doi.org/10.26599/BDMA.2022.9020040 [3] L. Liu, L. Yang, W. Chen, and X. Gao, “Dual-view 3D human pose estimation without camera parameters for action recognition,” IET Image Processing, vol. 15, no. 14, pp. 3433-3440, 2021. https://doi.org/10.1049/ipr2.12277 [4] G. Zhang, Y. Rao, C. Wang, W. Zhou, and X. Ji, “A deep learning method for video-based action recognition,” IET Image Processing, vol. 15, no. 14, pp. 3498-3511, 2021. https://doi.org/org/10.1049/ipr2.12303 [5] Q. Men, E. S. L. Ho, H. P. H. Shum, and H. Leung, “Focalized contrastive view-invariant learning for self-supervised skeleton-based action recognition,” Neurocomputing, vol. 537, no. 7, pp. 198-209, 2023. https://doi.org/10.1016/j.neucom.2023.03.070 [6] Y. Liu, R. Liu, S. Wang, D. Yan, B. Peng, and T. Zhang, “Video face detection based on improved ssd model and target tracking algorithm,” Journal of Web Engineering, vol. 21, no. 2, pp. 545-567, 2022. https://doi.org/10.13052/jwe1540-9589.21218 [7] M. Zhao, Y. Zhong, D. Sun, and Y. Chen, “Accurate and efficient vehicle detection framework based on ssd algorithm,” IET Image Processing, vol. 15, no. 13, pp. 3094-3104, 2021. https://doi.org/10.1049/ipr2.12297 [8] D. Liu, S. Gao, W. Chi, and D. Fan, “Pedestrian detection algorithm based on improved ssd,” International Journal of Computer Applications in Technology, vol. 65, no. 1, pp. 25-35, 2021. https://doi.org/10.1504/ij cat.2021.199965999996 [9] Y. Li, D. Han, H. Li, X. Zhang, B. Zhang, and Z. Xiao, “Multi-block SSD based on small object detection for UAV railway scene surveillance,” Chinese Journal of Aeronautics, vol. 33, no. 6, pp. 1747-1755, 2020. https://doi.org/10.1016/j.cja.2020.02.024 [10] Z. Xu, Z. Wu, and W. Fan, “Improved SSD-assisted algorithm for surface defect detection of electromagnetic luminescence,” Proceedings of the Institution of Mechanical Engineers, vol. 235, no. 5, pp. 761-768, 2021. https://doi.org/1748006X2199538 [11] Z. Xu, and W. Zhang, “3D CNN hand pose estimation with end-to-end hierarchical model and physical constraints from depth images,” Neural Network World, vol. 33, no. 1, pp. 35-48, 2023. https://doi.org/10.14311/NNW.2023.33.003 [12] A. Rehman, M. A. Khan, T. Saba, Z. Mehmood, and N. Ayesha, “Microscopic brain tumor detection and classification using 3D CNN and feature selection architecture,” Microscopy Research and Technique, vol. 84, no. 1, pp. 133-149, 2020. https://doi.org/10.1002/jemt.23597 110 Informatica 48 (2024) 97–110 Z. Cui [13] V. Pareek, S. Chaudhury, and S. Singh, “Hybrid 3DCNN-RBM network for gas mixture concentration estimation with sensor array,” IEEE Sensors Journal, vol. 21, no. 21, pp. 24263-24273, 2021. https://doi.org/10.1109/JSEN.2021.3105414 [14] A. Chaddad, P. Sargos, and C. Desrosiers, “Modeling texture in deep 3D CNN for survival analysis,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 7, pp. 2454-2452, 2020. https://doi.org/10.1109/JBHI.2020.3025901 [15] S. K. Roy, G. Krishna, S. R. Dubey, and B. B. Chaudhuri, “HybridSN: Exploring 3D-2D CNN feature hierarchy for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, vol. 17, no. 2, pp. 277-281, 2020. https://doi.org/10.1109/LGRS.2019.2918719 [16] W. Chen, Y. Qiao, and Y. Li, “Inception-SSD: An improved single shot detector for vehicle detection,” Journal of Ambient Intelligence and Humanized Computing, vol. 13, no. 1, pp. 5047-5053, 2022. https://doi.org/10.1007/s12652-020-02085-w [17] Z. Lyu, D. Zhang, and J. Luo, “A GPU-free real-time object detection method for apron surveillance video based on quantized MobileNet-SSD,” IET Image Processing, vol. 16, no. 8, pp. 2196-2209, 2022. https://doi.org/10.1049/ipr2.12483 [18] Q. Huang, Y. Zhang, Y. Huang, C. Mi, Z. Zhang, and W. Mi, “Two-stage container keyhole location algorithm based on optimized SSD and adaptive threshold,” Journal of Computational Methods in Sciences and Engineering, vol. 22, no. 5, pp. 1559-1571, 2022. https://doi.org/10.3233/JCM-226135 [19] W. A. Okaishi, A. Zaarane, I. Slimani, I. Atouf, and M. Benrabh, “A vehicular queue length measurement system in real-time based on SSD network,” Transport and Telecommunication, vol. 22, no. 1, pp. 29-38, 2021. https://doi.org/10.2478/ttj-2021-0003 [20] Y. Pan, M. Lin, Z. Wu, H. Zhang, and Z. Xu, “Caching-aware garbage collection to improve performance and lifetime for nand flash ssds,” IEEE Transactions on Consumer Electronics, vol. 67, no. 2, pp. 141-148, 2021. https://doi.org/10.1109/TCE.2021.3067604 [21] T. E. Trueman, A. K. Jayaraman, S. Jasmine, G. Ananthakrishnan, and P. Narayanasamy, “A Multi-channel convolutional neural network for multilabel sentiment classification using abilify oral user reviews,” Informatica, vol. 47, no. 1, pp. 109-113, 2023. https://doi.org/10.31449/inf.v47i1.3510 [22] L. Yao, and Z. Ge, “Cooperative deep dynamic feature extraction and variable time-delay estimation for industrial quality prediction,” IEEE Transactions on Industrial Informatics, vol. 17, no. 6, pp. 3782-3792, 2021. https://doi.org/10.1109/TII.2020.3021047 [23] K. Bhosle, and V. Musande, “Evaluation of deep learning CNN model for recognition of devanagari digit,” Artificial Intelligence and Applications, vol. 1, no. 2, pp. 114-118, 2023. https://doi.org/10.47852/bonviewAIA3202441 [24] R. A. A. Salvador, and P. C. Naval, “Towards a feasible hand gesture recognition system as sterile non-contact interface in the operating room with 3D convolutional neural networkm,” Informatica, vol. 46, no. 1, pp. 1-12, 2022. https://doi.org/10.31449/inf.v46i1.3442 [25] Y. Li, S. Yang, Y. Zheng, and H. Lu, “Improved point-voxel region convolutional neural network: 3d object detectors for autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 9311-9317, 2022. https://doi.org/10.1109/TITS.2021.3071790 [26] K. Zhu, W. Lu, J. Liu, X. Luo, and X. Zhao, “A lightweight 3d convolutional neural network for deepfake detection,” International Journal of Intelligent Systems, vol. 36, no. 9, pp. 4990-5004, 2021. https://doi.org/10.1002/int.22499 [27] D. Kim, M. E. Lipford, H. He, Q. Ding, V. Ivanovic, S. N. Lockhart, S. Craft, C. T. Whitlow, and Y. Jung, “Parametric cerebral blood flow and arterial transit time mapping using a 3D convolutional neural network,” Magnetic Resonance in Medicine, vol. 90, no. 2, pp. 583-595, 2023. https://doi.org/10.1002/mrm.29674 [28] R. Opfer, J. Krueger, L. Spies, A. C. Ostwaldt, H. H. Kitzler, and S. Schippling, and R. Buchert, “Automatic segmentation of the thalamus using a massively trained 3d convolutional neural network: higher sensitivity for the detection of reduced thalamus volume by improved inter-scanner stability,” European Radiology, vol. 33, no. 3, pp. 1852-1861, 2022. https://doi.org/10.1007/s00330-022-09170-y [29] L. He, B. Ding, H. Wang, and T. Zhang, “An optimal 3d convolutional neural network based lipreading method,” IET Image Processing, vol. 16, no. 1, pp. 113-122, 2021. https://doi.org/10.1049/ipr2.12337 [30] B. Masoudi, S. Daneshvar, and S. N. Razavi, “Multi-modal neuroimaging feature fusion via 3d convolutional neural network architecture for schizophrenia diagnosis,” Intelligent Data Analysis, vol. 25, no. 3, pp. 527-540, 2021. https://doi.org/10.3233/IDA-205113