https://doi.org/10.31449/inf.v48i12.5968 Informatica 48 (2024) 137–152 137 A Face Recognition Method for Sports Video Based on Feature Fusion and Residual Recurrent Neural Network Xu Yan Physical Education Department, North China Electric Power University, Baoding 071003, China E-mail: yanxu1986ncepu63@163.com Keywords: face recognition, feature fusion, residual recurrent neural network, reconstruction of ternary confrontation, sports Received: April 1, 2024 Face recognition technology has penetrated into people's daily life and work fields, and has also been widely applied in sports videos. A video face recognition technology based on feature fusion and residual recurrent neural network is proposed to address the issue of image pose deviation caused by non-cooperative situations. Due to the large number of missing high-frequency data in low resolution facial images, a ternary adversarial reconstruction network is first proposed. It achieves correct image matching through the spatial distance of each image, improving the robustness of the model. For facial recognition in video sequences, higher precision key feature extraction is required. Therefore, this study introduced a residual recurrent neural network to optimize it, and designed its feature fusion and recognition network modules to compensate and extract relevant information before and after frames. Finally, performance verification analysis was conducted on the proposed model, indicating that the recognition accuracy of the recognition system reached 98.3%. In summary, the constructed residual recurrent neural network based on the ternary adversarial reconstruction network framework can effectively achieve video oriented facial recognition. Povzetek: Predstavljena je metoda za prepoznavanje obrazov v športnih videih, ki temelji na združitvi značilnosti in rekurentnem nevronskem omrežju. 1 Introduction Face recognition is widely used in fields such as security, payment, and intelligent devices. Traditional portrait recognition for still images and matching facial features is no longer sufficient to meet people's daily needs. More research is focused on non-cooperative dynamic video face recognition scenarios. In real conditions, images often have features such as changes in lighting angle, changes in posture, and masking [1]. In the context of these more challenging circumstances, it is imperative that devices are capable of facial recognition with enhanced speed and precision. The development of modern equipment has a strong demand for face recognition in non-cooperative scenes. For example, high-speed trains or buses in transportation, as well as border controls, require control of visitor access. Face recognition for videos can ensure the security of these places, providing great help for law enforcement personnel. It not only ensures the timeliness of recognition, but also provides early prediction for suspicious behavior. In the commercial field, ordinary people can use facial recognition technology to achieve non-contact payment, improve efficiency, and also provide traffic management for commercial venues. When identifying users, the automatically read customer data can provide a basis for behavioral analysis of the model to achieve personalized recommendations and optimize user experience [2]. In summary, enhancing the recognition performance of realistic image scenes such as posture deviation is the key to the development of dynamic facial recognition technology. This study proposes a Residual Recurrent Neural Network (R-RNN) based on a ternary adversarial reconstruction network framework to solve the problems of face recognition in non-cooperative scenes. The content includes four parts. The first part introduces the current development status of video facial recognition technology. The second part designs and analyzes the framework of the ternary adversarial reconstruction network and the R-RNN. The third part verifies the performance of the recognition model through simulation experiments. The fourth part further summarizes the experimental data. 2 Related works Facial reconstruction is the key to achieving face recognition in non-cooperative scenes. Deep learning is widely used in the field of visual perception due to its excellent feature extraction performance. Wang et al. believed that the convolutional operators in deep Convolutional Neural Networks (CNN) need to be further improved to overcome the limitations of "local" kernels. Therefore, they proposed a "non-local" model of Speckle Converter (SpT) UNet, with a Pearson correlation 138 Informatica 48 (2024) 137–152 X. Yan coefficient of 0.989 [3]. Theoharis believed that facial reconstruction is an important part of the branch of computer vision. Compared to 2D data, 3D facial data could better avoid the impact of lighting and pose deviation. A recognition model combining it with CNN was proposed, and the reliability of the model was verified through simulation testing using the Florence dataset [4]. Tewari et al. utilized deep CNN to achieve automatic encoding and achieved 3D facial reconstruction in non-cooperative scenes. The input data of its decoder adopted well-defined code vectors. The encoder extracted useful semantic parameters from a single input image. This facial reconstruction model based on deep CNN had good performance [5]. Dib et al. applied differential ray tracing to facial reconstruction and simulated it under different light intensities using a digital analog light table. Concurrently, the reconstruction optimization equation has been implemented to facilitate reconstruction in complex scenarios, such as self-shadowing, and can also estimate parameters, such as diffuse reflection. Finally, the performance analysis of the model in real scenarios verified its effectiveness [6]. Super resolution and pose correction are two important branches of facial reconstruction. Dastmalchi and Aghaeinia proposed a deep CNN based on pixel loss function for discriminating high-resolution facial images. The Generative Adversarial Network (GAN) was introduced to solve the problem of model over smoothing, and achieved an accuracy of 86.1% in the LFW dataset [7]. Nagar et al. proposed the use of position blocks for facial super-resolution optimization to address the impact of Gaussian pulse noise on low resolution images. This was because ordinary facial super-resolution methods are highly susceptible to noise. Its principal component analysis could analyze the matrix of pixel noise details and eliminate pulse noise. Residual learning was used to update the training set and weaken Gaussian noise, and the effectiveness of this method has been verified [8]. Teng et al. proposed alternating improvement algorithms to address the issue of insufficient accuracy in deep learning facial reconstruction, especially for low resolution images. This algorithm improved network performance through alternating training of dual convolutional networks, which are used for facial reconstruction and attribute correction, respectively. Finally, the reliability of this method was demonstrated in the CelebA dataset [9]. Sharma used GAN for facial recognition to enhance the performance of super-resolution images, and experiments has shown that its error rate was only 0.001% [10]. In conclusion, deep learning can be employed for video facial recognition. However, the existing state-of-the-art (SOTA) methods in the literature still exhibit deficiencies in their ability to cope with the aforementioned complex situations, particularly in terms of recognition performance in situations involving posture deviation and non-cooperative scenarios. Meanwhile, SOTA models often perform well under laboratory conditions, but in the real world, their robustness is insufficient due to the variability of poses and the unpredictability of non-cooperative scenarios. The paper selects the ternary GAN as the fundamental framework for facial reconstruction, which directly provides innovative solutions to the challenges in this field. This approach facilitates the advancement of facial recognition technology in non-cooperative scenarios, particularly in enhancing recognition accuracy, optimizing model generalizability, and accelerating real-time processing capabilities. The study also utilizes an R-RNN based on feature fusion as a facial recognition model. The utilization of R-RNNs to optimize the feature fusion process, which combines the advantages of triplet loss and Recurrent Neural Networks (RNN), is employed to enhance the robustness of the model to occlusion and lighting changes. Table 1 shows the summary of the related works. Table 1: Summary of the related work Researchers Key contributions Models or techniques used Main results or performance indicators Wang et al. [3] Propose a "non-local" model for SpT UNet Deep CNN Achieved a Pearson correlation coefficient of 0.989 Theoharis [4] Propose a 3D facial data recognition model combined with CNN 3D facial data and CNN Model reliability validated through simulation tests Tewari et al. [5] Realizing 3D facial reconstruction in non-cooperative scenarios Automatic encoding of deep CNN Model demonstrated good expressiveness Dib et al. [6] Applying Differentiable Ray Tracing to Facial Reconstruction Differentiable Rendering Achieved reconstruction under complex conditions like self-shadowing Dastmalchi and Aghaeinia [7] Propose a deep CNN based on pixel loss Deep CNN + GAN Achieved an accuracy of 86.1% in the LFW A Face Recognition Method for Sports Video Based on Feature… Informatica 48 (2024) 137–152 139 function dataset Nagar et al. [8] Propose using position blocks for facial super-resolution optimization Principal Component Analysis (PCA) + Residual Learning Mitigated the impact of Gaussian impulse noise on low-resolution images Teng et al. [9] Propose alternating improvement algorithms to enhance the accuracy of facial reconstruction Dual Convolutional Networks with alternating training Method's reliability validated in the CelebA dataset Sharma [10] Using GAN for Facial Recognition GAN Experiments showed an error rate of only 0.001% The research of this article Propose a Face Residual Recurrent Neural Network (FR-RNN) model based on the TL-GAN framework to optimize face recognition in non-cooperative scenarios TL-GAN+FR-RNN+Tensorflow The TL-GAN+FR-RNN model has an accuracy of up to 98.3% in facial recognition tasks and performs best on the IJB-A dataset, with an accuracy of 96.3% 3 A face recognition method for sports video based on feature fusion and R-RNN The basic framework of a recognition network model based on ternary adversarial reconstruction is studied, aiming to solve the problem of face recognition in non-cooperative states. This model uses a GAN as the basic architecture, and introduces a ternary adversarial reconstruction recognition network for construction. Finally, the GAN is used for training. The proposed optimization construction of video facial recognition technology based on feature fusion R-RNN includes two modules: feature fusion and facial recognition. This study also utilizes R-RNN to improve the accuracy of feature fusion [11, 12]. 3.1 Construction of a basic framework for reconstructing and identifying network models based on ternary confrontation Facial recognition technology based on video clips is widely used. In the field of sports, this technology can be used to achieve facial recognition and violation detection functions. However, facial recognition functions in non-cooperative states often experience a sudden decrease in recognition accuracy due to issues such as posture deviation and clarity. Therefore, the design of a facial reconstruction recognition system that addresses the aforementioned defects is very necessary and has research prospects. This study is based on the GAN architecture and introduces the concept of feature mapping to achieve multi-pose facial correction. This type of network can reduce its dependence on supervised learning and also improve computational accuracy. The basic structure is shown in Figure 1. 140 Informatica 48 (2024) 137–152 X. Yan Real image Randomly matched image Builder Sampling Sampling Arbiter Correct Mistake Loss Backward transfer update Backward transfer update Figure 1: The basic structure of GAN The basic principle of GAN is the mutual confrontation between the generator and discriminator for training. When facing low resolution side face images, the lack of high-frequency data often leads to feature similarity issues after facial correction. Therefore, the introduction of the triplet loss theory can construct a Triplet Loss Constrained GAN for Reconstruction and Recognition (TL-GAN). The principle of distance measurement is shown in Figure 2. Draw into Zoom out Target class Interference class 1/2 Figure 2: Principle of distance measurement in image recognition The basic principle of the model is to achieve accurate matching of the same face through spatial distance in high-dimensional space. It is divided into three major modules, namely low-resolution correction module, super-resolution module, and discrimination module. The first two are combined to generate a network. Low-resolution pose correction uses convolutional and deconvolution networks. The convolutional layer and adaptive attention module respectively achieve feature extraction and detail data acquisition. The input-output relationship of the reconstructed network of the codec combination is shown in equation (1). ( ( )) HR LR dec enc I F F I = (1) In equation (1), LR I represents the network input value. HR I represents the reconstructed network output value. dec F and enc F are decoder and encoder, respectively. The input output relationship of the recognition network is shown in equation (2). Identity discrimination ( ) HR class FI = (2) A Face Recognition Method for Sports Video Based on Feature… Informatica 48 (2024) 137–152 141 In equation (2), class F represents the classification recognition network. Ordinary triplet losses have uncertainty. The difference in positive and negative examples can cause defects such as under-fitting in the model. The shortest distance ternary loss function triple L is introduced for optimization, as shown in equation (3). 22 2 1 ( ) ( ) ( ) ( ) 1 ( ) ( ) 2 LR LR LR LR N enc i enc i enc i enc i triple LR LR i enc i enc i F I F I F I F I L F I F I +− − =  − − −  =  +−    (3) In equation (3), 1,2,..., iN = represents the serial number of the portrait. The corresponding triplet symbol is represented as ( , , ) LR LR LR i i i I I I +− , which is the low-resolution profile image and low-resolution frontal image of the portrait, as well as the low-resolution frontal image of another person. The vector features of an image are mapped through an encoder. The function uses the distance between vectors for similarity recognition. The construction of a triplet is a random pattern, which may cause the divergence of the negative target LR i I − to be too high and ultimately slow down the convergence speed of the model. The method of selecting LR i I − in this study is shown in equation (4). 2 arg min ( ) ( ) LR LR LR i enc i enc i I F I F I −+ =− (4) After selecting faces with a focus on distinguishing similar features, the training speed and fitting degree of the model can be improved to a certain extent. The training process of the model is shown in Figure 3. Nearest neighbor binary selection Face recognition model Encoder Decoder Triplet loss Reconstruct the face image Pixel loss Real face image Discriminant network WGAN-GP Loss Figure 3: Training flow of ternary adversarial network The triplet image enters the feature space through an encoding network. The model utilizes the shortest distance triplet loss to constrain its spatial distance, and also introduces pixel loss to further enhance the model's perception of the front face. GAN is the final training platform. The discriminator structure is WGAN-GP. In summary, the target loss of the model is the sum of multiple losses, as shown in equation (5). SR pixel triple WGAN GP L L L L    − = + + (5) In equation (5), pixel L represents pixel loss. WGAN GP L − represents WGAN-GP loss.  ,  and  respectively represent the weights of the corresponding losses. Usually, the weight value of the latter two losses is higher. After continuous training, the objective function of ordinary GAN will cause the gradient to disappear. The bulldozing distance can be employed to enable the feature vector input after feature extraction to counteract losses, thereby facilitating continuous optimization of the generator and discriminator. This, in turn, stabilizes the network structure. When the discriminator compares images, the WGAN-GP loss will use the distance between each data to determine 142 Informatica 48 (2024) 137–152 X. Yan similarity, as shown in equation (6). 2 2 ( ) ( ) ( ( ) 1) HR HR HR WGAN GP L D I D I D I  + − = − +  − (6) In equation (6), D represents the output data of the discriminator. HR I represents the reconstruction of the front face image. HR I + represents a true facial image.  is a pre-set value for the user. Pixel loss can be used to constrain surface similarity, as shown in equation (7). ( ) 3 2 , , , , 2 1 1 1 1 3 mm HR HR pixel i j r i j r r i j L I I m + = = = =−  (7) In equation (7),  is the number of samples in the training set. // i j r are different categories. 3.2 Optimization and construction of video facial recognition technology based on feature fusion R-RNN Although the above framework can achieve optimized facial reconstruction, the extraction of key features in videos is not precise enough. Therefore, the FR-RNN is introduced to optimize it. The model can be roughly divided into two modules: feature fusion and face recognition. The difference between video super-resolution and ordinary image super-resolution lies in the strong correlation between the front and back frames of the former, so feature fusion is necessary. Feature fusion essentially involves supplementing information from images that are interrelated, so as to enhance their data expression capabilities. The feature fusion technology based on deep learning is superior to traditional fusion technologies, including the fusion between feature maps [13]. The pooling layer is one of the manifestations of feature map fusion. The self attention mechanism belongs to one of the manifestations of fusion between feature maps, as shown in Figure 4. 5.0 2.0 4.0 3.0 0.5 0.2 0.4 0.3 Average pooling:3.5 Max pooling:5.0 Random pooling:? (a)Pooling layer feature fusion framework Softmax Input Output (b)Basic framework of attention mechanism model Figure 4: Basic framework of pooling layer and attention mechanism feature fusion In Figure 4 (a), when the image passes through the pooling layer, the average pooling will select the average feature value. The maximum value is selected for the maximum pooling layer. Random pooling involves selecting any value through a probability matrix. Figure 4 (b) actually shows the connection implemented in the channel. The attention mechanism achieves important data extraction by training the adaptive weight matrix. However, this method only targets the connection of a single image and is not suitable for feature map fusion in videos. Common feature fusion techniques for video image super-resolution include 2/3D convolution and RNN. The difference between 2/3D convolutional fusion is mainly reflected in the dimension of feature fusion. The former is directly connected and fused at the channel. The latter takes video as input and connects it in both spatial and temporal dimensions. RNN needs to calculate the context correlation of video images, and the resulting hidden states can be connected to the current frame [14, 15]. The feature description performance of ordinary images in the network has been significantly improved. The most important thing at present is to optimize the feature fusion of video frame images to enable them to extract facial features more accurately. To enhance the model's ability to perform feature fusion, this study analyzes and compares the three fusion technologies, as shown in Figure 5. A Face Recognition Method for Sports Video Based on Feature… Informatica 48 (2024) 137–152 143 ... ... C Conv2D R t (Concatenate) (a)2D convolution fusion ... ... C Conv2D R t (b)3D convolution fusion ... C R t (c)RNN fusion ... ... ... R t-1 h t-1 ReLU Conv2D ReLU Conv2D h t Figure 5: Different fusion technology frameworks In Figure 5, 2D convolutional fusion is shown in equation (8).     2 ,..., t net D t T t T R W Concatenate I I −+ = (8) In equation (8),   ,..., t T t T Concatenate I I −+ represents the cascading operation of the sequence image. The dimensional data is represented as NC H W  . 21 NT =+ represents the length of the video sequence, which is the number of input image channels.  represents the length and width dimensions of the image. The 3D convolutional fusion is shown in equation (9).     3 ,..., t net D t T t T R W Concatenate I I −+ = (9) In equation (9), the convolutional kernel becomes three-dimensional, and its motion on the spatio-temporal axis is achieved by inputting a video sequence, facilitating its extraction of spatio-temporal feature data. RNN utilizes 2D convolutional encoding to achieve fusion and obtain the output of the current frame and the hidden state of the subsequent frames. The fusion technology of the first two can better achieve feature fusion when faced with a small number of sequences, but the increase in sequence length will ultimately lead to computational difficulties. On the contrary, the input of RNN is only the pre and post frame data, and the fusion is achieved by utilizing the hidden states of the two. This recursive method is more suitable for recognizing video images with longer sequences. Therefore, using this technology for feature fusion is the most suitable. FR-RNN can further improve the potential gradient vanishing defects in feature fusion. Its basic structure is shown in Figure 6. 144 Informatica 48 (2024) 137–152 X. Yan I t-3 I t-2 I t-1 I t I t+1 I t+2 I t+3 Data entry: images and hidden states C 2D convolution fusion ReLU activation 2D convolution fusion 2D convolution fusion ReLU activation 2D convolution fusion Output: The frame hides the state Output Feature fusion Face recognition Prediction matrix Average method Identity identification Figure 6: Overall input value of FR-RNN In Figure 6, the overall input value of FR-RNN is a video image sequence, with dimensions of '' mm  . After feature fusion, the recognition network can obtain the discriminative data of the current frame, as shown in equation (10).   2 ˆ t conv D k R W x = (10) In equation (10), ˆ k x represents the output value of the current frame. ˆ k x represents the encoding of the channel connection, and its specific calculation is shown in equation (11).   ( )   ( ) 11 0 2 1 1 1 2 ˆ ˆ ˆ ( ), [1, ] ˆ [ , , , ] ˆ k k k con D t t t t t con D k x x x k K x W I I o h h W x   −− − − −  = +     =   =   (11) In equation (11), ˆ k x represents the channel connection encoding for the next frame. K represents the number of standard residual blocks. 0 ˆ x represents the connection of the four parameters on the channel, namely the hidden state of the previous frame, the output of the previous frame, the input of the current frame, and the input of the previous frame. t h represents the hidden state of the current frame. 1 ˆ() k x −  represents the final residual block output. The prediction matrix obtained through feature fusion and recognition is shown in equation (12). () t class t IP F R = (12) In equation (12), t IP represents the final prediction matrix. class F represents the recognition network. Among them, Light Convolutional Neural Networks (LightCNN) and Visual Geometry Group Face (VGG-Face) are two common facial recognition network models. Subsequently, the averaging method is used to process the prediction matrix obtained in the previous text, and the final output of the recognition model can be obtained, as shown in equation (13). 1 1 Identity discrimination= N t t IP N =  (13) The training of FR-RNN includes feature fusion for each frame and training for recognition modules. By introducing cross entropy to construct a loss function and calculating the deviation between the predicted label vector and its true value, equation (14) can be obtained.   1 ˆ ˆ log (1 )log(1 ) x Loss y y y y n = + − −  (14) In equation (14), x represents the sample. n represents the number of samples. ˆ y represents the predicted label vector. y represents the actual label vector. Video facial recognition is a classification problem. Softmax regression is introduced to process it, as shown in equation (15). 1 1 max 1 1 1 1 log T y i i i T ji W x by m soft W x bj n i j e L m e + + = = =   (15) In equation (15), 1 m represents the batch size. 1 n represents the number of classes. i y represents the specific category. i x represents the i depth features under the corresponding category. j W represents the j column of the weight. b represents deviation. T ji W x bj e + represents fully connected layer output. The increase in A Face Recognition Method for Sports Video Based on Feature… Informatica 48 (2024) 137–152 145 the output value weight of the fully connected layer can reduce model losses. Overall, the architecture of the FR-RNN model is based on generating GANs, which includes generators and discriminators for generating and discriminating images, and is trained through the adversarial process between them. To enhance the model's ability for facial image reconstruction, especially when processing low-resolution images, TL-GAN is introduced in the study, which utilizes the triplet loss theory to improve the model performance. The FR-RNN model further integrates R-RNN, which consist of two modules: feature fusion and face recognition. The main responsibility is to optimize the accuracy of feature fusion to better process keyframe features in video sequences. The standard practice based on GAN adopts convolutional layers, pooling layers, and ReLU activation functions. In terms of selecting the loss function, the FR-RNN model combines pixel loss, WGAN-GP loss, and triplet loss, among which WGAN-GP loss is particularly used to improve the stability of the training process and avoid the problem of gradient vanishing during the optimization process. 4 Performance simulation analysis of feature fusion and FR-RNN model in video face recognition performance The performance simulation experiment of the video facial recognition model is divided into two parts, which are the analysis of each module and the overall analysis. The performance analysis of the model itself includes four aspects: Structural Similarity (SSIM), recognition accuracy, rank N, and model size. The comparative analysis of the models conducted experiments on the recognition accuracy of each model [16-18]. 4.1 Performance verification analysis of TL-GAN framework and FR-RNN module Table 2 shows the parameters of the experimental environmen. Table 2: Experimental environment and parameter settings Experimental environment Parameter setting Graphics card GTX1080Ti Operating system Linux Deep learning framework Tensor flow Pre-treatment method Double - and three-wire interpolation Image size 32×32 //    0.01/0.1/0.1 Learning rate 0.001 Weight attenuation 0.01 Batch 20 Stochastic gradient optimization Adam ( 1  =0.9, 2  =0.999) Preconditioning Double trilinear interpolation method Data Sets Multi-PIE / IJB-A The study uses data augmentation techniques such as rotation, scaling, cropping, and color transformation. The batch size used during the training process is 20, and the Adam optimizer is used with parameters B1=0.9 and B2=0.999. Then, performance analysis is conducted on the image-based ternary adversarial reconstruction recognition network. The 250 participants in Session01 are selected from 6 different angles under the same lighting and facial expressions, and allocated in a 4:1 ratio as the training and testing sets, respectively. The data preprocessing steps include using the double trilinear interpolation method to process images to improve image quality and prepare for subsequent facial recognition analysis. All images are uniformly adjusted to a size of 32 ×32 pixels, and this standardization process helps to accelerate model training speed and reduce computational resource consumption. A comprehensive data distribution strategy has been devised for the multi-PIE dataset with the objective of ensuring the diversity of images in terms of angles and lighting conditions. This approach aims to simulate the various challenges that face recognition may encounter in the real world. The paper compares the 146 Informatica 48 (2024) 137–152 X. Yan accuracy of the proposed FR-RNN algorithm with SOTA and traditional RNN algorithms, as shown in Figure 7. 5 10 15 20 25 Accuracy Iteration time (s) FR-RNN 0 0.00 0.15 0.30 0.45 0.60 0.75 0.90 RNN 3 6 9 12 15 Accuracy Iteration time (s) 0 0.00 0.15 0.30 0.45 0.60 0.75 0.90 SOTA FR-RNN RNN SOTA (a) The first group of experiments (b) Second set of experiments Figure 7: Comparison of accuracy of different algorithms As shown in Figure 7 (a), in the first set of tests, the accuracy of the proposed FR-RNN algorithm reaches 94.3, while the accuracy of the SOTA algorithm is 89.7, and the accuracy of the traditional RNN algorithm is 82.5. In Figure 7 (b), in the second set of tests, the accuracy of the proposed FR-RNN, SOTA and traditional RNN algorithms reaches 94.5, 89.4 and 81.2. This indicates that the FR-RNN algorithm is more effective in handling video facial recognition tasks. This study introduces commonly used Two Path Generative Adversarial Network (TP-GAN), Factorization Machines Deep Neural Network (FNM), LightCNN, and VGG-Face as controls. SSIM and recognition accuracy are used as evaluation indicators for model performance. Table 3 shows the experimental results. Table 3: Performance comparison of portrait reconstruction model Index Moldel Angle / ° ±15 ±30 ±45 ±60 ±75 ±90 SSIM TL-GAN 0.7105 0.6643 0.6512 0.6327 0.6249 0.6078 TP-GAN 0.6987 0.6541 0.6276 0.6048 0.5809 0.5699 FNM 0.6847 0.6362 0.6009 0.5806 0.5462 0.4621 Accuracy rate /% LightCNN 87.76 85.73 69.42 30.64 10.34 2.13 VGG-Face 89.81 87.76 71.45 32.74 12.42 4.15 TL-GAN+LightCNN 98.16 95.92 93.88 91.84 85.72 71.44 TL-GAN+VGG-Face 98.16 96.95 94.91 93.86 87.73 74.77 TP-GAN+LightCNN 88.71 88.08 85.42 77.73 67.45 54.68 FNM+LightCNN 94.62 92.51 89.77 85.31 77.25 61.21 Due to the large amount of data, the study only conducts SSIM performance comparison analysis on the TP-GAN, FNM, and TL-GAN models. As the angle decreases, the facial reconstruction ability of each model will also be correspondingly improved. FNM has the lowest SSIM. The average SSIM value of TL-GAN is 0.6486, which is 9.79% higher than the average SSIM value of FNM. At ±90 °, the SSIM of TL-GAN is 6.23% and 23.97% higher compared to TP-GAN and FNM, respectively. The facial reconstruction image of A Face Recognition Method for Sports Video Based on Feature… Informatica 48 (2024) 137–152 147 TP-GAN is relatively clear, but there may be artifacts in the image. The FNM facial reconstruction image has relatively more detailed features, but it is obvious that as the angle increases, its facial correction performance will decrease, so the model has higher requirements for lighting. TL-GAN can maintain relatively stable facial correction performance, with SSIM values only differing by 0.1027 under ± 90 ° and ± 15 ° conditions. Therefore, this model can better extract correct detail features and achieve more accurate facial correction. In the accuracy verification experiment of each model, LightCNN, VGG-Face are combined with the other three models. Experiments have shown that the recognition performance of LightCNN and VGG-Face alone is much lower than that of other models. Especially at ±90 °, the recognition accuracy is lower than 5%. The average recognition accuracy is 47.59% and 49.64%, respectively. The average recognition accuracy of TL GAN+LightCNN, TL GAN+VGG-Face, TP GAN+LightCNN, and FNM+LightCNN are 89.49%, 91.06%, 76.89%, and 83.41%, respectively. Therefore, the combination of TL-GAN and other models has the best performance, reaching over 89%. At ±90 °, the recognition accuracy of TL-GAN+VGG-Face is 1.57%, 14.17%, and 7.65% higher than that of TL-GAN+LightCNN, TP-GAN+LightCNN, and FNM+LightCNN models, respectively. This study selects the IJB-A, YTC, and YTF datasets as experimental samples to validate the performance of the FR-RNN model. There is high-quality frontal data and corresponding video sequences in IJB-A, which can be used to simulate sports videos. Due to its low-resolution and multi-pose data features that are very similar to actual video surveillance, it is a better choice for recognition verification. The YTC and YTF datasets lack high-quality frontal images. Therefore, this study utilizes the FaceChoose algorithm for high-quality image selection and skipping. This study first fixes the number of residual blocks K to 5 and conducts experiments on each model separately. The experimental results are shown in Figure 8. 20 18 16 14 12 10 8 6 4 2 0 94.0 94.5 95.0 95.5 96.0 96.5 97.0 97.5 98.0 Number of experiments rank-N (%) 98.5 F-2DCNN F-3DCNN FR-RNN F-RNN Model size (M) (a) The recognition rate performance of each model (b) Size of each model 4.5 4.0 5.5 5.0 6.5 6.0 7.5 7.0 8.0 F-3DCNN F-2DCNN FR-RNN F-RNN Line of change Run size Model size Figure 8: Comparison of performance of each recognition model when K=5 The above models all use the same training dataset and test set sequence length. Feature extraction is unified as a residual block module. The data ratio for training and testing is set to 8:2. Figure 7 compares the rank N recognition rate and model size of each model. The rank N recognition rate represents the proportion of correct attributes among the top N model recognition results. The size of the model indirectly reflects the parameter quantity and computational speed of the model. In Figure 8 (a), under the condition of K=5, F-3DCNN has the highest recognition accuracy, with an average of 98.2%, which is 3.6% and 2.7% higher than F-2DCNN and F-RNN, respectively. This indicates that the accuracy of the model is relatively excellent. The average rank N value of FR-RNN is 98.0%, which is only 0.2% lower than F-3DCNN. Therefore, the difference in recognition accuracy between the two is not significant. In Figure 8 (b), the average size of F-3DCNN is 11.7M, while the average size of F-2DCNN, F-RNN, and FR-RNN models is 4.4, 4.2 and 4.2, respectively. This indicates that although F-3DCNN has the highest recognition accuracy, its operating speed is much lower than other models. Based on the two-indicator data, FR-RNN has good comprehensive recognition performance. The experiment reset the K value to 10, and the experimental results are shown in Figure 9. 148 Informatica 48 (2024) 137–152 X. Yan 20 18 16 14 12 10 8 6 4 2 0 94.0 94.5 95.0 95.5 96.0 96.5 97.0 97.5 98.0 Number of experiments rank-1 (%) 98.5 F-2DCNN F-3DCNN FR-RNN (a) The recognition rate performance of each model (b) Size of each model F-3DCNN F-2DCNN FR-RNN 0 2 6 4 10 12 8 Model size (M) Figure 9: Comparison of performance of each recognition model when K=5 In Figure 9, there is no experimental data for F-RNN, as the model experienced gradient vanishing during training. This also indirectly confirms the effectiveness of residual connections in improving model performance. In Figure 9 (a), as the number of residual blocks increases, the average recognition rate of FR-RNN is higher than that of F-3DCNN. Its average rank N value is 98.5%, which is 3.7% and 0.4% higher than the F-2DCNN and F-3DCNN models, respectively. In Figure 9 (b), the size of FR-RNN has increased to some extent, but it is also at its minimum value. The three model sizes are 5.7M, 5.9M, and 11.6M, respectively. This is because the F-2DCNN and F-3DCNN models utilize the increase in model parameter quantity to achieve the processing of long sequence data. FR-RNN uses recursive methods for feature fusion processing. The correlation between model size and sequence length is weak, which makes it easier to handle real-time and video sequence data, while avoiding gradient vanishing in hidden states, ensuring the stability of sequence length. 4.2 Performance verification and comparative analysis of recognition models based on TL-GAN framework and FR-RNN module This study further analyzes the stability of the overall model. By setting different sequence lengths and residual blocks, it is determined whether the differences in the model are too large, as a way to determine the stability of the model. Table 4 shows the experimental results. Table 4: Verifies the stability of the overall model Parameter Recognition accuracy Frame number 5 95.3% 10 96.9% 15 97.1% 20 97.5% 25 97.9% Number of residual blocks 3 92.5% 4 92.9% 5 94.6% 6 94.7% 7 95.2% 8 95.3% 9 95.1% 10 95.0% A Face Recognition Method for Sports Video Based on Feature… Informatica 48 (2024) 137–152 149 In Table 4 above, the proposed TL-GAN+FR-RNN model is better at processing long sequence data, as shown in the data of frame number and recognition accuracy. As the number of frames increases, its accuracy also improves. The difference in recognition accuracy between models with frame numbers 5 and 25 is 2.6%. However, when the number of frames increases to a certain limit, the amplitude of accuracy increase will decrease. The model recognition accuracy difference between frame 20 and frame 25 is only 0.4%. Experiments have shown that TL-GAN+FR-RNN can also achieve high-precision recognition and better feature fusion when facing changes in data frame numbers. According to the data in Table 4 on the number of residual blocks and the accuracy of model recognition, it can be concluded that an appropriate increase in the number of residual blocks can have a positive impact on the performance of model recognition. There is also a phenomenon of maintaining stability after increasing the limit value. The recognition accuracy for residual blocks of 3 and 8 differs by 2.8%. When the number of residual blocks is between 7 and 10, its recognition accuracy remains stable in the [95, 93] range. Based on 95%, the average deviation is 0.15%. Therefore, the change in the number of residual blocks does not affect the stability of the overall model. Accumulated residual blocks can actually improve the accuracy of the model to a certain extent. Neural Aggregation Network (NAN), Attention Deep Reinforcement Learning (ADRL), and Template Depth Reconstruction Model (TDRM) are introduced and compared with the research algorithm, as shown in Figure 10. NAN AVE ADRL TL-GAN + FR- CNN(5L) TL-GAN + FR- CNN(10L) 91.0 91.5 92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0 96.5 97.0 97.5 98.0 IJB-A YTC 98.5 YTF Recognition accuracy /% Recognition accuracy /% 99 NAN AVE ADRL TL-GAN + FR-CNN(5L/10L) 98 97 96 95 YTC linearity YTCF linearity (a) Model performance in IJB-A dataset (b) Model performance in YTC/YTF dataset Figure 10: Comparison of recognition accuracy of various face recognition models In Figure 10, in the IJB-A dataset, the TL-GAN+FR-RNN model with 10 residual blocks has the highest accuracy of 96.3%. According to the order in the figure, it is 2.1%, 2.7%, and 0.5% higher than the other models, respectively. The phenomenon of gradient disappearance occurred in ADRL. In YTC data, the recognition rates of each model do not differ significantly, with a mean of 98%. The research method is slightly lower than the NAN model by 0.3%. In the YTF model, the difference in recognition rates among different models is still small. The accuracy of the TL-GAN+FR-RNN model reaches 96.2%, which is 1% higher than the TDRM model. In summary, the TL-GAN+FR-RNN model can output more features for hidden states, achieving optimization of recognition rate. 5 Discussion The FR-RNN model based on the TL-GAN framework proposed in the study has demonstrated superior performance in video facial recognition tasks, especially in non-cooperative scenarios. Simulation analysis shows that the recognition accuracy of the model on the IJB-A dataset reaches 96.3%, which is outstanding in current research and surpasses the SOTA methods in existing literature. For example, although the SpT UNet model proposed by Wang et al. [3] achieved a Pearson correlation coefficient of 0.989, it was not as accurate in facial recognition as the proposed model. Although Theoharis T's 3D facial recognition model has been validated for reliability through simulation testing, there are limitations in processing video sequence data. Although the deep CNN proposed by Dastmalchi and Aghaeinia [7] achieved an accuracy of 86.1% on the LFW dataset, the model demonstrated higher performance in more complex video face recognition tasks. The performance differences may be mainly attributed to several key factors. Firstly, the TL-GAN framework effectively combines the advantages of triplet loss and GAN to better handle attitude deviation and lighting changes. Secondly, the FR-RNN model optimizes feature fusion and enhances feature expression ability through residual loop mechanism. Finally, the training strategy 150 Informatica 48 (2024) 137–152 X. Yan adopted, including pixel loss and WGAN-GP loss, helps to improve the robustness and accuracy of the model. The proposed model provides novel contributions to the field of facial recognition, particularly in optimizing facial recognition in non-cooperative scenarios, processing long sequence data, and considering real-time performance and computational efficiency. These characteristics render the model not only innovative in theory but also potentially valuable in practical applications, particularly in face recognition tasks that necessitate the processing of complex scenes and long sequence data. 6 Conclusion To further improve facial recognition technology for videos, this study proposed a FR-RNN model based on the TL-GAN framework. The purpose was to solve the problem of low-recognition accuracy caused by attitude deviation and lighting in non-matching images. The simulation analysis of the model showed that in the experiment of the TL-GAN framework, the average SSIM was 0.6486, which was 9.79% higher than the FNM model. In the experiment on the FR-RNN model, when K=5, the rank N mean of the FR-RNN model was 98.0%, which was only 0.3% lower than the highest recognition rate of the F-3DCNN model. Its model size was lower than 7.5M, so its overall performance was the best. In the verification of the overall model stability, when the number of residual blocks was between 7-10, its recognition accuracy remained stable in the [95,93] range. Based on 95%, the average deviation was 0.15%. In the comparison between TL-GAN+FR-RNN and other models, in the IJB-A dataset, the accuracy of TL-GAN+FR-RNN using 10 residual blocks was 96.3%, which was 2.7% higher than the ADRL model. This had always been at a high level in other datasets, with the best overall performance. However, there are still some shortcomings in the experiment, such as reducing model complexity and improving computational speed. At the same time, the experiment also needs to further apply the model to capture multiple faces to adapt to actual scene requirements. References [1] D. Tang, and J. Hao, “A deep map transfer learning method for face recognition in an unrestricted smart city environment,” Sustainable Energy Technologies and Assessments, vol. 52, no. 8, pp. 102207-102215, 2020. https://doi.org/https://10.1016/j.seta.2022.102207 [2] F. Zhang, N. Liu, L. Chang, F. Duan, and X. Deng, “Edge-guided single facial depth map super-resolution using CNN,” IET Image Processing, vol. 14, no. 17, pp. 4708-4716, 2021. https://doi.org/https://10.1049/iet-ipr.2019.1623 [3] Y. Wang, H. Wang, and M. Gu, “High performance "non-local" generic face reconstruction model using the lightweight Speckle-Transformer (SpT) UNet,” Advances in Optoelectronics, vol. 6, no. 2, pp. 220049-220058, 2023. https://doi.org/10.29026/oea.2023.220049 [4] T. Theoharis, “Robust 3D face reconstruction using one/two facial images,” Journal of Imaging, vol. 7, no. 9, pp. 169-176, 2021. https://doi.org/https://10.3390/jimaging7090169 [5] A. Tewari, M. Zollhofer, F. Bernard, P. Garrido, H. Kim, P. Perez, and C. Theobalt, “High-fidelity monocular face reconstruction based on an unsupervised model-based face autoencoder,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 357-370, 2020. https://doi.org/https://DOI:10.1109/TPAMI.2018.28 76842 [6] A. Dib, G. Bharaj, J. Ahn, C. Thébault, P. H. Gosselin, M. Romeo, and L. Chevallier, “Practical face reconstruction via differentiable ray tracing,” Computer Graphics Forum, vol. 40, no. 2, pp. 153-164, 2021. https://doi.org/https://DOI:10.1111/cgf.142622 [7] H. Dastmalchi, and H. Aghaeinia, “Super-resolution of very low-resolution face images with a wavelet integrated, identity preserving, adversarial network,” Signal Processing. Image Communication: A Publication of the European Association for Signal Processing, vol. 1, no. 107, pp. 116755-116767, 2022. https://doi.org/https://DOI:10.1016/j.image.2022.11 6755 [8] S. Nagar, A. Jain, P. K. Singh, and B. AK, “Mixed-noise robust face super-resolution through residual-learning based error suppressed nearest neighbor representation,” Information Sciences, vol. 1, no. 546, pp. 121-145, 2021. https://doi.org/https://DOI:10.1016/j.ins.2020.08.00 2 [9] Z. Teng, X. Yu, and C. Wu, “Iterative attribute augmentation network for face image super resolution,” Electronics Letters, vol. 57, no. 22, pp. 854-856, 2021. https://doi.org/https://DOI:10.1049/ell2.12285 [10] R. J. N. Sharma, “An improved technique for face age progression and enhanced super-resolution with generative adversarial networks,” Wireless Personal Communications: An International Journal, vol. 114, no. 3, pp. 2215-2233, 2020. https://doi.org/https://doi.org/10.1007/s11277-020-0 7473-1 [11] F. Zhang, J. Zhao, L. Wang, and F. Duan, “3D face model super-resolution based on radial curve estimation. Applied Sciences,” vol. 10, no. 3, pp. 1047-1047, 2020. https://doi.org/10.3390/app10031047 [12] A. B. Deshmukh, and N. U. Rani, “Optimization-driven kernel and deep convolutional neural network for multi-view face video super A Face Recognition Method for Sports Video Based on Feature… Informatica 48 (2024) 137–152 151 resolution,” International Journal of Digital Crime and Forensics, vol. 12, no. 3, pp. 77-95, 2020. https://doi.org/10.4018/IJDCF.2020070106 [13] X. Wang, Y. Guo, B. Deng, and J. Zhang, “Lightweight photometric stereo for facial details recovery,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, vol. 13, no. 15, pp. 740-749, 2020. https://doi.org/10.1109/CVPR42600.2020.00082 [14] H. M. R. Afzal, S. Luo, M. K. Afzal, G. Chaudhary, M. Khari, and S. Kumar, “3D face reconstruction from single 2D image using distinctive features,” IEEE Access, vol. 8, no. 1, pp. 180681-180689. 2020. https://doi.org/10.1109/ACCESS.2020.3028106 [15] M. Sari, A. Moussaouı, and A. Hadid, “Automated facial expression recognition using deep learning techniques: an overview,” International Journal of Informatics and Applied Mathematics, vol. 3, no. 1, pp. 39-53, 2020. [16] P. Su, “Immersive online biometric authentication algorithm for online guiding based on face recogni3D face reconstruction from single 2D image using distinctive features tion and cloud-based mobile edge computing,” Distributed and Parallel Databases, vol. 41, no. 1, pp. 133-154, 2023. https://doi.org/10.1007/s10619-021-07351-0 [17] G. Veselov, A. Tselykh, A. Sharma, and R. Huang, “Applications of artificial intelligence in evolution of smart cities and societies,” Informatica, vol. 45, no. 5, 2021. https://doi.org/10.31449/inf.v45i5.360 [18] Y Yang, and X. Song, “Research on face intelligent perception technology integrating deep learning under different illumination intensities. Journal of Computational and Cognitive Engineering,” vol. 1, no. 1, pp. 32-36, 2022, https://doi.org/10.14016/j.cnki.1001-9227.2023.04.0 49 152 Informatica 48 (2024) 137–152 X. Yan