https://doi.org/10.31449/inf.v48i14.5939 Informatica 48 (2024) 157–170 157 Image Semantic Quality Evaluation Model for Human-Machine Hybrid Intelligence: A Gradient-based Uncertainty Calculation Method Ziyan Yue 1 , Senyang Lu 2 , Hong Lu 3* 1 Key Laboratory of Educational Informatization for Nationalities (YNNU), Ministry of Education, Yunnan Normal University, Kunming 650500, China 2 Faculty of Art and Communication, Kunming University of Science and Technology, Kunming 650500, China 3 Academy of Fine Arts, Nanjing Xiaozhuang University, Nanjing 211171, China E-mail: luhongeys@126.com * Corresponding author Keywords: gradient, uncertainty calculation, human-machine hybrid, semantic distortion, image quality evaluation Received: March 20, 2024 With the advancement of human-machine hybrid intelligence technology, the importance of images in interaction becomes increasingly high. The accurate evaluation of image semantic quality becomes crucial. However, traditional evaluation models may be limited in this environment. New methods are needed to improve evaluation accuracy. Then, an evaluation model for gradient-based uncertainty calculation method was proposed. The study conducted semantic distortion perception analysis at two levels. Firstly, overall, the recognition ability was analyzed by analyzing the average recognition accuracy of the dataset. Secondly, recognition ability analysis was conducted based on the confidence level of a single sample. Experiments showed that machines had a higher tolerance for distortion compared to humans. However, these machines were weaker in terms of generalization and stability. The proposed method performed well on the complex CIF AR100 dataset, achieving the lowest FPR of 95%, the highest TPR of 528%, and the lowest error detection rate of 3.65%. In addition, the accuracy of the proposed framework reached 68.03%, which was significantly better than 59.83% for humans and 40.16% for machines. The results indicated its ability to effectively combine the advantages of different decision-makers. This study is expected to provide new ideas for image quality evaluation, improving the application performance and user experience of images in multiple fields. Povzetek: Predlagan je model za ocenjevanje semantične kakovosti slik, ki temelji na gradientni metodi izračuna negotovosti, z namenom izboljšati interakcijo med človekom in strojem. 1 Introduction With the rapid development of social media, e-commerce, and digital media, people interact with a large amount of image content every day, including uploading, sharing, searching, and shopping. It is necessary to accurately evaluate the quality and semantic content of images to filter out junk images, improve the relevance of search results, and recommend related products to provide a better user experience [1]. Traditional image quality assessment (IQA) methods mainly focus on pixel level image quality, such as noise, blur, and distortion [2]. However, these methods often fail to capture the semantic content of the image and cannot determine whether the image meets the user's needs or contains important information. The gradient-based uncertainty calculation method, as a commonly used technique in deep learning and machine learning, can be used to evaluate the uncertainty of models [3-4]. Based on this, a subjective dataset is created to evaluate semantic distortion in monitoring distortion scenarios. Confidence measures are used to analyze the recognition ability of humans and machines on a single sample. Finally, the above content is applied to human-machine joint decision-making. A decision-making framework is designed. The research aims to develop more accurate and intelligent methods to improve the performance of applications such as understanding, searching, and retrieving image content. The innovation of the research lies in providing a new method suitable for human-machine collaborative decision-making and introducing gradient uncertainty calculation to more accurately estimate image quality. The research consists of five parts. Part 1 introduces the research background, problems, and solutions of the image semantic quality evaluation model for human-machine hybrid intelligence. Part 2 reviews the current research status of image semantic quality evaluation models based on human-machine hybrid intelligence. The existing difficulties and methodological shortcomings are summarized. Part 3 establishes a human-machine hybrid intelligent image semantic quality evaluation model based on gradient uncertainty. Part 4 158 Informatica 48 (2024) 157–170 Z. Yue et al. evaluated the performance of the model through comparative experiments and efficiency validation. Part 5 summarizes the research methods and proposes the shortcomings of the methods as well as future research directions. The application of human-machine hybrid intelligent systems is becoming increasingly prominent in computer vision. The evaluation of image semantic quality involves multiple fields such as image processing, computer vision, and deep learning. Ensuring the evaluation accuracy of image semantic quality is important for system performance in numerous applications such as autonomous driving, facial recognition, and safety monitoring. Traditional image quality evaluation methods often fail to meet the requirements of human-machine hybrid intelligent systems due to their neglect of the uncertainty of deep learning models. This may lead to misleading results in practical applications. Therefore, an urgent issue that needs to be addressed is to improve the performance of image semantic quality evaluation models for human-machine hybrid intelligence. Some scholars have conducted a series of studies on this topic. Sara U et al. proposed structured similarity index method and feature similarity index method to measure the structural and feature similarity between the restored object and the original object based on perceptual comparison. Experiments showed that this method provided perceptual and saliency-based errors more easily understood [5]. Jang et al. proposed an automatic crack evaluation technique based on deep learning, aiming to achieve high-quality crack evaluation by utilizing semantic segmentation technology to process images. The experimental results showed that the method achieved a high accuracy rate of 90.92% and a high recall rate of 97.47% [6]. Researchers such as Fu proposed an evaluation method that combined rules and semantic logic based on deep learning semantic evaluation, aiming to provide evaluation regularity and semantic decoding. The experimental results showed that this method had the ability to automatically evaluate regularity and semantics and exhibited higher validation [7]. Liu et al. proposed a video reconstruction and semantic quality evaluation method based on the characteristics of upstream streaming media, aiming to further improve the accuracy of semantic evaluation. Experiments showed that block compression sensing required less sensing or storage resources in the front-end, achieving a lightweight observation matrix and supporting block by block or parallel transmission [8]. On the other hand, gradient-based uncertainty calculation methods are widely developed and applied in science, engineering, and machine learning. Giraud J et al. proposed a workflow for integrating geological modeling uncertainty information to solve the geological uncertainty information being used for local constraints. This experiment showed that this method significantly reduced the uncertainty of interpretation [9]. Ouziala et al. proposed a method for detecting small-scale faults involving parameter uncertainty, aiming to ensure optimal detection performance by optimizing thresholds. This experiment showed that this method improved the sensitivity of residuals to small faults and ensured optimal early detection [10]. Puzyrev et al. proposed a deterministic gradient-based method aimed at solving least squares optimization problems in high-dimensional parameter spaces. This experiment showed that the method exhibited excellent performance in multiple aspects such as accuracy, generalization ability, and training cost [11]. Pevey et al. proposed a gradient optimization design method for nuclear reactor core components. This method was based on continuous and discrete material neutronics objectives, aiming to fully utilize gradient information for design optimization. This experiment indicated that the accompanying gradient calculation method had potential application prospects in nuclear system design optimization [12]. In summary, there has been some development in image semantic quality evaluation for human-machine mixing. However, there are still problems with low model generalization ability, high computational complexity, and a lack of deep semantic understanding. On the other hand, gradient-based uncertainty calculation methods can be used to estimate the uncertainty of model output, thereby improving the reliability and predictive ability of the model. Then, this study proposes a gradient-based uncertainty calculation oriented human-machine hybrid intelligent image semantic evaluation model. This model is expected to improve the accuracy of evaluation, promote human-machine collaboration, and enhance the interpretability of intelligent systems. The literature review classification is shown in Table 1. Table 1: Literature review classification Author Method Achieved goals Disadvantage Sara et.al [5] Structured similarity index method and feature similarity index method Measure structural and feature similarities between restored objects and original objects based on perceptual comparisons From representation perspective, SSIM and FSIM are normalized, but MSE and PSNR are not. Jang et.al [6] Automatic crack assessment technology based on deep learning Use semantic segmentation technology to process images to achieve high-quality crack assessment Detection time is longer. Image Semantic Quality Evaluation Model for Human-Machine… Informatica 48 (2024) 157–170 159 Fu et.al [7] Evaluation method combining deep learning semantics and semantic logic Provide evaluation regularity and semantic decodability Pattern complexity increases decoding time. Liu et.al [8] Video reconstruction and semantic quality evaluation method based on the characteristics of upstream streaming media Improve the accuracy of semantic evaluation Detection perception accuracy is closely related to video quality. Giraud et.al [9] Workflow for integrating uncertainty information in geological modeling Address issues where geological uncertainty information is used for local constraints With all geological modeling, the model cannot account for geological units or faults that are not sampled by in-situ geological measurements, which can lead to biases in the final model. Ouziala et.al [10] A method for detecting micro-level faults involving parameter uncertainty Ensure optimal detection performance by optimizing thresholds The accuracy of residual error in detecting minor faults needs to be improved. Puzyrev [11] Deterministic gradient-based methods Solve least squares optimization problems in high-dimensional parameter spaces The inversion region has a significant impact on the results. Pevey et.al [12] A gradient optimization design method for nuclear reactor core components Leverage gradient information for design optimization Gradient-informed designs must scale as the dimensions of the design space increase. 2 Building an image semantic quality evaluation model on the ground of gradient uncertainty A gradient-based uncertainty prediction method is proposed for out of distribution detection. In addition, a human-machine joint decision-making framework was designed for research. It combines the advantages of humans and machines in perceiving semantic distortion to improve decision accuracy. 2.1 Semantic distortion perception analysis on the ground of datasets With the advancement of deep learning technology, machines are increasingly capable of semantic analysis of images. Machines include tasks such as object detection, positioning, and recognition. Nonetheless, distortion phenomena in images can negatively impact these analysis tasks. The specific effect is known as semantic distortion. Semantic distortion is different from traditional image quality distortion, which is not suitable for conventional image quality evaluation indicators. Therefore, research on this issue is particularly urgent. Semantic distortion is proposed for specific image semantic analysis tasks and needs to be explored in a specific application context [13]. The distortion can damage image quality. Therefore, IQA methods are needed to evaluate [14]. IQA methods are divided into subjective and objective categories. Subjective methods require human observers to evaluate, accurately but time-consuming and subject to interference. The performance of objective methods is usually measured by the similarity with subjective scores. The higher the similarity, the better the model performance. Objective image quality evaluation can be divided into three categories: full reference, semi-reference, and no reference [15]. The widely used objective quality evaluation methods currently include mean square error and structural similarity index. Mean square error is a measure of the average error at the pixel level of an image, and the calculation method for both is shown in equation (1). 2 11 2 1 ( ( , ) ( , ) 10log HW ref sts ji MSE I i j I i j WH D PSNR MSE ==  =−     =    (1) In equation (1), ref I represents the reference image, sts I represents the distorted image, and D represents the range of pixel value dynamic transformation. MSE and PSNR represent mean square error and structural similarity, respectively. Researchers have made efforts to improve the confidence score of deep learning models by conducting uncertainty predictions to reduce uncertainty. Therefore, the misleading high confidence predictions that may occur in models that do not conform to the distribution of training data can be addressed [16]. This is achieved by first classifying uncertainty and then dealing with it in a targeted manner. Figure 1 shows the classification of uncertainty. 160 Informatica 48 (2024) 157–170 Z. Yue et al. Uncertain prediction Overall uncertainty Cognitive uncertainty Random uncertainty Uncertainty of heteroscedasticity Uncertainty of homoscedasticity Figure 1: Classification of uncertainty In Figure 1, cognitive uncertainty is caused by the uncertainty in the model parameter space. It can be reduced by increasing training data, usually caused by underfitting or dataset offset. After determining the uncertainty, this study needs to consider the perception differences among different populations, cultural backgrounds, and contexts. The semantic distortion perception in the dataset is further analyzed. When studying human semantic distortion perception, the first step is to select a reference image, then perform distortion processing on the image, followed by subjective experimental design, and finally eliminate outliers. In this study, Facenet is used for human face subset testing. Triplet is used to reduce intra-class spacing. The specific correlation loss function diagram is shown in Figure 2. Negative Negative Anchor Postive Anchor Postive Study Figure 2: Schematic diagram of Triplet loss function in Facenet In Figure 2, the distance between the anchor sample and all the same negative samples is greater than the distance between the anchor sample and all the same positive samples. The specific calculation expression is shown in equation (2). 22 22 || ( ) ( ) || || ( ) ( ) || ( ( ), ( ), ( )) a p a n i i i i a p n i i i f a f a a f a f a f a f a f a − +  −    (2) In equation (2), a represents the ideal distance maintained between positive and negative samples. a i a represents the anchor sample, p i a represents the positive sample, and n i a represents the negative sample.  represents the set of all possible triples in the training set. The loss function calculation method is shown in equation (3). 22 22 [ || ( ) ( ) || || ( ) ( ) || ] N a p a n i i i i i f a f a f a f a a  = − − − +  (3) In equation (3),  represents the value of the loss function. The Omni-scale network (OSNet) is used to conduct the pedestrian subset test after the face subset test. The network references the image source Market-1501 data set. The test accuracy reaches 93%. OSNet is a full-scale learning structure for person re-identification, which contains a residual module composed of multiple convolution streams. Each convolution stream is responsible for feature detection at a certain scale. In addition, a new unified aggregation gate is introduced in the network to dynamically fuse multi-scale features with input-related channel weights. Its specific calculation formula is shown in equation (4). , . . ( ) y x x s t x F x = + = (4) In equation (4), x represents the given input value, F represents the mapping function, and x represents the learning residual. Figure 3 shows the convolutional Image Semantic Quality Evaluation Model for Human-Machine… Informatica 48 (2024) 157–170 161 comparison of Lite3*3 in OSNet, which is a standard 3*3 convolutional kernel, after introducing a row aggregation gate. Conv 3x3 BatchNorm ReLU Conv 1x1 DW Conv 3x3 BatchNorm ReLU (a) Standard 3 x 3 Convolution (b) Lite 3 x 3 Convolution,DW: Depthwise Figure 3: Comparison intention between standard 3*3 convolution and Lite3*3 convolution in ONet In Figure 3, unlike the standard 3*3 convolution layer, the Lite3*3 convolution layer uses depthwise separable convolution to reduce the amount of network parameters. The scale of the feature is represented by an index to achieve multi-scale feature learning. All convolution streams are dynamically fused through a unified aggregation gate. The weights of different scales are dynamically adjusted according to the input samples. The residual expression that the network needs to learn is shown in equation (5). 1 ( ( )) ( ) T tt t x G F x F x = =  (5) In equation (5), represents the Hadamard product, and ( ( )) t G F x represents the channel weight coefficient. After pedestrian recognition, the study further conducted license plate detection and recognition. 2.2 Analysis of single sample semantic distortion perception on the ground of gradient uncertainty calculation method In Section 2.1, semantic distortion is evaluated by computing the recognition accuracy on a specific dataset. Although this method is simple to operate, it has some obvious limitations. For example, accuracy is a statistical result based on a large number of samples, which cannot reveal subtle differences between individual samples. The results of accuracy are easily affected by the characteristics of the selected data set. Then, the study proposes to introduce confidence as a new metric to measure semantic distortion. Confidence represents the probability of prediction correctness. Confidence not only reflects the strength of the recognition ability, but also can be directly calculated based on the model's prediction of the current input sample. Confidence provides the possibility of in-depth analysis of a single sample [17]. At the human level, confidence is defined by calculating the proportion of individuals in the population that correctly identify the sample. At the machine level, confidence is related to the uncertainty predictions of the deep learning model. Therefore, the experiment proposes a new gradient-based uncertainty prediction method specifically for outlier detection. In addition, the research proposes a joint decision-making framework that integrates human and machine perception. This method aims to utilize the complementary advantages of the two on different samples to improve the overall decision-making accuracy. The formula for calculating human confidence is shown in equation (6). 1 1( ) || H Hi i H U h y  ==   (6) In equation (6), H  represents the human subject population, y represents the label corresponding to the sample, and i h represents the recognition result of the i -th human subject. This study uses a deep learning model to predict human confidence in different data types. The model is adjusted to output a scalar value representing human confidence. The training process uses random gradient descent and sets some hyperparameters. During data processing, data augmentation operations that may affect human confidence are avoided. The study uses human confidence in subjective datasets as the true score and the output of the model as the predicted value. The main evaluation indicator for model performance is the correlation between the true and predicted values, usually using Spearman rank correlation coefficient (SROCC) and Pearson linear correlation coefficient (PLCC). SROCC is a nonlinear correlation coefficient used to measure the correlation between two variables, whose values are obtained by ranking the data. Specifically, the SROCC calculation method is shown in equation (7). 2 1 2 6 ( ( ) ( )) 1 ( 1) N ii i rank x rank y SROCC NM = − =− −  (7) 162 Informatica 48 (2024) 157–170 Z. Yue et al. In equation (7), () i rank x represents the arrangement order of x in all sequences. PLCC is a widely used linear correlation coefficient used to measure the linear relationship between two sets of data. The calculation expression is shown in equation (8). 1 22 ( )( ) ( ) ( ) N ii i ii x x y y PLCC x x y y = −− = −−  (8) In equation (8), i x represents the true score of the i th image, and i y represents the test value of the i th image. , xy represent the average values corresponding to both. In practical applications, this study needs to first perform nonlinear fitting between the test scores and the true values. In this experiment, a logistic regression model is used for fitting, as shown in equation (9). 23 1 4 5 () 1 (0.5 ) _ 1 x yx e     − = − + + (9) In equation (9), x represents the predicted score before fitting. y represents the predicted score after fitting. The fitting parameters are represented by { | 1 ,2,3,4,5} i i  = . A gradient-based uncertainty prediction method is proposed to address the high complexity and impracticability of deterministic prediction methods in machine semantic distortion perception. The study first observes the feature sparsity of out of distribution samples, and then further utilizes the Jacobian matrix to analyze the relationship between feature sparsity and gradient norm. Then, the network output is obtained based on the network input and the network linear layer, as shown in equation (10). 1 1 1 1 ( ) ( ( ... ( ))) L L m L a F a W W W −− =    (10) In equation (10), a represents the network input, () Fa represents the output, ,1..., , { | [ ]} ii i i i W d W W W = represents the network linear layer, and { | 1,..., } i iL = represents the nonlinear layer. The linear correction unit layer is shown in equation (11) by combining the relationship between gradient norm and network nonlinear layer analysis feature sparsity. Re ( ) ( ) max(0, ), 1 ,..., 1 i UL z z z i L  =  = = − (11) In equation (11), ReUL represents the activation layer. The nonlinear layer derivative expression in the Jacobian matrix is shown in equation (12). , 1 [1( ( ) 0)], 1,..., i i j i i ii h diag w h x j d Wh −  =  =  (12) In equation (12), in backpropagation, the zero value of the Jacobian matrix is positively correlated with the sparse matrix output by the ReLU layer. The gradient norm is negatively correlated with the sparse matrix output [18]. Usually, backpropagation requires label data to calculate the gradient of the loss function, but no labels are available during the testing phase. To address this issue, the study considers introducing a loss function for gradient retrieval in the unlabeled case. Firstly, this study perturbs the output to a small amplitude of ε. Then, a new loss function is introduced, as shown in equation (13). (1 ) () W FF W F F W F    =+    =  − =    (13) In equation (13), W represents the channel total coefficient corresponding to the output shape. This study designs a loss function for adaptive adjustment of input samples, especially when dealing with "out of distribution" samples. The disturbance of the loss function directly affects the gradient size of backpropagation. Previous studies have shown that predicting probability distributions to some extent reflects uncertainty [19]. Therefore, the disturbance amplitude  is not a uniform value for all samples, which is related to the network output F(x). The expression for calculating disturbance amplitude is shown in equation (14). ( ( )) A F x  = (14) In equation (14), ( ( )) A F x represents the scalar statistic of the predicted probability distribution. This method is based on the correlation between input gradient and "out of distribution". That is, the larger the input gradient, the greater the sample difference, which is used to measure the "out of distribution" probability [20]. This method uses backpropagation to calculate the input gradient, as shown in equation (15). ( ) [ ( )] W W U x E x   =  (15) In equation (15), W  represents the calculated mean of the weight coefficients. This study focuses on the size of the gradient norm. However, the mutual cancellation of positive and negative gradients in backpropagation may reduce the gradient norm. This reduces the significance of analyzing the differences between in distribution and out of distribution samples [21]. To address this issue, this study uses a backpropagation optimization strategy. It is based on the directed backpropagation method, which separates positive and negative gradients by truncating the gradient flow. The specific schematic diagram is shown in Figure 4. Image Semantic Quality Evaluation Model for Human-Machine… Informatica 48 (2024) 157–170 163 1 -1 5 2 -5 -7 -3 2 4 1 -1 5 2 -5 -7 -3 2 4 (a) Forward pass -2 0 -1 6 0 0 0 -1 3 -2 3 -1 6 -3 1 2 -1 4 (b) Backward pass:backpropagation -2 3 -1 6 -3 1 2 -1 3 (c) Backward pass:guidedbackpropagation 0 0 0 6 0 0 0 0 3 Figure 4: Schematic diagram of guided backpropagation method In Figure 4, the core idea of the directed backpropagation method is to truncate the negatively activated gradient in the ReLU layer, making it zero in backpropagation and no longer affecting gradient propagation. Deep learning technology develops rapidly in different fields. Due to the uncertainty of models, decisions in practical applications are not always reliable, especially in situations where high accuracy is required. Therefore, this study proposes a new human-machine joint decision-making framework, as shown in Figure 5. x 4 x 3 x 2 x 1 PASS 1 PASS 0 0 1 0 1 0 1 0 0 Data Model DM Output Figure 5: Schematic diagram of human-computer interactive decision-making process In Figure 5, the proposed human-computer hybrid decision-making method includes a deep neural network and an external decision maker (DM), which is usually a human. The decision-making flow is in the form of a cascade and is divided into two steps. First, the DNN model can give a prediction result or choose to reject it. If choosing to reject, the second step is carried out. Otherwise, the system outputs the model prediction result. Second, if choosing to reject, the system redirects the input to the external decision-maker for a second judgment and outputs the prediction result of the external decision-maker. Although this modeling method is simple, it can still describe a large part of the decision-making system including multiple decision-makers. To better control the decision-making process, the study proposes a "human-computer hybrid decision-making framework", which requires the confidence of human and deep neural network models to be calibrated in order to compare their confidence. The relevant framework and confidence are shown in Figure 6. Data Human Machine Alignment Decision Rule Results Results Accept Reject Confidence Confidence Human Machine Human Machine Confidence Accuracy Calibration Aligned (a) Research the proposed resolution intention (b) Confidence alignment diagram Figure 6: The proposed resolution intention and confidence alignment diagram In Figure 6, the study needs to calculate the confidence scores of the human and deep neural network models for each input sample separately, which are adjusted to the same measurement scale. Then, decision rules are designed to generate the final decision result based on these two confidence scores. It ensures that the 164 Informatica 48 (2024) 157–170 Z. Yue et al. confidence and accuracy distributions of human and deep neural network models are as consistent as possible on specific datasets. 3 Performance verification of a human-machine hybrid intelligent image semantic quality evaluation model on the ground of gradient uncertainty calculation method The aim of this study is to create a comprehensive monitoring scene dataset that includes faces, pedestrians, and license plates. Three different distortion methods: JPEG compression, BPG compression, and motion blur are considered. The main objective of the study is to require participants to identify target objects in distorted images while excluding abnormal data to ensure the reliability of the obtained data. 3.1 Verification of semantic distortion perception performance on the ground of gradient uncertainty and datasets This study conducted facial and pedestrian recognition tasks. Participants needed to select images from a template library that match the faces or pedestrians in distorted images. For facial recognition, this study divided the images into 10 groups, including different hairstyles and genders, to exclude other interfering factors. For pedestrian recognition, this study divided the images into 8 groups based on the color of the clothing. As for the license plate recognition task, the study used a single choice experiment, requiring participants to input complete license plate numbers, including province abbreviations, letters, and numbers. It was only considered correct recognition when the input matched the actual license plate perfectly. Figure 7 shows the trend of the average recognition accuracy of human subjects and deep neural network models as the distortion increases under different tasks and distortions. 20 40 0 0.5 1.0 0.0 QP Accuracy Model(Face) Subjects(Face) (a) BPG Model(Person) Subjects(Person) Model(License) Subjects(License) 60 20 100 0.5 1.0 0.0 Quality Accuracy Model(Face) Subjects(Face) Model(Person) Subjects(Person) (b) JPEC Model(License) Subjects(License) 20 40 0 0.5 1.0 0.0 Kernel size Accuracy Model(Face) Subjects(Face) Model(Person) Subjects(Person) (c) Motion Blur Model(License) Subjects(License) Figure 7: The average recognition accuracy of human and DNN models varies with distortion Image Semantic Quality Evaluation Model for Human-Machine… Informatica 48 (2024) 157–170 165 Figure 7(a) shows the results of image processing using the BPG compression method. As the image distortion value increased, the obtained image accuracy gradually changed from high accuracy to low accuracy. At the beginning of the experiment, the accuracy of the image reached 1.0. When the distortion value reached more than 40, the accuracy of the image was less than 0.5 and approached 0. Figure 7(b) shows the results of JPEG compression on image processing. When the image quality decreased, the accuracy of the image also shrank from the initial 1.0 and approached 0.0, not equal to 0.0. Figure 7(c) shows the results of image processing by three methods of motion blur. As the model kernel size became larger, the accuracy of the model on the image showed a trend of fluctuation. When the kernel size was 40, the accuracy of the image began to level off. It is worth noting that the DNN model performed better than humans in facial recognition and pedestrian recognition tasks under the three processing methods, especially in cases of severe image distortion. However, in the task of license plate recognition, the advantage of DNN model was relatively small. This indicates that the DNN model is more robust in dealing with image distortion. Figure 8 shows the differences in recognition accuracy among some different human participants. 20 40 0 0.5 1.0 0.0 QP Accuracy (a) Face BPG 10 20 0 1.0 0.0 Quality level Number of Subjects (b) Face BPG 1.0 0.5 20 40 0 0.5 1.0 0.0 QP Accuracy (c) Person BPG 10 20 0 1.0 0.0 Quality level Number of Subjects (d) Person BPG 1.0 0.5 20 40 0 0.5 1.0 0.0 QP Accuracy (e) License BPG 10 20 0 1.0 0.0 Quality level Number of Subjects (f) License BPG 1.0 0.5 Figure 8: Human individual recognition accuracy curve and subjective recognition threshold distribution graph Figures 8 (a), 8 (c), and 8 (e) represent the average, maximum, and minimum recognition accuracy of human individuals under QP. Figures 8 (b), 8 (d), and 8 (f) represent histograms of the distribution of subjective recognition thresholds for human individuals. Under the same level of distortion, the recognition accuracy of different individuals varied greatly, indicating that different individuals had different impacts when facing image distortion. In summary, machines are more robust 166 Informatica 48 (2024) 157–170 Z. Yue et al. in dealing with image distortion compared to humans. However, they may be relatively weak in terms of generalization and stability. In addition, there are significant differences between human individuals. 3.2 Verification of semantic distortion perception performance on the ground of gradient uncertainty and single sample The relationship between out of distribution samples and gradients was analyzed to validate the design based on gradient uncertainty. Firstly, starting from the sparsity of ReLU output features, the study investigated the correlation between feature sparsity and out of distribution samples. Subsequently, the connection between feature sparsity and gradient was established through the network Jacobian matrix. The experiment used CIFAR-10 and CIFAR-100 as in distribution datasets. TinyImageNet, LSUN, and iSUN were used as out of distribution datasets. The evaluation adopted indicators such as FPRat95% TPR, Detection Error, and AUROC. Table 1 shows the performance comparison between the proposed uncertainty prediction method and the complex method. Table 2: Performance comparison of the proposed uncertainty prediction method with that of current high-complexity methods Method Knowledge Complexity Detection error AUROCT FPR@ 95%TPR ODIN OOD Val Black-box 24 91 44 Malahanobis IND Val White-Box 7 97 13 Ours None White-Box 3 96 5 Margin-based ensemble IND Train+00D Val Retraining 8 97 16 Outlier exposure OOD Val Retraining / 83 57 In Table 1, the study further compared the proposed method with the more complex DenseNet model, using CIFAR-100 as the in-distribution dataset and LSUN (r) as the out of distribution dataset. On the more challenging CIFAR-100 dataset, the studied method achieved the lowest FPRat95% TPR and detection error rate, which performed best among all methods. The experiment used confidence interval analysis to quantify the uncertainty of the evaluation results to enhance the credibility of the results. Specifically, the experiment calculated a 95% confidence interval for each performance indicator, ensuring the consistency and reliability of the evaluation experimental results. Then, the performance of the model was robust even under different experimental conditions. The method was applied to different data sets for training and testing, which effectively enhanced the generalization ability of the model and the repeatability of experiments. Figure 9 shows the complexity comparison of the proposed method with methods such as ONID, Mahalanobis, Softmax, etc. Knowledge None INDVAL IND Train OOD VAL Black-box White-box Retraining Complexity Ours Softmax Mahalanobis Margin-based ensemble/Outlier exposure ODIN Application difficulty Figure 9: Complexity comparison between the proposed method and other methods According to Figure 9, the proposed method achieved similar or even better performance while significantly reducing complexity compared with the state-of-the-art method. In addition, the study further used the constructed semantic dataset to simulate human predictions. The study randomly selected the predicted results of 20 human subjects as human predictions in actual scenarios. To eliminate the impact of randomness, each image was tested multiple times and the average value was taken as the final result. Table 2 shows the best Image Semantic Quality Evaluation Model for Human-Machine… Informatica 48 (2024) 157–170 167 performance comparison between the proposed method and the refusal learning framework. Table 3: The optimal performance comparison of the proposed framework and the rejection learning framework In Table 2, the optimal accuracy of the framework reached 68.03% when humans were superior to machines. The result was significantly higher than the accuracy of 59.83% for humans and 40.16% for machines. This indicates that the framework can effectively combine the advantages of different decision-makers. In contrast, the optimal accuracy of rejecting learning frameworks was 59.84%, which was only slightly higher than the accuracy of 59.83% in humans. This indicates that its performance limit does not exceed the performance of a single human or machine decision-maker, which cannot reflect the advantages of human-machine collaboration. This study further conducted accuracy analysis. Figure 10 shows the accuracy of the proposed framework and refusal learning framework under different thresholds. Human Decision Rate 0.0 0.2 0.4 0.6 0.8 1.0 0.35 0.45 0.55 0.65 Accuracy Rejection Learning Ours Machine Human (a) Human superiority over machines Human Decision Rate 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 Accuracy Rejection Learning Ours Machine Human (b) Machines are superior to humans Figure 10: Accuracy rates of the proposed framework and rejection learning framework under different thresholds. Figure 10(a) shows the changes in human superiority over machines. The optimal accuracy of this framework was significantly higher than human accuracy and machine accuracy. The accuracy of this framework was always better than that of the rejection learning framework under the same ratio of human judgments. Figure 10(b) shows the curve change of machines outperforming humans. Since the human accuracy was lower than the proposed model, the performance of the framework might inevitably show an overall downward trend as the proportion of human decisions increased. However, the accuracy of the proposed framework was always better than the rejection learning framework. Nonetheless, the framework still showed significant performance improvements compared to machines, a level that rejection learning frameworks could not achieve. Finally, the model was applied to the quality evaluation of a certain graphic semantics to analyze the impact of different factors on the accuracy of the model. The results are shown in Figure 11. Method Framework Effectiveness of human decision Accuracy Human decision rate License Human / 24.2 / Ours 54 86.6 19 Rejection learning 49 78.3 5 Machine / 75.8 / Person Human / 59.8 / Ours 34 68.0 81 Rejection learning 19 59.8 100 Machine / 40.1 / 168 Informatica 48 (2024) 157–170 Z. Yue et al. Decision rate 0.0 0.2 0.4 0.6 0.8 1.0 0.35 0.45 0.55 0.65 Accuracy Gender Human-machine decision- making factors 0.70 Age Marital status Education level Number of children born Figure 11: Impact of human factors In Figure 11, as the decision-making rate of different factors increased, the accuracy of the model under the influence of the corresponding factors also showed different growth rates. When the decision rate reached 1.0, the model accuracy rate under the influence of all factors had the highest value. The education level between different statistical populations had the greatest impact on the image semantic quality evaluation of accuracy. The children born to different statistical populations had the greatest impact. The impact of quantity on accuracy was smaller than other factors. This may be because human beings' education level directly affects human beings' understanding and discrimination of different things and changes their views on things. There was overlap between the impact of human-machine decision-making factors on accuracy and human factors. The final model accuracy was less than 0.70. However, the connection between human-machine decision-making factors and human factors was very close. The accuracy obtained was higher than that under the influence of human factors. This can also be directly shown from the accuracy obtained. Therefore, human-machine decision-making factors can be used to evaluate the semantic quality of images. 4 Discussion and conclusion 4.1 Discussion The image semantic quality evaluation model based on the proposed gradient-based uncertainty calculation method was tested in different scenarios. The performance was compared with different methods. The proposed method showed superior performance than traditional methods in key indicators such as accuracy and distortion recognition perception. The proposed model showed stronger robustness, especially in the face of complex background and noise interference. When humans were better than machines, the best accuracy of the proposed model framework reached 68.03%, which was significantly higher than the accuracy of humans of 59.83% and the accuracy of machines of 40.16%. This framework could effectively combine different decisions with the advantage of the person. In contrast, the best accuracy of the rejection learning framework was 59.84%, which was only slightly higher than the human accuracy of 59.83%. Its performance upper limit did not exceed the performance of a single human or machine decision-maker and could not reflect the advantages of the human-machine collaboration. For example, the automatic IQA model based on hybrid deep neural network was proposed by Chan K et al. Although the average correlation of the model reached 0.57, the image accuracy was not as good as the proposed model [22]. In addition, although the IQA algorithm based on the HCL framework and NR-IQA proposed by Wang et al. had strong generalization capabilities, there were still certain challenges in extracting distorted image information [23]. The performance difference may be mainly due to the fact that the proposed model is better able to handle blurred and discontinuous regions in the image by introducing gradient-based uncertainty calculation, thereby improving the accuracy of the evaluation. In addition, the computational efficiency of the proposed model was optimized. Compared with traditional methods, the proposed model reduced computational time while maintaining high accuracy, which had important practical significance in real-time application scenarios. The proposed model provides new contributions to image semantic quality assessment, especially in terms of the rationality of using confidence as a measure of semantic distortion. The resulting model is applied in a human-machine joint decision-making framework and achieves superior performance. 4.2 Conclusion With the explosive growth of digital media content, automated image quality evaluation is important for content management such as automatic sorting, filtering, and recommendation systems. It is worth noting that the current semantic evaluation models have problems such as weak noise resistance, insufficient diversity and universality. This study selected monitoring scenarios and Image Semantic Quality Evaluation Model for Human-Machine… Informatica 48 (2024) 157–170 169 selected three common objects: faces, pedestrians, and license plates to further improve the accuracy of image semantic quality evaluation. Three common distortion types: JPEG compression, BPG compression, and motion blur to test the accuracy of human recognition of these objects were selected. Then, a subjective perception database of semantic distortion was built. In addition to recognition accuracy, this study also introduced confidence to measure the recognition ability of human or deep neural network models for individual samples. This is to provide a deeper analysis of the perceptual effects of semantic distortion. The experimental results showed that machines were more robust in distortion compared to humans, but performed poorly in generalization and stability. The study adopted fine-grained semantic object classification, which meant that local detail features were more crucial, explaining why machines were more robust than humans. The research method achieved an optimal accuracy of 68.03%, significantly higher than human accuracy of 59.83% and machine accuracy of 40.16%. This indicates that the proposed method can effectively combine the advantages of different decision-makers. This study may lack in-depth research and evaluation of user subjective experience. Understanding users' expectations and preferences for image quality can provide important references for model improvement. Future research can enhance the design of user surveys and subjective evaluations. Fundings The research is supported by Key Project of Science and Technology Research of Ministry of Education: Research on Intelligent Network Teaching Model Based on Ontology and Agent (No.: 210210), National Natural Science Foundation of China “Research on ontology correction”(No.: 60903131), 2023 Innovation and Entrepreneurship Research Fund Project of Yunnan Normal University "Research on the Construction of Normal College Students' Intelligent Educational Literacy Evaluation Index System for Improving Employment Competence". References [1] H. T. Wang, L. Wang, F. Q. Lai, and J. Y. Zhang, “Investigation of image segmentation effect on the accuracy of reconstructed digital core models of coquina carbonate,” Applied Geophysics, vol. 17, no. 4, pp. 501-512, 2020. https://doi.org/10.1007/s11770-020-0846-2 [2] S. Zhao, P. Wang, Q. Cao, H. Song, and W. Li, “Weakly supervised salient object detection based on image semantics,” Journal of Computer-Aided Design & Computer Graphics, vol. 33, no. 2, pp. 270-277, 2021. https://doi.org/10.3724/SP.J.1089.2021.18318 [3] A. Williams, “Human-centric functional computing as an approach to human-like computation,” Artificial Intelligence and Applications, vol. 1, no. 2, pp. 118-137, 2023. https://doi.org/10.47852/bonviewAIA2202331 [4] F. Ecer, and E. Aycin, “Novel comprehensive MEREC weighting-based score aggregation model for measuring innovation performance: The case of G7 countries,” Informatica, vol. 34, no. 1, pp. 53-83, 2023. https://doi.org/10.15388/22-INFOR494 [5] U. Sara, M. Akter, and M. S. Uddin, “Image quality assessment through FSIM, SSIM, MSE and PSNR-A comparative study,” Computer and Communication (English), vol. 7, no. 3, pp. 8-18, 2019. https://doi.org/10.4236/JCC.2019.73002 [6] K. Jang, Y. K. An, B. Kim, and S. Cho, “Automated crack evaluation of a high‐rise bridge pier using a ring‐type climbing robot,” Computer‐Aided Civil and Infrastructure Engineering, vol. 36, no. 1, pp. 14-29, 2021. https://doi.org/10.1111/mice.12550 [7] K. Fu, Y. Zhang, and X. Lin, “The automatic evaluation of regularity and semantic decodability in wallpaper decorative patterns,” Perception, vol. 48, no. 8, pp. 731-751, 2019. https://doi.org/10.1177/0301006619862142 [8] H. Liu, R. Huang, and H. Yuan, “Survey on compressive sensing video stream for uplink streaming media,” Journal of Image and Graphics, vol. 26, no. 7, pp. 1545-1557, 2021. https://doi.org/10.11834/jig.200487 [9] J. Giraud, M. Lindsay, V. Ogarko, M. Jessell, and E. Pakyuz-Charrier, “Integration of geoscientific uncertainty into geophysical inversion by means of local gradient regularization,” Solid Earth, vol. 10, no. 1, pp. 193-210, 2019. https://doi.org/10.5194/se-10-193-2019 [10] M. Ouziala, Y. Touati, S. Berrezouane D. Benazzouz, and B. Ouldbouamama, “Optimized fault detection using bond graph in linear fractional transformation form,” Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering, vol. 235, no. 8, pp. 1460-1471, 2021. https://doi.org/10.1177/0959651820985617 [11] V. Puzyrev, “Deep learning electromagnetic inversion with convolutional neural networks,” Geophysical Journal International, vol. 218, no. 2, pp. 817-832, 2019. https://doi.org/10.1093/gji/ggz204 [12] J. Pevey, B. Hiscox, A. Williams, O. Chvala, V. Sobes, and J. W. Hines, “Gradient-informed design optimization of select nuclear systems,” Nuclear Science and Engineering: The Journal of the American Nuclear Society, vol. 196, no. 12, pp. 1559-1571, 2022. https://doi.org/10.1080/00295639.2021.1987133 [13] X. Li, S. Li, S. Liu, and D. He, “A malicious webpage detection algorithm based on image semantics,” Traitement du Signal, vol. 37, no. 1, pp. 113-118, 2020. https://doi.org/10.18280/ts.370115 [14] J. Wu, J. Zeng, W. Dong, G. Shi, and W. Lin, “Blind 170 Informatica 48 (2024) 157–170 Z. Yue et al. image quality assessment with hierarchy: Degradation from local structure to deep semantics,” Journal of Visual Communication and Image Representation, vol. 58, pp. 353-362, 2019. https://doi.org/10.1016/j.jvcir.2018.12.005 [15] Z. Jin, D. Yu, Z. Yuan, and L. Yu, “MCIBI++: soft mining contextual information beyond image for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 5, pp. 5988-6005, 2023. https://doi.org/10.48550/arXiv.2209.04471 [16] F. Liu, M. Huang, W. Pedrycz, and H. Zhao, “Group decision making based on flexibility degree of fuzzy numbers under a confidence level,” IEEE Transactions on Fuzzy Systems: A Publication of the IEEE Neural Networks Council, vol. 29, no. 6, pp. 1640-1653, 2021. https://doi.org/10.1109/TFUZZ.2020.2983663 [17] Y. Fang, B. Luo, and T. Zhao, “ST-SIGMA: Spatio-temporal semantics and interaction graph aggregation for multi-agent perception and trajectory forecasting,” CAAI Transactions on Intelligence Technology, vol. 7, no. 4, pp. 744-757, 2022. https://doi.org/10.1049/cit2.12145 [18] X. Xia, X. He, and L. Feng, “Semantic translation of face image with limited pixels for simulated prosthetic vision,” Information Sciences: An International Journal, vol. 609, no. 2, pp. 507-532, 2022. https://doi.org/10.1016/j.ins.2022.07.094 [19] K. Liu, Z. Ye, H. Guo, D. Cao, L. Chen, and F. Y. Wang, “FISS GAN: A generative adversarial network for foggy image semantic segmentation,” IEEE/CAA Journal of Automatica Sinica, vol. 8, no. 8, pp. 1428-1439, 2021. https://doi.org/10.1109/JAS.2021.1004057 [20] K. Shimoyama, and S. Kawai, “A kriging-based dynamic adaptive sampling method for uncertainty quantification,” Transactions of the Japan Society for Aeronautical and Space Sciences, vol. 62, no. 3, pp. 137-150, 2019. https://doi.org/10.2322/tjsass.62.137 [21] V. Puzyrev, “Deep learning electromagnetic inversion with convolutional neural networks,” Geophysical Journal International, vol. 218, no. 2, pp. 817–832, 2019. https://doi.org/10.1093/gjiz204 [22] K. Y. Chan, H. K. Lam, and H. Jiang, “A genetic programming-based convolutional neural network for image quality evaluations,” Neural Computing and Applications, vol. 34, no. 18, pp. 15409-15427, 2022. https://doi.org/10.1007/s00521-022-07218-0 [23] J. Wang, Z. Chen, C. Yuan, B. Li, W. Ma, and W. Hu, “Hierarchical curriculum learning for no-reference image quality assessment,” International Journal of Computer Vision, vol. 131, no. 11, pp. 3074-3093, 2023. https://doi.org/10.1007/s11263-023-01851-5