https://doi.org/10.31449/inf.v48i9.5876 Informatica 48 (2024) 155–162 155 Research on the Recognition of Psychological Emotions in Adults Using Multimodal Fusion Hui Zhao School of Education, Science and Music, Luoyang Institute of Science and Technology, Luoyang, Henan 471027, China Email: zhaoh1977@outlook.com Keywords: multimodal fusion, adult, psychological emotion, depression, speech, gait Received: March 8, 2024 Recognition of psychological emotions in adults plays a pivotal role in various fields. In this study, 110 healthy people and 110 depressed people were recruited. Gait, facial, and speech feature data were collected and screened. The obtained features were utilized as multimodal features. Subsequently, attention-bidirectional long short-term memory (BiLSTM) was developed as a method for recognizing adult psychological emotions. It was observed that better recognition results were achieved through multimodal fusion than unimodal approaches. The accuracy of the attention-BiLSTM method was 0.8182, and the F1 value was 0.8125, demonstrating higher recognition accuracy compared to methods such as the recurrent neural network and LSTM. These results verify the reliability of multimodal fusion and the attention-BiLSTM method fo recognizing psychological emotions in adults, and they can be applied in practice. Povzetek: Raziskava uporablja multimodalno zlivanje z metodo attention-BiLSTM za prepoznavanje psiholoških čustev odraslih z analizo hoje, obraznih potez in govora. 1 Introduction Emotions reflect the physiological and psychological states resulting from an individual's thoughts, behaviors, etc., and are related to their needs and desires. Human daily communication activities rely heavily on the transmission of emotions. Healthy psychological emotions are crucial for normal life activities, yet in a fast- paced and evolving society, adults often grapple with complex employment, emotional, and interpersonal communication pressures, leading to psychological disturbances. Depression, characterized by persistent low mood, can escalate to self-injury and suicidal behavior [1]. Early detection and intervention are especially critical in such cases. Presently, depression diagnosis heavily relies on the subjective judgment of professional doctors through methods like interviews and questionnaires. However, these approaches are inefficient, and some patients resist medical treatment. The advent of artificial intelligence has introduced new avenues for recognizing psychological conditions like depression [2]. The analysis of related works is shown in Table 1. Table 1 shows that the current research mostly focuses on utilizing a single indicator for recognizing psychological emotions, such as text, facial expressions, and speech. There is relatively less research conducted on studying two or more indicators simultaneously. In this paper, multimodal fusion was proposed for depression recognition based on multiple indicators, integrating gait, facial, and speech feature data for adult psychological emotion recognition. The proposed method was validated through experiments on real data. This paper provides a novel and reliable approach to emotion recognition research and further affirms the effectiveness of multimodal fusion. Table 1: A summary of related works Key indicator Recognition method Result Wani et al. [3] Text social data Support vector machine It achieved better results than existing techniques on the stance sentiment emotion corpus and Aman datasets. Hossain et al. [4] Speech and visual signals A convolutional neural network (CNN) and an extreme learning machine The method was evaluated based on three datasets and achieved success. Alamgir et al. [5] Facial features A bi- directional Elman neural network It achieved 98.57% and 98.75% accuracy on the JAFFE and CK+ datasets. Tsouvalas et al. [6] Speech features Federated learning The approach improved the recognition rate by 8.67% on average using only 10% labeled data in the experiment on IEMOCAP. 156 Informatica 48 (2024) 155–162 H. Zhao 2 Analysis of psychological emotion features in adults 2.1 Research subjects To obtain multimodal characteristics, 220 subjects were recruited as research subjects for the experiment. Among them, the subjects in the depressed group were all from a mental health center in Luoyang, and the subjects in the healthy group were recruited from the society. All subjects understood the purpose and process of the study. Approval was obtained from the subjects’ guardiana, and they signed the informed consent form. The collected experimental data was only used for the research in this paper, and the privacy rights and image rights of the subjects were strictly protected. The screening conditions for subjects are as follows. (1) Depression group: depression diagnosed by a specialized psychiatrist with a patient health questionnaire-9 (PHQ-9) score ≥ 10. (2) Healthy group: no history of psychiatric illness and PHQ-9 score < 5. For the depression group, the exclusion conditions are as follows. (1) Persons with extreme suicidal tendencies or co- occurring mental disorders. (2) Pregnant or lactating women. (3) Persons with visual, motor, or cognitive impairments. (4) Those with serious cardiovascular diseases. (5) Those with severe alcohol or drug dependence. For the healthy group, the exclusion conditions are as follows. (1) Persons with mental disorders within two generations and three lines. (2) Pregnant or lactating women. (3) Persons with visual, motor, or cognitive impairments. (4) Those with serious cardiovascular diseases. (5) Those with severe alcohol or drug dependence. The subjects are summarized in Table 2. Table 2: Basic information of the subjects Depression group (n=110) Healthy group (n=110) Number of males 47 51 Number of females 61 57 Age range 20-50 20-50 Average age 27.25 26.88 Average PHQ- 9 score 13.21 2.03 2.2 Gait data collection and analysis A study has demonstrated differences in gait between individuals with depression and those without [7]. Gait data can be collected without contact, minimizing interference with human psychological emotions as much as possible. Gait cannot be faked, making it more reliable in recognizing psychological emotions. This study utilizes a Kinect camera for gait data acquisition, enabling the capture of the 3D coordinates of 25 skeletal joint points of the human body, as illustrated in Table 3. Table 3: Kinect-generated joint points 0.Spine base 7.Hand left 14.Ankle left 21.Hand tip left 1.Spine mid 8.Shoulder right 15.Foot left 22.Thumb left 2.Neck 9.Elbow right 16.Hip right 23.Hand tip right 3.Head 10.Wrist right 17.Knee right 24.Thumb right 4.Shoulder left 11.Hand right 18.Ankle right 5.Elbow left 12.Hip left 19.Foot right 6.Wrist left 13.Knee left 20.Spine shoulder Gait data were collected by arranging the subjects to walk in an open room with a range of 6×1 meters for about 2 minutes. The walking process was captured using a Kinect camera. The collected data were segmented, and only the part facing the Kinect camera was retained. At least one complete gait cycle was intercepted. The Kinect coordinate system was transformed using 0. Spine base as the origin of the coordinate system. The coordinates of the i-th joint point at the t-th frame were transformed to: { 𝑥 𝑖 𝑡 ′=𝑥 𝑖 𝑡 −𝑥 0 𝑡 𝑦 𝑖 𝑡 ′=𝑦 𝑖 𝑡 −𝑦 0 𝑡 𝑧 𝑖 𝑡 ′=𝑧 𝑖 𝑡 −𝑧 0 𝑡 . To reduce the effect of body size differences, the joint data were normalized using the distance between 0. Spine base and 16. Hip right as the height of the body. The coordinates of the 𝑖 -th joint point at the 𝑡 -th frame were transformed to: { 𝑥 𝑖 𝑡 ′= 𝑥 𝑖 𝑡 |𝑥 16 𝑡 −𝑥 0 𝑡 | 𝑦 𝑖 𝑡 ′= 𝑦 𝑖 𝑡 |𝑦 16 𝑡 −𝑦 0 𝑡 | 𝑧 𝑖 𝑡 ′= 𝑧 𝑖 𝑡 |𝑧 16 𝑡 −𝑧 0 𝑡 | . For the processed gait data, the following features were selected in conjunction with the current study. Step Speed: The change in 3. Head along the Z-axis coordinates was used to calculate the individual step speed. Stride length: The distance traveled in a straight line along the Z-axis between two consecutive landings of the ipsilateral heel was used to calculate the individual stride length. Research on the Recognition of Psychological Emotions in Adults… Informatica 48 (2024) 155–162 157 Step width: The distance between the feet along the X-axis when the feet are supported on the ground is used to calculate the individual step width. Swing arm amplitude: The gap between the maximum distance along the Z-axis when the left-hand and right- hand swing in one gait cycle was used to calculate the swing arm amplitude. Head tilt angle: The angle between the line between 2. Neck and 20. Spine shoulder and the vertical direction of the body was used as the head tilt angle. The amplitude of joint motion: The maximum angular difference between shoulder-elbow-hip-knee movements in the radial plane during a gait cycle was used to calculate the amplitude of joint motion. The collected data was statistically analyzed, and the results are displayed in Table 4. Table 4: Statistical analysis of gait characteristics Feature Depression group Health group p value Step speed (m/s) 1.25±0.13 1.41±0.05 0.000 Step length/m 0.99±0.21 1.12±0.16 0.000 Step width/mm 146.21±26.77 143.25±23.58 0.256 Swing arm range/mm 335.26±115.24 426.25±87.26 0.000 Head tilt angle/° 1.33±2.16 -0.25±2.77 0.007 The amplitude of shoulder joint motion/° 30.55±9.87 35.12±5.67 0.032 The amplitude of elbow joint motion/° 55.21±12.26 63.61±12.26 0.017 The amplitude of hip joint motion/° 45.77±7.35 46.58±8.25 0.352 The amplitude of knee joint motion/° 49.27±7.15 50.33±8.25 0.413 As per Table 4, the comparison of step width and the amplitudes of hip and knee joint motion did not reveal significant differences between the depression and healthy groups. Therefore, these three features were excluded, and the subsequent identification process focused on the remaining six features. 2.3 Facial data collection and analysis Facial and speech data acquisition was conducted by professional psychiatrists. They scored the subjects using the PHQ-9 scale [8] through interviews. The scoring process was recorded using a Canon 700D camera. The length of facial data for different subjects was aligned through frame extraction processing, ensuring a uniform video length of 5 minutes. Facial feature extraction was performed using the open-source tool Openface [9], which extracted facial action units (AUs) as features. Openface provided a total of 17 AUs, as illustrated in Table 5. Table 5: AU and corresponding facial movements AU0 1 Lift the inner corner of the eyebrow AU0 9 Wrinkle the nose AU2 0 Pull up the corner s of the mouth AU0 2 Lift the outer corner of the eyebrow AU1 0 Lift the upper lip AU2 3 Tighte n lips AU0 4 Gather eyebrow s and press them down AU1 2 Pull the corners of the mouth to tilt upward AU2 5 Separa te lips AU0 5 Lift upper eyelid AU1 4 Tighteni ng cheek muscle AU2 6 Suck in lips AU0 6 Lift cheek AU1 5 Pull down the corners of the mouth AU4 5 Blink AU0 7 Tighteni ng eyelid AU1 7 Lift chin Openface employed the detected AU intensity to reflect the activity intensity of the AU, where a larger value indicates more intense movement. The data collected were tabulated in Table 6. Table 6: Statistical analysis of facial features Depression group Healthy group p value AU01 0.18±0.01 0.19±0.01 0.121 AU02 0.15±0.02 0.14±0.01 0.252 AU04 0.42±0.02 0.41±0.01 0.215 AU05 0.15±0.01 0.13±0.01 0.236 AU06 0.22±0.11 0.77±0.12 0.000 AU07 0.52±0.21 0.79±0.22 0.012 AU09 0.15±0.01 0.13±0.02 0.325 AU10 0.41±0.22 0.83±0.21 0.000 AU12 0.27±0.21 0.88±0.33 0.000 AU14 0.01±0.11 0.55±0.22 0.000 158 Informatica 48 (2024) 155–162 H. Zhao AU15 0.23±0.01 0.22±0.01 0.521 AU17 0.57±0.07 0.55±0.08 0.214 AU20 0.16±0.01 0.15±0.01 0.362 AU23 0.18±0.01 0.16±0.01 0.412 AU25 0.58±0.12 0.61±0.08 0.251 AU26 0.51±0.02 0.48±0.03 0.236 AU45 0.28±0.01 0.25±0.01 0.274 Upon examining Table 5, significant differences between the depression and healthy groups were observed for AU06, AU07, AU10, AU12, and AU14 (p < 0.05). In contrast, the differences in the remaining AUs were small. Consequently, these features were excluded, and only the five AUs with significant differences were considered for further analysis. 2.4 Speech data collection and analysis The speech data was captured through a Roland R-26 portable recorder with a sampling frequency of 44.1kHz, dual-channel, and 16-bit. The recorded audio was labeled and then converted to mono through a cool edit. Segments containing interferences, such as long pauses and coughs, were cut, and then the remaining video was pre-processed. First, the signal passed through a pre-emphasis filter: 𝐻 (𝑧 )=1−𝜇 𝑧 −1 , where 𝜇 is the pre-emphasis coefficient, which was taken as 0.97. Then, the obtained signal was processed by sub-frame windowing, using a Hamming window with a window length of 25 ms and a window shift of 10 ms. Finally, the Mel-frequency cepstral coefficients (MFCC) were extracted as speech features by the open SMILE toolkit [10], which includes: (1) 12 MFCC parameters and logarithmic energy parameters, (2) 13 first-order differences of MFCCs, (3) 13 second-order differences of MFCCs. The final number of speech features obtained was 39. 3 Recognition method based on multimodal fusion Multimodal fusion involves combining inputs from various modalities to enhance model recognition accuracy. This study analyzed a total of three modal features—gait, face, and speech—in the recognition of adult psychological emotions. Specifically, they included six gait features, five facial features, and 39 speech features, which were joined to obtain 50-dimensional features. To extract deeper features from these features, the paper chose the bidirectional long short-term memory (BiLSTM) neural network method in deep learning for the recognition process. LSTM has a good improvement on the gradient vanishing and explosion problem that exists in a recurrent neural network (RNN) [11] and has excellent performance in the processing of signals, text, etc. [12]. LSTM achieves the selective forgetting of information through forgetting gate 𝑓 𝑡 : 𝑓 𝑡 =𝜎 (𝑊 𝑓 ∙[ℎ 𝑡 −1 ,𝑥 𝑡 ]+𝑏 𝑓 ), where ℎ 𝑡 −1 is the previous unit output and 𝑥 𝑡 denotes the current unit input. Input gate 𝑖 𝑡 achieves the selective recording of information, which is expressed as: 𝑖 𝑡 =𝜎 (𝑊 𝑖 ∙[ℎ 𝑡 −1 ,𝑥 𝑡 ]+𝑏 𝑖 ), 𝐶̃ 𝑡 =tanh(𝑊 𝑐 ∙[ℎ 𝑡 −1 ,𝑥 𝑡 ]+𝑏 𝑐 ), 𝐶 𝑡 =𝑓 𝑡 ∙𝐶 𝑡 −1 +𝑖 𝑡 ∙𝐶̃ 𝑡 , where 𝐶̃ 𝑡 represents the vector of candidate values and 𝐶 𝑡 denotes the memory cell status. Output layer 𝑜 𝑡 and 𝐶 𝑡 determines the final output ℎ 𝑡 of LSTM together. The process is written as: 𝑜 𝑡 =𝜎 (𝑊 𝑜 ∙[ℎ 𝑡 −1 ,𝑥 𝑡 ]+𝑏 𝑜 ), ℎ 𝑡 =𝑜 𝑡 ∙tanh(𝐶 𝑡 ), where 𝑊 and 𝑏 are the weight and bias of each gate, 𝜎 is the sigmoid function, and tanh refers to the tanh function. BiLSTM is able to learn bidirectional features, including a forward LSTM and an inverse LSTM. At moment 𝑡 , the output ℎ 𝑡 of BiLSTM includes forward ℎ 𝑡 ⃗⃗⃗ and reverse ℎ 𝑡 ⃖⃗⃗⃗ : ℎ 𝑡 =[ℎ 𝑡 ⃗⃗⃗ ,ℎ 𝑡 ⃖⃗⃗⃗ ]. After learning advanced features through BiLSTM, to learn more valuable features, this paper combines the attention mechanism [13] based on BiLSTM, and the calculation process is: 𝑄 ℎ =ℎ 𝑡 ⃗⃗⃗ +ℎ 𝑡 ⃖⃗⃗⃗ , 𝐾 =𝑊 𝑘 ∙𝑄 ℎ , 𝑉 =𝑊 ℎ ∙𝑄 ℎ , 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 (𝑄 ℎ ,𝐻 )=∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝑄 ℎ ∙𝐾 𝑖 𝑇 )∙𝑉 𝑖 𝑛 𝑖 =1 , where 𝑄 ℎ stands for the query vector of the BiLSTM hidden layer, 𝐾 is a key-value vector, 𝑉 is a value vector, 𝑊 𝑘 and 𝑊 ℎ are random matrices, 𝐻 is the vector sequence of the BiLSTM hidden layer, and 𝐾 𝑖 𝑇 is the transpose of the 𝑖 -th key-value vector. Eventually, the output of the attention-BiLSTM method is passed through the linear layer, and then the probability distribution, i.e., the final adult psychological emotional recognition results, is obtained through the Softmax function. The process is: 𝑝 ̂ (𝑦 |𝑆 )=𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (ℎ∙𝑊 ℎ +𝑏 ℎ ), 𝑦̂=𝑎𝑟𝑔 max 𝑦 𝑝 ̂ (𝑦 |𝑆 ), where 𝑝 ̂(𝑦 |𝑆 ) is the probability distribution of the labels of different classes in the set of class 𝑦 and 𝑦̂ is the predicted label. The attention-BiLSTM-based multimodal fusion method is presented in Figure 1. Research on the Recognition of Psychological Emotions in Adults… Informatica 48 (2024) 155–162 159 Determine research subjects Kinect camera Openface openSMILE Gait feature Facial feature Speech feature BiLSTM Attention Linear Softmax Recognition result Figure 1: The attention-BiLSTM-based multimodal fusion method As shown in Figure 1, when recognizing the psychological emotions of adults, the gait features, facial features, and speech features extracted previously were fused as inputs to BiLSTM. Then, important information was obtained through an attention layer, and finally the recognition result was obtained by output mapping in the linear and softmax layers. 4 Results and analysis 4.1 Experimental setup The training set and test set were allocated in a ratio of 7:3. The experiment was conducted on a Linux operating system using the Python programming language and the TensorFlow deep learning framework. During training, the attention-BiLSTM model adopted a batch size of 4, a learning rate of 0.001, the Adam optimization algorithm, a dropout rate of 0.1, and a number of epochs of 100. The results were evaluated based on the confusion matrix, as outlined in Table 7. Table 7: Confusion matrix Recognition label Healthy Depression True label Healthy TN FP Depression FN TP The following indicators were used to assess the effectiveness of the attention-BiLSTM approach on the recognition of psychological emotions in adults: (1) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =(𝑇𝑃 +𝑇𝑁 )/(𝑇𝑃 +𝐹𝑃 +𝐹𝑁 + 𝑇𝑁 ), (2) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑃 /(𝑇𝑃 +𝐹𝑃 ), (3) 𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇𝑃 /(𝑇𝑃 +𝐹𝑁 ), (4) 𝐹 1=(2×𝑃 ×𝑅 )/(𝑃 +𝑅 ). 4.2 Analysis of results The recognition performance of the proposed method was evaluated using a binary dataset from the UCI dataset: the heart disease dataset (https://archive.ics.uci.edu/dataset/45/heart+disease) from the UCI dataset. This dataset consisted of 13 attributes and contained a total of 3,205 data. This method was compared with the other basic recognition approaches (Table 8). Table 8: Comparison of recognition using the heart disease dataset Accuracy Precision Recall rate F1 value Logic regression (LR) 0.9573 0.9722 0.8974 0.9333 Naive Bayes (NB) 0.9715 0.9561 0.9715 0.9561 Support vector machine 0.9630 0.9829 0.9127 0.9465 Random forest 0.9744 0.9621 0.9695 0.9658 Attention- BiLSTM 0.9846 0.9736 0.9855 0.9795 Table 8 shows that compared to LR, NB, and other methods, the recognition performance of the attention- BiLSTM model was superior. The accuracy on the heart disease dataset reached 0.9846, the precision was 0.9736, the recall rate was 0.9855, and the F1 value was 0.9795, all of which were the highest. This result demonstrated the excellent recognition capability of the attention-BiLSTM method. The impact of multimodal fusion on the recognition of psychological emotions in adults was analyzed using the same recognition method. T-tests were conducted on the results of other modalities and the results of multimodal fusion (gait + facial expression + speech) proposed in this paper using SPSS 22.0 software. The significance level was 0.05. The comparison across different modalities is presented in Table 9. 160 Informatica 48 (2024) 155–162 H. Zhao Table 9: Comparison of recognition using different modes Accuracy Precision Recall rate F1 Gait 0.7273* 0.7143* 0.7576* 0.7353* Face 0.7424* 0.7500* 0.7273* 0.7385* Speech 0.7121* 0.7059* 0.7273* 0.7164* Gait + face 0.7727* 0.8000* 0.7273* 0.7619* Gait + speech 0.7576* 0.7429* 0.7879* 0.7647* Face + speech 0.7576* 0.7742* 0.7273* 0.7500* Gait + face + speech 0.8182 0.8387 0.7879 0.8125 Note: * indicates the difference is statistically significant. Table 9 shows that in unimodal recognition, facial features exhibited relatively better performance. In terms of accuracy, facial features showed a 1.51% improvement compared to gait and a 3.03% improvement compared to speech. Similarly, in terms of the F1 value, facial features demonstrated a 0.32% improvement compared to gait and a 2.21% improvement compared to speech. These outcomes suggested that among gait, facial, and speech features, facial features contained information that yielded better results in recognition of psychological emotions in adults and could effectively distinguish between depressed and healthy individuals. In multimodal recognition, the pairwise combinations of different modalities demonstrated an increase in accuracy compared to when the modalities were individually considered: 0.7727 for gait+face, 0.7576 for gait+speech, and 0.7576 for face+speech. Ultimately, fusing the three modalities of gait, face, and speech resulted in an accuracy of 0.8182, showing an improvement of 4.55% to 6.06% compared to pairwise combinations of modalities. The F1 value reached 0.8125, reflecting a 4.78%-6.25% improvement compared to the modalities combined two by two. It can be observed from the statistical analysis results that there was a significant difference between the recognition results obtained from the multimodal fusion (gait + facial + speech) and those of other modalities. These outcomes suggested that when three modalities were integrated for adult psychological emotion recognition, the information contained in them was more comprehensive, leading to optimal recognition results. Then, the generalization ability of the attention- BiLSTM approach was analyzed, and the inputs of the model were unified as a multimodal fusion of gait, face, and speech. T-tests were also conducted on the results. The comparison is shown in Table 10. Table 10: Comparison of different LSTM models Accurac y Precisio n Recall rate F1 value RNN [14] 0.6818* 0.6875* 0.6667 0.6769 * * LSTM [15] 0.7121* 0.7188* 0.6970 * 0.7077 * BiLSTM 0.7727* 0.7818* 0.7576 * 0.7692 * Attention -LSTM [16] 0.8030 0.8125 0.7879 0.8000 Attention -BiLSTM 0.8182 0.8387 0.7879 0.8125 Table 10 reveals that the traditional RNN method performed poorly in psychological emotion recognition, with all indicators below 0.7. The LSTM method exhibited an accuracy of 0.7121, representing a 3.03% improvement compared to the RNN method, and an F1 value of 0.7077, indicating a 3.08% improvement compared to the RNN method. These findings verified the effectiveness of the LSTM method over the RNN method. Comparing the LSTM method with the BiLSTM method, the latter demonstrated a 6.06% improvement in accuracy and a 6.15% improvement in F1 value, indicating the enhancement achieved by the BiLSTM method relative to the LSTM method. Further, in the comparison between the LSTM and attention-LSTM methods, the latter exhibited a 9.09% improvement in accuracy and a 9.23% improvement in F1 value, showing the positive impact of the attention mechanism on recognition efficacy. Finally, the attention-BiLSTM approach outperformed the attention-LSTM approach with a 1.52% improvement in accuracy and a 1.25% improvement in F1 value. These results verified the reliability of the method proposed in this paper for recognizing psychological emotions in adults. From the statistical analysis of the results, it can be seen that there were significant differences between the recognition results of the attention-BiLSTM and the RNN, LSTM, and BiLSTM methods. Compared the attention- BiLSTM method with the attention-LSTM method, although there was no significant difference in accuracy and recall rate, there was a significant difference in precision and F1 value, further proving the superiority of the proposed method. 5 Discussion With the advancement of technology, there have been increasing applications of methods such as machine learning and deep learning in recognizing psychological emotions in adults. The analysis of emotional features has also become more profound. However, current research shows that most studies on psychological emotion recognition focus only on single-modal features, such as text, speech, facial expressions, gait patterns, etc., with limited research on multi-modal fusion features. Nevertheless, there may be complementary relationships between different modalities' features. By integrating these features together, better recognition results can be achieved. Therefore, this study focuses on the recognition of psychological emotions in adults under multi-modal fusion by selecting gait, facial, and speech feature data. An Research on the Recognition of Psychological Emotions in Adults… Informatica 48 (2024) 155–162 161 attention-BiLSTM method was designed to achieve identification between healthy and depressed groups. The attention-BiLSTM model demonstrated good recognition performance on a binary benchmark dataset, outperforming methods such as NB and SVM. Furthermore, based on the experimental results obtained from the collected dataset, the fusion of gait, facial, and speech modalities showed better discrimination between depressed and healthy groups compared to single-modal approaches. The F1 value reached 0.8125, and this result exhibited significant differences when compared to other single-modal or multimodal recognition outcomes. The features from these three modalities complement each other, providing more comprehensive information. Therefore, the attention-BiLSTM model also achieved better effectiveness in feature learning and improved its ability to recognize depression and healthy populations. Comparing different LSTM models, it can be observed that the introduction of attention significantly distinguished the recognition results between the attention-BiLSTM and BiLSTM methods (p < 0.05). Furthermore, there was also a significant difference in precision and F1 value comparison between the attention- LSTM and attention-BiLSTM methods, further confirming the reliability of the attention-BiLSTM method. Although some achievements have been made in the research on adult emotional recognition, there are still some shortcomings. For example, the methods used for extracting gait, facial, and speech features are relatively simple, and there is room for further optimization. The feature fusion method employed also uses a direct concatenation approach, which requires further discussion. In future work, more in-depth research will be conducted on feature extraction and fusion, and more comprehensive and extensive datasets will be obtained to validate the reliability of the proposed method further. 6 Conclusion In this study, data comprising gait, facial, and speech features from depressed and healthy groups were collected for the recognition of adult psychological emotions. Subsequently, an attention-BiLSTM method was developed for recognition. Experimental analysis revealed that multimodal fusion yielded a superior recognition effect compared to unimodal approaches. Furthermore, compared with the RNN method and similar methods, the attention-BiLSTM approach proposed in this paper exhibited higher accuracy in adult psychological emotion recognition. These findings suggest the potential for further promotion and practical application of the attention-BiLSTM method. References [1] Gaur M, Alambo A, Sain JP, Kursuncu U, Thirunarayan K, Kavuluru R, Sheth A, Welton R, Pathak J (2019). Knowledge-aware Assessment of Severity of Suicide Risk for Early Intervention. The World Wide Web Conference, pp. 514-525. https://doi.org/10.1145/3308558.3313698. [2] Zhong Y, Sun L, Ge C, Fan H (2021). HOG-ESRs Face Emotion Recognition Algorithm Based on HOG Feature and ESRs Method. Symmetry, 13(2), pp. 1-18. https://doi.org/10.3390/sym13020228. [3] Wani AH, Hashmy R (2023). A supervised multinomial classification framework for emotion recognition in textual social data. International Journal of Advanced Intelligence Paradigms, 24(1/2), pp. 173-189. https://doi.org/10.1504/IJAIP.2018.10027081. [4] Hossain MS, Muhammad G (2019). An Audio- Visual Emotion Recognition System Using Deep Learning Fusion for a Cognitive Wireless Framework. IEEE Wireless Communications, 26(3), pp. 62-68. https://doi.org/10.1109/MWC.2019.1800419. [5] Alamgir FM, Alam MS (2022). A Novel Deep Learning-Based Bidirectional Elman Neural Network for Facial Emotion Recognition. International Journal of Pattern Recognition and Artificial Intelligence, 36(10), pp. 1-37. https://doi.org/10.1142/S0218001422520164. [6] Tsouvalas V, Ozcelebi T, Meratnia N (2022). Privacy-preserving Speech Emotion Recognition through Semi-Supervised Federated Learning. 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), Pisa, Italy, pp. 359-364. https://doi.org/10.1109/PerComWorkshops53856.2 022.9767445. [7] Murri MB, Triolo F, Coni A, Tacconi C, Nerozzi E, Escelsior A, Respino M, Neviani F, Bertolotti M, Bertakis K, Chiari L, Zanetidou S, Amore M (2020). Instrumental assessment of balance and gait in depression: A systematic review. Psychiatry Research, 284, pp. 112687. https://doi.org/10.1016/j.psychres.2019.112687. [8] Inegbenosun HE, Tlasek-Wolfson M (2021). QIM21-086: Implementation of Depression Screening With the Patient Health Questionnaire-9 (PHQ-9) at a Radiation Oncology Department. Journal of the National Comprehensive Cancer Network: JNCCN, 19(3.5), pp. QIM21-086. https://doi.org/10.6004/jnccn.2020.7713. [9] Baltrusaitis T, Zadeh A, Lim YC, Morency LP (2018). OpenFace 2.0: Facial Behavior Analysis Toolkit. IEEE International Conference on Automatic Face & Gesture Recognition, pp. 59-66. https://doi.org/10.1109/FG.2018.00019. [10] Eyben F, Wöllmer M, Schuller B (2010). Opensmile: the munich versatile and fast open-source audio feature extractor. ACM International Conference on Multimedia, pp. 1459-1462. https://doi.org/10.1145/1873951.1874246. [11] Balaji E, Brindha D, Elumalai VK, Vikrama R (2021). Automatic and non-invasive Parkinson's disease diagnosis and severity rating using LSTM 162 Informatica 48 (2024) 155–162 H. Zhao network. Applied Soft Computing, 108(4), pp. 1-14. https://doi.org/10.1016/j.asoc.2021.107463. [12] Khademi Z, Ebrahimi F, Kordy HM (2022). A transfer learning-based CNN and LSTM hybrid deep learning model to classify motor imagery EEG signals. Computers in Biology and Medicine, 143, pp. 105288. https://doi.org/10.1016/j.compbiomed.2022.105288. [13] Yang G, Liu S, Li Y, He L (2023). Short-term prediction method of blood glucose based on temporal multi-head attention mechanism for diabetic patients. Biomedical Signal Processing and Control, 82, pp. 1-12. https://doi.org/10.1016/j.bspc.2022.104552. [14] Liu Y, Zhang Q, Song L, Chen Y (2019). Attention- based recurrent neural networks for accurate short- term and long-term dissolved oxygen prediction. Computers and Electronics in Agriculture, 165, pp. 104964. https://doi.org/10.1016/j.compag.2019.104964. [15] Shang S, Luo Q, Zhao J, Xue R, Sun W, Bao N (2021). LSTM-CNN network for human activity recognition using WiFi CSI data. Journal of Physics: Conference Series, 1883(1), pp. 1-9. https://doi.org/10.1088/1742-6596/1883/1/012139. [16] Muhammad K, Mustaqeem, Ullah A, Imran AS, Sajjad M, Kiran MS, Sannino G, de Albuquerque VHC (2021). Human action recognition using attention-based LSTM network with dilated CNN features. Future Generation Computer Systems, 125, pp. 820-830. https://doi.org/10.1016/j.future.2021.06.045.