https://doi.org/10.31449/inf.v48i10.5961 Informatica 48 (2024) 195 –208 195 
Human-Computer Interaction Based on ASGCN Displacement 
Graph Neural Networks 
Yiping Yang
1*
, Jijun Liu
2
, Liang Zhao
1
, Yuchen Yin
1 
1
China State Shipbuilding Corporation Limited No.723 Research Institute, Yangzhou, 225001, China 
2
School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 
430074, China 
E-mail: yyp1982to723@163.com 
*
Corresponding author 
Keywords: ASGCN algorithm, Human-computer interaction, Long and short-term memory algorithm, Joint features, 
Action recognition, Encoder 
Received: March 29, 2024 
Intelligent terminal devices have become a popular theme for research in recent years, but the 
development of intelligent terminals cannot be separated from high-quality human-computer 
interaction models. Behavioral action recognition is one of the main ways to realize human-computer 
interaction, but the current action recognition model still exists with obvious time delay and low 
recognition accuracy. In light of this, the study built an intelligent human action capture and 
recognition model using an action structured graph convolutional network in conjunction with an 
encoder-decoder architecture, long and short-term memory algorithms, and controlled experiments to 
assess the model's performance. The outcomes indicated that the loss of the proposed model after 
convergence on the test dataset was 0.56%, while the average accuracy was 95.39%, and both 
performances outperformed the control experiment. In the meantime, the suggested model's average 
F1 score was 89.79%, which was 11.13% and 3.82% higher than that of the experiment's control 
model. The suggested model exhibits some improvement in the accuracy and F1 score of action 
recognition, according to the experimental findings. Therefore, the research of the suggested behavior 
recognition model has practical value. Additionally, in the real scene behavior recognition detection 
experiments, the proposed model validates the viability of the model with higher accuracy and reduced 
delay. 
Povzetek: Prispevek predstavi izboljšan model za prepoznavanje človeških akcij s pomočjo ASGCN in 
LSTM algoritmov za natančnejšo in hitrejšo interakcijo človek-računalnik.
1 Introduction 
Human-computer interaction (HCI) usually relies on 
gesture recognition, speech recognition and action 
recognition (AR), etc. Speech recognition is very mature 
in current development, and there are quite a number of 
intelligent models that can realize the needs of daily HCI 
[1]. However, to realize more intelligent HCI, it is 
necessary to solve the algorithm's ability to understand 
the combination of action feature capture. The current 
mainstream motion capture algorithms include a series of 
machine learning algorithms such as convolutional neural 
network (CNN), graph convolution network (GCN), deep 
neural network (DNN), and so on, among which the 
effect of image and video processing is better than the 
GCN algorithm. Better is the GCN algorithm [2-3]. 
However, the traditional GCN algorithm still has obvious 
shortcomings. Shallow GCN cannot transfer labels from a 
limited amount of training data to the whole graph 
structure, and the semi-supervised performance is poor. 
Deep GCN will have excessive smoothing problems, and 
it is difficult to distinguish the features of the nodes [4]. 
An abstract idea of a deep learning model is the 
encoder-decoder (ED) architecture. An ED structure may 
compress a lot of data, which cuts down on processing 
time and space while increasing transmission and storage 
efficiency [5]. The advantages of the ED architecture are 
especially obvious when processing large files such as 
images, videos, and audios. In view of this, therefore, the 
study selects aspect-specific graph convonlutional 
network (ASGCN) to be optimized and used in the 
construction of HCI model, and the ED architecture is 
used in long short-term memory (LSTM) as a way to 
optimize the ASGCN model. The innovation of the study 
is that it introduces an LSTM-based encoder structure 
that is utilized to capture specific movements of the 
human body. The article is structured into four sections. 
Related work, the first section, concentrates on the 
theoretical analysis that came before the research. The 
second part is the methodology, which performs HCI 
model construction through advanced techniques. The 
proposed model is put through performance testing tests 
in the third section, known as "model testing," in order to 
confirm its advanced nature. The fourth part is the 
196   Informatica 48 (2024) 195 –208                                                              Y. Yang et al. 
conclusion, which summarizes the research results and 
proposes future improvement directions. 
2 Related works 
HCI is important for the development of smart devices, 
so domestic and international researchers have explored 
for how to realize intelligent HCI. Chowdary et al. used 
deep learning techniques to recognize human emotions, 
thus promoting the intelligence of the model and HCI. 
The method eliminated the original fully connected layer 
of ConvNets and added a new fully connected layer with 
weights based on the number of instructions in the task. 
The study's findings demonstrated that the suggested 
emotion identification model can identify emotions with 
an average accuracy (AverA) of 96% [6]. Liu et al. 
conducted a related study on head pose estimation and 
optimized the technique for application in HCI. Liu et al. 
solved the problem of neighboring pose information 
processing and mislabeling gap in head pose estimation. 
The model was evaluated using an open-source dataset, 
and the study found that the suggested model performed 
noticeably better than other cutting-edge techniques, 
leading to improved outcomes for the optimization 
approach [7]. Zhang et al. constructed a glove-based HCI 
system using friction electric nanogenerators in order to 
realize the intelligence of wearable devices. The system 
was also used to extract and friction electric 
nanogenerator to analyze multidimensional signal 
features for gesture visualization and manipulator control 
functions. The study applied the proposed model to five 
object classification and recognition tasks. According to 
the experimental findings, the model performed the five 
tasks with an AverA of 98.7% [8]. Zhang et al. proposed 
a gesture recognition system called WiGesID in their 
study as they found that gesture recognition technology 
can advance HCI to some extent. The system employed 
Wi-Fi sensing and radar sensing techniques to enhance 
the security of the gesture recognition system and 
computer vision techniques to realize the dynamic 
patterns of gesture recognition. The findings indicated 
that the proposed system exhibits superior performance in 
cross-domain sensing, with enhanced recognition 
accuracy compared to state-of-the-art models [9]. 
GCN is a neural network designed to process images, 
but as the demand for image processing increases, the 
traditional GCN is difficult to meet the current needs, so 
many researchers have improved and optimized the GCN 
for GCN. Bessadok et al. provided a medical image 
recognition method based on learning depth graph neural 
network (GNN) structure. The method incorporated DNN 
and GNN. They used the method for the recognition of a 
comprehensive roadmap of neuronal activity in the 
human brain. According to the testing data, the suggested 
approach performs better and can obtain recognition 
accuracy of above 90% [10]. Wu et al. proposed a 
GCN-based natural language processing model, a 
taxonomy that systematically organizes existing GNN 
research on natural language processing along three axes. 
In addition, the method introduced ED techniques to 
achieve global encoding of input data. It was 
experimentally concluded that the proposed model 
possesses high accuracy and recall in natural language 
processing and classification, thus the proposed model is 
feasible [11]. Zhu et al. presented a GCN and DNN-based 
picture analysis model to address the significant 
unsupervised graph problem. The model implemented the 
recovery of cluster structure by DNN improved GCN 
pooling method and constructed an unsupervised pooling 
method inspired by the modularity metric of clustering 
quality. After multiple sets of controlled trials, the 
suggested model's overall performance was shown to be 
superior to the mainstream state-of-the-art at the time 
[12]. As a result, the proposed model is considered 
state-of-the-art. Zhu et al. found that GCN only focuses 
on the homogeneity of image nodes and ignores the 
heterogeneity among different image nodes in practical 
applications. In order to solve this problem, the 
researchers proposed a new graph convolution framework 
that contains an interpretable compatibility matrix for 
modeling the level of anisotropy or homotropy in a graph. 
Experimentally, it was concluded that the new framework 
has a significant reduction in the dependence on the 
training samples, while the accuracy of the image being 
an Oba was improved [13]. Kiningham et al. proposed a 
GNN gas pedal architecture for low-latency inference 
design, aiming to address the shortcomings of GCN's low 
efficiency for image processing. The architecture 
combined arithmetic-intensive vertex-centered operations 
with memory-intensive edge-centered operations and 
introduced a high-performance matrix multiplication 
engine. Experiments concluded that the proposed 
framework effectively reduces the sample latency and 
ensemble average [14]. 
In summary, many researchers have explored the 
application of GCN algorithm in various fields, but there 
are still more obvious shortcomings of this method for 
AR tasks. Therefore, the study selects the ASGCN 
algorithm as the core algorithm on the basis of GCN and 
introduces other advanced technologies to improve it, so 
as to construct a more perfect AR model to realize 
intelligent HCI. 
3 Intelligent Human-Computer 
interaction model based on 
optimized ASGCN algorithm 
The construction of human AR model based on LSTM's 
encoder and ASGCN algorithm is firstly discussed in 
depth, aiming to further improve the effect of HCI in 
daily life. To realize intelligent HCI, the study fuses 
LSTM with ED and adds it to the feature fusion module 
(FFM) of the HCI model in order to achieve feature 
improvement and accurate fusion. 
 
 
Human-Computer Interaction Based on ASGCN Displacement … Informatica 48 (2024) 195 –208   197 
3.1 Construction of human motion capture 
and recognition model based on ASGCN 
The study uses the GCN algorithm to create the human 
motion capture model because of the GCN model's 
impressive performance in a variety of domains and the 
quick growth of artificial intelligence technologies [15]. 
Given the specific needs of this research, the study 
chooses ASGCN algorithm as the core algorithm of 
human motion capture model. ASGCN is an improved 
algorithm of spatio-temporal graph convolutional 
networks (STGCN), and ASGCN adds the extraction of 
human body joint features (JFs) in the spatial domain on 
the basis of STGCN to improve the accuracy and stability 
of AR [16-17]. The ASGCN algorithm stacks 
behavior-actions together to form a fused graph 
convolution module, thus learning spatial and feature 
sequences and performing AR, the emergence of this 
fused module when GCN overcomes the difficulty of 
poor dynamic processing. Furthermore, in terms of 
flexibility and scalability, the enhanced ASGCN 
algorithm outperforms the STGCN algorithm. Figure 1 
depicts the ASGCN algorithm's recognition structure. 
 
Actional
 Links
Structural
Links
Input Action
Time
Featur Response of 
ASGCN
 
Figure 1: Schematic diagram of ASGCN algorithm recognition structure 
 
Equation (1) displays the mathematical expression for the 
convolution computation used in ASGCN's JF extraction 
for the human body, which is based on a convolution 
kernel (CK). 
( ) ( ) ( ) ( )
, , ,
KK
out in
hw
f x f s x h w w h w =

(1) 
In Equation (1), 
K
 denotes the CK size and x 
denotes a point in the acquisition region. ( ) , hw 
denotes the height and width of the sampling region, and 
( ) , W h w denotes the weights of the sampling region. 
( ) ,, s x h w is a sampling function whose computational 
expression is shown in Equation (2). 
) , (
~
) , , ( w h s x w h x s + =     (2) 
In Equation (2), ( ) , s h w denotes the pixels in the 
neighborhood of point x . However, the skeletal model 
of the human body is an irregular image, so different 
weights need to be assigned for different skeletal joints in 
order to correctly analyze the behavior of the human body 
[18]. Therefore, after redefining the weights the 
convolution calculation expression of ASGCN is shown 
in Equation (3). 
)) ( ( ) (
) (
1
) (
) (
i i j in
v N v i i
i out
v l w v f
v Z
v f
i j
• =


(3) 
In Equation (3), 
i
v denotes a point in the sampling 
area, 
j
v denotes a sampling point adjacent to 
i
v .  
( ) ( )  ( ) 
| | |
i i k i k i j
Z v v l v l v == , the expression 
denotes normalization, which is used to balance the 
weights of different collection points. 
i
l denotes the set 
of mapping relationships from different collection points 
to neighboring collection points. The subset partition of 
the features to be extracted can address the issue that the 
model has to extract more JFs simultaneously because, as 
can be shown in the calculation above, a high number of 
samples supplied at once will likewise result in a huge 
amount of model computation. Figure 2 illustrates the 
commonly used subset division method in the ASGCN 
algorithm. 
 
198   Informatica 48 (2024) 195 –208                                                              Y. Yang et al. 
(a) (b) (c) (d) 
 
Figure 2: Schematic diagram of subset partitioning method 
 
The subset division only solves the number of 
features extracted by the model at the same time, but it 
does not fundamentally solve the drawbacks of the 
ASGCN model, such as long time of accessing memory 
and more learning parameters. To solve the above 
drawbacks, the study adopts the displacement operation 
(DO) to simplify the learning parameters of GCN and 
ensure its computational efficiency and memory access 
efficiency. The computational expression of the DO is 
shown in Equation (4). 
 + +
=
j i
m j b i a jm i b a
I K
,
, , , ,
0
(4) 
In Equation (4), O denotes the output tensor and 
K
 denotes the size of the CK. m denotes the channel 
the output and input, and 
I
 denotes the size of the input 
tensor. , ji is the dimension index of the input tensor. 
In GCN, the CK is the core of extracting the human body 
used as a feature, which can aggregate the information in 
the image, the study uses the DO mainly to shrink the CK, 
so as to reduce the amount of computation and learning 
parameters. The DO method introduces unit scales at 
specific CK index positions, allowing the model to focus 
on local features rather than global information. This 
significantly reduces the size of the convolution kernel 
while maintaining the effectiveness of feature extraction. 
This method simplifies the model parameters, reduces the 
computational burden, and promotes memory access 
efficiency, thereby making DO an effective tool for 
reducing the CK size and simplifying the model structure. 
The size of the CK after reduction is shown in Equation 
(5). 


 = =
=
other
j andj i i
K
m m
m j i
, 0
, 1
, ,
    (5) 
In Equation (5), ,
mm
ij is the index, indicating that 
the size of the CK at ( ) ,
mm
ij is 1 and the size of the 
CK at the rest of the locations is 0. The flow of the GCN 
convolution after the introduction of the DO is shown in 
Figure 3. 
 
…
…
…
…
M
×
N
D
F
D
F
M
D
F
D
F
D
F
D
F
…
Shift
Conv
 
Figure 3: GCN convolution process after introducing displacement operation 
 
In Figure 3, the left side of symbol  represents 
the displacement convolution module, the right side of 
convolution symbol represents FFM, and  is used to 
connect DO and convolution channels. In the 
displacement convolution module, the DO operation 
represents the primary step of the overall process. Its 
function is to enrich the representation of features and to 
capture broader neighborhood information, thereby 
enhancing the model's perception of multi-scale 
connectivity between vertices. The input of the module 
comprises multiple layers of features, each representing a 
distinct subset of features. Through the process of DO, 
these feature subsets are able to refine their positions in 
order to enhance the model's adaptability to local 
structures. Subsequently, FFM employs the inverse 
operation of subset partitioning rules to achieve the 
recombination of these feature subsets. This process is 
not merely a restoration of existing features. Rather, it 
Human-Computer Interaction Based on ASGCN Displacement … Informatica 48 (2024) 195 –208   199 
entails the acquisition of more nuanced and varied feature 
representations through the precise regulation of feature 
recombination. This approach enables the model to 
enhance its representation capabilities while 
simultaneously reducing its computational complexity 
[19]. To reflect the process of feature fusion, a 
mathematical expression for human action features is 
introduced in the study, as shown in Equation (6). 
 
( )
1 2 n
,:
( , :2 ) ( ,2 :3 ) ( , :)
|| || || ||
vc v v v
v
E c c E c c E nc
f f f f f =
(6) 
In Equation (6), c denotes the channel serial 
number and 
E
 denotes the set of neighboring 
acquisition points. n denotes the nodes of the 
neighboring collection points, v denotes the currently 
calculated collection points, and  denotes the weight 
of each neighboring collection point. The weighted sum 
operation performed on each collection point and its 
neighborhood endows the model with the flexibility to 
identify heterogeneous connections between vertices, 
thereby improving its accuracy in identifying diverse 
human motion features and enhancing its ability to 
represent complex human motion patterns during the 
recognition process. In the context of graph convolution, 
the weight allocation of each adjacent collection point 
serves to quantify the importance of adjacent points, 
thereby ensuring that the heterogeneity of the graph 
structure is taken into account during feature aggregation. 
Therefore, the computational expression of  is shown 
in Equation (7). 
( ) ( )
ck
H D x  =    (7) 
In Equation (7), ( ) . L denotes the activation 
function (AF) of the displacement convolution module, 
and the AF selected for the study is the ReLU function. 
C
 denotes the proportion of the c th channel to the 
total data, and ( )
k
Dx denotes the parameter mapping 
relationship of the input sample x . x denotes the input 
sample and k denotes the number of parameters. The 
model's speed of feature extraction and recognition can 
be somewhat increased by the DO and subset division, 
but it still lacks feature fusion and AR accuracy, therefore 
other sophisticated techniques must be included by the 
study in order to fully optimize the model. 
3.2 Construction of action recognition model 
based on Ed improvement 
The ASGCN model needs to subset the collected features 
during the construction process, although the model 
exists a FFM to splice the extracted features, it lacks a 
correction module. The correction module can check the 
features spliced by the ASGCN model degree and return 
the wrong features to the fusion module to be spliced 
again when wrong splicing is found [20]. The accuracy of 
the AR model can be increased by adding the correction 
module, which can also substantially increase the 
efficiency of feature fusion. In this study, the subset 
division rule is used as the input to the encoder, which in 
turn yields the corresponding coding sequence for the 
entire subset. The study's encoder is based on the LSTM 
method, which computes the positional characteristics of 
various subsets according to their weights at each 
encoding step while accounting for the subset's global 
location. LSTM has hidden state (HS) and memory state 
(MS) inside to record the historical data, so this algorithm 
can record the connection between different subsets better. 
And the HS and MS of LSTM is realized by three 
important structures of input gate (IG), forgetting gate 
(FG) and output gate (OG), which are defined as shown 
in Equation (8). 
1
1
1
( [ ; ])
( [ ; ])
( [ ; ])
t f t t
t i t t
t o t t
f W h x
i W h x
o W h x



−
−
−
= 

=


=

       (8) 
In Equation (8),  denotes the standard deviation, 
and ,,
t t t
f i o denotes the FG, IG and OG, respectively. 
t denotes the moment, and ,,
f i o
W W W the three 
parameters denote the overall weight matrix of the FG, 
IG and OG, respectively. 
t
x denotes the input data of 
the t moment. In the calculation of MS and HS also 
need to carry out the calculation of candidate MS, the 
calculation expression is shown in Equation (9). 
1
( [ ; ])
t c t t
C W h x 
−
=        (9) 
In Equation (9), 
t
C is the candidate MS and 
t
h is 
the HS. 
c
W denotes the weight matrix of the candidate 
MS. The candidate MS is the current moment MS, 
including the new candidate information and parameters 
added in the LSTM at this time [21]. After obtaining the 
candidate MS, it is also necessary to calculate the MS and 
HS of the LSTM, and its calculation expression is shown 
in Equation (10). 
1
tanh( )
t t t
t t t t t
h o C
C i C f C
−
= 


=  + 


      (10) 
In Equation (10),  denotes a multiplication 
operation between each dimension of the same feature. 
t
C is the memorized state and 
t
h is the HS. In LSTM 
state computation, the AFs used are all Sigmoid (.) 
function. LSTM is based on FGs, IGs and OGs to form a 
hidden unit which is the core of the LSTM algorithm [22]. 
The structure of the hidden unit is shown in Figure 4. 
200   Informatica 48 (2024) 195 –208                                                              Y. Yang et al. 
 
h
t
x
t
x
x x σ
σ tanh σ
tanh σ
h
t+1
x
t+1
x+1
x+1 x+1 σ
σ tanh σ
tanh σ
Hidden unit t
Hidden unit t+1
 
Figure 4: LSTM hidden unit structure diagram 
 
The ED architecture with the introduction of LSTM 
algorithm can record the position information of the 
whole subset by memorizing the state. In addition, in the 
encoder, the expression for positional feature calculation 
is shown in Equation (11). 
t ts s
s
Sh  =

          (11) 
In Equation (11), 
ts
 denotes the weight of the 
subset at s when the time step of the encoder is t . 
s
l 
denotes the features of this subset. Where 
ts
 is defined 
as shown in Equation (12). 
''
exp( ( , ))
exp( ( , ))
ts
ts
s t s
F q l
F q l
 =

     (12) 
In Equation (12), 
t
q denotes the state of the 
encoder in the hidden layer of the LSTM algorithm at 
time step t . ( ) ,
ts
F q l denotes the correlation 
between 
t
q and 
s
l . The calculation of correlation 
mainly consists of two forms: multiplication and addition, 
in order to avoid excessive model computation, the study 
adopts the method of addition for the correlation 
calculation of 
t
q and 
s
l , whose computational 
expression is shown in Equation (13). 
12
( , ) tanh( )
t
t s a a t a s
F q l v W q W l =+ (13) 
In Equation (13), ( ) tanh . denotes the hyperbolic 
tangent function. 
t
a
v denotes the output sequence of the 
OG of the LSTM algorithm at the moment t . W 
denotes the weight matrix of 
t
q and 
s
l . The structure 
of ED is shown in Figure 5. 
 
Cell Cell Cell Cell Cell
Dense
Target
Dense
Encoder
X 1 X 2 X 3 X 4 X 5
Feature
H 0
InPut
 
Figure 5: Schematic diagram of encoder-decoder architecture 
Human-Computer Interaction Based on ASGCN Displacement … Informatica 48 (2024) 195 –208   201 
 
Once the computation of the encoder is completed, 
the decoder can then be checked against the output of the 
encoder after going through a subset of the decoder 
feature fusion method. The study incorporates the 
decoder into the FFM so that the decoder becomes a 
submodule of the FFM, and after this operation, the FFM 
has the ability to check. During the encoder and decoder 
training process, some parameters of the hidden layer and 
max Soft classifier of the LSTM algorithm can be 
migrated to the decoder for training, and the encoder is 
obtaining the trained parameters by inverse operation. 
The ED model's capacity to generalize its parameters and 
its computing efficiency can both be enhanced via 
parameter migration training. The structure of the AR 
model fusing the encoder and decoder is shown in Figure 
6. 
 
ASGC T-CN
ASGC T-CN ASGCN block
ASGCN block
Encoder
Decoder
Feature fusion
Global
Avg pool
Input
Action classification
Classifier
Recognizer
Class
Prob
Recognition result
 
Figure 6: Schematic diagram of the action recognition model structure integrating encoder and decoder 
 
The study's construction of an intelligent HCI model 
is almost complete, and for the convenience of 
subsequent experiments, the study replaces the proposed 
model with the acronym L-ASGCN model. 
4 L-ASGCN model performance 
testing and analysis 
ASGCN, STGCN, and GCN are used as control models 
for controlled experiments in the study in order to verify 
the complexity of the intelligent HCI model that is 
suggested. The equipment required for the experiment is 
a computer with Intel Xeon w9-3495X CPU, 16GB of 
running memory, and RTX 2080 Ti graphics card. The 
experiments mainly used NTU RGB-D dataset, 
Interaction Action RGB-D dataset and Kinetics dataset. 
The Kinetics dataset comprises a diverse range of Internet 
videos, encompassing a multitude of daily activity 
scenarios. Its diversity and scale render it the benchmark 
dataset for behavior recognition algorithms. The 
Interaction Action RGB-D dataset is designed to record 
the interaction behavior between two individuals, 
providing detailed multimodal data on human actions and 
interaction scenarios. The NTU RGB-D dataset 
represents a comprehensive behavior recognition dataset 
that encompasses a diverse array of human activities, 
thereby providing a wealth of human behavior 
recognition scenarios for deep learning models. The 
selected dataset facilitates the deep feature learning of the 
model, particularly in the context of bone and joint data, 
thereby enhancing the accuracy of human pose estimation. 
In terms of evaluation indicators, accuracy, loss rate, 
recall rate, and F1 score are key measures of model 
performance. The accuracy of a model reflects its ability 
to predict correctly. The loss represents the accumulation 
of prediction errors during the training process. The recall 
measures the model's ability to recognize positive 
instances. The F1 score is the harmonic average of 
accuracy and recall, which can be used to evaluate the 
overall performance of the model. 
 
202   Informatica 48 (2024) 195 –208                                                              Y. Yang et al. 
0 6 12 18 24 30 36 42 48 54
78
82
86
90
94
96
60
0 6 12 18 24 30 36 42 48 54
0
10
20
30
40
60
Epoch
ACcuracy /%
ASGCN
L-ASGCN
Epoch
Loss /%
(a) Comparison chart of loss rates for different models (b) Comparison chart of accuracy of different models
ASGCN
L-ASGCN
 
Figure 7: Comparison of loss and accuracy of different models 
 
Accuracy and loss rate are important indicators of 
model performance, the study used NTU RGB-D dataset 
to train the STGCN model and L-ASGCN model for 30 
min respectively, and then used the Interaction Action 
RGB-D dataset as the input to conduct the comparison 
experiment of accuracy and loss rate. Figure 7(a) shows 
the evolution of the loss rates for the STGCN and 
L-ASGCN models. Based on the figure's trend of curves, 
it is evident that both models' loss rates drop as the 
iterations increases. However, the loss rate curve of the 
STGCN model shows obvious oscillations before 
convergence, while the L-ASGCN model has no obvious 
oscillations before convergence. In addition, the loss rate 
of the STGCN model after convergence is about 0.91%, 
while the loss of the L-ASGCN model after convergence 
is only 0.56%, so the proposed model has some 
advantages in loss rate. Figure 7(b) represents the 
comparison of the accuracy rates of the STGCN model 
and the L-ASGCN model. The results presented in Figure 
7(b) suggest that, initially, the L-ASGCN model's 
accuracy is not as high as the STGCN models. However, 
after a few iterations of the model, the accuracy of the 
suggested model grows significantly. The AverA of the 
proposed model, when the two models converge, is 
95.39%, which is 1.41% greater than the AverA of the 
STGCN model. Additionally, the suggested model's 
accuracy smoothness is superior to that of the control 
model. 
 
90 0 0 0 0 0
0 90 0 0 3 0
0 0 83 0 1 0
0 0 0 90 0 0
0 7 0 0 86 0
0 0 0 0 0 87
0 0 0 0 0 0
0.0
0.2
0.4
0.6
0.8
1.0
0
10
0
0
3
0
80
Raise hands 
flat
Wave
Punch
Lift your left 
hand
Jump
Lift your 
right foot
Squat down
Raise hands 
flat
Punch
Lift your 
left hand
Jump
Squat down
Lift your right 
foot
90 0 0 0 0 0
0 90 0 0 3 0
0 0 90 0 0 1
0 0 0 88 0 0
0 0 0 0 90 0
0 0 2 0 0 80
0 0 0 0 0 0
0.0
0.2
0.4
0.6
0.8
1.0
0
10
0
0
0
0
87
Raise hands 
flat
Wave
Punch
Wave
Jump
Lift your 
right foot
Squat down
Raise hands 
flat
Wave
Punch
Lift your 
left hand
Jump
Squat down
Lift your 
right foot
Wave
True lables True lables
Predicted lables
Predicted lables
(a) STGCN model confusion matrix (b) L-ASGCN model confusion matrix
 
Figure 8 Comparison diagram of confusion matrices for different models 
 
The confusion matrix (ConM) of the STGCN and 
L-ASGCN models are compared in Figure 8, with Figure 
8(a) showing the ConM the STGCN model produced 
using the Interaction Action RGB-D dataset. The ConM 
produced by the suggested model using this dataset is 
shown in Figure 8(b). The average score obtained by the  
 
 
 
 
 
 
proposed model is about 87.86, and the average score of 
the STGCN model is 86.57. By contrasting the 
aforementioned findings, it is evident that the proposed 
model of the study has some development because its AR 
effect on the same dataset is superior to that of the control 
model. 
 
 
 
 
Human-Computer Interaction Based on ASGCN Displacement … Informatica 48 (2024) 195 –208   203 
 
 
Table 1: Comparison of output times of different models on the Kinetics dataset for each module 
Model name 
Feature extraction 
time (s) 
Characterized recombination 
time (s) 
Action recognition 
time (s) 
Total Time (s) 
L-ASGCN 3.1 0.7 0.3 4.1 
ASGCN 3.5 1.1 0.4 5.0 
STGCN 3.7 1.1 0.6 5.4 
GCN 4.2 1.7 0.8 6.7 
 
Table 1 presents a comparison of the output time and 
total elapsed time for each module on the Kinetics dataset. 
It can be observed that the total elapsed time is lowest for 
the L-ASGCN model, followed by the ASGCN model, 
the STGCN model, and the GCN model. With regard to 
the time required for the output of each module, the 
feature recombination time consumption of the ASGCN 
model and the STGCN model is identical. This is due to 
the fact that the ASGCN model is not optimized for 
feature recombination during the process of improvement. 
Consequently, the feature recombination module of the 
ASGCN model and the feature recombination module of 
the STGCN model are both subject to the same time 
constraints. The research-proposed model performs best 
across all modules and in terms of the overall output 
elapsed time, according to the experimental data, 
demonstrating its superior computational and 
feature-processing capabilities. 
 
Table 2: Comparison of actual scene recognition performance between L-ASGCN model and ASGCN model 
Behavior 
Accuracy (%) Recognition time (s) 
L-ASGCN ASGCN L-ASGCN ASGCN 
Raise hands flat 94.5 90.2 1.13 1.51 
Lift left hand 95.9 87.3 1.21 1.39 
Lift right hand 94.7 86.9 1.26. 1.40 
Cross hands 91.2 91.1 0.91 1.11 
Lift left foot 92.3 88.6 1.22 1.29 
Lift right foot 92.1 88.4 1.22 1.31 
Squat down 90.1 90.8 1.08 1.33 
Punching 89.6 88.1 1.39 1.45 
Jump 94.9 90.2 1.21 1.30 
Wave 87.2 89.9 1.29 1.26 
 
To test the effectiveness of the model's application 
in real life, the study randomly selects 10 volunteers for 
testing, each action is done 30 times during the test, and 
the behavioral actions of the volunteers are inputted into 
the experimental model in video form. The L-ASGCN 
model has the highest recognition rate for the 
hand-raising action, and a lower AR rate for the 
hand-waving, but the overall recognition rates of the 
model are all around 90%. Additionally, a comparison of 
the L-ASGCN and ASGCN models' findings reveals that 
the former has better accuracy and requires less time to 
recognize an action, which supports the suggested 
model's AR efficiency. 
 
85
30 40 50 60 70 80
100
Image sequence number
Recall rate (%)
90
90
95
85
30 40 50 60 70 80
100
90
90
95
Image sequence number
(a) Experimental results on the Kinetics dataset
Recall rate (%)
L-ASGCN
ASGCN
STGCN
L-ASGCN
ASGCN
STGCN
(b) Experimental results on the Interaction Action RGB-D dataset
 
Figure 9: Comparison of the change in recall of each model on different datasets 
 
 
204   Informatica 48 (2024) 195 –208                                                              Y. Yang et al. 
Figure 9 represents the recall comparison of 
L-ASGCN model, ASGCN model, and STGCN model on 
Kinetics dataset and Interaction Action RGB-D dataset. 
The experimental outcomes of the three models on the 
Kinetics dataset are shown in Figure 9(a). The average 
recall of L-ASGCN model, ASGCN model and STGCN 
model are 93.06%, 91.71% and 87.86% respectively. 
Figure 9(b) represents the trend of the recall of each 
model on the Interaction Action RGB-D dataset. Based 
on the results in Figure 9(b), the average recalls of 
L-ASGCN model, ASGCN model and STGCN model are 
calculated as 89.46%, 87.92% and 86.11%, respectively. 
 
60
65
70
75
80
85
90
95
(b) F1 score average
F1 score average (%)
5 10
60
70
80
90
100
50
Training time (min)
15 20 25 30 35
(a) F1 score
L-ASGCN
ASGCN
STGCN
L-ASGCN ASGCN STGCN
 
Figure 10: Schematic of F1 scores vs. average F1 scores for each model 
 
Figure 10 shows a comparison of the F1 scores and 
average F1 scores of three models, L-ASGCN, ASGCN, 
and STGCN, on the Interaction Action RGB-D dataset. 
Figure 10(a) illustrates the relationship between the F1 
scores of the three models and training time. The F1 
scores of each model increase as the training time 
increases. Among the experimental models, the F1 score 
of the L-ASGCN model is the highest. Figure 10(b) 
shows the average F1 scores of each model after 
repeating the experiment three times. The L-ASGCN 
model achieved an average F1 score of 89.79%, followed 
by the ASGCN model with 85.97%, and the STGCN 
model with 78.66%. 
5 Discussion 
In the field of human behavior recognition, GCN has 
emerged as an effective data analysis tool. In order to 
achieve more precise and efficient action recognition 
capabilities and to promote the intelligent development of 
HCI technology, an L-ASGCN model was studied and 
constructed. The loss rate of the L-ASGCN model after 
convergence was 0.56%, with an overall recognition rate 
of approximately 90%, an average recall rate of 93.06%, 
and an average F1 score of 89.79%. The model 
demonstrated significant advantages over STGCN and 
ASGCN in multiple performance indicators. The primary 
rationale for this outcome was that L-ASGCN streamlines 
the convolutional kernel of the model through the 
utilization of displacement operations, thereby reducing 
the overall computational complexity and enhancing 
operational efficiency. In comparison to the studies 
conducted by Ahmad et al. [21] and Tong et al. [22], the 
enhanced performance of the L-ASGCN algorithm in 
processing human motion data represented a significant 
advancement in the application of GCN in the field of 
motion recognition. In particular, with regard to recall 
and model efficiency, L-ASGCN offered a more refined 
feature extraction and recombination mechanism than the 
dynamic virtual network embedding algorithms explored 
by Zhang et al. [23]. This was because it not only 
processes features but also suppresses performance loss 
caused by excessive computation, thereby optimizing 
model performance. Nevertheless, the L-ASGCN 
algorithm employed in this study is not without its 
limitations. Chief among these was the fact that the 
model is unable to capture global physical dependencies 
between joints, and that the motion capture is based on 
fixed skeletons. Future work should aim to enhance the 
generalization ability of models for different types of 
actions and to optimize real-time action recognition 
technology. In conclusion, the L-ASGCN model has 
considerable potential for application in human behavior 
recognition tasks, and can be applied in the fields of 
medical rehabilitation, intelligent monitoring, and 
interactive media. Furthermore, L-ASGCN has 
established a foundation for more efficient real-time 
action recognition in various dynamic environments in 
the future. 
6 Conclusion 
In the contemporary era where intelligent development 
has become mainstream, HCI is an important part of the 
development of intelligent terminal devices. An excellent 
HCI model can facilitate the intelligent terminal's 
understanding of human commands so as to better serve 
humans. In view of this, the study adopts the ASGCN 
algorithm and the LSTM-based ED module for fusion, so 
as to construct an intelligent AR model. The outcomes 
indicated that the L-ASGCN model achieved a loss of 
only 0.56% after convergence on the test dataset, with an 
Human-Computer Interaction Based on ASGCN Displacement … Informatica 48 (2024) 195 –208   205 
AverA of 95.39%. Additionally, the study tested the 
recall of the proposed model, and the results showed that 
the average recall of the L-ASGCN model was 93.06%, 
which is 1.35% higher than the ASGCN model and 
5.20% higher than the STGCN model. Regarding the test 
experiments on F1 scores, the L-ASGCN model achieved 
an average F1 score of 89.79%, while the ASGCN model 
and the STGCN model achieved average F1 scores of 
85.97% and 78.66%, respectively. These results indicated 
that the model proposed in the study had a higher overall 
performance. Meanwhile, the study also tested the 
proposed model for AR in real-life scenarios. The 
outcomes revealed that the proposed model has an overall 
recognition rate of around 90% for routine actions, 
indicating its feasibility and advancement. At the same 
time, there are some shortcomings in this study, such as 
the proposed model only captures the local physical 
dependence between joints and motion capture based on a 
fixed skeleton, so the model needs to be further optimized 
to address the shortcomings. 
 
Reference 
[1] Stephan Diederich, Alfred Benedikt Brendel, Stefan 
Morana, and Lutz Kolbe. On the design of and 
interaction with conversational agents: An 
organizing and assessing review of human-computer 
interaction research. Journal of the Association for 
Information Systems, 23(1): 96-138, 2022. 
https://doi.org/10.17705/1jais.00724 
[2] Barbara Rita Barricelli ， Daniela Fogli. Digital twins 
in human-computer interaction: A systematic review. 
International Journal of Human –Computer 
Interaction, 40(2): 79-97, 2024. 
https://doi.org/10.1080/10447318.2022.2118189 
[3] Li Xiaofei, Jiang Miao, Du Yiming, Ding Xin, Xiao 
Chao, Wang Yanyan, Yang Yanyu, Zhuo Yizhi, 
Zheng Kang, Liu Xianglan, Chen Lin, Gong Yi, 
Tian Xingyou, Zhang Xian. Self-healing liquid 
metal hydrogel for human –computer interaction and 
infrared camouflage. Materials Horizons, 10(8): 
2945-2957, 2023. 
https://doi.org/10.1039/d3mh00341h 
[4] Rajdeep Ghosh, Souvik Phadikar, Nabamita Deb, 
Nidul Sinha, Pranesh Das, Ebrahim Ghaderpour. 
Automatic eyeblink and muscular artifact detection 
and removal from EEG signals using k-nearest 
neighbor classifier and long short-term memory 
networks. IEEE Sensors Journal, 23(5): 5422-5436, 
2023. https://doi.org/10.1109/JSEN.2023.3237383 
[5] Anitha Rani Inturi, V. M. Manikandan, Vignesh 
Garrapally. A novel vision-based fall detection 
scheme using keypoints of human skeleton with 
long short-term memory network. Arabian Journal 
for Science and Engineering, 48(2): 1143-1155, 
2023. https://doi.org/10.1007/s13369-022-06684-x 
[6] M. Kalpana Chowdary, Tu N. Nguyen & D. Jude 
Hemanth. Deep learning-based facial emotion 
recognition for human –computer interaction 
applications. Neural Computing and Applications, 
35(32): 23311-23328, 2023. 
https://doi.org/10.1007/s00521-021-06012-8 
[7] Liu, Hai, Liu, Tingting, Zhang, Zhaoli, Sangaiah, 
Arun Kumar, Yang, Bing, Li, Youfu. Arhpe: 
Asymmetric relation-aware representation learning 
for head pose estimation in industrial 
human –computer interaction. IEEE Transactions on 
Industrial Informatics, 18(10): 7107-7117, 2022. 
https://doi.org/10.1109/TII.2022.3143605 
[8] Zhang, Hao, Zhang, Dongzhi, Wang, Zihu, Xi, 
Guangshuai, Mao, Ruiyuan, Ma, Yanhua, Wang, 
Dongyue, Tang, Mingcong, Xu, Zhenyuan, Luan, 
Huixin. Ultrastretchable, self-healing conductive 
hydrogel-based triboelectric nanogenerators for 
human –computer interaction. ACS Applied 
Materials and Interfaces, 15(4): 5128-5138, 2023. 
https://doi.org/10.1021/acsami.2c17904 
[9] Zhang, Ronghui, Jiang, Chunxiao, Wu, Sheng, Zhou, 
Quan, Jing, Xiaojun, Mu, Junsheng. Wi-Fi sensing 
for joint gesture recognition and human 
identification from few samples in human-computer 
interaction. IEEE Journal on Selected Areas in 
Communications, 40(7): 2193-2205, 2022. 
https://doi.org/10.1109/JSAC.2022.3155526 
[10] Alaa Bessadok, Mohamed Ali Mahjoub, Islem Rekik. 
Graph neural networks in network neuroscience. 
IEEE Transactions on Pattern Analysis and Machine 
Intelligence, 45(5): 5833-5848, 2022. 
https://doi.org/10.48550/arXiv.2106.03535 
[11] Lingfei Wu, Yu Chen, Kai Shen, Xiaojie Guo, 
Hanning Gao, Shucheng Li, Jian Pei, Bo Long. 
Graph neural networks for natural language 
processing: A survey. Foundations and Trends® in 
Machine Learning, 2023, 16(2): 119-328. 
https://doi.org/10.48550/arXiv.2106.06090 
[12] Qian Wang, Youfa Liu. Energy Levels Based Graph 
Neural Networks for Heterophily. Journal of Physics 
Conference Series. 1948(1): 012042. 
https://doi.org/10.1088/1742-6596/1948/1/012042 
[13] Qian Wang, Youfa Liu. Energy Levels Based Graph 
Neural Networks for Heterophily. Journal of Physics 
Conference Series. 1948(1): 012042. 
https://doi.org/10.1088/1742-6596/1948/1/012042 
[14] Kevin Kiningham, Philip Levis, Christopher Ré. 
GRIP: A graph neural network accelerator 
architecture. IEEE Transactions on Computers, 
72(4): 914-925, 2022. 
https://doi.org/10.1109/TC.2022.3197083 
[15] Adem Aylin, akt Erman, Dadeviren Metin. Selection 
of suitable distance education platforms based on 
human –computer interaction criteria under fuzzy 
environment. Neural Computing and Applications, 
34(10): 7919-7931, 2022. 
https://doi.org/10.1007/s00521-022-06935-w 
[16] Milani Alireza Sadeghi, Cecil-Xavier Aaron, Gupta 
206   Informatica 48 (2024) 195 –208                                                              Y. Yang et al. 
Avinash, Cecil  J, Kennison Shelia. A systematic 
review of human –computer interaction (HCI) 
research in medical and other engineering fields. 
International Journal of Human –Computer 
Interaction, 40(3): 515-536, 2024. 
https://doi.org/10.1080/10447318.2022.2116530 
[17] Xie Yaochen, Xu Zhao, Zhang Jingtun, Wang 
Zhengyang, Ji Shuiwang. Self-supervised learning 
of graph neural networks: A unified review. IEEE 
Transactions on Pattern Analysis and Machine 
Intelligence, 45(2): 2412-2429, 2022. 
https://doi.org/10.48550/arXiv.2102.10757 
[18] He Jinbao, Yang Jie. Network security situational 
level prediction based on a double-feedback Elman 
model. Informatica, 46(1): 87-93, 2022. 
https://doi.org/10.31449/inf.v46i1.3775 
[19] Utkin Lev V, Zhuk Kirill D. Improvement of the 
deep forest classifier by a set of neural networks. 
Informatica, 44(1):1-13, 2020. 
https://doi.org/10.31449/inf.v44i1.2740 
[20] Yuan Hao, Yu Haiyang, Gui Shurui, Ji Shuiwang. 
Explainability in graph neural networks: A 
taxonomic survey. IEEE Transactions on Pattern 
Analysis and Machine Intelligence, 45(5): 
5782-5799, 2022. 
https://doi.org/10.1109/TPAMI.2022.3204236 
[21] Ahmad Tasweer, Jin Lianwen, Zhang Xin, Lai 
Songxuan, Tang Guozhi, Lin Luojun. Graph 
convolutional neural network for human action 
recognition: A comprehensive survey. IEEE 
Transactions on Artificial Intelligence, (2): 128-145, 
2021. https://doi.org/10.1109/TAI.2021.3076974 
[22] Tong Houjie, Qiu Robert C, Zhang Dongxia, Yang 
Haosen, Ding Qi, Shi Xin. Detection and 
classification of transmission line transient faults 
based on graph convolutional neural network. CSEE 
Journal of Power and Energy Systems, 7(3): 
456-471, 2021. 
https://doi.org/10.17775/CSEEJPES.2020.04970 
[23] Zhang Peiying, Wang Chao, Kumar Neeraj, Zhang 
Weishan, Liu Lei. Dynamic virtual network 
embedding algorithm based on graph convolution 
neural network and reinforcement learning. IEEE 
Internet of Things Journal, 9(12): 9389-9398, 2021. 
https://doi.org/10.48550/arXiv.2202.02140 
 
 
 
 
 
 
 
Appendix 
 
Table A summarizes the content of the above research. 
Table A: Summary of related work 
Research 
contents 
Researchers Key findings Potential Shortcomings 
Human-compute
r interaction 
Chowdary et 
al. [6] 
Using deep learning techniques to recognize 
emotions and promote intelligent 
human-computer interaction 
Not taking into account the 
differences in emotional expression 
across different cultural 
backgrounds 
Liu et al. [7] 
Solving the problem of adjacent pose 
information processing and mislabeling gap in 
head pose estimation 
Robustness in ever-changing 
environments may need to be 
improved 
Zhang et al. 
[8] 
Constructing a glove based human-machine 
interaction system using a frictional electric 
nanogenerator. Extracting and analyzing 
multidimensional signal features to achieve 
gesture visualization and robotic arm control 
More data support may be required 
for the recognition of complex 
gestures 
Zhang et al. 
[9] 
Dynamic mode of gesture recognition using WiFi 
and Radar sensing technology 
System complexity and power 
consumption may be obstacles in 
practical applications 
GCN research 
contents 
Bessadok et 
al. [10] 
Medical image recognition based on learning 
deep graph neural networks; Combining DNN 
and GNN for identifying human brain neuronal 
activity 
Unknown generalization ability for 
large-scale image datasets 
Wu et al. [11] 
A natural language processing model based on 
GCN; Introduced ED technology to achieve 
global encoding of input data 
Adjustments may be needed to 
adapt to different natural language 
processing tasks 
Human-Computer Interaction Based on ASGCN Displacement … Informatica 48 (2024) 195 –208   207 
Zhu et al. [12] 
Unsupervised image analysis based on GCN and 
DNN; Improving GCN pooling method through 
DNN to achieve cluster structure recovery 
Further validation is needed to 
evaluate the effectiveness of 
unsupervised methods on diverse 
datasets 
Zhu et al. [13] 
Develop a new graph convolutional framework 
to address the homogeneity assumption problem 
of GCN; Introducing interpretable compatibility 
matrices to model heterogeneity in graphs 
Insufficient scalability of the 
framework 
Kiningham et 
al. [14] 
GNN gas pedal architecture with low latency 
inference design; Combining vertex and edge 
operations, introducing a high-performance 
matrix multiplication engine 
Further consideration is needed for 
real-time performance and 
computational resource 
consumption 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208   Informatica 48 (2024) 195 –208                                                              Y. Yang et al.