https://doi.org/10.31449/inf.v48i12.6179 Informatica 48 (2024) 123 –136 123 
 
Optimization of Video Stability Technology by Integrating 
Tiny-Res-PWNet Model 
 
Wenji Zhong
1
, Liping Wu
2*
 
1
College of Information Engineering, Guangxi Vocational College of Water Resources and Electric Power, Nanning 
530023, China 
2
Big Data Academy, Guangxi Vocational and Technical College, Nanning 530226, China 
E-mail: wlp20232023@126.com 
*
Corresponding author 
Keywords: image stabilization technology, residual module, warping field, structural similarity, stability 
Received: May 11, 2024 
This study proposes an improved pixel by pixel stabilization network to achieve high-quality video 
stabilization. Fourier spectrum constraints and local motion constraints are added to the output 
structure of the image stabilization network. Meanwhile, affine matrix parameters are optimized. Then, 
the transformation parameters output by the network are more similar, thereby reducing the difficulty 
of learning the overall jitter pattern of the image. In addition, this study replaces the encoder 
convolutional layer with a residual module and combines feature fusion to process feature information 
extracted from different network layers. This can achieve optimization and lightweight processing of 
pixel-by-pixel stable network models. The results showed that the stability evaluation index of the 
improved pixel by pixel stable network model increased by about 3.7% compared to the previous pixel 
by pixel stable network model. The parameter counts of the pixel-by-pixel stable network model 
encoder fused with residual module was reduced by 12.1%. compared to the pixel-by-pixel stable 
network model encoder. The model size was reduced by 11.7%, the floating-point operation was 
reduced by 13.2%, and the running frame rate was increased by 5.6%. The lightweight pixel by pixel 
stable network model achieved a frame rate of 131.2 at high-performance operation, far higher than 
the 83.1 of the pixel-by-pixel stable network model. The outcomes showcase that the network model is 
an effective optimization method for video stabilization technology and can be applied to many 
real-time video processing scenarios. This helps to improve the technical level and application 
effectiveness in this field. 
Povzetek: Razvito je izboljšano omrežje za stabilizacijo videa na ravni slikovnih točk, ki vključuje 
Fourierjeve spektralne omejitve in lokalne omejitve gibanja. Rezultati kažejo, da izboljšan model 
dosega 3,7 % boljšo stabilnost in zmanjšano velikost modela za 11,7 %, kar predstavlja pomemben 
doprinos k optimizaciji video stabilizacijske tehnologije.
1 Introduction 
With the rapid development of digital media technology, 
video has become one of the important ways for people to 
record and share their lives [1]. However, many people 
inevitably encounter shaking problems when shooting 
videos due to the popularity and convenience of various 
camera devices. This leads to a decrease in video quality 
and poor viewing experience [2]. Therefore, video 
stabilization technology has become an important means 
to solve this problem. In the past few decades, many 
video stabilization methods were presented and achieved 
certain results [3]. Early methods were mainly based on 
traditional image processing techniques, such as motion 
estimation-based methods and image block 
matching-based methods. However, these methods are 
often limited by high computational complexity and poor 
robustness, which perform poorly in practical 
applications [4]. Recently, as the boost of deep learning 
technology, video stabilization methods based on 
Convolutional Neural Network (CNN) have made 
significant breakthroughs [5]. Among them, the 
Pixel-Wise Stable Network (PWSNet) model has good 
stability and real-time performance. However, in practical 
applications, the PWSNet model still has some problems. 
In addition, the stability and robustness of network 
models in handling videos under different scenes and 
lighting conditions also need to be further optimized. 
Therefore, the study aims to optimize and improve the 
PWSNet model to provide an efficient and accurate video 
stabilization method, providing users with a better 
viewing experience. The study consists of four parts. The 
first is a summary of the relevant research. The second is 
the optimization and improvement methods of video 
stabilization technology, which are verified in the third 
part. The fourth is a summary of the entire study. 
124   Informatica 48 (2024) 123 –136                                                             W. Zhong et al. 
2 Related works 
Video stabilization technology is used to suppress jitter 
and vibration in videos. The Son team proposed an 
effective recursive video deblurring network. It improved 
the motion estimation accuracy between blurred frames 
through effectively aggregating information from 
multiple video frames. They solved motion estimation 
errors by using pixel volumes containing candidate 
sharpened pixels. The experiment showcased that 
compared with traditional deep learning methods, this 
method improved efficiency by 13% [6]. The Chen team 
proposed an end-to-end training method that integrated 
deep learning and state space models for estimating and 
predicting the state space model of physical systems. The 
results showed that this method could leverage the 
relative advantages of deep neural networks and 
demonstrate effectiveness in estimation and prediction of 
many physically challenging tasks [7]. Zhou et al. 
proposed a unified motion correction and denoising 
resistance network for generating motion compensated 
low noise images from low-dose gated PET data. The 
experiment showcased that the network could directly 
generate accurate motion estimates from low-dose gated 
images and produce high-quality motion compensated 
low noise reconstruction [8]. Wang's team proposed a 
real-time dynamic vision system to achieve accurate pose 
estimation of cameras in indoor dynamic environments. 
The system utilized geometric motion removal modules 
and template-based motion removal modules to process 
dynamic feature points. They found complete dynamic 
regions with the help of depth image clustering. The 
outcomes showcased that the effectiveness of the system 
could reach 80% [9]. The Asad team proposed a spatial 
and temporal feature learning method based on video 
equidistant sequence frames. This method combined the 
multi-level features of two consecutive frames extracted 
from the top and bottom layers of CNN to consider 
motion information. The experiment showcased that the 
accuracy could reach 85% [10]. 
Recently, deep learning technology has made 
significant progress in video stabilization. Deep learning 
models could automatically learn feature representations 
of images and videos by learning a large amount of video 
data and could better capture complex motion patterns. 
Shahbazi et al. utilized a recursive neural network-based 
Long Short-Term Memory (LSTM) architecture for 
incorporating motion features into a single object tracker. 
A new motion model was trained to predict the position 
of the target in each frame. The results indicated that the 
motion model had low computational cost and was in line 
with the basic tracking performance [11]. The Iraei group 
proposed a new deep learning algorithm that estimated 
the fuzzy kernels through CNN. Then, objects were 
tracked through particle filters and the probability 
distribution of motion information obtained through 
kernel estimation. The experiment showcased that 
compared with existing technologies, this method could 
improve tracking accuracy by 10% [12]. The Liu team 
proposed a fault detection method based on 
high-dimensional features of video image depth. This 
method selected deep and highly sensitive features with a 
large amount of fault information as the features to be 
detected. Euclidean distance was used for fault detection 
and moving average window function for reducing 
sudden noise interference. The experiment showcased 
that the detection efficiency of this method could reach 
90% [13]. Chen and his team members proposed a 
video-based action recognition network that used channel 
attention mechanism in residual units to learn the action 
features of each view. The results showed that the 
accuracy of this method could reach 91% [14]. Liu et al. 
proposed a dynamic spatiotemporal network to integrate 
spatiotemporal information. Under the guidance of coarse 
saliency maps, features and decoders were modified 
through spatial attention to obtain the final saliency map. 
The experiment showcased that the accuracy of this 
method in extracting motion features could reach 90% 
[15]. The summary table of related works is shown in 
Table 1. 
 
 
 
Table 1: Summary of related works 
 
Field Researchers Research content Research resultxity Index 
Video 
stabilization 
technology 
Son et al [6] 
An effective recursive 
video deblurring 
network 
Enhance motion 
estimation accuracy. 
Computational 
efficiency is 
improved by 13% 
Chen et al [7] 
An end-to-end training 
method 
Display the 
effectiveness of 
estimation and 
prediction 
The prediction 
efficiency reaches 
85% 
Zhou et al [8] 
A unified motion 
correction and 
denoising resistance 
network 
Produce high-quality 
motion compensated 
low noise 
reconstruction 
The accuracy rate 
can reach 90% 
Optimization of Video Stability Technology by Integrating …                         Informatica 48 (2024) 123 –136 125 
Wang et al [9] 
A real-time dynamic 
vision system 
Find complete dynamic 
regions 
The efficiency 
reaches 80% 
Asad et al 
[10] 
A spatial and temporal 
feature learning method 
Combine the 
multi-level features 
Accuracy can reach 
85% 
Deep 
learning 
Shahbazi et al 
[11] 
A recursive neural 
network-based LSTM 
Be in line with the 
basic tracking 
performance 
Computing cost is 
reduced by 10% 
Iraei et al [12] 
A target tracking 
algorithm based on 
CNN and PF 
Track objects through 
PF 
The tracking 
accuracy is 
improved by 10% 
Liu et al [13] 
A fault detection 
method 
Reduce sudden noise 
interference 
The detection 
efficiency can 
reach 90% 
Chen et al 
[14] 
A video-based action 
recognition network 
Learn each view action 
feature 
Precision is up to 
91% 
Liu et al [15] 
A dynamic 
spatiotemporal network 
The final significant 
plot is obtained 
The accuracy rate 
can reach 90% 
 
In summary, video stabilization technology based on 
deep learning has made significant progress. However, 
there are still challenges in dealing with complex scenes 
and multivariate problems. In addition, most existing 
methods need a large amount of pre training data and 
computational resources. For some real-world application 
scenarios, real-time performance and efficiency remain 
key challenges. Therefore, this study proposes an 
optimization method for video stabilization technology 
that integrates the Tiny-Res-PWNet model. This is to 
achieve higher quality, more stable, and more efficient 
image stabilization technology, and to perform better in 
different application scenarios. 
 
3 Design of optimization method for 
video stabilization technology 
integrating Tiny-Res-PWNet 
model 
This chapter proposes the design of optimization methods 
for video stabilization technology, including 
improvements and optimization methods for the PWSNet 
model. Faster network model running speed and better 
image stabilization effect are achieved with less resource 
consumption. This can be achieved by replacing the 
encoder convolutional layer with the residual module, 
extracting feature information from different network 
layers through feature fusion, Batch Normalization (BN) 
layer, bottleneck residual module, etc. Meanwhile, the 
PWSNet model is lightweight processed to obtain a 
smaller Tiny-Res-PWNet image stabilization network 
model. 
 
 
 
 
3.1 Design of optimization method for video 
stabilization effect 
With the continuous development of mobile devices and 
camera technology, users have an increasing demand for 
video stabilization effects [16]. To this end, research is 
being conducted to improve the output structure of the 
image stabilization network and add Fourier spectrum 
constraints and local motion constraints. Meanwhile, the 
affine matrix parameters are optimized to make the 
transformation parameters output by the network more 
similar, thereby reducing the difficulty of learning the 
overall jitter pattern of the image. PWSNet is a deep 
learning network used for image processing and computer 
vision tasks. The purpose is for enhancing the 
performance of image processing tasks by learning pixel 
level stability [17]. Pixel warping field is the motion 
model of PWSNet, used to directly map the relationships 
between all pixels between stable and unstable frames 
[18]. In the pixel warping, the relationship between the 
warping matrix and the corresponding pixels is shown in 
equation (1). 
 
0
0
( , )
( , )
x
y
T i j x
T i j y
= 


=


 (1) 
In equation (1), the two warping matrices are 
x
T 
and 
y
T
, with pixel coordinates 
00
( , ) xy and pixel points 
( , ) ij . The warp matrix records the horizontal and 
vertical coordinates of the source pixel. The pixel values 
in stable frame ˆ
I
 come from unstable frame I . The 
pixel warping field is shown in Figure 1. 
 
126   Informatica 48 (2024) 123 –136                                                             W. Zhong et al. 
0
y
( , ) ij
y
T
ˆ
I
00
( , ) xy
I
( , ) ij
T
0
x
x
T
( , ) ij
Pixel warping field
Stable frame Unstable frame
 
Figure 1: Pixel warping field 
 
 
The single-stage structure of PWSNet is similar to 
the encoder decoder framework of U-net architecture [19]. 
The encoder consists of multiple convolutional layers, 
gradually downsampling to generate smaller feature maps 
and increasing the quantity of channels to enhance 
learning ability. The decoder structure is similar to the 
encoder, generating larger feature maps through 
convolutional layer upsampling and reducing the number 
of channels. The convolutional layers of the decoder and 
encoder are connected to each other through skip 
connections. At each stage in the figure, PWSNet 
generates two equally sized pixel warping fields to record 
the relationship between the source pixel and the target 
pixel and to achieve image stability. During the 
generation of pixel warping field by PWSNet, the 
horizontal and vertical movement positions of pixels are 
shown in equation (2). 
0
1,2,3
0
1,2,3
( , ) (1, ) ( , , )
( , ) (2, ) ( , , )
x
t
n
y
t
n
T i j H n A i j n
T i j H n A i j n
=
=
 =


=




 (2) 
In equation (2), the pixel's horizontal and vertical 
movement positions are 
0
x
T and 
0
y
T , respectively, the 
affine transformation matrix is 
t
H , and the constant 
matrix is A . The overall architecture of PWSNet 
training is based on a dual branch neural network with 
shared identical parameters. This architecture can ensure 
the temporal and spatial consistency of continuous stable 
frames. In the network testing phase, there is no need to 
constrain the network. Only one branch is used to 
generate stable videos, reducing the consumption of 
computing resources. The overall framework of PWSNet 
training is shown in Figure 2. 
 
Cascade encoder-
decoder networks
Unstable frame 
sequences Warping maps
Generated stable 
frames
Share weights
 
Figure 2: Overall framework for PWSNet training 
 
In the frequency domain, using Fourier spectral 
constraints can be used as a method to enhance the spatial 
smoothness of the warping field. The frequency domain 
loss function is calculated as shown in equation (3). 
22
ˆ ˆ
( ) ( )
frequency x y
L G F W G F W =  + 
 (3) 
In equation (3), the frequency domain loss is 
frequency
L
, the weighted filtering function is 
ˆ
G
, and the 
two-dimensional Fourier transform is F . The pixel 
warping matrices in the 
x
 and 
y
 directions are 
x
W 
and 
y
W
, respectively. The frequency spectrum after 
Fourier transform of the warping matrix is () FW . The 
Optimization of Video Stability Technology by Integrating …                         Informatica 48 (2024) 123 –136 127 
weighted filtering function is calculated as shown in 
equation (4). 
 
max( )
ˆ
max( )
GG
G
G
−
=
 (4) 
In equation (4), the two-dimensional Gaussian 
distribution function that satisfies mean 0 and variance 10 
is G . The global affine transformation calculation is 
shown in equation (5). 
 
11 12 1
21 22 2
'
'
1 0 0 1 1
X m m b X
Y m m b Y
     
     
=
     
     
     
 (5) 
In equation (5), the coordinates of the grid vertices 
before the affine transformation are   , XY . The 
coordinate of the network vertices after the affine 
transformation is   ', ' XY . The global affine 
transformation parameters are 
11
m , 
12
m , 
21
m , 
22
m , 1 b , 
and 
2
b . The global vector is defined as equation (6). 
 
 
'
'
d
XX
M
YY
   
=−
   
   
 (6) 
 
In equation (6), the global vector is 
d
M . The 
intensity of local motion is measured by the L2 norm of 
the sparse motion field. The calculation of local motion 
loss is shown in equation (7). 
 
,
1 2
N
motion n g d
n
L M M
=
=−

 (7) 
In equation (7), the local motion loss is 
motion
L . The 
quantityr of grid vertices in a frame is N . The number 
of grid vertices serves as 
n
. The vector of the 
n
-th grid 
vertex is 
, ng
M
. The affine transformation matrix is 
shown in equation (8). 
 
 
0 1 2
3 4 5
t
m m m
H
m m m

=


 (8) 
 
In equation (8), 
05
mm − represent the scaling and 
rotation relationships before and after the transformation. 
To overcome systematic errors in affine transformations, 
an improved method is proposed. This method only 
retains the scaling, rotation, and translation 
transformations to reduce the coupling between the four 
parameters representing rotation and scaling. The 
improved network encoder generation structure is shown 
in Figure 3. 
 
Upgradation
Conventional 
matrix
Conventional 
matrix
Affine matrix Affine matrix
Encoder feature 
map
Encoder feature 
map
Output 
parameters
Output 
parameters
 
Figure 3: Improved network encoder generation structure 
 
The improved network encoder generation structure 
first obtains transformation parameters through a 1*1 
convolution operation on the input feature map. Then, the 
transformation parameters are filled into the affine 
transformation matrix to generate a feature map with a 
shape of 1*1*512. This feature map facilitates the 
subsequent calculation of generated parameters. The 
entire process is equivalent to learning a weight of 
1*1*512, similar to the operation method of fully 
connected layers. The improved affine transformation 
calculation is shown in equation (9). 
 
cos sin
sin cos
t
S S x
H
S S y


− 
=



 (9) 
In equation (9), the scaling factor is S , the rotation 
angle is  , and the displacements in the 
x
 and 
y
 
directions are x  and 
y 
. 
 
3.2 Design of Tiny-Res-PWNet model 
A lightweight image stabilization network 
Tiny-Res-PWNet is proposed to solve the problems of 
complex network structure, high computational 
complexity, and low output frame rate in PWSNet. 
Tiny-Res-PWNet replaces the encoder convolutional 
layer with a residual module to improve the training 
convergence and fitting ability of the network. By using 
feature fusion to extract feature information from 
different network layers, neural networks can achieve 
128   Informatica 48 (2024) 123 –136                                                             W. Zhong et al. 
faster model running speed and better image stability 
with less resource consumption [20]. In Tiny-Res-PWNet, 
the residual module is used to replace the encoder 
convolutional layer. The residual module introduces 
short-circuit connections, allowing the network to directly 
learn the residual mapping between input and output 
features. The residual module helps to improve the 
training convergence speed of the network and makes it 
easier for the network to fit complex nonlinear 
relationships. Tiny-Res-PWNet extracts feature 
information from different network layers through feature 
fusion, combining feature maps from different levels to 
obtain richer and more comprehensive feature 
representations. In Tiny-Res-PWNet, feature fusion can 
be achieved through skip connections. Jumping 
connections fuse the feature maps of the encoder with 
those of the decoder, allowing the network to utilize more 
information for prediction. In addition, the PWSNet 
model is lightweight processed to obtain a faster running 
speed, higher output frame rate, and smaller model 
structure of the Tiny-Res-PWNet image stabilization 
network model. The output feature map of the residual 
network is showcased in equation (10). 
( ) ( )
ll
H x x F x =+ (10) 
In equation (10), the input feature map and output 
feature map of layer F are G and E, respectively. The 
disadvantage of traditional residual networks is that as the 
network depth increases. The computational load 
increases and the effectiveness gradually weakens [21]. 
The bottleneck residual module can perform convolution 
operations on relatively low dimensions by reducing the 
number of channels and using a 1*1 convolution kernel, 
thereby improving computational efficiency and 
effectiveness. The bottleneck residual module consists of 
a 1*1 convolutional layer, a 3*3 convolutional layer, and 
a 1*1 convolutional layer, used to extract features and 
solve the gradient vanishing problem in deep networks. 
To optimize the training structure, this study introduces 
BN layers to normalize the data distribution. This causes 
the input value of the activation function to fall in areas 
with larger gradients, thereby avoiding gradient vanishing 
and reducing training time. The BN layer is usually used 
after the convolutional layer. In addition, this study has 
made improvements to the encoder structure to enhance 
the network feature extraction capability and improve 
operational efficiency. It replaces the ordinary 3*3 
convolutional layer with a bottleneck structure 
combination layer, and connects the input and output 
together through skip connections to form a bottleneck 
residual structure. The optimized single-layer structure of 
the encoder is shown in Figure 4. 
 
Input 
features
Output 
features
1*1 Conv 3*3 Conv 1*1 Conv ReLU
 
Figure 4: Optimized single-layer structure of encoder 
 
To learn multi-scale and multi-dimensional features 
in images, the bottleneck residual block of the encoder 
structure reduces the output feature size by half layer by 
layer. The output feature channel first increases and then 
remains unchanged. Under the condition that channel 
transformation is required, an additional 1*1 
convolutional layer is introduced for channel 
transformation. This study applies depthwise separable 
convolution to the input layer, warping field output layer, 
encoder bottleneck residual module intermediate layer, 
and decoder transpose convolution layer of the network. 
In addition, the computational load within the network is 
reduced by reducing the network hierarchy. The structure 
of the lightweight Tiny-Res-PWNet model is shown in 
Figure 5. 
 
Input structure Encoder structure Decoder structure Output structure
Transform output structure
 
Figure 5: Tiny-Res-PWNet model structure 
 
Optimization of Video Stability Technology by Integrating …                         Informatica 48 (2024) 123 –136 129 
Tiny-Res-PWNet uses three different activation 
functions: ReLU, Leaky ReLU, and Tanh. Most of the 
convolutional layers are activated using ReLU, the output 
layer is activated using Tanh. The convolutional layers 
connected to the output layer are activated using Leaky 
ReLU. The ReLU activation function is shown in 
equation (11). 
 
1
1
'
( ) max(0, )
1, 0
()
0, 0
g x x
x
gx
x
= 

  
=



 
 (11) 
In equation (11), the ReLU activation function is 
1
() gx , and its derivative is 
1
'
() gx
. The Leaky ReLU 
activation function is shown in equation (12). 
 
2
'
2
( ) max( , )
1, 0
()
,0
g x ax x
x
gx
ax
= 

  
=



 
 (12) 
In equation (12), the Leaky ReLU activation 
function is 
2
() gx , its derivative is 
'
2
() gx , and the 
constant term is 
a
. The Tanh activation function is 
shown in equation (13). 
( )
3
2
2
'
33
()
( ) 1 1 ( )
xx
xx
xx
xx
ee
gx
ee
ee
g x g x
ee
−
−
−
−
 −
=

+


 −

= − = −


+
 
 (13) 
In equation (13), the Tanh activation function is 
3
() gx , and its derivative is 
'
3
() gx . The content loss 
function is shown in equation (14). 
,
2
1, 1
ˆ
( ( , ) ( , ))
WH
ij
MSE
I i j I i j
L
WH
==
−
=

 (14) 
In equation (14), the content loss function is 
MSE
L , 
the true stable frame is 
I
, and the pixel width and height 
of the frame are W and H , respectively. The feature 
loss function is shown in equation (15). 
 
1
2
()
f
n
ii
i
fea
f
PP
L
n

=
−
=

 (15) 
In equation (15), the feature loss function is 
fea
L
, 
the number of matched feature points is 
f
n
, and the 
coordinate of a feature point in the real stable frame is 
P
. The coordinate of a feature point in an unstable frame 
is 
i
P , and the coordinate of the feature point transformed 
by the warping field is ()
i
P  . 
 
 
4 Video stabilization technology 
optimization method integrating 
Tiny-Res-PWNet model 
This chapter mainly analyzed the application of 
optimization methods for video stabilization technology 
that integrated the Tiny-Res-PWNet model. By setting 
the experimental environment and adjusting the training 
parameters, the performance of different image 
stabilization networks was compared. The improved 
PWSNet model was validated in terms of video structure 
similarity, fidelity, and stability. 
 
4.1 Application analysis of optimization 
methods for video stabilization effect 
The study used two different experimental environments, 
namely high performance and moderate performance. In 
a high-performance environment, Intel i9-12900k CPU, 
Nvidia RTX-3090 GPU, 32G graphics memory, 64G 
memory, Python 3.9 programming language, and Python 
1.7 deep learning framework were used. In a medium 
performance environment, Intel i3-10100f CPU, Nvidia 
GTX-1650S GPU, 8GB of graphics memory, and 16GB 
of memory were used. The size of the experimental 
image is uniformly 640*360, and the pixel warping field 
size is 256*256. The weight initialization adopted a 
normal distribution, and the training used an Adam 
optimizer with a batch size of 16 and an initial learning 
rate of 0.001. The experimental dataset adopts GoPro, 
which includes 3214 blurred images with a size of 
1280×720, of which 2103 are training images and 1111 
are test images. The GoPro dataset consists of one-to-one 
corresponding real blurred images and ground truth 
images, both captured by high-speed cameras. For 
verifying the performance of the improved PWSNet 
model, the study compared the Optical Flow model, 
Block Matching model, and the pre-improved PWSNet 
model. The structural similarity evaluation results of 
different image stabilization networks are shown in 
Figure 6. 
 
130   Informatica 48 (2024) 123 –136                                                             W. Zhong et al. 
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Structural similarity
(a) Test 1
Video number
1 4 2 3
Original video
Optical Flow
Block Matching
PWSNet
Improved PWSNet
5
0.5
0.6
0.7
0.8
0.9
1.0
Structural similarity
(b) Test 2
Video number
1 4 2 3
Original video
Optical Flow
Block Matching
PWSNet
Improved PWSNet
5
 
Figure 6: Structural similarity evaluation results of different image stabilization networks 
 
In Figure 6 (a), the original video structure similarity 
is less than 0.5, and the improved PWSNet model has 
improved the video similarity evaluation index by about 
41.8%. Compared to the Optical Flow model, Block 
Matching model, and PWSNet, the similarity evaluation 
index of the improved PWSNet model has increased by 
about 7.3%, 7.9%, and 2.7%, respectively. In Figure 6 (b), 
the similarity of the original video structure is greater 
than 0.5. The improved PWSNet model has increased the 
video similarity evaluation index by approximately 
52.8%. Compared to the Optical Flow model, Block 
Matching model, and the pre-improved PWSNet model, 
the similarity evaluation index of the improved PWSNet 
model has increased by approximately 14.1%, 17.3%, and 
4.3%, respectively. The outcomes showcase that the 
improved PWSNet model possesses excellent accuracy 
and reliability in evaluating video structural similarity, 
which can better capture the structural similarity between 
videos. The fidelity and stability evaluation results of 
different image stabilization networks are shown in 
Figure 7. 
 
0.6
0.7
0.8
0.9
1.0
Fidelity
(a) Fidelity
Video number
1 4 2 3
Optical Flow
Block Matching
PWSNet
Improved PWSNet
5
0.5
0.6
0.7
0.8
0.9
Stability
(b) Stability
Video number
1 4 2 3
Optical Flow
Block Matching
PWSNet
Improved PWSNet
5
Original video
 
Figure 7: Evaluation results of fidelity and stability for different image stabilization networks 
 
Figure 7 (a) shows the fidelity evaluation results, 
and the average fidelity of the improved PWSNet model 
is 0.87. Compared to the Optical Flow model, Block 
Matching model, and the pre-improved PWSNet model, 
the improved PWSNet model has improved its fidelity 
evaluation metrics by approximately 2.1%, 9.9%, and 
1.9%, respectively. Figure 7 (b) shows the stability 
evaluation results, and the average stability value of the 
improved PWSNet model is 0.78. Compared to the 
pre-improved PWSNet model, the stability evaluation 
index of the improved PWSNet model has increased by 
about 3.7%. The outcomes showed that the improved 
PWSNet model achieved essential improvements in both 
fidelity and stability. This indicates that the model can 
better maintain image quality and stability in video 
processing, providing a more reliable solution for video 
processing tasks. The stabilization results of different 
types of low-quality videos are shown in Figure 8. 
 
Optimization of Video Stability Technology by Integrating …                         Informatica 48 (2024) 123 –136 131 
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Indicator value
(a) Strong light video
Evaluating indicator
Original video
Optical Flow
Block Matching
Improved PWSNet
Structural
similarity
Fidelity Stability
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Indicator value
(b) Night time video
Evaluating indicator
Original video
Optical Flow
Block Matching
Improved PWSNet
Structural
similarity
Fidelity Stability
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Indicator value
(c) Fuzzy video
Evaluating indicator
Original video
Optical Flow
Block Matching
Improved PWSNet
Structural
similarity
Fidelity Stability
 
Figure 8: Stable image processing results of low-quality videos of different types 
 
Figure 8 (a) shows the comparison of image 
stabilization results for strong light videos. The improved 
PWSNet model has evaluation indicators for structural 
similarity, fidelity, and stability of 0.82, 0.84, and 0.79, 
respectively. Compared with the Optical Flow model and 
Block Matching model, the improved PWSNet model has 
improved the evaluation indicators of structural similarity, 
fidelity, and stability by about 16.9%, 15.1%, and 16.5%. 
Figure 8 (b) showcases the comparison of image 
stabilization results for nighttime videos. The evaluation 
indicators for structural similarity, fidelity, and stability 
of the improved PWSNet model are 0.83, 0.82, and 0.77, 
respectively. Relative to the Optical Flow model, the 
improved PWSNet model possesses improved the 
evaluation indicators of structural similarity, fidelity, and 
stability by about 4.2%, 7.8%, and 13.1%. Figure 8 (c) 
showcases the comparison of image stabilization results 
for blurred videos. The evaluation indicators for 
structural similarity, fidelity, and stability of the 
improved PWSNet model are 0.78, 0.80, and 0.76, 
respectively. Compared to the Optical Flow model and 
Block Matching model, the improved PWSNet model has 
improved the structural similarity, fidelity, and stability 
evaluation indicators by approximately 25.1%, 20.9%, 
and 13.2%. Based on the analysis of the above data, the 
improved PWSNet model has shown significant 
improvement in image stabilization processing results for 
different types of low-quality videos. Whether it is strong 
light videos, nighttime videos, or blurry videos, the 
PWSNet model can better maintain the structural 
similarity of videos and improve the fidelity and stability. 
The comparison of screenshots before and after video 
stabilization is shown in Figure 9. Figure 9 (a) shows the 
original video image. Figure 9 (b) shows the video image 
after image stabilization processing. After image 
stabilization processing, the clarity of the image has been 
improved. During the image stabilization process, noise 
reduction is applied to the image to make the video more 
pure. 
 
132   Informatica 48 (2024) 123 –136                                                             W. Zhong et al. 
(a) Original image (b) Stable image
 
Figure 9: Video stabilization image before and after the screenshot comparison 
 
4.2 Application analysis of Tiny-Res-PWNet 
model 
The Res-PWNet model introduced residual modules for 
structural improvement on the basis of the original 
PWSNet. The impact of residual modules was compared 
and analyzed in different types of video scenes. The 
comparison of Res-PWNet image stabilization 
performance in different types of video scenes is 
showcased in Table 2. 
 
 
Table 2: Comparison of Res-PWNet image stabilization performance in different types of video scenes 
Video type Algorithm Structural similarity Fidelity Stability 
Routine 
Optical Flow 0.802 0.852 0.813 
PWSNet 0.773 0.832 0.819 
Res-PWNet 0.814 0.845 0.842 
High speed 
Optical Flow 0.792 0.782 0.806 
PWSNet 0.761 0.771 0.791 
Res-PWNet 0.806 0.811 0.824 
High density 
Optical Flow 0.752 0.782 0.705 
PWSNet 0.746 0.809 0.748 
Res-PWNet 0.761 0.817 0.726 
High light 
intensity 
Optical Flow 0.742 0.801 0.731 
PWSNet 0.732 0.801 0.725 
Res-PWNet 0.756 0.809 0.765 
 
In Table 2, Res-PWNet exhibits excellent image 
stabilization performance in different types of video 
scenes, with high structural similarity, fidelity, and 
stability. Res-PWNet has better image stabilization 
performance compared to PWSNet in different types of 
video scenes. In conventional scenarios, the structural 
similarity score of Res-PWNet is 0.814, which is higher 
than the score of PWSNet by 0.773. Meanwhile, the 
fidelity score of Res-PWNet is 0.845, which is higher 
than the score of PWSNet by 0.832. In  
 
 
both high-speed and high-density scenarios, Res-PWNet 
has higher scores for structural similarity and fidelity than 
PWSNet. In high light intensity scenes, the structural 
similarity and fidelity scores of Res-PWNet are slightly 
higher than those of PWSNet. By introducing residual 
modules, Res-PWNet can better capture motion 
information in videos and make more accurate 
predictions and compensations. The comparison of 
network module complexity after the introduction of 
residual module is shown in Figure 10. 
 
Optimization of Video Stability Technology by Integrating …                         Informatica 48 (2024) 123 –136 133 
PWSNet
0
2.0
Mark
Parameter quantity/×10
6
(a) Encoder
Model Size/MB
12
6
0
Floating point operations/×10
9
8
4
0
Res-PWNet 
Convolutional layers
20
0
PWSNet
Res-PWNet 
PWSNet
Res-PWNet 
PWSNet
Res-PWNet 
1.0
3.0
6
2
30
10
PWSNet
35
45
Mark
Parameter quantity/×10
6
(b) Neural network
Model Size/MB
200
150
100
Floating point operations/×10
9
150
130
110
Res-PWNet 
Run frame rate
85
70
PWSNet
Res-PWNet 
PWSNet
Res-PWNet 
PWSNet
Res-PWNet 
40
50
140
120
90
80
 
Figure 10: Comparison of network module complexity after introducing residual modules 
 
Figure 10 (a) shows the complexity comparison of 
the encoder module. Compared to PWSNet encoder, 
Res-PWNet encoder reduces the number of parameters by 
53.7%, model size by 54.7%, floating-point operation by 
66.9%, and convolutional layer depth by three times. 
Figure 10 (b) showcases a comparison of the overall 
complexity of the network. Compared to PWSNet 
encoder, Res-PWNet encoder reduces parameter count by 
12.1%, model size by 11.7%, floating-point operation by 
13.2%, and running frame rate has increased from 80.1 to 
84.6, an increase of 5.6%. The outcomes show that the 
introduction of residual modules could markedly decrease 
the parameters and model size of the network and also 
reduce the floating-point computational complexity. This 
improves the performance and stability of the network. In 
addition, introducing residual modules can significantly 
decrease the overall complexity of the network and 
improve the running frame rate, thereby improving the 
real-time performance and stability of the network. Some 
videos from the aviation video dataset were used for 
testing to test the image stabilization performance of 
Tiny-Res-PWNet in airborne video stabilization 
application scenarios. The stability performance test 
results of Tiny-Res-PWNet are shown in Figure 11. 
 
0.70
0.72
0.74
0.76
0.80
0.82
0.84
0.86
Indicator value
(a) Routine
Evaluating indicator
Optical Flow
Tiny-Res-PWNet
Structural
similarity
Fidelity Stability
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Indicator value
(b) Fuzzy
Evaluating indicator
Structural
similarity
Fidelity Stability
Res-PWNet
Optical Flow
Tiny-Res-PWNet
Res-PWNet
 
Figure 11: Tiny-Res-PWNet image stabilization performance test results 
 
Figure 11 (a) showcases the image stabilization 
results of conventional clear videos. Figure 11 (b) 
showcases the image stabilization results of very 
large-scale blurred videos. The traditional optical flow 
model for image stabilization has better performance, but 
its processing effect for complex scene videos is poor. 
Tiny-Res-PWNet performs better in handling complex 
scenes and has high robustness. Compared with 
Res-PWNet, Tiny-Res-PWNet can better handle various 
complex scenes while maintaining image stability 
performance. The comparison results of the running 
speed of the Tiny-Res-PWNet model are showcased in 
Table 3. 
 
 
 
134   Informatica 48 (2024) 123 –136                                                             W. Zhong et al. 
Table 3: Comparison results of running speed of Tiny-Res-PWNet model 
Project PWSNet Res-PWNet Tiny-Res-PWNet 
Parameter quantity 49.2 42.8 6.9 
Model size/×106 186.5 164.2 27.5 
Floating point 
operations/MB 
129.4 114.9 12.3 
High performance 
running frame rate/×109 
78.3 83.1 131.2 
Low performance 
running frame rate 
3.8 3.9 7.9 
 
In Table 3, the Tiny-Res-PWNet model is 
significantly smaller in terms of parameter count, model 
size, and floating-point operations. There is an excellent 
performance in high-performance frame rate, reaching 
131.2, far higher than the 78.3 and 83.1 of Res-PWNet 
and PWSNet models. In terms of low performance frame 
rates, Tiny-Res-PWNet also performs well, reaching 7.9, 
higher than Res-PWNet and PWSNet models at 3.8 and 
3.9. Therefore, the Tiny-Res-PWNet model has a 
fast-running speed and a small model size, making it 
suitable for use in resource limited environments. 
5 Discussion 
The study proposed a video stability optimization method 
that integrates the Tiny-Res-PWNet model. Compared to 
existing advanced methods, the proposed improved 
PWSNet model achieved significant improvements in 
video structure similarity, realism, and stability. 
Especially when dealing with different types of 
low-quality videos, such as strong light, nighttime, and 
blurry videos, the improved PWSNet model performed 
particularly well. These achievements were attributed to 
the introduction of residual modules, which enabled the 
network to better capture motion information in videos 
while maintaining model simplicity, achieving more 
accurate prediction and compensation. In addition, the 
Tiny-Res-PWNet model had lower complexity in terms 
of parameter count, model size, and floating-point 
operations, which made the model have higher 
performance and stability in practical applications. 
Compared with Res-PWNet and PWSNet models, 
Tiny-Res-PWNet had higher robustness and faster 
running speed when dealing with complex scenes. These 
advantages enabled the proposed method to achieve good 
results in aviation video stabilization application 
scenarios. Through comprehensive comparison, the 
proposed video stability optimization method 
outperformed existing methods in multiple aspects. These 
advantages mainly stem from the introduction of residual 
modules and optimization of network structures, which 
enable the model to maintain high performance while 
possessing stronger robustness and generalization ability. 
In addition, the Tiny-Res-PWNet model is particularly 
suitable for resource constrained environments by 
reducing model size and computational complexity. 
6 Conclusion 
To improve the performance and stability of video 
stabilization technology, the PWSNet model was 
optimized. A new network model, Tiny-Res-PWNet, was 
proposed. In the Tiny-Res-PWNet model, ResNet was 
first used as the basic network structure for strengthening 
the depth and expressive power of the network. Then, 
based on ResNet, the pixel level weight mechanism of 
PWSNet was introduced to improve the effectiveness of 
video stabilization processing. In addition, techniques 
such as batch normalization and residual connections 
were also used to accelerate network training and 
improve network convergence to further enhance the 
performance and stability of the Tiny-Res-PWNet model. 
The results showed that the improved PWSNet model 
improved the video similarity evaluation index by 
approximately 41.8%. Compared to the Optical Flow 
model, Block Matching model, and the pre-improved 
PWSNet model, the similarity evaluation index of the 
improved PWSNet model increased by about 7.3%, 7.9%, 
and 2.7%, respectively. The improved PWSNet model 
enhanced video stabilization performance. The 
Tiny-Res-PWNet model reduced computational 
complexity while maintaining high-performance frame 
rates. This study provides a new method for video 
stabilization technology, which can be applied in various 
practical scenarios and has great potential. The 
limitations of this study mainly lie in the limited 
hardware equipment and software tools in the 
experimental environment. The limitations may have a 
certain impact on the research results. Future research can 
consider exploring more types of network structures and 
optimization algorithms to broaden the applicability of 
research methods, improve their reliability and 
effectiveness. At the same time, in-depth research is 
conducted on the optimization and improvement of 
hardware devices and software tools combined with 
practical application scenarios to reduce their limitations 
on research results. 
References 
[1] K. A. Mills, and A. Brown, “Immersive virtual reality 
(VR) for digital media making: transmediation is 
key, ” Learning, Media and Technology, vol. 47, no. 
2, pp. 179-200, 2022. 
https://doi.org/10.1080/17439884.2021.1952428 
Optimization of Video Stability Technology by Integrating …                         Informatica 48 (2024) 123 –136 135 
[2] X. Jin, F. Jiang, L. Li, and T. Zhong, “Plenoptic 2.0 
intra coding using imaging principle, ” IEEE 
Transactions on Broadcasting, vol. 68, no. 1, pp. 
110-122, 2022. 
https://doi.org/10.1109/TBC.2021.3108058 
[3] G. Du, K. Wang, S. Lian, and K. Zhao, “Vision-based 
robotic grasping from object localization, object 
pose estimation to grasp estimation for parallel 
grippers: a review, ” Artificial Intelligence Review, 
vol. 54, no. 3, pp. 1677-1734, 2021. 
https://doi.org/10.1007/s10462-020-09888-5 
[4] S. Shimada, V. Golyanik, W. Xu, P. Pérez, and C. 
Theobalt, “Neural monocular 3d human motion 
capture with physical awareness, ” ACM 
Transactions on Graphics (ToG), vol. 40, no. 4, pp. 
1-15, 2021. 
https://doi.org/10.1145/3450626.3459825 
[5] M. Poggi, F. Tosi, K. Batsos, P. Mordohai, and S. 
Mattoccia, “On the synergies between machine 
learning and binocular stereo for depth estimation 
from images: a survey, ” IEEE Transactions on 
Pattern Analysis and Machine Intelligence, vol. 44, 
no. 9, pp. 5314-5334, 2021. 
https://doi.org/10.1109/tpami.2021.3070917 
[6] H. Son, J. Lee, J. Lee, S. Cho, and S. Lee, “Recurrent 
video deblurring with blur-invariant motion 
estimation and pixel volumes, ” ACM Transactions 
on Graphics (TOG), vol. 40, no. 5, pp. 1-18, 2021. 
https://doi.org/10.1145/3453720 
[7] C. Chen, C. X. Lu, B. Wang, N. Trigoni, and A. 
Markham, “DynaNet: Neural Kalman dynamical 
model for motion estimation and prediction, ” IEEE 
Transactions on Neural Networks and Learning 
Systems, vol. 32, no. 12, pp. 5479-5491, 2021. 
https://doi.org/10.1109/TNNLS.2021.3112460 
[8] B. Zhou, Y. J. Tsai, X. Chen, J. S. Duncan, and C. Liu, 
“MDPET: a unified motion correction and denoising 
adversarial network for low-dose gated PET, ” IEEE 
Transactions on Medical Imaging, vol. 40, no. 11, 
pp. 3154-3164, 2021. 
https://doi.org/10.1109/TMI.2021.3076191 
[9] K. Wang, X. Yao, N. Ma, and X. Jing, “Real-time 
motion removal based on point correlations for 
RGB-D SLAM in indoor dynamic environments, ” 
Neural Computing and Applications, vol. 35, no. 12, 
pp. 8707-8722, 2023. 
https://doi.org/10.1007/s00521-022-07879-x 
[10] M. Asad, J. Yang, J. He, P. Shamsolmoali, and X. 
He, “Multi-frame feature-fusion-based model for 
violence detection, ” The Visual Computer, vol. 37, 
no. 1, pp. 1415-1431, 2021. 
https://doi.org/10.1007/s00371-020-01878-6 
[11] M. Shahbazi, M. H. Bayat, and B. Tarvirdizadeh, “A 
motion model based on recurrent neural networks 
for visual object tracking, ” Image and Vision 
Computing, vol. 126, no. 1, pp. 104533-104544, 
2022. https://doi.org/10.1016/j.imavis.2022.104533 
[12] I. Iraei, and K. Faez, “A motion parameters 
estimating method based on deep learning for visual 
blurred object tracking, ” IET Image Processing, vol. 
15, no. 10, pp. 2213-2226, 2021. 
https://doi.org/10.1049/ipr2.12189 
[13] B. Liu, Y. Chai, Y. Liu, C. Huang, Y. Wang, and Q. 
Tang, “Industrial process fault detection based on 
deep highly-sensitive feature capture, ” Journal of 
Process Control, vol. 102, no. 1, pp. 54-65, 2021. 
https://doi.org/10.1016/j.jprocont.2021.04.003 
[14] B. Chen, H. Tang, Z. Zhang, G. Tong, and B. Li, 
“Video-based action recognition using spurious-3D 
residual attention networks, ” IET Image Processing, 
vol. 16, no. 11, pp. 3097-3111, 2022. 
https://doi.org/10.1049/ipr2.12541 
[15] J. Liu, J. Wang, W. Wang, and Y. Su, “DS-Net: 
Dynamic spatiotemporal network for video salient 
object detection, ” Digital Signal Processing, vol. 
130, no. 1, pp. 103700-103711, 2022. 
https://doi.org/10.1016/j.dsp.2022.103700 
[16] C. Z. Dong, and F. N. Catbas, “A review of 
computer vision –based structural health monitoring 
at local and global levels, ” Structural Health 
Monitoring, vol. 20, no. 2, pp. 692-743, 2021. 
https://doi.org/10.1177/1475921720935585 
[17] I. Salman, and J. Vomlel, “Learning the structure of 
Bayesian networks from incomplete data using a 
mixture model, ” Informatica, vol. 47, no. 1, pp. 
83-96, 2023. https://doi.org/10.31449/inf. 
v47i1.4497 
[18] M. Majd, and R. Safabakhsh, “A motion-aware 
ConvLSTM network for action recognition, ” 
Applied Intelligence, vol. 49, no. 7, pp. 2515-2521, 
2019. https://doi.org/10.1007/s10489-018-1395-8 
[19] C. Hong, “Basketball video image segmentation 
using neutrosophicFuzzy C-means clustering 
algorithm, ” Informatica, vol. 48, no. 9, pp. 145-154, 
2024. https://doi.org/10.31449/inf.v48i9.5929 
[20] S. Choudhuri, S. Adeniye, and A. Sen, “Distribution 
alignment using complement entropy objective and 
adaptive consensus-based label refinement for 
partial domain adaptation, ” Artificial Intelligence 
and Applications, vol. 1, no. 1, pp. 43-51, 2023. 
https://doi.org/10.47852/bonviewAIA2202524 
[21] J. Purohit, and R. Dave, “Leveraging deep learning 
techniques to obtain efficacious segmentation 
results, ” Archives of Advanced Engineering Science, 
vol. 1, no. 1, pp. 11-26, 2023. 
https://doi.org/10.47852/bonviewAAES32021220 
 
 
 
 
 
 
 
 
 
 
136   Informatica 48 (2024) 123 –136                                                             W. Zhong et al.