https://doi.org/10.31449/inf.v48i8.5700                                           Informatica 48 (2024) 35–48  35 
Real-time Target Detection System in Scenic Landscape Based on 
Improved YOLOv4 Algorithm 
Cheng Pan
1
, Haiyan Zhao
2
, Meijiao Sun
3*
 
1
Scientific Research Office, Nanchang Vocational University, Nanchang 330500, China 
2
Office of Academic Affairs, Nanchang Vocational University, Nanchang 330500, China 
3
School of Economics and Management, Nanchang Vocational University, Nanchang 330500, China 
E-mail: summeijiao@163.com 
* 
Corresponding author 
Keywords: YOLOv4; Image; Target detection; Landscape; Real-time; Deep learning 
Received: February 6, 2024 
With the rapid development of computer vision technology, the use of real-time target detection systems 
in scenic landscape management and services is increasingly widespread. To enhance the precision 
and efficiency of real-time target detection in scenic landscapes, this research integrates the fourth 
version of the You Only Look Once (YOLO) algorithm to construct an optimized real-time target 
detection system is introduced for scenic landscapes. First, adaptive spatial feature fusion to enhance 
the fourth version of the You Only Look Once algorithm. Then, the optimized algorithm was combined 
with OpenCV library, Python OS library, and other hardware and software to design a real-time image 
recognition system for scenic landscapes. The study results indicated that the proposed optimized 
algorithm had better recognition performance, and its precision value, recall rate, and F1 value were 
as high as 0.96, 0.97, and 0.98, respectively. The recognition system, which was developed using an 
optimization algorithm, demonstrated excellent practical application effect. It displayed stable system 
operation under four natural landscapes: sunrise, sea of clouds, maple forest, and stone monument, 
with a stability performance of 0.92, 0.93, 0.92, and 0.94, respectively. Moreover, the system operated 
remarkably fast, with low operational times of 2.3 s, 0.8 s, 2.9 s, and 1.2 s under these landscapes. In 
conclusion, the research institute's target detection algorithm has demonstrated excellent performance. 
Utilizing this algorithm in the detection system can offer technical aid for managing and intelligently 
detecting scenic landscape images. 
Povzetek: Raziskava predstavlja izboljšan sistem za zaznavanje tarč v realnem času v slikovitih 
krajinskih območjih, temelječ na algoritmu YOLOv4, s prilagodljivo prostorsko združitvijo značilnosti 
in uporabo knjižnic OpenCV in Python OS.
1 Introduction 
Under the present wave of digitization, intelligent 
management of tourist sites has become crucial in 
improving guest experience and ensuring safety [1-2]. 
The real-time Target Detection (TD) system plays a vital 
role by accurately identifying and locating various objects 
in the scenic area, such as tourists, natural landscapes, 
historical sites, and more. Furthermore, it provides 
technical support for image data management in the 
scenic area. In the field of TD, You Only Look Once 
(YOLO) and its derived network structures have achieved 
better detection results. Among them, the You Only Look 
Once Version 4 (YOLOv4) algorithm has received wide 
attention for its efficient detection speed and good 
accuracy [3-4]. Real-time TD of scenic landscapes is 
challenging due to various factors, including light 
variations, occlusion problems, complex backgrounds, 
and diverse target types. Therefore, relying solely on the 
traditional YOLOv4 for building recognition models is no 
longer sufficient to achieve high-precision TD tasks [5]. 
To address the aforementioned issues, this study 
optimizes the YOLOv4 algorithm and introduces a 
real-time landscape-based tourism demand system 
suitable for scenic locations. Building a real-time TD 
system for scenic spots enhances theoretical research on 
intelligent monitoring system applications in actual 
scenic spots and provides a practical technical solution. 
This solution is of great practical significance for 
promoting the development of intelligent tourism. This 
study is comprised of five sections. The initial section 
provides a concise overview of the study, while the 
subsequent section critically evaluates and summarizes 
prior research. The third section presents the research 
methodology, while the fourth section assesses algorithm 
performance. The fifth and final section offers a 
comprehensive summary of the entire study. 
2 Related Works 
Landscape image TD is a method in computer vision and 
deep learning designed to identify and locate specific 
features or objects within landscape photos or video 
36   Informatica 48 (2024) 35-48                                                       C. Pan et al. 
streams. A multitude of experts have researched this field, 
combining different models and algorithms to advance 
the technique. Jahani et al. study utilized three machine 
learning techniques, support vector machine, radial basis 
function neural network with multilayer perceptron, to 
simulate and evaluate landscape images of deciduous 
forests in northern Iran. By analyzing 13 landscape 
features, it was found that the multilayer perceptron 
model performed optimally in assessing the aesthetic 
quality of forest landscapes with a coefficient of 
determination of 0.878. In addition, the study identified 
the significant effect of factors such as tree species 
diversity and canopy density on landscape quality 
through the aforementioned models [6]. Peng et al. 
developed an image style conversion framework based on 
a recurrent generative adversarial network model, and 
used the framework to realize the transformation of 
landscape photos to Chinese landscape painting style. 
With the contour enhancement technique and edge 
detection operator, the conversion effect outperformed 
the traditional generative adversarial network model in 
both edge sensitivity and structural similarity. The 
outcomes demonstrated that the framework can enhance 
the landscape painting effect of landscape photographs 
with a comprehensive similarity score as high as 0.92. In 
the comparative analysis, the method outperformed 
several existing reference models in terms of visual 
quality [7]. Zhou et al. improved the traditional codebook 
modeling algorithm and proposed an improved codebook 
modeling algorithm. In addition to examining the origins 
and uses of the background modeling approach in motion 
TD, the paper also highlighted the shortcomings and 
suitability of the conventional approach, providing a 
framework for future studies on motion TD based on 
complicated background modeling. Furthermore, this 
study employed a combination of deep learning 
algorithms to examine the properties of fast-moving films 
and enhance their ability to identify features. According 
to study findings, the updated algorithm can analyze 
high-speed films efficiently and increase motion video 
frame feature detection [8]. Kikuchi et al. designed a 
method capable of performing real-time detection and 
virtual removal of existing buildings from a video stream, 
aiming to more intuitively demonstrate a future scene 
without these buildings. The results showed that the 
method was able to accurately perform real-time 
detection and building removal at 5.71 frames/sec when 
the complementary field of view was no more than 15%, 
which can effectively help users visualize the future 
environment on-site while reducing time and cost 
consumption [9]. 
As the computational speed and detection efficiency 
of the YOLO algorithm continue to improve, it has 
become increasingly prevalent in numerous real-time 
application scenarios, including video surveillance, 
autonomous driving, drone monitoring, and various 
industrial vision systems. Along with advancements in 
artificial intelligence technology, experts have conducted 
vast research on the YOLO algorithm’s performance. To 
address the problem of degraded performance of deep 
learning techniques for cross-domain object detection in 
the presence of insufficient labeled data, Li et al. 
proposed a step-by-step domain-adaptive YOLO 
framework. The framework creatively constructed an 
auxiliary domain to narrow the gap between the source 
and target domains, and then utilized the newly 
developed domain-adapted YOLO algorithm for the 
cross-domain object detection task. Experimental results 
showed that the detection framework designed by the 
institute significantly improved the detection performance 
of the algorithm [10]. Lee and Hwang explored the 
service performance of the YOLO algorithm in real-time 
object detection in resource-constrained AI embedded 
systems. To address the poor performance of YOLO in 
webcam object detection, a novel YOLO architecture 
with adaptive frame control was proposed in the paper to 
effectively address these issues. The results showed that 
the proposed adaptive frame control YOLO scheme can 
reduce the service delay while maintaining the high 
accuracy and convenience of YOLO, overcoming the 
real-time processing limitations of pure YOLO systems 
[11]. Aiming at the real-time monitoring and assisted 
driving requirements in self-driving vehicles, Liang et al. 
proposed an edge-cloud cooperative object detection 
system called Edge YOLO. With the use of compressed 
feature fusion and pruned feature extraction networks, 
Edge YOLO developed a lightweight framework that 
may significantly increase multi-scale prediction 
efficiency while lowering the system’s reliance on CPU 
resources. The research results showed that Edge YOLO 
had high reliability and detection accuracy on 
COCO2017 and KITTI datasets [12]. 
 
 
Table 1: Summary of related work 
Researchers Year 
Technology 
methods 
Key Findings Limitations 
Literature 
number 
Jahani et al. 2023 
Machine learning 
techniques 
Using machine 
learning to assess 
the aesthetic quality 
of forest landscapes 
Limited to 
forested 
landscapes in 
specific areas, 
not covering a 
wider range of 
landscape types 
[6] 
Real-time Target Detection System in Scenic Landscape Based on… Informatica 48 (2024) 35–48 37 
Peng et al. 2022 
Recurrent 
generative 
adversarial 
networks 
Realising the 
transformation of 
landscape photos to 
Chinese landscape 
painting style 
Conversion 
effects are 
dependent on 
specific styles 
with limited 
ability to 
generalise 
[7] 
Zhou et al. 2023 
Modified 
codebook 
modelling 
algorithms 
Improving feature 
recognition of 
fast-motion video 
sequences 
Needs further 
optimisation to 
deal with 
extreme lighting 
and complex 
backgrounds 
[8] 
Kikuchi et al. 2022 
Semantic 
segmentation and 
GAN 
Capable of 
detecting and 
virtually removing 
existing buildings 
from video streams 
in real-time 
Only applicable 
to specific 
landscape 
assessment 
scenarios 
[9] 
Li et al. 2022 
Progressive 
domain adaptation 
YOLO framework 
Significantly 
improves 
cross-domain object 
detection 
performance 
Requires 
ancillary domain 
data, limited 
applicability 
[10] 
Lee and Hwang 2022 
Adaptive frame 
control YOLO 
architecture 
Improved service 
performance of 
YOLO in real-time 
object detection. 
Primarily for 
resource-constra
ined 
environments, 
may not be 
applicable to all 
scenarios 
[11] 
Liang et al. 2022 
Edge YOLO 
system 
Enabling efficient 
real-time object 
detection in 
autonomous driving 
Dependent on 
edge-cloud 
co-operation, 
high 
implementation 
costs 
[12] 
 
In summary, Table 1 shows that numerous 
experts have conducted studies on image detection and 
YOLO algorithm performance testing. Additionally, 
many experts have applied the YOLO network model 
to image detection and achieved superior research 
results. The ongoing development of the tourism 
industry has led to the application of various 
intelligent technologies in tourism management. To 
enhance real-time detection and retrieval of landscape 
images in scenic spots, this study will optimize the 
YOLO network and develop a real-time TD system for 
the said images. By doing so, tourists can acquire 
landscape images swiftly and locate attractions 
efficiently, providing significant technical assistance 
to intelligent tourism management. 
 
3 Real-time TD system for 
landscape images of scenic spots 
based on YOLOv4-ASFE algorithm 
 
In today's digital era, the real-time TD system has a 
significant enhancement effect on both the 
management of tourist attractions and the experience 
of tourists. In this study, the traditional YOLOv4 TD 
algorithm is first optimized, and the TD accuracy of 
YOLOv4 is optimized by introducing Adaptive Spatial 
Feature Fusion (ASFF). On this basis, adaptive spatial 
feature optimization You Only Look Once version 4 
(YOLOv4-ASFF) for real-time images of scenic 
landscapes is designed, aiming at completing the 
real-time detection of landscape images by this system, 
so as to improve the image data management effect of 
the scenic management system. 
3.1 TD Algorithm design based on 
improved YOLOv4 
The traditional YOLOv4 is a popular real-time TD 
38   Informatica 48 (2024) 35-48                                                       C. Pan et al. 
algorithm commonly used for computer vision tasks. 
Compared to the previous three versions of YOLO 
algorithm, YOLOv4 brings several improvements and 
advantages [13]. Figure 1 depicts the conventional 
YOLOv4 network configuration.
CBL
CBL*3
CSPDarknet-53
26×26×255
SPP CBL
CBL
CBL
Concat CBL
Conv y1
CBL*3
Upsampling
CBL
CBL
Concat CBL
Conv y2
Upsampling
CBL
Concat CBL
Conv y3
Upsampling
CBL
Concat CBL
Input 
layer
Upsampling
CBL*5
CBL*5
CBL*5
CBL*5
13×13×255
52×52×255
 
Figure 1: Traditional YOLOv4 network structure 
 
The input layer, the additional modules, the head 
network, the anchors, the loss function, and the backbone 
network make up the six key components of the classic 
YOLOv4 network topology shown in Figure 1. The input 
layer primarily prepares the incoming picture data. Cross 
Stage Partial (CSP) networks are frequently used to 
increase the backbone network's learning capacity while 
lowering the computational cost of the model. The 
backbone network is mostly used for feature extraction. 
The additional module is located behind the backbone 
network and enhances the sensory field through pooling 
operations, allowing the network to recognize features at 
different scales and increasing the model's adaptability to 
the input size. The header network is dedicated to the 
final TD task, and this part contains a target size 
prediction layer, a target category prediction layer, and a 
target frame prediction layer. In YOLOv4, anchor points 
are target boxes used to predict actual features. The 
dimensions of the anchor points are obtained by 
clustering and analyzing the bounding box dimensions in 
the training dataset. Finally, the loss function is mainly 
used to train the loss process so that the network can 
achieve the predetermined prediction through multiple 
training. In the YOLOv4 network structure, the loss 
function is mainly divided into two parts: classification 
loss and location regression loss. The common 
classification loss functions are cross-entropy loss 
function and Softmax loss function. Equation (1) is the 
mathematical expression of the cross-entropy loss 
function, which is frequently employed in classification 
loss [14]. 
 ( ) ( ) ( )
1
log 1 log 1
i i i i
i
L y y y y y
N
= − + − − 


(1) 
In equation (1), 
i
y denotes the probability that 
sample i is predicted to be a positive class. The 
predicted probability of all samples is divided into two 
categories of labels, where the positive category labels 
are denoted by 1 and the negative category labels are 
denoted by 0. N denotes the number of all labels. 
( ) Ly denotes the cross-entropy loss function. The 
Softmax loss function is calculated as shown in equation 
(2). 
 
( ) max log
i
i
x
C
x
i
e
Soft x
e
=−

 (2) 
In equation (2), C denotes the number of 
categories. 
i
x denotes the output of the correct category. 
( ) max Soft x denotes the Softmax loss function. Smooth 
L1 is a kind of location regression loss function and its 
expression is shown in equation (3). 
 ( )
2
1
0.5 SmoothL x x = (3) 
In equation (3), Smooth L1 uses the expression in 
equation (3) when 1 x  . 
x
 denotes the positional 
regression value. When the positional regression value 
1 x  , equation (3) becomes equation (4). 
 ( )
1
0.5 SmoothL x x =− (4) 
Based on equation (3) and equation (4), it is possible 
to obtain the border regression task loss calculation 
equation for real TD as shown in equation (5). 
 
( ) ( )
 
1
, , ,
,
uu
loc i i
i x y w h
L t v SmoothL t v

=−

 (5) 
Real-time Target Detection System in Scenic Landscape Based on… Informatica 48 (2024) 35–48 39 
In equation (5), ( )
,
u
loc
L t v denotes the loss value of 
the border regression task for the actual TD. 
u
t and 
v
 
denote the predicted and actual coordinates, respectively, 
and their specific expressions are shown in equation (6). 
i denotes the sample, whose coordinates are denoted by 
  , , , x y w h . 
 
( )
( )
, , ,
, , ,
u u u u u
x y w h
x y w h
t t t t t
v v v v v

=


=


 (6) 
In equation (6), ( )
, , ,
u u u u
x y w h
t t t t and ( )
, , ,
x y w h
v v v v 
denote the true value coordinates and predicted value 
coordinates, respectively. Equation (6) is viewed as a 
whole for regression, and its calculation equation is 
obtained as shown in equation (7). 
 
( )
( )
sec ,
ln
,
gt pre
gt pre
Inter tion box box
IoULoss
Union box box
=− (7) 
In equation (7), 
gt
box
 and 
pre
box
 denote the true 
and predicted frames, respectively. When there is no 
overlapping region between the real and predicted boxes 
in equation (7), it will lead to equation (7) being equal to 
0, which does not reflect the distance between the 
predicted and real values in depth. Based on this, the 
distance measurement equation is introduced as shown in 
equation (8). 
 
c
c
AU
GIoULoss IoU
A
−
=−
 (8) 
In equation (8), IoU is an abbreviation for 
IoULoss , which denotes the intersection and merger 
ratio loss. 
c
A denotes the area of the smallest closed 
region shared by the two boxes. U denotes the 
concatenation of the two boxes. Based on equation (8), 
the distance between the centroids of the two boxes is 
further considered to obtain the DIoULoss loss 
function in equation (9). 
 
( )
2
2
,
pre gt
cc
DIoULoss IoU
d

=−
 (9) 
In equation (9), 
pre
c and 
gt
c denote the centroid 
positions of the prediction frame and the real frame, 
respectively. 

 denotes the Euclidean distance between 
the two centroids. d denotes the diagonal distance 
between the prediction frame and the real frame. A 
penalty factor is added to equation (9) to obtain equation 
(10). 
 
( )
2
2
,
pre gt
cc
CIoULoss IoU v
d

 = − −
 (10) 
In equation (10), 
v 
 denotes the penalization 
factor. Where 

 and 
v
 denote the weight function and 
aspect ratio measurement parameters, respectively. The 
specific equation for the weight function is shown in 
equation (11). 
 
( ) 1
v
IoU v
 =
−+
 (11) 
In equation (11), IoU denotes the cross-merger 
ratio loss. The equation for the aspect ratio measurement 
parameter is shown in equation (12). 
 
2
2
4
arctan arctan
gt pre
gt pre
ww
v
hh 

=−


 (12) 
In equation (12), 
gt
w and 
gt
h denote the width 
and length of the true frame, respectively. 
pre
w and 
pre
h denote the width and length of the predicted frame, 
respectively. According to the above equation (10) is able 
to calculate the loss function of YOLOv4. 
In the scenic landscape TD problem, since the 
distant buildings may be very small and the near 
sculptures may be very large therefore the traditional 
YOLOv4 cannot better detect the real-time landscape. 
Furthermore, the study incorporates the ASFF to enhance 
the stability and detection precision of the conventional 
YOLOv4 in light of the potential impact of shifting 
lighting conditions, dynamic target objects, and complex 
terrain backgrounds on its detection. ASFF enables 
effective information exchange at the feature layer and 
enhances the model's recognition ability for targets at 
different scales by intelligently adjusting weights 
between multi-scale feature maps. The ASFF mechanism 
dynamically adjusts fusion weights by learning weight 
parameters between different feature maps, allowing the 
algorithm to adaptively strengthen the response to 
important features. In addition, ASFF can simultaneously 
suppress background noise, significantly improving 
YOLOv4's ability to detect small targets in complex 
landscapes and accurately recognize large targets. The 
incorporation of ASFF into the YOLOv4 model enhances 
the feature extraction process and improves the model's 
recognition accuracy in various and challenging 
landscapes. This is particularly true when dealing with 
scenes that have strong lighting variations and significant 
differences in target size. Figure 2 depicts the ASFF's 
organizational structure.
ASFF-1
ASFF-2
ASFF-3
Output
Output
Output
Level 1
Stride 32
Level 2
Stride 16
Level 3
Stride 8
Feature fusion
 
40   Informatica 48 (2024) 35-48                                                       C. Pan et al. 
Figure 2: ASFF structure diagram 
 
In Figure 2, ASFF is mainly composed of 
multi-scale feature maps, adaptive weight learning, and 
special fusion mechanism. With the use of ASFF, feature 
maps with varying scales will be able to communicate 
information more effectively, which will enhance tiny TD 
performance and preserve high identification accuracy for 
large targets. In ASFF, the input features are fused 
through three layers of adaptive fusion to increase the 
richness of information, and the fusion process is shown 
in equation (13). 
 
1 2 3 l l l l l l l
ij ij ij ij ij ij ij
y x x x   
→ → →
= + + (13) 
In equation (13), 
l
ij
 , 
l
ij
 , 
l
ij
 denote the weight 
parameters of the first, second, and third layers, 
respectively. 
1 l
ij
x
→
, 
2 l
ij
x
→
, 
3 l
ij
x
→
 denote the features of 
the first, second, and third layers, respectively. 
l
ij
y 
denotes the fused features. The structure of 
YOLOv4-ASFF is obtained by adding ASFF into 
YOLOv4 as shown in Figure 3.
Input layer
CBM
CSP1
CSP2
CSP8
CSP8
CSP4
CBL
SPP
CBL
ASFF3
ASFF2
ASFF1
CBL
CBL
CBL
CBL+Conv
CBL+Conv
C
C
C y1
y2
y3
26×26×255
13×13×255
52×52×255
CBL*3
CBL*3
ASFF Head
CBL+Conv 
Figure 3: YOLOv4-ASFF structure diagram 
 
In Figure 3, the optimized YOLOv4-ASFF is mainly 
composed of input layer, feature fusion layer, and 
decoupling header module. Different from the traditional 
path aggregation network, YOLOv4-ASFF adopts the 
ASFF network, which fuses and weights the three layers 
of features output from the backbone network, thus 
further enriching the feature information of the scenic 
landscape image, and avoiding the introduction of too 
many parameters. To further increase the model's 
detection accuracy, the decoupled detection header is also 
adopted by the model to optimize edge regression and 
classification, respectively. 
3.2 Design of Real-time Image Recognition 
System for Scenic Landscapes 
In addition to optimizing YOLOv4 to improve the 
detection accuracy of the target, it is necessary to further 
combine various types of hardware and software to build 
a complete scenic landscape image recognition system. 
The designed real-time image recognition system for 
scenic landscapes can not only support real-time image 
recognition, but also process static images, so as to 
provide tourists with instant and rich scenic area 
information, enhance the tourists’ experience, and also 
support the digital management of scenic areas. By 
building a real-time image recognition system for scenic 
landscapes, the cultural value and natural beauty of scenic 
spots can be better demonstrated, and at the same time 
provide a scientific basis for the protection and 
management of scenic resources [15]. The traditional 
scenic landscape recognition system has shortcomings 
such as insufficient recognition accuracy, slow processing 
speed, poor generalization ability, limited real-time 
monitoring ability, and poor user interactivity, etc. The 
combination of optimized YOLOv4-ASFF algorithm to 
build a real-time image recognition system for scenic 
landscapes can effectively improve the above 
shortcomings. The structure of the scenic landscape 
image recognition system designed in this research is 
shown in Figure 4.
Real-time Target Detection System in Scenic Landscape Based on… Informatica 48 (2024) 35–48 41 
Scenic Spot 
Landscape Image 
Recognition System
Model 
building
Data 
processing 
module
Identification 
module
Output 
module
 
Figure 4: Structural diagram of scenic landscape image recognition system 
 
In Figure 4, the designed landscape image 
recognition system for scenic spots mainly consists of 
four modules: model construction, data acquisition and 
input, model recognition, and recognition result output. 
The core of the model building module is to create an 
accurate landscape recognition model. First, a large 
amount of scenic landscape image data collection is 
carried out. These images need to contain a variety of 
landscapes in the scenic area, such as natural landscapes, 
buildings, sculptures, and so on. Next, these images were 
accurately labeled using an annotation tool, including the 
categories and locations of the objects. Then, these 
labeled data are trained using the YOLOv4-ASFF 
algorithm. The model’s parameters are adjusted during 
training to increase recognition speed and accuracy. The 
model is retained for use in later courses once the training 
is over. The data acquisition and input module mainly 
uses the OpenCV library to realize the acquisition of 
real-time video streams, which can acquire real-time 
images from cameras set up in scenic spots. For 
non-real-time image recognition, an interface is provided 
to allow users to upload image files, and Python’s os 
library is used to process file paths and read image data. 
In the model recognition module, it is first necessary to 
import the previously trained YOLOv4-ASFF model. 
When image data is received from the data acquisition 
module, the model detects and recognizes the landscapes 
in the image. The recognition process involves extraction 
of image features, inference using the model, and 
deriving category and location information for each 
landscape in the image. The recognition result output 
module focuses on visualizing the recognition results of 
the model on the user interface, such as marking the 
recognized landscapes on the image through a bounding 
box and displaying the category name next to it. At the 
same time, the recognition results, including the images, 
the recognized landscape information and the associated 
confidence level, are stored in a local folder for further 
analysis or archiving. Figure 5 depicts the precise 
workflow of the landscape image recognition system.
Start
Collect landscape 
image data set
Label the dataset
Build recognition 
model
Are images of 
landscapes included?
Calculate 
confidence
Is the calculated 
value greater than the 
expected value?
Save image
Generate image
End
Save in 
unrecognized folder
Y
N
Y
N
 
Figure 5: Flowchart of scenic landscape image recognition
In Figure 5, firstly, a complete landscape image 
dataset needs to be built in the designed landscape image 
recognition system, and then the image dataset is labeled. 
Then the optimized YOLOv4-ASFF algorithm is used to 
build a recognition model to detect the input landscape 
images. The model’s confidence can be computed by 
42   Informatica 48 (2024) 35-48                                                       C. Pan et al. 
comparing the difference between the actual and expected 
confidence values, provided that the detected content 
includes the target landscape. If the detected content does 
not contain the target landscape, then the image needs to 
be saved in the computer and labeled as unrecognized. 
When the actual confidence value is greater than the 
expected value, this image can be saved in the 
corresponding landscape file, thus completing the whole 
image recognition process. Continuously repeating the 
above steps, then the performance of the YOLOv4-ASFF 
algorithm can be optimized, so that the output value of 
this algorithm is getting closer and closer to the preset 
value, and then all TD tasks can be completed. The final 
recognition content will be displayed in the UI 
visualization page, as shown in Figure 6.
Step 1: Open the identification 
website
Step 2: Select pictures for testing
Step 3: Display the visualization 
results
 
Figure 6: Visual page diagram 
 
In Figure 6, the user is able to get the recognition 
visualization result of the image through the designed 
scenic landscape image recognition system. Users first 
need to enter the browser to open the recognition website, 
and then open the image to detect the image, when the 
recognition system completes the detection will be 
displayed in the UI visualization page to detect the 
results. 
 
4 Evaluation of performance and 
application effect of scenic landscape 
image detection algorithm based on 
improved YOLOv4-ASFF 
In order to prove that the algorithms and recognition 
systems designed in this research have better performance 
and application results, three different detection 
algorithms were selected to compare their performance, 
so as to find that the YOLOv4-ASFF algorithm's 
detection accuracy and stability in TD are better than the 
other compared algorithms. In addition, the application of 
YOLOv4-ASFF algorithm to the recognition system can 
also achieve better recognition results. 
4.1 Performance test of scenic landscape 
image detection algorithm 
To comprehensively evaluate the performance of the 
improved YOLOv4-ASFF algorithm in real-world 
application scenarios, this study carefully selects and 
processes two types of datasets, namely the publicly 
available cityscapes dataset and the homemade scenic 
landscape image dataset collected specifically for the 
needs of this research. During the data pre-processing 
stage, the data quality is first ensured through a series of 
standardized steps, including image resizing, contrast 
enhancement, and denoising, in order to simulate the 
various environmental factors that  
may be encountered in TD in scenic landscapes. The 
Cityscapes dataset is selected for the dataset selection 
criteria because of its rich urban street view images and 
accurate pixel-level annotation, in order to test the 
algorithm's ability to detect targets in complex urban 
environments. The homemade scenic landscape image 
dataset, on the other hand, covers a wide range of natural 
landscapes, reflecting the specific application scenarios of 
scenic landscape TD, ensuring the practicality and wide 
applicability of the experimental results. The two datasets 
are partitioned randomly into training and validation sets 
in a 9:1 ratio after preprocessing to ensure fairness in the 
training process and reliability in the validation results. 
Performance evaluation in this study comprehensively 
considered several metrics, such as precision, recall, and 
F1 score, chosen based on their wide application and 
recognition in the field of TD. The precision metric 
reflects the model's ability to recognize positive class 
samples, while the recall measures the proportion of 
positive class samples recognized by the model to the 
total positive class samples. The F1 score is the 
reconciled average of precision and recall, providing a 
comprehensive performance evaluation. These 
performance metrics allow for a thorough evaluation and 
demonstration of the YOLOv4-ASFF algorithm's 
performance under various conditions and its superiority 
over other algorithms. Table 2 displays the precise 
makeup of the two datasets.
Table 2: Data set information 
Data set composition Cityscapes Scenic spot landscape image data set 
Real-time Target Detection System in Scenic Landscape Based on… Informatica 48 (2024) 35–48 43 
category 50 different street scenes 8 different natural landscapes 
Number of samples 3100 6500 
Source 
Images taken in different urban street view 
environments 
Image taken at a tourist attraction in the 
city 
Annotation 
information 
19 categories annotated for image detection Image detection 
Data Format JPEG JPEG 
 
In Table 2, the specific information of the dataset is 
given, including the number of dataset samples,  
 
 
categories, labeling information and so on. The specific 
hardware and software environment of the experiment is 
shown in Table 3.
 
Table 3: Experimental environment  
Environment Set up Parameter configuration 
Hardware environment 
CPU AMD Ryzen7 4800H 
GPU NVIDIA GeForce RTX2060, 6GB RAM 
Software Environment 
Programming system Pytorch 
Operating system Windows 10, 64-bit 
 
In Table 3, the specific hardware environment and 
software environment for this experiment are given. The 
variation of loss function of YOLO, Single Shot 
MultiBox Detector (SSD), YOLOv4-ASFF, and  
 
You Only Look Once version 5 (YOLOv5) is tested 
under the dataset information in Table 1 and the 
experimental environment in Table 2 as shown in Figure 
7. 
10
20
40
60
80
100
Epoch
Loss
(a) Loss function curve of YOLO algorithm
30 20 5 15 25
0
0 10
20
40
60
80
100
Epoch
Loss
(b) Loss function curve of SSD algorithm
30 20 5 15 25
0
0
10
20
40
60
80
100
Epoch
Loss
(c) Loss function curve of YOLOv4-ASFF algorithm
30 20 5 15 25
0
0 10
20
40
60
80
100
Epoch
Loss
(d) Loss function curve of YOLOv5 algorithm
30 20 5 15 25
0
0
Train loss Actual loss Train loss Actual loss
Train loss Actual loss Train loss Actual loss
 
Figure 7: Loss function curve changes of each algorithm 
 
Figure 7 shows the variation of loss function curves 
for the four detection algorithms. Figure 7(a), Figures 
7(b), (c), (d) show the loss function curves of the four 
algorithms YOLO, SSD, YOLOv4-ASFF, and YOLOv5, 
respectively. Taken together, the training loss curve of 
YOLOv4-ASFF can overlap well with the actual loss 
curve, and when the value of epoch is 7, the training loss 
curve of YOLOv4-ASFF starts to stabilize. On the 
contrary, YOLO, SSD, and YOLOv5 need to traverse 17, 
12, and 9 epochs, respectively, in order to reach a stable 
training loss value. 
 
 
44   Informatica 48 (2024) 35-48                                                       C. Pan et al. 
0
Number of samples
Detection accuracy
(a) Detection accuracy of four algorithms
0
Number of samples
(b) Detection recall rate of four algorithms
Recall
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
100 200 300 400 500
YOLOv5
 SSD
YOLOv4-ASFF
YOLO
YOLOv5
 SSD
YOLOv4-ASFF
YOLO
0 100 200 300 400 500
Number of samples
(c) Detection F1 value of four algorithms
F1 value
0.2
0.4
0.6
0.8
1.0
YOLOv5
 SSD
YOLOv4-ASFF
YOLO
100 200 300 400 500
 
Figure 8: Detection accuracy, recall rate and F1 value changes of each algorithm 
 
 
The detection precision, recall and F1 value 
variation of the four detection algorithms are shown in 
Figure 8. From Figure 8(a), when the number of samples 
is 500, the detection precision values of the four 
algorithms YOLO, SSD, YOLOv4-ASFF, and YOLOv5 
are 0.69, 0.78, 0.96, and 0.91, respectively. From Figure 
8(b), when the number of samples is 500, the detection 
recall values of the four algorithms YOLO, SSD, 
YOLOv4-ASFF, and YOLOv5 algorithms have detection 
recall values of 0.71, 0.77, 0.97, and 0.90, respectively. 
From Figure 8(c), when the number of samples is 500, 
the four algorithms YOLO, SSD, YOLOv4-ASFF, and 
YOLOv5 have detection F1 values of 0.70, 0.78, 0.98, 
and 0.90, respectively. 
The variation of frames per second for the four 
detection algorithms is shown in Figure 9. In Figure 9(a), 
the four algorithms, YOLO, SSD, YOLOv4-ASFF, and 
YOLOv5, are finally able to achieve frame rate values of 
22, 25, 29, and 34 under the training dataset, respectively. 
In Figure 9(b), the four algorithms YOLO, SSD, 
YOLOv4-ASFF, and YOLOv5 are finally able to achieve 
frame rate values of 23, 25, 30, and 35 under the 
validation dataset, respectively. Compared to YOLO and 
SSD, YOLOv4-ASFF and YOLOv5 are able to reach 
stable frame rate values faster, thus indicating that these 
two algorithms are more efficient in detecting images. 
 
Number of experiments
40
30
20
10
0
5 10 15 20
FPS
(a) FPS changes of different algorithms in the 
training set
(b) Changes in FPS of different algorithms in 
the validation set
30
20
10
0
FPS
0 5 10 15 20 0
Number of experiments
40
YOLOv5
 SSD
YOLOv4-ASFF
YOLO
YOLOv5
 SSD
YOLOv4-ASFF
YOLO
 
Real-time Target Detection System in Scenic Landscape Based on… Informatica 48 (2024) 35–48 45 
Figure 9: Changes in frames per second for each algorithm 
 
4.2 Scenic landscape recognition system 
application effect analysis 
In addition to verifying that the YOLOv4-ASFF 
algorithm has a better performance advantage in image 
detection, the study further utilized the above detection 
algorithms to build a real-time recognition system for 
scenic landscape images respectively, and compared the 
image recognition effect of each system in practical 
applications (see Figure 10).
Landscape type
(a) Operational stability of different recognition systems
0.70
Operational stability
0.60
0.80
0.90
Operational hours
0.85
0.75
0.65
YOLO
 SSD
YOLOv5
YOLOv4-ASFF
10
20
30
25
15
5
Maple 
forest
 Sea of 
clouds
Sunrise
Stone 
monument
YOLOv5
YOLOv4-ASFF
YOLO
 SSD
Maple 
forest
 Sea of 
clouds
Sunrise
Stone 
monument
Landscape type
(b) Running time of different recognition systems
0
 
Figure 10: Operation stability and identification time of each system 
 
Figures 10(a), (b) show the operation stability and 
recognition time of the four recognition systems, 
respectively. Four natural landscapes, namely sunrise, sea 
of clouds, maple forest and stone monument, are selected 
as test objects. In Figure 10(a), the recognition system 
built by the YOLOv4-ASFF algorithm has an operational 
stability as high as 0.92, 0.93, 0.92, 0.94 under the four 
kinds of natural landscapes, which is much higher than 
that of the recognition system built by the other three 
algorithms. In Figure 10(b), the recognition time of the 
recognition system built by the YOLOv4-ASFF 
algorithm under the four natural landscapes is 2.3s, 0.8s, 
2.9s, and 1.2s, respectively, which is much lower than 
that of the recognition system built by the other three 
algorithms. 
The recognition of three natural landscape images by 
the traditional YOLO recognition system and the 
recognition system built by the YOLOv4-ASFF 
algorithm are shown in Figure 11, respectively. 
Combined with Figure 11(a), (b), the optimized 
recognition system is able to better recognize the details 
in the natural landscape images, including sunrise, 
inscription text, pedestrians, and so on. The recognition 
system constructed using the YOLOv4-ASFF algorithm 
has better practical application results.
(a) Optimized recognition effect of landscape 
recognition system
(b) Recognition effect of traditional landscape 
recognition system
 
Figure 11: Actual landscape recognition situation of the two-recognition system 
 
46   Informatica 48 (2024) 35-48                                                       C. Pan et al. 
5 Discussion 
To enhance the accuracy of detecting landscape images in 
scenic areas, this study optimized the traditional 
YOLOv4 TD algorithm by introducing ASFF. The 
optimized YOLOv4 algorithm is then combined with 
ASFF to create the fourth generation of adaptive spatial 
feature optimization primary TD system for real-time 
images of scenic landscapes, known as YOLOv4-ASFF. 
This approach improves the TD accuracy of YOLOv4. 
The system's ability to detect and manage landscape 
images in scenic areas is confirmed by its real-time 
detection capabilities. In the comparative analysis, the 
YOLOv4-ASFF proposed in this study demonstrates 
significant performance advantages compared with 
existing related work. The optimized YOLOv4 algorithm 
achieved not only higher scores in terms of precision, 
recall, and F1 value through the introduction of ASFF, 
but also provided a significant improvement in real-time 
performance when dealing with complex landscape 
environments. The YOLOv4 algorithm achieved high 
precision, recall, and F1 scores of 0.96, 0.97, and 0.98, 
respectively. In comparison, Jahani A et al. achieved a 
coefficient of determination of 0.878 in assessing the 
aesthetic quality of forest landscapes using machine 
learning techniques, and Peng X et al. achieved a 
composite similarity score of 0.92 in transforming 
landscape photos based on recurrent generative 
adversarial network models. Although these studies 
performed well in their respective domains, the precision 
and recall rates are lower than those of the algorithm 
proposed in this study for the complex task of detecting 
scenic landscapes. Additionally, this study demonstrated 
exceptional performance in the frame rate test, achieving 
processing speeds of up to 30 fps without compromising 
detection accuracy. This was critical for real-time 
surveillance systems. Compared to the other two 
approaches in the related work, YOLOv4-ASFF 
outperformed mainly because the ASFF mechanism and 
the efficient network architecture design significantly 
improve the algorithm's ability to detect multi-scale 
targets, especially in complex scenic environments, 
which enables more accurate identification and 
localization of targets of different sizes. In addition, 
optimizing the YOLOv4 algorithm improved not only the 
accuracy of TD, but also significantly increased the 
processing speed, enabling the algorithm to meet the dual 
requirements of speed and accuracy for real-time TD 
systems.  
In this study, three comparison models (YOLO, SSD, 
and YOLOv5) were introduced to test the performance of 
YOLOv4-ASFF. The YOLOv4-ASFF algorithm-based 
recognition system for scenic landscape images achieved 
an operational stability of 0.92, 0.93, 0.92, and 0.94 under 
four natural landscapes, namely sunrise, cloud sea, maple 
forest, and stone monument, and a recognition time of 2.3 
s, 0.8 s, 2.9 s, and 1.2 s, respectively, surpassing the 
performance of the other three models. The analysis 
below explained why YOLOv4-ASFF outperforms other 
models in specific usage scenarios. YOLOv4-ASFF 
optimized the YOLOv4 framework with ASFF, which 
significantly improves the model's recognition ability for 
targets of different sizes. The ASFF mechanism can 
dynamically adjust the weights of feature fusion 
according to the target sizes, which is especially 
important in multi-scale TD in scenic landscapes. 
Furthermore, YOLOv4-ASFF utilized efficient backbone 
network and feature fusion techniques, including CSPNet 
and PANet, to improve both detection speed and accuracy. 
In comparison, SSD was less accurate in processing 
small-size targets due to its limited method of detecting 
directly on feature maps at different scales, especially in 
complex landscape environments. YOLOv5, while 
improved in speed and accuracy, lacked the adaptive 
feature fusion mechanism found in YOLOv4-ASFF and 
did not perform as well as YOLOv4-ASFF for highly 
complex backgrounds and multi-scale targets. The 
YOLOv4-ASFF architecture had been optimized to 
provide a significant performance advantage in real-time 
TD scenarios in scenic landscapes, particularly in dealing 
with multi-scale TD tasks under changing light and 
complex background conditions. 
In summary, this study has improved the YOLOv4 
algorithm, achieving breakthrough performance in 
real-time TD in scenic landscapes. It also outperforms 
existing related work in key performance metrics, such as 
precision, recall, F1 value, and frame rate. This result 
demonstrates the superiority of the improved algorithm 
and provides a new direction for subsequent research on 
real-time TD in complex environments. It has important 
academic value and practical application potential. 
6 Conclusion 
To enhance the detection effectiveness of scenic 
landscape images, this study utilizes the upgraded 
YOLOv4 algorithm to optimize and evaluate the 
real-time TD system intended for scenic landscapes. The 
study's results indicated that by comparing the changes in 
the loss function curves of YOLO, SSD, YOLOv4-ASFF 
and YOLOv5 algorithms, it was found that the 
YOLOv4-ASFF algorithm performed the best, and its 
training loss started to stabilize at the 7th epoch, while the 
other algorithms required 17, 12 and 9 epochs, 
respectively. When the sample size was 500, 
YOLOv4-ASFF achieved high scores of 0.96, 0.97, and 
0.98 for detection accuracy, recall rate, and F1 value, 
respectively, outperforming the other algorithms 
significantly. Furthermore, YOLOv4-ASFF and YOLOv5 
demonstrated exceptional performance in the frame rate 
test, achieving 29 fps and 34 fps, respectively. 
Conversely, YOLO and SSD exhibited a lower frame rate 
of 22 fps and 25 fps, respectively. During the analysis of 
the scenic landscape recognition system's performance, it 
was found that the system developed using the 
Real-time Target Detection System in Scenic Landscape Based on… Informatica 48 (2024) 35–48 47 
YOLOv4-ASFF algorithm demonstrated high operational 
stability rates of 0.92, 0.93, 0.92, and 0.94 for four natural 
landscapes: sunrise, sea of clouds, maple forest, and stone 
monument. Moreover, the recognition time for these 
landscapes was low, ranging from 0.8 s to 2.9 s, with an 
average of 1.6 s. Furthermore, the enhanced recognition 
system demonstrates the ability to detect landscape image 
attributes with greater accuracy than the conventional 
recognition solution. To summarize, the TD algorithm 
developed in this research presents superior proficiency 
and yields improved outcomes in real-world scenarios. 
However, there are still some limitations in this study, 
such as the need for improved recognition performance 
under extreme lighting and complex backgrounds, and 
the absence of coverage for all possible types of natural 
landscapes in the current test. Future research should 
expand its focus onto more landscape types. 
7 Future work 
The optimization model YOLOv4-ASFF, designed in this 
research, has achieved significant results in real-time TD 
in scenic landscapes. However, the significance of this 
research extends beyond the field of intelligent 
monitoring of scenic landscapes. Future work will 
explore the algorithm's potential in other areas, such as 
intelligent transport systems, unmanned surveillance 
security, automated agricultural monitoring, and rapid 
response to natural disasters. These areas require efficient 
and accurate real-time TD techniques. The challenge of 
balancing computational efficiency, real-time 
performance, and resource consumption of algorithms is 
particularly relevant for practical deployments. This is 
especially true in resource-constrained environments, 
such as the use of UAVs for on-site monitoring during 
natural disasters. Ensuring algorithm performance while 
reducing energy consumption is a major challenge. In 
addition, future research should focus on improving the 
model's generalization ability in diverse environmental 
conditions and complex backgrounds through in-depth 
adaptive improvements. Furthermore, future research is 
planned to investigate the utilization of multimodal data 
sources, such as infrared and radar fusion techniques, to 
further improve the model's ability to detect targets in 
extreme weather conditions and low-light environments. 
Additionally, the latest advances in deep learning, such as 
self-supervised learning and meta-learning strategies, can 
be combined with model training methods that require 
only a small amount of labeled data. This approach can 
help reduce the cost of large-scale data labeling and 
improve the adaptability of models. Finally, with the 
importance of AI ethics and privacy protection in mind, 
future research will focus on ensuring the interpretability 
and fairness of algorithms to promote the sustainability 
and social responsibility of the technology. In summary, 
the improved YOLOv4 algorithm and related 
technologies will play an important role in a wider range 
of fields, promoting the development of intelligent 
monitoring technologies and bringing positive impacts in 
practical applications. 
References 
[1] Syamimi Abdul Khalil, Shuzlina Abdul Rahman, 
Sofianita Mutalib, and Nurin Mirza Afiqah Andrie 
Dazlee. Object detection for autonomous vehicles 
with sensor-based technology using YOLO. 
International journal of intelligent systems and 
applications in engineering, 10(1):129-134, 2022. 
https://doi.org/10.18201/ijisae.2022.276 
[2] Muhammed Enes Atik, Z. Duran, and Roni 
ÖZGÜNLÜK. Comparison of YOLO versions for 
object detection from aerial images. International 
journal of environment and geoinformatics, 
9(2):87-93, 2022. 
https://doi.org/10.30897/ijegeo.1010741 
[3] Yunyun Song, Zhengyu Xie, Xinwei Wang, and 
Yingquan Zou. MS-YOLO: object detection based 
on YOLOv5 optimized fusion millimeter-wave 
radar and machine vision. IEEE Sensors journal, 
22(15):15435-15447, 2022. 
https://doi.org/10.1109/JSEN.2022.3167251 
[4] Xin Shen, Xudong Sun, Huibing Wang, and Xianping 
Fu. Multi-dimensional, multi-functional and 
multi-level attention in YOLO for underwater object 
detection. Neural computing and applications, 
35(27):19935-19960, 2023. 
https://doi.org/10.1007/s00521-023-08781-w 
[5] Sugiarto Wibowo, and Indar Sugiarto. Hand symbol 
classification for human-computer interaction using 
the fifth version of YOLO object detection. 
CommIT (communication and information 
technology) journal, 17(1):43-50, 2023. 
https://doi.org/10.21512/commit.v17i1.8520 
[6] Ali Jahani, Maryam Saffariha, and Pegah Barzegar. 
Landscape aesthetic quality assessment of forest 
lands: an application of machine learning approach. 
Soft computing, 27(10):6671-6686, 2023. 
https://doi.org/10.1007/s00500-022-07642-3 
[7] Xianlin Peng, Shenglin Peng, Qiyao Hu, Jinye Peng, 
Jiaxin Wang, Xinyu Liu, and Jianping Fan. 
Contour-enhanced CycleGAN framework for style 
transfer from scenery photos to Chinese landscape 
paintings. Neural computing and applications, 
34(20):18075-18096, 2022. 
https://doi.org/10.1007/s00521-022-07432-w 
[8] Kai Zhou, Zhendong Zhang, Rui Yuan and Enqing 
Chen. A deep learning algorithm for fast motion 
video sequences based on improved codebook 
model. Neural ccomputing and applications, 
35(6):4353-4368, 2023. 
https://doi.org/10.1007/s00521-022-07079-7 
[9] Takuya Kikuchi, Tomohiro Fukuda, and Nobuyoshi 
Yabuk. Diminished reality using semantic 
segmentation and generative adversarial network for 
landscape assessment: evaluation of image 
48   Informatica 48 (2024) 35-48                                                       C. Pan et al. 
inpainting according to colour vision. Journal of 
computational design and engineering, 
9(5):1633-1649, 2022. 
https://doi.org/10.1093/jcde/qwac067 
[10] Guofa Li, Zefeng Ji, Xingda Qu, Rui Zhou, and 
Dongpu Cao. Cross-domain object detection for 
autonomous driving: A stepwise domain adaptative 
YOLO approach. IEEE Transactions on intelligent 
vehicles, 7(3):603-615, 2022. 
https://doi.org/10.1109/TIV .2022.3165353 
[11] Jeonghun Lee, and Kwang-il Hwang. YOLO with 
adaptive frame control for real-time object detection 
applications. Multimedia tools and applications, 
81(25):36375-36396, 2022. 
https://doi.org/10.1007/s11042-021-11480-0 
[12] Siyuan Liang, Hao Wu, Li Zhen, Qiaozhi Hua, 
Mohammad Mehedi Hassan, and Keping Yu. Edge 
YOLO: real-time intelligent object detection system 
based on edge-cloud cooperation in autonomous 
vehicles. IEEE Transactions on intelligent 
transportation systems, 23(12):25345-25360, 2022. 
https://doi.org/10.1109/TITS.2022.3158253 
[13] Raidah Salim Khudeyer, and Noor Mohammed 
Almoosawi. Fake Image Detection Using Deep 
Learning. Informatica, 47(7):115-120, 2023. 
https://doi.org/10.31449/inf.v47i7.4741 
[14] Xiaojian Wang, Xiaoye Sun, and Zixuan Wang. 
Construction of visual evaluation system for 
building block night scene lighting based on 
multi-target recognition and data processing. IET 
Circuits, devices and systems, 17(3):149-159, 2023. 
https://doi.org/10.1049/cds2.12154 
[15] Mehdi Gheisari, Hooman Hamidpour, Yang Liu, and 
Peyman Saedi. Data mining techniques for web 
mining: a survey. Artificial intelligence and 
applications, 1(1):3-10, 2022. 
https://doi.org/10.47852/bonviewAIA2202290