https://doi.org/10.31449/inf.v48i11.6033 Informatica 48 (2024) 59 –70 59 
Method for Top View Pedestrian Flow Detection Based on Small 
Target Tracking 
 
Ming Li
1,2*
, Hui Dong
3
, Fei Zhang
2
, Xiaoxiao Liu
2 
1
School of Computer, Electronic and Information, Guangxi University, Nanning 530004, China 
2
Intelligent Manufacturing Department, Zaozhuang Vocational College, Zaozhuang 277000, China 
3
Medical School, Zaozhuang Vocational College, Zaozhuang 277000, China 
E-mail: liming860311@163.com 
Keywords: target tracking, vision transformer, deep sort, abortion detection 
Received: April 17, 2024 
In public spaces, monitoring pedestrian flow can effectively avoid the occurrence of crowding and 
stampede incidents, and can effectively improve public safety. To improve the accuracy and efficiency 
of small target tracking in pedestrian flow detection from a top-down perspective, this study integrates 
the Vision Transformer architecture and the Deep SORT tracking algorithm to improve the YOLOv5 
model. This method aims to achieve efficient detection in complex pedestrian flow scenarios by 
enhancing the recognition ability and tracking accuracy of small targets. In the experiment, the 
improved YOLOv5-V-D model quickly converged after 61 iterations, achieving excellent operational 
efficiency with an average delay of only 7.2ms. Compared to CenterNet and RetinaNet, it has 
increased by 3.9ms and 6.4ms, respectively. Furthermore, the model demonstrated an exceptional 
capacity for accurately predicting pedestrian flow, with a prediction accuracy of 98.72%, which is 
significantly higher than the comparison model's range of 20.59% to 28.61%. In summary, the 
improved YOLOv5 model not only provided faster detection speed, but also significantly improved the 
accuracy of pedestrian flow detection. This advancement offers an efficacious solution for the 
monitoring of high-density crowds, establishing a robust foundation for the advancement of future 
real-time monitoring systems and significantly enhancing public safety. 
Povzetek: Predlagana je metoda za zaznavanje pešcev z vidika od zgoraj, ki temelji na sledenju 
majhnih tarč. Izboljšuje zaznavanja pešcev s pomočjo izboljšanega modela YOLOv5, Vision 
Transformer in algoritma Deep SORT.
1 Introduction 
The rapid development of Artificial Intelligence (AI) 
technology has led to the increasing application of 
real-time video surveillance systems in many fields. It is 
of great significance for maintaining public order, 
optimizing traffic management, and formulating 
emergency evacuation plans [1-2]. However, traditional 
monitoring systems are often limited by resolution and 
the influence of complex backgrounds when detecting 
pedestrian flow at a top view angle, making it difficult to 
accurately identify and track [3]. In response to this 
challenge, this study fully utilizes the advantages of 
YOLOv5 network and Vision Transformer (ViT) to 
construct a more adaptive Small Target Detection (STD) 
method to improve the accuracy of Pedestrian Flow 
Detection (PFD) from a Top-down Perspective (TP-PFD). 
This method is based on YOLOv5 and relies on its 
lightweight and efficient characteristics to ensure the 
speed and sensitivity of the model when processing 
real-time video streams [4]. To further improve the 
recognition ability of small targets, ViT has been 
introduced. This technology effectively captures global 
information in images through a Self Attention 
Mechanism (SAM), which helps to distinguish between 
pedestrian flow and background in complex scenes [5]. 
By integrating the fine-grained feature recognition 
advantages of ViT, the accuracy of STD can be enhanced. 
In addition, the enhanced data preprocessing module 
further enriches the model's adaptability to small target 
shapes. The innovation of this study lies in the 
combination of YOLOv5 and ViT, which enhances 
YOLOv5's shortcomings in STD and improves feature 
utilization. The proposal of this method aims to improve 
the robustness and accuracy of TP-PFD, thereby better 
serving application scenarios such as public safety 
monitoring, crowd management, and business analysis. It 
is hoped that through this paper, a new technological path 
can be provided for relevant fields, which will have a 
positive driving effect on PFD and analysis from a 
top-down perspective. As as result, while ensuring public 
safety, it can promote the development of smart city 
construction. The study is divided into four parts. The 
first part is a summary of the fields of Small Target 
Tracking (STT) and PFD. Part 2 is the implementation of 
the proposed improvement method. Part 3 is the 
validation and testing of the research method. Part 4 is a 
60   Informatica 48 (2024) 59 –70                                                                   M. Li et al. 
summary of the entire text. 
2 Related works 
STT detection is an important research direction in the 
field of computer vision that focuses on detecting and 
tracking smaller objects in images. In application 
scenarios such as surveillance videos, drone images, and 
satellite images, small targets become very challenging to 
detect and track due to their small pixel size, lack of 
detailed information, and susceptibility to environmental 
noise and occlusion. With the improvement of computing 
power and the advancement of deep learning technology, 
STT detection algorithms are moving towards more 
accurate, real-time, and robust directions. Shi et al. 
proposed a sea surface small target-feature detector based 
on dispersion relative entropy. This method was superior 
to existing single-feature detectors in suppressing clutter 
and improving detection performance, and could be 
comparable to three-feature detectors [6]. Yang et al. 
proposed an improved helmet detection algorithm based 
on YOLO V4 to address the issue of existing helmet 
detection algorithms being susceptible to occlusion. This 
algorithm significantly improved the detection accuracy 
of small and occluded targets, and optimized the 
convergence speed and regression accuracy of model 
training [7]. Zhi et al. proposed a framework called 
attention context region detection for precise recognition 
of small and medium-sized traffic signs in intelligent 
transportation systems. This framework utilized attention 
contextual features and combined target and 
environmental information through dot convolutional 
layers, achieving advanced levels in small traffic sign 
detection [8]. Bommes et al. proposed a ResNet-34 
Convolutional Neural Network (CNN) based on 
unsupervised domain adaptation problem for the 
automatic detection of module faults in photovoltaic (PV) 
systems. It combined supervised comparison loss and 
K-means cluster classifier for anomaly detection of small 
target images. In nine combinations of four source and 
target datasets containing 2.92 million infrared images, its 
classification accuracy for normal and abnormal images 
reached 79.4% and 77.1%, and it could reliably detect 
unknown types of anomalies [9]. Qin et al. proposed a 
dense sampling and detail enhancement network to 
address the issue of insufficient performance of existing 
object detection algorithms in STD. It improved feature 
map resolution and expanded receptive fields through 
dense sampling modules. On the Minico2021 and 
VisDrone datasets, this method improved by 
approximately 4.6% and 4.2% compared to the advanced 
DetectoRS algorithm, respectively [10]. 
PFD refers to the use of various sensors or video 
devices to estimate the number of people in a specific 
area. This field focuses on how to accurately count each 
individual in a population, and is a key sub-field in 
computer vision and image processing. PFD technology 
has extensive applications in various fields such as 
business analysis, public safety, urban planning, and 
traffic management. The advancement of machine 
learning technology and the increase in computing 
resources are continuously improving the accuracy and 
efficiency of PFD. Minegishi et al. proposed a pedestrian 
flow simulator based on actual physical parameters to 
address the challenge of corridor fire evacuation. When 
the density exceeded 2.35 people/square meter, 
pedestrians exhibited stagnant behavior, with a direct 
increase in speed and spacing, while a specific flow rate 
increased linearly with density, and density was inversely 
proportional to speed [11]. Yang et al. proposed a deep 
learning detection method based on a single multi-frame 
detector for precise target recognition and localization in 
smart city applications. This algorithm optimized the 
network structure through VGG16, achieving a maximum 
mAP of 77% and an accuracy of 96.31% [12]. Song et al. 
proposed a progressive refinement network for pedestrian 
detection in complex occlusion situations. The proposed 
method effectively improved the accuracy and domain 
adaptability of occluded pedestrian detection [13]. Yang 
et al. proposed a deep learning detection method based on 
SSD to address the impact of crowded subway stations on 
large pedestrian traffic. This method had higher 
performance compared to other mainstream detection 
methods [14]. Zhang S et al. proposed an asymmetric 
multi-stage network to address the challenge of 
small-scale pedestrian detection. It utilized rectangular 
anchor frames and asymmetric convolutional kernels to 
address pedestrian body asymmetry, improved detection 
performance through three-stage gradual feature selection, 
and demonstrated excellent performance in benchmark 
testing [15]. The summary table of the related works is 
shown in Table 1. 
 
 
Table 1: Related works summary table 
Research Major technology Application scenario The state-of-the-art gap 
Shi et al Feature detector of sea surface 
small target based on dispersion 
relative entropy 
Marine surveillance video Insufficient capture of 
small target details 
Yang and Wang Improved helmet detection 
algorithm based on YOLO V4 
Security monitoring The detection accuracy of 
occluded target is limited 
Zhi et al Attention context area detection 
framework 
Intelligent transportation 
system 
Small traffic sign detection 
Bommes and ResNet-34 network based on Photovoltaic system Accuracy of anomaly 
Method for Top View Pedestrian Flow Detection Based on Small … Informatica 48 (2024) 59 –70 61 
Hoffmann unsupervised domain adaptation 
problem 
module fault detection detection 
Qin et al Intensive sampling and detail 
enhancement network 
Drone image STD performance 
Minegishi et al Flow simulator based on physical 
parameters 
Fire evacuation Dynamic flow simulation 
Yang et al Deep learning detection method 
based on single frame detector 
Smart city Target recognition and 
location 
Song et al Progressive refinement network Complex occlusion 
condition 
Blocked pedestrian 
detection 
Zhang et al Asymmetric multistage network Small scale pedestrian 
detection 
The pedestrian is 
asymmetrical 
Research method YOLOv5 and ViT Public space monitoring Small target recognition 
and tracking 
 
In summary, the current literature on STT and PFD 
methods has demonstrated significant advantages based 
on deep learning models. Nevertheless, research on PFD 
under overhead angles is still lacking. There are still 
challenges in dealing with extremely dense crowd 
scenarios, especially in improving the recognition rate of 
small targets while maintaining high detection speed. 
Therefore, this study proposes a TP-PFD method based 
on STT. It employs an enhanced YOLOv5 architecture, 
integrating the strengths of ViT. This integration 
leverages the lightweight nature of ViT to enhance the 
STD capabilities of the native network, ultimately aiming 
to achieve accurate TP-PFD. 
 
3 Construction of STT algorithm for 
TP-PFD 
To improve the accuracy of TP-PFD, this study 
constructs an improved algorithm by integrating the 
Token-to-Token ViTs (T2T-ViT) architecture into the 
YOLOv5 framework. This algorithm aims to optimize 
YOLOv5's recognition ability for small targets, and 
enhance the model's resolution and tracking accuracy for 
individuals in crowded scenes. The introduction of 
T2T-ViT aims to enhance the richness of feature 
expression through its SAM, thereby enhancing the 
robustness of the model to small targets in complex 
backgrounds. It is expected that through this algorithm 
improvement, the accuracy and stability of TP-PFD can 
be improved while maintaining real-time performance. 
 
3.1 Construction of YOLOv5 Algorithm for 
PFD 
In fields such as surveillance video analysis, intelligent 
transportation systems, and public safety, PFD is a key 
technology used to estimate the number of people in 
specific areas, monitor pedestrian density, and track 
individual movements [16]. Observing the crowd from 
top to bottom with a top-down angle reduces occlusion 
issues and helps with more accurate flow counting and 
behavior analysis. However, from a top-down perspective, 
the size of the human body in the image is usually small, 
and the interaction and occlusion between individuals can 
lead to a decrease in detection accuracy. Monitoring the 
dynamic lighting changes in the scene can also pose 
challenges for PFD. YOLOv5 is an extremely fast object 
detection model that can meet the needs of real-time 
monitoring, and it has good performance on STD [17]. 
Meanwhile, due to the excellent customization 
performance of YOLOv5, it can be targeted for expansion 
and optimization according to actual needs. Its relatively 
small number of parameters is also the reason for its 
excellent deployment difficulty [18]. Therefore, this 
study chooses YOLOv5 as a strong candidate model for 
TP-PFD. This model can combine speed and accuracy to 
provide reliable detection performance in challenging 
scenarios. Figure 1 shows the process of applying 
YOLOv5 model to TP-PFD in this study. 
 
Video
Real-time 
target 
detection
Model
Graphics
Annotate, 
learn, analyze
Use YOLOv5 
to train the 
model
Save 
model
Output 
pedestrian 
flow analysis 
results
 
Figure 1: Flow framework of YOLOv5 model under 
TP-PFD 
 
The Transformer architecture is a deep learning 
model proposed in 2017. It was originally designed to 
solve sequence problems in natural language processing. 
Afterwards, Transformer quickly became the mainstream 
model in the field of natural language processing due to 
its efficiency and powerful performance, and gradually 
expanded to computer vision and other fields. One of the 
key innovations of Transformer is that it is entirely based 
on the "SAM". This mechanism allows the model to 
62   Informatica 48 (2024) 59 –70                                                                   M. Li et al. 
dynamically focus on other elements in the sequence 
while processing each element, thereby capturing their 
relationships. Its unique SAM allows the model to 
calculate attention scores at different positions in the 
sequence, allowing the model to focus on the relevant 
parts of the input sequence. This is very useful for 
understanding long-distance dependencies in sequences. 
The Transformer model can parallelize multiple SAMs, 
with each focusing on different sub-spaces of input, 
which improves the model's ability to capture different 
types of information [19]. ViT is the first attempt to 
directly apply Transformer to image classification, which 
divides images into multiple small pieces (tokens) and 
then processes them using a standard Transformer model. 
Although ViT performs well on large-scale datasets, its 
performance is weaker on small-scale images or datasets. 
This is partly because it generates significant information 
loss when dealing with these situations, especially in 
terms of its ability to capture fine-grained features. Figure 
2 shows the network structure of ViT. 
 
Image 
input
Linear projection
Transformer encoder
Multilayer 
perceptron
Forecast
 
Figure 2: ViT network structure diagram 
 
To overcome the limitations of ViT on small-scale 
images or datasets, an innovative T2T-ViT mechanism is 
introduced. This mechanism can effectively aggregate 
locally relevant feature information together. Specifically, 
T2T-ViT recursively merges adjacent image blocks 
(tokens) into higher-level tokens to construct a 
hierarchical representation. This method not only 
improves the model's ability to express small targets and 
complex textures, but also reduces the complexity and 
computational cost of the model. In traditional ViT 
design, an image is divided into a series of fixed size 
blocks, each of which is flattened and linearly projected 
into a token. Then these tokens are fed into the 
Transformer structure for processing. However, this 
method usually ignores local structural information 
between blocks, especially when the image details are 
relatively fine. To address this issue, T2T-ViT proposes a 
token-to-token conversion mechanism. In this conversion 
process, the model first performs Soft Split (SS) on the 
image, which means that the segmented blocks are 
allowed to have overlapping parts to retain more local 
information. Next, the model recursively merges adjacent 
tokens into new tokens, similar to the way features are 
gathered layer by layer in a neural network. After each 
merge, the model is able to obtain a more global and 
abstract image representation, while reducing the number 
of tokens required for processing. Through this approach, 
T2T-ViT can maintain and refine the image 
representation at each step, allowing the model to capture 
global information while also paying attention to local 
details of the image. This makes T2T-ViT more effective 
in processing delicate image features, especially in small 
target recognition and fine-grained classification tasks, 
showing better performance than ViT. Figure 3 shows the 
T2T module. 
 
1
2
3
4
5
6
7
8
9
5
2 1 3
4 6
7 8 9
1 2 4 5
5 6 8 9
i
A
'
i
A
i
C
1 i
A
+
Refactoring
Convert Unfold
 
Figure 3: Diagram of the T2T module 
 
The reconstruction process in Figure 3 can be 
specifically represented as shown in formula (1). 
( ) ( )
'
ii
A MLP MSA A =       (1) 
In formula (1), 
'
i
A represents the reconstructed 
output. MLP represents a multi-layer perception 
Method for Top View Pedestrian Flow Detection Based on Small … Informatica 48 (2024) 59 –70 63 
module. MSA represents the multi-head attention 
mechanism. 
i
A represents the initial input. The SS 
process can be described as formula (2). 
( )
'
ii
C Reshape A =         (2) 
In formula (2), 
i
C represents the conversion 
output, and Reshape represents the total conversion 
operation of the module. The final output representation 
is formula (3). 
( )
1ii
A SS a
+
=          (3) 
In formula (3), 
1 i
A
+
 represents the output after SS 
and SS represents the SS operation. T2T-ViT also 
reduces the burden of self-attention computation in the 
Transformer model through this structured merging 
approach. Due to the number of merged tokens has 
decreased, the computational complexity of self-attention 
has also decreased. This design not only improves the 
model's expressive power, but also makes it more suitable 
for use in environments with limited computing resources. 
In summary, T2T-ViT enhances the model's ability to 
capture image details through its innovative token to 
token conversion process, and improves computational 
efficiency by reducing the number of tokens. Therefore, 
the application of the Transformer architecture in image 
related tasks becomes more powerful and efficient. 
 
3.2 Improvement of PFD YOLOv5 algorithm 
When applying the YOLOv5 algorithm to PFD, its 
performance optimization has become the focus of 
research attention. The limitations of YOLOv5 in 
detection accuracy and speed have been thoroughly 
analyzed to explore targeted improvement measures. 
These improvements aim to enhance YOLOv5's detection 
ability in complex pedestrian flow scenarios through 
structural adjustments and algorithm optimization. Due to 
the high similarity of pedestrian targets in TP-PFD, to 
further improve the accuracy of the model, this study 
chooses to introduce prior knowledge to make targeted 
improvements to the network. This study introduces an 
attention mechanism for the T2T module, which performs 
another conversion after the image is converted. The 
specific module architecture is Figure 4. 
 
7
4
1
8
5
2
9
6
3
Mean
Max
Splice
1x1 
Conv
Unfold
Input
 
Figure 4: Improved T2T module for TP-PFD 
 
Figure 4 shows the T2T module improvement 
mechanism that takes into account the similarity of 
TP-PFD. The process of taking the average value can be 
expressed as formula (4). 
( )
mean i
A Mean A =        (4) 
In formula (4), Mean represents averaging. 
mean
A 
represents the output value processed by Mean . 
i
A 
represents the input value. The process of taking the 
maximum value is formula (5). 
( )
max i
A Max A =        (5) 
In formula (5), 
max
A represents the maximum 
output value, and Max represents taking the maximum 
value. The attention mechanism is represented by formula 
(6). 
( ) ( )
1 1 max
,
mean
Attention Conv Cat A A

= (6) 
In formula (6), Attention represents the output 
result of the attention mechanism. 
11
Conv

 represents 
the convolution operation of 11  , which is mainly aimed 
at dimensionality reduction. Cat represents Concate, 
with the aim of connecting 
mean
A with 
max
A . 
a
i
I 
represents the output result after introducing attention 
mechanism recombination. The purpose of 
64   Informatica 48 (2024) 59 –70                                                                   M. Li et al. 
sub-optimization in the study is to further enhance the 
feature learning performance of the model in TP-PFD. 
The improved T2T module is Figure 5. 
 
Unfold
Normalization
Bull attention + Normalization
Multilayer 
perceptron
+
+
Point-by-point 
addition
T2T
Transformer
T2T
Transformer Improved T2T Improved T2T
Image input
 
Figure 5: Improved T2T module for PFD at overhead angles 
 
By combining the T2T-ViT module, this study 
constructs a backbone network with a structure similar to 
CNN, as shown in Figure 6. It has a similar deep narrow 
shaped structure as CNN, where S represents the number 
of stacked modules. The objective of this study is to 
enhance the learning and recognition performance of the 
model for pedestrian features under overhead angles. This 
will be achieved by optimizing the token representation 
of the images, thus overcoming the problem of similarity 
in pedestrian targets under this perspective. 
 
Input
c*h*w
Improved 
T2T
Linear 
embedding
Improved 
T2T
Improved 
T2T
Improved 
T2T
x2 x2
x2 x6
96*h/4*w/4
P2
192*h/8*w/8
P3
384*h/16*w/16
P4
768*h/32*w/32
P5
 
Figure 6: The backbone network structure of overlooking human traffic detection network 
 
In the study, the Deep SORT tracking algorithm is 
further adopted to achieve accurate tracking of human 
flow. This algorithm combines Kalman Filtering (KF) 
[20]. KF essentially achieves optimal prediction of future 
states through the current historical state. The system 
state equation of KF is formula (7). 
1 k
k X k k
X a bU W
−
= + +
       (7) 
In formula (7), 
k
X and 
1 k
X
−
 represent the system 
state matrices at times k and 1 k − . 
a
 and b 
represent the corresponding system transition matrix. 
k
U 
represents the system control matrix at time k . 
k
W 
represents the noise impact during the process. The 
system observation equation is formula (8). 
k k k
Z hX V =+         (8) 
In formula (8), 
k
Z represents the observation 
matrix at time k . h represents the system observation 
matrix. 
k
V represents the observed noise. For the 
convenience of calculation, it is assumed that the noise 
that occurs in the usual process is white noise, which does 
not change with changes in the system state. Its condition 
can be expressed as formula (9). 
Method for Top View Pedestrian Flow Detection Based on Small … Informatica 48 (2024) 59 –70 65 
   
 
0,
0, ,
,
0,
0, ,
,
k k i
k k j
ki
E W Cov W W
Q k i
kj
E V Cov V V
R k j
 
==
 
=


 

 ==



=
 
  (9) 
In formula (9), the additional conditions that need to 
be met are shown in formula (10). 
  ,0
kk
Cov W V =         (10) 
In formulas (10) and (9), 
Q
 and R represent the 
covariance matrices of the corresponding noise, 
respectively. The state prediction equation of KF is 
formula (11). 
1 1 1 k k k k k
X aX bU
− − −
=+
       (11) 
In formula (11), 
1 kk
X
−
 represents the system state 
at time k predicted by time 1 k − . 
11 kk
X
−−
 represents 
the optimal prediction system at time 1 k − . The 
covariance analysis is performed on it as shown in 
formula (12). 
1 1 1
T
k k k k
P aP a Q
− − −
=+
      (12) 
In formula (12), 
1 kk
P
−
 and 
11 kk
P
−−
 are the 
covariance matrices corresponding to 
1 kk
X
−
 and 
11 kk
X
−−
. 
T
a represents the transposition of the system 
related state transition matrix. The optimal estimate is 
expressed as formula (13). 
( )
11 kk k k k k k k
X X K Z hX
−−
= + −    (13) 
In formula (13), 
kk
X
 represents the optimal 
estimated system state at time k . 
k
Z represents the 
system observation state at time k . 
k
K represents the 
Kalman gain matrix at time k . h represents the state 
observation matrix. The calculation of 
k
K is formula 
(14). 
11
/
TT
k k k k k
K P h hP h R
−−

=+

    (14) 
In formula (14), R represents the covariance 
matrix of the noise. The covariance update result of the 
system state is formula (15). 
( )
1
1
k k k k k
P K h P
−
=−
      (15) 
In formula (15), 
kk
P
 represents the covariance 
update result of the system state at time k . This study 
uses YOLOv5 for object detection and combines it with 
the Deep SORT tracking algorithm to achieve object 
detection in TP-PFD. The pseudo-code for the research 
method is shown in Figure 7. 
 
66   Informatica 48 (2024) 59 –70                                                                   M. Li et al. 
```plaintext
Algorithm: YOLOv5-V-D Training and Evaluation
Procedure: Train_YOLOv5_V_D(TrainingData, Hyperparameters)
  Input: TrainingData - Data for training
          Hyperparameters - Parameters for training like learning rate, batch size, etc.
  Begin
    1. Initialize YOLOv5 model.
    2. Integrate T2T-ViT module for small target detection.
    3. For each epoch, do:
        a. Loop over batches in TrainingData.
        b. Perform forward pass, calculate loss, and backward pass.
        c. Update model weights.
    4. Apply learning rate decay.
    5. Save the best model based on validation performance.
  End Procedure
Procedure: Evaluate_YOLOv5_V_D(TestFrames, TrainedModel)
  Input: TestFrames - Data for testing
         TrainedModel - The saved model from training
  Begin
    1. For each frame, do:
        a. Get model predictions.
        b. Apply non-maximum suppression.
        c. Track individuals using Deep-SORT.
    2. Calculate metrics (accuracy, F1 score, recall) for TestFrames.
  End Procedure
Procedure: RealWorld_Tracking(VideoClips, TrainedModel)
  Input: VideoClips - Real-world video data
         TrainedModel - The trained model
  Begin
    1. For each clip, do:
        a. Extract frames.
        b. Apply Evaluate_YOLOv5_V_D.
        c. Aggregate and report flow prediction accuracy.
  End Procedure
 
Figure 7: Method pseudo-code 
 
4 Performance testing of TP-PFD 
model 
To test the practicality and usability of the proposed 
TP-PFD model, this study selects UCF_ CC_ 50 and 
NWPU Crowd datasets are used to validate this method. 
Among them, the UCF_ CC_50 dataset contains 50 
high-resolution images obtained from different scenes. 
Each image contains numbers ranging from 94 to 4,543, 
suitable for population counting studies. NWPU Crowd 
includes 5,109 images and 2,133,238 person annotations, 
including some images taken from a top view angle, 
suitable for crowd counting and positioning. This study 
randomly selects 80% of the two datasets for training, 
while the remaining 20% is used for testing. CenterNet 
(CN) model and RetinaNet (RN) model are selected for 
comparison with research methods 
(YOLOv5-ViT-T2T-Deep-SORT, YOLOv5-V-D). To 
avoid limitations on research due to the device 
performance, the study chooses to rent a cloud server 
platform for experimentation. Table 2 provides specific 
software and hardware details as well as training 
parameters. 
 
Method for Top View Pedestrian Flow Detection Based on Small … Informatica 48 (2024) 59 –70 67 
Table 2: Hardware and software details and training parameter settings 
Hardware Software Training parameter 
Name Detail Name Detail Hyper-parameters Detail 
Supplier 
Microsoft 
Azure 
Linux 
Ubuntu Server 
20.04 LTS 
Learning Rate 0.001 
Type Standard NC6 TensorFlow 2.9 Batch Size 64 
CPU 
Intel Xeon 
E5-2690 v3 
PyTorch 1.7 Optimizer Default 
CUDA Toolkit 11.0.194 Weight Initialization Xavier 
RAM 56Gb cuDNN 7.2.1 Activation Function Leaky ReLU 
GPU 
NVIDIA Tesla 
K80 
Python 3.8 Loss Function 
Cross-Entropy 
Loss 
MEM 
Azure Blob 
Storage 
OpenCV 4.2 Learning Rate Decay 0.1 
Azure Managed 
Disk 
Jupyter 
Notebook/Lab 
- Epoch 150 
 
Firstly, the F1 and Recall values of the three models 
are tested, and the results are shown in Figure 8. The 
improved YOLOv5-V-D has a faster convergence speed, 
reaching the optimal F1 value and the optimal Recall 
value in about 61 iterations. In Figure 8 (a), its optimal F1 
value is 0.952, which is 0.015 and 0.037 higher than CN 
and RN, respectively. In Figure 8 (b), its optimal Recall 
value is 0.947, which is 0.021 and 0.041 higher than CN 
and RN, respectively. 
 
0 50 100 150 25 75 125
Epoch
F1
(a) F1
0.96
0.94
0.92
0.90
0.88
YOLOv5-V-D
CN
RN
 
0 50 100 150 25 75 125
Epoch
Recall
(b) Recall
0.95
0.93
0.91
0.89
0.87
YOLOv5-V-D
CN
RN
 
Figure 8: F1 and recall tests of three models 
 
Figure 9 shows the results of testing and comparing 
the Precision recall curves of three models. In Figure 9, 
the curve area of the research method is superior to the 
other two models, which proves the superiority of the 
research method in performance compared to CN and RN. 
The improved YOLOv5-V-D has better basic 
performance indicators and stronger practicality. 
 
0 0.4 0.8 0.2 0.6 1.0
Recall
Precision
0.8
0.6
0.4
0.2
0
YOLOv5-V-D
CN
RN
1.0
 
Figure 9: Precision-recall curve tests for three models 
 
The detection delay of three models are tested to 
examine their usability in practice, testing on two datasets 
to obtain Figure 10. In Figure 10 (a), the optimal delay 
performance of YOLOv5-V-D is 2.4ms, leading by 3.5ms 
and 7.2ms compared to CN and RN, respectively. The 
performance of the demonstration in Figure 10 (b) has 
increased, which is speculated to be due to the high 
difficulty of the dataset. The optimal delay performance 
of YOLOv5-V-D is 12.0ms, which is 4.3ms and 5.6ms 
ahead of CN and RN, respectively. Overall, the improved 
YOLOv5-V-D has better latency performance on both 
datasets. Its average delay performance is 7.2ms, leading 
by 3.9ms and 6.4ms compared to CN and RN, 
respectively. 
 
0 50 100 150 25 75 125
Epoch
 Delay/ms
(a) UCF_CC_50
20
15
10
5
0
YOLOv5-V-D
CN
RN
 
68   Informatica 48 (2024) 59 –70                                                                   M. Li et al. 
0 50 100 150 25 75 125
Epoch
 Delay/ms
(b) NWPU-Crowd
30
25
20
15
10
YOLOv5-V-D
CN
RN
 
Figure 10: The delay test results of three algorithms 
 
Figure 11 tests the parameter values of the three 
models. Figure 11 (a) is the UCF_ CC_50 dataset, and the 
YOLOv5-V-D performs better than the comparative 
model. The number of model parameters in Figure 11 (b) 
has increased due to the increased difficulty of the dataset, 
but YOLOv5-V-D still performs better than the 
comparative model. Overall, the improved YOLOv5-V-D 
has the best parameter performance. It can achieve lower 
system load and lower computing costs, which is more 
conducive to embedding and deployment on low 
performance platforms. 
The PFD results of the three algorithms are tested 
using actual top-down angle shots. To avoid the impact of 
errors on the test results, this study selects five videos for 
testing, as shown in Table 3. In Table 3, the improved 
YOLOv5-V-D has the best accuracy in predicting 
pedestrian flow. The accuracy of its traffic prediction 
reaches 98.72%, leading by 20.59% and 28.61%, 
respectively, compared to CN and RN. 
 
 
 
0 50 100 150 25 75 125
Epoch
Parameters/MB
(a) UCF_CC_50
40
35
20
5
0
YOLOv5-V-D
CN
RN
30
25
15
10
 
0 50 100 150 25 75 125
Epoch
Parameters/MB
(a) NWPU-Crowd
40
35
20
5
0
30
25
15
10
YOLOv5-V-D
CN
RN
 
Figure 11: Parameter test results of three models 
 
 
Table 3: The actual flow detection test of three models 
Video 
clip 
Model 
Actual result Forecast result 
Accuracy(%) 
Upflow Downflow Upflow Downflow 
Clip 1 
YOLOv5-V-D 
9 12 
9 12 100.00 
CN 7 11 84.72 
RN 5 9 65.28 
Clip 2 
YOLOv5-V-D 
11 15 
11 15 100.00 
CN 14 9 69.29 
RN 11 5 66.67 
Clip 3 
YOLOv5-V-D 
19 14 
20 14 97.50 
CN 24 15 86.25 
RN 14 19 73.68 
Clip 4 
YOLOv5-V-D 
21 17 
20 17 97.62 
CN 27 21 79.37 
RN 19 29 75.55 
Clip 5 
YOLOv5-V-D 
29 33 
29 32 98.48 
CN 21 23 71.06 
RN 46 25 69.40 
 
In summary, the improved YOLOv5-V-D has 
excellent training efficiency and superior performance 
compared to both CN and RN models. It has lower 
latency and parameter quantity, and it has higher 
accuracy in actual PFD. Using only the UCF_CC_50 and 
NWPU-Crowd datasets for testing may result in a lack of 
clear understanding of operational performance in 
real-world operating environments, leading to unclear 
Method for Top View Pedestrian Flow Detection Based on Small … Informatica 48 (2024) 59 –70 69 
limitations in the study. To more fully evaluate the 
generalization ability of the model, ShanghaiTech and 
Qnrf datasets containing extreme crowd density and 
complex dynamic behavior are introduced for testing. On 
the ShanghaiTech dataset, the model's detection accuracy 
exhibits a minimal decline of 0.39%, particularly in 
scenarios involving high passenger density, where its 
performance remains robust. Data enhancement 
techniques are then used to simulate a variety of 
environmental conditions, including random brightness 
adjustment (brightness variation range ±20%), contrast 
change (contrast variation range ±15%), noise addition 
(Gaussian noise and salt and pepper noise), and image 
transformation to simulate different viewing angles. 
Under more environmental conditions, the fluctuation 
range of detection accuracy of the research method is 
kept within 1.50%. In severe weather conditions 
including rain, fog, snow and dust, the detection accuracy 
of the research method has a decrease of nearly 3.00%. 
To assess the stability of the model over long tracking 
periods, the study is tested in a continuous video stream. 
The results show that the YOLOv5-V-D model has a 
tracking accuracy of more than 95% in the video tracking 
of up to 2 hours, which proves its long-term stability. It 
shows that the method has good model generalization 
performance. 
5 Discussion 
This paper proposes a PFD method based on top-down 
view of STT. By integrating ViT architecture and 
Deep-SORT tracking algorithm, the YOLOv5 model is 
improved, aiming to improve the accuracy and efficiency 
of small-target tracking under top-down view. The 
improved YOLOv5-V-D model outperforms the existing 
CenterNet and RetinaNet models on several performance 
metrics. Specifically, the values of F1 and Recall of 
YOLOv5-V-D reach 0.952 and 0.947, respectively, 
which are significantly improved compared to CenterNet 
and RetinaNet models. In addition, the average delay of 
YOLOv5-V-D is 7.2ms, which also shows better 
performance in real-time than other existing studies. The 
introduction of the T2T-ViT architecture has enabled the 
effective capture of global information in images through 
SAM. This has facilitated the distinction between the 
flow of people and the background in complex scenes, 
thereby enhancing the recognition accuracy of small 
targets. The integration of T2T-ViT architecture is one of 
the core innovations of the study. The T2T-ViT model 
employs a token-to-token mechanism to recursively 
merge adjacent image blocks, thereby constructing a 
hierarchical representation. This approach not only 
enhances the model's capacity to identify subtle targets 
and intricate textures but also reduces its complexity and 
computational cost. The integration of the T2T-ViT 
architecture significantly improves the model's 
performance when dealing with complex pedestrian 
dynamics. In the actual overhead angle shot, the 
YOLOv5-V-D model shows a high pedestrian flow 
prediction accuracy of 98.72%, which indicates that the 
model can effectively handle the interaction and 
occlusion between pedestrians, as well as the challenges 
brought by dynamic lighting changes. Through 
experiments on various test sets, it is found that the 
performance of the model deteriorates in high density 
human flow scenarios, poor lighting conditions or severe 
lighting changes. Analysis shows that the main reasons 
for performance degradation include but are not limited 
to insufficient representation of small targets in images 
resulting in insufficient feature extraction, as well as 
interference factors in complex backgrounds. The 
stability and accuracy of tracking algorithms are 
challenged in situations of rapid motion or occlusion. The 
integration of infrared or depth sensor data can be 
considered in the solution to supplement the visual 
information and enhance the robustness of the model to 
occlusion and illumination changes. More sophisticated 
image enhancement techniques such as adaptive contrast 
adjustment and noise reduction algorithms are employed 
to improve image quality. 
6 Conclusion 
The detection and tracking of pedestrian flow is of great 
significance for public safety and space management. The 
PFT under the overhead angle requires extremely high 
STD capability of the model. This study aimed to 
improve the accuracy and efficiency of detection methods 
to achieve real-time monitoring in various application 
scenarios. By adopting the improved YOLOv5-V-D 
model, this method has made significant progress in 
target tracking and PFD performance. In response to the 
shortcomings of classical models in small target 
recognition, the ViT-T2T framework and Deep SORT 
tracking algorithm were introduced to enhance the 
sensitivity and detection speed of the model to small 
targets. After improvement, the model converged after 
approximately 61 iterations, demonstrating excellent 
performance. The F1 value and Recall value reached 
0.952 and 0.947, respectively, surpassing the 
performance of the CN and RN models. In addition, the 
model also performed excellently in terms of latency, 
with an average time of only 7.2ms, significantly superior 
to the comparison model. The practical application value 
of this research result was reflected in the accuracy of 
pedestrian flow prediction, reaching 98.72%, 
significantly higher than the CN and RN models. This 
achievement confirmed the effectiveness of the improved 
model in handling complex dynamic scenes, especially its 
potential application in high-density pedestrian 
environments. Despite the above achievements, the 
limitations of this method in facing extreme environments 
and behavior patterns should also be recognized. This 
study lacks further improvements in the robustness of the 
model, and future improvements should be made in this 
area to enhance the applicability and anti-interference 
70   Informatica 48 (2024) 59 –70                                                                   M. Li et al. 
ability of the model. 
References
 
[1] X. Xu, Q. Wu, L. Qi, W. Dou, S. B. Tsai, and M. Z. A. 
Bhuiyan, “Trust-aware service offloading for video 
surveillance in edge computing enabled internet of 
vehicles, ” IEEE Transactions on Intelligent 
Transportation Systems, vol. 22, no. 3, pp. 
1787-1796, 2021. 
https://doi.org/10.1109/TITS.2020.2995622 
[2] Y. Zhang, J. Zhang, and R. Tao, “Key frame 
extraction of surveillance video based on fractional 
fourier transform, ” Journal of Beijing Institute of 
Technology, vol. 30, no. 3, pp. 311-321, 2021. 
https://doi.org/10.15918/j.jbit1004-0579.2021.058 
[3] J. Qiu, L. Wang, Y. Hu, and Y. Wang, “Effective 
object proposals: size prediction for pedestrian 
detection in surveillance videos, ” Electronics Letters, 
vol. 56, no. 14, pp. 706-709, 2020. 
https://doi.org/10.1049/el.2020.0850 
[4] D. Xi, Y. Qin, and S. Wang, “YDRSNet: an 
integrated Yolov5-Deeplabv3+real-time 
segmentation network for gear pitting 
measurement, ” Journal of Intelligent Manufacturing, 
vol. 34, no. 4, pp. 1585-1599, 2023. 
https://doi.org/10.1007/s10845-021-01876-y 
[5] Z. Zhao, F. N. Khan, Z. A. H. Qasem, B. Deng, Q. Li, 
Z. Liu, and H. Y. Fu, 
“Convolutional-neural-network-based versus 
vision-transformer-based SNR estimation for visible 
light communication networks, ” Optics Letters, vol. 
48, no. 6, pp. 1419-1422, 2023. 
https://doi.org/10.1364/OL.485321 
[6] S. Shi, L. Jiang, D. Cao, and Y. Zhang, “Sea-surface 
small target detection using entropy features with 
dual-domain clutter suppression, ” Remote Sensing 
Letters, vol. 13, no. 10/12, pp. 1142-1152, 2022. 
https://doi.org/10.1080/2150704X.2022.2127129 
[7] B. Yang, and J. Wang, “An improved helmet 
detection algorithm based on YOLO V4, ” 
International Journal of Foundations of Computer 
Science, vol. 33, no. 6/7, pp. 887-902, 2022. 
https://doi.org/10.1142/s0129054122420205 
[8] G. L. Zhi, D. U. Juan, T. Feng, and Z .W. Jia, “Traffic 
sign recognition using an attentive context 
region-based detection framework, ” Chinese Journal 
of Electronics, vol. 30, no. 6, pp. 1080-1086, 2021. 
https://doi.org/10.1049/cje.2021.08.005 
[9] L. Bommes, M. Hoffmann, C. Buerh o p ‐ L u tz, T. 
Pickel, J. Hauch, and C. Brabec, “Anomaly 
detection in IR images of PV modules using 
supervised contrastive learning, ” Progress in 
Photovoltaics, vol. 30, no. 6, pp. 597-614, 2022. 
https://doi.org/10.1002/pip.3518 
[10] H. Qin, Y. Wu, F. Dong, and S. Sun, “Dense 
sampling and detail enhancement network: 
Improved small object detection based on dense 
sampling and detail enhancement, ” IET Computer 
Vision, vol. 16, no. 4, pp. 307-31, 2022. 
https://doi.org/10.1049/cvi2.12089 
[11] Y. Minegishi, Y. Ohmiya, T. Sano, and M. Tange, 
“Analysis and modeling of pedestrian flow in a 
confined corridor focusing on the headway distance 
and velocity of pedestrians, ” Fire Technology, vol. 
58, no. 2, pp. 709-735, 2022. 
https://doi.org/10.1007/s10694-021-01173-3 
[12] J. Yang, W. Y. He, T. Zhang, C. Zhang, and B. F. 
Nan, “Research on subway pedestrian detection 
algorithms based on SSD model, ” IET Intelligent 
Transport Systems, vol. 14, no. 11, pp. 1491-1496, 
2020. https://doi.org/10.1049/iet-its.2019.0806 
[13] X. Song, B. Chen, P. Li, B. Wang, and H. Zhang, 
“PRNet++: learning towards generalized occluded 
pedestrian detection via progressive refinement 
network, ” Neurocomputing, vol. 482, no. 14, pp. 
98-115, 2022. 
https://doi.org/10.1016/j.neucom.2022.01.056 
[14] J. Yang, W. Y. He, T. Zhang, C. L. Zhang, L. Zeng, 
and B. F. Nan, “Research on subway pedestrian 
detection algorithms based on SSD model, ” IET 
Intelligent Transport Systems, vol. 14, no. 11, pp. 
1491-1496, 2020. 
https://doi.org/10.1049/iet-its.2019.0806 
[15] S. Zhang, X. Yang, Y. Liu, and C. Xu, “Asymmetric 
multi-stage CNNs for small-scale pedestrian 
detection, ” Neurocomputing, vol. 409, no. 7, pp. 
12-26, 2020. 
https://doi.org/10.1016/j.neucom.2020.05.019 
[16] A. Ali, “A framework for air pollution monitoring in 
smart cities by using IoT and smart sensors, ” 
Informatica, vol. 46, no. 5, pp. 129-138, 2022. 
https://doi.org/10.31449/inf.v46i5.4003 
[17] J. Guo, X. Zhang, Y. Dong, Z. Xue, and B. Huang, 
“Terrain classification using mars raw images based 
on deep learning algorithms with application to 
wheeled planetary rovers, ” Journal of 
terramechanics, vol. 108, no. 8, pp. 33-38, 2023. 
https://doi.org/10.1016/j.jterra.2023.04.002 
[18] Q. Zhang, Y. Wang, L. Song, M. Han, and H. Song, 
“Using an improved YOLOv5s network for the 
automatic detection of silicon on wheat straw 
epidermis of micrographs, ” Journal of Field 
Robotics, vol. 40, no. 1, pp. 130-143, 2023. 
https://doi.org/10.1002/rob.22120 
[19] A. Ali, “Remote monitoring of lab experiments to 
enhance collaboration between universities, ” 
Informatica, vol. 46, no. 2, pp. 169-177, 2022. 
https://doi.org/10.31449/inf.v46ix.xxxx 
[20] S. Yang, Z. Chen, X. Ma, X. Zong, and Z. Feng, 
“Real-time high-precision pedestrian tracking: a 
detection-tracking-correction strategy based on 
improved SSD and Cascade R-CNN, ” Journal of 
Real-Time Image Processing, vol. 19, no. 2, pp. 
287-302, 2022. 
https://doi.org/10.1007/s11554-021-01183-y