https://doi.org/10.31449/inf.v46i4.3884 Informatica 46 (2022) 567–574 567 
A Novel Method for Multiple Object Detection on Road Using 
Improved YOLOv2 Model 
P.Gunasekaran
1 
,  A.Azhagu Jaisudhan Pazhani
2
 and T.Ajith Bosco Raj
3
 
1, 2
Department of ECE, Ramco Institute of Technology, Rajapalayam, Tamilnadu, India. 
3
Department of ECE, PSN College of Enginnering and Technology, Tirunelveli, India 
E-mail: mailtogunasekar@gmail.com, alagujaisudhan@gmail.com, ajithboscoraj@gmail.com 
 
Keywords: AI, Object, Vehicle, Convolutional, VOC, COCO, KITTI, YOLO 
 
Received: December 30, 2021 
 
Object detection is a branch of machine vision and image processing that deals with instances of a 
certain class of semantic items. One of the most significant habits of object detection in intelligent 
transportation schemes is vehicle detection. Its aim is to extract clear-cut vehicle-type information from 
photographs or videos of automobiles. A fully convolutional network (FCN) is employed in 
sophisticated driver assistance systems for high performance and quick object identification (ADAS). A 
novel vehicle detection model employing YOLOv2 is presented to tackle the difficulties of prevailing 
vehicle detection, such as the absence of vehicle-type recognition, stumpy detection accuracy and 
sluggish speed. The detection model is trained using the VOC and COCO datasets, and the detection 
enactment is evaluated quantitatively using KITTI training pictures. In addition, the performance of the 
YOLOv2 model was compared to that of prior models. 
 
    Povzetek: Razvita je nova metoda zaznavanja več objektov na cesti s pomočjo YOLOv2 modela.
 
1 Introduction 
Moving object detection is a computer technique that 
compacts with recognizing occurrences of semantic 
matters of a precise class (such as humans, automobiles, 
etc.) in a digital picture or video. It is connected to 
computer vision, image processing, and neural networks. 
Vehicle detection and pedestrian detection are two well-
studied fields. In the field of machine vision, moving 
object detection has a variety of applications, including 
picture retrieval and video monitoring. 
While new research datasets have increased the 
number of training sets and testing instances to get closer 
to real-world situations, detectors' capacity to process big 
data sets in an acceptable period of time has become a 
significant concern in addition to accuracy. It is not just 
the number of classes that matters, but also the training 
examples. 
Detecting moving items in a video clip entail finding 
them in the frame. Item detection is required by every 
tracking technique, whichever in all frame or when the 
object first shows in the video. Various backdrop 
removal approaches from the literature were simulated 
for moving object detection. Background subtraction 
uses the relative difference between the current image 
and the reference updated backdrop over time. 
Background subtraction that works well should be able to 
deal with fluctuating lighting conditions, background 
clutter, shadows, camouflage, bootstrapping, and 
foreground segmentation in real time. 
The tracking of moving objects in video images has 
flickered a lot of interest in machine vision. Surveillance  
 
 
systems, navigation systems, and object identification all 
flinch with object tracking. Object tracking is extremely 
important in a real-time environment because it allows 
for an improved sense of refuge through visual 
information, security and surveillance to recognize 
people, analysis of customer shopping behavior in retail 
spaces, video abstraction to attain involuntary annotation 
of videos, generation of object-based synopses, traffic 
management to examine flow, and design futuristic video 
effects. 
Huieun Kim et al. offered "On-road object 
identification using Deep Neural Network" [4], which 
advocated SSD as a quicker object detection method than 
R-CNN by 41 frames per second. The model is built on 
SSD and tweaked with the KITTI dataset, which is made 
up of on-road environment object classes (SSD is a pre-
trained model by Pascal VOC pictures). This work 
proposes an on-road object identification method based 
on SSD that overcomes the difficulties of detecting on-
road objects using a camera in instantaneous and allows 
for robust object detection. It creates appearance 
characteristics from input pictures using convolutional 
layers and trains object position in 2D image coordinates 
by calculating loss of object box position (IoU) during 
the training step. SSD, on the other hand, has the 
disadvantage of overlooking tiny things due to its grid 
methodology. 
The furthermost representative FCN-based object 
identification approaches are region-based fully 
convolutional networks (R-FCN), single shot multi-box 
detector (SSD), and you only look once (YOLO). To 
obtain good detection performance, these approaches 
568 Informatica 46 (2022) 567–574                                                                                                                P. Gunasekaran
 
et al.  
need a large amount of labeled training data. Most deep 
learning-based detection algorithms train the 
classification model using millions of ImageNet 
classification datasets and fine-tune it by detection 
training data such as tens of thousands of PASCAL VOC 
and COCO datasets [13]. The detection approaches based 
on deep learning, on the other hand, need a high level of 
computing complication to train the detection models. 
A FCN-based object identification approach that 
enhances performance in a road environment was 
suggested in the publication "High Performance and Fast 
Object Detection in Road Environments" [9]. Although 
the SSD input network is slighter than that of YOLO, the 
processing time is significantly longer. The 
classification-specific layer design and the amount of 
default boxes account for the performance disparity. The 
VGG-16 model castoff in SSD requires around four 
times more processing resources than the Darknet-19 
model used in YOLO in the classification-specific layer. 
Hui-Lee Ooi et al. used an object detector to evaluate 
the MOT in urban traffic sceneries with road users of 
various vehicle sizes, whereas earlier work in this area 
has used background removal or optical flow to excerpt 
the items of interest regardless of size. The work 
involves a review of a common model object detector for 
tracking in urban traffic divisions, as well as the addition 
of label information to describe the items in the scene. 
The label information should be a valuable signal to 
differentiate and associate the objects of interest through 
frames, resulting in a more precise trajectory, due to the 
diversity of objects prevalent in urban landscapes. This is 
documented in the paper “Multiple Object Tracking in 
Urban Traffic Scenes with a Multiclass Object Detector” 
[5]. 
Due to its efficiency and accuracy, a deep-learning 
object identification model from the Region-based Fully 
Convolutional Network (R-FCN) framework is used to 
recognize the road users in each frame. The top 
performing technique on the MIO-TCD localization 
contest led to the selection of this detector. The pre-
trained model is refined further by using the MIO-TCD 
dataset to deliver labels for the various road users seen in 
traffic scenes, each of which falls into one of eleven 
classes or labels. 
The work "Survey of Pedestrian Detection for 
Advanced Driver Assistance Systems" [3] focuses on one 
form of ADAS in particular, pedestrian protection 
systems (PPSs). This study focuses on pedestrians since, 
according to accident data, 70% of persons engaged in 
car-to-pedestrian collisions were in front of the vehicle, 
with 90% of them moving. As a result, PPSs frequently 
employ forward-facing sensors. The foreground 
segmentation algorithm detects moving people. The 
INRIA Person Data set, which is now fairly popular for 
general human categorization assessment but comprises a 
significant number of samples derived from high 
resolution pictures, was employed in this model. 
The work "The Object Detection Based on Deep Lea
rning" [15] provides an overview of object detection and 
discusses the relationship and differences between the co
nventional and deep learning methods. The study focuses
 on the framework design, model working principles (YO
LO, SSD), realime model performance analysis, and dete
ction accuracy. 
In the work "Integrated Real-Time Object Detection 
for Self-Driving Vehicles," [11], the authors propose 
combining Fast R-CNN with YOLO to obtain real-time 
performance with around half the YOLO localization 
inaccuracy. The ImageNet 2012 dataset was used to pre-
train the model. This may lower the likelihood. However, 
the model has trouble identifying tiny items that are close 
together. 
To attain the greatest accuracy result and speed, the 
work "Comparative study of Object Detection 
Algorithms" [12] focuses on three distinct models, 
namely SSD, faster R-CNN, and R-CNN. On the COCO 
dataset, these models are trained and their performance 
indicators are evaluated. The test is run on the same 
hardware and includes a variety of model combinations. 
Mendes, et al., proposed a method to detect object, 
when an object centroid passes over a region of interest, 
ROI ID and region type (in/out) are saved into object 
properties [8]. This ID will be used along the vehicle’s 
lifetime over next frames to determine its route. At this 
moment, object ID is stored in a result set to prevent 
duplicate in counting. This is necessary because ROI is a 
polygon (not a single line) and an object will pass over 
the region in multiple sequential frames 
For improving school bus routing and scheduling, 
see the study "Improving efficiency of school bus routing 
using AI based on bio-inspired computing" [2]. The 
accuracy of the School bus routing problem may be 
enhanced further by utilizing a genetic algorithm that 
compacts with data preparation, routing, and bus stop 
selection (SBRP). The author of the work "Moving 
Object Tracking in Video" [18] proposes a technique for 
isolating moving objects in video sequences, followed by 
a rule-based tracking system. The introductory testing 
findings show that the algorithm works even in difficult 
conditions    like a new track, a halted track, a track 
collision, and so on. 
Azhagu Jaisudhan Pazhani1., et al., proposed Faster 
R-CNN which comprise of a combination of Faster R-
CNN with enhanced ROI pooling, named as FrRNet-
ERoI frame-work. It is pipeline process meant to 
establish the result as detected object for given test 
image. The network comprises of two sections namely 
region proposal network and fast R-CNN [16]. 
The work "Moving Object Tracking in Video Using 
Matlab" [1] discusses a tracking approach without 
background extraction. Since, when removing backdrop 
from a video frame, if there are little moving objects in 
that frame, they form a blob in thresholding, which 
causes confusion while tracking that blob because it is of 
no use. The author covers video tracking in computer 
vision in his work "Video-Based People Tracking" [7], 
which includes design criteria and a study of solutions 
ranging from simple window tracking to tracking 
complicated, deformable objects by learning shape and 
dynamics models. 
Markus Schreiber., et al., proposed a sequential 
processing of the GNSS raw data i.e. each measurement 
A Novel Method for Multiple Object Detection on Road Using… Informatica 46 (2022) 567–574 569 
is processed as a single measurement. Given n pseudo 
range measurements at one time, n estimation steps are 
carried out successively [10]. The alternative would be to 
take a measurement vector including all measured values 
present at one time and process them in a single 
measurement step. 
 
2 Object Detection   Models Based 
on region proposal 
 
The extraction of region candidates and the construction 
of deep neural networks are the two key tasks in the deep 
learning object identification based on region proposal. 
 
2.1 R-CNN 
 
One of the earliest models to employ convolutional 
neural networks for object detection was the R-CNN 
model. R-purpose CNN's is to yield in an image and 
properly recognize the key items in the image. R-CNN 
does exactly what it sounds like it should: it proposes 
a lot of boxes in the picture and checks to see if any of 
them belong to an item. R-CNN uses a procedure 
called Selective Search to generate these bounding 
boxes, or region proposals. 
The Regions of Interest (RoI) are created first. The 
RoIs are category-agnostic bounding boxes with a 
high probability of covering an intriguing object. 
Selective Search is the approach employed in the 
study to generate them; however other region creation 
methods can be used instead. The characteristics from 
each area suggestion are then extracted using a 
convolutional network. The bounding box's sub-image 
is twisted to match the CNN's input size before being 
sent to the network. Following the network's 
extraction of topographies from the input, the 
topographies are sent into support vector machines 
(SVM), which do the final classification. Starting with 
the convolutional network, the approach is trained in 
steps. The SVMs are fitted to the CNN features once 
the CNN has been trained. Finally, the region proposal 
creating method is trained.  
The R-CNN approach is significant since it was the 
first feasible key for object detection with CNNs. 
Because it was the first, it has a number of flaws that 
succeeding systems have addressed. R-three CNN's key 
issues are: First, as previously said, training is divided 
into many stages. Second, training is too expensive. 
Topographies are retrieved from every region proposal 
and kept on disc for both SVM and region proposal 
training. This will take days to compute and hundreds of 
gigabytes of storage. Third, and probably most 
importantly, object detection is sluggish, taking about a 
minute per image even when using a GPU. This is due to 
the fact that the CNN forward calculation is done 
independently for each item suggestion, even if they 
come from the same picture or overlap. 
2.2 Fast R-CNN 
 
Girshick's Fast R-CNN, published in 2015, is a more real 
solution for object recognition. Instead of doing the 
forward pass of the CNN sequentially for each RoI, the 
fundamental concept is to conduct it for the whole 
picture [14]. 
The technique takes an image and computes areas of 
interest from it as input. The RoIs are created using an 
external mechanism, same like in R-CNN. A CNN with 
numerous convolutional and max pooling layers is used 
to process the image. After these layers, the 
convolutional feature map is formed and fed into a RoI 
pooling layer. The feature map is used to derive a fixed-
length feature vector for each RoI. The feature vectors 
are then fed into fully connected layers, which are 
coupled to two output layers: a softmax layer that 
generates probability estimates for object classes, and a 
real-valued layer that generates bounding box co-
ordinates based on regression. 
The region proposer was still a bottleneck with Fast 
R-CNN that needed to be addressed. To detect the 
locations of objects, the first step is to create a set of 
potential bounding boxes or regions of interest to test. 
These suggestions were developed in Fast R-CNN 
employing Selective Search, a somewhat slow method 
that was discovered to represent the entire process' 
bottleneck. 
 
2.3 Faster R-CNN 
 
Faster R-CNN discovered a solution to make the region 
proposal phase nearly free. Faster R-CNN exposed that 
region recommendations were grounded on picture 
attributes that had previously been estimated during the 
CNN's forward pass (first step of classification). A single 
CNN is employed in this model to perform both region 
recommendations and classification. Only one CNN has 
to be educated, and region suggestions may be made for 
absolutely little cost. Faster R-CNN creates the Region 
Proposal Network by layering a Fully Convolutional 
Network on top of the CNN's characteristics. 
By alternating between training for RoI generation 
and detection, a Faster R-CNN network is developed. 
Two distinct networks are first trained. These networks 
are integrated and fine-tuned after that. Certain layers are 
fixed during fine-tuning, while others are trained in turn. 
A single picture is sent into the trained network. The 
image's feature maps are generated by the shared fully 
convolutional layers. The RPN receives these feature 
maps. The RPN generates region suggestions, which are 
sent into the final detection layers together with the 
feature maps. These layers yield the final classifications 
and contain a RoI pooling layer. Region suggestions are 
practically costless to compute thanks to shared 
convolutional layers. 
The use of a CNN to compute region suggestions has 
the extra benefit of being GPU-friendly. A CPU is used 
to implement traditional RoI generating methods like 
570 Informatica 46 (2022) 567–574                                                                                                                P. Gunasekaran
 
et al.  
Selective Search. To identify the items, all of the object 
detection methods presented so far employs areas. The 
network does not look at the entire image at once, but 
instead focuses on different areas of it in a sequential 
manner. Two issues arise as a result of this: 
 
• To extract all of the items, the programme must 
pass over a single image many times. 
• Because there are several systems operating 
simultaneously, the performance of the systems 
that follow is influenced by the performance of 
the prior systems. 
2.4 R-FCN 
 
Faster R-CNN was an order of magnitude quicker than 
its predecessor fast R-CNN thanks to the performance 
boost. However, there was an issue with applying the 
region-specific component multiple times in an image; 
this issue was resolved in R-FCN, where the computation 
required per image was drastically reduced by cropping 
features from the last layer of features prior to 
predictions, rather than harvesting features from the same 
layer where the crops are predicted. When utilising 
Resnet101 as the feature extractor, the approach is 
quicker than Faster R-CNN while attaining equal 
accuracy ratings. In hindsight, it also respects 
translational invariance since it is a position sensitive 
cropping mechanism. 
 
2.5 Based on regression SSD 
 
The Single Shot MultiBox Detector (SSD) goes much 
farther in terms of integrated detection. There is no 
resampling of picture segments, and the approach does 
not create any recommendations. It creates object 
detections using a single pass of a convolutional network. 
The approach starts with a default set of bounding boxes, 
similar to a sliding window method. Offset parameters 
indicate how much the right bounding box encircling the 
item differs from the default box in the object predictions 
made for these boxes. 
The classifier uses feature maps from multiple 
distinct convolutional layers (i.e. larger and smaller 
feature maps) as input to cope with diverse scales. The 
classifier is followed by a non-maximum suppression 
stage, which removes most boxes below a particular 
confidence level because the approach creates a dense 
collection of bounding boxes. 
 
 
2.6 YOLO 
 
To begin, create a VGG16 classifier network. Then, 
for object detection, replace the completely linked layers 
with a convolution layer and retrain it endways. YOLO 
uses 224 224 images to train the classifier, followed by 
448 448 images for object recognition [6]. YOLOv2 
trains the classifier using 224 224 images at first, but 
then retrains it with 448 448 images in a considerably 
shorter time frame. This simplifies detector training 
while also increasing mAP by 4%. 
3 Proposed method for multiple 
object detection  
 
3.1 YOLOv2 model 
 
YOLOv2 is a more advanced version of the original 
YOLO. YOLO9000 is based on YOLOv2; however, it is 
trained on a combined dataset that combines the COCO 
detection dataset with ImageNet's top 9000 classes. 
 
3.2 YOLOv2 Improvement 
 
To improve the accuracy and speed of YOLO prediction, 
a number of changes are made, including: 
 
3.3 Image resolution matters 
 
The detection performance is improved by fine-tuning 
the basis model with high-resolution photos. 
 
3.4 Convolutional anchor box detection 
 
Rather of using fully-connected layers to predict 
bounding box positions throughout the whole feature 
map, YOLOv2 employs convolutional layers to predict 
anchor box locations, similar to quicker R-CNN. Class 
probabilities and spatial location predictions are 
disconnected. Overall, the modification results in a 
modest reduction in mAP while increasing recall. 
 
3.5 K-mean clustering of box dimensions 
 
Unlike the speedier R-CNN, which employs hand-picked 
anchor box sizes, YOLOv2 uses k-mean clustering to 
discover acceptable priors on anchor box dimensions on 
the training data. The distance metric is built around IoU 
scores: 
dist (y, zk) = 1−IoU (y, zk), k = 1 to M 
 
A Novel Method for Multiple Object Detection on Road Using… Informatica 46 (2022) 567–574 571 
If x is a candidate for a ground truth box and ci is one of 
the centroids. The elbow approach may be used to 
choose the best number of centroids (anchor boxes) k. 
 
3.6 Direct location prediction 
 
YOLOv2 formulates the bounding box prediction in such 
a way that it does not deviate too far from the centre. The 
model training may become unstable if the box location 
prediction may position the box in any section of the 
picture, as in the regional proposal network. 
 
3.7 Add fine-grained features 
 
A passthrough layer is added to YOLOv2 to convey fine-
grained characteristics from an earlier layer to the final 
output layer. This passthrough layer uses the same 
approach as ResNet's identity mappings to retrieve 
higher-dimensional information from preceding layers. 
This results in a 1% improvement in performance. 
 
3.8 Multi-scale training 
 
Every 10 batches, a new size of input dimension is 
randomly picked to train the model to be resilient to 
input photos of various sizes. The freshly sampled size is 
a multiple of 32 since the convolution layers of YOLOv2 
down sample the input dimension by a factor of 32. 
 
3.9 Architecture of YOLOv2 model 
 
Between the input and output, the architecture 
represented in figure 3.1 has multiple hidden levels. The 
convolution layer, ReLU, pooling layer, and fully 
connected layer are all part of the hidden layer. Finally, 
the softmax layer is used to determine the output 
probability range. 
 
 
 
Figure 3.1: Architecture of YOLOv2 model 
 
 
3.10 Convolution layer  
 
A convolutional neural network's basic building part is 
the convolution layer. As you progress through the 
convolution layers, the filters perform dot products on 
the previous convolution layers' input. As a result, they're 
using the smaller cultured bits or edges to create larger 
pieces. The convolution layer, in general, is made up of 
many filters that extract characteristics from the input 
picture. 
3.11 ReLU Layer 
Convolutional neural networks do not have a distinct 
component called ReLU. The goal of using the rectifier 
function is to make the pictures more non-linear. The 
rectifier is used to further breakdown the linearity in 
order to compensate for the linearity imposed on an 
image when it is processed through the convolution 
function. Examine the changes in figure 3.2 as it goes 
through the convolution and rectification processes. 
 
Figure 3.2: ReLU layer 
3.12 Pooling layer 
A CNN's pooling layer is another component. Its purpose 
is to gradually shrink the representation's spatial 
dimension in order to minimize the number of 
parameters and computations in the network. 
3.13 Fully connected layer  
Each neuron in one layer is attached to every neuron in 
alternative layer in fully connected layers. It works in the 
same way as a standard multi-layer perception neural 
network in theory. The image is classified using the 
flattened matrix. 
3.14 Softmax 
The softmax function in mathematics normalizes an 
unnormalized vector into a probability distribution. In 
neural networks, it's frequently used to translate non-
normalized output to a probability distribution across 
expected output classes. 
 
572 Informatica 46 (2022) 567–574                                                                                                                P. Gunasekaran
 
et al.  
4 Results and discussion 
 
Python was used to create the suggested work. The 
model is built with data from the PASCAL VOC and 
COCO datasets. The convolution layer, pooling layer, 
and activation layers such as ReLU and softmax are used 
to build the model at first. To excerpt the features from 
the input picture, all of the hidden layers are employed. 
Finally, to forecast the probability of prediction, the 
completely linked layer is added. 
There are numerous models for object detection, 
including Faster R-CNN, SSD, and YOLO. These 
models are implemented and their performance is 
evaluated in this work. The suggested model YOLOv2 is 
created, and the model's performance is evaluated using 
various input photos. 
The YOLOv2 model was designed to identify 
pedestrians in a road environment. It has also been 
improved to detect a variety of items such as a bicycle, 
automobile, bus, motorcycle, and truck. The item is 
recognized and the likelihood of prediction is displayed 
via anchor boxes. 
 
4.1 YOLOv2 
 
INPUT OUTPUT 
 
                           
                 
 
 
                                              
 
  
Figure 4.1: Input and Prediction output for YOLOv2 
model 
 
 
Figure 4.2: Output for person detection in a video 
 
 
Figure 4.3: Output for multiple object detection in a 
video 
 
Figure 4.2 demonstrates how the YOLOv2 model 
recognizes just pedestrians in a video, but figure 4.3 
shows how the model detects numerous things in the 
input video, such as a bus, bicycle, automobile, and 
person. 
 
Table 1: Performance result of various models 
 
               
Objects 
Models 
Bus Person Bicycle    Car 
Faster  
R-CNN 
99.5 76.0 81.9 99.4 
SSD 98.0 73.7 79.7 71.7 
YOLO 100 96.3 93.1 99.8 
YOLOv2 98.5 99.6 97.0 98.5 
(a)                                          
(b)                                          
(c)                                          
A Novel Method for Multiple Object Detection on Road Using… Informatica 46 (2022) 567–574 573 
The accompanying table 1 shows the chance of 
detection for various models of bus, person, bicycle, and 
automobile. Figure 4.4 depicts a comparative study of 
several models. 
 
Figure 4.4: Comparison study of probability of detection 
for the various models 
 
5 Conclusion 
 
Currently, many deep learning frameworks, 
including TensorFlow, provide multiple versions of pre-
trained object identification models. The goal of this 
work is to identify many items in a stable environment. 
Using YOLOv2, high accuracy in object recognition and 
tracking is achieved. YOLOv2 takes an efficient 
technique by first predicting the portions that contain the 
essential data and then classifying them using CNN. It 
just looks at the image once, which increases the speed of 
object detection. To detect the items in the video, the pre-
trained object detection model is used. The likelihood of 
detecting various items is used to calculate the detection 
model's performance. In Pascal VOC detection dataset, 
the YOLOv2 model yields detection probabilities of 98.5 
percent (bus), 99.6 percent (person), 97 percent (bicycle), 
and 86.4 percent (vehicle), whereas competing systems, 
such as the enhanced version of Faster R-CNN and SSD, 
only obtain lower results. 
 
References 
 
[1] Bhavana C. Bendale, et al., (2012). “Moving Object 
Tracking in Video Using MATLAB”, International 
Journal of Electronics Communication and Soft 
Computing Science and Engineering ISSN: 2277-
9477, Vol 2, Issue 1. 
[2] Gawande P.V, Lokhande S.V, (2018). “Improving 
efficiency of school bus routing using AI based on 
bio inspired computing: A Survey”, International 
Research Journal of Engineering and Technology 
(IRJET) e-ISSN: 2395-0056 ,p-ISSN: 2395-0072, 
Volume: 05 Issue: 03. 
[3] Gerónimo, D., et al., (2010). “Survey of Pedestrian 
Detection for Advanced Driver Assistance 
Systems”, IEEE Transactions on Pattern Analysis 
and Machine Intelligence, 
doi:10.1109/tpami.2009.122. 
[4] Huieun Kim, et al., (2016). “On-road object 
detection using Deep Neural Network”, IEEE 
International Conference on Consumer Electronics-
Asia (ICCE-Asia), DOI: 10.1109/ICCE-
Asia.2016.7804765. 
[5] Hui-Lee Ooi, et al., (2018). “Multiple Object 
Tracking in Urban Traffic Scenes with a Multiclass 
Object Detector”, published on 13th International 
Symposium on Visual Computing (ISVC), Cornal 
University, arXiv:1809.02073 [cs.CV]. 
[6] J. Redmon and A. Farhadi. (2017). “YOLO9000: 
Better, Faster, Stronger” Proceedings of the IEEE 
Conference on Computer Vision and Pattern 
Recognition. 
[7] Marcus A. Brubaker, et al., (2010).  “Video-Based 
People tracking”, hand book of ambient intelligence 
under smart environments, pp 57-87. 
[8] Mendes, et al., (2015). “Vehicle Tracking and 
Origin-Destination Counting System for Urban 
Environment” in proceedings of the International 
Conference on Computer Vision Theory and 
Applications. 
[9] Minsung Kang, Young-Chul Lim, (2017). “High 
Performance and Fast Object Detection in Road 
Environments”, Seventh International Conference 
on Image Processing Theory, Tools and 
Applications (IPTA), DOI:10.1109/IPTA.8310148. 
[10] M. Schreiber, et al., (2016). “Vehicle 
localization with tightly coupled GNSS and visual 
odometry”, in Proc. IEEE Intelligent Vehicles 
Symposium. 
[11] Naghavi, S. H., et al., (2017). “Integrated real-
time object detection for self-driving vehicles” 10th 
Iranian Conference on Machine Vision and Image 
Processing (MVIP). doi:10.1109/iranianmvip.2017. 
834234. 
[12] Nikhil Yadav, et al., (2017). “Comparative 
Study of Object Detection Algorithms”, 
International Research Journal of Engineering and 
Technology (IRJET), e-ISSN: 2395-0056, p-ISSN: 
2395-0072, Vol 4. 
[13] O. Russakovsky, et al., (2015). “ImageNet 
Large Scale Visual Recognition Challenge”, in 
computer vision. vol. 115, no. 3, pp. 211–252. 
[14] S. Ren, K. He, et al., (2015). “Faster R-CNN: 
Towards Real-Time Object Detection with Region 
Proposal Networks” Nips, pp. 1–10. 
[15] Tang, C., et al., (2017). “The Object Detection 
Based on Deep Learning”, 4th International 
Conference on Information Science and Control 
0
50
100
99,5 98
100
98,5
76
73,7
96,3
99,6
81,9
79,7
93,1
97
99,4
71,7
99,8
86,4
Bus
Person
Bicycle
Car
574 Informatica 46 (2022) 567–574                                                                                                                P. Gunasekaran
 
et al.  
Engineering (ICISCE), 
DOI:10.1109/icisce.2017.156. 
[16] Azhagu Jaisudhan Pazhani1. et al., (2021). 
“Object detection in satellite images by faster R-CNN 
incorporated with enhanced ROI pooling (FrRNet-
ERoI) framework” Earth Science Informatics, 
https://doi.org/10.1007/s12145-021-00746-8.