Informatica 41 (2017) 87–97 87
Hidden-layer Ensemble Fusion of MLP Neural Networks for Pedestrian
Detection
Kyaw Kyaw Htike
School of Information Technology, UCSI University, Kuala Lumpur, Malaysia
E-mail: ali.kyaw@gmail.com
Keywords: pedestrian detection, neural networks, fusion, ensembles, multi-layer perceptron
Received: January 6, 2017
Being able to detect pedestrians is a crucial task for intelligent agents especially for autonomous vehicles,
robots navigating in cities, machine vision, automatic traffic control in smart cities, and public safety and
security. Various sophisticated pedestrian detection systems have been presented in literature and most
of the state-of-the-art systems have two main components: feature extraction and classification. Over the
past decade, the majority of the attention has been paid to feature extraction. In this paper, we show that
much can be gained by having a high-performing classification algorithm, and changing only the classifica-
tion component of the detection pipeline while fixing the feature extraction mechanism constant, we show
reduction in pedestrian detection error (in terms of log-average miss rate) by over 40%. To be specific,
we propose a novel algorithm for generating a compact and efficient ensemble of Multi-layer Perceptron
neural networks that is well-suited for pedestrian detection both in terms of detection accuracy and speed.
We demonstrate the efficacy of our proposed method by comparing with several state-of-the-art pedestrian
detection algorithms.
Povzetek: Razvit je nov algoritem z nevronskimi mrežami za zaznavanje pešcev v prometu npr. za
avtonomno vozilo.
1 Introduction
Pedestrian detection is an important problem in Artificial
Intelligence and computer vision. Many pedestrian detec-
tion systems have been proposed with different feature ex-
traction and classification methods. One of the earliest gen-
eral machine learning based pedestrian detection systems
was proposed by Papageorgiou and Poggio [1]. They use
Haar wavelets coefficients measuring (at multiple scales)
differences in intensity levels of pixels as the feature repre-
sentation, and a Support Vector Machine (SVM) with the
quadratic kernel as the classifier.
A year later, this motivated a seminal work on object de-
tection, namely the Viola-Jones face detector [2]. Viola and
Jones use the idea of integral images to speed up the extrac-
tion of rectangular Haar basis functions to use as features
which then serve as input to an adaboost classifier. The
classifier, which is applied to an image in a sliding win-
dow fashion, is specifically designed to be an attentional
cascade so that most of non-face examples can be rejected
with relatively few feature extraction steps. This results
in a fast face detector. Although Haar basis functions are
well-suited for detecting frontal-faces, it was not very clear
whether it is also good for detecting other categories of ob-
jects. Indeed, one of the critical cues human beings use to
classify or detect objects is shape information (for which
the bulding blocks are image gradients or edges); the set
of Haar basis functions does not exploit this. In fact, af-
ter [2] came out, features that make effective use of image
gradient information would soon be proposed in [3].
Leibe et al. [4] propose an Implicit Shape Model which
learns, for each cluster of local image patches, a vote dis-
tribution on the centroid of the object class. At test time,
Generalized Hough Transform [5] is used to detect objects;
to be specific, each local image patch votes with the learnt
distributions for the centroid in the hough voting space and
then local maxima (or peaks) in the voting space corre-
spond to object centroids in the test image. For each of
these local maxima, local image patches that contributed
(i.e. voted for) are identified through a process known as
backprojection. From this, a rough segmentation (and con-
sequently, bounding boxes) of the objects can be obtained.
Their method requires segmentation of the training images
which can be labor-intensive and costly, and identifying the
local maxima and performing backprojection can be highly
sensitive to many types of noise. Moreover, due to the use
of image patches, the system is not robust to illumination
and various image geometric transformations.
In 2005, a groundbreaking work on object detection was
published by Dalal and Triggs [3] who propose Histograms
of Oriented Gradients (HOG) as features for pedestrian de-
tection. In HOG, a histogram of image gradients is con-
structed in each small local region (termed as a cell) in
an image and these histograms are concatenated to form
a high-dimensional feature vector. They carried out a large
number of experiments to explore and evaluate the design
space of the HOG feature extraction process. This includes
highlighting the importance of local contrast normaliza-
88 Informatica 41 (2017) 87–97 K.K. Htike
tion whereby each histogram corresponding to a cell is re-
dundantly normalized with respect to other nearby cells. It
was shown that with HOG features, a linear SVM was suf-
ficient to obtain state-of-the-art results and outperformed
several techniques such as generalized haar wavelets [1],
PCA-SIFT [6] and Shape Contexts [7].
Moreover, HOG is still highly influential to this day; the
majority of current state-of-the-art research is based on ei-
ther variations of HOG or building upon ideas outlined in
HOG [8, 9, 10, 11]. For instance, the well-known part-
based object detection system, Deformable Part Models
(DPM) [12], use HOG features as building blocks. Many
systems have extended DPMs (e.g. [13, 14, 15]). Schwartz
et al. [16] use Partial Least Squares to reduce high dimen-
sional feature vectors that combine edge, texture and color
cues, and Quadratic Discriminant Analysis as the classifier.
Dollar et al. [17] propose Integral Channel Features
(ICF) that can be considered as a generalization of HOG
by including not only gradient when computing local his-
tograms, but also other “channels” of information such as
sums of grayscale and color pixel values, sums of outputs
obtained by convolving the image with linear filters (such
as Difference of Gaussian filters) and sums of outputs of
nonlinear transformation of the image (e.g. gradient mag-
nitude). An attempt was made in [18] to speed up features
such as ICF by approximating some of the scales of the fea-
ture pyramid by interpolating from corresponding nearby
scales during multi-scale sliding window object detection.
Benenson et al. [19] propose an alternative to speed up
object detection by training a separate model for each of
the nearby scales of the image pyramid. Although this in-
creases the object detection speed at test time, it also con-
siderably lengthens the training time.
Benenson et al. [20] extend HOG [3] by automatically
selecting the HOG cell sizes and locations using boosting.
Their approach is very expensive to train. However, their
work shows that the plain HOG is powerful and with the
right sizes and placements of HOG cells, with a single rigid
component, it is possible to outperform various feature ex-
traction methods researchers have proposed over the years
after the invention of HOG.
For these reasons, in this work, we adopt HOG [3] as the
feature extraction method. Furthermore, we focus on the
classification component of the object detection pipeline.
In order to isolate the performance of the classifier, we set
the feature extraction mechanism constant for all the ex-
periments of our proposed method. For the classification
component of the pedestrian detection pipeline, we pro-
pose a powerful and efficient non-linear classifier ensemble
that significantly increases the pedestrian detection perfor-
mance (i.e. accuracy) while at the same time reducing the
computational complexity at test time which is especially
important for a task such as pedestrian detection. Although
the proposed algorithm could also be potentially applied to
other object detection tasks and general machine learning
classification problems, we focus on pedestrian detection
in this paper.
To the best of our knowledge, neural networks, or more
accurately, Multi-layer Perceptrons (MLPs) have not been
used for pedestrian detection. This observation is also true
for ensembles of MLPs (EoMLPs): there have not been
any major work using EoMLPs for object detection, let
alone pedestrian detection. Although there can be many
reasons behind this dearth of application of EoMLPs in
pedestrian detection, one possible reason could be due to
the highly computationally expensive nature of EoMLPs at
test time and due to the popularity of linear Support Vector
Machines (SVMs) and other regularized linear classifiers.
We show in this paper that given the same feature extraction
mechanism, using our proposed non-linear classifier gives
much better pedestrian detection performance than the tra-
ditional classifiers commonly used for pedestrian detection.
Works such as [21], which apply EoMLPs to real-world
problems, exist (although quite rare to find); however, they
are for general pattern tasks rather than pedestrian de-
tection or even object detection. Literature on EoMLPs
have been published in the machine learning and Artifi-
cial Intelligence community; a few well-known ones are
[22, 23, 24, 25]. They demonstrate the effectiveness of
EoMLPs compared to individual MLPs on some standard
statistical benchmarks; none of them have been applied to
any object detection or even computer vision problems.
Furthermore, although there are many different ensem-
ble methods of numerous types of classifiers such as bag-
ging [26], boosting [27], Random subspace [28], Random
Forest [29] and Rulefit [30], the focus of this paper is on
EoMLPs and pedestrian detection. Moreover, our proposed
algorithm differs from [22, 23, 24, 25] in numerous ways
including being suitable to be applied for pedestrian detec-
tion problems and our method outperforms state-of-the-art
EoMLPs as will shown in the experimental results in Sec-
tion 4.
We term our novel algorithm as Hidden-layer Ensemble
Fusion of Multi-layer Perceptrons (HLEF-MLP).
2 Contribution
The contribution that we make in this paper is five-fold:
1. We propose a novel way, for the purpose of pedestrian
detection, of training multiple individual MLPs in an
ensemble and fusing the members of the ensemble at
the hidden-feature level rather than at the output score
level as done in existing EoMLPs [22, 23, 24, 25].
This has the benefit of being able to correct the mis-
takes of the ensemble in a major way since the final
classifier still has access to hidden level features rather
than only output scores or classification labels.
2. We use L1-regularization in a stacking fashion to
jointly and efficiently select individual hidden feature
units of all the members in the ensemble which has the
effect of fusing the members of the ensemble. This re-
sults in only a few active projection units at test time,
Hidden-layer Ensemble Fusion. . . Informatica 41 (2017) 87–97 89
which can be implemented as fast matrix projections
on modern hardware for efficient pedestrian (or ob-
ject) detection.
3. In HLEF-MLP, the decisions of the members are com-
bined or fused in a systematic way using the given su-
pervision labels such that the combination is optimal
for the classification task. This is in contrast to ad-hoc
class-agnostic nature of most fusion schemes (which
turn to use techniques such as averaging).
4. We show that given the same feature extraction mech-
anism, HLEF-MLP gives much better pedestrian de-
tection performance than the state-of-the-art classi-
fiers commonly used for pedestrian detection.
5. HLEF-MLP is stable and robust to initialization con-
ditions and local optima of individual member MLPs
in the ensemble. Therefore, we do not need to be so
careful about training each MLP for two main reasons:
firstly, we are not relying solely on one MLP, and sec-
ondly, we are able to, during fusion, correct mistakes
of the individual MLPs at the hidden features level as
mentioned previously. In fact, each MLP falling in its
own local optima should be seen as achieving diver-
sity in the ensemble which is a desirable goal that can
be obtained for “free”. At the L1-regularized fusion
step, the optimization function is convex, therefore it
is guaranteed to achieve the global optimum.
3 Method
3.1 Overview
Let D = {(x1, y1), (x2, y2), . . . , (xN , yN )} be the la-
belled training dataset, where N is the number of training
data points, xi ∈ Rk is the k-dimensional feature vector
corresponding to the i-th training data point, yi ∈ {1, 0} is
the supervision label associated with xi.
HLEF-MLP consists of two stages. D is split into D =
{D1,D2}, where
D1 = {(x1, y1), . . . , (xN
2
, yN
2
)}
and
D2 = {(xN
2 +1
, yN
2 +1
), . . . , (xN , yN )}
In the first stage which we call “ensemble learning”, we
train each individual MLP on D1 and in the second stage
termed “sparse fusion optimization”, we use D2 to auto-
matically learn how to fuse the decisions of the members
of the ensemble. We now describe each stage.
3.2 Ensemble learning
The inputs to the first stage are D1 (obtained as de-
scribed in Section 3.1) and the number of members
M in the ensemble. The ensemble can be written as
[f1(x), f2(x), . . . , fM (x)] where each member fj(x) of
the ensemble can be formulated as:
fj(x) = logsig(wj tanh(Wjx+ bj) + bj) (1)
where x is the input feature vector, and W and b are the
matrix and vector respectively corresponding to an affine
transformation of x. Each column of W corresponds to the
weights of the incoming connections to a neuron. There-
fore the number of hidden neurons in fj(x) is equal of the
number of columns in Wj and the number of rows of Wj
is k (recall that x ∈ Rk). Therefore, W ∈ Rk×h, where h
is the number of neurons in the hidden layer.
The architecture of this first stage of the ensemble is il-
lustrated in Figure 1. In the figure, each ensemble member
(i.e. a MLP neural network) is shown as a vertical chain of
blocks of mathematical operations for a total of M chains.
The j-th ensemble (chain) corresponds to the function fj
as defined in Equation 1 and it can be seen that the block
chain diagram follows exactly the sequence of operations
defined in Equation 1.
For example, for the first ensemble member f1, the in-
put data vector x is first matrix-multiplied with W1 (first
block), and the resulting vector is then added with the vec-
tor b1 in the second block. The function tanh(·) is then
applied (third block). This is followed by matrix multipli-
cation of the output of the previous block with w1 in the
fourth block. In the last block, a scalar addition with b1 is
performed.
The functions tanh(·) and logsig(·) (visualized in Fig-
ure 2) are non-linear activation functions (applied after re-
spective affine transformations) independently acting on
each dimension of the vector obtained after the affine trans-
formation and are defined as follows:
tanh(a) =
ea − e−a
ea + e−a
(2)
logsig(a) =
ea
1 + ea
(3)
The latent parameters of the members of the first stage
of the ensemble need to be learnt from the training dataD1
(which is a set of pairs of input feature vectors and output
supervision labels) and this training (i.e. learning) process
is depicted in Figure 3.
We set all Wj to be the same size (i.e. each MLP fj(x)
in the ensemble has the same number of hidden neurons).
For each member MLP fj(x) in the ensemble, the follow-
ing loss function can be constructed:
L(D1,Wj ,bj ,wj , bj) =
− 2
N
N
2∑
i=1
yi logsig(wj tanh(Wjxi + bj) + bj)+
(1− yi) logsig(wj tanh(Wjxi + bj) + bj)
(4)
where {xi, yi}
N
2
i=1 are pairs of input feature vectors and out-
put supervision labels from D1.
90 Informatica 41 (2017) 87–97 K.K. Htike
Figure 1: Architecture of the first stage of the ensemble.
a
-10 -8 -6 -4 -2 0 2 4 6 8 10
ta
n
h
(a
)
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
a
-10 -8 -6 -4 -2 0 2 4 6 8 10
lo
g
s
ig
(a
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2: Non-linear activation functions; on the left is the
curve for tanh(a) and on the right is logsig(a), where a is
the input signal to the activation function.
The loss function given in Equation 4 is a smooth and
differential function and we optimize it using L-BFGS [31]
since it is a type of efficient semi-newton method that does
not require setting sensitive hyperparameters such as learn-
ing rate, momentum and mini-batch size. After the opti-
mization, all the weights of each network:
{W1,W2, . . . ,WM ,b1,b2, . . . ,bM ,w1, . . . ,
. . . ,wM , b1, b2, . . . , bM}
(5)
are obtained.
3.3 Sparse fusion optimization
In the second stage which is sparse fusion optimization,
function gj(x) is used to project each data point x as given
below:
gj(x) = tanh(Wjx+ bj) (6)
Each data point x ∈ D2 is projected using {gj(x)}Mj=1.
That is, a new training dataset D̂2 is constructed as follows:
g1(x1) g2(x1) . . . gM (x1)
g1(x2) g2(x2) . . . gM (x2)
...
...
. . .
...
g1(xN ) g2(xN ) . . . gM (xN )
 (7)
where each row corresponds to one new data point in D̂2.
The architecture of this second stage of the ensemble is
depicted in Figure 4.
It is important to note that the projection is done using
the function g(·), and not f(·). In other words, this can
be interpreted as projecting to the hidden layers of the in-
dividual MLP neural networks (which are members of the
ensemble) and then learning to fuse in this hidden layer,
hence the name of our proposed algorithm being Hidden-
layer Ensemble Fusion of Multi-layer Perceptrons (HLEF-
MLP).
This improves over the traditional (state-of-the-art) en-
semble techniques in terms of both generating a much more
compact ensemble (making it available to be applied to
very time-critical applications such as pedestrian detection)
and the final pedestrian detection performance (measured
by log-average miss rate).
After constructing the projected dataset from D̂2, a L1-
regularized logistic regression is then trained on D̂2 by op-
timizing:
Hidden-layer Ensemble Fusion. . . Informatica 41 (2017) 87–97 91
Figure 3: Independent ensemble member learning process.
mtrained =
arg min
m
N∑
i=N2 +1
fL(m, [g1(xi), g2(xi), . . . ,
gM (xi)], yi) + βfR(m)
(8)
where mtrained ∈ Rk is the trained classifier and is in fact
a vector of weights. Moreover, fL is the loss function, fR
is the regularization to encourage m to take small values
(hence in a way, favoring simpler models), and β balances
the regularization term and loss term. A higher value of β
would result in a smoother solution (and less fit to the train-
ing data D̂2) whereas lower β would result lower training
error. The loss function is given by:
fL(m, [g1(xi), . . . , gM (xi)], yi) =
log(1 + exp(−yimT [g1(xi), . . . , gM (xi)]))
(9)
In order to encourage sparse solutions, we use L1-
regularization for which the regularization term is given by:
fR = [1, . . . , 1]
Tm (10)
This effectively sets many of the components of the weight
vector m to zero, resulting in a very compact ensemble
(speeding up the pedestrian detection), while at the same
time, retaining or even improving the ensemble perfor-
mance for pedestrian detection. Figure 5 illustrates the
learning process for the sparse fusion of the members in
the ensemble at the hidden layer.
3.4 Prediction at test time
As illustrated in Figure 6, at test time, given a test feature
vector x, the prediction score spred is obtained by:
spred =
1
1 + exp(−(mtrained)T [g1(x), . . . , gM (x)])
(11)
where {g1, g2, . . . , gM} are the hidden-layer-projection
functions (parts of the members of the ensemble) whose
latent parameters have been obtained by the training pro-
cess described in Section 3.2 and mtrained is the set of latent
sparse weights that have been obtained as explained in Sec-
tion 3.3.
Equation 11 can also be equivalently written as:
spred =
1
1 + exp(a)
where a =− (mtrained)T [tanh(W1x+ b1), . . . ,
tanh(WMx+ bM )]
(12)
Since mtrained is a sparse vector (i.e. a vector where
the majority of the elements are zeros), at test time, en-
tire columns of the matrices {W1, . . . ,WM} correspond-
ing to the positions of the zero vector elements in mtrained
can be omitted. This greatly speeds up the pedestrian
detection process, while improving the performance (log-
average miss rate), due to the sparse fusion technique at the
hidden layers as described in Sections 3.2 and 3.3.
92 Informatica 41 (2017) 87–97 K.K. Htike
Figure 4: Architecture of the second stage of the ensemble (fusion).
4 Results and discussion
4.1 Dataset
We use the INRIA Person dataset [3] for evaluating our al-
gorithms. In the dataset, for training, there are 614 images
containing pedestrians for which ground truth is available;
after cropping and flipping each pedestrian, the total num-
ber of pedestrians come up to be 2474. There are also 1218
lage images that do not contain any pedestrians; from these,
data corresponding to non-pedestrian class can be gener-
ated by a combination of initial sampling and hard negative
mining as is common in object detection literature [32, 9].
4.2 Ensemble hyper-parameters
We set the size of the ensemble M (see Section 3.2) to 100
and the number of hidden neurons h in each hidden layer
to 10 for all the experiments involving both our proposed
method and baselines on various ensemble variations.
4.3 Experiment setup
We perform seven different experiments in order to com-
pare our proposed algorithm with state-of-the-art pedes-
trian detection systems. The experiments are described be-
low.
4.3.1 VJ
The Viola-Jones detector of [2] applied to pedestrian detec-
tion.
4.3.2 HOG
Pedestrian detector of the seminal work of Dalal and
Triggs [3], using Histogram of Oriented Gradients (HOG)
for feature extraction and linear SVM as the classifier.
4.3.3 HikSvm
The system proposed by Maji et al. [33] who use a variation
of HOG (concatenation of histograms of oriented gradients
for a few different cell sizes) and SVM with an approxima-
tion of the intersection kernel.
4.3.4 Pls
The pedestrian detection system proposed by Schwartz et
al. [16] using Partial Least Squares (PLS) analysis to re-
duce the dimensions of very high dimensional features ob-
tained by concatenating different sets of features obtained
from edge, texture and color information derived from the
original image. Then Quadratic Discriminant Analysis is
used as the classifier.
4.3.5 OursMLPEnsHidFuse
Our proposed approach in this paper as described in Sec-
tion 3. At test time, due to L1-regularized fusion at the hid-
den layer level, we can expect that the resulting ensemble
will be very small, which will be suitable for time-critical
applications such as pedestrian detection.
Hidden-layer Ensemble Fusion. . . Informatica 41 (2017) 87–97 93
Figure 5: Learning to fuse the members in the ensemble.
Figure 6: Using the fused ensemble at test time.
94 Informatica 41 (2017) 87–97 K.K. Htike
4.3.6 OursMLPEnsScoreFuse
A variation of our proposed method
OursMLPEnsHidFuse; instead of fusing at the
level of hidden features, L1-regularized fusion takes
place at the score level. This is also somewhat similar to
classifier stacking [34] in literature; however most stacking
ensembles in literature is using simple unregularized
classifiers whereas OursMLPEnsScoreFuse is able to
select the most efficient members of the ensemble jointly
using supervised label information.
In terms of efficiency at test time, we can anticipate it
to be worse than OursMLPEnsHidFuse because the su-
pervised L1-regularized selection is being performed at the
score level and also at test time, there is a need to multiple
matrix projections (one for each member of the ensemble)
and then combine them.
4.3.7 OursMLPEns
This is our implementation of the state-of-the-art MLP fu-
sion by bagging. Here unlike OursMLPEnsHidFuse
and OursMLPEns, there are no separation of the train-
ing dataset D into D1 and D2; independent training of the
members of the ensemble takes place on the entire training
data D. Moreover, there is no sparse fusion optimization.
We can expect that compared to OursMLPEns-
HidFuse and OursMLPEnsScoreFuse, this will have
the worst efficiency at test time because the entire ensemble
needs to be evaluated which is highly expensive.
4.4 Evaluation criteria
To evaluate the pedestrian detections from the aformen-
tioned experiments, firstly there needs to be a measure on
what a correct pedestrian detection is. Although this can be
defined in many ways, we use the PASCAL 50% overlap
criterion [35] which is the most widely-used one in state-
of-the-art detection literature. Let a detected bounding box
be rd and the corresponding groundtruth bounding box be
rg . In PASCAL 50% overlap criterion, in order to establish
whether rd is a correct detection, the overlap ratio, αo, de-
fined in Equation 13 is computed and then the detection rd
is deemed to be correct if αo > 0.5.
αo =
area(rg ∩ rd)
area(rg ∪ rd)
(13)
The αo can be understood as the ratio of the intersection
of the detected bounding box with ground truth bounding
box and the union of the detected box with ground truth
box.
Graphs are an important tool to visualize the perfor-
mance of pedestrian detectors since they show the perfor-
mance at various levels of detection thresholds. To this end,
we plot curves where miss rate is on the y-axis and false
positives per image (FPPI) is on the x-axis.
This has been empirically proven in state-of-the-art
pedestrian detection (e.g. by Benenson et al. [9]) to be
Figure 7: ROC curves for all experiments)
Experiment
VJ
H
O
G
H
ik
Sv
m Pl
s
O
ur
sM
LP
En
sS
co
re
Fu
se
O
ur
sM
LP
En
sH
id
Fu
se
O
ur
sM
LP
En
s
L
o
g
-a
v
e
ra
g
e
 m
is
s
 r
a
te
0
10
20
30
40
50
60
70
80
Figure 8: Comparison of log-average miss rate (lower is
better).
a more accurate way of measuring the detection perfor-
mance than the previously used false positives per window
(FPPW).
In a miss rate versus FPPI plot, lower curves denote
higher detection performance. Moreover, to summarize the
performance of a detector with a single value, log-average
miss rate is calculated which is approximately equal to the
area under the miss rate versus FPPI curve.
4.5 Results
The miss rate versus FPPI plots for all the experiments are
shown in Figure 7. In the figure, the curves are ordered in
terms of log-average miss rate (LAMR) from worst to best.
To better focus on the summarized performance, we also
plot the LAMRs of each experiment in terms of a bar graph
as illustrated in Figure 8. Finally, we depict in Figure 9
the comparison of average detection time taken by different
algorithms for a 640×480 image. We also show in Table 1
Hidden-layer Ensemble Fusion. . . Informatica 41 (2017) 87–97 95
Experiment
VJ
H
O
G
H
ik
Sv
m Pl
s
O
ur
sM
LP
En
sS
co
re
Fu
se
O
ur
sM
LP
En
sH
id
Fu
se
O
ur
sM
LP
En
s
M
e
a
n
 t
im
e
 t
a
k
e
n
 (
s
e
c
s
)
0
20
40
60
80
100
Figure 9: Comparison of average detection time for a 640×
480 image.
Experiment Mean time (secs)
VJ 2.23
HOG 4.18
HikSvm 5.41
Pls 55.56
OursMLPEnsScoreFuse 32.72
OursMLPEnsHidFuse 8.49
OursMLPEns 92.45
Table 1: Comparison in terms of detection time.
the raw detection time values for easier comparison.
From Figures 7, 8 and 9, the following observations can
be made:
– VJ has the worst performance among all. This is to be
expected since the Haar-features that it uses does not
capture the object shape information well. Moreover,
the seminal work HOG greatly improves over VJ for
generic object detection.
– Although HikSvm requires complex modification in
the feature extraction and the classifier compared to
HOG, it only results in a modest detection performance
gain (i.e. decrease in LAMR from 46% to 43%). Sim-
ilar observation can be made for Pls although the im-
provement is relatively better.
– Our proposed algorithm OursMLPEnsHidFuse has
the best performance among them, tied in the first
place with OursMLPEns. Given that OursMLP-
EnsHidFuse is using the standard HOG features,
we can see that just by using our proposed algorithm,
the LAMR goes down from 46% and 27%. This is a
significant improvement (corresponding to sover 40%
reduction in LAMR). In comparison, HikSvm, de-
spite proposing both new feature extraction and clas-
sification algorithms, only managed less than 0.1% re-
duction in LAMR.
– OursMLPEnsHidFuse has slightly higher detec-
tion performance than OursMLPEnsScoreFuse
whereas in terms of speed at test time, OursMLP-
EnsHidFuse is almost 4 times faster. This shows
the efficacy of our proposed algorithm OursMLP-
EnsHidFuse and also proves that fusion at the level
of the hidden layers not only results in better detec-
tion performance and also very compact matrix pro-
jections (and hence much faster speed) due to the L1-
regularized sparse fusion learning process having ac-
cess to the hidden layer features.
– Despite OursMLPEnsHidFuse and OursMLPEns
being tied at the first place in terms of detection per-
formance, OursMLPEnsHidFuse is significantly
(over 10 times) faster than OursMLPEns. This shows
that our proposed OursMLPEnsHidFuse has the
same detection accuracy as the very expensive state-
of-the-art MLP bagging which is not practical for de-
tection purposes. In other words, the novel method
that we have presented in this paper OursMLPEns-
HidFuse gives the same (top) performance as a MLP
bagging-ensemble but with 10 times faster speed for
pedestrian detection.
– OursMLPEnsScoreFuse is faster than
OursMLPEns but lower in LAMR than OursMLP-
EnsHidFuse. This again shows that our proposed
OursMLPEnsHidFuse not only results in a much
smaller ensemble model (and consequently, faster
detection speed), but is also more effective in bringing
out the effectiveness of the ensemble compared to
OursMLPEnsScoreFuse.
– OursMLPEnsHidFuse has better detection accu-
racy than PLS while being significantly faster using
only standard HOG features. It shows that we have not
saturated the performance of HOG features yet. There
is still a lot of improvement that can made in the clas-
sification algorithm for pedestrian detection systems.
– Our experiments show for the first time that using
the proposed algorithm OursMLPEnsHidFuse, it is
possible to apply ensemble techniques for efficient ob-
ject detection purposes.
4.5.1 Analysis of trained ensemble model complexity
Although the results and the discussion about detec-
tion performance and speeds in the previous section
should be sufficient, for completeness, we theorize
and analyze the model complexities and sizes for the
three ensemble methods described in the previous sec-
tion: OursMLPEns, OursMLPEnsScoreFuse and
OursMLPEnsHidFuse. It is to be noted that as men-
tioned previously, there are M = 100 MLPs in the ensem-
ble.
96 Informatica 41 (2017) 87–97 K.K. Htike
For OursMLPEns, given a test feature vector x, there
are 100 different matrix projections with each matrix hav-
ing k rows and h columns (which is equal to the number of
dimensions of the feature vector and the number of hidden
neurons in each MLP respectively). After each projection,
tansig(·) function must be applied, followed by a dot prod-
uct between the resulting vector and a linear weight vector,
which will produce a single score (for each member of the
ensemble). Then the logsig(·) function needs to be per-
formed on this score. Then 100 such scores will be aver-
aged to give the final ensemble score. Therefore the ensem-
ble complexity is quite high, especially for a highly compu-
tationally expensive task such as sliding window pedestrian
detection.
For OursMLPEnsScoreFuse, after training the L1-
regularized linear classifier on scores at the end of the
model fusion step, 33 networks are discovered to be non-
zero. This means that theoretically, it should be 10033 ≈ 3
times faster than OursMLPEns. In fact, this theoretical
observation tallies with the experimental results presented
in Table 1.
With regards to OursMLPEnsHidFuse, it was found
that performing L1-regularized classifier optimization on
hidden features resulted in highly sparse weight vector
mtrained with the number of nonzero weight components
equal to a mere 159. Therefore, at test time, there is only
a need to perform a single matrix projection with a ma-
trix having k rows and 159 columns. After that, tansig
function has to be applied followed by a dot product with
a linear weight vector and finally a logsig(·) function. This
means that the effective model complexity is much lower
than either OursMLPEns or OursMLPEnsScoreFuse.
This again matches with the empirical evidence.
5 Conclusion and future work
In this paper, we propose a novel algorithm to train a com-
pact ensemble of Multi-layer Perceptron neural networks
for pedestrian detection. The proposed algorithm integrates
the members of the ensemble at the hidden feature layer
level, resulting in a very small ensemble size (allowing
for fast pedestrian detection) while at the same time out-
performing existing neural network ensembling techniques
and other pedestrian detection systems. We obtain very
encouraging state-of-the-art results, and show for the first
time that, using our proposed algorithm, it is indeed possi-
ble to apply neural network ensemble techniques for the
task for pedestrian detection, something that was previ-
ously thought to be too inefficient due to the very large
model size of ensembles.
There are several interesting directions of research based
upon the work in this paper; firstly, there is an opportunity
to apply the method presented in this paper to other object
detection tasks and applications such as face and vehicle
detection, and other general pattern recognition problems.
Secondly, it will be highly beneficial to explore the effect
of different feature extraction mechanisms on the resulting
ensembles in terms of both detection performance and effi-
ciency.
References
[1] Constantine Papageorgiou and Tomaso Poggio. A
trainable system for object detection. International
Journal of Computer Vision, 38(1):15–33, 2000.
[2] Paul Viola and Michael Jones. Rapid object detection
using a boosted cascade of simple features. In Con-
ference on Computer Vision and Pattern Recognition,
volume 1, pages 1–511. IEEE, 2001.
[3] Navneet Dalal and Bill Triggs. Histograms of ori-
ented gradients for human detection. In Conference
on Computer Vision and Pattern Recognition, pages
886–893. IEEE, 2005.
[4] Bastian Leibe, Ales Leonardis, and Bernt Schiele.
Combined object categorization and segmentation
with an Implicit Shape Model. In ECCV Workshop
on Statistical Learning in Computer Vision, volume 2,
pages 7–14, 2004.
[5] Dana H Ballard. Generalizing the hough trans-
form to detect arbitrary shapes. Pattern recognition,
13(2):111–122, 1981.
[6] Krystian Mikolajczyk and Cordelia Schmid. A per-
formance evaluation of local descriptors. Pattern
Analysis and Machine Intelligence, IEEE Transac-
tions on, 27(10):1615–1630, 2005.
[7] Serge Belongie, Jitendra Malik, and Jan Puzicha.
Matching shapes. In International Conference on
Computer Vision, volume 1, pages 454–461. IEEE,
2001.
[8] Kyaw Kyaw Htike and David Hogg. Adapting pedes-
trian detectors to new domains: a comprehensive re-
view. Engineering Applications of Artificial Intelli-
gence, 50:142–158, April 2016.
[9] Rodrigo Benenson, Mohamed Omran, Jan Hosang,
and Bernt Schiele. Ten Years of Pedestrian Detection,
What Have We Learned?, pages 613–627. Springer
International Publishing, Cham, 2015.
[10] Kyaw Kyaw Htike and David Hogg. Unsupervised
Detector Adaptation by Joint Dataset Feature Learn-
ing, pages 270–277. Springer International Publish-
ing, Cham, 2014.
[11] Kyaw Kyaw Htike and David Hogg. Weakly su-
pervised pedestrian detector training by unsupervised
prior learning and cue fusion in videos. In Interna-
tional Conference on Image Processing, pages 2338–
2342. IEEE, October 2014.
Hidden-layer Ensemble Fusion. . . Informatica 41 (2017) 87–97 97
[12] Pedro Felzenszwalb, David McAllester, and Deva Ra-
manan. A discriminatively trained, multiscale, de-
formable part model. In Conference on Computer Vi-
sion and Pattern Recognition, pages 1–8. IEEE, June
2008.
[13] Pedro F Felzenszwalb, Ross B Girshick, David
McAllester, and Deva Ramanan. Object detec-
tion with discriminatively trained part-based models.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 32(9):1627–1645, 2010.
[14] Pedro F Felzenszwalb, Ross B Girshick, and David
McAllester. Cascade object detection with de-
formable part models. In Conference on Computer
Vision and Pattern Recognition, pages 2241–2248.
IEEE, 2010.
[15] Ross B Girshick, Pedro F Felzenszwalb, and David A
Mcallester. Object detection with grammar models.
In Advances in Neural Information Processing Sys-
tems, pages 442–450, 2011.
[16] William Robson Schwartz, Aniruddha Kembhavi,
David Harwood, and Larry S Davis. Human detection
using Partial Least Squares analysis. In International
Conference on Computer Vision, pages 24–31. IEEE,
2009.
[17] Piotr Dollar, Zhuowen Tu, Pietro Perona, and Serge
Belongie. Integral channel features. In British Ma-
chine Vision Conference, pages 91.1–91.11. BMVA
Press, 2009.
[18] Piotr Dollár, Serge Belongie, and Pietro Perona. The
fastest pedestrian detector in the West. In British Ma-
chine Vision Conference, volume 2, page 7, 2010.
[19] Rodrigo Benenson, Markus Mathias, Radu Timofte,
and Luc Van Gool. Pedestrian detection at 100 frames
per second. In Conference on Computer Vision and
Pattern Recognition, pages 2903–2910. IEEE, 2012.
[20] Rodrigo Benenson, Markus Mathias, Tinne Tuyte-
laars, and Luc Gool. Seeking the strongest rigid de-
tector. In Conference on Computer Vision and Pattern
Recognition, pages 3666–3673. IEEE, 2013.
[21] E Filippi, M Costa, and E Pasero. Multi-layer percep-
tron ensembles for increased performance and fault-
tolerance in pattern recognition tasks. In Interna-
tional Conference on Neural Networks: IEEE World
Congress on Computational Intelligence, volume 5,
pages 2901–2906. IEEE, 1994.
[22] Pablo M Granitto, Pablo F Verdes, and H Alejan-
dro Ceccatto. Neural network ensembles: evalua-
tion of aggregation algorithms. Artificial Intelligence,
163(2):139–162, 2005.
[23] Lars Kai Hansen and Peter Salamon. Neural network
ensembles. IEEE Transactions on Pattern Analysis
and Machine Intelligence, (10):993–1001, 1990.
[24] Anders Krogh, Jesper Vedelsby, et al. Neural net-
work ensembles, cross validation, and active learning.
Advances in Neural Information Processing Systems,
7:231–238, 1995.
[25] Zhi-Hua Zhou, Jianxin Wu, and Wei Tang. Ensem-
bling neural networks: many could be better than all.
Artificial intelligence, 137(1):239–263, 2002.
[26] Leo Breiman. Bagging predictors. Machine learning,
24(2):123–140, 1996.
[27] Jerome Friedman, Trevor Hastie, and Robert Tibshi-
rani. Additive logistic regression: a statistical view of
Boosting. Annals of Statistics, 28(2):337–407, 1998.
[28] Tin Kam Ho. The random subspace method for con-
structing decision forests. Pattern Analysis and Ma-
chine Intelligence, IEEE Transactions on, 20(8):832–
844, 1998.
[29] Leo Breiman. Random forests. Machine learning,
45(1):5–32, 2001.
[30] Jerome H Friedman and Bogdan E Popescu. Predic-
tive learning via rule ensembles. The Annals of Ap-
plied Statistics, pages 916–954, 2008.
[31] Dong C. Liu, Jorge Nocedal, and Dong C. On the
limited memory BFGS method for large scale opti-
mization. Mathematical Programming, 45:503–528,
1989.
[32] Piotr Dollár, Christian Wojek, Bernt Schiele, and
Pietro Perona. Pedestrian detection: An evaluation
of the state of the art. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 34(4):743–
761, 2012.
[33] Subhransu Maji, Alexander C Berg, and Jitendra Ma-
lik. Classification using intersection kernel support
vector machines is efficient. In Conference on Com-
puter Vision and Pattern Recognition, pages 1–8.
IEEE, 2008.
[34] Saso Džeroski and Bernard Ženko. Is combining clas-
sifiers with stacking better than selecting the best one?
Machine learning, 54(3):255–273, 2004.
[35] Mark Everingham, Luc J. Van Gool, Christopher K. I.
Williams, John M. Winn, and Andrew Zisserman.
The Pascal Visual Object Classes (VOC) challenge.
International Journal of Computer Vision, 88(2):303–
338, 2010.