ERK'2019, Portorož, 264-267 264
Generation of 2D Ear Dataset with Annotated View Angles as a
Basis for Angle-Aware Ear Recognition
Anja Hrovatiˇ c
1
, Kihoon Kwon
2
, Diego Suˇ sanj
3
, Peter Peer
1
,
ˇ
Ziga Emerˇ siˇ c
1
1
Faculty of Computer and Information Science, University of Ljubljana Veˇ cna pot 113, SI-1000 Ljubljana, EU
2
School of Electronics Engineering, IT College, Kyungpook National University, 80 Daehak-ro, Buk-gu, Daegu 41566, Korea
3
University of Rijeka, Faculty of Engineering, Vukovarska 58, 51000 Rijeka, Croatia
E-mail: ah7651@student.uni-lj.si, kwons149@naver.com, dsusanj@riteh.hr,fpeter.peer, ziga.emersicg@fri.uni-lj.si
Abstract
Ear recognition has seen steady development in the re-
cent years. Despite numerous novel approaches rang-
ing from traditional approaches based on local feature
extraction to deep learning approaches, certain issues
still remain unsolved. As pointed out in recent studies,
one of the most prominent issues is the problem of ear
alignment. To tackle this problem traditional approaches
proved to be unsuccessful. However, in order to train
deep neural networks to estimate pose angles to facilitate
ear alignment, dataset with annotated angles is needed.
In this work we present a 2D RGB dataset based on UND-
J 3D dataset with corresponding 2D angle-annotated im-
ages as a base for convolutional neural network training.
1 Introduction
The popularity of ear recognition has been, and still is,
increasing in the recent years. There has even been ear
recognition competitions [1, 2], and many new deep-learn-
ing-based approaches proposed [1, 3, 4]. However, as
emphasized in some of the recent surveys [5, 6] one of
the most prominent issue still remains unaddressed - the
problem of ear alignment. In real-life scenarios images
of ears are taken in different positions, which causes dif-
ﬁculties during recognition stage.
Ear alignment has been addresses only partially [7,
8], where ear alignment was done either only in-plane
(where only head pitch is considered) or to limited suc-
cess in arbitrary rotations. We argue, that the most promis-
ing approach is to use convolutional neural network (CN-
N) to estimate in what angle the observed ear image was
taken. When the angle is known we can use simple pro-
jective translation to transform ear image into a normal-
ized one or train multiple models - one group for severe
angle from one side, one for fully proﬁle ear images and
one for severe angle from the other side.
However, in order to achieve that, a new dataset, with
speciﬁcally annotated angles, is needed. In this paper
we focus on generating this new dataset of 2D RGB and
corresponding depth images of subjects’ left proﬁle faces
with applied roll, pitch, yaw angles and small positional
perturbation in the form of random movement or deﬁned
translation. This dataset will serve as a basis for future
research and development of ear-alignment-aware recog-
nition.
We based our approach on the University of Notre
Dame collection J2 database (UND-J2). The dataset con-
sists of 3D point cloud data and corresponding 2D RGB
images. 3D point cloud range scans were used to gen-
erate a colored point cloud in PLY and PCD ﬁle formats,
which we then further used for transformation of 3D point
coordinates. Next we applied roll, pitch and yaw rota-
tion and added translation transformation to include small
random movements. Transformed point coordinates were
then projected from three-dimensional world to a two-
dimensional image plane with the use of the pinhole cam-
era model and camera’s intrinsic parameters. The pro-
jected coordinates (u, v) were used to generate 2D RGB
images. Depth images, however, were generated based
on the value of Z coordinate of the 3D projected vector.
Generated images with this method can be used to
ﬁll in the missing examples from the datasets and make
databases larger with a wider range of pose variation.
This can be seen in the UND-J2 dataset as well. For
some subjects there are only a couple of example im-
ages, which may present itself as a problem for certain
ear detection or recognition approaches that require larger
amount of images.
The paper is structured as follows. In section Sec-
tion 2 we present the related work in this ﬁeld. Section 3
includes and presents the process of our approach, from
preparing the UND-J2 dataset to applying rotation, trans-
lation and perspective projection to 2D world and map-
ping projected points to 2D RGB and depth imagesand
provide some ﬁnal remarks in Section 4.
2 Related work
Preparation and generation of databases is a vital process
in computer vision for successful evaluation of unsuper-
vised approaches, as well as key for supervised methods
that depend upon large databases due to their learning na-
ture.
In the ﬁeld of ear biometrics several databases al-
ready exist, such as IITD Ear Dataset [9] and WPUT
Ear Dataset [10], many of which contain data captured in
controlled and uncontrolled environments and are there-
fore not a challenging task for ear detection algorithms.
There are, however, some databases that include exam-
ples with different postures and angles. Such are the
USTB dataset, that provides images at different angles
265
and YSU database, which includes images with poses
from 0 to 90 degrees [11].
The ﬁrst database consisting of ear images in the wild,
which includes images of famous people taken from the
internet, was presented in [12]. The dataset was shown
to be the most challenging so far, as various illumination
conditions, angles and the presence of hair and ear acces-
sories contribute to that.
However, none of these datasets contain annotated an-
gles. To tackle this issue we propose a new dataset. We
used 3D point cloud data captured with 3D range scanner.
However, a challenge of 3D sensors and 3D point cloud
data is that missing data can occur. In [13] they addressed
this problem by applying median ﬁlter to the original 3D
point cloud with holes and median ﬁlter to rotated output
data.
3 Preparation of a generated 2D image data-
set of ears
In the sections below we present the techniques used to
obtain and generate 2D RGB and depth image database
of ears with applied roll, pitch and yaw rotation and small
random movement or deﬁned translation. First, we present
the UND-J2 database
1
, used for generating a new dataset
and how we used it in our work. Then we describe the
techniques used for the transformation of data and gener-
ation of images.
3.1 Preparation of UND-J2 dataset
The basis of our work was the UND-J2 database. The
dataset consists of 2436 left proﬁle face 3D range scans
of 415 subjects, taken with Minolta Vivid 910 3D laser
scanner. Dataset includes 2413 corresponding 2D RGB
images of640  480 pixels for each subject. The data was
captured in controlled, laboratory-like, conditions but in
different lightning conditions and poses. Some images
include occlusions such as earrings and hair.
First we performed data correspondence check be-
tween the 3D point cloud data and 2D RGB images in the
database. The check concluded in 404 different subjects
with at least two example sets for each.
Based on the 3D point cloud data and corresponding
RGB images, we merged RGB components from 2D im-
ages with 3D point cloud data and examined the visual
correspondence as well. The visual correspondence ex-
amination resulted in: 3 examples of which 2D RGB im-
age did not match with the 3D point cloud data,75 exam-
ples with big offsets between RGB images and 3D point
cloud data and221 examples, of which RGB images and
3D point cloud had small offset. Mentioned offsets are
the result of the process of capturing images, the corre-
sponding 3D and 2D data were taken at nearly the same
time, which caused the subjects to move causing small
or big offsets in data. Even though using examples with
large offsets would result in an imperfect projection from
3D world to 2D plane, we still decided to use the given
1
https://cvrl.nd.edu/projects/data/
#nd-collection-j2
examples and generated as large database as possible and
further evaluated the obtained results.
Since given 3D point cloud data in UND-J2 database
are unstructured and in raw format, we transformed the
data into Polygon (PLY) and Point Cloud Data (PCD)
ﬁle formats. We generated PLY and PCD ﬁles from 3D
points alongside with RGB components taken from corre-
sponding 2D RGB images and created colored 3D point
clouds in mentioned formats. Converting the ﬁles into
PLY and PCD ﬁle formats enabled us more efﬁcient fur-
ther transformation of data, which is described more thor-
oughly in the following sections.
3.2 3D rotation and projection
In this section we present the applied approach of map-
ping 3D coordinates to a 2D plane, with given roll, pitch
and yaw rotation angles and translation in x, y, z direc-
tion, to create the generated 2D RGB and depth image
database.
We performed a perspective projection on the colored
3D point cloud data with the use of the pinhole camera
model [14], illustrated in image 1. The pinhole camera
model models perspective projections and is the basic
camera model used in computer vision [14]. The map-
ping of the pinhole camera, from 3D points in 3D world
to 2D points in an image plane, is described with a camera
matrix, also known as the camera projection matrix [15],
given in equation (5).
Figure 1: Illustration of the pinhole camera model, where
F
c
represents the focal length deﬁned in X (F
x
), Y (F
y
)
and Z (F
z
) directions and (u,v) are the coordinates of the
pointP being projected.
We performed the transformation of colored 3D point
cloud as follows. First we computed the mean of each
X, Y , Z coordinate in 3D world. The mean of each co-
ordinate corresponds to a reference point by which we
translated the given 3D (X, Y , Z) point to align the cam-
era and world coordinate systems. Next we applied ro-
tation in X, Y and Z direction - where X corresponds to
roll, Y to pitch and Z to yaw rotation. And then applied
translation, in X, Y and Z directions, to given 3D point.
We computed the overall rotation by matrix multiplica-
tion [16] of matrices shown in equations (1), (2) and (3).
266
The whole transformation of the 3D point is computed as
shown in equation (4).
R
x
=
2
4
1 0 0
0 cos    sin  0 sin  cos  3
5
(1)
R
y
=
2
4
cos  0 sin  0 1 0
  sin  0 cos  3
5
(2)
R
z
=
2
4
cos    sin  0
sin  cos  0
0 0 1
3
5
(3)
2
4
x
rt
y
rt
z
rt
3
5
=R
z
R
y
R
x
2
4
X
Y
Z
3
5
+
2
4
x
t
y
t
z
t
3
5
(4)
After applied transformation we project the 3D point
by multiplying it with camera’s intrinsic parameters, given
by the camera projection matrix. The projection is shown
in equation (5). Camera’s intrinsic parameters are the fo-
cal lengths, f
x
and f
y
, expressed in pixels [16]. In our
case we deﬁnedf
x
as 640 andf
y
as 480. Parametersc
x
andc
y
represent the principal point which corresponds to
the image center. In our case we applied some propor-
tional scaling with parameters s
x
and s
y
, by which we
deﬁned the scale factors in X and Y directions. The ﬁ-
nal projection of points to 2D image plane was computed
with equations (6) and (7), where each X and Y coordi-
nate of a projected 3D point is divided by the Z value of
the projected point in question, giving us the ﬁnal pro-
jected u and v coordinates in a 2D image plane.
2
4
x
p
y
p
z
p
3
5
=
2
4
f
x
=s
x
0 c
x
0
0 f
y
=s
y
c
y
0
0 0 1 0
3
5
2
6
6
4
x
r
y
r
z
r
1
3
7
7
5
(5)
u =x
p
=z
p
(6)
v =y
p
=z
p
(7)
3.3 Generating 2D RGB and depth images
The perspective projection, described in the previous sec-
tion, has given us the u, v and z coordinates of the pro-
jected 3D point. Coordinates u and v correspond to the
points in the 2D image plane and z corresponds to the
transformed and projected 3D point before applying ﬁnal
projection into 2D world.
We generated 2D RGB images (2, 3) based on the
color information taken from acquired PLY and PCD ﬁles,
described in Section 3. Images are obtained by applying
RGB components to projected u and v coordinates in a
2D image.
Based on the transformed and projected z coordinate,
we generated depth images, also known as depth maps. A
depth map or image contains information about the dis-
tance of the surfaces of objects in the image from a cam-
era viewpoint. We produced depth maps with luminance
(a) RGB image with 20
  pitch
angle.
(b) RGB image with 15
  roll an-
gle.
Figure 2: Generated RGB images with different roll and
pitch angles.
(a) RGB image with 5
  roll,
25
  pitch and 10
  yaw angle.
(b) RGB image with 25
  roll,
15
  pitch and 35
  yaw angle.
Figure 3: Generated RGB images with different roll,
pitch and yaw angles.
proportional to the distance to the camera. Depth images,
as can be seen on 4 and 5, were produced such that sur-
faces closer to the camera are brighter and surfaces with
greater distance to the camera are darker.
(a) Depth image with 5
  roll,
10
  pitch and 20
  yaw angle.
(b) Depth image with 20
  roll
angle.
Figure 4: Generated depth images with different roll,
pitch and yaw angles.
The applied transformations, in the process of gener-
ating RGB and depth images, resulted in holes in images
- missing data. Reasons for such lack of data are ﬁrstly,
holes in the original 3D point cloud caused by the 3D sen-
sor due to illumination conditions or subject’s oily skin.
And secondly, the applied rotation transformation to the
3D point cloud. Rotation causes the intervals between the
X and Y neighbourhoods to become distorted, especially
when the point of view is changed signiﬁcantly [13]. The
267
(a) Depth image with 15
  pitch
angle.
(b) Depth image with 10
  roll,
25
  pitch and 30
  yaw angle.
Figure 5: Generated depth images with different roll,
pitch and yaw angles.
latter mostly refers to applying pitch and yaw rotations
to point clouds. Since the holes in the generated images
would pose a problem in further use of the database, we
decided to apply a hole ﬁlling process to the projected
points. We applied a ﬂood-ﬁll operation on the back-
ground pixels of the input binary image.
4 Conclusion
In this paper we, not only present a novel dataset derived
from UND-J2, but also present the process of preparing
and generating a new database, consisting of 2D RGB
and depth images, based on 3D point cloud data and cor-
responding 2D RGB images from the UND-J2 dataset.
We applied a perspective transformation, 3D projection
from 3D world to 2D image plane, with the use of the
pinhole camera model and mapped projected coordinates
to 2D RGB and depth images.
Some problems were encountered, such as missing
data, in the form of holes, in the 3D point cloud, imper-
fect data correspondence between 3D point clouds and
2D RGB images, causing offsets and impacting data re-
construction process. Another important factor in the per-
spective projection process is correctly deﬁning the cam-
era’s intrinsic parameters being applied to the camera pro-
jection matrix. This has a vital role in the accuracy of the
projection as well.
The ﬁnal application for the dataset is CNN-based an-
gle estimation. This enables either separate-model-based
prediction or serves as a base for deep-neural-network-
based ear alignment. We hope that our work will serve as
a good basis to solve the problem of ear alignment - an
issue that after many years still remained unsolved.
References
[1]
ˇ
Z. Emerˇ siˇ c, D.
ˇ
Stepec, V .
ˇ
Struc, P. Peer, A. George,
A. Ahmad, E. Omar, T. E. Boult, R. Safdari, Y . Zhou,
S. Zafeiriou, D. Yaman, F. I. Eyiokur, and H. K. Ekenel,
“The Unconstrained Ear Recognition Challenge,” in Inter-
national Joint Conference on Biometrics. IEEE/IAPR,
2017, pp. 715–724.
[2] B. S. H. W. G. J. N. K. A. P. E. H. M. P. S. S. S. H. P. G. P.
N. I.-J. K. S. G. S. K. M. K. L. Y . J. Y . H. Z. F. L. J. M. X.
Z. D. Y . F. I. E. K. B. H. K. E. D. P. C. S. B. P. K. S. B. M.
P. P. V .
ˇ
Ziga Emerˇ siˇ c, Aruna Kumar S. V ., “The uncon-
strained ear recognition challenge 2019,” in International
Conference On Biometrics. IAPR, 2019.
[3] Y . Zhang, Z. Mu, L. Yuan, and C. Yu, “Ear veriﬁcation
under uncontrolled conditions with convolutional neural
networks,” IET Biometrics, vol. 7, no. 3, pp. 185–198, Jan.
2018.
[4] E. E. Hansley, M. P. Segundo, and S. Sarkar, “Employ-
ing fusion of learned and handcrafted features for uncon-
strained ear recognition,” IET Biometrics, vol. 7, no. 3, pp.
215–223, Jan. 2018.
[5] A. Pﬂug and C. Busch, “Ear biometrics: A survey of de-
tection, feature extraction and recognition methods,” IET
Biometrics, vol. 1, no. 2, pp. 114–129, 2012.
[6]
ˇ
Z. Emerˇ siˇ c, V .
ˇ
Struc, and P. Peer, “Ear Recognition: More
Than a Survey,” Neurocomputing, vol. 255, pp. 26–39,
2017.
[7] A. Pﬂug and C. Busch, “Segmentation and normalization
of human ears using cascaded pose regression,” in Secure
IT Systems. Springer, 2014, pp. 261–272.
[8] M. Ribiˇ c,
ˇ
Z. Emerˇ siˇ c, V .
ˇ
Struc, and P. Peer, “Inﬂuence
of Alignment on Ear Recognition: Case Study on AWE
Dataset,” in International Electrotechnical and Computer
Science Conference, vol. 25-B. IEEE, 2016, pp. 131–
134.
[9] “Automated human identiﬁcation using ear imaging,”
http://www4.comp.polyu.edu.hk/˜csajaykr/IITD/Database-
Ear.htm.
[10] D. Frejlichowski and N. Tyszkiewicz, “The West Pomera-
nian University of Technology Ear Database – A Tool
for Testing Biometric Algorithms,” in Image Analysis and
Recognition, ser. Lecture Notes in Computer Science,
A. Campilho and M. Kamel, Eds. Springer Berlin Hei-
delberg, 2010, vol. 6112, pp. 227–234.
[11] T. V . Kandgaonkar, R. S. Mente, A. R. Shinde, and
S. D. Raut, “Ear biometrics: A survey on ear image
databases and techniques for ear detection and recog-
nition,” IBMRD’s Journal of Management & Research,
vol. 4, no. 1, pp. 88–103, 2015.
[12]
ˇ
Z. Emerˇ siˇ c and P. Peer, “Ear biometric database in the
wild,” in 2015 4th international work conference on bioin-
spired intelligence (IWOBI). IEEE, 2015, pp. 27–32.
[13] P. Yan and K. W. Bowyer, “Ear biometrics using 2d and
3d images,” in 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05)-
Workshops. IEEE, 2005, pp. 121–121.
[14] P. Sturm, “Pinhole camera model,” Computer Vision: A
Reference Guide, pp. 610–613, 2014.
[15] H. G. Pharr, M., “Physically based rendering: From theory
to implementation,” 2004.
[16] P. I. Dunn, F., “3d math primer for graphics and game
development. 2nd ed.” pp. 217–275, 2011.