https://doi.org/10.31449/inf.v46i1.3442 Informatica 46 (2022) 1–12 1
Towards a Feasible Hand Gesture Recognition System as Sterile Non-contact
Interface in the Operating Room with 3D Convolutional Neural Network
Roy Amante A. Salvador and Prospero C. Naval, Jr.
E-mail: rasalvador1@up.edu.ph, pcnaval@up.edu.ph
Computer Vision and Machine Intelligence Group, Department of Computer Science
College of Engineering, University of the Philippines, Diliman, Quezon City, Philippines
Keywords: hand gesture recognition system, computer vision, 3d convolutional neural network, operating room
Received: February 16, 2021
Operating surgeons are constrained when interacting with computer systems as they traditionally utilize
hand-held devices such as keyboard and mouse. Studies have previously proposed and shown the use of
hand gestures is an efﬁcient, touchless way of interfacing with such systems to maintain a sterile ﬁeld. In
this paper, we propose a Deep Computer Vision-based Hand Gesture Recognition framework to facilitate
the interaction. We trained a 3D Convolutional Neural Network with a very large scale dataset to classify
hand gestures robustly. This network became the core component of a prototype application requiring
intraoperative navigation of medical images of a patient. Usability evaluation with surgeons demonstrates
the application would work and a hand gesture lexicon that is germane to Medical Image Navigation was
deﬁned. By completing one cycle of usability engineering, we prove the feasibility of using the proposed
framework inside the Operating Room.
Povzetek: Prispevek skuša dokazati izvedljivost uporabe globokega raˇ cunalniškega sistema za prepozna-
vanje kretenj z roko v operacijski sobi.
1 Introduction
Hand Gesture Recognition (HGR) Systems for interfacing
is now more interesting and relevant than ever. The devel-
opment and deployment of contactless technology could
be part of community preparedness and response during
disease outbreaks. With the ongoing Coronavirus Disease
pandemic (COVID-19), it behooves us to observe guide-
lines such as frequent hand sanitation, social distancing,
and the wearing of personal protective attire to reduce the
spreading of pathogens and contamination. This aseptic
setting is more strictly enforced in the Operating Room
(OR) domain.
Figure 1: A typical operating room layout [1] depicting
personnel and objects which should remain sterile during
operations.
During operations, surgeons are not able to physically
touch unsterile objects due to safety and health regulations
but may need to control a medical system such as navigat-
ing to the patient’s medical images, viewing for reference,
and manipulating them on the screen. A simple hand ges-
ture like swiping in the air would be more convenient for
the surgeon and still a compliant way of interfacing with
the system. Such a touchless system driven by modali-
ties like hand gestures helps maintain sterility. Figure 1
shows a visualization of the Operating Room setting in
terms of sterility. Wipﬂi et al. [2] compared and evaluated
the gesture-controlled approach versus assistant-controlled
with guided instructions from the surgeon when it comes
to manipulating images in a surgery setting. The former
received signiﬁcantly higher ratings on efﬁciency and sur-
geon satisfaction.
Building a Hand Gesture Recognition System is non-
trivial. It may be challenging to develop a quick and reli-
able method of detecting and recognizing the human hand
in dynamic environments such as the operating room. For
example, the hand may vary in size, color, illumination, po-
sition, and orientation to the camera. Many local and global
invariant features have been manually designed and engi-
neered to cope with these variabilities such as Histogram of
Gradients (HOG) [3], Countour Description [4], Hu Invari-
ant Moments [5], Fourier Descriptors [6], and Karhunen-
Loeve Transform [7]. As with the no free lunch theorem,
none of the features are better all the time from the oth-
ers but there are factors and situations where they perform
2 Informatica 46 (2022) 1–12 R.A.A. Salvador et al.
relatively better.
Deep Learning is a branch of machine learning based
on a set of techniques that attempt to model high-level
data abstractions using multiple processing layers com-
posed of complex structures and transformations. It has
been applied and proven useful in ﬁelds including but
not limited to computer vision, natural language process-
ing, speech recognition, and knowledge representation.
As a subset of Representation learning [8], Deep Learn-
ing has the advantage of learning features automatically
from data whereas rule-based and classical machine learn-
ing approaches need manual engineering and extracting of
hand-designed features. It can also eliminate the need for
pre-processing steps such as Hand Tracking and Region
Segmentation making the processing pipeline simpler and
straight-forward (raw data in; prediction out).
We approach the issue of maintaining sterility in the
Operating Room by proposing the usage of a Real-time
Computer Vision and Deep Learning-based Hand Gesture
Recognition framework. The main contributions of our
work are the following:
1. Demonstration of the feasibility of the proposed
framework inside the Operating Room by designing
and building a basic Medical Image Navigation proto-
type that is positively evaluated by surgeons.
2. Deﬁnition of a Hand Gesture Lexicon that is suitable
for Medical Image Navigation application and conse-
quently also appropriate to use inside the operating
room.
3. A new Dynamic Hand Gesture dataset we’ve collected
during the development of the prototype application.
It is medium-sized containing more hand gesture sam-
ples than the Shefﬁeld Kinect Gesture (SKIG) [9]
dataset.
2 Hand gesture recognition systems
in the operating room
Pioneering work in this arena heavily applied traditional
computer vision techniques for performing image prepro-
cessing, hand detection, and hand tracking and used ﬁnite-
state machine for gesture classiﬁcation [10, 11]. Some of
them had poor usability and caused fatigue for the users
[12]. A classical machine learning approach was taken by
Achacon et al. [13]. Their system called REALISM in-
cluded only a few gesture classes. They ﬁrst performed
hand detection with Haar-like features and cascade classi-
ﬁer then employed Principal Component Analysis and Eu-
clidean Distance matching from the samples of the classes
to perform classiﬁcation.
Jacob et al. [14] deﬁned a set of gestures for navigating
medical images in consultation with veterinary surgeons.
They used the 3D trajectory (3Dt) of the hand as the feature
and Hidden Markov Model (HMM) as the classiﬁer. The
computation of the hand trajectory relied however on the
Skeletal Tracking feature of Microsoft Kinect. They also
used the skeletal information to compute head and torso
orientation for determining intentional gestures. Several
other HGR systems in the operating room [1, 15, 16, 17]
have used and depended on the Microsoft Kinect device.
Another popular device is the Leap Motion Controller
(LMC) which uses proprietary drivers to process and for-
mat data into frames of objects like hands and ﬁngers [18].
Park et al. [19] developed a message hooking program
called GestureHook to convert gestures into mouse and
keyboard functions to their medical system. [20] applied
a rule-based classiﬁcation based on the 3D hand movement
(trajectory) using the points returned by the LMC device.
With their ability to project holograms that can be ac-
cessed interactively with hand gestures, mixed reality head-
sets are now making their way inside the operating room to
support surgical procedures. A study by Galati et al. [21]
found that they can increase the surgeon’s productivity but
highlighted that the battery autonomy and the weight of the
device which can cause physical stress and discomfort are
points for improvement. Furthermore, these devices cost
signiﬁcantly higher than the other capture devices.
3 Proposed framework
We propose a Hand Gesture Recognition System using
Deep Computer Vision to act as an interface of the sur-
geon to a medical image navigation application. To do this
successfully, we aimed at training a deep network and de-
veloping a framework which classiﬁes static and dynamic
gestures:
1. With High Accuracy - Classiﬁcation performance
must be invariant to the user and background. It must
be at least comparable in performance with baseline
systems.
2. In Real-time - The system must be able to complete
the processing pipeline at most 250 milliseconds. The
average reaction time for visual stimuli ranges from
250 to 350 milliseconds [22]. Taking longer than this
range would introduce a noticeable lag visually.
3.1 Data capture
Microsoft Kinect was ﬁrst explored for capturing the hand
gestures as it provides depth modality. With depth informa-
tion, one can easily determine which objects or pixels are
in the foreground and the background. Moreover, the depth
sensor of Kinect is an infrared camera so the effect of light-
ing conditions, user’s skin color and clothing, and the back-
ground were assumed to have small to no impact on system
performance. After the introduction of the ﬁrst-generation
Microsoft Kinect in 2010, there have been recognition sys-
tems developed and research using the device [23, 24, 25].
The simpler, cheaper and more available way is to use a
Towards a Feasible Hand Gesture Recognition System as. . . Informatica 46 (2022) 1–12 3
single regular camera to capture the user’s gesture. Many
laptops have webcams and they are much widely available
especially in developing countries. Usage of a single web-
cam also works in the proposed framework.
3.2 Network architecture
We determined the separation of networks for static hand
postures and dynamic gestures is unnecessary as static hand
gestures can also be seen as dynamic. Even though a static
hand posture may not move in space, it always moves for-
ward in time. With this, we chose a 3D Convolutional
Neural Network as they are well-suited for systems involv-
ing spatiotemporal feature learning [26, 27, 28]. Specif-
ically, it is a slightly modiﬁed version of the C3D model
which was used for action recognition, action similarity la-
beling, and scene classiﬁcation [29]. It resembles a VGG-
11 [30] architecture replacing all 2D convolutional layers
with 3D convolutional layers followed by Batch Normal-
ization. The input to the network is the 16 RGB and depth
fused frames, sized 96 x 96. The sizing of the input and net-
work depended on the capacity of the GPU (NVIDIA GTX
970M graphics card) of the demo laptop machine used.
Prediction of one sample took around 60 ms on battery and
40 ms when plugged in on the mentioned machine. We also
veriﬁed the network with the SKIG [9] dataset using RGB
only, and both RGB and depth (RGBD) on three-fold cross-
validation. Results can be seen in Table 1. There might be
other network architectures that are arguably better but our
purpose is to only be able to predict robustly and quickly
based on our deﬁned guidelines for accuracy and real-time
processing speed.
3.3 System actions / functionalities
The system actions include the ten functionalities listed by
Jacob et al. [14]. The set was suggested by surgeons who
were asked to give the most common functions they per-
form with medical images during surgeries. These are ac-
tions for navigation, brightness, orientation, and zoom level
manipulation. Furthermore, we’ve added auxiliary func-
tionalities for usability guided by consultation with a sur-
geon. These are locking/unlocking the system, panning,
and animation of the images in the medical series. The
Lock and Unlock actions signal to the system the user’s in-
tent to use the system. Locking the system reduces the pos-
sibility of recognizing unintended hand gestures performed
by the user. The Panning action is seen as a supporting
functionality when zoning into regions of interest. While
the series animation functionalities not only enable view-
ing of the medical series but also help with navigating to a
speciﬁed image as a series can contain hundreds or thou-
sands of images/slices.
3.4 Medical image navigation interface
Developed in OpenCV [31], the interface is designed as
a Medical Image/DICOM Viewer-like application. Digi-
tal Imaging and Communications in Medicine (DICOM) is
the international standard format for storing medical imag-
ing and information. In consultation with radiologists and
surgeons, we were given and chose anonymized medical
images taken from a real case with Appendicitis. We ex-
tracted and organized its images for the application to dis-
play as sample patient data.
3.5 Processing pipeline
As a real-time system, the network continuously gives its
prediction for every new frame it receives from the cam-
era but we need to treat the gestures of the user as discrete
actions. We expect the user to perform each intentional
gesture for about 1 second. For System Actions to be trig-
gered, the prediction for the particular gesture class must
be sustained by a set duration - gestures must continuously
be predicted by the network in  timesteps. If this condi-
tion is not met, we treat them as unintended gestures hence
won’t trigger any action. The workﬂow for executing Sys-
tem Actions can be summarized by the diagram in Figure
4. On our demo laptop with 16 GB memory, Intel Core i-7
processor, and NVIDIA GTX 970M graphics card, the ap-
plication logic and rendering took around 50 ms running on
battery and around 25 ms when running on power. Com-
bining this with the network’s prediction time, one cycle
takes around 110 ms on battery, and 65 ms when running
on power. This means the system’s performance is in the
range of about 8 to 15 fps and the suitable duration thresh-
old  is within this range.
4 Training our hand gesture
recognition network
In this section, we discuss the efforts carried out in training
the network used in our prototype application. Due to the
size of the datasets, all network training efforts were per-
formed in a Google Cloud Instance with 24 GB of mem-
ory and P100 GPU. We used Lasagne [32] deep learning
library to train the networks using Stochastic Gradient De-
scent (SGD) with Nesterov Momentum with learning rate
from 1e
  2
to 1e
  6
annealed by a factor of 0.1 whenever
performance on the validation set did not improve. More-
over, we employed heavy Data Augmentation by manipu-
lating the frames of the video samples - scaling inward and
outward, random brightness and contrast, applying Gaus-
sian noise, and random and multiple frame sampling of the
training video clips.
4.1 Collecting and using our dataset
We ﬁrst naively collected our dataset with the Microsoft
Kinect camera. Some example RGB frames with their cor-
4 Informatica 46 (2022) 1–12 R.A.A. Salvador et al.
Figure 2: Modiﬁed C3D [29]. The network contains eight 3D convolutional layers (Convxx) having a kernel size of
3 × 3 × 3 with a stride of 1 in all dimensions. The number of ﬁlters is indicated in their corresponding boxes. Each
convolutional layer is followed by Batch Normalization (BN). There are ﬁve 3D max-pooling layers (Poolx) each with
pooling kernel size and stride of 2 × 2 × 2, except for Pool1 which is 1 × 2 × 2. The series of convolutional, batch
normalization and pooling layers act as the feature extraction phase producing 3D feature maps. The network is closed
off with the classiﬁcation phase by two fully connected layers (FC) with 4096 neurons each and the output softmax layer
whose size is the number of gesture classes.
Table 1: Accuracy of the Modiﬁed C3D Model on the SKIG Dataset. We ﬁrst validated our chosen network architecture
(Modiﬁed C3D Model) with the Shefﬁeld Kinect Gesture (SKIG) [9] dataset. Data is divided into three folds each fold
contains all samples from two subjects - Fold A (subjects 1 and 2), Fold B (subjects 3 and 4), and Fold C (subjects 5 and
6). For RGB only, and RGB with depth (RGBD) modality, the accuracy is more than our benchmark accuracy of around
93%.
Fold A Fold B Fold C Avg  Stdev
RGB 92.22% 94.17% 95.00% 93.8%  1.43%
RGBD 93.06% 93.61% 98.33% 95.0%  2.90%
Figure 3: The Medical Image Navigation Application In-
terface. It is designed as a DICOM viewer-like application.
The status (whether Locked or Unlocked) is displayed at
the top-left corner. On the left-hand side is the list of avail-
able DICOM series for viewing. The center panel displays
the current image/slice of interest of the current series. We
can see the video stream and real-time visualization of the
prediction of our system on the right-hand side of the ap-
plication.
responding depth frames can be seen in Figure 5. Samples
from four users were incrementally used for training and
samples from the remaining user were used for validation.
Adding more users to the training set was seen to increase
the classiﬁcation performance, however, at four users, the
quality of the network is still very poor. The collection of
new data and annotation is a long and tedious process so it
was decided to look for a public dataset that can help boost
performance via Transfer Learning.
4.2 Leveraging very large scale hand
gesture dataset
The 20BN-JESTER [33] dataset is a very large scale hand
gesture recognition dataset containing more than a hun-
dred thousand densely-labeled clips performed by a huge
number of crowd workers in front of a webcam or lap-
top camera. With transfer learning in mind, the selected
network architecture is trained with the Jester dataset ap-
plying all previously mentioned regularization techniques.
Since we have 4 channels as input to our chosen network
and the Jester dataset only has RGB, a blank white image
is fused in place of the depth frame. This is because, in
Microsoft Kinect’s depth images, white represents back-
ground. Samples in the dataset have varying lengths so,
during prediction time, 16 frames sampled at equal inter-
vals are extracted to represent the entire clip. Our trained
network scored an overall accuracy of 94.52% on the ofﬁ-
cial Jester validation set and around this value for the test
set.
4.3 Fine-tuning
We initialized the network with the weights of our network
trained with Jester Dataset and performed ﬁnetuning with
our dataset. A sharp increase in performance as opposed to
training from scratch is seen; however, the resulting Sys-
tem Accuracy is still quite poor at 67.58%. This might be
because in the Jester dataset the gestures are performed di-
rectly in front of the camera so the hand is a dominant part
Towards a Feasible Hand Gesture Recognition System as. . . Informatica 46 (2022) 1–12 5
Figure 4: Overview of the Processing Pipeline. The camera continuously sends the video frames and the system collects
the last 16 frames, resizes, and organizes them as input to our deep neural network. Every time the network performs
an inference, the predicted gesture class is mapped to a system action. We check if the current system action has been
sustained long enough for an actual gesture to take place. We further reduce the occurrence of triggering system actions
for unintended gestures from the user by checking the application status (whether Locked or Unlocked).
Figure 5: Our Dynamic Hand Gesture Dataset. Microsoft
Kinect is used to capture the gestures of 5 users in RGB and
depth modality. There are 16 gestures classes, 10 of which
were taken from Jacob et al. [14], and a no system action
class. Sequences are recorded with 3 varying backgrounds
(blue, green and pink) and 3 different scales (user distance
from the camera - 1, 2, 3 steps from the camera) performed
each on their left and right hand unless the gesture needs
two hands to perform. For each take, the users were asked
to dress differently and apply different hairdos to increase
variability. The dataset includes 1400+ gesture samples.
of the frame. In our dataset, gestures are performed stand-
ing with some distance from the Microsoft Kinect camera,
making the hand relatively small compared to everything
else in the frame. Another factor could be the Jester dataset
only has the RGB modality. If it also has depth informa-
tion, that might have further helped with boosting perfor-
mance.
4.4 Utilizing Jester-trained network
At this point, we have trained a robust network capable of
predicting for different users on different types of complex
backgrounds. We assumed it would also work for a sur-
geon inside the Operating Room. Instead of forcing the
use of our gesture lexicon, we used the ones in the Jester
dataset and mapped the most intuitively matching gestures
into our System Actions. Table 2 and Figure 6 details the
mapping and its performance respectively. Table 3 sum-
marizes the performance of training with the SKIG dataset,
our dataset, and the Jester dataset. Compared to working
with our dataset, utilizing the Jester-trained network meets
the criterion for robust gesture classiﬁcation.
5 Evaluation
5.1 Baseline systems
We evaluated against prior work on HGR systems for OR
usage. The baseline systems include: REALISM [13],
3Dt+HMM without and with contextual cues (WC) [14],
GestureHook [19], binarized (b-) and raw depth+2D-CNN
[34], 3Dt+rule-based (RB) classiﬁcation [20], and IR+
CapsNet [35]. They are listed in Table 4. We note that
we compare the systems as a whole in which the num-
ber of action classes, types of gesture, and capture device
are packaged and designed as a system. We addressed the
collective weaknesses of these systems which include the
need for manual engineering of features or pre-processing
procedures to achieve a feasible recognition performance
[13, 14, 34, 20], constraints on the hand gesture lexicon
containing only static or movement hand gestures only but
unable to process both due to methodology [13, 14, 34, 35],
and reliance on the capabilities of the data capture device
which are more expensive and may not be readily avail-
able in developing countries as an ordinary webcam would
[14, 19, 34, 20, 35].
5.2 Test setup in the operating room
We were given a brief timeframe and have attempted to val-
idate our system by testing it in an actual operating room.
We recorded and annotated a sequence of gestures to see
how it performs in real-time. Figure 7 shows the conﬁ-
dence over time of our trained network which is the output
of the ﬁnal softmax layer. Visually, the parts where there
is sustained prediction conﬁdence at 1.0 coincide with the
ground truth. Quantitatively, the continuous recognition
results in System Action Accuracy (SAA) of 94.95%, No
Action Precision (NAP) of 96.54%, and No Action Recall
(NAR) of 63.90% for the test sequence. The low NAR is
attributed to conﬁdence spikes for other gestures that corre-
6 Informatica 46 (2022) 1–12 R.A.A. Salvador et al.
Table 2: Jester Mapped Actions. Each gesture in the Jester dataset is intuitively mapped with our system actions.
Our System Action Jester Gesture
Previous Series Swiping Up
Next Series Swiping Down
Previous Image Swiping Left
Next Image Swiping Right
Browse Up Sliding Two Fingers Up
Browse Down Sliding Two Fingers Down
Browse Left Sliding Two Fingers Left
Browse Right Sliding Two Fingers Right
Play Series In Reverse Rolling Hand Backward
Play Series Rolling Hand Forward
Stop Playing Series Stop Sign
Pushing Hand Away
Increase Brightness Pushing Two Fingers Away
Decrease Brightness Pulling Two Fingers In
Rotate Counter-clockwise Turning Hand Counterclockwise
Rotate Clockwise Turning Hand Clockwise
Zoom In Zooming In With Full Hand
Zooming In With Two Fingers
Zoom Out Zooming Out With Full Hand
Zooming Out With Two Fingers
Thumbs Up (Unlock) Thumb Up
Thumbs Down (Lock) Thumb Down
No System Action No gesture
Doing other things
Pulling Hand In
Drumming Fingers
Shaking Hand
Figure 6: Jester Mapped Actions Performance. The confusion matrix of the proposed mapping.
spond to unintended gestures. They generally do not trigger
system actions since we employed a set duration threshold
to mitigate.
5.3 Usability evaluation with surgeons
We created an online survey form for the evaluation of the
hand gestures as well as the overall usability of the ap-
plication. There were 11 survey respondents from 7 in-
Towards a Feasible Hand Gesture Recognition System as. . . Informatica 46 (2022) 1–12 7
Table 3: Hand Gesture Recognition Results. We quantiﬁed the robustness of the trained network by the following: System
Action Accuracy (SAA) tells us how good the network is in identifying system action classes, No Action Precision (NAP)
shows how resilient the network is against classifying intended gestures as unintended (classiﬁed as no system action),
and No Action Recall (NAR) denotes how resilient the network is against classifying unintended gestures as intended.
SAA NAP NAR
SKIG Dataset (RGB) 93.80% - -
SKIG Dataset (RGBD) 95.00% - -
Own Dataset (1 user) 0.96% 79.18% 97.37%
Own Dataset (2 users) 23.58% 83.59% 75.38%
Own Dataset (3 users) 29.87% 86.15% 82.05%
Own Dataset (4 users) 43.57% 90.80% 88.93%
Jester Validation Set 94.52% - -
Jester Test Set 94.26% - -
Jester pre-training + 67.58% 94.65% 91.44%
Own Dataset (4 users)
Jester Mapped Actions 94.48% 96.62% 97.80%
Table 4: Comparison with Baseline Systems. We relate the performance of our work with their reported System Action
Accuracy (SAA) along with the type of gestures used (whether static, dynamic, or both), and the number of System
Actions/commands.
Capture Gesture Number of System Action
Device Type System Actions Accuracy (SAA)
REALISM [13] webcam static 5 < 80%
3Dt+HMM [14] Kinect dynamic 10 93.60%
3Dt+HMM WC [14] Kinect dynamic 10 92.58%
GestureHook [19] LMC dynamic 8   92%
depth+2D-CNN [34] ToF static 10 94.86%
b-depth+2D-CNN [34] ToF static 10 92.07%
3Dt+RB [20] LMC dynamic 10 95.83%
IR+CapsNet [35] LMC static 5 86.46%
Ours webcam both 19 94.48%
stitutions with varying experience ranging from fellows in
post-residency training to attending physicians from differ-
ent specialties such as General Surgery, Pediatric Surgery,
Surgical Oncology, and Surgical Endoscopy. To avoid bi-
ased feedback, our main consulting surgeon did not partic-
ipate. Participants were ﬁrst introduced with the purpose
of the research and then walked through the application by
showing them images of the setup and video clip record-
ings of each system action being triggered in the system
by their corresponding hand gesture. We asked them to
choose from Strongly Disagree, Disagree, Neutral, Agree,
and Strongly Agree for each criterion and we converted
their answers numerically from 1 to 5 respectively for anal-
ysis. At the end of the form, they were also asked to eval-
uate and provide their overall thoughts on the usability of
the system.
5.4 Hand gesture lexicon
For the hand gestures to be feasible, they should be
positively rated on the following criteria [12]: Intuitive-
ness (Gesture Intuitively Matches the System Action Per-
formed), Ease of Use (Gesture Is Easy and Comfortable to
Perform), and Memorability/Ease of Remembrance. Ad-
ditionally, we also included Appropriateness to use inside
the OR. Table 5 shows the detailed ratings of the surgeons
in our evaluation survey. All of the hand gestures received
a mean rating greater than 3 (Neutral), and the majority of
them greater than 4 (Agree). Most surgeons did not pro-
vide any suggestions to improve implying they were con-
tent with the gestures. With this, we can say that we have a
workable initial set of hand gestures for the application just
by using the ones in the Jester dataset.
A few of the hand gestures particularly those for bright-
ness and orientation manipulation and playing the series
received a relatively lower rating across all criteria. Some
of the comments mention a preference for gestures involv-
ing smaller movements. Some suggested highly subjective
gestures and additional functionalities with no agreement
with other respondents. To determine the ﬁnal set of pro-
posed gestures for a Medical Image Navigation application
in the operating room, we perform the following:
1. If the gesture received consistently lower scores
across all criteria, we replace it with:
(a) The suggested improved gesture provided by at
8 Informatica 46 (2022) 1–12 R.A.A. Salvador et al.
(a) (b)
Figure 7: Test inside the operating room. (a) Visualization of the continuous hand gesture recognition during the test. (b)
The system hardware (a laptop with a webcam in this case) is placed in a convenient location adhering to sterility rules.
To consult with the patient’s medical images, the surgeon moves in front of the camera and gestures to the system.
least two surgeons in agreement in the evalua-
tion; else with
(b) The suggested improved gesture provided by at
least one surgeon in agreement with Jacob et al
[14].
2. If there are no suggestions or if the gesture received
satisfactory ratings, we keep the gesture.
The resulting improved hand gesture lexicon can be found
in Table 6. We note that gestures can have different mean-
ings across cultures hence some of them might only be suit-
able in our local setting.
5.5 Overall usability
For the overall usability of the application, we measure
with the following criteria: Usefulness, Efﬁciency, Learn-
ability, and Satisfaction. Figure 8 depicts the ratings given
by the survey respondents. Feedback from surgeons is
strongly positive with a mean score between 4 and 5 (from
Agree to Strongly Agree) for all of our criteria. At 95%
probability, the conﬁdence intervals of the mean rating
using t-distribution are 4.36  0.62 for Usefulness, 4.09
  0.82 for Efﬁciency, 4.27  0.56 for Learnability, and
4.36  0.54 for Satisfaction. Only one surgeon disagreed
with the usefulness of the application and the efﬁciency it
would bring to productivity. The consensus is that the sys-
tem is easily learnable and quite useful. The majority did
not leave any further comments but some left suggestions
on additional functionalities such as contrast management,
measuring, incorporating voice commands, and using ad-
ditional monitor screens to display the interface and video
feed.
5.6 Limitations
Due to restrictions on physical meetings brought about by
ongoing local policies regarding community quarantines,
Figure 8: Evaluation survey overall usability rating dis-
tribution. Most of the respondents were generally satis-
ﬁed, agreeing with the system’s usefulness, efﬁciency, and
learnability, suggesting high viability of the application.
evaluating the system with multiple surgeons in an OR ses-
sion has been a roadblock. However, we believe the results
on the Jester evaluation set (a highly variable dataset which
contains 14,000+ samples) coupled with our brief test in
an operating room translates to a generally feasible hand
gesture recognition performance.
For the framework to be seamlessly applied to any appli-
cation in the operating room or any domain, there should
be a capability for deﬁning custom gestures to use [1].
With the current approach, it is difﬁcult to achieve satis-
factory performance for new gestures without acquiring a
large enough number of samples as shown in Table 3. An
extension of this work that could mitigate this issue is to
integrate a One or Few-Shot Learning mechanism of hand
gestures. [36] had executed this by using a pre-trained net-
work for feature extraction then employing some distance
measurement.
6 Conclusion
In this paper, we were able to demonstrate that a straight-
forward Deep Learning and Computer Vision-based frame-
work is a viable solution in maintaining sterility in the op-
Towards a Feasible Hand Gesture Recognition System as. . . Informatica 46 (2022) 1–12 9
Table 5: Mapped Jester gestures evaluation results. Participants were asked to choose among strongly disagree (1),
disagree (2), neutral (3), agree (4), and strongly agree (5) for each criterion for each gesture. The values displayed are the
conﬁdence intervals of the mean hand gesture ratings using t-distribution at 95% probability.
System Hand Appropriate- Intuitive- Ease Memorability
Action Gesture ness ness of Use
Unlock Thumb Up 4.00  0.74 3.54  0.70 4.09  0.36 4.27  0.31
Lock Thumb Down 3.91  0.76 3.91  0.76 4.18  0.27 4.27  0.31
Previous Series Swipe Up 4.09  0.36 4.09  0.36 3.91  0.56 4.09  0.36
Next Series Swipe Down 4.00  0.42 4.00  0.42 3.91  0.47 4.00  0.42
Previous Image Swipe Left 4.36  0.34 4.27  0.43 4.27  0.43 4.36  0.34
Next Image Swipe Right 4.36  0.34 4.18  0.50 4.09  0.56 4.36  0.34
Play Series Roll Hand 3.73  0.68 3.73  0.61 3.45  0.35 3.64  0.69
Forward
Play Series Roll Hand 3.91  0.56 3.64  0.62 3.45  0.76 3.64  0.69
In Reverse Backward
Stop Series Palm Facing 4.45  0.35 4.18  0.72 4.36  0.45 4.27  0.53
Screen
Pan Up Slide Two 4.18  0.27 3.91  0.47 4.18  0.27 4.09  0.36
Fingers Up
Pan Down Slide Two 4.18  0.27 4.00  0.42 4.09  0.36 4.09  0.36
Fingers Down
Pan Left Slide Two 4.27  0.31 4.18  0.41 4.27  0.31 4.27  0.85
Fingers Left
Pan Right Slide Two 4.27  0.31 4.18  0.41 4.27  0.31 4.27  0.31
Fingers Right
Increase Push Two 3.82  0.72 3.27  0.74 3.73  0.75 3.55  0.87
Brightness Fingers Away
Decrease Pull Two 3.82  0.72 3.45  0.72 3.73  0.82 3.45  0.92
Brightness Fingers In
Rotate Counter- Turn Hand 4.18  0.27 3.73  0.61 3.91  0.63 3.64  0.75
clockwise Counter-
clockwise
Rotate Turn Hand 4.00  0.52 3.55  0.82 3.73  0.61 3.36  0.69
Clockwise Clockwise
Zoom In Open Two 4.45  0.35 4.45  0.35 4.36  0.45 4.45  0.35
Fingers / Hand
Zoom Out Close Two 4.45  0.35 4.45  0.35 4.36  0.45 4.45  0.35
Fingers / Hand
erating room. We implemented an end-to-end, real-time,
robust Hand Gesture Recognition System applied for usage
inside the operating room in the form of a Medical Image
Navigation application that is not dependent on the cap-
ture device and is positively evaluated by surgeons. Gen-
eral feedback from our local surgeons shows receptiveness
and willingness to apply this technology. Furthermore, we
deﬁned a set of suitable hand gestures for the application.
This set coupled with the framework can serve as a foun-
dation for building and deploying Hand Gesture-controlled
applications in our operating rooms as well as in other more
lenient settings requiring sterility maintenance.
Acknowledgement
We would like to thank the following medical profession-
als who helped us with this work: Dr. Alvin Caballes from
the University of the Philippines Philippine General Hospi-
tal (UP PGH), Department of Surgery, our primary surgeon
in consultation with during the creation and evaluation of
the prototype application; Dr. Cecilia Amicis and the Ra-
diology Department of Veterans Memorial Medical Center
(VMMC) who provided anonymized medical images from
areal case study; and Dr. Jose Abellera and the Surgery
Department at National Children’s Hospital (NCH) who
helped and allowed us to perform a brief test setup in one
of their operating rooms.
10 Informatica 46 (2022) 1–12 R.A.A. Salvador et al.
Table 6: Suggested hand gesture lexicon for a medical image navigation application. An initial set of gestures was
deﬁned by mapping the System Actions with the Jester dataset [33] gestures. This was reﬁned based on the results of our
evaluation survey coupled with the ethnographic study conducted by Jacob et al [14]. It was raised that if there are added
functionalities for conﬁrmation, the thumb up and down gestures would be more appropriate for answering yes and no
respectively.
System Hand Details
Action Gesture
Unlock Thumb Up Thumb Up, other ﬁngers tucked in.
Lock Thumb Down Thumb Down, other ﬁngers tucked in.
Go To Previous Series Swipe Up Palm facing upward. Smaller movement of
four ﬁngers pivoting upward.
Go To Next Series Swipe Down Palm facing downward. Smaller movement of
four ﬁngers pivoting downward.
Go To Previous Image Swipe Left Palm facing camera or side. Move hand to the left.
Go To Next Image Swipe Right Palm facing camera or side. Move hand to the right.
Play Series Swipe Down Side view of the hand facing screen.
(Side View) Palm facing downward.
Smaller movement of four ﬁngers pivoting downward.
Play Series Swipe Up Side view of the hand facing screen.
In Reverse (Side View) Palm facing upward.
Smaller movement of four ﬁngers pivoting upward.
Stop Playing Series Stop Sign Open hand palm facing Screen.
Pan Up Slide Two Point index and middle ﬁnger. Move hand or two
Fingers Up ﬁngers upward.
Pan Down Slide Two Point index and middle ﬁnger. Move hand or two
Fingers Down ﬁngers downward.
Pan Left Slide Two Point index and middle ﬁnger. Move hand or two
Fingers Left ﬁngers towards the left.
Pan Right Slide Two Point index and middle ﬁnger. Move hand or two
Fingers Right ﬁngers towards the right.
Increase Brightness Push Hand Palm facing camera, move open hand towards
Away camera. Taken from [14].
Decrease Brightness Pull Hand Back of hand facing camera, move open hand away
In from camera. Taken from [14].
Rotate Swipe Palm facing camera, wave hand to the left
Counter-clockwise Counter-clockwise (counter-clockwise). Taken from [14].
Rotate Swipe Palm facing camera, wave hand to the right
Clockwise Clockwise (clockwise). Taken from [14].
Zoom In Open Two Index and thumb touching initially, other ﬁngers
Fingers tucked in. Move index and thumb away
from each other.
Zoom Out Close Two Index and thumb away from each other initially,
Fingers other ﬁngers tucked in. Move index and thumb
towards each other until they touch.
References
[1] A. Bigdelou, Operating Room Speciﬁc Domain
Model for Usability Evaluations and HCI Design.
PhD thesis, Technical University Munich, 2012.
[2] R. Wipﬂi, V . Dubois-Ferrière, S. Budry, P. Hoffmeyer,
and C. Lovis, “Gesture-Controlled Image Man-
agement for Operating Room: A Randomized
Crossover Study to Compare Interaction Using
Gestures, Mouse, and Third Person Relaying,”
PLoS ONE, vol. 11, no. 4, p. e0153596, 2016.
Towards a Feasible Hand Gesture Recognition System as. . . Informatica 46 (2022) 1–12 11
https://doi.org/10.1371/journal.
pone.0153596.
[3] W. T. Freeman and M. Roth, “Orientation histograms
for hand gesture recognition,” 1995.
[4] C.-C. Chang, I.-Y . Chen, and Y .-S. Huang,
“Hand pose recognition using curvature scale
space,” vol. 2, pp. 386 – 389 vol.2, 02 2002.
https://doi.org/10.1109/ICPR.2002.
1048320.
[5] P. Premaratne, Human Computer Interac-
tion Using Hand Gestures. Springer Sci-
ence+Business Media Singapore, 2014.
https://doi.org/10.1007/
978-3-642-14831-6_51.
[6] S. Conseil, S. Bourennane, and L. Martin, “Compar-
ison of fourier descriptors and hu moments for hand
posture recognition,” in 2007 15th European Signal
Processing Conference, pp. 1960–1964, 2007.
https://doi.org/10.5281/zenodo.
40606.
[7] J. Singha and K. Das, “Hand gesture recogni-
tion based on karhunen-loeve transform,” Mobile
and Embedded Technology International Conference
(MECON), 06 2013.
[8] I. Goodfellow, Y . Bengio, and A. Courville,
Deep Learning. MIT Press, 2016.
https://www.deeplearningbook.org.
[9] L. Liu and L. Shao, “Learning discriminative rep-
resentations from rgb-d video data,” in Proceedings
of the Twenty-Third International Joint Conference
on Artiﬁcial Intelligence, IJCAI ’13, pp. 1493–1500,
AAAI Press, 2013.
[10] C. Grätzel, T. Fong, S. Grange, and C. Baur, “A
non-contact mouse for surgeon-computer inter-
action,” Technology and health care : ofﬁcial
journal of the European Society for Engineering
and Medicine, vol. 12, pp. 245–57, 02 2004.
https://doi.org/10.3233/
THC-2004-12304.
[11] J. Wachs, H. Stern, Y . Edan, M. Gillam, C. Feied,
M. Smith, and J. Handler, “Gestix: A doctor-
computer sterile gesture interface for dynamic
environments,” in Soft Computing in Indus-
trial Applications (A. Saad, K. Dahal, M. Sar-
fraz, and R. Roy, eds.), (Berlin, Heidelberg),
pp. 30–39, Springer Berlin Heidelberg, 2007.
https://doi.org/10.1007/
978-3-540-70706-6_3.
[12] A. Hurstel and D. Bechmann, “Approach for in-
tuitive and touchless interaction in the operat-
ing room,” J, vol. 2, pp. 50–64, 01 2019.
https://doi.org/10.3390/j2010005.
[13] M. Achacon, David Louis Jr, D. M Carlos,
M. Kaye Puyaoan, C. T Clarin, and P. Naval, “Re-
alism: Real-time hand gesture interface for surgeons
and medical experts,” 09 2010.
[14] M. G. Jacob, J. P. Wachs, and R. A. Packer,
“Hand-gesture-based sterile interface for the
operating room using contextual cues for the
navigation of radiological images,” J Am Med In-
form Assoc, vol. 20, pp. e183–186, Jun 2013.
https://doi.org/10.1136/
amiajnl-2012-001212.
[15] G. C. Ruppert, L. O. Reis, P. H. Amorim,
T. F. de Moraes, and J. V . da Silva, “Touch-
less gesture user interface for interactive im-
age visualization in urological surgery,” World
J Urol, vol. 30, pp. 687–691, Oct 2012.
https://doi.org/10.1007/
s00345-012-0879-0.
[16] M. Strickland, J. Tremaine, G. Brigley, and
C. Law, “Using a depth-sensing infrared cam-
era system to access and manipulate medical
imaging from within the sterile operating ﬁeld,”
Can J Surg, vol. 56, pp. 1–6, Jun 2013.
https://doi.org/10.1503/cjs.035311.
[17] J. H. Tan, C. Chao, M. Zawaideh, A. C. Roberts, and
T. B. Kinney, “Informatics in Radiology: developing
a touchless user interface for intraoperative image
control during interventional radiology procedures,”
Radiographics, vol. 33, no. 2, pp. 61–70, 2013.
https://doi.org/10.1148/rg.
332125101.
[18] B. Pavaloiu, “Leap motion technology
in learning,” pp. 1025–1031, 05 2017.
https://doi.org/10.15405/epsbs.
2017.05.02.126.
[19] B. J. Park, T. Jang, J. W. Choi, and N. Kim, “Gesture-
Controlled Interface for Contactless Control of
Various Computer Programs with a Hooking-Based
Keyboard and Mouse-Mapping Technique in the
Operating Room,” Computational and Mathematical
Methods in Medicine, vol. 2016, p. 5170379, 2016.
https://doi.org/10.1155/2016/
5170379.
[20] P. Sa-nguannarm, T. Charoenpong, C. Chianrabu-
tra, and K. Kiatsoontorn, “A method of 3d hand
movement recognition by a leap motion sensor
for controlling medical image in an operating
room,” in 2019 First International Symposium on
Instrumentation, Control, Artiﬁcial Intelligence,
and Robotics (ICA-SYMP), pp. 17–20, 2019.
https://doi.org/10.1109/ICA-SYMP.
2019.8645985.
12 Informatica 46 (2022) 1–12 R.A.A. Salvador et al.
[21] R. Galati, M. Simone, G. Barile, R. De Luca,
C. Cartanese, and G. Grassi, “Experimental
Setup Employed in the Operating Room Based
on Virtual and Mixed Reality: Analysis of
Pros and Cons in Open Abdomen Surgery,” J
Healthc Eng, vol. 2020, p. 8851964, 2020.
https://doi.org/10.1155/2020/
8851964.
[22] J. Shelton and G. Kumar, “Comparison between
auditory and visual simple reaction times,” Neuro-
science & Medicine, vol. 1, pp. 30–32, 01 2010.
https://doi.org/10.4236/nm.2010.
11004.
[23] Y . Li, “Multi-scenario gesture recognition us-
ing kinect,” in Proceedings of the 2012 17th
International Conference on Computer Games:
AI, Animation, Mobile, Interactive Multime-
dia, Educational & Serious Games (CGAMES),
CGAMES ’12, (Washington, DC, USA),
pp. 126–130, IEEE Computer Society, 2012.
https://doi.org/10.1109/CGames.
2012.6314563.
[24] A. Tang, K. Lu, Y . Wang, J. Huang, and H. Li,
“A real-time hand posture recognition system us-
ing deep neural networks,” ACM Trans. Intell. Syst.
Technol., vol. 6, pp. 21:1–21:23, Mar. 2015.
https://doi.org/10.1145/2735952.
[25] C. Yang, Y . Jang, J. Beh, D. Han, and H. Ko,
“Gesture recognition using depth-based hand
tracking for contactless controller application,”
in 2012 IEEE International Conference on Con-
sumer Electronics (ICCE), pp. 297–298, Jan 2012.
https://doi.org/10.1109/ICCE.2012.
6161876.
[26] J. Li, S. Zhang, and T. Huang, “Multi-
scale 3d convolution network for video
based person re-identiﬁcation,” Proceedings
of the AAAI Conference on Artiﬁcial Intel-
ligence, vol. 33, pp. 8618–8625, 07 2019.
https://doi.org/10.48550/arXiv.
1811.07468.
[27] D. Cheng, S. Xiang, C. Shang, Y . Zhang, F. Yang,
and L. Zhang, “Spatio-temporal attention-based
neural network for credit card fraud detection,”
Proceedings of the AAAI Conference on Artiﬁ-
cial Intelligence, vol. 34, pp. 362–369, 04 2020.
https://doi.org/10.1109/ICIP.2019.
8803152.
[28] P. Pandey, A. P. Prathosh, M. Kohli, and J. Pritchard,
“Guided weak supervision for action recognition
with scarce data to assess skills of children with
autism,” Proceedings of the AAAI Conference on
Artiﬁcial Intelligence, vol. 34, pp. 463–470, 04 2020.
https://doi.org/10.48550/arXiv.
1911.04140.
[29] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and
M. Paluri, “Learning spatiotemporal features with
3d convolutional networks,” in Proceedings of the
2015 IEEE International Conference on Computer
Vision (ICCV), ICCV ’15, (Washington, DC, USA),
pp. 4489–4497, IEEE Computer Society, 2015.
https://doi.org/10.1109/ICCV.2015.
510.
[30] K. Simonyan and A. Zisserman, “Very deep convo-
lutional networks for large-scale image recognition,”
CoRR, vol. abs/1409.1556, 2014.
[31] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Jour-
nal of Software Tools, 2000.
[32] S. Dieleman, J. Schlüter, C. Raffel, E. Ol-
son, S. K. Sønderby, D. Nouri, et al.,
“Lasagne: First release.,” Aug. 2015.
https://doi.org/10.5281/zenodo.
27878.
[33] J. Materzynska, G. Berger, I. Bax, and R. Memi-
sevic, “The jester dataset: A large-scale video
dataset of human gestures,” in 2019 IEEE/CVF
International Conference on Computer Vision
Workshop (ICCVW), pp. 2874–2882, 2019.
https://doi.org/10.1109/ICCVW.2019.
00349.
[34] E. Nasr-Esfahani, N. Karimi, S. M. R. Soroush-
mehr, M. H. Jafari, M. A. Khorsandi, S. Samavi,
and K. Najarian, “Hand gesture recognition
for contactless device control in operating
rooms,” CoRR, vol. abs/1611.04138, 2016.
https://doi.org/10.48550/arXiv.
1611.04138.
[35] A.-r. Lee, Y . Cho, S. Jin, and N. Kim, “Enhance-
ment of surgical hand gesture recognition using
a capsule network for a contactless interface in
the operating room,” Computer Methods and Pro-
grams in Biomedicine, vol. 190, p. 105385, 2020.
https://doi.org/10.1016/j.cmpb.
2020.105385.
[36] Z. Lu, S. Qin, X. Li, L. Li, and D. Zhang, “One-
shot learning hand gesture recognition based on
modiﬁed 3d convolutional neural networks,” Ma-
chine Vision and Applications, vol. 30, 08 2019.
https://doi.org/10.1007/
s00138-019-01043-7.