ERK'2021, Portorož, 386-389 386
Designing a Machine Learning based Non-intrusive Load
Monitoring Classiﬁer
Leo Ogrizek
1
, Blaz Bertalanic
1;2
, Gregor Cerar
1
, Marko Meza
2
, Carolina Fortuna
1
1
Jozef Stefan Institute, Ljubljana, Slovenia
2
Faculty of Electrical Engineering, University of Ljubljana, Slovenia
E-mail: lo7909@student.uni-lj.si
Abstract. Non-Intrusive load monitoring provides
the users with detailed information about the electricity
consumption of their appliances and gives energy providers
a better insight about the usage of their clients. It can also
be used in improving care of elderly, legal services and
optimizing energy consumption. While there is plenty of
work in NILM appliance classiﬁcation, in this paper we
investigate the design tradeoffs in the process of develop-
ing a machine learning based classiﬁcation model from
the perspective of feature engineering, model selection
and optimisation. Our work shows that well engineered
features have a greater impact on model performance than
the selection of the machine learning technique. Accord-
ing to the results, the improvement in f1 score between
non-engineered and the proposed engineered features is
up to 42% while improvement between the worst non-
optimised model and the best optimised one is 19%.
1 Introduction
Non-Intrusive load monitoring (NILM) is a process of
estimating the electricity consumption of individual ap-
pliances in a household from their combined electricity
consumption. By providing detailed information to the
user, NILM can help reduce energy usage by 5-15%[1].
It is also beneﬁcial for the energy providers as the addi-
tional information allows them to tune the production of
electricity and adjust the pricing plans better.
Data about appliance usage has additional applica-
tions outside of energetics. It provides an insight into
daily activities of residents which can be used in vari-
ous ﬁelds including health and medical (discovering sleep
disorders, remotely monitoring elderly), legal (monitor-
ing curfews) and commercial (customer proﬁling). The
beneﬁt of measuring electricity consumption on a single
point rather than on every device is ease of deployment
and lower installation and maintenance costs.
For application such as elderly care and curfews, events
triggered by the residents, for example a toaster being
used, need to be automatically detected and distinguished
from automatic events such as for example air condition-
ing being activated by a thermostat. When unusual, pos-
sibly anomalous patterns in the usage are detected, such
as decreased home activity, then an alert needs to be sent
out to their caretakers, allowing them to intervene sooner.
This is possible through the use of NILM classiﬁcation
[2]. In a similar manner, when anomalous behaviour is
detected from people under house arrest, an alert can be
dispatched to the appropriate authorities automatically.
NILM can also be used for predicting trends in en-
ergy consumption. Having a future view allows energy
producers to prepare and tune the production in accor-
dance to the projected requirements. Accurate consump-
tion predictions reduce excess energy production and in
turn, reduce the negative effects on the environment. NILM
by itself can reduce power consumption by increasing
awareness, some NILM powered systems can also turn
off unneeded devices [3, 1]. Inside small communities
NILM allows for better implementation of smart energy
distribution systems and peer-to-peer energy trading, by
providing a local producer information about whether en-
ergy is needed or can be returned into the grid [4].
Machine learning (ML) techniques have been shown
to be suitable for addressing all NILM related problems
[5]. This is realized by formulating a NILM classiﬁca-
tion problem to be solved and selecting high quality data
that can be used by the ML techniques to discriminate
between the target classes. The data, ML techniques, the
training and evaluation process result in a model able to
distinguish the target classes based on previously learned
patterns in the data.
In this paper we aim to understand the design trade-
offs for developing a NILM classiﬁer. The contributions
of this paper are as follows:
• We systematically analyze the performance of fea-
ture engineering, model selection and optimization
for NILM on the UK-Dale dataset.
• We propose a best feature set constructed to cap-
ture the shape of the time series and we show it
performs by 42 percentage points better than the
baseline raw time series.
The paper is organized as follows. Section 2 presents
related work, Section 3 formulates the problem and iden-
tiﬁes the dataset, Section 4 discusses the feature selection
process while Section 5 elaborates on the model selec-
tion. Finally, Section 6 concludes the paper.
387
2 Related Work
In the last decade a large body of work on NILM classi-
ﬁcation has been published. The traditional way of de-
veloping such classiﬁers used various signal processing
technique. For instance, in [6] the authors classify appli-
ances based on transient effects by using FFT transforms
on higher frequency measurements.
As the performance of the ML techniques improved,
they have been increasingly considered as potential tech-
niques for realising NILM. In [5], the authors provide
an overview of the ﬁeld by analysing suitable features
and ML techniques for different types of data, i.e. low
frequency, high frequency, etc. They also provide an
overview on model selection. In [7] the authors develop
a KNN classiﬁer that is trained using U/I trajectories as
features. In [8], the authors compare four algorithms for
disaggregating appliances using multi label classiﬁcation.
More recently, authors are using deep learning to solve
the NILM classiﬁcation problem. In [9] low frequency
data is used to detect on/off events. Classiﬁcations are
made based on these events using a deep neural network
classiﬁer trained on average power consumption, mini-
mum and maximum time an appliance is powered on.
In [10] the authors use transient power signals to train
a convolutional neural network to classify devices while
in [11] a recurrent neural network is trained on denoised
data.
Our work complements [5] by providing a quantita-
tive consideration of the design trade-offs when design-
ing a NILM classiﬁer using classical explainable machine
learning models.
3 Problem statement
We deﬁne our NILM classiﬁcation problem as follows.
Given an input time series T representing energy mea-
surements from households, there is a function   , that
maps the time series to a set of target classes C represent-
ing different household appliances as in equation 1.
C =  ( T ) (1)
where the set of target classes is C =f computer mon-
itor, laptop, television, washer dryer, microwave, boiler,
toaster, kettle, fridge g. The assumption is that a disag-
gregation in individual time series is already performed
and the classiﬁcation is done on the resulting time series.
The classiﬁer   is realized using classical machine learn-
ing techniques and the UK-DALE dataset to develop the
model able to discriminate between the classes. We con-
sider the following diverse but explainable set of tech-
niques: SVM, KNN, Random forest (RF), MLP, Logis-
tic regression (LR) from the scikit-learn library version
0.22.2. Deep learning methods were not used due to the
limited amount of data in the UK-DALE dataset.
3.1 Dataset summary
The UK DALE (Domestic Appliance-level Electricity) [12]
dataset contains 180000 measurements of power usage
per device, taken in 6 second intervals or 0.1667 Hz for
a total of 300 hours of measurements per device. The
devices in the dataset are common household appliances
which were already presented in section 3 . Data is spread
into 1 hour long segments, each dataset sample contains
a time series with 600 data points as depicted in Figure 1.
Measurements are disaggregated, taken with a smart plug
on every device [12].
Figure 1: Samples for each appliance, showing power in rela-
tion to time over a 1 hour interval.
3.2 Methodology
UK-Dale contains univariate time series that if used raw,
may lead to suboptimal model performance. A common
practice in the ML community is to construct synthetic
features, by statistical summaries, feature interactions, etc.
using the raw time series. As some of the considered
techniques, such as SVM and Logistic regression are sen-
sitive to scaling, we used standard scaling that ensures
that all the features will be relatively proportional.
During the model development process, we use 5-
fold cross validation realized using scikit-learn K-fold
cross validation. We use the selected ML technique with
default parameter conﬁguration as a baseline and then
provide an optimized version using the scikit-learn im-
plementation of grid search. For the SVM, grid search
388
was optimizing the kernel and C value; for K-nearest-
neighbours it optimized the number of neighbours; for
MLP it optimized the solver, the initial learning rate and
the learning rate schedule; for Random forest the number
of estimators while for SVM and logistic regression the
regularization parameter.
The models were evaluated using standard classiﬁca-
tion metrics: precision =
TP
TP+FP
, recall =
TP
TP+FN
andf1 =
2  precision  recall
precision+recall
for every class and their mean
scores for every fold, where TP, FP and FN stand for true
positives, false positives and false negatives. The ﬁnal re-
sults presented in this paper are the mean scores of all 5
folds.
4 Feature selection
By manually inspecting the appliances in the dataset, we
noticed that they differ in overall power consumption and
how constant their energy usage is (see Figure 1). For fea-
ture engineering we chose those features, that best cap-
ture the shape of the time series. The ﬁrst created features
were the mathematical moments and the squares of their
values, minimum, maximum, peak-to-peak, sum and me-
dian. These features alone do not describe the energy
usage spikes very well, so a second series was created
from every sample with the derivatives of all the intervals
between measurements. The number of times this deriva-
tive changes sign, which will be referred to as nturns, de-
scribes how many power spikes a device has during the
measurement period. The sum of absolute values of all
the derivatives, which will be referred to as d abs de-
scribes the intensity of said spikes. Features were cal-
culated for each sample separately to avoid data leakage
between training and testing dataset.
Table 1: Comparison of feature sets using SVM.
Feature set Precision Recall f1
raw 0.486 0.460 0.421
min, max, median 0.746 0.664 0.661
mean, std 0.714 0.646 0.642
all 0.851 0.837 0.835
best 0.851 0.835 0.834
We performed a guided evaluation of the features de-
scribed in this section with the best results, the raw time
series as baseline and standard statistics summarized in
Table 1. Using SVM to explore the performance of vari-
ous feature vectors, our experiments showed that the most
useful features are those that describe the shape of the
time series. Inputting raw data which makes every times-
tamp a feature yields an f1 score of 0.421. The results
are better than random because of the difference in size
of the features within different classes.
Using the minimum, maximum and median value from
the time series results in an f1 score of 0.661. These
results are signiﬁcantly better because they describe the
shape of the data, rather than assuming a timestamp is rel-
evant to the end results. Using synthetic features, shifting
the series would still result in the same features, unlike
when raw data is used. Because measurements can be
started/ﬁnished at any time, this is a better classiﬁer. For
the same reason, using the mean and standard deviation
also yields better results than raw data: f1 score of 0.642.
Using all synthetic features results in an f1 score of 0.835.
This is considerably better because the features describe
more properties of the time series. However, some of
the features are redundant and can be ignored to improve
computation time.
The ﬁnal feature set determined to be the best is [mean,
skew, kurtosis, peak-to-peak, variance, nturns, d abs and
standard deviation]. This means the baseline feature vec-
tor of length 600 was reduced to a feature vector of length
8. These features give almost identical results as using all
the synthetic features, with an f1 score of 0.834, but with
lower required computing performance.
Table 2: Per class performance, SVM, best feature set
class precision recall f1
computer monitor 0,882 0,720 0,780
laptop computer 0,897 0,800 0,838
television 0,958 0,927 0,941
washer dryer 0,952 0,700 0,804
microwave 0,679 0,710 0,687
boiler 0,945 0,937 0,940
toaster 0,705 0,940 0,806
kettle 0,684 0,806 0,739
fridge 0,961 0,980 0,970
A per class breakdown shown in Table 2 of the best
results reveals that the fridge, boiler and television are
classiﬁed best, while microwave and kettle are classiﬁed
poorly. This is because the former have very distinct
energy consumption patterns (fridge has low consump-
tion and it is usually turned off can be distinguished from
small devices by large and long lasting energy consump-
tion spikes, boiler has a low consumption when not in use
but consumes a lot of power when turned on, television
uses a medium amount of energy with much more empha-
sised variations in consumption), while the latter share
properties between each other and other low consump-
tion devices such as the toaster and monitor (all are pow-
ered on for short periods of time, they also share highly
variable energy usage).
Certain samples contain devices that are powered off
throughout the whole sample, such samples can not be
distinguished from one another because they all have the
same characteristic of no consumption.
5 Model selection and optimisation
As discussed in Section 3.2 and summarized Table 3 the
model performance depends both on the selected ML tech-
nique as well as on the parameters. It can be seen from
the table that the best performing models are based on
MLP and SVM.
For the SVM, the default radial kernel proved to be
optimal, however the default C value of 1 was far too
389
Table 3: Impact of algorithm optimisation
Algorithm Precision Recall f1
SVM default 0.704 0.672 0.647
SVM optimised 0.851 0.835 0.834
KNN default 0.784 0.776 0.772
KNN optimised 0.787 0.776 0.776
MLP default 0.786 0.769 0.766
MLP optimised 0.849 0.830 0.828
LR default 0.742 0.691 0.674
LR optimised 0.788 0.770 0.768
RF default 0.822 0.815 0.813
RF optimised 0.822 0.815 0.813
low, resulting in a separating hyperplane with a larger
margin, that misclassiﬁes more points. After optimisa-
tion the best C value was determined to be around 38000.
This resulted in a signiﬁcant increase in accuracy with the
f1 score going from 0.647 to 0.834 as can be seen from
the ﬁrst two rows of the table.
In K-nearest-neighbours grid search found that check-
ing 4 closest neighbours is better than the default 5, how-
ever this resulted in only a very small improvement of f1
from 0.772 to 0.776 as can be seen from rows 3 and 4 of
the table.
MLP initially had convergence issues. The ﬁrst can-
didates for optimisation were thus the initial learning rate
and the learning rate schedule. Different solvers were
also tested. After optimisation, the parameters were changed
from default adam solver with initial learning rate of 0.001
and constant learning rate to sgd solver with 0.45 initial
learning rate and adaptive learning rate. We used one hid-
den layer with 100 neurons. Optimisation resulted in a
large increase in f1, from 0.766 to 0.828 as can be seen
from rows 5 and 6 of the table.
Like SVM, logistic regression also suffered from a
low C value. After optimisation, it was changed from the
default of 1 to 20000. The solver was changed from lbfgs
to liblinear. This resulted in a large increase in f1 going
from 0.674 to 0.768 as can be seen from rows 7 and 8 of
the table.
In random forest the number of estimators was changed
from 100 to 500 but that only resulted in a change of f1
from 0.813 to 0.815 as can be seen from rows 9 and 10 of
the table.
6 Conclusions
In this paper we showed the tradeoffs in developing an
accurate appliance classiﬁcation system and proposed a
new feature set for improved classiﬁer performance.
First, we showed that by using statistical and custom
quantities able to capture the shape of the raw time series
can improve the F1 score performance by up to 42 per-
centage points compared to the raw time series baseline.
Our experiments showed that the best performing fea-
tures were mean, skew, kurtosis, peak-to-peak, variance,
the number of times the derivative changed sign on the
observed interval, sum of absolute values of the deriva-
tive on every interval within the time series and standard
deviation.
Second, we also showed that the choice of the ma-
chine learning technique and the optimal parameters are
also important. The best preforming model using opti-
mised SVM has an f1 score 0.187 better than the worst
performing non optimized SVM. However, all optimized
SVM, MLP and random forest models work well with f1
scores between 0.81 and 0.84 for the considered problem.
References
[1] S. Darby, “THE EFFECTIVENESS OF FEEDBACK ON
ENERGY CONSUMPTION,” p. 24, 2006.
[2] J. Alcal´ a, J. Ure˜ na, and A. Hern´ andez, “Activity supervi-
sion tool using Non-Intrusive Load Monitoring Systems,”
Sep. 2015, pp. 1–4, iSSN: 1946-0759.
[3] I. Abubakar, S. N. Khalid, M. W. Mustafa, H. Shareef, and
M. Mustapha, “Application of load monitoring in appli-
ances’ energy management – A review,” Renewable and
Sustainable Energy Reviews, vol. 67, pp. 235–245, Jan.
2017.
[4] “Smart home energy management systems survey,” Nov.
2014, pp. 167–173.
[5] A. Zoha, A. Gluhak, M. A. Imran, and S. Rajasegarar,
“Non-intrusive load monitoring approaches for disaggre-
gated energy sensing: A survey,” Sensors, vol. 12, no. 12,
pp. 16 838–16 866, 2012.
[6] S. R. Shaw, S. B. Leeb, L. K. Norford, and R. W.
Cox, “Nonintrusive load monitoring and diagnostics in
power systems,” IEEE Transactions on Instrumentation
and Measurement, vol. 57, no. 7, pp. 1445–1454, 2008.
[7] A. Kelati, H. Gaber, J. Plosila, and H. Tenhunen, “Im-
plementation of non-intrusive appliances load monitoring
(NIALM) on k-nearest neighbors (k-NN) classiﬁer,” AIMS
Electronics and Electrical Engineering, vol. 4, no. 3, pp.
326–344, 2020.
[8] D. Li and S. Dick, “Whole-house Non-Intrusive Appli-
ance Load Monitoring via multi-label classiﬁcation,” in
2016 International Joint Conference on Neural Networks
(IJCNN), Jul. 2016, pp. 2749–2755, iSSN: 2161-4407.
[9] M. A. Devlin and B. P. Hayes, “Non-intrusive load moni-
toring and classiﬁcation of activities of daily living using
residential smart meter data,” IEEE Transactions on Con-
sumer Electronics, vol. 65, no. 3, pp. 339–348, 2019.
[10] D. d. Paiva Penha and A. R. Garcez Castro,
“Home Appliance Identiﬁcation for Nilm Systems
Based on Deep Neural Networks,” IJAIA, vol. 9,
no. 2, pp. 69–80, Mar. 2018. [Online]. Available:
http://aircconline.com/ijaia/V9N2/9218ijaia06.pdf
[11] J. Kim, T.-T.-H. Le, and H. Kim, “Nonintrusive
Load Monitoring Based on Advanced Deep Learn-
ing and Novel Signature,” Computational Intelli-
gence and Neuroscience, vol. 2017, p. e4216281,
Oct. 2017, publisher: Hindawi. [Online]. Available:
https://www.hindawi.com/journals/cin/2017/4216281/
[12] J. Kelly and W. Knottenbelt, “The uk-dale dataset, domes-
tic appliance-level electricity demand and whole-house
demand from ﬁve uk homes,” Scientiﬁc data, vol. 2, no. 1,
pp. 1–14, 2015.