https://doi.org/10.31449/inf.v45i4.3283 Informatica 45 (2021) 507–516 507
Clustering of Variables for Enhanced Interpretability of Predictive Models
Evelyne Vigneau
StatSC, Oniris, Inrae, 44 322 Nantes Cedex 03, France
E-mail: evelyne.vigneau@oniris-nantes.fr
Keywords: clustering of variables, linear model, boosting, high-dimensional data, dimensionality reduction, authentica-
tion
Received: August 18, 2020
A new strategy is proposed for building easy to interpret predictive models in the context of a high-
dimensional dataset, with a large number of highly correlated explanatory variables. The strategy is based
on a ﬁrst step of variables clustering using the CLustering of Variables around Latent Variables (CLV)
method. The exploration of the hierarchical clustering dendrogram is undertaken in order to sequentially
select the explanatory variables in a group-wise fashion. For model ﬁtting implementation, the dendrogram
is used as the base-learner in an L2-boosting procedure. The proposed approach, named lmCLV, is illus-
trated on the basis of a toy-simulated example when the clusters and predictive equation are already known,
and on a real case study dealing with the authentication of orange juices based on
1
H-NMR spectroscopic
analysis. In both illustrative examples, this procedure was shown to have similar predictive efﬁciency to
other methods, with additional interpretability capacity. It is available in the R packageClustVarLV.
Povzetek: Prispevek opisuje nov pristop za izboljšano razlago prediktivnih modelov.
1 Introduction
In the context of high-dimensional data with a large num-
ber of variables, p, and small number of observations, n,
such as microarray data, metabolomic and volatolomic data
(among a large variety of -omic data collected using high-
throughput ﬁngerprinting technologies), most recent sta-
tistical modelling strategies aim to achieve both efﬁcient
prediction and enhanced interpretability outcomes. For 20
years, the question of variable selection has driven a great
deal of work, from discrete processes in which variables
are either retained or removed, to lasso regularization pro-
cesses in which several model coefﬁcients may be shrunken
towards zero [1, 2]. Regularization strategies discussions
are still an active topic due to the high-dimensional of cur-
rent data, either with model-based approach as in [3] or in
combination with ﬁltering step as in [4].
However, as Lasso tends to arbitrarily select one predic-
tive variable among a group of highly correlated relevant
variables, the sparsity principle of Lasso have been embed-
ded into strategies for the selection of grouped predictors.
The Elastic Net penalty [5] has been proven to be an efﬁ-
cient compromise between the Lasso (L1) penalty and the
Ridge (L2) penalty. Fused Lasso [6] and Group Lasso [7]
are designed for explanatory variables naturally ordered (as
in vibrational spectroscopy) or arranged into groups (as for
design factors). Sparse Group Lasso [8, 9] yields sparsity at
both the group and individual feature levels, in order to se-
lect groups and predictors within a group. The Octagonal
Shrinkage and Clustering Algorithm for Regression (OS-
CAR) [10] is another approach which combines anL
1
-type
penalty and a pairwiseL
1
-type penalty on the model’s co-
efﬁcients. Thus OSCAR allows to simultaneously select
variables and perform supervised clustering in the context
of linear regression. The CLERE methodology [11] is also
based on the clustering of the regression coefﬁcients using
a Gaussian latent mixture model. Several of the penalized
least squares approaches have Bayesian analogues also de-
veloped for variable selection and grouping [12, 13].
The model we propose here stems from another family
of approaches. Emphasis is put on reducing the dimen-
sionality of the data when a large number of highly corre-
lated explanatory variables (p  n) has been collected. We
consider reduction dimension approaches that deﬁne latent
components as being linear combinations of the explana-
tory variables and taking their correlation-structure into ac-
count. It is also desirable that these latent components are
easy to interpret by making several component loadings
to be exactly zeros for instance. This goal has also been
addressed with approaches designed to constrain Principal
Component [14, 15, 16] or PLS regression [17] loadings to
be shrunk to zeros. Another strategy was to ﬁrst perform a
cluster analysis in order to construct interpretable compo-
nents [18, 19].
Our proposal is also to ﬁrst perform a clustering of the
explanatory variables and then to ﬁt a linear model be-
tween a response variable Y and p explanatory variables
X
j
(j = 1;:::;p) in a groupwise fashion. The algo-
rithm adopted herein is easy to implement and is oriented
towards the interpretability of the latent component intro-
duced sequentially into the model. In the case study con-
sidered herein, each selected latent component can be as-
sumed to be associated with a compound from its chemical
spectrum in NMR (Nuclear Magnetic Resonance). With
508 Informatica 45 (2021) 507–516 E. Vigneau
metabolomic data, it is possible to imagine that the latent
components associated to subsets of metabolites may be
related to speciﬁc biological pathways. Within the context
of DNA microarray data, similar studies have been under-
taken [20, 21] in which hierarchical cluster analysis allows
to identify supergenes, obtained by averaging genes within
the clusters. These supergenes are thereafter used to ﬁt re-
gression models, thereby attaining concise interpretation
and accuracy. One of the main differences in this study,
compared with previous research works [20, 21], is that
representatives of the clusters of variables, or latent vari-
ables, are not necessarily an average of the observed vari-
ables, but can be components that best reﬂect the variability
within each cluster of variables. The second difference is
in the procedure adopted for the progressive construction
of the regression linear model.
The clustering of the explanatory variables is based
herein on the CLV (CLustering of Variables around La-
tent Variables) approach [22, 23], implemented in the
ClustVarLV R package [24]. In summary, the CLV
approach consists of clustering together highly correlated
variables into clusters (or groups of variables), while ex-
hibiting within each cluster a latent variable (or latent com-
ponent) representative to this cluster. It turns out that each
latent component is deﬁned as a linear combination of only
the variables belonging to the corresponding cluster. From
this point of view, CLV components are sparse compo-
nents in the space of observations, and aim to best reﬂect
the variance-covariance structure between the explanatory
variables. Using the same approach, but deﬁning a slightly
different criterion, [18] also proposed a clustering of vari-
ables method for producing interpretable principal compo-
nents. [25] investigated an approach for clustering of vari-
ables, based on the identiﬁcation of a mixture with bipolar
Watson components deﬁned on the hypersphere. Herein,
the central feature of the clustering procedure is a bottom-
up aggregation approach of the explanatory variables in-
volving the CLV criterion.
The dendrogram obtained is then explored in order to
identify and select the predictive group’s latent components
regarding the response variableY . In contrast with a pre-
vious study [26], hierarchical clustering, which is the more
time consuming step, is performed only once and the ﬁt-
ting model stage has been modiﬁed accordingly. More pre-
cisely, an L2-boosting procedure for which the base-learner
model is the CLV hierarchical algorithm has been consid-
ered. It consists to, iteratively, select a group of explana-
tory variables. The residuals of the response variable is
then regressed on the latent component associated with the
selected group and the predicted response is updated with
the shrunken version of this local predictor.
The proposed methodology combining variables cluster-
ing and iterative linear model ﬁtting, designated as lmCLV,
is described in Section 2. Section 3 includes a simple sim-
ulated dataset in order to illustrate the behavior of the pro-
cedure in a known context, as well as one real case study
dealing with the authentication of orange juice based on
1
H-NMR spectroscopy.
2 Methodology
2.1 Notation
We consider the high-dimensional linear model :
y = X  +"; (1)
where y = [y
i
] is a n  1-dimensional random re-
sponse vector; X = [x
1
j:::jx
p
] = [x
ij
], a n  p-dimensional quantitative explanatory variables matrix
(withi = 1;:::;n andj = 1;:::;p);  , ap  1 regression
coefﬁcients vector and" then  1-dimensional random vec-
tor of the residuals, with zero mean and constant variance.
We consider contexts wherep>n orp  n and where
the explanatory variables may be arranged into groups of
highly correlated variables.
Both the explanatory variables matrix and the response
vector are assumed to be column-centered. In addition, the
user may choose to standardize, or not, the variables to a
unit variance.
2.2 lmCLV’s outlines
lmCLV combines two main methods:
1. The CLV method which is performed ﬁrst.The simi-
larities between the variables, herein evaluated on the
basis of their covariance or correlation coefﬁcients,
are depicted by a tree diagram which is used as the
learner for the model ﬁtting method. The clustering of
variables, using CLV method, is detailed in Sect.2.3.
2. The L2-boosting procedure which provides an efﬁ-
cient iterative model ﬁtting method. The outline of
the L2-boosting algorithm is depicted in Sect.2.4, and
the way the CLV base-learner is used in the course of
this procedure is explained in Sect.2.5.
2.3 Clustering of the variables
The clustering of the explanatory variables using the CLV
method for directional groups [22, 24] aims to both identify
clusters of variables and deﬁne a latent component associ-
ated with each cluster.
For a given number of clusters,K, the aim of the CLV
method is to seek a partitionG
K
=fG
1
;:::G
K
g of the
variables into K disjoint clusters and a matrix C
K
=
[c
1
j:::jc
K
] of K latent variables, each being associated
with one cluster, so as to maximize the internal coherence
within the clusters. When the agreement between the vari-
ables is assessed using their covariance or correlation co-
efﬁcients, regardless of the sign of these coefﬁcients, the
clusters we are looking for are named directional groups. In
this case, the aim is to deﬁne the partition,G
K
, and group’s
Clustering of Variables for Enhanced. . . Informatica 45 (2021) 507–516 509
latent variables matrix, C
K
, so that to get the optimal value
T
  of the CLV criterion, so that:
T
  := max
(G
K
;C
K
)
K
X
k=1
p
X
j=1
  kj
cov(x
j
; c
k
)
2
; (2)
under the constraintsjjc
k
jj = 1.
In eq.(2),  kj
= 1 if thej
th
variable belongs to the group
G
k
, and   kj
= 0 otherwise. In other terms, (  kj
) is the
generic term of a binary matrix   of the group’s member-
ship of the p variables.   has only one nonzero element
per row.
It is easy to show [22] that when the partitionG
K
is
ﬁxed, the latent variable in a cluster G
k
(k = 1;:::;K)
is deﬁned as the ﬁrst standardized principal component of
a data matrix formed by the variables belonging to this
cluster. The latent variable c
k
associated with the cluster
G
k
(k = 1;:::;K) can therefore be expressed as a linear
combination of the variables of this group:
c
k
=
X
j= 
kj
=1
v
j
x
j
: (3)
For various numbers of clusters, from K = p clusters,
in which each variable is considered to form a cluster by
itself, toK = 1, where there is a single cluster containing
all the variables, an ascendant hierarchical clustering algo-
rithm is also proposed with respect to the CLV criterion and
aggregating rules detailed in [22]. The strategy proposed
herein consists of ﬁrstly constructing the whole hierarchy
of thep explanatory variables and to explore repeatedly the
dendrogram obtained.
2.4 L2-boosting procedure for model ﬁtting
The second feature of the strategy is a boosting procedure,
which can also be represented as a functional gradient de-
scent (FGD) algorithm [27, 28], performed on the residu-
als of an iteratively updated shrunken linear model for the
prediction of the response variable y. The outline of the
algorithm can be depicted as follows:
1. Setm = 0, and initialize ^ y
(0)
, for instance by choosing
^ y
(0)
= 1
n
 y, with 1
n
a vector of ones of lengthn, and
 y = 1=n
P
n
i=1
y
i
,
2. Increasem by 1, and compute the residualse
(m)
i
=y
i
  ^ y
i
(m  1)
fori = 1;:::;n;
3. Apply the base-learner procedure to the actual residuals.
The aim at this step is to identify the "best" CLV la-
tent component denoted c
  (m)
. This base-learner pro-
cedure will be described in Sect.2.5;
4. Update the predictive function, i.e. ^ y
(m)
= ^ y
(m  1)
+
   (m)
c
  (m)
where 0 <     1 is a shrinkage pa-
rameter and  (m)
the Ordinary Least Squares (OLS)
coefﬁcient estimate of the linear regression of c
  (m)
on e
(m)
= [e
(m)
1
;:::;e
(m)
n
].
5. Return to step 2., untilm =M (M being a large prede-
termined integer).
This procedure depends on two parameters: the stop-
ping iteration parameter,M, and the shrinkage parameter,
  , which can be determined via cross-validation or other
information criterion as previously suggested in [28].
In practice, because our base-learner procedure returns
one-dimensional components associated with sequential
group-wise selections of the explanatory variables, we have
often observed (as it will be shown in Sect.3) that a rela-
tively high value of  (greater than 0.5) generally performs
better. Moreover, as the predictive ability of the model ap-
peared to be relatively stable, large values ofM (say 50 or
more) are not necessarily useful.
2.5 Base-learner procedure
The third step of the algorithm presented in Sect.2.4 is the
core of the proposed strategy. At each iteration,m, of the
algorithm, the aim is to select a cluster of explanatory vari-
ables and their associated representative, i.e. the group’s
latent component c
  (m)
. This choice is guided by CLV
hierarchical clustering of the p explanatory predictors of
X = [x
1
j:::jx
p
] (Sect.2.3).
  For each size of partition (from p clusters to one clus-
ter), we ﬁrst aim to identify the CLV latent component
which has the largest correlation (in absolute value)
with the actual residuals e
(m)
. Forq = 1;:::p, that is
for each partitionG
p  q+1
, we deﬁne :
c
  q
:= max
k2f1;:::;(p  q+1)g
jcor(c
k
; e
(m)
)j: (4)
  The next step consists of choosing a speciﬁc levelq be-
tween 1 andp, that is a latent component c
  q
and its as-
sociated group of predictorsG
  q
, in such a way thatG
  q
is as large as possible while accommodating an undi-
mentionality criterion. The unidimensionality con-
dition is assessed herein using the modiﬁed Kaiser-
Guttman (KG) rule [29].
If we denotem this speciﬁc level:
m :=arg max
q2f1;:::;pg
(jG
  q
j=  1
>L and  2
 L );
(5)
wherejG
  q
j denotes the cardinal ofG
  q
,  1
and  2
are
respectively the ﬁrst and the second largest eigenval-
ues of the correlation matrix of the explanatory vari-
ables belonging toG
  q
and the thresholdL is deﬁned
according to [29] by:
L = 1 + 2
s
jG
  q
j  1
p  1
:
510 Informatica 45 (2021) 507–516 E. Vigneau
The latent component c
  (m)
and the groups of explana-
tory variables in G
  m
are returned to the main algorithm
(Sect.2.4) which continues at step 4.
Finally, it can be noticed that at each step the base learner
returns a latent component which is itself a linear combi-
nation of a subset of the explanatory variables (eq.3). It
is therefore easy to reformulate the prediction function in
terms of a linear combinations of the p predictors, and to
obtain an estimate of the coefﬁcients’ vector  (eq.1).
3 Applications
3.1 A toy-simulated example
The lmCLV procedure is illustrated in this section using a
simple example based on a simulated model as in [26]. We
considered hereinp = 70 explanatory variables supposed
to be measured onn = 100 observations. Moreover, these
variables were assumed to be structured into ﬁve groups
(G
1
toG
5
) of various sizes:G
1
was the largest group con-
sisting of 35 variables,G
2
was the smallest group with 5
variables, whereasG
3
;G
4
andG
5
were 10 variables each.
Each group of variables was generated around a pro-
totypical variable. The ﬁve prototypes were centered
and standardized random variables with a known struc-
ture of covariance. In practice, n realizations of a vector
(Z
1
;:::;Z
5
)
t
were generated from a centered multivariate
normal distribution with a covariance matrix   :
  =
0
B
B
B
B
@
1 0:5 0:5 0:5 0:1
0:5 1 0:5 0:1 0:1
0:5 0:5 1 0:1 0:1
0:5 0:1 0:1 1 0:1
0:1 0:1 0:1 0:1 1
1
C
C
C
C
A
: (6)
Let us denote Z then x 5 generated prototypical matrix.
Then, the variables of each group were randomly simulated
according to a multivariate normal distributionN (0
n
; I
n
),
where 0
n
represents then-dimensional null vector and I
n
then  n identity matrix, as follows:
x
j
=!
j
Z 
t
+"
j
forj2f1;:::;pg; (7)
where   is a p x 5 binary matrix deﬁning the allocation
of explanatory variables into the 5 groups. The column-
wise marginal sum of   was [35; 5; 10; 10; 10]. In eq.(7),
!
j
2f+1;  1g was used to randomly create positive or
negative correlations between each simulated variable x
j
in a group and the associated prototypical variable, in order
to generate directional groups of variables.
Finally the response variable y (2 R
n
) was generated
with the following model:
y = Zb +" with b = [1; 5; 3; 0; 0]
t
; (8)
where " resulting from a multivariate normal distribution
N (0
n
; I
n
).
The response y was supposed to be mainly related to the
variables of the smallest group,G
2
, moderately to the vari-
ables inG
3
and the smallest for the most numerous vari-
ables inG
1
. According to the parameters of the simulation,
the expected correlation coefﬁcients between y and ﬁve
prototypical variables are 0.69, 0.96, 0.82, 0.18 and 0.12,
respectively. This toy-simulated example was designed to
represent a simpliﬁed, but realistic, case study.
The choice of the shrinkage parameter   is discussed
on the basis of the Root Mean Squared Error criterion
evaluated using a ﬁve-fold cross-validation procedure, de-
noted RMSE
CV
. This criterion was assessed at each it-
eration m from 0 (null model) up to M = 20. Figure
1 shows that as the parameter  becomes smaller, the de-
crease in prediction errors becomes slower. In this exam-
ple, a value of  = 0:7 or 0:8 led to a low level of errors
(RMSEP
CV
= 2:29) after only three iterations of the al-
gorithm. In the following,  was ﬁxed to 0.7.
0 5 10 15 20
2 3 4 5 6 7 8
nb. iterations
RMSE_CV
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
ν
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Figure 1: Evolution of the Root Mean Squared Error in
Cross-validation with lmCLV, according to the number of
iterations, and for various values of the shrinkage parame-
ter,  (from 1 to 0.1).
At the ﬁrst iteration, and for each fold of the cross-
validation procedure, the variable numbers 36 to 40 (G
2
)
were in the cluster associated with the selected CLV latent
component (for one of the ﬁve folds the variable 33 was
also added, and for another fold the variable 46 was added).
At the second iteration, theG
3
variables (41 to 50) were al-
ways retained (except for one fold, for which the variable
48 was lacking). Finally, at the third iteration, as expected,
all the variables associated withG
1
, number 1 to 35, were
in the selected cluster of variables.
For comparison of lmCLV with two usual regression
methods based on latent components, namely Principal
Components Regression (PCR) and Partial Least Squares
Regression (PLSR), the same dataset was considered, using
the same ﬁve consecutive segments for Cross-Validation.
The evolution of theRMSE
CV
according to the number of
latent components included are shown in Figure 2(b). The
left hand side panel (Figure 2(a)) shows the evolution of
Clustering of Variables for Enhanced. . . Informatica 45 (2021) 507–516 511
the errors in prediction, more precisely theRMSE, evalu-
ated on the calibration set formed by the whole dataset. As
can be observed in Figure 2(a), PLSR and, to a weaker ex-
tend, PCR are prone to over-ﬁtting, with theRMSE value
reaching 0.94 whenp = 70 components are included (i.e.
the OLS solution). On the contrary theRMSE with lm-
CLV remained relatively stable with increasing numbers of
iterations. As a matter of fact, the higher the number of it-
erations, the more often the same groups of variables, and
thus the same latent components, are likely to be included
in the model. However, the loadings of these repeated la-
tent components are becoming increasingly smaller, which
has no impact on the overall quality of the model. With
three latent components, extracted during three iterations of
the algorithm, lmCLV made it possible to obtain a value of
RMSE = 2:23 and ofRMSE
CV
= 2:29. Regarding the
RMSE
CV
criterion, in this example, lmCLV performed
a little better than PLSR and PCR. As with lmCLV, the
PLSR solution with three components could be retained.
However, except for the ﬁrst PLS component which was
quite well correlated (r =  0:94) with the third latent com-
ponent retained with lmCLV (also well correlated with the
third prototypical variable), the other components were not
pairwise well correlated. In fact, the PLS components are
two by two uncorrelated, which is not the case for compo-
nents from CLV .
The component-wise gradient boosting, or L2-boosting
approach [28, 30] has also been considered for compari-
son purposes using theglmboost function included in the
R package mboost. We used the same cross-validation
structure as previously and tested step length values   =
0.01 and 0.1 to 1 by step of 0.1, with a maximal number
of iteration of 1000. The best combination of the shrink-
age parameter  and the number of iterations, was chosen
according to the RMSE
CV
criterion. For this example,
the best combination was  = 0:3 and a stopping iteration
numberm
stop
= 30, leading to aRMSE
CV
value of 2.45,
which is similar to the optimal values observed with previ-
ous methods. With both of these parameters, 23 variables
were selected: seven fromG
1
, the ﬁve variables fromG
2
,
six variables fromG
3
, and two variables from each of the
last two group.
3.2 Orange juice authentication case study
In this case study, a
1
H NMR spectroscopic proﬁling ap-
proach was investigated to discriminate between authentic
and adulterated juices [31]. In this study, we considered the
adulteration of orange juice (Citrus sinensis) with clemen-
tine juice (Citrus reticulata). Supplementation of substi-
tution with cheaper or easier to ﬁnd similar fruit is one of
the type of fraud conducted within the fruit juice industry.
For the experiment, twenty pure orange juices and ten pure
clementine juices were selected from the Euroﬁns database.
They are deemed to be representative of the variability of
the fruit juices available on the market. From these juices,
120 blends were prepared by mixing one of the twenty or-
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8
nb.comp/iter.
RMSE
PCR
PLSR
lmCLV
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8
nb.comp/iter
RMSE_CV
PCR
PLSR
lmCLV
Figure 2: Evolution of the Root Mean Squared Errors.
ange juices and one of the ten clementine juices in known
proportions. The proportion of the clementine juice in a
mix were 10%, 20%,:::, or 60%. The experimental design
is described in more detail in [31]. The database was com-
pleted with the twenty authentic orange juices, for a total of
140 juice samples, and these were analyzed on an
1
H Nu-
clear Magnetic Resonance (NMR) spectroscopy platform.
The NMR variables are associated with chemical shifts,
given in ppm. In the following, two spectral regions were
simultaneously considered: The region from 6 to 9 ppm
and the region from 0.5 to 2.3 ppm. The ﬁrst spectral region
mostly includes chemical shifts associated with aromatic
components and the second one due to amino-acid-speciﬁc
spectral shifts (among others). Between these regions lies
the typical
1
H NMR spectra for orange juice sugars [32],
which cannot discriminate between Citrus sinensis and Cit-
rus reticulata juice. After a preprocessing binning step,
300 chemical shift variables were collected between 6 to 9
ppm, and 180 variables between 0.5 to 2.3 ppm.
The data matrix is therefore composed ofn=140 obser-
vations and p=480 variables. The log-transformed data
512 Informatica 45 (2021) 507–516 E. Vigneau
are available in the R package ClustVarLV (dataset
authen_NMR). The mean of the log-transformed signals
for both spectral regions are shown in Figure 3.
9.0 8.5 8.0 7.5 7.0 6.5 6.0
12.0 12.1 12.2 12.3 12.4 12.5 12.6
ppm
2.0 1.5 1.0 0.5
14.8 14.9 15.0 15.1 15.2 15.3 15.4
ppm
Figure 3: Mean spectrum of the log-transformed NMR sig-
nals according to the spectral region. Two subplots are
shown due to the differences with order of magnitude in
both regions.
In the following, a pareto-scaling was applied to each
variable. This scaling, which consists of dividing each
variable by the square root of its standard deviation, was
shown in this case study [31] to be preferable to the usual
pre-scaling by the standard deviation. Moreover, the split-
ting of the observations into ten segments was deﬁned for
Cross-Validation purposes. A proportional stratiﬁed alloca-
tion rule has been adopted in such a way that each segment
contained two observations of each of the seven experimen-
tal levels, ranging from 0 to 60% of co-fruit added to pure
orange juices.
Figure 4 shows the evolution of the errors in prediction,
evaluated by means of theRMSE
CV
criterion, when the
shrinkage parameter   was set to 0.2, 0.5 or 0.8. On this
basis, we choose to consider a value of the shrinkage pa-
rameter  = 0:5 for more detailed analysis of the model.
Besides the predictive purpose, one of the key features
of lmCLV is to identify groups of explanatory variables
and their associated latent variables that are, step by step,
selected and involved in the model. The introduction se-
quence of the CLV latent variables provides an interesting
insight. However, it should be noted that the smaller the
shrinkage parameter, the more often the same group of ex-
planatory variables, by the means of its latent variable, will
appear. Therefore, some of the lmCLV algorithm’s outputs
are provided for each exhibited group, in order of its ﬁrst
occurrence, rather than by iteration. Thus, the OLS co-
efﬁcients,   (m)
, associated with a selected latent compo-
nent c
  (m)
(see step 4 in Sect.2.4) are aggregated on all
iterations, m, to which this speciﬁc latent component is
selected, up to the predeﬁned maximum number of itera-
tions, M. The same applies to the   coefﬁcients of each
of the (pre-processed) explanatory variables belonging to a
selected group. Finally, a group importance criterion has
been introduced. This importance is assessed as the sum,
0 20 40 60 80 100
5 10 15 20
nb iterations
RMSE_CV
ν
0.2
0.5
0.8
Figure 4: Evolution of the Root Mean Squared Error in
Cross-validation with lmCLV, according to the number of
iterations, and the values of the shrinkage parameter,  , in
the orange juice authentication case study.
over all its occurrences, of the decrease in residual variance
allowed by introducing the shrunken OLS estimate of the
associated latent component.
G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20
relative decrease in residuals variance
0.0
0.1
0.2
0.3
0.4
Figure 5: Relative Group Importance for the orange juice
authentication case study. The Group Importance values
are expressed relatively to the total variance of the response
variable y.
In our case study, using the whole data set, with lmCLV
parameters  = 0:5 andM = 25, the group importance es-
timates are depicted in Figure 5. This reveals the presence
of a group of spectral variables (G1) of high importance,
as well as two other important groups (G2 and G3). Each
group is numbered according to its ﬁrst occurrence, which
corresponds rather well with its importance in the model.
The ﬁrst group involved nine spectral variables at 7.52,
7.51, 7.50, 6.77, 2.10, 2.08, 2.07, 2.06 and 2.04 ppm.
The spectral range between 7.50 and 7.52 ppm, combined
Clustering of Variables for Enhanced. . . Informatica 45 (2021) 507–516 513
with the signal at 6.77 ppm and around 2 ppm, is partic-
ularly interesting and could be associated with 4 amino-3-
methylbenzoic acid (see for instance the Spectral Database
for Organic Compounds [33]. This information is quite sta-
ble: the spectral variables at 7.52, 7.51, 6.77 and 2.08 ppm
had been selected in the ﬁrst group at each of the 10 itera-
tions of the CV procedure. The same variables have also be
noted in [31], but their presumed association with the same
compound could not be so clearly highlighted. The sec-
ond group of spectral variables consisted of ten variables
(7.15, 7.10-7.09, 1.10-1.05 and 0.96 ppm). The CV proce-
dure showed that the range between 1.09 and 1.05 ppm and
the peak at 7.10-7.09 ppm was simultaneously retrieved 7
times out of 10 as the second or third group. Lastly, the
third group consisted of nine spectral variables, and specif-
ically a range between 1.65 and 1.60 ppm. The relationship
between these ﬁrst three group components and the propor-
tion of clementine juice in a mix is shown in Figure 6.
0 10 20 30 40 50 60
−0.4 0.0 0.4
Component 1
% of clementine juice in the mix
0 10 20 30 40 50 60
−0.2 0.2 0.6
Component 2
% of clementine juice in the mix
0 10 20 30 40 50 60
−0.4 0.0 0.4
Component 3
% of clementine juice in the mix
Figure 6: relationship between the ﬁrst three group com-
ponents and the proportion of clementine juice added to a
mix.
In Table 3.2, the predictive ability of lmCLV with   =
0:5 is compared with those of various alternatives among
the most popular statistical learning approaches [1, 27]: (i)
Sparse PLS Regression [17] which is very widely used in
chemometrics; (ii) Elastic-Net [5], a penalized regression
method allowing sparsity and grouping effect in variable
selection; (iii) L2-boosting [28, 30], a simple boosting ap-
proach for squared error loss function which is based on a
regression model, unlike the base learner involved in lm-
CLV; and (iv) Random Forest [34], one of the most popu-
lar and intuitive machine learning method for high dimen-
sional predictive problem. TheRMSE
CV
were evaluated
using the same 10-fold Cross-Validation division as above.
For Sparse PLS Regression (sPLSR) the R packagespls
was used, with parameter   2 f0; 0:1;:::; 0:9g and K,
the number of components, varying from 1 to 20. Elastic-
Net (E-Net) was adjusted using the R package glmnet
and by considering the Gaussian family model with mix-
ing parameter   ﬁxed to 0.5. The parameter   was cho-
sen as the mean of optimal values determined for each CV
fold. For the L2-boosting approach (L2-boost), in the R
packagemboost, values of parameter  from 0 (0.01 pre-
cisely) to 1, by 0.1, with a maximal number of 1000 iter-
ations, was explored. RandomForest was considered be-
cause this machine learning approach often proves to be
effective in high-dimensional classiﬁcation or regression
problems and provides an interesting variables importance
ranking. The R package randomForest was used with
parameter ntree = 5000, and default values for the two
other parameters,mtry (i.e. 160) andnodesize (i.e. 5).
method
a
parameters setting RMSE
CV
# var.selected
lmCLV   = 0:5,M = 25 6.366 145
sPLSR   = 0:3,K = 5 6.427 150
E-Net   = 0:5,  = 0:08 6.335 90
L2-boost   = 0:2,M = 209 6.742 75
RdF (5000;160;5)
b
9.992 146
c
a
All the methods was applied using available R packages. Their paramaters were
determined on the basis of Cross-Validation with the same samples division.
b
ntree = 5000;mtry = 160;nodesize = 5.
c
Variables with a standardized variable importance value greater than 3 were se-
lected.
Table 1: Root Mean Squared Error in cross-validated pre-
diction of the percentage of clementine juice added to the
orange juices, according to various methods.
As shown in Table 3.2, lmCLV has an expected pre-
diction performance similar to Sparse PLS Regression,
Elastic-Net and the L2-boosting approach. Quite surpris-
ingly, in this case study, Random Forest did not perform
very well. In fact, one can observe that by Random For-
est the prediction of the percentage of co-fruit added was
overestimated or underestimated at the extremes of the ex-
perimental scale, i.e. for 0-10% or 50-60%.
ppm
9
8.9
8.8
8.7
8.6
8.5
8.4
8.3
8.2
8.1
8
7.9
7.8
7.7
7.6
7.5
7.4
7.3
7.2
7.1
7
6.9
6.8
6.7
6.6
6.5
6.4
6.3
6.2
6.1
2.3
2.2
2.1
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
lmCLV
sPLSR
E−Net
L2−boost
RdF
Figure 7: Location of the spectral variables involved in the
models according to the method considered, in the orange
juice authentication case study.
The number of variables involved in the model ﬁtted on
the whole dataset, using the predetermined model’s param-
eters, is indicated in the last column of Table 3.2. As shown
in Figure 7, while Sparse PLS Regression has the tendency
to identify fewer but larger spectral ranges, the L2-boosting
514 Informatica 45 (2021) 507–516 E. Vigneau
approach retained a relatively small number of variables as-
sociated with narrow ranges. For lmCLV, 20 groups and
their latent variable were identiﬁed. Since several small
spectral ranges were merged within the same group, the to-
tal number of spectral variables involved was rather high.
However, the order of extraction and the grouping effect,
which are a speciﬁcity of lmCLV, cannot be revealed in
Figure 7. Globally, 19 spectral variables were retained
with the ﬁve methods considered herein. We systemati-
cally found the variables at 7.51-7.52 ppm, but not the other
variables that belonged to the ﬁrst group extracted with lm-
CLV. The variables between 7.09 to 7.12 ppm as well as
the variable at 1.07 ppm are also present, as in the second
group extracted with lmCLV. We ﬁnally noted that the ar-
eas at 6.96-6.98 ppm, 1.56-1.57 ppm, 0.86-0.88 ppm were
selected with the ﬁve methods.
4 Conclusion
In this study, we introduced a strategy for linear model ﬁt-
ting based on the hypothesis that high-dimensional datasets
often include highly correlated variables having similar
effect on the response variable. This is speciﬁcally the
case for modern scanning instruments such as those used
in spectroscopy (infrared, near-infrared, Raman, NMR,...)
which are able to collect a large number of sequential spec-
tral variables, several of them being representative of one
feature/signal. Omic data, and speciﬁcally metabolomic
data, contain a large quantity of measured elements that are
components of the same metabolic pathways and that con-
stitute the biological information that is sought. The basis
of the approach is to then cluster the explanatory variables
ﬁrst, as other authors have already proposed [19, 18]. In lm-
CLV, the clustering stage is based on the CLV method and
consists of identifying unidimensional latent components
which represent clusters of variables. These latent com-
ponents play the role of the predictors in the second stage
which results in introducing the explanatory variables in a
group-wise fashion.
In constrast with [19] for the CRL approach (Cluster
Representative Lasso), and [20] or [21] in which each clus-
ter representative is deﬁned as the mean variable of a group
of variables, the CLV latent components are deﬁned at the
same time as the clusters, and are derived from the eigen
decomposition of the within cluster covariance matrix. In
addition, for the construction of the regression model it-
self, we have adopted an L2-boosting approach, rather than
a Lasso approach. This makes it possible to build a sim-
ple and efﬁcient algorithm in which the CLV dendrogram
constitutes the base-learner.
Compared to our previous study [26], the algorithm pro-
posed here requires much less computer resources. In the
previous study, the clustering stage was performed on the
data matrix combining the residuals of the response vari-
able and the explanatory variables, and was repeated for
each iteration. However, the clustering of several hun-
dred variables requires the largest part of the computa-
tion time. Using the new version of lmCLV, for the case
study on orange juice authentication (Sect. 3.2), which
involved 480 explanatory variables, the whole procedure
(  = 0:5, M = 100) took 1 min 48 sec on one 3.4 GHz
processor, including 1 min 25 sec for the clustering stage
alone. For the implementation of this algorithm, a function
lmCLV will be available in version 2.0.2 of the R package
ClustVarLV.
In both examples presented in this study, lmCLV was
shown to have similar predictive efﬁciency to other meth-
ods, and importantly, provided additional interpretability
capacity. The use of latent components that are easy to in-
terpret, because each of them is associated with a group
of collinear variables, is consistent with the development
of modern simplifying approaches for modelization. How-
ever, compared to Lasso-based methodologies, dimension-
ality reduction based on variables clustering makes it easier
to identify directions of interest for prediction and makes it
possible to highlight functional links between explanatory
variables where they exist.
Acknowledgements
The author is grateful to Freddy Thomas (Authenticity unit,
Euroﬁns Analytics France, Nantes, France) for the oppor-
tunity to present the case study on orange juice authentica-
tion.
References
[1] T. Hastie, R. Tibshirani and J. Friedman (2009). The
Elements of Statistical Learning: Data Mining, Infer-
ence, and Prediction, (Second ed.). Springer Series in
Statistics. New-York: Springer
[2] R. Tibshirani (1996). Regression Shrinkage and
Selection Via the Lasso, , J. Roy. Stat. Soc. B, 58,
267–288.
https://doi.org/10.1111/j.
2517-6161.1996.tb02080.x,
[3] G. Celeux, C. Maugis-Rabusseau and M.Sedki
(2019). Variable selection in model-based clustering
and discriminant analysis with a regularization
approach. Data Anal. Classif., 13, 259–278.
https://doi.org/10.1007/
s11634-018-0322-5
[4] Z.Y . Algamal and M.H. Lee (2019). A two-stage
sparse logistic regression for optimal gene selection
in high-dimensional microarray data classiﬁcation,
Adv. Data Anal. Classif., 13, 753—771.
https://doi.org/10.1007/
s11634-018-0334-1
[5] H. Zou and T. Hastie (2005). Regularization and
variable selection via the elastic net. J. Roy. Stat. Soc.
Clustering of Variables for Enhanced. . . Informatica 45 (2021) 507–516 515
B, 67, 301–320.
https://doi.org/10.1111/j.
1467-9868.2005.00503.x
[6] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu and K.
Knight (2005). Sparsity and smoothness via the fused
lasso. J. R. Stat. Soc. B, 67, 91–108.
https://doi.org/10.1111/j.
1467-9868.2005.00490.x
[7] M. Yuan and Y . Lin (2006). Model selection and
estimation in regression with grouped variables. J.
Roy. Stat. Soc. Ser. B, 68, 49–67.
https://doi.org/10.1111/j.
1467-9868.2005.00532.x
[8] J. Friedman, T. Hastie, and R. Tibshirani (2010).
A note on the group lasso and a sparse group
lasso. Technical report, Statistics Department, Stan-
ford University.
[9] B. Zeng, X. M. Wen and L.Zhu (2017). A linkfree
sparse group variable selection method for single-
index model. J. Appl. Stat., 44, 2388–2400.
https://doi.org/10.1080/02664763.
2016.1254731
[10] Bondell, H. D. and B. J. Reich (2008). Simultaneous
regression shrinkage, variable selection and clus-
tering of predictors with OSCAR. Biometrics, 64,
115-123.
https://doi.org/10.1111/j.
1541-0420.2007.00843.x
[11] L. Yengo, J. Jacques, C. Biernack, and M. Canoui
(2016). Variable clustering in highdimensional linear
regression: the R package CLERE. R journal, 8,
92–106.
https://doi.org/10.32614/
RJ-2016-006
[12] S. M. Curtis and S. K. Ghosh (2011). A bayesian
approach to multicollinearity and the simultaneous
selection and clustering of predictors in linear regres-
sion. J. Stat. Theory Pract., 5, 715–735.
https://doi.org/10.1080/15598608.
2011.10483741
[13] S. Chakraborty and A. C. Lozano (2019). A graph
laplacian prior for bayesian variable selection and
grouping. Comput. Stat. Data An., 136, 72–91.
https://doi.org/10.1016/j.csda.
2019.01.003
[14] I. T. Jolliffe, N. T. Trendaﬁlov and M. Uddin (2003).
A modiﬁed Principal Component technique based on
the lasso. J. Comput. Graph. Stat., 12, 531–547.
https://doi.org/10.1198/
1061860032148
[15] H. A. Chipman and H. Gu (2005). Interpretable
dimension reduction. J. Appl. Stat., 32, 969–987.
https://doi.org/10.1080/
02664760500168648
[16] T. F. Cox and D. S. Arnold (2018). Simple compo-
nents. J. Appl. Stat., 45, 83–99.
https://doi.org/10.1080/02664763.
2016.1268104
[17] H. Chun and S. Keles (2010). Sparse partial least
squares for simultaneous dimension reduction and
variable selection. J. Roy. Stat. Soc. B, 72, 3–25.
https://doi.org/10.1111/j.
1467-9868.2009.00723.x
[18] D. G. Enki, N. T. Trendaﬁlov and I. T. Jolliffe (2013).
A clustering approach to interpretable principal
components. J. Appl. Stat., 40, 583–599.
https://doi.org/10.1080/02664763.
2012.749846
[19] P. Bühlmann, P. Rütimann, S. van de Geer and C.-H.
Zhan (2013). Correlated variables in regression:
clustering and sparse estimation. J. Stat. Plan. Infer.,
143, 1835–1858.
https://doi.org/10.1016/j.jspi.
2013.05.019
[20] T. Hastie, R. Tibshirani, D. Botstein and P. Brown
(2001). Supervised harvesting of expression trees.
Genom. Biol., 2, 1–12.
https://doi.org/10.1186/
gb-2001-2-1-research0003
[21] M. Y . Park, T. Hastie and R. Tibshirani (2007).
Averaged gene expressions for regression. Biostat.,
8, 212–227.
https://doi.org/10.1093/
biostatistics/kxl002
[22] E. Vigneau and E. Qannari (2003). Clustering of
variables around latent components. Comm. Stat.-
Simul Comput., 32, 1131–1150.
https://doi.org/10.1081/
SAC-120023882
[23] E. Vigneau and M. Chen (2016). Dimensionality
reduction by clustering of variables while setting
aside atypical variables. Electron. J. Appl. Stat. An.,
9, 134–153.
https://doi.org/10.1285/
i20705948v9n1p134
[24] E. Vigneau, M. Chen and E. M. Qannari (2015).
ClustVarLV: An R package for the clustering of vari-
ables around latent variables. R journal, 7, 134–148.
https://doi.org/10.32614/
RJ-2015-026
516 Informatica 45 (2021) 507–516 E. Vigneau
[25] A. Figueiredo and P. Gomes (2015). Clustering of
variables based on Watson distribution on hyper-
sphere: A comparison of algorithms. Comm. Stat. -
Simul Comput., 44, 2622–2635.
https://doi.org/10.1080/03610918.
2014.901353
[26] M. Chen and E. Vigneau (2016). Supervised clus-
tering of variables. Adv. Data Anal. Classif., 10,
85–101.
https://doi.org/10.1007/
s11634-014-0191-5
[27] B. Efron. and T. Hastie (2016). Computer age statisti-
cal inference: algorithms, evidence and data science.
New York: Cambridge University Press. https:
//doi.org/10.1017/cbo9781316576533
[28] P. Bühlmann and T. Hothorn (2007). Boosting algo-
rithms: Regularization, prediction and model
ﬁtting. Stat. Sc., 22, 477–505.
http://dx.doi.org/10.1214/07-STS242
[29] D. Karlis, G. Saporta and A. Spinakis (2003). A
simple rule for the selection of principal compo-
nents.Comm. Stat. - Theor. M., 32, 643–666.
https://doi.org/10.1081/
STA-120018556
[30] B. Hofner, A. Mayr, N. Robinzonov and M. Schmid
(2014). Model-based boosting in R: A hands-on
tutorial using the R package mboost. Comput. Stat.,
29, 3–35.
https://doi.org/10.1007/
s00180-012-0382-5
[31] E. Vigneau and F. Thomas (2012). Model calibration
and feature selection for orange juice authentication
by 1H NMR spectroscopy. Chemometr. Intell. Lab.
Sys., 117, 22–30.
https://doi.org/10.1016/j.chemolab.
2011.05.006
[32] P. Rinke, S. Moitrier, E. Humpfer, S. Keller, M. Mo-
ertter, M. Godejohann, G. Hoffmann, H. Schaefer and
M. Spraul (2007). An 1H-NMR technique for high
troughput screening in quality and authenticity con-
trol of fruit juice and fruit juice raw materials- SGF-
proﬁling. Fruit Process, 1, 10–18.
[33] SDBSWeb (2020). Spectral Database for organic
Compounds.
https://sdbs.db.aist.go.jp (National In-
stitute of Advanced Industrial Science and Technol-
ogy)
[34] L. Breiman (2001). Random forests. Mach.
Learn.,45, 5–32.
https://doi.org/10.1023/A:
1010933404324