23rd European Young Statisticians Meeting





11.–15. September 2023, Ljubljana, Slovenia





Proceedings



23rd European Young Statisticians Meeting – Short of papers

Published by: Statistical Society of Slovenia, Litostrojska cesta 54, 1000 Ljubljana, Slovenia organizer Statistical Society of Slovenia

Editor: Andrej Srakar

Place of publication: Ljubljana, Slovenia

Year of publication: 2025

ISBN: 978-961-94283-6-8



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 213542659

ISBN 978-961-94283-6-8 (PDF)



23. Evropsko srečanje mladih statistikov je bilo izvedeno prek spleta in na Inštitutu za ekonomska raziskovanja (IER) v Ljubljani. Inštitutu se zahvaljujemo za podporo pri izvedbi dogodka.





23rd European Young Statisticians Meeting I




Preface



European Young Statisticians Meetings are organized every two years under the auspices of the European Regional Committee of the Bernoulli Society for Mathematical Statistics and Probability. The aim is to provide a scientific forum for the next generation of European researchers in probability theory and statistics. It represents an excellent opportunity to promote new collaborations and international cooperation. Participants are less than 30 years old or have 2 to 8 years of research experience, and are invited on the basis of their scientic achievements, in a uniformly distributed way in Europe (at most 2 participants per country). The International Organizing Committee (IOC) is responsible for their selection.

There were twenty five European countries participating at the 23rd EYSM. The scientific part of the Conference was organized as follows:

• six eminent scientists from the field of mathematical statistics and probability gave 60-minutes keynote



lectures

• fourty seven invited young scientists gave 20-minutes lectures.

The topics presented included, but were not limited to:

- Applied statistics in economics, finance, biology, etc.

- Applied stochastic models

- Bayesian statistics and computation

- Causal inference

- Central limit theorem and asymptotics

- Dependent data

- Design of experiments

- Directional data analysis

- Distributional topics

- Econometrics

- Functional data analysis

- High-dimensional statistics

- Nonparametric statistics

- Random matrix theory

- Robust statistics

- Schramm-Loewner evolution

- Spatial data

- Statistical learning

- Stochastic differential equations

- Stochastic processes

More information about the Conference, such as the scientic program, abstracts of all given lectures, the list of participants together with their aliations and contact information, is available in the Book of Abstracts, and at the Conference website https://sites.google.com/view/eysm2023.

II 23rd European Young Statisticians Meeting



These Proceedings contain short papers that went through the peer review process organized by the IOC, in the way that the IOC representatives personally acted as a referee or proposed reviewers for papers of participants they invited. We would like to thank the European Regional Committee of the Bernoulli Society for giving us the opportunity to organize this lovely event. We are also thankful to the members of the International Organizing Committee for selecting prominent young scientists to attend this conference, as well as to the reviewers of the papers published in the conference proceedings. We also appreciate very much the help of the administrative staff of the Statistical Society of Slovenia (Mrs. Jerneja Čuk) and Statistical Office of the Republic of Slovenia. Last, but not least, we thank all Keynote Speakers and Young Participants for providing an excellent scientic program, and great vibes that made this event special giving invited young scientists the opportunity to present their recent research results, exchange experience, gain new knowledge and establish contacts, in the hope that this event will be a driving force for their future academic achievements.



Ljubljana, October 2025

Local Organizing Committee





23rd European Young Statisticians Meeting V




Organizers



23rd European Young Statisticians Meeting

Co-Organized by

Statistical Society of Slovenia

Statistical Office of the Republic of Slovenia (SORS)

Under the auspices of

Bernoulli Society for Mathematical Statistics and Probability

International Organizing Committee

Alexandre Jacquemain, Belgium



Hrvoje Planinić, Croatia

Jan Vávra, Czechia

Niko Lietzen, Finland

Anne van Delft, Germany/United States

Christina Parpoula, Greece

Federico Camerlenghi, Italy

Aliaksandr Hubin, Norway

Łukasz Rajkowski, Poland

Marija Cuparić, Serbia

Andrej Srakar, Slovenia

Jakob Peterlin, Slovenia

Elena Castilla González, Spain

Dario Azzimonti, Switzerland

Chengchun Shi, UK (England)

Laura Boyle, UK (Northern Ireland)

Local Organizing Committee

Andrej Srakar, Chair, Institute for Economic Research and University of Ljubljana Damjan Manevski, Institute for Biostatistics and Medical Informatics, University of Ljubljana, Faculty of Medicine

Jakob Peterlin, Institute for Biostatistics and Medical Informatics, University of Ljubljana, Faculty of Medicine

Keynote Speakers

Nina Holden, New York University

Daniela Witten, University of Washington

Davy Paindaveine, Université Libre de Bruxelles

Aad van der Vaart, Delft University of Technology

Mihael Perman, University of Ljubljana and University of Primorska Vladimir Batagelj, University of Ljubljana

Conference Structure: keynote lectures, invited lectures.

Conference Language: English



23rd European Young Statisticians Meeting VII





Contents




Preface................................................................................................................................................... I

Organizers ............................................................................................................................................. V



Papers

On the Estimation of Anisotropic Covariance Functions on Compact Two-Point

Homogeneous Spaces

Alessia Caponera ............................................................................................................................ 3

On the theory of space-time models for counts



Ana Martins, Manuel G. Scotto, Christian H. Weiß and Sónia Gouveia ......................................... 9

Joint Species Spatial Modelling of Deer Count Data

Aoife K Hurley, Ruth F Carden and James Sweeney ..................................................................... 15

Nonparametric Simultaneous Confidence Bands:

The Case of Known Input Distributions

Bálint Horváth and Balázs Csanád Csáji ........................................................................................ 21

A primer on experimental design for practitioners with efficient

augmented designs

Carlos de la Calle-Arroyo, Jesús López-Fidalgo

and Licesio J. Rodríguez-Aragón .................................................................................................... 27

Parameter estimation in stochastic heat equations driven by fractional

and mixed fractional Brownian motion

Diana Avetisian and Kostiantyn Ralchenko ..................................................................................... 33

Stochastic modeling of age-specific fertility rates

Erika Lettrichová ............................................................................................................................. 39

On extreme quantile estimation with approximated Lp-norms

Jaakko Pere .................................................................................................................................... 45

Using Spatial Point Process Models to Understand Cell Loss in Glaucoma

Jonathan Henderson, Benjamin M Davis and Hannah J Mitchell ................................................... 51

Subsampling MCMC for Bayesian Variable Selection and Model Averaging

in BGNLM

Jon Lachmann and Aliaksandr Hubin ............................................................................................. 57

Exploring Nonlinear Associations between Features: A Tool for Capturing Complex

Relationships

Kimon Ntotsis, Andreas Artemiou and Alex Karagrigoriou .............................................................. 63

Precision of Individual Shapley Value Explanations

Lars Henry Berge Olsen.................................................................................................................. 69

Resampling methods in conditional independence testing

Małgorzata Łazęcka ........................................................................................................................ 75

Some notes about Poisson–Laguerre tessellation with unbounded weights

Martina Petráková and Zbynĕk Pawlas .......................................................................................... 81

VIII 23rd European Young Statisticians Meeting



Kalman Filter and Identifiability of the Observation Model

Matej Benko and Libor Žák ............................................................................................................. 87

Asymptotic Properties of Parameter Estimators for Mixed Fractional Brownian

Motion with Trend

Mykyta Yakovliev and Kostiantyn Ralchenko .................................................................................. 93

Ordering Awad-Tsallis Quantile Entropy and Applications to Some Stochastic Models

Răzvan-Cornel Sfetcu ..................................................................................................................... 99

Approximate Post-Selection Inference for Group Lasso with Application to NHANES

Sarah Pirenne and Gerda Claeskens .............................................................................................. 105

Some Results on Random Mixed SAT Problems

Andreas Basse-O’Connor, Tobias Overgaard, and Mette Skjøtt ..................................................... 113

Fast Adaptation with Bradley-Terry Preference Models in Text-To-Image Classification

and Generation

Víctor Gallego ................................................................................................................................. 119



Author index ......................................................................................................................................... 127





Papers





23rd European Young Statisticians Meeting 3



On the Estimation of Anisotropic Covariance

Functions on Compact Two-Point

Homogeneous Spaces



Alessia ∗ 1 Caponera

1University of Milano-Bicocca



Abstract: In this paper, the asymptotic theory presented in [1] for spline-

type anysotropic covariance estimator on the 2-dimensional sphere is general-

ized to the case of connected and compact two-point homogeneous spaces.



Keywords: functional data analysis, sparse sampling, representer theorem,

compact two-point homogeneous space, Sobolev space

AMS subject classi fication: Primary 62G08; secondary 62M.



1 Introduction

Let Md denote a connected and compact two-point homogeneous Riemannian

manifold of dimension d. According to [5], Md belongs essentially to one of

the following categories: the unit (hyper)sphere Sd, d = 1, 2, 3, . . . , the real

projective spaces Pd( ), d = 2, 3 R, . . . , the complex projective spaces Pd( ), C

d = 4, 6 d , . . . , the quaternionic projective spaces( ), = 8 P H d, 12, . . . , and the

Cayley projective plane Pd(Cay), d = 16.

In this paper, we extend the asymptotic theory presented in [1] for spline-

type anysotropic covariance estimator on the 2-dimensional sphere to the case

of connected and compact two-point homogeneous spaces. Specifically, we



∗ alessia.caponera@unimib.it 4 23rd European Young Statisticians Meeting PPP A. Caponera



have a zero-mean second-order random field X = {X (u), u ∈ Md} that is a 2 random element of d L ( M ) with covariance function

C (u, v) = E[X (u)X (v)],

for u, v ∈ Md. We consider also X1 , . . . , Xn independent replicates of X and for the i-th replicate we make noisy measurements at ri random locations on Md , so that

Wij = Xi(Uij) + ij , i = 1, . . . , n, j = 1, . . . , ri,

where the Uij ’s are independently drawn from a common distribution on Md, and the ij’s are independent and identically distributed measurement errors of mean 0 and variance 0 2 < σ < . Furthermore, the ’s, the measurement ∞ X i locations, and the measurement errors are assumed to be mutually independent. We address the functional data analysis problem of estimating their second-order moment structure.

In Section 2, we provide an introduction to several key concepts, including

the Laplace-Beltrami operator, addition formula, and Sobolev spaces, defined within the context of compact two-point homogeneous spaces. These tools are then used to carry out the asymptotic theory for the spline-type covariance estimator presented in Section 3.



2 Mathematical Background



ρ, and we assume that ρ(u, v) ∈ [0, π], for every u, v ∈ Md. We also consider a Here, the space is endowed with the usual Riemannian (geodesic) distance M d



the spectrum of ∆ normalized Riemannian measure, denoted by dν , so that dν M d = 1. d Let ∆ be the Laplace–Beltrami operator on M d . It is well known that M is purely discrete, the eigenvalues being M d

λ d = λ () = M −( + α + β + 1), Λ ∈,





where d n : even if = ∈ N n M d Λ = Λ( M d { 0( ) } P R ) = N 0 if ) M d = ( P d R,

and the parameters α, β are reported in Table 1. Now, consider the space of real-valued square-integrable functions over Md 2 , written d L (), endowed M with the standard inner product



f, g 2 2 d = f ( u ) g ( u ) dν ( u ) d , f, g L (). L ( M ) ∈ M M

d 23rd European Young Statisticians Meeting 5 PPP



Table 1



M ( ) ( ) (Cay) d ) S d ( P d R P d C P d H P d

Dimension d = 1, 2, . . . d = 2, 3, . . . d = 4, 6, . . . d = 8, 12, . . . d = 16

α (d − 2)/2 (d − 2)/2 (d − 2)/2 (d − 2)/2 7

β (d 2)/2 (d 2) 2 0 1 − − / 3



For every ∈ Λ, the eigenfunctions corresponding to the same eigenvalue

λ d form a finite dimensional vector space, denoted by = Y Y (), with M dimension dim( Y ) that is uniquely identified through



dim( (2 + α + β + 1)Γ(β + 1)Γ( + α + β + 1)Γ( + α + 1) . ) = Y Γ( α + 1)Γ( α + β + 2)Γ( + 1)Γ( + β + 1)



Then, 2 L ( d ) = M Y.

∈Λ



Moreover, given an orthonormal basis of Y, say {Y,m, m = 1, . . . , dim(Y)}, 2 any function d f ∈ L ( ) can be expanded as M

dim( Y)

f = f, Y,m 2 L . ( M d Y ) ,m

m ∈ Λ=1

In order to deal with covariance functions, we will also consider functions g 2 d ( ), for which it holds ∈ L M × M d

dim( Y dim(Y ) )

g = g, Y ,m ⊗ Y ,m 2 d Y Y L ( M ,m ,m × M d ) ⊗ .

, m=1 ∈ Λ m=1

A crucial result of [2, Theorem 3.2] ensures that also for general Md an

addition formula holds, that is, for any Λ ∈,

dim( Y)

Y,m ( (α,β) u) Y,m ( v ) = κP (cos ρ(u, v))

,

m=1

6 23rd European Young Statisticians Meeting PPP A. Caponera



where = 1/2 if Md = ), = 1 otherwise, P d ( R



κ d dim(Y) = κ ( M ) = . ( α,β ) P (1)

The functions ( ) α,β P : [ 1 1] are the well-known 0 − , → R , ∈ N , Jacobi polynomials. Note that when Md 2 = they reduce to the Legendre polynomials. S

In general, the reason for the exceptional behaviour of the real projective

spaces is discussed for instance in [3]. See also [4] and the references therein.



2.1 Sobolev Spaces on and M d Md × Md

We now give the definition of Sobolev spaces on Md and Md × Md which will be essential later to introduce the smoothness properties of the random field {X (u), u and its covariance function. ∈ M d }

Definition 1. Let p ≥ 0. The Sobolev space of order p on Md is defined as

H = 2 p p ( d )= d M f ( ) H ∈ L M , f < , H p ∞

where



2 p 2 2 (1 f = H p − λ ) f, Y ,m d . L ( M ) Λ dim( Y ) ∈ m=1

The tensorial Sobolev space of order d p on M × Md is

H = ( d d 2 ) =( d d) p H p M × M H p ⊗ H p ⊂ L M × M.

The parameter p in Definition 1 quantifies the regularity: the higher p, the

smoother the functions f ∈ Hp and g . Indeed, if we consider the Sobolev ∈ H p operator

D = (Id −∆Md )p/2 , p ≥ 0,

we can readily see that

f g H p = D f 2 d , L ( D M ) = H p ( ⊗ D )g 2 . L ( M d × M d )

In the next section, we will only consider Sobolev spaces and H p that are H p Reproducing Kernel Hilbert Spaces (RKHS). It turns out that

Hp is a RKHS is a RKHS ⇐⇒ H p ⇐⇒ p > d/2.

See [1, Proposition 1] and [2, Corollary 3.4].

23rd European Young Statisticians Meeting 7

PPP



3 Covariance Estimation

In this section, we present the estimation of the covariance function C and the asymptotic theory behind.

We assume that ) , for some { X ( d u , u is a random element of ∈ M } H q

q > d/ 2. For a given p 0, we can define the estimator ≥ q and η > Cη of the covariance function as the solution of the following minimization problem



g g Uij , U )) η ∈ min (W 2 2 n 1 W ( + − Hp nri (r i 1) ij ik ik (1) − g , H p

i=1 1≤j=k≤ri

It is possible to prove that the solution is unique and it is given by

n 1

C (η) η ∗ β = ψD (cos ρ( ijk D ·, Uij )) )) ⊗ ψ ∗ DD (cos ρ ( ·

r i ( , U ik r i 1)

i − =1 1≤j=k≤ri

for some random (η) 1 { β , i 1 ijk ∗ ≤ ≤ n, ≤ j = k ≤ ri . The function } ψ D D is the zonal Green’s kernel of ∗ DD , defined as



ψ κ (cos ∗ D D ρ ( u, v )) = (α,β) d P (cos ρ ( u, v )) , u, v ∈ M. (1 p − λ ) ∈ Λ

However, the penalty (and hence the solution) can be expressed in terms of any admissible operator D with spectral growth order p, which induces an equivalent norm – see [1, Section 2.2].

Theorem 1 below establishes a (uniform) rate of convergence for the estima-

tor Cη, under a suitable condition on the decay of the penalty parameter η. Such rate is expressed in terms of the number of replicates n and the average number of measurement locations r, defined as the harmonic mean of the r i’s. The result can thus be interpreted in both the dense and sparse sampling regimes. In a dense design, r = r(n) is required to diverge with n and it is larger than some order of n; on the other hand, in a sparse design, the sampling frequency r is bounded and can be arbitrary small (as small as two).

In the following, we define the class of probability measures for which our

rate is achieved.

Definition 2. Consider p > d, d/2 < q ≤ p. Let Π2(p, q) be the collection of probability measures for Hq-valued processes, such that for any X with probability law PX Π ∈2 (p, q)

EX 4 2 L, C H q ≤ K, H p ≤

for some constants d L, K > 0. The constants may depend on the manifold M.

PPP 8 23rd European Young Statisticians Meeting A. Caponera



are independent copies of Theorem 1. Assume that E[ 4 ] < and the U , . . . , n, j = 1 11 ∞ ij , i = 1, . . . , ri , d U ∼ Unif( M ) . Let p > d and d/ 2 < q ≤ p , and consider the covariance estimator obtained as the minimizer of Equation (1) . If η ( nr/ log n ) − 2 p/ (2 p + d ) , then



D lim 2 2 p log 2 p + n d 1 lim sup sup P C η − C 2 d > D + L ( M × M d ) →∞ n nr →∞ P X ∈ Π 2 ( p,q n )

= 0.

The proof follows the same lines of arguments as [1, Theorem 4].



Acknowledgements: The author wishes to thank Maurizia Rossi for insight-ful discussions on compact two-point homogeneous spaces.



References

[1] A. Caponera, J. Fageot, M. Simeoni, V. M. Panaretos. Functional estima-

tion of anisotropic covariance and autocovariance operators on the sphere. Electron. J. Stat., 16(2):5080–5148, 2022.

[2] E. Gin´e M.. The addition formula for the eigenfunctions of the Laplacian.

Adv. Math., 18(1):102–107, 1975.

[3] A. Kushpel. The Lebesgue constants on projective spaces. Turkish J.

Math., 45(2):856–863, 2021.

[4] A. Kushpel, S. A. Tozoni. Entropy and widths of multiplier operators on

two-point homogeneous spaces. Constr. Approx., 35:137–180, 2012.

[5] H.-C. Wang. Two-point homogeneous spaces. Ann. Math., 55(1):177–191,

1952.





23rd European Young Statisticians Meeting 9




On the theory of space-time models for counts



Ana ∗1 2 3 Martins , Manuel G. Scotto , Christian H. Weiß, and S´onia

Gouveia 1,4

1 Institute of Electronics and Informatics Engineering of Aveiro (IEETA) and

Department of Electronics, Telecommunications and Informatics (DETI), University

of Aveiro, Aveiro, Portugal.

2 Center for Computational and Stochastic Mathematics (CEMAT) and Department

of Mathematics, IST, University of Lisbon, Lisbon, Portugal.

3Department of Mathematics and Statistics, Helmut Schmidt University, Hamburg,

Germany.

4Intelligent Systems Associate Laboratory (LASI), Portugal.



Abstract: This work introduces a new class of statistical models for the

analysis of space-time series of counts. The new class is addressed as space-

time integer-valued ARMA (STINARMA), and holds clear connections with

existing classes of models based on the ARMA framework. The STINARMA

class constitutes an integer counterpart of the STARMA class of Pfeifer and

Deutsch (1980) based on the binomial thinning operator of Steutel and Van

Harn (1979). Also, the STINARMA class of models can be seen as a spatio-

temporal extension (STARMA-type) of the popular integer-valued ARMA

(INARMA) class. This paper starts by recalling the formulation of the models

conveyed in the STARMA and INARMA classes, and shows how the novel

STINARMA class is constructed.



Keywords: integer-valued models; INARMA; binomial thinning operator;

spatio-temporal models; STINARMA.

AMS subject classification: 62M10, 62H11, 62H12.



∗ Corresponding author: a.r.martins@ua.pt, sonia.gouveia@ua.pt 10 23rd European Young Statisticians Meeting

STINARMA models



1 Introduction

Time series of count data, i.e. non-negative integer values, are common in many fields, such as finance or health [4]. One class playing an important role for count data is the INteger AutoRegressive and Moving Average (INARMA) driven by the well-known ARMA models. The INARMA preserve the integer nature of the data by setting a discrete distribution to the innovation process and, by further replacing the multiplication by the binomial thinning operator (BTO, hereafter) [12]. The INARMA developments were initiated by theoreti-cal contributions on the INAR(1) [10, 1] and the INMA(1) [2] processes. More recent literature introduced multivariate INARMA contributions, namely the MINAR(1) and a generalised MINAR(p) process [8]. Other multivariate extensions have been sparsely proposed over the last decades, which explains why space-time count models are still quite underdeveloped and limited to a few applications with autoregressive approaches [3, 7].

Aiming at contributing to the development of space-time models of counts,

this work proposes a new class of multivariate models that simultaneously account for temporal and spatial data dependency. This class is addressed as Space-Time INARMA (STINARMA) and borrows ideas of the INARMA and STARMA models. Many applications of this new framework are foreseen, e.g. the joint modelling of health data at different locations [9]. This paper starts by recalling the STARMA and INARMA frameworks, which are used to construct the novel STINARMA class.



2 Background on INARMA and STARMA models

The INARMA models are known as an integer counterpart of the ARMA models since they are formulated in a similar fashion, but they also show two essential differences resulting in a rather different model. One is the use of a discrete distribution for the innovation process, the other is the replacement of the multiplication by the ‘ ◦ ’ BTO defined as





X


ψ ◦ X := ui, X > 0 (1)

i=1

and 0 otherwise [12]. In this definition, ψ ◦ X is a non-negative discrete random variable (r.v.) where X is a discrete r.v. and the counting ∈ N 0 series u i, i = 1, . . . , X , are independent and identically distributed (i.i.d.) Bernoulli-distributed r.v. with probability of success ψ [0 1) and, are also ∈ , independent of X . The boundary cases are 0 = 0 and 1 ◦ X = ◦ X X .

23rd European Young Statisticians Meeting 11 Martins, Scotto, Weiß and Gouveia STINARMA models



Let us consider Yt to be a univariate process of counts. This process is ∈ N 0

said to be an INARMA(p, q) process if it can be represented by the recursion

p q

Yt = αi ◦ Yt−i + βj ◦ εt , (2) − j + ε t , t ∈ Z

i=1 j=1

where p and q are the AR and MA orders, respectively [5]. The i.i.d. innovation process ε t ∈ N0 exhibits finite mean and finite variance, and is independent of Y u for u < t. All BTOs are performed independently of each other and of εt, for each t. These operations use the model parameters αi, βj [0 1) ∈ , for i = 1, . . . , p and j = 1, . . . , q such that αp , βq ̸= 0. Given that εt and each BTO parcel are non-negative discrete variable, the INARMA model (2) results in a non-negative discrete-valued process.

Now consider Z Z Z t = (Z1,t , . . . , ZS,t)⊤ ∈ RS to be a multivariate real-

valued process defined at several locations s = 1, . . . , S. Then, (Z Z Z t) is a STARMA(pf1,...,fp , qm1,...,mq ) process if it can be written as

p fi q mj

Z Z Z (ℓ) (ℓ) t iℓ t− ϕ = W W W Z Z Z i + θ jℓW W W εεεt , t − + (3) ∈ Z j t εεε ,

i=1 ℓ=0 j=1 ℓ=0

where p and q are again the AR and MA orders and, the sets fi and mj are the spatial order of the th th i AR and the j MA terms [11]. The model parameters are the scalars ( ϕ iℓ andℓ) θ jℓ . The matrix W W W incorporates the spatial information in the STARMA model (3) through the entries



w ( ( th ℓ ) 1 ℓ ) /n , if i and j are ℓ-order neighbors := i , (4) ij 0 , otherwise

where ( ) th ℓ n is the number of ℓ-order neighbours of location i. A weighted i average of the spatial information is achieved by row normalised weights, i.e., S (ℓ) w = 1, a property satisfied by (4). Definition (4) shows that sites i th th and j cannot be simultaneously ℓ- and k-order neighbours for ℓ = . ̸ j=1 ij

Also, it follows that (0) W W W = III S where III S is the (S × S)-identity matrix, as ( k



each location is its (only) zero-order neighbour. For ℓ > 0, ℓ ) w = 0 since ii ( ) each location is not a neighbour of order ℓ ℓ ̸ = 0 of itself. Finally, W W W is not necessarily symmetric for ℓ > 0. The random error εεε t = ( S ε 1 ,t , . . . , ε S,t ) ∈ R

is such that iid 2 ε s,t (0 = 1 , with covariance matrix ∼ N , σ ) for s , . . . , S



Cov 2 G G G = σIII , h = 0 ⊤ S εεε t , εεε t = , (5) + h O O O S , h ̸ = 0

12 23rd European Young Statisticians Meeting

STINARMA models



where O O O S denotes the (S × S)-zero matrix. Additionally, the random error is independent of Z Z Z u with u < t such that Cov(Z Z Z t, εεεt+h) = O O OS for h > 0.



3 The new class of STINARMA models

The new STINARMA class is now constructed by borrowing some of the ideas of the INARMA and STARMA models. Consider Y Y Y t = (Y1,t, . . . , YS,t)⊤ ∈ NS 0 to be a multivariate process of counts at locations s = 1, . . . , S . Then, (Y Y Y t) it is said to be a STINARMA process if it admits the representation



Y Y ( Y t  p f q m ij    = ℓ) (ℓ) α W iℓ

W W ◦ Y Y Y t + β − ijℓW W W ◦ εεεt−j + εεεt, t ∈ Z, (6)

i=1 ℓ=0 j=1 ℓ=0

where the temporal (p, q) and the spatial (fi, mj ) orders are defined as in (3), being referred to as STINARMA(p f , q 1 ,...,f pm1,...,mq ). The formulation in (6) highlights that Y Y Y t is a sum of several time-lagged BTOs introducing the temporal dependence to the multivariate process. Moreover, the spatial



W dependence is introduced through the multiplication of αiℓ, βjℓ ∈ [0, 1) by ( ℓ ) W W , which is endowed into a matricial BTO [6, 8]. The BTO between a matrix Ψ Ψ Ψ with (fixed) entries in [0 , 1) and a random vector X X X is defined as



Ψ    S     ψ 1 ◦  j X j  ψ 11 ψ 12 · · · ψ 1 S X 1  j =1        Ψ Ψ =  ◦ X X X :=  ... ... ... ...  ◦  ...  ...  , (7)    S  ψ S ψ 1 S 2 · · · ψ SS X S    ψ Sj ◦ X j =1 j

where each ψ ij ◦ Xj is performed according to (1) for a given i and j and all thinning operations are performed independently.

The STINARMA process differs from the continuous STARMA process

(3) in the BTO (instead of the multiplication) and the non-negative discrete innovation process (instead of a continuously distributed one), which guaran-



innovation process S εεε t in the STINARMA process (6) is assumed to be ∈ N 0 ⊤ i.i.d. in time with tees the non-negativeness and discreteness of the STINARMA process. The

2 µ µ µ ε := E (εεεt) = (µε , . . . , µ 1εS ) where µεs := E (εs,t) < ∞

and σ := V (ε ) < , for each component s = 1, . . . , S. Moreover, εεε is as-ε s,t t s ∞ sumed to follow a (trivial) discrete and non-negative multivariate distribution, i.e., with mutually independent components. Under the mutual independence assumption, the covariance is defined as in (5) but now 2 2 G G G = diag( σ , . . . , σ ) ε 1 ε S

23rd European Young Statisticians Meeting 13 Martins, Scotto, Weiß and Gouveia STINARMA models



with zero off-diagonals. To sum up, the multivariate process εεε t is assumed

to be i.i.d. in time and independent in space, but not necessarily identically

distributed in space, which differs from 2 G G G = σIII S in the continuous STARMA.

With regards to the neighbourhood matrix ( ) ℓ W W W, a more general definition

of spatial weights is considered for the STINARMA



w ( (ℓ) th κ sn , if s and n are ℓ-order neighbours ℓ ) := , (8) sn 0 , otherwise

where (ℓ) κsn quantifies the influence of location n in s provided that these are

ℓth ( ) (-order neighbours. Clearly, (4) and (8) are equal if ℓℓ) κ sn = 1 /n. Still, i

other metrics may be more advantageous to consider. E.g., the inverse of the

euclidean distance will attribute a lower weight to further apart neighbours

and higher weight to closer neighbours. This type of weighting scheme may

also be used to consider all locations as first-order neighbours, decreasing

model complexity (in terms of estimation) while retaining information of the

entire spatial system. Also, considering all locations as first-order neighbours

disregards the hierarchical ordering of neighbours, which avoids the subjective

and cumbersome task of such definition. Lastly, the BTOs imply that all

entries of ( ) ℓ ) ( ℓ α W W W W , β W W must range between 0 and 1. Since α , β [0 ∈ 1), iℓ jℓ iℓ iℓ ,

it suffices that ( ( ℓ )ℓ) w sn ∈ [0 , 1], which holds for row-normalised W W W matrices.



4 Conclusions and ongoing work

The new STINARMA class is an intuitive modelling framework to deal

with temporal and spatially dependent count data by borrowing ideas from

INARMA and STARMA models. Theoretical results on first- and second-order

moments, e.g., space-time autocovariance and autocorrelation functions, have

been derived for the full STINARMA as well as the STINAR and STINMA

subclasses. Parametric estimation methods, by assuming a given distribution

for the innovations have been developed. These are based on method of

moments, conditional least squares or conditional maximum likelihood, with

the finite-sample performance being evaluated through simulation studies.



Acknowledgements: This work was partially funded by the Foundation for

Science and Technology, FCT, through national (MEC) and European struc-

tural (FEDER) funds, in the scope of the research projects UIDB/00127/2020

(IEETA/UA) and UIDB/04621/2020 (CEMAT/UL). AM acknowledges a PhD

grant (SFRH/BD/143973/2019) from the FCT.

14 23rd European Young Statisticians Meeting

STINARMA models



References

[1] M. Al-Osh and A. A. Alzaid. First-order integer-valued autoregressive

(INAR (1)) process. Journal of Time Series Analysis, 8(3):261–275, 1987.

[2] M. Al-Osh and A. A. Alzaid. Integer-valued moving average (INMA)

process. Statistical Papers, 29(1):281–300, 1988.

[3] S. Aldor-Noiman, L. D. Brown, E. B. Fox, and R. A. Stine. Spatio-

temporal low count processes with application to violent crime events. Statistica Sinica, 26(4):1587–1610, 2016.

[4] R. A. Davis et al. Count time series: A methodological review. Journal

of the American Statistical Association, 116(535):1533–1547, 2021.

[5] J. Dion, G. Gauthier, and A. Latour. Branching processes with immi-

gration and integer-valued time series. Serdica Mathematical Journal, 21(2):123–136, 1995.

[6] J. Franke and T. Subba Rao. Multivariate first-order integer-valued

autoregressions. Technical report, University of Kaiserslautern, 1993.

[7] N. Huda, U. Mukhaiyar, and U. Pasaribu. The approximation of GSTAR

model for discrete cases through INAR model. In Journal of Physics: Conference Series, volume 1722, page 012100. IOP Publishing, 2021.

[8] A. Latour. The multivariate GINAR (p) process. Advances in Applied

Probability, 29(1):228–248, 1997.

[9] A. Martins, M. Scotto, R. Deus, A. Monteiro, and S. Gouveia. Association

between respiratory hospital admissions and air quality in Portugal: A count time series approach. Plos one, 16(7):e0253455, 2021.

[10] E. McKenzie. Some simple models for discrete variate time series. JAWRA

Journal of the American Water Resources Association, 21(4):645– 650, 1985.

[11] P. E. Pfeifer and S. J. Deutsch. A three-stage iterative procedure for

space-time modeling. Technometrics, 22(1):35–47, 1980.

[12] F. W. Steutel and K. van Harn. Discrete analogues of self-decomposability

and stability. The Annals of Probability, pages 893–899, 1979.





23rd European Young Statisticians Meeting 15




Joint Species Spatial Modelling of Deer Count

Data



Aoife ∗ 1 2 1 K. Hurley , Ruth F. Carden , and James Sweeney

1 Department of Mathematics and Statistics, University of Limerick, Ireland

2 School of Archaeology, University College Dublin, Ireland



Abstract: Accurately predicting the populations and spatial distribution of wild species is important from both a management and conservation per-spective. Data on wild species can be collected in various ways including; tracking devices, motion activated cameras as well as conducting site surveys. Site survey data is generally collected at a fine spatial resolution (along a predefined 2 path or within a 10km square) as collection is limited by human sense (vision and hearing) and the costs of collecting data itself. In conjunction with the survey data, other informative data, such as land cover covariates or cull data from a hunting season may be available, though at a different spatial resolution, creating issues of spatial misalignment in the data sources.

In the Republic of Ireland, there are four main species of deer in the

wild, with the main three being Fallow, Red and Sika. Previous studies have looked at the distribution and abundance of these species in the Republic of Ireland [1, 5], but neither account for the correlation between these species nor the effect of different land coverings on their spatial distribution. We aim to jointly model the three species of deer using both observational and cull data, allowing the borrowing or pooling of information across the species, to estimate the impacts of some land coverings, and create species distribution maps across the Republic of Ireland.



Keywords: Joint species modelling, Spatial statistics, Spatial Modelling

AMS subject classification: 62H11



∗ Corresponding author: Aoife.Hurley@ul.ie 16 23rd European Young Statisticians Meeting



1 Introduction

Multi-species data have many complexities including imperfect detection, spatial autocorrelation, between-species correlation, and preferential sampling. Due to the computation intensity required to model multiple species jointly, many studies have looked at individual species in isolation [5].

Multiple studies have been conducted regarding wild deer in Ireland includ-

ing: modelling the distribution and range expansion of wild deer in Ireland [1]; abundance of wild deer in Ireland [5]; and the overlap between wild deer species and endangered or vulnerable plant species [6]. More recently, national newspapers in the Republic of Ireland have written articles on wild deer populations and noted “The overall size of the deer population in Ireland is unknown because no census of numbers has ever been conducted” [4]. This study aims to estimate the population of three wild deer species in Ireland, and investigate the impact of various land coverings on the deer species.



2 Modelling

We will adopt subscript i to indicate observation, c = 1, 2, . . . , 26 for county, g = 1, 2, . . . , 732 for grid square, and species will be indicated by s = 1, 2, 3. In our model, we employ an N-mixture model [8], while including a multivariate intrinsic conditional autoregressive model (MICAR).

We let N Cs,c represent the estimated population of species s in county c

and Ns be the estimated population at a grid square.



N C Ns,g Neg Bin(p s,g , rs,g ) ∼ (1) s,c = N s L c ,

where L is an index matrix of dimension g , with 1 indicating which county × c

a grid square belongs to.The parameters p s,g and rs,g are modelled as

p s,g = θs

θs + λs,g

r s,g = λs,g ps,g ,

1 − ps,g

where θs controls the level of overdispersion. To estimate these populations, we have the observation or sighting data ( 2 y s,i ) at the 10 km grid level and the bagged number of deer at the county level (bs,c), and model as



ys,i Binomial( ∼ds,i∈g , Ns,i∈g ) (2) b s,c Poisson( ∼ κ s,c N C s,c ), 23rd European Young Statisticians Meeting 17



with κs,c is the associated cull rate for county c, d being an s × g matrix of

detection probabilities, and i ∈ g indicating the grid square where observation i

happened. Instead of having a detection probability for each observation, di∈g

translates to having a detection probability for each square. This detection

probability is modelled as

logit(ds,g ) = T δ s G g ,

with G being a matrix of covariates. In this application G represents four

groups using Corine land covering categories, and the groups are informed

from expert opinion. We preform a log transformation on λ and model using

linear regression as follows

log(λs,g ) = Xg βs + ϕs,g ,

where Xg is 14 different Corine land covering proportions for grid square g [2].

We impose the MICAR prior on ϕ and this can be written as



us ∼ MVN(0, τs(D − A)−1 ) (3) ϕ = Σ ⊗ u

In equation 3, the spatially varying precision parameter is given by τs, the



D is a diagonal matrix that has the number of neighbours for each grid ′ square along the diagonal, and Σ = matrix A is the adjacency matrix (1 if two squares neighbour and 0 otherwise), W W , with W being the upper triangular

Cholesky decomposition of the variance-covariance matrix. Lawson has a

tutorial that uses the R package nimble to implement this MICAR model [3].

For identifiability of the intercept, we impose a sum to zero constraint on us.

To alleviate the affects of confounding between variables, we implement

Bayesian LASSO penalty [7] on the βs using βs ∼ Laplace(1, 1). We impose

prior distributions on the remaining parameters of the models:

κ 2 s,c ∼ Normal( µ s,c , σ s,c)

τ s ∼ Gamma(10, 2)

θs ∼ Exp(0.5)

The values for µ 2 s,c and σ s,c for each species were informed through expert

elicitation.



3 Wild deer populations in the Republic of Ireland

We have data across two distinct spatial levels. At a 10 km2 level we have

CORINE land cover variables as well as sightings data. This sightings data 18 23rd European Young Statisticians Meeting



includes the number of deer seen within a square, as well as the species observed. At a county level, we have the number of bagged deer by species.

In this study, we fix the cull percentages κs. Each cull rate, κc,s, was set

to be the mean of the informed intervals. The cull percentages are set as constants for identifiability of the detection probability amongst others. We plan to sample cull percentages from these intervals in the future to account for additional uncertainty.



3.1 Covariate estimates

Figure 1 shows the estimated βs, with the associated 95% credible interval. Many intervals contain zero, except for the species specific intercepts, pastures for Fallow deer and peat bogs for Red deer. These positive associations can be interpreted as land types where these deer are likely to reside.





Figure 1: Covariate estimates





3.2 Spatial surfaces

Figure 2 illustrates the mean value for the fitted spatial surfaces alongside the associated standard deviations for each species of wild deer. In the maps of standard deviations, we mark each grid square that has had at least one observation in our data set with a triangle.

23rd European Young Statisticians Meeting 19





Figure 2: Estimated spatial surfaces for Fallow, Red and Sika deer





When we compare these surface estimates to the abundance maps created

by Morera-Pujol et al. [5], we can see similar patterns occurring for each

species.



3.3 Population estimates

From expert opinion, we would expect Red deer to have the smallest population

whereas Fallow and Sika would have quite large, somewhat similar sized

populations. Table 1 shows the estimated median total population for each

species, along with the lower and upper 95% credible interval at the end of

the 2017/2018 hunting season.



Table 1: Estimated total population for the three wild deer species

FALLOW RED SIKA

2.5% 188,747 46,318 245,965

50% 191,821 47,740 249,610

97.5% 195,171 49,317 253,575

20 23rd European Young Statisticians Meeting



4 Discussion

We model three species of wild deer jointly and estimate their populations, which is the first time this has been done for the species in the Republic of Ireland. The above results have the cull parameter, κ, set to an informed constant. Rerunning with cull rates of different values and sampling will add additional variability to our results.



Acknowledgements: This work has emanated from research conducted with the financial support of Science Foundation Ireland under Grant Number 18/CRT/6049.



References

[1] R. F. Carden, C. M. Carlin, F. Marnell, D. Mcelholm, J. Hetherington,

and M. P. Gammell. Distribution and range expansion of deer in Ireland. Mammal Review, 41(4):313–325, 2011.

[2] EU Copernicus. Copernicus Land Monitoring Service CORINE Land

Cover User Manual. European Environment Agency (EEA), 2021.

[3] A. B. Lawson. NIMBLE for Bayesian Disease Mapping. Spatial and

Spatio-temporal Epidemiology, 33:100323, 2020.

[4] H. McGee. Calls for larger deer culls after record 55,000 shot dead last

year. The Irish Times, 05 2023.

[5] V. Morera-Pujol, P. S. Mostert, K. J. Murphy, T. Burkitt, B. Coad, B. J.

McMahon, M. Nieuwenhuis, K. Morelle, A. I. Ward, and S. Ciuti. Bayesian species distribution models integrate presence-only and presence–absence data to predict deer distribution and relative abundance. Ecography, 2023(2):e06451, 2023.

[6] J. O’Mahony, A. Vanmechelen, and P. Holloway. Quantifying the distribu-

tion and potential biotic interactions between deer and flora using species distribution modelling. Annals of GIS, pages 1–16, 07 2023.

[7] T. Park and G. Casella. The Bayesian lasso. Journal of the American

Statistical Association, 103(482):681–686, 2008.

[8] J. A. Royle. N-mixture models for estimating population size from spatially

replicated counts. Biometrics, 60(1):108–115, 2004.





23rd European Young Statisticians Meeting 21




Nonparametric Simultaneous Confidence Bands:

The Case of Known Input Distributions



B´alint ∗1 2 Horv´ath and Bal´azs Csan´ad Cs´aji

1 HUN-REN Institute for Computer Science and Control, Budapest, Hungary and

Institute of Mathematics, Budapest University of Technology and Economics (BME)

2 HUN-REN Institute for Computer Science and Control, Budapest, Hungary and

Institute of Mathematics, E¨otv¨os Lor´and University (ELTE), Hungary



Abstract: In this paper, we construct nonparametric, nonasymptotic, and

simultaneous confidence b ands f or b and-limited r egression f unctions b ased on

the theory of Paley-Wiener kernels. We work with a sample of independent

and identically distributed (i.i.d.) input-output pairs, the measurement noises

are assumed to have a joint distribution that is invariant with respect to

transformations from a compact matrix group (e.g., permutations), and we

also assume that the distribution of the inputs is a priori known. The task is

divided into two steps: first, we study the case when the outputs are noise-free,

then the problem is generalized for measurement noises. The algorithms

provide nonasymptotic guarantees for the inclusion of the true regression

function in the confidence band, simultaneously for all possible inputs. Finally,

we demonstrate our results via numerical experiments.



Keywords: nonasymptotic confidence bands, convex optimization, reproduc-

ing kernel Hilbert spaces, Paley-Wiener spaces, randomized algorithms

AMS subject classification: 46E22, 62G15, 68Q32

Acknowledgements: This research was supported by the European Union

within the framework of the Artificial I ntelligence National L aboratory,

RRF-2.3.1-21-2022-00004; and by the TKP2021-NKTA-01 grant of the

National Research, Development and Innovation Office, Hungary.



∗Corresponding author: horvath.balint@sztaki.hu 22 23rd European Young Statisticians Meeting

B. Horv´ath, B. Cs. Cs´aji



1 Introduction

Constructing confidence bands for the regression function from a finite sample of input-output data is a core problem in statistics and machine learning [1]. In a parametric setting, such region estimates are typically induced by confidence sets in the parameter space, however, in a nonparameteric setting this indirect approach is often infeasible, which calls for direct constructions.

The problem comes with a fairly standard setting. We are given a finite



(measurement) noise terms on the true regression function i.i.d. sample of input-output pairs, (x , y , . . . , , y 1 1 ) ( x nn ), having an unknown 2 joint distribution P m X,Y , where x k ] . We assume ∈ R , y k and [ ∈ R E y < k ∞ that y k = f ( x ) + ε , for k [ n ] = 1 , ..., n ∗ k k , where represent the ∈ { } { ε k } f with [ ε ] = 0. ∗ E k Our primary goal is the following: we are looking for an I : R m → R × R function, such that I ( x ) = ( I 1 ( x ) , I 2 ( x )) specifies the endpoints of an interval estimate for f ( x ), where m x . The aim is to construct I with the property: ∗ ∈ R

ν . m ( I ) = P I 1 ( x ) ≤ f x ∗ ( ) ≤ I 2 ( x ) , for P X-a.e. x ∈ R ≥ 1 − α,

where α ∈ (0, 1) is a (user-chosen) risk probability. Since the confidence band I depends on the sample, typically we have = ( ) the P P ⊗ n . ν I X,Y We call reliability of the confidence band. Next, we define Paley-Wiener spaces [2].

Definition 1. A 2 Paley-Wiener space m H is a subspace of L ( R), where for each φ the support of the Fourier transform of ∈ H φ is included in a given hypercube [ ]m − η, η, where η > 0 is a hyper-parameter.

Paley-Wiener spaces are Reproducing Kernel Hilbert Spaces (RKHSs) with

the following reproducing kernel function. For all m u, v : ∈ R



k . m sin(η(uj vj )) − m − ( u, v ) = π , j =1 u j − v j

where, for convenience, sin(η 0) 0 is defined to be . Henceforth, we work · / η with the Paley-Wiener kernel defined above and denote our RKHS by . H

Our fundamental assumptions are as follows:



there is a compact matrix group, A1. The sample ( m 2 x 1 , y 1 ) , . . . , ( x n , y n ) is i.i.d., and [ ] . ∈ R × R E y < 1 ∞ A2. For k [ , = 0 ∈ n ] , variables are independent, and E ε k and { x k } { ε k } d × n , such that . G ⊆ R n ∀ G : = ∈ G G ε ε A3. The probability distribution of the inputs, { x k , is a priori known, it is } absolutely continuous, and its density, h , satisfies m h ( x ) > 0 , x . ∗ ∗ ∀ ∈ R



( A4. The regression function f is from a Paley-Wiener space and there is a ∗ 2 universal ) constant ρ > 0 , such that for all m x , ( ) ∈ R f x ρ h ( . ∗ ≤ x ) ∗ 23rd European Young Statisticians Meeting 23 Nonparametric Simultaneous Confidence Bands



The third part of A2 can be easily satisfied, e.g., by the group of permutation

matrices [3], as {εk} are i.i.d.; A4 ensures that the observations are informative.



2 Noise-Free Outputs

We start by studying a simplified problem: when the true regression function,

f , is observed perfectly at random inputs, that is k ] : ∗ ∀ ∈ [ n yk = f (x ). ∗ k

Since A1 and A3 guarantee that the inputs {xk } are almost surely distinct,

the element from H, which interpolates every yk output and the corresponding

xk input, and which has the smallest possible kernel norm, that is

f ¯ . = arg min ∥ f ∥ : f & n f x ) =, H ∈ H ∀ k ∈ [ ] : ( k y k

exists and it takes the following form for all possible inputs m x ∈ R:

¯ n

f (x) = α ˆkk(x, xk),



where the weights are ˆ − . k =1

α 1 T = K y with y = ( y , ..., y α 1 n ) and ˆ α := (ˆ α , ..., 1 ˆn),

and Ki,j = k(xi, xj ) is the kernel or Gram matrix. Note that under A1, A3

and A4, the Gram matrix is almost surely invertible.

The main idea behind our approach is as follows. First, we need to estimate

how “smooth” the function is, which is measured by ∥f f ∗ ∥ = ∥∗∥ . H 2



risk probability Lemma 1. Assuming A1, A3, A4 and that yk = f (x k [ ] ∗ k ) for ∈ n, for any 2 α ∈ (0 , 1) , we have P ( ∥ f α, with ∗ ∥ ≤ κ ) ≥ 1 − H n 2 . 1 y ln α κ = k + ρ . n h ( x k ) − 2 n ∗ k =1

This statement can be proved analogously to the similar Lemma 1 in [3]. To test a “candidate” ( m x 0 , y 0 ) ∈ R × R input-output pair, we can compute



( combined with ( m x 0 , y 0 ) ∈ R × R. The minimum norm interpolation of the minimum norm needed to interpolate the original {(xk , yk)}, k ∈ [n]

x n , y , . . . , , y f x α k x, x 0 0 x ) ( n n k k =0 k . ) is now ˜ ( ) = ˜ (), where the weights are

α − 1 . T T ˜ = K y ˜ with ˜ y = ( y 0 , y 1 , . . . , y , 0 n ) α ˜ = (˜ α 0 , . . . , α ˜ n ) , and K0(i +1, j +1) =

k (xi, xj ) is the extended kernel matrix. Since H is an RKHS, we have

∥ ˜ 2 T T −1 −1 T 1 f ∥ = ˜ α K α K K 0 ˜ = ˜ K y ˜ = ˜ − y y Ky. H 0 0 ˜ 0 0

For a candidate (x , y 00) input-output pair, we first calculate the norm square

of the minimum norm interpolation of {(x0, y0 )} ∪ {(xk , yk)}. Then, if this

norm square is less than or equal to our estimate (denoted by κ), we include

(x 0, y0 ) in our confidence band, otherwise, (x0, y0 ) is not included in the band.

24 23rd European Young Statisticians Meeting

B. Horv´ath, B. Cs. Cs´aji



To obtain the interval endpoints for a given query input x0, we have to

calculate the highest and lowest y 0 values which can be interpolated with a function from having at most norm square . This leads to the problems: H κ



min / max y0 (1) T 1 T subject to ( − T y , y 0 ) K ( y , y ) 0 0 ≤ κ,

where “min / max” means that we have to solve the problem as a minimization and also as a maximization (separately). The problems in (1) are convex and



by y y min andmax, respectively, determine the endpoints of the confidence . . interval for their solutions can be calculated analytically [3]. The optimal values, denoted

f (x ), that is I (x and I ∗ 0 1 0 ) = y min2(x0) = ymax. If there is no

solution, we return I(x0) = . We conclude with the following theorem. ∅

Theorem 1. Assume A1, A3, A4 and that yk = f x ∗ (k ) for all k. Then, for any risk probability α (0 1) and for any finite sample size ∈ , n, the constructed confidence band is guaranteed to have the reliability ν(I) 1 ≥ − α.

Proof sketch. 2 2 According to Lemma 1, ( 1 . If P ∥ f κ) , ∗ ∥ ≤ ≥ − α ∥ f∗ ∥ ≤ H κ H

then for all x 0, the value f ∗(x0) is in the confidence band, since f∗ interpolates the sample extended with (x , f x , κ 0 ∗ ( 0 )) and its norm is, thus the minimum ≤ norm interpolant of this extended sample inherits this norm bound.



3 Noisy Outputs

Next, we provide our solution for the case when the outputs are affected by measurement noises which satisfy A2. A new problem is that we do not observe the true function values at the sample inputs. However, we can



apply the KGP method [5] to construct an ellipsoid Z with the guarantee T ( ( P f ( x ) , . . . , f x ∗ 1 d )) 1 , for a given 1) and [4]. ∈ Z ≥ − β β (0 ∈ , d ∗ ≤ n 2 In order to get an estimate of , we can solve the following problem ∥ f ∗ ∥ H



maximize 1 d 2 z k subject to z (2) ∈ Z . d h ( x ∗ k ) k =1

This problem is not convex, but due to strong duality, we can solve its convex dual instead. The construction is analogous to the one in [4, Section 6.1].

Lemma 2. Under A1, A2, A3, A4 and for any α, β (0 1) risk probabilities, ∈ ,

P T 2 ( ( ( f x , . . . , f x ∗ 1 ) )) ∗ d ∈ Z ∧ ∥ f 1 ∗ ∥ ≤ τ ≥ − α − β, H

where . τ = ξ + ρ ln(α)/( 2 and is the optimal value of problem (2). − d) ξ

23rd European Young Statisticians Meeting 25

Nonparametric Simultaneous Confidence Bands



Given ellipsoid Z which contains with high probability the true outputs of f∗

at the sample inputs d { x k , we can construct a confidence interval for } f (x ) k =1 ∗ 0

at any query input x 0 by computing the maximum and the minimum potential



sample {(x0, z0) d } ∪ { ( x k , z k ) } , for a z , and has a norm square τ k =1 ∈ Z ≤ , 2 i.e., our upper bound for output z0 at ∈ R x0, for which there is an interpolant that interpolates the

∥f∗ . This leads to the (convex) problems ∥ H

min / max z0

subject to ( 1 T − (3) z , z , . . . , z 0 1 d ) K ( z , z , . . . , z 0 0 1 d ) ≤ τ

(z1, ..., zd ) ∈ Z .

Let z z min andmax be the optimal values of (3). The confidence interval for

f (x z ( ∗ 0 ) is given by [ min , z max ]. We return Ix0) = if (3) is infeasible. ∅

Theorem 2. Assume that A1, A2, A3 and A4 are satisfied. Then, for any

risk probabilities α, β (0 ∈, 1) and for any finite sample size n, the constructed

confidence band is guaranteed to have the reliability ν(I) 1 ≥ − α − β.



According to Lemma 2, the event T A that both ( f ( x , . . . , f )) as ∗ 1 ) ( x ∗ d ∈ Z 2 well as Proof sketch. The core idea of the proof is very similar to that of Theorem 1.

∥f∗∥ ≤ τ happen has probability at least 1 − α − β . H



invertible, which holds a.e., we can guarantee that there is a z , namely ∈ Z T z Conditioning on event A, for all query input point x0 such that K0 is

= (f (x ), . . . , f (x , and z , name z ∗ 1 ∗ d )) 00 = f (x ), such that the minimum ∗ 0

norm interpolant of {(x0, z0)} ∪ {( n x k , z k ) } has a norm square τ , since f k =1 ≤∗

intself is an interpolant of this dataset. Thus, we have that zmin ≤ z0 , ≤ z max

for z f x 0 = ( ). This property is always guaranteed, hence, we simultaneously ∗ 0

have for -a.e. P X x0 ∈ Rm that zmin(x0) ( ), under . ≤ f x ) ( x A ∗ 0 ≤ z max 0



4 Numerical Experiments

The methods were also tested and implemented numerically. The Paley-

Wiener RKHS was used with parameter η = 30 and the original data-



were generated, with uniform distribution on [ 1 1]. Then we created − , 20 f generating function was created as follows: 20 random input points 20 x ¯ { k } k =1

∗ (x) = w k k(x, x ¯ ), where each w had a uniform distribution on k =1 k k

[ 1 1]. The function was normalized, in case its maximum value exceeded 1. − ,

A sample with n = 300 random noisy observations from f was generated. ∗

The inputs followed Laplace distribution with location { x k = 0 and scale } µ

b = 0.3 parameters, while the measurement noise {εk had the following }

26 23rd European Young Statisticians Meeting B. Horv´ath, B. Cs. Cs´aji



distribution: first, we experimented with a non-symmetric dase, where ε ∼ exp(λ) 1 = 0 − /λ , where λ.3, then we implemented an experiment with a symmetrically distributed noise, namely ε had Laplace distribution with µ = 0 and b = 0.3 parameters. Both of these statistical setups satisfy A2.

Figure 1 demonstrates that the proposed approach leads to feasible and

informative simultaneous (nonparametric, nonasymptotic) confidence bands.

Simultaneous confidence bands (exponential noise) Simultaneous confidence bands (Laplace noise)

2.0

1.5 1.5

1.0 1.0

0.5 0.5

0.0 0.0

−0.5 −0.5

−1.0 −1.0

−1.5 −1.5

−2.0 −2.0

−0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4

all data points 50% confidence band all data points 50% confidence band

selected data points original function selected data points original function

90% confidence band 90% confidence band



Figure 1: Nonparametric, nonasymptotic, simultaneous confidence bands with

Laplace distributed inputs and symmetric and non-symmetric noises.



References

[1] Knafl, G., Sacks, J., Ylvisaker, D. (1985). Confidence bands for regression

functions. Journal of the American Statistical Association, 80, 683-691.

[2] Yang, J., Sindhwani, V., Avron, H., Mahoney, M. (2014). Quasi-Monte

Carlo feature maps for shift-invariant kernels. In International Conference on Machine Learning (pp. 485-493). PMLR.

[3] Cs´aji, B. Cs., Horv´ath, B. (2022). Nonparametric, Nonasymptotic Con-

fidence Bands With Paley-Wiener Kernels for Band-Limited Functions, IEEE Control Systems Letters, IEEE, 6, 3355-3360.

[4] Cs´aji, B. Cs., Horv´ath, B. (2023) Improving Kernel-Based Nonasymptotic

Simultaneous Confidence Bands, 22nd IFAC World Congress, Yokohama, Japan, July 9-14, 2023.

[5] Cs´aji, B. Cs., Kis, K. B. (2019). Distribution-Free Uncertainty Quantifica-

tion for Kernel Methods by Gradient Perturbations. Machine Learning, Springer, 108(8-9), 1677-1699.





23rd European Young Statisticians Meeting 27




A primer on experimental design for practitioners

with efficient augmented designs



Carlos de la Calle-Arroyo∗1 1 2 , Jes´ us L´opez-Fidalgo , and Licesio J. Rodr´ıguez-Arag´on

1 Instituto de Ciencia de los Datos e Inteligencia Artificial, Universidad de Navarra

2 Instituto de Matem´ atica Aplicada, Universidad de Castilla-La Mancha



Abstract: Experimental design plays a critical role in scientific research and allows practitioners to efficiently allocate resources and maximise the informa-tion gained from experiments. This work serves as a primer on experimental design for practitioners. We introduce a workflow that encompasses the cal-culation of optimal designs, design augmentation, and rounding using the R package optedr. By following this approach, researchers can optimise their experimental designs, leading to more reliable and statistically robust results while efficiently utilising available resources.



Keywords: Optimal Design, D-optimality, Augmented Designs, General Equiv-alence Theorem, R package.

AMS subject classification: 62K05.



1 Introduction

The sub-field of optimal experimental design in statistics has a long history and goes as far back as [7], who seems to be the first to consider how to optimally collect observations in a regression model in a controlled experiment. The chosen design criterion depends on the goal of the study, which typically is estimating model parameters as accurately as possible, a function of thereof or minimise prediction variance.



∗ Corresponding author: cdelacallea@unav.es 28 23rd European Young Statisticians Meeting



Since then, many progress has been done in this field, and there exists

an extensive body of works on how to collect information for such disparate scenarios as Gaussian processes, clinical trials or marketing surveys. Among them, in this paper a complete experimental design process is illustrated with its application to a particular physicochemical model.

Thus, the aim of this paper is to illustrate a practical design workflow for the

Michaelis-Menten model, with the help of the R package optedr , available on CRAN. This workflow can be summarised as follows: first, optimal designs will be calculated, bearing in account the probability distribution of the response; second, augment the generated design with the methodology described in [1]; third, the efficiency of the resulting design is checked; and last, the approximate design is rounded to an exact design, ready to be used by the practitioner.



2 Optimal design example workflow: Michaelis-Menten model

The Michaelis-Menten (MM) is a non-linear model that describes the kinetics of enzymatic reactions, and has been frequently studied from the perspective of optimal design [5, 2]. It relates the reaction rate y to the concentration of a substrate, x, and its expression is given by:



y m · x 2 ( x ) = η ( x ; θ ) + ε = + ε, ε ∼ N (0 , σ(x)). (1) k + x

Here, θ = (m, k ), where m represents the maximal velocity rate, and k is

the MM constant, which characterises the action of the enzyme as it is the value at which there is a half-maximal response.



2.1 Calculation of Optimal Designs: opt des

In this example, we will calculate locally optimal designs for the MM model

(Equation 1) in the design space x ∈ [0.125, 64] using initial parameter esti-

mates t θ = (m, k) = (10, 1) ([3]) through the function opt_des.

0

It is common for many biological phenomena to exhibit heteroscedastic

behaviour with non-constant variances for model errors. [3] propose the calculation of D-optimal designs for the MM model when the variance of the response is a function of the mean response (referred to as weight notation). They suggest that var(y(x)) = d · exp(c · mx/(k + x)). Therefore, we proceed to calculate the D-optimal design for the heteroscedastic model using the values c = −2 and d = 5 for the constants in the variance expression.

R> c <- -2; d <- 5; m<-10; k<-1

R> weight_MM <- function(x) sqrt(d*exp(c*m*x/(k + x))^(-1))

23rd European Young Statisticians Meeting 29



In our case, we use the weight function as a parameter, weight_fun=1 / var (η (x, θ)),

as defined by [4], to obtain the design for the heteroscedastic case.

R> MM_Dopt_het <- opt_des("D-Optimality", y ~ m*x/(k+x), + par_values = c(10,1), parameters = c("m", "k"), + design_space = c(0.125, 64), weight_fun = weight_MM)

R> summary(MM_Dopt_het)

Model:

y ~ m * x/(k + x)

and weight function:

function(x) sqrt(d*exp(c*m*x/(k + x))^(-1)) Optimal design for D-Optimality :

Point Weight

1 8.500127 0.499995

2 64.000000 0.500005

Minimum efficiency (Atwood): 99.9989211778552% Criterion value: 3.479517e-09

The optimal design is supported at the upper bound of the design space,

as its been analytically proved, and an interior point which depends on the initial values of the unknown parameter. Its frequent for D-optimal designs to have one or several support points on the design space boundary.



2.2 Design Augmentation: augment design

Next, we will demonstrate how to augment a D-optimal design to include an additional proportion of observations. In this example, we will augment the optimal design for the heteroscedastic case, supported at points 8.5 and 64, with half of the observations at each point.

We propose allowing the experimenter to select 30% of the points from the

design (parameter alpha), resulting in an augmented D-design that guarantees a minimum D-efficiency.

R> aug_MM_Dhet <- augment_design(criterion = "D-Optimality",

init_design = MM_Dopt_het$optdes,

alpha = 0.3, model = y ~ m*x/(k+x),

parameters = c("m", "k"),

par_values = c(10,1),

design_space = c(0.125, 64),

weight_fun = weight_MM,

calc_optimal_design = T)

30 23rd European Young Statisticians Meeting





Figure 1: Candidate points region to D-augment MM with a choice of alpha

= 0.3 and delta = 0.87. On purple range of possible delta values

The user selects a lower bound for the D-efficiency of the new augmented

design, δ, from a range of efficiencies (in our case 87% ∈ (0.71, 0.95)). Sub-sequently, the candidate regions R are shown to us (Figure 1), and we can introduce as many points as we want as long as they belong to the proposed regions. Each point should be accompanied by a weight. In our case, we have added the points 15, 20, 30, 40, and 50, all with the same weight, 7.5%, making a total of 30% of new points in the design.

R> aug_MM_Dhet

Point Weight

1 8.500127 0.3499965

2 64.000000 0.3500035

3 15.000000 0.0750000

4 20.000000 0.0750000

5 40.000000 0.0750000

6 50.000000 0.0750000

When it comes to expanding a design, it might seem that the most desirable option is always to choose the highest possible value of δ, as this would guarantee the minimum efficiency for our design. While this is true, there is an opportunity cost associated with this choice. The larger the chosen δ, the smaller the region of candidate points R will be, and the less flexibility the experimenter will have when expanding their design.

23rd European Young Statisticians Meeting 31



2.2.1 Efficient rounding of a design: efficient_round

One of the disadvantages of approximate designs is their inability to be

implemented in reality. It is necessary to transform our approximate design

into an exact design. The function efficient_round will propose a rounding

method following the methodology proposed by [6].

In this case, we present the rounding result for n = 20.

R> efficient_round(aug_MM_Dhet, n = 20)

Point Weight

1 8.500127 6

2 64.000000 6

3 15.000000 2

4 20.000000 2

5 40.000000 2

6 50.000000 2

It is evident that any rounding strategy affects the design and, therefore, its

efficiency. With an efficiency of 88.4% this efficient design is now ready to be

used in the experimental setting.

In order to further illustrate how the choice of α and δ affects the resulting

region, as well as the rounding effect, in Table 1 the extension of the candidate

points region, as well as the efficiency of the approximate and rounded exact

designs (n = 17) can be compared. The efficiencies have been computed with

the function design_efficiency.

α with δ = 0.87 α = 0.2 α = 0.3 α = 0.4 aug M M Dhet

Region Size 60.65 53.31 23.69 53.31

Design Efficiency 87.04% 88.44% 92.39% 91.67%

Rounded Design Efficiency 88.60% 88.46% 91.86% 87.46%



Table 1: Regions and efficiencies of augmented designs for different choices of

α and δ = 0.87.

It should be noted that, aside from the example carried out throughout this

work, the points for the augmented designs have been chosen as the endpoints

of the candidate points regions, and that for a value of α = 0.1 or α = 0.5

the value of δ = 0.87 was too low or high, respectively, and not feasible.

Decreasing/increasing δ has an analogous effect as decreasing/increasing α,

and their choice can be balance to the experimenters’ liking.

32 23rd European Young Statisticians Meeting



3 Summary

The previous section has highlighted how design of experiments can be used in an experimental workflow. It also has briefly summarised the main features of the optedr package through an example with the Michaelis-Menten model.

We hope that the presented ideas could highlighted the importance of a

correct statistical care before realising an experiment, and provide practitioners with a tool that can provide an end to end framework to the design process.



Acknowledgements: This work was sponsored by Ministerio de Ciencia e Innovaci´on PID2020-113443RB-C21 and by Consejer´ıa de Educaci´on, Cultura y Deportes of Junta de Comunidades de Castilla-La Mancha and Fondo Europeo de Desarrollo Regional SBPLY/17/180501/000380.



References

[1] C. de la Calle-Arroyo, M. Amo-Salas, J. L´ opez-Fidalgo, L. J. Rodr´ıguez-

Arag´ on and W. K. Wong A methodology to D-augment experimental designs Chemometrics and Intelligent Laboratory Systems, 237:14822, 2023

[2] H. Dette and V. B. Melas and W. K. Wong Optimal Design for Goodness-

of-Fit of the Michaelis-Menten Enzyme Kinetic Function Journal of the American Statistical Association, 100(472):1370–1381, 2005

[3] H. Dette and W. K. Wong Design Issues for the Michaelis–Menten Model

Biometrics, 55:925–929, 1999

[4] V. V. Fedorov and S. L. Leonov Optimal Design for Nonlinear Response

Models, Taylor & Francis, 2013

[5] J. L´opez-Fidalgo and W. K. Wong Design Issues for the Michaelis–Menten

Model Journal of Theoretical Biology, 215(1):1–11, 2002

[6] F. Pukelsheim and S. Rieder Efficient Rounding of Approximate Designs

Biometrika, 79(4):763–770, 1992

[7] K. Smith On the Standard Deviations of Adjusted and Interpolated Values

of an Observed Polynomial Function and its Constants and the Guidance they give Towards a Proper Choice of the Distribution of Observations Biometrika, 12(1):1–85, 1918





23rd European Young Statisticians Meeting 33




Parameter estimation in stochastic heat

equations driven by fractional and mixed

fractional Brownian motion



Diana ∗1 1 Avetisian and Kostiantyn Ralchenko

1Department of Probability Theory, Statistics and Actuarial Mathematics, Taras

Shevchenko National University of Kyiv, Ukraine



Abstract: The paper is devoted to a stochastic heat equation driven by mixed fractional Brownian noise. We investigate the covariance structure, stationarity, upper bounds and asymptotic behavior of the solution. We construct a strongly consistent estimator for the Hurst index H and prove the asymptotic normality for H < 3/4, based on the discrete-time observations of a solution. Then assuming the parameter H to be known, we deal with joint estimation of the coefficients in front of Wiener process and in front of fractional Brownian motion.



Keywords: Stochastic partial differential equation, fractional Brownian motion, strong consistency, asymptotic normality

AMS subject classification: 60G22, 60H15, 62F10, 62F12



1 Introduction

The paper is devoted to parameter estimation in a stochastic heat equation of

the following form:



2 ∂u 1 ∂ u ˙ ˙ − · ( t, x ) = H σ B + κ W , t > 0, x , (1) 2 2 x x ∈ R ∂t ∂x

u (0, x) = 0, x (2) ∈ R .



∗ Corresponding author: diana.avetisian2017@gmail.com 34 23rd European Young Statisticians Meeting



We consider the problem of estimating unknown parameters H , σ, κ in

the equation (1), by discrete observations of its solution u(t, x). The solution u(t, x) is observed at equidistant spatial points for a several fixed time instants.

We start with proving stationarity and ergodicity of solution u(t, x) as a

function of the spatial variable x by analyzing the behavior of the covariance function. Based on these results we construct a strongly consistent estimator of H (assuming that the parameters σ and κ are unknown). The asymptotic normality of this estimator is proved for any 1 1 3 H (0 ( ∈ , ) ∪ , ). Then we 2 2 4 consider the problem of estimating the couple of parameters (σ, κ) when the value of H is known. We prove strong consistency of the estimator and investigate its asymptotic normality.

It is worth to mention that the stochastic heat equation with fractional



Brownian noise H B : x ∂u 1 ˙ − ∆ u ( H t, x ) = σ B , t > 0, x x

∂t ∈ R, 2 (3)

u(0, x) = 0, x ∈ R.

is a particular case of the previous equation when κ = 0. It was studied in detail in our previous papers [1, 2]. In [1] we constructed a strongly consistent estimator of the diffusion parameter σ assuming that Hurst parameter H is known. Further, in [2] we dealt with strongly consistent estimators of two unknown parameters, namely the diffusion parameter σ and the Hurst parameter H . We also proved joint asymptotic normality of the estimators in the case 3 H (0 ). ∈ , 4



2 Properties of solutions

Assume that H B = H B , x is a two-sided fractional Brownian motion x ∈ R with Hurst index H (0 1), while is a Wiener process, ∈ , W = { W x , x ∈ R } independent of H B. Let G be Green’s function of the heat equation, that is



G 2 1 x √ exp − , if t > 0, πt 2 ( 2 t, x ) = t δ 0 ( x ) , if t = 0 .

We define a solution to SPDE (1) in a mild sense as follows.

Definition 1 ([3], Definition 1). The random field {u(t, x), t ≥ 0, x ∈ R} defined by



u(t, x) = σ G(t ) H − s, x ( − y dB ds + κ Gt s, x y) dW (4) y − − y ds 0 t t

R 0 R

23rd European Young Statisticians Meeting 35



is called a solution to SPDE (1)–(2).

Remark 1. As shown in [1], both stochastic integrals in (4) exist as pathwise Riemann–Stieltjes integrals. This fact follows from the H¨older regularity of the integrands and integrators. Namely, Green’s function is obviously Lipshitz continuous, while sample paths of fractional Brownian motion are H¨older continuous up to order H. Such regularity guarantees the existence of the first integral in (4). The second integral is also well defined, since the integrand is square integrable.

The next proposition summarizes basic properties of the solution u(t, x).

These properties, especially stationarity and ergodicity, enable us to construct and investigate statistical estimators for H, κ and σ.

Proposition 1 ([3], Proposition 1). Let u = ( { ut, x), t [0 ∈, T ], x be a ∈ R }

solution to (1) defined by (4). Then the following properties hold.

1. For all t, s [0 ] ∈ , T and x, z , ∈ R



cov u( t, z ) , u( s, x + z ) = cov u( t, 0) , u( s, x ) = 1 t s 2 3 κ 2 2 H − 1 √ ( q + r ) − 2 σ H y + | | 2 π 2 0 0 R

× (sign (y 2 ) − x y )( y ) exp − x − dy dq dr (5) 2( q + r )



Consequently, for fixed distinct points t , . . . , t 1n [0 ] ∈ , T, the stochas-u ( t 1 ,x )



tic process , x ∈ ... R, is a multivariate stationary Gaussian u ( t process. n ,x )

2. The variance of u(t, x) is given by

E 2 2 2 1 u ( t, x ) = σ v t ( H ) + κ v t , t > 0, x (6) ∈ R , 2



where H 2+1 1 (2 H 1)Γ( + +1 − H ) 2 v t ( H ) = H c H t , c H = √ , (7) π ( H + 1) Γ denotes the gamma function.

3. For all t, s [0 ∈, T ] and x > 0, the covariance function admits the

following upper bound:

2 2 cov u 2 ( H − 2 t, 0) − 1 , u ( s, x ) ≤ C H ts σ x + κ x , (8)

where CH is a positive constant depending only on H.

36 23rd European Young Statisticians Meeting



4. For all t, s [0 ] and ∈ , T x , ∈ R



cov 2   1  σ 2 H ) H +1 H +1 H +1 Γ( H + ) ( t + s − t − s 2 u ( t, x ) , u ( s, x ) = √ π ( H + 1)   3 3 3 3 2 κ 2 2 ( t + s ) 2 − t 2 − s 2 + √ . (9) 3 π



5. For a fixed t > 0, the random process ( ) { u t, x, x is ergodic. ∈ R }



2.1 Parameter estimation

Let us consider the following statistical problem. It is supposed that for fixed t1, . . . , tn and fixed step δ > 0, the random field u given by (4) is observed at the points xk = kδ, k = 1, . . . , N . So the observations have the following form:

{ ( uti, kδ), i = 1, . . . , n, k = 1, . . . , N } .

Our aim is to estimate the unknown parameters H , σ and κ. We shall do this it two steps. We start with construction of a strongly consistent estimator of H that does not depend on κ and σ. Also, we shall establish its asymptotic normality. Then assuming that H is known, we shall estimate the parameters σ and κ.

In what follows we assume that H 1 = , because otherwise the model is ̸ 2

non-identifiable. The parameters σ and κ are assumed to be positive.



2.2 Estimation of H

In order to estimate H without knowledge of σ and κ, it suffices to observe u(ti, xk) only at three time instants t 1 < t2 < t3.

We define estimator of H as



H (   − 3 / 2 t V ( − 3 t ) / 2 t V 1) 3 N 3 ( ) − t 2 N 2  N = − f , (10) − 3 / 2 − 3 / 2 t V t 2 2 N ( ) − t V ( 1 N t 1 )

where (−1) f denotes the inverse function of



f  H −1/2 H−1/2 t  t 3 − 2 1 H − 1 / 2 H − 1 / 2 , if H = ̸ , 2 ( H ) := t t 2 − 1 (11)  log t 3 − log t 2 1 if H = . log t 2 − log t 1 2

23rd European Young Statisticians Meeting 37



Theorem 1 ([3], Theorem 2). 1 1 1. For any H ∈ (0 , ) ∪ ( , 1) , H is a strongly 2 2 N consistent estimator of the parameter H as N → ∞.

2. For 1 1 3 H ∈ (0 , ) ∪ ( , ) , the estimator H is asymptotically normal: 2 2 4 N



√ d 2 N H N − H − → N (0 , ς) as N → ∞, (12)



where



1 3

ς 2 = r 2 t 4 2itj H ( )L L ,

D i j σ c

H i,j=1



L tH 1 1 1 1 1 1 − H − H − H − H − H − 2 2 2 2 2 2 t t 3 − t 2 1 − t t 3 2 − 1 1 = , L 2 = , L 3 = , 3 / 2 3 / 2 3 / 2 t t t 1 2 3 1 1 1 1 H − H 2 − H 2 − H 2 − 2 D = t 2 − t t log t t 1 3 3 − t log 2 2

− 1 1 1 1 H − H − H − H − 2 2 2 t t t 3 − log 2 t 2 2 2 − t log t . 1 1



2.3 Estimation of σ and κ

Now we assume that the Hurst index H is known and investigate the estimation of the coefficients σ and κ.

We consider the following estimators:



2 t − 3/2 3 2 −1−H −1− V ( − / t t ( t ) t V ( t )H 1 N 1 ) − V 2 N 2 2 1 N 1 − t V 2N (t2) σ N = , κ N = . H − 1 / 2 H − 1 / 2 1 / 2 − H 1 / 2 − H c H t t c 1 1 − t 1 − t 2 2 2 (13)



strongly consistent estimator of the parameter Theorem 2 ([3], Theorem 3). 1. For any H 1 1 2 2 ∈ (0 , ) ∪ ( , 1) , ( σ , κ ) 2 2 N N is a 2 2 ( σ , κ ) as N → ∞ .

2. For H 1 1 3 2 2 ∈ (0 , ) ∪ ( , ) , the estimator ( σ , κ ) is asymptotically normal: 2 2 4 N N



√ 2 2 σ N − σ d N N 2 2 − → N (0 , Σ) as → ∞, (14) κ κ N −

38 23rd European Young Statisticians Meeting



where the asymptotic covariance matrix Σ consists of the following elements:



Σ t−3 3 − ( r t 1 t 1 t r H 1 ( H ) + 1 t 2 ( )) + 11 = t(r H 2 t r H 1 t 2 ( ) + t 2 t 2 ()) , 2 2 H − 1 1 2 1 c t 2( ) H H − t 1 t 2 H 1 − 2 + − t 2



Σ t− 5 5 H 2 − − H ) + 2 − ( r 1 1 t ( H 12 = Σ 21 = t r t r r 1 1 t H 2 ( H )) + t ( ( 2 t 1 t 2 ) +t H 2 t 2 ( )) 1 1 1 1 , H − H H 2 2 − 2 − H − c H c 1 2 2 − t t t 1 2 − t 1 2 2



Σ t−2−H 2 ( − r ( H r ( H )) +−H t 1 1 t 1 ) + t r 2 ( 2t1t2 (H) + r 22 = t t 1t2t2 (H )) . 1 2 1 − 2 H 2 − H 1 + − 2 c 1 t 2( t 1 t ) H 1 − 2 t 2 2



References

[1] D. Avetisian, K. Ralchenko, Ergodic properties of the solution to a frac-

tional stochastic heat equation, with an application to diffusion parameter estimation, Modern Stochastics: Theory and Applications, 2020, Vol. 7, No. 3, 339 – 356.

[2] D. A. Avetisian, K. V. Ralchenko, Estimation of the Hurst and diffusion

parameters in fractional stochastic heat equation, Theory of Probability and Mathematical Statistics, 2021, Vol. 104, 61 - 76.

[3] D. Avetisian, K. Ralchenko, Parameter estimation in mixed fractional

stochastic heat equation, Modern Stochastics: Theory and Applications, 2023, Vol. 10, No. 2, 175 – 195.





23rd European Young Statisticians Meeting 39




Stochastic modeling of age-specific fertility

rates

Erika ∗1 Lettrichova ´

1Comenius University; Faculty of Mathematics, Physics and Informatics;

Department of Applied Mathematics and Statistics



Abstract: The objective of our research was to develop precise and efficient mathematical models for describing the stochastic progression of age-specific fertility rates. We aimed to establish functional formulations for these models that not only demonstrated accuracy and efficiency, but also displayed robust-ness across diverse human populations and time periods. By achieving this, we sought to identify a single model type that could be universally applicable across different populations and time frames, with only the estimated parame-ters varying. The effectiveness of the models was evaluated using the modified Bayesian information criterion, while the quality of the models was assessed based on the characteristics of the estimated standardized model residuals.



Keywords: stochastic models, age-specific fertility rate, Lee’s model AMS subject classification: 62M20, 62P25, 91B70, 91D20



1 Introduction

Age-specific fertility rate is a relative analytical demographic indicator, mea-suring the number of live births among women of specific age groups relative to the average number of women within those age groups. Fertility plays a pivotal role in driving changes in human population dynamics, making it a crucial natural mechanism to study.



∗ Corresponding author: erika.lettrichova@fmph.uniba.sk



40 23rd European Young Statisticians Meeting



1.1 Fertility models

Among the various approaches available for modeling fertility, we have chosen to utilize stochastic models based on Lee’s model [2]. Lee’s model belongs under the category of atypical regression models, as it does not involve direct regressors on the right-hand side. Instead, the model incorporates age-specific parameters ax, bx and period-specific parameters ft to capture the dynamics of fertility.

ln fx,t = ax + bx ft + εx,t. (1)

The unknown age-specific parameters ax and bx in Lee’s model represent the fertility rate at different ages, period-specific parameters ft capture the effect of calendar year on fertility rates. The parameter ε x,t represents an error that determines the size of the effects that the model failed to capture. It is assumed that ε x,t is a parameter with white noise properties. In our research, Lee’s model (1) is denoted as M1A model. To enhance the model’s performance, we explored different combinations of age, period, and cohort parameters. The most efficient models were formally labelled as M6B and M7B (see equations



(2) and (3), respectively). Details of all estimated model variants can be found in [3]. x ln

fx,t = ax + bx ft + jt −x + d + − h t εx,t, (2) x p

ln fx,t = ax + bx ft + jt−x + gx ht + εx,t. (3)



In the M6B model, the time-specific parameter h t is modified by multiplying x it with the term d − , d represents a scalar parameter specific to the x p model, x is an independent regression variable that ranges linearly across all observed ages, x p denotes the average value of the monitored ages used for computing the specific fertility rate. In contrast, the M7B model incorporates the multiplication of the time-specific parameter h t with the age-specific parameter g x .



1.2 Parameter estimation

In the next part, we will describe the method used for model’s parameters estimation. Let x ∈ χ and t ∈ τ , where χ = {x1, x2 , . . . , xk , } k represents ∈ N

the set of women’s ages, and τ = {t , t , . . . , t 1 2n , denotes the set of } n ∈ N

tracked calendar years. For each combination of x and t, we define FP(x, t) as the average size of the population of females and LB(x, t) as the average number of live births to women, who are x years old in calendar year t. It is 23rd European Young Statisticians Meeting 41



important to note that we assume LB(x, t)∀x ∈ χ, t ∈ τ to follow a Poisson

probability distribution:

LB(x, t) Po(FP( ) ( ∼ x, t fx, t)), (4) ∀ x ∈ χ, t ∈ τ,

where f (x, t) represents the specific fertility rate at age x in calendar year t. We

denote the set of parameters estimated for each model as Φ. Thus, the specific

fertility rate f (x, t) can be expressed as f (x, t; Φ), as each model is dependent

on specific parameters. Assuming the fulfillment of the Poisson probability

distribution assumption (4), the logarithm of the likelihood function (LogLik)

for each model can be represented as:

ℓ(Φ; LB, FP) = (LB(x, t) ln[FP(x, t)f (x, t; Φ)] −

t∈τ x∈χ

− FP(x, t)f (x, t; Φ) − ln[LB(x, t)!]),

LB and FP denote observation matrices LB(x, t) and FP(x, t) , x ∈ χ,t ∈ τ x ∈ χ,t ∈ τ

respectively. To estimate the parameters, we employ stochastic optimization

techniques, specifically maximizing the LogLik. The entire parameter estima-

tion process was performed using the R software [4] and the optim function.

We employed a two-step optimization approach for maximizing the LogLik.

In the first step, we utilized Nelder-Mead algorithm, in the second step, we

employed Broyden-Fletcher-Goldfarb-Shanno algorithm. We decided to use

the two-step method because of the large number of estimated parameters,

where the Nelder and Mead algorithm tends to achieve greater success in

converging to the semi-optimal point. We subsequently used the parameters

obtained in the first optimization as the starting parameters of the BFGS

algorithm, and thus the algorithm was more likely to converge to the optimal

point with the selected numerical accuracy.



1.3 Quantitative comparison of models

In order to compare the models based on the maximizing their LogLik, we

employed the modified Bayesian Information Criterion (BIC). The modified

BIC value allows model comparison by taking into account both the goodness-

of-fit and the complexity of the model. The modified BIC considers not only

the LogLik value but also penalizes models with a higher number of estimated



parameters. 1 BIC = ℓ ˆΦ − ν ln(n). (5) 2

In the definition of the modified BIC, Φ refers to the vector of parameters

for model, ˆΦ represents the estimated vector of parameters obtained through 42 23rd European Young Statisticians Meeting



the maximum likelihood method, ℓ ˆΦ

denotes the maximum value of the

LogLik, n represents the total number of observations, and ν indicates the number of parameters estimated in model.



1.4 Residuals estimation

To assess the quality of the models employed, we examined the characteristics of their residuals. These residuals were obtained by dividing the time period into data windows, each of which was shifted by one year in subsequent calculations. In each data window, the first step involved estimating the parameters of the models. The initial data window consisted of years t , . . . , t n 1 n , where

represents the number of years in the data window and remains constant during scrolling.

Based on the data from this data window, we estimated the parameters of the models. These estimated parameters were then used to calculate modeled specific fertility rates ¯ f for ages x from the set χ = x , . . . , x 1k, where k corresponds to the number of observed ages. We denote the calculated specific fertility rates as ¯ f (x, t n) and actual values of specific fertility rates as ˜ f (x, t+n). Subsequently, we utilized the following relationship to evaluate the residuals:



ε f ˜ ¯ ( x, t + n ) ( − fx, tn ) ˆ ( x, t n ) = . (6) ¯ f ( x, t n ) / FP( x, t n )

We performed the described procedure for all data windows, resulting in the creation of a matrix consisting of standardized residuals for the specific fertility rate.



1.5 Qualitative assessment of the suitability of models

We regarded the normality of the model as an additional characteristic and tested it by examining the presence of a normal distribution in the matrix of estimated residuals. If the hypothesis regarding the normality of the model is not rejected, it allows us to employ procedure and techniques that rely on the assumption of normality. For assessing the normality of the estimated residuals, we employed: the Jarque-Bera test (JB), adjusted Jarque-Bera test (AJB), Shapirov-Wilk test (ShW), one-sample Kolmogorov-Smirnov test for general normality (KS norm).

To visually assess the independence of the estimated residuals, we employed heat maps. The heat map provides a graphical representation of the residuals, with a random pattern resembling white noise indicating independence.

23rd European Young Statisticians Meeting 43



2 Results for the Slovak population

We chose to examine the suitability of models for stochastic modeling of the specific fertility rate in several populations. In this article, we will specifically focus on presenting the results for the Slovak population. For the complete results, refer to [3]. To ensure a fair comparison of the models, we established a consistent age and time range for the data used in our analysis. We set the age range at 18 42 years. As time period, we selected the range of − 1960 − 2019 based on data availability in [1]. We divided selected time period into 20-years data windows, the first was in the years 1960 1979 and each − subsequent one was shifted by 1 year. For each data window, we estimated the model parameters and calculated the LogLik. Table 1 presents the median values of the LogLik, median of the modified BIC. The maximum median

Table 1: Quantitative comparison of models

Median of the

Model LogLik median Number of parameters modified BIC M1A 4415 49 78 79 − . 4626 − .

M6B −2554.32 150 −2973.81

M7B 2532 179 − . 77 3109 −.65



value of the LogLik was reached when applying the M7B model. However, it

should be noted that this model involved the estimation of 179 parameters

(25 3 age specific parameters, 20 2 time specific parameter and 64 cohort × ×

parameters), which adversely affected its BIC value. Taking into account the

BIC criterion, we determined that the M6B model was the most suitable for

the Slovak population within the group of models examined by us. Results

of the p-values from the tests indicate that for models M6B and M7B we

do not reject the hypothesis of normality, see Table 2. This suggests that the

residuals from these models exhibit a distribution that is consistent with a

normal distribution. In Figure 1, we present the heat map of the estimated

Table 2: Qualitative comparison of models

Model JB AJB ShW KS norm

M1A < 0.0001 < 0.0001 < 0.0001 < 0.0001

M6B 0.5435 0.5060 0.2018 0.2783

M7B 0.0145 0.0120 0.0037 0.1218



residuals for the M1A model, revealing significant clusters. Conversely, in

Figure 2, the heat map of the M6B model displays a random pattern similar

to white noise.

44 23rd European Young Statisticians Meeting



15 4

39 39 3 10

34 5 2

34

1

age 0 age

29 29 0

−5

−1

24 −10 24 −2

19 −15 −3 19

1980 1987 1994 2001 2008 2015 1980 1987 1994 2001 2008 2015

Figure 1: Heat map M1A year year Figure 2: Heat map M6B



3 Conclusion

In this paper was presented a method for developing stochastic models for modeling age-specific fertility rates. Two most suitable models for the Slovak population were introduced. For the other investigated populations - Czech, Hungarian, Austrian, and Dutch - the models labeled M6B and M7B also emerged as the most suitable, similar to the Slovak population. Therefore, we believe that within the models we examined, we have managed to find the most suitable model that is also sufficiently robust across different populations and time periods.



Acknowledgements: This research was supported by the Grant for PhD students of the Comenius University Bratislava No. UK/228/2023.



References

[1] HFD [online] Human Fertility Database.

URL: https://www.humanfertility.org/.

[2] R. D. Lee. Modeling and forecasting the time series of US fertility: Age

distribution, range, and ultimate level. International Journal of Forecasting, 9(2): 187–202, 1993.

[3] E. Lettrichov´a. Stochastic modeling of age-specific fertility rates in selected

countries. Master’s thesis. Faculty of Mathematics, Physics and Informatics, Comenius University Bratislava, Slovakia, 2022.

[4] R Core Team R: A language and environment for statistical computing.

R Foundation for Statistical Computing, Vienna, Austria, 2021. URL: https://www.R-project.org/.





23rd European Young Statisticians Meeting 45




On extreme quantile estimation with

approximated p L-norms



Jaakko ∗ 1 Pere

1 Aalto University School of Science, Department of Mathematics and Systems

Analysis



Abstract: Often in real-life scenarios, when the heavy-tailed model is as-sumed, only approximations of the original heavy-tailed observations are available. Then the approximation error has to be taken into account in the asymptotics. Here we wish to estimate extreme quantiles corresponding to the p L-norm of a random function when only discretizations of the realized sample paths are observed. We review an asymptotic normality result for a well-known extreme quantile estimator when approximated observations are used in the estimation. Then we consider an example class of stochastic processes and discuss the behavior of approximation error in p L-norms due to discretization under this class of processes.



Keywords: Extreme quantile estimation, Regular variation, Functional data analysis, Hill estimator

AMS subject classification: 6 2G32, 60G70



1 Introduction

Extreme value theory provides a framework for statistical inference of rare events [1]. Often the approach is to assume a semiparametric model such that sufficient re gularity for the tail can be assumed. Then applications of extreme value theory, such as extreme quantile estimators, are constructed by



∗ Corresponding author: jaakko.pere@aalto.fi 46 23rd European Young Statisticians Meeting



extrapolating outside the sample by means of the assumed semiparametric model. One example of such a model is the heavy-tailed model that is being applied in multiple fields, such as finance and telecommunications [5].

Next, we proceed by defining a heavy-tailed random variable. Let X be a

random variable with the distribution FX . Define a quantile type function U ← ← X by U X ( t ) = F (1 1 ), where f denotes the left-continuous inverse X − /t of a nondecreasing function f . Then we say that the random variable X is heavy-tailed with an extreme value index γ > 0, and we denote X ∈ RVγ if the corresponding function UX is regularly varying with the same index γ > 0.

That is, if lim U ) X ( tx t = γ x, for all x > 0. →∞ U X ( t )

However, often in applications observations of heavy-tailed random variables

are not observed directly, but only approximations of them are available. Then the approximation error has to be taken into account in the asymptotic results. We consider the aforementioned setting in functional data analysis context [4]. Let p [1 ∈, ] and let p 1] be a random function such that ∞ Y [0 ∈ L , ∥Y and denote ∥ p ∈ RV γ . Let Y , . . . , Y 1 n be i.i.d. copies of Y Xi = ∥Yi∥p. In functional data analysis, the sample paths of Y are rarely observed as a whole, and thus, the norms Xi have to be approximated. We assume that only discretizations Yi(j/m), j 0, . . . , m 1 are observed and then natural ∈ { − } approximations of the norms Xi,



X ˆ    1     p 1 m − 1  j  p Y , =0 i p [1 ∈, ) ∞ i = m j m , (1)    max  j  j Y , p = ∈{ 0 ,...,m − 1 } i ∞ m

are available. We wish to estimate extreme quantile ← x q = F (1 q) ∥ Y − ∥ p corresponding to a very small probability q when approximations ˆ Xi are used in the estimation. More precisely, in the estimation of x q we use a well-known extreme quantile estimator for heavy-tailed distributions with approximated observations ˆ Xi, when the extreme value index γ > 0 is estimated with the Hill estimator. For a review of the Hill estimator, extreme quantile estimation for both the heavy-tailed and a more general setting, and discussion about quantile estimation in less extreme regions, see [1] and the references therein.

In [3] the effect of replacing the norms Xi by the approximations ˆ Xi in

asymptotic results of the Hill estimator and the related extreme quantile estimator was considered. Indeed, in Section 2, under the described setup, we review an asymptotic normality result for the extreme quantile estimator when approximated norms are used in the estimation. It turns out that in order to have asymptotic normality, the approximation error max ˆ i cannot | X i − X i | grow too fast, as n . Then, in Section 3 we give a class of stochastic → ∞

23rd European Young Statisticians Meeting 47



processes that satisfy all the assumptions required for the asymptotic normality

of the extreme quantile estimator. The processes are of form Y (t) = RB(t),

where tail distribution function of satisfies a particular expansion and is R B

the fractional Brownian motion independent of R. These processes are special

cases of more general examples considered in [3]. Also, the behavior of the

approximation error max ˆ i | X i is studied for the specific class of stochastic − X i |



when we set in order to have asymptotic normality for the extreme quantile estimator processes. Particularly, we describe the rate of the discretization level m = mn R = 1 (ν) (ν) / X /ν . Here X is a chi-squared distributed random

variable with ν ∈ N∗ degrees of freedom. Lastly, proofs are presented in

Section 4.



2 Asymptotic normality of the extreme quantile

estimator under approximated p L-norms

First, we give the definition of the extreme quantile estimator. Notice that

in the heavy-tailed setup, the extreme value index γ > 0 determines the

heaviness of the tail. That is, larger γ implies a heavier right tail. Since the

extreme value index determines how to extrapolate outside the sample, the

estimation of γ is of paramount importance in the applications of extreme

value theory.

Definition 1 (Extreme quantile estimator). Let Z Z 1 ,n denote ≤ · · · ≤ n,n



Extreme quantile estimator ˆ xq computed with the sample Z1, . . . , Zn is defined ˆ as ˆ order statistics corresponding to a sample Z , . . . , Z 1n. Let k 1 . ∈ { , . . . , n 1 − }

x γn 1 1 q = k − Z n − k,n ( k/nq ) where ˆ γ n = log Zn log Z is the − i,n n − k,n k i =0 −

Hill estimator.

Regular variation alone is not sufficient for asymptotic normality of the

extreme quantile estimator ˆ xq but a second-order condition is required. Below

we give the definition of second-order regular variation.

Definition 2 (Second-order regular variation). We say that random variable



X satisfies second-order regular variation condition and denote X ∈ 2RVγ,ρ if U the corresponding tail quantile function satisfies lim ( tx ) γ U t x →∞ U − /A ( t ) = ( t )

xγ xρ− 1 , for all x > 0, where γ > 0, ρ 0, for some A that is a positive or ≤

negative function with lim ρt →∞ A(t) = 0. For ρ = 0 the right-hand side is

interpreted as γ x log x.

Notice that in asymptotic results related to extreme quantile estimation, it

is commonplace to assume that the probability q corresponding to the quantile 48 23rd European Young Statisticians Meeting



xq depends on n and decays to zero fast, as n . This guarantees that the → ∞

quantile xq is extreme asymptotically speaking. Below we present asymptotic normality result given in [3, Theorem 1], in the setting of this article.

Theorem 1. Let p [1 ] and let p 1] be random function such that ∈ , ∞ Y ∈ L [0 ,



the extreme quantile estimator of Definition 1 computed with the sample n n . Assume also that = , and log( ) = → ∞ q q n nq = o ( k ) nq o( k), as ˆ max | X → ∞ . Denote − | C n = i X i i . Let ← x q = F (1 − q ) and let x ˆ be U X ( n/k ) q X √ ˆ and let ˆ ˆ X , . . . , X k/n 1 n be as in Equation (1) . Assume that k = k n , 0, → ∞ → √ as X = 2 0. Let ∥ Y for ∥ p ∈ RV γ,ρ γ > 0 and ρ < X , . . . , X 1n be i.i.d. copies of X



is the auxiliary function from Definition 2. Denote X ˆ 1 , . . . , Xn. Furthermore, assume that limn kA →∞(n/k) = λ ∈ R, where A √ P d n = k/ ( nq ) . If kC n 0 → √ ˆ d then x k q 1 Λ , as is normally distributed with → n , where Λ log d n x − → ∞ q 2 mean λ/ (1 and variance − ρ ) γ .



3 Example

Now we are ready to consider under which conditions the asymptotic normality result of Theorem 1 holds for a particular class of stochastic processes.



Definition 3. Let be a positive random variable satisfying 1 ( R − Fx) = R − 1 /γ − 1 /γ + ρ/γ c 1 x + c 2 x (1 + o (1)), as x , with and → ∞ γ > ρ < 0, where 0 and c > c 1 B 2 = 0. Let be the fractional Brownian motion with Hurst index ̸ (0 H , 1) such that B and are independent. Define stochastic process ∈ R Y by Y ( t ) = ( R B t ), t [0 1]. ∈ ,

The structure of the stochastic process given by Definition 3 will guarantee

that the corresponding p L-norm satisfies second-order regular variation for any p [1 ∈, ]. Also, the approximation error due to discretization has a ∞ convenient representation.

Theorem 2. Let p [1 ∈, ] and let be the stochastic process of Definition ∞ Y 3. Let ˆ ˆ Y 1 , . . . , Y n be i.i.d. copies of Y . Denote X i = ∥ Y i ∥ p and let X 1 , . . . , Xn be as in Equation (1). Suppose that k = kn , 0 → ∞ k/n, as → n . Then → ∞

1. 2 ∥ Y with auxiliary function ∥ p ) = ∈ RV γ,ρ A ( t O( ρ t), as t , and → ∞

2. max ˆ H | X − X − 1 ε C i n = i i | = γ O k, for any ε > 0. U P X ( n/k ) m

Below we give an example of a random variable that satisfies the expansion R

required in Definition 3.

23rd European Young Statisticians Meeting 49



Lemma 1. Let 1 ( =ν) R √ , where X has chi-squared distribution with ( ν ) X /ν

ν ∗ 1 + degrees of freedom. Then 1 ( ) = − /γ − 1 ∈ N /γρ/γ − F x c x + c R 1 2 x (1+ o(1))

with γ = 1/ν and ρ = 2 . − γ , where c > 1 0 and c 2 = 0 ̸

By the above lemma there is a relationship between the first-order parameter

γ and the second-order parameter ρ for the chi-squared distribution, which

simplifies the analysis in this example. However, generally, the parameters γ

and ρ do not need to be related.

Consider a case where Yi are i.i.d. copies of the process Y of Definition

3 when R is set as in Lemma 1. Now we can study how the discretization

level m has to be chosen in order to have asymptotic normality for the



with the condition estimator ˆ xq when approximations ˆ Xi are used in the computation. Firstly, √ the condition lim ρ n kA ( n/k ) = λ = O ( ) implies the →∞ ∈ R together with A t 2 2 ) restriction − ρ/ (1 − ρ 4 k = γ/ (1+4 γ ) O n = O n . Combining this restriction √ P kC n 0 gives that → m = m n has to be chosen such that 4 2+2 1 H − ε γ γ lim 1+4 γ n n = 0, for an arbitrary small ε > 0 in order to have →∞ m asymptotic normality for the extreme quantile estimator ˆ x q computed with

approximations ˆ X i. That is, as γ increases, heaviness of the tail has to be

compensated with a larger discretization level m, or with more regularity

(larger value of H ).



4 Proofs

Proof of Theorem 2. The proof is similar to the proof of [3, Theorem 3], and

thus, we present only the sketch of the proof. Let ε > 0 and p [1 ∈, ]. ∞

By applying Kolmogorov continuity theorem and Garsia-Rodemich-Rumsey

inequality [3, Lemma 2] we get that

| ( ) ) (0 ) B t − B ( η s | ≤ M | t (2) − s | ∀ η ∈ , H ,



where 1/γ+ Eε M < and is a measurable map of . Notice that ∞ M B 1 Equation (2) implies /γ + ε 1 /γ E ∥ B ∥ p < ∞ . Also, clearly E ∥ B ∥ p ∈ (0 , ∞ ). Then by Breiman’s theorem [2, Proposition 1.3.9] we have 1 − F x ∥ Y ( ) ∥ ∼ p 1 /γ E ∥ B ∥ p (1 − F ( x )), as x . That is, R → ∞



1 − 1 + 1 /γ − /γρ/γ − F x c x 1 ∥ Y ( x ) = ˜ c + ˜ o 2 (1 +(1)), x ∥ → ∞, (3) p

with γ > 0 and ρ < 0, where ˜ c 1 > 0 and ˜ c 2 = 0. Now notice that since ̸

1 ρ − F Y = ( t ∥ Y satisfies Expansion (3), then with A ∥ p ∥ ∥ p ∈ 2 RV γ,ρ O), e.g., 50 23rd European Young Statisticians Meeting



see [1, Section 3.2] for details. This completes the proof of Claim 1. By Equation (2) we have that |Y (t) η − Y ( s ) | ≤ V | t − s | for all η ∈ (0, H ), where V = RM . The fact V ∈ 2RVγ,ρ can be proven similarly as the second-order regular variation of ∥Y ∥p. Then Claim 2 follows by combining [3, Theorem 2] and [3, Remark 2].



Proof of Lemma 1. Let Γ be the gamma function and recall that F (ν) X (x) =    ℓ + ν/ 2 − x/ 2 ∞ 1 x e . Now ℓ =0 Γ( ν/ 2+ ℓ +1) 2

  1  

P ν  > x = ( F (ν) X 2 ν

X ) x /ν

 



=  1+  ( ν/ 2 ν/ 2) ( ν/ 2) ν/ 2  − ν − ν 2  x + − x (1 + o(1)), x   → ∞.  Γ( ν/ 2 + 1) Γ( ν/ 2 + 2)        = c 1 = c 2

Thus, 1 −1 − − F ( x ) = /γ1/γ+ c +ρ/γ c R 1 x 2 x(1 + o(1)), as x → ∞, with γ = 1/ν and ρ = −2γ, where c1 > 0 and c2 ̸= 0.



Acknowledgements: Jaakko Pere gratefully acknowledges support from the Vilho, Yrj¨o and Kalle V¨ais¨al¨a Foundation.



References

[1] L. De Haan and A. Ferreira. Extreme Value Theory: An Introduction.

Springer, 2007.

[2] T. Mikosch. Regular variation, subexponentiality and their applications in

probability theory. Eurandom, 1999.

[3] J. Pere, B. Avelin, V. Garino, P. Ilmonen, and L. Viitasaari. Hill estimator

and extreme quantile estimator for functionals of approximated stochastic processes. arXiv:2307.03581, 2023.

[4] J. O. Ramsay and B. W. Silverman. Functional data analysis. Springer

series in statistics. Springer, New York, 2005.

[5] S. I. Resnick. Heavy-tail phenomena: probabilistic and statistical modeling.

Springer Science & Business Media, 2007.





23rd European Young Statisticians Meeting 51




Using Spatial Point Process Models to

Understand Cell Loss in Glaucoma



Jonathan ∗1 2 1 Henderson , Benjamin M Davis , and Hannah J Mitchell

1 Mathematical Sciences Research Centre, School of Mathematics and Physics,

Queen’s University Belfast, Belfast, Northern Ireland

2 CLF OCTOPUS Imaging Facility, STFC Rutherford Appleton Laboratories,

Oxfordshire, UK



Abstract: Glaucoma is a leading cause of irreversible blindness, primarily characterized by the progressive degeneration of retinal ganglion cells (RGCs). Analyzing the spatial distribution of RGCs as a point pattern allows the application of spatial point process techniques to gain insights into this disease. This study uses two well established rodent models of glaucoma and focuses on modeling RGC behavior using the log-Gaussian Cox process, employing the Integrated Nested Laplace Approximation (INLA) for accurate modeling. Additionally, an investigation is carried out to determine whether factors such as cell size (marks) influence the regions of the retina most susceptible to cell loss. Understanding these spatial relationships can provide valuable insights into the mechanisms of glaucoma progression and potentially inform future diagnostic and treatment strategies.



Keywords: Point pattern analysis, log Gaussian Cox process, Neuro-degenerative diseases, Integrated Nested Laplace Approximation.

AMS subject classification: 62H11



∗ Corresponding author: jhenderson27@qub.ac.uk 52 23rd European Young Statisticians Meeting



1 Introduction

Glaucoma is a neurodegenerative disease that affects millions of people world-wide and is a leading cause of irreversible blindness [1]. The progressive vision loss experienced by glaucoma patients results from the loss of retinal ganglion cells (RGCs). While no cure for the disease exists, treatments are available to prevent further disease progression, they are most effective when adminis-tered early in the disease process. Unfortunately, patients often don’t exhibit symptoms until approximately 60% of their RGCs are lost. Furthermore, the diagnostic process can be time-consuming, taking at least 6-12 months for a specialist to analyze the retina to obtain sufficient data to establish degenerative glaucoma progression using a combination of functional and structural endpoints [1].

The ability to visualise RGC point patterns with adaptive optics cSLO has the capability to provide much richer information about the health of the retina than current structural approaches which are typically dependent on retinal layer thickness measures. These can be problematic as retinal layer thickening caused by inflammatory processes can mask thinning due to neuronal cell loss. We hypothesize that a spatial statistical evaluation of RGC point patterns will enable more rapid and accurate detection of disease associated changes in the RGC population and so allow more rapid diagnosis of Glaucoma and more rapid and effective therapeutic intervention.

The dataset contains the locations and sizes of RGCs within retinal whole-mounts of rats subjected to various procedures that simulate the cell loss observed in glaucoma patients. Specifically, two procedures were used: partial optic nerve transection (pONT) to represent severe cases and ocular hyper-tension (OHT) to illustrate milder cases of glaucoma [3]. To analyze disease progression, rats in the pONT model were dissected at 3, 7, 21, and 56 days post-model induction, while those in the OHT model had retina dissected at 7, 21, 56, and 84 days. A control group of rats underwent no procedures.

Various models can be employed depending on the spatial arrangement of the points within the study region, with summary functions like Ripley’s K function [6] helping to determine the most appropriate model. Previous studies have shown that RGCs within retinal wholemounts tend to aggregate [2]. For aggregated point processes, the log-Gaussian Cox process can be used to model point intensity [5]. This study utilizes the integrated nested laplace approximation (INLA) [7] for model fitting, chosen for its ability to accommodate marks or covariates without excessive computational time.

23rd European Young Statisticians Meeting 53



2 Model Fitting Procedure

In order to determine how the disease effects the change in intensity of the pattern of cells the log Gaussian cox process can be used.

Definition 1. A log Gaussian Cox process is a hierarchical Poisson process with a random intensity,

Λ(s) = exp(Z (s))

where Z (s) = d { Z ( s ) : s ∈ R} denotes a Gaussian random field [4].

The standard approach for parameter estimation of this model is based on Markov chain Monte Carlo (MCMC) simulations. However, this approach can be computationally cumbersome particularly, if covariates or marks are introduced to the model. Rue et al. [7] presents an alternative method for parameter estimation in complex point process models, of which the log-Gaussian Cox process is a special case. In general, for the class of latent Gaussian models, the response variable yi is characterized by the mean ϕi that is linked to a predictor through a link function g(.) such that,

ηi = g(ϕ i) = β0 + βαZjα + fγ(Cjγ ) (1)

α γ

where β0 denotes the intercept, while denotes the linear effects of covari- { β α }

ates Z α and {fγ denotes the non-linear effects of covariates } Cγ , respectively.



3 Main Result

Considering the varying shapes of retinal wholemounts among subjects, anal-ysis focuses on samples extracted from these wholemounts. Sample locations are determined based on their proximity to the optic nerve head (ONH), which is centrally located within the retina. In each retina, four samples are collected, with one taken from each quadrant of the wholemount. Figure 1 shows a visual representation of the wholemount, highlighting the positions and sizes of RGCs and indicating the specific subsamples subjected to analysis

54 23rd European Young Statisticians Meeting





Figure 1: A) A typical retinal wholemount containing the locations of RGCs.

The red boxes indicating samples which were used to fit the model which are dependent on the distance to the ONH (green dot). B) An example of a sample pattern taken containing location and size of RGCs



By fitting the log Gaussian Cox process using INLA a joint model was created for the replicated samples within each dissection group and a map of the intensity was obtained. Figure 2 shows the results obtained for the OHT group which represents milder cases of the disease, and therefore is close to what would be observed through the early stages of the disease process.





Figure 2: Intensity plots of the OHT model at the different dissection times.



23rd European Young Statisticians Meeting 55



The results illustrate a progressive decrease in cell intensity across the various groups, with the center of the study region being the most susceptible to cell loss. This is particularly evident in the region where smaller cells are primarily located, as depicted in Figure 3.





Figure 3: Density of cells which had a size below the mean.





4 Discussion

Utilizing INLA, we observed a progressive decrease in cell intensity across various dissection times, with a more pronounced decline between days 21 and 56. Additionally, our findings corroborate the observation made by Davis et al. [2] that areas with smaller cells are more susceptible to cell loss.



Acknowledgements: I would like to express my gratitude to Benjamin Davis for providing the dataset for this study and for his valuable assistance and guidance in understanding the biology. Additionally, I extend my thanks to Hannah Mitchell for her support and guidance throughout the project.



References

[1] Davis, B. M., Crawley, L., Pahlitzsch, M., Javaid, F., and Cordeiro, M.

F. (2016). Glaucoma: the retina and beyond. Acta neuropathologica. 132, 807-826.

[2] Davis, B.M., Guo, L., Ravindran, N., Shamsher, E., Baekelandt, V.,

Mitchell, H., Bharath, A.A., De Groef, L. and Cordeiro, M.F. (2020).

56 23rd European Young Statisticians Meeting



Dynamic changes in cell size and corresponding cell fate after optic nerve injury. Scientific Reports, 10(1), 21683.

[3] Davis, B. M., Salinas-Navarro, M., Cordeiro, M. F., Moons, L., and De

Groef, L. (2017). Characterizing microglia activation: a spatial statistics approach to maximize information extraction. Scientific Reports, 7(1), 1576.

[4] Illian, J.B., Sørbye, S.H., Rue, H. and Hendrichsen, D.K. (2012). Using

INLA to fit a complex point process model with temporally varying effects-a case study. Journal of Environmental Statistics, 3(7).

[5] Møller, J., Syversveen, A.R. and Waagepetersen, R.P. (1998). Log gaussian

cox processes. Scandinavian journal of statistics, 25(3), 451-482.

[6] Ripley, B. D. (1976). The second-order analysis of stationary point pro-

cesses. Journal of applied probability, 13(2), 255-266.

[7] Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian in-

ference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statisti-cal Methodology), 71(2), 319-392.





23rd European Young Statisticians Meeting 57




Subsampling MCMC for Bayesian Variable

Selection and Model Averaging in BGNLM



Jon Lachmann ∗1 2 and Aliaksandr Hubin

1 Stockholm University, Indicio Technologies AB

2 NMBU, University of Oslo, OUC



Abstract: Bayesian Generalized Nonlinear Models (BGNLM) offer a flexible nonlinear alternative to GLM while still providing better interpretability than machine learning techniques such as neural networks. In BGNLM, the methods of Bayesian Variable Selection and Model Averaging are applied in an extended GLM setting. Models are fitted to data using MCMC within a genetic framework by an algorithm called GMJMCMC. In this paper, we combine GMJMCMC with a novel algorithm called S-IRLS-SGD for estimating the marginal likelihoods in BGLM/BGNLM by subsampling from the data. This allows to apply GMJMCMC to tall data.



Keywords: BMA, MJMCMC, Subsampling, Laplace Approximation

AMS subject classification: 62-02, 62-09, 62F07, 92D20, 90C27, 90C59



1 Introduction

In [3], the class of Bayesian Generalized Nonlinear Models (BGNLM) is proposed to provide flexible (potentially deep Bayesian) modelling of data with high interpretability. More specifically, BGNLM models the relationship between one dependent variable y = (y1, ..., yn ) and m independent variables x = (x 1 , ..., xn)T , where xi = (xi1, ..., xim ) with observation count n. This is done through an extended Generalized Linear Model (GLM). In a general

∗ Corresponding author: jon@lachmann.nu 58 23rd European Young Statisticians Meeting



sense, the class of models is defined as

y |µ, ϕ ∼ f(y |µ, ϕ), (1)

q

h(µ) = β F 0 + γ j β jj (x, αj ). (2)

j =1

In Equation (1), f(·|µ, ϕ) is a density/mass function of a distribution from the exponential family with mean and dispersion parameters µ and ϕ respectively. In (2), h(µ) is an link function between the mean µ and the features Fj . The features are nonlinear transformations of covariates, the definition of which is discussed in detail in [3]. For now, we will just note that a single feature can contain both bare covariates and other features, resulting in a recursive structure that allows for high flexibility. In Equation (2), each feature Fj (x, αj ) has two outer parameters: The binary parameter γj which is 1 if feature F j is included, and 0 otherwise and the coefficients βj which works as in a regular linear model, specifying the influence of feature Fj in this particular model. Each feature also has a number of inner parameters αj , that define its structure. Moreover, by introducing some hard and soft constraints on feature complexity that are defined in [3], the set of q possible features F is forced to be finite.

To describe the feature space F of BGNLM, the following properties of each

feature are introduced in [3]: depth, width and operations count. In [3] they are used to define three hard constraints which are applied to limit the model space and make sure that it is finite. These constraints are: The maximum depth of a feature, D; The maximum width of a feature, L; The maximum total number of features in a given model Q.

Further, in [3], a model prior which penalises more complex models is

defined. Each model is identified by the vector γ, indicating the features that it involves, i.e. m = (γ , ..., γ 1q ). Using some parsimonious complexity measure c(Fj (·, αj )) ≥ 0 of a feature Fj (·, αj ), the model prior is then defined as



p γ c(F (·,α )) ( q

m) ∝ u j j j , (3)

j =1

Here, u ∈ (0, 1) will give us a larger negative number for models including more numerous and complex features. This will cause the less complicated of two models that are equally good at describing the data to be favoured. We shall follow [3] and use the operations count (oc) of a feature as the complexity measure. It simply counts the number of algebraic manipulations (+, −,∗, and nonlinear transformations) in the feature: x1 has oc of 0 but sin(x1 ∗ x2 ) of 2.

23rd European Young Statisticians Meeting 59



2 Genetically Modified MJMCMC (GMJMCMC)

MCMC or MJMCMC [1] working on the marginal space of models cannot be

directly applied to explore the feature space from Equation (2). First, the

model space of size 2q increases exponentially with the number of features q.

Second, q is growing exponentially with the depth of the features. Hence, an

alternative solution is required. In [2], this problem is solved in the context

of Bayesian logic regressions by introducing an algorithm called Genetically

Modified MJMCMC (GMJMCMC), which adaptively embeds MJMCMC into a genetic programming framework. It was further adopted in [3] for BGNLM.

Beginning with an initial population of features S0 consisting of the bare

covariates, MJMCMC is applied to explore the model space spanned by

this population. A new population is then generated by randomly filtering

features that have low marginal inclusion probability. Then, replacements

are generated by applying the transformations corresponding to the feature

generating process [3] on the remaining features. This process is repeated

a set number of times T , referred to as the number of populations of that

particular run. Further, in [2, 3] embarrassing parallelisation of the algorithm

is suggested. Also, recurrence of the algorithm is shown under practical

assumptions.



Proposed subsampling technique Markov Chain Monte Carlo (MCMC)

algorithms such as Metropolis-Hastings are very useful to sample from complex

posterior distributions, but as the amount of data involved increases, so does

computational time. To address this issue subsampling MCMC methods start

to appear. Since our interest in this paper is within Bayesian model selec-

tion and model averaging, a specifically interesting for us paper is [5], which

develops and confirms numerically and theoretically a subsampling-based

stochastic optimisation technique for computing the marginal likelihoods and

is coupled with MCMC. In [5], it is proven that any converging stochastic

optimisation can be used in theory, yielding a time inhomogeneous MCMC

(TIMCMC) with the desired limiting distribution, yet recommend a stochastic gradient descent with subsampling iterated reweighted least squares initialisa-tion (S-IRLS-SDG) as a working practical solution. S-IRLS-SDG effectively uses a subsampling IRLS algorithm for fast convergence to the proximity of the local optima, which is further obtained by a standard SGD approach. Since GMJMCMC is a recurrent algorithm, the convergence result from [5] automatically extends to it.

In this paper, we suggest using S-IRLS-SDG available at https://github.

com/jonlachmann/irls.sgd as the TIMCMC within the populations of 60 23rd European Young Statisticians Meeting



GMJMCMC resulting in a new subsampling-based GMJMCMC algorithm. Our implementation is available as an R-package at https://github.com/ jonlachmann/GMJMCMC. Further, we provide experiments showing this to be a working solution for GMJMCMC and BGNLMs.



3 Experiments



Reproducibility study As an example to demonstrate the possibility to use BGNLM for inference, in [3] it was applied to a subset of the Open Exoplanet Catalogue data [6]. This dataset contains information about planets that are not in our solar system. The dataset contains 11 variables, outlined in [3].

Kepler’s third law [4] relates the square of a planet’s orbital period 2 P to



the law corresponds to 2 1/3 a ≈ K ( P M h ), where Mh is the mass of the star the 2 planet is orbiting and the gravitational constant the cube of the semi-major axis of its orbit 3 a. Kepler’s original formulation of G and 4π are replaced with

the constant K. By setting a to be the dependent variable and the remaining variables as independent, [3] were able to apply BGNLM to recover the law. Noting that the correlation between ( 2 1 2 1 3 1 3 / 3 P M h , ( / 2 ) / P R h ) and ( P T h )

(with Rh being the radius of the star and Th its temperature) was almost 1,

all these three variants were considered as true positives (TP).

Following [3], we consider every feature with a posterior marginal probability

above 0.25 to be a discovered feature. Every time we merge multiple runs, we count how many of the discovered features were not any of the ones in defined true positives, and this count is considered false positives (FP). For the merged results that did include at least one of the true features with a posterior marginal probability above 0.25, we consider this merged result to contain exactly one true positive (TP). The final measure, false discovery rate (FDR) is the proportion of false positives in the total discovered features. From these definitions, and K runs, we calculate the Power, the expected number of false positives and FDR as in [3].

To demonstrate the efficiency of the new implementation of GMJMCMC,

we replicated the results from [3] in this experiment. We used the same set of nonlinear transformations and parameters as in the original experiment. Using the measures mentioned above, the results are presented in the left panel of Table 1. We see that for 64-thread runs we recover the true feature 95% of the time. As expected, the power is lower when fewer threads are used. Our results are almost on par with those presented in [3], yet there are still some tuning parameters that might need to be adjusted properly.

23rd European Young Statisticians Meeting 61



Table 1: Power and false discovery rate (FDR) for

Kepler’s third law Simulated tall data

Threads Power FDR Threads Power FDR

1 0.06 (0.14) 0.97 (0.86) 1 0.42 0.89

16 0.56 (0.84) 0.69 (0.18) 32 0.67 0.78

64 0.95 (1.00) 0.13 (0.01) 128 0.76 0.74



Simulation study with tall data (n ) Next, to check how subsam- ≫ m

pling influences the convergence of GMJMCMC, we test how GMJMCMC

equipped with S-IRLS-SGD performs in terms of recovering the true features

from the data generative process. Example 7 in the appendix of [3] used

a set of data with complex interactions. It consisted of 50 binary covari-

ates T x ∼ Bernoulli(0 . 5) and one dependent variable y ∼ N( x β, 1), with

n = 1, 000. The coefficient vector was set to β = (1, 1.5, 1.5, 6.6, 3.5, 9, 7, 7, 7)

for the data generative features and 0 otherwise and the expected value of

y was specified as E(y |x, β) = β0 + β1 x7 + β2 x8 + β3 x18x21 + β4x2 x9 +

β5 x12x20x37 + β6 x1x3x27 + β7x4x10x17x30 + β8 x11x13x19x50.

To be able to see the performance of our implementation of GMJMCMC

combined with S-IRLS-SGD, we generated a tall data set n = 100, 000, and

the signal strength adjusted to be equivalent to those from [3] by increasing

the variance of the observations.

We chose to use 0.75% of the observations per S-IRLS-SGD iteration and ran

GMJMCMC on the data with n = 100, 000. For every population, MJMCMC

was allowed to explore the model space for 1000 iterations. All three runs

were repeated for 100 times on 1, 32, and 128 threads. In the right panel

of Table 1, the results from the experiment with n = 100, 000 observations

using S-IRLS-SGD with 0.75% of the observations at each iteration and 1, 000

MJMCMC iterations per population are presented. We observe improvements

in both Power and FDR as the computational effort increases.



Abalone shell predictions The Abalone dataset, available at https:

//archive.ics.uci.edu/ml/datasets/Abalone, is a long-standing reference

for predicting abalone age based on physical measurements. Following [3], we

divided 4177 observations into a set of 3177 for training and 1000 for testing.

We chose to use 10% of the observations per S-IRLS-SGD iteration. We

ran GMJMCMC with 32 threads on the Abalone data with other tuning

parameters equivalent to [3] and u = exp( 2) corresponding to the AIC −

penalty. The results are summarised in Table 2 and show almost identical

performance as compared to the full sample analysis.

62 23rd European Young Statisticians Meeting



Table 2: Prediction performances in terms of Root Mean Square Error (RMSE), Mean

Absolute Error (MAE), and Pearson’s correlation (CORR).

Model RMSE MAE CORR

BGNLM SUB 1.9506 (1.9343,2.0121) 1.4443 (1.4239,1.4653) 0.7854 (0.7738,0.7892) BGNLM 1.9573 (1.9334,1.9903) 1.4467 (1.4221,1.4750) 0.7831 (0.7740,0.7895)



4 Conclusion

We developed a new implementation of GMJMCMC empowered with sub-sampling through S-IRLS-SGD. The resulting approach is available as an R package https://github.com/jonlachmann/GMJMCMC. Its appearance on

CRAN will hopefully help to spark interest in BGNLM and GMJMCMC to

get more researchers involved in developing the techniques further even for tall data sets. The reduced computational complexity (detailed time comparisons are available in [5]) enables larger sets of data for BGNLM at the cost of no or little inferential and predictive performance as shown in the examples.



References

[1] A. Hubin and G. Storvik. Mode jumping MCMC for Bayesian variable

selection in GLMM. Comp. Stat. and Data Anal., 127:281–297, Nov 2018.

[2] A. Hubin, G. Storvik, and F. Frommlet. A Novel Algorithmic Approach

to Bayesian Logic Regression. Bayes. Anal., 15(1):263 – 333, 2020.

[3] A. Hubin, G. Storvik, and F. Frommlet. Flexible Bayesian nonlinear model

configuration. J. Artif. Intell. Res., 72:901–942, 2021.

[4] J. Kepler. Astronomia nova. V¨ogelin, 1609.

[5] J. Lachmann, G. Storvik, F. Frommlet, and A. Hubin. A subsampling

approach for Bayesian model selection. Int. J. Approx. Reason., 151:33–63, 2022.

[6] H. Rein. A proposal for community driven and decentralized astronomical

databases and the Open Exoplanet Catalogue, 2012.





23rd European Young Statisticians Meeting 63




Exploring Nonlinear Associations between

Features: A Tool for Capturing Complex

Relationships



Kimon Ntotsis∗1 2 3 , Andreas Artemiou , and Alex Karagrigoriou

1 NIHR Leicester Biomedical Research Centre-Respiratory, United Kingdom &

Department of Respiratory Sciences, University of Leicester, United Kingdom

2 Department of Information Technologies, University of Limassol, Cyprus

3 Laboratory of Statistics and Data Analysis, University of the Aegean, Greece &

Graphic Era Deemed to be University, India



Abstract: When evaluating associations between features, many techniques traditionally rely on linear associations. However, real-life data often exhibits associations that deviate from linearity, necessitating the development of new approaches to assess these non-linear associations. This study introduces an association coefficient as a novel tool for detecting associations between features, utilizing kernel functions. The methodology is investigated in com-parison with already existing and frequently utilized coefficients. The analysis results revealed that the proposed coefficient consistently outperforms exist-ing coefficients in terms of accuracy and performance in many cases. The methodology demonstrated optimal or satisfactory accuracy rates regardless of the underlying data distribution and sample size, without favouring linear or non-linear associations. These findings contribute to the advancement of statistical modelling by highlighting the necessity of capturing complex relationships for inferential, descriptive or predictive purposes.



Keywords: Association, kernel function, measures of association.

AMS subject classification: 62H20, 62R07, 46E22



1 Overview of Association Measures

Feature selection is widely used in model evaluation, aiming to choose relevant features for analysis. It addresses the curse of dimensionality and reduces

∗ Corresponding author: kn170@leicester.ac.uk 64 23rd European Young Statisticians Meeting



execution times while assuming the presence of redundant or uninformative features. However, most feature selection methods focus on linear associations i.e., Pearson’s, Spearman’s, and Kendall’s correlation coefficient. To overcome this limitation, alternative non-linear association measures have been pro-posed, such as Mutual Information (MI), Distance and Hoeffding’s association coefficients. Also, kernel functions offer a simple yet efficient solution by mapping data to a higher-dimensional feature space. The Kernel Mutual Information (KMI), which utilizes a linear kernel function, has been shown to achieve enhanced performance with low bias and variance. This allows for efficient and effective feature selection, as shown by Gretton et al. [1]-[2].

Ntotsis et al. [4] propose the Kernel Association Coefficient (KAC) as a

novel coefficient capable of detecting associations, both linear and non-linear. KAC exhibits a high level of accuracy in measuring associations since it utilizes the eigenvalues derived from kernel-based matrices, which is designed for recognizing linear and non-linear patterns in data. While other kernels can also be employed, their investigation suggests that linear and polynomial kernels are the most suitable with a standardized 2-degree polynomial kernel maximizing the accuracy of KAC. The definition of the polynomial kernel is given below while for further details see Ntotsis et al. [4].



Definition ⊺ 1 (Polynomial kernel). If ⃗ Z k = ( z 1 k , z 2 k , . . . , z mk ), and ⃗ Z = n ⊺ ( z 1 n , z 2 n , . . . , z mn ) are two vectors in R m from a matrix Z = ⃗ . . ⃗ × 1 Z 1 . Z p , k, n ∈ { 1 , 2 , . . . , p } , then, the polynomial kernel is defined as:



P olyKer ⊺ ( α, c, d ) = ( αz zin + c)d ik

where α, c and d ≥ 0 are tuning parameters.



2 Kernel Association Coefficient

In this section, the methodology for the establishment of the Kernel Asso-ciation Coefficient (KAC) is presented. The KAC is an indicator capable to recognise linear and non-linear data transformations, such as quadratic, cubic, trigonometric etc., at a significantly high level. It is considered to be a measure of association between continuous features with non zero elements.



non-zero matrix Z Let Z ⃗ ⊺ ∗ k = ( z 1 k , z 2 k , . . . , z mk ) , 1 ∈ R k p, be a vector from the m × 1 ≤ ≤

= (zij ) i=1,. . . ,m = ⃗ Z1 . . . ⃗ Zp j=1,. . . ,p

where f d (z ik , zjk) is the associated d-degree kernel of the reproducing kernel Hilbert space F⃗ and R∗ is the set of positive real numbers, i.e., R − {0}. Z k 23rd European Young Statisticians Meeting 65



The centered Grammian matrix is a symmetric positive semi-definite matrix

and takes the form



G⃗ Z ⃗ Z k k ) (1) c := C m G C m = f d ( z c ik , z jk i=1,. . . ,m j=1,. . . ,m



where f d ⃗ Z 1 ⊺ ( z c ik , z jk ) is the centered element of G k c and C m = I m − 1 m 1 . m m ∗ We now define the pivot matrix P( ) · , on which the spectral · ∈ R m × m decomposition is conducted for the evaluation of the association between the ⊺ vectors. If Z ⃗ ∗ k = ( ⃗ ⊺ z 1 k , z 2 k , . . . , z mk ) and = ( ∈ R Z z , z m n 2 × 1 n 1 n , . . . , z mn ) ∈ R ∗ , with 1 k, n m ≤ p , are two vectors of the matrix Z , then × 1 ≤



P( 1 ⃗ ⃗ Z k , ⃗ Z n ) = − log I − min ( w − 2 Z ⃗ Z , k n ⃗ w Z ⃗ ) G G (2) 2 k Z c c n

where ⃗ Zj w ⃗ Z j is the minimum of sum of columns of G , = 1, . . . , p.

j c

Then, KAC (Z) is defined as a symmetric and positive semi-definite ma-



trix containing the normalized first eigenvalue ⃗ Z ⃗ Z k λn of the pivot function 1 ⃗ P( Z k , ⃗ Z n ) given in Equation (2). Each element of the matrix contains the

association between two features Z ⃗ ⃗ k and Zn and can be expressed as:



KAC (Z ⃗ 1 λ⃗ Zk ,⃗ Zn

k n , ⃗ Z ) = [0 1] ∈ , (3)

λ⃗ Z Z ⃗ ,⃗ k ,⃗ k Z λ n Zn 1 1

with values close to one being indicative of strong association.

KAC is based on a significant level to the KMI, utilizing a similar underlying

mathematical foundation as the latter but for a different purpose and in a

different context. Specifically, KAC is designed to capture associations between

features and builds upon the foundation of KMI (Equation (2)) to increase

the interpretability and applicability of the measure. In this sense, KAC can

be considered as a modification and extension of KMI.



3 Numerical applications

3.1 Simulation Case Study

In this study, an application was developed to analyse data generated from a

Student’s t-distribution with ν degrees of freedom (T ). The procedure was ∼ t ν

applied selectively to data generated from various continuous distributions,

considering different combinations of sample sizes and replications for 10

66 23rd European Young Statisticians Meeting



different m odels. Since, these variations did not yield contradictory conclusions, only the T ∼ tν case is presented below. The effectiveness of KAC is compared with Pearson’s, Spearman’s, Kendall’s, Distance and Hoeffding’s association coefficients.

The dataset used in the analysis consisted of univariate modelling with five

explanatory features (X1 − X5 ), where X1 and X2 were linear and non-linear transformations of the response variable Y . Table 1 provides the percentage of occurrences where the association between Y and each X j is found to be statistically significant (association exceeds 0.5).



X Model KAC Pearson Spearman X 1 % | X 2 % | X 3 % | X 4 % | X 5 % 2 1 = 1 . 5 ∗ Y & X 2 = Y 100% | 100% | 10% | 0% | 7% 100% | 20% | 0% | 0% | 0% 100% | 0% | 0% |0% | 0% 2 X 1 = 0 . 5 ∗ Y & X 2 = Y 100% | 100% | 34% | 15% | 27% 100% | 0% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0%

X 3 1 = 2 ∗ Y & X 2 = Y + 2 t ν 100% | 70% | 20% | 21% | 19% 100% | 100% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 3 X 1 = Y & X 2 = 5 ∗ Y 100% | 80% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 3 X 1 = sin ( Y ) & X 2 = Y 100% | 100% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0%

X1 = sin(Y ) & X 2 = tan(Y )2 100% | 75% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0%

X1 = 0.5 ∗ Y + tν & X2 = tan(Y )2 100% | 55% | 0% | 0% | 0% 25% | 0% | 0% | 0% | 0% 18% | 0% | 0% | 0% | 0% 3 X 1 = Y / 2 & X 2 = 0 . 5 ∗ Y 100% | 70% | 4% | 7% | 13% 100% | 100% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0%

X 1 = Y /2 & 3 X 2 = 0 . 5 ∗ Y + (5 2 ∗ Y)/2 100% | 100% | 10% | 8% | 5% 100% | 13% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0%

X X 1 = 0.5 ∗ Y & X2 = tan(Y ) 100% | 100% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0% 100% | 40% | 0% | 0% | 0% 3 1 = 1 . 5 ∗ Y & X 2 = cos ( Y ) 100% | 0% | 17% | 14% | 15% 100% | 0% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0%

Model Kendall Distance Hoeffding X 1 % | X 2 % | X 3 % | X 4 % | X 5 %

X1 = 1.5 2 ∗ Y & X 2 = Y 100% | 0% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0%

X X1 = 0.5 2 ∗ Y & X 2 = Y 100% | 0% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0% 3 2 1 = 2 ∗ Y & X 2 = Y + t ν 100% | 85% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 100% | 0% v 0% | 0% | 0% 3 X 1 = Y & X 2 = 5 ∗ Y 100% | 100% | 0% | 0% | 0% 100% | 75% | 26% | 21% | 34% 100% | 100% | 0% | 0% | 0%

X X X1 = sin(Y ) & X 2 = 3 Y 100% | 100% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 2 1 = sin ( Y ) & X 2 = tan ( Y ) 100% | 0% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0% 2 1 = 0 . 5 ∗ Y + t ν & X 2 = tan ( Y ) 0% | 0% | 0% | 0% | 0% 20% | 0% | 0% | 0% | 0% 0% | 0% | 0% | 0% | 0%

X X1 = Y /2 & 3 X 2 = 0 . 5 ∗ Y 100% | 100% | 0% | 0% | 0% 100% | 100% | 15% | 19% | 23% 100% | 100% | 0% | 0% | 0% 3 2 1 = Y / 2 & X 2 = 0 . 5 ∗ Y + (5 ∗ Y ) / 2 100% | 0% | 0% | 0% | 0% 100% | 100% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0%

X 1 = 0.5 ∗ Y & X2 = tan(Y ) 100% | 75% | 0% | 0% | 0% 100% | 30% | 0% | 0% | 0% 100% | 50% | 0% | 0% | 0% X 1 = 1.5 ∗ Y & X 2 = cos(Y )3 100% | 0% | 0% | 0% | 0% 100% | 90% | 0% | 0% | 0% 100% | 0% | 0% | 0% | 0%

Table 1: Comparison table of coefficients performance. Percentage of times

that the association of Y with each X j is statistically significant.



Table 1 shows the significance frequency of each Xj in the models. The

KAC coefficient has an average Type I error rate of around 7%, incorrectly identifying X3, X4, and X5 as significant features. The KAC coefficient falls within an acceptable range for Type I error rates, while the other coefficients have a Type I error rate around 0%.

In terms of true positive outcomes, the KAC coefficient accurately identifies

X1 and X 2 as significant features with an 89% accuracy rate. The Distance coefficient follows with 83% accuracy, while the remaining coefficients have accuracy rates below 65%. Additional analysis reveals that both the KAC and Distance coefficients have an average accuracy of approximately 63% in 23rd European Young Statisticians Meeting 67



identifying as significant only the association between X1 , X2 , and Y , while

the other coefficients never exceed 40% (Ntotsis et al. [4]).



3.2 Real Case Studies

In this section, the performance of KAC is assessed on real-case data using

a dataset previously published by Ntotsis et al. [3]. The dataset includes

several economic growth indicators, such as Balance of Trade to GDP (BoT ),

Government Debt to GDP (GovDebt), Gross Domestic Product Growth Rate

(GDPGR), Inflation Rate (InfR), Interest Rate (IntR), and Unemployment Rate (U nemR). These indicators can exhibit both linear and non-linear associations based on events that impact a country’s economic stability. For example, long-term events like economic crises or wars may linearly decrease an indicator’s value over time, while temporary events like tourism seasons can temporarily increase the indicators. Following these temporary events, the indicators tend to decline, potentially resembling trigonometric functions between tourism seasons. The significance of the continuous regressors in Table 2 is ranked, with statistically significant features shown in bold.

In this study, we aim to explore the potential non-linear associations between

these features using the GDP GR as the response variable. Note that all

variables are highly correlated to each other based on relevant history due to

the fact that express different aspects of an economy.

Elastic Information Criterion Dataset

KAC Int R GovDebt Bot InfR UnemR

Pearson GovDebt Bot InfR UnemR IntR

Spearman GovDebt Bot IntR InfR UnemR

Kendall Int R GovDebt Bot UnemR InfR

Distance Int R GovDebt UnemR Bot InfR

Hoeffding GovDebt Bot IntR InfR UnemR

Table 2: Association rank of GDPGR with Xj , j = 1, .., 5



According to the findings presented in Table 2, the KAC demonstrates

the highest level of accuracy. It effectively identifies the significance of all

regressors, distinguishing their contribution to the analysis. In contrast, the

other coefficients fail to accurately capture the significance of the regressors.

Among them, Kendall and Distance coefficients perform relatively well by

identifying the significance of three regressors, but they do not fully capture

the degree of association with the remaining two. Notably, KAC, Distance,

and Kendall coefficients are the only ones capable of capturing the non-linear 68 23rd European Young Statisticians Meeting



association that exists between GDP GR and IntR.



4 Conclusion

This study extends the evaluation of the Kernel Association Coefficient (KAC), a novel coefficient for detecting linear and non-linear associations between features with the utilization of the eigendecomposition of a kernel-based function (Ntotsis et al. [4]). The results demonstrate that the KAC achieves optimal or sufficient accuracy rates across different models and is sui table for data with various distributions. It is unbiased towards linear or non-linear associations and unaffected by sample size. The KAC e ffectively highlights the significance of key features and their impact on the modeling process, making it a promising tool for various applications.

However, the KAC is sensitive to zero-valued data and has a notable

computational cost for extremely large datasets. Despite these limitations, it remains a valuable coefficient for association and an informative tool for feature selection and data analysis in real-world problems. Researchers should consider these limitations to better understand the capabilities of the KAC and develop effective strategies f or its application in t heir work.



References

[1] A. Gretton, R. Herbrich, and A. Smola. The Kernel Mutual information.

IEEE International Conference on Acoustics, Speech, and Signal Processing Proceedings, 4, IV-880–IV-883, 2003.

[2] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch¨olkopf.

Kernel Methods for Measuring Independence. The Journal of Machine Learning Research, 6, 2075–2129, 2005.

[3] K. Ntotsis, A. Artemiou, A. Karagrigoriou. Interdependency Pat-

tern Recognition in Econometrics: A Penalized Regularization Antidote. Econometrics, 9, 44, 2022.

[4] K. Ntotsis, A. Artemiou, A. Karagrigoriou. Kernel Association Coefficient

for capturing nonlinear association between features. submitted, 2023.





23rd European Young Statisticians Meeting 69




Precision of Individual Shapley Value Explanations



Lars ∗1 Henry Berge Olsen

1Department of Mathematics, University of Oslo, Norway



Abstract: Shapley values are extensively used in explainable artificial intel-ligence (XAI) as a framework to explain predictions made by complex machine learning (ML) models. In this work, we focus on conditional Shapley values for predictive models fitted to tabular data and explain t he prediction f ( ∗ x) for a single observation ∗ x at the time. Numerous Shapley value estimation methods have been proposed and empirically compared on an average basis in the XAI literature. However, less focus has been devoted to analyzing the precision of the Shapley value explanations on an individual basis. We extend our work in [6] by demonstrating and discussing that the explanations are systematically less precise for observations on the outer region of the training data distribution for all used estimation methods. This is expected from a statistical point of view, but to the best of our knowledge, it has not been systematically addressed in the Shapley value literature. This is crucial knowledge for Shapley values practitioners, who should be more careful in applying these observations’ corresponding Shapley value explanations.



Keywords: Shapley values, explainable artificial i n telligence, prediction explanation, feature dependence.

AMS subject classification: 6 2D10, 6 2E17, 6 2G05, 6 2G07, 6 8T01, 91A12.



1 Introduction

Complex ML models often obtain accurate predictions for supervised learning problems in numerous fields but at the cost of interpretability. Not under-standing the input’s influence on the MLm odel’s output i s a significant



∗ lholsen@math.uio.no 70 23rd European Young Statisticians Meeting



drawback; hence, the XAI field has proposed several types of post hoc explana-tion frameworks [5]. One of the most commonly used explanation frameworks is Shapley values, a promising explanation methodology with desirable and unique theoretical properties and a solid mathematical foundation [1, 4, 8].

Shapley values originated in cooperative game theory as a solution concept

of how to fairly divide a payout of a game between the players based on their contribution, but it was reintroduced as an explanation framework in XAI by [4, 9]. It is most commonly used as a model-agnostic explanation framework with local explanations. Model-agnostic means that Shapley values do not rely on model internals and can explain predictions made by any predictive model f . Local explanation means that Shapley values explain the local model behavior for a single prediction f ( ∗ x ) by providing feature importance scores and not the global model behavior across all data instances.

We focus on conditional Shapley values which take feature dependencies into

consideration, in contrast to interventional Shapley values. A disadvantage of the conditional Shapley values, compared to the interventional counterpart, is that they require the estimation/modeling of non-trivial conditional expec-tations. There is an ongoing debate about when to use the two versions [2], and [6] provides an overview of possible estimation methods. Throughout this article, we refer to conditional Shapley values when we discuss Shapley values.



predictions made by a model f [i] [i] Ntrain [i] ( x ) trained on X = { x , y } x i , where =1 [ is an We focus on the supervised learning setting where we aim to explain

M-dimensional feature vector, i] y is a univariate response, and Ntrain is

the number of training observations. More specifically, we want to explain individual predictions f ( ∗ x) for specific observations ∗ x using Shapley values. We demonstrate and discuss that the explanations will be less precise for test observations in the outer region of the training distribution. Thus, Shapley value practitioners should be more careful when using these explanations.

2 Shapley Values: Theory and Estimation Methods Shapley values are a solution concept of how to fairly divide the payout of a cooperative game v : P (M) → R among the M players based on four fairness axioms [8]. Here M = {1, 2, . . . , M } denotes the set of all players, P (M) is the power set, that is, the set of all subsets of M, and v(S ) maps a subset of players S ∈ P (M), also called a coalition, to a real number representing their contribution in the game. In XAI, we treat the features ∗ x as the players, the predictive model f (indirectly) as the game, and the prediction f ( ∗ x ) as the payout to be fairly distributed onto the features. Furthermore, we call v( S ) the contribution function. See [1, 4] for why the four fairness axioms give Shapley values desirable properties in the model explanation setting.

23rd European Young Statisticians Meeting 71





Figure 1: Schematic overview of the paradigms, method classes, and estimation



methods used to compute the conditional Shapley value explanations.



are given by |S|!( 1)! ϕ j = M −|S|− (v(S ∪ { The Shapley values ϕj = ϕj (v) assigned to each feature j, for j = 1, . . . , M

S⊆M\{ j} ) − v(S)), where |S| is j } M !

the number of features in coalition S. Each Shapley value is a weighted average



for the contribution function in XAI, see [1, 3, 4, 6], is ∗ v ( S ) = v ( S ; x) = E of the feature’s marginal contribution to each coalition S. A common choice

[f (x) ∗ | x f x , S = x ] = E ( ∗ x x S ) | x S = x , whereS = {xj : j ∈ S} de-S S S

notes the features in S and x = {xj : j ∈ S} denotes the features outside S



erties is that ∗ ϕ describes the importance of the jth feature in the prediction j f the coalition S, that is, S = M\S. One of the desirable Shapley value prop-



Shapley values ( ∗ x) = M 0 + ∗ ϕ ϕ, where ϕ = f y ¯ j =1 j 0 E [ ( x )] ≈train. That is, the sum of the ∗ ϕ explains the difference between the prediction f ( ∗ x ) and the global average prediction. Note that the Shapley values can be negative.

In Figure 1, we provide a schematic overview of the methods used to

estimate v(S) by ˆ v(S). We group the methods into six method classes based

on their characteristics in accordance with [6]. That is, if the methods

(implicitly) assume feature independence, use empirical estimates, parametric

assumptions, generative methods, or separate/surrogate regression models.



where ) = f (x , x ), K k =1 S S K is the number of Monte Carlo samples, f is the predictive model, ( We further group the six method classes into two paradigms. The first one uses Monte Carlo integration, i.e., ˆ 1 ( ) K k ∗ v ( S



and k) ∗ x ∼ p ( x | x = x ) are the generated Monte Carlo samples, for S S S S k = 1 , 2 , . . . , K . These samples must closely follow the (generally unknown) true conditional distribution of the data to yield accurate Shapley values.



Thus, any regression model error (MSE) loss function, i.e., The second paradigm uses that v(S) is the minimizer of the mean squared 2 v ( S ) = arg min x , c E ( f ( x c S ) − ) ∗ | x x S =. S S g x S ( S ), fitted to the MSE loss function, will 72 23rd European Young Statisticians Meeting



approximate v( ) and yield an alternative estimator ˆ( ) = S v S g (x ). The S S accuracy of ˆ v( ) depends on, e.g., the form of the predictive model S f (x) and the flexibility of the regression model g (x ). We can either train a separate S S regression model g (x ) for each ( ) or a single surrogate regression S S S ∈ P M model g(˜ x ) which approximates the contribution function v( SS) for all S ∈ P ( ) simultaneously. In [6], we thoroughly discuss the notation, method M

(implementation) details, and how the methods estimate the true conditional

distribution/expectation, and we provide extensive recommendations for which method (or methods) to use in different situations.

3 Simulation Study: Results and Discussion

We focus on the gam more interactions experiment in [6], but we obtain



Σ [ i] ] M − l | i jl = | j ρ for ρ = 0 . 5. While the response y = β 0 + β j x) + j =1 j [ i ] [ i ] [ i ] [ ] [ i ] 2 2 γ g , x g 1 ( x ) + i γ g ( x , x ) + ε , where ( x 1 2 2 3 j , x k ) = x j x k + x j x + x k x , 4 k j [ ] β = 1 0 { . 0 , . 2 , 0 8 − . , 1 . 0 , 0 . 5 , 0 8 0 7 0 6 , = 0 8 i − . , . 6 , 0 0 − . , 1 − . } γ { . , − . , and } ε ∼ (0 X = [ ] i ] [ i Ntrain [ withi] { x , y } N train = 1000. Here x 8 (0 Σ), where =1 ∼ N , i [ cos( similar results for the other experiments in [6]. The training data set is N ( ) is a GAM with splines for , 1). The corresponding predictive model f x

the nonlinear terms and tensor product smooths for the nonlinear interaction terms. We create Ntest = 250 test observations by the same data-generating procedure and explain the corresponding predictions made by f .

A common evaluation criterion in XAI is the mean absolute error (MAE)

between the true and estimated Shapley values averaged over all test ob-



servations and features, see [1, 6, 7]. That is, MAE = MAEϕ (method q) = N 1 N test 1M [ ] [i] ϕ test i=1 M j=1 | j,true ( i ˆ x ) − ϕj,q(x ) . The true Shapley values are |

generally unknown, but we can compute them with arbitrary precision in our setup as we know the true data-generating process. The MAE is suit-able to measure the average precision of a method, but it tells us nothing about the spread in the errors for the different test observations. Uncovering systematized patterns in the errors is of high interest in the Shapley value explanation setting as the explanations are used on an individual basis.

In Figure 2, we see that the parametric methods, which are all able to

model the Gaussian distribution, obtain the lowest MAE and, therefore, the most precise Shapley value explanations. However, we see several outliers for most methods, that is, greater than the upper quartile plus 1.5 times the interquartile range. In Figure 3, we see that it is the same test observations that yield large errors for all methods. Furthermore, we see a clear pattern in the errors when we color encode the test observations based on their Euclidean distance to the empirical center of the training data distribution. Note that 23rd European Young Statisticians Meeting 73





Figure 2: Left: Boxplot of the MAE between the true and estimated Shapley values



in the gam more interactions experiment, ordered based on overall MAE. Right: Histograms of the explained predictions ∗ f ( x ) with color indicating the corresponding MAE using the Gaussian method or Euclidean distance.





Figure 3: Plots of the MAE for each test observation for different pairs of methods.



using the empirical mean as the center does not apply to multimodal data.

The predictions with the largest Shapley value explanation errors correspond

to the test observations furthest away from the center of the training data.

This is not surprising from a statistical point of view as the estimation methods

have little to no training data to learn the feature dependence structures in

these outer regions. Making a correct parametric assumption about the data

reduces the errors compared to the more flexible methods, but the errors are

still larger compared to test observations closer to the training data center.



| ( ∗ f x) − ϕ0 yield larger mean absolute errors. This tendency is natural as the | magnitude of the Shapley values The left histogram in Figure 2 illustrates that observations ∗ x with a large

Meaning that the scale of the Shapley values changes based on ϕ∗ M ∗ has to increase since f ( ) = ∗ x ϕ 0 + ϕ. j =1 j ∗ x , as we illustrate in Figure 4 for three test observations with predicted responses below,

close, and above ϕ 0 . Thus, the relative Shapley value errors can be larger for

observations with predictions close to ϕ 0 , but this has a minimal impact on

the overall MAE due to its absolute and not relative scale. Scaling the MAE

by [ ] 1 i ) − | f ( x − ϕ , , . . . , N 0 , for = 1 2 , leads to problems as the MAE will | i test 74 23rd European Young Statisticians Meeting





Figure 4: Plots of the true and estimated Shapley values for three test observations.



blow up to infinity for [i] [i] f ( x ) ≈ ϕ 0 and is not defined for f ( x) = ϕ0 .

In Figure 4, we see that most methods find the true influential features.

Thus, if the Shapley values are only used to roughly uncover the important features and rank them, then the practitioner can be less careful than if the estimated Shapley values are directly used in further analysis. For the latter, the practitioner should consider the location of ∗ x.

4 Conclusion

We have demonstrated and discussed that Shapley value explanations are less precise for test observations in the outer regions of the training data, w.r.t. the MAE evaluation criterion, and argued that practitioners should be more careful when applying these observations’ Shapley value explanations.

Acknowledgements: I want to thank my supervisors, Ingrid Glad, Martin Jullum, and Kjersti Aas, for their guidance. This work was supported by The Norwegian Research Council 237718 through the research center BigInsight.

References

[1] Aas K., Jullum M., Løland A., Explaining individual predictions when features are dependent:

More accurate approximations to Shapley values, Artificial Intelligence, 298 (2021).

[2] Chen H., Covert I., and Lundberg S., Lee S., Algorithms to estimate Shapley value feature

attributions, arXiv preprint arXiv:2207.07605 (2022).

[3] Covert I., Lundberg S., Lee S., Explaining by removing: A unified framework for model

explanation, Journal of Machine Learning Research, 22:209 (2021), 1-90.

[4] Lundberg S., Lee S., A unified approach to interpreting model predictions, Advances in

neural information processing systems (2017), 4765-4774.

[5] Molnar C., A Guide for Making Black Box Models Explainable (2nd ed.), christophm.github.

io/interpretable-ml-book , 2022.

[6] Olsen L. H. B., Glad I. K., Jullum M., Aas K., A Comparative Study of Methods for

Estimating Conditional Shapley Values and When to Use Them, arXiv:2305.09536 (2023).

[7] Olsen L. H. B., Glad I. K., Jullum M., Aas K., Using Shapley Values and Variational

Autoencoders to Explain Predictive Models with Dependent Mixed Features, Journal of Machine Learning Research, 23:213 (2022), 1-51.

[8] Shapley L. S., A value for n-person games, Contributions to the Theory of Games, 2:28

(1953), 307-317.

[9] Strumbelj E. and Kononenko I., An efficient explanation of individual classifications using

game theory, Journal of Machine Learning Research, 11 (2010), 1–18.





23rd European Young Statisticians Meeting 75




Resampling methods

in conditional independence testing



Ma�lgorzata ∗1,2 �Laze¸cka

1 2 Faculty of Mathematics, Informatics and Mechanics University of WarsawInstitute

of Computer Science Polish Academy of Sciences



Abstract: This article presents the properties of four resampling scenarios in the context of conditional independence testing. We use an information-theoretic measure called conditional mutual information as the test statistic. We present two viewpoints: one is based on the asymptotic behavior of the conditional mutual information estimated on a resampled sample. The other uses resampled p-values and is based on exchangeability of statistics computed on a sample and resampled samples under the null hypothesis.



Keywords: conditional independence, conditional mutual information, re-sampling, asymptotic distribution, permutation

AMS subject classification: 62E20, 62F40, 62H15



1 Introduction

Testing for conditional independence is much more challenging than for un-conditional independence. In the discrete case, conditional independence of two variables X and Y given the third one Z (denoted as X ⊥⊥ Y |Z) holds if for every layer Z = z the two variables are independent. Thus the curse of dimensionality is a significant obstacle when the dimensionality of the conditioning variable is high. In the context of feature selection, we consider Y as the target variable, Z as a vector of already selected variables, and X



∗ Corresponding author: m.lazecka@uw.edu.pl 76 23rd European Young Statisticians Meeting



as a candidate that we may consider including in the model if it provides additional information about Y (i.e. is not independent of Y ) in the presence of Z. Thus conditional independence testing of X and Y given Z serves in this problem as a tool.

In this article we provide a summary of the results from the PhD thesis [2]

and the article [3]. First, we introduce the necessary notation and definitions. Then, both asymptotic and non-asymptotic results related to resampling methods are presented. Finally, we show some applications of the theoretical results in conditional independence testing. For detailed proofs of the presented theorems and further references, we refer to [2, 3].



1.1 Preliminaries

We consider a discrete-valued triple (X, Y, Z), where X ∈ X , Y ∈ Y, Z ∈ Z . Assume that P (X = x, Y = y, Z = z) = p(x, y, z) > 0 holds for any (x, y, z) ∈ X × Y × Z . Conditional Mutual Information is defined as



I (X, Y |Z ) = p(x, y, z) log = D(p(x, y, z)||p (x, y, z)) (1) p p(x, y, z)p(z)

x,y,z (x, z)p(y, z) ci



where pci(x, y, z) = p(x|z)p(y|z)p(z) and D denotes the Kullback-Leibler divergence. It follows from the properties of the Kullback-Leibler divergence that I (X, Y |Z) ≥ 0 and I(X, Y |Z ) = 0 if and only if X is independent of Y given Z. These properties show that in the sense conditional mutual information is a measure of conditional dependence. In the following, we will denote the conditional mutual information as a function of the vector of probabilities (p(x, y, z))x,y,z , namely CM I (p) = I (X, Y |Z ).

Let (Xi, Yi, Zi)n be independent random variables with common discrete i =1



(ˆ estimator of a vector of probabilities (p(x, y, z)) x,y,z is a vector of fractions distribution P (Xi = x, Yi = y, Zi = z) = p(x, y, z). The maximum likelihood

p (x, y, z)) x,y,z = (n(x, y, z)/n)x,y,z , where n(x, y, z) = n I(Xi = x, Yi = i =1

y, Z i = z). We estimate conditional mutual information using a plug-in estimator, namely



CM I p ˆ (x, y, z)ˆ p (z) p (ˆ ) = p ˆ ( x, y, z ) log . p ˆ ( x, z )ˆ p ( y, z ) x,y,z

The asymptotic behavior of CM I (ˆ p ) is described by the following lemma.



Lemma 1. d 2 If X ⊥⊥ Y | Z we have that 2 nCM I (ˆ p ) − → χ . ( |X |− 1)( |Y|− 1) |Z|

23rd European Young Statisticians Meeting 77



The 2 G test, which is also known as the likelihood-ratio test of conditional

independence, employs 2 2 nCM I (ˆ p ) as a test statistic. The G test also relies on

the fact that the test statistic follows an approximate chi-squared distribution

under the conditional independence as stated in Lemma 1.

In the next section, we focus on the properties of conditional mutual

information computed using the resampled sample and denoted as ∗ CM I (ˆ p).



2 Resampling methods

We introduce four resampling scenarios mimicking the situation, in which X

is conditionally independent of Y given Z , while the joint distributions of

(X, Z ) and (Y, Z) remain unchanged. We denote the resampled sample as

( ∗ ∗ ∗ n X , Y , Z ) . We consider the following resampling scenarios: i i i i =1

• Bootstrap CI (BCI) - we sample new observations (Xi, Yi, Zi) from

the distribution ˆ p(x|z)ˆ p (y|z)ˆ p(z) independently for i = 1, 2, . . . , n,

• Conditional Randomisation (CR) - we assume that the conditional

distribution p( ∗ x | z ) is known and then we copy ( Y i , Z i ) and sample X i from the distribution p(x|Zi) independently for i = 1, 2, . . . , n (see [1]),

• Bootstrap X (BX) - similarly as in CR, we copy (Yi, Zi) and sample

X ∗ from ˆ p (x Z i |i),

• Conditional Permutation (CP) - we copy ( n Y i , Z i ) and then for i =1

each subset of observations J = {j : Zj = z, j = 1, 2, . . . , n}, we permute the values of (Xj )j ∈J . This process is repeated for all z ∈ Z (see [4]).



Similarly to ˆ ∗ n ∗ 1 p , we define ˆ p as ˆ p ( x, y, z ) = n i I( ∗ ∗ X = ∗ x, Y = y, Z = z) =1 i i i and a plug-in estimator for conditional mutual information is CM I (ˆ ∗ p ).



2.1 Asymptotic approach

We derive the asymptotic distributions of the vectors of estimated probabilities

obtained from a resampled sample, i.e. (ˆ∗ p(x, y, z)) x,y,z . Next, we use the

obtained asymptotic distribution of (ˆ ∗ p(x, y, z))x,y,z to show the asymptotic

behavior of the plug-in estimator of CM I(ˆ∗ p).



( Lemma 2. For almost all sequences ( ∞ X i , Y i , Z i ) and conditionally on i =1 ∞ X i , Y i , Z i ) , we have that i =1

√ d ∗ n (ˆ p ( x, y, z ) − p ˆ ( x | z )ˆ p ( y | z )ˆ p ( z )) − → N (0, Σ), (2) x,y,z 78 23rd European Young Statisticians Meeting



where Σ depends on the resampling scenario and

Σ( ) ) CP ( ) ( BX (BCI) CR ≤ Σ = Σ ≤ Σ (3)

where A ≤ B means that B is positive semi-definite. Moreover, for CR − A

scenario p ˆ (x|z) = p(x|z) in (2).

The proof of Lemma 2 for all the resampling scenarios except for CP is

straightforward and relies on the multivariate Berry-Esseen theorem. This is because the observations in the resampled sample for BCI, CR, and BX are independent of each other given the sample, which is not the case for CP scenario. Thus the proof for CP scenario is substantially different and relies on the convergence of a (generalized) multivariate hypergeometric distribution to a multivariate normal distribution which is obtained in [3].

Using Lemma 2 and the delta method we can prove the following theorem.

Theorem 1. For almost all sequences ( ∞ X i , Y i , Z i ) , and conditionally on i =1

( d 2 ∞ ∗ X i , Y i , Z i ) , we have that 2 nCM I (ˆ p ) χ i =1 − → . ( |X |− 1)( |Y|− 1) |Z|

Theorem 1 states that under the null hypothesis there is a consistency

between the distribution of CM I(ˆ ∗ p) for each of the considered resampling scenarios and the result for CM I(ˆ p) (see Lemma 1). The proof of Theorem 1 uses Taylor expansion of CM I as a function of probabilities at ˆ p ci multiplied by 2n, namely



2nCM I(ˆ ∗ ∗ p ) = 2 nCM I (ˆ p ci ) + 2 n (ˆ p p ˆ )′ − ciDCM I (ˆ p ci )

+ √ √ ∗ n (ˆ ′ ∗ p p ˆ ) ∗ 2 − ci H CM I (ˆ p ci ) n (ˆ p p ˆ ) + − ci o p ˆ ( n p ˆ p ˆ ) || − ||,

where D CM I and HCM I denote a gradient and Hessian of CM I, respectively. The third term of the above equation plays a crucial role in obtaining the



asymptotic distribution as all other terms vanish as n → ∞. Although the √ asymptotic distribution of n (ˆ ∗ p p ˆ ) differs across the considered resampling − ci scenarios (see Lemma 2 and inequality (3)), the asymptotic distribution of (ˆ CM I ∗ p ) for each of them is the same.



2.2 Non-asymptotic approach

In addition to the asymptotic approach, we also present a finite sample method.



(X∗ , Y∗ , Z∗ ) = ( ∗ ∗ ∗ X , Y , Z)n for b = 1, 2, . . . , B and calculate the n,b n,b n,b i i i i =1 ∗ values of a statistic We independently construct B resampled samples using CR or CP scenario

T ∗ = ∗ T (X ∗ , Y , Z ). In Theorem 2, we establishes b n,b n,b n,b 23rd European Young Statisticians Meeting 79



validity of resampled p-values defined as 1 + B ( ∗ T T b I ≤ ) /(1 + B), =1 b where T = T (X n , Yn, Zn ) and (Xn, Yn, Zn ) = (Xi, Yi, Zi)n . i =1

Theorem 2. If the null hypothesis H0 : X holds, then ⊥⊥ Y | Z



P 1 + B ( ∗ T ) b =1 I T ≤ b ≤ α ≤ α, 1 + B

where ∗ T for b = 1, 2, . . . , B is obtained from CR or CP resampling scenario. b



and CP (see [1, 2]), but not for bootstrap-based methods. In this article, we ∗ choose The proof is based on exchangeability of ∗ ∗ T, T , . . . , T , which holds for CR 1 B

CM I ∗ as the test statistic T and T =  CM I = CM I(ˆ ∗ p). b b b



3 Conditional independence testing

We consider three tests for testing H0 : X ⊥⊥ Y |Z : asymptotic- a test based



that uses the chi-squared distribution, but adjusts the number of degrees ∗ 1 of freedom on Lemma 1; exact- a test based on Theorem 2; df estimation- a test

d based on resampled samples, ˆ d = CM I  , where B = 50. B b b

While Theorem 2 doesn’t apply to BCI and BX, we can still depend on their

asymptotic guarantees (cf. Theorem 1), and because of that we include BCI and BX in simulations.



3.1 Example



We show the performance of tests in practice using an example1 . We generate conditionally dependent data N = 500 times from the joint distribution 4 given by X p(x, y, z1, z2 , z3, z4) = p(y)p(x|y) p(zi s|y), where Y ∼ Bern(0.5), =1 i | Y = y Bern(Φ( 5)), ∼ y 0 − . Z i | Y = y Bern(Φ( ( ∼ γ y 0 5))), Φ is a CDF − . of a standard normal distribution, and γ = 0 . 5. We also sample data from the distribution p ci

obtained from the joint distribution, to assess I type error

rates (N = 500 times). In the first panel of Figure 1, we observe that testing procedures, which rely only on asymptotic guarantees, exceed a significance level (α = 0.05) for small sample sizes. In the middle and right panels, we present procedures for which the specified significance level is not exceeded. We note a slight advantage in terms of power when using df estimation compared to exact, while no notable differences are observed between the procedures based on CR and CP.

1 Code available on GitHub: github.com/lazeckam/EYSM example 80 23rd European Young Statisticians Meeting



Significance level Significance level Power

Tests exceeding the significance level Tests holding the significance level Tests holding the significance level

0.8 0.20 1.00 Test Test

asymptotic exact (CR)

0.6 exact (BCI) df_estimation (CR) 0.15 0.75 df_estimation (BCI) exact (CP)

exact (BX) df_estimation (CP)

0.4 df_estimation (BX) 0.10 0.50

Test

Fr 0.2 Fr 0.05 Fr 0.25 action of rejections action of rejections action of rejections exact (CR)

df_estimation (CR)

exact (CP)

df_estimation (CP)

0.0 0.00

0.00

2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 1 2 3 4 5

frac frac frac

Figure 1: Attained significance level (left and middle panels) and power (right

panel) of the considered testing procedures. frac is defined as an average number of observations per one value of the triple (x, y, z), namely frac= n/ 6 2.



4 Conclusions

In conclusion, resampling-based methods serve as a promising alternative to a standard asymptotic test of conditional independence, with a wider applicability range in terms of small sample sizes.



Acknowledgements: I would like to thank my supervisor prof. J. Mielniczuk.



References

[1] E. Cand`es, Y. Fan, L. Janson, and J. Lv. Panning for gold: model-X

knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc., B: Stat. Methodol., 80:551–577, 2018.

[2] M. �Laz¸ecka. Properties of information-theoretic measures of conditional

dependence. PhD thesis. 2022.

[3] M. �Laz¸ecka, B. Ko�lodziejek, and J. Mielniczuk. Analysis of conditional

randomisation and permutation schemes with application to conditional independence testing. In review. 2023.

[4] I. Tsamardinos and G. Borboudakis. Permutation Testing Improves

Bayesian Network Learning. In Lecture Notes in Computer Science, volume 6323 LNAI, pages 322–337. 2010.





23rd European Young Statisticians Meeting 81




Some notes about Poisson–Laguerre tessellation

with unbounded weights



Martina ∗ 1 1 Petr´akova´ and Zbynˇek Pawlas

1 Department of Probability and Mathematical Statistics, Faculty of Mathematics

and Physics, Charles University



Abstract: The object of our research is the Poisson–Laguerre tessellation, particularly in the case where the random weights of the generating points are not uniformly bounded. In this paper, we present some properties of this random tessellation derived using the concept of tempered configurations.



Keywords: Poisson–Laguerre tessellation, tempered configurations

AMS subject classification: 60G55, 60D05



1 Introduction

The Poisson–Laguerre tessellation is a Laguerre diagram – a weighted ge-neralisation of the well-known Voronoi diagram – with Poisson marked point process as its random generator (see [3] for theoretical background). One of the standard assumptions made when working with this stochastic model is to bound the random weights of the generating points uniformly. Our goal is to weaken this assumption. To do that, we consider the tempered configurations defined in [4] to deal with finite but possibly unbounded interactions in the Gibbs models. In this paper, we consider the tempered configurations in the Poisson case and use them to study the distance between two neighbouring generators in the Poisson–Laguerre tessellation.

In Section 2 we present the main notation. Section 3 is devoted to the

tempered configurations and Section 4 to the Poisson–Laguerre tessellation.



∗ Corresponding author: petrakova@karlin.mff.cuni.cz 82 23rd European Young Statisticians Meeting



2 Main notation

Let d 2, 0 and 0 be fixed. Let ≥ δ > t > λ denote the Lebesgue measure on , -algebra on R d d the Borel d and d with B σ R B ( x, r ) the closed ball in R



where A marked point process in Rd ×R+ is a random element in the space (M, M), d centre x and radius r. Also denote R+ = (0, ∞) and vd = λ(B(0, 1)).

M is the set of all locally finite counting measures on such that R × R +

their projections to is a suitable R d are still locally finite and M σ-algebra. Since we only consider simple marked point processes in this paper, we can identify the process with its support (see [5] for the theoretical background). We denote γB = γ ( ∩B ), × R d B +, and ) the number of points of ∈ B γ ( A γ in d A , d A ( ), . Let x denotes the Dirac ∈ B R × R + γ , then ∈ M ∈ R × R + δ x measure in x.

Let Q be a probability measure on . The Poisson marked point process R +



n the random variables ) n ∈ N t { η ( d A i are independent for ( } A i =1 i ) ∈ B R × R + t pairwise disjoint, and η t with intensity measure Λ = tλ ⊗ Q is a marked point process such that for

η (A1 ) has Poisson distribution with parameter Λ(A1 ).

The measure Q is called the mark distribution. Throughout this paper, M

denotes a generic random variable with distribution Q. We will consider one of the two following assumptions:

(M1): E d+δ 2( + ) M < ∞ or (M2): E d δ M < ∞.

Apart from that, we pose no assumptions on Q. In particular, we do not want to assume that there exists K (0 ) such that ∈ , ∞ Q((0, K)) = 1.



3 Tempered configurations



Definition 1. The set of tempered configurations is the set = , Mtemp l Ml ∈ N d where = : (1 + +δ M l { γ m) d ∈ M l k, k . ( x,m ) ≤ ∈ γ · ∀ ∈ N } B (0 ,k )



and τ (γ ) = min Then { l : ) ∈ N t γ ∈ M l } . f k ( η, k and ∈ N , τ ( t η) are well-+ defined random variables. Denote θ = t (1 + E d δ M ). If (M1) holds, then E v d + 2 f k ( t η ) = t v d If (M2) holds, then This · θ. var ( t δ f k ( η )) = d d · E (1 + M ) . k Let 1 d k and and denote ∈ N γ ∈ M temp f k ( γ ) = d (1 ++δ m) k ( x,m ) ∈ γ B (0 ,k ) can be easily shown by using the Campbell’s formula (see Th. 3.1.3 in [5]).

In the following lemma, we show that under the assumption (M1), we can

restrict ourselves to work with tempered configurations. Let us remark that the same claim was proved in [4] for the Gibbs processes under a stronger assumption on Q. However, since we only work with Poisson marked point process, it is enough to assume (M1).

23rd European Young Statisticians Meeting 83



Lemma 1. If (M1) holds, then P( t η ∈ Mtemp) = 1.



Proof. Since t a. s. t η is ergodic, Cor. 12.2.V from [1] yields that f k ( η ) −−−−→ vd · θ. k →∞ t Therefore, l ( t t η ) = sup f ( η ) = 1. k k ( η ) is a. s. finite and P ∈ ∈ M temp N

Under the stronger assumption (M2), we can prove the following for t τ ( η).

Lemma 2. If (M2) holds, then Eτ ( t η)ε < for any 0 ∞ < ε < 2 .



Proof. Denote t t 1 l ( η ) = sup f η ζ ( d < k k ( ) and ) = ∈ N k ∈ N d ∞. Then for any k a > θ ,



P t f ( t η f ( η f ( t η ) sup k ) k ) > a ≤ P > a ≤ P k − θ > a − θ k v ∈ N d v d v d k ∈ N k ∈ N



≤ var t +δ 2 f k ( η ) t E (1 + d M ) = . 2 d v ( a ) 2 a − θ ) 2 k ( d − θ v d k ∈ N k ∈ N



Therefore, we have that P ( t E(1+ d +δ 2 l ( η ) M > a ) tv ≤ d ) ζ ( 2 ·(d) for a > θv a − θv d )d. + 2 Denote b = d δ tv d E (1 + M ) ζ ( d ). Altogether we obtain



E ∞ ∞ t l ( η ) ε = P t ε t 1 l ( η ) > a d a = P l ( η ) ε − > a εa da 0 0 ∞ ε − 1 bεa ( + 1) ε ≤ θv d + d a < ∞ . ( a θv ) 2 θv d +1 − d

Since ( ⌈ l ( γ ) = ⌉ τγ), we have that τ ( t η) ( t ≤ l η)+1 and the proof is finished.



4 Poisson–Laguerre tessellation



Let x = ( d x, m ) . We define the power distance of a point to ∈ R × R d + z ∈ R 2 2 a weighted point x as ρ ( z, x) = ∥ x − z ∥ − m . Let γ ∈ M and x ∈ γ , then the Laguerre cell of x is L (x , γ ) = d x) { z ∈ R : ρ ( z, ≤ ρ ( z, y) y and ∀ ∈ γ } , the Laguerre diagram of γ is L ( γ ) = (x { L , γ ) : L (x , γ ) = x ̸ ∅ , ∈ γ } . Let γ and x d , x . Define the set of neighbours of the ∈ M ∈ R × R / + ∈ γ point x in L ( γ + δ γ L x ) as N(x , γ ) = y : (x + { ∈ , γ δ x ) ∩ L (y , γ + δ x ) = ̸ ∅} and let R(x , γ ) = max : ( ) {∥ x ) − y N(x be the maximum distance ∥ y, n ∈ , γ } between x = ( x, m ) and its neighbour. Also denote the closed halfspace (x H ≤ , y) = { z ∈ R d : ρ ( z, x) ≤ ρ ( z, y) . } 84 23rd European Young Statisticians Meeting



Definition 2. The Poisson–Laguerre tessellation is the random Laguerre diagram t t L ( η ), whose generator is the Poisson marked point process η.

A tessellation of Rd is a set of convex compact cells Ci, i ∈ N, with non-

empty disjoint interiors, satisfying ∪i C ∈ Ni = Rd. The Poisson–Laguerre tessellation was studied in [3], where it was proved that our assumption (M1) is sufficient for L( t η) to a. s. be a normal tessellation (see Th. 4.1).

The following lemma was shown in [2] (see Prop. 3.1) with the assumption

of bounded weights. However, it is valid even in the unbounded case.

Lemma 3. Let x = (0, m) d ∈ R × R+. Then there exist a random variable D t (x , η) and constants A(t, m), B(t) > 0 such that

i) t L (x t , η + δ x ) ⊂ B (0 , D (x , η )) holds a. s. ,

ii) P(D(x t , η) ≥ s) ≤ A(t, m) · exp( d − B ( t ) · s) for any s > 0.



Proof. Divide Rd into J disjoint cones Kj , j ∈ {1, . . . , J }, with a vertex at 3 the origin such that ⟨ x, y ⟩ ≥ ∥ x ∥ ∥ y ∥ for all x, y ∈ K , j ∈ { 1 , . . . , J . 4 j } Let x j = ( x j , m j ) = argmin {∥ y ∥ : ( y, n ) t ∈ η c for j 1 , . . . , J , K j ∩ B (0 , 2 m ) } ∈ { } then clearly t L (x , η + J δ x ) ⊂ ∩ H j (x , x ). Let the random variable D (x t , η ) =1 ≤ j 4 be defined as D (x t , η ) = max 1 ≤ j ≤ J ∥ x ∥ . If z 3 j ∈ H ≤ (x , x j ) ∩ K j , then we 3 2 2 2 have ⟨ z, x j ⟩ ≥ ∥ x ∥ ∥ ∥ and 2 ⟨ z, x + 4 j z j ⟩ ≤ ∥ x j ∥ m < 2 ∥ x j ∥ . Therefore 4 ⟨ z,x j ⟩ 4 ∥ z ∥ ≤ ∥ x and hence , ≤ (x 3 j ∥ x ≤ ∥ H x t j ) ∩ K j ⊂ B (0 , D (x , η )). This j ∥ 3 implies that i) holds for our choice of D (x t , η ).   8 3 Let s ∈ ( m, ∞ ) and denote C ( s ) = K 3 j j ∩ B (0 , s ) \ B (0 , 2 m ) , then 4   J   J    4  3 P ( t D (x , η )  ≥ s ) = P ∥ x s  ∥ ≥ 3 j ∥ ≥ ≤ P ∥ x j s 4 j =1 j =1



=    d  (3 / 4) d d d s − 2 m t J · P η ( C j ( s ) × R + ) = 0 = J · exp − tv d · . J

Setting  2d d  (3 4)d A ( t, m ) = m tv / J · exp tv d · and B ( t ) = d finishes the proof. J J



We now derive an upper bound for the distance to the farthest neighbour.

Lemma 4. Let γ ∈ Mtemp , x = (0, m) ∈ Rd × R+ and D = D(x, γ) be such that L(x, γ + δx) ⊂ B(0, D). Then



R   max(d,δ)+δ   2 2 δ 2 d d + (x , γ ) ≤ D + 2 δ τ ( γ ) .

23rd European Young Statisticians Meeting 85



Proof. Denote for simplicity τ = τ (γ) and let y = (y, n) (x ∈ N, γ ) be any

neighbour of x in L(γ + δ δ x ). Then there exists z + ∈ L (x , γ L x )(y ∩, γ + δx).

Therefore,

i) 2 2 2 2 2 2 2 ∥ y − z = ∥ − n ∥ z 2 ∥ − m ⇐⇒ ∥ y ∥ − ∥ z ∥ ∥ y cos α = ∥ n , − m

ii) ∥z∥ ≤ D,

where α is the angle between y and z. From the fact that γ ∈ Mτ , we have

iii) d+ δ n (1 + d+δ d ≤ u ) ( x,u ) ∈ γ ≤ τ · ⌈∥ y ∥⌉. B (0 , ⌈∥ y ∥⌉ )

Altogether we get that



∥ 2 ii) i) iii) 2 2 2 2 y 2 ∥ − D ∥ y ∥ ≤ ∥ y 2 (1) ∥ − ∥ z ∥ ∥ y cos d ∥ α ≤ n ≤ τ d + δ ⌈∥ y ∥⌉ d + δ .

There are two possibilities: a) ∥y 1 and ∥ ≤ ⌈∥y = 1 or b) 1 and ∥⌉ ∥ y ∥ > ⌈∥ + 1 2 y . Assume now that b) holds. Then (1) implies that ∥⌉ ≤ ∥ y ∥ ≤ ∥ y ∥



∥ 2 2 2d 2 d−δ d y 2 ∥ − D ∥ y 2 d + ∥ ≤ δ τ = 2 d ∥ d + + y ∥ d + δ ⇒ ∥ δ y 2 ∥ − D ≤ τ ∥ y ∥ d δ . (2)



Define a function d 2 − δ + h ( a ) = a 2 1. Then is − D − ca d + d δ δ , where d c = 2 τ > h 2 − δ continuous, h (1) < 0, lim ′ − a h ( a ) = and h ( ) = 1 d δ c →∞ ∞ a − a d + δ . Therefore, d + δ the equation h ( a ) = 0 has a unique solution (denote it by a 0 ) on (1 , ) and ∞ ( h a ) 0 implies ≤ a ≤ a 0 . 2 δ d − δ − If δ < d , then + δ h ( d a 0 ) = 0 implies a = 2 d + δ Da + c 2 D + c , since 0 0 ≤ d − δ + d a > δ a 0 1. Similarly for , we obtain ≥ d D 0 = 2 D + δ ca 2 + 0 ≤ c . Now it is enough to notice that (2) can be rewritten as h ( 0. ∥ y ) ∥ ≤ Altogether we have shown that if ( y, n ) (x ∈ N , γ ), then we have ∥ y 1 or ∥ ≤ 2 d + δ 2 δ 2 ∥ y ∥ ≤ 2 D + 2 d τ d + δ (for δ < d ) and d ∥ y ∥ ≤ 1 or + ∥ y ∥ ≤ 2 D + 2 τ d δ

(for 2 d δ ≥ d ). Since 2 D + 2 + τ dδ > 1, the proof is complete.

All of the properties round up to the following theorem.

Theorem 1. 2 If (M2) holds and d δ > 1 , then E R (x t , η ) < for x . ∞ ∈ R × R +

Proof. The stationarity of t D η implies that R (( x, m ) t , η ) = R((0, m) t , η ) for

any 2 d x . Therefore, it is enough to show that ((0 ) for ∈ R E t R , m ) , η < ∞

any m ∈ R+. For simplicity, denote t D = D ((0 t , m ) , η ) , R = R ((0 , m ) , η) and

τ t = τ ( η ). Thanks to the estimate from ii) in Lemma 3, it is easy to see that 86 23rd European Young Statisticians Meeting



all moments of D are finite. If δ < d, then Lemma 4, Jensen’s inequality + 2 and Lemma 2 imply that E d δ 2 R ≤ a · E D δ + b · E τ δ < ∞ for some constants 2 a, b > 0, since < 2. If δ d , then analogously Lemma 4 and Lemma 2 δ ≥ 2 2 imply E R ≤ 2 E D + E 2 d τ d + δ < ∞ .



5 Conclusion

In this paper we showed, using the concept of tempered configurations, that the distance to the farthest neighbour in the Poisson–Laguerre tessellation has a finite second moment, even without the assumption of bounded weights. The need to study this characteristic arose while deriving the central limit theorem for the functionals of the Poisson–Laguerre tessellation using the Malliavin–Stein method. In the next step of our research we intend to use the results from this paper to prove the desired CLT.



Acknowledgements: This work was supported by the Czech Science Foun-dation, project no. 22-15763S.



References

[1] D. Daley and D. Vere-Jones. An Introduction to the Theory of Point

Processes: Volume II: General Theory and Structure, Second Edition. Springer, New York, 2008.

[2] D. Flimmel, Z. Pawlas and J. Yukich. Limit theory for unbiased and

consistent estimators of statistics of random tessellations. J. Appl. Probab. 57(2):679–702, 2020.

[3] C. Lautensack and S. Zuyev. Random Laguerre tessellations. Adv. Appl.

Probab. 40(3):630–650, 2008.

[4] S. Rœlly and A. Zass. Marked Gibbs point processes with unbounded

interaction: an existence result. J. Stat. Phys. 179(4):972—996, 2020.

[5] R. Schneider and W. Weil: Stochastic and Integral Geometry. Probability

and its Applications. Springer-Verlag, Berlin, 2008.





23rd European Young Statisticians Meeting 87




Kalman Filter and Identifiability of the

Observation Model



Matej ∗1 2 Benko and Libor Zˇ´ak

1,2 Institute of Mathematics, Faculty of Mechanical Engineering, Brno University of

Technology, Technick´a 2896/2, 616 69 Brno, Czech Republic



Abstract: Kalman filter has become a commonly used approach in many technological applications to filter uncertainty from noisy measurements. In this paper, we describe the Kalman filter as an optimal unbiased estimator of the hidden state in the linear model. Additionally, a discussion of the model estimation is provided. In the end, as an original contribution, a discussion of the observation model is done.



Keywords: Hidden Markov model, Kalman filtering, Bayesian estimation, Parameter identifiability, Observation model

AMS subject classiication: 62M05



1 Introduction

Kalman 1 filter (KF) is the successor to the so-called Wiener filter, which was developed as part of the army project to track targets from radar measurements during 2 World War II. Kalman incorporated the dynamic model into the Wiener filter and obtained the best (in terms of minimizing squared error)

∗ Corresponding author: Matej.Benko@vutbr.cz

1 Norbert Wiener (1894 – 1964), American mathematician and philosopher, originator of

cybernetics.

2 Rudolf Emil K´alm´an (1930 – 2016), Hungarian-American electrical engineer, mathemati-

cian, and inventor.

88 23rd European Young Statisticians Meeting PPP M. Benko and L. ˇZ´ak



unbiased Bayesian3 estimator of the hidden state in the state-space linear model.



2 Solution of the Wiener Problem

Definition 1 (Linear state-space system [3]). Let n X 0 , X 1 , . . . , X k , . . . ∈ R

be sequence of state vectors, m Z , Z , . . . , Z , . . . 1 2 k sequence of measure-∈ R ment vectors, and u0 , u1, . . . , uk , . . . ∈ Rp be the sequence of control input vectors. We call the linear state-space system a following system of stochastic difference equations:

X k = Fk X k−1 + Gkuk + ΓkV k, (1)

Z k = Hk Xk + W k ; (2)

where Fk is state transition model, Hk( R) is observation model and Gk −1 is the control-input model. Γk transforms the process noise into state space, V k is the process noise with variance matrix Qk and W k is the measurement noise with variance matrix Rk. The initial state X 0 ∼ N(¯x0, P0 ) is given.

Remark 1. The linear state-space model of Definition 1 describes the relations between states of the hidden Markov model depicted in Figure 1.



Z k Z − 1 k Z k+1



X k−1 X k X k+1



Figure 1: Hidden Markov model described by the linear state-space system.



Proposition 1. The state vector Xk has a normal distribution Xk ∼ N(¯xk , Pk).

We denote a set of realized measurements th z , . . . , z 1 k up to the k time

epoch as := Z k {z , . . . , z 1k . }

Note. KF estimates parameters of the normal distribution of X k from the set of all measurements up to the th k time step.

3Thomas Bayes (1701 – 1761), English statistician, philosopher and Presbyterian minister.

23rd European Young Statisticians Meeting 89 Kalman Filter and Identifiability of the Observation Model PPP



Since KF is a Bayesian estimator, we introduce the following notation.

At the time step k, h(x k ) symbolizes the prior distribution of the |Z k − 1

hidden state with the mean ¯ xk|k and the variance matrix P . Similarly, − 1 k | k − 1

h(xk ) symbolizes the posterior distribution of the hidden state with mean |Z k

¯x k and variance matrix P | kk|k. At this point, we are able to present KF.

Theorem 1 (Kalman filter). Following equations are solution of the so-called

Wiener problem – optimal Bayesian estimation of the state in the linear state-

space model:

Prediction:

¯x k |k = F ¯x u − 1 k k , − 1 | k + G − 1 k k

P k Γ⊤ | k − 1 = F k P k F ⊤ . − 1 Q | k − 1 + Γ k k k k

Innovation:

˜yk = zk H −k x , k | k − 1

S k = HkP ⊤ k H + R . | k − 1 k k

Optimal Kalman gain:

K ⊤ 1 k = P. k H − | k − 1 S k

Update:

¯xk = ¯x ˜y, | k k | k − 1 + K k

Pk|k = (In K )P − k H kk|k . − 1

Remark 2. It is worth commenting on the stages in the Kalman filter. They are depicted in Figure 2. The first stage, called prediction computes the prior distribution (red) from the previous posterior distribution (blue) and the dynamic state model (1). It is worth mentioning that the volume of the prior variance matrix Pk|k−1 is higher than that posterior in the previous time step P k−1|k , as the process noise (model uncertainty) has been added − 1

to the system. Then, observation z k (cyan) is obtained with uncertainty

described by the measurement noise variance matrix R k. The mean of the

posterior distribution is a convex combination of the measurement and the

prior distribution. The precise position is determined by the Kalman gain

Kk.

In simple Bayesian estimation (e.g. estimation of normal distribution

paramaters or regression coefficients), posterior distribution is determined

by the Bayes formula. In this case, posterior distribution is not uniquely 90 23rd European Young Statisticians Meeting PPP M. Benko and L. ˇZ´ak



determined since the presence of some degrees of freedom, which arises from the linear state-space model. Generally, the dimension of the state space does not have to be equal to the dimension of the measurement space. Kalman gain is chosen to be the such one matrix which minimizes the squared error of the hidden state estimation.

150

140



130 x ¯k z k

[m] x ¯ |k

k|k−1

y

120

110

x ¯ k −1|k−1

100

80 100 120 140 160

x [m]



Figure 2: Scheme of the KF at time step k.



3 Likelihood of the State Model

Since the Kalman filter has been initially used to track the (maneuvering) object, the question of estimation of maneuver type arose. The well-used technique is called the Interacting Multiple Model (IMM) algorithm [1, 2]. The estimation of process noise and measurement noise is discussed in [4].

In the mentioned sources, the Kalman filter likelihood is introduced. The

likelihood of the Kalman filter, in fact, describes the probability that a partic-ular linear state-space model Ms is valid according to the given observation. Since the Bayesian nature of the problem, we are only able to to obtain likelihood conditioned on the previous measurements such that

f (z k |Ms, Zk−1 ) = N(˜yk; 0, Sk ). (3)

This can be understood in the following way. If the model suits the real behavior of the system well, the prior state projected into the measurement space should be close to the observed measurement. The tolerance of this ”distance” is covered by the innovation variance matrix Sk.

23rd European Young Statisticians Meeting 91

Kalman Filter and Identifiability of the Observation Model PPP



In the reviewed literature, it is not mentioned, but for the estimation of the

observation model Hk from (2), this likelihood shape is not valid. We discuss this in the next section.



4 Likelihood of the Observation Model

Motivation. Motivation for the determining the likelihood of the observation model came from, e.g., the problem of estimating battery state of charge from noisy measurement of terminal voltage and current using Kalman filtering [5]. Temperature and battery aging are important parameters since they influence battery behavior. They determine the observation model. One of the authors’ research topics is to estimate the temperature and battery aging from behavior of the model.

Conjecture 1. Likelihood function of the observation model is

f (z k N(˜y ; 0 |M o , ) = Z k, − 1 k Rk ). (4)

Let us compare (3) and (4). The innovation variance matrix S k contains

the variance of prior information at the time step k (which in fact mainly carries the noise in the dynamic state system). In the case of observation model estimation, only information of measure noise is necessary to compute the likelihood. Information about noise in the dynamical state system is redundant. In Figure 3 the values of the two mentioned likelihood functions are depicted with respect to the values of the observation model H k in a one-dimensional case at a time step k. We can see in the (a) case that maximum of the likelihood function depends on the noise in the system (prior variance of the state), which is not proper behavior of the likelihood function. In the (b) case, with replacing the matrix Sk with the measurement variance matrix Rk we suppressed this phenomenon and the likelihood function behaved properly.



5 Conclusion

In the paper, the Kalman filter has been introduced and described. Comments on its behavior and properties have been made. Then a discussion of the linear state-space model estimation has been provided. We have emphasized and demonstrated the observed phenomenon by the authors that the generally used likelihood of the Kalman filter is not appropriate for the observation model estimation. This is motivated by long-term effort of the authors to estimate the temperature and agging of the battery from the Kalman filter behavior used to estimate the battery state of charge from current and terminal voltage.

92 23rd European Young Statisticians Meeting PPP M. Benko and L. ˇZ´ak



Pk |k−1 = 10 Rk = 0.5



) 0.06 Pk |k−1 = 20 Rk = 1.0 ) 1 1 0 . 4 R k = 3.0 P k − | k − 1 = 30 − k k R Z k = 5.0 Z P , 0 . 04 k | k − 1 = 40 , k k true H k true H H k H | | k k 0 . 2 z z ( ( 0 . 02 f f

0.00 0.0

0 1 2 3 4 5 0 1 2 3 4 5

Hk Hk

(a) likelihood of state model (b) likelihood of observation model

Figure 3: Comparison of likelihood function values for both state model and

observation model with respect to values of observation model H k in one-dimensional case.



Acknowledgements: This topic is part of the Doctoral Study of Matej Benko under the supervision of Libor ˇZ´ak at Brno University of Technology.



References

[1] Y. Bar-Shalom, S. Challa, and H. A. Blom. Imm estimator versus opti-

mal estimator for hybrid systems. IEEE Transactions on Aerospace and Electronic Systems, 41(3):986–991, 2005.

[2] Y. Bar-Shalom, X. Li, and T. Kirubarajan. Estimation with Applications

to Tracking and Navigation: Theory, Algorithms and Software. 01 2004.

[3] C. Chui and G. Chen. Kalman filtering, with real-time applications.

Kalman Filtering: With Real-Time Applications, 01 2009.

[4] J. Dunik, O. Straka, O. Kost, and J. Havl´ık. Noise covariance matrices in

state-space models: A survey and comparison of estimation methods—part i. International Journal of Adaptive Control and Signal Processing, 31, 05 2017.

[5] M. Mastali, J. Vazquez-Arenas, R. Fraser, M. Fowler, S. Afshar, and

M. Stevens. Battery state of the charge estimation using kalman filtering. Journal of Power Sources, 239:294–307, 2013.





23rd European Young Statisticians Meeting 93




Asymptotic Properties of Parameter Estimators

for Mixed Fractional Brownian Motion with

Trend



Mykyta ∗ 1 1 Yakovliev and Kostiantyn Ralchenko

1 Taras Shevchenko National University of Kyiv



form Xt = θt + σWt + H κB , driven by a standard Brownian motion W and t H a fractional Brownian motion Abstract: We investigate the mixed fractional Brownian motion of the B with Hurst parameter H . We consider

strongly consistent estimators of unknown model parameters (θ, H, σ, κ) based on the equidistant observations of a trajectory. Joint asymptotic normality of

these estimators is proved for H 1 ∈ (0 , ). 2



Keywords: Brownian motion, Wiener process, mixed model, asymptotic distribution.

AMS subject classification: 60G22, 62F12.



1 Introduction

The standard Brownian motion is traditionally applied to model many real-world processes that evolve over time. At the same time, it cannot be used for accurate modelling of some time series with properties of long-range dependence, self-similarity and complex correlation structures [2, 3]. Alternatively, a fractional Brownian motion can be used which has correlated increments that imply short-range (H < 1/2) or long-range (H > 1/2) dependence. Recently, the so-called mixed fractional Brownian motion, i.e. a combination of the Wiener process and a fractional Brownian motion, has

∗ Corresponding author: mykyta.yakovliev@gmail.com 94 23rd European Young Statisticians Meeting



attracted significant interest. This process was introduced by [1] with the aim to consider models of financial markets that are simultaneously arbitrage-free and have a memory.

In the present paper, we consider the following mixed fractional Brownian

motion with a linear trend:



X H t t t θt = + σW + κB , t , ≥ 0 (1)

where H W is a Wiener process, B is a fractional Brownian motion with Hurst index H , H B is independent of W . We aim to estimate unknown parameters

(θ, H, σ, κ) based on observed {Xkh, k = 0, 1, 2, . . . }, h > 0. Following [4], we

introduce the next four statistics



X N − 1 1

ϕ N h N ( X := = X k+1)h − ,

N kh N

k =0

1 N −1

ξN := 2 X , ( k +1) h −

N k Xkh

=0 (2)

η N := X X X , ( k +1) h − X kh ( k +2) h − ( k +1) h N 1 N −1

k=0



ζN := X X X X ( k +2) h − kh , ( k +4) h − ( k +2) h N 1 N −1

k=0

and consider the following estimators for the parameters (θ, H, σ2 2 , κ):



κ θ ˆ ϕ 2 1 ζ N N − 4 ϕ N H N = , N = log , 2+ 2 h 2 η N − ϕ N (3) 2 η 2 N − ϕ 1 N 2 2 2 2 ˆ N = H , σ ˆ ξ N = N − ϕ N − κ ˆ h N , N 2 H 2 H − 1 h h N (2 N − 1)



with log x, if x > 0, 2 log x = 2+ 0 , if x ≤ 0 .

The strong consistency of these estimators was proved in [4]. The goal of the present paper is to establish their joint asymptotic normality.

23rd European Young Statisticians Meeting 95



2 Main Results

We will start with a study of the asymptotic properties of the vector

τN = (ϕ N , ξN , ηN , ζN ). (4)

According to [4, Lemma 3.2], for any H ∈ (0, 1),

τN (E E → ϕ ,ξ , η ζ 1 1 E , 1 E τ 1 ) :=0

a. s., as N → ∞, where

E ϕ 2 2 2 2 2H N = θh, E ξ N = θ h + σ h + κ h ,

E 2 2 2 2 2 H H − 1 2 2 2 2 H 2 H 2 1 η N = θ h + κ h 2 1 , E − ζ N = 4 θ h + κ h 2 2 H − 1 − .

Further, let us denote the increment

∆Xk := X X ( k θh +1) h − kh = + σ∆Wk + κ∆ H B = θh + ∆Z kk ,

where ∆Zk := σ∆ H W k + κ ∆ B . It is well known that ∆ k and ∆ k { X are } { Z k }

stationary Gaussian sequences with the following autocovariance function

ρ ˜(i) = cov(∆X0 , ∆Xi) = cov(∆Z0, ∆Zi)

= 2 2 σ cov (∆ H W 0 , ∆ H W i ) + κ cov ∆ B , ∆ B (5) 0 i

= σ 2 2 2 h 1 κ { i +H h ρ(i), i =0 } ∈ Z,



where 1 2 ρ 2 2H (i) := | i + 1 H 2 H + 1 (6) |

2 − |i| |i − |

denotes the autocovariance function of the stationary sequence H { B k +1 −

B H, k , which is known as a fractional Gaussian noise. The following ≥ 0 }

statement provides some important properties of the sequences ˜ k ρ (i) and ρ(i),

i , defined by (5) and (6) respectively. ∈ Z

Lemma 1. Let H 1 (0 ). Then the following statements hold. ∈ , 2

1. ρ 2 2 H − ( i ) ∼ H (2 H − 1) | i | as |i| → ∞.



2. 2 ∞ ρ (i) < . i = ∞ −∞ 3. ∞ ρ ( i ) < and i = ∞ −∞ +∞ ρ ˜(i) < ∞. (7)

i= −∞

96 23rd European Young Statisticians Meeting



4. For all α, β ∈ Z, + ∞  ρ ˜ (i + α)˜ ρ(i + β) < ∞. (8)

i=−∞

The proofs of these and the following statements are provided in [5]. Now we are ready to state the main result of this section.

Theorem 1. 1 Let H ∈ (0 , ). The vector τ defined by (4) is asymptotically normal, namely 2 N

  ϕN − E

√ √  ξ − ξ  N ϕ N (  N E

τN − N d  ⃗  τ0 ) = N − → N ( 0 , Σ) as N → ∞

 η  N − E η N

ζN − EζN

with the asymptotic covariance matrix Σ , which can be presented explicitly as

  Σ  Σ   11

Σ =   Σ  Σ  Σ  Σ   12 22 23 24   , 12 13 14 Σ Σ 

   Σ   13 Σ Σ   23 33 Σ 34 

Σ  Σ  Σ   14 24 34 Σ44

where



Σ +∞ +∞ +∞    2 2 2  11 = ρ ˜ (  i ) , Σ 22 = 2 ρ ˜ ( i ) + 4 θ h ρ ˜ (i),

i= −∞ i=−∞ i=−∞

Σ  ∞ +  ∞  2 2  33 = ρ ˜ ( i ) ρ ˜ ( i ) + ˜ ρ ( i + 2) + 4 θ h ρ ˜(i),

i= −∞ i=−∞

+∞  

Σ 44 = ρ ˜(i) 6˜ ρ (i) + 8˜ ρ(i + 1) + 3˜ ρ (i + 2) + 4˜ ρ (i + 3)

i= −∞

+ 6˜  +∞  2 2 ρ ( i + 4) + 4˜ ρ ( i + 5) + ˜ ρ ( i + 6) + 64 θ h ρ ˜(i),

i=−∞

Σ  ∞ +∞  2 2  23 = 2 ρ ˜ ( i )˜ ρ ( i + 1) + 4 θ h ρ ˜(i),

i =−∞ i=−∞

Σ +∞ +  ∞   2 2  24 = 2 ρ ˜ ( i ) ρ ˜ ( i + 1) + 2˜ ρ ( i + 2) + ˜ ρ ( i + 3) + 16 θ h ρ ˜(i),

i =−∞ i=−∞

23rd European Young Statisticians Meeting 97



 +∞  

Σ 34 = ρ ˜ (i) ρ ˜ (i) + 2˜ ρ(i + 1) + 2˜ ρ (i + 2) + 2˜ ρ(i + 3) + ˜ ρ(i + 4)

i=−∞

+ 4 2 + ∞  2 θ h ρ ˜ (i),

i= −∞



Σ = Σ  = 2θh ρ ˜ (i), Σ  = 8θh ρ ˜ (i). 12 13 14

 + ∞ +∞  

i =−∞ i=−∞



Remark 1 1 . The condition H < is essential since otherwise, the series in 2

Theorem 1 do not converge.

Let us introduce the notation

ϑ 2 2 2 2  = ( θ, H, κ , σ ) , ϑ N = (ˆ θ N , H  N , ˆ κ , σ ˆ ), N (9) N N ∈ N .

Now, we can provide the main result on asymptotic properties of our estima-

tors (3).



Theorem 2. 1 Let H (0 ). The estimator ∈ , ϑ  is asymptotically normal, 2 N

namely



√    ˆ θ N − θ  √  H   d   N − H 0 N ϑ  N =   ⃗ − ϑ N 0 Σ − → N , , as N → ∞  2 2  κ ˆ κ N − 2 2 σ ˆ − σ N

with the asymptotic covariance matrix 0 Σ that can be found by the formula



Σ0 ⊤ = ′ g ( ′ τ  0 ) Σ ( g ( τ 0 )) ,

where Σ  is defined in Theorem 1 and



g ′  1  0 0 0 h 2  θ ( l − 2) − 1 1 2 2 H − 1 0 2 2 k  2 H h l ( l +2) 2 H  κ h l log 2 κ h l ( l +2) log 2  ( τ 0 ) =  − 4 θ 2( l +2)( l − 1)+ cl ( l − 2) 2 (2+ c ) − 2 ( +1)+2  2 H − 1 2 0 l +2 l c ,  2 2 h l ( l +2) H 2 H 2 h l h l ( l +2)  2 4 θ ( l +4 l − 4) 1 4( l +1) 2 2 2 l h − 2 hl hl

where 2H c = log h and l = 2 2. 2 −

98 23rd European Young Statisticians Meeting



References

[1] P. Cheridito. Mixed fractional Brownian motion. Bernoulli, 7(6):913–934,

2001.

[2] Q. Dai and K. J. Singleton. Specification analysis of affine term structure

models. J. Finance, 55:1943–1978, 2000.

[3] Z. Ding, C. W. J. Granger, and R. F. Engle. A long memory property

of stock market returns and a new model. Journal of Empirical Finance, 1(1):83–106, 1993.

[4] A. Kukush, S. Lohvinenko, Y. Mishura, and K. Ralchenko. Two approaches

to consistent estimation of parameters of mixed fractional Brownian motion with trend. Stat. Inference Stoch. Process., 25(1):159–187, 2022.

[5] K. Ralchenko and M. Yakovliev. Asymptotic normality of parameter

estimators for mixed fractional Brownian motion with trend. Austrian Journal of Statistics, 2023. (in press).





23rd European Young Statisticians Meeting 99




Ordering Awad-Tsallis Quantile Entropy and

Applications to Some Stochastic Models



R˘azvan-Cornel ∗1 Sfetcu

1University of Bucharest, Faculty of Mathematics and Computer Science



Abstract: We define a stochastic order for Awad-Tsallis residual entropy and prove the preservation of this order in some stochastic models, such as the proportional hazard rate model, proportional reversed hazard rate model and proportional odds model.



Keywords: Awad-Tsallis entropy, proportional hazard rate model, propor-tional reversed hazard rate model, proportional odds model

AMS subject classiication: 60E15, 60K10, 62N05, 90B25, 94A17



1 Introduction and Preliminaries

The notion of Shannon entropy has many generalizations (Awad-Shannon

entropy, Tsallis entropy, R´enyi entropy, Varma entropy, Kaniadakis entropy, relative entropy etc.), these generalizations having applications in many areas, of which we mention: microchannels (see [2], [6]), Markov chains (see [3], [4],[5]) and model selection (see [13], [14]). Awad-Shannon entropy was introduced in [1] and Tsallis entropy, in [15]. In this paper we deal with Awad-Tsallis entropy and some of its generalizations. We assume that all expectations and all ratios are well defined.



x Let α ∈ R \ {1}. Tsallis logarithm is given by log − T x α −1 1 ( x ) = , if α T − 1 > 0 and log (0) = 0. From now on we assume that α > 0. Let X



∗ Corresponding author: razvan.sfetcu@fmi.unibuc.ro 100 23rd European Young Statisticians Meeting



be a nonnegative random variable with absolutely continuous cumulative

distribution function def F , survival function F = 1 − and probability X X X F

density function fX (X represents a living thing or the lifetime of a device).

Let def δ = sup f (x). We assume in the whole paper that this supremum is

X X



X A 1 f ( − T at time in (0 x , ∞) for any density function. We define Awad-Tsallis residual entropy of

t ≥ Z being a nonnegative random variable identically distributed like X . 0, t by H (t) = logT X Z ) E Z[Z > t] for any X − F X ( t ) δ X F X ( t )

The quantile function of X is given by

Q def −1 X X u ( ) = F (u) = inf [0, ) F ( ) u for any [0 { , x ∈ ∞ | ≥ } ∈ 1]. X x u

Differentiating with respect to u the both sides of the equality FX (QX (u)) =

u ′ , we have ′ F ( Q X ( u )) Q (u) = 1. We denote ′ q X ( u ) = Q (u) and get X X X fX (QX (u))qX (u) = 1.

Sunoj and Sankaran [12] introduced a quantile version of Shannon residual

entropy, this idea being generalized for R´enyi residual entropy in [7] and [16], for Varma residual entropy in [10] and for Awad-Varma residual entropy in [9]. In this paper we work with a quantile version of Awad-Tsallis residual entropy, considering ΨA−T ( −T u ) = A H (QX (u)) for any u [0 1]. X X ∈ ,

In the whole paper U is a random variable uniformly distributed on [0, 1].



After some computations, we have, for any u ∈ [0, 1], Ψ 1 f X ( Z ) A − T ( T u ) = Z



X X Z > Q − E log [ 1 (u)] = − u δ X (1 ) − u 1 f ( Q ( )) log T X X U [ − E U u < U < 1] . 1 ) − u δ X (1 − u

The following lemma is very useful in this paper.



Lemma 1. (see [7]) Let h : [0, 1] [0 be such that × , ) ∞ → R E U h ( u, U ) [ u < U < 1] ≥ 0 for any u ∈ [0 , 1] and g : [0 , ∞ ) → [0 , ∞ ) an increasing function. Then [ E U h ( u, U ) g ( U ) u < U < 1] 0 for any ≥ u [0 1]. ∈ ,



2 Results

Definition 1. We say that X is smaller than Y in Awad-Tsallis quantile entropy order (and write A T X ) if Ψ A − T ≤ A Y ( u ) Ψ − − T ( ) u [0 X ≤ u 1]. Y ∀ ∈ , 23rd European Young Statisticians Meeting 101



The following theorem is the main result of this paper.

Theorem 1. The following assertions are equivalent:



0 1. X . ≤ A − T Y α − 1 δ f ( Z ) 2 − 1 T Y X . E Z f Y F ( F X ( Z )) log Y [ Z > t ] ≥ − 1 δ X f Y F ( F X ( Z )) Y for any t 0 . ≥

Proof. According to Definition 1, we have X ≤A Y − T if and only if



− − 1 f ( Q ( U )) E U log T X X [ u < U < 1] ≤ 1 − u δ X (1 − u ) 1 f ( Q ( U log T Y Y )) [ E U u < U < 1] for any u [0 ∈, 1]. 1 − u δ Y (1 − u )

Considering QX (U ) = Z in the preceding inequality we have the equivalences

(for any u ∈ [0, 1]):



E E X f log T X ( Z ) − 1 ( u )] ≤ A − T Y ≥ ⇐⇒ E Z [ X Z > F (1 δ X − u ) ( Q Y F X ( ))) E T f ( Z Z log Y − 1 ⇐⇒ [ Z > F X ( u )] δ Y (1 − u ) f log Z T X ( Z ) f ( Q Z − log T Y Y ( F X ( ))) − 1 ( u )] ≥ 0 ⇐⇒ [ > F δ X (1 − u ) δ Y (1 − u ) X Z α − 1 δ ) Z f ( Y f ( Z − 1 Y F ( T 1 F X ( Z ))) log X − ( u )] Y ≥ 0. − 1 [ X Z > F δ X f Y ( F ( F X ( Z ))) Y

Putting − 1 F(u) = t in these equivalences we get the conclusion. X

Applications to some stochastic models In the sequel we present the preservation of Awad-Tsallis quantile entropy

order in some stochastic models.

1. Proportional hazard rate model (see [11]) Let θ > 0. We take X (θ) and Y (θ) two absolutely continuous nonnegative

random variables with survival functions (F X )θ , respectively (F Y )θ.

Theorem 2. 1. If 0 < θ ≤ 1, X ≤A−T Y and δX δ δ ( θ ) Y ≤X δY (θ), then X (θ) ≤A−T Y (θ).

2. If θ 1, ≥ X (θ) ≤A Y − T (θ) and δX δ δ δ Y Y ( θ ) , then ≤ X X ( θ ) Y ≤ A − T .

102 23rd European Young Statisticians Meeting



Proof. θ−1 − For any x 0 we have ) = ( ) ), 1 ≥ f x θ F x f x F F X ( θ ) ( X X ( ( θ ) ( x Y ( X ) = θ ) − 1 1 θ − 1 F ( − − 1 F X ( x )) and f F F x Y Y ( θ ) ( ) ( ) = θ F ( x ) Y ( θ ) X θ X f Y F ( F X ( x )) , Y ) f X ( θ ( x ) f hence = X ( x ) . Then, for any t 0, ≥ − 1 − 1 )) f F F x Y ( θ ) ( θ ) ( ) f Y F ( F X ( x Y Y ( X θ )



E α−1 δY Z) ( θ ) fX ( θ )( − 1 T Z ) log Z fY F ( θ ) FX( )( Y ( θ θ ) [Z > t] = − 1 ) δX F Z ( θ ) fY ( θ ) Y ( )( ( θ FX ) θ

E α θ − 1 α−1 ( δYZ) ( θ ) fX − 1 T θ ( ) ( Z ) log Z F X Z fY F [Z > t] . FX Y − 1 δX F ( ) ( θ ) fY FX Z Y



sume that 0 < θ 1, and ≤ X ≤ A Y δ − T δ X ( θ )Y ≤ δXδY (θ). It follows α θ − 1 that the function [0 We will prove only the case 1., the case 2. being analogous. Let us as-

, ) ) ∞ ∋ x −→ θ F X ( x is nonnegative increas-



ing, 1α−1 δ f (Z) − EZ fY F (FX (Z )) logT Y X Y [Z > t] ≥ − 1 δ X f Y F ( F X ( Z )) Y δ



0 for any Y (θ f )X (x) δ TY fX (x) t 0 and log T ≥ log ≥ − 1 − 1 δ X f F ( F ( x )) ( θ ) Y X δ X f Y F ( F X ( x )) Y Y for any x 0. Using Lemma 1, we conclude that ( ≥ X θ ) ≤ A Y θ ). − T (

2. Proportional reversed hazard rate model (see [11]) For any θ > 0 let X (θ) and Y (θ) be two absolutely continuous nonnegative

random variables with distribution functions ( θ F X ) and (FY )θ .

The proof of the next theorem is analogous with the proof of Theorem 2

and is omitted.

Theorem 3. 1. If θ 1, ≥ X ≤A Y and δ δ δ − T X X θ ( θ δ , then) ) Y ≤ X Y ( θ ( ) ≤A−T Y (θ).

2. If 0 < θ 1, ( ) ) and ≤ X θ ≤ A Y ( θ δ δ Y − T X Y , then X ( θ ) ≤ δ X ( θ δ ) Y ≤ A . − T

3. Proportional odds model (see [8])



X θF (x) Y p and p defined by the survival functions F X p ( x ) = X , 1 Let θ > 0 and the absolutely continuous nonnegative random variables



respectively θF ) Y ( x F Y x p ( ) = for any x 0. ≥ 1 − (1 ) ) − θ F X ( x

− (1 − θ)F Y (x)

Theorem 4. 1. If θ ≥ 1, X ≤A Y δ − T andXp δY ≤ δX δYp, then Xp ≤A Y − Tp.

2. If 0 < θ ≤ 1, Xp ≤A−T Yp and δXδYp ≤ δXp δY , then X ≤A−T Y .

23rd European Young Statisticians Meeting 103



Proof. θu For any θ > 0, let h : [0 , 1] → R , h( u) = . If θ ≥ 1, then

1 h − (1 − θ)

is an increasing concave function on [0, 1]. If 0 < θ ≤ 1, then h is an increasing u

convex function on [0 , 1]. We have F Xp ( x ) = h F X ( x ) , F Y p ( x) = h F Y ( x),

f 1 X p ( x ) = ′ ′ − h F X ( x ) f X ( x ), f Y p ( x ) = h F Y ( x ) f Y ( x ) and F F X ( x ) = Y p p



F − f 1X (p (x) fX (x) F X ( x )), hence = for any Y x ≥ 0. − 1 − 1 f ( x )) f Y F ( F X Y Y p F F X ( x ) Y p p

Then, for any t ≥ 0,



Z F Z log FXp Z > t Yp [ = − 1 δXp fYp F ( FXp Z ) Yp E α ( ) − 1 δYp fXp Z − 1 T ( ) ] fYp

E ′ α α ( ) − 1 δYp fX Z − 1 T h Z Z ( ) ( F X fY F FX Z ) log [Z > t] . Y − 1 ) δXp fY F ( FX Z Y



E , ′ ∞ ) ∋ x −→ h F X (x) is nonnegative increasing, X Z − 1 α − 1 δ T Y f ( ) Z f Y F ( F X ( Z )) log Y [ Z > t ] ≥ − 1 δ X f Y F ( F X ( Z )) Y δ the function [0 We assume that θ ≥ 1, X ≤A and δ − T YXp δY ≤ δXδYp . Then α We will prove only the case 1., the case 2. being analogous.



0 for any T Y X f p (x) t ≥ 0 and log ≥ − 1 δ X p f Y F ( F X ( x )) Y ( x ) log T δ Y f X for any x ≥ 0, hence, by Lemma 1, we − 1 δ X f Y F ( F X ( x )) Y get X p ≤ A Y − T p .



Acknowledgements: We are very much indebted to the anonymous

referees and to the editors for their most valuable comments and

suggestions which improved the quality of the paper.



References

[1] A. M. Awad. A statistical information measure. Dirasat, 12: 7–20, 1987.

[2] M. M. Awad. A review of entropy generation in microchannels. Advances

in Mechanical Engineering, 7: 1–32, 2015.

104 23rd European Young Statisticians Meeting



[3] V. S. Barbu, A. Karagrigoriou and V. Preda. Entropy, divergence rates

and weighted divergence rates for Markov chains. I: The alpha-gamma and beta-gamma case. Proc. Rom. Acad., Ser. A, Math. Phys. Tech. Sci. Inf. Sci., 18: 293–301, 2017.

[4] V. S. Barbu, A. Karagrigoriou and V. Preda, Entropy and divergence

rates for Markov chains. II: The weighted case. Proc. Rom. Acad., Ser. A, Math. Phys. Tech. Sci. Inf. Sci., 19: 3–10, 2018.

[5] V. S. Barbu, A. Karagrigoriou and V. Preda. Entropy and divergence

rates for Markov chains. III: The Cressie and Read case and applications. Proc. Rom. Acad., Ser. A, Math. Phys. Tech. Sci. Inf. Sci., 19: 413–421, 2018.

[6] W. H. Mah, Y. M. Hung and M. Guo. Entropy generation of viscous

dissipative nanofluid flow in microchannels. Int. J. Heat Mass. Tran., 55: 4169–4182, 2012.

[7] A. K. Nanda, P. G. Sankaran and S. M. Sunoj. R´enyi’s residual entropy:

A quantile approach. Statist. Probab. Lett., 85: 114–121, 2014.

[8] J. Navarro, Y. del Aguila and M. Asadi. Some new results on the cumulative

residual entropy. J. Statist. Plann. Inference, 140: 310–322, 2010.

[9] R.-C. Sfetcu, S.-C. Sfetcu and V. Preda. Ordering Awad–Varma entropy

and applications to some stochastic models. Mathematics, 9: 280, 2021.

[10] S.-C. Sfetcu. Varma quantile entropy order. An. ¸St. Univ.

Ovidius Constant¸a, 29: 249–264, 2021.

[11] M. Shaked and J. G. Shanthikumar. Stochastic Orders. Springer

Science Business Media LLC, 2007.

[12] S. M. Sunoj, P. G. Sankaran. Quantile based entropy function.

Statist. Probab. Lett., 82: 1049–1053, 2012.

[13] A. Toma. Model selection criteria using divergences. Entropy, 16:

2686–2698, 2014.

[14] A. Toma, A. Karagrigoriou and P. Trentou. Robust model

selection criteria based on pseudodistances. Entropy, 22: 304, 2020.

[15] C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. J.

Stat. Phys., 52: 479–487, 1988.

[16] L. Yan and D. -t. Kang. Some new results on the R´enyi quantile entropy

ordering. Statistical Methodology, 33: 55–70, 2016.





23rd European Young Statisticians Meeting 105




Approximate Post-Selection Inference for

Group Lasso with Application to NHANES



Sarah ∗ 1 Pirenne and Gerda Claeskens

1ORStat and Leuven Statistics Research Center, KU Leuven



Abstract: The motivation behind this work stems from the need for a method of conducting post-selection inference for data that include grouped variables, such as the National Health and Nutrition Examination Survey (NHANES). In model selection, data are mined for strong associations to

include in a statistical model. Statistical inference for the selected model must take this model selection into account, since the model is itself already a function of the data. We use a selective inference approach, which conditions hypothesis tests on the subspace of the data which leads to the same model selection as the observed data. Much research has been done in selective inference for polyhedral constrained selection events. For more complicated selection events, methods for selective inference are lacking. We develop an algorithm for approximate selective inference for parameters after selection events which do not necessarily have to be characterizable as a polyhedron. We apply this algorithm for conducting inference post-selection by the group lasso and illustrate our proposed method using a NHANES dataset.



Keywords: post-selection inference, selective inference, group lasso, para-metric programming

AMS subject classification: 6 2J07, 62F03



∗ sarah.pirenne@kuleuven.be 106 23rd European Young Statisticians Meeting



1 Motivating example and problem setting





Figure 1: 95% confidence intervals for regression coefficients selected by the

group lasso for estimating Body Mass Index (BMI) from the National Health and Nutrition Examination Survey (NHANES) [1]: selective inference (SI), naive inference and 50-50% data split (DS)



The National Health and Nutrition Examination Survey (NHANES) gathers data about the health, nutrition and socio-economic features of United States residents through a combination of survey questions and physical examinations. We compiled a dataset of size (n, p) = (5985, 50) from the data collection of 2017 to March 2020. We are interested in building a parsimonious model with low variability, yet not underfitting, for Body Mass Index (BMI) based on a set of 24 potential covariates. Since these can be both continuous (e.g. age) or (grouped) categorical covariates with multiple levels (e.g. race, education. . . ), we select variables with the group lasso, which is well-suited in this setting.

23rd European Young Statisticians Meeting 107



Using ⊤ for the transpose of a vector or matrix, the full coefficient vector is

β = ( ⊤ ⊤ ⊤ p β , . . . , β ), where the subvector β[ ] has p components, for [1] [ J ] ∈ R j j j ∈ {1, . . . , J }. The group lasso estimator is defined in [13] as



β ˆ J 1 2 √ = arg min ∥ Y − Xβ ∥ + λ p, 2 2 j ∥ β [ j ] ∥ 2 (1) ∈ R p β j =1



where q 2 1 λ ≥ 0 is a user-specified regularization parameter and ∥ u ∥ 2 = (/2 u ) j =1 j denotes the ℓ 2 -norm of a vector q u ∈ R . The ℓ 2 -norm in (1) encourages group- wise sparsity of coefficients, unlike group ridge penalization. Note that a group may consist of one variable. Then, the ℓ 1 -norm or the standard lasso penalty is retrieved. Note also that extensions to group (adaptive) elastic net are possible [5], but we restrict our attention to the seminal group lasso for simplicity of the exposition. 1 In applying (1) with λ fixed to √ , we selected the 15 covariates mentioned n in Figure 1, of which five were categorical covariates consisting of different levels (milk consumption, subjective diet, diabetes, educational level and race). We denote the indices of the active set of nonzero coefficients A ( Y ) = { j ∈ 1 { , . . . , p } : ˆ β j ̸ = 0 } and use A ( Y ) as a subscript for indicating restriction to the selection. Now, practitioners are often interested in inferring the significance of the selected coefficients only. This is a typical scenario for inference post-selection, with hypotheses generated by the data.



2 Brief literature overview

We follow a selective inference (SI) approach which conditions statistics for hypothesis testing on the event that the data led to the observed selection. Due to its specialization to particular model selection methods, SI yields more powerful inference [3]. Unlike the lasso [7], the group lasso does not induce a polyhedral constrained selection event. This means that the subspace of the data which would lead to the same selected variables cannot be exactly described as a set of linear inequalities. This work is motivated by the need for a powerful method of conducting post-selection inference for regression coefficients when data contain grouped variables. Other approaches based on SI are limited to testing the significance of the entire group [9], or use Bayesian methods [11]. SI for the randomized group lasso estimator is obtained in [6]. Our inspiration is twofold. First, Liu et al. [8] suggest to conduct a parametrized line search to reduce the problem to one dimension. Second,



108 23rd European Young Statisticians Meeting



Duy and Takeuchi [2] execute the latter for the lasso, and exploit its conve-nient property of linear inequality constraints to make the above-mentioned polyhedral approach to SI more powerful using parametric programming (PP). We obtain approximate SI for the group lasso by conducting a parametrized line search and using approximate PP to determine an approximation of the selection event. The proposed algorithm (see [12]) can easily be extended to different model selection methods, including other non-polyhedral selection events.



3 Method

The response n Y ∈ R is assumed to follow a multivariate normal distribution:

Y ⊤ = ( Y , . . . , Y 1 n ∼ N ) (µ, Σ), (2)

where Σ = 2 2 σ I n is the variance covariance matrix. In practice, σ is estimated with the variance of the residuals from the full model (see [7] for a justification of this when n > p). We model µ as a linear function of the columns of X n×p ∈ R , but we do not assume that this is the true data generating mechanism. The linear model may thus be misspecified. This makes the



β ∗ ⊤ −1 = ( X X A Y ) ) ⊤ A ( Y ( X µ. For the j th selected coefficient, we write ) A ( Y ) A ( Y ) 1 target of inference the least-false parameter value in the selected linear model β ∗ ⊤ = ⊤ η µ where η j = X X ( ) ( X ( ))− |A (Y )| e ∈ ) j e j A Y with R a A ( Y ) ,j j A Y A ( Y basis vector with one in the j th row. For testing the hypothesis of interest

H ⊤ ∗ 0 ,j j A( η : µ β = = 0, (3)

Y ),j



the test statistic ⊤ obs ⊥ η Y Y ) = ( y ) ⊥ is considered, j |{A ( A , P ( Y ) = P ( obs y ) η j η j } 1 where ⊥ P η ( − obs Y ) = ( I n Σ ⊤ ⊤ η j η Σ η j ) ) Y and is an observed response j − ( η y j j vector sampled from (2). The test statistic is thus conditioned on the event that the selection with a random response Y is the same as the observed selection event, and on the projection of a random response Y onto the orthogonal complement of η j being the same as observed. Following Lee et al. [7], p-values are obtained through the pivotal quantity:

F Z ⊤ obs obs ⊤ ⊤ ( η Y ) ( ⊥ Y ) = ( ⊥ y ) , P ( Y ) = P η µ,η Σ η j j |{A A ( y) j}, (4) η j η j j

where Z F ⊤ ⊤ denotes the cumulative distribution function of the truncated η µ,η Σ η j j j Gaussian distribution truncated to ⊤ ⊤ Z , with mean η µ j and variance η η j Σj . The p-values in (4) are uniformly distributed under the null hypothesis, 23rd European Young Statisticians Meeting 109



since the probability integral transform is used. This entails that the test obtains type I error rate control at a specified significance level α given the selection event. For example, for a two-sided test for (3) we calculate pj = 2 min Z ⊤ Z ⊤ { F 0 ⊤ ( η Y ) , 1 ,η Σ j − F ⊤ ( ηY . The remaining task is computing η j 0 ,η Σ j ) } η j j j the truncation Z , which is not straightforward for complicated non-polyhedral selection events like the group lasso. Liu et al. [8] suggest to reduce the problem to a one-dimensional line search by parametrizing the response vector by a parameter z ∈ R:

Y ( ⊤ 1 ⊤ z ) = ⊥ P ( ⊤ Y ) + Σ ⊤ η ( η Σ − η z, z = η Y µ, η η j j j j ) · ( η Ση j ∼ N jj ). (5) j



The response is thus simply orthogonally decomposed in terms of ηj, in which the test statistic ⊤ η Y is replaced by z. This simplifies the computation of j obs the truncation in (4) to finding the values of

z for which A(Y (z)) = A(y ).

Therefore, we solve the objective function in (1) with the parametrized response in (5) for a grid of z-values. A proper grid can be determined based on the fact that the test statistic is known to follow a Gaussian distribution. The pivotal quantity thus becomes simplified to Z F ⊤ ⊤ (Z ) for a random variable η µ,η Σ η j j j Z |{Z ∈ Z}.

While Duy and Takeuchi [2] use PP to carry out these steps for the lasso,

the group lasso has nonlinear optimality conditions. In [12] we propose an approximate PP algorithm that computes the conditioning on selection events for a grid of ⊤ z ∈ [ − 20 ⊤ η η , η j Σ j 20 η ϵ j Σ j ] with a small step size (see [4] and [10] for approximation guarantees of parametrized programming problems).



4 Analyzing NHANES data

In Figure 1, three different 95% confidence intervals for the selected coefficients from group lasso are plotted: the proposed method, a 50 − 50% data split and inference using the full dataset yet ignoring the selection (“Naive”). First, the effects of the category “undergraduate” of educational level (with “graduate” as reference) and smoking, lose their statistical significance on a 95% level when conditioning on the model selection, compared to naive inference. Second, for the category “fair” of subjective health of the diet (with “poor” as reference) and borderline diabetes (with no diabetes as reference) intervals from a data split did not detect a significant effect, whereas SI did. Due to the smaller sample size, the intervals after data splitting are in most cases wider than the intervals from SI, and thus less powerful.



110 23rd European Young Statisticians Meeting



5 Discussion

The approximate parametric programming algorithm provides an attractive method for situations where polyhedral results do not apply. The advantages of this method are its computational efficiency and its increased power because it conditions on the selected model only. Extensions to selective inference for other selection methods are worth a further study.



Acknowledgements: G.Claeskens acknowledges the support of the Research Foundation Flanders and KU Leuven Research Fund C1-project C16/20/002.



References

[1] Centers for Disease Control and Prevention (CDC) - National Center for

Health Statistics (NCHS). National Health and Nutrition Examination Survey (NHANES), 2021. https://wwwn.cdc.gov/nchs/nhanes/.

[2] V. N. L. Duy and I. Takeuchi. More powerful conditional selective inference

for generalized lasso by parametric programming. Journal of Machine Learning Research, 23(300):1–37, 2022.

[3] W. Fithian, D. Sun, and J. Taylor. Optimal inference after model selection.

https://doi.org/10.48550/arXiv.1410.2597 , 2014.

[4] J. Giesen, M. Jaggi, and S. Laue. Approximating parameterized convex

optimization problems. ACM transactions on algorithms, 9(1):1–17, 2012.

[5] J. Hu, J. Huang, and F. Qiu. A group adaptive elastic-net approach for

variable selection in high-dimensional linear regression. Science China. Mathematics, 61(1):173–188, 2018.

[6] Y. Huang, S. Pirenne, S. Panigrahi and G. Claeskens. Selective in-

ference using randomized group lasso estimators for general models. https://arxiv.org/abs/2306.13829, 2023.

[7] J. D. Lee, D. L. Sun, Y. Sun, and J. E. Taylor. Exact post-selection

inference, with application to the lasso. The Annals of statistics, 44(3):907– 927, 2016.

[8] K. Liu, J. Markovic, and R. Tibshirani. More power-

ful post-selection inference, with application to the lasso. https://doi.org/10.48550/arXiv.1801.09037, 2018.



23rd European Young Statisticians Meeting 111



[9] J.R. Loftus, and J.E. Taylor. Selective inference in regression models with

groups of variables. https://doi.org/10.48550/arXiv.1511.01478, 2015.

[10] E. Ndiaye, T. Le, O. Fercoq, J. Salmon, and I. Takeuchi. Safe grid

search with optimal complexity. Proceedings of the 36th International Conference on Machine Learning, 97, 2019.

[11] S. Panigrahi, P. MacDonald, and D. Kessler. Approximate post-

selective inference for regression with the group lasso. Journal of Machine Learning Research, 24(79):1–49, 2023.

[12] S. Pirenne and G. Claeskens. Parametric programming-based approximate

selective inference for adaptive lasso, adaptive elastic net and group lasso. Journal of Statistical Computation and Simulation, 94 (11 ):2412-2435, 2024.

[13] M. Yuan and Y. Lin. Model selection and estimation in regression

with grouped variables. Journal of the Royal Statistical Society. Series B, Statistical methodology, 68(1):49–67, 2006.



112 23rd European Young Statisticians Meeting





23rd European Young Statisticians Meeting 113




Some Results on Random Mixed SAT Problems



Andreas Basse-O'Connor 1 1 1 , Tobias Overgaard* , and Mette Skji,;tt



1



Department of Mathematics, Aarhus University



Abstract: In this short paper we present a survey of some results concerning

the random SAT problems. To elaborate, the Boolean Satisfiability (SAT)

Problem refers to the problem of determining whether a given set of m Boolean

constraints over n variables can be simultaneously satisfied, i.e. all evaluate

to 1 under some interpretation of the variables in {O, 1 }. If we choose them

constraints i.i.d. uniformly at random among the set of disjunctive clauses k,

of length then the problem is known as the random k-SAT problem. It is

conjectured that this problem undergoes a structural phase transition; taking

m = an for o: > 0, it is believed that the probability of there existing a k),

satisfying assignment tends in the large n limit to 1 if o: < O:c ( and to O if

o: > o:c (k), for some critical value o:c (k) depending on k. We review some of

the progress made towards proving this and consider similar conjectures and

results for the more general case where the clauses are chosen with varying

lengths, i.e. for the so-called random mixed SAT problems.



Keywords: Satisfiability, random formulas, phase transition.

AMS subject classification: Primary: 60K35; secondary: 82B44, 68R99.



1 Introduction

The present paper presents a survey of conjectures and results on the phase

transitions for the random SAT problems with a focus on mixed formulas.

The Boolean Satisfiability (SAT) Problem lies at the heart of the famous P

vs. NP Millennium Prize Problem. Indeed, the existence of a polynomial





*Corresponding author: tlo@math.au.dk





114 23rd European Young Statisticians Meeting

23rd European Young Statisticians Meeting 115





116 23rd European Young Statisticians Meeting



23rd European Young Statisticians Meeting 117



118 23rd European Young Statisticians Meeting





23rd European Young Statisticians Meeting 119




Fast Adaptation with Bradley-Terry Preference

Models in Text-To-Image Classification and

Generation



V´ıctor ∗ 1 Gallego

1Komorebi AI Technologies



Abstract: Recently, large multimodal models, such as CLIP [5] and Stable Diffusion [6] have experimented tremendous successes in both foundations and applications. However, as these models increase in parameter size and compu-tational requirements, it becomes more challenging for users to personalize them for specific tasks or preferences. I n this work, we address the problem of adapting the previous models towards sets of particular human preferences, aligning the retrieved or generated images with the preferences of the user. We leverage the Bradley-Terry preference model [1] to develop a fast adaptation method that efficiently fine-tunes the original model, with few examples and with minimal computing resources. Extensive evidence of the capabilities of this framework is provided through experiments in different domains related to multimodal text and image understanding, including preference prediction as a reward model, and generation tasks.



Keywords: preference models, multimodal, text-to-image, image generation

AMS subject classification: 62H30, 62H35



1 Background

Contrastive models. Given a textual prompt, such as a sentence in English t, and an image I , both data points can be processed by a contrastive model



∗ Corresponding author: victor.gallego@komorebi.ai 120 23rd European Young Statisticians Meeting



such as CLIP [5] to compute a latent representation of the inputs. That is, the text t is feed through an encoder model (typically a transformer-based neural model) arriving at a text embedding x = fθ1 (t), and the same for the image I , arriving at an image embedding y = fθ (I ). Both embeddings lie on a shared 2 high-dimensional space, d x, y ∈ R, with d = 768 or 1024, depending on model size. In this shared space, we can compute the similarity s(x, y) between the text and the image, using the dot product (assume the embeddings are ℓ2-normalized, and x, y are column vectors):

s ⊺ ( x, y ) = xy. (1)

Using this metric, we can do several tasks, such as retrieving the closest image to a given textual prompt, or classifying images based on a set of texts, etc.

Large-scale models such as CLIP have been massively pretrained over vast

amounts of (text, image) pairs crawled from the internet. As users interact with them, it is of great interest to develop techniques to further fine-tune these models, personalizing them towards their user preferences, using as little data as possible. Thus, there is growing interest in creating datasets of human preferences, such as SAC [4] or DiffusionDB [7], and models further adapted to human preference, such as PickScore [3].



Bradley-Terry models. Assume a pair of items i, j sampled from a popu-lation, the Bradley-Terry (BT) model [1] estimates the probability that the pairwise comparison i over ≻ j , which indicates a strong preference of i j, is true as:



P exp si ( i ) = ≻ j , (2) exp s i + exp s j

with si being a latent variable representing the strength of item i.



2 The CLIP-BT framework

We now introduce our main contribution: using the BT model to fine-tune a contrastive model like CLIP in an efficient way. Given a textual embedding x and two image embeddings y1 and y2 , the probability ˆ p 1 of choosing image y1 versus y2 can be computed using the Bradley-Terry model with the similarity scores as the strength variables, ˆ exp s(x,y1) p 1 . = Assuming that we

exp s(x,y1)+exp s(x,y2 )

do not allow ties; that is, for any pair of images, we can sort the indices to have y1 , we can define the loss function as the negative log-likelihood: ≻ y 2

L log ˆ pref ( x, y , y p . 1 2 ) = (3) − 1

23rd European Young Statisticians Meeting 121



The case of having pairs preference data, N N of { y i 1 ≻ y i 2 } can be straight-i =1 forwardly extended by simply averaging the previous loss over each pair.

Using standard automatic differentiation l ibraries, w e c an c ompute the

gradients of the previous loss with respect to the encoder parameters, θ 1 , θ2,

to optimize the model towards the preferences of the user. However, for many

applications we can instead take the gradient with respect to the textual

prompt x, involving much less computational burden. Thanks to the linearity

of (1), after a few simplifications we arrive at

∇xLpref (x, y1, y2) = (ˆ p 1 − 1)(y1 − y2). (4)

That is, given a textual query x, and a pair of user preferences y1 ≻ y2, we compute a refined prompt ′ x with just one iteration of gradient descent, i.e. x′ = x − ϵ∇xLpref (x, y1, y2), with ϵ > 0 being a learning rate. Intuitively, this update rule steers the embedding x towards the preferred image embedding y1 , and far from the negative one, y 2.

Note that with this approach, we do not need to modify the underlying

encoder models, thus this is a fast preference adaptation method, that just

requires modifying the embedding d x ∈ R. Also notice that the previous

gradient only involves subtracting the two image embeddings, and computing

the probability ˆ p 1, so it can be computed efficiently, compared to fine-tuning

the parameters of the encoder models. These are the main benefits of our

approach.



What to do when we do not have pairwise preference data? If we do

not have a set {yi1 ≻ yi2} of both positive and negative samples, but rather



loss. But we can still adapt the embedding x towards that set, by maximizing the similarity (1) with one iteration of gradient descent: just a set of highly-preferred images {yi1 }, we cannot use the previous BT x′ 1 = x + ϵ y .

N i1

This is a baseline that we will use through the experiments.



3 Experiments

3.1 Classification task as a preference model

DifussionDB is a recent large-scale dataset consisting in triplets (x, y1 , y2), in

which users voted whether y1 ≻ y2 or the opposite. We randomly split 2000 samples as an evaluation set, and experiment with different training sets of sizes from 1 to 50 preference pairs. As the textual query, we take the embedding x corresponding to the text t = An amazing and high-quality image, with 122 23rd European Young Statisticians Meeting



the objective of predicting, for each pair, the preferred image in terms of quality. We experiment with CLIP models of different sizes: CLIP-L (430M parameters), CLIP-H (1B parameters), and CLIP-G (2.5B parameters).

Results for training set of size N = 50 are shown in Figure 1 left. For

each pair of testing images, we compute the preferred image by the model; and as the accuracy metric, we consider a match if that predicted preferred image matches the preferred one in the ground truth. Accuracy std’s are computed over 10 different random training sets. For each model size, we report accuracies with the original model (original), a fast adaptation based on just using the positive samples of each training pair (positive), and our fast adaptation gradient from (4) (BT). We also include the results of a recent state-of-the-art model, PickScore [3], which was fine-tuned on a different preference dataset. Next, in Fig. 1 right we show the mean accuracies while varying the size of the training set from 0 (no adaptation) to 50.





Config. Accuracy



CLIP-L-original 0.548

CLIP-L-positive 0.695 ± 0.003

CLIP-L-BT 0.754 ± 0.007

CLIP-H-original 0.644

CLIP-H-positive 0.742 ± 0.007

CLIP-H-BT 0.788 ± 0.003

CLIP-G-original 0.624

CLIP-G-positive 0.731 ± 0.007

CLIP-G-BT 0.772 ± 0.004

PickScore [3] 0.738



Figure 1: Preference classification results

The previous results suggest that just by using a small set of preference

data, we can achieve superior results that the fully fine-tuned model PickScore, which is a CLIP model trained over a similar dataset of highly-scored images. Also, the BT loss clearly improves on our baseline that only uses positive samples, which is expected as we are giving it more learning signal thanks to the negative samples.



3.2 Text-to-image generation

Stable Diffusion [6] is a state-of-the-art model for text-to-image generation. Internally, it uses the CLIP model to compute a latent representation from a given text prompt, which is later used by a diffusion process to generate an 23rd European Young Statisticians Meeting 123



image y. We are interested in performing an adaptation of the model towards specific aesthetics preferred by the user, specified as a set of image preferences. To obtain such a set of image preferences, the SAC dataset contains images with human rankings [4]. A recent work of us already tackled the problem of steering the model towards a set of images, but without negative preferences, just using images with high scores [2]. Here, we also use images with low scores, and form a preference dataset by sampling pairs from the previous two groups, N , with { y i 1 ≻ y i 2 } i =1 {yi1 being the set of highly-scored images, and } { the other one, with lowest scores. We use the loss from y i 2 the also function }

(3), and adapt the CLIP model for between T = 5 and T = 7 optimization steps of gradient optimization. The benefit is that we do not need to retrain the full generative model, as the optimization is done at inference time, only adding a small computing time overhead.

For a fixed set of 20 textual prompts, we generate several image samples, with the following three variants to the underlying CLIP model: CLIP-L-original, CLIP-L-positive (adapted only with the positive set of images [2]); and CLIP-L-BT (adapted using both positive and negative samples, and loss from (3)). We evaluate the quality of each variant with a human evaluation using two annotators. For each prompt, we generate three images using the three previous variants, and each annotator voted their most preferred generation. Results are shown in Fig. 2 left (Win rate), confirming the superiority of the optimization via the BT loss. Lastly, Figure 2 right shows a random pair of samples for each variant.



Textual prompt Generated samples





Config Win rate A pirate ship, sepia coloring,



CLIP-L-original 0.267 hyper-detailed, dusk,



CLIP-L-positive 4k octane render 0.273

CLIP-L-BT 0.460



A still life of flowers,



volumetric lighting





Figure 2: Results for the text-to-image generation experiment. Left: human

evaluation results. Right: Stable Diffusion generations. For each prompt, each generation is sampled using CLIP-L-original, CLIP-L-positive, and CLIP-L-BT, respectively.

124 23rd European Young Statisticians Meeting



4 Conclusions and further work

We have proposed a fast adaptation, gradient-based method, which leverages the linearity in the CLIP scoring function and the BT model. Since we do not need to store modified weights for the CLIP model, the technique introduced here is scalable to regimes of many different preference profiles.

The applicability of the technique was demonstrated with experiments in

preference classification and improving text-to-image generation towards user preferences. An interesting line of work would be to adapt more general preference models, such as the Plackett-Luce variant, to account for more than two preference levels, e.g., y1 ≻ y2 ≻ y3 .



Acknowledgements: The author acknowledges support of the Torres-Quevedo postdoctoral grant PTQ2021-011758.



References

[1] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs:

I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.

[2] V. Gallego. Personalizing text-to-image generation via aesthetic gradients.

arXiv preprint arXiv:2209.12330, 2022.

[3] Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-

a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.

[4] J. D. Pressman, K. Crowson, and S. C. Contributors. Simulacra aes-

thetic captions. Technical Report Version 1.0, Stability AI, 2022. url https://github.com/JD-P/simulacra-aesthetic-captions .

[5] A. e. a. Radford. Learning transferable visual models from natural lan-

guage supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.

[6] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-

resolution image synthesis with latent diffusion models, 2022.

[7] Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H.

Chau. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. arXiv:2210.14896 [cs], 2022.

23rd European Young Statisticians Meeting 125





23rd European Young Statisticians Meeting 127




Author index



Artemiou Andreas, 63

Avetisian Diana, 33 Martins Ana, 9

Mitchell Hannah J, 51

Basse-O'Connor Andreas, 111

Benko Matej, 87 Ntotsis Kimon, 63

Calle-Arroyo Carlos de la, 27 Olsen Lars Henry Berge, 69 Caponera Alessia, 3 Overgaard Tobias L., 111

Carden Ruth F, 15

Claeskens Gerda, 105 Pawlas Zbynĕk, 81

Csáji Balázs Csanád, 21 Pere Jaakko, 45

Petráková Martina, 81



Davis Benjamin M, 51 Pirenne Sarah, 105

Gallego Victor, 117 Ralchenko Kostiantyn, 33, 93

Gouveia Sónia, 9 Rodríguez-Aragón Licesio J., 27

Henderson Jonathan, 51 Skjøtt Mette, 111

Horváth Bálint, 21 Scotto Manuel G., 9

Hubin Aliaksandr, 57 Sfetcu Razvan-Cornel, 99

Hurley Aoife K, 15 Sweeney James, 15

Karagrigoriou Alex, 63 Weiß Christian H., 9

Lachmann Jon, 57 Yakovliev Mykyta, 93

Łazęcka Małgorzata, 75.

Lettrichová Erika, 39 Žák Libor, 87

Lopez-Fidalgo Jesús, 27