Metodoloˇ skizvezki,Vol. 17,No. 1,2020,1–17
EstimatingBayesfactorsfromminimalsummary
statisticsinrepeatedmeasuresanalysisofvariance
designs
ThomasJ.Faulkenberry
1
Abstract
Inthispaper,IdevelopaformulaforestimatingBayesfactorsdirectlyfrommin-
imalsummarystatisticsproducedinrepeatedmeasuresanalysisofvariancedesigns.
The formula, which requires knowing only the F-statistic, the number of subjects,
andthenumberofrepeatedmeasurementspersubject, isbasedontheBICapproxi-
mationoftheBayesfactor,acommondefaultmethodforBayesiancomputationwith
linear models. In addition to providing computational examples, I report a simula-
tion study in which I demonstrate that the formula compares favorably to a recently
developed, more complex method that accounts for correlation between repeated
measurements. The minimal BIC method provides a simple way for researchers to
estimateBayesfactorsfromaminimalsetofsummarystatistics,givingusersapow-
erfulindexforestimatingtheevidentialvalueofnotonlytheirowndata,butalsothe
datareportedinpublishedstudies.
1 Introduction
In this paper, I discuss how to apply the BIC approximation (Kass and Raftery, 1995;
Wagenmakers, 2007; Masson, 2011; Nathoo and Masson, 2016) to compute Bayes fac-
tors for repeated measures experiments using only minimal summary statistics from the
analysis of variance (e.g., Ly et al., 2018; Faulkenberry, 2018). Critically, I develop a
formula (Equation 3.1) that works for repeated measures experiments. Further, I investi-
gate its performance against a method of Nathoo and Masson (2016) which accounts for
varying levels of correlation between repeated measurements. Among several “default
prior”solutionstocomputingBayesfactorsforcommonexperimentaldesigns(Rouderet
al., 2009, 2012), each of which requires raw data for computation, the proposed formula
standsoutforprovidingtheuserwithasimpleexpressionfortheBayesfactorthatcanbe
computedevenwhenonlythesummarystatisticsareknown. Thus,equippedwithonlya
handcalculator,onecanimmediatelyestimateaBayesfactorformanyresultsreportedin
publishedpaper(evennulleffects),providingameta-analytictoolthatcanbequiteuseful
whentryingtoestablishtheevidentialvalueofacollectionofpublishedresults.
1
Department of Psychological Sciences, Tarleton State University, Stephenville, TX, USA; faulken-
berry@tarleton.edu
2 Faulkenberry
2 Background
To begin, let us consider the elementary case of a one-factor independent groups design.
Considerasetofdatay
ij
,onwhichweimposethelinearmodel
y
ij
=μ+α
j
+ε
ij
; i = 1,··· ,n; j = 1,...,k
where μ represents the grand mean, α
j
represents the treatment effect associated with
group j, and ε
ij
∼ N(0,σ
2
ε
). In all, we have N = nk independent observations. To
proceedwithhypothesistesting,wedeﬁnetwocompetingmodels:
H
0
:α
j
= 0forj = 1,...,k
H
1
:α
j
6= 0forsomej
Classically, model selection is performed using the analysis of variance (ANOVA),
introducedinthe1920sbySirRonaldFisher(Fisher,1925). Roughly,ANOVAworksby
partitioning the total variance in the data y into two sources – the variance between the
treatmentgroups,andtheresidualvariancethatisleftoverafteraccountingforthistreat-
ment variability. Then, one calculates anF statistic, deﬁned as the ratio of the between-
groups variance to the residual variance. Inference is then performed by quantifying the
likelihood of the observed datay under the null hypothesisH
0
. Speciﬁcally, this is done
bycomputingtheprobabilityofobtainingtheobservedF statistic(orgreater)underH
0
.
Ifthisprobability,calledthep-value,issmall,thisindicatesthatthedatay arerareunder
H
0
, so the researcher may rejectH
0
in favor of the alternative hypothesisH
1
. Though
it is a classic procedure, some issues arise that make it problematic. First, the p-value
is not equivalent to the posterior probability p(H
0
| y). Despite this distinction, many
researchers incorrectly believe that a p-value directly indexes the probability thatH
0
is
true(Gigerenzer,2004),andthustakeasmallp-valuetorepresentevidenceforH
1
. How-
ever, Berger and Sellke (1987) demonstrated that p-values classically overestimate this
evidence. For example, with a t-test performed on a sample size of 100, a p-value of
0.05 transforms to p(H
0
| y) = 0.52 – rather than reﬂecting evidence forH
1
, this small
p-valuereﬂectsdatathatslightlyprefersH
0
. Second,the“evidence”providedforH
1
via
thep-value is only indirect, as thep-value only measures the predictive adequacy ofH
0
;
thep-valueproceduremakesnosuchmeasurementofpredictiveadequacyforH
1
.
Forthesereasons,IwillconsideraBayesianapproachtotheproblemofmodelselec-
tion. The approach I will describe in this paper is to compute the Bayes factor (Kass and
Raftery, 1995), denoted BF
01
, forH
0
overH
1
. In general, the Bayes factor is deﬁned as
theratioofmarginallikelihoodsforH
0
andH
1
,respectively. Thatis,
BF
01
=
p(y|H
0
)
p(y|H
1
)
. (2.1)
This ratio is immediately useful in two ways. First, it indexes the relative likelihood of
observingdatay underH
0
comparedtoH
1
,soBF
01
> 1istakenasevidenceforH
0
over
H
1
. Similarly, BF
01
< 1 is taken as evidence forH
1
. Second, the Bayes factor indicates
the extent to which the prior odds forH
0
overH
1
are updated after observing data. Said
BayesfactorsforANOVAsummaries... 3
differently,theratioofposteriorprobabilitiesforH
0
andH
1
canbefoundbymultiplying
theratioofpriorprobabilitiesbyBF
01
(afactwhichfollowseasilyfromBayes’theorem):
p(H
0
|y)
p(H
1
|y)
= BF
01
·
p(H
0
)
p(H
1
)
. (2.2)
One interesting consequence of Equation 2.2 is that we can use the Bayes factor to com-
pute the posterior probability of H
0
as a function of the prior model probabilities. To
see this, consider the following. If we solve Equation 2.2 for the posterior probability
p(H
0
|y)andthenuseBayes’theorem,wesee
p(H
0
|y) = BF
01
·
p(H
0
)
p(H
1
)
·p(H
1
|y)
=
BF
01
·p(H
0
)·p(y|H
1
)·p(H
1
)
p(H
1
)·p(y)
=
BF
01
·p(H
0
)·p(y|H
1
)
p(y|H
0
)·p(H
0
)+p(y|H
1
)·p(H
1
)
.
Dividingbothnumeratoranddenominatorbythemarginallikelihoodp(y|H
1
)givesus
p(H
0
|y) =
BF
01
·p(H
0
)
BF
01
·p(H
0
)+p(H
1
)
.
ByEquation2.1,wehaveBF
10
= 1/BF
01
. Itcanthenbeshownsimilarlythat
p(H
1
|y) =
BF
10
·p(H
1
)
BF
10
·p(H
1
)+p(H
0
)
.
Inpractice,researchersoftenassumebothmodelsareaprioriequallylikely,andthusset
bothp(H
0
) =p(H
1
) = 0.5. Inthiscase,weobtainthesimpliﬁedforms
p(H
0
|y) =
BF
01
BF
01
+1
, p(H
1
|y) =
BF
10
BF
10
+1
. (2.3)
Though there are many simple quantities that can be derived from the Bayes factor,
theactualcomputationofBF
01
canbequitedifﬁcult,asthemarginallikelihoodsinEqua-
tion2.1eachrequireintegratingoverapriordistributionofmodelparameters. Thisoften
results in integrals that do not admit closed form solutions, requiring approximate tech-
niques to estimate the Bayes factor. In Faulkenberry (2018), it was shown that for an
independentgroupsdesign,onecanusetheF-ratioanddegreesoffreedomfromananal-
ysisofvariancetocomputeanapproximationofBF
01
thatisbasedonaunitinformation
prior(Wagenmakers,2007;Masson,2011). Speciﬁcally
BF
01
≈
s
N
df
1

1+
Fdf
1
df
2

−N
, (2.4)
whereF(df
1
,df
2
)istheF-ratiofromastandardanalysisofvarianceappliedtothesedata.
As an example, consider a hypothetical dataset containing k = 4 groups of n =
25 observations each (for a total of N = 100 independent observations). Suppose that
4 Faulkenberry
an ANOVA produces F(3,96) = 2.76, p = 0.046. This result would be considered
as “statistically signiﬁcant” by conventional null hypothesis standards, and traditional
practicewoulddictatethatwerejectH
0
infavorofH
1
. Butisthisresultreallyevidential
forH
1
? ApplyingEquation2.4shows:
BF
01
≈
s
N
df
1

1+
Fdf
1
df
2

−N
=
r
100
3

1+
0.76·3
96

−100
= 15.98.
This result indicates quite the opposite: by deﬁnition of the Bayes factor, this implies
that the observed data are almost 16 times more likely underH
0
thanH
1
. Note that the
appearance of such contradictory conclusions from two different testing frameworks is
actuallyaclassicresultknownasLindley’sparadox(Lindley,1957).
3 TheBICapproximationforrepeatedmeasures
Againstthisbackground,thegoalnowistoextendEquation2.4tothecasewherewehave
anexperimentaldesignwithrepeatedmeasurements. Forcontext,consideranexperiment
where k measurements are taken from each of n experimental subjects. We then have a
totalofN =nkobservations,buttheyarenolongerindependentmeasurements. Assume
alinearmixedmodelstructureontheobservations:
y
ij
=μ+α
j
+π
i
+ε
ij
; i = 1,...,n; j = 1··· ,k,
where μ represents the grand mean, α
j
represents the treatment effect associated with
group j, π
i
represents the effect of subject i, and ε
ij
∼ N(0,σ
2
ε
). Due to the correlated
structureofthesedata,wehaven(k−1)independentobservations. Wewilldeﬁnemodels
H
0
andH
1
as above. Also, we will denote the sums of squares terms in the model in the
usualway,where
SSA =n
k
X
j=1
(y
·j
−y
··
)
2
, SSB =k
n
X
i=1
(y
i·
−y
··
)
2
representthesumsofsquarescorrespondingtothetreatmenteffectandthesubjecteffect,
respectively,
SST =
n
X
i=1
k
X
j=1
(y
ij
−y
··
)
2
representsthetotalsumofsquares,and
SSR =SST −SSA−SSB
represents the residual sum of squares left over after accounting for both treatment and
subject effects. From here, we can compute theF-statistic for the treatment effect in our
BayesfactorsforANOVAsummaries... 5
designas
F =
SSA
SSR
·
df
residual
df
treatment
=
SSA
SSR
·
(n−1)(k−1)
k−1
=
SSA
SSR
·(n−1).
WewillnowshowthatthisF statisticcanbeusedtoestimateBF
01
.
To this end, note the following. Prior work of Wagenmakers (2007) has shown that
BF
01
canbeapproximatedas
BF
01
≈ exp(ΔBIC
10
/2),
where
ΔBIC
10
=N ln
 
SSE
1
SSE
0
!
+(κ
1
−κ
0
)ln(N).
Here,N isequaltothenumberofindependentobservations;asnotedabove,thisisequal
ton(k−1) for our repeated measures design. SSE
1
represents the variability left unex-
plained byH
1
; for our design, this is equal to the residual sum of squares, SSR. SSE
0
represents the variability left unexplained byH
0
; for our design, this is equal to the sum
of the treatment sum of squares and the residual sum of squares, SSA +SSR. Finally,
κ
1
−κ
0
is equal to the difference in the number of parameters betweenH
1
andH
0
; this
isequaltok−1forourdesign.
We are now ready to derive a formula for BF
01
. First, we will re-express ΔBIC
10
in
termsofF:
ΔBIC
10
=N ln
 
SSE
1
SSE
0
!
+(κ
1
−κ
0
)ln(N)
=n(k−1)ln
 
SSR
SSR+SSA
!
+(k−1)ln

n(k−1)

=n(k−1)ln
 
1
1+
SSA
SSR
!
+(k−1)ln

n(k−1)

=n(k−1)ln
 
n−1
n−1+
SSA
SSR
·(n−1)
!
+(k−1)ln

n(k−1)

=n(k−1)ln
 
n−1
n−1+F
!
+(k−1)ln

n(k−1)

6 Faulkenberry
Thus,wecanwrite
BF
01
≈ exp(ΔBIC
10
/2)
= exp
"
n(k−1)
2
ln
 
n−1
n−1+F
!
+
k−1
2
ln

n(k−1)

#
=
 
n−1
n−1+F
!
n(k−1)
2
·

n(k−1)
k−1
2
=
v
u
u
t

n(k−1)

k−1
·
 
n−1
n−1+F
!
n(k−1)
=
v
u
u
t
(nk−n)
k−1
·
 
n−1
n−1+F
!
nk−n
If we invert the term containingF and dividen−1 into the resulting numerator, we get
thefollowingformula:
BF
01
≈
v
u
u
t
(nk−n)
k−1
·
 
1+
F
n−1
!
n−nk
, (3.1)
wherenequalsthenumberofsubjectsandkequalsthenumberofrepeatedmeasurements
persubject.
IwillnowgiveanexampleofusingEquation3.1tocomputeaBayesfactor. Theex-
amplebelowisbasedondatafromFaulkenberryetal. (2018). Inthisexperiment,subjects
were presented with pairs of single digit numerals and asked to choose the numeral that
was presented in the larger font size. For each of n = 23 subjects, response times were
recorded in k = 2 conditions – congruent trials and incongruent trials. Congruent trials
weredeﬁnedasthoseinwhichthephysicallylargerdigitwasalsothenumericallylarger
digit (e.g., 2 – 8). Incongruent trials were deﬁned such that the physically larger digit
was numerically smaller (e.g., 2 – 8). Faulkenberry et al. (2018) then ﬁt each subjects’
distribution of response times to a parametric model (a shifted Wald model; see Anders
et al., 2016; Faulkenberry, 2017, for details), allowing them to investigate the effects of
congruity on shape, scale, and location of the response time distributions. Speciﬁcally,
theypredictedthattheleadingedge,orshift,ofthedistributionswouldnotdifferbetween
congruentandincongruenttrials,thusprovidingsupportagainstanearlyencoding-based
explanation of the observed size-congruity effect (Santens and Verguts, 2011; Faulken-
berry et al., 2016; Sobel et al., 2016, 2017). The shift parameter was calculated for both
ofthek = 2congruityconditionsforeachofthen = 23subjects. TheresultingANOVA
summarytableispresentedinTable1.
BayesfactorsforANOVAsummaries... 7
Table1: ANOVAsummarytableforshiftparameterdataofFaulkenberryetal. (2018)
Source SS df MS F p
Subjects 103984 22 4727
Treatment 739 1 739 1.336 0.260
Residual 12176 22 553
Total 116399 45
ApplyingtheminimalBICmethodfromEquation2.4givesusthefollowing:
BF
01
≈
v
u
u
t
(nk−n)
k−1
·
 
1+
F
n−1
!
n−nk
=
v
u
u
t
(23·2−23)
2−1
 
1+
1.336
23−1
!
(23−23·2)
=
v
u
u
t
23
1
 
1+
1.336
22
!
−23
= 2.435
This Bayes factor tells us that the observed data are approximately 2.4 times more likely
underH
0
thanH
1
. Assumingequalpriormodelodds,weuseEquation2.3toconvertthe
Bayesfactortoaposteriormodelprobability,givingpositiveevidenceforH
0
:
p(H
0
|y) =
BF
01
BF
01
+1
=
2.435
2.435+1
= 0.709.
4 Accountingforcorrelationbetweenrepeatedmeasure-
ments
In a recent paper, Nathoo and Masson (2016) took a slightly different approach to cal-
culating Bayes factors for repeated measures designs, investigating the role of effective
sample size in repeated measures designs (Jones, 2011). For single-factor repeated mea-
suresdesigns,effectivesamplesizeisdeﬁnedas
n
eff
=
nk
1+ρ(k−1)
,
whereρistheintraclasscorrelation,
ρ =
σ
2
π
σ
2
π
+σ
2
ε
.
8 Faulkenberry
Thus, ρ = 0 implies n
eff
= nk, whereas ρ = 1 implies n
eff
= n. Though ρ is unknown,
Nathoo and Masson (2016) developed a method to estimate it from SS values in the
ANOVA,leadingtothefollowing:
ΔBIC
10
=n(k−1)ln
 
SST −SSA−SSB
SST −SSB
!
+(k +2)ln
 
n(SST −SSA)
SSB
!
−3ln
 
nSST
SSB
!
Thisestimateprovidesabetteraccountofthecorrelationbetweenrepeatedmeasurements,
butthebeneﬁtcomesatapriceofaddedcomplexity,anditisnotclearhowtoreducethis
formula to a simple expression involving onlyF as we do with Equation 3.1. This leads
tothenaturalquestion: howwelldoestheminimalBICmethodfromEquation3.1match
upwiththemorecomplexapproachofNathooandMasson(2016)?
As a ﬁrst step toward answering this question, let us revisit the example presented
above. We can apply the Nathoo and Masson formula to the ANOVA summary in Table
1:
ΔBIC
10
= 23(2−1)ln
 
116399−739−103984
116399−103984
!
+(2+2)ln
 
23(116399−739)
103984
!
−3ln
 
23(116399)
103984
!
= 23ln(0.9405)+4ln(25.583)−3ln(25.746)
= 1.812.
ThisequatestoaBayesfactorof
BF
01
= exp(ΔBIC
10
/2)
= exp(1.812/2)
= 2.474
and a posterior model probability of p(H
0
| y) = 2.474/(2.474 + 1) = 0.712. Clearly,
these computations are quite similar to the ones we performed with Equation 3.1, with
bothmethodsindicatingpositiveevidenceforH
0
overH
1
.
5 Simulationstudy
The computations in the previous section reﬂect two preliminary facts. First, the method
ofNathooandMasson(2016)yieldsBayesfactorsandposteriormodelprobabilitiesthat
BayesfactorsforANOVAsummaries... 9
take into account an estimate of the correlation between repeated measurements. This is
a highly principled approach which the minimal BIC method of Equation 3.1 does not
take. However,aswecanseewithbothcomputations,thegeneralconclusionremainsthe
sameregardlessofwhetherweusetheminimalBICmethodorthemethodofNathooand
Masson. GiventhatourEquation3.1is(1)easytouse,and(2)requiresonlythreeinputs
(the number of subjectsn, the number of repeated measurement conditionsk, and theF
statistic), we wonder if the minimal BIC method produces results that are sufﬁcient for
day-to-day work, with the risk of being conservative being outweighed by its simplicity.
Toanswerthisquestion,IconductedaMonteCarlosimulation
3
tosystematicallyinvesti-
gate the relationship between Equation 3.1 and the Nathoo and Masson method across a
widevarietyofrandomlygenerateddatasets.
Inthissimulation, Irandomlygenerateddatasetsthatreﬂectedtherepeatedmeasures
designs that we have discussed throughout this paper. Speciﬁcally, data were generated
fromthelinearmixedmodel
Y
ij
=μ+α
j
+π
i
+ε
ij
; i = 1,...,n; j = 1,...,k,
where μ represents a grand mean, α
j
represents a treatment effect, and π
i
represents a
subject effect. For convenience, I set k = 3, though similar results were obtained with
othervaluesofk (notreportedhere). Also,Iassumedπ
i
∼N(0,σ
2
π
)andε
ij
∼N(0,σ
2
ε
).
Ithensystematicallyvariedthreecomponentsofthemodel:
1. Thenumberofsubjectsnwassettoeithern = 20,n = 50,orn = 80;
2. The intraclass correlation ρ between treatment conditions was set to be either ρ =
0.2orρ = 0.8;
3. Thesizeofthetreatmenteffectwasmanipulatedtobeeithernull,small,ormedium.
Speciﬁcally, these effects were deﬁned as follows. Let μ
j
= μ + α
j
(i.e., the
conditionmeanfortreatmentj). Thenwedeﬁneeffectsizeas
δ =
max(μ
j
)−min(μ
j
)
p
σ
2
π
+σ
2
ε
,
and correspondingly, we set δ to one of three values: δ = 0 (null effect), δ = 0.2
(small effect), andδ = 0.5 (medium effect). Also note that since we can write the
intraclasscorrelationas
ρ =
σ
2
π
σ
2
π
+σ
2
ε
,
itfollowsdirectlythatwecanalternativelyparameterizeeffectsizeas
δ =
√
ρ
 
max(μ
j
)−min(μ
j
)

σ
π
.
Usingthisexpression,Iwasabletosetthemarginalvarianceσ
2
π
+σ
2
ε
tobeconstant
acrossthevaryingvaluesofoursimulationparameters.
3
The simulation script (in R) and resulting simulated datasets can be downloaded from https://
git.io/Jfekh.
10 Faulkenberry
For each combination of number of observations (n = 20,50,80), effect size (δ =
0,0.2,0.5),andintraclasscorrelation(ρ = 0.2,0.8),Igenerated1000simulateddatasets.
For each of the datasets, I performed a repeated measures analysis of variance and, us-
ing the F statistic and relevant values of n and k, extracted two Bayes factors for H
0
;
one based on the minimal BIC method of Equation 3.1 and one based on the method of
Nathoo and Masson (2016) which accounts for correlation between repeated measure-
ments. These Bayes factors were then converted to posterior probabilities via Equation
2.3. To compare the performance of both methods in the simulation, I considered four
analyses for each simulated dataset: (1) a visualization of the distribution of posterior
probabilities p(H
0
| y); (2) a calculation of the proportion of simulated trials for which
the correct model was chosen (i.e., model choice accuracy); (3) a calculation of the pro-
portion of simulated trials for which both methods chose the same model (i.e., model
choice consistency); and (4) a calculation of the correlation between posterior probabili-
tiesfrombothmethods.
First,letusvisualizethedistributionofposteriorprobabilitiesp(H
0
|y). Tothisend,
I constructed boxplots of the posterior probabilities, which can be seen in Figure 1. The
primary message of Figure 1 is clear. Our Equation 3.1, which was derived from min-
imal BIC method developed in this paper appears to produce a distribution of posterior
probabilities which is similar to those produced by the method of Nathoo and Masson
(2016). Moreover, this consistency extends across a variety of reasonably common em-
piricalsituations. InthecaseswhereH
0
wastrue(theﬁrstrowofFigure1,bothEquation
3.1 and the Nathoo and Masson (2016) method produce posterior probabilities for H
0
thatarereasonablylarge. Forbothmethods, thevariationoftheseestimatesdecreasesas
thenumberofobservationsincreases. Whentheintraclasscorrelationissmall(ρ = 0.2),
theestimatesfromEquation3.1andtheNathooandMasson(2016)methodarevirtually
identical. When the intraclass correlation is large (ρ = 0.8), the Nathoo and Masson
(2016) method introduces slightly more variability in the posterior probability estimates.
Inall,theseresultsindicatethatEquation3.1isslightlymorefavorablewhenH
0
istrue.
For small effects (row 2 of Figure 1), the performance of both methods depended
heavily on the correlation between repeated measurements. For small intraclass correla-
tion (ρ = 0.2), both methods were quite supportive ofH
0
, even thoughH
1
was the true
model. This reﬂects the conservative nature of the BIC approximation (Wagenmakers,
2007); since the unit information prior is uninformative and puts reasonable mass on a
large range of possible effect sizes, the predictive updating value for any positive effect
(i.e., BF
10
) will be smaller than would be the case if the prior was more concentrated on
smallereffects. Asaresult,theposteriorprobabilityforH
1
issmalleraswell. Regardless,
theminimalBICmethod(Equation3.1)andtheNathooandMasson(2016)methodpro-
duceasimilarrangeofposteriorprobabilities. Thepictureisdifferentwhentheintraclass
correlation is large (ρ = 0.8); both methods produce a wide range of posterior probabil-
ities, though they are again highly comparable. It is worth pointing out that the poste-
rior probability estimates all improve with increasing numbers of observations; but this
should not be surprising, given that the BIC approximation underlying both the minimal
BICmethodandtheNathooandMasson(2016)methodisalargesampleapproximation
technique. For medium effects (row 3 of Figure 1), we see much of the same message
that we’ve already discussed previously. Both Equation 3.1 and the Nathoo and Masson
(2016) method produce similar posterior probability values for H
0
. Both methods im-
BayesfactorsforANOVAsummaries... 11
 Figure 1: Resultsfromoursimulation. Eachboxplotdepictsthedistributionoftheposterior
probability p(H
0
| y) for 1000 Monte Carlo simulations. White boxes represent posterior
probabilities derived from Bayes factors that were computed using the minimal BIC method
of Equation 3.1. Gray boxes represent posterior probabilities that come from the method of
NathooandMasson(2016)whichaccountsforcorrelationbetweenrepeatedmeasurements.
12 Faulkenberry
Table 2: Model choice accuracy for the minimal BIC method and the Nathoo and Masson
(2016)method,calculatedastheproportionofsimulateddatasetsforwhichthecorrectmodel
waschosen
Correlation=0.2 Correlation=0.8
MinimalBIC Nathoo&Masson MinimalBIC Nathoo&Masson
Nulleffect
n = 20 0.969 0.968 0.979 0.954
n = 50 0.989 0.988 0.991 0.981
n = 80 0.992 0.992 0.992 0.985
Smalleffect
n = 20 0.068 0.072 0.148 0.218
n = 50 0.058 0.056 0.307 0.374
n = 80 0.062 0.062 0.485 0.550
Mediumeffect
n = 20 0.259 0.266 0.867 0.910
n = 50 0.526 0.530 0.997 0.999
n = 80 0.760 0.756 1.000 1.000
provewithincreasingsamplesize,andatleastformedium-sizeeffects,thecomputations
arequitereliableforhighvaluesofcorrelationbetweenrepeatedmeasurements.
Though the distributions of posterior probabilities appear largely the same, it is not
cleartowhatextentthetwomethodsprovidetheuserwithanaccurateinference. Sincethe
data are simulated, it is possible to deﬁne a “correct” model in each case – for simulated
datasets where δ = 0, the correct model isH
0
, whereas when δ = 0.2 or δ = 0.5, the
correct model isH
1
. To compare the performance of both methods, I calculated model
choice accuracy, deﬁned as the proportion of simulated datasets for which the correct
model was chosen. Model choice was deﬁned by consideringH
0
to be chosen whenever
BF
01
> 1andH
1
tobechosenwheneverBF
01
< 1. TheresultsaredisplayedinTable2.
Let us consider Table 2 in three sections. First, for data that were simulated from a
null model, it is clear that the accuracy of both methods is excellent, with model choice
accuracies all above 95%. Further, the minimal BIC method outperforms the Nathoo
and Masson (2016) method across all possible sample sizes as well as correlation con-
ditions. However, the overall performance of both methods becomes more questionable
for small effects. Model choice accuracies are no better than 5-7% (regardless of sample
size) for datasets with small correlation (ρ = 0.2) between repeated measurements. The
situation improves a bit when this correlation increases to 0.8, though never gets better
than55%. Acrossallthesmall-effectdatasets,theNathooandMassonmethodisslightly
more accurate in choosing the correct model. This pattern continues for datasets which
aresimulatedtohavealargeeffect,thoughoverallaccuracyismuchbetterinthiscase.
Overall, this pattern of results permits two conclusions. First, the BIC method (upon
which both methods are based) tends to be conservative (Wagenmakers, 2007), so the
tendencytoselectthenullmodelinthepresenceofsmalleffectsisunsurprising. Second,
thoughperformancewasvariableinthepresenceofsmallandmediumeffects,thediffer-
ences in model choice accuracies between the minimal BIC method and the Nathoo and
BayesfactorsforANOVAsummaries... 13
Table3: ModelchoiceconsistencyfortheminimalBICmethodandtheNathooandMasson
(2016) method, calculated as the proportion of simulated datasets for which both methods
chosethesamemodel
Nulleffect Smalleffect Mediumeffect
Correlation=0.2
n = 20 0.997 0.994 0.977
n = 50 0.999 0.994 0.984
n = 80 1.000 0.998 0.994
Correlation=0.8
n = 20 0.975 0.930 0.957
n = 50 0.990 0.933 0.998
n = 80 0.993 0.935 1.000
Masson (2016) method were small. Thus, any performance penalty that is exhibited for
the minimal BIC method is shared by the Nathoo & Masson method as well, reﬂecting
not a limitation of the minimal BIC method, but a limitation of the BIC method in gen-
eral. To further validate this claim, I calculated model choice consistency, deﬁned as the
proportion of simulated datasets for which both methods chose the same model. As can
be seen in Table 3, both the minimal BIC method and the Nathoo and Masson method
choosethesamemodelinalargeproportionofthesimulateddatasets,regardlessofeffect
size,samplesize,orcorrelationbetweenrepeatedmeasurements.
As a ﬁnal investigation, I calculated the correlations between the posterior probabili-
ties that were produced by both methods. These correlations can be seen in Table 4 and
Figure2–notethattheﬁgureonlyshowsscatterplotsforthen = 50condition,thoughthe
n = 20andn = 80conditionsproducesimilarplots. Table4showsveryhighcorrelations
between the posterior probability calculations. As can be seen in Figure 4, the relation-
ship is linear when repeated measurements are assumed to have a small correlation, but
nonlinear in the presence of highly correlated repeated measurements. For highly corre-
lated measurements, the curvature of the scatterplot indicates that for a given simulated
dataset,theposteriorprobability(forH
0
)calculatedbytheminimalBICmethodwilltend
to be greater than the posterior probability calculated by the Nathoo and Masson (2016)
method. Again,thisishardlysurprising,astheNathooandMassonmethodisdesignedto
bettertakeintoaccountthecorrelationbetweenrepeatedmeasurements. Oneshouldnote
that this correction is advantageous for datasets generated from a positive-effects model,
butdisadvantageousfordatasetsgeneratedfromanullmodel.
Inall,theperformanceoftheminimalBICmethodisquitecomparabletotheNathoo
andMasson(2016)method. ThoughtheNathooandMassonmethodisdesignedtobetter
account for the correlation between repeated measurements, this advantage comes at a
cost of increased complexity. On the other hand, the minimal BIC method introduced in
this paper requires the user to only know theF-statistic, the number of subjects, and the
number of repeated measures conditions. Thus, the small performance penalties for the
minimalBICmethodarefaroutweighedbyitscomputationalsimplicity.
14 Faulkenberry
 Figure 2: Scatterplot demonstrating the relationship between posterior probabilities calcu-
latedbytheminimalBICmethod(onthehorizontalaxis)andtheNathooandMasson(2016)
method(ontheverticalaxis). Samplesizeisassumedtoben = 50forallplots.
BayesfactorsforANOVAsummaries... 15
Table4: Correlationsbetweentheposteriorprobabilitiesp(H
0
|y)calculatedbytheminimal
BICmethodandtheNathooandMasson(2016)method
Correlation=0.2 Correlation=0.8
Nulleffect
n = 20 0.993 0.987
n = 50 0.997 0.990
n = 80 0.998 0.988
Smalleffect
n = 20 0.994 0.989
n = 50 0.998 0.991
n = 80 0.999 0.991
Mediumeffect
n = 20 0.995 0.990
n = 50 0.999 0.995
n = 80 0.999 0.999
6 Conclusion
Inthispaper,IhaveproposedaformulaforestimatingBayesfactorsfromrepeatedmea-
sures ANOVA designs. These ideas extend previous work of Faulkenberry (2018), who
presented such formulas for between-subject designs. Such formulas are advantageous
for researchers in a wide variety of empirical disciplines, as they provide an easy-to-use
methodforestimatingBayesfactorsfromaminimalsetofsummarystatistics. Thisgives
the user a powerful index for estimating evidential value from a set of experiments, even
in cases where the only data available are the summary statistics published in a paper.
I think this provides a welcome addition to the collection of tools for doing Bayesian
computationwithsummarystatistics(e.g.,Lyetal.,2018;Faulkenberry,2019).
Further, I demonstrated that the minimal BIC method performs similarly to a more
complex formula of Nathoo and Masson (2016), who were able to explicitly estimate
andaccountforthecorrelationbetweenrepeatedmeasurements. ThoughtheNathooand
Masson (2016) approach is certainly more principled than a “one-size-ﬁts-all” approach,
it does require knowledge of the various sums-of-squares components from the repeated
measuresANOVA,andthoughIhavetried,Ihavenotfoundanobviouswaytorecoverthe
NathooandMasson(2016)estimatesfromtheF statisticalone. Assuch,theNathooand
Masson approach is inaccessible without access to the raw data – or at least the various
SS components, which are rarely reported in empirical papers. Thus, given the similar
performance compared to the Nathoo and Masson (2016) method, the new minimal BIC
methodstandsatanadvantage,notonlyforitscomputationalsimplicity,butalsoitspower
inproducingmaximalinformationgivenminimalinput.
16 Faulkenberry
References
[1] Anders,R.,Alario,F.X.,andVanMaanen,L.(2016): TheshiftedWalddistribution
forresponsetimedataanalysis.PsychologicalMethods,21(3),309–327.
[2] Berger, J.O. and Sellke, T. (1987): Testing a point null hypothesis: The irreconcil-
ability of p values and evidence. Journal of the American Statistical Association,
82(397),112.
[3] Faulkenberry,T.J.(2017): Asingle-boundaryaccumulatormodelofresponsetimes
inanadditionveriﬁcationtask.FrontiersinPsychology,8,01225.
[4] Faulkenberry, T.J. (2018): Computing Bayes factors to measure evidence from ex-
periments: AnextensionoftheBIC approximation. Biometrical Letters, 55(1), 31–
43.
[5] Faulkenberry, T.J. (2019): Estimating evidential value from analysis of variance
summaries: A comment on Ly et al. (2018). Advances in Methods and Practices in
PsychologicalScience2(4),406–409.
[6] Faulkenberry,T.J.,Cruise,A.,Lavro,D.,andShaki,S.(2016): Responsetrajectories
capture the continuous dynamics of the size congruity effect. Acta Psychologica,
163,114–123.
[7] Faulkenberry, T.J., Vick, A. D., and Bowman, K.A. (2018): A shifted Wald de-
composition of the numerical size-congruity effect: Support for a late interaction
account.PolishPsychologicalBulletin,49(4),391–397.
[8] Fisher, R.A. (1925): Statistical Methods for Research Workers. Edinburgh: Oliver
&Boyd.
[9] Gigerenzer, G. (2004): Mindless statistics. The Journal of Socio-Economics, 33(5),
587–606.
[10] Jones, R.H. (2011): Bayesian information criterion for longitudinal and clustered
data.StatisticsinMedicine,30(25),3050–3056.
[11] Kass,R.E.andRaftery,A.E.(1995): Bayesfactors.JournaloftheAmericanStatis-
ticalAssociation,90(430),773–795.
[12] Lindley,D.V.(1957): Astatisticalparadox.Biometrika,44(1-2),187–192.
[13] Ly, A., Raj, A., Etz, A., Marsman, M., Gronau, Q.F., and Wagenmakers, E.-J.
(2018): Bayesian reanalyses from summary statistics: A guide for academic con-
sumers. Advances in Methods and Practices in Psychological Science, 1(3), 367–
374.
[14] Masson, M.E.J. (2011): A tutorial on a practical Bayesian alternative to null-
hypothesissigniﬁcancetesting.BehaviorResearchMethods,43(3),679–690.
BayesfactorsforANOVAsummaries... 17
[15] Nathoo, F.S. and Masson, M.E. (2016): Bayesian alternatives to null-hypothesis
signiﬁcance testing for repeated-measures designs. Journal of Mathematical Psy-
chology,72,144–157.
[16] Rouder, J.N., Morey, R.D., Speckman, P.L., and Province, J.M. (2012): Default
Bayes factors for ANOVA designs. Journal of Mathematical Psychology, 56(5),
356–374.
[17] Rouder, J. N., Speckman, P.L., Sun, D., Morey, R.D., and Iverson, G. (2009):
Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bul-
letin&Review,16(2),225–237.
[18] Santens,S.andVerguts,T.(2011): Thesizecongruityeffect: Isbiggeralwaysmore?
Cognition,118(1),94–110.
[19] Sobel, K. V., Puri, A.M., and Faulkenberry, T.J. (2016): Bottom-up and top-down
attentional contributions to the size congruity effect. Attention, Perception, & Psy-
chophysics,78(5),1324–1336.
[20] Sobel, K.V., Puri, A.M., Faulkenberry, T.J., and Dague, T.D. (2017): Visual search
for conjunctions of physical and numerical size shows that they are processed in-
dependently. Journal of Experimental Psychology: Human Perception and Perfor-
mance,43(3),444–453.
[21] Wagenmakers, E.-J. (2007): A practical solution to the pervasive problems of p
values.PsychonomicBulletin&Review,14(5),779–804.