Metodološkizvezki,Vol. 14,No. 2,2017,1–17
TheImpactofanExtremeObservationinaPaired
SamplesDesign
BenDerrick
1
AntoniaBroad DeirdreToher PaulWhite
Abstract
The effect of systematically altering the value of a single observation within a
paireddifferencesdesignisconsidered. Aparadoxisobservedforthepairedsamples
t-test, where increasingthevalueof an observationin thedirection of thetruemean
difference results in a higher p-value. Using simulation, deviations from robustness
of the paired samples t-test is demonstrated, and is contrasted with Yuen’s paired
samples t-testandtheWilcoxonsignedranksumtest.
1 Introduction
Thepaired samplest-test is logically andnumerically equivalent tothe one samplet-test
performedonpaireddifferences,anditisoneofthemostwell-establishedandcommonly
performed statistical tests. Zimmerman (1997) demonstrated that the type I error rate
of the paired samples t-test remains close to the nominal signiﬁcance level for varying
correlation and sample sizes under normality. Under less idealised conditions, Posten
(1979), Herrendörfer et al, (1983), Rasch and Guiard (2004), and Fradette et al. (2003)
found that the paired samplest-test maintains type I error robustness for a range of non-
normal distributions. However, Blair and Higgins (1985) found the Wilcoxon signed
rank sum test to also be type I error robust and to have some power advantages over the
paired samples t-test for a range of non-normal distributions. Chafﬁn and Rhiel (1993)
demonstratedthatthetailsofthesamplingdistributionofthepairedsamplesteststatistic
areskewnessdependent,particularlywithrelativelysmallsamplesizes.
Zumbo and Jennings (2002), using a novel contamination model, determined the ef-
fect of outliers on the validity and power of the paired samples t-test. They found the
paired samples t-test to have robust validity for symmetric contamination, but with in-
creasinginﬂationofthetypeIerrorratewithincreasingasymmetriccontamination. This
is coupled with degradation in power in the presence of outliers when the true effect is
small and sample sizes are small. In their work the number of outliers in the sample is
consideredtobearandomvariable.
Oneoftheassumptionsofthepairedsamplest-testisthatthedifferencesbetweenthe
two samples are normally distributed, or alternatively and in a practical sense, that the
mean difference has a distribution which can reasonably be approximated by a normal
1
Engineering, Design and Mathematics, University of the West England, Bristol, United Kingdom;
ben.derrick@uwe.ac.uk
2 Derrick et al.
distribution. A closely related assumption is that there are no large outliers in the differ-
ences. Whenperformingthepairedsamplest-test,theremaybecompetitionbetweenthe
magnitude of the mean difference and the standard deviation of the differences. In par-
ticular, extreme observations within a dataset can distort the balance between these two
elementsofthetest. Toillustratethis,considertheexampledatainTable1.
Table1: Exampledataforsixunitswithinapaireddesign
Pair Sample1 Sample2 Difference
1 30 22 8
2 28 18 10
3 45 45 0
4 57 54 3
5 38 32 6
6 37 37-X X
For the ﬁrst ﬁve pairs, the mean of Sample 1 is greater than or equal to the mean of
Sample 2. For the sixth pair, let the difference between the Sample 1 observation and
the Sample 2 observation be denoted as X. Intuition might suggest that a positive value
of X may contribute towards an overall signiﬁcant difference in means being observed.
If this were the case, a large positive value of X should seemingly contribute towards
a signiﬁcant effect. In the following, the value of X is systematically altered in order
to demonstrate its impact on the paired samples t-test. The observation X will “march”
throughthedatasetandwillbecolloquiallyreferredtoasamarchingobservation. Table2
showstheresultsofatwo-sidedpairedsamplest-testfornegativevaluesofXthroughto
largepositivevaluesofX.
Table2: Pairedsamples t-testonﬁvedegreesoffreedomforincreasingvaluesofX
X t p-value X t p-value X t p-value
−3 1.984 0.104 11 3.670 0.014 25 2.425 0.060
−1 2.406 0.061 13 3.461 0.018 27 2.319 0.068
1 2.870 0.035 15 3.240 0.023 29 2.226 0.077
3 3.321 0.021 17 3.033 0.029 31 2.145 0.085
5 3.671 0.014 19 2.848 0.036 33 2.073 0.093
7 3.840 0.012 21 2.687 0.043 35 2.009 0.101
9 3.820 0.012 23 2.546 0.052 37 1.953 0.108
The values of X for which the null hypothesis of equal means is rejected at the 5%
signiﬁcancelevelarehighlightedinTable2. ForlowvaluesofXitcanbeseenthatasthe
value of X increases, thep-value decreases. In this example, as the value of X increases
beyond approximately 8, the p-value increases. As the value of the observed difference
in the sixth pair increases (and hence as the mean difference increases), the p-value also
increases. Observing an extreme value of X in the direction of the seemingly observed
effect can increase the sample variance to such an extent that it impedes the test from
The Impact of an Extreme Observation ... 3
giving a signiﬁcant result. The extreme observation paradox is the contrariwise p-value
increase as the value of an extreme observation increases in the direction of the overall
effect.
As the absolute value of the marching observation increases, the assumptions of the
paired samples t-test are increasingly violated. When the sample size is small or the
assumptionsofthepairedsamplest-testareviolated,researchersoftenchoosetoperform
theWilcoxonsignedranksumtest. Aguinisetal.,(2013)summariseacomprehensivelist
oftechniquesfordealingwithoutliersandstatethatnon-parametrictestsgiveresultsthat
are robust in the presence of outliers. However, Zimmerman (2011) indicates that rank
based methods do not necessarily eliminate the inﬂuence of outliers. Another alternative
approach whenoutliersarepresentistouseYuen’s paired samplest-test. Inthis test, the
principlesoftrimmedmeansoutlinedbyYuen(1974),areappliedtothepaireddifferences
(Wilcox,2005).
In this paper, simulation is used to explore the scenarios in which the extreme obser-
vation paradox is observed in a paired samples design. We are particularly interested in
isolatingthosesituationswhentwo-sidedhypothesistestingisundertaken(e.g. seeRing-
walt et al., 2011), when sample sizes are relatively small (i.e. when outliers may have
a greater effect on the paired samples t-test). The concept of a systematically marching
observation similar to the demonstration in Table 2, is used to investigate the effects of
an aberrant observation. In the simulation design, this aberrant observation is a forced
additional observation not ﬁtting with the simulated data, and is not due to inherent vari-
ability. Simulations are performed for an aberrant observation in the direction of the
effectsuggestedbytherestofthesample,andsecondlywhereanaberrantobservationis
intheopposingdirectionoftheeffectsuggestedbytherestofthesample. Thussituations
where the sign of the marching observation is concordant or discordant with the mean
of the other observations are considered. For comparative purposes, the paired samples
t-test,theWilcoxonsignedranksumtest,andYuen’spairedsamplest-testareincluded.
Null hypothesis signiﬁcance testing is most frequently performed with a nil-null hy-
pothesis specifying that no difference between groups is present, and a two directional
alternative (Levine et al., 2008). Therefore the impact of an extreme observation for a
two-sided test is the main emphasis of this paper. However, one-sided tests retain some
practicalutility,andthesimulationsareextendedtoaone-sidedtest.
WehypothesisethattheseeminglyparadoxicalbehaviourexhibitedinTable2willbe
a feature of the paired samples t-test in general. In contrast, we hypothesise that Yuen’s
paired samples t-test and the Wilcoxon signed rank sum test will be robust to a single
aberrantobservation.
Inordertogaininsight,weﬁrstlyinvestigatethemathematicallimitingformsofeach
of the three test statistics under consideration as a single marching observation becomes
increasinglylargecomparedwiththerestofthesample,andthenproceedtoasimulation
investigation.
2 AnUnboundedMarchingObservation
For development purposes consider a random sample X
1
,X
2
,...,X
n−1
,X
n
, and let
X
(1)
< X
(2)
< ···X
(n−1)
< X
(n)
denote the order statistics. Further, let X
k
= Y
k
4 Derrick et al.
for (k = 1,2,...,n−1), let Y
(1)
< Y
(2)
< ···Y
(n−1)
be the corresponding order statis-
tics, and letX
n
= ξ be the marching observation. In this notation,Y
k
(k = 1,...,n−1)
denotestheobservationspriortotheinclusionofthemarchingobservation.
The following analytical exposition investigates the behaviour of the one sample t-
test,Yuen’spairedsamplest-test,andtheWilcoxonsignedranksumtest,asthemarching
observationX
n
=ξ becomesrelativelylargecomparedwiththerestofthesample.
2.1 Thet-test
Consider the single sample t-test test statistic on the paired differences, used to test
H
0
: μ
X
= 0,deﬁnedby
T: =
¯
X
ˆ σ
X+
√
n
where
¯
X :=
X
1
+X
2
+···+X
n
n
and
ˆ σ
X+
:=
s
(X
1
−
¯
X)
2
+(X
2
−
¯
X)
2
+···+(X
n
−
¯
X)
2
n−1
.
Observethat
¯
X =
(n−1)
¯
Y +ξ
n
,
¯
X−
¯
Y =
ξ−
¯
Y
n
, andξ−
¯
X =
(n−1)(ξ−
¯
Y)
n
.
Thus
ˆ σ
X+
=
s
(Y
1
−
¯
X)
2
+(Y
2
−
¯
X)
2
+···+(Y
n−1
−
¯
X)
2
+(ξ−
¯
X)
2
n−1
.
Notethat
n−1
X
j=1
(Y
j
−
¯
X)
2
=
n−1
X
j=1
(Y
j
−
¯
Y +
¯
Y −
¯
X)
2
=
n−1
X
j=1
(Y
j
−
¯
Y)
2
+(n−1)(
¯
Y −
¯
X)
2
+2(
¯
Y −
¯
X)
n−1
X
j=1
(Y
j
−
¯
Y)
= (Y
1
−
¯
Y)
2
+(Y
2
−
¯
Y)
2
+···+(Y
n−1
−
¯
Y)
2
+(n−1)(
¯
Y −
¯
X)
2
+0.
Hence
ˆ σ
X+
=
s
(Y
1
−
¯
Y)
2
+(Y
2
−
¯
Y)
2
+···+(Y
n−1
−
¯
Y)
2
+(ξ−
¯
X)
2
n−1
+(
¯
X−
¯
Y)
2
.
Forthen−1values,deﬁne
ˆ σ
Y
:=
s
(Y
1
−
¯
Y)
2
+(Y
2
−
¯
Y)
2
+···+(Y
n−1
−
¯
Y)
2
n−1
.
The Impact of an Extreme Observation ... 5
Note that ˆ σ does not have the “+” symbol, i.e. that the marching observation is not
included. Analternativedeﬁnitionfor ˆ σ couldhaven−2inthedenominatorandso
ˆ σ
X+
=
s
ˆ σ
2
Y
+
(ξ−
¯
X)
2
n−1
+(
¯
X−
¯
Y)
2
=
s
ˆ σ
2
Y
+
(n−1)
2
n
2
(ξ−
¯
Y)
2
n−1
+
(ξ−
¯
X)
2
n
2
=
r
ˆ σ
2
Y
+
(ξ−
¯
Y)
2
n
andhence
T =
(n−1)
¯
Y +ξ
p
nˆ σ
2
Y
+(ξ−
¯
Y)
2
Itcanbeseenthatasξ →∞,T → 1,andsimilarlyasξ →−∞,T →−1. Accordingly,
for any value of signiﬁcance level likely to be encountered in practice the results ξ →
±∞, T → ±1 indicate that the null hypothesis would not be rejected under the stated
conditions.
2.2 Yuen’sPairedSamplest-test
Let γ denote the per tail proportion of trimming, let e := bγnc and let f := n− 2e.
Deﬁne the trimmed sampleX
t1
,X
t2
,...,X
tf−1
,X
tf
asX
tk
:= X
(k+e)
(k = 1,2,...,f)
anddeﬁnethewinsorisedsampleX
w1
,X
w2
,...,X
wf
as
X
wk
:=





X
(e+1)
k = 1,2,...,e
X
(k)
k =e+1,e+2,...,n−e
X
(n−e)
k =n−e+1,n−e+2,...,n
Let
¯
X
t
=
P
f
k=1
X
tk
/f, and
¯
X
w
=
P
n
k=1
X
wk
/n deﬁne the trimmed mean and win-
sorised mean respectively and let, ˆ σ
2
Xw+
=
P
n
k=1
(X
wk
−
¯
X
w
)
2
/(n−1) denote the win-
sorisedvariance. Inthisnotation,Yuen’steststatisticisgivenbyT
Y
:=
¯
Xt
ˆ σ
Xw+
√
n(1−2γ).
Forξ <Y
(e)
¯
X
t
=
Y
(e)
+Y
(e+1)
+Y
e+2
+···+Y
(n−e−1)
f
,
¯
X
w
=
eY
(e)
+Y
(e)
+Y
(e+1)
+Y
(e+2)
+···+Y
n−e−1
+eY
n−e−1
n
and
ˆ σ
2
Xw+
=
e(Y
(e)
−
¯
X
w
)
2
+
P
n−e−1
k=e
(Y
(k)
−
¯
X
w
)
2
+e(Y
(n−e−1)
−
¯
X
w
)
2
n−1
For ﬁxed values, Y
1
,Y
2
,...,Y
n−1
, as ξ → −∞, T
Y
:=
¯
Xt
ˆ σ
Xw+
√
n(1− 2γ) stabilises to
somelimitingvalue.
6 Derrick et al.
Similarly,forξ >Y
(n−e)
¯
X
t
=
Y
(e+1)
+Y
(e+2)
+Y
(e+2)
+···+Y
(n−e)
f
,
¯
X
w
=
eY
(e+1)
+Y
(e+1)
+Y
(e+2)
+Y
(e+2)
+···+Y
(n−e)
+eY
n−e
n
and
ˆ σ
2
Xw+
=
e(Y
(e+1)
−
¯
X
w
)
2
+
P
n−e
k=e+1
(Y
(k)
−
¯
X
w
)
2
+e(Y
(n−e)
−
¯
X
w
)
2
n−1
For ﬁxed values, Y
1
,Y
2
,...,Y
n−1
as ξ → −∞, T
Y
:=
¯
Xt
ˆ σ
Xw+
√
n(1− 2γ) stabilises to
some limiting value. Moreover, for a sufﬁciently large sample, the limit values for both
directionsofthemarchingobservationshouldbeclosetoeachother. Hencetheproperties
displayedasξ →−∞orξ →−∞areconsistentwithT
Y
beingarobustteststatistic.
2.3 TheWilcoxonsignedranksumtest
Assumingnotiesandnozeroobservations,thentheteststatisticfortheWilcoxonsigned
ranksumtest,W,isdeﬁnedas
W =R
X
1
sgn(X
1
)+R
X
2
sgn(X
2
)+···+R
X
n
sgn(X
n
)
where R
X
k
is the rank of|X
k
| among|X
1
|,|X
2
|,...,|X
n
|. If X
1
,X
2
,...,X
n
are inde-
pendentandfollowthesamesymmetriccontinuousdistribution,thenW followsadistri-
butionwithmean0andvariancen(n+1)(2n+1)/6.
DenotebyR
Y
k
therankof|Y
k
|among|Y
1
|,|Y
2
|,...,|Y
n−1
|. For
|ξ|> max{|Y
1
|,|Y
2
|,...,|Y
n−1
|},
W =R
Y
1
sgn(Y
1
)+R
Y
2
sgn(Y
2
)+···+R
Y
n
sgn(Y
n
)+nsgn(ξ).
Henceunderthestatedconditions,forﬁxedvaluesY
1
,Y
2
,...,Y
n−1
,theWilcoxonsigned
rank sum statistic stabilises to some situation dependent limit value as ξ → +∞, and
to some situation dependent limit value as ξ → −∞. The difference between these two
valuesisn−(−n) = 2n,andthestandardisedvaluesdifferby
p
24n/{(n+1)(2n+1)}.
Theseareclosetoeachotherforsufﬁcientlylargen.
3 SimulationMethodology
The approach is to generate sample data meeting the assumptions of the paired samples
t-test, and to then include an additional observation in the sample. This additional ob-
servation systematically changes in its observed value. The paired samples t-test, the
Wilcoxon signed rank test, and Yuen’s paired samples t-test, are performed for a two-
sidednil-nullhypothesis. Underatwo-sidednil-nullhypothesis;thepairedsamplest-test
is used to test a distribution mean difference of zero; and Yuen’s paired samplest-test is
usedtotestthedistributionofthetrimmedmeanequaltozero. Historically,thederivation
The Impact of an Extreme Observation ... 7
oftheWilcoxonranksumdistributionhasbeenmadeforcontinuousrandomvariablesun-
deranullhypothesisofnodistributionaldifferences,andissensitivetochangesincentral
location(GibbonsandChakraborti,2011).
Withinthesimulation,thedifferencesaregeneratedratherthanthepairedobservations
themselves. Speciﬁcally, n− 1 random normal deviates n
1
,x
2
,...,x
n−1
are generated
using the Box-Muller (1958) transformation, where n represents the sample size of the
paireddifferences. UnderH
0
,then−1randomnormaldeviateshaveapopulationmean
ofzero(μ = 0)andastandarddeviationofone(σ = 1).
Toisolatethephenomenonandbehaviourofinterest,if ¯ x
n−1
=
P
n−1
i=1
x
i
/(n−1)< 0
thenx
1
,x
2
,...,x
n−1
are multiplied by−1 to ensure a non-negative sample mean. (This
change of sign does not affect the validity of a two-sided test of a nil-null hypothesis for
thesedata.)
UnderH
1
, for each of then−1 deviates, a constantd is added to each of the values.
The simulations are performed under normality so that the data fulﬁl the assumptions of
thetestwiththeexceptionofanaberrantobservation.
Anadditionalobservation,x
n
,isaddedtothen−1observationstogiveatotalsample
sizeofn. Foranysimulatedsample,thevalueofx
n
issystematicallyvariedfrom−8to8
inincrementsof0.1. Itisthisvalue,x
n
,whichisreferredtoasthe‘marchingobservation’.
Thevaluesofx
n
approximatelyrangebetween±8standarddeviationsfromthemeanand
would therefore cover limits likely encountered in a practical environment. Note that the
condition of ¯ x
n−1
> 0 is to ensure that the concordance of effects (¯ x
n−1
> 0,x
n
> 0) or
discordanceofeffects(¯ x
n−1
> 0,x
n
< 0)canbeestablished.
Asummaryofthevaluesofn,x
n
andd usedinthe fullfactorialsimulationdesign is
giveninTable3. Thesimulationisrun 10000timesforeachcombinationofsamplesize
andmeandifference.
In a second set of simulations, the impact of the marching observation is similarly
assessed, removing the condition that the mean sample difference is positive, and per-
formingaone-sidedtest. ThisisdoneaspertheparametercombinationsinTable3using
uppertailcriticalvalues.
Table3: Summaryofsimulationdesign
Samplesize 10,15,20,25
Marchingobservation −8:8(0.1)
Meandifference 0,0.5
Signiﬁcancelevel 5%
NumberofIterations 10000
ProgrammingLanguage Rversion3.1.3
For the paired samples t-test and the Wilcoxon signed rank sum test, the default
statspackageinRisused. Yuen’spairedsamplest-testisperformedusingtheRpack-
agePairedDataasoutlinedbyWilcox(2005). 10%trimmingpertailisperformed.
The proportion of the 10000 iterations where the null hypothesis is rejected is cal-
culated at the nominal signiﬁcance level of 5%. This gives the Null Hypothesis Rejec-
tion Rate (NHRR). Note that the terminology NHRR is used and not type I error rate,
8 Derrick et al.
becausetheinclusionofthemarchingobservationwouldstrictlyinvalidatetheunderpin-
ning assumptions of the resultant test. The effect of gradually increasing the marching
observationistograduallyviolatetheassumptionofthenil-nullhypothesis.
The research question being asked is “How is the performance of the paired samples
t-test, Yuen’s paired samples t-test, and the Wilcoxon signed rank sum test affected by
thepresenceofanaberrantobservation?”
4 Results
The Null Hypothesis Rejection Rate (NHRR) is assessed for each of the three statistical
tests under consideration for a two-sided test, ﬁrstly when d = 0 and secondly in the
presenceofasystematiceffectsize(d = 0.5).
Figure 1 gives the NHRR of the paired samplest-test whend = 0, using the nominal
signiﬁcancelevelof5%.
Figure1: NHRRofthepairedsamples t-test, d=0,two-sided
Figure1showsthatwhenthevalueofx
n
=d = 0,theNHRRisapproximatelyequal
tothenominaltypeIerrorrateof5%. Forpositivesamplemeans,asthevalueofx
n
starts
to increase above zero, the paired samples t-test has an increasingly higher NHRR until
a turning point is reached and with a subsequent return to the nominal type I error rate.
Extreme and increasingly larger values of the marching observation, x
n
, in the direction
The Impact of an Extreme Observation ... 9
ofthesampleeffectresultsinaprogressivelylowerNHRR,withvaluesnoticeablylower
thanthenominaltypeIerrorrate. Theseeffectsarereplicatedinallfoursamplesizes,but
theeffectsaremarginallylessnoticeablewithincreasingsamplesize. Figure1alsoshows
that a large value for the marching observation in the opposite direction to the mean of
the ﬁrstn−1 observations, effectively results in a zero value for the NHRR. This effect
isconsistentwiththeasymptoticbehaviourgiveninSection2andtheﬁndingsalludedto
intheexamplegiveninTable2.
Figure2givestheNHRRofYuen’spairedsamplest-testandFigure3givestheNHRR
oftheWilcoxonsignedranksumtest,bothwhend = 0.
Figure2: NHRRofYuen’spairedsamples t-test, d=0,two-sided.
Figure 2 and Figure 3 show that when x
n
> 0 and ¯ x
n−1
> 0, both Yuen’s paired
samples t-test and the Wilcoxon signed rank sum test result in the null hypothesis being
rejected more frequently than the nominal signiﬁcance level. Conversely, when x
n
< 0
and ¯ x
n−1
> 0, both Yuen’s paired samples t-test and the Wilcoxon signed rank sum
test have a NHRR lower than the nominal signiﬁcance level. These ﬁndings are entirely
consistentwithexpectationforarobusttestgiventhedesignofthesimulation.
For the Wilcoxon signed rank sum test, due to the use of rank values, the test is
not greatly affected by the magnitude of the extreme observation. Similarly due to the
trimming, Yuen’s paired samples t-test is not greatly affected by the magnitude of the
extreme observation. The phenomenon of a turning point when x
n
> 0 is not observed
foreithertheWilcoxonsignedranksumtestorYuen’spairedsamplest-test.
10 Derrick et al.
Figure3: NHRRoftheWilcoxonsignedranksumtest, d=0,two-sided.
The Impact of an Extreme Observation ... 11
Figure 4 gives indicative power of the paired samples t-test, where d = 0.5. For a
sample of size n = 10 independent Normal deviates with μ = 0 and σ = 1, the power
of the test for the paired samples t-test for testing H
0
: μ = 0 is 0.293. Under the same
conditions, the power of the paired samplest-test forn = 15, 20 and 25 is 0.438, 0.565,
and 0.670 respectively. These reference lines are added to the graphics for comparative
purposes.
Figure4: NHRRofthepairedsamples t-test, d=0.5,two-sided
Figure 4 shows that forx
n
> d = 0.5, increases inx
n
are initially associated with an
increaseinpower. Thispowerincreaserelativetotheexpectedpowerforeachofthesam-
plesizesiscleartoseebutmightnotbeofgreatpracticalconsequence. Inaddition,there
is a noticeable turning point at which the power decreases as x
n
further increases. For
larger sample sizes, the paired samplest-test is relatively more robust to the presence of
an extreme observation. For smaller sample sizes, the power reduction when an extreme
observation is present is exacerbated. When the marching observation is in the opposite
direction to the true effect, an increasingly large negative difference eliminates the effect
underthestatedconditions.
Figure5givestheNHRRofYuen’spairedsamplest-testandFigure6givestheNHRR
of the Wilcoxon signed rank sum test, both when d = 0.5. Under the same normality
conditions, forn = 10, 15, 20 and 25, the corresponding power for the Wilcoxon signed
ranksumtestis0.279,0.419,0.543,and0.648respectively,andthecorrespondingpower
for the Yuen paired samples t-test is 0.263, 0.356, 0.528, and 0.613 respectively. These
12 Derrick et al.
referencelinesareaddedtothegraphicforcomparativepurposes.
Figure5: NHRRofYuen’spairedsamples t-test, d=0.5,two-sided
Figure 5 and 6 show that for x
n
> d = 0.5, increases in x
n
are associated with
an increase in power relative to the expected power for each of the sample sizes, but
the increase might not be of great practical consequence. For small samples, when the
marching observation is in the opposite direction to the true effect, an increasingly large
negativemarchingobservationreducestheeffectandthisisseeninthereducedpower.
Thesecondsimulationset-upisnowconsidered. Theconditionthatthesamplemean
differencesarepositiveisremoved, andaone-sided testusingthe uppertailofthedistri-
bution is performed. Figure 7 shows the impact of the marching observation for each of
thethreetestswhenthenullhypothesisistrue.
Figure 7 demonstrates that the patterns observed and identiﬁable conclusions for the
two-sided tests are the same under these conditions. In fact, the impact of the marching
observation in the second simulation set-up is qualitatively similar to the ﬁrst simulation
set-up. Forbrevity,theremaininggraphicsunderthisconditionarenotdisplayed.
5 Discussion
We have used a systematically increasing marching observation to demonstrate the im-
pactontheNullHypothesisRejectionRate(NHRR)forthepairedsamplest-test,Yuen’s
The Impact of an Extreme Observation ... 13
Figure6: NHRRoftheWilcoxontest, d=0.5,two-sided.
14 Derrick et al.
Figure7: NHRRforeachofthethreetestswhen n=15, d=0,onesided
The Impact of an Extreme Observation ... 15
paired samplest-test, and the Wilcoxon signed rank sum test. This systematic approach,
similar to one-factor at a time experimentation, would lend itself to other similar investi-
gations e.g. two independent samples design, or to other single sample tests such as the
single sample variance test, or be extended to investigations involving multiple march-
ing observations. In practice, x
n
and the condition ¯ x
n−1
> 0 may be independent and
the condition ¯ x
n−1
> 0 is imposed to separate potential different behaviours of the tests
statistics.
ThemathematicalexpositioninSection2indicatesthatforatwosidedpairedsamples
t-test,alargeobservationeitherconcordantordiscordantwiththerestofthesamplewill
leadtoanon-rejectionofthenullhypothesis. Withthepairedsamplest-testtheinclusion
of a very large positive observationx
n
into a sample with ¯ x
n−1
> 0 may in fact severely
reducetheprobabilityofrejectingthenullhypothesis.
Simulations comprising normal deviates and in testing a nil-null hypothesis of no
location effects have been performed. Stipulation of the condition ¯ x
n−1
> 0 does not
invalidatethetwo-sidedtestprocedure. However,theinclusionofasingle,butoftenlarge
discrepant observation, does imply that the nil-null hypothesis is not strictly true, hence
our use of the terminology of the NHRR (the null hypothesis rejection rate), rather than
usingtheterminologytypeIerrorrate.
For small sample sizes there is a paradox when performing the paired samples t-
test that more extreme values of the marching observation in the direction of the sample
mean difference result in a greater p-value than a less extreme value of the marching
observation.
Underalocationshiftmodel,theinclusionofgenuinelylargepositiveobservationx
n
intoasamplewith ¯ x
n−1
shouldleadtoanincreaseinstatisticalpowerinatwo-sidedtest
of the nil-null hypothesis. This effect is observed with Yuen’s paired samples t-test and
withtheWilcoxonsignedranksumtest,butitisnotconsistentlyobservedwiththepaired
samplest-test.
Under a location shift model, the inclusion of a large negative observation x
n
into a
samplewith ¯ x
n−1
> 0shouldleadtoarelativedecreaseinstatisticalpower. Thiseffectis
observed with Yuen’s paired samplest-test and with the Wilcoxon signed rank sum test,
buttheeffectismostevident,andissamplesizedependent,forthepairedsamplest-test.
In summary, Yuen’s paired samples t-test and the Wilcoxon signed rank sum test
broadly display properties consistent with being robust statistical tests in the presence of
alargeoutlier. Incontrastthepairedsamplest-testdisplaysbehaviourstronglydependent
onthemagnitudeoftheoutlier. Speciﬁcally,forsmallsamplesizesthemoreextremethe
values of the marching observation in the direction of the sample mean difference the
greaterthep-valuecomparedtoalessextremevalueofthemarchingobservation.
Zumbo and Jennings (2002), using their novel contamination model, concluded that
the paired samples t-test had an inﬂated type I error rate with increasing asymmetric
contamination, however our marching observation simulations indicate that the effect of
a single outlier on this test is dependent on sample size, magnitude and direction of the
outlier,andcouldleadtoincreasesanddecreasesintheNHRR.Itshouldbenotedthatthe
simulationsofZumboandJennings(2002)consistedofsituationsinwhichtheunderlying
distributionswerecontaminatedwithoutliersandsimultaneouslyatruenullhypothesisis
maintained. Incontrastoursimulationsarebasedonthefulﬁlmentofcorrectassumptions
priortotheinclusionofthemarchingobservation.
16 Derrick et al.
Our simulations demonstrate the seemingly paradoxical effect of large outliers on
the performance of the paired samples t-test, and although we concur with Zimmerman
(2011)thatrankbasedmethodsdonotnecessarilyeliminatetheinﬂuenceofoutliers,the
simulations indicate that Yuen’s paired samplest-test and the Wilcoxon signed rank sum
testhaverobustbehaviourinthepresenceofasingleoutlyingobservation.
In the preparation of this paper, methods for outlier detection in the conditions above
were attempted, but we were unable to identify a suitable method. With reference to
pairedsamples,Preece(1982)statesthatformalproceduresforthedetectionandrejection
of outliers are of negligible use for small sample sizes. Further debate and investigation
intooutlierdetectionmethodsoffersanareaforfurtherresearch.
Acknowledgements
The authors thank the reviewers, and the editor, for their generous and insightful com-
ments. Theirvaluablecontributionshaveresultedinasigniﬁcantlyimprovedmanuscript.
References
[1] Aguinis,H.,Gottfredson,R.K.,andJoo,H.(2013): Best-practicerecommendations
for deﬁning, identifying, and handling outliers. Organizational Research Methods,
16(2),1–32.
[2] Blair, R. C., and Higgins, J. J. (1985): A comparison of the power of the paired
samplesranktransformstatistictothatofWilcoxon’ssignedranksstatistic.Journal
of Educational Statistics,10(4),368–383.
[3] Box, G. E., and Muller, M. E. (1958): A note on the generation of random normal
deviates. The Annals of Mathematical Statistics,29(2),610–611.
[4] Chafﬁn, W.W., andRhiel, G. S.(1993): Theeffect ofskewness andkurtosisonthe
one-samplettestandtheimpactofknowledgeofthepopulationstandarddeviation.
Journal of Computation and Simulation,46,79–90.
[5] Fradette, K., Keselman, H. J., Lix, L., Algina, J., and Wilcox, R. (2003): Conven-
tional and robust paired and independent samples t-tests: Type I rrror and power
rates. Journal of Modern Applied Statistical Methods,2(2),481–496.
[6] Gibbons, J. D., and Chakraborti, S. (2011): Nonparametric statistical inference. In:
M.Lovric(Ed): InternationalEncyclopediaofStatisticalScience,977–979.Berlin:
Springer.
[7] Herrendörfer,G.,Rasch,D.,andFeige,K.D.(1983): Robustnessofstatisticalmeth-
odsII.Methodsoftheone-sampleproblem. Biometrical Journal,25,327–343.
[8] Levine, T. R., Weber, R., Hullett, C., Park, H. S., and Lindsey, L. L. M. (2008): A
critical assessment of null hypothesis signiﬁcance testing in quantitative communi-
cationresearch. Human Communication Research,34(2),171–187.
The Impact of an Extreme Observation ... 17
[9] Posten, H. O. (1979): The robustness of the one-sample t-test over the Pearson
system. Journal of Statistical Computation and Simulation,6,133–149.
[10] Preece,D.A.(1982): Tisfortrouble(andtextbooks): Acritiqueofsomeexamples
ofthepaired-samplest-test. The Statistician,31(2),169–195.
[11] R Core Team (2014): R: A language and environment for statistical computing.
Vienna: RFoundationforStatisticalComputing.https://www.r-project.org/.
[12] Rasch, D., and Guiard, V. (2004): The robustness of parametric statistical methods.
Psychology Science,46,175–208.
[13] Ringwalt, C., Paschall, M. J., Gorman, D., Derzon, J., and Kinlaw, A. (2010): The
useofone-versustwo-tailedteststoevaluatepreventionprograms.Evaluation&the
Health Professions34(2),135–150.
[14] Wilcox, R. R. (2005): Introduction to robust estimation and hypothesis testing. San
diego,CA:AcademicPress.
[15] Yuen, K. K. (1974): The two-sample trimmed t for unequal population variances.
Biometrika,61,165–170.
[16] Zimmerman,D.W.(1997): Anoteontheinterpretationofthepairedsamples.Jour-
nal of Educational and Behavioral Statistics,22(3),349–360.
[17] Zimmerman, D. W. (2011): Inheritance of properties of normal and non-normal
distributionsaftertransformationofscorestoranks. Psicologica,32(1),65–85.
[18] Zumbo B. D., and Jennings, M. J. (2002): The robustness of validity and efﬁciency
oftherelatedsamplest-testinthepresenceofoutliers.Psicologica,23(2),415–450.