Metodološkizvezki,Vol. 14,No. 2,2017,1–17 TheImpactofanExtremeObservationinaPaired SamplesDesign BenDerrick 1 AntoniaBroad DeirdreToher PaulWhite Abstract The effect of systematically altering the value of a single observation within a paireddifferencesdesignisconsidered. Aparadoxisobservedforthepairedsamples t-test, where increasingthevalueof an observationin thedirection of thetruemean difference results in a higher p-value. Using simulation, deviations from robustness of the paired samples t-test is demonstrated, and is contrasted with Yuen’s paired samples t-testandtheWilcoxonsignedranksumtest. 1 Introduction Thepaired samplest-test is logically andnumerically equivalent tothe one samplet-test performedonpaireddifferences,anditisoneofthemostwell-establishedandcommonly performed statistical tests. Zimmerman (1997) demonstrated that the type I error rate of the paired samples t-test remains close to the nominal significance level for varying correlation and sample sizes under normality. Under less idealised conditions, Posten (1979), Herrendörfer et al, (1983), Rasch and Guiard (2004), and Fradette et al. (2003) found that the paired samplest-test maintains type I error robustness for a range of non- normal distributions. However, Blair and Higgins (1985) found the Wilcoxon signed rank sum test to also be type I error robust and to have some power advantages over the paired samples t-test for a range of non-normal distributions. Chaffin and Rhiel (1993) demonstratedthatthetailsofthesamplingdistributionofthepairedsamplesteststatistic areskewnessdependent,particularlywithrelativelysmallsamplesizes. Zumbo and Jennings (2002), using a novel contamination model, determined the ef- fect of outliers on the validity and power of the paired samples t-test. They found the paired samples t-test to have robust validity for symmetric contamination, but with in- creasinginflationofthetypeIerrorratewithincreasingasymmetriccontamination. This is coupled with degradation in power in the presence of outliers when the true effect is small and sample sizes are small. In their work the number of outliers in the sample is consideredtobearandomvariable. Oneoftheassumptionsofthepairedsamplest-testisthatthedifferencesbetweenthe two samples are normally distributed, or alternatively and in a practical sense, that the mean difference has a distribution which can reasonably be approximated by a normal 1 Engineering, Design and Mathematics, University of the West England, Bristol, United Kingdom; ben.derrick@uwe.ac.uk 2 Derrick et al. distribution. A closely related assumption is that there are no large outliers in the differ- ences. Whenperformingthepairedsamplest-test,theremaybecompetitionbetweenthe magnitude of the mean difference and the standard deviation of the differences. In par- ticular, extreme observations within a dataset can distort the balance between these two elementsofthetest. Toillustratethis,considertheexampledatainTable1. Table1: Exampledataforsixunitswithinapaireddesign Pair Sample1 Sample2 Difference 1 30 22 8 2 28 18 10 3 45 45 0 4 57 54 3 5 38 32 6 6 37 37-X X For the first five pairs, the mean of Sample 1 is greater than or equal to the mean of Sample 2. For the sixth pair, let the difference between the Sample 1 observation and the Sample 2 observation be denoted as X. Intuition might suggest that a positive value of X may contribute towards an overall significant difference in means being observed. If this were the case, a large positive value of X should seemingly contribute towards a significant effect. In the following, the value of X is systematically altered in order to demonstrate its impact on the paired samples t-test. The observation X will “march” throughthedatasetandwillbecolloquiallyreferredtoasamarchingobservation. Table2 showstheresultsofatwo-sidedpairedsamplest-testfornegativevaluesofXthroughto largepositivevaluesofX. Table2: Pairedsamples t-testonfivedegreesoffreedomforincreasingvaluesofX X t p-value X t p-value X t p-value −3 1.984 0.104 11 3.670 0.014 25 2.425 0.060 −1 2.406 0.061 13 3.461 0.018 27 2.319 0.068 1 2.870 0.035 15 3.240 0.023 29 2.226 0.077 3 3.321 0.021 17 3.033 0.029 31 2.145 0.085 5 3.671 0.014 19 2.848 0.036 33 2.073 0.093 7 3.840 0.012 21 2.687 0.043 35 2.009 0.101 9 3.820 0.012 23 2.546 0.052 37 1.953 0.108 The values of X for which the null hypothesis of equal means is rejected at the 5% significancelevelarehighlightedinTable2. ForlowvaluesofXitcanbeseenthatasthe value of X increases, thep-value decreases. In this example, as the value of X increases beyond approximately 8, the p-value increases. As the value of the observed difference in the sixth pair increases (and hence as the mean difference increases), the p-value also increases. Observing an extreme value of X in the direction of the seemingly observed effect can increase the sample variance to such an extent that it impedes the test from The Impact of an Extreme Observation ... 3 giving a significant result. The extreme observation paradox is the contrariwise p-value increase as the value of an extreme observation increases in the direction of the overall effect. As the absolute value of the marching observation increases, the assumptions of the paired samples t-test are increasingly violated. When the sample size is small or the assumptionsofthepairedsamplest-testareviolated,researchersoftenchoosetoperform theWilcoxonsignedranksumtest. Aguinisetal.,(2013)summariseacomprehensivelist oftechniquesfordealingwithoutliersandstatethatnon-parametrictestsgiveresultsthat are robust in the presence of outliers. However, Zimmerman (2011) indicates that rank based methods do not necessarily eliminate the influence of outliers. Another alternative approach whenoutliersarepresentistouseYuen’s paired samplest-test. Inthis test, the principlesoftrimmedmeansoutlinedbyYuen(1974),areappliedtothepaireddifferences (Wilcox,2005). In this paper, simulation is used to explore the scenarios in which the extreme obser- vation paradox is observed in a paired samples design. We are particularly interested in isolatingthosesituationswhentwo-sidedhypothesistestingisundertaken(e.g. seeRing- walt et al., 2011), when sample sizes are relatively small (i.e. when outliers may have a greater effect on the paired samples t-test). The concept of a systematically marching observation similar to the demonstration in Table 2, is used to investigate the effects of an aberrant observation. In the simulation design, this aberrant observation is a forced additional observation not fitting with the simulated data, and is not due to inherent vari- ability. Simulations are performed for an aberrant observation in the direction of the effectsuggestedbytherestofthesample,andsecondlywhereanaberrantobservationis intheopposingdirectionoftheeffectsuggestedbytherestofthesample. Thussituations where the sign of the marching observation is concordant or discordant with the mean of the other observations are considered. For comparative purposes, the paired samples t-test,theWilcoxonsignedranksumtest,andYuen’spairedsamplest-testareincluded. Null hypothesis significance testing is most frequently performed with a nil-null hy- pothesis specifying that no difference between groups is present, and a two directional alternative (Levine et al., 2008). Therefore the impact of an extreme observation for a two-sided test is the main emphasis of this paper. However, one-sided tests retain some practicalutility,andthesimulationsareextendedtoaone-sidedtest. WehypothesisethattheseeminglyparadoxicalbehaviourexhibitedinTable2willbe a feature of the paired samples t-test in general. In contrast, we hypothesise that Yuen’s paired samples t-test and the Wilcoxon signed rank sum test will be robust to a single aberrantobservation. Inordertogaininsight,wefirstlyinvestigatethemathematicallimitingformsofeach of the three test statistics under consideration as a single marching observation becomes increasinglylargecomparedwiththerestofthesample,andthenproceedtoasimulation investigation. 2 AnUnboundedMarchingObservation For development purposes consider a random sample X 1 ,X 2 ,...,X n−1 ,X n , and let X (1) < X (2) < ···X (n−1) < X (n) denote the order statistics. Further, let X k = Y k 4 Derrick et al. for (k = 1,2,...,n−1), let Y (1) < Y (2) < ···Y (n−1) be the corresponding order statis- tics, and letX n = ξ be the marching observation. In this notation,Y k (k = 1,...,n−1) denotestheobservationspriortotheinclusionofthemarchingobservation. The following analytical exposition investigates the behaviour of the one sample t- test,Yuen’spairedsamplest-test,andtheWilcoxonsignedranksumtest,asthemarching observationX n =ξ becomesrelativelylargecomparedwiththerestofthesample. 2.1 Thet-test Consider the single sample t-test test statistic on the paired differences, used to test H 0 : μ X = 0,definedby T: = ¯ X ˆ σ X+ √ n where ¯ X := X 1 +X 2 +···+X n n and ˆ σ X+ := s (X 1 − ¯ X) 2 +(X 2 − ¯ X) 2 +···+(X n − ¯ X) 2 n−1 . Observethat ¯ X = (n−1) ¯ Y +ξ n , ¯ X− ¯ Y = ξ− ¯ Y n , andξ− ¯ X = (n−1)(ξ− ¯ Y) n . Thus ˆ σ X+ = s (Y 1 − ¯ X) 2 +(Y 2 − ¯ X) 2 +···+(Y n−1 − ¯ X) 2 +(ξ− ¯ X) 2 n−1 . Notethat n−1 X j=1 (Y j − ¯ X) 2 = n−1 X j=1 (Y j − ¯ Y + ¯ Y − ¯ X) 2 = n−1 X j=1 (Y j − ¯ Y) 2 +(n−1)( ¯ Y − ¯ X) 2 +2( ¯ Y − ¯ X) n−1 X j=1 (Y j − ¯ Y) = (Y 1 − ¯ Y) 2 +(Y 2 − ¯ Y) 2 +···+(Y n−1 − ¯ Y) 2 +(n−1)( ¯ Y − ¯ X) 2 +0. Hence ˆ σ X+ = s (Y 1 − ¯ Y) 2 +(Y 2 − ¯ Y) 2 +···+(Y n−1 − ¯ Y) 2 +(ξ− ¯ X) 2 n−1 +( ¯ X− ¯ Y) 2 . Forthen−1values,define ˆ σ Y := s (Y 1 − ¯ Y) 2 +(Y 2 − ¯ Y) 2 +···+(Y n−1 − ¯ Y) 2 n−1 . The Impact of an Extreme Observation ... 5 Note that ˆ σ does not have the “+” symbol, i.e. that the marching observation is not included. Analternativedefinitionfor ˆ σ couldhaven−2inthedenominatorandso ˆ σ X+ = s ˆ σ 2 Y + (ξ− ¯ X) 2 n−1 +( ¯ X− ¯ Y) 2 = s ˆ σ 2 Y + (n−1) 2 n 2 (ξ− ¯ Y) 2 n−1 + (ξ− ¯ X) 2 n 2 = r ˆ σ 2 Y + (ξ− ¯ Y) 2 n andhence T = (n−1) ¯ Y +ξ p nˆ σ 2 Y +(ξ− ¯ Y) 2 Itcanbeseenthatasξ →∞,T → 1,andsimilarlyasξ →−∞,T →−1. Accordingly, for any value of significance level likely to be encountered in practice the results ξ → ±∞, T → ±1 indicate that the null hypothesis would not be rejected under the stated conditions. 2.2 Yuen’sPairedSamplest-test Let γ denote the per tail proportion of trimming, let e := bγnc and let f := n− 2e. Define the trimmed sampleX t1 ,X t2 ,...,X tf−1 ,X tf asX tk := X (k+e) (k = 1,2,...,f) anddefinethewinsorisedsampleX w1 ,X w2 ,...,X wf as X wk :=      X (e+1) k = 1,2,...,e X (k) k =e+1,e+2,...,n−e X (n−e) k =n−e+1,n−e+2,...,n Let ¯ X t = P f k=1 X tk /f, and ¯ X w = P n k=1 X wk /n define the trimmed mean and win- sorised mean respectively and let, ˆ σ 2 Xw+ = P n k=1 (X wk − ¯ X w ) 2 /(n−1) denote the win- sorisedvariance. Inthisnotation,Yuen’steststatisticisgivenbyT Y := ¯ Xt ˆ σ Xw+ √ n(1−2γ). Forξ Y (n−e) ¯ X t = Y (e+1) +Y (e+2) +Y (e+2) +···+Y (n−e) f , ¯ X w = eY (e+1) +Y (e+1) +Y (e+2) +Y (e+2) +···+Y (n−e) +eY n−e n and ˆ σ 2 Xw+ = e(Y (e+1) − ¯ X w ) 2 + P n−e k=e+1 (Y (k) − ¯ X w ) 2 +e(Y (n−e) − ¯ X w ) 2 n−1 For fixed values, Y 1 ,Y 2 ,...,Y n−1 as ξ → −∞, T Y := ¯ Xt ˆ σ Xw+ √ n(1− 2γ) stabilises to some limiting value. Moreover, for a sufficiently large sample, the limit values for both directionsofthemarchingobservationshouldbeclosetoeachother. Hencetheproperties displayedasξ →−∞orξ →−∞areconsistentwithT Y beingarobustteststatistic. 2.3 TheWilcoxonsignedranksumtest Assumingnotiesandnozeroobservations,thentheteststatisticfortheWilcoxonsigned ranksumtest,W,isdefinedas W =R X 1 sgn(X 1 )+R X 2 sgn(X 2 )+···+R X n sgn(X n ) where R X k is the rank of|X k | among|X 1 |,|X 2 |,...,|X n |. If X 1 ,X 2 ,...,X n are inde- pendentandfollowthesamesymmetriccontinuousdistribution,thenW followsadistri- butionwithmean0andvariancen(n+1)(2n+1)/6. DenotebyR Y k therankof|Y k |among|Y 1 |,|Y 2 |,...,|Y n−1 |. For |ξ|> max{|Y 1 |,|Y 2 |,...,|Y n−1 |}, W =R Y 1 sgn(Y 1 )+R Y 2 sgn(Y 2 )+···+R Y n sgn(Y n )+nsgn(ξ). Henceunderthestatedconditions,forfixedvaluesY 1 ,Y 2 ,...,Y n−1 ,theWilcoxonsigned rank sum statistic stabilises to some situation dependent limit value as ξ → +∞, and to some situation dependent limit value as ξ → −∞. The difference between these two valuesisn−(−n) = 2n,andthestandardisedvaluesdifferby p 24n/{(n+1)(2n+1)}. Theseareclosetoeachotherforsufficientlylargen. 3 SimulationMethodology The approach is to generate sample data meeting the assumptions of the paired samples t-test, and to then include an additional observation in the sample. This additional ob- servation systematically changes in its observed value. The paired samples t-test, the Wilcoxon signed rank test, and Yuen’s paired samples t-test, are performed for a two- sidednil-nullhypothesis. Underatwo-sidednil-nullhypothesis;thepairedsamplest-test is used to test a distribution mean difference of zero; and Yuen’s paired samplest-test is usedtotestthedistributionofthetrimmedmeanequaltozero. Historically,thederivation The Impact of an Extreme Observation ... 7 oftheWilcoxonranksumdistributionhasbeenmadeforcontinuousrandomvariablesun- deranullhypothesisofnodistributionaldifferences,andissensitivetochangesincentral location(GibbonsandChakraborti,2011). Withinthesimulation,thedifferencesaregeneratedratherthanthepairedobservations themselves. Specifically, n− 1 random normal deviates n 1 ,x 2 ,...,x n−1 are generated using the Box-Muller (1958) transformation, where n represents the sample size of the paireddifferences. UnderH 0 ,then−1randomnormaldeviateshaveapopulationmean ofzero(μ = 0)andastandarddeviationofone(σ = 1). Toisolatethephenomenonandbehaviourofinterest,if ¯ x n−1 = P n−1 i=1 x i /(n−1)< 0 thenx 1 ,x 2 ,...,x n−1 are multiplied by−1 to ensure a non-negative sample mean. (This change of sign does not affect the validity of a two-sided test of a nil-null hypothesis for thesedata.) UnderH 1 , for each of then−1 deviates, a constantd is added to each of the values. The simulations are performed under normality so that the data fulfil the assumptions of thetestwiththeexceptionofanaberrantobservation. Anadditionalobservation,x n ,isaddedtothen−1observationstogiveatotalsample sizeofn. Foranysimulatedsample,thevalueofx n issystematicallyvariedfrom−8to8 inincrementsof0.1. Itisthisvalue,x n ,whichisreferredtoasthe‘marchingobservation’. Thevaluesofx n approximatelyrangebetween±8standarddeviationsfromthemeanand would therefore cover limits likely encountered in a practical environment. Note that the condition of ¯ x n−1 > 0 is to ensure that the concordance of effects (¯ x n−1 > 0,x n > 0) or discordanceofeffects(¯ x n−1 > 0,x n < 0)canbeestablished. Asummaryofthevaluesofn,x n andd usedinthe fullfactorialsimulationdesign is giveninTable3. Thesimulationisrun 10000timesforeachcombinationofsamplesize andmeandifference. In a second set of simulations, the impact of the marching observation is similarly assessed, removing the condition that the mean sample difference is positive, and per- formingaone-sidedtest. ThisisdoneaspertheparametercombinationsinTable3using uppertailcriticalvalues. Table3: Summaryofsimulationdesign Samplesize 10,15,20,25 Marchingobservation −8:8(0.1) Meandifference 0,0.5 Significancelevel 5% NumberofIterations 10000 ProgrammingLanguage Rversion3.1.3 For the paired samples t-test and the Wilcoxon signed rank sum test, the default statspackageinRisused. Yuen’spairedsamplest-testisperformedusingtheRpack- agePairedDataasoutlinedbyWilcox(2005). 10%trimmingpertailisperformed. The proportion of the 10000 iterations where the null hypothesis is rejected is cal- culated at the nominal significance level of 5%. This gives the Null Hypothesis Rejec- tion Rate (NHRR). Note that the terminology NHRR is used and not type I error rate, 8 Derrick et al. becausetheinclusionofthemarchingobservationwouldstrictlyinvalidatetheunderpin- ning assumptions of the resultant test. The effect of gradually increasing the marching observationistograduallyviolatetheassumptionofthenil-nullhypothesis. The research question being asked is “How is the performance of the paired samples t-test, Yuen’s paired samples t-test, and the Wilcoxon signed rank sum test affected by thepresenceofanaberrantobservation?” 4 Results The Null Hypothesis Rejection Rate (NHRR) is assessed for each of the three statistical tests under consideration for a two-sided test, firstly when d = 0 and secondly in the presenceofasystematiceffectsize(d = 0.5). Figure 1 gives the NHRR of the paired samplest-test whend = 0, using the nominal significancelevelof5%. Figure1: NHRRofthepairedsamples t-test, d=0,two-sided Figure1showsthatwhenthevalueofx n =d = 0,theNHRRisapproximatelyequal tothenominaltypeIerrorrateof5%. Forpositivesamplemeans,asthevalueofx n starts to increase above zero, the paired samples t-test has an increasingly higher NHRR until a turning point is reached and with a subsequent return to the nominal type I error rate. Extreme and increasingly larger values of the marching observation, x n , in the direction The Impact of an Extreme Observation ... 9 ofthesampleeffectresultsinaprogressivelylowerNHRR,withvaluesnoticeablylower thanthenominaltypeIerrorrate. Theseeffectsarereplicatedinallfoursamplesizes,but theeffectsaremarginallylessnoticeablewithincreasingsamplesize. Figure1alsoshows that a large value for the marching observation in the opposite direction to the mean of the firstn−1 observations, effectively results in a zero value for the NHRR. This effect isconsistentwiththeasymptoticbehaviourgiveninSection2andthefindingsalludedto intheexamplegiveninTable2. Figure2givestheNHRRofYuen’spairedsamplest-testandFigure3givestheNHRR oftheWilcoxonsignedranksumtest,bothwhend = 0. Figure2: NHRRofYuen’spairedsamples t-test, d=0,two-sided. Figure 2 and Figure 3 show that when x n > 0 and ¯ x n−1 > 0, both Yuen’s paired samples t-test and the Wilcoxon signed rank sum test result in the null hypothesis being rejected more frequently than the nominal significance level. Conversely, when x n < 0 and ¯ x n−1 > 0, both Yuen’s paired samples t-test and the Wilcoxon signed rank sum test have a NHRR lower than the nominal significance level. These findings are entirely consistentwithexpectationforarobusttestgiventhedesignofthesimulation. For the Wilcoxon signed rank sum test, due to the use of rank values, the test is not greatly affected by the magnitude of the extreme observation. Similarly due to the trimming, Yuen’s paired samples t-test is not greatly affected by the magnitude of the extreme observation. The phenomenon of a turning point when x n > 0 is not observed foreithertheWilcoxonsignedranksumtestorYuen’spairedsamplest-test. 10 Derrick et al. Figure3: NHRRoftheWilcoxonsignedranksumtest, d=0,two-sided. The Impact of an Extreme Observation ... 11 Figure 4 gives indicative power of the paired samples t-test, where d = 0.5. For a sample of size n = 10 independent Normal deviates with μ = 0 and σ = 1, the power of the test for the paired samples t-test for testing H 0 : μ = 0 is 0.293. Under the same conditions, the power of the paired samplest-test forn = 15, 20 and 25 is 0.438, 0.565, and 0.670 respectively. These reference lines are added to the graphics for comparative purposes. Figure4: NHRRofthepairedsamples t-test, d=0.5,two-sided Figure 4 shows that forx n > d = 0.5, increases inx n are initially associated with an increaseinpower. Thispowerincreaserelativetotheexpectedpowerforeachofthesam- plesizesiscleartoseebutmightnotbeofgreatpracticalconsequence. Inaddition,there is a noticeable turning point at which the power decreases as x n further increases. For larger sample sizes, the paired samplest-test is relatively more robust to the presence of an extreme observation. For smaller sample sizes, the power reduction when an extreme observation is present is exacerbated. When the marching observation is in the opposite direction to the true effect, an increasingly large negative difference eliminates the effect underthestatedconditions. Figure5givestheNHRRofYuen’spairedsamplest-testandFigure6givestheNHRR of the Wilcoxon signed rank sum test, both when d = 0.5. Under the same normality conditions, forn = 10, 15, 20 and 25, the corresponding power for the Wilcoxon signed ranksumtestis0.279,0.419,0.543,and0.648respectively,andthecorrespondingpower for the Yuen paired samples t-test is 0.263, 0.356, 0.528, and 0.613 respectively. These 12 Derrick et al. referencelinesareaddedtothegraphicforcomparativepurposes. Figure5: NHRRofYuen’spairedsamples t-test, d=0.5,two-sided Figure 5 and 6 show that for x n > d = 0.5, increases in x n are associated with an increase in power relative to the expected power for each of the sample sizes, but the increase might not be of great practical consequence. For small samples, when the marching observation is in the opposite direction to the true effect, an increasingly large negativemarchingobservationreducestheeffectandthisisseeninthereducedpower. Thesecondsimulationset-upisnowconsidered. Theconditionthatthesamplemean differencesarepositiveisremoved, andaone-sided testusingthe uppertailofthedistri- bution is performed. Figure 7 shows the impact of the marching observation for each of thethreetestswhenthenullhypothesisistrue. Figure 7 demonstrates that the patterns observed and identifiable conclusions for the two-sided tests are the same under these conditions. In fact, the impact of the marching observation in the second simulation set-up is qualitatively similar to the first simulation set-up. Forbrevity,theremaininggraphicsunderthisconditionarenotdisplayed. 5 Discussion We have used a systematically increasing marching observation to demonstrate the im- pactontheNullHypothesisRejectionRate(NHRR)forthepairedsamplest-test,Yuen’s The Impact of an Extreme Observation ... 13 Figure6: NHRRoftheWilcoxontest, d=0.5,two-sided. 14 Derrick et al. Figure7: NHRRforeachofthethreetestswhen n=15, d=0,onesided The Impact of an Extreme Observation ... 15 paired samplest-test, and the Wilcoxon signed rank sum test. This systematic approach, similar to one-factor at a time experimentation, would lend itself to other similar investi- gations e.g. two independent samples design, or to other single sample tests such as the single sample variance test, or be extended to investigations involving multiple march- ing observations. In practice, x n and the condition ¯ x n−1 > 0 may be independent and the condition ¯ x n−1 > 0 is imposed to separate potential different behaviours of the tests statistics. ThemathematicalexpositioninSection2indicatesthatforatwosidedpairedsamples t-test,alargeobservationeitherconcordantordiscordantwiththerestofthesamplewill leadtoanon-rejectionofthenullhypothesis. Withthepairedsamplest-testtheinclusion of a very large positive observationx n into a sample with ¯ x n−1 > 0 may in fact severely reducetheprobabilityofrejectingthenullhypothesis. Simulations comprising normal deviates and in testing a nil-null hypothesis of no location effects have been performed. Stipulation of the condition ¯ x n−1 > 0 does not invalidatethetwo-sidedtestprocedure. However,theinclusionofasingle,butoftenlarge discrepant observation, does imply that the nil-null hypothesis is not strictly true, hence our use of the terminology of the NHRR (the null hypothesis rejection rate), rather than usingtheterminologytypeIerrorrate. For small sample sizes there is a paradox when performing the paired samples t- test that more extreme values of the marching observation in the direction of the sample mean difference result in a greater p-value than a less extreme value of the marching observation. Underalocationshiftmodel,theinclusionofgenuinelylargepositiveobservationx n intoasamplewith ¯ x n−1 shouldleadtoanincreaseinstatisticalpowerinatwo-sidedtest of the nil-null hypothesis. This effect is observed with Yuen’s paired samples t-test and withtheWilcoxonsignedranksumtest,butitisnotconsistentlyobservedwiththepaired samplest-test. Under a location shift model, the inclusion of a large negative observation x n into a samplewith ¯ x n−1 > 0shouldleadtoarelativedecreaseinstatisticalpower. Thiseffectis observed with Yuen’s paired samplest-test and with the Wilcoxon signed rank sum test, buttheeffectismostevident,andissamplesizedependent,forthepairedsamplest-test. In summary, Yuen’s paired samples t-test and the Wilcoxon signed rank sum test broadly display properties consistent with being robust statistical tests in the presence of alargeoutlier. Incontrastthepairedsamplest-testdisplaysbehaviourstronglydependent onthemagnitudeoftheoutlier. Specifically,forsmallsamplesizesthemoreextremethe values of the marching observation in the direction of the sample mean difference the greaterthep-valuecomparedtoalessextremevalueofthemarchingobservation. Zumbo and Jennings (2002), using their novel contamination model, concluded that the paired samples t-test had an inflated type I error rate with increasing asymmetric contamination, however our marching observation simulations indicate that the effect of a single outlier on this test is dependent on sample size, magnitude and direction of the outlier,andcouldleadtoincreasesanddecreasesintheNHRR.Itshouldbenotedthatthe simulationsofZumboandJennings(2002)consistedofsituationsinwhichtheunderlying distributionswerecontaminatedwithoutliersandsimultaneouslyatruenullhypothesisis maintained. Incontrastoursimulationsarebasedonthefulfilmentofcorrectassumptions priortotheinclusionofthemarchingobservation. 16 Derrick et al. Our simulations demonstrate the seemingly paradoxical effect of large outliers on the performance of the paired samples t-test, and although we concur with Zimmerman (2011)thatrankbasedmethodsdonotnecessarilyeliminatetheinfluenceofoutliers,the simulations indicate that Yuen’s paired samplest-test and the Wilcoxon signed rank sum testhaverobustbehaviourinthepresenceofasingleoutlyingobservation. In the preparation of this paper, methods for outlier detection in the conditions above were attempted, but we were unable to identify a suitable method. With reference to pairedsamples,Preece(1982)statesthatformalproceduresforthedetectionandrejection of outliers are of negligible use for small sample sizes. Further debate and investigation intooutlierdetectionmethodsoffersanareaforfurtherresearch. Acknowledgements The authors thank the reviewers, and the editor, for their generous and insightful com- ments. Theirvaluablecontributionshaveresultedinasignificantlyimprovedmanuscript. References [1] Aguinis,H.,Gottfredson,R.K.,andJoo,H.(2013): Best-practicerecommendations for defining, identifying, and handling outliers. Organizational Research Methods, 16(2),1–32. [2] Blair, R. C., and Higgins, J. J. (1985): A comparison of the power of the paired samplesranktransformstatistictothatofWilcoxon’ssignedranksstatistic.Journal of Educational Statistics,10(4),368–383. [3] Box, G. E., and Muller, M. E. (1958): A note on the generation of random normal deviates. The Annals of Mathematical Statistics,29(2),610–611. [4] Chaffin, W.W., andRhiel, G. S.(1993): Theeffect ofskewness andkurtosisonthe one-samplettestandtheimpactofknowledgeofthepopulationstandarddeviation. Journal of Computation and Simulation,46,79–90. [5] Fradette, K., Keselman, H. J., Lix, L., Algina, J., and Wilcox, R. (2003): Conven- tional and robust paired and independent samples t-tests: Type I rrror and power rates. Journal of Modern Applied Statistical Methods,2(2),481–496. [6] Gibbons, J. D., and Chakraborti, S. (2011): Nonparametric statistical inference. In: M.Lovric(Ed): InternationalEncyclopediaofStatisticalScience,977–979.Berlin: Springer. [7] Herrendörfer,G.,Rasch,D.,andFeige,K.D.(1983): Robustnessofstatisticalmeth- odsII.Methodsoftheone-sampleproblem. Biometrical Journal,25,327–343. [8] Levine, T. R., Weber, R., Hullett, C., Park, H. S., and Lindsey, L. L. M. (2008): A critical assessment of null hypothesis significance testing in quantitative communi- cationresearch. Human Communication Research,34(2),171–187. The Impact of an Extreme Observation ... 17 [9] Posten, H. O. (1979): The robustness of the one-sample t-test over the Pearson system. Journal of Statistical Computation and Simulation,6,133–149. [10] Preece,D.A.(1982): Tisfortrouble(andtextbooks): Acritiqueofsomeexamples ofthepaired-samplest-test. The Statistician,31(2),169–195. [11] R Core Team (2014): R: A language and environment for statistical computing. Vienna: RFoundationforStatisticalComputing.https://www.r-project.org/. [12] Rasch, D., and Guiard, V. (2004): The robustness of parametric statistical methods. Psychology Science,46,175–208. [13] Ringwalt, C., Paschall, M. J., Gorman, D., Derzon, J., and Kinlaw, A. (2010): The useofone-versustwo-tailedteststoevaluatepreventionprograms.Evaluation&the Health Professions34(2),135–150. [14] Wilcox, R. R. (2005): Introduction to robust estimation and hypothesis testing. San diego,CA:AcademicPress. [15] Yuen, K. K. (1974): The two-sample trimmed t for unequal population variances. Biometrika,61,165–170. [16] Zimmerman,D.W.(1997): Anoteontheinterpretationofthepairedsamples.Jour- nal of Educational and Behavioral Statistics,22(3),349–360. [17] Zimmerman, D. W. (2011): Inheritance of properties of normal and non-normal distributionsaftertransformationofscorestoranks. Psicologica,32(1),65–85. [18] Zumbo B. D., and Jennings, M. J. (2002): The robustness of validity and efficiency oftherelatedsamplest-testinthepresenceofoutliers.Psicologica,23(2),415–450.