9 LANGUAGE
Nikola Dobrić
University of Klagenfurt, Austria
2018, v ol. 15 (2), 9–24(120)
revije.ff.uni-lj.si/elope
doi: 10.4312/elope.15.2.9-24
UDc: 37.091.26:808.1(091)
Reliability, Validity, and Writing Assessment: 
A Timeline
ABStRA ct
Looking at the issue of validity and test validation, the historical and the theoretical progression 
has been well described both when it comes to educational assessment in general and language 
assessment in particular. A clear progression can be seen starting in the 1920s and culminating 
in the late 1980s/early 1990s (with minor notable developments since), and it is an advancement 
motivated and driven almost solely by new theoretical and practical considerations. Securing 
validity and validation with regard to writing assessment in particular, however, took a more 
winding route and was primarily shaped by a power struggle between externally administered 
standardized testing (and the supporting administrative bodies) on one side, and the practicing 
teachers of writing at higher education institutions on the other. The paper at hand outlines this 
evolution and gives a timeline of the events and major developments that have fueled it and 
explores the cutting edge of today.
Keywords: language testing; writing assessment; historical development; overview; validation; 
validity; standardization
Zanesljivost, veljavnost in ocenjevanje pisanja: 
časovni pregled
POvZEtEK
vprašanje veljavnosti in vrednotenja testov je bilo z vidika zgodovinskega in teoretičnega razvoja 
že dobro opisano tako na področju splošnega pedagoškega ocenjevanja kot tudi na specifičnem 
področju ocenjevanja jezikov. Začetek razvoja sega v dvajseta leta preteklega stoletja in doseže 
vrhunec v poznih osemdesetih oziroma v začetku devetdesetih let 20. stoletja (z manjšim 
opaznim napredkom tudi po tem času). Gibalo razvoja so predvsem nova teoretična in praktična 
dognanja. Zagotavljanje veljavnosti in validacije ocenjevanja pisanja pa je potekalo po bolj 
ovinkasti poti in predvsem pod vplivom boja za premoč med zunanjimi standardiziranimi testi 
(ob podpori administrativnih teles) na eni strani ter visokošolskimi učitelji pisanja na drugi. 
Prispevek prikazuje ta razvoj in podaja časovni pregled dogodkov in večjih dosežkov, ki so ga 
omogočili, hkrati pa tudi raziskuje današnje najsodobnejše smernice.
Ključne besede: jezikovno testiranje; ocenjevanje pisanja; zgodovinski razvoj; pregled; veljavnost; 
validacija; standardizacija
10 Nikola Dobrić  Reliability, Validity, and Writing Assessment: A Timeline
Reliability, Validity, and Writing Assessment: 
A Timeline
1 Introduction – Indirect Assessment of Writing
If we leave aside the writing examinations one reads of starting in Ancient china almost two 
millennia ago (as a part of the rigorous assessment of civil servants underway during the Han 
Dynasty in the 2
nd
 century AD) as too far removed contextually and in time, and somewhat 
historically inaccessible, we can say that the earliest more contemporary records dealing with 
the assessment of writing in the Western countries emerged in the mid-19
th
 century. At the 
time, the turn was being made from the Western tradition of open debate (stemming originally 
from the Ancient Greek philosophical tradition) towards a more standardized model of uniform 
examinations (actually conceptually influenced by the previously mentioned chinese tradition). 
In essence, across the Western world universities usually favored giving oral-based examinations, 
a tradition stemming from the Socratic approach to higher education. However, because of the 
displacement created by the Industrial Revolution, school-age children were suddenly being 
forcibly put behind school desks (a result of new compulsory education laws), a development 
which demanded a faster and more resource-friendly way to test larger numbers of students. One 
of the first actual records of written examinations coming up within a national dialogue is one 
by Horace Mann calling for written tests to replace oral examinations in Boston schools in 1840 
(Huot, O’Neill, and Moore 2010, 496).
1
 In the 1840s oral tests were the standard in American 
schools, and the move was to introduce written tests in order to facilitate objectivity and reduce 
bias. Additionally, he also wished to build a transparent (standardized) system according to 
which one could compare the quality of different schools and teachers. As a result of this and 
similar calls, written tests were relatively quickly introduced across the United States, such as 
for example the Harvard’s Exams in Writing in 1874. With this a new writing component was 
added to the Harvard University entrance exam, which included a short, speeded essay rated by 
the resident teachers, and was based on ‘such works of standard authors as shall be announced 
from time to time’ (Hobbs and Berlin 2001, 252). However, not long after the launch of the 
first essay-based writing exams at American universities came critical voices. On the one hand, 
the criticism focused on unreliability and subjectivity in terms of rating the essays (ironically 
enough, the same criticism that motivated the switch from oral examinations in the first place), 
and their lack of practicability and usability in a situation of increased numbers of students to 
process, on the other hand. Already in 1880 we have first publications highlighting the problems 
of unreliability that were inherent in the early essay-based writing examinations (Huddleston 
1952). For example, the study published by Hopkins in 1921 demonstrated that the score a 
student received as a part of the college Board exam statistically depended more on the rater 
and on the administration conditions (when and where the exam took place) than on the actual 
writing performance, a situation also found in many European countries administering writing 
tasks as part of high-stakes examinations (Nikolov 2009). The certificate of Proficiency in 
English (cPE) was instituted in London in 1913 by cambridge Assessment (an organization 
that went on to become the leading international English language examining institution in the 
1
 The history of the development of standardized testing and its influence on writing assessment is best documented within 
the American framework, and hence the brief historical overview will mostly follow the story in this context, with a note that 
the development in other Western countries took a similar form with the exception of the UK, where the emphasis on direct 
methods of writing assessment and a focus on (context) validity was always more present.
11 LANGUAGE
world), and with its emphasis on essay (composition) writing was to face similar criticism. These 
very real issues were further emphasized not just by a continuous academic scrutiny (in a number 
of studies, such as Starch and Elliott (1912), Hopkins (1921), Sheppard (1929), and others), but 
also by both demographic and political developments impacting higher education in the USA 
at the time. Demographically, the numbers of pupils and students in the US rose exponentially, 
because starting from 1852 the different states began adopting the recently approved universal 
(and obligatory) public education laws. Politically, this fueled the foundation of the college 
Entrance Examination Board (cEEB) in 1899 as a not-for-profit organization aiming to expand 
access to higher education. And while this did mark a very important development, certain 
actions regarding organization of assessment within higher education (such as the centralization 
and outsourcing of language testing) created a massive power struggle and by extent had a great 
impact in terms of how validation of writing assessment developed in the subsequent century. 
Backed by academic publications pointing out the severe problems of the unreliability of 
essay-based assessment of writing, weighed on by the increased numbers of students in need of 
processing, under pressure from high school representatives for a more standardized arrangement 
of entrance examinations into universities, and somewhat swayed by internal power issues, 
the decision was made by the cEEB to replace the then-perceived unreliable direct writing 
assessment (essays) with a much more reliable (though, of course, much less valid) indirect 
method. Unlike the US, the socio-economic context in the UK did not demand such an increase 
in processing power, and was able to maintain a much larger focus on context validity and 
maintain (for cPE unwaveringly for more than a century now) direct assessment of rating as a 
predominant method (Weir, v idakovic, and Galaczi 2013). Nevertheless, assessment of writing 
in higher education institutions in the USA (and many other countries) became as of that point 
almost solely conducted in the form of multiple-choice tests and, to a large extent, mostly 
outsourced to private institutions as well. This was to remain so up until the 1960s, as the issue 
of unreliability came up periodically and repeatedly in academic publications as the major factor 
in favor of indirect methods (such as traxler and Anderson (1935), Stalnaker (1936), coward 
(1952), and others). Fortunately, this emphasis on continuously pointing out unreliability of 
direct writing assessment was to backfire in 1961 with a notable publication by Diederich, Frech, 
and carlton. Their study dealt with the problems of readers (raters) not being able to agree in 
terms of rating writing performances, identifying that as the major source of error when it comes 
to score stability (reliability) and, ultimately, fairness of direct writing examination. In brief, they 
asked 53 raters to score 300 writing performances on a 9-point scale, with the results of 94% 
of papers receiving at least seven different scores (and the inter-rater agreement peaking at only 
0.31). Looking from today’s perspective, we can easily see the flaws in the study as readers were 
bound to disagree seeing that they worked without any guidance, common training, or shared 
point of reference (as was pointed out even at the time in Braddock, Lloyd-Jones and Schoer 
(1963)). However, the results of the study regarding the inter-rater (dis)agreement are not the 
reason why this earlier work is to be considered so relevant. This importance actually stems from 
the fact that it got the academic community thinking about what criteria raters actually focus 
on when rating written work. This is because the study, apart from looking at the agreement, 
also collected some eleven thousand introspective reports from the raters, revolving around the 
features they were looking at when rating the performances (coming up, via factor analysis, with 
ideas, form, flavor, mechanics, and wording as the most salient features). This particular aspect 
of looking beyond just inter-rater agreement as an expression of reliability into issues such as 
the features of writing performance that influence rating, rater scores, and rater training and 
experience, was to establish a new base-line for researching writing assessment that was to mark 
12 Nikola Dobrić  Reliability, Validity, and Writing Assessment: A Timeline
the next stage of the development in the field (Huot, O’Neill, and Moore 2010, 502). Studies 
moving towards establishing proof that direct writing assessment can produce reasonably reliable 
scores (such as Godshalk, Swineford, and coffman (1966) and their outline of holistic scoring, 
for example) further opened the door to a generally increased focus on direct writing assessment 
(but not, as we shall see, to a much needed shift of focus from reliability to validity). 
2 Direct Assessment of Writing – Focus on Reliability
Despite these developments and motions, the cEEB fought back all the way up to the late 
1980s to preserve multiple-choice tests as the foundation of standardized (and usually externally 
managed) assessment of writing at universities and colleges in the USA (such as the well-known 
tOEFL examinations). The struggle was, however, futile, as the push in the other direction 
(towards direct examination of writing proficiency) was once more coming both from political 
and academic quarters (Breland 1983, 2). Politically, one of the issues was that the overwhelming 
focus on indirect assessment of writing was extremely worrying, because it technically led to hardly 
any writing being taught within the American educational system by the mid-1970s (Applebee 
1981). Another political issue was the pressure coming from the English Departments at higher 
education institutions across the country. Rankled by both the feeling of disempowerment caused 
by the prescribed external testing services and the feeling that multiple-choice testing of writing 
did not really conform to their experiences as teachers of English and writing of what writing 
ability represented, a firm stand was made to change things. Within the US, we can trace the first 
major battle being fought in 1971 and the English Departments at california State University, 
the staff successfully rejecting the institutionalization of an externally administered multiple-
choice test for their first year English equivalency exam (White 2001, 308), following this up 
with a creation of their own locally administered examination. Academically, with authors such as 
Paul Diedrich changed their stance towards direct writing assessment, and similar voices coming 
collectively from other publications (such as Brown (1978), coffman (1971), cooper (1977), 
cooper and Odell (1977), and others), the move was a clear one towards the abandonment of the 
indirect method of assessment of writing ability. Having a wealth of information on where the 
source of reliability problems is to be found when it came to rating written performance, the two 
main focuses being on rater inconsistency and sampling bias, the majority of studies in this period 
revolved around the attempts to address these. With a better understanding of the rating process 
and what it entailed, there was a rise in research on scoring (such as that related relating to analytic 
scoring with Diedrich (1974), holistic scoring with Godshalk, Swineford, and coffman (1966) or 
cooper (1977); primary-trait scoring with Mullis (1980); or syntactic scoring with Hunt (1977); 
writing scales in Jacobs et al. (1981); Huot (1990) or Ericsson and Simon (1993); rater training 
and agreement with McIntyre (1993) or Weigle (1994); and other facets of score stability). 
The fact that the socio-economical and academic conditions were favorable for direct assessment 
of writing to be promoted meant that by the mid-1980s most universities (and their language 
departments) were already implementing locally administered direct tests of writing ability in much 
the same way. This was a very positive step, as it allowed for tests to be developed by the faculty that 
worked together within the same program for their own purposes, with their own goals and within 
the shared system and shared curriculum, and yet also able to maintain reliability at an acceptable 
level. However, as we mentioned, while the switch to performance-based testing of writing was a 
very important development, what did not change (but should have) was the focus, which needed 
to shift from practicability of testing and reliability issues to much more important questions of 
aspects of validity unrelated to scoring (mostly present within the UK tradition, though). 
13 LANGUAGE
3 Direct Assessment of Writing – Focus on Validity
v alidity in terms of assessment indicates that the results obtained from the given measurement 
procedure objectively reflect the phenomenon the said procedure is intended to measure, and 
that the measurement at hand has not been obtained due to any measurement-irrelevant factors 
or chance). In this respect validity depends on the degree to which quantified measurements of 
presumptive behavior or ability are blurred by other factors (Sigott 2004, 44). Different from 
reliability, which can be seen as the quality of the data collected, validity is the quality of the 
inferences (and decisions) we can make following the measurement (chan 2014, 9). In this 
sense validity also depends on the degree as to which aspects of the said behavior or ability 
which the test is supposed to measure are covered by the given test (Sigott 2004, 44). v alidation 
is the process in which we gather and evaluate evidence to support the said appropriateness, 
meaningfulness, and usefulness of the inferences and decisions we make based on measurement 
scores (Zumbo 2007; 2009). Unfortunately, one of the products of the described power struggle 
between the (external) testing providers (who were seen as one of the actors advocating the 
indirect writing examinations) and the public bodies (such as the cEEB) on one side and the 
researchers and practicing teachers of writing (who felt unheard and disempowered) on the 
other, was that writing assessment did not catch up with the theoretical developments revolving 
around validity that had actually been under way in educational and language testing since 
the 1950s. Although there were voices calling for more insight into the content of the tests 
(the likes of Wiseman (1949) or Wiseman (1956)), most of the research on the assessment of 
writing taking place between 1960 and 1990 was into the reliability of the direct methodology 
of writing assessment (the same as during the previous stage of development regarding indirect 
methodology). The focus did not appropriately really shift towards validity until the early 1990s.
The earliest accounts of validity considered within the framework of language testing can be 
traced to authors such as Lado (1961) and Davies (1968)
2
. Their account of validity focused 
on the face values of the test (the so-called validity by assumption), on the content of the test, 
on control of extraneous factors, on conditions required to answer test items, and on empirical 
insight (D’Este 2012, 63). At the same time, campbell and Fiske (1959) described the validity 
of language tests as having a convergent dimension (measures that should be related are perceived 
as related) and a discriminate dimension (measures that should not be related are found to be 
not related). campbell and Stanley (1966), as one more early account, featured the concepts of 
internal and external validity. It is only decades later that we can see perhaps the most influential 
account of validity within language testing, that of Bachman (1990). It was written as a reaction 
to the unified theory of validation famously put forward by Messick in 1989, which Bachmann 
embraced, but went on to identify three factors that support this overall validity (D’Este 2012, 
67): content relevance and content coverage (what is known as content validity); criterion relatedness 
(i.e. criterion validity); and meaningfulness of construct (construct validity). Content validity 
here refers to the domain specifications which underlie the test, criterion relatedness refers to a 
meaningful relationship between test scores and other indicative criteria, while construct validity 
relates to the extent to which performance of the test is consistent with the predictions we make 
on the basis of the theory of abilities (Bachman 1990, 246–69). In addition, Bachman also 
imported the consequential (or ethical) basis of validity from Messick and other authors, which 
2
  While the earliest considerations of validity within psychological and educational settings trace back earlier, with beginnings 
in the 1920s (and authors such as Scott (1917), Thorndike (1918), Ruch (1929), or tyler (1934)), with the major movement 
here starting in of the 1950s (and the work related to the first Standards of Educational and Psychological testing of 1955).
14 Nikola Dobrić  Reliability, Validity, and Writing Assessment: A Timeline
refers to the fact that tests have not been designed to be used in an academic vacuum but rather 
have real-life applications and are influenced by society as a whole. The final important notion, 
as argued by Bachman, was of the inclusion of the concept of reliability within validity itself. His 
argument was that when it comes to language testing of any kind (and it so especially in the case 
of writing assessment), it is not easy to distinguish between effects of different test methods or 
between traits and test methods (1990, 239). 
Stemming from Bachman’s account, the 1990s and early 2000s saw several related discussions and 
models of validity in language assessment, such as Kane (1992; 2001), Sigott (1994), Alderson, 
chapman and Wall (1995), Miller and Linn (2000), and Wier (2005). If we were to take as an 
example, as the most recent and most language-assessment-related account, Weir’s socio-cognitive 
model of evidence-based test validation (2005, 17–37), this presents overall test validity as an 
interplay between five different types (discussed in more detail further on): influenced by test-
taker characteristics, there are cognitive validity, context validity, scoring validity, consequential 
validity, and criterion validity. cognitive validity has as its emphasis the determination of the 
cognitive processes which are to be used as a model for designing test items. context validity 
follows the idea of moving away from the sole focus on linguistic representation and including 
the social and cultural dimension within which the writing performance has been produced 
(Weir 2005). Scoring validity, in line with Bachman’s already discussed elaborations on reliability 
belonging within validity, focuses on all aspects of the assessment that could have an impact 
on the scores. Finally, criterion-related validity in this account keeps the focus as traditionally 
established, and revolves around a comparison to any reliable external measurements. 
4 Direct Assessment of Writing – Discussion
What we can see if we dissect all of the accounts presented above, from Bachmann onwards, is 
the contemporary view of validity and validation in language testing understands that validity 
is a unified concept (which includes reliability), and is an aspect of a test’s use to be regarded as 
a whole. However, we can also see a practical need for a focus on particular individual aspects 
(types) of validity that should be covered. These individual aspects reflect sources of evidence 
for validity that we can tap and steps to be taken within any validation effort (regardless of the 
theoretical preference of the terminology marking the given different aspects. This is clearly 
evident in terminological disagreements such as construct validity vs. validity, division on 
content, criterion, and consequential validity, and so on). Applying this kind of a practice-driven 
(authenticity-driven) model to direct writing assessment, we can start with understanding it as 
being strong or weak (McNamara 1996). Direct writing assessment in a strong sense incorporates 
tasks which represent real-world performance and are judged according to real-world criteria, 
the focus being on the successful fulfilment of the task and not (only) on language proficiency. 
Direct writing assessment in a weak sense puts the focus on language use, where the task does 
indeed resemble a real-world purpose, but the goal is only to extract a writing performance for 
the purpose of ascertaining language competence. It is a distinction that influences everything 
from the construct to the scoring, and it is safe to say that in most cases writing assessment in 
higher educational setting takes the form of the latter ( tsai 2002, 1). Once a choice is made 
in this direction, further conceptualization (and retroactively understanding) of the construct 
(including all other elements of validity as well) comes essentially at one of the three stages 
(Bachman and Palmer 1996; Weigle 2002): design stage, operationalization stage, and test 
administration. 
15 LANGUAGE
The design stage is extremely important for ensuring validity and it should, according to McNamara 
(1996, 43), revolve around sampling of the task from the communicative situation the test is to 
be a proxy of – this would include consulting expert informants, examining available literature, 
analyzing and categorizing communicative tasks, collecting and examining relevant texts, and 
deciding on a broad test method ( tsai 2002, 2). In the operationalization stage information 
gathered in the design stage is transformed into concrete test specifications and detailed procedures 
the test takers and readers (raters) are to follow. These specifications should generally include the 
description of the test content, criteria for correctness, and a sample of tasks (Douglas 2000), and 
comprise of scoring rubrics, writing prompts, rating scales, and similar (Weigle 2002). Finally, 
the test administration phase focuses on the actual tests being administered, where the validation 
revolves around the data that stem from the elicited writing performances (such as the scores 
and the washback information), while the stage is firmly footed in different statistical methods. 
This division into practical stages of test development and administration is useful, as it points 
to the different steps which should be taken at different stages in order to ensure the validity of a 
particular interpretation and use of a particular test, and to the fact that a useful way to observe 
the process of validation is to imagine it as a priori (before the test administration) and a posteriori 
(after it has been administered). Additionally, it is useful to observe validity in terms of the types 
of evidence that support it, along the lines of Bachman’s division into evidential and consequential 
bases of validity (1990, 248)
3
. Finally, the interplay between validity and reliability is also worth 
discussing, together with how Bachman (following other similar voices such as Loevinger (1957), 
Messick (1989) or c ronbach (1990)) argues that the two are in fact complementary aspects of 
identifying, estimating, and interpreting different sources of variance in test scores (1990, 239).
4.1  A Priori Validation
Unlike the subsequent stage of a posteriori validation that is largely grounded in quantitative data 
and statistical methodology (and which is usually feared and avoided by practicing teachers), a 
priori validation efforts are usually conducted on a regular basis by most practitioners. The reason 
is that a priori validation takes places during the test compilation phase, and most commonly 
includes the very logical, common-sense, steps anyone seriously compiling an assessment tool 
would think of (though it can also incorporate some statistics-based operations). A priori efforts 
basically focus on the commitment of the test designer to make sure that he or she understands 
the behavior or performance which is the target of the assessment, that the content of the test 
is relevant to the real-life behavior or performance at hand, and that the criteria of success 
(correctness) are clear both to the test-taker and the test-rater. This stage of validation is hence 
much less opaque to non-experts and, as indicated, covers the stages of test design and test 
operationalization. 
Starting always with the understanding of the behavior or performance of interest, the first step 
in test design (and a priori validation) is the consideration of construct validity in the narrow 
sense. As we noted previously, a construct can be seen as a definition of people’s attributes that 
are assumed to be reflected in their performance (c ronbach and Meehl 1955, 283). This means 
that to properly measure the ‘extent’ of someone’s writing ability we need to understand the 
nature of that ability as it is, in our minds, and we need to be able to describe it in a sufficiently 
3
 Evidential basis of validity, according to Bachman, is grounded in evidence that supports the relationship between test scores 
and their interpretations and subsequent use (1990, 248). The consequential (or ethical) basis of validity refers to the fact that 
tests have not been designed for use in an academic vacuum, but rather have real-life applications (Bachman 1990, 280).
16 Nikola Dobrić  Reliability, Validity, and Writing Assessment: A Timeline
clear manner. Hence, construct validation is related to the basic question of what is the nature 
of that something that an individual possesses or displays that is the object of our measurement 
(Messick 1975, 957). The part that takes place a priori (sometimes referred to as the logical part 
of construct validation) involves first defining the construct (the ability) theoretically, or rather 
scouring the existing literature on descriptive accounts of writing ability. Accounts such as these 
try to capture the inner-workings of the cognitive processes at the basis of what we could call ‘an 
ability to write’ (and often perhaps the ‘ability to write in a foreign/second language’ as well). 
Sometimes this is quite challenging, as writing assessment as a discipline has not really excelled 
at writing about writing (Huot 1990), but rather at the more practical work on assessment 
methodology (such as writing scales and rater training procedures). That is why there is a feeling 
that theoretical accounts of writing competence as a phenomenon (unlike accounts of language 
competence in general) are somewhat lacking, and that more work still needs to be done in this 
direction (Yancey 1999). Nonetheless, finding (or setting up) a comprehensive and meaningful 
theoretical account of the construct in the design stage allows us to perform many actions which 
can assure subsequent high levels of validity. Understanding the construct well allows us to 
identify the sources for the most meaningful sampling of test content, and to identify which 
test method would be best to elicit behavior (performance) which would most resemble the 
construct at hand and the relevant real-life performance. These developments we would carry 
over to the operationalization stage, where the logical (theoretical) part of construct validation 
provides us with means of linking the actual ability we are measuring with the content of the 
test and with the scores that end up quantitatively representing our measurement. Adjacent to 
this understanding of the ability itself is the insight into the metacognitive strategies (or strategic 
competences, as termed in Bachman and Palmer (1996)) seen as mediating between the trait 
and context, and comprising of faucets such as goal setting, assessment of needs to achieve the 
purpose, planning, monitoring, and organization of language and topic – all contributing, in 
fact, to what Wier refers to cognitive validity (2005, 86). In terms of writing assessment, Shaw 
and Weir (2007, 34) recognize six different aspects of cognition behind writing ability: 
•	 macro-planning: gathering of ideas and identification of major constrains such as genre, 
readership, and goals;
•	 organization: ordering the ideas and identifying relationships between them; 
•	 micro-planning: focusing on individual parts of the text and considering issues such as 
the goal of the paragraph in question, including both its alignment with the rest of the 
text and the ideas and the sentence and content structure within the paragraph itself;
•	 translation: the content previously held in a propositional form is transferred into text;
•	 monitoring: focusing on the surface linguistic representation of the text, on the content 
and the argumentation presented in it, and its alignment with the planned intentions and 
ideas; and 
•	 revising: results from the monitoring activity and involves fixing the issues found to be 
unsatisfactory.
Alongside the cognitive aspects behind writing as a performance, Weir also emphasized the need 
to consider the test-taker actively in terms of personal individualities. He distinguished between 
three classes of test-taker characteristics (Shaw and Weir 2007, 5): 
•	 physical/physiological characteristics, which include any special needs on the side of the 
test-taker, such as those stemming from dyslexia or eyesight impairment; 
17 LANGUAGE
•	 psychological characteristics, including test-taker motivation, personality type, learning 
styles, and more; and 
•	 experiential characteristics, incorporating factors such as the degree of test-taker familiarity 
with the test format or content. 
These individual characteristics can then be viewed as systematic, if they affect a test-taker’s 
performance consistently (such as dyslexia or personality traits) or unsystematic, when they have 
a random, perhaps one-off effect (for example, motivation or test format familiarity).
Once we have a reasonable grasp of the nature of writing ability (and a complete grasp in theoretical 
terms is pretty hard to achieve when it comes to social sciences and humanities), the test-takers’ 
characteristics at play, and the cognitive processes taking place in our minds while writing, then 
while still in the design stage we need to tackle the context surrounding the test. Related to the social 
and situational background (Weir 2005, 57), this revolves around the task setting and conditions 
necessary to ensure appropriateness of the test content, clarity and conformity to the construct, the 
intended use, and the intended stakeholders (Shaw and Wier 2008, 63): 
•	 setting = task: includes the expected task format (consideration the genre), the purpose 
(expected form and communicative function or the nature of information in the text), 
the clear knowledge of the criteria (task instructions), the weighing (focusing on the relative 
contribution of the different parts/aspects of the test), the text length, the constraints (i.e. 
speededness, additional resources, and more), and the expected writer-reader relationship 
(addressee information);
•	 setting = administration: incorporates physical conditions (venue, background noise, 
lighting, and more), uniformity of administration (same specifications for all test takers), 
and the security involved (controlled access); and
•	 linguistic demands = text input and output: the focus is on the lexical resources 
(vocabulary), structural resources (morpho-syntax), the discourse mode (considerations 
of genre, rhetorical task, and pattern of exposition), the functional resources (referring to 
the fulfillment of the intended communicative purpose of the writing), and the content 
knowledge (background and the subject matter knowledge). 
Also important to mention here is that, unlike most of the other a priori validation efforts, 
ascertaining validity of contextual (content) features such as the appropriateness of the 
lexical resources, structural resources, or the discourse mode can be undertaken empirically 
(quantitatively) as well. 
The last a priori puzzle-piece in the jigsaw that is validation (and ultimately the degree of validity) 
of a particular issue of writing assessment is in the operationalization stage, where we define 
instructions on how to, essentially, obtain reliable scores. called scoring validity by Weir (2005, 
117), Shaw and Weir list the a priori considerations as follows (2007, 146):
•	 criteria and type of the rating scale: focuses in essence on the marking scheme and the 
different approaches to assigning a number to the measurement (i.e. primary trait scoring 
vs. holistic scoring vs. analytic scoring);
•	 rater characteristics: identified as one of the biggest causes of score variability when it 
comes to writing assessment; the rater is observed both in terms of the rater-candidate 
and the rater-item interactions (again dividable into physical/physiological, psychological, 
and experiential characteristics); 
18 Nikola Dobrić  Reliability, Validity, and Writing Assessment: A Timeline
•	 rating process: different prescribed rating procedures (such as for example the principled and 
pragmatic two-scan approaches, the read through approach, or the provisional mark approach);
•	 rating conditions: includes the setting (on site or at home, for example), the medium of 
the writing performance (handwritten vs. electronic), time constraints (deadlines), and 
scaffolding (ways in which examiners are advised); and
•	 rater training: perhaps the most crucial aspect in achieving higher levels of inter-rater 
agreement (lowering the influence of severity and other effects that stem from the rater).
After having completed all of listed actions, and ultimately having arrived at a version of the test 
that in its content and specification reflects entirely the theoretical description of the behavior 
(ability) we are interested in measuring, it is time to administer the test and then see how much 
the results (scores) obtained do in fact conform to the expectation of validity we have built up in 
the design and operationalization stages.
4.2  A Posteriori Validation
One of the first aspects of validity that most research into test interpretation employs after 
the test has been administered is generally termed criterion validity. As indicated, it revolves 
around correlations with other comparable measures of the same ability. This is one of the 
go-to procedures in smaller-scale validation efforts, as it is simpler to set up in comparison to 
other procedures – all one needs is an established comparable existing (or future) measurement 
which is already considered as valid, to which you can make comparisons in terms of obtained 
measurements (scores). traditionally, we find two kinds of criterion-based validity: concurrent 
and predictive (Bachman 1990, 248). concurrent criterion relatedness involves one of the several 
commonly employed procedures: examining differences in test performance among individuals 
at different levels of proficiency, examining correlations among different measurements of the 
given ability, correlations with existing standardized tests, comparisons with teacher ratings, with 
informant data, with self-assessment, and with different versions of the same test (Weir 2005, 
209). Predictive criterion relatedness focuses on demonstrating a link between test scores and 
some future performance, where the test scores predict the criterion behavior of interest – for 
example, having a writing test serve as a predictor of place allocations in writing courses of 
different levels, or linking predictive criterion validity to comparison with external performance 
benchmarks such as the cEFR, for example (Weir 2005, 209). In particular, for the purposes of 
writing assessment, Shaw and Wier list the following three procedures (2008, 230):
•	 cross-test comparability: the procedure encompasses the effort of comparing different and 
yet related language proficiency measures (such as comparing the ratings of two different 
tests and ascertaining the levels of their relatedness);
•	 comparison with different versions of the test: relies on the existence of a relationship 
between two or more different versions of the test involving the same test takers and other 
comparable conditions; and
•	 comparison with external standards: using standardized and reputable existing 
measurements (such as national graduation (Matura) exams or well-known British 
council or ESOL examinations) as benchmarks for making comparisons to.
The statistical methodology of pursuing further aspects of a posteriori validity then becomes 
increasingly more complex than just looking at the correlation coefficients one would normally 
find in criterion-related validity. This is the reason why we mostly see them in more serious 
19 LANGUAGE
validation efforts, starting with the focus on scoring validity in an empirical sense. A posteriori 
analysis of the obtained scores looks into rating scales, inter- and intra-rater agreement, 
moderation of scores, and then beyond into use and consequences (Weir 2005), and as such 
involves complex dealing with quantitative data. A posteriori analysis of scoring validity involves 
focusing on two major aspects (Shaw and Weir 2008, 190) that are also strongly grounded in 
statistical methodology:
•	 post-exam adjustments: they include statistical methods (such as scaling or Rasch 
measurements) aimed at artificially correcting scores based on the attested interplay of 
different features surrounding the test (severity and leniency, central tendency, and more); 
and 
•	 grading and awarding: the final process is the one of assigning cut-off scores and issuing 
certificates (if applicable).
However, one can argue that this part of empirical validation deals much more with consistency 
of measurement (and hence reliability) than what we would normally understand as validity, 
even if we look at them as merged. Hence, most a posteriori efforts at validation in the strictest 
sense revolve around construct validation in its narrow meaning. Empirical construct validation 
focuses on finding empirical evidence (in terms of correlations and experimentation) which 
goes towards confirming or disproving a particular interpretation of the obtained test scores 
(Bachman 1990, 260). It functions in terms of two very broad approaches at gathering validity 
evidence – correlational (and other psychometric methods) and experimental approaches. The 
correlational approaches can be seen as either exploratory or a confirmatory. The exploratory 
approach concentrates on identifying the traits that influence test performance by examining the 
correlations
4
 among a large set of measures. The confirmatory approach focuses on a particular 
hypothesis about the traits and attempts to confirm or reject it. On the other hand, the classical 
experimental design for testing hypotheses (which has up to recently received less attention) 
involves attempts at manipulating the variables involved in a testing situation, such as example 
pre-test/post-test experimentation, and similar (Bachman 1990, 258–66).
Finally, looking beyond the test and the scores, we have the social effect a test use may have, 
usually observed within the framework of consequential validity . This involves us considering the 
ethical basis of validity as incorporating deliberations which are neither scientific nor technical, 
and which focus on the influence of a particular (educational) system on the interpretation of 
a test, as well as on the washback effect that test use has on that particular system in reverse 
(1990, 279). In practice, this means observing (see Bachman (1990, 280–84) and Weir (2005, 
210)):
•	 the rights of the test-takers (secrecy, confidentiality, privacy); 
•	 the values inherent in test developers and raters; 
•	 the values inherent in the particular social system; 
•	 background knowledge; and 
•	 influence on teaching and learning.
4
 Related psychometric approaches also include factor analysis, multi-trait-multi-method investigation, and more.
20 Nikola Dobrić  Reliability, Validity, and Writing Assessment: A Timeline
5 Direct Assessment of Writing – The Cutting Edge and 
Outlook
If we review the entire discussion presented in the previous two sections, looking at both the 
general push and pull within development of writing assessment and at the issues of reliability 
and validity within language testing in general, and assessing writing in particular, we can with 
some degree of confidence suggest a global outline of what validity in terms of direct assessment 
of writing entails, and how it can be supported:
•	 validity entails the application of scientific rigor to the process of interpretation of test 
scores by focusing on both rational argument and empirical evidence;
•	 even though it can be observed as a conceptual whole, validity is in essence comprised 
of different aspects and this division is extremely useful when it comes to the practical 
(empirical) conduct of validation efforts;
•	 reliability is to be considered as an integral part of overall validity, and especially so when 
it comes to writing assessment;
•	 writing as a language skill deserves special consideration when it comes to its theoretical 
description within the larger concept of language competence; and
•	 equal attention should be given to the a priori and a posteriori efforts of ensuring validity.
If we were to translate this general overview, emerging from the overall discussion, into practical 
steps needed to assure the highest possible degree of validity, we can dilute them into a validation 
checklist, as follows in Figure 1 below (largely revolving around and adapted from the Shaw and 
Wier (2007)). Following a list such as this one can lead any practitioner to easily pinpoint most 
of the steps they need to take when setting up the assessment, and can/should take after they 
have administered the test, in order to ensure the highest level of validity in the context of the 
notoriously problematic phenomenon that is the assessment of writing. 
In the end, when it comes to further development of securing validity within assessing writing 
ability, it is hard to foresee any new theoretical movements in the future. The core deliberations 
regarding the nature of validity in educational (and psychological) testing took place between 
1950s and 1990s, culminating in the extensive account presented by Mesick in 1989. Similarly, 
Bachman’s (1990) translation of Messick’s (1989) influential account into the framework of 
language testing basically set the ultimate theoretical underpinnings of validity for this particular 
field, and even added the somewhat novel consideration of incorporating reliability within 
validity. However, Bachman’s account cannot be seen as bringing anything conceptually new 
with regards to Messick (perhaps only adding the said notion of reliability being merged with 
validity), nor can any other account seen as coming after his. The real contribution he and 
other of the authors discussed in this paper made was in furthering the practice of validation 
and fine-tuning the outline of the sources of evidence and procedures one should focus on to 
address different aspects of unified validity. This is the direction the development of validity and 
validation research in terms of language assessment, and in particular that of writing, will most 
likely follow in the future, with conceptual novelty highly unlikely in this content (apart from 
any possible changes related to our understanding of the construct itself).
One of the areas that will see particular development is the validation of scoring, both in an a 
priori and a posteriori sense. A priori, the development will most likely focus on rating scales and 
21 LANGUAGE
their application. For instance, one of the more interesting directions the advance of rating scales 
is taking is in moving away from the vague ‘can do’ and ‘has got’ descriptors often found in rating 
scales to the actual (linguistic) features being weighted and ticked off as present (or not) in relation 
to ratings. This represents the first step of actually turning away from a purely ‘judging’ perspective 
predominant in rating writing performances to a more ‘counting’ oriented one. Likewise, a posteriori, 
the shift to a more ‘counting’ oriented approach would mean the introduction and reliance on new 
methods of score adjustment, quality assurance (via agreement studies), computer-aided processes 
being introduced on a wider scale, eye-tracking investigations, and more.
 
Figure 1. A summary of the prototypical validity effort of any writing assessment. 
Figure 1. A summary of the prototypical validity effort of any writing assessment.
22 Nikola Dobrić  Reliability, Validity, and Writing Assessment: A Timeline
References
Alderson, J. charles, caroline clapham, and Dianne Wall. 1995. Language T est Construction and Evaluation. 
cambridge: cambridge University Press.
American Psychological Association. 1955. Standards for Educational and Psychological T esting and Manuals. 
Washington, Dc: Author.
Applebee, Arthur N. 1981. Writing in the Secondary School: English and the Content Areas. Urbana, IL: National 
council of teachers of English.
Bachman, Lyle F . 1990. Fundamental Considerations in Language T esting. Oxford: Oxford University Press.
Bachman, Lyle F ., and Adrian S. Palmer. 1996. Language T esting in Practice: Designing and Developing Useful 
Language T ests. Oxford: Oxford University Press.
Breland, Hunter M. 1983. The Direct Assessment of Writing Skill: A Measurement Review. College Board Report No. 
83–6. New York: Educational testing Service.
Brown, Rexford. 1978. “What We Know Now and How We could Know. More about Writing Ability in 
America.” Journal of Basic Writing 1 (4): 1–6. 
campbell, David, and D. Fiske. 1959. “convergent and Discriminant v alidation by the Multitrait-Multimethod 
Matrix.” Psychological Bulletin 56 (2): 81–105. https://doi.org/10.1037/h0046016.
campbell, David, and Julian c. Stanley. 1966. Experimental and Quasi-Experimental Designs for Research. Boston: 
cengage Learning.
chan, Eric K. H. 2014. “Standards and Guidelines for v alidation Practices: Development and Evaluation of 
Measurement Instruments.” In Validity and Validation in Social, Behavioral, and Health Sciences, edited by 
Bruno D. Zumbo and Eric K. H. chan, 9–24. New York: Springer.
coffman, William E. 1971. “On the Reliability of Ratings of Essay Examinations in English.” Research in the 
T eaching of English 5: 24–36.
cooper, charles R. 1977. “Holistic Evaluation of Writing.” In Evaluating Writing: Describing, Measuring, Judging, 
edited by charles R. cooper and Lee Odell, 3–31. Urbana: National council of teachers of English. 
cooper, charles R., and Lee Odell, eds. 1977. Evaluating Writing: Describing, Measuring, Judging. Urbana: 
National council of teachers of English.
coward, Ann F . 1952. “A comparison of two Methods of Grading English compositions.” Journal of 
Educational Research 46 (2): 81–93. https://doi.org/10.1080/00220671.1952.10882003.
c ronbach, Lee J. 1990. Essentials of Psychological T esting. 5th Ed. New York: Harper & Row.
c ronbach, Lee J., and Paul E. Meehl. 1955. “construct v alidity in Psychological tests.” Psychological Bulletin 52: 
281–302. https://doi.org/10.1037/h0040957.
Davies, Alan, ed. 1968. Language T esting Symposium. A Psycholinguistic Perspective. London: Oxford University 
Press.
Davies, Alan, and catherine Elder. 2005. “ v alidity and v alidation in Language testing.” In Handbook of Research 
in Second Language T eaching and Learning, edited by Eli Hinkel, 795–813. Mahwah, New Jersey: Lawrence 
Erlbaum Associates. 
D’Este, claudia. 2012. “New v iews of v alidity in Language testing.” EL.LE 1 (1): 61–76. https://doi.
org/10.14277/2280-6792/5p.
Diederich, Paul B., John Winslow French, and Sydell t. carlton. 1961. “Factors in Judgments of Writing 
Ability.” ETS Research Bulletin 1961 (15): i–93. https://doi.org/10.1002/j.2333-8504.1961.tb00286.x.
Diederich, Paul B. 1974. Measuring Growth in English. Urbana: National council of teachers of English. 
23 LANGUAGE
Douglas, Dan. 2000. Assessing Language for Specific Purposes. cambridge: cambridge University Press.
Ericsson, K. Anders, and Herbert A. Simon. 1993. Protocol Analysis: Verbal Reports as Data. cambridge, MA: 
MIt Press.
Godshalk, Fred I., Frances Swineford, and William Eugene coffman. 1966. The Measurement of Writing Ability. 
New York: college Entrance Examination Board. 
Hobbs, catherine L., and James A. Berlin. 2001. “A century of Writing Instruction in School and college 
English.” In A Short History of Writing Instruction: From Ancient Greece to Modern America, edited by James J. 
Murphy, 247–89. Mahwah: Lawrence Erlbaum Associates.
Hopkins, Levi Thomas. 1921. The Marking System of the College Entrance Examination Board. cambridge: The 
Graduate School of Education, Harvard University. 
Huddleston, Edith M. 1952. “Measurement of Writing Ability at the college‐ Entrance Level: Objective 
vs. Subjective t esting t echniques.” ETS Research Bulletin Series 1952 (2): 165–213. https://doi.
org/10.1002/j.2333-8504.1952.tb00925.x.
Hunt, Kellogg W. 1977. “Early Blooming and Late Blooming Syntactic Structures.” In Evaluating Writing: 
Describing, Measuring, Judging, edited by charles cooper and Lee Odell, 91–106. Urbana: National council 
of teachers of English.
Huot, Brian. 1990. “Reliability, v alidity, and Holistic Scoring: What We Know and What We Need to Know.” 
College Composition and Communication 41 (2): 201–13.
Huot, Brian, Peggy O’Neill, and cindy Moore. 2010. “A Usable Past for Writing Assessment.” College English 72 
(5): 495–517.
Jacobs, Holly L., Stephen A. Zinkgraf, Deanna R. Wormuth, v . Faye Hartfiel, and Jane B. Hughey. 1981. T esting 
ESL Composition: A Practical Approach. Rowley, MA: Newbury House. 
Kane, Michael t. 1992. “An Argument-Based Approach to v alidity.” Psychological Bulletin 112 (3): 527–35. 
https://doi.org/10.1037/0033-2909.112.3.527.
—. 2001. “c urrent concerns in v alidity Theory.” Journal of Educational Measurement 38 (4): 319–42. https://
doi.org/10.1111/j.1745-3984.2001.tb01130.x.
Lado, Robert. 1961. Language T esting: The Construction and Use of Foreign Language T ests: A T eacher’s Book. New 
York: McGraw-Hill.
Loevinger, Jane. 1957. “Objective tests as Instruments of Psychological Theory.” Psychological Reports 1957 (3); 
635–94. https://doi.org/10.2466%2Fpr0.1957.3.3.635.
McIntyre, Philip N. 1993. “The Importance and Effectiveness of Moderation training on the Reliability of 
teachers’ Assessments of ESL Writing Samples.” MA Thesis, Department of Applied Linguistics, University 
of Melbourne.
McNamara, t im F . 1996. Measuring Second Language Performance. Harlow: Addison Wesley Longman.
Messick, Samuel. 1975. “The Standard Problem. Meaning and v alues in Measurement and Evaluation.” 
American Psychologist 30 (10): 955–66. 
—. 1989. “v alidity.” In Educational Measurement, edited by Robert Linn, 13–103. Washington, Dc: American 
council on Education and National council on Measurement in Education.
Miller, David M., and Robert L. Linn. 2000. “ v alidation of Performance-Based Assessments.” Language T esting 
24 (4): 367–78. https://doi.org/10.1177%2F01466210022031813
Mullis, Ina v . S. 1980. Using the Primary Trait System for Evaluating Writing (Report No. 10‐W ‐51). Denver: 
National Assessment of Educational Progress, Education commission of the States. 
Nikolov, Marianne, ed. 2009. Early Learning of Modern Foreign Languages: Processes and Outcomes. clevedon: 
Multilingual Matters.
24 Nikola Dobrić  Reliability, Validity, and Writing Assessment: A Timeline
Ruch, Giles Murrel. 1929. The Objective or New-Type Examination: An Introduction to Educational Measurement. 
chicago: Scott, Foresman and co.
Scott, W. 1917. “A Fourth Method of checking Results in v ocational Selection.” Journal of Applied Psychology 1: 
61–66. https://doi.org/10.1037/h0073494.
Shaw, Stuart D., and c yril J. Weir. 2007. Examining Writing: Research and Practice in Assessing Second Language 
Writing. cambridge: cambridge University Press.
Starch, Daniel, and Edward c. Elliott. 1912. “Reliability of the Grading of High-School Work in English.” School 
Review 20 (7): 442–57.
Sheppard, Everett M. 1929. “The Effect of Quality of Penmanship on Grades.” Journal of Educational Research 
19 (2): 102–5.
Sigott, Günter. 1994. “Language test v alidity: An Overview and Appraisal.” AAA: Arbeiten aus Anglistik und 
Amerikanistik 19 (2): 287–94.
—. 2004. Towards Identifying the C-T est Construct. Frankfurt am Main: Peter Lang.
Stalnaker, John M. 1936. “The Problem of the English Examination.” Educational Record 17: 140–43. 
Thorndike, Edward L. 1918. “The Nature, Purposes and General Methods of Measurements of Educational 
Products.” In The Measurement of Educational Products – National Society for the Study of Education Yearbook, 
edited by Guy Monteose Whipple, 16–24. chicago: National Society for the Study of Education.
traxler, Arthur E., and Harold A. Anderson. 1935. “Reliability of an Essay test in English.” School Review 43 (7): 
534–40. 
tsai, constance Hui Ling. 2002. “Issues of v alidity in the Assessment of Writing Performance.” T eachers College, 
Columbia University Working Papers in TESOL & Applied Linguistics 4 (2): 1–3.
tyler, Ralph W. 1934. Constructing Achievement T ests. columbus: Bureau of Educational Research, Ohio State 
University.
Weigle, Sara c. 1994. “Effects of training on Raters of ESL compositions.” Language T esting 11 (2): 197–223. 
https://doi.org/10.1177%2F026553229401100206.
—. 2002: Assessing Writing. cambridge, UK: cambridge University Press. 
Weir, cyril J. 2005. Language T esting and Validation: An Evidence-Based Approach. Houndgrave, UK: Palgrave.
Weir, cyril J., Ivana v idakovic, and Evelina D. Galaczi. 2013. Measured Constructs: A History of Cambridge 
English Language Examinations 1913–2012. cambridge: UcLES/cambridge University Press.
White, Edward M. 2001. “The Opening of the Modern Era of Writing Assessment: A Narrative.” College English 
63 (3): 306–20. https://doi.org/10.2307/378995.
Wiseman, Stephen. 1949. “The Marking of English compositions in Grammar School Selection.” British Journal 
of Educational Psychology 19: 200–209. https://doi.org/10.1111/j.2044-8279.1949.tb01622.x.
—. 1956. “Symposium: The Use of Essays in Selection at 11+.” British Journal of Educational Psychology 26: 
172–79. https://doi.org/10.1111/j.2044-8279.1957.tb01390.x.
Yancey, Kathleen Blake. 1999. “Looking Back as We Look Forward: Historicizing Writing Assessment.” College 
Composition and Communication 50 (3): 483–503. https://doi.org/10.2307/358862.
Zumbo, Bruno D. 2007. “ v alidity: Foundational Issues and Statistical Methodology.” In Handbook of Statistics 
26: Psychometrics, edited by c. R. Rao and S S. Sinharay, 45–79. Amsterdam: Elsevier Science.
—. 2009. “v alidity as contextualized and Pragmatic Explanation, and Its Implications for v alidation Practice.” 
In The Concept of Validity: Revisions, New Directions and Applications, edited by Robert W. Lissitz, 65–82. 
charlotte: Information Age Publishing.