Further supplementary memorandum from
the Qualification and Curriculum Authority
At the Select Committee's 15 May oral evidence
session with QCA, you asked if I would provide a more detailed
commentary on Professor Tymms's submission to you about the key
stage 2 English tests. I am now pleased to enclose an analysis
of the points that Professor Tymms raised.
For ease of reference, the enclosed analysis
seeks to summarise each of the questions contained within Professor
Tymms's letter and then provides our comments and response. Where
there is a need to refer to particular key stage subjects and
their tests to support a point, the analysis commonly refers to
key stage 2 English and mathematics, given that it is in relation
to these two subjects that concerns are most frequently voiced.
I should like to reinforce some of the points
that I made at the Select Committee hearing.
There are no absolutely perfect techniques to
measure pupils' attainments. The key issues, as we perceive them,
are whether QCA's tests are fit for the purpose of measuring pupils'
attainment against the national curriculum, and whether they and
the procedures for marking and level setting are as high in quality
as they can be. QCA seeks to use all available techniques to address
these aimsthe professional judgement of subject specialists,
pre-tests, anchor tests, test scrutiny, statistical comparisons
and script scrutiny. Through our work means we have extensive
experience of dealing with the major technical issues involved,
and we are investigating key problems such as pre-test effects.
Most importantly, QCA processes do not peg standards in one year
solely to the previous year; we reference tightly to an absolute
standard set by the national curriculum and carried through all
of our tests since 1996.
There is a second general issue to do with the
use of the tests to measure how well pupils have been taught the
national curriculum. As I explained, we know of three distinct
methods of national testing. Professor Tymms argues for tests
which seek to measure underlying ability. Professor Dylan Wiliam
argues that the country should place a lot more store on teachers'
own judgements about their pupils and that these methods of assessment,
he believes, provide the most reliable method of recording what
an individual pupil knows and can do. Then there is the QCA system
of measuring pupils' attainments against the standards of the
national curriculum subjects that they have been taught by their
teachers. These are three very different systems. The procedures
that might work well for tests of underlying ability, such as
those used by Professor Tymms, are not appropriate for national
I hope that this more detailed analysis is helpful
to you and Select Committee colleagues. You will also find enclosed,
for reference, copies of our Standards Reports, which we send
to schools each year to provide detailed feedback on test performance
in the previous May.
QCA ON THE
APRIL 2002 SUBMISSION
Professor Tymms's raises a number of questions
about QCA's national curriculum testing arrangements, in particular
those for testing English in the national curriculum at the end
of key stage 2. This paper summarises the questions raised and
provides comment and response from QCA.
Does QCA believe that standards have risen at
the end of key stage 2 over the last seven years?
Yes. QCA is confident that standards have risen.
Pupils are now taught the skills, knowledge and understanding
set out in the national curriculum better than in 1995 and 1996.
Our evidence to support this statement comes
from a number of sources.
HMI's annual reports, including the
most recent report by Her Majesty's Chief Inspector of Schools
Standards and Quality in Education 2000/01(Feb 2002), have for
several years now reported improvements in the quality of teaching
and reductions in the proportions of unsatisfactory lessons. In
meetings with QCA subject officers, HMI have referred to evidence
of improved teaching of reading in key stages 1 and 2; of improvements
in the teaching of Shakespeare and improvements in areas of science
and mathematics from their surveys of subject teaching.
Each year QCA conducts its own monitoring
activities, as well as carrying out a detailed analysis of pupils'
test scripts, drawn from a national sample of schools. This latter
analysis in particular shows that pupils' performance in English,
mathematics and science has improved since 1996. In addition,
however, it also shows that pupils are now clearly better at responding
to questions in the key stage 2 English reading test which ask
about authorial intentions; they are now much better able to answer
mental arithmetic questions at key stage 2; they are also better
at showing their working, at presenting graphical information;
and in key stage 2 science tests, they can now respond very well
to questions that ask them about science experiments involving
two variables. QCA, therefore, has detailed information about
precisely where pupils' improvements have come in each subject.
QCA provides this detailed sort of information about test performance
each year to the literacy and numeracy strategy teams and to schools
through our Standards Reports.
Each year, QCA carries out an analysis
of writing on a sample of scripts at each national curriculum
level. At each level, against criteria that remain the same each
year, the analysis examines the features of the writing that are
characteristic of the sample scripts. The analysis has shown that
patterns of performance for each level are similar each year.
This is further evidence to demonstrate that the expectations
(ie the standards) for achievement of the levels have remained
Do the results reflected in those [key stage
2] tests give a general indication that the basic skills of pupils
finishing primary schools are higher than they were six or seven
Yes. The evidence referred to in response to
the previous question demonstrates the improvements in pupils'
skills and their ability to show, through the tests, the knowledge,
skills and understanding that are required by the national curriculum.
The annual Standards Reports provide considerable detail to schools
on test performance in relation to particular aspects of the required
Is it the case that in setting the cut-scores
for the levels at the end of key stage 2 every year QCA attempts
to keep the standard to that of the previous year?
QCA does seek to maintain standards from year
to year. However, it is not accurate to say that our procedures
simply link standards in one year to the previous year. Each summer,
QCA convenes committees to take final decisions about the number
of marks needed for the award of each level covered by the tests.
To ensure that those decisions are fair, professional, and maintain
standards over time, QCA:
has statistical procedures which
anchor the standard of the new year's tests against the standards
used every year since 1996;
uses archive scripts from 1996 and
subsequent years to ensure that the assessment experts make their
recommendations drawing on the standard of pupils' work from a
number of years;
draws on the expertise of test development
teams (drawing on their own considerable expertise, research and
statistical experts) which are largely stable and constant from
year to year;
invites an external expert each year
to offer independent advice on procedures, the statistics and
the decisions being made. In the past QCA has invited Professor
Tom Christie of University of Manchester, Professor Dylan Wiliam
of King's College and Dr Gordon Stobart of the Institute of Education.
This year Mr Alf Massey from UCLES in Cambridge has been invited
as the independent expert.
In addition, there is a wide range of measurement
techniques available to QCA and others engaged in the process
of setting and maintaining standards for tests. The organisation
is continually investigating and conducting research with the
aim of improving procedures wherever possible.
Why have QCA's own insiders been critical of
the standard-setting procedures used by QCA (Quinlan and Scharaschkin,
There is a range of views amongst test developers
and educational statisticians about what maintaining test standards
involves and how best to achieve it. Part of QCA's responsibility
is to be aware of that range of views and to ensure the organisation
designs procedures having fully discussed the issues.
The work carried out by Quinlan and Scharaschkin
as QCA researchers is an example of internal research that aims
to take a critical look at existing procedures. Their work examined
and commented on various refinements to arrangements, which have
now been implemented. In the main their work was designed to examine
what more QCA should be doing to ensure that procedures are beyond
reproach. They made a number of recommendations, the most significant
of which were:
The statistical methods used to obtain
a set of test results early each summer for a representative sample
of pupils should be improved. This has been done.
The use of statistics in level setting
should be enhancedthis has been put in place.
The conduct of key meetings should
be better codified and documentedin terms of who attends,
what documentation is produced, and how advice is provided. This
has been done and QCA is now drawing up a detailed Code of Practice
to cover this area of our work.
QCA does not rely solely on its own internal
experts to provide advice on procedures. The independent review
of A level standards, Maintaining GCE A level standards (January
2002), carried out by Professor Eva Baker, is a recent example
of an external review. Similarly, QCA also commissions research
on the national curriculum tests. Much of this work takes the
form of focused investigations, but the organisation also commissions
longer-term research into the issue of maintaining standards.
The outcomes of this independent research work are reviewed by
a group of independent researchers who form QCA's Advisory Group
for Research into Assessment and Qualifications. Some of the research
has identified changes that could be introduced but all of the
work has concluded that QCA has a robust testing system.
A further point can be made about QCA's internal
research. The Rose Panel, in their report Weighing the Baby (July
1999), drew attention to the fact that QCA's test developers see
their role as including a responsibility for improving the quality
of the tests where there is evidence to suggest it this needed
and where it is clear what should be done. Since 1996, QCA has
sought to make improvements to all of the tests, including the
key stage 2 English tests. For example, QCA has made changes to
make lines of questioning clearer for pupils, and the layout of
questions and artwork have been improved. All of these changes
have been made in order to improve the tests for pupils. It is
important to ensure that the tests are good enough to enable all
pupils to show what they know and can do. This work has been most
apparent in key stage 2 English and these tests are now much more
reliable in enabling pupils to demonstrate their learning.
Question 3 (c)
Does QCA agree that its procedures are bound
to lead to drift?
No, for the reasons outlined above and summarised
QCA's procedures provide an anchor
over several years, not just to the previous year.
QCA evaluates and monitors the tests
and the procedures associated with them.
QCA draws on the advice of leading
experts to assure the quality of its procedures and to provide
long-term strategic advice.
Is QCA unable to maintain standards over time?
Based on the comments to the earlier questions,
QCA is able to and does maintain standards successfully.
In summary the procedures in place to assure
standards over time include:
pre-tests, with anchor tests designed
to compare each new test against a common underpinning standard,
measured year after year by the anchor test;
script scrutiny meetings in which
the most senior and most experienced markers review scripts from
previous years (we generally select archive scripts from 3 years,
at random), and in which the markers seek to track the standard
shown in the archive scripts over a number of years, into the
new year's test;
meetings in which we ask teachers to review the difficulty of
the tests and to evaluate their views using the national curriculum
statistical reviews of national results,
drawn from a sample of schools nationally.
The evidence from each of these sources is considered
annually by the level setting committee, which includes QCA, the
test development teams, the marking teams and external observers,
including an invited independent expert.
If the cut-off score were shifted by one mark
on the test, how much difference would this make to the proportion
of pupils getting a level 4 across the country?
As Professor Tymms indicates, the decisions
made by QCA are significant for pupils and their schools. At the
level 4 boundary for key stage 2 English and mathematics, the
proportion of pupils affected if a threshold were to have been
moved by a mark in either direction in 2001 was between one and
two per cent.
Approximately what range of marks is considered
during the discussion when cut-off scores for level 4 are being
It is possible for the range of marks considered
to vary from year to year and from one subject to another. QCA's
internal procedures do not prescribe a pre-specified range that
should be considered each year when setting cut-off scores. However,
the issue is more complex than Professor Tymms submission suggests.
For each pre-test that QCA conducts, there are
a number of different statistical measures that can be used to
equate the standard in the new test to the previous standard,
the anchor test and the previous years' tests. The reliability
of each of these measures depends on the nature of each of the
tests, as well as on the number of pupils involved in the pre-test.
However, with only very rare exceptions each statistical measure
generally agrees to within one or two marks with the others. There
is also clear advice from the test developers' statisticians about
which of the statistical measures should be used. There is not
a process of splitting the difference.
In the script scrutiny process there is a prescribed
mark-range that the markers are required to investigate. This
varies for good reasons from subject to subject but is always
at least nine marks. The variation occurs as a result of the nature
of the tests and their tiering arrangements.
Markers are required to determine within that range which scripts
are definitely below the standard of the level, which above it
and which on the standard from previous years. This process typically
enables the markers to identify a provisional range of two to
three marks around a potential cut-off score; they are then required
to discuss and evaluate their differences of view (each marker
will be clear in their own views which mark best reflects the
standard of previous years, but there is generally some disagreement
initially around the table about which mark this should be) and
make a single recommendation to QCA.
The 'Angoff' teachers' judgmental process also
derives a single mark recommendation.
Taking these three procedures together, it is
usual for there to be some limited difference in the marks that
each process generates; which is why QCA operates several processes.
In discussing the differences, which are typically of one or two
marks, the level setting committee will sometimes identify the
fact that one indicator is an outlier and is therefore unreliable
and not considered further.
The level setting committee does not take a
decision based on the average of the indicators arising from the
different processes. The meeting weighs the evidence carefully
and generally agrees which of the indicators is based on the most
Where the standard setting meeting considers
that the range of marks to be considered is too wide or is otherwise
reluctant to make a decision, the meeting can request further
analysisa re-run of the script scrutinybefore reconvening.
This has only happened on two or three occasions across all of
the testing system. A decision has then been taken at a reconvened
meeting, having considered further evidence.
Question 7 (a)
What is QCA's estimate of the margin of error
on the decisions made about the proportions of pupils achieving
level 4 or above at key stage 2.
Professor Tymms's submission suggests that the
margin or error is to be counted in relation to whole marks, that
QCA might equally well have selected a higher or lower mark for
the level threshold.
This is not QCA's view. Many of the measures
that are considered in the level setting meetings are expressed
to one decimal place. This enables QCA to be more confident than
Professor Tymms's submission suggests, and certainly QCA would
not suggest that the margin or error could be counted as high
as 0.5 per cent.
Question 7 (b)
If we saw, say, a 2 per cent rise in Level 4s
in maths, how confident would [QCA] be that the rise represented
a real increase in standards and was not within the margin of
It would be unwise for anyone to claim a rise
of 2 per cent as clear and unequivocal evidence of an increase
in standards, unless there was a similar level of increase for
a number of years consecutively. A rise in a single year of more
than 2 per cent, or a rise of 1-2 per cent each year for three
or more years, would provide unequivocal evidence of a real increase
in standards of performance in QCA's view.
Question 8 (a)
How comfortable does QCA feel with next year's
test being trialled the previous year?
QCA is confident that this is the best possible
way of pre-testing the tests. One alternative would be not to
pre-test: this would mean that untried questions were placed in
front of pupils and the system of external marking could not be
run in the way it is or to the tight timescale that it currently
is. Another alternative would be to pre-test in another country,
but to be reliable the same curriculum would have to apply, together
with broadly the same major educational initiatives. A third option
would be to find small groups of pupils to take both the old test
and the new test under pre-test conditionsie, low stakes
conditions, with no revision or preparation effects. In practical
terms QCA has rejected this option as it would be impossible to
determine whether pupils had prior sight of the old test (they
are all in the public domain and used by teachers during year
QCA has concluded that that the system of pre-testing
currently used is as good as it can be.
Question 8 (b)
Does QCA make a correction for age difference?
No. The national curriculum tests do not seek
to relate test performance to age, other than at key stage 1.
Nor in practice are the age differences material (they are a matter
of weeks, not terms).
Moreover, the pre-test results are not the final
data that inform the decisions about where the level thresholds
need to be set. They provide the first evidence which is then
considered over a 12-month period alongside the other measures
described elsewhere in this commentary.
Question 8 (c)
How does QCA deal with the fact that the next
years' test acts as a dress rehearsal for the actual end of key
QCA's advice to all schools involved in either
the first or second pre-test is to consider providing pupils with
the opportunity to practice a previous year's test, so that pupils
may be familiar with the requirements of their tests. A pre-test
in this sense is therefore just good practice for schools.
QCA also measures something termed the 'pre-test
effect'. This effect is measured accurately historically and the
historical data are then used to estimate the effect on the new
test. Should that estimate prove inaccurate, this would be clear
at the level setting meetings when actual test performance statistics
are made available. In terms of QCA procedures these factors do
not affect final decisions.
Question 9 (a)
Has QCA used the same anchor test for each of
the past seven years.
In respect of key stage 2 English, there are
two components to the test; a Reading Test and a Writing Test.
Each is worth 50 marks out of a total of 100 marks. The Writing
Test is absolutely anchored to 1996. This is achieved because
the form of the mark scheme for the test has remained stable since
1995. The mark scheme uses identical criteria, year on year, adapted
to the range of tasks set in the test.
For the Reading Test QCA uses a short reading
test as an anchor. This was developed in 1995 and has been used
in all pre-test administrations since. Pupils involved in a pre-test
take their actual statutory key stage 2 test, the anchor test,
and either the new reading test or the new writing test.
For key stage 2 mathematics the anchor test
procedures need to be rather different. Here the approach is to
select six of the most technically robust items used in previous
years' tests and to 'seed' those six items in every pre-test booklet.
(A 'robust' item, in this respect, is one that performs very consistently,
showing a good match with pupils' total scores on the tests.)
This enables QCA's statisticians to use a sound statistical procedure
(the most common is a highly technical procedure known as item
response theory (IRT) analysis) to compare all new items against
the standards of all previous years' tests. Where there is evidence
to suggest the need, items are retired from the bank of six anchor
items and replaced with more robust items. In practice, one item
is retired every other year, giving a largely stable base to the
anchor test over time.
Question 9 (b)
Is QCA prepared to share the anchor test data?
As explained earlier in this commentary, the
anchor test is just one piece of evidence considered when setting
Data from the anchor test are used in different
ways by each test development agency, depending on the subject.
For example, the Rasch model of statistical equating uses the
data from the anchor test as an integral part of the statistical
analysis process. However, Rasch is not appropriate for all subjects
and is not used by all test developers. Item response theory (IRT),
used in mathematics for example, uses the anchor test data in
a different way from the Rasch process. Providing the anchor test
data out of context, therefore, would not be helpful.
In English at Key Stage 2, the raw data emerging
from the anchor test are subject to statistical methods that equate
outcomes on the anchor test with outcomes on the live national
curriculum test. In 2002, there was a high correlation statistically
of 0.82 between the data from the anchor test and the data from
the live national curriculum test.
QCA, June 2002
1 Angoff describes generally a set of procedures applicable
in education and other professions through which the views of
the professionals (in this case teachers) about standards are
For example, mathematics at key stage 3 has papers targeting levels
3-5; 4-6; 5-7 and 6-8 of the national curriculum in mathematics. Back