In this blog, Amy Finch, our Head of Strategic Evaluation, sets out Ofsted’s work on evaluating the reliability of inspection.
Last week, Ofqual’s research on marking consistency (originally published back in November) captured the imagination of the press, with some unhelpfully lurid headlines. Ofqual was making the important point that its research “considers the implications of there not being a single, right mark for every answer given in every subject”.
Our continuing struggle to accept the inherent imprecision of most kinds of judgement shows that reliability really is a difficult subject.
However, the impossible quest for perfection in making judgements should not deter us from seeking to celebrate good performance or improve where things are not quite right.
There are many ways of measuring reliability. Most involve highly complex statistics, and all require careful interpretation to understand the results. Will Hazell at TES has written an insightful piece on Ofqual’s research, revealing the nuance behind the study and expressing caution over simple abstractions from it.
Generally speaking, reliability means the extent to which 2 assessments of something are the same. At Ofsted, we take reliability and its sister concept, validity, very seriously. Validity means the extent to which our inspections and judgements assess what they’re supposed to. Both reliability and validity matter.
But there is very often a trade-off between reliability and validity.
Let’s consider an example. We might make inspections more reliable, in the sense of increasing the probability that 2 different inspection teams would return the same judgement on each aspect, if each inspection consisted of asking a standard list of questions, each to be answered with a simple ‘yes’ or ‘no’. But most people would not think that the inspection equivalent of a multiple-choice test was the best way to bring human intelligence into assessing the quality of education. So in practice, we have to balance reliability and validity, making sure that both are at acceptable levels, while recognising that there are no perfect answers.
Making sure that our frameworks are reliable and valid is therefore one of our strategic aims.
We’ve published several research studies on the subject over the last 2 years, such as ‘Do two inspectors inspecting the same school make consistent decisions?’ and 'How valid and reliable is the use of lesson observation in supporting judgements on the quality of education?'.
From September, we’ll be implementing a new inspection framework that we believe has greater validity, with a clearer underlying concept of education quality, grounded in the evidence on educational effectiveness and with the curriculum at its heart. As we start to monitor and evaluate this framework, we will be looking at reliability as well as validity.
Reliability and inspection
To really understand reliability in the inspection context, we should also disentangle a couple of concepts.
First, differences in inspection practice do not necessarily result in a lack of reliability! While our inspection handbook sets out our overall approach, there are many reasons why inspections can feel different to providers. Whether we’re inspecting schools, colleges or local authorities, all vary and the strengths and weaknesses of those inspections are not the same. But that does not necessarily mean that inspectors come to different judgements.
Of course, there can be no such thing as ‘perfect reliability’ for complex, real-world judgements. Both assessments and judgements are estimations of performance, and like all estimations they carry some error, both statistical and human.
Expertise and professional dialogue
Ofsted inspections have been criticised as unreliable by design because they involve human subjectivity. But that view fails to recognise the great benefits of human inspection based on expertise and professional dialogue, as well as the limitations of the other ways in which we can attempt to assess the quality of education.
Exam data gives the appearance of precision but is in fact not perfectly valid or reliable, for many reasons. For one thing, exams can generally only test pupils on a small sample of what they know about a subject, which may or may not be representative of everything they know. And of course, putting too much weight on exam data can lead to undesirable behaviours in schools.
While the high-stakes uses of inspection make it hard to accept anything less than perfect agreement on judgements, an inspection system that was ‘perfectly’ reliable would not be looking at the right things.
The things that are hard to measure are often the things that matter most.
With this in mind, our approach is to improve validity while making sure that we have sufficiently reliable judgements. To do this, we look at ‘inter-rater reliability’ – essentially the degree of agreement among ‘raters’ (in our case, inspectors).
There are 2 layers to this:
- Inspection judgement inter-rater reliability – do 2 inspectors come to the same judgement of a provider?
- Inspection method inter-rater reliability – do 2 inspectors come to the same conclusions on indicators, such as for lesson observation and book scrutiny?
In our first judgement reliability study, pairs of inspectors independently carried out live short inspections of good schools. The ‘lead inspector’ was responsible for deciding whether the school remained good, while a ‘methodology inspector’, who visited on the same day but worked entirely independently on site, came to separate conclusions.
The results were very positive. In 22 of 24 (over 90%) of short inspections, the inspectors agreed with each other about whether the school remained good in the short inspections themselves. In one of the other two, the inspector came to the same conclusion after a converted full inspection. The final judgements were different in just one of the 24 schools. Students of human judgement reliability in other contexts will recognise that this is a high level of agreement.
More recently, for our review of how we inspect local authority (LA) children’s services (ILACS), separate teams of inspectors reviewed anonymised evidence from live LA inspections and came to an independent judgement. We are due to publish the results of this work later this year. A similar methodology has been used to examine the reliability of inspections by the Care Quality Commission.
For method reliability, we asked inspectors to make independent judgements using indicators for lesson observation and book scrutiny activities. These indicators are not being used on live inspections but are another way of assessing reliability.
We published these studies in June. They show a substantial level of agreement for lesson observation indicators in secondary and a reasonable level in primary. Reliability in book scrutiny was generally lower, though the study was conducted off-site with no ability to talk to pupils or teachers about the work and involved minimal training of the inspectors.
The relationship between layers
While the 2 layers of reliability are different, we know that there’s a relationship between them.
In collecting evidence ahead of evaluating the EIF, we found that judgement reliability improved significantly when book scrutiny and lesson observation reliability also improved. This suggests that there is a strong relationship between evaluation during individual inspection activities and the evaluation of the school more generally.
This is exactly what we’d want to see.
Evaluating the EIF
As we come to monitor and evaluate the EIF, we’ll be deciding how best to measure reliability. There are a number of considerations here.
The most important is deciding which form of reliability to measure. We’ve already looked at inter-rater reliability – that is, whether 2 inspectors using the same inspection techniques on the same day come to the same conclusions.
An argument could be made for examining ‘intra-rater reliability’ to test whether a single inspector comes to the same judgement on 2 different days. However, there are obvious practical difficulties in this, such as the length of time needed for an inspector to forget their previous judgement.
This leads us to the second major consideration. Reliability studies involving live inspections are resource intensive and, of course, rely on the willing participation of schools and other providers. Most studies involve either doubling an inspection team or carrying out a second inspection-like activity.
We were able to recruit schools for previous studies quite easily (though they tended to be good or outstanding). Yet schools being inspected for the first time under a new framework may, understandably, be less willing. The additional burden this puts on providers, as well as our own costs of doubling inspection resource, are important considerations.
In evaluating our new education inspection framework, we will need to strike the right balance between resourcing reliability studies and reaching valid and useful conclusions. It’s important that we’re able to show that inspection works. But it’s equally important that we do not unnecessarily burden the sectors we inspect. We will be coming to some firm conclusions on this over the coming months.
Amy Finch is our Head of Strategic Evaluation. Follow Amy on Twitter.