The nuance of reliability studies

In this blog, Amy Finch, our Head of Strategic Evaluation, sets out Ofsted’s work on evaluating the reliability of inspection.

Last week, Ofqual’s research on marking consistency (originally published back in November) captured the imagination of the press, with some unhelpfully lurid headlines. Ofqual was making the important point that its research “considers the implications of there not being a single, right mark for every answer given in every subject”.

Our continuing struggle to accept the inherent imprecision of most kinds of judgement shows that reliability really is a difficult subject.

However, the impossible quest for perfection in making judgements should not deter us from seeking to celebrate good performance or improve where things are not quite right.

Interpreting reliability

There are many ways of measuring reliability. Most involve highly complex statistics, and all require careful interpretation to understand the results. Will Hazell at TES has written an insightful piece on Ofqual’s research, revealing the nuance behind the study and expressing caution over simple abstractions from it.

Generally speaking, reliability means the extent to which 2 assessments of something are the same. At Ofsted, we take reliability and its sister concept, validity, very seriously. Validity means the extent to which our inspections and judgements assess what they’re supposed to. Both reliability and validity matter.

But there is very often a trade-off between reliability and validity.

Let’s consider an example. We might make inspections more reliable, in the sense of increasing the probability that 2 different inspection teams would return the same judgement on each aspect, if each inspection consisted of asking a standard list of questions, each to be answered with a simple ‘yes’ or ‘no’. But most people would not think that the inspection equivalent of a multiple-choice test was the best way to bring human intelligence into assessing the quality of education. So in practice, we have to balance reliability and validity, making sure that both are at acceptable levels, while recognising that there are no perfect answers.

Our frameworks

Making sure that our frameworks are reliable and valid is therefore one of our strategic aims.

We’ve published several research studies on the subject over the last 2 years, such as ‘Do two inspectors inspecting the same school make consistent decisions?’ and 'How valid and reliable is the use of lesson observation in supporting judgements on the quality of education?'.

From September, we’ll be implementing a new inspection framework that we believe has greater validity, with a clearer underlying concept of education quality, grounded in the evidence on educational effectiveness and with the curriculum at its heart. As we start to monitor and evaluate this framework, we will be looking at reliability as well as validity.

Reliability and inspection

To really understand reliability in the inspection context, we should also disentangle a couple of concepts.

First, differences in inspection practice do not necessarily result in a lack of reliability! While our inspection handbook sets out our overall approach, there are many reasons why inspections can feel different to providers. Whether we’re inspecting schools, colleges or local authorities, all vary and the strengths and weaknesses of those inspections are not the same. But that does not necessarily mean that inspectors come to different judgements.

Of course, there can be no such thing as ‘perfect reliability’ for complex, real-world judgements. Both assessments and judgements are estimations of performance, and like all estimations they carry some error, both statistical and human.

Expertise and professional dialogue

Ofsted inspections have been criticised as unreliable by design because they involve human subjectivity. But that view fails to recognise the great benefits of human inspection based on expertise and professional dialogue, as well as the limitations of the other ways in which we can attempt to assess the quality of education.

Exam data gives the appearance of precision but is in fact not perfectly valid or reliable, for many reasons. For one thing, exams can generally only test pupils on a small sample of what they know about a subject, which may or may not be representative of everything they know. And of course, putting too much weight on exam data can lead to undesirable behaviours in schools.

Judgement reliability

While the high-stakes uses of inspection make it hard to accept anything less than perfect agreement on judgements, an inspection system that was ‘perfectly’ reliable would not be looking at the right things.

The things that are hard to measure are often the things that matter most.

With this in mind, our approach is to improve validity while making sure that we have sufficiently reliable judgements. To do this, we look at ‘inter-rater reliability’ – essentially the degree of agreement among ‘raters’ (in our case, inspectors).

There are 2 layers to this:

Inspection judgement inter-rater reliability – do 2 inspectors come to the same judgement of a provider?
Inspection method inter-rater reliability – do 2 inspectors come to the same conclusions on indicators, such as for lesson observation and book scrutiny?

In our first judgement reliability study, pairs of inspectors independently carried out live short inspections of good schools. The ‘lead inspector’ was responsible for deciding whether the school remained good, while a ‘methodology inspector’, who visited on the same day but worked entirely independently on site, came to separate conclusions.

The results were very positive. In 22 of 24 (over 90%) of short inspections, the inspectors agreed with each other about whether the school remained good in the short inspections themselves. In one of the other two, the inspector came to the same conclusion after a converted full inspection. The final judgements were different in just one of the 24 schools. Students of human judgement reliability in other contexts will recognise that this is a high level of agreement.

More recently, for our review of how we inspect local authority (LA) children’s services (ILACS), separate teams of inspectors reviewed anonymised evidence from live LA inspections and came to an independent judgement. We are due to publish the results of this work later this year. A similar methodology has been used to examine the reliability of inspections by the Care Quality Commission.

Method reliability

For method reliability, we asked inspectors to make independent judgements using indicators for lesson observation and book scrutiny activities. These indicators are not being used on live inspections but are another way of assessing reliability.

We published these studies in June. They show a substantial level of agreement for lesson observation indicators in secondary and a reasonable level in primary. Reliability in book scrutiny was generally lower, though the study was conducted off-site with no ability to talk to pupils or teachers about the work and involved minimal training of the inspectors.

The relationship between layers

While the 2 layers of reliability are different, we know that there’s a relationship between them.

In collecting evidence ahead of evaluating the EIF, we found that judgement reliability improved significantly when book scrutiny and lesson observation reliability also improved. This suggests that there is a strong relationship between evaluation during individual inspection activities and the evaluation of the school more generally.

This is exactly what we’d want to see.

Evaluating the EIF

As we come to monitor and evaluate the EIF, we’ll be deciding how best to measure reliability. There are a number of considerations here.

The most important is deciding which form of reliability to measure. We’ve already looked at inter-rater reliability – that is, whether 2 inspectors using the same inspection techniques on the same day come to the same conclusions.

An argument could be made for examining ‘intra-rater reliability’ to test whether a single inspector comes to the same judgement on 2 different days. However, there are obvious practical difficulties in this, such as the length of time needed for an inspector to forget their previous judgement.

This leads us to the second major consideration. Reliability studies involving live inspections are resource intensive and, of course, rely on the willing participation of schools and other providers. Most studies involve either doubling an inspection team or carrying out a second inspection-like activity.

We were able to recruit schools for previous studies quite easily (though they tended to be good or outstanding). Yet schools being inspected for the first time under a new framework may, understandably, be less willing. The additional burden this puts on providers, as well as our own costs of doubling inspection resource, are important considerations.

In evaluating our new education inspection framework, we will need to strike the right balance between resourcing reliability studies and reaching valid and useful conclusions. It’s important that we’re able to show that inspection works. But it’s equally important that we do not unnecessarily burden the sectors we inspect. We will be coming to some firm conclusions on this over the coming months.

Amy Finch is our Head of Strategic Evaluation. Follow Amy on Twitter.

1 comment

Comment by Terry Pearson posted on 21 August 2019

It is reassuring to know that Ofsted recognises the need to continually examine the trustworthiness of both inspection methodology and inspection outcomes. It is also encouraging to know that Ofsted acknowledge that imperfection and imprecision will always be a feature of inspection judgements.

Nevertheless, it is rather worrying to see Ofsted promoting the studies they have carried out and published over the last two years as these are flawed in ways that prevent the findings from them being used effectively as indicators of the reliability of inspection judgements or inspection methodology.

In this blog Ofsted emphasise that consistency can be increased when inspectors are required to make categorical responses so it is somewhat ironic that in order to facilitate quantitative analysis of the lesson observation study inspectors were required to record a score on a five-point scale that, most importantly, WOULD NOT be used during inspections. The scale and its associated indicators were produced “strictly for the purpose of the research study”. This means that the levels of consistency attained during the study for scoring decisions cannot be simply transferred to inspection situations where the indicators and scales are not being used. It is not possible therefore to use the outcomes of the study to predict with any degree of confidence what level of consistency could be expected during inspections. It is worth noting that indicators and scoring scales were also developed for the workbook scrutiny study.

The earlier Ofsted study titled ‘Do two inspectors inspecting the same school make consistent decisions?’ reported a number of limitations but failed to alert readers to other substantial shortcomings. Consequently, the findings from that study provide no assurance that under normal inspection conditions a different sample of inspectors would produce similar levels of decision consistency when carrying out short inspections. Most importantly perhaps the report of the study does not provide meaningful information for identifying the next steps in the process of developing a credible measure for the reliability of Ofsted school inspections. Further details of the shortcomings of that piece of research are available on this link: https://www.researchgate.net/publication/327894743_A_review_of_Ofsted's_test_of_the_reliability_of_short_inspections

However, Ofsted’s commitment to providing evidence of the trustworthiness of inspection decisions and methodology needs to be welcomed. Ofsted really does need to figure out how best to measure reliability. Beforehand though, it needs to think more deeply about what reliability means in the context of its inspections. A fundamental aspect of the EIF is that it recognises the great benefits of human inspection based on expertise and professional dialogue. Unlike the marking of examination scripts Ofsted inspection depends heavily on thoughtful human interaction in which expert judgement is sought that may lead inspectors to different but legitimate conclusions. It seems wholly inappropriate to use a measure of consistency as a primary indicator of the trustworthiness of those decisions. What’s more, striving for a balance between discretionary expert judgement and consistency may well result in a ‘neither one thing nor the other’ situation which would be most unhelpful for Ofsted and the range of inspection stakeholders.

On numerous occasions, over more than one quarter of a century, Ofsted has attempted without success to furnish credible evidence of the reliability of inspection judgements. Ofsted must develop better ways of testing and reporting reliability or call upon the support of others who are capable of doing it for them.

Link to this comment

Blog Ofsted: schools and further education & skills (FES)

Share this page

1 comment

About 'Ofsted blog: schools and further education and skills'

Categories

Sign up and manage updates

Recent Posts

Archives

Comments and moderation

Interpreting reliability

Our frameworks

Reliability and inspection

Expertise and professional dialogue

Judgement reliability

Method reliability

The relationship between layers

Evaluating the EIF

Sharing and comments

Share this page

1 comment

Related content and links

About 'Ofsted blog: schools and further education and skills'

Categories

Sign up and manage updates

Recent Posts

Archives

Comments and moderation