Matt Cardy/Getty Images
show image

Laurie Clarke


How the A-level results algorithm was fatally flawed

A-level students across the country have been left with a bitter taste in their mouths after receiving a grade lower than the one predicted for them by their teachers. Results released on Thursday showed that 39 per cent of teacher recommendations were downgraded – and that pupils in disadvantaged areas were disproportionately hit the hardest.

The algorithm used to calculate the results is under scrutiny, and people are crying bias. What flaws in the algorithm produced an outcome many feel is unfair?

Read next: A-level results algorithm could be unlawful, experts say

The algorithm was fed a few different strands of raw data. The first was the teacher’s predicted grade for each student based on their performance in class and the mock exams. But this was deemed insufficient on its own, so teachers were also asked to rank each students from highest to lowest in terms of their expected grade.

A report released by Ofqual on Thursday provides a few reasons for this, including the assertion that people are better at making relative compared to absolute judgments, and that teachers are apparently more accurate when ranking students in order than predicting their future attainment.

But statisticians have identified problems with this approach. Firstly, in larger cohorts – such as big sixth form classes in popular subjects – it will be harder to accurately rank each student in order of ability. Secondly, the model demanded that pupils of the same ability be placed in some sort of order, meaning unlucky ones might be pushed down a grade barrier because of a superficial quirk of the system.

“Ranking is known to be a non-robust measure,” says Richard Wilkinson, professor of statistics at Nottingham University. But despite this, the rankings were considered “fixed” (essentially sacrosanct) in Ofqual’s statistical model. This is in contrast to the teacher’s predicted grades, which were treated as uncertain, and were processed in a statistical manner to try to account for this.

“That seems to be a big inconsistency – you treat kind of one set of data, the rankings from teachers, as fixed and accurate, whereas the predicted grades are considered subject to uncertainty,” says Guy Nason, chair in statistics at Imperial College.

When Ofqual tested the model, it didn’t examine the potential pitfalls of the ranking measure. Instead, it tested for accuracy by using data from the years 2016, 2017 and 2018 to predict the grades of 2019. But it used the actual rankings of the students in the 2019 exams to test for the model’s accuracy. By contrast, it relied on teacher’s estimates to predict scores in the 2020 exams.

Even without this consideration, testing found the accuracy of the model to be fairly low – in the range of 50 to 60 per cent. “One in three might be misclassified anyway, if not higher, and that’s when you use the correct rankings in the actual exams in 2019,” says Wilkinson. “With teacher predicted rankings, that could actually go down significantly.”

This points to a potentially serious flaw in the model, yet the ranking measure wasn’t its only weakness. After the raw data was fed into the model, standardisation was carried out to account for inconsistencies in how different schools might have predicted grades for students. (For example, more pessimistic or optimistic predictions.)

Ofqual’s Direct Centre Performance model (DCP) attempted to iron out these disparities in scoring method as well as “maintain[ing] overall qualification standards” and “produce[ing] overall outcomes broadly in line with those of previous years”.

To achieve this, the standardisation model combined information about individual students with historical performance data from each school or college. It determined “the most likely distribution of grades for each centre based on the previous performance of the centre” and the prior attainment profile of 2020’s students. It then used the teacher’s rankings to place pupils along this expected grade distribution.

In selecting which computational model to use, Ofqual decided that the standardisation model would place more weight on the statistical data available (such as the school’s historical performance) rather than the teacher’s estimated scores for pupils because the latter would ostensibly lead to an intolerable level of grade inflation and more disparities in fairness between different schools.

A standardisation model that privileged students’ past attainment was also decided against, on the basis that it would cause the teacher’s predicted score to be given more weight than the teacher’s ranking scores.

But evidently this approach led to students in disadvantaged areas being disproportionately marked down. The proportion of A* and As awarded to independent schools rose by 4.7 percentage points – more than double the rate for state comprehensive schools. How might the statistical model have contributed to disadvantaged students receiving lower grades?

One of the reasons is that for smaller cohorts of fewer than 15 students in a particular subject at a particular school or college, the model put more weight on the teacher’s predicted grades than the one calculated by the algorithm. This is because there is no statistical model that can reliably predict grades for very small groups of students.

This means that for pupils from larger schools, with a greater number of people taking a particular subject, their grades were far more influenced by the school’s historical performance and their teacher’s ranking than their predicted grades. The inverse was true for pupils attending a smaller school with smaller class sizes – for them, their teachers’ grades were far more decisive.

“What you have found, is that the algorithm has systematically privileged people who took, for example, classics at an independent school, and systematically underprivileged people in those larger entry subjects at that same school,” says Cori Crider, director of Foxglove, a group that challenges “digital injustices”.

This effect was compounded by the rigidity of the ranking scores. “For example, it says that for one centre, you need ten A*s and so the top ten students from the rankings go into that bin,” says Nason. “My problem is, if there is a lot of uncertainty about the ranking, how do you know you’re dropping the right students into the right bins?”

By the paper’s own admission, teacher’s estimated grades tend to be inflated compared to what pupils achieve (teachers said they supplied predictions based on how the pupil would perform “on a good day”). This accounts for why those pupils at smaller schools, which are more likely to be independent schools, received inflated scores, and those at the institutions with the largest number of students – such as sixth form colleges – were relatively downgraded compared to their teachers’ predictions.

“If you’re at a high performing school, it’s much more likely under-performers benefitted from over-prediction of their grades,” says Wilkinson. “At low-performing schools, high-performers have been shifted down: they have all been shifted towards the average of the school performance over the previous three years.”

The overall results show record highs for A and A*s, but that doesn’t mean that the right students received them. “They’ve gone for an approach which somehow maintains the integrity of the overall system, but I think there’s a lot of individual unfairness […],” says Nason, chair in statistics at Imperial College.

Unless Ofqual takes action to address the potential discrimination, Foxglove plans to challenge the body in court, on behalf of Curtis Parfitt-Ford, an A-level student at Elthorne Park High School.

“The statute that set up Ofqual says that they exist to measure individual achievement,” says Crider. “That’s not what this did.” She says Foxglove believes Ofqual should set up a free individual appeal route for students to challenge grades they think are unfair. At present, the body is charging students to appeal their grades – something that appears particularly unfair given that it’s disadvantaged students who were disproportionately affected.

Crider says that since agreeing to take on Parfitt-Ford’s case, Foxglove has received hundreds of emails from other distressed students.

Ofqual’s report reads: ‘To understand the impact of potential advantage or disadvantage across different demographic and socio-economic groups we have also performed an equalities analysis of calculated grades. The analyses show no evidence that this year’s process of awarding grades has introduced bias.”