As the world rushes to apply AI to their work practices, its use is becoming apparent in both the production of research ‘products’ for assessment (outputs, proposals, CVs) and the actual assessment of those products and their producers.
This all comes at a time when the research sector is seeking to reform the way it assesses research, both to mitigate some of the problematic outcomes of publication-dominant forms of assessment (such as the rise in paper mills, authorship sales, citation cartels, and a lack of incentives to engage with open research practices) and to prioritise peer review over solely quantitative forms of assessment.
Where assessment reform and AI tools meet
There are two main issues that arise at the intersection of assessment reform and AI. The first is the extent to which our current assessment regime is driving the misuse of generative AI to produce highly prized outputs that look scholarly but aren’t. And the second is the extent to which AI might legitimately be used in research assessment going forward.
On the first issue, we are on well-trodden ground.
The narrow, publication-dominant methods of assessment used to evaluate research and researchers are driving many poor behaviours. One such behaviour is the pursuit of questionable research practices – such as publication and citation bias. Worse again is research misconduct – such as fabrication, falsification and plagiarism.
The system rewards publication in and of itself above the content and quality of the research, to the point that it is now rewarding mere approximations of publications. It should therefore come as no surprise that bad actors will be financially motivated to use any means at their disposal to produce publications, including AI.
In this case, our main problem is not AI, but rather publication-dominant research assessment. We can address this problem by broadening the range of contributions we value and taking a more qualitative approach to assessment. By doing this, we will at least disincentivise polluting the so-called ‘scholarly record’ (curated, peer-reviewed content) with fakes and frauds.
AI in research outputs versus assessment
Assuming we were successful in disincentivising the use of AI in generating value-less publications in any reformed assessment regime, the question remains as to whether it may be incentivised for other aspects.
This is because broadening how we value research and moving to more qualitative (read ‘narrative’) forms of assessment, it will lead to more work, not less, for both assessors and the assessed.
And if there is one thing we know GenAI is good at, it’s generating narratives at speed. GenAI might even help to level the playing field for those for whom the assessment language is not their first, making papers clearer and easier to read.
Most guidelines state that if the right safety precautions are followed – if the human retains editorial control, and is transparent about their use of AI, and doesn’t enter sensitive information into a Large Language Model – it’s perfectly legitimate to submit the resulting content for assessment.
Where the guidelines are more cautious is around the use of AI to do the assessing.
The European Research Area (ERA) guidelines on the responsible use of AI in research are clear that we should “refrain from using GenAI tools in peer reviews and evaluations”. But that’s not to say that researchers aren’t experimenting.
Mike Thelwall’s team has shown weak success in using ChatGPT to replicate human peer review scores, and many researchers believe they’ve been on the receiving end of a new, over-thorough, less aggressive Reviewer Two, which is probably an AI.
But given human peer review is already a highly contested exercise (when does Reviewer One agree with Reviewer Two?) we must ask the question: if ChatGPT can’t replicate human peer review scores, does it say more about the AI or the human?
We have to question whether the human scores are the correct ones and whether we are doing machine learning a disservice by expecting it simply to replicate human scores, only faster. One might argue that the real power of AI is in seeing what we can’t see; finding patterns we cannot; and identifying potential that we cannot.
The dual value of peer review
Perhaps we must first ask, is the scholarly process itself purely about generating and (through research assessment) verifying new discoveries? Or is there something valuable in the act of discovery and verification: the acquisition and deployment of skills, knowledge and understanding, which is fundamental to being human?
We have to ask if the process of collaborating with other humans in the pursuit of new knowledge is just about this new knowledge, or whether the business of building connections and interfacing with others is essential to human well-being, to civil society, and to geopolitical security.
The recognition of fellow humans – through peer review and assessment – is more than just a verification of our results and our contributions, and instead is something critical to our welfare and motivation: An acknowledgement that, human-to-human, I see you and I value you. Would any researcher be happy knowing their contribution had been assessed by automation alone?
It comes down to whether we value only the outcome or the process.
And if we continuously outsource that process to technology, and generate outcomes that might provide answers, but that we don’t actually understand or trust, we risk losing all human connection to the research process. The skills, knowledge and understanding we accumulate through performing assessments are surely critical to research and researcher development.
Proceeding with the right amount of caution
There is no justification for condemning AI outright. It is being used (and its accuracy then verified by humans) to solve many of society’s previously unsolved problems.
However, when it comes to matters of judgement, where humans may not agree on the ‘right answer’ – or even that there is a right answer – we need to be far more cautious about the role of AI. Research assessment is in this category.
There are many parallels between the role of metrics and the role of AI in research assessment. There is significant agreement that metrics shouldn’t be making our assessments for us without human oversight. And assessment reformers are clear that referring to appropriate indicators can often lead to a better assessment, but human judgement should take priority.
This logic offers us a blueprint for approaching AI: human judgement first, and technology in support; or AI-augmented human assessment.
By forbidding the use of AI in assessment altogether, the ERA guidelines took an understandably cautious initial response. However, properly contained, the judicious involvement of AI in assessment can be our friend, not our enemy. It largely comes down to the type of research assessment we are talking about, and the role we allow AI to play.
The use of AI to provide a first draft of written submissions, or to summarise, identify inconsistencies, or provide a view on the content of those submissions could lead to fairer, more robust, qualitative evaluations.
However, we should not rely on AI to do the imaginative work of assessment reform and rethink what ‘quality’ looks like, nor should we outsource human decision-making to AI altogether. As we look to reform research assessment, we should simply be open to the possibilities offered by new technologies to support human judgements.
Dr Elizabeth Gadd is head of research culture and assessment at Loughborough University, United Kingdom. She chairs the International Network of Research Management Societies (INORMS) Research Evaluation Group and is a vice chair of the Coalition on Advancing Research Assessment (CoARA). She co-authored the UKRI-commissioned report, “Harnessing the Metric Tide: Indicators, infrastructures and priorities for UK Research Assessment”.
Professor Nick Jennings is vice-chancellor and president of Loughborough University, UK. He was previously the vice-provost for research and enterprise at Imperial College London, the UK Government’s first chief scientific advisor for national security, and the UK’s first Regius Professor of computer science. His research is in the areas of AI, autonomous systems, cyber-security and agent-based computing.