What are we going to do about marking now?
Raelene Tifflin, Curtin University
Assessment design has never been perfect, and concerns about validity are hardly new. Scholars like Boud and Falchikov (2007) spent decades pointing out that assessment is integral to student learning. But generative AI (GenAI) has brought a sense of urgency to how we assess learning, especially when it comes to marking and evaluation.
As assessment marking season soon approaches, many educators will be allowing students to use GenAI in assessments (sometimes referred to as ‘open’ assessments). Yet this begs the question, are we marking human or AI work? And if the assessment design allows for GenAI use, how can educators truly detect that a student has learned something, rather than simply used AI to meet the assessment submission guidelines?
The artefact is no longer a reliable witness
If you want to see how significant this problem really is, watch Professor Danny Liu and Dr Benjamin Miller’s 2024 short video series showing GenAI producing the kinds of assessed work we all recognise. The outputs are unnervingly convincing. GenAI can produce a reflective journal that reads as a personal account, an essay that handles disciplinary vocabulary competently, and a video pitch with clear structure, all without a student needing to engage with the material.
The issue for markers is not just that these outputs exist, but that they arrive looking exactly like the work we have trained ourselves to reward. A well-organised argument, appropriate use of sources, or fluent academic prose are the familiar features we scan for when we mark, and they are precisely what GenAI is good at producing. The artefact no longer tells us whether a student engaged with the material. It tells us that something was assembled, and nothing more.
This matters because marking in higher education has never simply been measurement. Sadler’s (2013) decades of work on assessment standards reminds us that grading student work is an act of human judgement, shaped by tacit knowledge, disciplinary socialisation, and a marker’s own sense of quality. Boud and colleagues make a related point: what we are really assessing are the contextual features we have learned to associate with the learning we expect (Boud et al., 2018). When those features can be generated on demand, the judgement call gets significantly harder.
The open assessment problem
The sector’s current response has largely been to split assessments into two lanes: closed, supervised tasks where AI cannot easily intrude, and open tasks where AI use is permitted or at least monitored (Liu & Bridgeman, 2023). That’s a reasonable first move, but it sidesteps the harder question.
In open assessments, we are still reading the essay, watching the video, listening to the presentation, and making judgements about a product that can now be assembled without the learning it was supposed to evidence. The underlying issue is that we have been relying on production to stand in for process for a long time, and that substitution no longer holds.
So what should we be looking for?
If the product can no longer carry the weight, we need to get sharper about the signals we are trying to detect. At Curtin, our Assessment 2030 team has been working through this using three capacities: conceptual grasp, evaluative judgement, and contextual transfer. None of these appears in student work by accident. They must be built into assessment design and marking criteria from the start.
Conceptual grasp
This is the capacity to know something well enough to use it, not merely name it. Bruner (1976) showed that this kind of understanding is built through repeated, deepening engagement with ideas over time. Marzano and Kendall (2007) distinguish usefully between recalling information and demonstrating how a concept’s parts function together. Marton and Säljö’s (1976) foundational work on deep versus surface learning reinforces the point. Criteria need to capture how ideas connect, not simply whether the right terms appear.
Evaluative judgement
Evaluative judgement is the capacity to make reasoned quality judgements about one’s own work and the work of others. Boud and colleagues argue it must be taught and assessed deliberately, not assumed to develop on its own (Tai et al., 2018). Henderson, Ryan and Phillips (2019) add that criteria need to be visible and discussable so students can internalise what quality looks like, rather than performing to a rubric they do not understand. Criteria need to test whether students can calibrate their own sense of quality against disciplinary standards, not just whether they can produce a polished surface.
Contextual transfer
This is the capacity to take what you know into an unfamiliar situation, adapt, and return with something more than you started with. Perkins and Salomon (1988) argue that the real test of learning is performing well in entirely new environments, not reproducing what was taught. Bearman and colleagues push this further, arguing that authentic assessment must prepare students for unpredictable futures rather than replicating current realities (Bearman et al., 2024; Nieminen et al., 2023). Criteria need to reward students for working in the unfamiliar, not just demonstrating what they have already practised.
This is a cultural problem, not a technical one
All of this points somewhere deeper than policy settings or platform choices. Ellis and Lodge (2024) put it plainly: we need to stop looking for evidence of cheating and start looking for evidence of learning. That is a cultural shift, and it requires more than updated rubrics.
Take your current marking criteria and ask yourself honestly: which of these could be satisfied by a polished AI output with minimal student thinking behind it? Then ask what you would need to change to make conceptual grasp, evaluative judgement, and contextual transfer visible. You do not necessarily need to throw out every existing task. But you do need to ask different questions of the work you receive.
GenAI has not broken assessment. It has forced open a set of problems we have been tolerating for years, and given us a reason to finally do something about them.
Raelene Tifflin is a Senior Lecturer, Learning Futures at Curtin University
