We were surprised to read this week that a new study reports that in a three-party version of the Turing test, OpenAI's GPT-4.5 model, adopting a persona, was deemed to be the human 73% of the time.
Just to refresh the memory … in the Turing test a human evaluator judges a text transcript of a conversation between a human and a machine, to test the machine's ability to exhibit intelligent behavior equivalent to that of a human. The machine passes the test if the evaluator cannot reliably tell them apart.
Why were we surprised? Because we thought (wrongly it seems) that LLMs like ChatGPT passed this particular test some time ago. That got us thinking about where we are with AI as well as tests and benchmarks more generally. About 3 months ago OpenAI shared some performance indicators for its latest o3 model (reviewed by Ethan Mollick in his excellent Substack ‘One Useful Thing’[1]). As Mollick noted, all of these tests suggest that what we previously considered unpassable barriers to AI performance may actually be beaten quite quickly. On the GPQA test – “a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry” - PhDs in the corresponding domains reach 65% but o3 achieved 87% beating human experts for the first time. The FrontierMath(s!) test was created in collaboration with over 70 mathematicians and spans the full spectrum of modern mathematics. Each problem demands hours of work from expert mathematicians. GPT-4 and Gemini solve less than 2% of them. o3 got 25% right.
Then we have the ARC-AGI-1 test. In 2019, François Chollet published "On the Measure of Intelligence" which introduced the "Abstract and Reasoning Corpus for Artificial General Intelligence" (ARC-AGI) benchmark to measure intelligence. o3 scored between 75.7% and 87.5% (depending on the version of the test). This led to some excitement vis-à-vis how close OpenAi was to achieving AGI, especially as at the beginning of 2025, Sam Altman claimed on his blog that “we are now confident we know how to build AGI as we have traditionally understood it”.
So how useful are these tests for Artificial General Intelligence? How will we know when we have achieved it (assuming it’s not an extinction level event for humanity). And does it matter?
To have a recognized test for AGI, we must have surely all be agreed on what we are testing for! We need the definition of AGI, just so that we know what we are looking for. This would seem a key first step BUT it turns out that isn’t as easy as you might think. There isn’t a generally accepted and agreed upon definition of AGI. Every AI researcher is free to come up with their own version of AGI and claim that they have cracked it. So, when Sam Altman claims that OpenAI “knows how to build AGI as we have traditionally understood it” we have to understand their definition.
But just to make this even more head scratching … OpenAI and Sam Altman seem to have different definitions. Here is where we are in danger of heading down a rabbit hole! Fortunately, Lance Eliot has published a really helpful explanation in a Forbes magazine article[2] entitled ‘Sam Altman moves the cheese when it comes to attaining AGI’.
Eliot gives us 3 definitions, his own, OpenAI’s and Sam Altman’s
His own definition: “AGI is defined as an AI system that exhibits intelligent behavior of both a narrow and general manner on par with that of humans, in all respects.”
OpenAI’s definition: “AGI is defined as highly autonomous systems that outperform humans at most economically valuable work”
And Sam Altman’s definition: “AGI is defined as a system that can tackle increasingly complex problems, at human level, in many fields.”
As Eliot points out, ‘highly autonomous’ in the OpenAI definition is a bit troubling as it implies the AGI system is therefore not fully autonomous. It would seem to us that intelligence on par or exceeding human capacity would need to be autonomous. Does it still need some form of human intervention? And what does ‘most economically valuable work’ mean? Is there some economically valuable thinking that it can’t do, and still achieve AGI status? Does thinking that has no economic value (maybe a PPE degree!?) discount it from achieving AGI?
As AGI definitions go, Altman’s also leaves a bit to be desired. One could argue that ‘tackling increasingly complex problems, at human level, in many fields’ is where we are today with the current crop of AI models, and so this is not a good, differentiating, definition. And this would seem to be the problem, we can define AGI any way we like and claim victory.
François Chollet – he of ARC-AGI – gave his own take on how we might then test for AGI … “you’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible”. Thomas Dietterich at Oregon State University suggests that to qualify as intelligence on a level with human cognition, AGI will also need to demonstrate “episodic memory, planning, logical reasoning and, most importantly, meta-cognition”.
Perhaps we just have to fall back on US Supreme Court Justice Potter Stewart, who when describing another threshold test (in that case, the definition of ‘obscenity’) famously set the “I know it when I see it” standard.
But then does it matter that nobody can agree on the definition of AGI? Does it matter that OpenAI’s o3 scored 25% on FrontierMath or 87.5% on ARC-AGI-1 test? Whilst on some levels, perhaps many levels, this could be really important. On another level it is largely irrelevant.
Are we there yet? Whether or not we achieve AGI, just the journey towards it will be hugely impactful and deliver new capabilities that again redefine the human-technology relationship. Studying the impact on knowledge work of the latest series of (non-AGI) models and looking ahead at their evolution … therein lies the power to completely transform our landscape. AI may develop many of the cognitive abilities associated with the human mind but still not achieve AGI status (whatever that might be). Without achieving AGI, the new models may well have the capabilities that allow them to do even more knowledge work, and displace even more knowledge workers, than they currently do.
While it walks like a duck and quacks like a duck, it might not be a duck, but it is so ‘duck-like’ that in reality it makes no difference. So AGI might be a bit of red herring! (apologies for the various animal related metaphors!). AI doesn’t need to have the intelligence of the smartest humans to completely reshape whole sectors of the economy. You don’t need to be a savant to perform well in whatever job you have, most of us are ‘smart enough’. We don’t need to achieve AGI for artificial intelligence that is ‘just smart enough’ to displace millions of jobs.
[2] https://www.forbes.com/sites/lanceeliot/2025/02/11/sam-altman-moves-the-cheese-when-it-comes-to-attaining-agi/