François Chollet on ARC-AGI-3: The Future of AI Reasoning

François Chollet discusses ARC-AGI-3, a new benchmark for AI reasoning, highlighting current AI's limitations and the path toward general intelligence.

5 min read
François Chollet speaking about AI reasoning benchmarks
The ONLY benchmark that AI can't solve (humans ace it) — Matthew Berman on YouTube

François Chollet, a distinguished AI researcher at Google and the creator of the deep learning library Keras, has been at the forefront of developing benchmarks to measure artificial general intelligence (AGI). His work has culminated in the Abstraction and Reasoning Corpus (ARC), a series of increasingly complex tasks designed to test AI systems' ability to generalize and reason from minimal information. In a recent discussion, Chollet highlighted the latest iteration, ARC-AGI-3, emphasizing its unique approach to evaluating AI's reasoning capabilities and the significant gap that still exists between current AI and human-level intelligence.

François Chollet on ARC-AGI-3: The Future of AI Reasoning - Matthew Berman
François Chollet on ARC-AGI-3: The Future of AI Reasoning — from Matthew Berman

Who is François Chollet?

François Chollet is a pivotal figure in the AI research community. His contributions to the field are substantial, most notably as the creator of Keras, a widely-used open-source deep learning library that has democratized access to powerful AI tools. Chollet's research often centers on understanding and replicating human intelligence, particularly in the realm of reasoning and generalization. His work on the ARC benchmark series is a testament to this focus, aiming to create tasks that are intuitive for humans but challenging for current AI systems, thereby providing a more accurate measure of AGI progress.

Understanding the ARC Benchmarks

The ARC benchmark suite, introduced in 2019, is built upon a foundation of grid-based reasoning problems. Each task presents a small set of input-output examples, requiring the AI agent to infer the underlying rules and apply them to solve a new, unseen problem. The key innovation of ARC lies in its design to assess general intelligence rather than specialized pattern recognition. Chollet's goal was to move beyond the limitations of existing benchmarks that could be 'gamed' by AI systems that excel at narrow tasks but fail to exhibit true understanding or adaptability.

The first version, ARC-AGI-1, focused on basic reasoning and generalization. The second version, ARC-AGI-2, increased the complexity by requiring agents to interpret symbols as having meaning beyond their visual patterns, testing their ability to recognize connections and transformations. However, even advanced AI models struggled significantly with ARC-AGI-2, often failing to achieve scores comparable to humans. This demonstrated that while AI could perform complex transformations, it often lacked the deeper semantic understanding that humans naturally possess.

ARC-AGI-3: A New Level of Challenge

ARC-AGI-3 represents a further escalation in difficulty, designed to be even more challenging for current AI systems. Chollet notes that while humans can consistently solve ARC-AGI-3 tasks, AI models have shown minimal success, scoring below 1%. This stark contrast underscores the ongoing challenges in achieving true AGI. The benchmark requires a sophisticated combination of pattern recognition, logical deduction, and abstract reasoning, pushing the boundaries of what current AI architectures can achieve.

Chollet demonstrates the nature of these tasks through examples, showing how humans can infer rules from a few examples and apply them to novel situations. The game-like interface of ARC-AGI-3, presented on a retro handheld device, provides an interactive way to engage with these problems. Participants are given a limited number of turns to solve each puzzle, demanding efficient and insightful reasoning.

The Cost-Performance Landscape of AI Models

The discussion also touched upon the performance of various AI models on the ARC benchmarks, presented through leaderboards that track scores against cost per task. Chollet highlights that even state-of-the-art models, such as GPT-4 variants and Google's Gemini, while showing promise, still fall significantly short of human performance on these reasoning tasks. For instance, on ARC-AGI-2, the top AI models scored around 70-80%, while humans consistently achieve 100%. The costs associated with these high-scoring models are also considerable, with some costing several dollars per task.

The leaderboards for ARC-AGI-3 further emphasize this gap. Models like Gemini 3.1 Pro and GPT-5.4 Pro, despite their advanced capabilities, are shown to score very low percentages, indicating their struggle with the benchmark's reasoning requirements. This cost-performance analysis provides a crucial perspective on the current state of AI development, showing that while computational power and model size have increased, true general intelligence and robust reasoning remain elusive goals.

The Future of AI Reasoning and the ARC Prize

Chollet's work on ARC-AGI-3 is not just about creating a difficult benchmark; it's about driving progress in AI reasoning. The ARC Prize Foundation aims to advance open-source artificial general intelligence research through competitions and prizes, with ARC-AGI-3 being a key component of this initiative. The foundation is offering $2,000,000 in prizes, encouraging researchers and developers worldwide to tackle these challenging problems. The ARC-AGI Competition invites participants to build AI agents that can play ARC-AGI-3 games, with the grand prize guaranteed for the best open-source solution.

The benchmark serves as a critical tool for identifying the limitations of current AI and guiding future research. Chollet's insights suggest that current AI, while capable of impressive feats, still lacks the intuitive understanding, adaptability, and robust reasoning abilities that define human intelligence. The ongoing development and testing on benchmarks like ARC-AGI-3 are essential steps towards bridging this gap and ultimately achieving AGI.

Microsoft 365 Copilot: A Practical Application of AI

The video also included a sponsored segment by Microsoft 365 Copilot, illustrating how AI is being integrated into everyday productivity tools. The presenter shared personal experience as a founder, highlighting how AI has transformed his workflow from manual, time-consuming tasks to streamlined, efficient processes. He demonstrated how Copilot can be used to summarize documents, review contracts, and integrate data across applications like Word and Excel, saving significant time and allowing for greater focus on creative and strategic work. This practical application underscores the tangible benefits AI can bring to businesses and individuals, freeing up human potential for more impactful activities.