AI Research

50 articles in this category

Google DeepMind Tackles AI Evaluation Challenges

Google DeepMind Tackles AI Evaluation Challenges

Google DeepMind's Nicholas Kang and Michael Aaron discuss the challenges in current AI evaluation and Kaggle's innovative solutions like Hackathons, Agent Exams, and Game Arena.

about 23 hours ago
Omar Sanseviero on Google's AI Strategy

Omar Sanseviero on Google's AI Strategy

Omar Sanseviero from Google DeepMind discusses Google's AI strategy, focusing on efficient models, multimodality, and open innovation in AI.

1 day ago
Graph Neural Networks Explained: GNN Basics & Models

Graph Neural Networks Explained: GNN Basics & Models

Explore the essentials of Graph Neural Networks (GNNs), from their basic principles to key models like GCNs, GraphSAGE, GATs, GINs, and Transformers.

1 day ago
DeepMind's Scale: How Agents Run at Google

DeepMind's Scale: How Agents Run at Google

Google DeepMind's KP Sawhney and Ian Ballantyne reveal how they run AI agents at scale, discussing the architecture, tools, and challenges involved in managing complex automated tasks.

2 days ago
Google DeepMind on Building with Gen Media Stack

Google DeepMind on Building with Gen Media Stack

Google DeepMind's Paige and Guillaume showcase building generative media pipelines with Google's Gen Media Stack.

3 days ago
MOSS: Source-Level Self-Rewriting for Agents

MOSS: Source-Level Self-Rewriting for Agents

MOSS enables AI agents to self-rewrite their source code, achieving significant performance gains and overcoming limitations of text-based evolution.

4 days ago
DeltaBox: Millisecond C/R for AI Agents

DeltaBox: Millisecond C/R for AI Agents

DeltaBox revolutionizes AI agent performance by introducing millisecond-level checkpoint/rollback via OS-level change-based state management.

4 days ago
MARL: The Scaffolding for Real-World AI

MARL: The Scaffolding for Real-World AI

Multi-agent reinforcement learning in drone racing surpasses human pilots and drastically cuts collisions, paving the way for safer real-world AI co-existence.

4 days ago
Google DeepMind Taps AI for Asia-Pacific Climate Crisis

Google DeepMind Taps AI for Asia-Pacific Climate Crisis

Google DeepMind launches an 'AI for the Planet' accelerator in Asia Pacific to help organizations tackle environmental risks with advanced AI.

5 days ago
Attractors Unlock Scalable Reasoning

Attractors Unlock Scalable Reasoning

Equilibrium Reasoners (EqR) leverage learned attractor landscapes to achieve scalable, adaptive test-time compute allocation, dramatically boosting accuracy on complex reasoning tasks.

5 days ago
Jure Leskovec on Relational Foundation Models

Jure Leskovec on Relational Foundation Models

Jure Leskovec, AI researcher and Stanford professor, discusses Relational Foundation Models, a new AI approach for understanding complex enterprise data and its applications.

5 days ago
DeepWeb-Bench: Beyond Frontier LLM Claims

DeepWeb-Bench: Beyond Frontier LLM Claims

DeepWeb-Bench benchmark exposes derivation and calibration as major LLM failure points, revealing domain specialization and the inadequacy of current evaluations.

5 days ago
Agent JIT Compilation for Web Automation

Agent JIT Compilation for Web Automation

Agent just-in-time compilation revolutionizes web automation by compiling tasks into efficient code, yielding significant speed and accuracy gains.

5 days ago
Microsoft's small AI agents get smarter

Microsoft's small AI agents get smarter

Microsoft Research unveils MagenticLite, an AI system using smaller models for efficient browser and file system tasks, pushing agentic AI capabilities on user hardware.

5 days ago
OpenAI and Chip Ganassi Racing Tease Future Collaboration

OpenAI and Chip Ganassi Racing Tease Future Collaboration

OpenAI and Chip Ganassi Racing are joining forces for a research and development initiative, as teased in their new video 'R&D: Part 1'.

5 days ago
Chatbots Fail News Accuracy, Forum AI Study Reveals

Chatbots Fail News Accuracy, Forum AI Study Reveals

A Forum AI study reveals major chatbots struggle with news accuracy, showing high failure rates on election-related prompts and reliance on biased sources.

6 days ago
Architecting LLM Agents: The SDB Primitive

Architecting LLM Agents: The SDB Primitive

Architecting reliable production LLM agents hinges on the Stochastic-Deterministic Boundary (SDB) and a catalog of runtime patterns.

6 days ago
AI Solves Erdős Breakthrough: OpenAI Researchers Detail Breakthrough

AI Solves Erdős Breakthrough: OpenAI Researchers Detail Breakthrough

OpenAI researchers reveal how AI has solved the complex Erdős Unit Distance Problem, a breakthrough with implications for mathematics and science.

6 days ago
Foundation Models Unlock Time Series Scaling

Foundation Models Unlock Time Series Scaling

Toto 2.0 foundation models demonstrate remarkable scaling, achieving state-of-the-art forecasting performance across multiple benchmarks with a unified training approach.

6 days ago
GeoX: Self-Play for Geospatial Reasoning AI

GeoX: Self-Play for Geospatial Reasoning AI

GeoX, a novel self-play framework, achieves state-of-the-art geospatial reasoning AI performance without costly human annotations, by generating and solving problems through executable programs.

6 days ago
AI Models Now Predict the Future, Almost

AI Models Now Predict the Future, Almost

Fine-tuning LLMs for forecasting tasks boosts their accuracy, with specialized models now rivaling top human predictors and enhancing ensemble predictions.

6 days ago
Google's Cormac Brick on Tiny LLMs for On-Device Agents

Google's Cormac Brick on Tiny LLMs for On-Device Agents

Google's Cormac Brick discusses the fine-tuning of Tiny LLMs for on-device agents, highlighting the benefits of LiteRT-LM and Gemma 4 for edge AI applications.

6 days ago
SpatioRoute VLM: Dynamic Prompting for Video QA

SpatioRoute VLM: Dynamic Prompting for Video QA

SpatioRoute VLM revolutionizes zero-shot spatial video question answering with dynamic prompt routing, achieving SOTA without fine-tuning or 3D sensors.

7 days ago
TaskGround: Bridging Scene Context and Action

TaskGround: Bridging Scene Context and Action

TaskGround revolutionizes household AI by enabling compact models to interpret complex scenes, infer task structures, and act effectively, drastically improving performance and reducing costs.

7 days ago
LLM Protocols Revolutionize MARL State Recovery

LLM Protocols Revolutionize MARL State Recovery

LLM-driven Multi-Agent Communication (LMAC) uses LLM reasoning to create adaptive protocols, significantly improving state reconstruction and performance in MARL.

7 days ago
Code as the Agent Harness

Code as the Agent Harness

Code is evolving into the foundational 'harness' for AI agents, enabling more executable, verifiable, and stateful systems across diverse applications.

7 days ago
Active Exploration Unlocks Spatial AI

Active Exploration Unlocks Spatial AI

New benchmark ESI-BENCH reveals active exploration is key to embodied spatial intelligence, exposing AI's 'action blindness' and metacognitive gaps.

7 days ago
Unlocking LLM Recall: Data Composition is Key

Unlocking LLM Recall: Data Composition is Key

New research reveals a sigmoid scaling law for LLM factual recall, driven by model size and training data composition, explaining up to 94% of performance variance.

7 days ago
Google's AI Speeds Up Aging Research

Google's AI Speeds Up Aging Research

Google DeepMind's Co-Scientist AI is revolutionizing cellular aging research by rapidly identifying genetic targets and analyzing experimental data, slashing research timelines.

7 days ago
Sam Altman Wins as Jury Sides With OpenAI Mission

Sam Altman Wins as Jury Sides With OpenAI Mission

A jury has dismissed Elon Musk's lawsuit against OpenAI, ruling he sued too late. This decision validates OpenAI's for-profit mission and removes a major legal obstacle.

8 days ago
GenMedia: DeepMind's Vernade on AI's Creative Future

GenMedia: DeepMind's Vernade on AI's Creative Future

Google DeepMind's Guillaume Vernade discusses the evolving potential of generative media (GenMedia) and its role in augmenting human creativity.

8 days ago
Anthropic Explains Long-Running AI Agents

Anthropic Explains Long-Running AI Agents

Anthropic's Ash Prabaker and Andrew Wilson discuss building AI agents that can operate for hours without losing focus or their objectives.

8 days ago
Shodh-MoE: Unlocking Universal SciML

Shodh-MoE: Unlocking Universal SciML

Shodh-MoE's sparse activation architecture resolves multi-physics interference in SciML, enabling universal foundation models with guaranteed physical properties.

11 days ago
Unified Embodied AI: Pelican-Unified 1.0

Unified Embodied AI: Pelican-Unified 1.0

Pelican-Unified 1.0, the first unified embodied foundation model, achieves SOTA performance by integrating VLM, reasoning, and generation, proving unification enhances rather than compromises specialist strengths.

11 days ago
Viverra: Verifying AI-Generated Code

Viverra: Verifying AI-Generated Code

Viverra tackles the trust deficit in AI-generated code by automatically producing formally verified annotations, enhancing developer comprehension and productivity.

11 days ago
AI Delegation: Reliability Concerns Emerge

AI Delegation: Reliability Concerns Emerge

New Microsoft Research highlights how AI can degrade document fidelity in long, delegated tasks, stressing the need for better verification and orchestration.

11 days ago
WARDEN: Tackling Low-Resource Language AI

WARDEN: Tackling Low-Resource Language AI

WARDEN pioneers a modular AI system for low-resource languages, using phoneme transfer and LLM-guided dictionaries to transcribe and translate Wardaman with minimal data.

12 days ago
GRIP-VLM: RL for Efficient Vision-Language Models

GRIP-VLM: RL for Efficient Vision-Language Models

GRIP-VLM employs Reinforcement Learning for discrete Vision-Language Model pruning, achieving superior efficiency and adaptability.

12 days ago
LLMs Tame Software Requirements

LLMs Tame Software Requirements

VERIMED leverages LLMs and SMT solvers to formally audit natural-language software requirements, turning ambiguity into testable signals and boosting verified accuracy.

12 days ago
Real-Time Agentic AI Unlocked

Real-Time Agentic AI Unlocked

New methods like Asynchronous I/O and Speculative Tool Calling slash latency for agentic AI, enabling real-time interactions on both cloud and edge devices.

12 days ago
Beyond Model Capability: The Harness for SE Agents

Beyond Model Capability: The Harness for SE Agents

Autonomous software engineering agents' reliability hinges on a novel 'AI Harness' system, not just model capability, enabling verifiably correct changes.

12 days ago
LMPath: Semantics Supercharge UAV Search

LMPath: Semantics Supercharge UAV Search

LMPath integrates language and vision models to create semantically-aware exploration priors for UAVs, dramatically improving search mission efficiency over traditional geometric methods.

12 days ago
OpenAI Podcast: Image Generation's Renaissance

OpenAI Podcast: Image Generation's Renaissance

OpenAI researchers Kenji Hata and Adele Li discuss the 'renaissance' in AI image generation, highlighting new models, user creativity, and future possibilities.

12 days ago
Mind the Gap in Agent Observability

Mind the Gap in Agent Observability

Microsoft's Amy Boyd and Nitya Narasimhan discuss the critical 'gap' in AI agent observability and the need for better tools.

12 days ago
Agentic AI Fails: Loops, Planning & Unsafe Tool Use

Agentic AI Fails: Loops, Planning & Unsafe Tool Use

An IBM Advisory AI Engineer breaks down why agentic AI systems fail, focusing on infinite loops, planning errors, and unsafe tool use, and offers mitigation strategies.

12 days ago
MoE LLMs Confront Real-World Hardware Noise

MoE LLMs Confront Real-World Hardware Noise

Hardware noise in CIM systems degrades MoE LLM performance. ROMER, a new calibration framework, significantly improves accuracy by restoring load balance and stabilizing routing.

13 days ago
Auditing LLM Agent Skill Integrity

Auditing LLM Agent Skill Integrity

A new framework, Behavioral Integrity Verification (BIV), reveals 80% of LLM agent skills have implementation gaps, primarily due to oversight, and achieves 0.946 F1 for malicious skill detection.

13 days ago
Hybrid Agents Master GUI-Tool Orchestration

Hybrid Agents Master GUI-Tool Orchestration

ToolCUA agent overcomes hybrid action space uncertainty with a novel staged training pipeline, achieving state-of-the-art performance in GUI-Tool orchestration.

13 days ago
Beyond RGB: Grounding Vision-Language on Raw Sensor Data

Beyond RGB: Grounding Vision-Language on Raw Sensor Data

PRISM-VL advances vision-language models by grounding them in raw camera measurements, not just RGB, significantly improving performance on challenging visual tasks.

13 days ago
AlphaGRPO: Reasoning-Enhanced Multimodal Generation

AlphaGRPO: Reasoning-Enhanced Multimodal Generation

AlphaGRPO framework enhances multimodal generation via GRPO and DVReward, enabling reasoning and self-correction without cold-start, validated across benchmarks.

13 days ago