#LLM Evaluation

3 articles with this tag

LLMs Fail Esoteric Code Tasks

Frontier LLMs show a dramatic capability gap on a new benchmark using esoteric programming languages, revealing a reliance on memorization over reasoning.

21 days ago

Artificial Intelligence

Balyasny's AI Engine

Balyasny Asset Management built a powerful AI research engine using OpenAI models, slashing analysis times and boosting investment team confidence.

26 days ago

Technology

Context-Aware Guardrails Tested

Mozilla.ai tested context-aware guardrails for LLMs in a humanitarian context, revealing crucial multilingual performance disparities and the need for robust, domain-specific safety policies.

about 2 months ago