Frontier Language Model Intelligence, over Time

Artificial Analysis released its Frontier Language Model Intelligence index, tracking performance, cost, and execution time of leading AI models over time. The index evaluates models on agentic tasks, coding, reasoning, and knowledge, providing independent benchmarks for model selection.

Independent analysis of AI Understand the AI landscape to choose the best model and provider for your use case Highlights Personalized model recommender Get personalized recommendations based on your priorities for intelligence, speed, and cost Explore agents for general work, coding, customer support, and more Compare AI agents across capabilities, pricing, and platform support Explore premium plans Access expanded benchmark data, custom visualizations, industry reports, and more Intelligence Intelligence of leading AI models based on our independent evaluations Artificial Analysis Intelligence Index Artificial Analysis Intelligence Index by Open Weights / Proprietary Intelligence vs. Cost to Run Artificial Analysis Intelligence Index Create custom visualizationsCreate your own charts and tables comparing models and providers, save groups of models, and export data.Go to Data Playground /data-playground Frontier Language Model Intelligence, Over Time Performance, cost, and execution time for leading coding agents on end-to-end software engineering tasks Explore Artificial Analysis Coding Agent Index /agents/coding-agents Artificial Analysis Coding Agent Index Image & Video Leaderboards Top models from our Image Arena and Video Arena leaderboards, with 95% confidence intervals Text to Image Leaderboard See the full leaderboard here. /image/leaderboard/text-to-image Intelligence Evaluations Agentic real-world work tasks, Elo-500 /2000 Agentic coding & terminal use Agentic tool use Long context reasoning Knowledge 1 - hallucination rate Reasoning & knowledge Scientific reasoning Coding Instruction following Physics reasoning Long-horizon agentic tasks ITBench-AA /evaluations/itbench-aa New Kubernetes incident root-cause analysis Visual reasoning AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains AA-Omniscience Index GDPval-AA evaluates AI models on real-world, economically valuable tasks across a wide range of occupations GDPval-AA Leaderboard ITBench-AA /evaluations/itbench-aa New ITBench-AA evaluates AI agents on Kubernetes incident root-cause analysis from offline incident snapshots ITBench-AA Average precision at full recall Artificial Analysis Openness Index assesses how 'open' models are on the basis of their availability and transparency across different components. Artificial Analysis Openness Index: Components Artificial Analysis Openness Index vs. Artificial Analysis Intelligence Index Output Tokens Output tokens of leading AI models based on our independent evaluations Output Tokens Used to Run Artificial Analysis Intelligence Index Cost Efficiency Cost of leading AI models based on our independent evaluations Cost to Run Artificial Analysis Intelligence Index Speed & Latency Comparison of first-party API performance Output Speed PriceUpdated Price of leading AI models based on our independent evaluations Pricing: Cache Hit, Input, and Output NewHardware Benchmarking /benchmarks/hardware Comprehensive benchmarking of GPUs for language model inference Video Arena & Leaderboard /video/arena Compare leading Text to Video and Image to Video models Image Arena & Leaderboard /image/arena Compare leading Image Generation and Image Editing models Speech Arena & Leaderboard /text-to-speech/arena Compare leading Text to Speech models