FunnyBench – Can AI Models Tell Funny Jokes?

A new benchmark called FunnyBench tests AI models' ability to tell funny jokes by asking each model to generate ten jokes and letting users vote on their humor. The live leaderboard uses a Bayesian score to rank models based on user votes, with the model revealed after voting. The benchmark aims to evaluate humor generation in AI, a challenging aspect of natural language processing.

FunnyBench Can AI tell a joke? FunnyBench asks a simple question: can AI models tell funny jokes? Each model was given the same prompt — “tell me a joke” — ten times. Read a joke and decide if it’s funny or not. Your votes drive a live leaderboard. We asked each model multiple times to encourage variety, but some still repeated the same joke. The model is revealed after you vote. Current leaders Funniest joke so far Funniest model so far No votes yet. Live leaderboard | | Returned model | Provider | Bayesian score | Votes | Funny% | |---| Details Jokes were generated through OpenRouter from its model catalog using the exact prompt “tell me a joke” . Generation used temperature 1 where supported, a 120 second timeout, provider fallback disabled, required parameters enabled, and the returned model, provider, and text were stored. Token counts and cost are stored internally but not displayed, to reduce noise. The leaderboard uses a Bayesian score: each model starts near the overall average and moves as votes come in, which makes early rankings less jumpy than a raw funny percentage. It also shows both the model requested from OpenRouter and the returned model that actually ran, so the benchmark is explicit about what was tested. For reasoning models, the lowest available reasoning setting was used; reasoning traces were intentionally not captured because they are not part of the joke shown to voters. The run excluded models not primarily meant for text, OpenRouter/router/front aliases, search or custom-tool variants, floating “latest” aliases, unavailable-price models, duplicate free aliases, invalid empty or oversized outputs, and any model that failed five calls in a row. The published set keeps ten valid jokes per retained model.