WeirdBench is an unconventional LLM benchmarking site.
The site reads benchmark scores from Neon Postgres. Benchmark execution happens locally.
- Benchmark definitions live in
lib/benchmarks.ts. - The website reads leaderboard data from Neon through
lib/benchmark-store.ts. - Benchmark runner scripts execute locally, use your local env vars, and write scores into the database.
- Scores are cached in Postgres by
(benchmark_id, model_id), so an existing model score is never recomputed unless you explicitly delete it.
-
nutrition-prediction- Source: The Nutrition Prediction Benchmark
- Task: predict calories, protein, carbs, and fat from ingredient lists for a fixed 50-dish Nutrition5k sample
- Scoring:
0.6 * accuracy + 0.4 * correlation, whereaccuracy = 100 / (1 + avg_mape_percent) - Ranking: higher is better
-
semantic-diversity- Source: The Semantic Diversity Benchmark
- Task: generate exactly 20 English words that are maximally semantically unrelated
- Scoring: average pairwise semantic similarity
- Ranking: lower is better
-
orthographic-diversity- Task: output 20 real English words that are maximally different in spelling under deterministic validity and overlap penalties
- Scoring: average pairwise Levenshtein distance minus penalties
- Ranking: higher is better
Expected in .env.local:
DATABASE_URLOPENROUTER_API_KEY
pnpm installpnpm devpnpm db:initThis is required before running the app or benchmark scripts against a fresh database. Runtime code does not auto-create tables.
pnpm benchmark:nutrition-prediction <model-id>Examples:
pnpm benchmark:nutrition-prediction openai/gpt-oss-120b
pnpm benchmark:nutrition-prediction anthropic/claude-opus-4.1 openai/gpt-oss-120bBehavior:
- Runs locally.
- Uses
OPENROUTER_API_KEY. - Fetches Nutrition5k metadata from the public Google bucket and deterministically samples the same 50 dishes each time.
- Writes the score to Neon using
DATABASE_URL. - Returns cached data immediately if that model already has a stored score.
pnpm benchmark:semantic-diversity <model-id>Examples:
pnpm benchmark:semantic-diversity google/gemini-2.5-pro
pnpm benchmark:semantic-diversity anthropic/claude-opus-4.1
pnpm benchmark:semantic-diversity openai/gpt-5
pnpm benchmark:semantic-diversity google/gemini-2.5-pro,anthropic/claude-opus-4.1,openai/gpt-5
pnpm benchmark:semantic-diversity google/gemini-2.5-pro anthropic/claude-opus-4.1 openai/gpt-5Behavior:
- Runs locally.
- Uses
OPENROUTER_API_KEY. - Writes the score to Neon using
DATABASE_URL. - Returns cached data immediately if that model already has a stored score.
pnpm dev
pnpm lint
pnpm build
pnpm db:init
pnpm benchmark:nutrition-prediction <model-id>
pnpm benchmark:semantic-diversity <model-id>