| Jiatong Li, et al. |
PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations |
NeurIPS 2024 |
https://arxiv.org/abs/2405.19740 |
| Jingnan Zheng, et al. |
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation |
NeurIPS 2024 |
https://arxiv.org/abs/2405.14125 |
| Jinhao Duan, et al. |
GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations |
NeurIPS 2024 |
https://arxiv.org/abs/2402.12348 |
| Felipe Maia Polo, et al. |
Efficient multi-prompt evaluation of LLMs |
NeurIPS 2024 |
https://arxiv.org/abs/2405.17202 |
| Fan Lin, et al. |
IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation |
NeurIPS 2024 |
https://arxiv.org/abs/2409.18892 |
| Jinjie Ni, et al. |
MixEval: Fast and Dynamic Human Preference Approximation with LLM Benchmark Mixtures |
NeurIPS 2024 |
https://nips.cc/virtual/2024/poster/96545 |
| Percy Liang, et al. |
Holistic Evaluation of Language Models |
TMLR |
https://arxiv.org/abs/2211.09110 |
| Felipe Maia Polo, et al. |
tinyBenchmarks: evaluating LLMs with fewer examples |
ICML 2024 |
https://openreview.net/forum?id=qAml3FpfhG |
| Miltiadis Allamanis, et al. |
Unsupervised Evaluation of Code LLMs with Round-Trip Correctness |
ICML 2024 |
https://icml.cc/virtual/2024/poster/33761 |
| Wei-Lin Chiang, et al. |
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference |
ICML 2024 |
https://arxiv.org/abs/2403.04132 |
| Yonatan Oren, et al. |
Proving Test Set Contamination in Black-Box Language Models |
ICLR 2024 |
https://arxiv.org/abs/2310.17623 |
| Kaijie Zhu, et al. |
DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks |
ICLR 2024 |
https://arxiv.org/abs/2309.17167 |
| Seonghyeon Ye, et al. |
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets |
ICLR 2024 |
https://openreview.net/forum?id=CYmF38ysDa |
| Shahriar Golchin, et al. |
Time Travel in LLMs: Tracing Data Contamination in Large Language Models |
ICLR 2024 |
https://openreview.net/forum?id=2Rwq6c3tvr |
| Gati Aher, et al. |
Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies |
ICML 2023 |
https://proceedings.mlr.press/v202/aher23a/aher23a.pdf |