📚 A curated list of papers & technical articles on AI Quality & Safety
-
Updated
Apr 14, 2025
📚 A curated list of papers & technical articles on AI Quality & Safety
Ship evals before you ship features.
Eval framework. Define correct, test against it, get results.
Open-source AI model evaluation and benchmarking framework for LLMs (OpenAI, Ollama, Claude, Gemini)
Diagnose your AI agents in production. Extract policies from prompts, evaluate traces, generate diagnostic reports.
A framework-agnostic metric for measuring AI code generation quality. Sealed-envelope testing protocol + reference validators.
AI model health monitor for LLM apps – runtime checks for drift, hallucination risk, latency, and JSON/format quality on any OpenAI, Anthropic, or local client.
朱雀 Suzaku — AI 生成品質模組。諂媚抑制、建設性挑戰、輸出適配、上下文錨定、一致性守護。基於 LDRIT 設計。
Python SDK for IvyCheck
Universal skill enhancement layer for Claude Code. Sees what your skill was trying to do, grades the gap, drives the rewrite.
Evaluate your LLM apps with one function call. Hallucination detection, RAG scoring, and agent evals for OpenAI, Anthropic, and more. 14 evaluators, pytest plugin, composite trust scores.
Open-source AI agent security testing framework. Test for prompt injection, data leakage, and privilege escalation before production.
The definitive CI/CD platform for AI Quality.
A 5-layer adversarial quality gate for Claude Code. Catches factual errors, score inflation, and buried conclusions before your AI output ships.
🚀 Professional-grade AI Agent Evaluation Platform. Multi-provider LLM-as-a-Judge (OpenAI, Anthropic, Gemini), automated testing, A/B benchmarking, and safety auditing.
Adversarial AI review with structured verdicts — C/I/M defect taxonomy, numerical pass floor, single- and cross-model audit modes.
Define, measure, and enforce code correctness with Eval-Driven Development, ensuring every probabilistic system ships with automated proof of quality.
Production-grade LLM evaluation pipeline for RAG chatbot — DeepEval + RAGAS + Garak + CI/CD | Financial domain | 7 metrics | Adversarial testing
AI Agent Ops framework for Claude Code — independent evaluator, adversarial review, and pre-commit quality gate for AI-generated code.
A cognitive immune system for AI agents. Self-evolving critics detect thinking pattern biases through context isolation. Inspired by Karpathy's autoresearch.
Add a description, image, and links to the ai-quality topic page so that developers can more easily learn about it.
To associate your repository with the ai-quality topic, visit your repo's landing page and select "manage topics."