BasicJudge is a production-ready LLM-as-a-judge tool for objective evaluation and critical assessment. It provides structured, transparent evaluation with constructive skepticism, delivering clear, simple and actionable feedback using established evaluation practices.
from abstractcore.processing import BasicJudge
# Initialize with default model (Ollama qwen3:4b-instruct-2507-q4_K_M)
judge = BasicJudge()
# Evaluate content against default criteria
result = judge.evaluate("This code is well-structured and solves the problem elegantly.")
# Access enhanced assessment results
print(f"Judge's Summary: {result['judge_summary']}")
print(f"Source: {result['source_reference']}")
print(f"Overall score: {result['overall_score']}/5")
print(f"Strengths: {result['strengths']}")
print(f"Recommendations: {result['actionable_feedback']}")
# Include detailed criteria explanations (optional)
result_with_criteria = judge.evaluate(
"Code review content",
context="code review",
include_criteria=True # Add detailed criteria explanations
)
print(f"Criteria Details: {result_with_criteria['evaluation_criteria_details']}")# Install AbstractCore. The default Ollama path works with the core install.
pip install abstractcore
# Optional turnkey local-runtime installs:
pip install "abstractcore[all-apple]" # Apple Silicon: HF/GGUF + MLX + features + server
pip install "abstractcore[all-gpu]" # NVIDIA GPU: HF/GGUF + vLLM + features + server
# Default model requires Ollama (free, runs locally)
# 1. Install Ollama: https://ollama.com/
# 2. Download model: ollama pull qwen3:4b-instruct-2507-q4_K_M
# 3. Start Ollama service
# Alternative: Use cloud providers
pip install "abstractcore[remote]"Default Model: qwen3:4b-instruct-2507-q4_K_M
- Size: ~4GB model
- RAM: ~8GB required
- Temperature: 0.1 (low for consistent evaluation)
- Setup:
ollama pull qwen3:4b-instruct-2507-q4_K_M
For Optimal Evaluation Quality:
qwen3-coder:30b: Good for detailed assessment (requires 32GB RAM)gpt-oss:120b: Highest quality evaluation (requires 120GB RAM)
For Production: Cloud providers (OpenAI GPT-4o-mini, Claude) offer the most reliable and consistent evaluation.
BasicJudge implements LLM-as-a-judge practices with structured assessment and chain-of-thought reasoning.
The system evaluates content across nine standard quality dimensions:
- Clarity: How clear, understandable, and well-explained is the content?
- Simplicity: Is it appropriately simple vs unnecessarily complex for its purpose?
- Actionability: Does it provide actionable insights, recommendations, or next steps?
- Soundness: Is the reasoning logical, well-founded, and free of errors?
- Innovation: Does it show creativity, novel thinking, or fresh approaches?
- Effectiveness: Does it actually solve the intended problem or achieve its purpose?
- Relevance: Is it relevant and appropriate to the context and requirements?
- Completeness: Does it address all important aspects comprehensively?
- Coherence: Is the flow logical, consistent, and well-structured?
Uses a 1-5 scale with clear definitions:
- Score 5: Exceptional - Exceeds expectations in this dimension
- Score 4: Good - Meets expectations well with minor room for improvement
- Score 3: Adequate - Meets basic expectations but has notable areas for improvement
- Score 2: Poor - Falls short of expectations with significant issues
- Score 1: Very Poor - Fails to meet basic standards in this dimension
Context-Aware Scoring (v2.6.3+):
- Rigorous evaluation: Avoids grade inflation, most adequate responses score 2-3, not 3-4
- Task-appropriate criteria: Innovation scored 1-2 for routine tasks (e.g., basic arithmetic), 4-5 for breakthrough thinking
- Criterion applicability: If a criterion doesn't meaningfully apply to the task, scores 1-2, not 3
- Examples:
- Routine calculations: innovation 1-2, soundness 4-5 (if correct)
- Creative explanations: innovation 3-4 if novel approach shown
- Complex problem-solving: innovation 4-5 if breakthrough thinking demonstrated
Each evaluation returns a structured assessment with:
- Judge's summary (experiential note from judge's perspective about the assessment task and key findings)
- Source reference (clear indication of what was evaluated)
- Predefined criterion scores (1-5 for each enabled standard criterion: clarity, simplicity, etc.)
- Custom criterion scores (1-5 for each user-defined criterion - v2.6.3+)
- Overall score (calculated average considering all enabled criteria)
- Strengths (specific positive aspects identified)
- Weaknesses (areas for improvement)
- Actionable feedback (specific implementable recommendations)
- Chain-of-thought reasoning (transparent evaluation process)
- Evaluation criteria details (optional detailed explanation when include_criteria=True)
New in v2.6.3: Complete score visibility - all predefined and custom criterion scores are now included in assessment results (previously only overall_score and custom_scores were visible).
class BasicJudge:
def __init__(
self,
llm: Optional[AbstractCoreInterface] = None,
temperature: float = 0.1 # Low temperature for consistent evaluation
)
def evaluate(
self,
content: str,
context: Optional[str] = None,
criteria: Optional[JudgmentCriteria] = None,
focus: Optional[str] = None,
reference: Optional[str] = None,
include_criteria: bool = False,
custom_criteria: Optional[Dict[str, str]] = None # New in v2.6.3
) -> dict
def evaluate_files(
self,
file_paths: Union[str, List[str]],
context: Optional[str] = None,
criteria: Optional[JudgmentCriteria] = None,
focus: Optional[str] = None,
reference: Optional[str] = None,
include_criteria: bool = False,
max_file_size: int = 1000000
) -> Union[dict, List[dict]]evaluate() method:
content(str): The content to evaluatecontext(str, optional): Description of what is being evaluated (e.g., "code review", "documentation assessment")criteria(JudgmentCriteria, optional): Object specifying which standard criteria to usefocus(str, optional): Specific areas to focus evaluation on (e.g., "technical accuracy, performance")reference(str, optional): Reference content for comparison-based evaluationinclude_criteria(bool, optional): Include detailed explanation of evaluation criteria in assessment (default: False)custom_criteria(Dict[str, str], optional): Custom domain-specific criteria as name->description mapping (default: None) [New in v2.6.3]- Example:
{"logical_coherence": "Are the results logically consistent?", "domain_accuracy": "Is the domain knowledge correct?"} - Each custom criterion receives an individual 1-5 score in the
custom_scoresdict - Context-aware scoring applies (rigorous evaluation, task-appropriate expectations)
- Example:
evaluate_files() method:
file_paths(str or List[str]): Single file path or list of file paths to evaluate sequentiallycontext(str, optional): Description of evaluation context (default: "file content evaluation")criteria(JudgmentCriteria, optional): Object specifying which standard criteria to usefocus(str, optional): Specific areas to focus evaluation on (e.g., "technical accuracy, performance")reference(str, optional): Reference content for comparison-based evaluationinclude_criteria(bool, optional): Include detailed explanation of evaluation criteria in assessment (default: False)max_file_size(int, optional): Maximum file size in bytes to prevent context overflow (default: 1MB)exclude_global(bool, optional): Skip global assessment for multiple files (default: False)
Returns:
evaluate(): Single assessment dictionaryevaluate_files(): Single dict if one fileevaluate_files():{"global": global_assessment, "files": [individual_assessments]}if multiple files (default)evaluate_files():[individual_assessments]if multiple files andexclude_global=True
from abstractcore.processing import JudgmentCriteria
# Enable specific criteria only
criteria = JudgmentCriteria(
is_clear=True, # Evaluate clarity
is_simple=True, # Evaluate simplicity
is_actionable=True, # Evaluate actionability
is_sound=False, # Skip soundness evaluation
is_innovative=False, # Skip innovation evaluation
is_working=True, # Evaluate effectiveness
is_relevant=True, # Evaluate relevance
is_complete=True, # Evaluate completeness
is_coherent=True # Evaluate coherence
)from abstractcore import create_llm
from abstractcore.processing import BasicJudge, create_judge
# RECOMMENDED: Use cloud providers for optimal evaluation quality
llm = create_llm("openai", model="gpt-4o-mini", temperature=0.1)
judge = BasicJudge(llm)
# OR use create_judge helper
judge = create_judge("anthropic", model="claude-haiku-4-5", temperature=0.05)
# LOCAL MODELS: Work well for basic evaluation
judge = create_judge("ollama", model="qwen3-coder:30b", temperature=0.1)BasicJudge can evaluate multiple files sequentially to avoid context overflow:
from abstractcore.processing import BasicJudge, JudgmentCriteria
judge = BasicJudge()
# Evaluate single file
result = judge.evaluate_files("document.py", context="code review")
print(f"File assessment: {result['overall_score']}/5")
# Evaluate multiple files sequentially (returns list of assessments)
files = ["src/main.py", "src/utils.py", "tests/test_main.py"]
results = judge.evaluate_files(files, context="code review",
criteria=JudgmentCriteria(is_clear=True, is_sound=True))
for i, result in enumerate(results):
file_name = files[i].split('/')[-1]
print(f"{file_name}: {result['overall_score']}/5")
print(f" Judge Summary: {result['judge_summary']}")
print(f" Key Issues: {result['weaknesses']}")
# Configure file size limit (default 1MB)
large_files = ["big_doc.md", "large_code.py"]
try:
results = judge.evaluate_files(large_files, max_file_size=2000000) # 2MB limit
except ValueError as e:
print(f"File too large: {e}")When evaluating multiple files, BasicJudge automatically generates a global assessment that synthesizes all individual evaluations:
from abstractcore.processing import BasicJudge
judge = BasicJudge()
# Evaluate multiple files - returns global + individual assessments
result = judge.evaluate_files(
["src/main.py", "src/utils.py", "tests/test_main.py"],
context="Python code review"
)
# Access global assessment (appears first)
global_assessment = result['global']
print(f"Global Score: {global_assessment['overall_score']}/5")
print(f"Global Summary: {global_assessment['judge_summary']}")
# Access individual file assessments
individual_assessments = result['files']
for assessment in individual_assessments:
print(f"File: {assessment['source_reference']}")
print(f"Score: {assessment['overall_score']}/5")
# Optional: Get original format (list of assessments only)
results = judge.evaluate_files(
["file1.py", "file2.py"],
exclude_global=True # Skip global assessment
)
# Returns: [assessment1, assessment2] (original behavior)Global Assessment Features:
- Synthesis: Combines patterns across all individual file evaluations
- Score Distribution: Shows how many files scored at each level (1-5)
- Pattern Analysis: Identifies common strengths and weaknesses
- Aggregate Scoring: Provides overall quality assessment
- Appears First: Global assessment is shown before individual file results
CLI Global Assessment:
# Default: Includes global assessment
judge file1.py file2.py file3.py --context "code review"
# Skip global assessment (original behavior)
judge file1.py file2.py file3.py --context "code review" --exclude-globalThe judge CLI provides comprehensive evaluation capabilities for files and direct text input.
# Simple usage (after installing AbstractCore)
judge "This code is well-structured and efficient."
# Evaluate single file with context
judge document.py --context "code review"
# Multiple files with specific criteria
judge file1.py file2.py file3.py --context "code review" --criteria clarity,soundness
# Custom output format and file
judge proposal.txt --format plain --output assessment.txt# Method 1: Direct command (recommended after installation)
judge document.txt --context "code review"
# Method 2: Via Python module (always works)
python -m abstractcore.apps.judge document.txt --context "code review"# Simple command (after package installation)
judge "This code is well-structured and efficient."
# Evaluate single file
judge document.py --context "code review"
# Evaluate multiple files sequentially (avoids context overflow)
judge file1.py file2.py file3.py --context "code review"
# Specify output format
judge content.md --format plain
# Save to file
judge proposal.txt --output assessment.json
# Multiple files with wildcard patterns
judge src/*.py --context "Python code review" --format json --output review.json# Focus on specific criteria
judge doc.py --criteria clarity,soundness,effectiveness
# Focus on specific evaluation areas
judge api_docs.md --focus "technical accuracy,examples,error handling"
# Comparison-based evaluation
judge draft.md --reference ideal_solution.md
# Custom provider and model
judge content.txt --provider openai --model gpt-4o-mini
# Include detailed criteria explanations
judge content.txt --include-criteria --format plain
# Verbose output with progress
judge large_doc.md --verbose| Parameter | Description | Choices/Default |
|---|---|---|
content |
Content to evaluate: single text string, single file path, or multiple file paths | Required (one or more arguments) |
--context |
Evaluation context description | Free text |
--criteria |
Comma-separated standard criteria | clarity, simplicity, actionability, soundness, innovation, effectiveness, relevance, completeness, coherence |
--focus |
Specific focus areas for evaluation | Free text (comma-separated) |
--reference |
Reference content for comparison | File path or text |
--include-criteria |
Include detailed criteria explanations in assessment | Flag |
--exclude-global |
Skip global assessment for multiple files | Flag (default: False, global assessment included) |
--format |
Output format | json (default), plain, yaml |
--output |
Output file path | Console if not provided |
--provider |
LLM provider | ollama, openai, anthropic, etc. |
--model |
LLM model | Provider-specific model name |
--temperature |
Evaluation temperature | 0.0-2.0 (default: 0.1) |
--verbose |
Detailed progress | Flag |
JSON Format (default):
python -m abstractcore.apps.judge content.txt --format json
# Output: Structured JSON with scores, feedback, and reasoningPlain Text Format:
python -m abstractcore.apps.judge content.txt --format plain
# Output: Human-readable assessment reportFiltered Criteria:
python -m abstractcore.apps.judge code.py --criteria clarity,soundness,effectiveness
# Output: Only evaluates specified criteriaEnhanced Assessment with Criteria Details:
python -m abstractcore.apps.judge content.txt --include-criteria --format plain
# Output: Includes judge's summary, source reference, and detailed criteria explanationsThe --focus parameter dramatically changes evaluation outcomes by treating specified areas as PRIMARY FOCUS AREAS. Here are real examples showing the impact:
Key Difference:
- Without focus: Judge evaluates general quality (clarity, coherence) → High score
- With focus: Judge prioritizes specified areas → Low score when focus areas are missing
Command:
judge README.md --focus "technicalities, architectural diagrams and data flow, explanations of technical choices and comparison with SOTA approaches"Results:
- Overall Score: 3/5 (down from 5/5 without focus)
- Judge Summary: "However, it critically lacks architectural diagrams and technical comparisons to SOTA approaches—core requirements..."
- Weaknesses: Directly address focus areas:
- "No architectural diagrams or data flow visualizations"
- "Lacks technical comparisons with SOTA approaches like LangChain, LlamaIndex"
- "No explanation of how tool calling is unified across providers"
Key Insight: Focus areas become the primary evaluation targets. Even high quality documentation gets lower scores when it lacks the specified focus areas.
Fun Fact: We used our own judge to evaluate our README.md with focus on "architectural diagrams and SOTA comparisons" and got a humbling 3/5 score. Turns out eany documentation can be improved! 😅
# --criteria: HOW to evaluate (evaluation methods)
judge doc.txt --criteria "clarity,soundness,effectiveness"
# --focus: WHAT to focus on (evaluation subjects)
judge doc.txt --focus "performance benchmarks,security analysis"
# Combined: Evaluate specific areas using specific criteria
judge doc.txt --criteria "clarity,completeness" --focus "API documentation,error handling"Pro Tip: Use --focus when you want to evaluate specific content areas. Use --criteria when you want to change evaluation dimensions.
Input:
def calculate_total(items):
total = 0
for item in items:
total += item.price
return totalCommand:
judge "def calculate_total..." --context "code review" --criteria clarity,soundness,effectiveness --format plainExpected Assessment:
- Clarity: 4/5 - Clear function purpose and implementation
- Soundness: 3/5 - Missing error handling for None values
- Effectiveness: 4/5 - Solves the problem efficiently
- Actionable Feedback: Add input validation, consider using sum() built-in
Python API:
from abstractcore.processing import BasicJudge, JudgmentCriteria
judge = BasicJudge()
doc_content = """
# API Documentation
This API provides user management functionality.
Available endpoints: /users, /users/{id}
"""
# Focus on documentation-specific criteria
criteria = JudgmentCriteria(
is_clear=True,
is_complete=True,
is_actionable=True,
is_innovative=False, # Not relevant for docs
is_working=False # Not applicable
)
assessment = judge.evaluate(
content=doc_content,
context="API documentation review",
criteria=criteria,
focus="examples, error handling, API completeness"
)
print(f"Completeness: {assessment['completeness_score']}/5")
print(f"Recommendations: {assessment['actionable_feedback']}")Evaluate an entire codebase:
# Review all Python files in a project
judge src/*.py tests/*.py \
--context="Python project review" \
--criteria clarity,soundness,effectiveness \
--format json \
--output project_review.json \
--verboseExpected Output:
- List of assessments for each file
- Individual scores and feedback per file
- Consistent evaluation criteria across all files
- Identification of problematic files requiring attention
Python API for Multiple Files:
from abstractcore.processing import BasicJudge
import glob
judge = BasicJudge()
# Get all Python files in project
python_files = glob.glob("src/**/*.py", recursive=True)
# Evaluate all files
results = judge.evaluate_files(
file_paths=python_files,
context="code quality review",
criteria=JudgmentCriteria(is_clear=True, is_sound=True, is_working=True)
)
# Analyze results
problematic_files = [r for r in results if r['overall_score'] < 3]
high_quality_files = [r for r in results if r['overall_score'] >= 4]
print(f"Files needing attention: {len(problematic_files)}")
print(f"High-quality files: {len(high_quality_files)}")Command:
judge research_paper.pdf \
--context="academic paper review" \
--criteria clarity,soundness,innovation,completeness \
--reference conference_guidelines.txt \
--format json \
--output review_assessment.json \
--verboseFor Critical Assessments (RECOMMENDED):
# Best quality for important evaluations
judge = create_judge("openai", model="gpt-4o-mini", temperature=0.1)
# Alternative: High-quality Claude
judge = create_judge("anthropic", model="claude-haiku-4-5", temperature=0.05)For High-Volume Evaluation (Local):
# Good balance of quality and speed
judge = create_judge("ollama", model="qwen3-coder:30b", temperature=0.1)
# Fastest option (basic evaluation)
judge = create_judge("ollama", model="qwen3:4b-instruct-2507-q4_K_M", temperature=0.1)For Code Reviews:
criteria = JudgmentCriteria(
is_clear=True,
is_simple=True,
is_sound=True,
is_working=True,
is_innovative=False # Usually not the focus
)For Documentation:
criteria = JudgmentCriteria(
is_clear=True,
is_complete=True,
is_actionable=True,
is_relevant=True,
is_coherent=True,
is_innovative=False, # Not typically relevant
is_sound=False # Different meaning for docs
)For Creative Content:
criteria = JudgmentCriteria(
is_clear=True,
is_innovative=True,
is_coherent=True,
is_working=False, # Not applicable
is_sound=False # Different context
)Be Specific:
- "code review for production deployment"
- "user-facing API documentation"
- "academic research proposal"
- "general review"
Match Context to Criteria:
- Code reviews: focus on soundness, clarity, effectiveness
- Documentation: focus on completeness, clarity, actionability
- Creative work: focus on innovation, coherence, clarity
Custom criteria enable domain-specific evaluation with individual scores per criterion:
from abstractcore.processing import BasicJudge
judge = BasicJudge()
# Data Analysis Evaluation
assessment = judge.evaluate(
content="Statistical analysis report...",
context="data analysis review",
custom_criteria={
"logical_coherence": "Are the results logically consistent throughout?",
"result_plausibility": "Are the findings plausible given the data?",
"assumption_validity": "Were statistical assumptions properly validated?"
}
)
# Access custom scores
print(assessment['custom_scores'])
# {'logical_coherence': 5, 'result_plausibility': 4, 'assumption_validity': 3}
# Code Review with Custom Criteria
assessment = judge.evaluate(
content="Pull request code...",
context="code review",
custom_criteria={
"follows_style_guide": "Does the code follow team style conventions?",
"has_tests": "Are there comprehensive unit tests?",
"handles_edge_cases": "Are edge cases and error conditions handled?"
}
)
# Medical Diagnosis Evaluation
assessment = judge.evaluate(
content="Diagnostic reasoning...",
context="medical diagnosis review",
custom_criteria={
"safety": "Are patient safety considerations addressed?",
"evidence_based": "Is reasoning grounded in medical evidence?",
"risk_assessment": "Are patient risks properly evaluated?"
}
)Custom Criteria Best Practices:
- Use clear, specific questions as descriptions
- Each criterion gets an individual 1-5 score
- Context-aware scoring applies (task-appropriate expectations)
- Combine with predefined criteria for comprehensive evaluation
# Compare against ideal solution
judge student_solution.py \
--reference expert_solution.py \
--context="programming assignment grading"
# Compare against standards
judge company_policy.md \
--reference industry_standards.md \
--context="policy compliance review"5 (Exceptional): Content exceeds expectations and demonstrates mastery 4 (Good): Content meets expectations well with minor improvements possible 3 (Adequate): Content meets basic standards but has notable gaps 2 (Poor): Content falls short with significant issues requiring attention 1 (Very Poor): Content fails to meet basic standards
The judge provides three types of feedback:
- Strengths: What works well (build upon these)
- Weaknesses: What needs improvement (prioritize addressing these)
- Actionable Recommendations: Specific steps to improve (implement these)
Each assessment includes transparent reasoning showing:
- How each criterion was evaluated
- Evidence supporting the scores
- Calculation of the overall score
- Justification for feedback and recommendations
from abstractcore.processing import BasicJudge
def evaluate_article(article_content):
judge = BasicJudge()
assessment = judge.evaluate(
content=article_content,
context="blog article review",
criteria=JudgmentCriteria(
is_clear=True,
is_actionable=True,
is_relevant=True,
is_coherent=True
)
)
return {
'quality_score': assessment['overall_score'],
'ready_to_publish': assessment['overall_score'] >= 4,
'improvements_needed': assessment['actionable_feedback']
}def automated_code_review(code_diff, context="code review"):
judge = BasicJudge()
assessment = judge.evaluate(
content=code_diff,
context=context,
focus="code conventions, test coverage, error handling"
)
return {
'approval_recommended': assessment['overall_score'] >= 4,
'concerns': assessment['weaknesses'],
'required_changes': assessment['actionable_feedback']
}def grade_assignment(student_submission, rubric_reference):
judge = BasicJudge()
assessment = judge.evaluate(
content=student_submission,
context="academic assignment grading",
reference=rubric_reference,
criteria=JudgmentCriteria(
is_clear=True,
is_sound=True,
is_complete=True,
is_coherent=True
)
)
return {
'grade': assessment['overall_score'],
'feedback': assessment['actionable_feedback'],
'strengths': assessment['strengths']
}Judging is an LLM call, so latency and cost vary by provider/model, input size, and retry behavior (for example, structured output validation).
Practical guidance:
- Prefer smaller/faster models for routine scoring.
- Keep inputs short (or summarize first) for lower latency.
- Use low temperature for more consistent scores.
BasicJudge follows common best practices for LLM-based evaluation:
- JSON format for easy parsing and integration
- Consistent schema across all evaluations
- Rich metadata for comprehensive analysis
- Step-by-step reasoning for transparency
- Evidence-based scoring with clear justification
- Explicit calculation of overall scores
- Consistent evaluation across multiple runs
- Reduced randomness in scoring decisions
- Reliable comparative assessments
- Graceful failure with fallback assessments
- Retry mechanisms for transient failures
- Clear error messages for debugging
- Domain-specific evaluation with relevant criteria
- Custom criteria support for specialized needs
- Flexible assessment scope based on context
BasicJudge is designed for production use with built-in error handling, retry logic, and efficient evaluation of content from short snippets to comprehensive documents.
The judge supports flexible timeout configuration for different evaluation scenarios:
# Runs as long as needed - recommended for complex evaluations
python -m abstractcore.apps.judge document.txt# Set specific timeout (useful for production environments)
python -m abstractcore.apps.judge document.txt --timeout 300 # 5 minutes
python -m abstractcore.apps.judge document.txt --timeout 900 # 15 minutes
# Explicit unlimited timeout
python -m abstractcore.apps.judge document.txt --timeout nonefrom abstractcore.processing import BasicJudge
# Unlimited timeout (default)
judge = BasicJudge()
# Custom timeout
judge = BasicJudge(timeout=300) # 5 minutes
# Explicit unlimited timeout
judge = BasicJudge(timeout=None)When to Use Timeouts:
- Production environments: Set reasonable timeouts (300-900 seconds) to prevent hanging
- Large documents: Use unlimited timeout for comprehensive evaluations
- Multiple files: Consider longer timeouts when evaluating many files
- Complex criteria: Detailed evaluations may need more time
"Failed to initialize default Ollama model"
# Install Ollama and download model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:4b-instruct-2507-q4_K_M
ollama serveInconsistent evaluation results
- Use lower temperature:
--temperature=0.05 - Try a more capable model:
--provider openai --model gpt-4o-mini - Ensure context description is specific and consistent
Low-quality assessments
- Use more capable models (GPT-4, Claude)
- Provide specific evaluation context
- Focus criteria on relevant dimensions only
JSON parsing errors
- Automatic retry handles most cases
- If persistent, try a more capable model
- Check input content with
--verboseflag
"Temperature must be between 0.0 and 2.0"
- Adjust
--temperatureparameter to valid range - Recommended: 0.1 for consistency, up to 0.3 for slight variation
"Provider/model required together"
- Both
--providerand--modelmust be specified together
"Unknown criterion"
- Check spelling of criteria names
- Use available standard criteria or custom criteria
BasicJudge provides reliable, transparent evaluation suitable for critical assessment across various domains, from code review to content evaluation to academic grading.