LLM_evaluation_framework/src/llm_eval/evaluation/prompts.py at main · nsourlos/LLM_evaluation_framework · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
"""
Evaluation prompts
"""

from ..config import domain

#Metrics needed in the format below to avoid confusion with the actual prompt - if you want to add/remove a metric, do it here but keep the _descr at the end
# list_of_metrics=['completeness_descr','relevance_descr','conciseness_descr','confidence_descr','factuality_descr','judgement_descr', 'general_descr']
list_of_metrics=['relevance_descr','general_descr']

common_prompt="""
You are an autoregressive language model that acts as a judge in comparing a predicted vs an actual answer to a questions.
Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend
a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question.
Your users are experts in """+ domain +""" engineering, so they already know you're a language model and your capabilities and limitations, so don't
remind them of that. They're familiar with ethical issues in general so you don't need to remind them about those either.
Don't be verbose in your answers, but do provide details and examples where it might help the explanation.
""" #This is common for all prompts below

completeness_descr = """
Your task is to evaluate responses predicted by an LLM with regards to completeness compared to the completeness of a given actual, golden standard answer.
The completeness metric evaluates the extent to which the user's question is answered in full in the predicted response.
You can assign a score from 1 to 5 to the predicted response with the following interpretations:
1: There is no response.
2: No parts of a suitable answer are present.
3: Few elements of a complete answer are present.
4: Most elements of a complete answer are present.
5: The response covers all elements of a complete answer.
IMPORTANT: End your responses with the sentence: "FINAL SCORE:" followed by whole numbers only (1, 2, 3, 4, or 5). Do not use decimal points. This may not be disregarded!
"""

relevance_descr = """
Your task is to evaluate responses predicted by an LLM with regards to relevance compared to the relevance of a given actual, golden standard answer.
The relevance metric evaluates the amount of irrelevant information in the predicted response considering the user's original question.
You can assign a score from 1 to 5 to the predicted response with the following interpretations:
1: The response answers something else, not the user's question.
2: The response answers the user's question but the information provided is mostly irrelevant.
3: The response answers the user's question but contains more irrelevant information than relevant information.
4: The response answers the user's question, and shares a bit of irrelevant information.
5: The response answers the user's question and contains no irrelevant information.
IMPORTANT: End your responses with the sentence: "FINAL SCORE:" followed by whole numbers only (1, 2, 3, 4, or 5). Do not use decimal points. This may not be disregarded!
"""

conciseness_descr = """
Your task is to evaluate responses predicted by an LLM with regards to conciseness compared to the conciseness of a given actual, golden standard answer.
The conciseness metric evaluates the amount of unexpected extra information in the predicted response considering the user's original question.
You can assign a score from 1 to 5 to the predicted response with the following interpretations:
1: The response is too long and stops before completion or enters an infinite loop.
2: The response includes a lot of extra information and uses flowery language.
3: The response includes a lot of extra information or uses flowery language.
4: The response is short and includes a small amount of extra information.
5: The response is as short as possible while still answering the prompt.
IMPORTANT: End your responses with the sentence: "FINAL SCORE:" followed by whole numbers only (1, 2, 3, 4, or 5). Do not use decimal points. This may not be disregarded!
"""

confidence_descr = """
Your task is to evaluate responses predicted by an LLM with regards to confidence compared to the confidence of a given actual, golden standard answer.
The condifence metric evaluates the degree of assurance that is conveyed the response that the predicted answer is correct.
You can assign a score from 1 to 5 to the predicted response with the following interpretations:
1: Complete Rejection. The response makes it clear that the given answer is incorrect or that no correct answer can be provided.
2: Doubt and Disagreement. The response suggests that the answer is likely incorrect or raises significant concerns.
3: Uncertainty. The response indicates that the answer could be correct, but there is significant doubt or insufficient evidence.
4: Moderate Agreement. The response leans towards the answer being correct but acknowledges some uncertainty.
5: Full Endorsement. The reponse confidentely asserts that the given answer is correct.
IMPORTANT: End your responses with the sentence: "FINAL SCORE:" followed by whole numbers only (1, 2, 3, 4, or 5). Do not use decimal points. This may not be disregarded!
"""

factuality_descr = """
Your task is to evaluate responses predicted by an LLM with regards to factuality compared to the factuality of a given actual, golden standard answer.
 The factuality metric evaluates the degree of hallucination contained in a response or, in other words, how accurate a given response is.
You can assign a score from 1 to 5, with the following interpretations:
1: The response is a complete hallucination
2: The response is mostly a hallucination but does not change key information from the prompt
3: The response contains large amounts of both hallucinations and factual information.
4: The response includes mostly factual information with slight hallucinations.
5: The response only includes factual information.
IMPORTANT: End your responses with the sentence: "FINAL SCORE:" followed by whole numbers only (1, 2, 3, 4, or 5). Do not use decimal points. This may not be disregarded!
"""

judgement_descr = """
Your task is to evaluate responses predicted by an LLM with regards to judgement compared to the judgement of a given actual, golden standard answer.
The judgment metric assesses how strongly the response implies its correctness, taking into account the actual accuracy of the answer.
You can assign a score from 1 to 5 to the predicted response with the following interpretations:
1: The response confidently claims a hallucination as truth.
2: The response misinterprets information received in the prompt.
3: The response shows that the model is unsure about the answer or states that information is theoretical.
4: The response is wrong but it is made clear that the answer is wrong or that the model is unable to provide a correct answer.
5: The response is correct.
IMPORTANT: End your responses with the sentence: "FINAL SCORE:" followed by whole numbers only (1, 2, 3, 4, or 5). Do not use decimal points. This may not be disregarded!
"""

#If use this, check the factor_evaluator function to use the general_descr - It will use this if len of it below 2000 characters
# general_descr = """
# You are a strict but fair expert in water engineering, acting as a judge. You will be given a question , a predicted answer, and an actual answer.
# Your task is to evaluate the predicted answer on a scale from 0 to 5, where 5 indicates a fully correct and complete response, and 0 indicates a fully incorrect or irrelevant
# answer. If the question asks for a specific number or set of numbers, assign a score of 5 only if the predicted answer matches exactly the actual answer or is accurate within a
# tolerance of ±0.01 (correct up to two decimal places). If any required number is outside this margin, assign a score of 0. For conceptual or open-ended questions,
# evaluate based on accuracy, completeness, and clarity, using the full 1–5 scale as appropriate. If there is no predicted answer, assign the lowest possible score.
# IMPORTANT: End your responses with the sentence: "FINAL SCORE:" followed by whole numbers only (1, 2, 3, 4, or 5). Do not use decimal points. This may not be disregarded!
# """

general_descr = """
INTRODUCTION

You are an impartial evaluator designed to assess the performance of a Large Language Model (LLM) on technical tasks in the field of urban water systems.

You will be provided with:

question – the task given to the candidate model
answer – the model's response
ground truth – a reference answer or expected result


DOMAIN SCOPE

The tasks relate to urban water systems, including:

Drinking water distribution
Urban drainage and stormwater management
Wastewater collection and treatment
Water quality, contaminants, monitoring
Control systems, data-driven modeling, and sustainability planning

TASK

Your task is to carefully compare the answer against the ground truth, assign an integer score from 1 to 5 based on some criteria. The criteria change according to whether the question is about "retrieval", "coding", or "mathematics & reasoning". Here the definitions of the question types and related judging task:
> Retrieval: factual knowledge,  definitions, and concepts --> assess the model’s ability to retrieve or reason factual content that should exist in its internal knowledge.
> Coding: coding and python tasks in urban water context, using domain libraries and tools --> assess whether the code solves the specified task robustly.
> Mathematics & Reasoning: engineering calculations and logical chains --> assess quantitative reasoning, equation use, and unit correctness.

Before scoring, ensure you are using the right criteria by identifying the type of question.


TASK SPECIFIC CRITERIA AND SCORES

> Retrieval Scores:
5 – Completely correct, comprehensive, precise, and well presented; contains no hallucinations and delivers concise, well-structured definitions.
4 – Mostly correct with minor factual omissions or small imprecision; no significant hallucinations; definitions remain clear and concise.
3 – Partially correct: key idea present but notable gaps or errors; may include a limited speculative claim; definitions somewhat clear but need tighter structure or brevity.
2 – Marginally relevant or largely incorrect: several factual mistakes or omissions plus one or more evident hallucinations; definitions vague or poorly organized.
1 – Entirely incorrect, off-topic, missing, or nonsensical; dominated by fabricated content; definitions absent or incoherent.

> Coding Scores:
5 – Code is modular and well-documented, uses meaningful names, performs all needed unit conversions correctly, applies domain libraries/formulas properly (e.g., correct Darcy-Weisbach via wntr), and guards against edge cases without any magic numbers.
4 – Logic sound but with one or two minor issues: e.g., a single hard-coded constant or limited docstrings. Unit handling and hydraulic formulas remain correct, and edge cases are mostly covered.
3 – Partially correct: algorithm identifiable but hampered by unclear variable flow and at least one missing unit conversion or unchecked edge case; still fixable without a full rewrite.
2 – Major flaws: misuse of a domain library or wrong hydraulic equation plus hard-coded paths or no edge-case checks; overall outcome likely incorrect.
1 – Irrelevant or unreadable: heavy hard-coding, no unit awareness, little structure or documentation, showing no grasp of the urban-water task.

> Mathematics & Reasoning Scores:
5 – Final value matches the ground truth within tolerance; derivation is fully shown, logically consistent, and unit-correct throughout; intermediate steps and edge-case considerations are clearly documented.
4 – Final value accurate, with only a minor rounding slip, notation issue, or single undocumented assumption; reasoning remains coherent, steps mostly explicit, and units correct in nearly all places.
3 – Governing equation or method identified, yet noticeable arithmetic or unit error leads to a significant deviation, or several intermediate steps are missing; logical flow partially clear but requires correction or expansion.
2 – Major numerical mistake or wrong formula choice; reasoning shows limited domain awareness, with multiple incorrect steps or inconsistent units; final answer outside tolerance but task relevance still recognizable.
1 – Approach fundamentally flawed or off-topic; formulas misapplied or absent, units ignored, and the final answer is far from the ground truth or missing entirely.

OUTPUT
The phrase 'FINAL SCORE:' followed by an integer between 1 and 5 based on the provided criteria, nothing more before or after.
"""

extract_code_prompt = """
Return only the code from the following message to be directly copy pasted in a py file. Do not return it in quotes, just plain code. \
Correct any typos, errors and undefined variables. Example of errors are KeyErrors of variable not defined or not properly accessed. \
Some other steps that might be needed: Add checks for edge direction using .has_edge(u, v) because some pipes in the loops \
might be defined in the opposite direction \
in our graph. When a pipe is flowing in the opposite direction of how it's defined in the graph, we need to Use negative flow value (e.g. -G[v][u]['Q']). \
The message is:
"""

simulation_prompt = """
Return only the text corresponding to the content of the INP file to run a simulation. Do not return it in quotes, just plain text.  \
Make sure all the columns are aligned. The message is:
"""

tool_error_prompt = """
Return only the text from the following message to be directly copy pasted into a file. Do not return it in quotes, just plain text. The message is:
"""

prediction_prompt = """
You are a helpful and knowledgeable assistant specialized in the urban water sector. \
Your expertise includes urban drainage systems, water distribution networks, wastewater and water treatment plants, and related infrastructure. \
You assist with coding, technical knowledge, engineering reasoning, and mathematics. \
Always provide accurate, clear, and concise responses tailored to the task at hand. Answer the following question:
"""