Skip to content

Commit 74423d6

Browse files
authored
Integrate Langfuse observability into VQA evaluation and update tracing replacing Opik as default (#34)
* Integrate Langfuse observability into agentic VQA evaluation, replacing Opik references and updating tracing mechanisms * Refactor observability integration: remove Opik references, streamline Langfuse integration, and tidy up code formatting * Update README and notebooks from opik to langfuse. * Update MEP directory path in dashboard.py for consistency with new structure * Rename Opik references to Langfuse in agentic VQA evaluation agents and update tracing parameters for observability. * Add integration test instructions to README for API key validation * Update Langfuse integration to require version 4 and add fallback for missing attributes propagation * Fix integration test command path in README for consistency * Add Google GenAI and OpenAI instrumentation support in Langfuse integration
1 parent fceb864 commit 74423d6

File tree

28 files changed

+928
-1067
lines changed

28 files changed

+928
-1067
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,18 @@ recent research, with fully reproducible notebooks and evaluation pipelines.
7272
uv run jupyter lab
7373
```
7474
75+
5. Run integration tests to validate that your API keys are set up correctly:
76+
77+
```bash
78+
uv run --env-file .env pytest -sv tests/test_integration.py
79+
```
80+
81+
> **Note:** If your `.env` file is incomplete or needs to be updated, you can re-run onboarding manually from inside your Coder workspace (from the repo root):
82+
>
83+
> ```bash
84+
> onboard --bootcamp-name "llm-interpretability-bootcamp" --output-dir "." --test-script "./aieng-llm-interp/tests/test_integration.py" --env-example "./.env.example" --test-marker "integration_test" --force
85+
> ```
86+
7587
## License
7688
7789
This project is licensed under the terms of the [LICENSE](LICENSE.md) file in the root directory.

implementations/agentic_vqa_eval/README.md

Lines changed: 132 additions & 185 deletions
Large diffs are not rendered by default.

implementations/agentic_vqa_eval/analysis.ipynb

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"id": "7fb27b941602401d91542211134fc71a",
66
"metadata": {},
7-
"source": "# ChartQAPro Evaluation — Analysis Walkthrough\n\nThis notebook walks through the full evaluation artifact stack produced by the\nagentic ChartQAPro framework. By the end you will be able to:\n\n- Load and inspect **MEPs** (Model Evaluation Packets) directly\n- Plot **accuracy by question type** from `metrics.jsonl`\n- Visualise the **verifier revision rate** and its effect on accuracy\n- Chart the **failure taxonomy** breakdown from `taxonomy.jsonl`\n- Browse individual samples — question, plan, vision answer, verifier verdict, chart image\n\n**Prerequisites:** Run these commands first:\n```bash\nuv run --env-file .env -m agentic_chartqapro_eval.runner.run_generate_meps --n 25 --config gemini_gemini\nuv run --env-file .env -m agentic_chartqapro_eval.eval.eval_outputs --mep_dir meps/gemini_gemini/chartqapro/test --out output/metrics.jsonl\nuv run --env-file .env -m agentic_chartqapro_eval.eval.error_taxonomy --mep_dir meps/gemini_gemini/chartqapro/test --metrics_file output/metrics.jsonl --out output/taxonomy.jsonl\n```"
7+
"source": "# ChartQAPro Evaluation — Analysis Walkthrough\n\nThis notebook walks through the full evaluation artifact stack produced by the\nagentic ChartQAPro framework. By the end you will be able to:\n\n- Load and inspect **MEPs** (Model Evaluation Packets) directly\n- Plot **accuracy by question type** from `metrics.jsonl`\n- Visualise the **verifier revision rate** and its effect on accuracy\n- Chart the **failure taxonomy** breakdown from `taxonomy.jsonl`\n- Browse individual samples — question, plan, vision answer, verifier verdict, chart image\n\n**Prerequisites:** Run these commands first (from any directory in the repo):\n```bash\nuv run --env-file \"$(git rev-parse --show-toplevel)/.env\" --directory \"$(git rev-parse --show-toplevel)/implementations/agentic_vqa_eval\" -m agentic_chartqapro_eval.runner.run_generate_meps --n 25 --config gemini_gemini\nuv run --env-file \"$(git rev-parse --show-toplevel)/.env\" --directory \"$(git rev-parse --show-toplevel)/implementations/agentic_vqa_eval\" -m agentic_chartqapro_eval.eval.eval_outputs --mep_dir meps/gemini_gemini/chartqapro/test --out output/metrics.jsonl\nuv run --env-file \"$(git rev-parse --show-toplevel)/.env\" --directory \"$(git rev-parse --show-toplevel)/implementations/agentic_vqa_eval\" -m agentic_chartqapro_eval.eval.error_taxonomy --mep_dir meps/gemini_gemini/chartqapro/test --metrics_file output/metrics.jsonl --out output/taxonomy.jsonl\n```"
88
},
99
{
1010
"cell_type": "markdown",
@@ -272,7 +272,11 @@
272272
"\n",
273273
" wrong = tax_df[tax_df[\"failure_type\"] != \"correct\"]\n",
274274
" print(f\"\\nTotal wrong: {len(wrong)} / {len(tax_df)}\")\n",
275-
" print(f\"Most common failure: {counts[counts.index != 'correct'].idxmax()}\")"
275+
" failure_counts = counts[counts.index != \"correct\"]\n",
276+
" if failure_counts.empty:\n",
277+
" print(\"Most common failure: none (all samples correct)\")\n",
278+
" else:\n",
279+
" print(f\"Most common failure: {failure_counts.idxmax()}\")"
276280
]
277281
},
278282
{

implementations/agentic_vqa_eval/run_pipeline.ipynb

Lines changed: 45 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
"|---|---|\n",
1717
"| 1 — Configuration | All tunable parameters in one place |\n",
1818
"| 2 — Environment | Check API keys, install path, imports |\n",
19-
"| 2.5 — Opik health check | Verify Opik stack is reachable and API-responsive before running |\n",
19+
"| 2.5 — Langfuse health check | Verify Langfuse credentials are configured before running |\n",
2020
"| 3 — Load dataset | Pull samples from HuggingFace |\n",
2121
"| 4 — Instantiate agents | Build Planner, OCR, Vision, Verifier |\n",
2222
"| 5 — Run pipeline | Generate MEPs (Plan → OCR → Vision → Verify) |\n",
@@ -119,7 +119,7 @@
119119
" val = os.environ.get(var, \"\")\n",
120120
" needed = needed_for in CONFIG\n",
121121
" if val and not val.startswith(\"your_\"):\n",
122-
" print(f\" ok {var} ({val[:12]}...)\")\n",
122+
" print(f\" ok {var} ({val[:3]}...)\")\n",
123123
" elif needed:\n",
124124
" print(f\" MISSING {var} <- required for {CONFIG}\")\n",
125125
" missing.append(var)\n",
@@ -141,8 +141,8 @@
141141
"from agentic_chartqapro_eval.eval.eval_outputs import evaluate_mep # noqa: E402\n",
142142
"from agentic_chartqapro_eval.eval.eval_traces import evaluate_trace # noqa: E402\n",
143143
"from agentic_chartqapro_eval.eval.summarize import summarize, write_csv # noqa: E402\n",
144+
"from agentic_chartqapro_eval.langfuse_integration.client import get_client # noqa: E402\n",
144145
"from agentic_chartqapro_eval.mep.writer import iter_meps # noqa: E402\n",
145-
"from agentic_chartqapro_eval.opik_integration.client import get_client # noqa: E402\n",
146146
"from agentic_chartqapro_eval.runner.run_generate_meps import ( # noqa: E402\n",
147147
" BACKEND_CONFIGS,\n",
148148
" process_sample,\n",
@@ -159,19 +159,17 @@
159159
"id": "cell-opik-hdr",
160160
"metadata": {},
161161
"source": [
162-
"## 2.5 — Opik Health Check\n",
162+
"## 2.5 — Langfuse Health Check\n",
163163
"\n",
164-
"Verifies that the self-hosted Opik stack is **fully operational** before the pipeline runs.\n",
165-
"Three checks are run in sequence:\n",
164+
"Verifies that Langfuse credentials are configured before the pipeline runs.\n",
166165
"\n",
167166
"| Check | What it tests |\n",
168167
"|---|---|\n",
169-
"| HTTP reachable | TCP connection to `OPIK_URL_OVERRIDE` succeeds within 5 s |\n",
170-
"| Client init | `opik.Opik()` initialises without error |\n",
171-
"| API read test | A lightweight `search_traces` call returns a valid response |\n",
168+
"| Env vars present | `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` are set in `.env` |\n",
169+
"| Client init | `Langfuse()` initialises without error |\n",
172170
"\n",
173-
"If `OPIK_URL_OVERRIDE` is not set the cell prints a skip notice and continues — Opik is optional.\n",
174-
"If any check fails the pipeline still runs; only tracing is affected."
171+
"If the keys are absent the cell prints a skip notice and continues — Langfuse is optional.\n",
172+
"The pipeline produces identical MEPs with or without it; tracing is purely additive."
175173
]
176174
},
177175
{
@@ -181,107 +179,61 @@
181179
"metadata": {},
182180
"outputs": [],
183181
"source": [
184-
"import urllib.error\n",
185-
"import urllib.request\n",
186-
"\n",
187-
"# Force re-initialisation so re-running this cell after starting Docker works correctly\n",
188-
"from agentic_chartqapro_eval.opik_integration.client import reset_client\n",
182+
"from agentic_chartqapro_eval.langfuse_integration.client import reset_client\n",
189183
"\n",
190184
"\n",
185+
"# Force re-initialisation so re-running this cell picks up any .env changes\n",
191186
"reset_client()\n",
192187
"\n",
193-
"OPIK_URL = os.environ.get(\"OPIK_URL_OVERRIDE\", \"\")\n",
188+
"lf_public = os.environ.get(\"LANGFUSE_PUBLIC_KEY\", \"\")\n",
189+
"lf_secret = os.environ.get(\"LANGFUSE_SECRET_KEY\", \"\")\n",
194190
"\n",
195-
"if not OPIK_URL:\n",
196-
" print(\"[skip] OPIK_URL_OVERRIDE is not set.\")\n",
197-
" print(\" Opik tracing is disabled. Pipeline will run fine without it.\")\n",
191+
"if not lf_public or not lf_secret:\n",
192+
" print(\"[skip] LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY are not set.\")\n",
193+
" print(\" Langfuse tracing is disabled. Pipeline will run fine without it.\")\n",
198194
" print()\n",
199-
" print(\"To enable Opik tracing:\")\n",
200-
" print(\" 1. Start the Docker stack:\")\n",
201-
" print(\" cd /path/to/opik/deployment/docker-compose\")\n",
202-
" print(\" docker compose --profile opik up -d\")\n",
203-
" print(\" 2. Add to .env: OPIK_URL_OVERRIDE=http://localhost:5173/api\")\n",
204-
" print(\" 3. Re-run this cell.\")\n",
195+
" print(\"To enable Langfuse tracing, add to .env:\")\n",
196+
" print(\" LANGFUSE_PUBLIC_KEY=pk-lf-...\")\n",
197+
" print(\" LANGFUSE_SECRET_KEY=sk-lf-...\")\n",
198+
" print(\" # LANGFUSE_HOST=https://cloud.langfuse.com (default; change for self-hosted)\")\n",
205199
"else:\n",
206200
" results = {}\n",
207201
"\n",
208-
" # -- Check 1: HTTP reachability (any response = server is up) --\n",
209-
" try:\n",
210-
" with urllib.request.urlopen(OPIK_URL, timeout=5) as r:\n",
211-
" results[\"http\"] = (\"ok\", f\"HTTP {r.status}\")\n",
212-
" except urllib.error.HTTPError as e:\n",
213-
" # HTTPError means server responded -- it is up, just returned a non-200\n",
214-
" results[\"http\"] = (\"ok\", f\"HTTP {e.code} (server responded)\")\n",
215-
" except Exception as e:\n",
216-
" results[\"http\"] = (\"fail\", str(e))\n",
202+
" # -- Check 1: Env vars present --\n",
203+
" results[\"env\"] = (\"ok\", f\"pk={lf_public[:3]}...\")\n",
217204
"\n",
218-
" # -- Check 2: Opik Python client initialises --\n",
219-
" _opik_hc = None\n",
205+
" # -- Check 2: Client initialises --\n",
220206
" try:\n",
221-
" from agentic_chartqapro_eval.opik_integration.client import get_client\n",
222-
"\n",
223-
" _opik_hc = get_client()\n",
224-
" if _opik_hc is not None:\n",
225-
" results[\"client\"] = (\"ok\", \"opik.Opik() ready\")\n",
207+
" _lf_hc = get_client()\n",
208+
" if _lf_hc is not None:\n",
209+
" results[\"client\"] = (\"ok\", \"Langfuse() ready\")\n",
226210
" else:\n",
227211
" results[\"client\"] = (\"fail\", \"get_client() returned None\")\n",
228212
" except Exception as e:\n",
229213
" results[\"client\"] = (\"fail\", str(e))\n",
230214
"\n",
231-
" # -- Check 3: API actually responds to a lightweight read --\n",
232-
" if results.get(\"client\", (\"\",))[0] == \"ok\" and _opik_hc is not None:\n",
233-
" try:\n",
234-
" traces = _opik_hc.search_traces(max_results=1)\n",
235-
" results[\"api\"] = (\"ok\", f\"search_traces returned {len(traces)} result(s)\")\n",
236-
" except Exception as e:\n",
237-
" err_str = str(e)\n",
238-
" hint = \"\"\n",
239-
" if \"readonly\" in err_str.lower() or \"500\" in err_str:\n",
240-
" hint = \" [ClickHouse replica may be read-only -- run SYSTEM RESTORE REPLICA]\"\n",
241-
" results[\"api\"] = (\"fail\", err_str[:120] + hint)\n",
242-
" else:\n",
243-
" results[\"api\"] = (\"skip\", \"client unavailable\")\n",
244-
"\n",
245215
" # -- Report --\n",
246-
" print(f\"Opik URL : {OPIK_URL}\")\n",
247-
" print()\n",
248216
" labels = [\n",
249-
" (\"http\", \"HTTP reachable \"),\n",
250-
" (\"client\", \"Client init \"),\n",
251-
" (\"api\", \"API read test \"),\n",
217+
" (\"env\", \"Env vars present\"),\n",
218+
" (\"client\", \"Client init \"),\n",
252219
" ]\n",
253220
" all_ok = True\n",
254221
" for key, label in labels:\n",
255222
" status, detail = results.get(key, (\"skip\", \"\"))\n",
256-
" if status == \"ok\":\n",
257-
" marker = \"✓ OK \"\n",
258-
" elif status == \"skip\":\n",
259-
" marker = \"⊘ skip\"\n",
260-
" else:\n",
261-
" marker = \"✗ FAIL\"\n",
223+
" marker = \"✓ OK \" if status == \"ok\" else (\"⊘ skip\" if status == \"skip\" else \"✗ FAIL\")\n",
224+
" if status not in (\"ok\", \"skip\"):\n",
262225
" all_ok = False\n",
263226
" print(f\" {marker} {label} {detail}\")\n",
264227
"\n",
265228
" print()\n",
266229
" if all_ok:\n",
267-
" dashboard_url = OPIK_URL.rstrip(\"/\").removesuffix(\"/api\")\n",
268-
" print(\"Opik is fully operational.\")\n",
269-
" print(f\"Dashboard : {dashboard_url}\")\n",
230+
" lf_host = os.environ.get(\"LANGFUSE_HOST\") or os.environ.get(\"LANGFUSE_BASE_URL\") or \"https://cloud.langfuse.com\"\n",
231+
" print(\"Langfuse is configured.\")\n",
232+
" print(f\"Host : {lf_host}\")\n",
270233
" print(\"Traces and scores will be recorded automatically during the pipeline run.\")\n",
271234
" else:\n",
272-
" print(\"⚠ WARNING: One or more Opik checks failed.\")\n",
273-
" print(\"The pipeline will still run; Opik tracing may not work correctly.\")\n",
274-
" if results.get(\"http\", (\"\",))[0] == \"fail\":\n",
275-
" print()\n",
276-
" print(\" Docker stack appears to be down. To start it:\")\n",
277-
" print(\" cd /path/to/opik/deployment/docker-compose\")\n",
278-
" print(\" docker compose --profile opik up -d\")\n",
279-
" if results.get(\"api\", (\"\",))[0] == \"fail\":\n",
280-
" print()\n",
281-
" print(\" API is reachable but not responding correctly.\")\n",
282-
" print(\" Check ClickHouse replica state:\")\n",
283-
" print(\" docker exec opik-clickhouse-1 clickhouse-client --query \\\\\")\n",
284-
" print(\" \\\"SELECT database,table,is_readonly FROM system.replicas WHERE database='opik'\\\"\")"
235+
" print(\"⚠ WARNING: Langfuse client failed to initialise.\")\n",
236+
" print(\"The pipeline will still run; tracing will be skipped.\")"
285237
]
286238
},
287239
{
@@ -376,10 +328,10 @@
376328
"else:\n",
377329
" print(\"OcrReaderTool : disabled (USE_OCR=False)\")\n",
378330
"\n",
379-
"# Opik observability (no-op if OPIK_URL_OVERRIDE not set)\n",
380-
"opik_client = get_client()\n",
381-
"opik_status = \"enabled\" if opik_client else \"not configured\"\n",
382-
"print(f\"Opik : {opik_status}\")"
331+
"# Langfuse observability (no-op if keys not set)\n",
332+
"lf_client = get_client()\n",
333+
"lf_status = \"enabled\" if lf_client else \"not configured\"\n",
334+
"print(f\"Langfuse : {lf_status}\")"
383335
]
384336
},
385337
{
@@ -421,7 +373,7 @@
421373
" config,\n",
422374
" RUN_ID,\n",
423375
" OUT_DIR,\n",
424-
" opik_client=opik_client,\n",
376+
" lf_client=lf_client,\n",
425377
" verifier_agent=verifier,\n",
426378
" ocr_tool=ocr,\n",
427379
" )\n",
@@ -459,8 +411,8 @@
459411
"## 6 — Inspect First MEP\n",
460412
"\n",
461413
"MEPs are self-contained JSON files. Every field you see here is what the agent actually\n",
462-
"produced — no post-processing. The `opik_trace_id` links this MEP back to the live trace\n",
463-
"in the Opik dashboard if Opik is configured."
414+
"produced — no post-processing. The `lf_trace_id` links this MEP back to the live trace\n",
415+
"in the Langfuse dashboard if Langfuse is configured."
464416
]
465417
},
466418
{
@@ -501,8 +453,8 @@
501453
" print(\"Timestamps (ms):\")\n",
502454
" for k in [\"planner_ms\", \"ocr_ms\", \"vision_ms\", \"verifier_ms\"]:\n",
503455
" print(f\" {k:<16} {ts.get(k, 0):.0f}\")\n",
504-
" if mep.get(\"opik_trace_id\"):\n",
505-
" print(f\"Opik trace ID: {mep['opik_trace_id']}\")\n",
456+
" if mep.get(\"lf_trace_id\"):\n",
457+
" print(f\"Langfuse trace ID: {mep['lf_trace_id']}\")\n",
506458
" print(\"=\" * 64)\n",
507459
"\n",
508460
" img_path = s.get(\"image_ref\", {}).get(\"path\", \"\")\n",
@@ -609,7 +561,7 @@
609561
" config,\n",
610562
" RUN_ID_NO_OCR,\n",
611563
" OUT_DIR_NO_OCR,\n",
612-
" opik_client=opik_client,\n",
564+
" lf_client=lf_client,\n",
613565
" verifier_agent=verifier,\n",
614566
" ocr_tool=None, # <-- OCR disabled\n",
615567
" )\n",

implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/agents/planner_agent.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
from crewai import LLM, Agent, Crew, Task
1212

1313
from ..datasets.perceived_sample import PerceivedSample
14-
from ..opik_integration.tracing import close_span, open_llm_span
14+
from ..langfuse_integration.tracing import close_span, open_llm_span
1515
from ..utils.json_strict import parse_strict
1616

1717

@@ -137,7 +137,7 @@ def __init__(
137137
self.api_key = api_key
138138
self._llm = _build_llm(backend, model, api_key)
139139

140-
def run(self, sample: PerceivedSample, opik_trace: Any = None) -> Tuple[str, dict, bool, str]:
140+
def run(self, sample: PerceivedSample, lf_trace: Any = None) -> Tuple[str, dict, bool, str]:
141141
"""
142142
Execute the planning phase for a new question.
143143
@@ -148,7 +148,7 @@ def run(self, sample: PerceivedSample, opik_trace: Any = None) -> Tuple[str, dic
148148
----------
149149
sample : PerceivedSample
150150
The question and context to plan for.
151-
opik_trace : Any, optional
151+
langfuse_trace : Any, optional
152152
Observability object for logging.
153153
154154
Returns
@@ -165,7 +165,7 @@ def run(self, sample: PerceivedSample, opik_trace: Any = None) -> Tuple[str, dic
165165
prompt = build_planner_prompt(sample)
166166

167167
span = open_llm_span(
168-
opik_trace,
168+
lf_trace,
169169
name="planner",
170170
input_data={"prompt": prompt},
171171
model=self.model,

implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/agents/verifier_agent.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
from openai import OpenAI
2424
from PIL import Image
2525

26-
from ..opik_integration.tracing import close_span, open_llm_span
26+
from ..langfuse_integration.tracing import close_span, open_llm_span
2727
from ..utils.json_strict import parse_strict
2828

2929

@@ -203,7 +203,7 @@ def run(
203203
sample, # PerceivedSample
204204
plan: dict,
205205
vision_parsed: dict,
206-
opik_trace: Any = None,
206+
lf_trace: Any = None,
207207
) -> Tuple[str, dict, bool, str]:
208208
"""
209209
Critically audit a draft answer using a single VLM call.
@@ -216,7 +216,7 @@ def run(
216216
The inspection plan used by the previous agent.
217217
vision_parsed : dict
218218
The draft answer and explanation to audit.
219-
opik_trace : Any, optional
219+
langfuse_trace : Any, optional
220220
Tracing object for observability.
221221
222222
Returns
@@ -250,7 +250,7 @@ def run(
250250
)
251251

252252
span = open_llm_span(
253-
opik_trace,
253+
lf_trace,
254254
name="verifier",
255255
input_data={"prompt": prompt, "draft_answer": draft_answer},
256256
model=self.model,

0 commit comments

Comments
 (0)