EntityProcess · christso · Jan 1, 2026
diff --git a/openspec/changes/add-langfuse-export/design.md b/openspec/changes/add-langfuse-export/design.md
@@ -0,0 +1,199 @@
+# Design: Langfuse Export for Observability
+
+## Context
+
+AgentV produces `output_messages` arrays containing tool calls, assistant responses, and timestamps during evaluation runs. This data is valuable for debugging and monitoring but currently stays within AgentV's result files.
+
+Industry frameworks (LangWatch, Mastra, Google ADK, Azure SDK) have adopted OpenTelemetry semantic conventions for LLM observability. Langfuse is an open-source platform that accepts traces in a compatible format.
+
+**Stakeholders:**
+- AgentV users who need to debug agent behavior
+- Teams integrating AgentV into existing LLMOps workflows
+- Developers comparing agent configurations across runs
+
+## Goals / Non-Goals
+
+**Goals:**
+- Export `output_messages` to Langfuse as structured traces
+- Follow OpenTelemetry GenAI semantic conventions where applicable
+- Provide opt-in content capture for privacy-sensitive environments
+- Keep export logic decoupled from core evaluation flow
+
+**Non-Goals:**
+- Full OpenTelemetry SDK integration (deferred)
+- Real-time streaming of traces during execution
+- Bi-directional sync with Langfuse (import traces)
+- Support for other observability platforms in this change (extensible design only)
+
+## Decisions
+
+### Decision 1: Use Langfuse SDK directly (not OTEL SDK)
+
+**What:** Import `langfuse` npm package and use its native trace/span API.
+
+**Why:**
+- Langfuse SDK handles authentication, batching, and flush automatically
+- Avoids complexity of OTEL collector setup
+- Direct mapping to Langfuse concepts (traces, generations, spans)
+- Can add OTEL exporter later as separate capability
+
+**Alternatives considered:**
+- Full OTEL SDK + OTLP exporter: More portable but requires collector infrastructure
+- Custom HTTP calls: Fragile, no batching, reinvents SDK features
+
+### Decision 2: Map OutputMessage to Langfuse structure
+
+**Mapping:**
+
+| AgentV Concept | Langfuse Concept | Notes |
+|----------------|------------------|-------|
+| Evaluation run | Trace | One trace per eval case |
+| `eval_id` | `trace.name` | Identifies the test case |
+| `target` | `trace.metadata.target` | Which provider was used |
+| Assistant message with content | Generation | LLM response |
+| Tool call | Span (type: "tool") | Individual tool invocation |
+| `score` | Score | Attached to trace |
+
+**Langfuse Trace Structure:**
+```
+Trace: eval_id="case-001"
+├── Generation: "assistant response"
+│   ├── input: [user messages]
+│   ├── output: "response text"
+│   └── usage: { input_tokens, output_tokens }
+├── Span: tool="search" (type: tool)
+│   ├── input: { query: "..." }
+│   └── output: "results..."
+├── Span: tool="read_file" (type: tool)
+│   └── ...
+└── Score: name="eval_score", value=0.85
+```
+
+### Decision 3: Attribute naming follows GenAI conventions
+
+Use `gen_ai.*` prefixed attributes where applicable:
+
+```typescript
+// Generation attributes
+'gen_ai.request.model': target.model,
+'gen_ai.usage.input_tokens': usage?.input_tokens,
+'gen_ai.usage.output_tokens': usage?.output_tokens,
+
+// Tool span attributes
+'gen_ai.tool.name': toolCall.tool,
+'gen_ai.tool.call.id': toolCall.id,
+
+// Trace metadata
+'agentv.eval_id': evalCase.id,
+'agentv.target': target.name,
+'agentv.dataset': evalCase.dataset,
+```
+
+### Decision 4: Privacy-first content capture
+
+**Default:** Do not capture message content or tool inputs/outputs.
+
+**Opt-in:** Set `LANGFUSE_CAPTURE_CONTENT=true` to include:
+- User message content
+- Assistant response content
+- Tool call inputs and outputs
+
+**Rationale:** Traces may contain PII, secrets, or proprietary data. Following Azure SDK and Google ADK patterns of opt-in content capture.
+
+### Decision 5: Flush strategy
+
+**Approach:** Flush traces after each eval case completes (not batched across cases).
+
+**Why:**
+- Ensures traces are visible in Langfuse promptly
+- Avoids data loss if process crashes
+- Trade-off: Slightly higher network overhead (acceptable for eval workloads)
+
+**Configuration:** No user-facing config in v1. Can add `--langfuse-batch` later if needed.
+
+## Data Flow
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        agentv run                                │
+│                                                                  │
+│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
+│  │   Provider   │───▶│ Orchestrator │───▶│ EvaluationResult │  │
+│  │   Response   │    │              │    │ + outputMessages │  │
+│  └──────────────┘    └──────────────┘    └────────┬─────────┘  │
+│                                                    │            │
+│                           ┌────────────────────────▼──────┐     │
+│                           │    LangfuseExporter           │     │
+│                           │    (if --langfuse enabled)    │     │
+│                           └────────────────┬──────────────┘     │
+│                                            │                    │
+└────────────────────────────────────────────┼────────────────────┘
+                                             │
+                                             ▼
+                                    ┌─────────────────┐
+                                    │    Langfuse     │
+                                    │    Platform     │
+                                    └─────────────────┘
+```
+
+## API Surface
+
+### CLI
+
+```bash
+# Enable Langfuse export
+agentv run eval.yaml --langfuse
+
+# With custom host (self-hosted Langfuse)
+LANGFUSE_HOST=https://langfuse.mycompany.com agentv run eval.yaml --langfuse
+
+# With content capture
+LANGFUSE_CAPTURE_CONTENT=true agentv run eval.yaml --langfuse
+```
+
+### Environment Variables
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `LANGFUSE_PUBLIC_KEY` | Yes (if --langfuse) | Langfuse project public key |
+| `LANGFUSE_SECRET_KEY` | Yes (if --langfuse) | Langfuse project secret key |
+| `LANGFUSE_HOST` | No | Custom Langfuse host (default: cloud) |
+| `LANGFUSE_CAPTURE_CONTENT` | No | Enable content capture (default: false) |
+
+### Programmatic API
+
+```typescript
+import { LangfuseExporter } from '@agentv/core/observability';
+
+const exporter = new LangfuseExporter({
+  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
+  secretKey: process.env.LANGFUSE_SECRET_KEY,
+  host: process.env.LANGFUSE_HOST,
+  captureContent: process.env.LANGFUSE_CAPTURE_CONTENT === 'true',
+});
+
+// Export a single result
+await exporter.export(evaluationResult, outputMessages);
+
+// Flush pending traces
+await exporter.flush();
+```
+
+## Risks / Trade-offs
+
+| Risk | Mitigation |
+|------|------------|
+| Langfuse SDK version churn | Pin to stable version, document upgrade path |
+| Network failures during export | Log warning, don't fail evaluation; traces are optional |
+| Large traces with many tool calls | Langfuse handles batching internally; monitor payload sizes |
+| Content capture leaking secrets | Default to off; document clearly in CLI help |
+
+## Migration Plan
+
+**No migration required.** This is a new optional feature. Existing users are unaffected unless they enable `--langfuse`.
+
+## Open Questions
+
+1. Should we support `--langfuse-session-id` to group multiple eval runs? (Defer to user feedback)
+2. Should token usage be estimated if provider doesn't return it? (Defer - not all providers report usage)
+3. Should we add a `--dry-run-langfuse` to preview traces without sending? (Nice to have, not v1)
diff --git a/openspec/changes/add-langfuse-export/proposal.md b/openspec/changes/add-langfuse-export/proposal.md
@@ -0,0 +1,32 @@
+# Change: Add Langfuse Export for Observability
+
+## Why
+
+AgentV captures rich execution traces via `output_messages` (tool calls, assistant responses, timestamps) but has no way to export this data to observability platforms. Users need to debug agent behavior, monitor performance, and integrate with existing LLMOps tooling.
+
+Langfuse is an open-source LLM observability platform that supports OpenTelemetry-compatible trace ingestion. By exporting AgentV traces to Langfuse, users can:
+- Visualize agent execution flows
+- Debug tool call sequences
+- Track token usage and latency across evaluations
+- Compare agent behavior across different configurations
+
+## What Changes
+
+- **Add `langfuse` export option**: Convert `output_messages` to OpenTelemetry-compatible spans and send to Langfuse
+  - New `--langfuse` CLI flag enables export during `agentv run`
+  - Supports `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST` environment variables
+  - Maps `OutputMessage` and `ToolCall` to Langfuse trace/span format
+  - Uses `gen_ai.*` semantic conventions for LLM attributes
+  - Optional content capture controlled by `LANGFUSE_CAPTURE_CONTENT` (default: false for privacy)
+
+- **Add new `observability` capability spec**: Defines trace export behavior and provider contracts
+
+## Impact
+
+- Affected specs: New `observability` capability (does not modify existing specs)
+- Affected code:
+  - `packages/core/src/observability/` (new directory)
+  - `packages/core/src/observability/langfuse-exporter.ts` (new file)
+  - `packages/core/src/observability/types.ts` (new file)
+  - `apps/cli/src/index.ts` (add `--langfuse` flag to run command)
+  - `packages/core/package.json` (add `langfuse` dependency)
diff --git a/openspec/changes/add-langfuse-export/specs/observability/spec.md b/openspec/changes/add-langfuse-export/specs/observability/spec.md
@@ -0,0 +1,104 @@
+# Spec: Observability Capability
+
+## Purpose
+
+Defines trace export functionality for sending AgentV evaluation data to external observability platforms. Enables debugging, monitoring, and analysis of agent execution through industry-standard tooling.
+
+## ADDED Requirements
+
+### Requirement: Langfuse Trace Export
+
+The system SHALL support exporting evaluation traces to Langfuse when enabled via CLI flag.
+
+#### Scenario: Export enabled with valid credentials
+
+- **WHEN** the user runs `agentv run eval.yaml --langfuse`
+- **AND** `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` environment variables are set
+- **THEN** the system creates a Langfuse trace for each completed eval case
+- **AND** the trace includes the `eval_id` as the trace name
+- **AND** the trace includes metadata for `target`, `dataset`, and `score`
+
+#### Scenario: Export disabled by default
+
+- **WHEN** the user runs `agentv run eval.yaml` without `--langfuse` flag
+- **THEN** no traces are sent to Langfuse
+- **AND** the evaluation proceeds normally without observability overhead
+
+#### Scenario: Missing credentials with flag enabled
+
+- **WHEN** the user runs `agentv run eval.yaml --langfuse`
+- **AND** `LANGFUSE_PUBLIC_KEY` or `LANGFUSE_SECRET_KEY` is not set
+- **THEN** the system emits a warning message
+- **AND** evaluation proceeds without Langfuse export
+
+### Requirement: OutputMessage to Trace Mapping
+
+The system SHALL convert `output_messages` to Langfuse-compatible trace structure.
+
+#### Scenario: Assistant message becomes Generation
+
+- **WHEN** an `OutputMessage` has `role: "assistant"` and `content`
+- **THEN** a Langfuse Generation is created with the content as output
+- **AND** the Generation includes `gen_ai.request.model` if available from target
+
+#### Scenario: Tool call becomes Span
+
+- **WHEN** an `OutputMessage` contains `toolCalls` array
+- **THEN** each `ToolCall` becomes a Langfuse Span with `type: "tool"`
+- **AND** the Span includes `gen_ai.tool.name` attribute set to the tool name
+- **AND** the Span includes `gen_ai.tool.call.id` if the tool call has an `id`
+
+#### Scenario: Evaluation score attached to trace
+
+- **WHEN** an `EvaluationResult` is exported
+- **THEN** the trace includes a Langfuse Score with `name: "eval_score"` and `value` set to the result score
+- **AND** the Score includes `comment` with the evaluation reasoning if available
+
+### Requirement: Privacy-Controlled Content Capture
+
+The system SHALL respect privacy settings when exporting trace content.
+
+#### Scenario: Content capture disabled (default)
+
+- **WHEN** `LANGFUSE_CAPTURE_CONTENT` is not set or set to `"false"`
+- **THEN** message content is replaced with placeholder text `"[content hidden]"`
+- **AND** tool call inputs are replaced with `{}`
+- **AND** tool call outputs are replaced with `"[output hidden]"`
+
+#### Scenario: Content capture enabled
+
+- **WHEN** `LANGFUSE_CAPTURE_CONTENT` is set to `"true"`
+- **THEN** full message content is included in Generations
+- **AND** full tool call inputs and outputs are included in Spans
+
+### Requirement: Custom Langfuse Host
+
+The system SHALL support self-hosted Langfuse instances.
+
+#### Scenario: Custom host configuration
+
+- **WHEN** `LANGFUSE_HOST` environment variable is set
+- **THEN** the exporter sends traces to the specified host URL
+- **AND** authentication uses the same `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY`
+
+#### Scenario: Default to cloud host
+
+- **WHEN** `LANGFUSE_HOST` is not set
+- **THEN** the exporter uses the default Langfuse cloud endpoint
+
+### Requirement: Graceful Export Failures
+
+The system SHALL handle export errors without disrupting evaluation.
+
+#### Scenario: Network error during export
+
+- **WHEN** sending a trace to Langfuse fails due to network error
+- **THEN** the system logs a warning with the error details
+- **AND** the evaluation result is still written to the output file
+- **AND** subsequent eval cases continue to attempt export
+
+#### Scenario: Flush at evaluation end
+
+- **WHEN** all eval cases have completed
+- **THEN** the system flushes any pending traces to Langfuse
+- **AND** waits for flush to complete before exiting (with timeout)
diff --git a/openspec/changes/add-langfuse-export/tasks.md b/openspec/changes/add-langfuse-export/tasks.md
@@ -0,0 +1,51 @@
+# Tasks: Add Langfuse Export
+
+## 1. Core Implementation
+
+- [ ] 1.1 Create `packages/core/src/observability/` directory structure
+- [ ] 1.2 Define `TraceExporter` interface in `types.ts`
+- [ ] 1.3 Implement `LangfuseExporter` class with trace/span conversion
+- [ ] 1.4 Add `langfuse` dependency to `packages/core/package.json`
+- [ ] 1.5 Export observability module from `packages/core/src/index.ts`
+
+## 2. OutputMessage to Langfuse Mapping
+
+- [ ] 2.1 Implement `convertToLangfuseTrace()` function
+- [ ] 2.2 Map `OutputMessage` with content to Langfuse Generation
+- [ ] 2.3 Map `ToolCall` to Langfuse Span (type: tool)
+- [ ] 2.4 Attach evaluation score to trace
+- [ ] 2.5 Add `gen_ai.*` semantic convention attributes
+
+## 3. Privacy Controls
+
+- [ ] 3.1 Implement content filtering based on `LANGFUSE_CAPTURE_CONTENT`
+- [ ] 3.2 Strip message content when capture disabled
+- [ ] 3.3 Strip tool inputs/outputs when capture disabled
+- [ ] 3.4 Document privacy behavior in code comments
+
+## 4. CLI Integration
+
+- [ ] 4.1 Add `--langfuse` flag to `run` command in `apps/cli/src/index.ts`
+- [ ] 4.2 Validate required environment variables when flag is set
+- [ ] 4.3 Initialize `LangfuseExporter` when enabled
+- [ ] 4.4 Call exporter after each `EvaluationResult` is produced
+- [ ] 4.5 Flush exporter after all eval cases complete
+
+## 5. Error Handling
+
+- [ ] 5.1 Catch and log Langfuse SDK errors without failing evaluation
+- [ ] 5.2 Warn on missing credentials when `--langfuse` is used
+- [ ] 5.3 Handle network timeouts gracefully
+
+## 6. Testing
+
+- [ ] 6.1 Unit tests for `convertToLangfuseTrace()` mapping
+- [ ] 6.2 Unit tests for content filtering logic
+- [ ] 6.3 Integration test with mock Langfuse server (optional)
+- [ ] 6.4 Add example in `examples/` directory
+
+## 7. Documentation
+
+- [ ] 7.1 Add CLI help text for `--langfuse` flag
+- [ ] 7.2 Document environment variables in README or docs
+- [ ] 7.3 Add usage example to CLI `--help` output