Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 199 additions & 0 deletions openspec/changes/add-langfuse-export/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# Design: Langfuse Export for Observability

## Context

AgentV produces `output_messages` arrays containing tool calls, assistant responses, and timestamps during evaluation runs. This data is valuable for debugging and monitoring but currently stays within AgentV's result files.

Industry frameworks (LangWatch, Mastra, Google ADK, Azure SDK) have adopted OpenTelemetry semantic conventions for LLM observability. Langfuse is an open-source platform that accepts traces in a compatible format.

**Stakeholders:**
- AgentV users who need to debug agent behavior
- Teams integrating AgentV into existing LLMOps workflows
- Developers comparing agent configurations across runs

## Goals / Non-Goals

**Goals:**
- Export `output_messages` to Langfuse as structured traces
- Follow OpenTelemetry GenAI semantic conventions where applicable
- Provide opt-in content capture for privacy-sensitive environments
- Keep export logic decoupled from core evaluation flow

**Non-Goals:**
- Full OpenTelemetry SDK integration (deferred)
- Real-time streaming of traces during execution
- Bi-directional sync with Langfuse (import traces)
- Support for other observability platforms in this change (extensible design only)

## Decisions

### Decision 1: Use Langfuse SDK directly (not OTEL SDK)

**What:** Import `langfuse` npm package and use its native trace/span API.

**Why:**
- Langfuse SDK handles authentication, batching, and flush automatically
- Avoids complexity of OTEL collector setup
- Direct mapping to Langfuse concepts (traces, generations, spans)
- Can add OTEL exporter later as separate capability

**Alternatives considered:**
- Full OTEL SDK + OTLP exporter: More portable but requires collector infrastructure
- Custom HTTP calls: Fragile, no batching, reinvents SDK features

### Decision 2: Map OutputMessage to Langfuse structure

**Mapping:**

| AgentV Concept | Langfuse Concept | Notes |
|----------------|------------------|-------|
| Evaluation run | Trace | One trace per eval case |
| `eval_id` | `trace.name` | Identifies the test case |
| `target` | `trace.metadata.target` | Which provider was used |
| Assistant message with content | Generation | LLM response |
| Tool call | Span (type: "tool") | Individual tool invocation |
| `score` | Score | Attached to trace |

**Langfuse Trace Structure:**
```
Trace: eval_id="case-001"
├── Generation: "assistant response"
│ ├── input: [user messages]
│ ├── output: "response text"
│ └── usage: { input_tokens, output_tokens }
├── Span: tool="search" (type: tool)
│ ├── input: { query: "..." }
│ └── output: "results..."
├── Span: tool="read_file" (type: tool)
│ └── ...
└── Score: name="eval_score", value=0.85
```

### Decision 3: Attribute naming follows GenAI conventions

Use `gen_ai.*` prefixed attributes where applicable:

```typescript
// Generation attributes
'gen_ai.request.model': target.model,
'gen_ai.usage.input_tokens': usage?.input_tokens,
'gen_ai.usage.output_tokens': usage?.output_tokens,

// Tool span attributes
'gen_ai.tool.name': toolCall.tool,
'gen_ai.tool.call.id': toolCall.id,

// Trace metadata
'agentv.eval_id': evalCase.id,
'agentv.target': target.name,
'agentv.dataset': evalCase.dataset,
```

### Decision 4: Privacy-first content capture

**Default:** Do not capture message content or tool inputs/outputs.

**Opt-in:** Set `LANGFUSE_CAPTURE_CONTENT=true` to include:
- User message content
- Assistant response content
- Tool call inputs and outputs

**Rationale:** Traces may contain PII, secrets, or proprietary data. Following Azure SDK and Google ADK patterns of opt-in content capture.

### Decision 5: Flush strategy

**Approach:** Flush traces after each eval case completes (not batched across cases).

**Why:**
- Ensures traces are visible in Langfuse promptly
- Avoids data loss if process crashes
- Trade-off: Slightly higher network overhead (acceptable for eval workloads)

**Configuration:** No user-facing config in v1. Can add `--langfuse-batch` later if needed.

## Data Flow

```
┌─────────────────────────────────────────────────────────────────┐
│ agentv run │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Provider │───▶│ Orchestrator │───▶│ EvaluationResult │ │
│ │ Response │ │ │ │ + outputMessages │ │
│ └──────────────┘ └──────────────┘ └────────┬─────────┘ │
│ │ │
│ ┌────────────────────────▼──────┐ │
│ │ LangfuseExporter │ │
│ │ (if --langfuse enabled) │ │
│ └────────────────┬──────────────┘ │
│ │ │
└────────────────────────────────────────────┼────────────────────┘
┌─────────────────┐
│ Langfuse │
│ Platform │
└─────────────────┘
```

## API Surface

### CLI

```bash
# Enable Langfuse export
agentv run eval.yaml --langfuse

# With custom host (self-hosted Langfuse)
LANGFUSE_HOST=https://langfuse.mycompany.com agentv run eval.yaml --langfuse

# With content capture
LANGFUSE_CAPTURE_CONTENT=true agentv run eval.yaml --langfuse
```

### Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `LANGFUSE_PUBLIC_KEY` | Yes (if --langfuse) | Langfuse project public key |
| `LANGFUSE_SECRET_KEY` | Yes (if --langfuse) | Langfuse project secret key |
| `LANGFUSE_HOST` | No | Custom Langfuse host (default: cloud) |
| `LANGFUSE_CAPTURE_CONTENT` | No | Enable content capture (default: false) |

### Programmatic API

```typescript
import { LangfuseExporter } from '@agentv/core/observability';

const exporter = new LangfuseExporter({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
host: process.env.LANGFUSE_HOST,
captureContent: process.env.LANGFUSE_CAPTURE_CONTENT === 'true',
});

// Export a single result
await exporter.export(evaluationResult, outputMessages);

// Flush pending traces
await exporter.flush();
```

## Risks / Trade-offs

| Risk | Mitigation |
|------|------------|
| Langfuse SDK version churn | Pin to stable version, document upgrade path |
| Network failures during export | Log warning, don't fail evaluation; traces are optional |
| Large traces with many tool calls | Langfuse handles batching internally; monitor payload sizes |
| Content capture leaking secrets | Default to off; document clearly in CLI help |

## Migration Plan

**No migration required.** This is a new optional feature. Existing users are unaffected unless they enable `--langfuse`.

## Open Questions

1. Should we support `--langfuse-session-id` to group multiple eval runs? (Defer to user feedback)
2. Should token usage be estimated if provider doesn't return it? (Defer - not all providers report usage)
3. Should we add a `--dry-run-langfuse` to preview traces without sending? (Nice to have, not v1)
32 changes: 32 additions & 0 deletions openspec/changes/add-langfuse-export/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Change: Add Langfuse Export for Observability

## Why

AgentV captures rich execution traces via `output_messages` (tool calls, assistant responses, timestamps) but has no way to export this data to observability platforms. Users need to debug agent behavior, monitor performance, and integrate with existing LLMOps tooling.

Langfuse is an open-source LLM observability platform that supports OpenTelemetry-compatible trace ingestion. By exporting AgentV traces to Langfuse, users can:
- Visualize agent execution flows
- Debug tool call sequences
- Track token usage and latency across evaluations
- Compare agent behavior across different configurations

## What Changes

- **Add `langfuse` export option**: Convert `output_messages` to OpenTelemetry-compatible spans and send to Langfuse
- New `--langfuse` CLI flag enables export during `agentv run`
- Supports `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST` environment variables
- Maps `OutputMessage` and `ToolCall` to Langfuse trace/span format
- Uses `gen_ai.*` semantic conventions for LLM attributes
- Optional content capture controlled by `LANGFUSE_CAPTURE_CONTENT` (default: false for privacy)

- **Add new `observability` capability spec**: Defines trace export behavior and provider contracts

## Impact

- Affected specs: New `observability` capability (does not modify existing specs)
- Affected code:
- `packages/core/src/observability/` (new directory)
- `packages/core/src/observability/langfuse-exporter.ts` (new file)
- `packages/core/src/observability/types.ts` (new file)
- `apps/cli/src/index.ts` (add `--langfuse` flag to run command)
- `packages/core/package.json` (add `langfuse` dependency)
104 changes: 104 additions & 0 deletions openspec/changes/add-langfuse-export/specs/observability/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Spec: Observability Capability

## Purpose

Defines trace export functionality for sending AgentV evaluation data to external observability platforms. Enables debugging, monitoring, and analysis of agent execution through industry-standard tooling.

## ADDED Requirements

### Requirement: Langfuse Trace Export

The system SHALL support exporting evaluation traces to Langfuse when enabled via CLI flag.

#### Scenario: Export enabled with valid credentials

- **WHEN** the user runs `agentv run eval.yaml --langfuse`
- **AND** `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` environment variables are set
- **THEN** the system creates a Langfuse trace for each completed eval case
- **AND** the trace includes the `eval_id` as the trace name
- **AND** the trace includes metadata for `target`, `dataset`, and `score`

#### Scenario: Export disabled by default

- **WHEN** the user runs `agentv run eval.yaml` without `--langfuse` flag
- **THEN** no traces are sent to Langfuse
- **AND** the evaluation proceeds normally without observability overhead

#### Scenario: Missing credentials with flag enabled

- **WHEN** the user runs `agentv run eval.yaml --langfuse`
- **AND** `LANGFUSE_PUBLIC_KEY` or `LANGFUSE_SECRET_KEY` is not set
- **THEN** the system emits a warning message
- **AND** evaluation proceeds without Langfuse export

### Requirement: OutputMessage to Trace Mapping

The system SHALL convert `output_messages` to Langfuse-compatible trace structure.

#### Scenario: Assistant message becomes Generation

- **WHEN** an `OutputMessage` has `role: "assistant"` and `content`
- **THEN** a Langfuse Generation is created with the content as output
- **AND** the Generation includes `gen_ai.request.model` if available from target

#### Scenario: Tool call becomes Span

- **WHEN** an `OutputMessage` contains `toolCalls` array
- **THEN** each `ToolCall` becomes a Langfuse Span with `type: "tool"`
- **AND** the Span includes `gen_ai.tool.name` attribute set to the tool name
- **AND** the Span includes `gen_ai.tool.call.id` if the tool call has an `id`

#### Scenario: Evaluation score attached to trace

- **WHEN** an `EvaluationResult` is exported
- **THEN** the trace includes a Langfuse Score with `name: "eval_score"` and `value` set to the result score
- **AND** the Score includes `comment` with the evaluation reasoning if available

### Requirement: Privacy-Controlled Content Capture

The system SHALL respect privacy settings when exporting trace content.

#### Scenario: Content capture disabled (default)

- **WHEN** `LANGFUSE_CAPTURE_CONTENT` is not set or set to `"false"`
- **THEN** message content is replaced with placeholder text `"[content hidden]"`
- **AND** tool call inputs are replaced with `{}`
- **AND** tool call outputs are replaced with `"[output hidden]"`

#### Scenario: Content capture enabled

- **WHEN** `LANGFUSE_CAPTURE_CONTENT` is set to `"true"`
- **THEN** full message content is included in Generations
- **AND** full tool call inputs and outputs are included in Spans

### Requirement: Custom Langfuse Host

The system SHALL support self-hosted Langfuse instances.

#### Scenario: Custom host configuration

- **WHEN** `LANGFUSE_HOST` environment variable is set
- **THEN** the exporter sends traces to the specified host URL
- **AND** authentication uses the same `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY`

#### Scenario: Default to cloud host

- **WHEN** `LANGFUSE_HOST` is not set
- **THEN** the exporter uses the default Langfuse cloud endpoint

### Requirement: Graceful Export Failures

The system SHALL handle export errors without disrupting evaluation.

#### Scenario: Network error during export

- **WHEN** sending a trace to Langfuse fails due to network error
- **THEN** the system logs a warning with the error details
- **AND** the evaluation result is still written to the output file
- **AND** subsequent eval cases continue to attempt export

#### Scenario: Flush at evaluation end

- **WHEN** all eval cases have completed
- **THEN** the system flushes any pending traces to Langfuse
- **AND** waits for flush to complete before exiting (with timeout)
51 changes: 51 additions & 0 deletions openspec/changes/add-langfuse-export/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Tasks: Add Langfuse Export

## 1. Core Implementation

- [ ] 1.1 Create `packages/core/src/observability/` directory structure
- [ ] 1.2 Define `TraceExporter` interface in `types.ts`
- [ ] 1.3 Implement `LangfuseExporter` class with trace/span conversion
- [ ] 1.4 Add `langfuse` dependency to `packages/core/package.json`
- [ ] 1.5 Export observability module from `packages/core/src/index.ts`

## 2. OutputMessage to Langfuse Mapping

- [ ] 2.1 Implement `convertToLangfuseTrace()` function
- [ ] 2.2 Map `OutputMessage` with content to Langfuse Generation
- [ ] 2.3 Map `ToolCall` to Langfuse Span (type: tool)
- [ ] 2.4 Attach evaluation score to trace
- [ ] 2.5 Add `gen_ai.*` semantic convention attributes

## 3. Privacy Controls

- [ ] 3.1 Implement content filtering based on `LANGFUSE_CAPTURE_CONTENT`
- [ ] 3.2 Strip message content when capture disabled
- [ ] 3.3 Strip tool inputs/outputs when capture disabled
- [ ] 3.4 Document privacy behavior in code comments

## 4. CLI Integration

- [ ] 4.1 Add `--langfuse` flag to `run` command in `apps/cli/src/index.ts`
- [ ] 4.2 Validate required environment variables when flag is set
- [ ] 4.3 Initialize `LangfuseExporter` when enabled
- [ ] 4.4 Call exporter after each `EvaluationResult` is produced
- [ ] 4.5 Flush exporter after all eval cases complete

## 5. Error Handling

- [ ] 5.1 Catch and log Langfuse SDK errors without failing evaluation
- [ ] 5.2 Warn on missing credentials when `--langfuse` is used
- [ ] 5.3 Handle network timeouts gracefully

## 6. Testing

- [ ] 6.1 Unit tests for `convertToLangfuseTrace()` mapping
- [ ] 6.2 Unit tests for content filtering logic
- [ ] 6.3 Integration test with mock Langfuse server (optional)
- [ ] 6.4 Add example in `examples/` directory

## 7. Documentation

- [ ] 7.1 Add CLI help text for `--langfuse` flag
- [ ] 7.2 Document environment variables in README or docs
- [ ] 7.3 Add usage example to CLI `--help` output