Skip to content

feat: move agentic loop and session management to backend#46

Merged
alexinthesky merged 1 commit intomainfrom
vk/4e22-improve-code-str
Feb 14, 2026
Merged

feat: move agentic loop and session management to backend#46
alexinthesky merged 1 commit intomainfrom
vk/4e22-improve-code-str

Conversation

@alexinthesky
Copy link
Collaborator

@alexinthesky alexinthesky commented Feb 12, 2026

Note

High Risk
Introduces new backend chat execution, persistence, and SSE streaming paths and changes RBAC enforcement semantics, which can impact authorization and run/session correctness across orgs and deployments.

Overview
Moves the AI chat agentic loop into the Go backend via a new /api/agent/* API, including SSE event streaming, run tracking (in-memory or Redis), cancellation, and token-aware context trimming. The backend now calls grafana-llm-app’s OpenAI-compatible endpoint using a Grafana service-account token and executes MCP tool calls with execution-time RBAC enforcement.

Reworks RBAC from hardcoded tool-name lists to MCP ToolAnnotations (readOnlyHint) propagated from MCP servers, and adds proxy support for tool lookup plus per-server header injection + dial timeouts. Adds backend session CRUD/current-session tracking and persists assistant/tool-call results from streamed events.

Updates local/dev/test plumbing: adds server:full multi-org compose setup, optional Prometheus datasource provisioning env vars, switches E2E secrets from OPENAI_API_KEY to LLM_API_KEY, and serializes Playwright workers while adding a stop/queue UX in ChatInput for overlapping sends.

Written by Cursor Bugbot for commit 9d5540c. This will update automatically on new commits. Configure here.

@alexinthesky alexinthesky changed the title Improve code structure and maintainability. (vibe-kanban) refactor: move agentic loop and session management to backend Feb 12, 2026
alexinthesky added a commit that referenced this pull request Feb 13, 2026
…ition

**Issue 1: Playwright using 2 workers despite config setting workers: 1**

Root cause: Top-level `fullyParallel: true` in playwright.config.ts was
overriding project-level `workers: 1` settings for chromium-llm-tests.

Fix: Remove top-level `fullyParallel: true` to respect per-project
configuration. Projects can now properly control their own parallelism:
- chromium-session-tests: workers: 1, fullyParallel: false
- chromium-llm-tests: workers: 1, fullyParallel: false
- chromium: workers: 6, fullyParallel: true (default)

**Issue 2: Agent runs being canceled mid-stream (500 error after ~500ms)**

Root cause: When a new session is created during `sendMessage()`, the
session ID changes from null to the new ID. This triggers the reconnect
useEffect cleanup, which aborts the AbortController *while the SSE stream
is still active*, causing:
```
rpc error: code = Canceled desc = context canceled
```

Flow that caused the bug:
1. User sends message (sessionId is null)
2. Backend creates session, returns sessionId
3. Frontend calls `setCurrentSessionIdDirect(result.sessionId)`
4. sessionManager.currentSessionId changes (null → new ID)
5. Reconnect useEffect dependency triggers cleanup
6. Cleanup aborts AbortController
7. Active SSE stream in `reconnectToAgentRun` gets canceled
8. Send button never re-enables, E2E tests timeout

Fix: In reconnect effect cleanup, only abort if `!isGenerating`. This
prevents aborting the controller while a message is actively being
streamed. The abort will still happen when truly needed (user navigates
away, explicitly stops generation, or session changes while idle).

**Impact:**
- E2E tests will now properly run LLM tests with single worker
- Agent runs will complete successfully without mid-stream cancellation
- Send button will re-enable after responses complete

**Note on LLM_API_KEY:**
Secret is correctly configured and LLM calls are succeeding. The test
failures were caused by the SSE abort race condition, not missing API key.

Fixes #46

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
alexinthesky added a commit that referenced this pull request Feb 13, 2026
- Fix double close on channel in RunBroadcaster unsubscribe
- Fix broadcaster race condition in AppendEvent
- Replace hardcoded rgba() colors with semantic Tailwind classes
- Remove dead code for cookie-based LLM auth (Grafana 12 strips cookies)
- Add unit tests for ReasoningIndicator component

All tests passing:
- Frontend unit tests: 552 passed
- Backend Go tests: all passed
- TypeScript type checking: passed
- ESLint: passed (warnings in unmodified code only)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
alexinthesky added a commit that referenced this pull request Feb 13, 2026
Fixes two medium-severity issues identified by cursor bot:

1. Cancelled runs with no events marked as failed
   - Check if run is still cancellable when no events produced
   - Set status to "cancelled" if run was cancelled early
   - Prevents status mismatch between endpoint response and stored status

2. Redis SubscribeAndSnapshot duplicate-event race condition
   - Add sequence numbers to SSEEvent for ordering
   - Implement atomic sequence tracking in RunStore (in-memory counter)
   - Implement atomic sequence tracking in RedisRunStore (Redis INCR)
   - Add client-side deduplication in agentClient (skip sequence <= lastSeen)
   - Prevents duplicate events when reconnecting to in-progress runs

Low-severity issues deferred to follow-up issue #48:
- Empty assistant message on cancelled runs
- GetBroadcaster dead code removal

All tests passing:
- Frontend unit tests: 552 passed
- Backend Go tests: all passed
- TypeScript type checking: passed
- ESLint: passed (warnings in unmodified code only)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
… test consolidation

Move the agentic loop from the frontend to a Go backend implementation with
SSE streaming, detached execution (survives browser tab close), and automatic
reconnection. Includes multi-org support, RBAC enforcement at tool execution,
token-aware context window management, and session persistence.

Key changes:
- Backend agentic loop (pkg/agent/) with LLM client, tool execution, context window
- Detached execution with atomic subscribe+snapshot for reconnection
- Session management moved from frontend to backend (in-memory + Redis stores)
- Frontend simplified to thin HTTP client (backendSessionClient.ts)
- Reasoning/thinking content dropped when tool calls are present
- E2E tests consolidated from 17 files to 8 with all 36 tests passing (0 skipped)
- All LLM-dependent tests use consistent wait patterns and fast prompts

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@alexinthesky alexinthesky force-pushed the vk/4e22-improve-code-str branch from 26bbfd5 to 9d5540c Compare February 13, 2026 16:48
cursor[bot]

This comment was marked as off-topic.

@alexinthesky alexinthesky changed the title refactor: move agentic loop and session management to backend feat: move agentic loop and session management to backend Feb 14, 2026
@alexinthesky alexinthesky merged commit c5fbe29 into main Feb 14, 2026
16 checks passed
@alexinthesky alexinthesky deleted the vk/4e22-improve-code-str branch February 14, 2026 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants