Skip to content

feat(ui): KG-scoped data source onboarding (k-extract flow)#737

Open
aredenba-rh wants to merge 79 commits into
mainfrom
feature/manage-knowledge-graph
Open

feat(ui): KG-scoped data source onboarding (k-extract flow)#737
aredenba-rh wants to merge 79 commits into
mainfrom
feature/manage-knowledge-graph

Conversation

@aredenba-rh
Copy link
Copy Markdown
Collaborator

Summary

  • Adds full-page data source onboarding at /knowledge-graphs/{kgId}/data-sources/new (URLs → configure → sequential initial sync → summary), modeled after k-extract designer/new.
  • Adds ongoing operations page at /knowledge-graphs/{kgId}/data-sources (phase1 equivalent) for sync, commits, diff, and maintenance focus.
  • KG manage workspace routes Data Sources to onboarding when dataSourceCount === 0, otherwise to the operations page.
  • Post–KG-create toast navigates to the new onboarding route.

Closes #736

Test plan

  • Create a KG → Manage → Data Sources → lands on /data-sources/new
  • Add GitHub URL(s), configure branch/token, connect → run Start initial sync → see progress and summary
  • Open data sources → operations page with cards, sync history, commit refs
  • Return to manage → Data Sources again → operations page (not wizard)
  • Maintain step → ?focus=maintain filters to maintenance-ready sources
  • Global sidebar /data-sources unchanged

Made with Cursor

aredenba-rh and others added 30 commits May 26, 2026 12:58
* chore(skills): add subagent delivery execution protocol

Add a reusable subagent skill that standardizes issue-based branching,
TDD execution, PR structure, and merge/conflict handling into
feature/manage-knowledge-graph.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(management): add knowledge graph workspace mode lifecycle

Implement schema_bootstrap as the default workspace mode and persist
irreversible transition state to extraction_operations across domain,
repository, API responses, and migration coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
…681)

Add a workspace-status API projection with mode, readiness flags,
transition eligibility, and session pointers, including service and
route authorization coverage for manage workspace rendering.

Co-authored-by: Cursor <cursoragent@cursor.com>
…#682)

Enforce workspace readiness checks for minimum entity/relationship type
coverage and prepopulated type instance presence, and project blocking
reasons so validate/transition workflows can render actionable feedback.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose authorized validate and transition commands for knowledge graph
workspaces, persist session pointers, and create an extraction-mode
session identifier when moving from bootstrap to extraction operations.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add durable run-level mutation metadata storage and lifecycle persistence
for session/scope identity, timestamps, token-cost totals, and
operation-count summaries linked to each sync run.

Co-authored-by: Cursor <cursoragent@cursor.com>
Emit operation-class counts and token/cost totals from mutation-log
application results into MutationsApplied payloads so downstream sync
lifecycle persistence can finalize run-level metadata.

Co-authored-by: Cursor <cursoragent@cursor.com>
#686)

Scaffold extraction application/presentation package structure and add
pytest-archon rules enforcing DDD layer boundaries plus cross-context
isolation so subsequent extraction features stay architecturally clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
Implement per-user/per-knowledge-graph/per-mode extraction session
lifecycle behaviors with clear-chat reset semantics and archived-session
retention backed by repository ports and unit coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve mode-specific extraction skill templates from global defaults and
apply deterministic knowledge-graph override merges so session prompts are
stable, customizable, and repeatable.

Co-authored-by: Cursor <cursoragent@cursor.com>
)

Persist extraction agent sessions and expose scoped APIs for active/list/clear-chat so reset creates a fresh session while preserving archived history and runtime context audit records.

Co-authored-by: Cursor <cursoragent@cursor.com>
Persist clone-head, last-extraction baseline, and tracked-branch head
commit references for data sources and expose them in management API
responses for downstream ingestion and UI commit-status workflows.

Co-authored-by: Cursor <cursoragent@cursor.com>
Prepare Git-backed ingestion context by loading data-source commit references,
refreshing tracked branch head, and passing baseline commit plus resolved
credentials into the ingestion pipeline before packaging begins.

Co-authored-by: Cursor <cursoragent@cursor.com>
# Conflicts:
#	src/api/ingestion/application/services/ingestion_service.py
#	src/api/ingestion/infrastructure/event_handler.py
#	src/api/ingestion/ports/services.py
#	src/api/tests/unit/ingestion/infrastructure/test_ingestion_event_handler.py
Skip heavy extraction when tracked branch head equals the last extraction
baseline by emitting a completed lifecycle event and recording an explicit
no-change audit log entry on the sync run.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose a data-source diff summary API that compares the last extraction
baseline to tracked branch head and returns aggregate counts plus a
large-list-safe changed-file preview for maintenance decisions.

Co-authored-by: Cursor <cursoragent@cursor.com>
Show commit-based diff counts immediately on each data source card and
render the changed-file list as collapsed-by-default with explicit
expand/collapse controls for large-diff safe browsing.

Co-authored-by: Cursor <cursoragent@cursor.com>
…695)

Add explicit data-source actions to refresh tracked/clone commit references and adopt tracked head as the current extraction baseline. This lets the UI surface per-source changed-file counts with user-controlled commit context updates for maintenance decisioning.

Co-authored-by: Cursor <cursoragent@cursor.com>
Strengthen subagent delivery guidance with a parallel execution model, required context packs, and a blocker-question escalation flow so multiple agents can pause and ask focused questions without serializing delivery.

Co-authored-by: Cursor <cursoragent@cursor.com>
…678) (#697)

Add structured mode-specific agent configuration (system prompt, hierarchy, guardrails, and skill pack defaults) and wire session initialization to resolve and persist the configuration per knowledge graph scope.

Co-authored-by: Cursor <cursoragent@cursor.com>
) (#698)

Seed schema bootstrap sessions with a capabilities-intake prompt that offers first-pass or guided co-design paths, and persist the selected path/capability summary in session runtime context so the conversation remains continuous across requests.

Co-authored-by: Cursor <cursoragent@cursor.com>
…679) (#699)

Build a filesystem runtime context for extraction workloads by materializing ingestion package resources, reconstructing repository files, and exposing a deterministic skills directory path; wire it through extraction event handling and local/deployed container configuration.

Co-authored-by: Cursor <cursoragent@cursor.com>
#700)

Enhance schema browser rows to display prepopulated type indicators and live per-type instance counts with lazy query-backed loading, while extending shared type contracts and tests to cover the new inspector metadata behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
…671) (#701)

Add manage-authorized run-control operations (start, pause, halt, reset_running, reset_failed, reset_completed, reset_all) over data source sync runs, expose them via dedicated management routes, and verify behavior with unit tests for both service transitions and HTTP contract responses.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose sync-run token/cost metadata in management API responses and add an extraction telemetry dashboard in the data-sources workspace with active worker counts, status buckets, recent job events, and 24h cost trend indicators backed by auto-refreshing sync data.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add knowledge-graph scoped maintenance schedule APIs with timezone-aware cron evaluation and persisted run outcomes, then expose the controls and history in the data-sources operations UI.

Co-authored-by: Cursor <cursoragent@cursor.com>
…704)

Extend the mutations console with a conversation-assisted draft flow and live entity/relationship inspector that highlights edited fields during the active session and resets highlights after apply/refresh.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace legacy row actions with Manage, Query, and Delete, remove inline edit controls from the list surface, and align structural tests to the new action contract.

Co-authored-by: Cursor <cursoragent@cursor.com>
)

Implement a dedicated manage workspace route that loads workspace status projection, shows readiness and session pointers, and provides Validate and transition-to-extraction controls.

Co-authored-by: Cursor <cursoragent@cursor.com>
Extend the manage workspace page with an always-visible extraction conversation panel, clear-chat reset action, and a tabbed lower operations area for extraction jobs, manual mutations, and run/log navigation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the local get_management_settings import that caused UnboundLocalError
and left sync runs stuck in ingesting after lifecycle updated.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
deploy/apps/kartograph/base/api-deployment.yaml (1)

167-182: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a startupProbe for the API container.

The deployment has liveness/readiness probes but no startup probe, which can cause premature restarts during cold start or migration-heavy boots.

Suggested patch
           readinessProbe:
             httpGet:
               path: /health
               port: 8000
             initialDelaySeconds: 5
             periodSeconds: 10
             timeoutSeconds: 5
             failureThreshold: 3
+          startupProbe:
+            httpGet:
+              path: /health
+              port: 8000
+            periodSeconds: 10
+            timeoutSeconds: 5
+            failureThreshold: 30

As per coding guidelines "deploy/**: Health probes (liveness, readiness, startup) configured (per health-endpoints.md)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/apps/kartograph/base/api-deployment.yaml` around lines 167 - 182, Add
a startupProbe block for the API container alongside the existing livenessProbe
and readinessProbe to avoid premature restarts during cold starts; locate the
container spec that currently defines livenessProbe and readinessProbe and add a
startupProbe (httpGet path: /health, port: 8000) with higher initialDelaySeconds
(e.g., 60), longer periodSeconds and timeoutSeconds appropriate for cold starts,
and a reasonable failureThreshold so Kubernetes waits for the app to become
ready before applying liveness checks.
src/dev-ui/app/pages/data-sources/index.vue (1)

675-721: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

N+1 API call pattern in data loading.

loadDataSources performs sequential API calls: one for all KGs, then for each KG fetches data sources, then for each data source fetches sync runs and diff summary. With many data sources this becomes slow.

Consider adding a backend endpoint that returns data sources with their sync runs and diff summaries in a single request, or use Promise.all for parallel fetching within each KG.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/data-sources/index.vue` around lines 675 - 721,
loadDataSources suffers from an N+1 API pattern: it sequentially fetches
knowledge graphs (kgs), then per-KG data sources, and then per-data-source
sync_runs and diff_summary (uses apiFetch, kgs, sources, ds), making it slow at
scale; fix it by parallelizing network calls (use Promise.all to fetch all KG
data-sources in parallel and within each KG use Promise.all to fetch each
ds.sync_runs and ds.diff_summary concurrently) or, better, add a backend
endpoint that returns data sources with sync_runs and diff_summary in one
response and update loadDataSources to call that endpoint and assign
dataSources.value accordingly.
♻️ Duplicate comments (14)
src/api/management/domain/events/data_source.py (1)

103-103: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Constrain pipeline_mode at the event boundary.

pipeline_mode: str permits invalid values and weakens the sync lifecycle contract. Use a constrained domain type (Literal["full", "ingest_only"] or the domain value object) and validate on construction.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/management/domain/events/data_source.py` at line 103, Change the
loose pipeline_mode: str to a constrained domain type and validate at
construction: replace the annotation pipeline_mode: str with pipeline_mode:
Literal["full", "ingest_only"] (import Literal from typing) or use your domain
value object type, and add validation in the constructor/validator (e.g.,
__post_init__ for dataclasses or a pydantic validator) to reject invalid values;
ensure references to pipeline_mode across the class/module use the new type.
src/api/infrastructure/canonical_schema/ontology_projection.py (1)

22-46: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle unexpected entity_type values explicitly.

Rows with unknown entity_type are currently ignored, which can silently drop schema definitions and hide data corruption.

Suggested fix
     for stored in stored_definitions:
         if stored.entity_type == "node":
             node_types.append(
                 NodeTypeDefinition(
@@
         elif stored.entity_type == "edge":
             edge_types.append(
                 EdgeTypeDefinition(
@@
                 )
             )
+        else:
+            raise ValueError(
+                f"Unexpected entity_type '{stored.entity_type}' for label '{stored.label}'"
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/infrastructure/canonical_schema/ontology_projection.py` around lines
22 - 46, The loop over stored_definitions currently only handles "node" and
"edge" and silently skips unknown stored.entity_type values; update the loop in
ontology_projection.py (the block that builds NodeTypeDefinition and
EdgeTypeDefinition) to add an explicit else branch that raises a clear exception
(e.g., ValueError) or logs and raises, including the unexpected
stored.entity_type and identifying info like stored.label (and/or stored.id) so
schema corruption is surfaced immediately rather than silently ignored. Ensure
the error message references stored.entity_type, stored.label and that the new
branch is used in the same loop that constructs NodeTypeDefinition and
EdgeTypeDefinition.
src/api/infrastructure/migrations/versions/f5b6c7d8e9f0_add_workspace_session_pointer_columns.py (1)

24-46: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Add referential integrity for the new session pointer columns.

These persisted relationship pointers are unconstrained, so dangling IDs can accumulate and degrade workspace status correctness. Add FKs (or a follow-up migration in this PR if table-ordering requires it) and drop them before column drops in downgrade.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/api/infrastructure/migrations/versions/f5b6c7d8e9f0_add_workspace_session_pointer_columns.py`
around lines 24 - 46, The new session pointer columns on knowledge_graphs
(active_schema_bootstrap_session_id, active_extraction_operations_session_id,
most_recent_completed_session_id) lack foreign key constraints; update the
upgrade() to add FK constraints referencing the sessions table (or create a
separate migration if ordering requires) and update downgrade() to drop those
FKs before dropping the columns; specifically modify upgrade() to call
op.create_foreign_key(...) for each column (naming keys like
fk_knowledge_graphs_active_schema_bootstrap_session_id) and modify downgrade()
to call op.drop_constraint(...) for each FK prior to op.drop_column(...).
compose.dev.yaml (1)

8-9: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make Docker-engine access opt-in instead of the default dev mode.

Line 8 and Line 15 force privileged host Docker access for every developer run. Keep these settings in an explicit opt-in profile/override to reduce accidental privilege exposure.

Suggested change in this file
     environment:
       UV_CACHE_DIR: /tmp/uv-cache
-      KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND: container
-      KARTOGRAPH_EXTRACTION_RUNTIME_CONTAINER_ENGINE: auto
@@
-      # Allow API process to launch sibling extraction runtime containers locally
-      - /var/run/docker.sock:/var/run/docker.sock

Also applies to: 14-15

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compose.dev.yaml` around lines 8 - 9, The compose file currently sets
KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND=container and
KARTOGRAPH_EXTRACTION_RUNTIME_CONTAINER_ENGINE=auto by default; change this so
Docker/host-engine access is opt-in by removing these variables from the main
service environment and placing them into an explicit opt-in profile or override
(e.g., a "docker-host" compose profile or separate override file) that
developers can enable when needed; update the service to reference that
profile/override and document the opt-in usage so default dev runs no longer
grant privileged host Docker access, and ensure the variables
KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND and
KARTOGRAPH_EXTRACTION_RUNTIME_CONTAINER_ENGINE are only defined in that opt-in
scope.
src/api/main.py (1)

190-196: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard GitHub tracked-head lookup failures to avoid aborting ingestion.

raise_for_status() is still unhandled here, so transient GitHub/network failures can fail the whole sync instead of degrading to “no tracked head enrichment”. This should return None on request/status/parse errors and continue ingestion.

Suggested fix
+from urllib.parse import quote
...
-        url = f"https://api.github.com/repos/{owner}/{repo}/branches/{branch}"
-        async with httpx.AsyncClient(timeout=20.0) as client:
-            response = await client.get(url, headers=headers)
-            response.raise_for_status()
-            payload = response.json()
+        url = f"https://api.github.com/repos/{owner}/{repo}/branches/{quote(branch, safe='')}"
+        try:
+            async with httpx.AsyncClient(timeout=20.0) as client:
+                response = await client.get(url, headers=headers)
+                response.raise_for_status()
+                payload = response.json()
+        except (httpx.RequestError, httpx.HTTPStatusError, ValueError):
+            return None
         sha = payload.get("commit", {}).get("sha")
         return str(sha) if sha else None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/main.py` around lines 190 - 196, The branch SHA lookup code that
builds url and calls httpx.AsyncClient (the block using url, client.get,
response.raise_for_status, payload and sha) must be made resilient: wrap the
async request, raise_for_status call and payload parsing in a try/except that
catches httpx.RequestError, httpx.HTTPStatusError and ValueError (or json parse
errors) and returns None on any of those failures so transient/network/GitHub
errors don't abort ingestion; keep the existing logic that extracts sha from
payload.get("commit", {}).get("sha") and return str(sha) if present, otherwise
None.
src/api/extraction/application/agent_session_service.py (1)

58-97: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make active-session creation atomic to prevent duplicate active sessions.

This is still a TOCTOU path: two concurrent requests can both pass Line 58 and each create/save a new active session for the same (user_id, knowledge_graph_id, mode) scope.

Enforce uniqueness in storage (active-session unique key/partial unique index) and handle save conflicts by re-reading and returning the existing active session.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/extraction/application/agent_session_service.py` around lines 58 -
97, The current create flow (uses _repository.find_active_by_scope and then
_repository.save of ExtractionAgentSession) has a TOCTOU race that can create
duplicate active sessions; add a durable uniqueness constraint in storage (a
partial/unique index on active sessions for the (user_id, knowledge_graph_id,
mode) scope) and change the save path to handle conflicts: when saving the new
ExtractionAgentSession if _repository.save raises a
unique-constraint/duplicate-key error, catch that error, re-call
_repository.find_active_by_scope(user_id, knowledge_graph_id, mode) and return
the found existing session instead of propagating the error; keep the existing
runtime_context/bootstrap logic unchanged and ensure the repository layer
surfaces a clear unique-constraint exception to catch.
src/api/tests/unit/ingestion/infrastructure/test_ingestion_event_handler.py (1)

317-335: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace PAT-shaped test literals to avoid secret-scan failures.

Line 317, Line 329, and Line 335 still use a GitHub PAT-shaped string (ghp_...). This can trip secret scanners in CI even in test code.

Proposed minimal fix
-                    "github auth failed for token ghp_1234567890abcdef1234567890abcdef1234"
+                    "github auth failed for token github_token_test_value_redacted"
@@
-                "token": "ghp_1234567890abcdef1234567890abcdef1234"
+                "token": "github_token_test_value_redacted"
@@
-        assert "ghp_1234567890abcdef1234567890abcdef1234" not in event["payload"][
+        assert "github_token_test_value_redacted" not in event["payload"][
             "error"
         ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/tests/unit/ingestion/infrastructure/test_ingestion_event_handler.py`
around lines 317 - 335, Tests contain GitHub PAT-shaped literals (e.g.,
"ghp_...") which trigger secret scanners; update the test to use a non-PAT
placeholder token (e.g., "token-placeholder" or "ghp_PLACEHOLDER") wherever the
PAT appears: in the runtime_credentials passed to IngestionEventHandler.handle,
in the expected error string from _LeakyService, and in assertions that inspect
outbox.appended/event payloads; ensure references to IngestionEventHandler,
_LeakyService, _sync_started_payload, handler.handle, and outbox.appended/event
remain unchanged so behavior is preserved while eliminating real-looking secret
patterns.
src/dev-ui/app/pages/graph/mutations.vue (1)

181-190: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent stale inspector state from overlapping async loads.

loadLiveInspector() can overlap (KG change, submission success reload, manual refresh). Older requests can resolve later and overwrite newer KG state.

Proposed fix
+const latestInspectorLoadId = ref(0)
+
 async function loadLiveInspector() {
+  const loadId = ++latestInspectorLoadId.value
   if (!selectedKnowledgeGraphId.value) {
     inspectorEntities.value = []
     inspectorRelationships.value = []
+    inspectorError.value = null
     return
   }
@@
-    inspectorEntities.value = entitiesResult.rows.map((row) => ({
+    if (loadId !== latestInspectorLoadId.value) return
+    inspectorEntities.value = entitiesResult.rows.map((row) => ({
@@
-    inspectorRelationships.value = relationshipsResult.rows.map((row) => ({
+    inspectorRelationships.value = relationshipsResult.rows.map((row) => ({
@@
   } catch (err) {
+    if (loadId !== latestInspectorLoadId.value) return
     inspectorError.value = err instanceof Error ? err.message : 'Failed to load graph inspector'
   } finally {
-    inspectorLoading.value = false
+    if (loadId === latestInspectorLoadId.value) {
+      inspectorLoading.value = false
+    }
   }
 }

Also applies to: 418-467

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/graph/mutations.vue` around lines 181 - 190, The async
loadLiveInspector calls can overlap and older requests may overwrite newer
state; make loadLiveInspector cancel-safe or sequencing-safe by introducing a
request token/sequence counter that increments before each invocation (from the
watchers on selectedKnowledgeGraphId and the submission.status watcher and any
manual refresh), pass/close over that token into the async workflow inside
loadLiveInspector, and before applying any results (state updates, DOM changes,
calls like resetSessionEditHighlights) verify the token matches the latest;
alternatively support an AbortSignal that the watchers set/abort so only the
latest load applies—update loadLiveInspector, its callers (the watchers shown
around loadLiveInspector and any manual refresh handlers), and any related code
paths in the other affected block (the section referenced at 418-467) to use
this token/abort mechanism.
src/api/management/application/services/data_source_service.py (1)

607-669: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate pipeline_mode before emitting request_sync.

Line 608 accepts any string. Invalid values can silently flow downstream and break sync lifecycle behavior.

Proposed fix
     async def trigger_sync(
         self,
         user_id: str,
         ds_id: str,
         *,
         pipeline_mode: str = "full",
     ) -> DataSourceSyncRun:
+        allowed_pipeline_modes = {"full", "ingest_only"}
+        if pipeline_mode not in allowed_pipeline_modes:
+            raise ValueError(
+                f"Unsupported pipeline_mode '{pipeline_mode}'. "
+                "Expected one of: full, ingest_only"
+            )
+
         has_manage = await self._check_permission(
             user_id=user_id,
             resource_type=ResourceType.DATA_SOURCE,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/management/application/services/data_source_service.py` around lines
607 - 669, The pipeline_mode parameter is not validated before calling
ds.request_sync, allowing arbitrary strings to propagate; add validation in the
trigger method (the function accepting pipeline_mode: str) to only allow the
known modes (e.g., "full" and "ingest_only") and raise a ValueError with a clear
message for invalid values; perform this check immediately before constructing
the DataSourceSyncRun / before calling ds.request_sync so ds.request_sync and
downstream lifecycle handlers always receive a valid pipeline_mode.
src/dev-ui/app/pages/knowledge-graphs/[kgId]/data-sources/index.vue (3)

142-157: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Parallelize per-source enrichment to cut page latency.

The current loop fetches sync-runs and diff-summary serially per source, so total load time scales linearly with source count.

As per coding guidelines **: Focus on major issues impacting performance, readability, maintainability and security.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue around
lines 142 - 157, The per-source enrichment loop (for (const ds of sources)) is
performing serial network calls via apiFetch for ds.sync_runs and
ds.diff_summary, causing high latency; change to run both fetches concurrently
per source (e.g., create two promises for
`/management/data-sources/${ds.id}/sync-runs` and
`/management/data-sources/${ds.id}/diff-summary` and await them together), and
then run enrichment across all sources in parallel using Promise.all over
sources so each source's pair of fetches executes concurrently; ensure errors
are handled per-promise so failed sync_runs become [] and failed diff_summary
become null while preserving assignment to ds.sync_runs and ds.diff_summary.

334-338: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reload page state when kgId route param changes.

State is refreshed on tenant change, but not on param-only KG navigation, which can leave stale data from the previous KG.

As per coding guidelines **: Focus on major issues impacting performance, readability, maintainability and security.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue around
lines 334 - 338, The watcher currently triggers reloads only when tenantVersion
changes (watch(tenantVersion, ...)) so navigating between knowledge graphs by
route param (kgId) leaves stale state; update the logic to also watch the route
param (kgId) or the route object and invoke the same sequence (clear
dataSources.value, await loadKnowledgeGraph(), await ensureEntryRoute()) when
kgId changes—modify the watcher or add a separate watch referencing the
route/route.params.kgId so loadKnowledgeGraph() and ensureEntryRoute() run on
param-only navigation.

114-121: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent overlapping polling executions.

setInterval with an async body can enqueue a new tick before the prior loadDataSources() finishes, causing concurrent refreshes and state races.

Suggested fix
 const pollInterval = ref<ReturnType<typeof setInterval> | null>(null)
+const pollInFlight = ref(false)

 function startPolling() {
   if (pollInterval.value !== null) return
   pollInterval.value = setInterval(async () => {
-    await loadDataSources()
-    if (!hasAnyActiveSync(dataSources.value)) {
-      stopPolling()
+    if (pollInFlight.value) return
+    pollInFlight.value = true
+    try {
+      await loadDataSources()
+      if (!hasAnyActiveSync(dataSources.value)) {
+        stopPolling()
+      }
+    } finally {
+      pollInFlight.value = false
     }
   }, 3000)
 }

As per coding guidelines **: Focus on major issues impacting performance, readability, maintainability and security.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue around
lines 114 - 121, The current startPolling implementation uses setInterval with
an async callback which can start a new tick before the previous
loadDataSources() completes, causing overlapping executions and race conditions;
change to a self-scheduling pattern (or guard) so the next poll is scheduled
only after loadDataSources() finishes: in startPolling, replace the setInterval
approach by either (a) using an async loop that awaits loadDataSources() then
awaits a timeout before looping, or (b) keep the timer but add an in-flight
boolean flag (e.g., isPollingInFlight) checked at the top of the callback to
skip if already running; ensure pollInterval (the timer id) and stopPolling
still clear the timer and reset flags, and keep the existing check that calls
stopPolling when hasAnyActiveSync(dataSources.value) is false.
src/api/management/domain/value_objects.py (1)

436-437: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Parse prepopulated explicitly instead of using bool(...) coercion.

bool(data.get("prepopulated", False)) will treat "false" as True, which can corrupt readiness/transition logic derived from ontology flags.

Suggested fix
+def _parse_bool(value: Any) -> bool:
+    if isinstance(value, bool):
+        return value
+    if isinstance(value, str):
+        return value.strip().lower() in {"true", "1", "yes", "y"}
+    if isinstance(value, (int, float)):
+        return value != 0
+    return False
+
 `@classmethod`
 def from_dict(cls, data: dict[str, Any]) -> NodeTypeDefinition:
     """Deserialize from a plain dict."""
     return cls(
         label=data["label"],
         description=data.get("description", ""),
         required_properties=tuple(data.get("required_properties", [])),
         optional_properties=tuple(data.get("optional_properties", [])),
-        prepopulated=bool(data.get("prepopulated", False)),
+        prepopulated=_parse_bool(data.get("prepopulated", False)),
         prepopulated_instance_count=int(data.get("prepopulated_instance_count", 0)),
     )
#!/bin/bash
set -euo pipefail

rg -n "prepopulated=bool\\(data.get\\(\"prepopulated\"" src/api/management/domain/value_objects.py

python3 - <<'PY'
print("bool('false') =>", bool("false"))  # demonstrates coercion hazard
PY

As per coding guidelines **: Focus on major issues impacting performance, readability, maintainability and security.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/management/domain/value_objects.py` around lines 436 - 437, The
assignment using prepopulated=bool(data.get("prepopulated", False)) mis-parses
string values like "false" as True; update the parsing in the
constructor/serializer that sets prepopulated (and keep
prepopulated_instance_count handling) to explicitly handle booleans and common
string forms: if the incoming value is a bool use it directly, if it's a string
compare lowercased values ("true","1","yes") => True and ("false","0","no","")
=> False, otherwise fall back to the default False; locate the code that creates
the value object (the place setting prepopulated and prepopulated_instance_count
in value_objects.py) and replace the bool(...) coercion with this explicit
parse-to-bool logic while preserving the int conversion for
prepopulated_instance_count.
src/api/management/application/services/knowledge_graph_service.py (1)

274-308: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make maintenance launch atomic to avoid partial sync starts.

If an exception occurs mid-loop, earlier DataSourceSyncRun/DataSource writes still commit at Line 308, but the run is recorded as LAUNCH_FAILED. This leaves partial launches persisted with contradictory run history.

Suggested fix
-        try:
+        try:
             for data_source in changed_sources:
                 ...
                 await self._sync_run_repo.save(sync_run)
                 data_source.request_sync(sync_run_id=sync_run_id, requested_by=user_id)
                 await self._ds_repo.save(data_source)
-
             run = KnowledgeGraphMaintenanceRunRecord(...)
-        except Exception as exc:
+            kg.append_maintenance_run(run)
+            await self._kg_repo.save(kg)
+            await self._session.commit()
+            return run
+        except Exception as exc:
+            await self._session.rollback()
             run = KnowledgeGraphMaintenanceRunRecord(...)
-
-        kg.append_maintenance_run(run)
-        await self._kg_repo.save(kg)
-        await self._session.commit()
-        return run
+            kg.append_maintenance_run(run)
+            await self._kg_repo.save(kg)
+            await self._session.commit()
+            return run

As per coding guidelines **: Focus on major issues impacting performance, readability, maintainability and security.

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/apps/kartograph/base/api-deployment.yaml`:
- Around line 165-166: Remove the unused emptyDir overlay and related
configuration for extraction skills: delete the volumeMount that mounts name:
extraction-skills at mountPath: /app/skills and the corresponding volume
definition (emptyDir) and unset the KARTOGRAPH_EXTRACTION_SKILLS_DIR env var (or
point it to a default directory inside the image) in the deployment YAML;
mention FilesystemExtractionRuntimeContextBuilder and
src/api/extraction/application/skill_resolution_service.py as justification so
reviewers can verify that no filesystem skills are loaded and confirm
src/api/Dockerfile has no skills copied, and if you intend to support
image-bundled skills later, add an init/container step to populate /app/skills
instead of keeping an emptyDir overlay.

In `@specs/extraction/agent-sessions.spec.md`:
- Around line 39-47: The requirement "Session Archival and Retention" currently
mandates indefinite storage; change it to require configurable retention
policies with safe defaults (e.g., 90 days) and explicit deletion/erasure
procedures, and update the "Historical session visibility" scenario to state
that archived sessions and run records are available within the configured
retention window and after deletion are permanently erased; include
configuration knobs (retention TTL, archival export, and purge-on-request),
describe retention policy enforcement (background TTL job or lifecycle rules)
and ensure records still include last-updated timestamps and run-level metrics
while noting they may be removed after the configured period or upon
user/administrator erasure requests.

In `@specs/index.spec.md`:
- Around line 61-68: Add the required blank-line separation around the new "###
[Extraction](extraction/) — Agent-Orchestrated Mutation Production" heading so
it is preceded by a blank line and followed by a blank line before the table;
specifically, insert an empty line above the "### [Extraction](extraction/) —
Agent-Orchestrated Mutation Production" line and another empty line between that
heading and the table header line starting with "| Spec | Scope |" to satisfy
markdownlint MD022.

In `@specs/ingestion/sync-lifecycle.spec.md`:
- Around line 73-94: The markdown block introducing "### Requirement:
Commit-Baseline-Aware Ingestion" and the subsequent "#### Scenario: ..."
headings lacks blank lines and triggers markdownlint MD022; add a single blank
line before and after the "### Requirement: Commit-Baseline-Aware Ingestion"
heading and likewise ensure there is a blank line both above and below each
"#### Scenario: ..." heading (e.g., the "Baseline at extraction start", "Branch
head refresh for ingestion readiness", and "No-new-commit outcome" scenarios) so
each heading is isolated per markdownlint rules.

In `@specs/ui/experience.spec.md`:
- Line 70: The inserted markdown headings (e.g., "#### Scenario: URL-first
provider detection" and the other headings at the noted spots) need a blank line
before and after each heading to satisfy markdownlint MD022; update each heading
occurrence (including the ones at 77, 130, 136, 141, 516, 519, 524, 529, 532,
537) by inserting one empty line above and one empty line below the heading so
there is a clear blank line boundary around every "#### ..." heading and rerun
the linter to confirm MD022 is resolved.

In `@specs/ui/kg-manage-experience.spec.md`:
- Around line 85-94: The spec must explicitly define routing based on the data
source count: add a clear rule that when dataSourceCount === 0 the Data Sources
entry from the KG manage workspace routes to
/knowledge-graphs/{kgId}/data-sources/new (first-time onboarding), and when
dataSourceCount > 0 it routes to /knowledge-graphs/{kgId}/data-sources
(returning view); reference the variables/paths dataSourceCount,
/knowledge-graphs/{kgId}/data-sources/new, and
/knowledge-graphs/{kgId}/data-sources in the Scenario text so UI and QA know the
exact conditional routing.

In `@src/api/extraction/infrastructure/repositories/agent_session_repository.py`:
- Around line 21-45: The current save() logic reads then conditionally inserts
ExtractionAgentSessionModel which can race on concurrent calls; replace the
SELECT + conditional INSERT with a single PostgreSQL upsert using SQLAlchemy's
insert(...).on_conflict_do_update targeting the primary key (id) so inserts
become atomic and idempotent; build an INSERT for ExtractionAgentSessionModel
with all fields from the input session (id, user_id, knowledge_graph_id, mode,
message_history, runtime_context, created_at, updated_at, archived_at) and use
on_conflict_do_update to set message_history, runtime_context, updated_at,
archived_at (and any other mutable fields) from excluded values, execute via
self._session.execute(...) and then await self._session.flush() to preserve
current behavior.

In
`@src/api/infrastructure/canonical_schema/graph_canonical_schema_repository.py`:
- Around line 85-91: The current _apply_operations loop only upserts type
definitions from repo.get_all() via self._store.upsert_type_definition, so any
types removed by the mutation are never deleted from storage; update
_apply_operations to compute the set of (label, entity_type) keys present in
repo.get_all() (and metadata_by_key) and compare it with the existing persisted
keys for the kg_id, then call the store deletion API (e.g.,
self._store.delete_type_definition or self._store.delete_type_definitions) for
any keys that exist in storage but not in the in-memory repo; keep existing
upsert logic for present types and ensure deletions are applied before/after
upserts consistently to avoid transient conflicts.

In `@src/api/ingestion/infrastructure/event_handler.py`:
- Line 140: The credentials precedence currently uses a falsy-or expression
("runtime_credentials or payload.get('credentials')") which treats an explicitly
provided empty runtime_credentials as absent and reverts to payload credentials;
update the assignment so payload.get("credentials") is used only when
runtime_credentials is exactly None (e.g., check runtime_credentials is None and
otherwise use runtime_credentials) to avoid re-enabling payload-sourced secrets;
modify the call site that passes credentials (the parameter named credentials
where runtime_credentials and payload.get("credentials") are referenced) so it
performs an explicit None check rather than relying on truthiness.

In `@src/api/management/infrastructure/git_commit_reference_service.py`:
- Around line 78-79: The branch value is inserted into the GitHub branches URL
raw (variable `branch` used when building `url`), which breaks endpoints for
branch names containing slashes; URL-encode `branch` (e.g., using
urllib.parse.quote) before interpolating into the `url =
f"https://api.github.com/repos/{owner}/{repo}/branches/{branch}"` expression in
git_commit_reference_service.py so the request targets the correct GitHub API
path while keeping the rest of the logic (including
`self._http_client`/httpx.AsyncClient usage) unchanged.

In `@src/api/shared_kernel/container_runtime/ports.py`:
- Around line 19-21: ContainerRunSpec is declared frozen but its env and labels
fields use mutable dicts, allowing mutation after construction; update
ContainerRunSpec so env and labels are truly immutable by changing their
annotations to Mapping[str, str] (or similar immutable type) and in the
dataclass __post_init__ wrap/replace the provided dicts with an immutable
mapping (e.g., types.MappingProxyType(dict(self.env)) or construct
frozendict-like structures) to defensively copy input and prevent later
mutation; reference the ContainerRunSpec class and its env and labels fields and
add the necessary imports (typing.Mapping and types.MappingProxyType) when
implementing the change.

In `@src/api/tests/integration/extraction/conftest.py`:
- Around line 18-24: The subprocess.run probe for the container engine (the call
using subprocess.run([engine, "info"], capture_output=True, text=True,
check=False)) can hang; add a short timeout argument (e.g. timeout=5) and catch
subprocess.TimeoutExpired around that call, treating a timeout as engine
unavailable (return False). Keep existing capture_output/text/check settings and
ensure any TimeoutExpired or other exceptions result in a False return so the
test harness treats the engine as unavailable.

In `@src/api/tests/integration/extraction/test_container_workload_runtime.py`:
- Around line 121-124: Remove the unnecessary wall-clock sleep by deleting the
call to time.sleep(3) before invoking manager.cleanup_expired; the test already
forces expiry by passing now=lease.last_activity_at + timedelta(seconds=3) to
manager.cleanup_expired, so keep the cleanup_expired(...) invocation (and its
now argument using lease.last_activity_at) and remove the time.sleep(3) line to
avoid adding runtime to the integration test.

In `@src/api/tests/unit/extraction/application/test_session_history_service.py`:
- Around line 66-76: The fake metrics reader's find_metrics_by_session_ids
currently ignores the knowledge_graph_id and returns metrics solely by
session_id, which masks KG-scoping bugs; update find_metrics_by_session_ids to
respect knowledge_graph_id by storing/looking up metrics scoped by KG (e.g.,
nested dict self._metrics[knowledge_graph_id][session_id] or keys that include
KG) and return only metrics for the provided knowledge_graph_id and the
requested session_ids (or empty lists if not present), referencing the
find_metrics_by_session_ids method and the _metrics store so tests fail when the
service passes the wrong KG.

In
`@src/api/tests/unit/extraction/infrastructure/test_workload_runtime_factory.py`:
- Around line 46-51: The test
test_outbox_extraction_handler_uses_runtime_factory_wiring should stop asserting
on raw source text and instead assert wiring by behavior: import the module that
performs the runtime wiring (the module that imports or calls
create_ephemeral_extraction_worker_launcher), use monkeypatch or
unittest.mock.patch to replace create_ephemeral_extraction_worker_launcher with
a sentinel/callable and also patch InMemoryEphemeralExtractionWorkerLauncher to
a different sentinel, then trigger the import/constructor path the test
exercises and assert the patched factory callable was used (e.g., called or
injected) and the InMemoryEphemeralExtractionWorkerLauncher sentinel was not;
target the exact symbols create_ephemeral_extraction_worker_launcher and
InMemoryEphemeralExtractionWorkerLauncher and the test function
test_outbox_extraction_handler_uses_runtime_factory_wiring to implement this
behavioral assertion.

In `@src/dev-ui/app/pages/data-sources/index.vue`:
- Around line 394-412: The detectGithubSourceDetails function currently calls
the public GitHub API unauthenticated; modify it to add an Authorization header
when a user-provided token is present on the PendingSourceDraft (e.g.,
entry.token or entry.githubToken) so requests become authenticated and use
higher rate limits; build headers like { Authorization: `token ${entry.token}`,
Accept: 'application/vnd.github.v3+json' } when token exists and fall back to no
Authorization otherwise, keeping the existing try/catch and early returns.
- Line 32: The file is missing an import for detectAdapterFromUrl from
'`@/utils/dataSourceWizard`' used in index.vue; update the existing import line
that currently imports DetectedAdapterId to also import detectAdapterFromUrl so
the calls to detectAdapterFromUrl (used around the functions and template
references) resolve correctly — e.g., modify the import statement that
references DetectedAdapterId to include detectAdapterFromUrl from
'`@/utils/dataSourceWizard`'.

In `@src/dev-ui/app/pages/graph/mutations.vue`:
- Around line 581-582: The code synchronously calls parseContent(body) and
markEditedFromOperations right before submit, which blocks the UI for large
payloads; instead, avoid the main-thread parse by removing parseContent(body) on
the large-file path and either use the worker-provided metadata
(operations/highlights) or skip calling markEditedFromOperations when in
large-file mode. Concretely: replace the direct parsedLarge = parseContent(body)
+ markEditedFromOperations(parsedLarge.operations) with logic that (1) if
workerMetadata?.operations exists use that to call markEditedFromOperations,
else (2) only call parseContent/markEditedFromOperations when a small-file flag
(e.g., !isLargeFile or fileSize < LARGE_THRESHOLD) is true; reference
functions/vars parseContent, markEditedFromOperations, parsedLarge, body, and
workerMetadata/isLargeFile in the change.

In `@src/dev-ui/app/pages/graph/schema.vue`:
- Around line 208-210: The catch block in schema.vue is caching a zero count on
failures (instanceCountCache.set(cacheKey, 0)), which makes transient metadata
lookup errors persist; remove that write from the catch and only call
instanceCountCache.set(cacheKey, value) after a successful count retrieval
(i.e., move the cache set into the try/after-success path for the function that
computes the count using cacheKey), or alternatively set a distinct error
marker/short TTL instead of 0 so failures aren't interpreted as a real zero
count; reference instanceCountCache and cacheKey to locate and change the
behavior.

In `@src/dev-ui/app/tests/data-sources.test.ts`:
- Around line 3142-3175: The test suite "Data-sources-focused layout -
structural verification" currently reads the global page via the source variable
(readFileSync(resolve(__dirname, '../pages/data-sources/index.vue'), 'utf-8'))
and asserts on global copy/layout; change the test to target the KG-scoped pages
or narrow assertions: either update the file path used by readFileSync to the KG
pages (e.g., '../pages/knowledge-graphs/[kgId]/data-sources/index.vue' and
'../pages/knowledge-graphs/[kgId]/data-sources/new.vue') and keep the specific
assertions that belong to those KG-scoped pages, or keep the global page path
but remove/relax the assertions that reference global sidebar content (lines
asserting 'Data source catalog', 'Paste your source URLs', Select markup, etc.)
so they only check non-breaking invariants; ensure you update the describe name
to reflect KG-scoped verification if you move files and adjust any test strings
referencing 'Paste your source URLs', 'Detected:', '<Select
v-model="selectedKnowledgeGraphId">', and remove checks for 'Active
workers'/'Estimated cost trend' that are global.

---

Outside diff comments:
In `@deploy/apps/kartograph/base/api-deployment.yaml`:
- Around line 167-182: Add a startupProbe block for the API container alongside
the existing livenessProbe and readinessProbe to avoid premature restarts during
cold starts; locate the container spec that currently defines livenessProbe and
readinessProbe and add a startupProbe (httpGet path: /health, port: 8000) with
higher initialDelaySeconds (e.g., 60), longer periodSeconds and timeoutSeconds
appropriate for cold starts, and a reasonable failureThreshold so Kubernetes
waits for the app to become ready before applying liveness checks.

In `@src/dev-ui/app/pages/data-sources/index.vue`:
- Around line 675-721: loadDataSources suffers from an N+1 API pattern: it
sequentially fetches knowledge graphs (kgs), then per-KG data sources, and then
per-data-source sync_runs and diff_summary (uses apiFetch, kgs, sources, ds),
making it slow at scale; fix it by parallelizing network calls (use Promise.all
to fetch all KG data-sources in parallel and within each KG use Promise.all to
fetch each ds.sync_runs and ds.diff_summary concurrently) or, better, add a
backend endpoint that returns data sources with sync_runs and diff_summary in
one response and update loadDataSources to call that endpoint and assign
dataSources.value accordingly.

---

Duplicate comments:
In `@compose.dev.yaml`:
- Around line 8-9: The compose file currently sets
KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND=container and
KARTOGRAPH_EXTRACTION_RUNTIME_CONTAINER_ENGINE=auto by default; change this so
Docker/host-engine access is opt-in by removing these variables from the main
service environment and placing them into an explicit opt-in profile or override
(e.g., a "docker-host" compose profile or separate override file) that
developers can enable when needed; update the service to reference that
profile/override and document the opt-in usage so default dev runs no longer
grant privileged host Docker access, and ensure the variables
KARTOGRAPH_EXTRACTION_RUNTIME_BACKEND and
KARTOGRAPH_EXTRACTION_RUNTIME_CONTAINER_ENGINE are only defined in that opt-in
scope.

In `@src/api/extraction/application/agent_session_service.py`:
- Around line 58-97: The current create flow (uses
_repository.find_active_by_scope and then _repository.save of
ExtractionAgentSession) has a TOCTOU race that can create duplicate active
sessions; add a durable uniqueness constraint in storage (a partial/unique index
on active sessions for the (user_id, knowledge_graph_id, mode) scope) and change
the save path to handle conflicts: when saving the new ExtractionAgentSession if
_repository.save raises a unique-constraint/duplicate-key error, catch that
error, re-call _repository.find_active_by_scope(user_id, knowledge_graph_id,
mode) and return the found existing session instead of propagating the error;
keep the existing runtime_context/bootstrap logic unchanged and ensure the
repository layer surfaces a clear unique-constraint exception to catch.

In `@src/api/infrastructure/canonical_schema/ontology_projection.py`:
- Around line 22-46: The loop over stored_definitions currently only handles
"node" and "edge" and silently skips unknown stored.entity_type values; update
the loop in ontology_projection.py (the block that builds NodeTypeDefinition and
EdgeTypeDefinition) to add an explicit else branch that raises a clear exception
(e.g., ValueError) or logs and raises, including the unexpected
stored.entity_type and identifying info like stored.label (and/or stored.id) so
schema corruption is surfaced immediately rather than silently ignored. Ensure
the error message references stored.entity_type, stored.label and that the new
branch is used in the same loop that constructs NodeTypeDefinition and
EdgeTypeDefinition.

In
`@src/api/infrastructure/migrations/versions/f5b6c7d8e9f0_add_workspace_session_pointer_columns.py`:
- Around line 24-46: The new session pointer columns on knowledge_graphs
(active_schema_bootstrap_session_id, active_extraction_operations_session_id,
most_recent_completed_session_id) lack foreign key constraints; update the
upgrade() to add FK constraints referencing the sessions table (or create a
separate migration if ordering requires) and update downgrade() to drop those
FKs before dropping the columns; specifically modify upgrade() to call
op.create_foreign_key(...) for each column (naming keys like
fk_knowledge_graphs_active_schema_bootstrap_session_id) and modify downgrade()
to call op.drop_constraint(...) for each FK prior to op.drop_column(...).

In `@src/api/main.py`:
- Around line 190-196: The branch SHA lookup code that builds url and calls
httpx.AsyncClient (the block using url, client.get, response.raise_for_status,
payload and sha) must be made resilient: wrap the async request,
raise_for_status call and payload parsing in a try/except that catches
httpx.RequestError, httpx.HTTPStatusError and ValueError (or json parse errors)
and returns None on any of those failures so transient/network/GitHub errors
don't abort ingestion; keep the existing logic that extracts sha from
payload.get("commit", {}).get("sha") and return str(sha) if present, otherwise
None.

In `@src/api/management/application/services/data_source_service.py`:
- Around line 607-669: The pipeline_mode parameter is not validated before
calling ds.request_sync, allowing arbitrary strings to propagate; add validation
in the trigger method (the function accepting pipeline_mode: str) to only allow
the known modes (e.g., "full" and "ingest_only") and raise a ValueError with a
clear message for invalid values; perform this check immediately before
constructing the DataSourceSyncRun / before calling ds.request_sync so
ds.request_sync and downstream lifecycle handlers always receive a valid
pipeline_mode.

In `@src/api/management/domain/events/data_source.py`:
- Line 103: Change the loose pipeline_mode: str to a constrained domain type and
validate at construction: replace the annotation pipeline_mode: str with
pipeline_mode: Literal["full", "ingest_only"] (import Literal from typing) or
use your domain value object type, and add validation in the
constructor/validator (e.g., __post_init__ for dataclasses or a pydantic
validator) to reject invalid values; ensure references to pipeline_mode across
the class/module use the new type.

In `@src/api/management/domain/value_objects.py`:
- Around line 436-437: The assignment using
prepopulated=bool(data.get("prepopulated", False)) mis-parses string values like
"false" as True; update the parsing in the constructor/serializer that sets
prepopulated (and keep prepopulated_instance_count handling) to explicitly
handle booleans and common string forms: if the incoming value is a bool use it
directly, if it's a string compare lowercased values ("true","1","yes") => True
and ("false","0","no","") => False, otherwise fall back to the default False;
locate the code that creates the value object (the place setting prepopulated
and prepopulated_instance_count in value_objects.py) and replace the bool(...)
coercion with this explicit parse-to-bool logic while preserving the int
conversion for prepopulated_instance_count.

In `@src/api/tests/unit/ingestion/infrastructure/test_ingestion_event_handler.py`:
- Around line 317-335: Tests contain GitHub PAT-shaped literals (e.g.,
"ghp_...") which trigger secret scanners; update the test to use a non-PAT
placeholder token (e.g., "token-placeholder" or "ghp_PLACEHOLDER") wherever the
PAT appears: in the runtime_credentials passed to IngestionEventHandler.handle,
in the expected error string from _LeakyService, and in assertions that inspect
outbox.appended/event payloads; ensure references to IngestionEventHandler,
_LeakyService, _sync_started_payload, handler.handle, and outbox.appended/event
remain unchanged so behavior is preserved while eliminating real-looking secret
patterns.

In `@src/dev-ui/app/pages/graph/mutations.vue`:
- Around line 181-190: The async loadLiveInspector calls can overlap and older
requests may overwrite newer state; make loadLiveInspector cancel-safe or
sequencing-safe by introducing a request token/sequence counter that increments
before each invocation (from the watchers on selectedKnowledgeGraphId and the
submission.status watcher and any manual refresh), pass/close over that token
into the async workflow inside loadLiveInspector, and before applying any
results (state updates, DOM changes, calls like resetSessionEditHighlights)
verify the token matches the latest; alternatively support an AbortSignal that
the watchers set/abort so only the latest load applies—update loadLiveInspector,
its callers (the watchers shown around loadLiveInspector and any manual refresh
handlers), and any related code paths in the other affected block (the section
referenced at 418-467) to use this token/abort mechanism.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue:
- Around line 142-157: The per-source enrichment loop (for (const ds of
sources)) is performing serial network calls via apiFetch for ds.sync_runs and
ds.diff_summary, causing high latency; change to run both fetches concurrently
per source (e.g., create two promises for
`/management/data-sources/${ds.id}/sync-runs` and
`/management/data-sources/${ds.id}/diff-summary` and await them together), and
then run enrichment across all sources in parallel using Promise.all over
sources so each source's pair of fetches executes concurrently; ensure errors
are handled per-promise so failed sync_runs become [] and failed diff_summary
become null while preserving assignment to ds.sync_runs and ds.diff_summary.
- Around line 334-338: The watcher currently triggers reloads only when
tenantVersion changes (watch(tenantVersion, ...)) so navigating between
knowledge graphs by route param (kgId) leaves stale state; update the logic to
also watch the route param (kgId) or the route object and invoke the same
sequence (clear dataSources.value, await loadKnowledgeGraph(), await
ensureEntryRoute()) when kgId changes—modify the watcher or add a separate watch
referencing the route/route.params.kgId so loadKnowledgeGraph() and
ensureEntryRoute() run on param-only navigation.
- Around line 114-121: The current startPolling implementation uses setInterval
with an async callback which can start a new tick before the previous
loadDataSources() completes, causing overlapping executions and race conditions;
change to a self-scheduling pattern (or guard) so the next poll is scheduled
only after loadDataSources() finishes: in startPolling, replace the setInterval
approach by either (a) using an async loop that awaits loadDataSources() then
awaits a timeout before looping, or (b) keep the timer but add an in-flight
boolean flag (e.g., isPollingInFlight) checked at the top of the callback to
skip if already running; ensure pollInterval (the timer id) and stopPolling
still clear the timer and reset flags, and keep the existing check that calls
stopPolling when hasAnyActiveSync(dataSources.value) is false.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment on lines +165 to +166
- name: extraction-skills
mountPath: /app/skills
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Consumers of skills path =="
rg -n -C2 'KARTOGRAPH_EXTRACTION_SKILLS_DIR|/app/skills|extraction-skills|skills/' src deploy compose.yaml compose.dev.yaml

echo "== Any manifest step that writes into /app/skills =="
rg -n -C3 '/app/skills|extraction-skills|initContainers:|cp .*skills|tar .*skills' deploy

echo "== Dockerfile checks for bundled skills =="
mapfile -t dockerfiles < <(fd -i 'dockerfile')
if ((${`#dockerfiles`[@]})); then
  rg -n -C2 'COPY .*skills|/app/skills' "${dockerfiles[@]}"
else
  echo "No Dockerfile found"
fi

Repository: openshift-hyperfleet/kartograph

Length of output: 10003


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Dockerfiles mentioning skills or /app/skills =="
mapfile -t dockerfiles < <(fd -i 'Dockerfile' . || true)
echo "Found ${`#dockerfiles`[@]} Dockerfiles"
if ((${`#dockerfiles`[@]})); then
  rg -n -C2 '/app/skills|COPY .*skills|extraction-skills' "${dockerfiles[@]}" || true
fi

echo "== Any code that downloads/generates skills into skills_dir =="
rg -n -C2 'KARTOGRAPH_EXTRACTION_SKILLS_DIR|skills_dir|/app/skills|extraction-skills' src || true

echo "== Any entrypoint/cmd scripts or scripts referencing skills path =="
rg -n -C2 'skills' .github . || true

echo "== Inspect api-deployment volume/extraction-skills definition block =="
# show around emptyDir definition and corresponding volumeMounts
rg -n -C3 'name: extraction-skills|emptyDir|mountPath: /app/skills' deploy/apps/kartograph/base/api-deployment.yaml

echo "== Are there initContainers that copy/tar into /app/skills? =="
# narrow to initContainers section
python - <<'PY'
import re, pathlib
p = pathlib.Path("deploy/apps/kartograph/base/api-deployment.yaml")
text = p.read_text()
m = re.search(r'initContainers:\n(?s:.*?)containers:', text)
print("initContainers block found:", bool(m))
if m:
    blk = m.group(0)
    # print only relevant lines
    for i,line in enumerate(blk.splitlines(), start=1):
        if '/app/skills' in line or 'skills' in line:
            print(line)
PY

Repository: openshift-hyperfleet/kartograph

Length of output: 201


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Dockerfiles mentioning skills or /app/skills =="
mapfile -t dockerfiles < <(fd -i 'Dockerfile' . || true)
echo "Found ${`#dockerfiles`[@]} Dockerfiles"
if ((${`#dockerfiles`[@]})); then
  rg -n -C2 '/app/skills|COPY .*skills|extraction-skills' "${dockerfiles[@]}" || true
fi

echo "== Any code that downloads/generates skills into skills_dir =="
rg -n -C2 'KARTOGRAPH_EXTRACTION_SKILLS_DIR|skills_dir|/app/skills|extraction-skills' src || true

echo "== Any entrypoint/cmd scripts or scripts referencing skills path =="
rg -n -C2 'skills' .github . || true

echo "== Inspect api-deployment volume/extraction-skills definition block =="
rg -n -C3 'name: extraction-skills|emptyDir|mountPath: /app/skills' deploy/apps/kartograph/base/api-deployment.yaml

echo "== Are there initContainers that copy/tar into /app/skills? =="
python - <<'PY'
import re, pathlib
p = pathlib.Path("deploy/apps/kartograph/base/api-deployment.yaml")
text = p.read_text()
m = re.search(r'initContainers:\n(?s:.*?)containers:', text)
print("initContainers block found:", bool(m))
if m:
    blk = m.group(0)
    for line in blk.splitlines():
        if '/app/skills' in line or 'skills' in line:
            print(line)
PY

Repository: openshift-hyperfleet/kartograph

Length of output: 201


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Dockerfiles present (no process substitution) =="
mapfile -t dockerfiles < <(fd -i "dockerfile" . -t f || true)
echo "Found ${`#dockerfiles`[@]} dockerfiles"
printf '%s\n' "${dockerfiles[@]}" | head -n 50

if ((${`#dockerfiles`[@]})); then
  rg -n -C2 '/app/skills|COPY .*skills|extraction-skills' "${dockerfiles[@]}" || true
fi

echo "== Show api-deployment volumeMount + emptyDir for extraction-skills =="
sed -n '130,220p' deploy/apps/kartograph/base/api-deployment.yaml

echo "== How app uses KARTOGRAPH_EXTRACTION_SKILLS_DIR (main + callers) =="
rg -n 'KARTOGRAPH_EXTRACTION_SKILLS_DIR|_EXTRACTION_SKILLS_DIR|/app/skills' src/api src || true

echo "== Where skills_dir is used and whether it triggers download/generation =="
rg -n 'skills_dir' src/api src || true

Repository: openshift-hyperfleet/kartograph

Length of output: 202


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Find Dockerfiles (no process substitution) =="
find . -maxdepth 6 -iname "Dockerfile*" -type f 2>/dev/null | head -n 200 || true

echo "== Search for any build-copy/bundle of skills =="
rg -n -C2 '/app/skills|extraction-skills|COPY .*skills|ADD .*skills|KARTOGRAPH_EXTRACTION_SKILLS_DIR' . || true

echo "== api-deployment.yaml: extraction-skills definitions =="
rg -n -C3 'name: extraction-skills|emptyDir|mountPath: /app/skills' deploy/apps/kartograph/base/api-deployment.yaml || true

echo "== api-deployment.yaml: look for any init/lifecycle copying into /app/skills =="
rg -n -C3 'initContainers:|lifecycle:|postStart:|command:|args:|cp |tar |curl |wget |aws s3 |gsutil |/app/skills' deploy/apps/kartograph/base/api-deployment.yaml || true

echo "== Where KARTOGRAPH_EXTRACTION_SKILLS_DIR / _EXTRACTION_SKILLS_DIR is used in code =="
rg -n 'KARTOGRAPH_EXTRACTION_SKILLS_DIR|_EXTRACTION_SKILLS_DIR|skills_dir' src/api || true

echo "== Show src/api/main.py around _EXTRACTION_SKILLS_DIR definition =="
sed -n '1,120p' src/api/main.py

echo "== Show src/api/main.py where skills dir is passed/used (small windows) =="
rg -n '_EXTRACTION_SKILLS_DIR|skills_dir' src/api/main.py || true

Repository: openshift-hyperfleet/kartograph

Length of output: 12527


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== src/api/Dockerfile (look for skills bundling) =="
sed -n '1,220p' src/api/Dockerfile

echo "== Fakes Dockerfile (if any skills fixtures) =="
sed -n '1,220p' src/api/tests/fakes/Dockerfile || true

echo "== Find all usages of skills_dir in src/api (focus on extraction) =="
rg -n 'skills_dir' src/api

echo "== runtime_context_builder: how/when it creates/uses skills dir =="
sed -n '1,140p' src/api/extraction/infrastructure/runtime_context_builder.py

echo "== extraction infrastructure/services using skills dir =="
rg -n 'skills_dir' src/api/extraction -S || true

echo "== Any code that populates/copies/extracts into skills_dir =="
rg -n '(mkdir|download|fetch|curl|wget|tar|unzip|cp |extract).*skills_dir|skills_dir.*(download|fetch|curl|wget|tar|unzip|cp)' src/api -S || true

echo "== Any 'skills' artifact download logic (broader search) =="
rg -n '(skills).*handle|download|fetch|s3|httpx|get|tar|unzip|extract|bundle' src/api -S || true

Repository: openshift-hyperfleet/kartograph

Length of output: 50388


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== api-deployment.yaml: show volumes + extraction-skills block (with line numbers) =="
nl -ba deploy/apps/kartograph/base/api-deployment.yaml | sed -n '130,220p'
nl -ba deploy/apps/kartograph/base/api-deployment.yaml | sed -n '220,320p'

echo "== Where runtime.skills_dir is consumed =="
rg -n 'runtime\.skills_dir|skills_dir=' src/api --glob '*.py' || true

echo "== Search for any 'skills' file IO (skills dir usage) =="
rg -n '(skills_dir|_skills_dir).*?(open|read|write|glob|rglob|listdir|iterdir|rmtree|copy|download|fetch|curl|wget|tar|unzip|extract|httpx|get|s3|gsutil)' src/api --glob '*.py' || true

echo "== Search for downloading/creating skills at runtime =="
rg -n '(download|fetch|get|curl|wget|s3|gsutil).*skills|skills.*(download|fetch|get|curl|wget|s3|gsutil)|skill.*(download|fetch|get|curl|wget|s3|gsutil)' src/api --glob '*.py' || true

echo "== Search for any hardcoded fallback to /app/skills beyond env =="
rg -n '/app/skills' src/api --glob '*.py' || true

Repository: openshift-hyperfleet/kartograph

Length of output: 222


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== api-deployment.yaml: show volumes + mounts around extraction-skills =="
cat -n deploy/apps/kartograph/base/api-deployment.yaml | sed -n '140,220p'

echo "== api-deployment.yaml: also show the trailing section where volumes are declared =="
cat -n deploy/apps/kartograph/base/api-deployment.yaml | sed -n '220,320p'

echo "== Trace all code references to skills_dir / _skills_dir =="
rg -n --glob '*.py' 'skills[_-]?dir|_skills_dir' src/api || true

echo "== Trace any code that lists/opens/reads from the skills directory =="
rg -n --glob '*.py' '(skills_dir|_skills_dir).*?(glob|rglob|listdir|iterdir|open|read|write|copy|download|fetch|get|curl|wget|tar|unzip|extract|httpx|get|s3|gsutil)' src/api || true

echo "== Trace any mention of 'skills' artifacts downloaded/generated from job payloads =="
rg -n --glob '*.py' '(skills).*?(download|fetch|get|curl|wget|tar|unzip|extract|httpx|get|s3|gsutil|bundle)' src/api || true

Repository: openshift-hyperfleet/kartograph

Length of output: 4044


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Search for 'skills' usage within src/api/extraction =="
rg -n -S 'skills' src/api/extraction || true

echo "== Search for 'skill' usage within src/api/extraction =="
rg -n -S 'skill' src/api/extraction || true

echo "== Search for 'skills' usage in entire src/api (limit to likely files) =="
rg -n -S 'skills' src/api --glob '*extraction*' --glob '*.py' || true

echo "== Search for skills_dir passed into other services (call sites) =="
rg -n --glob '*.py' 'skills_dir=' -S src/api || true

echo "== Search for any filesystem operations mentioning '_skills_dir' =="
rg -n --glob '*.py' '_skills_dir' -S src/api || true

Repository: openshift-hyperfleet/kartograph

Length of output: 11961


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== runtime_context_builder.py =="
cat -n src/api/extraction/infrastructure/runtime_context_builder.py

echo "== extraction/ports/services.py =="
cat -n src/api/extraction/ports/services.py

echo "== skill_resolution_service.py =="
cat -n src/api/extraction/application/skill_resolution_service.py

echo "== agent_session_service.py (skills usage) =="
cat -n src/api/extraction/application/agent_session_service.py

Repository: openshift-hyperfleet/kartograph

Length of output: 19263


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n --glob '*.py' 'runtime_context\.skills_dir|\.skills_dir' src/api || true

Repository: openshift-hyperfleet/kartograph

Length of output: 349


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n --glob '*.py' 'skills_dir' src/api --glob '!src/api/tests/**' || true

Repository: openshift-hyperfleet/kartograph

Length of output: 675


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n '/app/skills|extraction-skills|COPY .*skills|ADD .*skills|skills/' src/api/Dockerfile src/api/tests/fakes/Dockerfile src/dev-ui/Dockerfile || true

Repository: openshift-hyperfleet/kartograph

Length of output: 57


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '300,420p' src/api/main.py

Repository: openshift-hyperfleet/kartograph

Length of output: 5125


emptyDir over /app/skills is currently unnecessary (no filesystem skills are read, and no skills are bundled into the image).

File: deploy/apps/kartograph/base/api-deployment.yaml
Lines: 165-166

            - name: extraction-skills
              mountPath: /app/skills
  • The manifest mounts extraction-skills as emptyDir: {} over /app/skills and sets KARTOGRAPH_EXTRACTION_SKILLS_DIR=/app/skills.
  • The app passes this path into FilesystemExtractionRuntimeContextBuilder, which only does mkdir(..., exist_ok=True) on the directory and does not read any files from it.
  • Extraction “skills” come from in-code prompt templates + KG overrides (src/api/extraction/application/skill_resolution_service.py), not from files under /app/skills.
  • src/api/Dockerfile contains no COPY/ADD of skills or /app/skills into the image.

Also applies to: 197-198
Optional: if future extraction work expects image-bundled filesystem skills, add an init/lifecycle population step (or remove the overlay when not needed).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/apps/kartograph/base/api-deployment.yaml` around lines 165 - 166,
Remove the unused emptyDir overlay and related configuration for extraction
skills: delete the volumeMount that mounts name: extraction-skills at mountPath:
/app/skills and the corresponding volume definition (emptyDir) and unset the
KARTOGRAPH_EXTRACTION_SKILLS_DIR env var (or point it to a default directory
inside the image) in the deployment YAML; mention
FilesystemExtractionRuntimeContextBuilder and
src/api/extraction/application/skill_resolution_service.py as justification so
reviewers can verify that no filesystem skills are loaded and confirm
src/api/Dockerfile has no skills copied, and if you intend to support
image-bundled skills later, add an init/container step to populate /app/skills
instead of keeping an emptyDir overlay.

Comment on lines +39 to +47
### Requirement: Session Archival and Retention
The system SHALL retain completed session and run records indefinitely.

#### Scenario: Historical session visibility
- GIVEN prior sessions and mutation runs
- WHEN users or administrators query session history
- THEN archived sessions and associated run records remain available
- AND each record includes last-updated timestamps and run-level metrics

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid mandating indefinite retention without policy controls.

Requiring session/run records to be stored indefinitely creates a compliance/privacy risk for potentially sensitive conversational content. The spec should define bounded or configurable retention plus deletion/erasure behavior.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 39-39: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 42-42: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/extraction/agent-sessions.spec.md` around lines 39 - 47, The
requirement "Session Archival and Retention" currently mandates indefinite
storage; change it to require configurable retention policies with safe defaults
(e.g., 90 days) and explicit deletion/erasure procedures, and update the
"Historical session visibility" scenario to state that archived sessions and run
records are available within the configured retention window and after deletion
are permanently erased; include configuration knobs (retention TTL, archival
export, and purge-on-request), describe retention policy enforcement (background
TTL job or lifecycle rules) and ensure records still include last-updated
timestamps and run-level metrics while noting they may be removed after the
configured period or upon user/administrator erasure requests.

Comment thread specs/index.spec.md
Comment on lines +61 to +68
### [Extraction](extraction/) — Agent-Orchestrated Mutation Production
AI-assisted schema and extraction workflows that emit MutationLogs for Graph application.

| Spec | Scope |
|------|-------|
| [Operations](extraction/operations.spec.md) | Mode-specific agent operations and mutation-log production |
| [Agent Sessions](extraction/agent-sessions.spec.md) | Session lifecycle, reset behavior, and session metrics |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix heading spacing in the new Extraction section.

The new ###/| Spec | Scope | block is missing required blank-line separation, which triggers markdownlint (MD022) and can fail docs CI.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 61-61: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/index.spec.md` around lines 61 - 68, Add the required blank-line
separation around the new "### [Extraction](extraction/) — Agent-Orchestrated
Mutation Production" heading so it is preceded by a blank line and followed by a
blank line before the table; specifically, insert an empty line above the "###
[Extraction](extraction/) — Agent-Orchestrated Mutation Production" line and
another empty line between that heading and the table header line starting with
"| Spec | Scope |" to satisfy markdownlint MD022.

Comment on lines +73 to +94
### Requirement: Commit-Baseline-Aware Ingestion
The system SHALL maintain commit-aware ingestion context for Git-backed sources.

#### Scenario: Baseline at extraction start
- GIVEN a Git-backed data source with a local clone
- WHEN a sync run starts
- THEN the run baseline is set to `commit_during_last_extraction`
- AND incremental extraction compares current source state against that baseline

#### Scenario: Branch head refresh for ingestion readiness
- GIVEN a Git-backed data source with a tracked branch
- WHEN sync orchestration prepares ingestion context
- THEN the latest tracked branch HEAD is resolved and stored as `tracked_branch_head_commit`
- AND ingestion context for that run is prepared from the corresponding latest files

#### Scenario: No-new-commit outcome
- GIVEN `tracked_branch_head_commit` equals `commit_during_last_extraction`
- WHEN a sync run is requested
- THEN the system may short-circuit heavy extraction work
- AND a sync run record is still created for auditability
- AND run status and logs indicate no source changes were detected

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add blank lines around the new requirement/scenario headings.

The inserted headings in this block trigger markdownlint MD022 and may break documentation lint checks.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 73-73: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 76-76: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 82-82: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


[warning] 88-88: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/ingestion/sync-lifecycle.spec.md` around lines 73 - 94, The markdown
block introducing "### Requirement: Commit-Baseline-Aware Ingestion" and the
subsequent "#### Scenario: ..." headings lacks blank lines and triggers
markdownlint MD022; add a single blank line before and after the "###
Requirement: Commit-Baseline-Aware Ingestion" heading and likewise ensure there
is a blank line both above and below each "#### Scenario: ..." heading (e.g.,
the "Baseline at extraction start", "Branch head refresh for ingestion
readiness", and "No-new-commit outcome" scenarios) so each heading is isolated
per markdownlint rules.

The system SHALL provide a guided flow for connecting external data sources to a knowledge graph.

#### Scenario: Adapter type selection
#### Scenario: URL-first provider detection
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add required blank lines around inserted headings.

These heading insertions currently violate markdownlint MD022 (blanks-around-headings) and should be normalized for lint/readability consistency.

Also applies to: 77-77, 130-130, 136-136, 141-141, 516-516, 519-519, 524-524, 529-529, 532-532, 537-537

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 70-70: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/ui/experience.spec.md` at line 70, The inserted markdown headings
(e.g., "#### Scenario: URL-first provider detection" and the other headings at
the noted spots) need a blank line before and after each heading to satisfy
markdownlint MD022; update each heading occurrence (including the ones at 77,
130, 136, 141, 516, 519, 524, 529, 532, 537) by inserting one empty line above
and one empty line below the heading so there is a clear blank line boundary
around every "#### ..." heading and rerun the linter to confirm MD022 is
resolved.

buildDataSourceCreationUrl,
buildDataSourceCreationBody,
} from '@/utils/dataSourceWizard'
import type { DetectedAdapterId } from '@/utils/dataSourceWizard'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if detectAdapterFromUrl is exported from dataSourceWizard
rg -n "export.*detectAdapterFromUrl|function detectAdapterFromUrl|const detectAdapterFromUrl" --type ts --type vue

Repository: openshift-hyperfleet/kartograph

Length of output: 103


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Locate export
echo "== Export definitions =="
rg -n "export (function|const|default)?\\s*detectAdapterFromUrl|function detectAdapterFromUrl\\b|const detectAdapterFromUrl\\b" \
  src/dev-ui/app/utils/dataSourceWizard.ts || true

# 2) Locate usages in index.vue (including template)
echo "== Usages in index.vue =="
rg -n "detectAdapterFromUrl" src/dev-ui/app/pages/data-sources/index.vue || true

# 3) Show import block from dataSourceWizard in index.vue
echo "== Import from dataSourceWizard in index.vue =="
rg -n "from '`@/utils/dataSourceWizard`'" src/dev-ui/app/pages/data-sources/index.vue || true

# 4) Print the top of index.vue to inspect imports around the found line(s)
echo "== index.vue import section (top 120 lines) =="
sed -n '1,120p' src/dev-ui/app/pages/data-sources/index.vue

Repository: openshift-hyperfleet/kartograph

Length of output: 3455


Add missing detectAdapterFromUrl import in index.vue

src/dev-ui/app/utils/dataSourceWizard.ts exports detectAdapterFromUrl, but src/dev-ui/app/pages/data-sources/index.vue calls it (e.g., ~353 and template ~1468-1469) without importing it from @/utils/dataSourceWizard. Include detectAdapterFromUrl in the existing import block.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/data-sources/index.vue` at line 32, The file is missing
an import for detectAdapterFromUrl from '`@/utils/dataSourceWizard`' used in
index.vue; update the existing import line that currently imports
DetectedAdapterId to also import detectAdapterFromUrl so the calls to
detectAdapterFromUrl (used around the functions and template references) resolve
correctly — e.g., modify the import statement that references DetectedAdapterId
to include detectAdapterFromUrl from '`@/utils/dataSourceWizard`'.

Comment on lines +394 to 412
async function detectGithubSourceDetails(entry: PendingSourceDraft) {
if (entry.detectedAdapterId !== 'github') return
try {
const parsed = new URL(entry.url)
const [owner, repoRaw] = parsed.pathname.split('/').filter(Boolean)
const repo = repoRaw?.replace(/\.git$/, '')
if (!owner || !repo) return
const response = await fetch(`https://api.github.com/repos/${owner}/${repo}`)
if (!response.ok) return
const payload = await response.json() as { default_branch?: string; name?: string }
if (!entry.branch.trim() && payload.default_branch) {
entry.branch = payload.default_branch
}
if (!entry.name.trim() && payload.name) {
entry.name = payload.name
}
} catch {
// Best effort only.
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

GitHub API calls are unauthenticated and subject to rate limits.

detectGithubSourceDetails calls the public GitHub API without authentication. The unauthenticated rate limit is 60 requests/hour per IP. For bulk onboarding scenarios this could hit the limit quickly.

Consider using the user-provided token (if available) for these enrichment calls, or document this limitation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/data-sources/index.vue` around lines 394 - 412, The
detectGithubSourceDetails function currently calls the public GitHub API
unauthenticated; modify it to add an Authorization header when a user-provided
token is present on the PendingSourceDraft (e.g., entry.token or
entry.githubToken) so requests become authenticated and use higher rate limits;
build headers like { Authorization: `token ${entry.token}`, Accept:
'application/vnd.github.v3+json' } when token exists and fall back to no
Authorization otherwise, keeping the existing try/catch and early returns.

Comment on lines +581 to +582
const parsedLarge = parseContent(body)
markEditedFromOperations(parsedLarge.operations)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid synchronous parse on the large-file submit path.

Line 581 re-parses the whole body on the main thread right before submit. That negates the large-file fast path and can freeze the UI for big payloads.

Use worker-derived metadata for highlights, or skip markEditedFromOperations in large-file mode.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/graph/mutations.vue` around lines 581 - 582, The code
synchronously calls parseContent(body) and markEditedFromOperations right before
submit, which blocks the UI for large payloads; instead, avoid the main-thread
parse by removing parseContent(body) on the large-file path and either use the
worker-provided metadata (operations/highlights) or skip calling
markEditedFromOperations when in large-file mode. Concretely: replace the direct
parsedLarge = parseContent(body) +
markEditedFromOperations(parsedLarge.operations) with logic that (1) if
workerMetadata?.operations exists use that to call markEditedFromOperations,
else (2) only call parseContent/markEditedFromOperations when a small-file flag
(e.g., !isLargeFile or fileSize < LARGE_THRESHOLD) is true; reference
functions/vars parseContent, markEditedFromOperations, parsedLarge, body, and
workerMetadata/isLargeFile in the change.

Comment on lines +208 to +210
} catch {
// Avoid noisy toasts from metadata enrichment failures.
instanceCountCache.set(cacheKey, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not cache 0 when count lookup fails.

Line 208-Line 210 turns transient metadata query failures into persistent false counts until cache reset.

Proposed fix
-  } catch {
-    // Avoid noisy toasts from metadata enrichment failures.
-    instanceCountCache.set(cacheKey, 0)
+  } catch {
+    // Avoid noisy toasts from metadata enrichment failures.
+    // Do not persist a false "0" count on transient failures.
+    instanceCountCache.delete(cacheKey)
   } finally {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/graph/schema.vue` around lines 208 - 210, The catch
block in schema.vue is caching a zero count on failures
(instanceCountCache.set(cacheKey, 0)), which makes transient metadata lookup
errors persist; remove that write from the catch and only call
instanceCountCache.set(cacheKey, value) after a successful count retrieval
(i.e., move the cache set into the try/after-success path for the function that
computes the count using cacheKey), or alternatively set a distinct error
marker/short TTL instead of 0 so failures aren't interpreted as a real zero
count; reference instanceCountCache and cacheKey to locate and change the
behavior.

Comment on lines +3142 to +3175
describe('Data-sources-focused layout - structural verification', () => {
const { readFileSync } = require('fs')
const { resolve } = require('path')
const source = readFileSync(
resolve(__dirname, '../pages/data-sources/index.vue'),
'utf-8',
)

it('keeps data-source catalog guidance and removes telemetry dashboard copy', () => {
expect(source).toContain('Data source catalog')
expect(source).not.toContain('Active workers')
expect(source).not.toContain('Estimated cost trend')
})

it('removes scheduled maintenance orchestration from this page', () => {
expect(source).not.toContain('Scheduled maintenance orchestration')
expect(source).not.toContain('maintenance-runs/trigger')
})

it('renders URL-first onboarding with provider detection and coming soon messaging', () => {
expect(source).toContain('Paste your source URLs')
expect(source).toContain('Add another')
expect(source).toContain('Detected:')
expect(source).toContain('onboarding is coming soon, sorry.')
expect(source).toContain('Add to project')
})

it('uses shadcn Select for knowledge graph dropdown styling consistency', () => {
expect(source).toContain('<Select v-model="selectedKnowledgeGraphId">')
expect(source).toContain('SelectTrigger')
expect(source).toContain('SelectContent')
expect(source).not.toContain('<select')
})
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These assertions target the global /data-sources page, which is out of scope for this PR.

The new structural checks enforce copy/layout changes on pages/data-sources/index.vue, but this PR’s objective says the global sidebar /data-sources should remain unchanged. This creates requirement drift and brittle coupling to the wrong surface.

Please move these checks to KG-scoped pages (pages/knowledge-graphs/[kgId]/data-sources/new.vue and .../index.vue) or narrow them to non-breaking invariants.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/tests/data-sources.test.ts` around lines 3142 - 3175, The test
suite "Data-sources-focused layout - structural verification" currently reads
the global page via the source variable (readFileSync(resolve(__dirname,
'../pages/data-sources/index.vue'), 'utf-8')) and asserts on global copy/layout;
change the test to target the KG-scoped pages or narrow assertions: either
update the file path used by readFileSync to the KG pages (e.g.,
'../pages/knowledge-graphs/[kgId]/data-sources/index.vue' and
'../pages/knowledge-graphs/[kgId]/data-sources/new.vue') and keep the specific
assertions that belong to those KG-scoped pages, or keep the global page path
but remove/relax the assertions that reference global sidebar content (lines
asserting 'Data source catalog', 'Paste your source URLs', Select markup, etc.)
so they only check non-breaking invariants; ensure you update the describe name
to reflect KG-scoped verification if you move files and adjust any test strings
referencing 'Paste your source URLs', 'Detected:', '<Select
v-model="selectedKnowledgeGraphId">', and remove checks for 'Active
workers'/'Estimated cost trend' that are global.

Refresh commits updates tracked head only; advance extraction baseline on
successful sync. Parallelize wizard ingest prep and add phase1-style overview table.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/dev-ui/app/pages/knowledge-graphs/[kgId]/data-sources/new.vue (1)

335-371: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This changes the onboarding flow from sequential to parallel.

runParallelIngestionPrep() now starts every ingest-only sync at once, and the copy tells users the sources are prepared in parallel. The PR contract for this page is a sequential initial sync before the summary step, so this changes both load behavior and the user-visible progress model.

Suggested fix
-async function runParallelIngestionPrep() {
+async function runSequentialIngestionPrep() {
   const queue = createdSources.value.filter(
     (s) => s.syncStatus === 'idle' || s.syncStatus === 'failed' || s.syncStatus === 'queued',
   )
   if (queue.length === 0) {
     toast.error('No sources need preparation')
     return
   }

   syncRunActive.value = true
   syncRunTotal.value = queue.length
   syncCompletedInRun.value = 0
   readyForStats.value = false
   syncStepLabel.value = `0 / ${queue.length}`
-  syncActiveName.value = `${queue.length} source${queue.length === 1 ? '' : 's'}`
+  syncActiveName.value = null

   for (const target of queue) {
-    target.syncStatus = 'pending'
-    target.syncError = null
-  }
-
-  try {
-    await Promise.allSettled(
-      queue.map(async (target) => {
-        try {
-          await apiFetch(`/management/data-sources/${target.id}/sync`, {
-            method: 'POST',
-            body: { mode: 'ingest_only' },
-          })
-        } catch (err: unknown) {
-          target.syncStatus = 'failed'
-          target.syncError = err instanceof Error ? err.message : 'Preparation failed'
-        }
-      }),
-    )
-
-    await pollUntilAllTerminal(queue)
+    target.syncStatus = 'pending'
+    target.syncError = null
+    syncActiveName.value = target.name
+    try {
+      await apiFetch(`/management/data-sources/${target.id}/sync`, {
+        method: 'POST',
+        body: { mode: 'ingest_only' },
+      })
+      await pollUntilAllTerminal([target])
+    } catch (err: unknown) {
+      target.syncStatus = 'failed'
+      target.syncError = err instanceof Error ? err.message : 'Preparation failed'
+    }
+    syncCompletedInRun.value += 1
+    syncStepLabel.value = `${syncCompletedInRun.value} / ${queue.length}`
+  }

Also applies to: 575-577

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/new.vue around
lines 335 - 371, runParallelIngestionPrep currently fires all ingest-only syncs
in parallel (Promise.allSettled), which breaks the required sequential
onboarding flow; change it to run each source sequentially: keep the initial
queue creation and status initialization (createdSources, syncRunActive,
syncRunTotal, etc.), then replace the Promise.allSettled(...) block with a
for...of loop over queue where for each target you set
target.syncStatus='pending', await
apiFetch(`/management/data-sources/${target.id}/sync`, ...) inside a try/catch
to set target.syncStatus/'syncError' on failure, and after a successful apiFetch
await pollUntilAllTerminal([target]) (or call pollUntilAllTerminal for that
single target) before moving to the next, so each source completes its initial
sync sequentially before proceeding to the summary step.
♻️ Duplicate comments (2)
src/api/management/application/services/data_source_service.py (1)

603-604: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate pipeline_mode before emitting sync events.

Line 603 accepts arbitrary strings and forwards them at Line 663. Add an allowlist guard (full, ingest_only) to prevent invalid modes from entering lifecycle processing.

Proposed fix
     async def trigger_sync(
         self,
         user_id: str,
         ds_id: str,
         *,
         pipeline_mode: str = "full",
     ) -> DataSourceSyncRun:
@@
+        allowed_pipeline_modes = {"full", "ingest_only"}
+        if pipeline_mode not in allowed_pipeline_modes:
+            raise ValueError(
+                f"Unsupported pipeline_mode '{pipeline_mode}'. "
+                "Expected one of: full, ingest_only"
+            )
+
         has_manage = await self._check_permission(

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security."

Also applies to: 660-664

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/management/application/services/data_source_service.py` around lines
603 - 604, The function accepting pipeline_mode (returning DataSourceSyncRun)
currently forwards arbitrary strings to the sync lifecycle; add an allowlist
guard that only permits "full" and "ingest_only" before any event emission
(where pipeline_mode is forwarded), e.g. validate pipeline_mode at the start of
the method and either raise a ValueError or normalize to a default ("full") and
log the rejection for invalid values so no invalid mode reaches the
lifecycle/event emission code.
src/dev-ui/app/tests/data-sources.test.ts (1)

3142-3175: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These structural assertions target the global /data-sources page, which is out of this PR’s scope.

This block enforces copy/layout expectations on pages/data-sources/index.vue even though the objective is KG-scoped onboarding/operations and to keep global /data-sources unchanged. Move these checks to KG-scoped pages or reduce them to stable invariants.

As per coding guidelines, "Validate changes against HyperFleet architecture standards from the linked architecture repository."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/tests/data-sources.test.ts` around lines 3142 - 3175, The test
suite describe('Data-sources-focused layout - structural verification')
currently asserts global /data-sources page content (via the source variable)
which is outside this PR’s KG-scoped changes; update the tests to either (a)
move these assertions into KG-scoped page tests (targeting the KG
onboarding/component files) or (b) narrow each it(...) to only assert stable,
KG-specific invariants (e.g., onboarding copy or Select component usage) and
remove checks for telemetry/dashboard/maintenance copy; also rename the describe
to reflect the KG scope and keep references to the unique symbols used here (the
describe name, the source variable, and the specific it blocks like the Select
assertion) so maintainers can locate and adjust the exact assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/management/presentation/data_sources/models.py`:
- Around line 202-205: The DataSource model's connection_config field is being
returned raw and may leak secrets; update serialization to return only an
allowlisted, non-secret subset per adapter instead of
dict(ds.connection_config). Implement a sanitizer function (e.g.,
sanitize_connection_config(adapter_type, connection_config)) and call it
wherever connection_config is mapped to responses (references: connection_config
field in the model and the mapping sites around the areas noted: the blocks at
~234-235, ~479-482, and ~518-519) so that only explicit non-sensitive keys are
included in the response; ensure the sanitizer defaults to an empty/explicitly
filtered dict for unknown adapters.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue:
- Around line 328-335: The Prepare action can trigger duplicate POST /sync calls
because triggerSync(dsId, ...) remains clickable while a request is in flight or
when the data source is already syncing; fix this by adding a per-data-source
in-flight guard inside triggerSync (e.g., a Set or Map keyed by dsId) and by
short-circuiting when the data source's status indicates syncing before sending
the request; set the guard before the apiFetch call and clear it in finally so
the button is re-enabled after loadDataSources(); reference the existing
triggerSync(dsId: string, mode: 'full' | 'ingest_only') function to implement
the guard and ensure the UI Prepare button checks this in-flight state (and/or
the data-source's syncing status) to prevent duplicate submits.
- Around line 189-202: detectDefaultBranch currently fails for SSH-style URLs
and unauthenticated private repos; update it to first detect and parse SSH
git@github.com:owner/repo(.git) style URLs (e.g. via regex) to extract owner and
repo when new URL() throws, and modify the function signature to accept an
optional GitHub token param (e.g. detectDefaultBranch(url: string, token?:
string)): when a token is provided use it in the Authorization header for the
GitHub API request so private repos return their actual default_branch; only
return 'main' after trying both SSH parsing and an authenticated API call, and
apply the same change to the duplicate helper used elsewhere (the other
occurrence mentioned).

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/new.vue:
- Around line 317-325: The pollUntilAllTerminal function currently uses
Promise.all over rows.map(refreshSourceSyncStatus) which aborts the entire poll
on the first transient fetch error; change it to run per-row calls that never
throw (e.g., use Promise.allSettled or wrap each refreshSourceSyncStatus(row)
with a .catch(() => null)) so a single fetch failure doesn't reject the whole
iteration, still update syncCompletedInRun.value and syncStepLabel.value based
on successfully refreshed rows, and keep the same timeout and delay behavior;
reference pollUntilAllTerminal and refreshSourceSyncStatus to locate and modify
the call site.

In `@src/dev-ui/app/tests/kg-data-sources-phase1.test.ts`:
- Around line 45-51: The test is asserting parallel ingestion prep but the
onboarding flow requires sequential initial sync; update the assertions in KG
wizard test to expect sequential behavior by asserting presence of
runSequentialIngestionPrep (and any sequential helpers used) and absence of
runParallelIngestionPrep/Promise.allSettled/pollUntilAllTerminal as appropriate
— locate the expectations around runParallelIngestionPrep, Promise.allSettled,
pollUntilAllTerminal and runSequentialIngestionPrep in the test and flip them so
the test requires runSequentialIngestionPrep and does not permit the parallel
helpers.

---

Outside diff comments:
In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/new.vue:
- Around line 335-371: runParallelIngestionPrep currently fires all ingest-only
syncs in parallel (Promise.allSettled), which breaks the required sequential
onboarding flow; change it to run each source sequentially: keep the initial
queue creation and status initialization (createdSources, syncRunActive,
syncRunTotal, etc.), then replace the Promise.allSettled(...) block with a
for...of loop over queue where for each target you set
target.syncStatus='pending', await
apiFetch(`/management/data-sources/${target.id}/sync`, ...) inside a try/catch
to set target.syncStatus/'syncError' on failure, and after a successful apiFetch
await pollUntilAllTerminal([target]) (or call pollUntilAllTerminal for that
single target) before moving to the next, so each source completes its initial
sync sequentially before proceeding to the summary step.

---

Duplicate comments:
In `@src/api/management/application/services/data_source_service.py`:
- Around line 603-604: The function accepting pipeline_mode (returning
DataSourceSyncRun) currently forwards arbitrary strings to the sync lifecycle;
add an allowlist guard that only permits "full" and "ingest_only" before any
event emission (where pipeline_mode is forwarded), e.g. validate pipeline_mode
at the start of the method and either raise a ValueError or normalize to a
default ("full") and log the rejection for invalid values so no invalid mode
reaches the lifecycle/event emission code.

In `@src/dev-ui/app/tests/data-sources.test.ts`:
- Around line 3142-3175: The test suite describe('Data-sources-focused layout -
structural verification') currently asserts global /data-sources page content
(via the source variable) which is outside this PR’s KG-scoped changes; update
the tests to either (a) move these assertions into KG-scoped page tests
(targeting the KG onboarding/component files) or (b) narrow each it(...) to only
assert stable, KG-specific invariants (e.g., onboarding copy or Select component
usage) and remove checks for telemetry/dashboard/maintenance copy; also rename
the describe to reflect the KG scope and keep references to the unique symbols
used here (the describe name, the source variable, and the specific it blocks
like the Select assertion) so maintainers can locate and adjust the exact
assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 88a9a1dd-fac9-477e-a506-927cf69e1eed

📥 Commits

Reviewing files that changed from the base of the PR and between 781fa38 and 1c21207.

📒 Files selected for processing (15)
  • src/api/management/application/services/data_source_service.py
  • src/api/management/domain/aggregates/data_source.py
  • src/api/management/infrastructure/sync_lifecycle_handler.py
  • src/api/management/presentation/data_sources/models.py
  • src/api/management/presentation/data_sources/routes.py
  • src/api/tests/unit/management/application/test_data_source_service.py
  • src/api/tests/unit/management/infrastructure/test_sync_lifecycle_handler.py
  • src/api/tests/unit/management/presentation/test_data_sources_routes.py
  • src/dev-ui/app/pages/data-sources/index.vue
  • src/dev-ui/app/pages/knowledge-graphs/[kgId]/data-sources/index.vue
  • src/dev-ui/app/pages/knowledge-graphs/[kgId]/data-sources/new.vue
  • src/dev-ui/app/tests/data-sources.test.ts
  • src/dev-ui/app/tests/kg-data-sources-phase1.test.ts
  • src/dev-ui/app/tests/knowledge-graph-manage-workspace.test.ts
  • src/dev-ui/app/utils/kgDataSourcesCommits.ts
💤 Files with no reviewable changes (1)
  • src/api/management/presentation/data_sources/routes.py

Comment on lines +202 to +205
connection_config: dict[str, str] = Field(
default_factory=dict,
description="Adapter connection configuration (non-secret)",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid returning raw connection_config in API responses without sanitization.

Serializing dict(ds.connection_config) directly can expose sensitive values if any credential-like fields are ever written into config. Return an allowlisted non-secret subset per adapter before mapping to response models.

Proposed hardening
+def _public_connection_config(adapter_type: str, cfg: dict[str, str]) -> dict[str, str]:
+    allowlist: dict[str, set[str]] = {
+        "github": {"repo_url", "owner", "repo", "branch"},
+    }
+    allowed = allowlist.get(adapter_type, set())
+    return {k: v for k, v in cfg.items() if k in allowed}
@@
-            connection_config=dict(ds.connection_config),
+            connection_config=_public_connection_config(
+                ds.adapter_type.value, dict(ds.connection_config)
+            ),
@@
-            connection_config=dict(ds.connection_config),
+            connection_config=_public_connection_config(
+                ds.adapter_type.value, dict(ds.connection_config)
+            ),

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security."

Also applies to: 234-235, 479-482, 518-519

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/management/presentation/data_sources/models.py` around lines 202 -
205, The DataSource model's connection_config field is being returned raw and
may leak secrets; update serialization to return only an allowlisted, non-secret
subset per adapter instead of dict(ds.connection_config). Implement a sanitizer
function (e.g., sanitize_connection_config(adapter_type, connection_config)) and
call it wherever connection_config is mapped to responses (references:
connection_config field in the model and the mapping sites around the areas
noted: the blocks at ~234-235, ~479-482, and ~518-519) so that only explicit
non-sensitive keys are included in the response; ensure the sanitizer defaults
to an empty/explicitly filtered dict for unknown adapters.

Comment on lines +189 to +202
async function detectDefaultBranch(url: string): Promise<string> {
try {
const parsed = new URL(url)
const [owner, repoRaw] = parsed.pathname.split('/').filter(Boolean)
const repo = repoRaw?.replace(/\.git$/, '')
if (!owner || !repo) return 'main'
const response = await fetch(`https://api.github.com/repos/${owner}/${repo}`)
if (!response.ok) return 'main'
const payload = (await response.json()) as { default_branch?: string }
return payload.default_branch ?? 'main'
} catch {
return 'main'
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Default-branch detection is wrong for private and SSH repos.

This helper falls back to main for every private repo and every git@... URL: the GitHub lookup is unauthenticated, and new URL('git@github.com:...') throws. Since this flow explicitly accepts both tokens and SSH URLs, sources with a non-main default branch will be created with the wrong tracked branch and can fail on first prep/sync.

Suggested fix
+function parseGitHubRepo(url: string): { owner: string; repo: string } | null {
+  if (url.startsWith('git@github.com:')) {
+    const path = url.slice('git@github.com:'.length).replace(/\.git$/, '')
+    const [owner, repo] = path.split('/')
+    return owner && repo ? { owner, repo } : null
+  }
+
+  const parsed = new URL(url)
+  if (parsed.hostname !== 'github.com') return null
+  const [owner, repoRaw] = parsed.pathname.split('/').filter(Boolean)
+  const repo = repoRaw?.replace(/\.git$/, '')
+  return owner && repo ? { owner, repo } : null
+}
+
-async function detectDefaultBranch(url: string): Promise<string> {
+async function detectDefaultBranch(url: string, token?: string): Promise<string> {
   try {
-    const parsed = new URL(url)
-    const [owner, repoRaw] = parsed.pathname.split('/').filter(Boolean)
-    const repo = repoRaw?.replace(/\.git$/, '')
-    if (!owner || !repo) return 'main'
-    const response = await fetch(`https://api.github.com/repos/${owner}/${repo}`)
+    const repoRef = parseGitHubRepo(url)
+    if (!repoRef) return 'main'
+    const response = await fetch(
+      `https://api.github.com/repos/${repoRef.owner}/${repoRef.repo}`,
+      {
+        headers: token ? { Authorization: `Bearer ${token}` } : undefined,
+      },
+    )
     if (!response.ok) return 'main'
     const payload = (await response.json()) as { default_branch?: string }
     return payload.default_branch ?? 'main'
   } catch {
     return 'main'
   }
 }
...
-      const branch = await detectDefaultBranch(url)
+      const branch = await detectDefaultBranch(url, addToken.value.trim() || undefined)

Also applies to: 225-240

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue around
lines 189 - 202, detectDefaultBranch currently fails for SSH-style URLs and
unauthenticated private repos; update it to first detect and parse SSH
git@github.com:owner/repo(.git) style URLs (e.g. via regex) to extract owner and
repo when new URL() throws, and modify the function signature to accept an
optional GitHub token param (e.g. detectDefaultBranch(url: string, token?:
string)): when a token is provided use it in the Authorization header for the
GitHub API request so private repos return their actual default_branch; only
return 'main' after trying both SSH parsing and an authenticated API call, and
apply the same change to the duplicate helper used elsewhere (the other
occurrence mentioned).

Comment thread src/dev-ui/app/pages/knowledge-graphs/[kgId]/data-sources/index.vue Outdated
Comment on lines +317 to +325
async function pollUntilAllTerminal(rows: CreatedSourceRow[], timeoutMs = 600_000) {
const started = Date.now()
while (Date.now() - started < timeoutMs) {
await Promise.all(rows.map((row) => refreshSourceSyncStatus(row)))
const finished = rows.filter((row) => isSyncTerminal(row.syncStatus as SyncRunStatus)).length
syncCompletedInRun.value = finished
syncStepLabel.value = `${finished} / ${rows.length}`
if (finished === rows.length) return
await new Promise((resolve) => setTimeout(resolve, 1500))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

A single status-fetch failure aborts the entire prep run.

Promise.all(rows.map(refreshSourceSyncStatus)) rejects on the first transient fetch error. Because the caller only has a finally, one polling failure stops the whole run with the remaining sources left in non-terminal states and no actionable error for the user.

Suggested fix
-    await Promise.all(rows.map((row) => refreshSourceSyncStatus(row)))
+    const results = await Promise.allSettled(
+      rows.map((row) => refreshSourceSyncStatus(row)),
+    )
+    for (const [index, result] of results.entries()) {
+      if (result.status === 'rejected' && !isSyncTerminal(rows[index]!.syncStatus as SyncRunStatus)) {
+        rows[index]!.syncStatus = 'failed'
+        rows[index]!.syncError = 'Failed to refresh sync status'
+      }
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async function pollUntilAllTerminal(rows: CreatedSourceRow[], timeoutMs = 600_000) {
const started = Date.now()
while (Date.now() - started < timeoutMs) {
await Promise.all(rows.map((row) => refreshSourceSyncStatus(row)))
const finished = rows.filter((row) => isSyncTerminal(row.syncStatus as SyncRunStatus)).length
syncCompletedInRun.value = finished
syncStepLabel.value = `${finished} / ${rows.length}`
if (finished === rows.length) return
await new Promise((resolve) => setTimeout(resolve, 1500))
async function pollUntilAllTerminal(rows: CreatedSourceRow[], timeoutMs = 600_000) {
const started = Date.now()
while (Date.now() - started < timeoutMs) {
const results = await Promise.allSettled(
rows.map((row) => refreshSourceSyncStatus(row)),
)
for (const [index, result] of results.entries()) {
if (result.status === 'rejected' && !isSyncTerminal(rows[index]!.syncStatus as SyncRunStatus)) {
rows[index]!.syncStatus = 'failed'
rows[index]!.syncError = 'Failed to refresh sync status'
}
}
const finished = rows.filter((row) => isSyncTerminal(row.syncStatus as SyncRunStatus)).length
syncCompletedInRun.value = finished
syncStepLabel.value = `${finished} / ${rows.length}`
if (finished === rows.length) return
await new Promise((resolve) => setTimeout(resolve, 1500))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/new.vue around
lines 317 - 325, The pollUntilAllTerminal function currently uses Promise.all
over rows.map(refreshSourceSyncStatus) which aborts the entire poll on the first
transient fetch error; change it to run per-row calls that never throw (e.g.,
use Promise.allSettled or wrap each refreshSourceSyncStatus(row) with a
.catch(() => null)) so a single fetch failure doesn't reject the whole
iteration, still update syncCompletedInRun.value and syncStepLabel.value based
on successfully refreshed rows, and keep the same timeout and delay behavior;
reference pollUntilAllTerminal and refreshSourceSyncStatus to locate and modify
the call site.

Comment on lines +45 to +51
describe('KG wizard parallel ingestion prep', () => {
it('prepares sources in parallel', () => {
expect(newVue).toContain('runParallelIngestionPrep')
expect(newVue).toContain('Promise.allSettled')
expect(newVue).toContain('pollUntilAllTerminal')
expect(newVue).not.toContain('runSequentialIngestionPrep')
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Test encodes parallel prep while onboarding requirement is sequential initial sync.

Line 47–50 asserts parallel preparation (runParallelIngestionPrep / Promise.allSettled) and rejects sequential prep. That conflicts with the PR objective/spec flow (“sequential initial sync → summary”), so this test currently protects the wrong behavior.

As per coding guidelines, "Validate changes against HyperFleet architecture standards from the linked architecture repository."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/tests/kg-data-sources-phase1.test.ts` around lines 45 - 51,
The test is asserting parallel ingestion prep but the onboarding flow requires
sequential initial sync; update the assertions in KG wizard test to expect
sequential behavior by asserting presence of runSequentialIngestionPrep (and any
sequential helpers used) and absence of
runParallelIngestionPrep/Promise.allSettled/pollUntilAllTerminal as appropriate
— locate the expectations around runParallelIngestionPrep, Promise.allSettled,
pollUntilAllTerminal and runSequentialIngestionPrep in the test and flip them so
the test requires runSequentialIngestionPrep and does not permit the parallel
helpers.

Track last prepared commit and file count during ingest-only runs so the
overview can show branch freshness, file totals, and bulk prepare/check flows.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/api/management/domain/aggregates/data_source.py (1)

316-346: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce allowed pipeline_mode values at the aggregate boundary.

pipeline_mode is currently free-form and gets emitted as-is in SyncStarted. Invalid values can leak into downstream sync orchestration and cause undefined behavior.

Suggested fix
     def request_sync(
         self,
         sync_run_id: str,
         *,
         requested_by: str | None = None,
         pipeline_mode: str = "full",
     ) -> None:
@@
         if self._deleted:
             raise AggregateDeletedError("Cannot request sync on a deleted data source")
+        if pipeline_mode not in {"full", "ingest_only"}:
+            raise ValueError("pipeline_mode must be 'full' or 'ingest_only'")
         self._pending_events.append(
             SyncStarted(

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity. Validate changes against HyperFleet architecture standards from the linked architecture repository."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/management/domain/aggregates/data_source.py` around lines 316 - 346,
In the method that appends the SyncStarted event (where
self._pending_events.append(SyncStarted(...)) is called), validate the
pipeline_mode argument against the allowed values (e.g., "full" and
"ingest_only") before creating the SyncStarted event; if the value is invalid,
raise a clear domain error (e.g., ValueError or a domain-specific exception)
with a message stating the allowed values, and only pass the validated
pipeline_mode into SyncStarted to prevent invalid modes leaking downstream.
♻️ Duplicate comments (1)
src/api/management/presentation/data_sources/models.py (1)

240-243: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Sanitize connection_config before serializing API responses.

Both mappers still return raw connection_config via dict(ds.connection_config), which can leak secrets if sensitive keys are ever present. Please apply an adapter-specific allowlist sanitizer before returning this field.

Proposed hardening
+def _public_connection_config(
+    adapter_type: str, cfg: dict[str, str]
+) -> dict[str, str]:
+    allowlist: dict[str, set[str]] = {
+        "github": {"repo_url", "owner", "repo", "branch"},
+    }
+    allowed = allowlist.get(adapter_type, set())
+    return {k: v for k, v in cfg.items() if k in allowed}
@@
-            connection_config=dict(ds.connection_config),
+            connection_config=_public_connection_config(
+                ds.adapter_type.value, dict(ds.connection_config)
+            ),
@@
-            connection_config=dict(ds.connection_config),
+            connection_config=_public_connection_config(
+                ds.adapter_type.value, dict(ds.connection_config)
+            ),

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 532-535

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/management/presentation/data_sources/models.py` around lines 240 -
243, The mapper is returning raw connection_config via
dict(ds.connection_config), which can leak secrets; replace those returns with a
sanitized version by applying an adapter-specific allowlist sanitizer (e.g.,
sanitize_connection_config(adapter=ds.adapter or ds.type,
config=dict(ds.connection_config))) and use that sanitized dict in the API model
construction; add or reuse a small helper (sanitize_connection_config) that
takes adapter identifier and raw config, filters to an explicit allowlist per
adapter, and call it in both places where dict(ds.connection_config) is used so
only non-sensitive keys are serialized.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/ingestion/ports/services.py`:
- Line 8: Remove the unused import JobPackageId from the top of
src/api/ingestion/ports/services.py to satisfy linting (Ruff F401); update the
import list so only needed symbols remain (the interface IIngestionService.run
now returns IngestionRunResult, so ensure IngestionRunResult is the referenced
return type and no JobPackageId references remain in the file or in
IIngestionService); run the linter to confirm the unused-import error is
resolved.

In `@src/api/management/domain/aggregates/data_source.py`:
- Around line 398-401: The guards currently only update
self.last_prepared_commit and self.last_prepared_file_count when values are
truthy, allowing stale metadata to persist; change the logic in the DataSource
aggregate so these fields are updated even when the latest prepare returns
None/falsey — either remove the conditional guards and assign prepared_commit
and prepared_file_count unconditionally to self.last_prepared_commit and
self.last_prepared_file_count, or at minimum change the commit check to "is not
None" so a deliberate empty/zero value clears the previous value.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue:
- Around line 362-387: prepareAllDataSources currently sets preparingAll to
prevent re-clicks but does not mark individual sources as in-flight, so
per-source triggers (e.g., triggerSync) can start duplicate syncs; to fix,
introduce and use a Set like syncingIds, add each ds.id to syncingIds before
calling apiFetch in prepareAllDataSources and remove it in the settled handlers
(or finally), check syncingIds in the per-source triggerSync and disable
per-source actions when an id is present, and ensure UI binding uses syncingIds
(and preparingAll) so both bulk and single actions coordinate; update references
prepareAllDataSources, preparingAll, sourcesNeedingPrepare, triggerSync,
syncingIds, loadDataSources, and startPolling accordingly.

---

Outside diff comments:
In `@src/api/management/domain/aggregates/data_source.py`:
- Around line 316-346: In the method that appends the SyncStarted event (where
self._pending_events.append(SyncStarted(...)) is called), validate the
pipeline_mode argument against the allowed values (e.g., "full" and
"ingest_only") before creating the SyncStarted event; if the value is invalid,
raise a clear domain error (e.g., ValueError or a domain-specific exception)
with a message stating the allowed values, and only pass the validated
pipeline_mode into SyncStarted to prevent invalid modes leaking downstream.

---

Duplicate comments:
In `@src/api/management/presentation/data_sources/models.py`:
- Around line 240-243: The mapper is returning raw connection_config via
dict(ds.connection_config), which can leak secrets; replace those returns with a
sanitized version by applying an adapter-specific allowlist sanitizer (e.g.,
sanitize_connection_config(adapter=ds.adapter or ds.type,
config=dict(ds.connection_config))) and use that sanitized dict in the API model
construction; add or reuse a small helper (sanitize_connection_config) that
takes adapter identifier and raw config, filters to an explicit allowlist per
adapter, and call it in both places where dict(ds.connection_config) is used so
only non-sensitive keys are serialized.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 18cf79c4-e906-4311-8567-7ff47570c447

📥 Commits

Reviewing files that changed from the base of the PR and between 1c21207 and e0001b8.

📒 Files selected for processing (19)
  • src/api/infrastructure/migrations/versions/g9h0i1j2k3l4_add_prepared_fields_to_data_sources.py
  • src/api/ingestion/application/services/ingestion_service.py
  • src/api/ingestion/application/value_objects.py
  • src/api/ingestion/infrastructure/event_handler.py
  • src/api/ingestion/ports/services.py
  • src/api/main.py
  • src/api/management/domain/aggregates/data_source.py
  • src/api/management/infrastructure/models/data_source.py
  • src/api/management/infrastructure/repositories/data_source_repository.py
  • src/api/management/infrastructure/sync_lifecycle_handler.py
  • src/api/management/presentation/data_sources/models.py
  • src/api/tests/unit/ingestion/application/test_ingestion_service.py
  • src/api/tests/unit/ingestion/infrastructure/test_ingestion_event_handler.py
  • src/api/tests/unit/management/infrastructure/test_sync_lifecycle_handler.py
  • src/api/tests/unit/management/test_data_source.py
  • src/api/tests/unit/test_sessioned_ingestion_handler.py
  • src/dev-ui/app/pages/knowledge-graphs/[kgId]/data-sources/index.vue
  • src/dev-ui/app/tests/kg-data-sources-phase1.test.ts
  • src/dev-ui/app/utils/kgDataSourcesCommits.ts

from typing import Protocol

from ingestion.application.value_objects import IngestionRunResult
from shared_kernel.job_package.value_objects import JobPackageId
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove stale JobPackageId import to unblock CI.

IIngestionService.run now returns IngestionRunResult, and this import is unused, which is currently failing Ruff (F401) in pipeline.

Suggested fix
-from shared_kernel.job_package.value_objects import JobPackageId

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity. Validate changes against HyperFleet architecture standards from the linked architecture repository."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from shared_kernel.job_package.value_objects import JobPackageId
🧰 Tools
🪛 GitHub Actions: Tests / 0_Test Python 3.13.txt

[error] 8-8: ruff (F401) imported but unused: shared_kernel.job_package.value_objects.JobPackageId.

🪛 GitHub Actions: Tests / Test Python 3.13

[error] 8-8: Ruff (F401) unused import: shared_kernel.job_package.value_objects.JobPackageId imported but unused. Remove unused import.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/ingestion/ports/services.py` at line 8, Remove the unused import
JobPackageId from the top of src/api/ingestion/ports/services.py to satisfy
linting (Ruff F401); update the import list so only needed symbols remain (the
interface IIngestionService.run now returns IngestionRunResult, so ensure
IngestionRunResult is the referenced return type and no JobPackageId references
remain in the file or in IIngestionService); run the linter to confirm the
unused-import error is resolved.

Comment on lines +398 to +401
if prepared_commit:
self.last_prepared_commit = prepared_commit
if prepared_file_count is not None:
self.last_prepared_file_count = prepared_file_count
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid stale prepared metadata when latest prepare omits values.

The current guards only write when values are truthy/non-None, so old last_prepared_commit / last_prepared_file_count can persist across newer runs and misrepresent freshness/no-change state.

Suggested fix
-        if prepared_commit:
-            self.last_prepared_commit = prepared_commit
-        if prepared_file_count is not None:
-            self.last_prepared_file_count = prepared_file_count
+        self.last_prepared_commit = prepared_commit
+        self.last_prepared_file_count = prepared_file_count

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity. Validate changes against HyperFleet architecture standards from the linked architecture repository."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if prepared_commit:
self.last_prepared_commit = prepared_commit
if prepared_file_count is not None:
self.last_prepared_file_count = prepared_file_count
self.last_prepared_commit = prepared_commit
self.last_prepared_file_count = prepared_file_count
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/management/domain/aggregates/data_source.py` around lines 398 - 401,
The guards currently only update self.last_prepared_commit and
self.last_prepared_file_count when values are truthy, allowing stale metadata to
persist; change the logic in the DataSource aggregate so these fields are
updated even when the latest prepare returns None/falsey — either remove the
conditional guards and assign prepared_commit and prepared_file_count
unconditionally to self.last_prepared_commit and self.last_prepared_file_count,
or at minimum change the commit check to "is not None" so a deliberate
empty/zero value clears the previous value.

Comment on lines +362 to +387
async function prepareAllDataSources() {
const queue = sourcesNeedingPrepare.value
if (queue.length === 0) {
toast.error('No data sources need preparation')
return
}

preparingAll.value = true
try {
await Promise.allSettled(
queue.map((ds) =>
apiFetch(`/management/data-sources/${ds.id}/sync`, {
method: 'POST',
body: { mode: 'ingest_only' },
}),
),
)
toast.success(`Preparing ${queue.length} data source${queue.length === 1 ? '' : 's'}`)
await loadDataSources()
if (hasAnyActiveSync(dataSources.value)) startPolling()
} catch {
toast.error('Failed to start preparation')
} finally {
preparingAll.value = false
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Bulk prepare action lacks in-flight guard for individual sources.

While preparingAll prevents re-clicking the bulk button, the individual triggerSync calls in Promise.allSettled don't coordinate with single-source sync triggers. If a user clicks "Prepare data sources" and then quickly clicks a per-source action before the bulk completes, duplicate syncs can be enqueued.

Suggested approach

Track in-flight sync IDs in a Set and check before triggering:

+const syncingIds = ref<Set<string>>(new Set())
+
 async function prepareAllDataSources() {
   const queue = sourcesNeedingPrepare.value
   if (queue.length === 0) {
     toast.error('No data sources need preparation')
     return
   }

   preparingAll.value = true
+  queue.forEach((ds) => syncingIds.value.add(ds.id))
   try {
     await Promise.allSettled(
       queue.map((ds) =>
         apiFetch(`/management/data-sources/${ds.id}/sync`, {
           method: 'POST',
           body: { mode: 'ingest_only' },
         }),
       ),
     )
     // ...
   } finally {
     preparingAll.value = false
+    queue.forEach((ds) => syncingIds.value.delete(ds.id))
   }
 }

Then use syncingIds to disable per-source actions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue around
lines 362 - 387, prepareAllDataSources currently sets preparingAll to prevent
re-clicks but does not mark individual sources as in-flight, so per-source
triggers (e.g., triggerSync) can start duplicate syncs; to fix, introduce and
use a Set like syncingIds, add each ds.id to syncingIds before calling apiFetch
in prepareAllDataSources and remove it in the settled handlers (or finally),
check syncingIds in the per-source triggerSync and disable per-source actions
when an id is present, and ensure UI binding uses syncingIds (and preparingAll)
so both bulk and single actions coordinate; update references
prepareAllDataSources, preparingAll, sourcesNeedingPrepare, triggerSync,
syncingIds, loadDataSources, and startPolling accordingly.

Rework the manage overview into a phased workspace hub and add unpulled-commit
tracking on data sources so ingestion status matches a git-pull mental model.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/api/management/domain/aggregates/data_source.py (1)

311-346: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Constrain pipeline_mode at the domain boundary.

Ingestion logic branches on exact string equality (pipeline_mode == "ingest_only"), but both DataSource.request_sync(..., pipeline_mode: str = "full") and SyncStarted.pipeline_mode: str = "full" still accept unconstrained str, so invalid values can silently fall back to the "full" path. HTTP is already constrained (TriggerSyncRequest.mode: Literal["full","ingest_only"]), but non-HTTP/internal callers (or crafted outbox payloads) bypass that.

Proposed direction
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Literal
@@
-        pipeline_mode: str = "full",
+        pipeline_mode: Literal["full", "ingest_only"] = "full",

Tighten pipeline_mode consistently at the aggregate/event (and ideally DataSourceService.trigger_sync, which is also pipeline_mode: str) to the existing SyncPipelineMode = Literal["full", "ingest_only"].

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/management/domain/aggregates/data_source.py` around lines 311 - 346,
The pipeline_mode parameter is currently an unconstrained str in
DataSource.request_sync and carried onto the SyncStarted event, allowing invalid
values to slip through; change the type annotations for DataSource.request_sync,
SyncStarted.pipeline_mode, and DataSourceService.trigger_sync to use a shared
SyncPipelineMode = Literal["full","ingest_only"] type and add a runtime check in
DataSource.request_sync (and/or in SyncStarted construction) to validate the
passed pipeline_mode, raising a ValueError or similar if it's not one of the
allowed literals so callers and outbox payloads cannot silently default to the
"full" branch.
src/dev-ui/app/pages/knowledge-graphs/[kgId]/data-sources/index.vue (1)

355-371: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh the full KG here and inspect the settled results before showing success.

Under ?focus=maintain, this only refreshes visibleDataSources, so hidden sources never get their branch refs updated and newly stale repos can stay hidden. On top of that, Promise.allSettled() never throws, so this code can still toast “Up to date” when every refresh request failed. Refresh dataSources.value, check how many calls actually succeeded, and use that same source list for the disabled state at Line 639.

Proposed fix
 async function checkAllCommitRefs() {
-  if (visibleDataSources.value.length === 0) return
+  if (dataSources.value.length === 0) return
   checkingAllCommits.value = true
   try {
-    await Promise.allSettled(
-      visibleDataSources.value.map((ds) =>
+    const results = await Promise.allSettled(
+      dataSources.value.map((ds) =>
         apiFetch(`/management/data-sources/${ds.id}/commit-refs/refresh`, { method: 'POST' }),
       ),
     )
+    const refreshedCount = results.filter((result) => result.status === 'fulfilled').length
+    if (refreshedCount === 0) {
+      toast.error('Failed to check for new commits')
+      return
+    }
+
     await loadDataSources()
     const unpulled = visibleDataSources.value.filter((ds) => hasUnpulledCommits(ds))
     if (unpulled.length === 0) {
       toast.success('Up to date with remote branches')
     } else {
@@
-                  :disabled="checkingAllCommits || visibleDataSources.length === 0"
+                  :disabled="checkingAllCommits || dataSources.length === 0"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue around
lines 355 - 371, Replace the refresh of only visibleDataSources with a refresh
over dataSources.value, await Promise.allSettled on apiFetch calls for each ds
in dataSources.value, then inspect the settled results to build a set of
successfully refreshed source IDs (check PromiseResult.status === 'fulfilled'),
call loadDataSources() to reload the full list, compute unpulled by filtering
the reloaded dataSources.value by (1) membership in the succeeded-ID set and (2)
hasUnpulledCommits(ds), and finally show toasts based on the actual succeeded
set (if none succeeded show an error toast, otherwise show “Up to date” only
when succeeded sources have no unpulled commits). Also ensure the same
succeeded-ID set is used for any disabled-state logic that referenced the
previous visibleDataSources.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/manage.vue:
- Around line 451-484: The loop in loadOverviewMetrics iterates dataSources
serially and should be parallelized: replace the for-await loop over dataSources
with a single Promise.allSettled over dataSources.map(ds =>
apiFetch(`/management/data-sources/${ds.id}/sync-runs`) returning the runs
alongside the original ds), then iterate the settled results to compute each row
using the same logic (call latestSyncRun on successful results,
resolvePrepStatusLabel, mark statusVariant 'success' for 'ingested'/'completed',
treat rejected promises or missing runs with the default 'not prepared'), update
prepared using isIngestionPreparedAtHead(ds) (still checked per ds), and push
WorkspaceHubSourceRow objects into rows; ensure errors from rejected promises do
not throw and that behavior for default statuses remains identical to existing
status/statusVariant initialization.

---

Outside diff comments:
In `@src/api/management/domain/aggregates/data_source.py`:
- Around line 311-346: The pipeline_mode parameter is currently an unconstrained
str in DataSource.request_sync and carried onto the SyncStarted event, allowing
invalid values to slip through; change the type annotations for
DataSource.request_sync, SyncStarted.pipeline_mode, and
DataSourceService.trigger_sync to use a shared SyncPipelineMode =
Literal["full","ingest_only"] type and add a runtime check in
DataSource.request_sync (and/or in SyncStarted construction) to validate the
passed pipeline_mode, raising a ValueError or similar if it's not one of the
allowed literals so callers and outbox payloads cannot silently default to the
"full" branch.

In `@src/dev-ui/app/pages/knowledge-graphs/`[kgId]/data-sources/index.vue:
- Around line 355-371: Replace the refresh of only visibleDataSources with a
refresh over dataSources.value, await Promise.allSettled on apiFetch calls for
each ds in dataSources.value, then inspect the settled results to build a set of
successfully refreshed source IDs (check PromiseResult.status === 'fulfilled'),
call loadDataSources() to reload the full list, compute unpulled by filtering
the reloaded dataSources.value by (1) membership in the succeeded-ID set and (2)
hasUnpulledCommits(ds), and finally show toasts based on the actual succeeded
set (if none succeeded show an error toast, otherwise show “Up to date” only
when succeeded sources have no unpulled commits). Also ensure the same
succeeded-ID set is used for any disabled-state logic that referenced the
previous visibleDataSources.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1328d988-457b-4301-b5c6-98291775ffd6

📥 Commits

Reviewing files that changed from the base of the PR and between e0001b8 and 3012df5.

📒 Files selected for processing (15)
  • src/api/main.py
  • src/api/management/domain/aggregates/data_source.py
  • src/api/management/domain/commit_pull_state.py
  • src/api/management/presentation/data_sources/models.py
  • src/api/management/presentation/data_sources/routes.py
  • src/api/tests/unit/management/domain/test_commit_pull_state.py
  • src/api/tests/unit/management/infrastructure/test_sync_lifecycle_handler.py
  • src/api/tests/unit/management/test_data_source.py
  • src/dev-ui/app/pages/knowledge-graphs/[kgId]/data-sources/index.vue
  • src/dev-ui/app/pages/knowledge-graphs/[kgId]/manage.vue
  • src/dev-ui/app/tests/kg-data-sources-phase1.test.ts
  • src/dev-ui/app/tests/kg-manage-workspace-hub.test.ts
  • src/dev-ui/app/tests/knowledge-graph-manage-workspace.test.ts
  • src/dev-ui/app/utils/kgDataSourcesCommits.ts
  • src/dev-ui/app/utils/kgManageWorkspaceHub.ts

Comment thread src/dev-ui/app/pages/knowledge-graphs/[kgId]/manage.vue
aredenba-rh and others added 3 commits May 28, 2026 22:37
* feat(ui): align graph management step with k-extract phase2 layout

Rework the design chat, schema/session panels, and mode switcher with locked
extraction modes until the workspace transitions to extraction operations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ui): rename graph management chat title to Graph Management Assistant

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(extraction): specify sticky session chat turns and runtime

Document Graph Management chat as NDJSON streaming turns inside sticky
Claude Agent SDK containers with JobPackage gating and UI mode skills.

Closes #738

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
…#744)

Introduce sticky-session-aware chat orchestration with JobPackage gating,
UI-mode skill overlays, and a tracer-bullet deterministic agent. Closes #739.
Closes #740.

Co-authored-by: Cursor <cursoragent@cursor.com>
Stream NDJSON chat turns with thinking/wait activity lines and reload session
history after each turn. Closes #741.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/extraction/application/chat_turn_service.py`:
- Line 136: The code unconditionally sets
session.runtime_context["job_package"]["phase"] =
SessionJobPackagePhase.READY.value which can clobber intentional non-wait modes;
change this to only set the phase when a JobPackage actually exists and when its
current phase indicates it should transition to READY (e.g., phase is
missing/falsy or equals an expected prior state) — check
session.runtime_context.get("job_package") and the existing
job_package.get("phase") before assigning, and skip assignment when the current
phase represents a non-wait or NOT_REQUIRED mode so downstream
behavior/telemetry is preserved.

In `@src/api/extraction/dependencies.py`:
- Around line 62-77: The block in get_extraction_chat_turn_service duplicates
construction of ExtractionAgentSessionService (using
ExtractionAgentSessionRepository,
ExtractionSkillResolutionService/ExtractionSkillOverrideRepository, and
ExtractionSessionRunMetricsReader) already built earlier; refactor by creating
the ExtractionAgentSessionService once and reusing that instance instead of
reconstructing it (e.g., extract the shared construction into a single local
variable or helper and reference that variable for session_service), keeping the
same dependencies: ExtractionAgentSessionService,
ExtractionAgentSessionRepository(session),
ExtractionSkillResolutionService(override_repository=ExtractionSkillOverrideRepository()),
ExtractionSessionRunMetricsReader(session) and the sticky_runtime_manager
parameter.

In `@src/api/extraction/presentation/routes.py`:
- Around line 182-192: Wrap the async iteration in event_stream() in a
try/except so any exception from service.stream_chat_turn(...) is caught; on
exception yield a final NDJSON terminal event (e.g.
json.dumps({"type":"done","error": str(e)}) + "\n") before returning to ensure
the client receives a terminal "done" event, and optionally log the error with
the same context; keep existing yields for normal events and still return
StreamingResponse(event_stream(), media_type="application/x-ndjson").

In `@src/api/tests/unit/extraction/application/test_chat_turn_service.py`:
- Line 11: Remove the unused import ExtractionSkillResolutionService from the
test file; locate the import statement "from
extraction.application.skill_resolution_service import
ExtractionSkillResolutionService" in test_chat_turn_service.py and delete it (or
replace it with any actually used symbol from that module if needed) so the
unused-import Ruff (F401) error is resolved.

In `@src/dev-ui/app/components/extraction/SharedConversationPanel.vue`:
- Around line 124-145: The renderAssistantHtml function currently injects
untrusted URLs directly into href via the link-replacement regex, enabling
javascript: or quote-based attribute XSS when the output is used with v-html;
fix by validating and sanitizing the captured URL in the link replacement step
inside renderAssistantHtml: only allow safe schemes (e.g., http, https, mailto,
and relative paths), reject or replace unsafe ones with a safe placeholder
(e.g., '#'), HTML-escape or URL-encode the href value, and add safe attributes
like rel="noopener noreferrer" and target="_blank" as appropriate; update the
/\[([^\]]+)\]\(([^)]+)\)/g replacement to call a sanitizer helper (e.g.,
sanitizeUrl or isSafeScheme) before returning the anchor HTML so v-html never
receives an unsafe href.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0010ff04-feed-412b-a464-a0c859169a97

📥 Commits

Reviewing files that changed from the base of the PR and between 3012df5 and 8dfb95c.

📒 Files selected for processing (30)
  • specs/extraction/agent-sessions.spec.md
  • specs/extraction/chat-turns.spec.md
  • specs/extraction/operations.spec.md
  • specs/extraction/sticky-session-runtime.spec.md
  • specs/index.spec.md
  • specs/nfr/workload-execution.spec.md
  • src/api/extraction/application/__init__.py
  • src/api/extraction/application/agent_session_service.py
  • src/api/extraction/application/chat_turn_service.py
  • src/api/extraction/application/job_package_gate.py
  • src/api/extraction/application/skill_resolution_service.py
  • src/api/extraction/dependencies.py
  • src/api/extraction/domain/value_objects.py
  • src/api/extraction/infrastructure/deterministic_chat_agent.py
  • src/api/extraction/infrastructure/ingestion_readiness_reader.py
  • src/api/extraction/ports/chat_agent.py
  • src/api/extraction/ports/ingestion_readiness.py
  • src/api/extraction/presentation/models.py
  • src/api/extraction/presentation/routes.py
  • src/api/tests/unit/extraction/application/test_chat_turn_service.py
  • src/api/tests/unit/extraction/application/test_job_package_gate.py
  • src/dev-ui/app/components/extraction/SharedConversationPanel.vue
  • src/dev-ui/app/pages/knowledge-graphs/[kgId]/manage.vue
  • src/dev-ui/app/tests/kg-extraction-chat.test.ts
  • src/dev-ui/app/tests/kg-graph-management-artifacts.test.ts
  • src/dev-ui/app/tests/kg-graph-management-modes.test.ts
  • src/dev-ui/app/tests/knowledge-graph-manage-workspace.test.ts
  • src/dev-ui/app/utils/kgExtractionChat.ts
  • src/dev-ui/app/utils/kgGraphManagement.ts
  • src/dev-ui/app/utils/kgGraphManagementArtifacts.ts

Comment thread src/api/extraction/application/chat_turn_service.py Outdated
Comment thread src/api/extraction/dependencies.py
Comment on lines +182 to +192
async def event_stream():
async for event in service.stream_chat_turn(
user_id=current_user.user_id.value,
knowledge_graph_id=knowledge_graph_id,
mode=mode,
ui_mode=request.graph_management_ui_mode,
message=request.message,
):
yield json.dumps(event) + "\n"

return StreamingResponse(event_stream(), media_type="application/x-ndjson")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap stream generation failures and emit a terminal error event.

event_stream() has no exception handling. If service.stream_chat_turn(...) raises, clients receive a truncated NDJSON stream with no terminal event, which is brittle for UI state handling. Add a guarded error done event before ending the stream.

Suggested fix
 async def event_stream():
-    async for event in service.stream_chat_turn(
-        user_id=current_user.user_id.value,
-        knowledge_graph_id=knowledge_graph_id,
-        mode=mode,
-        ui_mode=request.graph_management_ui_mode,
-        message=request.message,
-    ):
-        yield json.dumps(event) + "\n"
+    try:
+        async for event in service.stream_chat_turn(
+            user_id=current_user.user_id.value,
+            knowledge_graph_id=knowledge_graph_id,
+            mode=mode,
+            ui_mode=request.graph_management_ui_mode,
+            message=request.message,
+        ):
+            yield json.dumps(event) + "\n"
+    except Exception:
+        yield json.dumps(
+            {
+                "type": "done",
+                "ok": False,
+                "error": {"code": "STREAM_FAILED", "message": "Chat stream failed."},
+            }
+        ) + "\n"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def event_stream():
async for event in service.stream_chat_turn(
user_id=current_user.user_id.value,
knowledge_graph_id=knowledge_graph_id,
mode=mode,
ui_mode=request.graph_management_ui_mode,
message=request.message,
):
yield json.dumps(event) + "\n"
return StreamingResponse(event_stream(), media_type="application/x-ndjson")
async def event_stream():
try:
async for event in service.stream_chat_turn(
user_id=current_user.user_id.value,
knowledge_graph_id=knowledge_graph_id,
mode=mode,
ui_mode=request.graph_management_ui_mode,
message=request.message,
):
yield json.dumps(event) + "\n"
except Exception:
yield json.dumps(
{
"type": "done",
"ok": False,
"error": {"code": "STREAM_FAILED", "message": "Chat stream failed."},
}
) + "\n"
return StreamingResponse(event_stream(), media_type="application/x-ndjson")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/extraction/presentation/routes.py` around lines 182 - 192, Wrap the
async iteration in event_stream() in a try/except so any exception from
service.stream_chat_turn(...) is caught; on exception yield a final NDJSON
terminal event (e.g. json.dumps({"type":"done","error": str(e)}) + "\n") before
returning to ensure the client receives a terminal "done" event, and optionally
log the error with the same context; keep existing yields for normal events and
still return StreamingResponse(event_stream(),
media_type="application/x-ndjson").

Comment thread src/api/tests/unit/extraction/application/test_chat_turn_service.py Outdated
Comment on lines +124 to +145
function renderAssistantHtml(text: string): string {
let s = text.replace(/&/g, '&amp;').replace(/</g, '&lt;').replace(/>/g, '&gt;')
s = s.replace(/\*\*([^*]+)\*\*/g, '<strong class="font-semibold text-foreground">$1</strong>')
s = s.replace(
/`([^`]+)`/g,
'<code class="rounded bg-muted px-1 py-0.5 text-xs font-mono text-foreground">$1</code>',
)
s = s.replace(
/^> (.+)$/gm,
'<p class="my-2 border-l-2 border-amber-500/60 pl-3 text-sm text-muted-foreground italic">$1</p>',
)
s = s.replace(
/\[([^\]]+)\]\(([^)]+)\)/g,
'<a class="text-primary font-medium underline underline-offset-2 hover:text-primary/90" href="$2">$1</a>',
)
s = s.replace(/## (.+)$/gm, '<h3 class="text-base font-semibold mt-3 mb-1 text-foreground">$1</h3>')
s = s.replace(/### (.+)$/gm, '<h4 class="text-sm font-semibold mt-2 text-foreground">$1</h4>')
s = s.replace(/^---$/gm, '<hr class="my-3 border-border" />')
s = s.replace(/\n\n+/g, '<br /><br />')
s = s.replace(/\n/g, '<br />')
return s
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Sanitize markdown links before v-html rendering (XSS risk)

Line 137 interpolates untrusted URL text directly into href, and Line 293 renders the result with v-html. This allows unsafe schemes (for example javascript:) and quote-based attribute injection.

Proposed fix
+function sanitizeHref(raw: string): string | null {
+  const value = raw.trim()
+  if (value.startsWith('/')) return value.replace(/"/g, '%22')
+  try {
+    const parsed = new URL(value)
+    if (!['http:', 'https:', 'mailto:'].includes(parsed.protocol)) return null
+    return parsed.href.replace(/"/g, '%22')
+  } catch {
+    return null
+  }
+}
+
 function renderAssistantHtml(text: string): string {
-  let s = text.replace(/&/g, '&amp;').replace(/</g, '&lt;').replace(/>/g, '&gt;')
+  let s = text
+    .replace(/&/g, '&amp;')
+    .replace(/</g, '&lt;')
+    .replace(/>/g, '&gt;')
+    .replace(/"/g, '&quot;')
+    .replace(/'/g, '&`#39`;')
@@
-  s = s.replace(
-    /\[([^\]]+)\]\(([^)]+)\)/g,
-    '<a class="text-primary font-medium underline underline-offset-2 hover:text-primary/90" href="$2">$1</a>',
-  )
+  s = s.replace(/\[([^\]]+)\]\(([^)]+)\)/g, (_, label: string, rawHref: string) => {
+    const href = sanitizeHref(rawHref)
+    if (!href) return label
+    return `<a class="text-primary font-medium underline underline-offset-2 hover:text-primary/90" href="${href}" target="_blank" rel="noopener noreferrer">${label}</a>`
+  })

As per coding guidelines **: Focus on major issues impacting performance, readability, maintainability and security.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function renderAssistantHtml(text: string): string {
let s = text.replace(/&/g, '&amp;').replace(/</g, '&lt;').replace(/>/g, '&gt;')
s = s.replace(/\*\*([^*]+)\*\*/g, '<strong class="font-semibold text-foreground">$1</strong>')
s = s.replace(
/`([^`]+)`/g,
'<code class="rounded bg-muted px-1 py-0.5 text-xs font-mono text-foreground">$1</code>',
)
s = s.replace(
/^> (.+)$/gm,
'<p class="my-2 border-l-2 border-amber-500/60 pl-3 text-sm text-muted-foreground italic">$1</p>',
)
s = s.replace(
/\[([^\]]+)\]\(([^)]+)\)/g,
'<a class="text-primary font-medium underline underline-offset-2 hover:text-primary/90" href="$2">$1</a>',
)
s = s.replace(/## (.+)$/gm, '<h3 class="text-base font-semibold mt-3 mb-1 text-foreground">$1</h3>')
s = s.replace(/### (.+)$/gm, '<h4 class="text-sm font-semibold mt-2 text-foreground">$1</h4>')
s = s.replace(/^---$/gm, '<hr class="my-3 border-border" />')
s = s.replace(/\n\n+/g, '<br /><br />')
s = s.replace(/\n/g, '<br />')
return s
}
function sanitizeHref(raw: string): string | null {
const value = raw.trim()
if (value.startsWith('/')) return value.replace(/"/g, '%22')
try {
const parsed = new URL(value)
if (!['http:', 'https:', 'mailto:'].includes(parsed.protocol)) return null
return parsed.href.replace(/"/g, '%22')
} catch {
return null
}
}
function renderAssistantHtml(text: string): string {
let s = text
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;')
.replace(/"/g, '&quot;')
.replace(/'/g, '&`#39`;')
s = s.replace(/\*\*([^*]+)\*\*/g, '<strong class="font-semibold text-foreground">$1</strong>')
s = s.replace(
/`([^`]+)`/g,
'<code class="rounded bg-muted px-1 py-0.5 text-xs font-mono text-foreground">$1</code>',
)
s = s.replace(
/^> (.+)$/gm,
'<p class="my-2 border-l-2 border-amber-500/60 pl-3 text-sm text-muted-foreground italic">$1</p>',
)
s = s.replace(/\[([^\]]+)\]\(([^)]+)\)/g, (_, label: string, rawHref: string) => {
const href = sanitizeHref(rawHref)
if (!href) return label
return `<a class="text-primary font-medium underline underline-offset-2 hover:text-primary/90" href="${href}" target="_blank" rel="noopener noreferrer">${label}</a>`
})
s = s.replace(/## (.+)$/gm, '<h3 class="text-base font-semibold mt-3 mb-1 text-foreground">$1</h3>')
s = s.replace(/### (.+)$/gm, '<h4 class="text-sm font-semibold mt-2 text-foreground">$1</h4>')
s = s.replace(/^---$/gm, '<hr class="my-3 border-border" />')
s = s.replace(/\n\n+/g, '<br /><br />')
s = s.replace(/\n/g, '<br />')
return s
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev-ui/app/components/extraction/SharedConversationPanel.vue` around
lines 124 - 145, The renderAssistantHtml function currently injects untrusted
URLs directly into href via the link-replacement regex, enabling javascript: or
quote-based attribute XSS when the output is used with v-html; fix by validating
and sanitizing the captured URL in the link replacement step inside
renderAssistantHtml: only allow safe schemes (e.g., http, https, mailto, and
relative paths), reject or replace unsafe ones with a safe placeholder (e.g.,
'#'), HTML-escape or URL-encode the href value, and add safe attributes like
rel="noopener noreferrer" and target="_blank" as appropriate; update the
/\[([^\]]+)\]\(([^)]+)\)/g replacement to call a sanitizer helper (e.g.,
sanitizeUrl or isSafeScheme) before returning the anchor HTML so v-html never
receives an unsafe href.

…746)

Ship kartograph-agent-runtime container image with NDJSON turn API, mount
skills and JobPackage workspaces, inject chat-scoped workload tokens, and
delegate graph-management chat turns to the remote runtime when container
backend is enabled. Closes #742.

Co-authored-by: Cursor <cursoragent@cursor.com>
):
yield json.dumps(event) + "\n"

return StreamingResponse(event_stream(), media_type="application/x-ndjson")
aredenba-rh and others added 16 commits May 29, 2026 01:47
Align sticky Claude Agent SDK containers with k-extract Vertex auth and warm
the graph-management assistant on UI entry with streamed readiness progress.

Co-authored-by: Cursor <cursoragent@cursor.com>
Prevent JIT provisioning conflicts when Keycloak re-imports the realm and
Postgres still holds rows keyed by the previous SSO subject.

Co-authored-by: Cursor <cursoragent@cursor.com>
Mount gcloud credentials at /gcloud/config and run sticky containers as the
host UID so Claude Agent SDK can reach Vertex AI, while keeping the API root
for Docker-out-of-Docker in dev.

Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the redundant branch tip column from the KG data sources table.

Co-authored-by: Cursor <cursoragent@cursor.com>
Load prepared archives even in schema-design mode, refresh the workspace
on chat reuse, point Claude SDK at /workspace, and remove sibling sticky
and worker containers during make down.

Co-authored-by: Cursor <cursoragent@cursor.com>
Incremental prepares were overwriting last_prepared_file_count with the
number of changed files, so the data sources table showed the wrong
"Files on branch" value after subsequent prepares.

Co-authored-by: Cursor <cursoragent@cursor.com>
Background refreshes no longer toggle the page-level loading gate, so
prepare polling updates status in place with a subtle updating indicator.

Co-authored-by: Cursor <cursoragent@cursor.com>
Graph Management and other manage steps no longer stretch edge-to-edge
on wide screens, matching the data sources workspace layout.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose job_package_available on data source listings, rematerialize when
the ZIP is gone, and skip ingest-only no-changes short-circuit without it.

Co-authored-by: Cursor <cursoragent@cursor.com>
Skip workspace rematerialization when the container is healthy and JobPackage
IDs match, report 503 until the agent workspace is ready, and only save user
messages after the assistant turn completes or fails.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ct layout

Split the combined schema nav/detail card into a sticky left navigator and
right detail column to match k-extract's Design Artifacts pattern.

Co-authored-by: Cursor <cursoragent@cursor.com>
Surface tool use, reasoning, task progress, and compose previews as NDJSON
thinking events so the Graph Management Assistant panel updates while Vertex
work is in flight.

Co-authored-by: Cursor <cursoragent@cursor.com>
…kspaces

Ensure ingest-only prepares full-branch JobPackages and only materialize packages that contain repository content so Graph Management sessions can reliably read repo files. Add workspace source indexing plus prompt/thinking updates so the agent reports accurate available files and tools.

Co-authored-by: Cursor <cursoragent@cursor.com>
Process SyncStarted outbox events with bounded concurrency and fetch GitHub blobs in parallel to reduce ingestion-context preparation time for multi-source batches.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ement

Expose separate schema-entities and schema-relationships rail items with readiness-driven status and detail panels so designers can track type coverage before transitioning.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KG-scoped data source onboarding (k-extract-style full-page flow)

2 participants