Skip to content

ChatWithCrewFlow.__init__ makes blocking LLM call at module import, crashes containers on any LLM hiccup #5510

@jpr5

Description

@jpr5

Summary

ChatWithCrewFlow.__init__ in ag_ui_crewai.crews triggers synchronous blocking LLM calls at module import time via crewai.cli.crew_chat.generate_crew_chat_inputs, which in turn calls:

For users deploying CrewAI behind a FastAPI server via ag_ui_crewai.endpoint.add_crewai_crew_fastapi_endpoint (the recommended integration for AG-UI / CopilotKit), these LLM calls fire during module import — BEFORE uvicorn binds to its HTTP port.

Failure mode

ANY LLM provider hiccup during container startup causes the Python process to crash before the HTTP server is listening:

  • OpenAI 500 / 503 / rate-limit
  • Network blip, DNS failure, slow cold-start on a mock/proxy server
  • Invalid credentials (even transient)
  • Litellm APIError, Timeout, or APIConnectionError

In orchestrated environments (Railway, Kubernetes, AWS ECS, Fly.io) the platform's readiness/health check fails because no process ever binds the port. The platform then marks the deploy failed and rolls back to the previous image, making the service effectively unresponsive to LLM-layer instability.

We hit this on our Railway-hosted CopilotKit showcase when our LLM mock (aimock) returned a transient schema error. The mock error was recoverable — the issue is that it shouldn't have been able to crash the entire container before the HTTP server was ready.

Actual stack trace we observed

File "/app/agent_server.py", line 27, in <module>
    add_crewai_crew_fastapi_endpoint(app, LatestAiDevelopment(), "/")
File ".../ag_ui_crewai/endpoint.py", line 250, in add_crewai_crew_fastapi_endpoint
    add_crewai_flow_fastapi_endpoint(app, ChatWithCrewFlow(crew=crew), path)
File ".../ag_ui_crewai/crews.py", line 56, in __init__
    self.crew_chat_inputs = crew_chat_generate_crew_chat_inputs(...)
File ".../crewai/cli/crew_chat.py", line 387, in generate_crew_chat_inputs
    description = generate_input_description_with_ai(input_name, crew, chat_llm)
File ".../crewai/cli/crew_chat.py", line 481, in generate_input_description_with_ai
    response = chat_llm.call(messages=[...])
File ".../crewai/llm.py", line 956, in call
    return self._handle_non_streaming_response(...)
...
APIError: <connection failure>

Container exits with code 1, never binds a port, orchestrator's health check fails, deploy rolls back.

Why this is a CrewAI concern, not just an ag-ui-crewai concern

While ChatWithCrewFlow lives in ag-ui-crewai, the two functions that block are part of CrewAI's public crewai.cli.crew_chat module. CrewAI is asking users to consume these helpers at import/init time without any of the standard production-server defenses:

  • No timeout
  • No retry/fallback
  • No try/except with a graceful default
  • No opt-out

Any consumer that instantiates a chat flow with them in a serving context inherits this fragility.

Suggested fixes (any or all)

  1. Lazy init at first request. Have ChatWithCrewFlow.__init__ store the crew and LLM but defer generate_crew_chat_inputs until the first actual chat turn. (A similar fix has already landed on ag-ui-protocol/ag-ui main for add_crewai_crew_fastapi_endpoint — deferring ChatWithCrewFlow construction to first-request. But the underlying functions in CrewAI still have no defenses.)

  2. Try/except with a static fallback inside the generator functions. If the LLM call fails for any reason, fall back to a generic string like "Input value for the crew's tasks and agents." or "A CrewAI crew.". These descriptions are only surfaced in the CrewAI chat UI — shipping a generic default on LLM failure is strictly better than crashing the process.

  3. Make AI-generated descriptions opt-in. Accept a kwarg generate_descriptions: bool = True (default preserves current behavior), but let production users pass False to skip the LLM calls entirely.

  4. Timeout + bounded retry. At minimum, enforce a short timeout (e.g., 10s) on chat_llm.call in these two functions so a hung LLM can't indefinitely block process startup.

Our workaround

We're shipping a defensive monkey-patch in our showcase to patch both functions to return static strings before ag_ui_crewai is imported. PR: CopilotKit/CopilotKit#3974

This is fragile (depends on private function names) and we'd much prefer an upstream fix so every AG-UI / CopilotKit / direct-CrewAI production deployment doesn't inherit this footgun.

Environment

  • crewai>=0.130.0
  • ag-ui-crewai==0.1.5 (latest released; main already has the deferred-construction fix in endpoint.py but no release yet)
  • Python 3.12

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions