The application follows a Multi-Agent Supervisor Architecture powered by LangGraph and LangChain. It is hosted on a Streamlit frontend which manages the session state and visualization rendering.
graph TD
User[User/Upload] -->|CSV| Streamlit[Streamlit App]
Streamlit -->|Trigger| Supervisor[Supervisor Agent]
Supervisor -->|Route| Cleaner[Cleaning Agent]
Supervisor -->|Route| Cluster[Clustering Agent]
Supervisor -->|Route| Visualizer[Visualization Agent]
Cleaner -->|Clean/EDA| Supervisor
Cluster -->|K-Means| Supervisor
Visualizer -->|Plot| Supervisor
Supervisor -->|Finish| Streamlit
Streamlit -->|Render| UI[Trace/Reports/Plots]
- LLM:
gpt-4o-mini - Routing Logic: Uses a structured system prompt to decide the
next_nodebased on the conversation history. - Robustness: Implements Regex-based parsing (
detected_agent) to handle "chatty" LLM responses, ensuring strict routing even if the LLM wraps the agent name in natural language. - Checklist Enforcement: Explicitly programmed to verify that
ClusteringandVisualizationhave occurred before returningFINISH.
All agents (Cleaning, Clustering, Visualization) are ReACT agents built with create_react_agent.
- Context Injection: The
run_specialistwrapper injects a specificSystemMessagecontaining the Absolute Path of the active CSV file before every invocation. - Recursion Management: Agents operate with a local recursion limit of 50 to prevent infinite loops.
- Hand-off Protocol: Agents are instructed to "Finish their turn" and report "Task Complete" rather than asking the user for input, maintaining the autonomous chain.
clean_data:- Null Imputation: Median (numeric) / Mode (categorical).
- Safe Outlier Removal: Uses IQR but guarantees retention of at least 10% of data or 5 rows.
- Smart Paths: Handles absolute paths and prevents redundant filenames (e.g.,
_cleaned_cleaned). - Column Dropping: Accepts a
drop_columnslist to act on EDA findings.
perform_eda:- Correlation Check: Identifies pairs with >0.85 correlation.
- Suggestion Engine: Returns a specific list of
SUGGESTED DROPSto the agent.
perform_clustering:- Pipeline:
StandardScaler->OneHotEncoder->KMeans->PCA(2 components). - Output: Saves
_clustered.csv.
- Pipeline:
generate_visualization:- Validation: Checks for PCA columns.
- Signalling: Returns a success message that triggers the Streamlit UI to render the chart.
AgentState: A TypedDict tracking:messages: List ofBaseMessage(User/AI/Tool).next_node: The next agent to call.df_path: The Absolute Path to the currently active CSV file.
MemorySaver: Uses in-memory checkpointing to maintain graph state across steps within a single session.
- Streaming Trace: Uses
trace_placeholderto render real-time tool calls and arguments. - Agent Reports: Filters the main chat stream to display nicely formatted Markdown reports (e.g.,
### 🧹 Cleaning Report) from agent summaries. - Dynamic File Handling:
- Uses
os.path.abspathfor all file saves. - Updates
st.session_state.df_pathbased on agent outputs.
- Uses
- Visualization:
- Detects
_clustered.csvto render PCA scatter plots. - Detects
signals generatedto render EDA heatmaps.
- Detects
- Export: Provides a
st.download_buttonfor the final processed dataset.
- Python 3.10+
uvpackage manager (recommended)- OpenAI API Key
# Install dependencies
uv pip install -r requirements.txt
# OR manual install
pip install streamlit langchain langgraph pandas plotly scikit-learn python-dotenv langchain-openai# Create .env file with OPENAI_API_KEY
uv run streamlit run app.py- RecursionError: The graph is configured with
recursion_limit=100. If hit, restart the app or simplify the data. - "File not found": All internal logic uses absolute paths (
/users/.../temp_data/file.csv). Ensure the app has write permissions to thetemp_datafolder. - Empty Dataset: The cleaning tool has guardrails. If data is still empty, the source file may be fully corrupt or contain only nulls.