Talk to your documents. Understand your data.
hArI is an AI-powered document intelligence system that lets you upload PDFs, CSVs, and Excel files — then have a natural conversation with their contents. It uses a RAG (Retrieval-Augmented Generation) pipeline for PDF semantic search and LLM-driven pandas code execution for structured data analysis.
- PDF Chat — Semantic search over PDF content using ChromaDB + sentence-transformers
- CSV / Excel Analysis — Natural language queries converted to pandas operations via LLM
- Mixed File Mode — Upload both types together; hArI automatically routes each query to the right engine
- Streaming Responses — Token-by-token streaming with live cursor effect (ChatGPT-style)
- Conversation Memory — Maintains context across turns; summarizes older messages when the context window fills up
- Persistent Vector Store — PDF embeddings stored across sessions; no re-processing on reload
- Modular Architecture — Clean separation of UI, state, handlers, and core logic
| Layer | Technology |
|---|---|
| Language | Python 3.10+ |
| UI | Streamlit |
| RAG Pipeline | ChromaDB (cosine similarity) |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) |
| LLM | Groq API (llama-4-scout-17b, compound-beta-mini) |
| Data Analysis | Pandas + NumPy |
| PDF Parsing | PyMuPDF |
hArI/
├── app.py # Entry point — page config + render orchestration
├── config.py # All config constants (models, chunking, RAG, memory, limits)
│
├── core/
│ ├── __init__.py
│ ├── file_processor.py # File loading, type detection, PDF/CSV/Excel parsing
│ ├── memory.py # Conversation context + summarization logic
│ ├── query_intent.py # Query classifier — routes to PDF (RAG) or CSV engine
│ ├── responser.py # Prompt builder + streaming Groq response handler
│ └── utils.py # Shared utilities (strip_thinking, Groq client helpers)
│
├── rag/
│ ├── __init__.py
│ ├── embedder.py # Splits docs into chunks, generates embeddings
│ ├── retriever.py # Embeds user query, retrieves top-k chunks from ChromaDB
│ └── vector_store.py # ChromaDB client, collection management, deduplication
│
├── ui/
│ ├── __init__.py
│ ├── styles.py # Global CSS injection (dark theme, purple accents)
│ ├── state.py # Session state init + clear_chat()
│ ├── handlers.py # Ingest, query, reset, remove file pipelines
│ └── components.py # All Streamlit render functions
│
├── prompts/
│ └── system_prompt.md # AI identity + mode-specific rules (PDF, CSV, General)
│
├── data/
│ └── chroma_store/ # Persistent ChromaDB vector storage
│
├── .env # API keys (not committed)
├── pyproject.toml
└── uv.lock
- Python 3.10+
- uv (recommended) or pip
- A Groq API key
# Clone the repo
git clone https://github.com/yourusername/hArI.git
cd hArI
# Install dependencies using uv
uv sync
# Or using pip
pip install -r requirements.txtCreate a .env file in the project root:
GROQ_API_KEY=your_groq_api_key_herestreamlit run app.pyUpload PDF
│
▼
PyMuPDF extracts text
│
▼
Text split into chunks (embedder.py)
│
▼
Embeddings generated (all-MiniLM-L6-v2)
│
▼
Stored in ChromaDB (vector_store.py)
│
▼
User query → embed → retrieve top-k chunks (retriever.py)
│
▼
Chunks + memory context injected into prompt (responser.py)
│
▼
Groq streams answer token-by-token → rendered live in UI
Upload CSV / Excel
│
▼
Pandas DataFrame created (file_processor.py)
│
▼
Schema + sample rows extracted as metadata
│
▼
User query + schema → Groq generates pandas code
│
▼
Code executed safely → result passed back to LLM
│
▼
Groq streams formatted answer → rendered live in UI
When both file types are present, query_intent.py uses the LLM to classify whether each query is best answered by the PDF RAG engine or the CSV analysis engine — then routes accordingly.
All tunable parameters live in config.py:
| Parameter | Description |
|---|---|
CHUNK_SIZE |
Token size per text chunk for PDF splitting |
CHUNK_OVERLAP |
Overlap between consecutive chunks |
TOP_K_RESULTS |
Number of chunks retrieved per query |
MEMORY_BUFFER_SIZE |
Max messages before summarization triggers |
MAX_FILE_SIZE_MB |
Upload size limit per file |
EMBEDDING_MODEL |
Sentence-transformer model name |
LLM_MODEL |
Groq model identifier |
ANALYSIS_MODEL |
Groq model used for pandas code generation |
SCORE_THRESHOLD |
Minimum cosine similarity score for chunk retrieval |
- PDF support only (no
.docx,.txtcurrently) - CSV/Excel analysis depends on LLM-generated pandas code — complex queries may occasionally fail
- Streaming LLM responses in UI
- Modular ui/ folder architecture
- ChromaDB cosine similarity threshold filtering
- Add
.docxand.txtsupport - Multi-collection support (separate namespaces per session)
- Export chat history
- Confidence score display on retrieved chunks
MIT License. See LICENSE for details.
Live Demo → hArI
Built by Harsh Bhanushali