Skip to content

Latest commit

 

History

History
86 lines (58 loc) · 2.54 KB

File metadata and controls

86 lines (58 loc) · 2.54 KB

PyThreadMind

A Python toolkit for managing conversational threads using TF-IDF and semantic similarity. It groups chat messages into threads and retrieves the most relevant threads for a new user prompt.

Warning: Both TF-IDF and semantic implementations are experimental and may not perfectly separate or rank threads in all cases.

I mostly moved to the semantic and gave up on TF-IDF


Dataset

Sample data lives in src/data/context.json. It was extracted from the public OASST1 (OpenAssistant) v1.0 dataset, a large archive of user–assistant chat transcripts. A custom extraction script (not included) filtered and formatted the portion you see here.

Requirements

  • Python 3.7–3.11 (break 3.12+)
  • POSIX shell (bash/zsh) or Windows PowerShell
  • Virtual environment tool (venv, conda)

Installation

# Clone repo (if not already)
git clone https://github.com/acoliver/pythreadmind.git
cd pythreadmind

# Create & activate venv
python3.11 -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install deps & package
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

# Download NLTK data (stopwords)
python -c "import nltk; nltk.download('stopwords')"

Note: requirements.txt includes en-core-web-sm from spaCy; it will auto-install the model.

Usage

TF-IDF Thread Manager Demo (not recommended)

python src/threadmind/tfidf_manager.py --test

Semantic Thread Manager in REPL

python
>>> from threadmind.semantic_manager import SemanticThreadManager
>>> from datetime import datetime
>>> mgr = SemanticThreadManager()
>>> threads = mgr.threads_for_prompt('user', 'Your query here', datetime.now())
>>> print(threads)

Running Tests (This runs the semantic one)

Integration & unit tests live in test/test_thread_manager.py. To run:

pytest test/test_thread_manager.py

This test will:

  • Feed the sample context into both TF-IDF and semantic managers
  • Verify thread grouping & message retrieval
  • Export threads_debug.csv
  • Print analysis of the longest threads

Expect some warnings or failures; development is ongoing.

Limitations & Future Work

  • Semantic manager may over-group or under-split threads
  • TF-IDF manager’s topic drift penalty can be too aggressive
  • Extraction script for context.json is not packaged here

Questions / Issues

If anything is unclear (dataset provenance, Python version, missing scripts), please open an issue or reach out. Happy threading!