Semantic code search powered by vector embeddings and LanceDB
Find similar functions, understand code patterns, navigate dependencies, and identify refactoring opportunities using AI-powered semantic search.
Codesearch indexes your codebase and enables intelligent queries like:
- Find Similar Functions: "Show me functions similar to this one" → Discover patterns and duplicates
- Understand Patterns: "What patterns are used for error handling?" → Learn how your code solves problems
- Navigate Dependencies: "What calls this function?" "What does it call?" → Trace code relationships
- Refactor Insights: "Find code that does the same thing differently" → Identify consolidation opportunities
# Clone the repository
git clone https://github.com/your-username/codesearch.git
cd codesearch
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"# 1. Index a repository
codesearch index /path/to/your/repo
# 2. Find functions similar to a specific one
codesearch find-similar my_function_name
# 3. Search semantically by description
codesearch pattern "function that validates email addresses"
# 4. Explore call dependencies
codesearch dependencies my_function_name
# 5. Find duplicate or similar code patterns
codesearch refactor-dupes --threshold 0.85Set environment variables to customize behavior:
# Database location (default: ~/.codesearch)
export CODESEARCH_DB_PATH=~/.codesearch
# Default programming language for searches (python, typescript, go)
export CODESEARCH_LANGUAGE=python
# Output format (table for terminal, json for scripts)
export CODESEARCH_OUTPUT_FORMAT=tableDiscover code patterns in your project:
# Index your codebase
codesearch index ~/my-project
# Find all error handling functions
codesearch pattern "function that handles errors or exceptions"
# Find validation functions
codesearch pattern "validates input"Find code duplicates for refactoring:
# Index your codebase
codesearch index ~/my-project
# Find similar implementations (duplication candidates)
codesearch refactor-dupes --threshold 0.90
# Examine specific function
codesearch find-similar database_queryExplore code dependencies:
# Index your codebase
codesearch index ~/my-project
# See what calls a function
codesearch dependencies main
# Understand call graph
codesearch dependencies api_handlerFor detailed command documentation, see docs/CLI.md.
pattern <query>- Search for code matching a natural language descriptionfind-similar <entity_name>- Find functions similar to a given functiondependencies <entity_name>- Show functions that call a given functionindex <path>- Index a repository or directoryrefactor-dupes [--threshold]- Find potential code duplicates
- 🔍 Semantic Search: Find code by meaning, not just keywords
- 📊 Call Graph Analysis: Understand function relationships and dependencies
- 🎯 Pattern Discovery: Identify code patterns and anti-patterns
- 🔄 Multi-Language Support: Python, TypeScript, Go (extensible)
- 📈 Incremental Indexing: Re-index only changed files
- 💾 Multiple Repos: Index and search across multiple codebases
- ⚡ Fast Queries: Vector similarity search with LanceDB
- 🖥️ CLI Interface: Easy command-line access
See docs/ARCHITECTURE.md for comprehensive design documentation including system diagrams, data flow, and technology stack.
Core Components:
-
Code Parser & Indexer (
codesearch/parsers/,codesearch/indexing/)- Extracts functions, classes, and call relationships from source code
- Supports Python, TypeScript, and Go (extensible)
- Identifies entity metadata: names, signatures, docstrings, locations
-
Embedding Pipeline (
codesearch/embeddings/)- Generates semantic embeddings using pre-trained transformer models
- Supports multiple models: MiniLM (384-dim), MPNet (768-dim)
- Batch processing for performance
-
LanceDB Database (
codesearch/lancedb/)- Vector database for semantic search
- Tables:
code_entities,code_relationships,search_metadata - Sub-second search on large codebases
-
CLI Query Interface (
codesearch/cli/)- 5 user-facing commands: pattern, find-similar, dependencies, index, refactor-dupes
- Multiple output formats: terminal table, JSON
- Configuration via environment variables
-
Data Ingestion System (
codesearch/indexing/)- Orchestrates repository indexing
- Hash-based deduplication for efficiency
- Incremental updates (only changed files)
- Multi-repository support with audit trails
For detailed installation instructions, see docs/INSTALLATION.md.
Install from source:
git clone https://github.com/jpequegn/codesearch.git
cd codesearch
pip install -e ".[dev]"# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Upgrade pip
pip install --upgrade pip
# Install with dev dependencies
pip install -e ".[dev]"# Run all tests with coverage
pytest --cov=codesearch --cov-report=html
# Format code
black codesearch/ tests/
isort codesearch/ tests/
# Type checking
mypy codesearch/
# Lint with ruff
ruff check codesearch/ tests/codesearch/
├── cli/ # Command-line interface
├── query/ # Query infrastructure
├── indexing/ # Data ingestion
├── embeddings/ # Embedding generation
├── lancedb/ # Database layer
├── parsers/ # Code parsing
├── caching/ # Caching system
└── models.py # Data models
tests/
├── cli/ # CLI tests
├── integration/ # Integration tests
└── conftest.py # Shared fixtures
docs/
├── ARCHITECTURE.md # System design
├── CLI.md # Command reference
├── API.md # Python API
├── INSTALLATION.md # Setup guide
└── TROUBLESHOOTING.md # Troubleshooting
- CONTRIBUTING.md - Development guidelines
- docs/INSTALLATION.md - Installation guide
- docs/ARCHITECTURE.md - System design
- docs/TROUBLESHOOTING.md - Troubleshooting
📊 Component Status:
- ✅ Component 5.1-5.5: Core features (indexing, caching, error handling, testing, documentation)
- ✅ Component 5.6: Project documentation (README, architecture, CLI, API, troubleshooting)
- ✅ Component 5.7: Project setup & deployment (this release)
- 🔄 Component #9: LanceDB schema (in progress)
- 🔄 Component #10: Data ingestion (in progress)
- 🔄 Component #11: Query infrastructure (in progress)
See GitHub Issues for current work.
MIT License - See LICENSE file for details
Contributions welcome! Please see CONTRIBUTING.md for guidelines.