Skip to content

Gruszkoland/embedding-ab-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Embedding A/B Test Framework

A comprehensive Python framework for A/B testing OpenAI embedding models with advanced reranking and adaptive weight optimization for hybrid search.

🎯 Overview

This framework enables data-driven decision-making for embedding model selection by:

  • A/B Testing: Systematically compare text-embedding-3-small vs text-embedding-3-large
  • Cross-Encoder Reranking: Fine-grained relevance scoring for top-K results
  • Learned Weight Optimization: Adaptive weight adjustment via gradient descent
  • Cost-Benefit Analysis: Detailed ROI calculations for model selection

✨ Features

1. Benchmark Dataset (benchmark_dataset.py)

  • 25+ diverse claims across 8 domains:
    • Science, Misconceptions, History, Technology
    • Health, Geography, Social, Recent Events
  • 3 difficulty levels (easy, medium, hard)
  • 4 verification statuses (verified, misinformation, partially_true, needs_context)
  • Domain-specific claim categorization

Example:

from src.benchmark_dataset import EmbeddingBenchmarkDataset

claims = EmbeddingBenchmarkDataset.get_all_claims()
science_claims = EmbeddingBenchmarkDataset.get_by_domain("science")
stats = EmbeddingBenchmarkDataset.get_statistics()

2. A/B Test Framework (ab_test_embeddings.py)

  • Parallel model evaluation
  • Metrics collection:
    • Accuracy: KB hit rate, vector accuracy
    • Latency: Embedding time, percentiles
    • Cost: Token count, USD estimation
  • Intelligent recommendation logic

Pricing (2024 OpenAI):

  • text-embedding-3-small: $0.02 per 1M tokens (1536 dims)
  • text-embedding-3-large: $0.13 per 1M tokens (3072 dims) β€” 650% more expensive

Example:

from src.ab_test_embeddings import EmbeddingABTestFramework

framework = EmbeddingABTestFramework(
    embedding_manager=embedding_manager,
    hybrid_search=search_engine,
    neo4j_manager=neo4j
)

result = await framework.run_ab_test(benchmark_claims, test_id="test_001")
print(framework.generate_report(result))

3. Cross-Encoder Reranking (cross_encoder_reranker.py)

  • 3 model tiers:
    • Fast: ms-marco-MiniLM-L-6-v2 (5-10ms, nDCG=0.39)
    • Balanced: ms-marco-TinyBERT-L-2-v2 (10-20ms, nDCG=0.50)
    • Accurate: ms-marco-MiniLM-L-12-v2 (20-50ms, nDCG=0.47)
  • Hybrid score + cross-encoder combination
  • Automatic relevance labeling

Example:

from src.cross_encoder_reranker import CrossEncoderReranker

reranker = CrossEncoderReranker(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    device="cpu",
    batch_size=32
)

reranked = await reranker.rerank_async(
    query="climate change effects",
    candidates=hybrid_results,
    top_k=5,
    cross_encoder_weight=0.4
)

4. Learned Weight Optimizer (learned_weight_optimizer.py)

  • Gradient descent with momentum
  • Per-claim-type weight sets:
    • Factual (0.4, 0.3, 0.3)
    • Misconception (0.2, 0.5, 0.3)
    • Contextual (0.3, 0.3, 0.4)
    • Opinion, Recent events
  • History tracking and performance metrics
  • JSON export

Example:

from src.learned_weight_optimizer import LearnedWeightOptimizer

optimizer = LearnedWeightOptimizer(
    learning_rate=0.01,
    momentum=0.9,
    iterations=100
)

# Add evaluations
optimizer.add_evaluation(
    claim="The Earth orbits the Sun",
    claim_type="factual",
    kb_hit=True,
    hybrid_score=0.85,
    exact_score=0.9,
    semantic_score=0.8,
    vector_score=0.8,
    final_confidence=0.85
)

# Optimize
weights = optimizer.optimize(global_only=False)
optimizer.export_weights_json("weights.json")

5. Claim Type Classifier (claim_type_classifier.py)

  • Automatic LLM-based classification into 5 types:
    • Factual: Objective, verifiable statements
    • Misconception: Common false beliefs or debunked claims
    • Contextual: True but needs nuance or context
    • Opinion: Subjective viewpoints
    • Recent: Time-sensitive or recent events
  • Few-shot prompt engineering with examples
  • Confidence scores and reasoning
  • Built-in caching to reduce API calls
  • Automatic weight selection based on claim type

Example:

from src.claim_type_classifier import ClaimTypeClassifier, ClaimTypeClassifierWithWeightSelection

# Initialize classifier
classifier = ClaimTypeClassifier(
    llm_manager=llm_manager,
    model="gpt-4",
    cache_file="classifications_cache.json"
)

# Classify single claim
classification = await classifier.classify("Vaccines cause autism")
print(f"Type: {classification.predicted_type}, Confidence: {classification.confidence}")

# Batch classification
claims = [
    "The Earth orbits the Sun",
    "Lightning never strikes twice",
    "AI will replace all jobs"
]
results = await classifier.classify_batch(claims)

# Automatic weight selection
selector = ClaimTypeClassifierWithWeightSelection(classifier, weight_optimizer)
results, classification = await selector.search_with_automatic_weights(
    claim="Social media is bad",
    search_engine=hybrid_search_engine
)

πŸ“Š Use Cases

1. Model Selection Decision

# Run comprehensive A/B test
result = await framework.run_ab_test(claims)

# Decision:
# If large_accuracy > small_accuracy + 2% AND cost_diff < $100/month
#   β†’ Use large model
# Else
#   β†’ Use small model (cost-optimized)

2. Dynamic Reranking

# First: Hybrid search (fast, high recall)
hybrid_results = await search_engine.hybrid_search(claim)

# Second: Cross-encoder reranking (precision)
final_results = await reranker.rerank_async(
    query=claim,
    candidates=hybrid_results,
    top_k=5
)

3. Adaptive Weighting with Auto-Classification

# Initialize classifier with weight optimizer integration
classifier = ClaimTypeClassifier(llm_manager, cache_file="cache.json")
selector = ClaimTypeClassifierWithWeightSelection(classifier, optimizer)

# Automatic classification and weight selection (no manual type needed)
results, classification = await selector.search_with_automatic_weights(
    claim="Vaccines cause autism",
    search_engine=search_engine
)

# OR use HybridSearchEngine directly with auto-classification
results, classification = await search_engine.hybrid_search_with_auto_classification(
    claim="Lightning never strikes twice",
    limit=5
)

# Results will be automatically ranked using weights optimized for "misconception" type
print(f"Classified as: {classification.predicted_type}")
print(f"Confidence: {classification.confidence}")
for result in results:
    print(f"  - {result.claim}: {result.hybrid_score:.2f}")

πŸ“ˆ Expected Impact

  • Accuracy: +2-3% improvement with large model on hard claims
  • Latency: +10-20ms for cross-encoder reranking
  • Cost: +$66-100/month for large model at 100k queries/month
  • Recall: Better capture of relevant facts via reranking
  • Precision: Improved relevance labels for top-K results

πŸš€ Installation

# Clone repository
git clone https://github.com/yourusername/embedding-ab-test-framework
cd embedding-ab-test-framework

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Dependencies

openai>=1.0.0
sentence-transformers>=2.2.0
numpy>=1.21.0
pytest>=7.0.0
pytest-asyncio>=0.21.0

πŸ§ͺ Testing

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_advanced_embeddings.py -v

# Run with coverage
pytest --cov=src tests/

πŸ“ Architecture

embedding-ab-test-framework/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ benchmark_dataset.py          # 25+ benchmark claims
β”‚   β”œβ”€β”€ ab_test_embeddings.py         # A/B testing framework
β”‚   β”œβ”€β”€ cross_encoder_reranker.py     # Reranking module
β”‚   β”œβ”€β”€ learned_weight_optimizer.py   # Weight optimization
β”‚   └── __init__.py
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_advanced_embeddings.py   # Comprehensive test suite
β”‚   └── __init__.py
β”œβ”€β”€ docs/
β”‚   └── API.md                        # Detailed API documentation
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── LICENSE

πŸ”Œ Integration

With Guardian G4 Knowledge Base

# Initialize framework
framework = EmbeddingABTestFramework(
    embedding_manager=guardian.embedding_manager,
    hybrid_search=guardian.vector_search_engine,
    neo4j_manager=guardian.neo4j
)

# Run A/B test on G4 KB
result = await framework.run_ab_test(
    EmbeddingBenchmarkDataset.get_all_claims(),
    test_id="guardian_g4_selection"
)

With Custom Hybrid Search

# Add to existing HybridSearchEngine
from src.cross_encoder_reranker import CrossEncoderReranker
from src.learned_weight_optimizer import LearnedWeightOptimizer

search_engine.cross_encoder = CrossEncoderReranker()
search_engine.weight_optimizer = LearnedWeightOptimizer()

# Use enhanced methods
results = await search_engine.hybrid_search_with_reranking(claim)
results = await search_engine.hybrid_search_with_learned_weights(claim, "factual")

πŸ“Š Example Output

╔═════════════════════════════════════════════════════════════════╗
β•‘          EMBEDDING MODEL A/B TEST REPORT                        β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Test ID: test_001
Timestamp: 2026-04-15T18:30:00
Dataset Size: 25 claims

β”Œβ”€ ACCURACY METRICS ──────────────────────────────────────────────┐
β”‚                                                                  β”‚
β”‚  Model          β”‚  KB Hit Rate  β”‚  Top-1 Ranking  β”‚  Avg Score β”‚
β”‚  ───────────────────────────────────────────────────────────────│
β”‚  Small (3-small)β”‚  72.0%        β”‚  0.8234         β”‚  0.7812    β”‚
β”‚  Large (3-large)β”‚  75.0%        β”‚  0.8456         β”‚  0.8145    β”‚
β”‚  Difference     β”‚  +3.0%        β”‚  +0.0222        β”‚  +0.0333   β”‚
β”‚                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

╔═════════════════════════════════════════════════════════════════╗
β•‘                    🎯 RECOMMENDATION                             β•‘
╠═════════════════════════════════════════════════════════════════╣
β•‘                                                                  β•‘
β•‘  Winner: LARGE                                                   β•‘
β•‘                                                                  β•‘
β•‘  Rationale:                                                      β•‘
β•‘  Large model offers 3.0% accuracy improvement for $75/month     β•‘
β•‘                                                                  β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

πŸ“š Documentation

🀝 Contributing

Contributions welcome! Please:

  1. Fork repository
  2. Create feature branch (git checkout -b feature/name)
  3. Add tests for new functionality
  4. Submit pull request

πŸ“„ License

MIT License - see LICENSE file

πŸ”— Related Projects

  • Guardian G4 KB β€” Knowledge base system using this framework
  • ADRION Ecosystem β€” 369-dimensional decision orchestration
  • Arbitrage Engine β€” Multi-model fact verification

πŸ“ž Support

For issues, questions, or suggestions:

  • Open an GitHub issue
  • Check FAQ in docs/
  • Review test examples

Last Updated: May 2026
Version: 1.0.0
Maintainer: Adrian Hadjiman

About

AB testing framework for embeddings and ML models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages