Embedding A/B Test Framework

A comprehensive Python framework for A/B testing OpenAI embedding models with advanced reranking and adaptive weight optimization for hybrid search.

🎯 Overview

This framework enables data-driven decision-making for embedding model selection by:

A/B Testing: Systematically compare text-embedding-3-small vs text-embedding-3-large
Cross-Encoder Reranking: Fine-grained relevance scoring for top-K results
Learned Weight Optimization: Adaptive weight adjustment via gradient descent
Cost-Benefit Analysis: Detailed ROI calculations for model selection

✨ Features

1. Benchmark Dataset (`benchmark_dataset.py`)

25+ diverse claims across 8 domains:
- Science, Misconceptions, History, Technology
- Health, Geography, Social, Recent Events
3 difficulty levels (easy, medium, hard)
4 verification statuses (verified, misinformation, partially_true, needs_context)
Domain-specific claim categorization

Example:

from src.benchmark_dataset import EmbeddingBenchmarkDataset

claims = EmbeddingBenchmarkDataset.get_all_claims()
science_claims = EmbeddingBenchmarkDataset.get_by_domain("science")
stats = EmbeddingBenchmarkDataset.get_statistics()

2. A/B Test Framework (`ab_test_embeddings.py`)

Parallel model evaluation
Metrics collection:
- Accuracy: KB hit rate, vector accuracy
- Latency: Embedding time, percentiles
- Cost: Token count, USD estimation
Intelligent recommendation logic

Pricing (2024 OpenAI):

text-embedding-3-small: $0.02 per 1M tokens (1536 dims)
text-embedding-3-large: $0.13 per 1M tokens (3072 dims) — 650% more expensive

Example:

from src.ab_test_embeddings import EmbeddingABTestFramework

framework = EmbeddingABTestFramework(
    embedding_manager=embedding_manager,
    hybrid_search=search_engine,
    neo4j_manager=neo4j
)

result = await framework.run_ab_test(benchmark_claims, test_id="test_001")
print(framework.generate_report(result))

3. Cross-Encoder Reranking (`cross_encoder_reranker.py`)

3 model tiers:
- Fast: ms-marco-MiniLM-L-6-v2 (5-10ms, nDCG=0.39)
- Balanced: ms-marco-TinyBERT-L-2-v2 (10-20ms, nDCG=0.50)
- Accurate: ms-marco-MiniLM-L-12-v2 (20-50ms, nDCG=0.47)
Hybrid score + cross-encoder combination
Automatic relevance labeling

Example:

from src.cross_encoder_reranker import CrossEncoderReranker

reranker = CrossEncoderReranker(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    device="cpu",
    batch_size=32
)

reranked = await reranker.rerank_async(
    query="climate change effects",
    candidates=hybrid_results,
    top_k=5,
    cross_encoder_weight=0.4
)

4. Learned Weight Optimizer (`learned_weight_optimizer.py`)

Gradient descent with momentum
Per-claim-type weight sets:
- Factual (0.4, 0.3, 0.3)
- Misconception (0.2, 0.5, 0.3)
- Contextual (0.3, 0.3, 0.4)
- Opinion, Recent events
History tracking and performance metrics
JSON export

Example:

from src.learned_weight_optimizer import LearnedWeightOptimizer

optimizer = LearnedWeightOptimizer(
    learning_rate=0.01,
    momentum=0.9,
    iterations=100
)

# Add evaluations
optimizer.add_evaluation(
    claim="The Earth orbits the Sun",
    claim_type="factual",
    kb_hit=True,
    hybrid_score=0.85,
    exact_score=0.9,
    semantic_score=0.8,
    vector_score=0.8,
    final_confidence=0.85
)

# Optimize
weights = optimizer.optimize(global_only=False)
optimizer.export_weights_json("weights.json")

5. Claim Type Classifier (`claim_type_classifier.py`)

Automatic LLM-based classification into 5 types:
- Factual: Objective, verifiable statements
- Misconception: Common false beliefs or debunked claims
- Contextual: True but needs nuance or context
- Opinion: Subjective viewpoints
- Recent: Time-sensitive or recent events
Few-shot prompt engineering with examples
Confidence scores and reasoning
Built-in caching to reduce API calls
Automatic weight selection based on claim type

Example:

from src.claim_type_classifier import ClaimTypeClassifier, ClaimTypeClassifierWithWeightSelection

# Initialize classifier
classifier = ClaimTypeClassifier(
    llm_manager=llm_manager,
    model="gpt-4",
    cache_file="classifications_cache.json"
)

# Classify single claim
classification = await classifier.classify("Vaccines cause autism")
print(f"Type: {classification.predicted_type}, Confidence: {classification.confidence}")

# Batch classification
claims = [
    "The Earth orbits the Sun",
    "Lightning never strikes twice",
    "AI will replace all jobs"
]
results = await classifier.classify_batch(claims)

# Automatic weight selection
selector = ClaimTypeClassifierWithWeightSelection(classifier, weight_optimizer)
results, classification = await selector.search_with_automatic_weights(
    claim="Social media is bad",
    search_engine=hybrid_search_engine
)

📊 Use Cases

1. Model Selection Decision

# Run comprehensive A/B test
result = await framework.run_ab_test(claims)

# Decision:
# If large_accuracy > small_accuracy + 2% AND cost_diff < $100/month
#   → Use large model
# Else
#   → Use small model (cost-optimized)

2. Dynamic Reranking

# First: Hybrid search (fast, high recall)
hybrid_results = await search_engine.hybrid_search(claim)

# Second: Cross-encoder reranking (precision)
final_results = await reranker.rerank_async(
    query=claim,
    candidates=hybrid_results,
    top_k=5
)

3. Adaptive Weighting with Auto-Classification

# Initialize classifier with weight optimizer integration
classifier = ClaimTypeClassifier(llm_manager, cache_file="cache.json")
selector = ClaimTypeClassifierWithWeightSelection(classifier, optimizer)

# Automatic classification and weight selection (no manual type needed)
results, classification = await selector.search_with_automatic_weights(
    claim="Vaccines cause autism",
    search_engine=search_engine
)

# OR use HybridSearchEngine directly with auto-classification
results, classification = await search_engine.hybrid_search_with_auto_classification(
    claim="Lightning never strikes twice",
    limit=5
)

# Results will be automatically ranked using weights optimized for "misconception" type
print(f"Classified as: {classification.predicted_type}")
print(f"Confidence: {classification.confidence}")
for result in results:
    print(f"  - {result.claim}: {result.hybrid_score:.2f}")

📈 Expected Impact

Accuracy: +2-3% improvement with large model on hard claims
Latency: +10-20ms for cross-encoder reranking
Cost: +$66-100/month for large model at 100k queries/month
Recall: Better capture of relevant facts via reranking
Precision: Improved relevance labels for top-K results

🚀 Installation

# Clone repository
git clone https://github.com/yourusername/embedding-ab-test-framework
cd embedding-ab-test-framework

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Dependencies

openai>=1.0.0
sentence-transformers>=2.2.0
numpy>=1.21.0
pytest>=7.0.0
pytest-asyncio>=0.21.0

🧪 Testing

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_advanced_embeddings.py -v

# Run with coverage
pytest --cov=src tests/

📝 Architecture

embedding-ab-test-framework/
├── src/
│   ├── benchmark_dataset.py          # 25+ benchmark claims
│   ├── ab_test_embeddings.py         # A/B testing framework
│   ├── cross_encoder_reranker.py     # Reranking module
│   ├── learned_weight_optimizer.py   # Weight optimization
│   └── __init__.py
├── tests/
│   ├── test_advanced_embeddings.py   # Comprehensive test suite
│   └── __init__.py
├── docs/
│   └── API.md                        # Detailed API documentation
├── requirements.txt
├── README.md
└── LICENSE

🔌 Integration

With Guardian G4 Knowledge Base

# Initialize framework
framework = EmbeddingABTestFramework(
    embedding_manager=guardian.embedding_manager,
    hybrid_search=guardian.vector_search_engine,
    neo4j_manager=guardian.neo4j
)

# Run A/B test on G4 KB
result = await framework.run_ab_test(
    EmbeddingBenchmarkDataset.get_all_claims(),
    test_id="guardian_g4_selection"
)

With Custom Hybrid Search

# Add to existing HybridSearchEngine
from src.cross_encoder_reranker import CrossEncoderReranker
from src.learned_weight_optimizer import LearnedWeightOptimizer

search_engine.cross_encoder = CrossEncoderReranker()
search_engine.weight_optimizer = LearnedWeightOptimizer()

# Use enhanced methods
results = await search_engine.hybrid_search_with_reranking(claim)
results = await search_engine.hybrid_search_with_learned_weights(claim, "factual")

📊 Example Output

╔═════════════════════════════════════════════════════════════════╗
║          EMBEDDING MODEL A/B TEST REPORT                        ║
╚═════════════════════════════════════════════════════════════════╝

Test ID: test_001
Timestamp: 2026-04-15T18:30:00
Dataset Size: 25 claims

┌─ ACCURACY METRICS ──────────────────────────────────────────────┐
│                                                                  │
│  Model          │  KB Hit Rate  │  Top-1 Ranking  │  Avg Score │
│  ───────────────────────────────────────────────────────────────│
│  Small (3-small)│  72.0%        │  0.8234         │  0.7812    │
│  Large (3-large)│  75.0%        │  0.8456         │  0.8145    │
│  Difference     │  +3.0%        │  +0.0222        │  +0.0333   │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

╔═════════════════════════════════════════════════════════════════╗
║                    🎯 RECOMMENDATION                             ║
╠═════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  Winner: LARGE                                                   ║
║                                                                  ║
║  Rationale:                                                      ║
║  Large model offers 3.0% accuracy improvement for $75/month     ║
║                                                                  ║
╚═════════════════════════════════════════════════════════════════╝

📚 Documentation

API Reference — Detailed method documentation
Examples — Code examples and notebooks
Benchmarks — Performance comparisons

🤝 Contributing

Contributions welcome! Please:

Fork repository
Create feature branch (git checkout -b feature/name)
Add tests for new functionality
Submit pull request

📄 License

MIT License - see LICENSE file

🔗 Related Projects

Guardian G4 KB — Knowledge base system using this framework
ADRION Ecosystem — 369-dimensional decision orchestration
Arbitrage Engine — Multi-model fact verification

📞 Support

For issues, questions, or suggestions:

Open an GitHub issue
Check FAQ in docs/
Review test examples

Last Updated: May 2026
Version: 1.0.0
Maintainer: Adrian Hadjiman

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embedding A/B Test Framework

🎯 Overview

✨ Features

1. Benchmark Dataset (`benchmark_dataset.py`)

2. A/B Test Framework (`ab_test_embeddings.py`)

3. Cross-Encoder Reranking (`cross_encoder_reranker.py`)

4. Learned Weight Optimizer (`learned_weight_optimizer.py`)

5. Claim Type Classifier (`claim_type_classifier.py`)

📊 Use Cases

1. Model Selection Decision

2. Dynamic Reranking

3. Adaptive Weighting with Auto-Classification

📈 Expected Impact

🚀 Installation

Dependencies

🧪 Testing

📝 Architecture

🔌 Integration

With Guardian G4 Knowledge Base

With Custom Hybrid Search

📊 Example Output

📚 Documentation

🤝 Contributing

📄 License

🔗 Related Projects

📞 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Embedding A/B Test Framework

🎯 Overview

✨ Features

1. Benchmark Dataset (benchmark_dataset.py)

2. A/B Test Framework (ab_test_embeddings.py)

3. Cross-Encoder Reranking (cross_encoder_reranker.py)

4. Learned Weight Optimizer (learned_weight_optimizer.py)

5. Claim Type Classifier (claim_type_classifier.py)

📊 Use Cases

1. Model Selection Decision

2. Dynamic Reranking

3. Adaptive Weighting with Auto-Classification

📈 Expected Impact

🚀 Installation

Dependencies

🧪 Testing

📝 Architecture

🔌 Integration

With Guardian G4 Knowledge Base

With Custom Hybrid Search

📊 Example Output

📚 Documentation

🤝 Contributing

📄 License

🔗 Related Projects

📞 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Benchmark Dataset (`benchmark_dataset.py`)

2. A/B Test Framework (`ab_test_embeddings.py`)

3. Cross-Encoder Reranking (`cross_encoder_reranker.py`)

4. Learned Weight Optimizer (`learned_weight_optimizer.py`)

5. Claim Type Classifier (`claim_type_classifier.py`)

Packages