A comprehensive Python framework for A/B testing OpenAI embedding models with advanced reranking and adaptive weight optimization for hybrid search.
This framework enables data-driven decision-making for embedding model selection by:
- A/B Testing: Systematically compare
text-embedding-3-smallvstext-embedding-3-large - Cross-Encoder Reranking: Fine-grained relevance scoring for top-K results
- Learned Weight Optimization: Adaptive weight adjustment via gradient descent
- Cost-Benefit Analysis: Detailed ROI calculations for model selection
- 25+ diverse claims across 8 domains:
- Science, Misconceptions, History, Technology
- Health, Geography, Social, Recent Events
- 3 difficulty levels (easy, medium, hard)
- 4 verification statuses (verified, misinformation, partially_true, needs_context)
- Domain-specific claim categorization
Example:
from src.benchmark_dataset import EmbeddingBenchmarkDataset
claims = EmbeddingBenchmarkDataset.get_all_claims()
science_claims = EmbeddingBenchmarkDataset.get_by_domain("science")
stats = EmbeddingBenchmarkDataset.get_statistics()- Parallel model evaluation
- Metrics collection:
- Accuracy: KB hit rate, vector accuracy
- Latency: Embedding time, percentiles
- Cost: Token count, USD estimation
- Intelligent recommendation logic
Pricing (2024 OpenAI):
text-embedding-3-small: $0.02 per 1M tokens (1536 dims)text-embedding-3-large: $0.13 per 1M tokens (3072 dims) β 650% more expensive
Example:
from src.ab_test_embeddings import EmbeddingABTestFramework
framework = EmbeddingABTestFramework(
embedding_manager=embedding_manager,
hybrid_search=search_engine,
neo4j_manager=neo4j
)
result = await framework.run_ab_test(benchmark_claims, test_id="test_001")
print(framework.generate_report(result))- 3 model tiers:
- Fast:
ms-marco-MiniLM-L-6-v2(5-10ms, nDCG=0.39) - Balanced:
ms-marco-TinyBERT-L-2-v2(10-20ms, nDCG=0.50) - Accurate:
ms-marco-MiniLM-L-12-v2(20-50ms, nDCG=0.47)
- Fast:
- Hybrid score + cross-encoder combination
- Automatic relevance labeling
Example:
from src.cross_encoder_reranker import CrossEncoderReranker
reranker = CrossEncoderReranker(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
device="cpu",
batch_size=32
)
reranked = await reranker.rerank_async(
query="climate change effects",
candidates=hybrid_results,
top_k=5,
cross_encoder_weight=0.4
)- Gradient descent with momentum
- Per-claim-type weight sets:
- Factual (0.4, 0.3, 0.3)
- Misconception (0.2, 0.5, 0.3)
- Contextual (0.3, 0.3, 0.4)
- Opinion, Recent events
- History tracking and performance metrics
- JSON export
Example:
from src.learned_weight_optimizer import LearnedWeightOptimizer
optimizer = LearnedWeightOptimizer(
learning_rate=0.01,
momentum=0.9,
iterations=100
)
# Add evaluations
optimizer.add_evaluation(
claim="The Earth orbits the Sun",
claim_type="factual",
kb_hit=True,
hybrid_score=0.85,
exact_score=0.9,
semantic_score=0.8,
vector_score=0.8,
final_confidence=0.85
)
# Optimize
weights = optimizer.optimize(global_only=False)
optimizer.export_weights_json("weights.json")- Automatic LLM-based classification into 5 types:
- Factual: Objective, verifiable statements
- Misconception: Common false beliefs or debunked claims
- Contextual: True but needs nuance or context
- Opinion: Subjective viewpoints
- Recent: Time-sensitive or recent events
- Few-shot prompt engineering with examples
- Confidence scores and reasoning
- Built-in caching to reduce API calls
- Automatic weight selection based on claim type
Example:
from src.claim_type_classifier import ClaimTypeClassifier, ClaimTypeClassifierWithWeightSelection
# Initialize classifier
classifier = ClaimTypeClassifier(
llm_manager=llm_manager,
model="gpt-4",
cache_file="classifications_cache.json"
)
# Classify single claim
classification = await classifier.classify("Vaccines cause autism")
print(f"Type: {classification.predicted_type}, Confidence: {classification.confidence}")
# Batch classification
claims = [
"The Earth orbits the Sun",
"Lightning never strikes twice",
"AI will replace all jobs"
]
results = await classifier.classify_batch(claims)
# Automatic weight selection
selector = ClaimTypeClassifierWithWeightSelection(classifier, weight_optimizer)
results, classification = await selector.search_with_automatic_weights(
claim="Social media is bad",
search_engine=hybrid_search_engine
)# Run comprehensive A/B test
result = await framework.run_ab_test(claims)
# Decision:
# If large_accuracy > small_accuracy + 2% AND cost_diff < $100/month
# β Use large model
# Else
# β Use small model (cost-optimized)# First: Hybrid search (fast, high recall)
hybrid_results = await search_engine.hybrid_search(claim)
# Second: Cross-encoder reranking (precision)
final_results = await reranker.rerank_async(
query=claim,
candidates=hybrid_results,
top_k=5
)# Initialize classifier with weight optimizer integration
classifier = ClaimTypeClassifier(llm_manager, cache_file="cache.json")
selector = ClaimTypeClassifierWithWeightSelection(classifier, optimizer)
# Automatic classification and weight selection (no manual type needed)
results, classification = await selector.search_with_automatic_weights(
claim="Vaccines cause autism",
search_engine=search_engine
)
# OR use HybridSearchEngine directly with auto-classification
results, classification = await search_engine.hybrid_search_with_auto_classification(
claim="Lightning never strikes twice",
limit=5
)
# Results will be automatically ranked using weights optimized for "misconception" type
print(f"Classified as: {classification.predicted_type}")
print(f"Confidence: {classification.confidence}")
for result in results:
print(f" - {result.claim}: {result.hybrid_score:.2f}")- Accuracy: +2-3% improvement with large model on hard claims
- Latency: +10-20ms for cross-encoder reranking
- Cost: +$66-100/month for large model at 100k queries/month
- Recall: Better capture of relevant facts via reranking
- Precision: Improved relevance labels for top-K results
# Clone repository
git clone https://github.com/yourusername/embedding-ab-test-framework
cd embedding-ab-test-framework
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtopenai>=1.0.0
sentence-transformers>=2.2.0
numpy>=1.21.0
pytest>=7.0.0
pytest-asyncio>=0.21.0
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_advanced_embeddings.py -v
# Run with coverage
pytest --cov=src tests/embedding-ab-test-framework/
βββ src/
β βββ benchmark_dataset.py # 25+ benchmark claims
β βββ ab_test_embeddings.py # A/B testing framework
β βββ cross_encoder_reranker.py # Reranking module
β βββ learned_weight_optimizer.py # Weight optimization
β βββ __init__.py
βββ tests/
β βββ test_advanced_embeddings.py # Comprehensive test suite
β βββ __init__.py
βββ docs/
β βββ API.md # Detailed API documentation
βββ requirements.txt
βββ README.md
βββ LICENSE
# Initialize framework
framework = EmbeddingABTestFramework(
embedding_manager=guardian.embedding_manager,
hybrid_search=guardian.vector_search_engine,
neo4j_manager=guardian.neo4j
)
# Run A/B test on G4 KB
result = await framework.run_ab_test(
EmbeddingBenchmarkDataset.get_all_claims(),
test_id="guardian_g4_selection"
)# Add to existing HybridSearchEngine
from src.cross_encoder_reranker import CrossEncoderReranker
from src.learned_weight_optimizer import LearnedWeightOptimizer
search_engine.cross_encoder = CrossEncoderReranker()
search_engine.weight_optimizer = LearnedWeightOptimizer()
# Use enhanced methods
results = await search_engine.hybrid_search_with_reranking(claim)
results = await search_engine.hybrid_search_with_learned_weights(claim, "factual")βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EMBEDDING MODEL A/B TEST REPORT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Test ID: test_001
Timestamp: 2026-04-15T18:30:00
Dataset Size: 25 claims
ββ ACCURACY METRICS βββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Model β KB Hit Rate β Top-1 Ranking β Avg Score β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Small (3-small)β 72.0% β 0.8234 β 0.7812 β
β Large (3-large)β 75.0% β 0.8456 β 0.8145 β
β Difference β +3.0% β +0.0222 β +0.0333 β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π― RECOMMENDATION β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β Winner: LARGE β
β β
β Rationale: β
β Large model offers 3.0% accuracy improvement for $75/month β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- API Reference β Detailed method documentation
- Examples β Code examples and notebooks
- Benchmarks β Performance comparisons
Contributions welcome! Please:
- Fork repository
- Create feature branch (
git checkout -b feature/name) - Add tests for new functionality
- Submit pull request
MIT License - see LICENSE file
- Guardian G4 KB β Knowledge base system using this framework
- ADRION Ecosystem β 369-dimensional decision orchestration
- Arbitrage Engine β Multi-model fact verification
For issues, questions, or suggestions:
- Open an GitHub issue
- Check FAQ in docs/
- Review test examples
Last Updated: May 2026
Version: 1.0.0
Maintainer: Adrian Hadjiman