Skip to content

jparkerweb/embedding-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

embedding-utils

Vector math, similarity search, ANN indexing, clustering, async pipelines, evaluation metrics, and multi-provider embedding generation -- zero dependencies, full TypeScript, one import.

embedding-utils

Build semantic search, RAG pipelines, recommendation engines, duplicate detection, and document clustering without pulling in heavy ML frameworks or vector databases.


Table of Contents


Why embedding-utils?

  • Zero production dependencies -- nothing to audit, nothing to break
  • Provider-agnostic -- swap between local ONNX, OpenAI, Cohere, and Google Vertex with one line
  • Complete toolkit -- math, search, clustering, quantization, caching, and serialization in a single package
  • TypeScript-first -- strict types, full inference, no @types package needed
  • Production-ready -- HNSW approximate search, async pipelines with rate limiting, and retrieval evaluation metrics
  • Tree-shakeable -- dual ESM + CJS build; import only what you use

Use this when you need embedding operations without a full ML framework, want to switch providers without refactoring, or need vector search and clustering without a vector database.


Installation

Requires Node.js 18+. Supports both ESM and CommonJS.

npm install embedding-utils

For local ONNX inference (no API key needed), also install the optional peer dependency:

npm install @huggingface/transformers

Quick Start

With a cloud provider

import { createOpenAICompatibleProvider, topK } from 'embedding-utils';

// 1. Create a provider
const provider = createOpenAICompatibleProvider({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'text-embedding-3-small',
});

// 2. Embed your documents
const { embeddings } = await provider.embed([
  'The cat sat on the mat',
  'A dog played in the park',
  'Vectors are mathematical objects',
]);
// embeddings: Float32Array[] (v0.3+)

// 3. Search: find the closest matches to a query
const { embeddings: [query] } = await provider.embed('pets and animals');
const results = topK(query, embeddings, 2);
// => [{ index: 0, score: 0.87, ... }, { index: 1, score: 0.82, ... }]

With local inference (no API key)

import { createLocalProvider, topK } from 'embedding-utils';

const provider = createLocalProvider(); // uses Xenova/all-MiniLM-L12-v2 by default
const { embeddings } = await provider.embed(['hello world', 'goodbye world']);

Providers

All providers implement the same EmbeddingProvider interface -- swap providers without changing application code:

interface EmbeddingProvider {
  embed(input: string | string[], options?: EmbedOptions): Promise<EmbeddingResult>;
  readonly name: string;
  readonly dimensions: number | null;
}

Provider comparison

Provider Factory Auth Batch Limit Notes
Local (ONNX) createLocalProvider() None -- Offline, auto-downloads models, built-in LRU cache
OpenAI createOpenAICompatibleProvider(config) API key 2048 Supports dimensions for output truncation
Cohere createCohereProvider(config) API key 96 Input type mapping (document/query) for best results
Google Vertex createGoogleVertexProvider(config) Access token 5 Token can be a string or async refresh function
Voyage createProvider('voyage', config) API key 2048 OpenAI-compatible wrapper
Mistral createProvider('mistral', config) API key 2048 OpenAI-compatible wrapper
Jina createProvider('jina', config) API key 2048 OpenAI-compatible wrapper
OpenRouter createProvider('openrouter', config) API key 2048 OpenAI-compatible wrapper
Together createProvider('together', config) API key 2048 OpenAI-compatible wrapper
Fireworks createProvider('fireworks', config) API key 2048 OpenAI-compatible wrapper
Nomic createProvider('nomic', config) API key 2048 OpenAI-compatible wrapper
Mixedbread createProvider('mixedbread', config) API key 2048 OpenAI-compatible wrapper

Any OpenAI-compatible endpoint works via createOpenAICompatibleProvider with a custom baseUrl -- including Ollama, LM Studio, and Azure OpenAI.

Provider examples

import {
  createLocalProvider,
  createOpenAICompatibleProvider,
  createCohereProvider,
  createGoogleVertexProvider,
  createProvider,
} from 'embedding-utils';

// Local ONNX (requires @huggingface/transformers)
const local = createLocalProvider({ model: 'Xenova/all-MiniLM-L6-v2' });

// OpenAI
const openai = createOpenAICompatibleProvider({
  apiKey: 'sk-...',
  model: 'text-embedding-3-small',
  dimensions: 256, // optional output dimension truncation
});

// Cohere
const cohere = createCohereProvider({ apiKey: 'co-...' });
const result = await cohere.embed('hello', { inputType: 'query' }); // optimized for search

// Google Vertex AI (supports token refresh functions)
const vertex = createGoogleVertexProvider({
  projectId: 'my-project',
  accessToken: async () => getAccessToken(), // or a static 'ya29...' string
});

// Voyage, Mistral, Jina, OpenRouter via factory aliases
const voyage = createProvider('voyage', { apiKey: 'pa-...', model: 'voyage-3' });
const mistral = createProvider('mistral', { apiKey: 'sk-...', model: 'mistral-embed' });
const jina = createProvider('jina', { apiKey: 'jina-...', model: 'jina-embeddings-v3' });

// Together AI
const together = createProvider('together', {
  apiKey: 'tog-...',
  model: 'togethercomputer/m2-bert-80M-8k-retrieval',
});

// Fireworks AI
const fireworks = createProvider('fireworks', {
  apiKey: 'fw-...',
  model: 'nomic-ai/nomic-embed-text-v1.5',
});

// Nomic
const nomic = createProvider('nomic', {
  apiKey: 'nk-...',
  model: 'nomic-embed-text-v1.5',
});

// Mixedbread
const mixedbread = createProvider('mixedbread', {
  apiKey: 'mb-...',
  model: 'mixedbread-ai/mxbai-embed-large-v1',
});

// All providers return the same shape
const { embeddings, model, dimensions, usage } = await openai.embed(['hello', 'world']);

Retry and batching

All cloud providers include automatic retry with exponential backoff (429 and 5xx errors) and auto-batching for large inputs. Configure via the retry option:

const provider = createOpenAICompatibleProvider({
  apiKey: 'sk-...',
  model: 'text-embedding-3-small',
  retry: { maxRetries: 5, baseDelay: 500, maxDelay: 60000 },
});

API Reference

Vector Math

Core operations on embedding vectors. These are the low-level building blocks for comparing, combining, and transforming vectors. All functions validate inputs and throw ValidationError on empty vectors or dimension mismatches.

Function Description Example
cosineSimilarity(a, b) Cosine similarity (-1 to 1) cosineSimilarity([1, 0], [0, 1]) => 0
dotProduct(a, b) Dot product dotProduct([1, 2, 3], [4, 5, 6]) => 32
euclideanDistance(a, b) L2 distance euclideanDistance([0, 0], [3, 4]) => 5
manhattanDistance(a, b) L1 distance manhattanDistance([0, 0], [3, 4]) => 7
normalize(v) Unit vector normalize([3, 4]) => Float32Array [0.6, 0.8]
magnitude(v) Vector length magnitude([3, 4]) => 5
add(a, b) Element-wise sum add([1, 2], [3, 4]) => [4, 6]
subtract(a, b) Element-wise difference subtract([5, 3], [1, 2]) => [4, 1]
scale(v, s) Scalar multiply scale([1, 2, 3], 2) => [2, 4, 6]
truncateDimensions(v, n) Matryoshka truncation + auto-normalize truncateDimensions([1, 2, 3, 4], 2) => Float32Array [0.45, 0.89]

truncateDimensions also accepts batches and preserves the input shape. In v0.3, it auto-normalizes after truncation so output vectors have L2 norm β‰ˆ 1.0, ready for cosine similarity without manual normalization.

// v0.3: truncateDimensions auto-normalizes for Matryoshka embeddings
const truncated = truncateDimensions(embedding1536, 256);
// truncated is Float32Array with L2 norm β‰ˆ 1.0
// Ready for cosine similarity β€” no manual normalize() needed

When to use each

  • cosineSimilarity -- The go-to for checking if two pieces of text are about the same thing. Use it in recommendation engines ("users who liked X also liked Y") or to verify if a generated answer is on-topic relative to the source document.
  • dotProduct -- Some models (like OpenAI's text-embedding-3-* with dimensions truncation) are optimized for dot product ranking. Check your model's docs -- if it says "use dot product," use this instead of cosine.
  • euclideanDistance -- Useful for anomaly detection: if a new data point's embedding is far from the cluster center, flag it as an outlier. Also commonly used in k-means style clustering.
  • manhattanDistance -- Similar to euclidean but cheaper to compute and more robust to noisy dimensions. A good choice when embeddings are high-dimensional or you're running on constrained hardware.
  • normalize -- Call this before storing embeddings in a database that only supports dot product search (e.g., some Postgres pgvector configs). After normalization, dot product equals cosine similarity.
  • magnitude -- Sanity-check that a model is returning valid vectors. A magnitude of 0 or near-infinity means something went wrong. Also useful to verify whether embeddings are already normalized (magnitude β‰ˆ 1.0).
  • add / subtract -- Perform analogy-style vector arithmetic ("king βˆ’ man + woman β‰ˆ queen"). In practice, subtract the embedding of "negative sentiment" from a mixed-tone corpus to bias search toward positive content.
  • scale -- Weight a particular embedding before combining it with others. For example, scale a title embedding by 2x before averaging with the body embedding, since titles carry more semantic signal.
  • truncateDimensions -- OpenAI's Matryoshka-trained models let you drop from 1536 dims to 256 with minimal quality loss, saving 80%+ storage. In v0.3, truncation auto-normalizes the result so it's ready for cosine similarity without a separate normalize() call.

Search

Find, rank, filter, and deduplicate embeddings by similarity. This is where most applications start -- "given a query, find the best matches." All search functions support four metrics via { metric: 'cosine' | 'dot' | 'euclidean' | 'manhattan' } and optional labels for tracking source data.

import { topK, topKMulti, aboveThreshold, deduplicate, rankBySimilarity, similarityMatrix } from 'embedding-utils';

const query = [1, 0, 0];
const corpus = [[1, 0, 0], [0.9, 0.1, 0], [0, 1, 0], [0, 0, 1]];
const labels = ['doc-a', 'doc-b', 'doc-c', 'doc-d'];

Top-K search

Use topK for semantic search and RAG pipelines. A user types "how do I reset my password" into your support chatbot -- embed their question, then topK against your knowledge base to find the most relevant help articles to feed to the LLM.

Use topKMulti for batch search -- e.g., 50 customer questions came in overnight and you want to find matching FAQ articles for all of them in one call instead of looping.

const results = topK(query, corpus, 2, { labels });
// => [{ index: 0, score: 1.0, label: 'doc-a', embedding: [...] },
//     { index: 1, score: 0.94, label: 'doc-b', embedding: [...] }]

// Batch: search multiple queries at once
const batchResults = topKMulti([query, [0, 1, 0]], corpus, 2);
// => SearchResult[][] (one array per query)

Threshold filtering and deduplication

Use aboveThreshold for intent matching where you only want confident matches. A voice assistant matching speech to known commands should only act on matches above 0.85 -- anything below is "I don't understand." Unlike topK, this returns a variable number of results (could be zero).

Use deduplicate for content deduplication. You scraped 10,000 product listings and many are reposts with slightly different wording. Deduplicate at 0.95 similarity to collapse near-identical listings, keeping the first occurrence.

// Find all embeddings above a similarity threshold
const matches = aboveThreshold(query, corpus, 0.8, { labels });
// => SearchResult[] sorted by descending score

// Remove near-duplicates (keeps first occurrence)
const unique = deduplicate(corpus, 0.95, { labels });
// => { embeddings: [...], indices: [0, 2, 3], labels: ['doc-a', 'doc-c', 'doc-d'] }

Ranking and similarity matrix

Use rankBySimilarity for recommendation feeds. A user just read an article -- rank your entire library by relevance to build a "more like this" feed. Unlike topK, this returns scores for every item.

Use similarityMatrix for content audits and overlap visualization. You have 100 support articles and want to find redundant documentation -- the NxN matrix powers a heatmap showing which articles cover the same ground.

// Rank entire corpus by similarity
const ranked = rankBySimilarity(query, corpus);
// => SearchResult[] for all items, sorted descending

// Compute NxN pairwise similarity matrix
const matrix = similarityMatrix(corpus);
// => number[][] (symmetric, diagonal = 1.0 for cosine)

Filtered search

Pass a filter callback to topK to dynamically include/exclude items without modifying the corpus.

// Only search documents from a specific category
const results = topK(query, corpus, 5, {
  labels: docLabels,
  filter: (index, label) => label?.startsWith('category-a') ?? false,
});

Pairwise similarity

Compare two parallel lists of embeddings element-wise. Useful for evaluating translation quality, paraphrase detection, or before/after comparison.

import { pairwiseSimilarity } from 'embedding-utils';
const scores = pairwiseSimilarity(originalEmbeddings, translatedEmbeddings);
// => [0.95, 0.88, 0.92, ...] β€” one score per pair

MMR search (Maximal Marginal Relevance)

Select results that are both relevant to the query and diverse from each other. Prevents redundant results in RAG pipelines.

import { mmrSearch } from 'embedding-utils';
const results = mmrSearch(query, corpus, 5, {
  lambda: 0.7,  // 0 = max diversity, 1 = max relevance
});

Re-ranking

Two-stage retrieval: fast initial search, then precise re-ranking with a different metric or weighted combination.

import { rerankResults } from 'embedding-utils';
const initial = topK(query, corpus, 20);
const reranked = rerankResults(initial, query, {
  weights: { original: 0.3, rerank: 0.7 },
});

SearchIndex

Stateful in-memory search index with CRUD operations. Suitable for corpora up to ~100k embeddings.

import { SearchIndex } from 'embedding-utils';

const index = new SearchIndex({ metric: 'cosine' });
index.add('doc-1', embedding1, { category: 'science' });
index.add('doc-2', embedding2, { category: 'history' });

const results = index.search(queryEmbedding, { topK: 5 });
index.remove('doc-1');

HNSW Search (Approximate Nearest Neighbor)

When your dataset grows beyond ~10k embeddings, brute-force search gets slow. HNSW (Hierarchical Navigable Small World) is a graph-based index that finds approximate nearest neighbors in milliseconds instead of seconds, with >95% accuracy.

When to use HNSW vs SearchIndex:

  • SearchIndex: <10k embeddings, need 100% accuracy, simple setup
  • HNSWIndex: 10k--1M+ embeddings, need fast queries, ok with ~95-99% accuracy

Basic example -- create, add, search:

import { HNSWIndex } from 'embedding-utils';

// Create an index (defaults: M=16, efConstruction=200)
const index = new HNSWIndex({ metric: 'cosine' });

// Add embeddings with IDs and optional metadata
index.add('doc-1', embedding1, { title: 'Introduction to ML' });
index.add('doc-2', embedding2, { title: 'Deep Learning Basics' });
index.add('doc-3', embedding3, { title: 'Cooking with Pasta' });

// Search β€” returns top matches in milliseconds
const results = index.search(queryEmbedding, { topK: 2 });
// => [{ id: 'doc-1', score: 0.94, ... }, { id: 'doc-2', score: 0.89, ... }]

Batch add:

// Add many items at once
index.addBatch([
  { id: 'doc-1', vector: emb1, metadata: { source: 'wiki' } },
  { id: 'doc-2', vector: emb2, metadata: { source: 'arxiv' } },
  // ... thousands more
]);

Filtered search -- filters run after graph traversal:

// Only return results matching a filter
const results = index.search(query, {
  topK: 5,
  filter: (item) => item.metadata?.source === 'arxiv',
});

Tuning accuracy vs speed -- efSearch controls the trade-off:

// Default efSearch=50 is a good balance
// Increase for higher accuracy (slower), decrease for speed (less accurate)
const precise = index.search(query, { topK: 10, efSearch: 200 }); // ~99% recall
const fast = index.search(query, { topK: 10, efSearch: 20 });    // ~90% recall, 3x faster

Persistence -- save to disk and reload:

import { writeFileSync, readFileSync } from 'fs';

// Save index to disk
const bytes = index.serialize();
writeFileSync('my-index.hnsw', bytes);

// Load index later (even in a different process)
const loaded = HNSWIndex.deserialize(readFileSync('my-index.hnsw'));
const results = loaded.search(query, { topK: 5 }); // same results as original

Configuration reference:

Parameter Default Effect
M 16 Connections per node. Higher = better recall, more memory
efConstruction 200 Build-time beam width. Higher = better graph quality, slower insert
efSearch 50 Query-time beam width. Higher = better recall, slower search
metric 'cosine' Distance metric: 'cosine', 'dot', 'euclidean', 'manhattan'

Hybrid Search (RRF + Score Normalization)

Real-world search often combines multiple signals -- keyword search, semantic search, metadata filters. Reciprocal Rank Fusion (RRF) merges ranked lists from different sources into a single ranking without needing to calibrate scores across systems.

RRF example -- combine semantic and keyword search:

import { fuseRankedLists } from 'embedding-utils';

// Results from semantic search (embedding similarity)
const semanticResults = [
  { id: 'doc-3', score: 0.95 },
  { id: 'doc-1', score: 0.87 },
  { id: 'doc-7', score: 0.82 },
];

// Results from keyword search (BM25 or similar)
const keywordResults = [
  { id: 'doc-1', score: 12.5 },
  { id: 'doc-3', score: 8.2 },
  { id: 'doc-5', score: 6.1 },
];

// Fuse them β€” scores don't need to be on the same scale!
const fused = fuseRankedLists([semanticResults, keywordResults]);
// doc-3 and doc-1 rank highest because they appear in BOTH lists

// Custom k parameter (default 60) β€” lower k gives more weight to top positions
const aggressive = fuseRankedLists([semanticResults, keywordResults], { k: 10 });

Score normalization -- when you need comparable scores:

import { normalizeScores } from 'embedding-utils';

const rawScores = [0.2, 0.8, 0.5, 0.95, 0.1];

// Min-max: scale to [0, 1] range
normalizeScores(rawScores, 'min-max');

// Z-score: mean=0, std=1 (good for outlier detection)
normalizeScores(rawScores, 'z-score');

// Sigmoid: squash to (0, 1) (good for probability-like scores)
normalizeScores(rawScores, 'sigmoid');

When to use which normalization:

Method Output Range Best For
min-max [0, 1] UI display, threshold comparison
z-score (-inf, +inf) Outlier detection, statistical analysis
sigmoid (0, 1) Probability-like confidence scores

Clustering

Group similar items together without predefined categories. Useful for discovering natural topic groupings in unlabeled data -- e.g., auto-tagging thousands of support tickets into topics like "billing issues," "shipping delays," and "app crashes" without defining those categories upfront.

Uses greedy agglomerative clustering: iterate embeddings, assign each to the most similar existing cluster (or create a new one), filter by size, merge if needed.

import { clusterEmbeddings, getPreset, cohesionScore, silhouetteScore, assignToCluster, mergeClusters } from 'embedding-utils';

Basic clustering

const clusters = clusterEmbeddings(embeddings, {
  similarityThreshold: 0.85,
  minClusterSize: 3,
  maxClusters: 10,
  metric: 'cosine',
});
// => [{ centroid, members, labels?, size, cohesion }]

Presets

Three built-in presets so you don't have to guess threshold values. Use 'high-precision' for tight groups (e.g., near-duplicate detection), 'balanced' for general topic discovery, and 'performance' for fast broad groupings on large datasets.

Preset Threshold Min Size Max Clusters Best for
high-precision 0.95 3 10 Tight, highly cohesive groups
balanced 0.85 5 5 General-purpose clustering
performance 0.75 10 3 Fast, broad groupings
const clusters = clusterEmbeddings(embeddings, getPreset('balanced'));

Quality metrics

Use cohesionScore to check if an individual cluster is internally consistent -- a low score (e.g., 0.4) means the cluster is a grab-bag of unrelated items and you should tighten the threshold.

Use silhouetteScore to evaluate overall clustering quality. A score near 1.0 means clusters are well-separated; near 0 means items could belong to any cluster. Use this to tune your similarityThreshold.

// Internal cluster quality (0-1, higher = tighter)
const cohesion = cohesionScore(clusters[0]);

// Global clustering quality (-1 to 1, higher = better separated)
const quality = silhouetteScore(clusters);

Operations

Use assignToCluster for live classification -- when a new support ticket or document arrives, classify it into an existing cluster without re-clustering everything. Returns -1 if it doesn't fit any group (an outlier / new topic).

Use mergeClusters for manual refinement -- after auto-clustering, you notice "password reset" and "login issues" ended up as separate clusters but your team treats them as one. Merge them programmatically.

// Classify a new embedding into existing clusters
const { clusterIndex, similarity } = assignToCluster(newEmbedding, clusters, { threshold: 0.8 });
// clusterIndex = -1 if below threshold (outlier)

// Merge two clusters
const merged = mergeClusters(clusters[0], clusters[1]);

Clustering enhancements

centroidCohesion -- Average similarity of all members to the cluster centroid. Faster than pairwise cohesionScore (O(n) vs O(nΒ²)), good for per-cluster quality monitoring.

import { centroidCohesion } from 'embedding-utils';
const quality = centroidCohesion(cluster, 'cosine'); // 0-1, higher = tighter

legacy preset -- Disables clustering by returning all embeddings in a single cluster. Use for backwards compatibility with pipelines that predated clustering.

Preset Threshold Min Size Max Clusters Best for
high-precision 0.95 3 10 Tight, highly cohesive groups
balanced 0.85 5 5 General-purpose clustering
performance 0.75 10 3 Fast, broad groupings
legacy 0 1 1 Clustering disabled (single cluster)

assignmentStrategy -- Controls how embeddings are assigned during clustering. 'centroid' (default) compares to the cluster centroid. 'average-similarity' computes average similarity to all current members, which is more accurate but slower.

clusterEmbeddings(embeddings, { assignmentStrategy: 'average-similarity' });

shuffle + shuffleSeed -- Greedy clustering is order-sensitive. Enable shuffle: true for order-independent results with a deterministic seed.

clusterEmbeddings(embeddings, { shuffle: true, shuffleSeed: 42 });

findOptimalK + silhouetteByK -- Automatically determine the best number of clusters using silhouette analysis or the elbow method.

import { findOptimalK, silhouetteByK } from 'embedding-utils';

const bestK = findOptimalK(embeddings, { minK: 2, maxK: 8 });
const scores = silhouetteByK(embeddings); // [{ k: 2, silhouette: 0.6 }, ...]

clusterStats + detectOutliers -- Detailed per-cluster statistics and outlier detection.

import { clusterStats, detectOutliers } from 'embedding-utils';

const stats = clusterStats(cluster); // { minSimilarity, maxSimilarity, meanSimilarity, ... }
const outlierIndices = detectOutliers(cluster, { threshold: 2 }); // member indices > 2Οƒ from mean

centroidDrift -- Measure how much a cluster centroid has shifted between two snapshots.

import { centroidDrift } from 'embedding-utils';
const drift = centroidDrift(oldCentroid, newCentroid, 'cosine'); // 0 = no change

HDBSCAN Clustering

The existing clusterEmbeddings requires you to set a similarity threshold and max cluster count. HDBSCAN automatically discovers the right number of clusters AND identifies outliers -- no tuning needed. It handles clusters of different densities (e.g., a tight group of 5 similar docs alongside a loose group of 50).

When to use HDBSCAN vs clusterEmbeddings:

  • clusterEmbeddings: You know roughly how many clusters to expect, want fast results, need deterministic output
  • hdbscan: You don't know how many clusters exist, need automatic outlier detection, have varying-density data

Basic example:

import { hdbscan } from 'embedding-utils';

const result = hdbscan(embeddings, { minClusterSize: 5 });

console.log(`Found ${result.clusters.length} clusters`);
console.log(`${result.noise.members.length} noise points (outliers)`);

// Each cluster has the familiar Cluster type
result.clusters.forEach((cluster, i) => {
  console.log(`  Cluster ${i}: ${cluster.size} items`);
});

// Labels array maps each input to its cluster (-1 = noise)
console.log(result.labels); // => [0, 0, 1, -1, 1, 0, -1, ...]

With labels for tracking -- trace back to source data:

const docNames = ['intro.md', 'setup.md', 'api.md', 'faq.md', 'changelog.md'];

const result = hdbscan(embeddings, {
  minClusterSize: 2,
  labels: docNames,
});

// Labels are preserved in clusters and noise
result.clusters.forEach((cluster, i) => {
  console.log(`Cluster ${i}: ${cluster.labels?.join(', ')}`);
});
// => Cluster 0: intro.md, setup.md
// => Cluster 1: api.md, faq.md

console.log(`Outliers: ${result.noise.labels?.join(', ')}`);
// => Outliers: changelog.md

Tuning -- minClusterSize and metric:

// Larger minClusterSize = fewer, bigger clusters (more noise points)
const broad = hdbscan(embeddings, { minClusterSize: 10 });

// Smaller minClusterSize = more, smaller clusters (fewer noise points)
const granular = hdbscan(embeddings, { minClusterSize: 3 });

// Different metric changes cluster shapes
const cosineResult = hdbscan(embeddings, { metric: 'cosine' });
const euclideanResult = hdbscan(embeddings, { metric: 'euclidean' });

Aggregation

Combine multiple embeddings into a single representative vector. Common use case: a 20-page PDF gets split into 40 chunks, each with its own embedding. Aggregate them into one "document embedding" for coarse-grained search (find relevant documents first, then drill into chunks).

import { averageEmbeddings, weightedAverage, incrementalAverage, centroid, maxPooling, minPooling, combineEmbeddings } from 'embedding-utils';

averageEmbeddings -- Simple element-wise mean. Use when all inputs carry equal weight, such as averaging chunk embeddings to represent an entire document.

averageEmbeddings([[1, 0], [0, 1]]); // => [0.5, 0.5]

weightedAverage -- Give more importance to certain embeddings. Useful for hybrid representations: embed a product's title, description, and reviews separately, then weight the title 3x and description 2x to emphasize what the product is over what people say.

weightedAverage([[1, 0], [0, 1]], [3, 1]); // => [0.75, 0.25]

incrementalAverage -- Streaming / running average without storing all prior embeddings. Use when processing a firehose of data (e.g., tweets, logs, sensor readings) and you need a running mean with only 1 vector in memory.

let avg = [1, 2, 3];
avg = incrementalAverage(avg, [4, 5, 6], 1); // count = embeddings already averaged
avg = incrementalAverage(avg, [7, 8, 9], 2);
// Numerically equivalent to averageEmbeddings([[1,2,3], [4,5,6], [7,8,9]])

batchIncrementalAverage -- Incrementally compute the average of embedding batches without reprocessing all previous data. Use when new data arrives in batches and you want to update a running centroid without re-averaging from scratch.

import { batchIncrementalAverage } from 'embedding-utils';

let centroidAvg = firstBatchAvg;  // average of first 10 embeddings
let count = 10;

// New batch of 5 embeddings arrives
centroidAvg = batchIncrementalAverage(centroidAvg, newBatch, count);
count += newBatch.length;
// Mathematically equivalent to averageEmbeddings([...allOldEmbeddings, ...newBatch])

Use batchIncrementalAverage over averageEmbeddings when you don't want to store or reprocess all historical embeddings. If you're updating one embedding at a time, use incrementalAverage instead.

maxPooling / minPooling -- Element-wise max or min across vectors. Max pooling captures the "strongest signal" across chunks -- if any chunk mentions "urgent," that signal is preserved. Min pooling captures the weakest, useful for detecting what's missing across a set.

maxPooling([[1, 0, 3], [2, 5, 1]]); // => [2, 5, 3]
minPooling([[1, 0, 3], [2, 5, 1]]); // => [1, 0, 1]

combineEmbeddings -- Embed multiple texts and aggregate in one call. A convenient one-liner for creating a single embedding from multi-field entities (e.g., combine a user's bio, interests, and recent posts into one "user profile" vector).

const combined = await combineEmbeddings(['hello', 'world'], provider);
const pooled = await combineEmbeddings(['hello', 'world'], provider, { aggregate: maxPooling });

Evaluation Metrics

How do you know if your search is actually good? These metrics compare your search results against ground-truth labels to give you a number. Use them to benchmark different embedding models, tune search parameters, or set up regression tests that alert you when search quality degrades.

Complete example -- evaluate a search system:

import { recallAtK, ndcg, mrr, meanAveragePrecision } from 'embedding-utils';

// Your search returned these document IDs (ranked by relevance)
const retrieved = ['doc-3', 'doc-1', 'doc-7', 'doc-4', 'doc-2'];

// Ground truth: these are the actually relevant documents
const relevant = ['doc-1', 'doc-2', 'doc-5'];

// Recall@K: what fraction of relevant docs did you find in top K?
recallAtK(retrieved, relevant, 3);  // => 0.333 (found 1 of 3 relevant in top 3)
recallAtK(retrieved, relevant, 5);  // => 0.667 (found 2 of 3 relevant in top 5)

// MRR: how high is the FIRST relevant result?
mrr(retrieved, relevant);  // => 0.5 (first relevant doc "doc-1" is at position 2 β†’ 1/2)

// MAP: average precision across all relevant results
meanAveragePrecision(retrieved, relevant);  // => 0.417

// NDCG: normalized discounted cumulative gain (accounts for graded relevance)
const relevanceScores = { 'doc-1': 3, 'doc-2': 2, 'doc-5': 1 }; // higher = more relevant
ndcg(retrieved, relevanceScores, 5);  // => 0.63

Practical use case -- comparing two embedding models:

// Compare OpenAI vs Cohere for your specific use case
const testQueries = [...]; // your evaluation set
const groundTruth = [...]; // manually labeled relevant docs per query

for (const model of [openaiProvider, cohereProvider]) {
  let totalRecall = 0;
  for (let i = 0; i < testQueries.length; i++) {
    const results = await searchWith(model, testQueries[i], { topK: 10 });
    totalRecall += recallAtK(results.map(r => r.id), groundTruth[i], 10);
  }
  console.log(`${model.name} avg recall@10: ${(totalRecall / testQueries.length).toFixed(3)}`);
}

Metrics cheat sheet:

Metric Question It Answers Range Use When
recallAtK "Did I find all the relevant docs?" 0--1 RAG (need high recall to feed LLM)
mrr "How quickly do I find the first relevant doc?" 0--1 Single-answer search (FAQ, QA)
meanAveragePrecision "Are relevant docs ranked above irrelevant ones?" 0--1 General ranking quality
ndcg "Are the MOST relevant docs ranked highest?" 0--1 Graded relevance (some docs more relevant than others)

Async Pipeline

Embedding thousands of documents? The async pipeline handles batching, concurrency, rate limiting, and progress tracking -- so you don't have to write retry loops and sleep timers. It works with any provider and can resume from where it left off if interrupted.

Basic example -- embed a large dataset:

import { createEmbeddingPipeline, createProvider } from 'embedding-utils';

const provider = createProvider('openai', {
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'text-embedding-3-small',
});

const pipeline = createEmbeddingPipeline(provider, {
  batchSize: 100,        // send 100 texts per API call
  concurrency: 3,        // up to 3 API calls in flight at once
  rateLimit: {
    requestsPerMinute: 500,
    tokensPerMinute: 1_000_000,
  },
});

const texts = loadDocuments(); // your 50,000 documents
const embeddings = await pipeline.embed(texts);
// => Float32Array[] β€” one embedding per input text

Progress tracking:

const pipeline = createEmbeddingPipeline(provider, {
  batchSize: 100,
  onProgress: ({ completed, total, elapsed }) => {
    const pct = ((completed / total) * 100).toFixed(1);
    const rate = (completed / (elapsed / 1000)).toFixed(0);
    console.log(`${pct}% complete (${completed}/${total}) β€” ${rate} docs/sec`);
  },
});

Checkpoint/resume -- survive crashes on long jobs:

import { writeFileSync, readFileSync, existsSync } from 'fs';

const pipeline = createEmbeddingPipeline(provider, {
  batchSize: 100,
  checkpoint: {
    interval: 5, // save progress every 5 batches
    save: async (state) => {
      writeFileSync('checkpoint.json', JSON.stringify({
        completedIds: [...state.completedIds],
        totalProcessed: state.totalProcessed,
        timestamp: state.timestamp,
      }));
    },
    load: async () => {
      if (!existsSync('checkpoint.json')) return null;
      const data = JSON.parse(readFileSync('checkpoint.json', 'utf-8'));
      return { ...data, completedIds: new Set(data.completedIds) };
    },
  },
});

// If the process crashes and restarts, it picks up where it left off
const embeddings = await pipeline.embed(texts);

Quantization

Reduce embedding precision for storage and memory efficiency. Essential when scaling to millions of vectors -- 10M embeddings at 1536 dims takes ~57 GB as float32, but only ~14 GB as int8, fitting on a single machine with less than 1% accuracy loss.

Type Bits Size vs float32 Input Range Precision Loss Best for
fp16 16 50% Any Negligible Default choice, wide value ranges
int8 8 25% [-1, 1] ~0.8% Normalized embeddings
uint8 8 25% [0, 1] ~0.4% Positive embeddings
binary 1 3% Any High (sign only) Extreme compression, candidate filtering

quantize / dequantize -- Convert between full-precision and compressed representations. Use int8 or uint8 for general storage savings. Use binary for a two-stage search pipeline: quickly find the top 1,000 candidates with binary similarity, then re-rank those candidates with full-precision vectors. This is how production systems handle billion-scale data.

getQuantizationInfo -- Inspect a quantization type's range and precision before choosing. Helps you decide between int8 (needs normalized [-1, 1] input) vs uint8 (needs [0, 1] input).

import { quantize, dequantize, getQuantizationInfo } from 'embedding-utils';

const embedding = [0.5, -0.3, 0.8, 0.1];

// Quantize to int8 (75% memory reduction)
const quantized = quantize(embedding, 'int8'); // => Int8Array [64, -38, 102, 13]
const restored = dequantize(quantized, 'int8'); // => [0.504, -0.299, 0.803, 0.102]

// Binary quantization (97% reduction, preserves sign only)
const binary = quantize(embedding, 'binary'); // => Uint8Array (packed bits)
const signs = dequantize(binary, 'binary');   // => [1, -1, 1, 1]

// Inspect any quantization type
getQuantizationInfo('int8');
// => { bits: 8, range: [-128, 127], description: 'Signed 8-bit integer...' }

Calibrated Quantization

Standard int8 quantization assumes embedding values fall in [-1, 1]. But many models produce values in a narrower range (e.g., [-0.3, 0.4]). Calibrated quantization learns the actual range from your data, using the full int8 resolution for better accuracy.

import { calibrate, quantize, dequantize } from 'embedding-utils';

// Step 1: Learn the value distribution from a sample of your embeddings
const calibration = calibrate(sampleEmbeddings); // ~1000 embeddings is enough

// Step 2: Quantize with calibration (tighter mapping = less precision loss)
const quantized = quantize(embedding, 'int8', { calibration });
const restored = dequantize(quantized, 'int8', { calibration });

// Compare: calibrated quantization has ~40% less error than uncalibrated
// Uncalibrated: maps [-1, 1] β†’ [0, 255] β€” wastes resolution on unused range
// Calibrated: maps [actual_min, actual_max] β†’ [0, 255] β€” uses full resolution

Hamming Distance

After binary quantization (1-bit), you can compare embeddings with Hamming distance -- just count differing bits. This is ~100x faster than cosine similarity and is used for the first stage of billion-scale search: quickly narrow down candidates, then re-rank with full-precision vectors.

import { quantize, hammingDistance, hammingSimilarity } from 'embedding-utils';

// Binary quantize: 1536 floats β†’ 192 bytes (97% smaller)
const binaryA = quantize(embeddingA, 'binary');
const binaryB = quantize(embeddingB, 'binary');

// Fast bitwise comparison
const distance = hammingDistance(binaryA, binaryB); // number of differing bits
const similarity = hammingSimilarity(binaryA, binaryB, 1536); // 0-1 scale

// Two-stage search pattern:
// 1. Binary search (fast, approximate) to get top 1000 candidates
// 2. Full-precision re-rank on those 1000

Dimensionality Reduction (Random Projection)

Need to reduce 1536-dim embeddings to 256 dims but your model doesn't support Matryoshka truncation? Random projection (Johnson-Lindenstrauss lemma) preserves pairwise distances while shrinking vectors. It's mathematically guaranteed to work -- no training needed.

When to use truncateDimensions vs createRandomProjection:

  • truncateDimensions: Your model was trained with Matryoshka (e.g., OpenAI text-embedding-3-*) -- use this, it's free
  • createRandomProjection: Your model wasn't trained with Matryoshka -- use this for a mathematically-sound reduction

Basic example:

import { createRandomProjection } from 'embedding-utils';

// Create a projector: 1536 dims β†’ 256 dims
const projector = createRandomProjection(1536, 256, { seed: 42 });

// Project a single vector
const reduced = projector.project(embedding); // Float32Array (256 dims)

// Project a batch (more efficient)
const allReduced = projector.projectBatch(embeddings); // Float32Array[] (256 dims each)

// Pairwise distances are approximately preserved (JL guarantee)
// cosineSimilarity(reduced_a, reduced_b) β‰ˆ cosineSimilarity(original_a, original_b)

Deterministic with seed:

// Same seed = same projection matrix = reproducible results
const p1 = createRandomProjection(1536, 256, { seed: 42 });
const p2 = createRandomProjection(1536, 256, { seed: 42 });
// p1.project(vec) and p2.project(vec) produce identical output

Markdown-Aware Chunking

Standard text chunking splits blindly by token count. Markdown-aware chunking respects document structure -- it never splits a code block in half, keeps lists together, and tracks which heading each chunk belongs to. Essential for RAG on documentation, READMEs, and knowledge bases.

Basic example:

import { chunkByStructure } from 'embedding-utils';

const markdown = `
# Getting Started

Install the package and set up your first project.

## Installation

\`\`\`bash
npm install my-package
\`\`\`

## Configuration

Create a config file with these settings:

| Key | Value | Required |
|-----|-------|----------|
| apiKey | string | yes |
| timeout | number | no |
`;

const chunks = chunkByStructure(markdown, { maxTokens: 200 });

chunks.forEach(chunk => {
  console.log(`[${chunk.metadata.type}] ${chunk.metadata.headings.join(' > ')}`);
  console.log(chunk.text.substring(0, 80) + '...');
  console.log();
});
// [heading] Getting Started
// # Getting Started...
//
// [paragraph] Getting Started
// Install the package and set up your first project....
//
// [code] Getting Started > Installation
// ```bash\nnpm install my-package\n```...
//
// [table] Getting Started > Configuration
// | Key | Value | Required |...

Metadata tracking -- heading breadcrumbs enable scoped search:

// Each chunk knows its heading context β€” use it for metadata filtering
const chunks = chunkByStructure(docsMarkdown, { maxTokens: 512 });

for (const chunk of chunks) {
  await index.add(generateId(), chunkEmbedding, {
    headings: chunk.metadata.headings, // ['API Reference', 'Authentication']
    type: chunk.metadata.type,         // 'paragraph', 'code', 'table', etc.
    offset: chunk.metadata.offset,     // character position in original doc
  });
}

// Later: search only within a specific section
const results = index.search(query, {
  topK: 5,
  filter: (item) => item.metadata?.headings?.includes('Authentication'),
});

Storage

Serialize embeddings for persistence and cache frequently used results. Use these to avoid re-calling expensive embedding APIs -- pre-compute once, store, and reload on startup.

Serialization

Three formats with different trade-offs:

Format Output Type Relative Size Use Case
json string Largest Human-readable, debugging, APIs
binary Uint8Array Smallest File storage, binary protocols
base64 string ~33% > binary Text-safe transmission (HTTP, databases)

Use binary for writing embeddings to disk files -- smallest size, fastest I/O. Use base64 when you need to store in a Postgres TEXT column or send over a JSON API. Use json for debugging and inspection where you want to see the actual numbers.

import { serialize, deserialize } from 'embedding-utils';

const embeddings = [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]];

const json = serialize(embeddings, 'json');     // JSON string
const binary = serialize(embeddings, 'binary'); // Uint8Array
const base64 = serialize(embeddings, 'base64'); // base64 string

const restored = deserialize(json, 'json');     // number[][] (identical to original)

LRU Cache

In-memory cache with configurable size limits and TTL. Use this to avoid redundant API calls -- users in your app often search similar queries, and caching means "how to reset password" doesn't hit the OpenAI API again for 60 seconds.

Implements the CacheProvider interface for pluggability. The async interface means you can later swap in Redis or SQLite without changing application code.

import { createLRUCache } from 'embedding-utils';

const cache = createLRUCache({ maxSize: 1000, ttl: 60000 }); // 60s TTL

await cache.set('doc-1', [[0.1, 0.2, 0.3]]);
const cached = await cache.get('doc-1'); // => [[0.1, 0.2, 0.3]]
await cache.has('doc-1');                // => true
await cache.delete('doc-1');
await cache.clear();

Model Management

Manage locally-downloaded ONNX models for the local provider. Only relevant if you're using createLocalProvider() for offline / API-key-free embedding.

  • downloadModel -- Pre-download a model so it's ready at runtime. Include this in your Docker build or CI setup to bake the model into the image and avoid a cold-start download on first request.
  • listModels -- See which models are cached locally. Useful for admin dashboards showing disk usage.
  • deleteModel -- Remove old model versions after upgrading to a newer one, or clean up models you no longer use.
  • setModelPath -- Point the model cache to a custom directory. In a shared server environment, use a shared NFS mount so multiple workers don't each download their own copy.
  • getModelInfo -- Check a model's dimensions (384 vs 768) and size (22M vs 109M) before downloading, to pick the right trade-off between inference speed and embedding quality.
import { downloadModel, listModels, deleteModel, setModelPath, getModelInfo, MODEL_REGISTRY } from 'embedding-utils';

// Browse built-in model registry
getModelInfo('Xenova/all-MiniLM-L12-v2');
// => { id, dimensions: 384, size: '33M', description: '...' }

// Download, list, delete models
const path = await downloadModel('Xenova/all-MiniLM-L6-v2');
const models = await listModels();
await deleteModel('Xenova/all-MiniLM-L6-v2');

// Override the default cache directory
setModelPath('/path/to/models');

Built-in model registry

Model Dimensions Size Description
Xenova/all-MiniLM-L12-v2 384 33M All-round English embedding model
Xenova/all-MiniLM-L6-v2 384 22M Lightweight English embedding model
Xenova/bge-small-en-v1.5 384 33M BGE small English (asymmetric prefixes)
Xenova/bge-base-en-v1.5 768 109M BGE base English (asymmetric prefixes)

High-Level APIs

EmbeddingStore

A batteries-included store that combines a provider, optional cache, and search index. Add texts, search by text or embedding -- the store handles embedding generation and caching automatically.

import { createEmbeddingStore, createOpenAICompatibleProvider } from 'embedding-utils';

const store = createEmbeddingStore({
  provider: createOpenAICompatibleProvider({ apiKey: 'sk-...', model: 'text-embedding-3-small' }),
  cache: { maxSize: 500 },
  metric: 'cosine',
});

await store.add('doc-1', 'The cat sat on the mat');
await store.add('doc-2', 'A dog played in the park');

const results = await store.search('animals', { topK: 2 });
// => [{ id: 'doc-1', score: 0.87, ... }, { id: 'doc-2', score: 0.82, ... }]

IncrementalClusterer

Online clustering that assigns embeddings as they arrive. No need to re-cluster the entire dataset when new data comes in.

import { IncrementalClusterer } from 'embedding-utils';

const clusterer = new IncrementalClusterer({ similarityThreshold: 0.85 });
clusterer.addEmbedding([1, 0, 0], 'doc-a');
clusterer.addEmbedding([0.98, 0.02, 0], 'doc-b');
clusterer.addEmbedding([0, 1, 0], 'doc-c');

const clusters = clusterer.getClusters();
clusterer.rebalance(); // re-optimize clusters with full re-clustering

Text Chunking

Split text into chunks for embedding. Useful for processing long documents that exceed model token limits.

import { chunkByTokenCount, chunkBySentence } from 'embedding-utils';

// Split by approximate token count (~1.3 tokens/word)
const chunks = chunkByTokenCount(longText, 512, { overlap: 50 });

// Split on sentence boundaries (uses Intl.Segmenter)
const sentences = chunkBySentence(longText, { maxTokens: 512 });

Getting Started (end-to-end)

import { createEmbeddingStore, createOpenAICompatibleProvider } from 'embedding-utils';

const store = createEmbeddingStore({
  provider: createOpenAICompatibleProvider({ apiKey: 'sk-...', model: 'text-embedding-3-small' }),
});

// Add documents
await store.addBatch([
  { id: '1', text: 'Machine learning basics' },
  { id: '2', text: 'Cooking pasta recipes' },
  { id: '3', text: 'Neural network architectures' },
]);

// Search
const results = await store.search('AI and deep learning', { topK: 2 });
console.log(results.map(r => r.id)); // => ['3', '1']

Caveats & Best Practices

  • Embedding dimension consistency -- Always use the same model and dimensions for embeddings you plan to compare. Mixing embeddings from different models produces meaningless similarity scores.

  • Normalization -- Cosine similarity assumes unit vectors for best results. Use normalize() if your provider doesn't auto-normalize, especially before storing in databases that only support dot product search.

  • Cache key considerations -- Cache keys include model and dimensions. Switching models invalidates cache entries. Use a consistent provider config within a cache lifetime.

  • Clustering order sensitivity -- Greedy agglomerative clustering is order-dependent. Use shuffle: true with a shuffleSeed for reproducible, order-independent results.

  • Text chunking accuracy -- Token count is estimated at ~1.3 tokens/word. This is a heuristic, not a tokenizer -- for precise token budgets, use your model's actual tokenizer.

  • SearchIndex scalability -- SearchIndex uses brute-force linear scan, suitable for up to ~100k embeddings. For larger corpora, use a dedicated vector database.


API Reference (all exports)

Math

Function Description
cosineSimilarity(a, b) Cosine similarity (-1 to 1)
cosineDistance(a, b) 1 - cosineSimilarity
dotProduct(a, b) Dot product
euclideanDistance(a, b) L2 distance
manhattanDistance(a, b) L1 distance
normalize(v) Unit vector
magnitude(v) Vector length
add(a, b) Element-wise sum
subtract(a, b) Element-wise difference
scale(v, s) Scalar multiply
isNormalized(v) Check if magnitude β‰ˆ 1
truncateDimensions(v, n) Matryoshka truncation + auto-normalize
validateDimensions(embeddings) Verify matching dimensions
createRandomProjection(source, target, opts?) Johnson-Lindenstrauss dimensionality reduction

Search

Function / Class Description
topK(query, corpus, k, opts?) Top-K similarity search
topKMulti(queries, corpus, k, opts?) Batch top-K search
aboveThreshold(query, corpus, t, opts?) Threshold-filtered search
deduplicate(corpus, threshold, opts?) Remove near-duplicates
rankBySimilarity(query, corpus, opts?) Rank entire corpus
similarityMatrix(corpus, opts?) NxN pairwise matrix
pairwiseSimilarity(a, b, metric?) Element-wise pair comparison
rerankResults(results, query, opts?) Re-score search results
mmrSearch(query, corpus, k, opts?) Diverse search (MMR)
SearchIndex Stateful CRUD + search index
HNSWIndex Approximate nearest neighbor index (HNSW graph)
fuseRankedLists(lists, opts?) Reciprocal Rank Fusion
normalizeScores(scores, method) Score normalization (min-max, z-score, sigmoid)

Clustering

Function / Class Description
clusterEmbeddings(embeddings, config?) Greedy agglomerative clustering
getPreset(name) Get preset config
cohesionScore(cluster) Pairwise cohesion (O(nΒ²))
centroidCohesion(cluster, metric?) Centroid-based cohesion (O(n))
silhouetteScore(clusters) Global clustering quality
assignToCluster(embedding, clusters, opts?) Classify into existing cluster
mergeClusters(a, b) Merge two clusters
findOptimalK(embeddings, opts?) Find optimal cluster count
silhouetteByK(embeddings, opts?) Silhouette scores per K
clusterStats(cluster, metric?) Detailed cluster statistics
detectOutliers(cluster, opts?) Find outlier members
centroidDrift(old, new, metric?) Measure centroid shift
IncrementalClusterer Online incremental clustering
hdbscan(embeddings, opts?) Density-based clustering with auto cluster count

Aggregation

Function Description
averageEmbeddings(embeddings) Element-wise mean
weightedAverage(embeddings, weights) Weighted mean
incrementalAverage(avg, new, count) Streaming average
batchIncrementalAverage(avg, batch, count) Batch streaming average
centroid(embeddings) Alias for averageEmbeddings
maxPooling(embeddings) Element-wise max
minPooling(embeddings) Element-wise min
combineEmbeddings(texts, provider, opts?) Embed + aggregate

Eval

Function Description
recallAtK(retrieved, relevant, k?) Recall at K
ndcg(retrieved, relevance, k?) Normalized Discounted Cumulative Gain
mrr(retrieved, relevant) Mean Reciprocal Rank
meanAveragePrecision(retrieved, relevant) Mean Average Precision

Pipeline

Function Description
createEmbeddingPipeline(provider, opts?) Async embedding pipeline with batching and rate limiting

Providers

Function Description
createLocalProvider(config?) Local ONNX provider
createOpenAICompatibleProvider(config) OpenAI / compatible APIs
createCohereProvider(config) Cohere API
createGoogleVertexProvider(config) Google Vertex AI
createProvider(type, config) Factory for named providers
withCache(provider, opts?) Add caching middleware
retryWithBackoff(fn, config?) Retry utility
autoBatch(inputs, batchSize, fn) Auto-batching utility

Storage

Function Description
serialize(embeddings, format) Serialize to JSON/binary/base64
deserialize(data, format) Deserialize embeddings
createLRUCache(opts?) In-memory LRU cache
warmCache(cache, entries) Pre-populate cache
estimateMemorySavings(embeddings, type) Quantization savings estimate

Quantization

Function Description
quantize(embedding, type) Reduce precision
dequantize(data, type) Restore precision
getQuantizationInfo(type) Inspect type details
calibrate(embeddings) Learn per-dimension value ranges for quantization
hammingDistance(a, b) Hamming distance for binary vectors
hammingSimilarity(a, b, dims) Hamming similarity (0-1 scale)

Models

Function Description
downloadModel(id) Download ONNX model
listModels() List cached models
deleteModel(id) Remove cached model
setModelPath(path) Set cache directory
getModelInfo(id) Get model metadata
registerModel(info) Add custom model to registry
getRecommendedModel(criteria?) Get model recommendation

Text

Function Description
chunkByTokenCount(text, maxTokens, opts?) Split by token count
chunkBySentence(text, opts?) Split on sentence boundaries
chunkByStructure(text, opts?) Markdown-aware chunking with heading metadata
getTokenizerInfo(model) Get model tokenizer info

Store

Function / Type Description
createEmbeddingStore(config) Create high-level store
EmbeddingStore Store interface type

Common Patterns

Semantic search pipeline

import { createOpenAICompatibleProvider, topK } from 'embedding-utils';

const provider = createOpenAICompatibleProvider({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'text-embedding-3-small',
});

// Index your documents once
const documents = ['Climate change overview', 'Quantum computing basics', 'History of jazz'];
const { embeddings: corpus } = await provider.embed(documents);

// Search at query time
const { embeddings: [query] } = await provider.embed('environmental science');
const results = topK(query, corpus, 2, { labels: documents });
console.log(results[0].label); // => 'Climate change overview'

Duplicate detection

import { deduplicate } from 'embedding-utils';

const { embeddings: unique, indices } = deduplicate(allEmbeddings, 0.95);
console.log(`Removed ${allEmbeddings.length - unique.length} duplicates`);

Document clustering with quality check

import { clusterEmbeddings, getPreset, silhouetteScore } from 'embedding-utils';

const clusters = clusterEmbeddings(embeddings, getPreset('balanced'));
const quality = silhouetteScore(clusters);
console.log(`${clusters.length} clusters, silhouette score: ${quality.toFixed(2)}`);

clusters.forEach((c, i) => console.log(`  Cluster ${i}: ${c.size} items, cohesion ${c.cohesion.toFixed(2)}`));

Memory-efficient storage

import { quantize, dequantize, serialize, deserialize } from 'embedding-utils';

// Quantize before storing (75% memory reduction)
const quantized = embeddings.map(e => quantize(e, 'int8'));

// Or serialize for disk/network
const binary = serialize(embeddings, 'binary');
// Restore later
const restored = deserialize(binary, 'binary');

Streaming average (no memory accumulation)

import { incrementalAverage } from 'embedding-utils';

let avg = firstEmbedding;
let count = 1;

for await (const embedding of embeddingStream) {
  avg = incrementalAverage(avg, embedding, count);
  count++;
}
// avg is the running mean -- only 1 vector in memory at a time

TypeScript

Full strict-mode type definitions ship with the package -- no @types install needed.

import type {
  Vector,
  EmbeddingProvider,
  EmbeddingResult,
  EmbedOptions,
  SearchResult,
  Cluster,
  ClusteringConfig,
  SimilarityMetric,
  ProviderType,
  CacheProvider,
  CacheOptions,
  RetryConfig,
  ModelInfo,
  LocalProviderConfig,
  OpenAICompatibleConfig,
  CohereConfig,
  GoogleVertexConfig,
  HNSWOptions,
  HNSWSearchOptions,
  HDBSCANOptions,
  HDBSCANResult,
  PipelineOptions,
  CheckpointState,
  QuantizationCalibration,
  RandomProjector,
  StructuredChunk,
  RankedItem,
  NormalizationMethod,
} from 'embedding-utils';

Types are fully inferred -- you rarely need explicit annotations:

const results = topK(query, corpus, 5);
//    ^? SearchResult[] -- fully typed

results[0].score;     // number
results[0].embedding; // Float32Array
results[0].label;     // string | undefined

Config types catch mistakes at compile time:

topK(query, corpus, 5, { metric: 'cosine' });   // OK
topK(query, corpus, 5, { metric: 'invalid' });   // Type error

Error Handling

Four error classes cover all failure modes, all extending EmbeddingError:

Error Thrown When Key Fields
ValidationError Invalid inputs (empty vectors, dimension mismatch, bad config) message
ProviderError Cloud API failures (auth, rate limits, server errors) provider, status?
ModelNotFoundError Local model not found or @huggingface/transformers not installed message
EmbeddingError Base class for all above message
import { cosineSimilarity, ValidationError, ProviderError } from 'embedding-utils';

try {
  cosineSimilarity([1, 2], [1, 2, 3]); // dimension mismatch
} catch (error) {
  if (error instanceof ValidationError) {
    console.error(error.message); // => 'Dimension mismatch: 2 vs 3'
  }
}

try {
  await provider.embed('hello');
} catch (error) {
  if (error instanceof ProviderError) {
    console.error(`${error.provider} failed (HTTP ${error.status}): ${error.message}`);
  }
}

Migration Guide (v0.2 β†’ v0.3)

Breaking Change Summary

All vector-returning functions now return Float32Array instead of number[]. This reduces memory usage by ~50% and improves computation speed, but requires awareness of a few behavioral differences.

What Changed

Operation v0.2 v0.3
provider.embed(texts) { embeddings: number[][] } { embeddings: Float32Array[] }
topK(query, corpus, k) results[0].embedding: number[] results[0].embedding: Float32Array
normalize(v) number[] Float32Array
clusterEmbeddings(...) cluster.centroid: number[] cluster.centroid: Float32Array
deserialize(data, fmt) number[][] Float32Array[]

Common Migration Patterns

// Converting Float32Array back to number[] (if needed for legacy code)
const arr: number[] = [...float32Array];
// or
const arr: number[] = Array.from(float32Array);

// JSON serialization (Float32Array serializes differently!)
// v0.2: JSON.stringify([0.1, 0.2]) => '[0.1,0.2]'
// v0.3: JSON.stringify(new Float32Array([0.1, 0.2])) => '{"0":0.1,"1":0.2}'
// Fix: convert first
JSON.stringify([...embedding]); // => '[0.1,0.2]'

// Array.isArray check
// v0.2: Array.isArray(embedding) => true
// v0.3: Array.isArray(embedding) => false
// Fix: use the new Vector type guard
import { isVector } from 'embedding-utils';
isVector(embedding); // => true for both number[] and Float32Array

// Spread into arrays
// v0.2: const combined = [...embA, ...embB]  => number[]
// v0.3: const combined = [...embA, ...embB]  => number[] (spread converts Float32Array to number[])
// This still works! Spread automatically converts.

// .map() returns Float32Array, not Array
// v0.2: embedding.map(x => x * 2) => number[]
// v0.3: embedding.map(x => x * 2) => Float32Array  (because Float32Array.map returns Float32Array)
// Usually fine, but if you need Array methods like .includes(), convert first:
const arr = [...embedding].filter(x => x > 0);

Input is NOT Breaking

All functions now accept BOTH number[] and Float32Array as input. Your existing number[] inputs still work without any changes:

// All functions accept BOTH number[] and Float32Array as input
cosineSimilarity([1, 2, 3], [4, 5, 6]);              // still works
cosineSimilarity(new Float32Array([1, 2, 3]), [4, 5, 6]); // also works

License

Apache-2.0

About

πŸ“š Collection of JavaScript functions used for common embedding realated tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages