Overview

Text embeddings convert text into numerical vectors that capture semantic meaning. Similar texts have similar vectors, enabling semantic search and clustering.

Embedding Models Comparison

ModelDimensionsSpeedQuality
all-MiniLM-L6-v2384FastGood
all-mpnet-base-v2768MediumBetter
e5-large-v21024SlowExcellent
text-embedding-3-small1536APIExcellent
nomic-embed-text768FastVery good

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "The weather is nice today"
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 384)

Semantic Similarity

from sklearn.metrics.pairwise import cosine_similarity

query = "What is artificial intelligence?"
query_embedding = model.encode([query])

similarities = cosine_similarity(query_embedding, embeddings)[0]
# [0.82, 0.75, 0.12] - first two are similar, third is not

Hugging Face Transformers

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    return outputs.last_hidden_state.mean(dim=1)

OpenAI Embeddings

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Machine learning is fascinating"
)

embedding = response.data[0].embedding  # 1536 dimensions

Local with Ollama

import requests

response = requests.post('http://localhost:11434/api/embeddings', json={
    'model': 'nomic-embed-text',
    'prompt': 'Machine learning is fascinating'
})

embedding = response.json()['embedding']

Use Cases

# Index documents
doc_embeddings = model.encode(documents)

# Search
query_embedding = model.encode([query])
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-5:][::-1]

Clustering

from sklearn.cluster import KMeans

embeddings = model.encode(texts)
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(embeddings)

Classification

from sklearn.linear_model import LogisticRegression

embeddings = model.encode(texts)
classifier = LogisticRegression()
classifier.fit(embeddings, labels)

Best Practices

  1. Normalize embeddings: For cosine similarity
  2. Batch processing: Encode in batches for speed
  3. Cache embeddings: Don’t recompute for same text
  4. Match training domain: Use domain-specific models when available

Key Resources