Overview
Text embeddings convert text into numerical vectors that capture semantic meaning. Similar texts have similar vectors, enabling semantic search and clustering.
Embedding Models Comparison
| Model | Dimensions | Speed | Quality |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Good |
| all-mpnet-base-v2 | 768 | Medium | Better |
| e5-large-v2 | 1024 | Slow | Excellent |
| text-embedding-3-small | 1536 | API | Excellent |
| nomic-embed-text | 768 | Fast | Very good |
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
"Machine learning is a subset of AI",
"Deep learning uses neural networks",
"The weather is nice today"
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 384)
Semantic Similarity
from sklearn.metrics.pairwise import cosine_similarity
query = "What is artificial intelligence?"
query_embedding = model.encode([query])
similarities = cosine_similarity(query_embedding, embeddings)[0]
# [0.82, 0.75, 0.12] - first two are similar, third is not
Hugging Face Transformers
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
return outputs.last_hidden_state.mean(dim=1)
OpenAI Embeddings
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="Machine learning is fascinating"
)
embedding = response.data[0].embedding # 1536 dimensions
Local with Ollama
import requests
response = requests.post('http://localhost:11434/api/embeddings', json={
'model': 'nomic-embed-text',
'prompt': 'Machine learning is fascinating'
})
embedding = response.json()['embedding']
Use Cases
Semantic Search
# Index documents
doc_embeddings = model.encode(documents)
# Search
query_embedding = model.encode([query])
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-5:](#)
Clustering
from sklearn.cluster import KMeans
embeddings = model.encode(texts)
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(embeddings)
Classification
from sklearn.linear_model import LogisticRegression
embeddings = model.encode(texts)
classifier = LogisticRegression()
classifier.fit(embeddings, labels)
Best Practices
- Normalize embeddings: For cosine similarity
- Batch processing: Encode in batches for speed
- Cache embeddings: Don’t recompute for same text
- Match training domain: Use domain-specific models when available
💬 Comments
Comments are coming soon! We're setting up our discussion system.
In the meantime, feel free to contact us with your feedback.