Semantic Search: The Foundation of Modern AI
How vector embeddings and similarity search power RAG, recommendations, and AI memory - the technical foundations and practical implementation.
Before RAG. Before agents. Before modern AI.
There was semantic search.
Everything else is built on top.
What It Is
Traditional search: match keywords.
Query: "refund policy"
Matches: documents containing "refund" and "policy"
Semantic search: match meaning.
Query: "refund policy"
Matches: documents about returns, money back,
cancellation, even if those exact words aren't used
The difference is understanding vs. pattern matching.
How It Works
Step 1: Embed Everything
Convert text to vectors (lists of numbers).
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("refund policy")
# → [0.023, -0.089, 0.156, ..., 0.044] (384 numbers)
Similar meanings → similar vectors.
Step 2: Store in a Vector Database
Index for fast retrieval.
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(
documents=["Return items within 30 days...", "Full refund available..."],
embeddings=[embed(doc) for doc in documents],
ids=["doc1", "doc2"]
)
Step 3: Query by Similarity
Find nearest neighbors.
results = collection.query(
query_embeddings=[embed("money back guarantee")],
n_results=5
)
# Returns documents about refunds, returns, guarantees
Even if those exact words weren't in the query.
The Embedding Space
Embeddings capture relationships.
king - man + woman ≈ queen
paris - france + italy ≈ rome
Similar concepts cluster together. Different concepts are far apart.
This is why semantic search works - it's navigating a space where meaning has geometry.
Choosing an Embedding Model
Dimensions
More dimensions = more nuance, more compute, more storage.
- 384 dims: fast, good for most use cases
- 768 dims: better quality, reasonable cost
- 1536+ dims: best quality, highest cost
Match to your quality needs.
Training Data
Models trained on similar data work better.
- General text → use general models
- Code → use code-trained models
- Medical/legal → use domain-specific models
Popular Options
| Model | Dims | Speed | Quality |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | ⚡⚡⚡ | ⭐⭐ |
| all-mpnet-base-v2 | 768 | ⚡⚡ | ⭐⭐⭐ |
| text-embedding-3-small | 1536 | ⚡⚡ | ⭐⭐⭐ |
| text-embedding-3-large | 3072 | ⚡ | ⭐⭐⭐⭐ |
Benchmark on YOUR data. General rankings don't always hold.
Vector Database Options
Purpose-Built
- Pinecone: Managed, scalable, expensive
- Weaviate: Open source, feature-rich
- Qdrant: Open source, Rust-based, fast
- Milvus: Open source, distributed
Add-ons to Existing DBs
- pgvector: PostgreSQL extension
- Elasticsearch: Has vector search now
- MongoDB: Atlas Vector Search
If you're already running Postgres, pgvector is often enough.
Stay Updated
Get updates on new labs and experiments.
The Architecture of Autonomous Flight
How we built a neural-symbolic hybrid system to control manned aircraft in real-time.
Practical Patterns
Hybrid Search
Vector search alone misses exact matches.
def hybrid_search(query, k=10):
vector_results = vector_db.search(embed(query), k*2)
keyword_results = keyword_db.search(query, k*2)
# Combine and rerank
combined = merge(vector_results, keyword_results)
return rerank(combined, query)[:k]
"Invoice #12345" - you want exact match, not semantically similar invoices.
Chunking Strategy
Documents → chunks → embeddings.
Too small: Loses context. "It" has no referent. Too large: Dilutes relevance. Buries the answer.
# Overlapping chunks preserve context
chunks = chunk(document,
size=500, # tokens per chunk
overlap=100 # tokens shared between adjacent chunks
)
500-1000 tokens with 10-20% overlap works for most cases.
Metadata Filtering
Narrow search before similarity.
results = collection.query(
query_embeddings=[embed(query)],
where={"department": "engineering", "year": {"$gte": 2024}},
n_results=10
)
Filter first, then find similar. Faster and more relevant.
Common Failures
Cold Start
New documents aren't searchable until embedded.
Fix: Embed on write, not on read. Keep index fresh.
Embedding Drift
Model updates change the embedding space.
Old embeddings + new query embedding = bad results.
Fix: Re-embed everything when changing models.
Semantic Mismatch
User's language ≠ document's language.
"How do I get my money back?" vs. formal policy documents.
Fix: Query expansion. Rephrase queries before searching.
Beyond Search
Semantic search enables:
- RAG: Find relevant context for LLM prompts
- Recommendations: Find similar items
- Deduplication: Find near-duplicate content
- Clustering: Group similar documents
- Anomaly detection: Find outliers
The foundation is the same. Applications vary.
Performance at Scale
Approximate Nearest Neighbor (ANN)
Exact search is O(n). ANN is O(log n).
# HNSW index - fast approximate search
collection.create_index(
index_type="HNSW",
metric="cosine",
m=16, # graph connectivity
ef_construction=200 # build-time quality
)
Slightly less accurate. Much faster.
Sharding
Millions of vectors → split across machines.
Shard 1: docs from A-M
Shard 2: docs from N-Z
# Query both, merge results
Most vector DBs handle this automatically.
Further Reading
Technical Resources
- Sentence Transformers - Embedding models
- Understanding HNSW
- Vector Database Comparison (2024)
Related Posts
- RAG Reality Check - Search in practice
- Context Engineering - Using search results
LLMs get the attention. Semantic search does the work. Master the foundation and everything built on top makes more sense.
Explore our services
AI consulting, development, and strategic advisory.
2026 Field Notes: The Reality of Local Context Gateways
The consensus among AI engineers is clear: Fine-tuning is for tone/style compliance, while RAG is for up-to-date facts. The number one failure mode in production RAG remains "Right documents, wrong chunks."
Additionally, with recent Model Context Protocol (MCP) privilege escalation vulnerabilities, security researchers are pushing for sandboxed local environments. We treat Local LLMs not as toys, but as Local Context Gateways—mandatory infrastructure for privacy-critical VPCs where data cannot leave the network.
Related Posts
The AI Dictionary: Technical Terms in Plain English
27 AI and ML terms explained for developers and everyone else.
Context Engineering: Why Your Prompts Aren't the Problem
Moving beyond prompt engineering to context engineering - systematic optimization of LLM inputs through retrieval, memory systems, and RAG for maximum performance within context windows.
The RAG Reality Check: Why Retrieval Isn't Magic
RAG is everywhere, but production implementations fail constantly. Common failure modes, debugging strategies, and what actually works in retrieval-augmented generation.