LOG_semanticCLASSIFIED // PUBLIC_ACCESS

Semantic Search: The Foundation of Modern AI

January 17, 2026
#semantic-search#embeddings#vector-databases#fundamentals#rag

How vector embeddings and similarity search power RAG, recommendations, and AI memory - the technical foundations and practical implementation.

Before RAG. Before agents. Before modern AI.

There was semantic search.

Everything else is built on top.

What It Is#

Traditional search: match keywords.

Query: "refund policy"
Matches: documents containing "refund" and "policy"

Semantic search: match meaning.

Query: "refund policy"
Matches: documents about returns, money back, 
         cancellation, even if those exact words aren't used

The difference is understanding vs. pattern matching.

How It Works#

Step 1: Embed Everything#

Convert text to vectors (lists of numbers).

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("refund policy")
# → [0.023, -0.089, 0.156, ..., 0.044]  (384 numbers)

Similar meanings → similar vectors.

Step 2: Store in a Vector Database#

Index for fast retrieval.

import chromadb

client = chromadb.Client()
collection = client.create_collection("docs")

collection.add(
    documents=["Return items within 30 days...", "Full refund available..."],
    embeddings=[embed(doc) for doc in documents],
    ids=["doc1", "doc2"]
)

Step 3: Query by Similarity#

Find nearest neighbors.

results = collection.query(
    query_embeddings=[embed("money back guarantee")],
    n_results=5
)
# Returns documents about refunds, returns, guarantees

Even if those exact words weren't in the query.

The Embedding Space#

Embeddings capture relationships.

king - man + woman ≈ queen
paris - france + italy ≈ rome

Similar concepts cluster together. Different concepts are far apart.

This is why semantic search works - it's navigating a space where meaning has geometry.

Choosing an Embedding Model#

Dimensions#

More dimensions = more nuance, more compute, more storage.

  • 384 dims: fast, good for most use cases
  • 768 dims: better quality, reasonable cost
  • 1536+ dims: best quality, highest cost

Match to your quality needs.

Training Data#

Models trained on similar data work better.

  • General text → use general models
  • Code → use code-trained models
  • Medical/legal → use domain-specific models

Popular Options#

ModelDimsSpeedQuality
all-MiniLM-L6-v2384⚡⚡⚡⭐⭐
all-mpnet-base-v2768⚡⚡⭐⭐⭐
text-embedding-3-small1536⚡⚡⭐⭐⭐
text-embedding-3-large3072⭐⭐⭐⭐

Benchmark on YOUR data. General rankings don't always hold.

Vector Database Options#

Purpose-Built#

  • Pinecone: Managed, scalable, expensive
  • Weaviate: Open source, feature-rich
  • Qdrant: Open source, Rust-based, fast
  • Milvus: Open source, distributed

Add-ons to Existing DBs#

  • pgvector: PostgreSQL extension
  • Elasticsearch: Has vector search now
  • MongoDB: Atlas Vector Search

If you're already running Postgres, pgvector is often enough.

Stay Updated

Get updates on new labs and experiments.

Related Reading

The Architecture of Autonomous Flight

How we built a neural-symbolic hybrid system to control manned aircraft in real-time.

Practical Patterns#

Hybrid Search#

Vector search alone misses exact matches.

def hybrid_search(query, k=10):
    vector_results = vector_db.search(embed(query), k*2)
    keyword_results = keyword_db.search(query, k*2)
    
    # Combine and rerank
    combined = merge(vector_results, keyword_results)
    return rerank(combined, query)[:k]

"Invoice #12345" - you want exact match, not semantically similar invoices.

Chunking Strategy#

Documents → chunks → embeddings.

Too small: Loses context. "It" has no referent. Too large: Dilutes relevance. Buries the answer.

# Overlapping chunks preserve context
chunks = chunk(document, 
    size=500,      # tokens per chunk
    overlap=100    # tokens shared between adjacent chunks
)

500-1000 tokens with 10-20% overlap works for most cases.

Metadata Filtering#

Narrow search before similarity.

results = collection.query(
    query_embeddings=[embed(query)],
    where={"department": "engineering", "year": {"$gte": 2024}},
    n_results=10
)

Filter first, then find similar. Faster and more relevant.

Common Failures#

Cold Start#

New documents aren't searchable until embedded.

Fix: Embed on write, not on read. Keep index fresh.

Embedding Drift#

Model updates change the embedding space.

Old embeddings + new query embedding = bad results.

Fix: Re-embed everything when changing models.

Semantic Mismatch#

User's language ≠ document's language.

"How do I get my money back?" vs. formal policy documents.

Fix: Query expansion. Rephrase queries before searching.

Beyond Search#

Semantic search enables:

  • RAG: Find relevant context for LLM prompts
  • Recommendations: Find similar items
  • Deduplication: Find near-duplicate content
  • Clustering: Group similar documents
  • Anomaly detection: Find outliers

The foundation is the same. Applications vary.

Performance at Scale#

Approximate Nearest Neighbor (ANN)#

Exact search is O(n). ANN is O(log n).

# HNSW index - fast approximate search
collection.create_index(
    index_type="HNSW",
    metric="cosine",
    m=16,          # graph connectivity
    ef_construction=200  # build-time quality
)

Slightly less accurate. Much faster.

Sharding#

Millions of vectors → split across machines.

Shard 1: docs from A-M
Shard 2: docs from N-Z
# Query both, merge results

Most vector DBs handle this automatically.

Further Reading#

Technical Resources#

Related Posts#


LLMs get the attention. Semantic search does the work. Master the foundation and everything built on top makes more sense.

Explore our services

AI consulting, development, and strategic advisory.

2026 Field Notes: The Reality of Local Context Gateways#

The consensus among AI engineers is clear: Fine-tuning is for tone/style compliance, while RAG is for up-to-date facts. The number one failure mode in production RAG remains "Right documents, wrong chunks."

Additionally, with recent Model Context Protocol (MCP) privilege escalation vulnerabilities, security researchers are pushing for sandboxed local environments. We treat Local LLMs not as toys, but as Local Context Gateways—mandatory infrastructure for privacy-critical VPCs where data cannot leave the network.

Related Posts