SQLServerCentral Article

You Probably Don't Need a Vector Database

,

Introduction

Conversations about RAG almost always hit the same question early: "What vector database are you using?"

Pinecone. Weaviate. Chroma. Qdrant. New options appear every few months. An entire layer of infrastructure has grown up around one core assumption that documents must be converted into high-dimensional numerical arrays and loaded into a specialized index before a language model can do anything useful with them.

Over the past couple of years I've built a range of AI tooling: incident response systems, test generation pipelines, engineering knowledge bases. On most of those projects I kept gravitating toward something far more straightforward: BM25. Not out of ignorance about vector databases, I'd already tried them. That's exactly why I moved on.

What went wrong with vectors

The case for vector search sounds airtight. Semantic similarity lets you catch synonyms, paraphrases, and conceptual overlap. A search for "connection timeout" can surface a document that only mentions "socket refused," with no shared wording required. That's a real capability.

The operational cost is where things get complicated.

You're committing to an embedding model and whatever model you choose at index time must be the same one you use at query time. Swapping it means reindexing everything. The index itself has to live somewhere: self-hosted adds infrastructure overhead, cloud-hosted introduces latency, cost, and data leaving your network. Debugging is the part nobody talks about. When retrieval quality degrades, the culprit could be chunking logic, embedding model drift, a misconfiguration similarity threshold, or a stale index and none of these failure modes surface cleanly.

For internal tooling in data-sensitive enterprise environments, that's a heavy stack to maintain for what is frequently a straightforward retrieval task.

What BM25 actually is

BM25 (Best Match 25) is a term-based ranking algorithm with roots in the 1990s. Elasticsearch is built on it. The mechanics are simple: for a given query, each document receives a score based on how frequently query terms appear in it, adjusted for document length and how rare those terms are across the whole corpus.

No embedding. No neural network. No external dependencies. It's pure Python, runs entirely on your machine, and it's fast.

Here is a BM25 quick test using Python.

"""
BM25 Quick Test
Run: python test_bm25.py
Requires: pip install rank-bm25
"""

from rank_bm25 import BM25Okapi

corpus = [
    "TLS connection pool exhausted on edge node",
    "Origin server returned 503 during handshake",
    "Cache miss rate exceeded configured threshold",
]

tokenized = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized)

query = "TLS connection failure"
scores = bm25.get_scores(query.lower().split())

print(f"Query: '{query}'\n")
for i, (doc, score) in enumerate(zip(corpus, scores)):
    marker = " ? highest" if score == max(scores) else ""
    print(f"[{i}] score={score:.4f}{marker}")
    print(f"     {doc}")

-----------------------------Output--------------------------------
python test_bm25.py
Query: 'TLS connection failure'

[0] score=0.9754 ? highest
     TLS connection pool exhausted on edge node
[1] score=0.0000
     Origin server returned 503 during handshake
[2] score=0.0000
     Cache miss rate exceeded configured threshold
-------------------------------------------------------------------

Engineering teams querying technical documentation, runbooks, incident logs, SQL schemas, or API references are working in precise, domain-specific language. They write the same way they search. BM25 is well-suited to that pattern.

Where it falls short

BM25 struggles with conversational queries, abstract questions, and anything requiring true semantic reasoning. A query like "what broke last Tuesday?" won't retrieve the right incident report unless the report literally contains that phrase.

Cross-language retrieval is off the table, and any user phrasing that shares little vocabulary with the underlying documents will produce poor results.

Vector search genuinely outperforms BM25 in those situations. The error isn't using vector search. It's treating those situations as typical rather than as edge cases, especially for internal tooling.

The Setup That's Worked for Me

Here's what a minimal but production-ready vectorless RAG pipeline looks like, pairing BM25 retrieval with a local Ollama model for air-gapped or privacy-sensitive environments:

"""
Full VectorlessRAG Test
Run: python test_rag.py
Requires:
  - pip install rank-bm25 requests
  - Ollama running locally (https://ollama.com)
  - ollama pull llama3.1  (or any other model)
"""

import requests
from rank_bm25 import BM25Okapi
from pathlib import Path
import sys

# -- Config -----------------------------------------------------------------
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL      = "llama3.1"
DOCS_FOLDER = "./runbooks"
CHUNK_SIZE  = 400
OVERLAP     = 50
TOP_K       = 4
# ---------------------------------------------------------------------------


def check_ollama():
    """Verify Ollama is reachable before doing anything else."""
    try:
        r = requests.get("http://localhost:11434", timeout=3)
        print("? Ollama is running\n")
    except requests.exceptions.ConnectionError:
        print("? Ollama not reachable. Start it with: ollama serve")
        print("  Then pull a model:  ollama pull llama3.2")
        sys.exit(1)


def load_and_chunk(folder: str, chunk_size: int = CHUNK_SIZE) -> list[str]:
    chunks = []
    paths = list(Path(folder).rglob("*.md"))
    if not paths:
        print(f"? No .md files found in '{folder}'. Creating sample docs...")
        create_sample_docs(folder)
        paths = list(Path(folder).rglob("*.md"))
    for path in paths:
        print(f"  Loading: {path}")
        words = path.read_text().split()
        for i in range(0, len(words), chunk_size - OVERLAP):
            chunk = " ".join(words)
            if chunk.strip():
                chunks.append(chunk)
    print(f"\n? Loaded {len(chunks)} chunks from {len(paths)} files\n")
    return chunks


def create_sample_docs(folder: str):
    """Create sample runbook docs if the folder is empty."""
    Path(folder).mkdir(exist_ok=True)
    docs = {
        "tls.md": (
            "TLS connection pool exhausted on edge node. "
            "This alert fires when the number of active TLS connections exceeds the configured pool limit. "
            "Recovery procedure: restart the connection manager service and flush idle connections. "
            "Verify certificate expiry dates are not within the next 7 days. "
            "Check if a recent deployment increased connection concurrency. "
            "Monitor the TLS error rate for at least 5 minutes after restart before closing. "
            "Escalate to the network team if the pool exhaustion recurs within 30 minutes."
        ),
        "failover.md": (
            "Origin server failover procedure. "
            "This runbook covers the steps to failover traffic away from an unhealthy origin server. "
            "Step 1: Check the origin health endpoint and confirm it is returning non-200 responses. "
            "Step 2: Reroute traffic to the secondary origin using the load balancer config. "
            "Step 3: Notify the on-call engineer via the incident channel. "
            "Step 4: Verify response codes return to 200 across all edge nodes before closing the incident. "
            "Step 5: File a post-mortem ticket if the origin was down for more than 10 minutes. "
            "Do not failback to the primary origin until root cause is confirmed."
        ),
        "cache.md": (
            "Cache miss rate exceeded threshold alert. "
            "This alert fires when the cache miss rate rises above 40 percent over a 5 minute window. "
            "Check TTL configuration and purge stale keys if needed. "
            "High cache miss rates can indicate a recent deployment, a config change, or a cache eviction spike. "
            "Review the deployment log for changes in the last 2 hours. "
            "If TTLs look correct, check memory pressure on the cache nodes. "
            "Purge and repopulate the cache only after confirming the root cause to avoid repeat misses."
        ),
    }
    for filename, content in docs.items():
        Path(folder, filename).write_text(content)
    print(f"? Created sample docs in '{folder}'\n")


class VectorlessRAG:
    def __init__(self, chunks: list[str]):
        self.chunks = chunks
        self.bm25 = BM25Okapi([c.lower().split() for c in chunks])

    def retrieve(self, query: str, top_k: int = TOP_K) -> list[str]:
        scores = self.bm25.get_scores(query.lower().split())
        top_idx = sorted(range(len(scores)), key=lambda i: scores, reverse=True)[:top_k]
        print(f"  Top chunks retrieved:")
        for rank, idx in enumerate(top_idx):
            print(f"    [{rank+1}] score={scores[idx]:.4f} ? {self.chunks[idx][:80]}...")
        print()
        return [self.chunks for i in top_idx]

    def ask(self, question: str) -> str:
        print(f"Question: {question}")
        print(f"Retrieving context...\n")
        context = "\n\n---\n\n".join(self.retrieve(question))
        payload = {
            "model": MODEL,
            "prompt": (
                f"Answer based only on the context below. "
                f"Be concise.\n\n{context}\n\nQuestion: {question}"
            ),
            "stream": False,
        }
        print(f"Calling Ollama ({MODEL})...")
        response = requests.post(OLLAMA_URL, json=payload, timeout=60)
        response.raise_for_status()
        return response.json()["response"]


if __name__ == "__main__":
    check_ollama()

    chunks = load_and_chunk(DOCS_FOLDER)
    rag = VectorlessRAG(chunks)

    test_questions = [
        "What's the recovery procedure for an origin failover?",
        "How do I fix TLS connection issues?",
        "What causes high cache miss rates?",
    ]

    for question in test_questions:
        print("-" * 60)
        answer = rag.ask(question)
        print(f"Answer:\n{answer}\n")

-----------------------------------Output------------------------------------------------------
python3 test_rag.py
 Ollama is running

  Loading: runbooks/tls.md
  Loading: runbooks/failover.md
  Loading: runbooks/cache.md

 Loaded 3 chunks from 3 files

------------------------------------------------------------
Question: What's the recovery procedure for an origin failover?
Retrieving context...

  Top chunks retrieved:
    [1] score=1.7191 ? Origin server failover procedure. This runbook covers the steps to failover traf...
    [2] score=0.7737 ? TLS connection pool exhausted on edge node. This alert fires when the number of ...
    [3] score=0.2293 ? Cache miss rate exceeded threshold alert. This alert fires when the cache miss r...

Calling Ollama (llama3.1)...
Answer:
Reroute traffic to the secondary origin using the load balancer config. Verify response codes return to 200 across all edge nodes before closing the incident. File a post-mortem ticket if the origin was down for more than 10 minutes. Do not failback to the primary origin until root cause is confirmed.

------------------------------------------------------------
Question: How do I fix TLS connection issues?
Retrieving context...

  Top chunks retrieved:
    [1] score=1.7474 ? TLS connection pool exhausted on edge node. This alert fires when the number of ...
    [2] score=0.4836 ? Origin server failover procedure. This runbook covers the steps to failover traf...
    [3] score=0.0000 ? Cache miss rate exceeded threshold alert. This alert fires when the cache miss r...

Calling Ollama (llama3.1)...
Answer:
Restart the connection manager service, flush idle connections, and verify certificate expiry dates are not within the next 7 days.

------------------------------------------------------------
Question: What causes high cache miss rates?
Retrieving context...

  Top chunks retrieved:
    [1] score=2.3975 ? Cache miss rate exceeded threshold alert. This alert fires when the cache miss r...
    [2] score=0.0000 ? TLS connection pool exhausted on edge node. This alert fires when the number of ...
    [3] score=0.0000 ? Origin server failover procedure. This runbook covers the steps to failover traf...

Calling Ollama (llama3.1)...
Answer:
A recent deployment, a config change, or a cache eviction spike.
---------------------------------------------------------------------------------------------

No external services. No embedding costs. Indexing a few hundred documents takes milliseconds. The whole thing fits in a single file you can hand to any engineer on your team without a setup guide.

When to upgrade

None of this is an argument against vector search permanently. It's an argument for starting simple and adding complexity only when you've hit the actual wall.

If your retrieval quality is good enough that your LLM's answers are useful, you're done. If you start seeing consistent misses where users are querying in natural language that doesn't match your document terminology, that's the moment to introduce embeddings. By that point you'll have real examples of where BM25 failed, which makes evaluating vector alternatives much more concrete than benchmarking against synthetic queries.

The AI tooling space has a habit of front-loading infrastructure. You set up Pinecone before you've written a single line of retrieval logic. You tune your embedding dimensions before you know what your users will actually ask.

BM25's value isn't theoretical, it's the retrieval layer you can actually build, debug, and put in front of users in the time a cloud vector service needs just to get you onboarded.


Kumar Abhishek is an Engineering Manager specializing in quality engineering and AI-powered developer tooling. Connect me on LinkedIn: https://www.linkedin.com/in/kr0abhishek/

Rate

(1)

You rated this post out of 5. Change rating

Share

Share

Rate

(1)

You rated this post out of 5. Change rating