As senior engineers, we constantly strive to build systems that are not just functional but intelligent, intuitive, and highly performant. In the realm of information retrieval, the evolution from keyword-based search to semantic search marks a pivotal leap towards truly understanding user intent. Gone are the days when a precise keyword match was enough; today, users expect search engines to grasp the meaning behind their queries, even if the exact words aren’t present in the documents.

This article delves deep into building a robust, scalable semantic search system leveraging the powerful combination of Elasticsearch or OpenSearch as our vector database and search engine, coupled with the cutting-edge capabilities of Transformer models for generating rich semantic embeddings. We’ll explore the ‘why,’ the ‘how,’ and the ‘what’ of this architecture, providing practical code examples, discussing scaling strategies, and outlining real-world applications.

If you’ve ever wondered how to move beyond simple keyword matching and enable your applications to understand the nuances of language, you’re in the right place. Let’s embark on this journey to empower your search with true intelligence.

The Paradigm Shift: From Keywords to Concepts

For decades, traditional search engines have relied heavily on lexical matching. When you typed “best laptop for coding,” a keyword search engine would look for documents containing those exact words or their stemming variations. This approach, while effective for many use cases, suffers from several inherent limitations:

Synonymy: A user searching for “automobile” might miss documents talking about “cars.”
Polysemy: A query for “bank” could refer to a financial institution or a river bank, leading to irrelevant results.
Lack of Context: Keyword search struggles to understand the relationships between words or the overall meaning of a sentence.
Vocabulary Mismatch: If a document describes a concept using different phrasing than the query, it won’t be retrieved, even if semantically relevant.

Semantic search, on the other hand, aims to overcome these hurdles by understanding the *meaning* or *intent* behind the query and the content. It doesn’t just match words; it matches concepts. This is where the magic of modern Natural Language Processing (NLP) and, specifically, Transformer models comes into play.

Enter Transformers: Encoding Meaning into Vectors

At the heart of semantic search lies the concept of converting text into numerical representations called “embeddings” or “vector embeddings.” These embeddings are high-dimensional vectors where semantically similar pieces of text (words, sentences, paragraphs, or even entire documents) are mapped to points that are close to each other in a multi-dimensional vector space.

Transformer models, such as BERT, RoBERTa, ELECTRA, and their derivatives (like Sentence-BERT), have revolutionized NLP by achieving unprecedented performance in understanding context and generating these powerful embeddings. Unlike older word embedding techniques (like Word2Vec or GloVe) that often struggle with context, Transformers process entire sequences of text, allowing them to grasp the full meaning of a sentence and produce embeddings that reflect that meaning.

How do they work (at a high level)?

Transformers use a mechanism called “attention” to weigh the importance of different words in a sentence relative to each other. This allows them to capture long-range dependencies and contextual relationships. When used for embedding generation, the final layer’s output (often after some pooling operation like mean pooling for sentence embeddings) becomes the dense vector representation of the input text.

For semantic search, we typically use models specifically fine-tuned for sentence similarity tasks, like Sentence-BERT (SBERT). SBERT models are designed to produce embeddings that are highly discriminative for semantic similarity, meaning that the cosine similarity between two sentence embeddings directly correlates with their semantic relatedness.

Semantic Search at Scale with ES, OS, and Transformers — Generated Image

Architecture Overview: Building a Scalable Semantic Search System

To build a robust semantic search system at scale, we need an architecture that efficiently handles two main phases: Indexing (converting documents into searchable embeddings) and Searching (converting queries into embeddings and finding similar documents).

The Indexing Pipeline (Data Ingestion)

The indexing pipeline is responsible for taking raw data, processing it, generating embeddings, and storing it in our search engine.

Components:

Data Source: This could be anything from databases, file systems, web scrapes, APIs, or event streams.
ETL/Data Processing Layer: Before generating embeddings, documents often need cleaning, normalization, and chunking. Large documents might need to be split into smaller, semantically coherent chunks to improve embedding quality and search precision. This layer could be implemented using Apache Flink, Spark, Kafka Streams, or simple Python scripts for smaller scales.
Transformer Embedding Service: This is a crucial component responsible for taking text chunks and generating their corresponding dense vector embeddings. This service might run on GPUs for faster inference, especially at scale. Frameworks like Hugging Face Transformers, ONNX Runtime, or NVIDIA TensorRT can be used here.
Elasticsearch/OpenSearch Cluster: Our chosen search engine. It will store the original text, metadata, and, most importantly, the dense vector embeddings. It acts as both a traditional inverted index for keyword search and a vector database for semantic search.

Flow:


+-----------------+     +-----------------------+     +-------------------------------+     +----------------------------------+
|   Data Source   | --> | ETL/Data Processing   | --> | Transformer Embedding Service | --> | Elasticsearch/OpenSearch Cluster |
| (DBs, Files,    |     | (Clean, Normalize,    |     | (Text -> Dense Vector         |     | (Store Text, Metadata, Embeddings)|
|   APIs, Streams)|     |   Chunk)              |     |   Embeddings)                 |     |                                  |
+-----------------+     +-----------------------+     +-------------------------------+     +----------------------------------+

In this flow, documents are ingested, pre-processed, fed to the Transformer service to get their vector representation, and then stored alongside other metadata in Elasticsearch/OpenSearch. This entire process can be asynchronous, often driven by message queues (e.g., Kafka) for robustness and scalability.

The Search Pipeline (Query Execution)

The search pipeline handles user queries, converts them into embeddings, and performs a similarity search against the indexed document embeddings.

Components:

User Interface/Application: The front-end where users submit their queries.
Query Processing/API Gateway: This layer receives user queries, performs any necessary pre-processing (e.g., spell correction, query expansion), and forwards them to the embedding service.
Transformer Embedding Service: The *same* service used during indexing. It takes the user’s natural language query and converts it into a dense vector embedding. Consistency is key here; using the same model for both indexing and querying ensures that query and document embeddings exist in the same vector space.
Elasticsearch/OpenSearch Cluster: Receives the query embedding and performs a vector similarity search (kNN search) against the indexed document embeddings. It can also perform traditional keyword searches simultaneously.
Result Re-ranking/Post-processing: After retrieving initial results, this layer might apply additional logic for re-ranking (e.g., based on freshness, popularity, user personalization) or filtering.

Flow:


+------------------+     +-----------------------+     +-------------------------------+     +----------------------------------+     +-------------------------+
| User Interface/  | --> | Query Processing/     | --> | Transformer Embedding Service | --> | Elasticsearch/OpenSearch Cluster | --> | Result Re-ranking/      |
|   Application    |     |   API Gateway         |     | (Query Text -> Dense Vector   |     | (Vector Similarity Search        |     |   Post-processing       |
|                  |     |                       |     |   Embedding)                  |     |   + Keyword Search)              |     |                         |
+------------------+     +-----------------------+     +-------------------------------+     +----------------------------------+     +-------------------------+

A typical semantic search query flow involves the user entering text, which is then embedded, and this embedding is used to query Elasticsearch/OpenSearch. The search engine returns documents whose embeddings are closest to the query embedding in the vector space.

The Power of Hybrid Search

While semantic search is incredibly powerful, relying solely on vector similarity can sometimes miss highly relevant documents that happen to use precise keywords, especially for niche terms or proper nouns. Conversely, traditional keyword search often fails for conceptual queries.

The optimal approach, especially at scale, is to combine both methods: Hybrid Search. This involves performing both a semantic search (using vector similarity) and a lexical search (using the inverted index) simultaneously, then intelligently combining their results.

Elasticsearch and OpenSearch offer mechanisms to facilitate hybrid search, such as:

Boolean Queries: Combining `knn` queries with traditional `match` or `term` queries in a single boolean query.
Reciprocal Rank Fusion (RRF): A powerful technique to combine ranked lists from different search methods (e.g., semantic and lexical) into a single, optimized ranked list. This helps mitigate the “cold start” problem for new vectors and balances the strengths of both approaches.

Hybrid search ensures that you get the best of both worlds: the conceptual understanding of semantic search and the precision of keyword matching.

Deep Dive into Implementation: Code Examples

Let’s get our hands dirty with some code examples. We’ll focus on setting up Elasticsearch/OpenSearch, generating embeddings with a Transformer model, and performing both indexing and searching.

For this example, we’ll use Python, the `sentence-transformers` library, and the `elasticsearch` client.

Step 1: Choose and Load a Transformer Model

We’ll use a pre-trained Sentence-BERT model. For practical applications, consider models like `all-MiniLM-L6-v2` for speed and efficiency, or larger models like `all-mpnet-base-v2` for better accuracy if latency allows.


from sentence_transformers import SentenceTransformer

# Load a pre-trained Sentence-BERT model
# 'all-MiniLM-L6-v2' is a good balance of speed and performance
# For higher accuracy, consider 'all-mpnet-base-v2'
model_name = 'all-MiniLM-L6-v2'
embedding_model = SentenceTransformer(model_name)

# The dimension of the embeddings produced by this model
# For all-MiniLM-L6-v2, it's 384. For all-mpnet-base-v2, it's 768.
EMBEDDING_DIM = embedding_model.get_sentence_embedding_dimension()
print(f"Loaded model: {model_name}, embedding dimension: {EMBEDDING_DIM}")

# Example: generate an embedding
text_example = "What is the capital of France?"
embedding_example = embedding_model.encode(text_example)
print(f"Embedding shape: {embedding_example.shape}")

Step 2: Set up Elasticsearch/OpenSearch Index Mapping

We need an index in Elasticsearch/OpenSearch that can store our `dense_vector` embeddings. The `dense_vector` field type is specifically designed for this purpose.


from elasticsearch import Elasticsearch

# Connect to your Elasticsearch/OpenSearch instance
# For OpenSearch, you might use OpenSearch() instead of Elasticsearch()
# and potentially adjust authentication/port.
es = Elasticsearch(
    hosts=["http://localhost:9200"],
    # For self-signed certificates or specific auth, adjust accordingly
    # http_auth=("user", "password"),
    # verify_certs=False
)

index_name = "semantic_documents"

# Define the index mapping
# The 'dims' parameter must match the dimension of your embeddings.
# The 'method' parameter is crucial for kNN search.
# We'll use "hnsw" (Hierarchical Navigable Small World) for efficient approximate nearest neighbor search.
# Other parameters like 'space_type' (l2, cosine, dot_product) define the distance metric.
index_mapping = {
    "properties": {
        "text": {"type": "text"},
        "title": {"type": "text", "fields": {"keyword": {"type": "keyword"}}},
        "embedding": {
            "type": "dense_vector",
            "dims": EMBEDDING_DIM,
            "index": True, # Ensure the vector is indexed for kNN search
            "similarity": "cosine" # Use cosine similarity for distance calculation
        }
    },
    "settings": {
        "index": {
            "knn": True, # Enable kNN for this index
            "knn.space_type": "cosine" # Specify default kNN space type
        }
    }
}

# Create the index with the specified mapping
if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)
    print(f"Deleted existing index: {index_name}")

es.indices.create(index=index_name, body=index_mapping)
print(f"Created index: {index_name} with mapping.")

Note on OpenSearch: OpenSearch has a dedicated k-NN plugin. While Elasticsearch 8.x and later have native `dense_vector` kNN capabilities, OpenSearch 1.x/2.x uses a plugin. The mapping might look slightly different for OpenSearch, typically defining the kNN method within the `dense_vector` field itself if using the plugin.


// OpenSearch kNN plugin specific mapping (example, might vary slightly by version)
{
  "properties": {
    "text": {"type": "text"},
    "embedding": {
      "type": "knn_vector", // Or dense_vector with k-NN specific settings
      "dimension": EMBEDDING_DIM,
      "method": {
        "name": "hnsw",
        "space_type": "l2", // or "cosinesimil" for cosine similarity
        "engine": "faiss", // or "nmslib"
        "parameters": {
          "ef_search": 100,
          "m": 16
        }
      }
    }
  }
}

For simplicity and broader applicability, I’ll stick to the Elasticsearch 8+ native `dense_vector` field with kNN settings in the index properties, which is also compatible with OpenSearch’s `dense_vector` field type for basic operations, but be aware of the k-NN plugin specifics if you’re heavily invested in OpenSearch’s k-NN plugin features.

Step 3: Index Documents with Embeddings

Now, let’s create some sample documents, generate their embeddings, and index them.


import uuid

documents = [
    {"title": "The Future of AI", "text": "Artificial intelligence is rapidly transforming industries, from healthcare to finance. Machine learning algorithms are at its core."},
    {"title": "Renewable Energy Solutions", "text": "Solar panels and wind turbines are key components of sustainable energy. Green technologies are crucial for combating climate change."},
    {"title": "Healthy Eating Habits", "text": "A balanced diet, rich in fruits and vegetables, is essential for maintaining good health. Avoid processed foods for better well-being."},
    {"title": "Deep Learning Architectures", "text": "Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are fundamental in deep learning. Transformers are the latest breakthrough."},
    {"title": "Space Exploration Missions", "text": "NASA's Mars rovers and SpaceX's Starship are pushing the boundaries of human exploration beyond Earth."},
    {"title": "The Impact of Climate Change", "text": "Rising global temperatures and extreme weather events are consequences of climate change. Urgent action is needed to reduce carbon emissions."}
]

# Index each document
for doc in documents:
    # Generate embedding for the document text
    doc_embedding = embedding_model.encode(doc["text"]).tolist() # Convert numpy array to list for JSON serialization
    
    # Prepare the document for indexing
    doc_to_index = {
        "title": doc["title"],
        "text": doc["text"],
        "embedding": doc_embedding
    }
    
    # Index the document
    response = es.index(index=index_name, id=str(uuid.uuid4()), document=doc_to_index)
    print(f"Indexed document: {doc['title']} with ID {response['_id']}")

es.indices.refresh(index=index_name) # Ensure all documents are searchable
print("All documents indexed and refreshed.")

Step 4: Perform Semantic Search (kNN Search)

Now, we can query our index using semantic similarity. We’ll take a user query, generate its embedding, and then use Elasticsearch/OpenSearch’s kNN query to find the most similar documents.


def semantic_search(query_text, k=2):
    query_embedding = embedding_model.encode(query_text).tolist()

    search_body = {
        "knn": {
            "field": "embedding",
            "query_vector": query_embedding,
            "k": k, # Number of nearest neighbors to return
            "num_candidates": 100 # Number of approximate nearest neighbors to consider
        }
    }

    response = es.search(index=index_name, body=search_body, source=["title", "text"])
    
    print(f"\nSemantic Search Results for: '{query_text}'")
    for hit in response['hits']['hits']:
        print(f"  Score: {hit['_score']:.4f}, Title: {hit['_source']['title']}")
        print(f"    Text: {hit['_source']['text'][:100]}...") # Print first 100 chars
    return response

# Test semantic queries
semantic_search("What are some ways to protect our planet?", k=2)
semantic_search("Tell me about recent advancements in machine learning.", k=2)
semantic_search("How to eat healthy?", k=2)

Notice how the queries retrieve documents based on their meaning, not just exact keyword matches. For “What are some ways to protect our planet?”, it might return “Renewable Energy Solutions” and “The Impact of Climate Change” because they are semantically related, even if the word “planet” wasn’t explicitly mentioned in the document text.

Step 5: Perform Hybrid Search (Semantic + Lexical)

Let’s combine semantic search with traditional keyword search using a boolean query. This allows us to ensure that if a document has exact keyword matches, it still gets a chance to be highly ranked, while also benefiting from semantic relevance.


def hybrid_search(query_text, k=2, num_candidates=100):
    query_embedding = embedding_model.encode(query_text).tolist()

    search_body = {
        "query": {
            "bool": {
                "should": [
                    { # Semantic search component
                        "knn": {
                            "field": "embedding",
                            "query_vector": query_embedding,
                            "k": k,
                            "num_candidates": num_candidates,
                            "boost": 0.7 # Give semantic search a boost, adjust as needed
                        }
                    },
                    { # Lexical search component
                        "match": {
                            "text": {
                                "query": query_text,
                                "boost": 0.3 # Give lexical search a boost, adjust as needed
                            }
                        }
                    }
                ]
            }
        },
        "size": k # Return top 'k' results after combining
    }

    response = es.search(index=index_name, body=search_body, source=["title", "text"])

    print(f"\nHybrid Search Results for: '{query_text}'")
    for hit in response['hits']['hits']:
        print(f"  Score: {hit['_score']:.4f}, Title: {hit['_source']['title']}")
        print(f"    Text: {hit['_source']['text'][:100]}...")
    return response

# Test hybrid queries
hybrid_search("space travel latest news", k=2)
hybrid_search("AI breakthroughs in medicine", k=2)
hybrid_search("how to eat healthy, good diet", k=2)

The `boost` parameter in the boolean query is crucial for tuning the balance between semantic and lexical relevance. You might need to experiment with these values based on your specific use case and data.

For more sophisticated hybrid approaches, especially when dealing with many results from different sources, explore Reciprocal Rank Fusion (RRF). Elasticsearch/OpenSearch offers the `rank_eval` API and custom scripting to implement RRF, effectively blending scores from multiple query types. Implementing RRF directly in a single query is complex and often done post-query by combining results from separate semantic and keyword queries in your application layer.

Scaling Semantic Search: Challenges and Solutions

Building a prototype is one thing; scaling it to handle millions or billions of documents and thousands of queries per second is another. Semantic search introduces unique scaling challenges.

Scaling Model Inference (Embedding Generation)

Generating embeddings for large volumes of text (during indexing) and for every user query (during search) can be computationally intensive, especially with large Transformer models.

Batching: Process multiple texts simultaneously during inference. Modern NLP libraries and hardware are optimized for batch operations.
Hardware Acceleration: Utilize GPUs (e.g., NVIDIA A100/H100) for significantly faster inference. For smaller models or lower throughput, modern CPUs with AVX512 support can also be effective.
Optimized Runtimes: Convert models to formats like ONNX or use specialized runtimes like NVIDIA TensorRT or Intel OpenVINO. This can drastically reduce inference latency and increase throughput.
Dedicated Microservice: Decouple the embedding generation into a dedicated microservice. This allows independent scaling and resource allocation for inference. Use frameworks like FastAPI or Flask with a Gunicorn/Uvicorn server for deployment.
Model Quantization/Distillation: For deployment on resource-constrained environments or to reduce latency, consider using smaller, distilled versions of models (e.g., DistilBERT) or quantizing models to lower precision (e.g., int8).
Caching: Cache embeddings for frequently queried terms or documents, especially if your query pattern has high overlap.

Scaling Elasticsearch/OpenSearch for Vector Search

Storing and searching billions of high-dimensional vectors efficiently requires careful planning and tuning of your Elasticsearch/OpenSearch cluster.

Hardware:
- CPU: kNN search is CPU-intensive. Nodes performing kNN search should have powerful CPUs with many cores.
- RAM: HNSW (Hierarchical Navigable Small World) graphs, often used for kNN, are memory-intensive. Ensure your nodes have ample RAM to keep the HNSW graph in memory for optimal performance.
- SSD: Fast NVMe SSDs are crucial for overall cluster performance, even if the primary vector index fits in RAM.
Sharding and Replicas:
- Distribute your vector index across multiple shards and nodes. This allows for parallel processing of kNN queries.
- Use replicas for high availability and to serve read requests, further distributing the load.
kNN Configuration (HNSW):
- `num_candidates` (or `ef_search` in OpenSearch k-NN plugin): Controls the trade-off between search accuracy and speed. Higher values increase accuracy but decrease speed. Tune this based on your latency and recall requirements.
- `m` (for indexing): Controls the number of connections in the HNSW graph. Higher `m` improves recall but increases index size and build time.
- `ef_construction` (for indexing): Similar to `ef_search`, affects graph quality during indexing.
- Space Type: Choose the appropriate similarity metric (`cosine`, `l2`, `dot_product`) that matches how your embeddings were trained.
Dedicated kNN Nodes: For very large clusters, consider dedicating specific nodes to handle kNN queries, separating them from nodes primarily handling keyword searches or data ingestion.
Indexing Performance: Indexing millions of vectors can be slow. Use bulk indexing APIs, optimize refresh intervals, and ensure your embedding service can keep up with the data ingestion rate.
Monitoring: Continuously monitor CPU, memory, and disk I/O on your data nodes. Pay attention to kNN specific metrics like query latency and recall.

Data Volume and Updates

Chunking Strategy: For very long documents, chunking them into smaller, semantically coherent paragraphs or sentences is crucial. Each chunk gets its own embedding. This improves search relevance but increases the number of documents to index.
Delta Updates: Instead of re-indexing entire documents on minor changes, aim for delta updates where only modified parts are re-embedded and re-indexed.
Garbage Collection: Regularly clean up stale or deleted embeddings.

Real-World Scenarios and Use Cases

Semantic search is a game-changer across various industries and applications.

1. E-commerce Product Search

Problem: Customers use natural language, often vague, to describe products (“comfortable shoes for long walks,” “gift for a tech enthusiast”). Keyword search struggles with descriptive, non-specific queries.
Solution: Embed product descriptions, reviews, and specifications. A query like “cozy blanket for movie nights” will find blankets with terms like “soft,” “warm,” “snuggle,” even if “cozy” isn’t explicitly mentioned. Hybrid search ensures direct searches like “iPhone 15 Pro Max” still yield precise results.

2. Customer Support and Knowledge Bases

Problem: Customers ask questions in myriad ways (“My internet is slow,” “Why is my connection bad?,” “Trouble with network speed”). Support agents need to quickly find relevant articles, FAQs, or troubleshooting guides.
Solution: Embed all knowledge base articles. A semantic search can match the user’s natural language query to the most relevant solution, regardless of keyword overlap. This reduces resolution time and improves customer satisfaction.

3. Internal Document Search and Legal Discovery

Problem: Employees spend hours sifting through internal documents, reports, or legal precedents using keyword searches that often miss crucial information.
Solution: Index all internal documentation, emails, and reports with semantic embeddings. A lawyer searching for “precedents involving intellectual property disputes in software” can find relevant cases even if the exact legal terminology varies across documents. This significantly boosts productivity and compliance.

4. Content Recommendation and Discovery

Problem: Recommending articles, videos, or news based on simple tags or explicit keywords leads to narrow, repetitive suggestions.
Solution: Embed content summaries or full text. Users who enjoy an article about “sustainable farming practices” can be recommended others about “organic agriculture,” “eco-friendly food production,” or “environmental conservation,” leading to richer discovery.

5. Q&A Systems and Chatbots

Problem: Basic chatbots struggle to answer nuanced questions or understand context outside of predefined rules.
Solution: Integrate semantic search into the chatbot’s retrieval mechanism. When a user asks a question, the bot embeds it and searches a knowledge base for the most semantically similar question-answer pair or relevant document to generate a response.

Advanced Considerations and Future Directions

Fine-tuning Your Transformer Model

While

Tags: elasticsearchhybrid searchkhadervalikNNopensearchsemantic searchtransformersvector search

Written by

Khader Vali

Senior Software Engineer specializing in cloud architecture, real-time systems, and enterprise-scale applications.

Share this article

Semantic Search at Scale with ES, OS, and Transformers