Searching for relevant information in vast repositories of unstructured text can be a challenge. This article explains a Python-based approach to implementing an efficient document search system using FAISS (Facebook AI Similarity Search) for Vector DB and sentence embeddings, which can be useful in applications like chatbots, document retrieval, and natural language understanding.
In this guide, we will break down how to use FAISS in combination with sentence transformers
library to create a semantic search solution that can effectively locate related documents based on a user query. For example, this could be used in a customer support system to find the most relevant past tickets or knowledge base articles in response to a user's question.
For complete python source code, please visit Utsavv/VectorDBUsingFAISS. This repository provides a comprehensive guide to utilizing Facebook AI Similarity Search (FAISS) for efficient vector database management.
I am going to make few assumptions. I am going to focus on explaining and implementing embeddings and vector databases. I assume that reader have a basic understanding of Python, concept of RAG (Retrieval-Augmented Generation), and LLMs (Large Language Models).
Introduction to Embeddings and Vector DB
Embeddings are like special codes that turn words into numbers. Think of words as different puzzle pieces, and embeddings are like a map that shows where each piece fits best. When words mean almost the same thing, their embeddings are like pieces that fit together snugly. This helps computers understand not just what words say, but what they really mean when we use them in sentences.
For example, let's take the sentence 'The cat chased the mouse.' Each word in this sentence, like 'cat' and 'mouse,' gets transformed into a set of numbers that describe its meaning. These numbers help a computer quickly find sentences with similar meanings, like 'The dog chased the rat,' even if the words are different.
Vector databases store these numbers (embeddings) in an efficient way. For instance, in our example sentence 'The cat chased the mouse,' each word ('cat', 'chased', 'mouse') would have its meaning translated into numbers by a computer. These numbers are then organized in a special database that makes it easy for the computer to quickly find similar meanings, like in the sentence 'The dog chased the rat,' even if different words are used.
Implementation Objective
In production applications, documentation is often extensive and finding information related to a specific topic can be challenging due to scattered information across various documents. This article will demonstrate how a user's question is searched within a text file, and how the vector database retrieves the closest possible matches. Searching Vector DB is incredibly powerful for applications like Q&A systems, recommendations, or any context where finding relevant information quickly is important.
To mimic this scenario, I have created a documentation text file. This article will show you how to search for information within this file. Although a simple text file is used here, the same approach can be applied to PDFs as well.
To make this example more realistic, I used the SAP rule engine documentation available at SAP Help Portal and compiled it into a single documentation text file. The text file used in this demonstration is attached to the article and can also be found in the GitHub repository.
High Level Flow
The system works in the following steps:
- Load text documents.
- Convert documents into vector embeddings using a Sentence Transformer model.
- Store these embeddings in a FAISS index for efficient similarity search.
- Query the index with user input to retrieve the most relevant documents.
Overview of the Components
Our solution is composed of following components:
- Sentence Transformers for Embeddings: We use a pre-trained model from the
sentence-transformers
library to convert textual documents into numerical representations (embeddings). - FAISS for Similarity Search: FAISS, developed by Facebook AI, is used to index these embeddings and perform fast similarity searches on them. This is particularly useful when dealing with large numbers of documents.
Setup
To get started, let's set up Python environment. Here's a list of dependencies you'll need to install:
conda install pytorch::faiss-cpu conda install conda-forge::sentence-transformers conda install numpy
Importing the Required Libraries
To begin with, we will be importing following libraries.
- faiss: The core library used for similarity search. FAISS enables efficient searching through large vector spaces.
- os: This module is used to interact with the file system, such as listing files in a directory.
- numpy: Used for handling vector operations and converting embeddings to numerical arrays.
- sentence_transformers: Provides pre-trained models to convert sentences into dense vector embeddings. These embeddings are used to determine semantic similarity between sentences.
import faiss import os import numpy as np from sentence_transformers import SentenceTransformer
Defining the Embedding Model and Document Loader Class
The EmbeddingModel class takes care of the foundational steps: loading the text, getting the model, and preparing the text for the next stages of intelligent search using FAISS. By managing both the document handling and model initialization, this class makes the rest of the process—like generating embeddings, setting up the FAISS index, and performing vector searches—smoother and more efficient. The EmbeddingModel class ensures that we only need to load this model once by using a singleton-like approach for the model instance.
The EmbeddingModel class initializes with two main attributes:
- document_path: Path to the documents that need to be processed.
- model_name: Specifies which pre-trained model to use. Here, it uses the all-MiniLM-L6-v2 model from sentence transformers.
self.model loads the pre-trained embedding model using the get_model method.
class EmbeddingModel: def __init__(self, document_path, model_name='sentence-transformers/all-MiniLM-L6-v2'): self.document_path = document_path self.model_name = model_name self.model = self.get_model(model_name)
Loading Text Documents
load_texts method takes a document and breaks it down into individual pieces of text that can be used later. This makes the data more manageable and ready for the next steps, like generating embeddings. By breaking the document into smaller lines, it becomes easier to handle large files and prepare them for intelligent searches or analyses.
def load_texts(self): texts = [] for filename in os.listdir(self.document_path): with open(os.path.join(self.document_path, filename), 'r', encoding='utf-8') as file: texts.append(file.read()) return texts
Model Loading Method
get_model is a class method used to load the specified pre-trained embedding model. It uses SentenceTransformer from the sentence_transformers library to get the model instance. The embedding model turns text into numerical vectors, which are crucial for similarity search.
@classmethod def get_model(cls, model_name='sentence-transformers/all-MiniLM-L6-v2'): return SentenceTransformer(model_name)
Embedding Generation
generate_embeddings takes in a list of texts and converts each text into a dense vector representation.
The encode function from the SentenceTransformer model converts text into numerical embeddings.Setting convert_to_numpy=True allows easy use of embeddings in FAISS and Numpy.
def generate_embeddings(self, texts): return self.model.encode(texts, convert_to_numpy=True)
Creating and Training the FAISS Index
The create_faiss_index method is a crucial step when building an efficient search engine for large volumes of text data. In simple terms, it helps us organize and store the embeddings generated from our text in a way that allows for fast and effective searching.
Here’s how it works:
- Prepare the Embeddings: After the text is converted into embeddings (numerical representations), these embeddings need to be grouped together in a format that makes searching easy and quick. The create_faiss_index method takes care of this by using a tool called FAISS, which is designed for fast similarity searches.
- Set Up the Index: The method creates an index with FAISS that allows us to perform vector-based searches. Essentially, it takes the embeddings and organizes them into a structure that makes it easy to find similar items quickly. The method uses a clustering approach to make searches more efficient, especially when dealing with a large number of embeddings.
- Train and Add Embeddings: For FAISS to work effectively, the index needs to be trained. This training helps the index understand how the embeddings are distributed. Once trained, the embeddings are added to the index, making them ready for future searches.
create_faiss_index method is like creating a special type of "map" for all our text embeddings. This map helps us quickly find which pieces of text are similar to each other, which is incredibly useful for building things like search engines or recommendation systems. By using FAISS, we ensure that even if we have a massive amount of data, our searches remain fast and efficient. Following is the explanation of arguments provided
- embedding_dim is the dimensionality of the embedding vectors.
- nlist is the number of clusters for partitioning the dataset during search.
- IndexFlatL2 is a simple index that computes L2 distances.
- IndexIVFFlat is used for faster searching by clustering the embeddings.
- train(embeddings_np) prepares the FAISS index to handle the vector space represented by the embeddings.
- add(embeddings_np) adds all vectors to the index for similarity search.
def create_faiss_index(self, embeddings_np, embedding_dim, nlist=10): quantizer = faiss.IndexFlatL2(embedding_dim) index = faiss.IndexIVFFlat(quantizer, embedding_dim, nlist, faiss.METRIC_L2) index.train(embeddings_np) index.add(embeddings_np) return index
Saving the FAISS Index
This method saves the trained FAISS index to a file (faiss_index.bin) for later use, which can speed up future searches.
def save_index(self, index, index_path='faiss_index.bin'): faiss.write_index(index, index_path)
Loading or Creating FAISS Index Dynamically
This is a singleton class that ensures only one instance of the FAISS index is loaded. The get_index method checks if a saved index exists and loads it. If an index does not exist, it creates and trains a new one, then adds embeddings.
class FaissIndex: _index_instance = None @classmethod def get_index(cls, index_path='faiss_index.bin'): if cls._index_instance is None: if os.path.exists(index_path): cls._index_instance = faiss.read_index(index_path) print("FAISS index loaded successfully.") else: cls._index_instance = faiss.IndexIVFFlat( faiss.IndexFlatL2(embedding_dim), embedding_dim, nlist ) if not cls._index_instance.is_trained: cls._index_instance.train(embeddings_np) cls._index_instance.add(embeddings_np) print("FAISS index created and loaded successfully.") return cls._index_instance
Defining the Search Function
The search_faiss_index method is the final piece of the puzzle that makes our intelligent search system complete. In simple terms, it allows us to query our FAISS index to find the most relevant pieces of text for a given search input.
Here’s how the method works:
- Load the Text, Model and Index: Before performing a search, the method ensures that both the embedding model and the FAISS index are loaded with provided text. The model is used to convert the search query into an embedding, while the FAISS index is used to search for similar embeddings.
- Handle the Query: The input query is first checked to ensure it’s not empty. If it’s valid, the query is then converted into a numerical representation (an embedding) using the loaded model. This embedding represents the search query in the same way our text documents were represented, which allows for a fair comparison.
- Perform the Search: Once the query is embedded, the method uses FAISS to search the index for similar embeddings. The FAISS index returns the closest matches, which correspond to the pieces of text that are most similar to the query. index.search(query_embedding, k) finds the k most similar entries in the FAISS index. It Retrieves the corresponding documents based on similarity scores. It Returns a combined result of relevant documents or an appropriate message if no matches are found.
- Retrieve and Present Results: After finding the closest matches, the method retrieves the actual text associated with those embeddings. These texts are then presented as the search results, providing relevant information based on the input query.
Given below are important segments of this function.
Prepare Your Documents
The first step in building an intelligent search system is preparing your documents. Gather all the text files you want to include in the search and place them in a designated folder. This folder will serve as the source for generating embeddings.
Embedding model can be initialized like following -
embedding_model = EmbeddingModel(document_path='path/to/documents')
Generate Embeddings
Once the documents are in place, use the `EmbeddingModel` to load the text data and generate embeddings. These embeddings represent the textual content in a numerical format, making them ready for indexing.
doc_texts = embedding_model.load_texts() embeddings_np = embedding_model.generate_embeddings(doc_texts)
Create and Train the FAISS Index
Pass the generated embeddings to the `create_faiss_index()` method. This step constructs a FAISS index that organizes the embeddings efficiently, enabling quick and accurate searches.
faiss_index = embedding_model.create_faiss_index(embeddings_np, embedding_dim=embeddings_np.shape[1])
Save the FAISS index
Following method will save index
embedding_model.save_index(faiss_index)
In simple terms, the search_faiss_index method is like asking a question and letting the system find the best answers for you. It takes the search query, finds the most similar pieces of information from the indexed text, and returns them as the result. Given below is the complete function
def search_faiss_index(query): # Load model and index model = EmbeddingModel.get_model() index = FaissIndex.get_index('faiss_index.bin') # Handle empty query case if not query.strip(): return "Query is empty. Please provide a valid query." # Encode the query query_embedding = model.encode([query]).astype('float32') # Ensure k is not greater than the number of indexed embeddings k = 3 # Number of nearest neighbors to retrieve k = min(k, index.ntotal) # Search in the index distances, indices = index.search(query_embedding, k) # Prepare the context from retrieved texts retrieved_texts = [] for idx in indices[0]: if 0 <= idx < len(texts): retrieved_texts.append(texts[idx]) # Join the retrieved texts to create context answer = "\n".join(retrieved_texts) if retrieved_texts else "No relevant information found." return answer
Running the Script
With the index in place, use the `search_faiss_index(query)` method to find the most relevant documents based on a user-provided query.
This example demonstrates the use of the `search_faiss_index` function to retrieve relevant information from a FAISS index for a set of predefined queries. The script begins by defining a list of queries, each focusing on a specific aspect of working with a Rule Engine. These queries address practical scenarios, such as simplifying multiple conditions for complex promotions, understanding the use of the "Container" condition and its advantages, and exploring how the "Group" condition enhances flexibility in managing rules.
For each query, the script prints the query text to provide context, followed by executing the `search_faiss_index` function to retrieve relevant results. The results are then displayed in a clear and readable format, offering insights based on the indexed documents. If no relevant information is found, the script gracefully informs the user with an appropriate message.
For instance, a query like "How can you manage and simplify multiple conditions together in the Rule Engine for complex scenarios like promotions?" might yield results highlighting the grouping capabilities of the Rule Engine, which allow modular management of logic. Similarly, a query about the "Container" condition could reveal its utility in packaging reusable rules for complex scenarios, making logic reusable across various promotions. Lastly, a query on the "Group" condition could emphasize its role in combining rules dynamically based on runtime parameters.
# Example usage if __name__ == "__main__": # Queries to search queries = [ "How can you manage and simplify multiple conditions together in the Rule Engine for complex scenarios like promotions?", "When might you use the 'Container' condition in the Rule Engine, and what advantage does it provide?", "How does the 'Group' condition enhance flexibility when working with conditions in the Rule Engine?" ] # Print results for each query for query in queries: print(f"\n\n\nQuery: {query}") search_result = search_faiss_index(query) print(f"\n\tResults:{search_result}")
Output
Query: How can you manage and simplify multiple conditions together in the Rule Engine for complex scenarios like promotions? FAISS index loaded successfully. Results:The 'Container' condition in the Rule Engine allows users to group multiple conditions together. Actions can then be created that reference the entire container, making it easier to manage complex rules. This is especially useful in scenarios like partner-product promotions where multiple conditions need to be evaluated together. Yes, in the Rule Engine, you can change the default logical operator between conditions from AND to OR by using the 'Group' condition. This allows for greater flexibility in specifying how different conditions are combined within a rule. The 'Group' condition in the Rule Engine allows users to change the logical operator between conditions from the default AND to OR. This provides more flexibility in how conditions are evaluated within a rule, allowing users to create more sophisticated and varied decision-making criteria. Query: When might you use the 'Container' condition in the Rule Engine, and what advantage does it provide? Results:The 'Container' condition in the Rule Engine allows users to group multiple conditions together. Actions can then be created that reference the entire container, making it easier to manage complex rules. This is especially useful in scenarios like partner-product promotions where multiple conditions need to be evaluated together. The Rule Engine includes several 'Out of the Box' conditions by default, such as 'Rule executed,' 'Group,' and 'Container.' The 'Rule executed' condition allows for the creation of dependencies between rules. The 'Group' condition helps in changing logical operators between rules from AND to OR, and the 'Container' condition allows grouping other conditions to reference them collectively. The 'Group' condition in the Rule Engine allows users to change the logical operator between conditions from the default AND to OR. This provides more flexibility in how conditions are evaluated within a rule, allowing users to create more sophisticated and varied decision-making criteria. Query: How does the 'Group' condition enhance flexibility when working with conditions in the Rule Engine? Results:The 'Group' condition in the Rule Engine allows users to change the logical operator between conditions from the default AND to OR. This provides more flexibility in how conditions are evaluated within a rule, allowing users to create more sophisticated and varied decision-making criteria. The Rule Engine includes several 'Out of the Box' conditions by default, such as 'Rule executed,' 'Group,' and 'Container.' The 'Rule executed' condition allows for the creation of dependencies between rules. The 'Group' condition helps in changing logical operators between rules from AND to OR, and the 'Container' condition allows grouping other conditions to reference them collectively. The 'Container' condition in the Rule Engine allows users to group multiple conditions together. Actions can then be created that reference the entire container, making it easier to manage complex rules. This is especially useful in scenarios like partner-product promotions where multiple conditions need to be evaluated together.
Advantages and Disadvantages of FAISS
Understanding the advantages and disadvantages of FAISS is crucial for determining if it is the right solution for your needs, especially when considering factors like scalability, performance, and ease of implementation.
- Advantages: FAISS is highly optimized for both CPU and GPU, making it capable of handling extremely large datasets efficiently. It supports multiple index types, which provides flexibility for different use cases. Its scalability is particularly suitable for enterprise-level solutions.
- Disadvantages: Compared to other solutions, FAISS may require more configuration and tuning to achieve optimal results, and its memory consumption can be relatively high, especially for large datasets. Additionally, setting up GPU acceleration can be complex for some users.
Different Types of Indexes Supported by FAISS
For simplicity, I chose the easiest-to-implement index, IndexFlatL2. However, there are other indexing options available, which you can select based on the specific requirements of your use case.
Type of Index | Explanation | Advantages | Disadvantages |
---|---|---|---|
IndexFlatL2 | A flat (brute-force) index that computes exact distances. | Simple to implement and exact results. | Not scalable for large datasets due to linear time queries. |
IndexIVFFlat | Inverted file with flat vectors, good for large datasets. | Faster search times on large datasets. | Requires training and may reduce accuracy slightly. |
IndexIVFPQ | Combines IVF with Product Quantization for compression. | Reduced memory usage and faster searches. | More complex configuration and can affect recall. |
HNSW | Hierarchical Navigable Small World graph for quick searches. | High accuracy and fast retrieval. | Memory intensive, and building the index can be slow. |
IndexPQ | Uses Product Quantization without IVF, ideal when memory usage is a primary concern. Offers good search performance. | Low memory consumption. | Slower compared to other indexing methods with IVF. |
To summarize, IndexFlatL2 is best for smaller datasets due to its simplicity, while IndexIVFFlat and IndexIVFPQ are more suitable for medium to large datasets, providing a good balance between speed and memory usage. HNSW is ideal for scenarios requiring high accuracy and fast retrieval, whereas IndexPQ is useful when minimizing memory consumption is the primary concern.
Conclusion
This solution provides a scalable approach to searching large volumes of text efficiently by combining sentence embeddings and FAISS. It highlights the power of semantic search over simple keyword matching by considering the meaning of the query in finding related documents.
Using FAISS and sentence transformers together allows us to handle large datasets with good performance, providing relevant results to user queries. Such a setup is especially beneficial for applications involving document retrieval, chatbots, and any solution requiring similarity-based matching.
I encourage readers to experiment with different embedding models and FAISS index types to optimize the solution for specific use cases. Additionally, feel free to modify the batch sizes or index parameters to best suit the characteristics of your dataset. For more details I suggest you to go through facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.
FAISS is not the only vector DB option. Other options beyond FAISS include Annoy (by Spotify), ScaNN (by Google), and HNSWlib. Each of these libraries offers unique benefits. For instance, Annoy is known for its simplicity and speed, while ScaNN is optimized for Google-scale workloads, and HNSWlib provides excellent accuracy due to its hierarchical navigable small world graphs.