Vectors vs. Graphs: Which Database to Choose for Building RAG…

The Retrieval-Augmented Generation (RAG) process is about optimizing how a Large Language Model (LLM) operates, ensuring that, besides the text and prompt template, it receives search results from an authorized knowledge database, customized with specific information beyond its training data sources before generating a response.

In this post, we will explore two types of knowledge databases to implement our RAG applications: vector databases and graph databases.

Deciding which database to use is crucial for optimizing the input information to the LLM. Although the objective of both remains the same (retrieving contextually relevant information for the user's query), each serves a unique purpose, and understanding when and how to use them can provide significant benefits in fields such as Generative AI. Besides understanding what these databases are, we will emphasize their advantages and disadvantages and highlight some use cases where they can be most effective.

Vector Databases

What Are Vector Databases?

Vectors (or embeddings in this context) are numerical representations of information generated by pre-trained embedding models. In the Generative AI domain, these models are used to transform semantic information into numerical vectors that can be processed by LLMs or by algorithms for comparison and retrieval from vector databases. Some examples of these models are: text-multilingual-embedding-002 (by Google), NV-Embed-v2 (by NVIDIA), Multilingual E5 (by Microsoft), and text-embedding-3-large (by OpenAI).

These vectors aim to encapsulate the semantic meaning of data (such as text, audio, videos, or images) in a multidimensional space. In this space, the dimensions of each vector capture various important characteristics of the data (like a word, an image, or a phrase), and all these characteristics together place the vector in a specific position in a high-dimensional space. The relationship between vectors in this space helps us understand how similar or different the data points are.

Generating vectors (embeddings) in a vector database

Today, data-driven applications handle complex and multidimensional data such as images, audio, videos, or text. These data types cannot be efficiently managed by traditional databases and search engines.

Vector databases are specialized databases designed to store, search, and manage high-dimensional data vectors. They excel at similarity search, which aims to identify elements that are close to each other in the vector space based on their numerical representations. Similarity search is crucial when comparing data points and finding the most similar ones according to certain criteria.

Apart from storing the vectors in the database, they must be indexed to optimize the retrieval process. This vector indexing technique uses advanced algorithms to organize high-dimensional vectors so that searches can be performed efficiently. This organization is not random but is done so that similar vectors are grouped together. Notable indexing techniques include: Inverted File with Flat Compression (IVF) or Hierarchical Navigable Small World (HNSW).

Examples of vector databases include:

Open Sources: Milvus, Chroma, Weaviate, Qdrant, PostgreSQL, Cassandra…
Commercial: Pinecone, Vertex AI Vector Search, Azure AI Search, Amazon Kendra, Amazon OpenSearch, SingleStore…

Within these databases, we can identify 3 key components:

Vector storage (embeddings).
Indexing structures for these vectors (IVF or HNSW).
Similarity searches.

What Are the Advantages and Disadvantages of These Databases?

Advantages

Ideal for unstructured data. They work best with unstructured data such as text, images, audio, or videos, but also support structured data.
Efficient similarity search. High precision in retrieving semantically similar data, improving the quality of generated responses.
Excellent support for real-time applications. High-speed performance for real-time applications with fast similarity search functions.
High scalability. Can handle billions of vectors efficiently, making them suitable for large-scale applications.
Lower operational costs. They offer lower costs due to their efficient management of high-dimensional data.
Ease of use. Easier learning curve.

Disadvantages

Limited contextual relationships. They primarily focus on similarity matching, lacking the ability to understand complex relationships beyond proximity in vector space (and hence semantic similarity).
High memory and storage requirements. Especially when dealing with large datasets.
Less interpretable. Vectors are generated using specific models, and vector databases are less interpretable for humans due to high-dimensional numerical representations. Understanding relationships or reasoning behind retrieved information is challenging.
May return irrelevant results. Since vector databases rely solely on similarity search, they may retrieve irrelevant or imprecise results.

What Are Some Use Cases Where These Databases Apply?

Natural Language Processing (NLP)

In NLP, words or sentences can be represented as vectors through embeddings. They directly store vector representations that reflect their meaning and connections. This makes them particularly useful for tasks such as semantic search, recommendations, text categorization, machine translation, and sentiment analysis, prioritizing meaning similarity over literal word matching.

Real-life Example

In a customer support chatbot, the user might ask: "How can I change or reset my password?". The query is transformed into vectors through embeddings, and the vector database identifies semantically similar vectors from the index containing the IT documentation. This way, it provides a relevant response.

Recommendation Systems

Recommendation systems are algorithms that suggest content to users based on their preferences and previous interactions. These systems represent user preferences and content as vectors, capturing important features such as behavior or content attributes. Vector databases compare the user's preference vector with content vectors, finding the most similar ones. This allows for personalized recommendations by suggesting semantically similar content.

Real-life Example

A streaming platform that tracks what a user consumes. If most of the series they watch are science fiction, it will recommend content from this genre or similar ones, as well as those consumed by other users with similar tastes.

Image and Video Recognition

When data is unstructured, vector databases are naturally suited for tasks like similarity search within visual data.

Real-life Example

On e-commerce platforms, customers may want to find a pair of sneakers they saw online. By uploading a photo of the sneakers, the e-commerce platform can quickly retrieve similar products from its vast inventory. Vector databases can compare the vector of the uploaded image with those of the stored product images.

Biometrics

From facial recognition systems to fingerprint databases, biometric data is high-dimensional and requires efficient similarity search capabilities.

Real-life Example

At international airports, facial recognition systems are used for security purposes. Each person’s face is captured and converted into a vector. When someone approaches the security checkpoint, their face is compared against a vector database of known criminals or persons of interest, enabling rapid threat detection.

Drug Discovery

In the pharmaceutical field, molecules and genes can be represented as vectors. Vector databases can be used to search for similar compounds or genetic patterns.

Real-life Example

In a pharmaceutical research lab, chemical compounds are represented as high-dimensional vectors. When a promising compound is identified to treat a specific disease, the vector database can find other compounds with similar structures or properties, potentially leading to more efficient drug discovery processes.

Graph Databases

What Are Graph Databases?

A graph is a composition of a set of objects known as nodes that are related to other nodes through a set of connections known as edges.

Graph databases are a type of NoSQL database that allows storing data as nodes (entities), relationships (edges), and properties. Entities can represent objects, concepts, or real-world ideas, while relationships describe how those entities are connected. This structure allows for intuitive data modeling and complex queries based on relationships. The differences between these databases and relational ones are:

The table-based format.
The high-dimensional space used by vector databases.

Some examples of graph databases include:

Open Source: Neo4j, OrientDB, ArangoDB, NebulaGraph, MemGraph, JanusGraph, Dgraph…
Commercial: GraphDB, Amazon Neptune, Azure Cosmos DB for Apache Gremlin, TigerGraph, InfiniteGraph…

Within these databases, we can identify 5 key components:

Nodes: Represent entities such as documents, concepts, people, etc.
Edges: Represent relationships between entities.
Graph: Nodes and edges are created from the data.
Traversal Algorithms: Use efficient graph traversal techniques to explore and retrieve connected data through relationships.
Complex Queries: Queries can explore paths and connections within the graph, allowing for information retrieval based on relationships rather than isolated data points.

What Are the Advantages and Disadvantages of These Databases?

Advantages

Contextual Understanding. By representing information in the graph through a structured hierarchy (entities and relationships), they provide more comprehensive and contextual information retrieval.
Optimized for Complex Queries. Enable advanced complex queries: shortest path, cycle detection, or cluster identification.
Easier Information Updates. Allow updating information easily by modifying nodes, relationships, or properties.
Richer Context. By retrieving data based on complex relationships, they improve the depth and relevance of the generated content.
Transparency. Provide a more transparent representation of the knowledge used to generate responses. This transparency is key to explaining the reasoning behind the generated outcome.
Native Relationship Handling. Efficiently handle connections between data points without requiring expensive join operations like relational databases.
Good Performance for intensive relationship handling.

Disadvantages

Scalability. Although graph databases can scale, they may struggle with performance on extremely large datasets.
Complexity. Building a graph database can be a complex process and requires sophisticated entity extraction and relationship modeling techniques. Specific domain knowledge (in addition to technical knowledge) is required, and it takes time for proper design.
Higher Learning Curve. Query languages for graph databases (such as Cypher or Gremlin) can differ from standard SQL and take time to learn.
Limited Support for Unstructured Data. Graph databases are more limited in terms of the types of data they can process. They work better with structured data since relationships between entities are easier to determine.
Higher Operational Costs. They tend to incur significantly higher costs due to the rapid expansion of graph size as more data points are added.
Limitations in Real-Time Applications. They are limited when handling real-time data streams. They are usually designed to process data and update the graph in batch format.
Lack of Standardization. There are no widely adopted standards for representing and querying these databases, which can lead to interoperability issues and vendor lock-in. The adoption or development of standards can facilitate interoperability and reduce vendor dependency.

What are some use cases where these databases apply?

Access Control Systems

Graph databases can help manage complex permission structures in access control systems.

Real-life Example

Companies can organize access to specific resources based on role, user type, or user privileges. With graph databases, access rights can be modeled according to these criteria, ensuring that the right people access the appropriate resources.

Networks

Graph databases significantly improve network management due to their ability to intuitively and scalably assign complex network relationships.

Real-life Example

A data center can use these databases to model its network. In case of an unexpected failure of a critical server within the data center, the database will be able to quickly identify and highlight all direct physical connections to the failed server. Additionally, the database could trace which clients or external systems depend on this server.

Fraud Detection and Prevention

In sectors like banking and insurance, graph databases can improve fraud detection by recognizing complex relationships between fraudulent activities that might not be visible with standard tools.

Real-life Example

A bank using a graph database can detect fraud by identifying unusual patterns in transaction data. For example, if several accounts are used to carry out a series of small transactions, which on their own might not trigger alerts, the graph database can connect these accounts to reveal a broader fraudulent scheme.

Knowledge Management and Document Retrieval

For companies handling large amounts of information, knowledge management is key to maintaining efficiency. Graph databases can help organize and retrieve documents in a way that makes knowledge more accessible, precise, and tailored to specific queries. By understanding the relationships between multiple documents, companies can quickly extract the most relevant information.

Real-life Example

In a law firm managing thousands of legal documents, these databases can retrieve precedents or related legal references based on a specific query. Instead of manually searching through countless files, the most relevant documents can be extracted based on their relationships and content.

Healthcare

Graph databases can help medical professionals quickly find relevant information within large volumes of medical data to answer complex questions about symptoms, treatments, and patient outcomes.

Real-life Example

In a hospital setting, graph databases can store information about patients, their medical histories, and treatments. They enable queries to identify patterns and trends in patient data, such as the effectiveness of certain treatments for specific conditions or recommending personalized treatment plans for patients.

Social Networks

Graph databases are excellent solutions for performing social network analysis. They can be used to represent people and their connections or relationships. They allow identifying influential people within a network, recommending new connections, or finding different groups of people and communities for profiling.

Real-life Example

Finding key people, identifying groups of people and communities, or locating important content through common neighbors are examples of using a graph database applied to a social network.

What are the differences between applying RAG with vector databases (Standard RAG) or applying RAG with graph databases (Graph RAG)?

Characteristics	Vector Databases (Standard RAG)	Graph Databases (Graph RAG)
Data Models and Structures	Uses vectors (embeddings) to represent data points in a high-dimensional space. Works well with unstructured data such as text, images, video, etc.	Uses nodes and edges to represent entities and relationships. Works well with structured and related data.
Query Methods	Uses similarity search algorithms to find the closest vector to a given query vector.	Uses graph traversal algorithms to explore relationships.
Scalability and Performance	Optimized for large-scale and high-dimensional data. Performance may vary based on vector dimension.	Can scale complex and interconnected data. Performance can be affected by the complexity of relationships between data entities.
Indexing Techniques	Uses techniques like IVF or HNSW for efficient similarity search.	Uses indexes like adjacency lists or B-trees for fast graph traversal.
Working Methodology	Focuses on measuring similarity or distance between data points in a multidimensional space.	Used to understand and analyze relationships between entities and leverage connections between nodes.
Interpretability	Less interpretable for humans due to high-dimensional numerical representations. It's difficult to understand relationships or reasoning behind retrieved information.	Knowledge representation interpretable by humans. The graph structure and labeled relationships clarify connections between entities.
Complexity and Learning Curve	Depending on the database, may require specific knowledge, but generally easier to use.	Its construction can be a complex process. Requires domain-specific knowledge and time for proper design. Query languages (e.g., Cypher or Gremlin) can be different from standard SQL and take time to learn.
Inferential Reasoning	More limited. It is based on vector similarity and may overlook relationships or implicit inferences. Can identify similar information but not complex relationships.	Allows inferential reasoning by traversing the graph structure and leveraging relationships between entities. It discovers implicit connections and derives new knowledge.

In Standard RAG, vector databases convert text, image, audio, or video into vectors (embeddings) capturing their semantic meaning. This allows for quick similarity searches. When a question is asked, the system finds the text segments with vectors (embeddings) most similar to the query. These texts help answer a Large Language Model (LLM).

In Graph RAG, graph databases are useful for analyzing relationships and connections. Graph databases use a Large Language Model (LLM) to find entities and their relationships, with an initial costly setup. This creates a structured graph of nodes (entities) and edges (relationships). When a question is asked, the system searches the graph for relevant parts. Then, the LLM uses this information and the query to provide a detailed response.

Conclusions

In this post, we have explained what vector and graph databases are, their advantages and disadvantages, and detailed some real-world use cases where these databases could fit more easily. Each of them has its strengths for handling different types of data. Understanding what each type of database does best will help us choose the most suitable one for our needs.

In the world of generative AI, RAG is the process of optimizing the response returned by an LLM to a question. This requires consulting an authorized knowledge database beyond its training data sources before generating a response. It is within these external authorized databases that vector and graph databases appear.

In RAG, we can distinguish two main development flows for our applications: Standard RAG (which works with vector databases) or Graph RAG (which works with graph databases). Depending on the requirements of our use case, and also on our types of data, we can choose which database best suits the problem and, therefore, what steps to follow to implement our RAG application. A quick way to determine which one to choose is the following:

When our RAG relies on retrieving semantically similar data, we should choose a vector database.
When our RAG needs to model complex relationships between entities, then we should choose a graph database.
When our RAG needs a hybrid approach, being able to combine both approaches in the same solution, then we will use both databases. For example, using a vector database to retrieve relevant documents and a graph database to understand relationships between entities within those documents.