Improving your LLMDiving into how RAG worksImplementing RAGAdopt a data-infrastructure-first strategyRAGs on a Modern Data Lake with UltiHashAny questions?

Fast, efficient Object Storage for Generative AI Infrastructures

The case of RAG

Improving your LLM

Large Language Models (LLMs) are undeniably changing business models and enhancing productivity by bringing instant chatbot services to the enterprise. Some areas of adoption include text generation and summarization, question-answering, sentiment analysis, translation and Named Entity Recognition (NER), which enables the identification and classification of names, locations and other entities within a text. LLMs are trained on vast volumes of data to generate accurate output, and those trained on larger datasets tend to perform more accurately. Despite their data-intensive training, LLMs still face challenges with output accuracy around specialized terminology and contextual relevance. At this point, leaders typically need to make a strategic decision between fine-tuning their LLM or implementing a RAG approach to compliment their LLM. But how do you make that decision?

Should you fine-tune your LLM or implement a RAG approach?

Whether you’ve trained your LLM in-house or are implementing one off the shelf, it may require further training to generate more suitable and accurate output, especially for specialized use-cases. LLMs are used in various industries for targeted or general purposes, and the quality of their output can be assessed with different metrics, including the Exact Match (EM) score, which determines how accurate the output is to the truth. There are several reasons why you may want to improve this score: high hallucination rates (when the LLM’s output is either factually incorrect, nonsensical, or disconnected from the input prompt), a lack of context knowledge, or outdated output. To address these issues, there are two main solutions: you can either fine-tune your LLM, or implement a Retrieval-Augmented Generation (RAG) system. Fine-tuning is ideal for achieving high accuracy and relevance in specialized domains, while a RAG system ensures up-to-date and factually correct information. But how do you know which one you need?

Let’s illustrate both solutions: a great use case for fine-tuning is the legal industry, where professionals require domain-specific accuracy, contextually appropriate and relevant content, and a consistent style and tone (e.g. legal language). Conversely, a RAG system is ideal for the customer service industry, where professionals need to access the latest information from several data sources to answer diverse customer inquiries accurately and efficiently. In some cases, implementing both solutions makes the most sense. For example, healthcare providers can benefit from fine-tuning their LLM for medical terminology and context, while simultaneously using a RAG system to ensure they are providing the most up-to-date information from the latest medical research and guidelines.

In this post, we’ll explore the RAG approach and how it impacts the supporting data infrastructure. If you’d like to dive into the “What and How” of fine-tuning a LLM, please check out our post about it here.

Diving into how RAG works

Retrieval-Augmented Generation (RAG) was first described by Facebook AI Research in the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks in 2021. It aims to incorporate real-time, relevant information from external databases or documents beyond the LLM's training data.

The RAG process starts with a user asking a question, which triggers the RAG system to look into several documents to collect relevant information which are then passed on to the LLM for response preparation. The LLM delivers the answer to the user. But how does it work technically? Simply put, a RAG system analyzes the question asked and then crawls the organization’s document corpus to find documents that could help answer the question. The RAG system retrieves relevant information from these documents and pushes it to the LLM, which then prepares and delivers the answer to the user.

Implementing RAG

Implementing a RAG system for your LLM involves more than just a simple plug-in. Since RAG leverages external data sources, it is essential to start this journey with a data inventory for a comprehensive overview of the data. Then, users should choose the best-fitting retrieval model to their application. Finally, there should be adequate data infrastructure to support the machine learning applications.

Data inventory

RAG is a concept that retrieves information from external sources, meaning sources that your LLM has not seen during training or fine-tuning. First, you need to make an inventory of all the data you have and determine what you’d like the RAG system to access. This often involves collaborating with different departments or business units to gather and consolidate all the data sources in scope. Depending on your motivations for implementing RAG, having an end-to-end view of the data on a process can help different units to have access to up-to-date data at all times. For example, if a company implements a RAG system to improve customer support, having an end-to-end view of the data allows different departments, such as sales, support, and product development, to access the most up-to-date information at all times. This ensures that everyone is working with the latest data, leading to more accurate responses and better decision-making.

Best-fitting data retrieval solution

When implementing a RAG system, you will need a data retrieval solution: once the user sent a query (a question), the system will look up in the data retrieval solution if it has something in store (from the data inventory) that can help answer your query. This means that you’ll have to implement a new tool, such as: full-text search, a vector database and a graph database. Each fits different types of queries and use cases which makes it possible to use one or all of them together.
Full-text search

Once implemented, full-text search solutions, like Elasticsearch, perform word-matching from the user’s query to the documents in storage. They perform lexical search by indexing large volumes of text to allow fast retrieval of documents containing specific keywords or phrases. They are best for applications where keyword-based queries are needed, such as customer support databases where users need to find articles or documents containing specific terms quickly.
Vector databases

The second option for data retrieval is a vector database, such as Qdrant or Milvus. These databases store and index data as high-dimensional vectors. Simply put, this technique encodes information based on its meaning. When two words are related, they are encoded closely together, whereas unrelated words are encoded further apart. This allows for semantic similarity search using nearest neighbor search. For example, in a financial service provider’s vector database, the words "instrument" and "finance" would be encoded close to each other (e.g., financial instruments). However, "finance" and "symphony" would be encoded further apart because they have entirely different meanings and belong to different contexts. Vector databases are ideal for applications involving semantic search or recommendation systems, where finding similar items based on context or content is crucial. For instance, in a content recommendation engine, vector databases can quickly find and suggest similar articles or videos.
Graph databases
The third option for data retrieval is a graph database, such as Neo4j or ArangoDB. These databases store and index data as nodes and edges, where nodes represent entities (such as people, products, or events) and edges represent the relationships between these entities. Simply put, this technique models data based on its connections. When two entities are related, they are connected by edges, whereas unrelated entities have no direct connections. This allows for efficient traversal and querying of complex relationships. For example, in fraud detection, a graph database would connect transactions (nodes) to associated entities like accounts, locations, and timestamps (other nodes) through edges, allowing for the identification of suspicious patterns and connections. Graph databases are ideal for applications where understanding and leveraging relationships between data points is crucial.
As mentioned earlier, data retrieval models need to know your data, meaning they need to encode it, which comes with data storage requirements. Their implementation necessitates storing the connections they make with your data, leading to increased data volumes on your foundation storage layer. We’ve seen that full-text search tools will increase your data volumes by 2-5 times due to the indexing required for fast retrieval. As for vector databases, they will raise your data storage capacity by 10 times due to high-dimensional vector storage. Finally, graph databases will impact your data volumes based on the complexity and number of connections in your data. Starting a RAG implementation before thinking about your data infrastructure will create data silos and inefficiencies resulting in significant bottlenecks.

Adopt a data-infrastructure-first strategy

At this point, the data inventory is done and you have a clear idea of which data retrieval tool you’d like to implement. All that’s left is to ensure that your data infrastructure can support your operations. In this section, we focus on implications RAG systems have on our data infrastructure. If you’re interested in the impact LLM training has on your data architecture, we have more details in this post.

The data inventory should give you a clear idea of data type, data volume, data location, and how dispersed these locations are. At this point, you’re likely to be looking for a data architecture that is scalable, keeps your data in a central repository and can manage different data formats and various average file sizes while maintaining decent performance to facilitate LLM-RAG operations. This type of data architecture is a data lake or a lakehouse. A data lake architecture allows users to store any type of data (structured, semi-structured and unstructured), and a lakehouse combines the ability of datalake architecture while offering high-performance querying on top of the storage as known from data warehouses. Both architecture types enable organizations to store large data volumes with different data types and formats, facilitating data storage and access for businesses. They have become the best practices for data architecture, especially as they facilitate data centralization and provide a single source of truth. Data lake and lakehouse architectures require a scalable and high-performance object storage, which benefits are described in further detail here.

Implementing RAG is largely about implementing the retriever model(s) you wish to use:
Automation possible
Extract raw content from upstream sources
Clean and normalize the text (e.g., removing stop words, stemming, etc.)
Index Creation
Build inverted indexes. (An inverted index maps terms to their locations in the dataset, allowing for fast keyword-based searches.)
Data Ingestion
Import data into the search store
Query Development
Write and optimize search queries to retrieve relevant documents quickly.
Requires manual development and optimization.
Automation possible
Extract raw content from upstream sources.
Divide the content into smaller parts.
Summarize, blend, and recontextualize the content chunks.
May require manual oversight or configuration
Encode the words in each content chunk into tokens. (A token is a single unit of text, such as a word or a subword, used in natural language processing.)
Vectorize tokenized content into word embeddings. (Word embeddings are dense vector representations of words, capturing their meanings and relationships based on their context within large datasets.)
Automation possible
Extract raw content from upstream sources.
Data Modeling
Define nodes, edges, and properties. (Nodes represent entities, edges represent relationships, and properties are attributes of nodes and edges.)
Requires manual design and input.
Data Ingestion
Define nodes, edges, and properties. (Nodes represent entities, edges represent relationships, and properties are attributes of nodes and edges.)
Create indexes for efficient querying.
Query Development
Write and optimize queries to traverse and analyze the graph.
Requires manual development and optimization.

RAGs on a Modern Data Lake with UltiHash

Each time a query is sent by a user, the retriever model(s) implemented will read from the object storage layer to retrieve relevant information to feed the LLM, and the end-user application will write objects like logs and the content generated for analysis and audit purposes to the foundation storage layer.

Implementing a RAG increases the read and write operations to and from your object storage foundation storage layer, inflating data volumes and increasing the need for parallel operations to avoid bottlenecks. Without parallel processing, the system would require one operation to be completed before proceeding to the next, significantly hindering performance. Additionally, this increases the requirements on I/O speed.

In the past, users seeking storage solutions faced a binary situation: expensive storage that allows for fast data retrieval or affordable storage with longer retrieval times. Increasing data volumes lead to higher hardware and cloud costs, increased maintenance, and overall surging resource consumption. UltiHash Object Storage provides a scalable and high-performance data storage designed for AI/ML workloads. It has built-in resource-efficiency features powered by lightweight data deduplication algorithms that generate space savings without compromising performance.
UltiHash is the primary underlying storage foundation for AIRAs (AI-ready architectures), and container-native for cloud and on-premises applications. It is designed to handle peta- to exabyte-scale data volumes while maintaining high speed and being resource-efficient.

Resource-efficient scalable storage

UltiHash provides resource-efficient scalable storage with fine granular sub-object-deduplication across the entire storage pool. This blended technology allows for high and optimal scalability. The result? Storage volumes do not grow linearly with your total data.

Lightweight, CPU-optimized deduplication

UltiHash achieves performant and efficient operations thanks to an architecture designed to handle high IOPS, characteristics that can be attributed to its optimised architecture and lightweight deduplication algorithm that keeps CPU time to a minimum.

Flexible + interoperable via S3 API

UltiHash offers high interoperability through its native S3-compatible API. It integrates with processing engines (Flink, Pyspark), ETL tools (Airflow, AWS Glue), open table formats (Delta Lake, Iceberg) and ML tools (SageMaker). If you’re using a tool we are not supporting yet, let us know and we’ll look into it!

Any questions?

What is UltiHash?

UltiHash is the neat foundation for data-intensive applications. It is powered by deduplication algorithms and streamlined storage techniques. It leverages on past data integrations to generate significant space savings while delivering high-speed access. UltiHash enhances your data management as it makes large datasets, and data growth having a synergistic effect on your infrastructure.

What does UltiHash offer?

UltiHash facilitates data growth within the same existing storage capacity. UltiHash deduplicates per and across datasets from terabytes to exabytes: users store only what they truly need. It’s fast, efficient, and works at a byte level, making it agnostic to data format or type. With UltiHash, the trade-off between high costs and low performance is a thing of the past.

What is an object storage?

Object storage is a data storage solution that is suitable to store all data types (structured, semi-structured and unstructured) as objects. Each object includes the data itself, its metadata, and a unique identifier, allowing for easy retrieval and management. Unlike traditional file or block storage, object storage is highly scalable, making it ideal for managing large amounts of unstructured data.

How does data deduplication work in UltiHash?

Data is analysed on a byte level and dynamically split into fragments, which allows the system to separate fragments that are unique from those that contain duplicates. UltiHash matches duplicates per and across datasets, leveraging the entirety of the data. Fragments that are unique and were not matched across the dataset or past integrations are then added to UltiHash, while matches are added to an existing fragment. This is our process to keep your storage footprint growth sustainable.

What is unique about UltiHash?

UltiHash efficiently stores your desired data volume, providing significant space savings, high speed and the flexibility to scale up seamlessly. Increase your data volumes within the existing storage capacity, without compromising on speed.

Can UltiHash be integrated in existing cloud environments?

Absolutely - UltiHash can be integrated to existing cloud environments, such those that leverage EBS. UltiHash was designed to be deployed in the cloud, and we can suggest specific machine configurations for optimal performance. The cloud environment remains in the hands of the administrator, who can configure it as preferred.

What API does UltiHash provide and connect to my other applications?

UltiHash provides an S3-compatible API. The decision for our API to be S3 compatible was made with its utility in mind - any S3 compatible application qualifies as a native integration. We want our users to have smooth and seamless integration.

How does UltiHash ensure data security and privacy?

The user is in full control of the data. UltiHash is a foundation layer that slides into an existing IT system. The infrastructure, and data stored, are the sole property of the user: UltiHash merely configures the infrastructure as code.

Is UltiHash suitable for both large and small scale enterprises?

UltiHash was designed to empower small and large data-driven organisations to pursue their thirst to innovate at a high pace.

What type of data can be stored in UltiHash?

The data integrated through UltiHash is read on a byte level; in other words, UltiHash processes are not impacted by the type or format of data integrated and works with structured, semi-structured and unstructured data.

What are the pricing models for UltiHash services?

UltiHash currently charges a fixed fee of $6 per TB per month - whether on-premises or in the cloud.

Need more answers?