Data lakehouses combine scalability and querying capabilities to power innovation.
As data is increasingly seen as the main catalyst for innovation, focus has shifted to the problem of the overall data being stored in different applications and processing engines within data-driven organisations.
The solution? A centralized system that can host high volumes of data of different types and formats, something that sits somewhere between the flexibility and ease of adoption in Data Lakes and the advanced querying capabilities of a data warehouse. Lakehouses are the new paradigm, providing a single source of truth for data where users can power vastly-differing use cases with both raw and processed data while maintaining a standardised approach to data management across the enterprise.
This centralised approach to data architecture improves data integrity, time to insights and model training time, all considerations that are mission-critical for organisations that rely on machine learning or business intelligence operations.
Recap
Data warehouses were the first data architecture innovation that allowed companies to enhance their analytics operations. This architecture typically has compute and data storage tightly integrated for optimized performance. Data warehouses are designed for efficient analytics, optimized for structured data and SQL queries.
Then came data lakes, designed to handle unstructured data. Over the years, companies began gathering more diverse data types (PDFs, images, videos, audio files, etc.) from which they could derive insights. Data lakes disaggregate compute and storage, using scalable, cloud-native object storage for data. This separation allows businesses to choose the most suitable compute resources for their specific performance requirements.At this point, there were two solutions addressing different challenges. Data warehouses excel at fast querying and managing structured data with schemas, while data lakes are excellent for storing various data formats and providing scalability at low costs. Companies needed a solution that combined the advantages of both data warehouses and data lakes. The need to balance performance tradeoffs and the evolving use of diverse data types in AI/ML and analytics led to the emergence of a new data architecture: the Data Lakehouse. This new paradigm combines the best features of Data Warehouses and Data Lakes, addressing the changing requirements of data-driven organizations. A Lakehouse architecture offers the best of both worlds: efficient querying and schema management combined with scalability and flexibility. It enables advanced data management features such as ACID transactions and metadata management, allowing for robust and reliable data operations across various data types and formats.
Let’s bring a little bit of context to this challenge. From the companies starting their ML journey to those managing petabyte-scale architectures, both share the same fate: storing the same data many times, in many locations and many formats. This practice consumes enormous amounts of storage and compute resources, results in team silos, and slows down overall productivity.
Imagine a car manufacturer, working on developing a cutting-edge self-driving technology.
They've installed edge devices on cars that continuously send data to the cloud. These edge devices collect video and LIDAR data from the front and rear cameras, as well as engine temperature readings and humidity data from a moisture detector. This manufacturer needs a storage solution that can store and process different data types and scale to train its ML models.
A Lakehouse architecture is ideal for this scenario. It allows for efficient storage of diverse data formats from the edge devices on the car, while also facilitating efficient querying, multi-part analysis, and machine learning. These capabilities are crucial for deriving actionable insights and advancing the development of self-driving technology. Additionally, the Lakehouse's advanced analytics capabilities enable real-time data analysis, vehicle performance monitoring, predictive maintenance, and overall enhancement of safety and efficiency for self-driving vehicles.
Open Table Formats: The Lakehouse Architecture
Open table formats, such as Hudi, Iceberg, and Delta Lake, facilitate the management of data within a Lakehouse. These open-source projects were first developed at Uber (Hudi), Netflix (Iceberg), and Databricks (Delta Lake), providing a robust framework for organizing large datasets, ensuring data consistency, and enabling efficient data retrieval. Open table formats are the key component for a lakehouse to support features like ACID (Atomicity, Consistency, Isolation, and Durability) transactions, time travel, and schema evolution, which are essential for maintaining data integrity and supporting complex data operations. By standardizing how data is stored and accessed, these formats help orchestrate various ETL processes, making it easier to manage and query data at rest. This results in improved performance, scalability, and flexibility, which are key to optimizing the data infrastructure of a lakehouse.
Choosing Your Open Table Format (OTF)
The open table format decision is highly strategic: it involves critical tradeoffs around transaction coordination, metadata storage, and data ingestion strategies that significantly affect performance. Navigating these tradeoffs is essential to efficiently executing diverse real-world workloads. For example, Hudi supports features like Merge-On-Read (MoR) writers, advanced indexing subsystems, and asynchronous table services, while Iceberg rather includes partition evolution. Let’s dig into the details now.
Metadata Management
Hudi, Iceberg and DeltaLake take different approaches to metadata management.
Hudi and Delta Lake use a tabular format for metadata storage. All metadata is maintained at a single level, making it simpler to understand and manage. This format often allows for faster writes because metadata changes are appended to a log. However, scalability can become an issue as the table grows because scanning the entire metadata table can be slow. Managing and compacting large amounts of metadata in a single table can also be challenging.
Iceberg uses a hierarchical format for metadata management. This approach scales better with large datasets because the metadata is divided into manageable layers. The upper levels act as an index, quickly narrowing down the search area and improving query performance. Reduced metadata scanning means that only a subset of metadata files needs to be read for queries, enhancing efficiency. However, this format introduces complexity due to its multi-level structure and can result in slower write operations, as updates may require changes across multiple layers of metadata.
Delta Lake and Hudi favour batch jobs in querying, distributing the operations to scan all the metadata tables required by a query. In contrast, queries over Iceberg are planned by a single node using the upper level of the manifest hierarchy, which has similar effects of an index as it minimizes the number of reads needed from the lower level. This results in more read operations with tabular format metadata and thus more read operations on the object storage layer in comparison to the hierarchical format.
Writes, inserts and upserts
There are two ways to update data in a lakehouse: either it is copied to a new data file during the write operation, which is called Copy-on-Write (CoW), or data changes are written to separate delta files and merged with the original data file during read operations, referred to as Merge-on-Read (MoR).
The Copy-on-Write (CoW) strategy identifies the files containing records that need to be updated and eagerly rewrites them to new files with the updated data, incurring high write amplification* but no read amplification*. This strategy is employed by all three formats.
*Reading/writing more data than necessary due to inefficiencies, resulting in increased I/O operations.
The Merge-On-Read (MoR) strategy, used by Iceberg and Hudi, does not rewrite files. Instead, it writes out information about record-level changes in additional files and defers the reconciliation until read-query time. This results in fewer writes to the object storage but potentially more read operations to filter out tombstoned records during queries. In Iceberg, tombstone files are created to mark deleted or updated records, while the original data files remain unchanged. This approach means a lower write load but higher read complexity, as the system needs to read and filter tombstone files. Hudi, on the other hand, stores changes (inserts, deletes, updates) in separate Avro files, which are then merged with the original Parquet files during queries. This results in increased write operations due to storing change logs and sorting, but query time involves reconciling these changes, which can be read-intensive.
In summary, both approaches balance the load on object storage differently depending on the application’s requirements (performance, read / write operations patterns, …).
Writes, inserts and upserts
There are two ways to update data in a lakehouse: either it is copied to a new data file during the write operation, which is called Copy-on-Write (CoW), or data changes are written to separate delta files and merged with the original data file during read operations, referred to as Merge-on-Read (MoR).
The Copy-on-Write (CoW) strategy identifies the files containing records that need to be updated and eagerly rewrites them to new files with the updated data, incurring high write amplification* but no read amplification*. This strategy is employed by all three formats.
*Reading/writing more data than necessary due to inefficiencies, resulting in increased I/O operations.
The Merge-On-Read (MoR) strategy, used by Iceberg and Hudi, does not rewrite files. Instead, it writes out information about record-level changes in additional files and defers the reconciliation until read-query time. This results in fewer writes to the object storage but potentially more read operations to filter out tombstoned records during queries. In Iceberg, tombstone files are created to mark deleted or updated records, while the original data files remain unchanged. This approach means a lower write load but higher read complexity, as the system needs to read and filter tombstone files. Hudi, on the other hand, stores changes (inserts, deletes, updates) in separate Avro files, which are then merged with the original Parquet files during queries. This results in increased write operations due to storing change logs and sorting, but query time involves reconciling these changes, which can be read-intensive.
In summary, both approaches balance the load on object storage differently depending on the application’s requirements (performance, read / write operations patterns, …).
Query performance
Benchmarking studies show that query performance varies significantly among different open table formats (OTFs). For example, Delta Lake generally outperforms Hudi and Iceberg in query execution times, primarily due to more-efficient data reading. Delta Lake benefits from larger file sizes, which enhance columnar compression and reduce overhead for large table scans, resulting in faster query times. In contrast, Hudi's smaller file sizes and Iceberg's custom-built Parquet reader can lead to slower performance.
For small queries, the performance bottleneck often lies in metadata operations. Hudi tends to perform better in these cases due to its query plan caching.
Impact on storage
Both Copy-on-Write (CoW) and Merge-on-Read (MoR) modes have significant trade-offs regarding their read/write interactions with the storage layer. MoR, used by Hudi and Iceberg, tends to generate higher I/O volume and calls, leading to increased read query latency compared to CoW. Hudi's MoR is faster for merging data but results in slower queries post-merge, while CoW has slower initial writes due to data organization and file balancing. Each OTF presents unique trade-offs, requiring careful consideration of workload characteristics to optimize query performance and storage interactions.
The interaction of OTFs with the storage layer is crucial, especially when using cloud-based object where I/O requests are billed on a pay-as-you-go basis, unlike UltiHash. This means it's not only important to manage storage utilization but also to consider the total API operations, data transfers, and peak rates. Concurrent read/write sessions also impact query performance. Combining maintenance operations with read queries on the same cluster can improve resource utilization without affecting read latency. Utilizing multiple compute clusters concurrently can further reduce execution time by leveraging the decoupling of compute and storage in cloud environments. Overall, managing the balance between read and write operations, as well as the efficient use of storage resources, is critical, time- and resource-intensive. UltiHash is the new high-performance and resource-efficient object storage that was designed for lakehouses by enabling data growth without compromising speed or sustainability.
How to set up a data lake or lakehouse architecture in 9 steps
Whether driven by advanced analytics, AI/ML operations, or data integration and unified data management, adopting a data lake or lakehouse architecture has become best practice. This approach improves decision-making through better insights, supports complex queries across large datasets, and integrates with BI tools like Tableau, Power BI, and Looker. Additionally, data lakes and lakehouses enhance model training by providing access to vast amounts of historical and real-time data, scalability to handle big data workloads, and integrations with machine learning frameworks like TensorFlow, PyTorch, and Scikit-learn. Finally, centralized data storage allows for easy access and management, reduces data silos, enhances data quality and consistency, and supports various data types and formats.
Here are 9 steps to setup your data lake or lakehouse architecture:
1
Define Objectives and Requirements
Start by outlining the goals for the lakehouse. Determining the types of data that will be handled and the analytical tasks it will support with help you shape the overall design and technology choices in your solution.
Clear objectives ensure that the storage solution and infrastructure are aligned with business needs, optimizing both cost and performance.
2
Select a Scalable Storage Solution
A storage solution capable of handling large volumes of diverse data is crucial. Object storages are the best practice in this architecture type due to their scalability and their capability to store unstructured, semi-structured and structured data.
UtliHash Object Storage combines efficient deduplication with high-performance, significantly reducing storage costs and improving access times.
3
Set Up Data Ingestion Pipelines
Developing pipelines to bring data into the lakehouse from various sources is essential. This includes batch processing tools like Apache Nifi for periodic data loads and streaming tools like Apache Kafka for real-time data ingestion. These pipelines ensure continuous and reliable data flow into the lakehouse.
Efficient data ingestion pipelines reduce latency and ensure timely availability of data in object storage for downstream processing.
4
Implement Data Processing and Transformation
Data processing frameworks such as Apache Spark or Databricks are used to clean, transform, and enrich the data. This step ensures that raw data is converted into a structured format suitable for analysis and storage, and prepares the data for machine learning operations by ensuring it is clean, consistent, and ready for model training and deployment.
Effective data processing minimizes storage bloat and optimizes the structure of data stored in object storage, enhancing query performance.
5
Integrate Metadata Management
Implementing a metadata management system, like Apache Atlas or AWS Glue Data Catalog, provides data discoverability, governance, and lineage tracking. This helps maintain data quality and compliance with regulatory standards and ensures the integrity and reliability of data used in machine learning operations.
Proper metadata management ensures efficient organization of data in object storage, facilitating quick access and compliance.
6
Deploy Analytics and Query Engines
Deploying query engines such as Presto, Trino, or Amazon Athena facilitates efficient data querying and analytics. These tools support complex queries and integration with BI tools like Tableau, Power BI, and Looker.
7
Optimize for Performance
Optimization techniques, including partitioning, indexing, and caching, improve data access speeds and query performance. These techniques help in managing large datasets efficiently.
Performance optimizations reduce I/O operations on object storage, lowering costs and speeding up data processing.
8
Set Up Monitoring and Logging
Monitoring and logging tools like Prometheus, Grafana, and AWS CloudWatch are used to track system performance, data ingestion, and processing tasks. These tools help in identifying and resolving issues promptly.
9
Plan for Maintenance and Scalability
Regular maintenance tasks, such as updating software and scaling resources, ensure the lakehouse remains efficient and reliable. Planning for future data growth and scaling needs is critical to sustaining long-term operations.
Scalable storage solutions allow for seamless expansion, ensuring that the infrastructure can handle increasing data volumes without degradation in performance.
UltiHash is the storage backbone for your lakehouse.
AI/ML models that deliver high quality and rapid time-to-insights, correspond to significant growth in both computing resources and storage capacity. In this context, scalability and performance translate to more storage capacity, more power and more compute. However, exponentially throwing resources and money at the problem simply isn’t sustainable: what is being overlooked in the current market is the need for resource-consumption control. Users are suffering from a lack of price predictability due to expensive storage solutions and sky high I/OPs and data egress fees.
Lakehouses require a stable and scalable data storage that provides high-performance. To be more specific, as lakehouses store several data types and formats, they utilise a multi-engine approach. This necessitates the organization and management of the ingestion process as well as the data at rest, in order to achieve consistency across the data infrastructure. Open Table formats such as Iceberg or Delta Lake facilitate this process, orchestrating different ETL tools and enabling organisations to utilise different compute engines depending on the exact needs of their different divisions or products.
UltiHash offers lakehouses a storage backbone that can handle scalability in a sustainable manner while maintaining high speed and interoperability.
While cloud migrations have been a major topic over the last 5 years, this doesn’t imply that on-premises infrastructures are outdated. On the contrary, enterprises typically harness the power of both approaches together in a hybrid environment, often keeping a Lakehouse in the cloud to enable use cases across the organisation while keeping older data or training sets on-premises to bring cost-efficiency to the MLOps lifecycle.
Software-defined solutions allow for high-compatibility with the underlying infrastructure setup, independently of whether it is on-premises, cloud, hybrid or multi-cloud.
UltiHash decided to go the extra mile by adopting an Infrastructure as Code (IaC) approach, enabling users to set up UltiHash seamlessly anywhere within their infrastructure.
GET EARLY ACCESS
Resource efficiency and performance?
UltiHash bridges the gap.
Resource-efficient scalable storage
We offer fine-granular sub-object-deduplication across the entire storage pool. This blended technology allows for high and optimal scalability. The result? Storage volumes do not grow linearly with your total data.
Lightweight, CPU-optimized deduplication
We maintain high performance through an architecture designed to handle high IOPS. Moreover, our lightweight deduplication algorithm keeps CPU-time at minimum, delivering fast read to optimize time-to-model and time-to-insights.
Flexible + interoperable via S3 API
UltiHash offers high interoperability through its native S3-compatible API. UltiHash supports processing engines (Flink, Pyspark), ETL tools (Airflow), open table formats (Delta Lake, Iceberg) and querying engines (Presto, Hive). If you’re using a tool we are not supporting yet, let us know and we’ll look into it!