USE CASE

Fast, efficient storage for your data lakehouse

Data lakehouses blend scalability and query capabilities to power innovation.

As value creation became more and more data-driven, we witnessed the problem of the overall data pool (with raw, processed, unstructured, semi-structured or structured data) being stored in different applications and processing engines.

The solution? A centralized system which can host high volumes of data from different types and formats, something that stands between a data lake and a data warehouse. Lakehouses were born from this merger, taking the best of each: the storage scalability of a data lake, and the query performance of a data warehouse: a single source of truth for data. Users could store both raw and processed data while benefitting from a built-in query engine.

This blended approach improved data integrity, time-to-insights and modelling time. Faster and better operations provide a new type of competitive advantage to businesses: thus, lakehouses have become the go-to for applications such as machine learning or business intelligence.
Let’s bring a little bit of context to this challenge. From the companies starting their ML journey to those managing petabyte-scale architectures, both share the same fate: storing the same data many times, in different locations and formats. This practice consumes enormous amounts of storage and compute resources, results in team silos, and overall slows down productivity.

Imagine a car manufacturer, working on developing a cutting-edge self-driving technology.

They've installed edge devices on cars that continuously send data to the cloud. These edge devices are sending video and LIDAR data from the front and rear cameras, as well as engine temperature readings and humidity data from a moisture detector. This manufacturer needs a storage solution that can store and process different data types and scale to train its ML models.

A lakehouse is the preferred solution in this case. The data lake part of the lakehouse will store the different data types and formats that are sent from the edge devices on the car; while the data warehouse part will facilitate efficient querying, multi-part analysis and ML - all crucial for deriving actionable insights and advancing the development of their self-driving technology.

UltiHash is the storage backbone
for your lakehouse.

Lakehouses require a stable and scalable data storage that will provide high-performance. Having the fastest time-to-insight and highly-quality ML models implies continuous data growth and increasing computing needs. In this case, scalability and performance can be translated to more space, more power, more compute. However, the pursuit of ‘more and better for more achievements’ is not sustainable: what is being overlooked in the current market is the need for resource-consumption control.

As lakehouses store several data types and formats, they therefore follow a multi-engine approach requiring the organization and management of the ingestion process and the data at rest to achieve consistency across the data infrastructure. This process is facilitated by open table formats, such as Apache Iceberg or Delta Lake, which orchestrate these ETL tools and allow for management of data at rest - with, for example, evolving schema management.
UltiHash offers lakehouses a storage backbone that can handle scalability in a sustainable manner while maintaining high speed and interoperability.
Migrating to the cloud has been a major topic over the past few years, but that does not imply that on-premises is outdated. Each company has specific architecture preferences - and they can change quite fast, e.g deciding to migrate their lakehouse to the cloud while keeping archives on-premises for more cost-efficient MLOps.

Software-defined solutions allow for high-compatibility with the underlying infrastructure setup, independently of whether it is on-premises, cloud, hybrid or multi-cloud.

UltiHash decided to go the extra mile by adopting an Infrastructure as Code (IaC) approach, enabling users to set up UltiHash seamlessly anywhere within their infrastructure.
GET EARLY ACCESS

Resource efficiency and performance?

UltiHash bridges the gap.

Resource-efficient scalable storage

We offer fine-granular sub-object-deduplication across the entire storage pool. This blended technology allows for high and optimal scalability. The result? Storage volumes do not grow linearly with your total data.

Lightweight, CPU-optimized deduplication

We maintain high performance through an architecture designed to handle high IOPS. Moreover, our lightweight deduplication algorithm keeps CPU-time at minimum, delivering fast read to optimize time-to-model and time-to-insights.

Flexible + interoperable via S3 API

UltiHash offers high interoperability through its native S3-compatible API. UltiHash supports processing engines (Flink, Pyspark), ETL tools (Airflow), open table formats (Delta Lake, Iceberg) and querying engines (Presto, Hive). If you’re using a tool we are not supporting yet, let us know and we’ll look into it!

Any questions?

What is UltiHash?

UltiHash is the neat foundation for data-intensive applications. It is powered by deduplication algorithms and streamlined storage techniques. It leverages on past data integrations to generate significant space savings while delivering high-speed access. UltiHash enhances your data management as it makes large datasets, and data growth having a synergistic effect on your infrastructure.

What does UltiHash offer?

UltiHash facilitates data growth within the same existing storage capacity. UltiHash deduplicates per and across datasets from terabytes to exabytes: users store only what they truly need. It’s fast, efficient, and works at a byte level, making it agnostic to data format or type. With UltiHash, the trade-off between high costs and low performance is a thing of the past.

What is an object storage?

Object storage is a data storage solution that is suitable to store all data types (structured, semi-structured and unstructured) as objects. Each object includes the data itself, its metadata, and a unique identifier, allowing for easy retrieval and management. Unlike traditional file or block storage, object storage is highly scalable, making it ideal for managing large amounts of unstructured data.

How does data deduplication work in UltiHash?


Data is analysed on a byte level and dynamically split into fragments, which allows the system to separate fragments that are unique from those that contain duplicates. UltiHash matches duplicates per and across datasets, leveraging the entirety of the data. Fragments that are unique and were not matched across the dataset or past integrations are then added to UltiHash, while matches are added to an existing fragment. This is our process to keep your storage footprint growth sustainable.

What is unique about UltiHash?

UltiHash efficiently stores your desired data volume, providing significant space savings, high speed and the flexibility to scale up seamlessly. Increase your data volumes within the existing storage capacity, without compromising on speed.

Can UltiHash be integrated in existing cloud environments?

Absolutely - UltiHash can be integrated to existing cloud environments, such those that leverage EBS. UltiHash was designed to be deployed in the cloud, and we can suggest specific machine configurations for optimal performance. The cloud environment remains in the hands of the administrator, who can configure it as preferred.

What API does UltiHash provide and connect to my other applications?

UltiHash provides an S3-compatible API. The decision for our API to be S3 compatible was made with its utility in mind - any S3 compatible application qualifies as a native integration. We want our users to have smooth and seamless integration.

How does UltiHash ensure data security and privacy?

The user is in full control of the data. UltiHash is a foundation layer that slides into an existing IT system. The infrastructure, and data stored, are the sole property of the user: UltiHash merely configures the infrastructure as code.

Is UltiHash suitable for both large and small scale enterprises?

UltiHash was designed to empower small and large data-driven organisations to pursue their thirst to innovate at a high pace.

What type of data can be stored in UltiHash?

The data integrated through UltiHash is read on a byte level; in other words, UltiHash processes are not impacted by the type or format of data integrated and works with structured, semi-structured and unstructured data.

What are the pricing models for UltiHash services?

UltiHash currently charges a fixed fee of $6 per TB per month - whether on-premises or in the cloud.

Need more answers?