Why object storage is the future of AI/ML storage

Tom Lüdersdorf, CEO
Amazon's recent launch highlights how objects storage is the way forward for large datasets

The recent announcement of Amazon's S3 Express One Zone tier respresents a pretty major moment for the data industry. It’s confirmation that high-speed object storage is the way forward for data-intensive use cases in 2024 and beyond - especially in AI and ML applications.

Over the last few years, object storage has been increasingly growing in stature. Initially, it was considered simply a ‘cheap and deep’ storage solution. But more recently it has proven to be high-performance ready and been revealed as the ideal solution for AI and ML applications due to its scalability, durability, and cost-effectiveness. Leaders across industries as various as autonomous vehicle development, medical imaging, climatology and urban infrastructure analytics are all turning towards object storage as the cornerstone of their tech stack.

Now, in addition to their now-ubiquitous standard S3 object storage which launched nearly 20 years ago, Amazon is finally offering a high-performance tier. S3 Express One Zone claims to be up to 10x faster, with lower latency, and a design that facilitates the expansive parallel computations needed for AI/ML. With the higher performance comes a much higher price, however. As such, it is not meant to replace standard S3, but instead be a temporary storage solution for local compute.

So, as the largest cloud storage provider tacitly acknowledges that object storage is the future for AI/ML, let’s reflect. What is it about object storage that makes it so well-suited to these use cases?

Let’s start with the basics: growth. As the unstructured datasets on which ML is performed become ever larger and more complex unstructured data sets, so too does the underlying infrastructure. Traditional file and block storage systems typically scale vertically, and so have inherent limitations in their capacity to grow. For example: because network-attached storage (NAS) is designed for smaller-scale file sharing, it struggles with performance issues as numbers of files grow, as the hierarchy of a file-based system make requests ever longer to fulfill due to large path distances. Distributed File Systems (DFS) address this by spreading data across multiple servers, and so can handle larger data volumes better than NAS - but at its core it is still a file-based solution, and so throughput on each node is limited in the same way as it is with NAS. Ultimately, DFS faces the same complexities in scaling efficiently - multiplied by network latency.

Not so with object storage, where data is stored in a flat structure, grouped into broad buckets, and organised using detailed metadata on the object level instead of labyrinthine hierarchies. This system design means items can be accessed very fast, even among extremely large datasets. They can scale horizontally almost without limits, allowing for the accommodation of the sizable data sets required in AI and ML - even to petabyte scale and larger.

As well as allowing for rapid data access in large datasets, the implementation of object storage metadata itself can be a boon for AI/ML applications. Object storage supports customizable and expandable metadata schemas, which can potentially make it easier for these systems to locate, identify, and use the data they need. File and block storage, on the other hand, only support very basic metadata like information relating to data creation and access rights.

Then there’s the question of money: AI/ML can require petabytes of storage if not more especially in enterprise environments, and object storage provides a more cost-effective solution than traditional solutions like NAS - in particular when it comes to semi-structured data that is most commonly in training datasets. In addition, object storage supports data size reduction techniques which can significantly reduce the disk footprint and overall cost (like UltiHash’s cluster-wide deduplication technology). Moreover, object storage systems are often software-defined so can be hosted on standard servers, offering a lower total cost of operation compared to traditional proprietary storage solutions.

Finally, compatibility and integration are paramount when it comes to efficient usage of automated pipelines and multiplicity of data types involved in AI/ML. Object storage solutions are accessed via RESTful interfaces - notably the de facto standard that has formed around Amazon S3’s API (which UltiHash supports). Strong precendents exist for all manner of operations, including authentication and access, facilitating the easy integration of software in both on-premises and public cloud environments.

All these factors come together to make high-performance object storage a clear winner when it comes to modern AI/ML storage. It’s heartening to see this acknowledged in Amazon’s latest launch, and I’m really excited to see what breakthroughs will come from these data-intensive use cases.