Controlling access to your AI data: how UltiHash keeps your data safe from unauthorized operations

Learn how UltiHash manages access natively, with policy-based control for different workflows

Posted by

Juliette Lehmann

Founder's Associate

Key takeaways

AI pipelines rely on a wide range of data types. Storage in AI holds model checkpoints, embeddings, chunked documents, and inference logs, each used differently, and each with its own access requirements.
Different workflows need different access. Ingestion jobs write new data, inference reads from production, training updates models. Without clear boundaries, jobs interfere with each other, leading to overwrites, broken indexes, or data exposure.
UltiHash manages access natively, with policy-based control. Define users and assign fine-grained permissions directly inside UltiHash. Use JSON policies to control who can read, write, or delete, scoped down to specific buckets or paths, without external IAM.

A unified storage layer for mixed AI data types

When we talk about data storage in AI, the conversation often revolves around training datasets. They’re large, they’re essential, and they often dominate the storage footprint. But they’re far from the only thing that matters.

Under the hood, AI pipelines generate and rely on a wide variety of data, from model checkpoints and user-submitted documents to retrieval embeddings, inference logs, and monitoring traces. All of it ends up in the same storage layer, but not all of it should be treated the same way. Some data is immutable, some append-only, some short-lived. The way it's used depends on the job: ingestion writes new documents, retrieval services read them during inference, and model checkpoints get versioned and updated by training jobs. For example, training datasets should be read-only to avoid accidental overwrites, while inference logs need continuous write access to capture outputs in real time. Without clear access boundaries, these workflows start interfering with eachother.

Operational complexity demands structured access

Let’s say you’re running a RAG (retrieval-augmented generation) setup in production.

The ingestion team handles the document pipeline: uploading files, chunking them, generating embeddings. Their jobs need write access to specific buckets, but there’s no reason they should be able to read all stored content. Internal documents, customer uploads, and production embeddings might be irrelevant, or sensitive.

Meanwhile, the inference team runs the retrieval layer that serves user-facing results. Their jobs need fast, read-only access to those same embeddings, but they shouldn’t be allowed to write or delete anything, especially in production. One bad write could corrupt the index or surface broken data.

On top of that, the ops team might manage model fine-tuning. Their pipelines generate new checkpoints and may swap out model weights. Those writes should be tightly scoped: you want automation to update what’s needed while avoiding accidental overwrites of critical files.

Different teams, different workflows, different risks: without access separation, the line between dev, staging, and prod starts to blur. You either over-permission everything and risk mistakes, or you end up manually approving every request and becoming a bottleneck.

Identity and policy management in UltiHash, built for your AI workloads

UltiHash offers integrated identity and policy management, enabling you to define users, assign access keys, and enforce permissions directly within your storage cluster. This functionality is consistent across both Serverless and self-hosted deployments.

User authentication utilizes AWS Signature Version 4, a standard familiar to many developers. You can create users and manage their credentials using AWS-compatible tools like the AWS CLI. For example, to create a new user named foo, you would use the commands listed in the documentation here: docs.ultihash.io

The essentials are below:

# You can configure access to UltiHash cluster in your $HOME/.aws/config file. 
# We will create a profile uh to be used with UltiHash:

[profile uh]
endpoint_url = https://my-uh-cluster.my-company.io
region=my-region

# We can now run aws commands using the profile parameters:

$> aws --profile=uh ...

# By adding our credentials to $HOME/.aws.credentials we can also authenticate ourselfs to the cluster:

[uh]
aws_secret_key_id = FANCY-ROOT-KEY
aws_secret_access_key = SECRET

# Create your IAM user like this
$> aws --profile=uh iam create-user --user-name 'foo'

Policies in UltiHash adhere to a subset of AWS's IAM policy syntax, allowing you to assign permissions at both the user and bucket levels. These policies are written in JSON and can specify actions such as read, write, delete, and list, scoped to specific buckets or object paths. For instance, you might grant the ingestion team write access to a raw data bucket while restricting the retrieval service to read-only access on production documents. (You can see more details here: docs.ultihash.io).

Access control in UltiHash follows the AWS IAM model, supporting users, access keys, and JSON policies, so teams can manage permissions with tools and formats they already know.

A root user account is created during the initial deployment, with credentials stored as a Kubernetes secret. This account has unrestricted access and should be reserved for cluster administration tasks. Regular users should operate under defined policies to maintain security and operational integrity.

By managing access within UltiHash itself, you avoid the complexity of external identity systems, ensuring that each part of your AI pipeline interacts with storage as intended. This approach simplifies access control, reduces the risk of misconfigurations, and keeps your workflows secure and efficient.

Posted by

Juliette Lehmann

Founder's Associate

Build faster AI infrastructure with less storage resources

Get 10TB Free