Why observability is crucial when building AI infrastructure

Avoiding storage bottlenecks in AI infra is crucial. UltiHash lets you see the impact of storage on the rest of your stack.

Posted by

Juliette Lehmann

Founder's Associate

Key takeaways

Why observability matters

In AI and data-intensive applications, understanding system behavior is crucial. When a pipeline behaves unexpectedly, a data job silently fails, or GPU usage drops, you need to know why and to move fast. Observability gives you that visibility, it’s about understanding how your system behaves in practice, spotting issues before they turn into blockers. Observability helps you understand what’s happening across your stack, from compute to storage, so you can troubleshoot with confidence, not guesswork. With OpenTelemetry (open-source observability framework), UltiHash lets you see the impact of storage on the overall processes and stack, making it easier to stay in control as things scale.

What to watch for in storage observability

When you’re building AI systems, issues rarely come from just one place. Maybe training is slower than expected. Maybe your LLM responses start to lag under pressure. Or maybe your GPU usage is lower than it should be. Without observability, it’s hard to tell what’s actually happening, and even harder to fix it before it costs you time or budget.

Storage isn’t isolated from the rest of your stack. It directly affects how efficiently your models process data, and indirectly impacts how long you're running expensive compute. Observability gives you the full picture, helping you understand not just whether something is wrong, but why, and where to look next. When metrics, logs, and traces are exported to your monitoring platform (e.g., Prometheus, Datadog, etc.), you can configure alerts, track performance trends, and investigate bugs without guesswork.

Take a training pipeline. You’re running jobs on expensive GPUs, but your dashboard shows they're only active part of the time. It’s not always obvious where the slowdown is coming from. It could be preprocessing or data loading, but it could also be the way storage is accessed mid-run. With observability in place, it can be traced: maybe the system is pulling too many small files in sequence, or streaming from a bucket that’s overloaded. You can track throughput and latency per bucket, spot I/O bottlenecks, and set alerts if performance drops below a threshold. These patterns are hard to catch without metrics. Once you see them, you can adjust the pipeline: parallelize reads, optimize file structure, or rebalance your workloads.

Let’s take the case of retrieval-augmented generation (RAG). Things can work smoothly during dev, but as usage grows, with more users, more requests, and longer documents, the system starts to strain. Retrieval gets slower and responses get choppier. Observability helps you see what’s breaking down under load. For instance, you might notice spikes in latency when chunked documents or embeddings are pulled from object storage. If you export access logs and latency metrics, you can catch those spikes early and act before user experience degrades. That’s not an obvious connection if you’re only looking at the LLM layer. But once it’s surfaced, you can tune how and when you load from storage, cache more intelligently, or redesign part of the pipeline to keep pace.

In GenAI search interfaces or semantic engines, object storage often holds the unstructured data behind the scenes—vector files, audio transcripts, image metadata. These are read during retrieval, often in real time. If your graph-based engine is returning results slowly, it’s tempting to look at the model. But sometimes the delay is upstream: the system is stalling while it reads from storage, especially under concurrent queries or when file access isn’t optimized. With logs and traces in place, you can pinpoint when reads slow down, how often retries are triggered, or whether certain objects take longer to fetch. Observability helps surface where in the retrieval path that’s happening. It could be related to file size, read concurrency, or simply the way your data is structured. Once you know, you can act, whether that’s restructuring your data, batching queries differently, or tuning how files are accessed.

Setting up OpenTelemetry with UltiHash

Now we’ve established why observability is so important, here’s how to set it up in UltiHash: you’ll need to enable OpenTelemetry in UltiHash and configure the telemetry settings in the UltiHash helm values. Specify the OpenTelemetry collector endpoint to export metrics and traces to your preferred observability backend.

Export metrics and logs to external systems like Prometheus and Loki.
Configure the OpenTelemetry Collector to export data to the monitoring platform (e.g Prometheus). Please refer to the Open Telemetry documentation.
Set up monitoring early to ensure you can track system performance and diagnose issues as they arise.

collector:
  config:
    exporters:
      prometheus/mycompany:
        endpoint: "1.2.3.4:1234"
    service:
      pipelines:
        metrics:
          receivers:
           - otlp
           - prometheus/mycompany

Conclusion

UltiHash is the storage layer, but in modern AI pipelines, storage doesn’t operate in isolation. It directly shapes how training runs perform, how fast retrieval pipelines respond, and how reliably systems scale. With OpenTelemetry integrated, UltiHash slots into your existing observability stack, Prometheus, Datadog, whatever you’re using, and gives you visibility into how storage impacts the end-to-end process. From monitoring throughput to spotting read delays, it turns storage from a blind spot into something you can track, understand, and optimize.

For detailed setup instructions, refer to the UltiHash documentation.

Posted by

Juliette Lehmann

Founder's Associate

Build faster AI infrastructure with less storage resources

Get 10TB Free