I've been deeply involved in data platform development for quite some time, and I've learned that effective risk management is crucial for building a reliable and sustainable platfom that brings real value to a business. From this experience, I’ve learned a few key principles, many of which I’m applying to the development of UltiHash’s high-performance object storage layer.
It’s crucial to layer defenses. A system is only ever as strong as its weakest point, so simply saying ‘let’s just put a firewall around the whole thing’ is not the answer: there should instead be security mechanisms designed into every component of the system.
That said, don’t go overboard: it’s crucial to find a balance between usability and security. Identify what parts of the system really are critical and need extra protection against intrusion or data breaches.
You should also empower your team with regular security training and audits, including risk assessments and incident response plans.
Stay informed and adaptable to the changing world of data regulations. It’s also important to hold yourself accountable with regular compliance audits (internal and external), ensuring that nothing slips under the radar.
If you would normally store personal data, consider not doing so unless it is completely necessary. Either anonymize it - or, even better, apply aggregation techniques to avoid for individual records - whichever is most appropriate for your business use case.
Personnel should have a plan of action about what to do if a data breach happens, so they can react quickly and effectively. They should be actively encouraged to inform the relevant people when something goes wrong instead of keeping silent out of fear of reprisals.
The usability of the data platform is paramount, and an important part of that is data quality and integrity.
Consider automating data quality monitoring with an off-the-shelf solution. Choose wisely when selecting the points in your data journey where you introduce these tools, since they can be computation-hungry. Here you also should do what you can if your resources are restricted: it’s better to implement minimal data quality checks than none at all.
If monitoring shows there is a problem, the reaction should involve proper root cause analysis and apply a fix at the source of an issue. Although it may be quicker in the moment, ‘solving’ the symptoms of a problem downstream from its source can cause greater headaches in future.
Make sure you apply best practices in data platform design and have your data immutable between the different stages of processing. Snapshots and replication is can be used as well. Techniques like this can form part of a contingency plan to make sure corrupted data can be restored.
In a world with this many moving parts, downtime is inevitable - but being caught off guard isn't. Make sure you have a robust disaster recovery plan, apply defensive pessimism by planning for the worst-case scenario, and then dance from that point. Up-to-date first response documentation can be a good option to make sure everyone knows how to act.
Consider fire drills with chaos engineering tools that randomly break parts of your system, to stress-test your response in a real-world scenario.
Make sure you have clear communication protocols to minimize downtime impact. The last thing you want to lose during a high-impact incident is time, because an engineer has to think about who they have to contact first and what they should say.
Once the emergency has passed, it’s essential to do a post-mortem analysis, so you can learn from the mistake and fixes are applied to processes of system to avoid it in the future.
Much of the above depends on building a culture of trust and confidence in the company, where mistakes are not punished but instead prized as learning experiences. Never tire of emphasising to everyone the importance of speaking aloud about mistakes and learning from them.