Late on Friday many years ago, a nightmare unfolded. A talented but relatively inexperienced engineer was grappling with a pressing task after everyone else had gone home. After a long week, he was tired - and so made a small but significant mistake.
He wanted to make some dataset transformations, but there was no space. So he decided to do some spring cleaning. We were running on a shoestring budget, so our on-premises Hadoop cluster setup was rather tiny. Like a classic startup, we prioritized product over role-based access (something senior staff should definitely have put in place sooner). So staff could do whatever they wanted with the data.
He selected a specific area of storage, and ran the command 'sudo rm -rf *’ - recursively deleting everything. To his horror, he quickly realised that the wrong pane had been in focus - he had accidentally wiped the entire source dataset, provided by a client. Over 100 GB of data, pivotal in demonstrating our value proposition, was gone.
An easy mistake. But, embarrassed, he left without leaving a note. On Monday, our ETL pipelines all ‘turned red’, since the source data was unavailable. The mistake was eventually confessed, and of course, instead of wasting time blaming the engineer for an easy error, we banded together to remedy the data loss.
Imagine our panic when we realized standard data recovery tools would be of little use because we didn’t have a free hard drive for recovery! Luckily, the source dataset was structured and contained a few obvious patterns. Reference information had been wiped from the file system table, but we were able to ‘grep’ out the majority of the records in the end.
Many lessons could be drawn from this, but let’s focus on two main ones.
First: be wary of doing anything in production on a Friday - it carries inherent risks, as fatigue and the lack of immediate support can turn minor oversights into major crises.
Second, most importantly: all companies must foster an environment where mistakes are acknowledged as part of learning, where trust and support overrule blame and fear. Staff should feel empowered to admit errors and extract learnings from them: both soft skills and also the hard skills needed for emergency fixes. Leaders must model this, making sure teams are not just prepared to succeed but supported when they falter. Instead of arguing, they should join forces to fix things.
So: beware of Friday decisions. You may find the universe punishes you when you least expect it. Even if your change worked smooth on dev and staging environments, there is always a chance it won’t work on production. If you do it anyway, and things do go wrong, just make sure you inform your team.
And that unlucky team member? Learning from his early mistake, he went on to become a highly-respected senior data engineer in the same company.