Runbook - Building best practices as a first principle
Moe Majeed (mmajeed@netflare.dev) | Fri Aug 05 2022
As a Software Engineer, I used to find going oncall to be the most stressful part of the job. It’s an incredibly difficult role, as I wasn’t just responsible for my contribution, but all of my team members and what was developed before I even joined the team. I also found that it didn’t really improve over time, as our system was consistently evolving as we built new features, so I was always playing catch up.
The worst feeling is when you’re oncall and you get paged for a system that you’re not familiar with. The fact that you get paged means that this is a time-sensitive issue, so there is no room to dive deep and learn about the system as you need to act fast. Most teams I have worked with had some best practices to help with these situations. A common solution is to use a note-taking software like self-hosted wiki or Notion to write up an article that would give you instructions on how to debug the problem and maybe provide step-by-step instructions on solutions. While this did alleviate some of the pain, it wasn’t a perfect solution as the bottleneck shifts to searching for these articles that are mixed with all your other notes, assuming you’re lucky and one was written.
At Netflare, we wanted to build these best practices as first principle features instead of being an afterthought. We therefore added to our line of products a Runbook system that allows you to document your routine procedures and then be able to execute on them, through easy to follow step-by-step instructions.
What elevates Netflare Runbook over using basic note-taking apps is that we built the best practices directly into the system. We created a dedicated space where you can build a library of procedures so that during an emergency, you can quickly scan your library for exactly what you need. When you start the execution, we help you stay focused on the procedure through step-by-step instructions so that you don’t make a mistake, forget a step or get sidetracked. When you do encounter a problem with the execution, we will help you rollback the change so that you can safely revert the steps that have been already started.
As part of building a unified system, we also integrated Runbook with our other line of products. We built the system to allow you to set up the relationship between Alarms and Runbooks, so that you're not wasting time searching through your documents for the appropriate solution when the alarm is triggered. We also built an integration with our Ticketing system, so that stakeholders can easily follow the progress of the execution directly in the ticket and not bother you for updates.
I’m very passionate about bringing you best practices built on first principles. If you have any best practices you found helpful during your oncall, then I would love to hear about it. In the meantime, checkout our Runbook.