How To Survive On-Call Without Losing Your Mind
The Stoic Developer's Playbook for Handling Production Chaos
It’s Friday at 6 PM, you’re literally putting on your coat to leave for dinner with friends, and your phone buzzes. Not just any buzz: that specific, gut-wrenching buzz of your monitoring app. Every customer-facing API just started timing out, and guess what? You’re on call.
Your heart rate spikes, your mind races through a dozen worst-case scenarios, and suddenly you’re that person frantically typing on their laptop.
Sound familiar?
Being on call transforms you from a code writer into a guardian of live systems that real people depend on. It’s one of those parts of the job that nobody really talks about in coding bootcamps or computer science classes. But here’s the thing, this responsibility doesn’t have to turn you into an anxious mess.
The Reality Check: Shit’s Gonna Break
Let’s start with the uncomfortable truth: your systems are going to fail. That perfectly tested feature you deployed last week? There’s an edge case nobody thought of, and it’s going to blow up at 2 AM on a Sunday. The database that’s been rock solid for months? It’ll decide to have a meltdown right when traffic spikes during your biggest sale of the year.
This isn’t pessimism, it’s reality. And once you accept that failure is inevitable, you can stop wasting energy trying to prevent every possible issue and start focusing on what actually matters: how quickly and effectively you respond when things go sideways.
Think about it like this: you wouldn’t drive a car without wearing a seat belt, right? You’re not planning to crash, but you’re prepared for the possibility. On-call duty works the same way.
Mental Preparation: Building Your Stoic Armor
Here’s where that ancient philosophy stuff actually becomes useful in your day job. The Stoics had this concept of negative visualization, basically imagining bad scenarios so you’re mentally prepared when they happen.
It sounds morbid, but it works.
Before your next big release, spend some time thinking through the worst-case scenarios. What if the database crashes? What if your internet goes out and you can’t access the systems? What if that third-party API you depend on decides to take an unscheduled vacation? What if multiple things fail at once?
I’m not saying you should spiral into anxiety about every possible failure. But having thought through these scenarios means when they actually happen, your brain doesn’t freeze up. You’ve already mentally rehearsed the response.
And here’s a crucial mindset shift: production issues don’t reflect your worth as a developer. That voice in your head saying “I’m terrible at this” when systems go down? Tell it to shut up. Systems fail because systems are complex, not because you suck at your job.
Every incident is a learning opportunity and a chance for your systems to evolve. Yes, it’s stressful as hell when it’s happening, but it’s also just part of the job as much as coding itself.
Technical Preparation: Your Safety Net
All the mental preparation in the world won’t help if you don’t have the technical foundation to back it up. This is where you channel that anxiety into something productive.
Build Bulletproof Runbooks Your runbook isn’t just a document, it’s your lifeline when you’re half-asleep and trying to figure out why everything’s on fire. Here’s what should be in there:
Step-by-step troubleshooting guides for common issues
Quick commands to check system health
Contact information for escalations
Links to dashboards and monitoring tools
Recovery procedures for different failure scenarios
That one weird workaround that saved your ass six months ago
Make these runbooks so detailed that a sleep-deprived version of yourself can follow them. Because at some point, you will be that sleep-deprived version of yourself.
Set Up Smart Alerts Your monitoring should be like a really good friend: there when you need them, but not constantly bothering you with bullshit. You want alerts for things that actually matter: customer-impacting issues, resource exhaustion, security breaches.
Skip the alerts for every tiny blip that auto-resolves in thirty seconds. Those just train you to ignore notifications, which is the last thing you want when a real crisis hits.
Practice Your Incident Response Run through your incident response procedures regularly, not just when things are actually broken. It’s like fire drills, boring when you’re doing them, invaluable when you actually need them.
When Things Go Sideways: The Art of Calm Chaos
So it’s happening. The alerts are firing, customers are complaining, and your manager is asking for updates every five minutes. This is where all that preparation pays off.
First, breathe. Seriously. Take three seconds to center yourself. Panic doesn’t fix production issues – clear thinking does.
Follow your runbooks, even if you think you remember what to do. When you’re stressed, your memory plays tricks on you. Trust the documented process over your adrenaline-fueled recollections.
Communicate early and often. Send that first “we’re aware of the issue and investigating” message as soon as you can. People can handle problems, but they hate being left in the dark.
And remember, you’re not trying to be a hero. If something is outside your expertise or if the issue is bigger than you can handle solo, escalate. There’s no shame in getting help.
Learning Without the Blame Game
Here’s where a lot of teams screw up. After an incident, they want to find someone to blame. Don’t be that team.
Run blameless postmortems focused on one question: how do we prevent this from happening again? Or if we can’t prevent it, how do we detect and recover from it faster?
Maybe you need better alerts. Maybe you need more resilient architecture. Maybe you need better documentation for that weird third-party integration that always seems to break at the worst possible moment.
Document everything you learn and integrate it into your processes. That incident that kept you up until 3 AM? Make sure it teaches you something valuable that improves your systems going forward.
Setting Boundaries: You’re On Call, Not On Chains
When you’re on call, yes, you need to be responsive. Real customers depend on these systems. Your company’s reputation (and everyone’s paychecks) rely on keeping things running.
But here’s the thing – when you’re not on call, you need to let that shit go. Don’t carry the stress of production issues when it’s someone else’s turn to hold the pager. You can’t be effective if you never mentally disconnect.
This means having proper coverage, clear escalation procedures, and team members who actually know what they’re doing. If your on-call setup requires you to be mentally present 24/7, that’s not sustainable, it’s a recipe for burnout.
The Payoff: Becoming Unshakeable
Here’s what happens when you approach on-call duty with this mindset: you become that developer everyone wants on their team. Not because you never have incidents, but because when incidents happen, you handle them like a pro.
You’re the person who stays calm while others are panicking. You’re the one with the detailed runbooks and the comprehensive monitoring. You’re the developer who learns from every incident and makes the systems stronger.
And here’s the weird part, once you stop fearing on-call duty and start seeing it as just another part of the job, it becomes way less stressful. It’s like that quote about fear: “The cave you fear to enter holds the treasure you seek.”
Your Next Steps
So what’s the biggest obstacle you’ve faced during on-call duty? Was it the technical challenge, the time pressure, or just that awful feeling in your gut when everything’s broken and people are looking to you for answers?
Take some time this week to think about your own on-call preparedness. Update those runbooks. Review your monitoring setup. Practice that negative visualization exercise, imagine the worst-case scenario and walk through how you’d handle it.
Because at the end of the day, the person who can make you a confident, effective on-call engineer is the same person who can turn you into an anxious mess: you.
The systems will fail. The alerts will fire at terrible times. But you’ll be ready.
Quote of the Day:
“The world breaks everyone, and afterward, some are strong at the broken places.” - Ernest Hemingway
👉 If you enjoy reading this post, feel free to share it with friends!
Or feel free to click the ❤️ button on this post so more people can discover it on Substack 🙏
You can find me on X and Instagram.
Also, I just launched a new YouTube channel - Code & Composure