Software bugs are a common occurrence in the world of technology. If you ever wrote more than 10 lines of code, chances are you might have accidentally caused one.
Now, I want you think about the last time you received a message from a customer or colleague freaking out about a bug. How did it end? Was the bug so urgent that you had to stop what you were doing?
In this article, we will challenge conventional thinking about bugs and explore how much each bug is costing you and your team.
Hey there👋, my name is Krste, and I'm a full-stack developer who has been working in startups for over 7 years. Recently, I co-founded Bugpilot, a bug monitoring platform that alerts software teams about critical user-facing bugs.
This article aims to challenge conventional thinking about bugs and explore the true cost of each bug to you and your team.
100% bug-free software. Myth or Reality?
The goal of achieving 100% bug-free software has always been a tricky one. The idea of creating a bug-less app may seem ideal, but is it actually possible? The truth is that software bugs are inevitable due to complex business logic, human errors, and technology’s continuous updates. Even with careful testing and quality checks, it's difficult to completely eliminate them.
Thanks to new developer tools and improved processes, teams are now shipping new features at a lightning pace to be able to stay relevant and competitive. And, with any change to the codebase, there is always a possibility that something will break. (Having to support tons of devices, multiple browsers doesn’t help at all.)
Of course, there are cases where bugs must be close to zero. Certain industries like banking, medical, aerospace, must ensure that mistakes are kept to a minimum, as they could potentially cost lives. That explains why most software in these industries is written in technologies from decades ago, and why people are now hesitant to touch it. 😃
But in the end, we have to ask ourselves, "Is it all worth it?” For most of us who are building marketing tools and CRMs, I believe that it's more expensive to fix most bugs. And here's why:
NOT all bugs are made equal
Imagine you're working on an e-commerce website, and a bug breaks the checkout process, making it impossible for customers to complete their purchases. It is painfully obvious that this bug needs immediate attention since it directly impacts the core functionality and potentially results in lost sales and dissatisfied customers. Same as we can't treat a bug in the Signup page that's preventing new users from joining our new ride-sharing app the same as an issue related to updating the profile photo. These black-and-white examples show that not only we need to detect, but we need a way tounderstand the impact of bugs.
However, in reality, things are not black-and-white. Most situations fall within the gray area. The question then becomes: how do we prioritize then?
“wE hAvE tO fIx iT nOw oR tHe cUsToMeR wIlL LeAVe”
What I've noticed is that we tend to add an "extra" sense of urgency to bugs that our customers find. You've probably been in a similar situation when one of your bigger customers is having a minor issue, and your whole team is getting flooded with messages as if the world is on fire. 🔥
”ACME Corp. is having a problem. They are an important customer. We must fix the bug NOW!” – your boss
Based on my experience, customer-facing roles such as customer success, support, and sales tend to prioritize issues depending on how the customer reacts and how it affects their reputation. This is why everything may seem like a big deal to them, which is understandable. Nobody wants to make a bad impression or have a negative reputation, and that is exactly what bugs can cause.
The more users are affected, the more important it is
Sometimes, production issues are dealt with as they occur. You might be using a tool such as Sentry or Bugsnag to monitor errors. When a "critical" error is found, it is quickly assigned to a developer while everyone impatiently awaits an update in Slack. Typically, these tools prioritize errors based on how frequently they occur and how many users are affected.
However, the priority of most bugs is determined by either the product owner or lead developer. These roles typically have an understanding of the business logic, underlying technology, current workload, and upcoming priorities, and plan accordingly.
Their prioritization process might look something like this:
Is it critical or not?
This is where it gets a bit tricky. Critical doesn't have a clear definition. If it's a core function that is down, then it's pretty obvious. However, as you saw in the example above, what if an important customer is affected? Or a certain number of users are impacted?
How much time does it take to fix it?
Next, they will ask how much time it takes to fix the issue. If it only takes a couple of hours, they will find ways to squeeze it into this week's work.
To fix (NOW) or not to fix, that is the question.
Developers often face a dilemma when dealing with an "urgent" bug. They may either be rush to finish their current task and fix the bug, possibly introducing new bugs in the process, or they may switch their attention to the bug, abandoning their current task altogether.
There are two significant problems with these scenarios:
The team prioritizes incorrectly, resulting in fixing less relevant bugs.
Adding unnecessary urgency to non-critical bugs causes the team to switch focus.
Before addressing the next bug, it's important to consider its cost. So, how much does a bug actually cost?
How expensive are bugs?
Have you ever heard of the phrase "There's no such thing as a free lunch?"
When we talk about cost usually the first thing we think about as devs is infrastructure cost. However, when it comes to bugs, I'm talking about attention.
If you studied computer engineering or you’ve ever read about how OS work, you’ve probably heard of context switching.
In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state. – Wikipedia
Most common examples of context switching are related to:
Multitasking: A context switching is the characteristic of multitasking that allows the process to be switched from the CPU so that another process can be run. When switching the process, the old state is saved to resume the process's execution at the same point in the system.
Interrupts: A CPU requests for the data to read from a disk, and if there are any interrupts, the context switching automatic switches a part of the hardware that requires less time to handle the interrupts.
Where have I heard these words before 🤔?
“I am a master at multitasking.”
Multitasking can take place when someone tries to perform two tasks simultaneously, switch from one task to another, or perform two or more tasks in rapid succession.
Doing your laundry while talking with a friend will probably work out all right. The tricky part comes when we are doing mentally complex tasks, like writing code. Psychologists have conducted multiple experiments on task-switching to determine its costs. They measured the time it takes for people to complete tasks and the time cost of switching between them.
The results were the following:
“Although switch costs may be relatively small, sometimes just a few tenths of a second per switch, they can add up to large amounts when people switch repeatedly back and forth between tasks. … shifting between tasks can cost as much as 40 percent of someone's productive time.”
But, don’t feel too bad about it. Even, CPUs are bad at context switching.
“Do you have a minute?”
Picture this: you're locked in, fully immersed, and nothing can stop you from finishing this new, shiny feature. The outside world doesn't even exist - it's just you and your code. But then, out of nowhere, you hear a ring... a new notification on your phone. It's your friend asking you out for drinks this weekend. And just like that, poof, the next 20 minutes of your life are gone.
Or maybe you just received a Slack message from your colleague, asking if you have a minute to help with a specific issue.
Is the second situation more acceptable than the first?
A programmer takes between 10-15 minutes to start editing code after resuming work from an interruption.
When interrupted during an edit of a method, only 10% of times did a programmer resume work in less than a minute.
A programmer is likely to get just one uninterrupted 2-hour session in a day
In the real world we are always dealing with limited resources. Whether it’s money, time or attention.
Fixing bugs has a price.
“No” is no to one thing. “Yes” is no to a lot of things. – Jason Fried
Saying “yes” to a bug means saying “no” to a feature. Multitasking may seem efficient on the surface, but switching back and forth between tasks actually take more time and introduces more bugs in the end. Understanding the hidden costs of context switching helps you make better choices on which bugs are worth dropping everything for, and which ones aren’t (most aren’t).
Should we stop caring about bugs?
Well, no. However, we should at least be aware of the cost of fixing bugs and ask ourselves, "Is it worth it?”
It can be challenging to sit calmly and accept bugs. We all have an inner desire to produce high-quality work and build something we can be proud of. Unfortunately, those pesky bugs often get in the way. As the book "Peopleware" explains, this is why it's tough to stop caring about quality.
We all tend to tie our self-esteem strongly to the quality of the product we
produce—not the quantity of product, but the quality. (For some reason, there
is little satisfaction in turning out huge amounts of mediocre stuff, although
that may be just what’s required for a given situation.) – Peopleware
Could we prevent them from happening?
In certain industries, as we mentioned, you don't have a choice. There, you must take all necessary steps to prevent mistakes from occurring and correct them as quickly as possible.
But, for most apps is the extra effort worth it?.
To help you deal with bugs in a more productive manner, we are building Bugpilot - a bug monitoring tool that notifies you when critical user-facing issues occur.
At the end of the day, even if we had a magic wand that fixes all bugs, our users would still find a way to misuse our app. 😅
We have limited resources. Saying “yes” to a bug means saying no to a feature, a new integration, an optimization or a needed refactor. Think what brings most value to your team, company and customers.
Multitasking is a counter-productive, especially with challenging coding tasks. Understanding the hidden costs helps you choose strategies that will avoid it as much as possible. Multitasking may seem efficient on the surface, switching back and forth between tasks actually take more time and introduce more bugs in the end.
If these arguments aren’t convincing for yourself, I get it. But, try thinking from a business point of view. Scaling them for your whole team, department, company for a duration of a year.
In the second part of this 2-part series, I’d to talk about how to deal with these situations and how the principle of ‘aggregate marginal gains’, and is the idea that if you improve by just 1% consistently, those small gains will add up to remarkable improvement.
→ Not sure if other article or?
20% of bugs cause 80% of the problems
Another way to think about this is who is the bug impacting. Is it a user who is currently trialing, or is it an already paying customer that has been with us for months. One of them has more patience, one is waiting for an opportunity to leave. Or the number of users that are impacted, is it a core feature or not.
Have a dedicated time for bugs
What we do is we dedicate time around 20% of our time on bugs. Usually, it’s like one full day per week. We heard from some of our customers that they do weekly/bi-weekly rotations. One dev to handle all bugs for the duration of the time period. But, there are multiple different versions to this approach that you could try.
Here are some examples:
Basecamp’s method is 2 weeks of bug fixing after shipping new code for not urgent bugs.
At Bugpilot, we dedicate 1 day per week on bugs.
These approaches are for handling non-critical production bugs. Reduce the context-switching.
Priority = Risk * Severity * Time to fix;
Time to fix - requires knowledge of the codebase, technologies used, business logic etc. and we are not going to get into this.
Severity = # of users x # of times in happened x last time it happened.
Risk = type of users, core vs. secondary function, blocking or there is a workaround.
Of course, there are and will be bugs who are worth dropping everything for. We use our own tool Bugpilot, to let us know which bugs are worth “dropping” everything for and focusing on them. We take multiple inputs such as
number and type of users impacted,
blocking or non blocking
We make it easy for our users to report any problem. When you are a small team, this saves us from asking at least 5-10 questions, waiting for a reply, trying to understand what the user meant, replicate the issue, create a bug ticket. Without going to support, then asking 5-10 questions, screenshots, then replicating.
We avoid back and forth. Bug fixing is around 90% understanding what happened, and 10% writing the fix. Thanks to Bugpilot, we always have all the details like screen recording, logs, network, user and enviroment info and more. The best part we can go back even to issues that happened weeks ago. Everything will be there.
No more manually creating Jira tickets. If collecting the details is a nightmare, then I cannot imagine what writing a Jira ticket is. Especially, if English is not your mothertongue, this is where the some of the challenges. If you are a dev reading this, then you understand where you’ve been like “What da fuq?”.
Aggregate marginal gains
It’s called the principle of ‘aggregate marginal gains’, and is the idea that if you improve by just 1% consistently, those small gains will add up to remarkable improvement. We see this everywhere in our lives. Saving small amounts of money over time can build big sums with the power of compound interest.
Fancy way of saying improve by 1% per day, or improve 1% in many areas and at the end as a whole you’ll improve.
Dedicated time or person drastically reduces the context switching that happens when we do when building new features.
Productivity takes a dive. All those small interruptions, asking for a details, and then waiting for for a reply.
Intro to problem
Bug are everywhere, bug free software doesn’t exists but that doesn’t mean we should stop caring about bugs.
Why do we jump on every bug if we know that they can be prioritized and we don’t have to fix them all.
Fixing every/most Bugs, less improvements/features shipped.
context switching expensive operation ⇒
Need time to change priority, … and time to get back to the first task at hand.
interruptions ⇒ Lose focus, need time to get back.
Add to the mix meetings, notifications, office distractions, complexity of task etc.
Get automatic notifications when coding errors occur, failed network requests happen, or users rage-click. Bugpilot provides you with user session replay and all the technical info needed to reproduce and fix bugs quickly.