The Biased Programmer

I wrote recently about cognitive biases and Thinking, Fast and Slow, Daniel Kahneman's book on them. Having had some time to digest it, some of the implications are becoming more clear for me, in particular as they relate to my role as a software developer.

The work on biases sheds light on some of the common mistakes that we make in our industry. Software developers are particularly susceptible to scale insensitivity, the planning fallacy, and failure to take on an outside view due to the nature of our work. These can have severe consequences: for example, a developer underestimating the impact of a low-probability flaw in a distributed system can cause widespread outages, costing their company time and money. Even extremely low probability events are likely to happen frequently in such systems due to the sheer volume of traffic that we must design for.

Some work has been done towards researching methods for improving the quality of software artifacts and the reliability of software delivery schedules which are notorious for their tendency to run overbudget and overschedule. However, this research is only slowly penetrating into industry. Most software projects still work off of back-of-the-envelope estimates which a frequently discounted and are rarely reliable to any meaningful degree.

Even once the product has been delivered, there are still often large gaps to be resolved in the reliability and quality of the system. This is caused in part by developers failing to fully grasp the problem domain they are working in, and in part by a failure of imagination in designing against possible failure modes. One of the techniques that Kahneman describes to avoid these kinds of traps is the "pre-mortem", wherein teams attempt to envision scenarios where a plan went horribly wrong and to propose plausible scenarios that can then be accounted for. NASA is famous for doing this kind of contingency planning - the extreme nature of their domain justifies that level of expense. Most projects in the commercial space, particularly those targeting consumers and those with tight schedules or budgets, will be hesitant to spend time on these kinds of things both because it doesn't seem like a good use of time ("You could be coding instead!") and because the consequences for failure are much lower - there is no equivalent in the consumer space for watching Ariane 4 disintegrate due to a guidance software error.

Even though they are less visible, the consequences are still there: lost customers and revenue, wasted developer time, increased project cost and schedule overruns caused by unanticipated failure modes. It's the programming equivalent of the hidden costs imposed by bad roads, where potholes and unmaintained pavement impose their toll on the people who use them. In the worst case, systems become totally unmaintainable and unreliable to the extent that they must be discarded entirely. Programmers working in such systems may not even realize it until they switch teams (or jobs), in part because What You See Is All There Is.

So what is to be done? Like any engineering problem, the answer depends heavily on context. Teams must balance tradeoffs to manage their exposure to risk while maintaining competitive. For all but the most important systems there will be a point at which it is unreasonable to spend more time mitigating the chances of failure. However, we have to remember that even lower-impact issues are still a drain on our time and resources, and that sometimes a small upfront investment can prevent us from recurring tolls maintaining our systems over the long term.