If You Leave an Exception Handler Untested, You're Part of the Problem
Announcements
When I decided to create this newsletter, I honestly rationalised it by this being a basically a newsletter to my friends and acquaintances. They are spared of my venting and ranting by just subscribing to this newsletter. And that's what it seems like it will become. That's fine!
I don't feel qualified to teach programming, especially to you, because you're probably better programmers than I am. But I think I can make you better programmers nevertheless. Strangest thing!
But anyhoo.
If You Leave an Exception Handler Untested, You're Part of the Problem
If you tolerate this, then your children will be next. -- Manic Street Preachers.
A common complaint among conscientious programmers is code like this:
try:
all sorts of stuff
except whatever:
pass
First, like always, context matters. It might be a perfectly reasonable thing to do. Say, for a one-off script. Or for countless other scenarios. But when the complaint is made, it is usually in a context where such code is not appropriate.
So there's plenty of advice about how to do proper error handling. Large part of that is basically a reiteration of "it depends". But we can do that. It is usually not easy, but it is manageable. Go on, do that. Usually it involves thinking hard about what you want actually to happen at these unfortunate circumstances.
But that's not what I'm interested in right now or equipped to give advice about.
So fix it!
First, if you complain about it, but actually do nothing about it ... well, now you're part of the problem too. Maybe the last programmer looking at the code was complaining about it to their colleagues, but did nothing about it. Maybe that was even the right decision. (It is a whole other discussion if you should fix every little thing that you feel is wrong in the code.)
But the point is that if you don't do it, now, for whatever reasons you have, right now, I think it is fair to say that you are, in this sense, part of the problem. Don't get me wrong, I am not that guy who goes around fixing empty exception handlers all day long! There are occasions when I've fixed similar issues. There are occasions when I've left it as it is.
A side note: I think leaving a TODO comment in the source code is largely pointless in these cases. # TODO proper error handling
, "Yes, thanks, that's sort of obvious." But a comment explaining why it is left like it is might be a reasonable thing to do. That's also a whole other discussion, which I hope to talk about in some later newsletter!
But still. Every argument you have for not fixing it right now is an argument every previous programmer looking at the code might have had. Or even the original programmer! So either fix it or accept that you are every bit as responsible for it. That's not the end of the world! I think it's healthy to feel shared responsibility and not blame others for everything.
Better yet, test it!
But the real point of this article is a better suggestion: test what happens when an exception is thrown! It might be a simple manual test that triggers a condition. See what happens. If you have an all-encompassing catch-all exception handler, see what happens when a null pointer exception is thrown. Don't simulate it in your mind, actually do it.
You might get surprised at what happens. Or even better, you might realise that when the exception is thrown, nothing good is going to happen. Follow what happens next and think hard if it means that at some point ignoring the exception will lead to trouble. Again, it's not easy but often worthwhile.
Now, the solution is not to write a unit test that just triggers the error and asserts that the exception handler is empty or does nothing. The idea is that an empty exception handler is a programming error. Not doing anything might be a programming error, depending what you're doing. Doing something might be an error, if it's the wrong thing.
The real meat of this newsletter
All of this was trying to be a sneaky way to motivate you to read the paper Simple Testing Can Prevent Most Critical Failures (by Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, Michael Stumm).
Now, it is a study about failures in large distributed systems (Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis) and I hear you already complaining violently with something like "Well, I don't work on large scale distributed systems, I only make web forms!" And, sure, it's a fair point, but I think reading the paper and trying to incorporate the ideas suggested in it will nevertheless make you a better programmer and your programs more resilient to errors.
Some quotes:
We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design.
and
almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
and
In fact, in 35% of the catastrophic failures, the faults in the error handling code fall into three trivial patterns: (i) the error handler is simply empty or only contains a log printing statement, (ii) the error handler aborts the cluster on an overly-general exception, and (iii) the error handler contains expressions like “FIXME” or “TODO” in the comments.
An example from HBase, the code that caused a fatal error:
try {
split(..);
} catch (Exception ex) {
LOG.error("split failed..");
}
where the failure is caused by a flaky file system returning a NullPointerException. Triggering the error and asserting the failed state the application is in was relatively easy in a unit test. The fix is to retry the split.
An example from MapReduce, where the exception handler is just simply:
catch (IOException e) {
// TODO
LOG("Error event from RM: shutting down..");
}
which again was easily reproducible in a unit test (meaning: a unit test that exercises the code and asserts that the end result is not what is wanted from the code and the system to be in after the exception handler).
The patch to fix the fatal error that took down the cluster of 4000 nodes in production was this:
catch (IOException e) {
+ // This can happen if RM has been restarted. Must clean up.
+ eventHandler.handle(..);
}
It is besides the point if this is the right fix and would require intimate knowledge of the system in question anyhow. And I'm not pointing fingers either! These systems are for sure more extensively tested than everything I've ever worked with directly.
The point is that a test that triggers an error and therefore the error handler and finally an assertion that examines the system state would have detected the error. Then you can fix the error without having to wait for error reports from your customers.
Of course, in large distributed systems the catastrophic failures also involve unfortunate timings of concurrently happening actions and a cascade of things, but what the paper demonstrates that these failures would have been detected and prevented by simply testing the error handling code.
So, if you're saying that, well, "I don't need to test my code's exception handlers, because I don't work with large distributed systems", I will counter that by saying that you probably do work with distributed systems (a load balancer, a frontend, an API/backend and a database is a distributed system). Second: if you don't test the exception handler, you really can't say what happens when that exception is thrown. Go on, test that exception handler, show that nothing bad happens and prove me wrong! (I'm trying to trick you into testing...)
Phew.
Final words, sort of.
Yes, the title was a provocation of sorts. But also sort of true. But don't take it like I'm saying you're a bad person for not testing exception handlers or that I'm somehow better. I'm not. Let's just become a little better, one newsletter at a time!
Thanks to Risto and Juuso for reading drafts of this. (Apologies to Duukkis, because I sent you a paper copy of a draft.)