When I Debugged a NodeJS Problem That Froze Our Production

I

                March 13, 2021

            When I Debugged a NodeJS Problem That Froze Our Production

            Hello to all subscribers (3) of this newsletter. Today I'm going to talk about a production problem I debugged some time ago. I'm not sure if there's a moral of the story. Feels like there is none. I'm writing it down so that I can remember it when I retire from the industry.
But first, a short explanation why this newsletter exists
Very short explanation: a friend asked me to do it. 
Slighty less short explanation: 
I'm infinitely and endlessly curious about all things related to programming and software development. That said, I feel like I only have average smarts when it comes to brainy stuff. Someone might say this is false modesty. They might be right. But that's how I feel. The brainy programming stuffy makes my head hurt. (Not literally.)
However: I've worked in this industry for so long that I no longer usually give a shit if I make a fool of myself. I'm not the smartest person out there, but I am effective, therefore  my ego isn't too vulnerable to the idea of telling the world how not-smart I might be.
And it turns out the most interesting writing comes from just trying to be honest. Don't get me wrong: being honest in public is hard. This is one of those things when saying that doing something is hard, it doesn't feel hard when you say that, but you never get around to actually doing the thing, because it is actually hard. More about this in a separate blog post.
Big hairy production problems, yay
Sometimes you run into a big hairy production problem that you absolutely have no stinkin' idea how to even start figuring out a solution. A problem that you know you cannot just ignore or work around. The feeling I have is usually mixed: partly a deep dread and anxiety and partly a strange excitement, because we all love to solve problems. Though the older I get, less excited I am about solving hairy problems. (The less hair I seem to have too, maybe these are related.)
The symptom was that our NodeJS application, from time to time, completely froze for a significant while. Did nothing, nada. Well, not completely true: it consumed 100% of the CPU while it apparently did nothing. 
Sometimes it was for a second, sometimes it was for even longer. We had several instances of the NodeJS server running so it wasn't like the whole service stopped working for a period. But it was still completely unacceptable and was basically making the user experience unbearable for the unlucky ones that happen to hit the server that was freezed.
"Don't block the event loop." Well, sir, I don't.
NodeJS being a single-threaded, asynchronous server software, the obvious suspect was any code that blocks the execution. Some long stading operation that blocks the event loop. (Don't do that.)
But none was found. We tried to isolate the problem and with meticulous tracing we could pinpoint the problematic code area more accurately. But any efforts to write a simplified version of the suspicious areas yielded no
results. The problem only occurred in the production. Yay.
It's not a great feeling that you have a showstopper production problem and no idea how to solve or even circumvent it after working on it for days already. What do I do if I cannot fix the problem ever? Admit a defeat and just retire from the industry for the shame? 
I don't remember how long the whole process of debugging and solving this problem took. It was certainly days, might have been couple of weeks. Felt like months. 
But every morning I usually had fresh ideas how to tackle the problem and feel excited for a while, but as the workday starts to close in, you feel extremely frustrated for not getting any closer to the solution. Adding insult to injury, you don't have a nice, specific error message that would make it easy to Google your problem. I found nothing similar. 
Take a deep breath. It doesn't help but you do have to breath anyways.
First thing you need to do is to accept that it is going to take a while and just clear all the things you had planned you did instead. It's like a mini project, with no promise of being able to finish it. If it takes a month, then it takes a month. (It probably won't.)
What I did was to isolate as much as I could. The thing was that it was very specific part of the application that triggered the problem. Then I ran it in the local environment with setup that I could generate as many requests as the production generated in the normal usage. I always have some Gatling project lying around for this purpose. Gatling is a programmer friendly performance/stress testing tool, Virtanen recommends.
And sure enough, in the end, I was able to reproduce it. That feeling is awesome. It's like winning a big heckin' competition if you are the competitive type. Or, I guess, shooting dope in your veins if that's your thing. You just know you're going to crack this depressing conundrum even though it's only a reproduction at this point.
A side note
[I noticed that I sometimes talk about 'we' and sometimes 'I', which is confusing. I wasn't the only one debugging the problem so it's 'we' in that sense. But I guess I spent the most time with the problem. And I felt some personal responsibility of the problem in large. So there's that.]
Flamegraphs are awesome
A simple V8 engine profiler didn't help, because it turned out it was a case of many things happening simultaneously. When there was only one request processed, nothing froze. Everything happened fast. But enough concurrent requests, the thing that usually took a millisecond on average, suddenly took hundreds of milliseconds on average. 
Flamegraphs (by Brendan Gregg) (or, in this case, flamebearer) to the rescue. The tool basically shows where your application spends its CPU cycles. It breaks it down to the final function call in your applications code's call chain and shows graphically where the time is spent. Brendan explains it better, read it from there. 

Ah. 
Now I can actually Google the problem. Not a lot of similar results, however, but at least there's someone who faced a similar problem. Their situation was slightly different, but at least there's something I can work on. 
The depressing non-conclusion. But life goes on. I  guess.
The library in question apparently has (or had) a complicated and involved bug that involves large error objects that have cyclical references. I'm not still sure what actually triggers the problem. This library is able to work with the cyclical references in the error object when there isn't too much concurrent activity. And it is fine with 99,99% of the cases. But it was pretty easy to find a way to work around it: passing a simpler error object that was stringified beforehand mitigated the problem. 
So the programmer in me said that I should continue investigating it. I wanted to know the actual technical details of the problem. But the pragmatist in me said that the problem no longer exists in our service and I should just move on, because I've already spent too much time with it. 
So that's what I did. I settled for the workaround. The problem never reoccurred. 
The moral of the story
I said there's no moral of the story, but I guess there's always something. Maybe the moral of the story is that in real life there isn't always a nice, beautiful ending of a problem like this, which would resolve everything. 
You might say that working around the problem was a resolvement enough. I don't feel like that. Maybe it wasn't a defeat, but it certainly wasn't a win. Let's call it a tie. 
Jarno
PS. Thanks to Risto K. for comments on the draft. I will try to incorporate those comments in later editions. :) 
PPS. If you have comments or suggestions of improvement, just hit reply or email directly to jajvirta@gmail.com

Don't miss what's next. Subscribe to The Regular Virtanen, A Software Industry Thought Follower: