Random Thoughts of a Happy Programmer logo

Random Thoughts of a Happy Programmer

Subscribe
Archives
September 22, 2014

For Loops in Node

Plague Sketch

As of late, I’ve been spending a fair amount of time writing Node.js code. While I’m not a huge Node.js fan (yey Python + Go!), I find myself liking some parts of the language quite a lot.

Over the past few months I’ve been working on a really awesome authentication library for Node.js: express-stormpath, and have learned quite a lot about Node as I’ve been working on it more and more.

Today I’d like to share a short, personal story with you, about my frustrating experience trying to do something simple.

The Story

Here’s how it started: two weeks ago I was writing a web scraper for thepiratebay. My idea was simple: I wanted to get a JSON dump of all torrent information available, so that I could later use it for some simple data analysis.

After taking a look at the site, I realized that the simplest way to scrape all the existing torrents would be to just loop through all integers, querying each one sequentially – this is because TPB allows you to access torrents via their integer ID (which is always increasing):

  • http://thepiratebay.se/torrent/1
  • http://thepiratebay.se/torrent/2
  • http://thepiratebay.se/torrent/3
  • http://thepiratebay.se/torrent/…

The rules are simple: if you get a 404 skip it – if you get a 200, the torrent exists and can be scraped!

So, I sat down and wrote a first version that looked something like this:

var request = require('request');

for (var i = 0; i < 10000000; i++) {
  request('http://thepiratebay.se/' + i, ...);
}

This is some pretty basic stuff:

  • Iterate through numbers? CHECK!
  • Make HTTP requests? CHECK!

But to my dismay, after running for a few minutes I noticed that this small program was eating all the RAM on my laptop! But why?!

I realized that Node.js blocks when running blocking code (eg: a for loop) – but I figured that since I was making async requests from within things would continue to work normally.

I was wrong.

So, being confused about what was happening, I decided to dig a bit deeper. I narrowed my case down to a simpler test:

for (var i = 0; i < 10000000; i++) {
  console.log('hi:', i);
}

But alas, the same problem. The program simply runs for a few minutes, then crashes as it uses all the RAM on my computer. Bummer.

So then I started Googling around to find potential solutions. Surely this must be a common issue?

Unfortunately, however, I didn’t see much discussion about this, and all the relevant Stack Overflow threads proposed solutions that didn’t require looping at all (not an option in my case).

Next, I turned to async – the really popular flow control library for Node. After looking through the docs, I realized there was something that was seemingly perfect for this! The forever construct!

So I then tried the following:

var async = require('async');

var i = 0;
async.forever(
  function(next) {
    console.log('hi:', i);
    i++;
    next();
  },
  function(err) {
    console.log('All done!');
  }
);

But again – the same issue. After a few thousand loops: crash.

After writing quite a few different iterations of this simple program, and a significant amount of lost sleep (I can’t really sleep well knowing I don’t understand something – grr) – my coworker Robert proposed a working solution:

var Abstraction = function() {
  this.index = -1;
};

Abstraction.prototype.getIndex = function getIndex() {
  this.index++;
  return this.index;
};

Abstraction.prototype.isDoneTest = function isDoneTest() {
  return this.index > 10000000;
};

var list = new Abstraction();

function iterator(){
  var i = list.getIndex();
  console.log(i);
  if(list.isDoneTest()){
    clearInterval(interval);
  }
}

var interval = setInterval(iterator,1);

Brilliant! I didn’t even think of setInterval for some reason.

Anyhow: after a lot of discussion – we both came to the agreement that using setInterval is essentially the only way to solve this problem.

After thinking about this some more, I decided to write a small abstraction layer to handle this – so I created lupus.

lupus provides simple (albeit, basic) asynchronous looping for Node.js:

var lupus = require('lupus');

lupus(0, 10000000, function(n) {
  console.log("We're on:", n);
}, function() {
  console.log('All done!');
});

Whatever you end up writing inside of the loop (blocking or not) – lupus doesn’t care.

The Moral

Performing asynchronous for loops in Node.js turned out to be quite a lot harder than I expected. I find it odd that it’s so easy to crash my programs with the simplest of looping examples.

Oh well! Live and learn!

Don't miss what's next. Subscribe to Random Thoughts of a Happy Programmer:
GitHub Website Bluesky
Powered by Buttondown, the easiest way to start and grow your newsletter.