17 May 2013,
Crawling pages using PhantomJS
I used PhantomJS in my CORS Planner project to crawl module codes from NUS website. It is really simple to pick up some code from the examples provided in PhantomJS site. My code look like this:
Last month, I added support on NTU modules in CORS Planner. However, NTU modules are distributed in 339+ different webpages. Crawling page after page in sequence is far too slow. So a better way would be running several PhantomJS webpage together and each completes a part of the crawling. The code: (thread is the number of webpages, not a real thread. webpages run in asynchronous)
Run $ phantomjs crawl-phantomjs.js -t # in command window, where # is the thread you want.
The performance comparison for different number of pages:
Note, Sometimes the webpage stops if you have slow network.