17 May 2013,
On PhantomJS, JavaScript, and CorsPlanner
Crawling pages using PhantomJS
I used PhantomJS in my CORS Planner project to crawl module codes from NUS website. It is really simple to pick up some code from the examples provided in PhantomJS site. My code look like this:
Last month, I added support on NTU modules in CORS Planner. However, NTU modules are distributed in 339+ different webpages. Crawling page after page in sequence is far too slow. So a better way would be running several PhantomJS webpage together and each completes a part of the crawling. The code: (thread is the number of webpages, not a real thread. webpages run in asynchronous)
Run $ phantomjs crawl-phantomjs.js -t # in command window, where # is the thread you want.
The performance comparison for different number of pages:
Thread
Pages Crawled
Time Taken
Avg Time/Page
1
30
152.92s
5.09s
9
40
52.38s
1.31s
19
40
39.95s
0.99s
29
40
32.66s
0.82s
Note, Sometimes the webpage stops if you have slow network.