David Bourea

David Boureau

I create web applications

Web scraping as a service

08 August, 2017

Scrapped website is here

Final code is here

Today we’ll study how to create a web service that is able to scrape the web. Some requirements could be:

The stack relies on big classics:

For the latter : 2 years ago I tried SpookyJS, but I found it quite complicated. Spooky is now not so active, and there is the excellent NightmareJS as a replacement : it allows us to drive a headless browser in a trickless, intuitive way.

The website we will scrape

We will scrape one of the most technology-advanced AI application ever : ask-the-dude : see it here

We can ask the dude any question :

Ask any question

The dude takes time to answer.

Take time

The dude always reply

Answer is yes

Answer may vary, according to the question

Answer is no

If you forget the trailing “?”, you will have no answer, but in this case, the dude gives you the opportunity to display a random quote.

Answer is error

Pretty incredible, isn’t it ?

A bad news

Unfortunately, the ignoble codeur didn’t release any API of “the dude.”

Which means you can’t access to “the dude” programmatically. The only way to get this API is to create a web service that scrapes “the dude”. The API we will create

Quickstart

$> node --version
v6.9.5
$> git --version
git version 2.7.2
$> git clone git@github.com:bdavidxyz/web-scraping-as-a-service.git
$> cd web-scraping-as-a-service
$> npm install
$> npm start

You can open it at http://localhost:5000/, a welcome message should be printed if everything installed correctly.

Good ! Now our service is ready to be tested.

var a = $.ajax({
 type: "POST",
 url: "http://localhost:5000/ask",
 data: {question:'Do you like butter ?'},
 success: function(e){console.log(e);}
});

“OK” should be outputted. Wait a few seconds, then

var a = $.ajax({
 type: "GET",
 url: "http://localhost:5000/get-answer-for?q=Do you like butter ?",
 success: function(e){console.log(e);}
});

You should have the answer to the question. Try now to get an answer to a question you never asked :

var a = $.ajax({
 type: "GET",
 url: "http://localhost:5000/get-answer-for?q=WTF ?",
 success: function(e){console.log(e);}
});

You can also print all questions

var a = $.ajax({
 type: "GET",
 url: "http://localhost:5000/all-questions",
 success: function(e){console.log(e);}
});

Code

Ask a question

The relevant part is here : https://github.com/bdavidxyz/web-scraping-as-a-service/blob/master/index.js#L27-L72

See that NightmareJS is pretty intuitive : you can chain basic instructions very easily. However the famous JS pyramid nightmare (ahem), cannot be completely avoided : once you start to evaluate anything on the page, the result of this evaluation is wrapped in a promise.

You have to be very careful about these 3 things :

Notice that in this example, you can achieve conditional browsing : based on the result of a first evaluation, you can reuse the nightmare instance and scrape the web page again.

Other parts

Great ! Who can do more can do less.

The two other endpoints /get-answer-to?q= and /all-questions don’t use NightmareJS, they are simple, self-describing ExpressJS endpoints.

Concluding thoughts

We have now :

Possible improvements :