Web scraping as a service

08 August, 2017

Today we’ll study how to create a web service that is able to scrape the web. Some requirements could be:

Ability to extract data from DOM and return it,
Ability to make some conditional browsing,
Ability to create a job, with params,
Ability to retrieve result of this time-consuming job.

The stack relies on big classics:

NodeJS
Git
NightmareJS

For the latter : 2 years ago I tried SpookyJS, but I found it quite complicated. Spooky is now not so active, and there is the excellent NightmareJS as a replacement : it allows us to drive a headless browser in a trickless, intuitive way.

The website we will scrape

We will scrape one of the most technology-advanced AI application ever : ask-the-dude : see it here

We can ask the dude any question :

Ask any question

The dude takes time to answer.

Take time

The dude always reply

Answer is yes

Answer may vary, according to the question

Answer is no

If you forget the trailing “?”, you will have no answer, but in this case, the dude gives you the opportunity to display a random quote.

Answer is error

Pretty incredible, isn’t it ?

A bad news

Unfortunately, the ignoble codeur didn’t release any API of “the dude.”

Which means you can’t access to “the dude” programmatically. The only way to get this API is to create a web service that scrapes “the dude”. The API we will create

POST /ask : Ask a question to the dude. The question is in the body of the request. The API replies “OK” to acknowledge the question, but is unable to answer immediately (remember the dude may take time to think about your question and reply properly).
GET /get-answer-to?q= : You can use this endpoint a few seconds after the POST : you will get the answer to the question that match with param q.
GET /all-questions : display all questions already asked, with their associated answers.

Quickstart

$> node --version
v6.9.5
$> git --version
git version 2.7.2
$> git clone git@github.com:bdavidxyz/web-scraping-as-a-service.git
$> cd web-scraping-as-a-service
$> npm install
$> npm start

You can open it at http://localhost:5000/, a welcome message should be printed if everything installed correctly.

Good ! Now our service is ready to be tested.

Open Chrome
Open http://code.jquery.com/, — we will use some simple jQuery code to test our service
Open the console from there (Cmd+Alt+i for Mac users, F12 for Windows users)
Copy/paste the following code snippets

var a = $.ajax({
 type: "POST",
 url: "http://localhost:5000/ask",
 data: {question:'Do you like butter ?'},
 success: function(e){console.log(e);}
});

“OK” should be outputted. Wait a few seconds, then

var a = $.ajax({
 type: "GET",
 url: "http://localhost:5000/get-answer-for?q=Do you like butter ?",
 success: function(e){console.log(e);}
});

You should have the answer to the question. Try now to get an answer to a question you never asked :

var a = $.ajax({
 type: "GET",
 url: "http://localhost:5000/get-answer-for?q=WTF ?",
 success: function(e){console.log(e);}
});

You can also print all questions

var a = $.ajax({
 type: "GET",
 url: "http://localhost:5000/all-questions",
 success: function(e){console.log(e);}
});

Code

Ask a question

The relevant part is here : https://github.com/bdavidxyz/web-scraping-as-a-service/blob/master/index.js#L27-L72

See that NightmareJS is pretty intuitive : you can chain basic instructions very easily. However the famous JS pyramid nightmare (ahem), cannot be completely avoided : once you start to evaluate anything on the page, the result of this evaluation is wrapped in a promise.

You have to be very careful about these 3 things :

Don’t forget the keyword “return” in front of the nightmare instance inside the promise, or chaining of promises will not occur.
Don’t forget to re-instantiate NightmareJS on every request.
Don’t forget to end your nightmare instance once you used it.

Notice that in this example, you can achieve conditional browsing : based on the result of a first evaluation, you can reuse the nightmare instance and scrape the web page again.

Other parts

Great ! Who can do more can do less.

The two other endpoints /get-answer-to?q= and /all-questions don’t use NightmareJS, they are simple, self-describing ExpressJS endpoints.

Concluding thoughts

We have now :

A completely free way to scrape any website, for free,
A way to APIfy any website that is not accessible by any other mean than a web browser,
A way to create some background job,
A way to do some conditional browsing.

Possible improvements :

You can deploy the service to a service like Heroku to make your API accessible from anywhere.
I admit that this doesn’t scale very well, for intensive tasks you may want to use more serious background jobs solutions, and polling an API is also not a very good idea. But that’s enough for today :).