Use Node.js and Puppeteer for web crawling

A short introductory tutorial for Web Scraping

Web Scraping is the task of downloading web pages and extracting certain information from them.

I recently made a small project with an Arduino board with an LCD display. Using Johnny-Five, we can use Node.js to program the Arduino. I want to get the temperature measured at the top of the mountain and display it on the Arduino board.

I usedPuppetryDo the task of scraping. Puppeteer is an excellent tool created by Google. This is a Node library that we can use to control headless Chrome instances.

This means that we basically use Chrome, but programmatically.

Puppeteer has many practical uses, including automated testing, making screenshots, creating server-side rendered versions of single-page applications, etc.

Install first using

npm install puppeteer

In the Node.js file, it is required:

const puppeteer = require('puppeteer');

Then we can uselaunch()Method to create a browser instance:

(async () => {
  const browser = await puppeteer.launch()
})()

we useawait, So we must wrap this method call inAsynchronous function,weCall now.

Next, we can usenewPage()On the methodbrowserObject getspagepurpose:

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
})()

Next, we callgoto()On the methodpageThe object that loads the page:

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('https://website.com')
})()

Finally, we can get the pagecontentcallevaluate()Methodspage. This method has a callback function where we can add the code needed to retrieve the required page elements. The function is executed in the context of the page, so we can accessdocumentAnd all browser APIs. We return a new object, which will be our resultevaluate()Method call.

We can useSelector APIAnd retrieve data from the page.

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto('https://website.com')
<span style="color:#66d9ef">const</span> <span style="color:#a6e22e">result</span> <span style="color:#f92672">=</span> <span style="color:#a6e22e">await</span> <span style="color:#a6e22e">page</span>.<span style="color:#a6e22e">evaluate</span>(() =&gt; {
<span style="color:#75715e">//...

}) })()

Let's solve the specific problem I encountered. This is the page hosting the weather station, located on the top of the hill at 3315m:http://www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/

I want to get that-9°Ctext. Using the browser inspector, I can see that it has acolumn-4Attach the course. This is not the ideal class name because it has no meaning and may change if they decide to add a new column, but this is what we get:

This is the complete code so far:

const puppeteer = require('puppeteer');

(async () => { const browser = await puppeteer.launch() const page = await browser.newPage() await page.goto(http://www.meteocentrale.ch/it/europa/svizzera/meteo-corvatsch/details/S067910/)

<span style="color:#66d9ef">const</span> <span style="color:#a6e22e">result</span> <span style="color:#f92672">=</span> <span style="color:#a6e22e">await</span> <span style="color:#a6e22e">page</span>.<span style="color:#a6e22e">evaluate</span>(() =&gt; {
  <span style="color:#66d9ef">let</span> <span style="color:#a6e22e">temperature</span> <span style="color:#f92672">=</span> document.<span style="color:#a6e22e">querySelector</span>(<span style="color:#e6db74">'.column-4'</span>).<span style="color:#a6e22e">innerText</span>
<span style="color:#66d9ef">return</span> {
    <span style="color:#a6e22e">temperature</span>
  }

})

console.log(result)

browser.close() })()

If we run this code,resultWill have the following values:

{
  temperature: '-9°C'
}

Or what is the current temperature.