Scraping with Puppeteer: A Practical Example

If you’re looking to scrape websites for data, Puppeteer is a powerful tool to consider. In this blog post, we’ll walk you through the process of creating a JavaScript job board that aggregates remote jobs for JavaScript developers using Puppeteer.

Steps to Complete the Project

To complete this project, you’ll need to follow these steps:

Create a Node.js scraper using Puppeteer to fetch jobs from the remoteok.io website.
Store the jobs into a database.
Create a Node.js application to display the jobs on your own website.

Before we dive in, we would like to mention that while we are using remoteok.io as an example in this blog post, we do not recommend scraping it. This website has an official API that you should consider using instead. We are using it here to showcase how Puppeteer works with a well-known website and to demonstrate its practical applications.

Let’s get started!

Creating a Scraper for JavaScript Jobs

Our goal is to scrape JavaScript jobs from remoteok.io. On this website, JavaScript jobs are listed under the “JavaScript” tag. At the time of writing, all JavaScript jobs can be found on this page: https://remoteok.io/remote-javascript-jobs.

It’s important to note that websites can change at any time, which means our scraping application may stop working if the website structure changes. Unlike APIs, scraping applications require more maintenance. However, in some cases, scraping is the only option for specific tasks, making it a valuable tool in certain scenarios.

To get started, create a new folder for your project. Inside the folder, run npm init -y to initialize a new Node.js project. Then, install Puppeteer using npm install puppeteer.

Next, create an app.js file and require the Puppeteer library at the top:

1	const puppeteer = require("puppeteer");

Now, let’s use the launch() method to create a browser instance:

1
2
3

(async () => {
  const browser = await puppeteer.launch({ headless: false });
})();

By passing the { headless: false } configuration object, we can see Chrome while Puppeteer is running, which is helpful for debugging purposes.

Next, we’ll use the newPage() method to get the page object, and then call the goto() method to load the JavaScript jobs page:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://remoteok.io/remote-javascript-jobs');
})();

To see what’s happening, run node app.js from the terminal. This will start a Chromium instance and load the page specified:

Chromium instance loading the website

Getting the Jobs from the Page

Now, let’s figure out how to extract the job details from the page. Puppeteer provides a page.evaluate() function that allows us to execute JavaScript code within the context of the page.

Inside the page.evaluate() callback function, we have access to the document object, which points to the page’s DOM. However, keep in mind that any output from this function will be printed to the browser console, not the Node.js terminal.

Instead of printing to the console, we can return an object from this function to access it as the value returned by page.evaluate():

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto('https://remoteok.io/remote-javascript-jobs');

  /* Run JavaScript inside the page */
  const data = await page.evaluate(() => {
    const list = [];
    const items = document.querySelectorAll('tr.job');

    for (const item of items) {
      list.push({
        company: item.querySelector('.company h3').innerHTML,
        position: item.querySelector('.company h2').innerHTML,
        link: 'https://remoteok.io' + item.getAttribute('data-href'),
      });
    }

    return list;
  });

  console.log(data);
  await browser.close();
})();

In this example code, we create an empty array called list and use querySelectorAll() to find each job element on the page. We then extract the company name, position, and link for each job and push them into the list array.

To determine the correct selectors to use, you can inspect the page source using the browser devtools:

Inspecting the page source

Here’s the complete code for extracting the job details:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto('https://remoteok.io/remote-javascript-jobs');

  /* Run JavaScript inside the page */
  const data = await page.evaluate(() => {
    const list = [];
    const items = document.querySelectorAll('tr.job');

    for (const item of items) {
      list.push({
        company: item.querySelector('.company h3').innerHTML,
        position: item.querySelector('.company h2').innerHTML,
        link: 'https://remoteok.io' + item.getAttribute('data-href'),
      });
    }

    return list;
  });

  console.log(data);
  await browser.close();
})();

Running this code will return an array of objects, each containing the job details:

Array of job details

Storing Jobs in a Database

Now that we have the job data, let’s store it in a local database. For this example, we’ll use MongoDB.

First, install the MongoDB package by running npm install mongodb in the terminal.

Next, in the app.js file, add the necessary code to connect to the MongoDB database and initialize the necessary collection:

const puppeteer = require('puppeteer');
const mongo = require('mongodb').MongoClient;

const url = 'mongodb://localhost:27017';
let db, jobs;

mongo.connect(
  url,
  {
    useNewUrlParser: true,
    useUnifiedTopology: true,
  },
  (err, client) => {
    if (err) {
      console.error(err);
      return;
    }
    db = client.db('jobs');
    jobs = db.collection('jobs');
    /* Rest of the code */
  }
);

Inside the connection callback function, we can now use the jobs collection to store our job data:

const puppeteer = require('puppeteer');
const mongo = require('mongodb').MongoClient;

const url = 'mongodb://localhost:27017';
let db, jobs;

mongo.connect(
  url,
  {
    useNewUrlParser: true,
    useUnifiedTopology: true,
  },
  (err, client) => {
    if (err) {
      console.error(err);
      return;
    }
    db = client.db('jobs');
    jobs = db.collection('jobs');
    (async () => {
      const browser = await puppeteer.launch({ headless: false });
      const page = await browser.newPage();
      await page.goto('https://remoteok.io/remote-javascript-jobs');

      /* Run JavaScript inside the page */
      const data = await page.evaluate(() => {
        const list = [];
        const items = document.querySelectorAll('tr.job');

        for (const item of items) {
          list.push({
            company: item.querySelector('.company h3').innerHTML,
            position: item.querySelector('.company h2').innerHTML,
            link: 'https://remoteok.io' + item.getAttribute('data-href'),
          });
        }

        return list;
      });

      console.log(data);
      jobs.deleteMany({});
      jobs.insertMany(data);
      await browser.close();
    })();
  }
);

We added the following code at the end of the function:

1 2	jobs.deleteMany({}); jobs.insertMany(data);

This clears the MongoDB table and then inserts the job data.

By running node app.js again, you will see the data being stored in the MongoDB database. You can inspect the database content using the terminal or an app like TablePlus.

MongoDB database content

Now, you can set up a cron job or any other automation to run this application periodically and update the database with fresh job data.

Creating the Node.js App to Visualize the Jobs

Finally, let’s create a Node.js application to visualize the job data stored in the database. We’ll use Express as the web framework and Pug as the server-side template engine.

First, create a new folder for your application and run npm init -y inside it. Then, install Express, MongoDB, and Pug using npm install express mongodb pug.

In your app.js file, initialize Express and set the view engine to Pug:

const express = require('express');
const path = require('path');

const app = express();
app.set('view engine', 'pug');
app.set('views', path.join(__dirname, '.'));

app.get('/', (req, res) => {
  /*...*/
});

app.listen(3000, () => console.log('Server ready'));

Next, initialize MongoDB and retrieve the job data from the database:

const express = require('express');
const path = require('path');

const app = express();
app.set('view engine', 'pug');
app.set('views', path.join(__dirname, '.'));

const mongo = require('mongodb').MongoClient;

const url = 'mongodb://localhost:27017';
let db, jobsCollection, jobs;

mongo.connect(
  url,
  {
    useNewUrlParser: true,
    useUnifiedTopology: true,
  },
  (err, client) => {
    if (err) {
      console.error(err);
      return;
    }
    db = client.db('jobs');
    jobsCollection = db.collection('jobs');
    jobsCollection.find({}).toArray((err, data) => {
      jobs = data;
    });
  }
);

app.get('/', (req, res) => {
  /*...*/
});

app.listen(3000, () => console.log('Server ready'));

We’ve added the necessary code to connect to the MongoDB database and get the job data using the find() method.

Finally, we’ll render a Pug template when the user visits the root URL (“/“):

app.get('/', (req, res) => {
  res.render('index', {
    jobs,
  });
});

In the same folder as your app.js file, create an index.pug file. This file will iterate through the jobs array and display the job details:

html
  body
    each job in jobs
      p
        | #{job.company}
        br
        a(href=`${job.link}`) #{job.position}

Start the application by running node app.js, and you’ll see the jobs displayed on the webpage:

Job details displayed on the webpage

And there you have it! You’ve successfully created a JavaScript job board that scrapes data using Puppeteer, stores it in a MongoDB database, and displays the jobs on a webpage.

Tags: Puppeteer, Web Scraping, JavaScript, Node.js, MongoDB, Express, Pug