Scraping with Puppeteer: A Practical Example
If you’re looking to scrape websites for data, Puppeteer is a powerful tool to consider. In this blog post, we’ll walk you through the process of creating a JavaScript job board that aggregates remote jobs for JavaScript developers using Puppeteer.
Steps to Complete the Project
To complete this project, you’ll need to follow these steps:
- Create a Node.js scraper using Puppeteer to fetch jobs from the remoteok.io website.
- Store the jobs into a database.
- Create a Node.js application to display the jobs on your own website.
Before we dive in, we would like to mention that while we are using remoteok.io as an example in this blog post, we do not recommend scraping it. This website has an official API that you should consider using instead. We are using it here to showcase how Puppeteer works with a well-known website and to demonstrate its practical applications.
Let’s get started!
Creating a Scraper for JavaScript Jobs
Our goal is to scrape JavaScript jobs from remoteok.io. On this website, JavaScript jobs are listed under the “JavaScript” tag. At the time of writing, all JavaScript jobs can be found on this page: https://remoteok.io/remote-javascript-jobs.
It’s important to note that websites can change at any time, which means our scraping application may stop working if the website structure changes. Unlike APIs, scraping applications require more maintenance. However, in some cases, scraping is the only option for specific tasks, making it a valuable tool in certain scenarios.
To get started, create a new folder for your project. Inside the folder, run npm init -y
to initialize a new Node.js project. Then, install Puppeteer using npm install puppeteer
.
Next, create an app.js
file and require the Puppeteer library at the top:
1 | const puppeteer = require("puppeteer"); |
Now, let’s use the launch()
method to create a browser instance:
1 | (async () => { |
By passing the { headless: false }
configuration object, we can see Chrome while Puppeteer is running, which is helpful for debugging purposes.
Next, we’ll use the newPage()
method to get the page
object, and then call the goto()
method to load the JavaScript jobs page:
1 | const puppeteer = require('puppeteer'); |
To see what’s happening, run node app.js
from the terminal. This will start a Chromium instance and load the page specified:
Getting the Jobs from the Page
Now, let’s figure out how to extract the job details from the page. Puppeteer provides a page.evaluate()
function that allows us to execute JavaScript code within the context of the page.
Inside the page.evaluate()
callback function, we have access to the document
object, which points to the page’s DOM. However, keep in mind that any output from this function will be printed to the browser console, not the Node.js terminal.
Instead of printing to the console, we can return an object from this function to access it as the value returned by page.evaluate()
:
1 | const puppeteer = require('puppeteer'); |
In this example code, we create an empty array called list
and use querySelectorAll()
to find each job element on the page. We then extract the company name, position, and link for each job and push them into the list
array.
To determine the correct selectors to use, you can inspect the page source using the browser devtools:
Here’s the complete code for extracting the job details:
1 | const puppeteer = require('puppeteer'); |
Running this code will return an array of objects, each containing the job details:
Storing Jobs in a Database
Now that we have the job data, let’s store it in a local database. For this example, we’ll use MongoDB.
First, install the MongoDB package by running npm install mongodb
in the terminal.
Next, in the app.js
file, add the necessary code to connect to the MongoDB database and initialize the necessary collection:
1 | const puppeteer = require('puppeteer'); |
Inside the connection callback function, we can now use the jobs
collection to store our job data:
1 | const puppeteer = require('puppeteer'); |
We added the following code at the end of the function:
1 | jobs.deleteMany({}); |
This clears the MongoDB table and then inserts the job data.
By running node app.js
again, you will see the data being stored in the MongoDB database. You can inspect the database content using the terminal or an app like TablePlus.
Now, you can set up a cron job or any other automation to run this application periodically and update the database with fresh job data.
Creating the Node.js App to Visualize the Jobs
Finally, let’s create a Node.js application to visualize the job data stored in the database. We’ll use Express as the web framework and Pug as the server-side template engine.
First, create a new folder for your application and run npm init -y
inside it. Then, install Express, MongoDB, and Pug using npm install express mongodb pug
.
In your app.js
file, initialize Express and set the view engine to Pug:
1 | const express = require('express'); |
Next, initialize MongoDB and retrieve the job data from the database:
1 | const express = require('express'); |
We’ve added the necessary code to connect to the MongoDB database and get the job data using the find()
method.
Finally, we’ll render a Pug template when the user visits the root URL (“/“):
1 | app.get('/', (req, res) => { |
In the same folder as your app.js
file, create an index.pug
file. This file will iterate through the jobs
array and display the job details:
1 | html |
Start the application by running node app.js
, and you’ll see the jobs displayed on the webpage:
And there you have it! You’ve successfully created a JavaScript job board that scrapes data using Puppeteer, stores it in a MongoDB database, and displays the jobs on a webpage.
Tags: Puppeteer, Web Scraping, JavaScript, Node.js, MongoDB, Express, Pug