How to Build a Web Crawler with Go to Detect Duplicate Titles

In this blog post, we will discuss how to build a small web crawler using the Go programming language. The purpose of this web crawler is to check if your website has duplicate page titles, which can be detrimental to your SEO efforts.

To start off, we will use the golang.org/x/net/html package, which is not part of the standard library but is maintained by the Go team. This package provides us with the necessary tools to parse HTML content.

To use this package, you can install it by executing the following command:

go get golang.org/x/net...

In this blog post, we will specifically use the html.Parse() function and the html.Node struct.

Let’s dive into the code. The first program accepts a URL as a command line argument and computes the unique links it finds on that page. It then outputs the URL and the corresponding page title.

Here is the main() function:

package main

import (
	"fmt"
	"net/http"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	url := os.Args[1]
	if url == "" {
		fmt.Println("Usage: `webcrawler <url>`")
		os.Exit(1)
	}
	visited := map[string]string{}
	analyze(url, url, &visited)
	for k, v := range visited {
		fmt.Printf("%s -> %s\n", k, v)
	}
}

The analyze() function is where the magic happens. It takes a URL and a base URL as parameters and recursively scans the page, following all the links and filling the visited map.

Inside analyze(), we use the parse() function to fetch and parse the HTML content of a URL. We then use the pageTitle() function to extract the page title from the parsed content.

Next, we use the pageLinks() function to recursively scan all the page nodes and extract the unique links. We iterate over these links and check if they have already been visited. If not, we call analyze() with the link’s URL.

func analyze(url, baseurl string, visited *map[string]string) {
	page, err := parse(url)
	if err != nil {
		fmt.Printf("Error getting page %s %s\n", url, err)
		return
	}
	title := pageTitle(page)
	(*visited)[url] = title

	links := pageLinks(nil, page)
	for _, link := range links {
		if (*visited)[link] == "" && strings.HasPrefix(link, baseurl) {
			analyze(link, baseurl, visited)
		}
	}
}

The pageTitle() function is responsible for finding the page title from an HTML node structure. It searches for the <title> tag recursively and returns its value.

The pageLinks() function is similar to pageTitle(), but it extracts all the links from the HTML content. It returns a list of unique links found on the page.

The parse() function uses the http package of the standard library to fetch the contents of a URL. It then uses the html.Parse() function from the golang.org/x/net/html package to parse the response body and return an html.Node reference.

Now let’s move on to detecting duplicate titles. We will modify the main() function to accept a flag called -dup, which, when set, will check for duplicate titles.

import (
	"flag"
	//...
)

func main() {
	var url string
	var dup bool
	flag.StringVar(&url, "url", "", "the url to parse")
	flag.BoolVar(&dup, "dup", false, "if set, check for duplicates")
	flag.Parse()

	if url == "" {
		flag.PrintDefaults()
		os.Exit(1)
	}

	visited := map[string]string{}
	analyze(url, url, &visited)
	for link, title := range visited {
		fmt.Printf("%s -> %s\n", link, title)
	}

	if dup {
		checkDuplicates(&visited)
	}
}

The checkDuplicates() function scans the visited map for pages with duplicate titles and writes a report. It builds its own uniques map, using the page titles as keys and the corresponding URLs as values. It then checks if there are any keys in uniques with multiple values, indicating duplicate titles.

Finally, we print out the duplicate titles, if any are found.

func checkDuplicates(visited *map[string]string) {
	found := false
	uniques := map[string]string{}
	fmt.Printf("\nChecking duplicates..\n")
	for link, title := range *visited {
		if uniques[title] == "" {
			uniques[title] = link
		} else {
			found = true
			fmt.Printf("Duplicate title \"%s\" in %s but already found in %s\n", title, link, uniques[title])
		}
	}

	if !found {
		fmt.Println("No duplicates were found 😇")
	}
}

In conclusion, building a web crawler with Go can be a useful tool for detecting duplicate titles on your website. By following the steps outlined in this blog post, you can easily implement a web crawler that can help improve your website’s SEO.

Tags: web crawler, Go programming language, duplicate titles, SEO