Use Go to build a web crawler to detect duplicate titles

In this article, I will write a small web crawler. I'm not sure if my website has beautiful page titles throughout the site, and if the titles are repeated, so I wrote this little utility to find out.

I will start by writing a command that accepts the start page from the command line, and thenFollow any link based on the original URL.

Later, I will add an optional flag to detect if the site hasRepeat title, May be useful for SEO.

Introductiongolang.org/x/net/html

Thisgolang.org/xPackages are packages maintained by the Go team, but for various reasons, they are not part of the standard library.

Maybe they are too specific to be used by most Go developers. Maybe they are still under development or experimentation, so they cannot be included in stdlib, it must meet Go 1.0's promise that there are no backwards incompatible changes-when something enters stdlib, it is "final".

One of these packages isgolang.org/x/net/html.

To install it, execute

go get golang.org/x/net...

In this article, I will specifically usehtml.Parse()Function, andhtml.Nodestructure:

package html

type Node struct { Type NodeType Data string Attr []Attribute FirstChild, NextSibling *node }

type NodeType int32

const ( ErrorNode NodeType = iota TextNode DocumentNode ElementNode CommentNode DoctypeNode )

type Attribute struct { Key, Val string }

func Parse(r io.Reader) (*Node, error)

The first program below accepts the URL and calculates the unique link found, giving the following output:

http://localhost:1313/go-filesystem-structure/ -> Filesystem Structure of a Go project
http://localhost:1313/golang-measure-time/ -> Measuring execution time in a Go program
http://localhost:1313/go-tutorial-fortune/ -> Go CLI tutorial: fortune clone
http://localhost:1313/go-tutorial-lolcat/ -> Build a Command Line app with Go: lolcat

let us beginmain()Because it shows a high-level overview of the program's functionality.

  1. geturlUse `os.Args from CLI parameters [1]
  2. Instantiatevisited, This is a map containing key strings and value strings, where we will store the URL and title of the website page
  3. Incoming callanalyze().urlIs passed twice because the function is recursive and the second parameter is used as the base URL for the recursive call
  4. TraversevisitedMap, passed by reference toanalyze()All values are now filled, so we can print them

    package main
    

    import ( “fmt” “net/http” “os” “strings”

    <span style="color:#e6db74">"golang.org/x/net/html"</span>
    

    )

    func main() { url := os.Args[1] if url == “” { fmt.Println(“Usage: webcrawler &lt;url&gt;) os.Exit(1) } visited := map[string]string{} analyze(url, url, &visited) for k, v := range visited { fmt.Printf(”%s -> %s\n”, k, v) } }

Simple? go inanalyze(). First, it requiresparse(), Given a string pointing to the URL, it will fetch and parse it, and return the html.Node pointer and an error.

func parse (URL string) (*html.Node, error)

After the check is successful,analyze()Use to get page titlepageTitle(), It gives a reference to html.Node, scans it until it finds the title tag, and then returns its value.

func pageTitle (n * html.Node) string

Once we have the page title, we can add it tovisitedmap.

Next, we get all page links by callingpageLinks(), It has a given starting page node, it will recursively scan all page nodes and return a list of unique links found (no duplicates).

func pageLinks(link [] string, n * html.Node)[] string

Once we get the link slice, we will iterate on it and then perform some checks:visitedThe page has not been included, which means that we have not yet visited the page, and the link must includebaseurlAs a prefix. If both assertions are confirmed, we can callanalyze()And link URL.

// analyze given a url and a basurl, recoursively scans the page
// following all the links and fills the `visited` map
func analyze(url, baseurl string, visited *map[string]string) {
	page, err := parse(url)
	if err != nil {
		fmt.Printf("Error getting page %s %s\n", url, err)
		return
	}
	title := pageTitle(page)
	(*visited)[url] = title
<span style="color:#75715e">//recursively find links

links := pageLinks(nil, page) for _, link := range links { if (*visited)[link] == “” && strings.HasPrefix(link, baseurl) { analyze(link, baseurl, visited) } } }

pageTitle()usegolang.org/x/net/htmlThe API we introduced above. In the first iterationnIs<html>node. We are looking for title tags. The first iteration will never meet this requirement, so we traverse<html>First, then its siblings, we callpageTitle()Pass the new node recursively.

Eventually, we will arrive<title>label:html.NodeExamples andTypeequalhtml.ElementNode(See above) andDataequaltitle, And then we return its content by visiting its contentFirstChild.Dataproperty

// pageTitle given a reference to a html.Node, scans it until it
// finds the title tag, and returns its value
func pageTitle(n *html.Node) string {
	var title string
	if n.Type == html.ElementNode && n.Data == "title" {
		return n.FirstChild.Data
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		title = pageTitle(c)
		if title != "" {
			break
		}
	}
	return title
}

pageLinks()no differencepageTitle(), But it doesn’t stop when it finds the first item, it looks for every link, so we have to passlinksThe slice is used as the parameter of this recursive function. Discover the link by checking the linkhtml.NodeAlreadyhtml.ElementNode Type,Datamust beaAnd they must have oneAttrwithKey href, Otherwise it may become an anchor point.

// pageLinks will recursively scan a `html.Node` and will return
// a list of links found, with no duplicates
func pageLinks(links []string, n *html.Node) []string {
	if n.Type == html.ElementNode && n.Data == "a" {
		for _, a := range n.Attr {
			if a.Key == "href" {
				if !sliceContains(links, a.Val) {
					links = append(links, a.Val)
				}
			}
		}
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		links = pageLinks(links, c)
	}
	return links
}

sliceContains()Is a utility function called bypageLinks()Check the uniqueness in the slice.

// sliceContains returns true if `slice` contains `value`
func sliceContains(slice []string, value string) bool {
	for _, v := range slice {
		if v == value {
			return true
		}
	}
	return false
}

The last function isparse(). It useshttpstdlib function to get the content of the URL (http.Get()), and then usegolang.org/x/net/html html.Parse()API used to parse the response body in the HTTP request and return ahtml.Nodereference.

// parse given a string pointing to a URL will fetch and parse it
// returning an html.Node pointer
func parse(url string) (*html.Node, error) {
	r, err := http.Get(url)
	if err != nil {
		return nil, fmt.Errorf("Cannot get page")
	}
	b, err := html.Parse(r.Body)
	if err != nil {
		return nil, fmt.Errorf("Cannot parse page")
	}
	return b, err
}

Detect duplicate titles

Because i wantUse command line flagsTo check for duplicates, I will slightly change the way the URL is passed to the program: instead of usingos.Args, I will also use the tag to pass the URL.

This is the modifiedmain()Function, and perform logo analysis before proceeding with the usual preparationsanalyze()Execute and print the value. In addition, there is a check at the enddupBoolean flag, if true, runcheckDuplicates().

import (
	"flag"
//...
)

func main() { var url string var dup bool flag.StringVar(&url, “url”, “”, “the url to parse”) flag.BoolVar(&dup, “dup”, false, “if set, check for duplicates”) flag.Parse()

<span style="color:#66d9ef">if</span> <span style="color:#a6e22e">url</span> <span style="color:#f92672">==</span> <span style="color:#e6db74">""</span> {
	<span style="color:#a6e22e">flag</span>.<span style="color:#a6e22e">PrintDefaults</span>()
	<span style="color:#a6e22e">os</span>.<span style="color:#a6e22e">Exit</span>(<span style="color:#ae81ff">1</span>)
}

<span style="color:#a6e22e">visited</span> <span style="color:#f92672">:=</span> <span style="color:#66d9ef">map</span>[<span style="color:#66d9ef">string</span>]<span style="color:#66d9ef">string</span>{}
<span style="color:#a6e22e">analyze</span>(<span style="color:#a6e22e">url</span>, <span style="color:#a6e22e">url</span>, <span style="color:#f92672">&amp;</span><span style="color:#a6e22e">visited</span>)
<span style="color:#66d9ef">for</span> <span style="color:#a6e22e">link</span>, <span style="color:#a6e22e">title</span> <span style="color:#f92672">:=</span> <span style="color:#66d9ef">range</span> <span style="color:#a6e22e">visited</span> {
	<span style="color:#a6e22e">fmt</span>.<span style="color:#a6e22e">Printf</span>(<span style="color:#e6db74">"%s -&gt; %s\n"</span>, <span style="color:#a6e22e">link</span>, <span style="color:#a6e22e">title</span>)
}

<span style="color:#66d9ef">if</span> <span style="color:#a6e22e">dup</span> {
	<span style="color:#a6e22e">checkDuplicates</span>(<span style="color:#f92672">&amp;</span><span style="color:#a6e22e">visited</span>)
}

}

checkDuplicatesUse the map of url->title and iterate on it to build your own mapuniquesMap, this time with the page title as the key, so we can checkuniques[title] == ""To determine whether the title already exists, we can access the first page entered with the title by printinguniques[title].

// checkDuplicates scans the visited map for pages with duplicate titles
// and writes a report
func checkDuplicates(visited *map[string]string) {
	found := false
	uniques := map[string]string{}
	fmt.Printf("\nChecking duplicates..\n")
	for link, title := range *visited {
		if uniques[title] == "" {
			uniques[title] = link
		} else {
			found = true
			fmt.Printf("Duplicate title \"%s\" in %s but already found in %s\n", title, link, uniques[title])
		}
	}
<span style="color:#66d9ef">if</span> !<span style="color:#a6e22e">found</span> {
	<span style="color:#a6e22e">fmt</span>.<span style="color:#a6e22e">Println</span>(<span style="color:#e6db74">"No duplicates were found 😇"</span>)
}

}

credit

Go programming languageThe book written by Donovan and Kernighan uses web crawlers as examples throughout the book, and changes it in different chapters to introduce new concepts. The code provided in this article draws inspiration from this book.


More tutorials: