In this article, I will write a small web crawler. I'm not sure if my website has beautiful page titles throughout the site, and if the titles are repeated, so I wrote this little utility to find out.
I will start by writing a command that accepts the start page from the command line, and thenFollow any link based on the original URL.
Later, I will add an optional flag to detect if the site hasRepeat title, May be useful for SEO.
Introductiongolang.org/x/net/html
Thisgolang.org/x
Packages are packages maintained by the Go team, but for various reasons, they are not part of the standard library.
Maybe they are too specific to be used by most Go developers. Maybe they are still under development or experimentation, so they cannot be included in stdlib, it must meet Go 1.0's promise that there are no backwards incompatible changes-when something enters stdlib, it is "final".
One of these packages isgolang.org/x/net/html
.
To install it, execute
go get golang.org/x/net...
In this article, I will specifically usehtml.Parse()
Function, andhtml.Node
structure:
package html
type Node struct {
Type NodeType
Data string
Attr []Attribute
FirstChild, NextSibling *node
}
type NodeType int32
const (
ErrorNode NodeType = iota
TextNode
DocumentNode
ElementNode
CommentNode
DoctypeNode
)
type Attribute struct {
Key, Val string
}
func Parse(r io.Reader) (*Node, error)
List website links and page titles
The first program below accepts the URL and calculates the unique link found, giving the following output:
http://localhost:1313/go-filesystem-structure/ -> Filesystem Structure of a Go project
http://localhost:1313/golang-measure-time/ -> Measuring execution time in a Go program
http://localhost:1313/go-tutorial-fortune/ -> Go CLI tutorial: fortune clone
http://localhost:1313/go-tutorial-lolcat/ -> Build a Command Line app with Go: lolcat
let us beginmain()
Because it shows a high-level overview of the program's functionality.
- get
url
Use `os.Args from CLI parameters [1] - Instantiate
visited
, This is a map containing key strings and value strings, where we will store the URL and title of the website page - Incoming call
analyze()
.url
Is passed twice because the function is recursive and the second parameter is used as the base URL for the recursive call Traverse
visited
Map, passed by reference toanalyze()
All values are now filled, so we can print thempackage main
import ( “fmt” “net/http” “os” “strings”
<span style="color:#e6db74">"golang.org/x/net/html"</span>
)
func main() { url := os.Args[1] if url == “” { fmt.Println(“Usage:
webcrawler <url>
“) os.Exit(1) } visited := map[string]string{} analyze(url, url, &visited) for k, v := range visited { fmt.Printf(”%s -> %s\n”, k, v) } }
Simple? go inanalyze()
. First, it requiresparse()
, Given a string pointing to the URL, it will fetch and parse it, and return the html.Node pointer and an error.
func parse (URL string) (*html.Node, error)
After the check is successful,analyze()
Use to get page titlepageTitle()
, It gives a reference to html.Node, scans it until it finds the title tag, and then returns its value.
func pageTitle (n * html.Node) string
Once we have the page title, we can add it tovisited
map.
Next, we get all page links by callingpageLinks()
, It has a given starting page node, it will recursively scan all page nodes and return a list of unique links found (no duplicates).
func pageLinks(link [] string, n * html.Node)[] string
Once we get the link slice, we will iterate on it and then perform some checks:visited
The page has not been included, which means that we have not yet visited the page, and the link must includebaseurl
As a prefix. If both assertions are confirmed, we can callanalyze()
And link URL.
// analyze given a url and a basurl, recoursively scans the page
// following all the links and fills the `visited` map
func analyze(url, baseurl string, visited *map[string]string) {
page, err := parse(url)
if err != nil {
fmt.Printf("Error getting page %s %s\n", url, err)
return
}
title := pageTitle(page)
(*visited)[url] = title
<span style="color:#75715e">//recursively find links
links := pageLinks(nil, page)
for _, link := range links {
if (*visited)[link] == “” && strings.HasPrefix(link, baseurl) {
analyze(link, baseurl, visited)
}
}
}
pageTitle()
usegolang.org/x/net/html
The API we introduced above. In the first iterationn
Is<html>
node. We are looking for title tags. The first iteration will never meet this requirement, so we traverse<html>
First, then its siblings, we callpageTitle()
Pass the new node recursively.
Eventually, we will arrive<title>
label:html.Node
Examples andType
equalhtml.ElementNode
(See above) andData
equaltitle
, And then we return its content by visiting its contentFirstChild.Data
property
// pageTitle given a reference to a html.Node, scans it until it
// finds the title tag, and returns its value
func pageTitle(n *html.Node) string {
var title string
if n.Type == html.ElementNode && n.Data == "title" {
return n.FirstChild.Data
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
title = pageTitle(c)
if title != "" {
break
}
}
return title
}
pageLinks()
no differencepageTitle()
, But it doesn’t stop when it finds the first item, it looks for every link, so we have to passlinks
The slice is used as the parameter of this recursive function. Discover the link by checking the linkhtml.Node
Alreadyhtml.ElementNode
Type
,Data
must bea
And they must have oneAttr
withKey
href
, Otherwise it may become an anchor point.
// pageLinks will recursively scan a `html.Node` and will return
// a list of links found, with no duplicates
func pageLinks(links []string, n *html.Node) []string {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key == "href" {
if !sliceContains(links, a.Val) {
links = append(links, a.Val)
}
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
links = pageLinks(links, c)
}
return links
}
sliceContains()
Is a utility function called bypageLinks()
Check the uniqueness in the slice.
// sliceContains returns true if `slice` contains `value`
func sliceContains(slice []string, value string) bool {
for _, v := range slice {
if v == value {
return true
}
}
return false
}
The last function isparse()
. It useshttp
stdlib function to get the content of the URL (http.Get()
), and then usegolang.org/x/net/html
html.Parse()
API used to parse the response body in the HTTP request and return ahtml.Node
reference.
// parse given a string pointing to a URL will fetch and parse it
// returning an html.Node pointer
func parse(url string) (*html.Node, error) {
r, err := http.Get(url)
if err != nil {
return nil, fmt.Errorf("Cannot get page")
}
b, err := html.Parse(r.Body)
if err != nil {
return nil, fmt.Errorf("Cannot parse page")
}
return b, err
}
Detect duplicate titles
Because i wantUse command line flagsTo check for duplicates, I will slightly change the way the URL is passed to the program: instead of usingos.Args
, I will also use the tag to pass the URL.
This is the modifiedmain()
Function, and perform logo analysis before proceeding with the usual preparationsanalyze()
Execute and print the value. In addition, there is a check at the enddup
Boolean flag, if true, runcheckDuplicates()
.
import (
"flag"
//...
)
func main() {
var url string
var dup bool
flag.StringVar(&url, “url”, “”, “the url to parse”)
flag.BoolVar(&dup, “dup”, false, “if set, check for duplicates”)
flag.Parse()
<span style="color:#66d9ef">if</span> <span style="color:#a6e22e">url</span> <span style="color:#f92672">==</span> <span style="color:#e6db74">""</span> {
<span style="color:#a6e22e">flag</span>.<span style="color:#a6e22e">PrintDefaults</span>()
<span style="color:#a6e22e">os</span>.<span style="color:#a6e22e">Exit</span>(<span style="color:#ae81ff">1</span>)
}
<span style="color:#a6e22e">visited</span> <span style="color:#f92672">:=</span> <span style="color:#66d9ef">map</span>[<span style="color:#66d9ef">string</span>]<span style="color:#66d9ef">string</span>{}
<span style="color:#a6e22e">analyze</span>(<span style="color:#a6e22e">url</span>, <span style="color:#a6e22e">url</span>, <span style="color:#f92672">&</span><span style="color:#a6e22e">visited</span>)
<span style="color:#66d9ef">for</span> <span style="color:#a6e22e">link</span>, <span style="color:#a6e22e">title</span> <span style="color:#f92672">:=</span> <span style="color:#66d9ef">range</span> <span style="color:#a6e22e">visited</span> {
<span style="color:#a6e22e">fmt</span>.<span style="color:#a6e22e">Printf</span>(<span style="color:#e6db74">"%s -> %s\n"</span>, <span style="color:#a6e22e">link</span>, <span style="color:#a6e22e">title</span>)
}
<span style="color:#66d9ef">if</span> <span style="color:#a6e22e">dup</span> {
<span style="color:#a6e22e">checkDuplicates</span>(<span style="color:#f92672">&</span><span style="color:#a6e22e">visited</span>)
}
}
checkDuplicates
Use the map of url->title and iterate on it to build your own mapuniques
Map, this time with the page title as the key, so we can checkuniques[title] == ""
To determine whether the title already exists, we can access the first page entered with the title by printinguniques[title]
.
// checkDuplicates scans the visited map for pages with duplicate titles
// and writes a report
func checkDuplicates(visited *map[string]string) {
found := false
uniques := map[string]string{}
fmt.Printf("\nChecking duplicates..\n")
for link, title := range *visited {
if uniques[title] == "" {
uniques[title] = link
} else {
found = true
fmt.Printf("Duplicate title \"%s\" in %s but already found in %s\n", title, link, uniques[title])
}
}
<span style="color:#66d9ef">if</span> !<span style="color:#a6e22e">found</span> {
<span style="color:#a6e22e">fmt</span>.<span style="color:#a6e22e">Println</span>(<span style="color:#e6db74">"No duplicates were found 😇"</span>)
}
}
credit
Go programming languageThe book written by Donovan and Kernighan uses web crawlers as examples throughout the book, and changes it in different chapters to introduce new concepts. The code provided in this article draws inspiration from this book.
More tutorials:
- Use NGINX reverse proxy service Go service
- Copy structure in Go
- Basics of Go web server
- Sorting map types in Go
- In a nutshell
- Go to label description
- Start date and time format
- Use Go for JSON processing
- Variadic function
- Cheat sheet
- Go to the empty interface description
- Use VS Code and Delve to debug Go
- Named Go return parameter
- Generate random numbers and strings in Go
- File system structure of Go project
- Binary search algorithm in Go
- Use command line flags in Go
- GOPATH explained
- Use Go to build a command line application: lolcat
- Use Go to build CLI commands: Cowsay
- Use shell and tube in Go
- Go CLI Tutorial: Fortune Clone
- Use Go to list files in a folder
- Use Go to get a list of repositories from GitHub
- Go, append a short string to the file
- Go, convert the string to byte slices
- Use Go to visualize your local Git contributions
- Getting started with Go CPU and memory analysis
- Solve the "Does not support index" error in the Go program
- Measuring the execution time in the Go program
- Use Go to build a web crawler to detect duplicate titles
- Best Practice: Pointer or Value Receiver?
- Best practice: Should you use methods or functions?
- Go data structure: set
- Go to the map cheat sheet
- Generating the implementation of generic types in Go
- Go data structure: dictionary
- Go data structure: hash table
- Implement event listeners in "through the channel"
- Go data structure: stack
- Go data structure: queue
- Go data structure: binary search tree
- Go data structure: graphics
- Go data structure: linked list
- A complete guide to Go data structures
- Compare Go value
- Is Go object-oriented?
- Use SQL database in Go
- Use environment variables in Go
- Last tutorial: REST API supported by PostgreSQL
- Enable CORS on the Go web server
- Deploy Go application in Docker container
- Why Go is a powerful language to learn as a PHP developer
- Go and delete the io.Reader.ReadString newline
- To start, how to watch the changes and rebuild the program
- To count the months since the date
- Access HTTP POST parameters in Go