Crawl a Site for 404s, or Other, Non-200 HTTP Responses
Why would you want to find 404s on a site?
A 404 HTTP response is the code a web-server sends back if the thing you are looking for is missing. Usually you are looking for a particular web page that doesn’t exist.
On your own site:
- check for typos in links
- check external sites linked to still up
- check for internal pages misbehaving
- check for missing site assets, e.g. images
On other sites:
- find candidates for broken link building SEO
How do you find 404s on a site?
This method should work on Mac and Linux terminal shells and uses wget.
Open a terminal and type the following:
$ wget --spider -r -p http://www.site-to-target.com 2>&1 | grep -B 2 ' 404 '
If anything on the site returns a 404, then the output for each will look like:
--2016-03-22 07:59:01-- http://www.site-to-target.com/missing-page Reusing existing connection to www.site-to-target.com:80. HTTP request sent, awaiting response... 404 Not Found
greping for the 404s, you can use
wget’s output flag to write the output to a file that you can examine for whatever HTTP response you see fit:
$ wget --spider -r -o ~/site-responses.log -p http://www.site-to-target.com 2>&1 # >>> ..saves crawl responses to a file called site-responses.log in your home folder