Crawl a Site for 404s, or Other, Non-200 HTTP Responses

Why would you want to find 404s on a site?

A 404 HTTP response is the code a web-server sends back if the thing you are looking for is missing. Usually you are looking for a particular web page that doesn't exist.

On your own site:

On other sites:

How do you find 404s on a site

This method should work on Mac and Linux terminal shells and uses wget.

Open a terminal and type the following:


$ wget --spider -r -p http://www.site-to-target.com 2>&1 | grep -B 2 ' 404 '

If anything on the site returns a 404, then the output for each will look like:


--2016-03-22 07:59:01--  http://www.site-to-target.com/missing-page
Reusing existing connection to www.site-to-target.com:80.
HTTP request sent, awaiting response... 404 Not Found
--

Instead of greping for the 404s, you can use wget's output flag to write the output to a file that you can examine for whatever HTTP response you see fit:


$ wget --spider -r -o ~/site-responses.log -p http://www.site-to-target.com 2>&1

# >>> ..saves crawl responses to a file called site-responses.log in your home folder

I found this little tip on a great article called: A Technical Guide to SEO. Hat-tip to Mattias Geniar.

To Top ^