Saturday, September 8, 2012

A Simple Way to Hunt Down Broken Links on Your Website

As you already know, we recently launched our new VoIP termination website, and we try to keep it updated, good-looking, fashionable and in good style! With all the hard work, there are times when you make simple link changes to your website and you perform a complete site automation test with tools such as Selenium.

We wanted a way to automate testing for broken links after new version deployments of the site (we are using Capistrano for deployments) so we needed a simple web crawler to automate the tests - wget:

wget --mirror --no-directories --delete-after \ 2> ./examplecom-wgetlog.txt
This will tell wget to mirror an entire site and dump stdout to examplecom-wgetlog.txt.
Once completed, you can then do the following to print out the errors:

cat examplecom-wgetlog.txt | grep -B4 ERROR
This will simply show to you stuff like:
--2012-08-28 12:21:04--
Connecting to||:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2012-08-28 12:21:04 ERROR 404: Not Found.
And now, for the fun part, we made a simple cronjob to monitor our modified file timestamp (mtime) and if it changes, we run a script that runs the wget test and if it finds errors (grep returns 0) it will mail us the report so we can fix it immediately.

# Simple script to run wget in order to catch 404 and other annoying stuff.


CMDLINE='wget --verbose --mirror --no-directories --delete-after domain.example'

# This tracks the mtime of index.php when last test was last made.

# We wanna make sure nobody but us can mess with this file.
umask 077
if [ ! -e "$TRACKFILE" ]; then
/usr/bin/env touch $TRACKFILE
/usr/bin/env chmod 600 $TRACKFILE
/usr/bin/env expr `/usr/bin/env stat -c %Y $FILETOTRACK` - 100 >$TRACKFILE

if [ `/usr/bin/env stat -c %Y $FILETOTRACK` = $TRACKDATA ]; then
# Our track data matches the timestamp of deployed file, no need to run test again.
echo "$TRACKFILE timestamp matches $FILETOTRACK, no work is necessary."
exit -1

/usr/bin/env rm -f $LOGFILE
if [ $? != "0" ]; then
/usr/bin/env echo "Fatal: unable to delete $LOGFILE"
exit 1

/usr/bin/env touch $LOGFILE
/usr/bin/env chmod 600 $LOGFILE
/usr/bin/env echo Running: $CMDLINE
/usr/bin/env $CMDLINE 2> $LOGFILE
/usr/bin/env cat $LOGFILE | /usr/bin/env grep -B3 "ERROR"

if [ $? = "0" ]; then
# Wget found errors
/usr/bin/env touch $TMPFILE
/usr/bin/env chmod 600 $TMPFILE
/usr/bin/env echo "Found errors using wget broken links check:" > $TMPFILE
/usr/bin/env cat $LOGFILE | /usr/bin/env grep -B3 "ERROR" >> $TMPFILE
/usr/bin/env echo "" >> $TMPFILE
/usr/bin/env echo "" >> $TMPFILE
/usr/bin/env cat $LOGFILE >> $TMPFILE
/usr/bin/env cat $TMPFILE | mail -s "wget error log" "$MAILTO"
/usr/bin/env rm -f $LOGFILE
/usr/bin/env rm -f $TMPFILE
/usr/bin/env stat -c %Y $FILETOTRACK > $TRACKFILE
exit 1
# Wget found no errors, exit successfully
/usr/bin/env rm -f $LOGFILE
/usr/bin/env stat -c %Y $FILETOTRACK > $TRACKFILE
exit 0

There are many more fancy and perhaps efficient ways of accomplishing this task, but the way we maintain the CommPeak website, it is a suitable way to handle this particular scenario.

No comments:

Post a Comment