Tue Sep 22 13:06:45 PDT 2015

How to Check Links - Using a Simple Bash Script

If your web site has a page of external links it is useful to have an automated method to check that the links are valid, because you cannot tell when a link will change or be taken over by another site. The alternative to automation is a tedious session of pointing, clicking and 'back'-ing. Of course, there are some heavy duty link checking programs which can automate the task for you - but even they tend to be a little tedious to use, meaning that your link checking might never get done, and your site would start to exhibit decaying links, a sure sign of neglect and carelessness. The following script assumes that you have a page called 'links.txt' which contains your external links. It then processes the strings which begin with http within this page - to pull down just a single page from each site, which it stores as www.site.com.tmp in the current directory. If a site is off line, or a page has moved on a given site, then you will see evidence of these facts in the output of the script. Rather than use 'lynx' or 'curl' to download the target page, the scripts uses 'telnet' and individual requests to the http server. This is done because it is a lot more educational than simply using 'lynx' or 'curl' and having some understanding of http is a good thing! The script took some inspiration from ancient variants which have existed on the web for more than 25 years - but is modified to not use temporary files. If you have any problems with it - please don't hesitate to let me know. At some point, I may put into the script a retry if the current sleep times prove to be ineffective in producing reliable page downloads. However, in its current form the script seems to work fine for my links.txt page - it gives me the confidence to extend the page knowing that I will be able to keep it up to date despite the ever changing web. Here is the script.

#!/bin/sh

while read siteline
do
  nohttp=`echo $siteline|sed 's|http://||g'`
  site=`echo $nohttp|sed 's|/.*$||g'`
  item=`echo $nohttp|sed 's|/| |'  | awk '{ print "/"$2 }'`
  echo Checking $site $item
  (echo "open $site 80"; sleep 3; echo "GET $item HTTP/1.0"; \
   echo -n "User-Agent: Mozilla/5.0 "; \
   echo -n "(Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4)"; \
   echo " Gecko/20070515 Firefox/2.0.0.4"; \
   echo "Host: $site"; echo; echo; sleep 5) | telnet > telnet.tmp
  grep "404" telnet.tmp | grep "Not Found"
  cp telnet.tmp $site.tmp
  echo "This site produced" `wc -c telnet.tmp | awk '{print $1}'` "bytes"
done < links.txt

Posted by ZFS | Permanent link | File under: bash