Sun May 31 17:03:35 PDT 2015

Using sed to Remove the Tags from HTML Files

sed is a wonderful tool! There are books, man-pages and generally good resources for sed users online. However, given how useful the tool is - it is hard to master - so I thought I would provide just a little information here.

Say you want to extract the ascii information in an html file for additional processing - how should you do that? Many programs can input html (e.g. aspell used elsewhere here) - and sometimes use the html tags to set font and formatting information. But what if you just want to count characters or words in an html file - how do you proceed? The first step is a quick google to 'sed one liners', and there one finds that the command line required is:

sed -e :a -e 's/<[^>]*>//g;/</N;//ba'

or to do this with a specific file

sed -e :a -e 's/<[^>]*>//g;/</N;//ba' filename.html > filename.txt

But how does this command work? Well, it builds a sed program with two sets (-e) of sed input. The first, :a, sets the branch label to 'a' at the beginning of the sed program. The second command says - if you find a left angle bracket that is followed immediately by any character which is not a right angle bracket ([^>]), and is followed by some characters (*) and a right angle bracket, then globally replace it with nothing (//g). This takes care of <tag> html tags on one line - but what of tags which span lines? Well, they will hit the /</N command which will append the next line into the sed pattern space and then the (//ba) will branch back to the beginning of the sed script (remember the 'a' label?) to continue to the search for the tag to replace with nothing (//g). Simple, elegant and compact!


Posted by ZFS | Permanent link | File under: bash