| Subcribe via RSS

using sed to parse a file.

March 5th, 2005 Posted in Bash


sed -n 's/.*href="\([^"]*\)”.*/\1/p’

-n suppresses printing.

’s/…/…/p’ is a command. s is search and replace, s/pattern/replacement/. The trailing p is a command to print the result of the command. Since we used a -n to suppress normal printing, this causes sed to print only the replacement text. In most cases the replacement text will be static, but you can also use \1 through \9 to replace with the regular expresions within parentheses.

The pattern in this case is: .*href=”([^"]*)”.*
.*href=” matches the begining of the line, including the href=”
([^"]*) matches everything except a quote (the url itself).
“.* matches the quote and the rest of the line.

Using \1 as the replacement text causes the url, and only the url to be printed. If the line doesn’t contain a matching pattern, sed continues on silently to the next line.

This method only catches the first url on a line, ignoring the rest. I will attempt to address that in a later article.

Here’s a simple example:


# wget -q http://lr2.com/ -O - |sed -n 's/.*href="\([^"]*\)”.*/\1/p’

Notice we escaped the parens so bash doesn’t get confused and think that’s a subscript.

Tomorrow I’ll start to build this into a smarter parser that can be used to harvest both web pages and xml feeds for mp3 links. The end goal will be a simple script to fetch podcasts, add them to my library, and automatically dump them on my iPod if it’s connected. Along the way I expect to learn a few more bash tricks.

Leave a Reply