Browse > Home / Archive by category 'Bash'

| Subcribe via RSS

parsing some xml

March 7th, 2005 | No Comments | Posted in Bash

Last night we looked at parsing urls out of the feeds. Tonight I’m going look at parsing a little more information. Specifically the channel title, item title, item enclosure, and item pubDate.

More »

reading a list of feeds

March 6th, 2005 | No Comments | Posted in Bash

Tonight I’m going to start off the script with reading a list of feeds, and fetching them for parsing.

#!/bin/bash BASEDIR="/mnt/usb0/mp3/podCast" FEEDS="${BASEDIR}/feeds.lst" while read URL ; do while read LINE; do echo $LINE|sed -n 's/.*<link>\([^<]*\)<\/link<.*/\1/p' done < <(wget -q -O - $URL) done < <(grep -v -e '^[;#]' -e '^$' $FEEDS)

podcast.001.sh

We’re using grep to filter out lines starting with ; and #, as well as blank lines. We could get fancy and validate the URL, but this will suffice for now.

If all we really wanted was a list of mp3 URLs, we could pipe wget directly through the sed command, but I have plans to parse out more than just the mp3. To keep our files organized and minimize network traffic I plan to also parse out the titles of the feed, show, and pubdates. We’ll delve into the parser more tomorrow, for now good nigh, and happy bashing.

using sed to parse a file.

March 5th, 2005 | No Comments | Posted in Bash


sed -n 's/.*href="\([^"]*\)".*/\1/p'

-n suppresses printing.

’s/…/…/p’ is a command. s is search and replace, s/pattern/replacement/. The trailing p is a command to print the result of the command. Since we used a -n to suppress normal printing, this causes sed to print only the replacement text. In most cases the replacement text will be static, but you can also use \1 through \9 to replace with the regular expresions within parentheses.

The pattern in this case is: .*href=”([^"]*)”.*
.*href=” matches the begining of the line, including the href=”
([^"]*) matches everything except a quote (the url itself).
“.* matches the quote and the rest of the line.

Using \1 as the replacement text causes the url, and only the url to be printed. If the line doesn’t contain a matching pattern, sed continues on silently to the next line.

This method only catches the first url on a line, ignoring the rest. I will attempt to address that in a later article.

Here’s a simple example:


# wget -q http://lr2.com/ -O - |sed -n 's/.*href="\([^"]*\)".*/\1/p'

Notice we escaped the parens so bash doesn’t get confused and think that’s a subscript.

Tomorrow I’ll start to build this into a smarter parser that can be used to harvest both web pages and xml feeds for mp3 links. The end goal will be a simple script to fetch podcasts, add them to my library, and automatically dump them on my iPod if it’s connected. Along the way I expect to learn a few more bash tricks.