parsing some xml
March 7th, 2005 Posted in Bash
Last night we looked at parsing urls out of the feeds. Tonight I'm going look at parsing a little more information. Specifically the channel title, item title, item enclosure, and item pubDate.
#!/bin/bash BASEDIR="/mnt/usb0/mp3/podCast" FEEDS="${BASEDIR}/feeds.lst" while read URL ; do CHANNEL="" while read LINE; do TAG=$(echo ${LINE}|sed -n 's/<\([^>]*\).*/\1/p') # Bit of a hack to find the title of this channel. # We will assume that the first title we see is the channel title, future ones belong to the episodes. if [[ "title" = "${TAG}" ]] ; then if [[ "" = "${CHANNEL}" ]] ; then CHANNEL=$(echo ${LINE} | sed -n 's/<title>\([^<]*\)<\/title>/\1/p') else TITLE=$(echo ${LINE} | sed -n 's/<title>\([^<]*\)<\/title>/\1/p') fi elif [[ "link" = "${TAG}" ]] ; then # I found a few feeds that use link instead of enclosure. LINK=$(echo ${LINE} | sed -n 's/.*<link>\([^<]*\)<\/link>/\1/p') elif [[ "pubDate" = "${TAG}" ]] ; then # pubdate. We'll use this in the id3tag (year, and comment) if the mp3 is missing them. DATE=$(echo ${LINE} | sed -n 's/.*<pubDate>\([^<]*\)<\/pubDate>/\1/p') elif [[ "${TAG}" =~ "^enclosure" ]] ; then # This should be the mp3 url, if it's blank, we'll try LINK if it ends in mp3. MP3=$(echo ${LINE} | sed -n 's/.*<enclosure url=["'\'']\([^"'\'']*\)["'\''].*/\1/p') elif [[ "${TAG}" = "/item" ]] ; then # Time to do something with the data we have. echo "Channel: "$CHANNEL echo "Title: "$TITLE echo "Link: "$LINK echo "Date: "$DATE echo "MP3: "$MP3 echo "" fi done < <(wget -q -O - $URL) done < <(grep -v -e '^[;#]' -e '^$' $FEEDS)
Tomorrow I will pick the script apart and describe some of the parts, and maybe evolve it a bit. I'm sure there is a more efficient way to do some of the parsing. I will also start commenting the script and add caching for the xml files.