Search a title or topic

Over 20 million podcasts, powered by 

Player FM logo
Artwork

Content provided by HPR Volunteer and Hacker Public Radio. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by HPR Volunteer and Hacker Public Radio or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.
Player FM - Podcast App
Go offline with the Player FM app!

HPR4404: Kevie nerd snipes Ken by grepping xml

 
Share
 

Manage episode 489592225 series 32765
Content provided by HPR Volunteer and Hacker Public Radio. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by HPR Volunteer and Hacker Public Radio or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

This show has been flagged as Explicit by the host.

More Command line fun: downloading a podcast

In the show hpr4398 :: Command line fun: downloading a podcast Kevie walked us through a command to download a podcast.

He used some techniques here that I hadn't used before, and it's always great to see how other people approach the problem.

Let's have a look at the script and walk through what it does, then we'll have a look at some "traps for young players" as the EEVBlog is fond of saying.

Analysis of the Script

wget `curl https://tuxjam.otherside.network/feed/podcast/ | grep -o 'https*://[^"]*ogg' | head -1` 

It chains four different commands together to "Save the latest file from a feed".

Let's break it down so we can have checkpoints between each step.

I often do this when writing a complex one liner - first do it as steps, and then combine it.

  1. The curl command gets https://tuxjam.otherside.network/feed/podcast/ .

To do this ourselves we will call curl https://tuxjam.otherside.network/feed/podcast/ --output tuxjam.xml , as the default file name is index.html.

This gives us a xml file, and we can confirm it's valid xml with the xmllint command.

$ xmllint --format tuxjam.xml >/dev/null $ echo $? 0 

Here the output of the command is ignored by redirecting it to /dev/null Then we check the error code the last command had. As it's 0 it completed sucessfully.

  1. Kevie then passes the output to the grep search command with the option -o and then looks for any string starting with https followed by anything then followed by two forward slashes, then
-o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line 

We can do the same with. I was not aware that grep defaulted to regex, as I tend to add the --perl-regexp to explicitly add it.

grep --only-matching 'https*://[^"]*ogg' tuxjam.xml http matches the characters http literally (case sensitive) s* matches the character s literally (case sensitive) Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] : matches the character : literally / matches the character / literally / matches the character / literally [^"]* match a single character not present in the list below Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] " a single character in the list " literally (case sensitive) ogg matches the characters ogg literally (case sensitive) 

When we run this ourselves we get the following

$ grep --only-matching 'https*://[^"]*ogg' tuxjam.xml https://archive.org/download/tuxjam-121/tuxjam_121.ogg https://archive.org/download/tuxjam-120/TuxJam_120.ogg https://archive.org/download/tux-jam-119/TuxJam_119.ogg https://archive.org/download/tuxjam_118/tuxjam_118.ogg https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://archive.org/download/tuxjam_116/tuxjam_116.ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://ogg http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg https://ogg https://archive.org/download/tuxjam_114/tuxjam_114.ogg https://archive.org/download/tuxjam_113/tuxjam_113.ogg https://archive.org/download/tuxjam_112/tuxjam_112.ogg 
  1. The last command returns the first line, so therefore https://archive.org/download/tuxjam-121/tuxjam_121.ogg
  2. Finally that line is used as the input to the wget command.

Problems with the approach

Relying on grep with structured data like xml or json can lead to problems.

When we looked at the output of the command in step 2, some of the results gave https://ogg .

When run the same command without the --only-matching argument we see what was matched.

$ grep 'https*://[^"]*ogg' tuxjam.xml

This episode may not be live as in TuxJam 115 from Oggcamp but your friendly foursome of Al, Dave (thelovebug), Kevie and Andrew (mcnalu) are very much alive to treats of Free and Open Source Software and Creative Commons tunes.

https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/ https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/#respond https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/feed/

With the group meeting up together for the first time in person, it was decided that a live recording would be an appropriate venture. With the quartet squashed around a table and a group of adoring fans crowded into a room at the Pendulum Hotel in Manchester, the discussion turns to TuxJam reviews that become regularly used applications, what we enjoyed about OggCamp 2024 and for the third section the gang put their reputation on the line and allow open questions from the sea of dedicated fans.

  • OggCamp 2024 on Saturday 12 and Sunday 13 October 2024, Manchester UK.
  • Two of the hits are not enclosures at all, they are references in the text to OggCamp what we enjoyed about OggCamp 2024

    Normally running grep will only get one entry per line, and if the xml is minimised it can miss entries on a file that comes across as one big line.

    I did this myself using

    xmllint --noblanks tuxjam.xml > tuxjam-min.xml 

    I then edited it and replaced the new lines with spaces. I have to say that the --only-matching argument is doing a great job at pulling out the matches.

    That said the results were not perfect either.

    $ grep --only-matching 'https*://[^"]*ogg' tuxjam-min.xml https://archive.org/download/tuxjam-121/tuxjam_121.ogg https://archive.org/download/tuxjam-120/TuxJam_120.ogg https://archive.org/download/tux-jam-119/TuxJam_119.ogg https://archive.org/download/tuxjam_118/tuxjam_118.ogg https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://archive.org/download/tuxjam_116/tuxjam_116.ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://tuxjam.otherside.network/?p=1029https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/#respondhttps://tuxjam.otherside.network/tuxjam-115-ogg https://ogg http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg https://ogg https://archive.org/download/tuxjam_114/tuxjam_114.ogg https://archive.org/download/tuxjam_113/tuxjam_113.ogg https://archive.org/download/tuxjam_112/tuxjam_112.ogg 

    You could fix it by modifying the grep arguments and add additional searches looking for enclosure . The problem with that approach is that you'll forever and a day be chasing issues when someone changes something.

    So the approach is officially "Grand", but it's a very likely to break if you're not babysitting it.

    Suggested Applications.

    I recommend never parsing structured documents , like xml or json with grep.

    You should use dedicated parsers that understands the document markup, and can intelligently address parts of it.

    I recommend:

    Of course anyone that looks at my code on the hpr gittea will know this is a case of "do what I say, not what I do."

    Never parse xml with grep, where the only possible exception is to see if a string is in a file in the first place.

    grep --max-count=1 --files-with-matches 

    That's justified under the fact that grep is going to be faster than having to parse, and build a XML Document Object Model when you don't have to.

    Some Tips

    Always refer to examples and specification

    A specification is just a set of rules that tell you how the document is formatted.

    There is a danger in just looking at example files, and not reading the specifications. I had a situation once where a software developer raised a bug as the files didn't begin with ken-test- followed by a uuid . They were surprised when the supplied files did not follow this convention as per the examples. Suffice to say that was rejected.

    For us there are the rules from the RSS specification itself, but as it's a XML file there are XML Specifications . While the RSS spec is short, the XML is not, so people tend to use dedicated libraries to parse XML. Using a dedicated tool like xmlstarlet will allow us to mostly ignore the details of XML.

    RSS is a dialect of XML . All RSS files must conform to the XML 1.0 specification, as published on the World Wide Web Consortium (W3C) website.

    The first line of the tuxjam feed shows it's an XML file.

     

    The specification goes on to say "At the top level, a RSS document is a element, with a mandatory attribute called version, that specifies the version of RSS that the document conforms to. If it conforms to this specification, the version attribute must be 2.0." And sure enough then the second line show that it's a RSS file.

     

    Use the best tool for the job

    You wouldn't grep a Excel File ? Why would you grep an XML file ?

    We could go on all day but I want to get across the idea that there is structure in the file. As XML is everywhere you should have a tool to process it. More than likely xmlstarlet is in all the distro repos, so just install it.

    The help looks like this:

    $ xmlstarlet --help XMLStarlet Toolkit: Command line utilities for XML Usage: xmlstarlet [] [] where is one of: ed (or edit) - Edit/Update XML document(s) sel (or select) - Select data or query XML document(s) (XPATH, etc) tr (or transform) - Transform XML document(s) using XSLT val (or validate) - Validate XML document(s) (well-formed/DTD/XSD/RelaxNG) fo (or format) - Format XML document(s) el (or elements) - Display element structure of XML document c14n (or canonic) - XML canonicalization ls (or list) - List directory as XML esc (or escape) - Escape special XML characters unesc (or unescape) - Unescape special XML characters pyx (or xmln) - Convert XML into PYX format (based on ESIS - ISO 8879) p2x (or depyx) - Convert PYX into XML are: -q or --quiet - no error output --doc-namespace - extract namespace bindings from input doc (default) --no-doc-namespace - don't extract namespace bindings from input doc --version - show version --help - show help Wherever file name mentioned in command help it is assumed that URL can be used instead as well. Type: xmlstarlet --help for command help XMLStarlet is a command line toolkit to query/edit/check/transform XML documents (for more information see http://xmlstar.sourceforge.net/) 

    You can get more help on a given topic by calling the xmlstarlet command like this:

    $ xmlstarlet el --help XMLStarlet Toolkit: Display element structure of XML document Usage: xmlstarlet el [] where - input XML document file name (stdin is used if missing) is one of: -a - show attributes as well -v - show attributes and their values -u - print out sorted unique lines -d - print out sorted unique lines up to depth XMLStarlet is a command line toolkit to query/edit/check/transform XML documents (for more information see http://xmlstar.sourceforge.net/) 

    To prove that it's a structured document we can run the command xmlstarlet el -u - show me unique elements

    $ xmlstarlet el -u tuxjam.xml rss rss/channel rss/channel/atom:link rss/channel/copyright rss/channel/description rss/channel/generator rss/channel/image rss/channel/image/link rss/channel/image/title rss/channel/image/url rss/channel/item rss/channel/item/category rss/channel/item/comments rss/channel/item/content:encoded rss/channel/item/description rss/channel/item/enclosure rss/channel/item/guid rss/channel/item/itunes:author rss/channel/item/itunes:duration rss/channel/item/itunes:episodeType rss/channel/item/itunes:explicit rss/channel/item/itunes:image rss/channel/item/itunes:subtitle rss/channel/item/itunes:summary rss/channel/item/link rss/channel/item/pubDate rss/channel/item/slash:comments rss/channel/item/title rss/channel/item/wfw:commentRss rss/channel/itunes:author rss/channel/itunes:category rss/channel/itunes:explicit rss/channel/itunes:image rss/channel/itunes:owner rss/channel/itunes:owner/itunes:name rss/channel/itunes:subtitle rss/channel/itunes:summary rss/channel/itunes:type rss/channel/language rss/channel/lastBuildDate rss/channel/link rss/channel/podcast:guid rss/channel/podcast:license rss/channel/podcast:location rss/channel/podcast:medium rss/channel/podcast:podping rss/channel/rawvoice:frequency rss/channel/rawvoice:location rss/channel/sy:updateFrequency rss/channel/sy:updatePeriod rss/channel/title 

    That is the xpath representation of the xml structure. It's very similar to a unix filesystem tree. There is one rss branch, of that is one channel branch, and that can have many item branches.

    "Save the latest file from a feed"

    The ask here is to "Save the latest file from a feed".

    The solution Kevie gave gets the "first entry in the feed", which is correct for his feed but is not safe.

    However let's see how we could replace grep with xmlstarlet .

    The definition of enclosure is:

     is an optional sub-element of . It has three required attributes. url says where the enclosure is located, length says how big it is in bytes, and type says what its type is, a standard MIME type. The url must be an http url. 

    The location of the files must be in rss/channel/item/enclosure or it's not a Podcast feed.

    In each enclosure there has to be a xml attribute called url which points to the media.

    xmlstarlet has the select command to select locations.

    $ xmlstarlet sel --help XMLStarlet Toolkit: Select from XML document(s) Usage: xmlstarlet sel {} [ ... ] where - global options for selecting - input XML document file name/uri (stdin is used if missing) - template for querying XML document with following syntax: are: -Q or --quiet - do not write anything to standard output. -C or --comp - display generated XSLT -R or --root - print root element -T or --text - output is text (default is XML) -I or --indent - indent output -D or --xml-decl - do not omit xml declaration line -B or --noblanks - remove insignificant spaces from XML tree -E or --encode - output in the given encoding (utf-8, unicode...) -N = - predefine namespaces (name without 'xmlns:') ex: xsql=urn:oracle-xsql Multiple -N options are allowed. --net - allow fetch DTDs or entities over network --help - display help Syntax for templates: -t|--template where -c or --copy-of - print copy of XPATH expression -v or --value-of - print value of XPATH expression -o or --output - output string literal -n or --nl - print new line -f or --inp-name - print input file name (or URL) -m or --match - match XPATH expression --var --break or --var = - declare a variable (referenced by $name) -i or --if - check condition --elif - check condition if previous conditions failed --else - check if previous conditions failed -e or --elem - print out element -a or --attr - add attribute -b or --break - break nesting -s or --sort op xpath - sort in order (used after -m) where op is X:Y:Z, X is A - for order="ascending" X is D - for order="descending" Y is N - for data-type="numeric" Y is T - for data-type="text" Z is U - for case-order="upper-first" Z is L - for case-order="lower-first"

    Options we will need are:

    -T or --text - output is text (default is XML) -t|--template where -m or --match - match XPATH expression -v or --value-of - print value of XPATH expression -n or --nl - print new line 

    So putting it together we will get:

    $ xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'enclosure/@url' --nl tuxjam.xml https://archive.org/download/tuxjam-121/tuxjam_121.ogg https://archive.org/download/tuxjam-120/TuxJam_120.ogg https://archive.org/download/tux-jam-119/TuxJam_119.ogg https://archive.org/download/tuxjam_118/tuxjam_118.ogg https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg https://archive.org/download/tuxjam_116/tuxjam_116.ogg http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg https://archive.org/download/tuxjam_114/tuxjam_114.ogg https://archive.org/download/tuxjam_113/tuxjam_113.ogg https://archive.org/download/tuxjam_112/tuxjam_112.ogg 

    We match for all the rss/channel/item and look for any enclosure/@url s, simple enough.

    We could replace the grep in Kevie's script,

    wget `curl https://tuxjam.otherside.network/feed/podcast/ | grep -o 'https*://[^"]*ogg' | head -1` 

    with,

    wget "$( curl --silent https://tuxjam.otherside.network/feed/podcast/ | xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'enclosure/@url' --nl - | head -1 )" 

    Which would guarantee to give the "first entry in the feed". Some additions is the use of $() instead of back ticks, as it's easier to nest them, and I think it's clearer that something is been executed. I also added --silent to curl to suppress the progress bar. Also replaced the file name with - dash to tell xmlstarlet that the input will be piped from standard in.

    How about the latest feed ?

    There is nothing to stop someone producing a RSS feed where the latest entries are at the end, or even sorted alphabetically, or random. They are all valid use cases and are allowed under the Specification. So how would we find the "latest podcast" ?.

    While defined as optional, the items pubDate found at rss/channel/item/pubDate , is usually also always defined in podcast feeds.

     is an optional sub-element of . Its value is a date, indicating when the item was published. If it's a date in the future, aggregators may choose to not display the item until that date. Sun, 19 May 2002 15:21:36 GMT 

    Unfortunately they picked a completely stupid email date format , as opposed to a sane format like rfc3339 a subset of iso8601 .

    So if you want to get the "latest podcast" you need to parse then convert the pubDate to sortable format like iso8601 or Unix epoch .

    I'll do the first part here, the second is it's own (series) of shows. The Problem with Time & Timezones - Computerphile

    So could it be as simple as replacing the xpath enclosure/@url of with pubDate ? Yes, yes it is.

    $ xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'pubDate' --nl tuxjam.xml Fri, 23 May 2025 17:54:17 +0000 Fri, 28 Feb 2025 15:48:57 +0000 Mon, 03 Feb 2025 09:57:26 +0000 Sat, 21 Dec 2024 17:08:26 +0000 Thu, 05 Dec 2024 12:57:52 +0000 Sat, 30 Nov 2024 20:31:20 +0000 Wed, 30 Oct 2024 08:10:58 +0000 Mon, 26 Aug 2024 15:11:51 +0000 Fri, 05 Jul 2024 18:15:44 +0000 Sat, 08 Jun 2024 09:21:50 +0000 

    But we will need both the enclosure/@url and pubDate , and we can do this using the concat option. For more information on this see the XmlStarlet Command Line XML Toolkit User's Guide

    $ xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'concat(pubDate, ";", enclosure/@url)' --nl tuxjam.xml Fri, 23 May 2025 17:54:17 +0000;https://archive.org/download/tuxjam-121/tuxjam_121.ogg Fri, 28 Feb 2025 15:48:57 +0000;https://archive.org/download/tuxjam-120/TuxJam_120.ogg Mon, 03 Feb 2025 09:57:26 +0000;https://archive.org/download/tux-jam-119/TuxJam_119.ogg Sat, 21 Dec 2024 17:08:26 +0000;https://archive.org/download/tuxjam_118/tuxjam_118.ogg Thu, 05 Dec 2024 12:57:52 +0000;https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg Sat, 30 Nov 2024 20:31:20 +0000;https://archive.org/download/tuxjam_116/tuxjam_116.ogg Wed, 30 Oct 2024 08:10:58 +0000;http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg Mon, 26 Aug 2024 15:11:51 +0000;https://archive.org/download/tuxjam_114/tuxjam_114.ogg Fri, 05 Jul 2024 18:15:44 +0000;https://archive.org/download/tuxjam_113/tuxjam_113.ogg Sat, 08 Jun 2024 09:21:50 +0000;https://archive.org/download/tuxjam_112/tuxjam_112.ogg 

    I use the ; as delimiter.

    To tackle the date we will need to use the date command with the following options:

    -d, --date=STRING display time described by STRING, not 'now' -u, --utc, --universal print or set Coordinated Universal Time (UTC) 

    The once it's in a sane format we can use the sort to sort them with the newest on the top.

    -n, --numeric-sort compare according to string numerical value; see full documentation for supported strings -r, --reverse reverse the result of comparisons 

    Putting that into a script

    #!/bin/bash # (c) CC-0 Ken Fallon 2025 podcast="https://tuxjam.otherside.network/feed/podcast/" wget --quiet $( curl --silent "${podcast}" | \ xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'concat(pubDate, ";", enclosure/@url)' --nl - | \ while read item do pubDate="$( echo ${item} | awk -F ';' '{print $1}' )" pubDate="$( \date -d "${pubDate}" --universal +%Y-%m-%dT%H:%M:%S )" url="$( echo ${item} | awk -F ';' '{print $2}' )" echo -e "${pubDate}\t${url}" done | \ sort --numeric-sort --reverse | \ head -1 | \ awk '{print $NF}' ) 

    Admittedly it's a smidgen linger than Kevies in it's one liner format.

    wget "$( curl --silent "https://tuxjam.otherside.network/feed/podcast/" | xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'concat(pubDate, ";", enclosure/@url)' --nl - | while read item; do pubDate="$( echo ${item} | awk -F ';' '{print $1}' )"; pubDate="$( \date -d "${pubDate}" --universal +%Y-%m-%dT%H:%M:%S )"; url="$( echo ${item} | awk -F ';' '{print $2}' )"; echo -e "${pubDate}\t${url}"; done | sort --numeric-sort --reverse | head -1 | awk '{print $NF}' )" 

    Provide feedback on this episode.

      continue reading

    860 episodes

    Artwork
    iconShare
     
    Manage episode 489592225 series 32765
    Content provided by HPR Volunteer and Hacker Public Radio. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by HPR Volunteer and Hacker Public Radio or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

    This show has been flagged as Explicit by the host.

    More Command line fun: downloading a podcast

    In the show hpr4398 :: Command line fun: downloading a podcast Kevie walked us through a command to download a podcast.

    He used some techniques here that I hadn't used before, and it's always great to see how other people approach the problem.

    Let's have a look at the script and walk through what it does, then we'll have a look at some "traps for young players" as the EEVBlog is fond of saying.

    Analysis of the Script

    wget `curl https://tuxjam.otherside.network/feed/podcast/ | grep -o 'https*://[^"]*ogg' | head -1` 

    It chains four different commands together to "Save the latest file from a feed".

    Let's break it down so we can have checkpoints between each step.

    I often do this when writing a complex one liner - first do it as steps, and then combine it.

    1. The curl command gets https://tuxjam.otherside.network/feed/podcast/ .

    To do this ourselves we will call curl https://tuxjam.otherside.network/feed/podcast/ --output tuxjam.xml , as the default file name is index.html.

    This gives us a xml file, and we can confirm it's valid xml with the xmllint command.

    $ xmllint --format tuxjam.xml >/dev/null $ echo $? 0 

    Here the output of the command is ignored by redirecting it to /dev/null Then we check the error code the last command had. As it's 0 it completed sucessfully.

    1. Kevie then passes the output to the grep search command with the option -o and then looks for any string starting with https followed by anything then followed by two forward slashes, then
    -o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line 

    We can do the same with. I was not aware that grep defaulted to regex, as I tend to add the --perl-regexp to explicitly add it.

    grep --only-matching 'https*://[^"]*ogg' tuxjam.xml http matches the characters http literally (case sensitive) s* matches the character s literally (case sensitive) Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] : matches the character : literally / matches the character / literally / matches the character / literally [^"]* match a single character not present in the list below Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] " a single character in the list " literally (case sensitive) ogg matches the characters ogg literally (case sensitive) 

    When we run this ourselves we get the following

    $ grep --only-matching 'https*://[^"]*ogg' tuxjam.xml https://archive.org/download/tuxjam-121/tuxjam_121.ogg https://archive.org/download/tuxjam-120/TuxJam_120.ogg https://archive.org/download/tux-jam-119/TuxJam_119.ogg https://archive.org/download/tuxjam_118/tuxjam_118.ogg https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://archive.org/download/tuxjam_116/tuxjam_116.ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://ogg http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg https://ogg https://archive.org/download/tuxjam_114/tuxjam_114.ogg https://archive.org/download/tuxjam_113/tuxjam_113.ogg https://archive.org/download/tuxjam_112/tuxjam_112.ogg 
    1. The last command returns the first line, so therefore https://archive.org/download/tuxjam-121/tuxjam_121.ogg
    2. Finally that line is used as the input to the wget command.

    Problems with the approach

    Relying on grep with structured data like xml or json can lead to problems.

    When we looked at the output of the command in step 2, some of the results gave https://ogg .

    When run the same command without the --only-matching argument we see what was matched.

    $ grep 'https*://[^"]*ogg' tuxjam.xml

    This episode may not be live as in TuxJam 115 from Oggcamp but your friendly foursome of Al, Dave (thelovebug), Kevie and Andrew (mcnalu) are very much alive to treats of Free and Open Source Software and Creative Commons tunes.

    https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/ https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/#respond https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/feed/

    With the group meeting up together for the first time in person, it was decided that a live recording would be an appropriate venture. With the quartet squashed around a table and a group of adoring fans crowded into a room at the Pendulum Hotel in Manchester, the discussion turns to TuxJam reviews that become regularly used applications, what we enjoyed about OggCamp 2024 and for the third section the gang put their reputation on the line and allow open questions from the sea of dedicated fans.

  • OggCamp 2024 on Saturday 12 and Sunday 13 October 2024, Manchester UK.
  • Two of the hits are not enclosures at all, they are references in the text to OggCamp what we enjoyed about OggCamp 2024

    Normally running grep will only get one entry per line, and if the xml is minimised it can miss entries on a file that comes across as one big line.

    I did this myself using

    xmllint --noblanks tuxjam.xml > tuxjam-min.xml 

    I then edited it and replaced the new lines with spaces. I have to say that the --only-matching argument is doing a great job at pulling out the matches.

    That said the results were not perfect either.

    $ grep --only-matching 'https*://[^"]*ogg' tuxjam-min.xml https://archive.org/download/tuxjam-121/tuxjam_121.ogg https://archive.org/download/tuxjam-120/TuxJam_120.ogg https://archive.org/download/tux-jam-119/TuxJam_119.ogg https://archive.org/download/tuxjam_118/tuxjam_118.ogg https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://archive.org/download/tuxjam_116/tuxjam_116.ogg https://tuxjam.otherside.network/tuxjam-115-ogg https://tuxjam.otherside.network/?p=1029https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/#respondhttps://tuxjam.otherside.network/tuxjam-115-ogg https://ogg http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg https://ogg https://archive.org/download/tuxjam_114/tuxjam_114.ogg https://archive.org/download/tuxjam_113/tuxjam_113.ogg https://archive.org/download/tuxjam_112/tuxjam_112.ogg 

    You could fix it by modifying the grep arguments and add additional searches looking for enclosure . The problem with that approach is that you'll forever and a day be chasing issues when someone changes something.

    So the approach is officially "Grand", but it's a very likely to break if you're not babysitting it.

    Suggested Applications.

    I recommend never parsing structured documents , like xml or json with grep.

    You should use dedicated parsers that understands the document markup, and can intelligently address parts of it.

    I recommend:

    Of course anyone that looks at my code on the hpr gittea will know this is a case of "do what I say, not what I do."

    Never parse xml with grep, where the only possible exception is to see if a string is in a file in the first place.

    grep --max-count=1 --files-with-matches 

    That's justified under the fact that grep is going to be faster than having to parse, and build a XML Document Object Model when you don't have to.

    Some Tips

    Always refer to examples and specification

    A specification is just a set of rules that tell you how the document is formatted.

    There is a danger in just looking at example files, and not reading the specifications. I had a situation once where a software developer raised a bug as the files didn't begin with ken-test- followed by a uuid . They were surprised when the supplied files did not follow this convention as per the examples. Suffice to say that was rejected.

    For us there are the rules from the RSS specification itself, but as it's a XML file there are XML Specifications . While the RSS spec is short, the XML is not, so people tend to use dedicated libraries to parse XML. Using a dedicated tool like xmlstarlet will allow us to mostly ignore the details of XML.

    RSS is a dialect of XML . All RSS files must conform to the XML 1.0 specification, as published on the World Wide Web Consortium (W3C) website.

    The first line of the tuxjam feed shows it's an XML file.

     

    The specification goes on to say "At the top level, a RSS document is a element, with a mandatory attribute called version, that specifies the version of RSS that the document conforms to. If it conforms to this specification, the version attribute must be 2.0." And sure enough then the second line show that it's a RSS file.

     

    Use the best tool for the job

    You wouldn't grep a Excel File ? Why would you grep an XML file ?

    We could go on all day but I want to get across the idea that there is structure in the file. As XML is everywhere you should have a tool to process it. More than likely xmlstarlet is in all the distro repos, so just install it.

    The help looks like this:

    $ xmlstarlet --help XMLStarlet Toolkit: Command line utilities for XML Usage: xmlstarlet [] [] where is one of: ed (or edit) - Edit/Update XML document(s) sel (or select) - Select data or query XML document(s) (XPATH, etc) tr (or transform) - Transform XML document(s) using XSLT val (or validate) - Validate XML document(s) (well-formed/DTD/XSD/RelaxNG) fo (or format) - Format XML document(s) el (or elements) - Display element structure of XML document c14n (or canonic) - XML canonicalization ls (or list) - List directory as XML esc (or escape) - Escape special XML characters unesc (or unescape) - Unescape special XML characters pyx (or xmln) - Convert XML into PYX format (based on ESIS - ISO 8879) p2x (or depyx) - Convert PYX into XML are: -q or --quiet - no error output --doc-namespace - extract namespace bindings from input doc (default) --no-doc-namespace - don't extract namespace bindings from input doc --version - show version --help - show help Wherever file name mentioned in command help it is assumed that URL can be used instead as well. Type: xmlstarlet --help for command help XMLStarlet is a command line toolkit to query/edit/check/transform XML documents (for more information see http://xmlstar.sourceforge.net/) 

    You can get more help on a given topic by calling the xmlstarlet command like this:

    $ xmlstarlet el --help XMLStarlet Toolkit: Display element structure of XML document Usage: xmlstarlet el [] where - input XML document file name (stdin is used if missing) is one of: -a - show attributes as well -v - show attributes and their values -u - print out sorted unique lines -d - print out sorted unique lines up to depth XMLStarlet is a command line toolkit to query/edit/check/transform XML documents (for more information see http://xmlstar.sourceforge.net/) 

    To prove that it's a structured document we can run the command xmlstarlet el -u - show me unique elements

    $ xmlstarlet el -u tuxjam.xml rss rss/channel rss/channel/atom:link rss/channel/copyright rss/channel/description rss/channel/generator rss/channel/image rss/channel/image/link rss/channel/image/title rss/channel/image/url rss/channel/item rss/channel/item/category rss/channel/item/comments rss/channel/item/content:encoded rss/channel/item/description rss/channel/item/enclosure rss/channel/item/guid rss/channel/item/itunes:author rss/channel/item/itunes:duration rss/channel/item/itunes:episodeType rss/channel/item/itunes:explicit rss/channel/item/itunes:image rss/channel/item/itunes:subtitle rss/channel/item/itunes:summary rss/channel/item/link rss/channel/item/pubDate rss/channel/item/slash:comments rss/channel/item/title rss/channel/item/wfw:commentRss rss/channel/itunes:author rss/channel/itunes:category rss/channel/itunes:explicit rss/channel/itunes:image rss/channel/itunes:owner rss/channel/itunes:owner/itunes:name rss/channel/itunes:subtitle rss/channel/itunes:summary rss/channel/itunes:type rss/channel/language rss/channel/lastBuildDate rss/channel/link rss/channel/podcast:guid rss/channel/podcast:license rss/channel/podcast:location rss/channel/podcast:medium rss/channel/podcast:podping rss/channel/rawvoice:frequency rss/channel/rawvoice:location rss/channel/sy:updateFrequency rss/channel/sy:updatePeriod rss/channel/title 

    That is the xpath representation of the xml structure. It's very similar to a unix filesystem tree. There is one rss branch, of that is one channel branch, and that can have many item branches.

    "Save the latest file from a feed"

    The ask here is to "Save the latest file from a feed".

    The solution Kevie gave gets the "first entry in the feed", which is correct for his feed but is not safe.

    However let's see how we could replace grep with xmlstarlet .

    The definition of enclosure is:

     is an optional sub-element of . It has three required attributes. url says where the enclosure is located, length says how big it is in bytes, and type says what its type is, a standard MIME type. The url must be an http url. 

    The location of the files must be in rss/channel/item/enclosure or it's not a Podcast feed.

    In each enclosure there has to be a xml attribute called url which points to the media.

    xmlstarlet has the select command to select locations.

    $ xmlstarlet sel --help XMLStarlet Toolkit: Select from XML document(s) Usage: xmlstarlet sel {} [ ... ] where - global options for selecting - input XML document file name/uri (stdin is used if missing) - template for querying XML document with following syntax: are: -Q or --quiet - do not write anything to standard output. -C or --comp - display generated XSLT -R or --root - print root element -T or --text - output is text (default is XML) -I or --indent - indent output -D or --xml-decl - do not omit xml declaration line -B or --noblanks - remove insignificant spaces from XML tree -E or --encode - output in the given encoding (utf-8, unicode...) -N = - predefine namespaces (name without 'xmlns:') ex: xsql=urn:oracle-xsql Multiple -N options are allowed. --net - allow fetch DTDs or entities over network --help - display help Syntax for templates: -t|--template where -c or --copy-of - print copy of XPATH expression -v or --value-of - print value of XPATH expression -o or --output - output string literal -n or --nl - print new line -f or --inp-name - print input file name (or URL) -m or --match - match XPATH expression --var --break or --var = - declare a variable (referenced by $name) -i or --if - check condition --elif - check condition if previous conditions failed --else - check if previous conditions failed -e or --elem - print out element -a or --attr - add attribute -b or --break - break nesting -s or --sort op xpath - sort in order (used after -m) where op is X:Y:Z, X is A - for order="ascending" X is D - for order="descending" Y is N - for data-type="numeric" Y is T - for data-type="text" Z is U - for case-order="upper-first" Z is L - for case-order="lower-first"

    Options we will need are:

    -T or --text - output is text (default is XML) -t|--template where -m or --match - match XPATH expression -v or --value-of - print value of XPATH expression -n or --nl - print new line 

    So putting it together we will get:

    $ xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'enclosure/@url' --nl tuxjam.xml https://archive.org/download/tuxjam-121/tuxjam_121.ogg https://archive.org/download/tuxjam-120/TuxJam_120.ogg https://archive.org/download/tux-jam-119/TuxJam_119.ogg https://archive.org/download/tuxjam_118/tuxjam_118.ogg https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg https://archive.org/download/tuxjam_116/tuxjam_116.ogg http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg https://archive.org/download/tuxjam_114/tuxjam_114.ogg https://archive.org/download/tuxjam_113/tuxjam_113.ogg https://archive.org/download/tuxjam_112/tuxjam_112.ogg 

    We match for all the rss/channel/item and look for any enclosure/@url s, simple enough.

    We could replace the grep in Kevie's script,

    wget `curl https://tuxjam.otherside.network/feed/podcast/ | grep -o 'https*://[^"]*ogg' | head -1` 

    with,

    wget "$( curl --silent https://tuxjam.otherside.network/feed/podcast/ | xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'enclosure/@url' --nl - | head -1 )" 

    Which would guarantee to give the "first entry in the feed". Some additions is the use of $() instead of back ticks, as it's easier to nest them, and I think it's clearer that something is been executed. I also added --silent to curl to suppress the progress bar. Also replaced the file name with - dash to tell xmlstarlet that the input will be piped from standard in.

    How about the latest feed ?

    There is nothing to stop someone producing a RSS feed where the latest entries are at the end, or even sorted alphabetically, or random. They are all valid use cases and are allowed under the Specification. So how would we find the "latest podcast" ?.

    While defined as optional, the items pubDate found at rss/channel/item/pubDate , is usually also always defined in podcast feeds.

     is an optional sub-element of . Its value is a date, indicating when the item was published. If it's a date in the future, aggregators may choose to not display the item until that date. Sun, 19 May 2002 15:21:36 GMT 

    Unfortunately they picked a completely stupid email date format , as opposed to a sane format like rfc3339 a subset of iso8601 .

    So if you want to get the "latest podcast" you need to parse then convert the pubDate to sortable format like iso8601 or Unix epoch .

    I'll do the first part here, the second is it's own (series) of shows. The Problem with Time & Timezones - Computerphile

    So could it be as simple as replacing the xpath enclosure/@url of with pubDate ? Yes, yes it is.

    $ xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'pubDate' --nl tuxjam.xml Fri, 23 May 2025 17:54:17 +0000 Fri, 28 Feb 2025 15:48:57 +0000 Mon, 03 Feb 2025 09:57:26 +0000 Sat, 21 Dec 2024 17:08:26 +0000 Thu, 05 Dec 2024 12:57:52 +0000 Sat, 30 Nov 2024 20:31:20 +0000 Wed, 30 Oct 2024 08:10:58 +0000 Mon, 26 Aug 2024 15:11:51 +0000 Fri, 05 Jul 2024 18:15:44 +0000 Sat, 08 Jun 2024 09:21:50 +0000 

    But we will need both the enclosure/@url and pubDate , and we can do this using the concat option. For more information on this see the XmlStarlet Command Line XML Toolkit User's Guide

    $ xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'concat(pubDate, ";", enclosure/@url)' --nl tuxjam.xml Fri, 23 May 2025 17:54:17 +0000;https://archive.org/download/tuxjam-121/tuxjam_121.ogg Fri, 28 Feb 2025 15:48:57 +0000;https://archive.org/download/tuxjam-120/TuxJam_120.ogg Mon, 03 Feb 2025 09:57:26 +0000;https://archive.org/download/tux-jam-119/TuxJam_119.ogg Sat, 21 Dec 2024 17:08:26 +0000;https://archive.org/download/tuxjam_118/tuxjam_118.ogg Thu, 05 Dec 2024 12:57:52 +0000;https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg Sat, 30 Nov 2024 20:31:20 +0000;https://archive.org/download/tuxjam_116/tuxjam_116.ogg Wed, 30 Oct 2024 08:10:58 +0000;http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg Mon, 26 Aug 2024 15:11:51 +0000;https://archive.org/download/tuxjam_114/tuxjam_114.ogg Fri, 05 Jul 2024 18:15:44 +0000;https://archive.org/download/tuxjam_113/tuxjam_113.ogg Sat, 08 Jun 2024 09:21:50 +0000;https://archive.org/download/tuxjam_112/tuxjam_112.ogg 

    I use the ; as delimiter.

    To tackle the date we will need to use the date command with the following options:

    -d, --date=STRING display time described by STRING, not 'now' -u, --utc, --universal print or set Coordinated Universal Time (UTC) 

    The once it's in a sane format we can use the sort to sort them with the newest on the top.

    -n, --numeric-sort compare according to string numerical value; see full documentation for supported strings -r, --reverse reverse the result of comparisons 

    Putting that into a script

    #!/bin/bash # (c) CC-0 Ken Fallon 2025 podcast="https://tuxjam.otherside.network/feed/podcast/" wget --quiet $( curl --silent "${podcast}" | \ xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'concat(pubDate, ";", enclosure/@url)' --nl - | \ while read item do pubDate="$( echo ${item} | awk -F ';' '{print $1}' )" pubDate="$( \date -d "${pubDate}" --universal +%Y-%m-%dT%H:%M:%S )" url="$( echo ${item} | awk -F ';' '{print $2}' )" echo -e "${pubDate}\t${url}" done | \ sort --numeric-sort --reverse | \ head -1 | \ awk '{print $NF}' ) 

    Admittedly it's a smidgen linger than Kevies in it's one liner format.

    wget "$( curl --silent "https://tuxjam.otherside.network/feed/podcast/" | xmlstarlet sel --text --template --match 'rss/channel/item' --value-of 'concat(pubDate, ";", enclosure/@url)' --nl - | while read item; do pubDate="$( echo ${item} | awk -F ';' '{print $1}' )"; pubDate="$( \date -d "${pubDate}" --universal +%Y-%m-%dT%H:%M:%S )"; url="$( echo ${item} | awk -F ';' '{print $2}' )"; echo -e "${pubDate}\t${url}"; done | sort --numeric-sort --reverse | head -1 | awk '{print $NF}' )" 

    Provide feedback on this episode.

      continue reading

    860 episodes

    All episodes

    ×
     
    Loading …

    Welcome to Player FM!

    Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

     

    Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
    Listen to this show while you explore
    Play