it.gen.nz

Writings on technology and society from Wellington, New Zealand

Sunday, September 21, 2008

A little programming project

As you know, I post my radio speaking notes as blog entries. At Miraz’s suggestion, I load these entries in advance and set them to publish automagically while I am on air. WordPress is clever like that.

Sometime after that, Radio New Zealand puts my radio slot online as sound files in ogg and mp3. Thanks, guys. But, I don’t know in advance what the file names are going to be so I can’t link them directly from my post. In practice I generally link to the download page for the whole of Nine to Noon and leave it at that.

Recently, Hamish wrote to me and suggested that I link my sound files directly from my post. I told him that I was far too lazy, but it has set me thinking – surely I can get a computer to do this.

This is the first in an ongoing series of posts about a little programming project I’ve started to automate the process of adding links to the sound files as they become available. I’ll collect the program together in a page on the site.

If you have any interest, read on and see just how easy free and open source tools make it to throw together something like this.

For now, I’m going to concentrate on the code necessary to read the addresses of the sound files from Radio New Zealand’s site and upload them to my blog. In a later post, I’ll talk about how to set this up as an automatic job that runs at the right time of the week – that is, just after I finish my radio slot.

I decided to write this in Python. That’s not because I’m religious about Python, but because it’s the scripting language I know least badly. Python is open source, it’s GPL compatible, and it runs on the Mac I’ll develop it on as well as on the GNU/Linux server that the program will probably end up running on. And there’s good documentation available on the Net. A primer is here.

As an aside, I want to point out that I’m self-taught as far as Python is concerned. My style is, probably, not the best. Sorry. Any real Pythonistas who might read this – please be gentle. I’m trying to convey a sense of how easy it is to work this stuff out, not convert people to professional status in one hit.

Python’s model is to keep all non-core language features in modules. These are chunks of Python code that you explicitly import into your program. This approach keeps the language clean and simple because you don’t have an enormous amount of other material you have to learn – just the modules you need, when you need them. And it makes the language extensible because it allows others to write modules which you can install. In this project I wound up using several modules that ship with Python and one that is separate.

To get started, I call up python in a terminal window:

taniwha:~ colinjackson$ python
Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:16) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

We are now in the Python interpreter. Anything we type from now on will be treated as a Python statement and run immediately, if possible. This is a great way of figuring out what has to go into a simple program. As I get statements to work, I copy them into a file in a text editor. Any editor will do – Textedit is fine. I use BBedit.

There are three basic steps to the program: 1) get the web page with links to the sound files from Radio New Zealand; 2) extract the links to the sound files; and 3) edit them into to the last entry on my blog.

Getting the links to the sound files we do by downloading the web page that they live on and parsing it for the links. Python provides a module called urllib which we can use for this. Let’s get it:

import urllib

Let’s save the address of the web page at Radio New Zealand:

rnzrurl = "http://www.radionz.co.nz/national/programmes/ninetonoon"

(This web page only has links to my sound files on it on Thursdays, and only then after I have done my slot.)

Reading the documentation, we find that urllib has a function called urlopen which returns a file handle – it lets you treat a web page as a file. And since this is a file, we can use the read() method on it to get the contents as a Python variable.

page = urllib.urlopen(rnzurl).read()

Now, in page we have a string of characters which is the HTML source of the web page. Maybe it has the sound file links in, maybe not.

(I haven’t covered for the possibility that the website was down or our Net access broken. Let’s keep it simple for now.)

There are lots of ways to find out whether the links we want are in the file, and to extract them. The way I used is with regular expressions, or regexes. Think of a regex as a pattern that you look for in a longer string.

Regexes are immensely powerful but have a fairly steep learning curve. I know just enough of them to achieve what I want, usually by poring over the relevant documentation.

We need a pattern to match the links to my sound files. The sound file names always contain the string New_Technology, and they are embedded in an HTML link statement.

A bit of trial and error on the Python command line got me to this pattern:

'"http.*?New_Tech.*?"'

The * matches any number of characters, and the ? following it specifies that we are looking for the shortest string that matches. The “http introduces a link and the trailing double quote closes it. Don’t worry if the detail isn’t clear; it’s the principle that matters.

Python implements regexes in (of course) a module, called re. The line we need looks like this:

links = re.findall(r'"http.*?New_Tech.*?"',page)

The findall method returns, as you might expect, all matching strings, so that way I the links to both the sound files. They are presented as a Python list.

We should check that there are two links, and we will in the proper program, but for now let’s just plough on. We need to assemble these two links into the correct style to go on the blog:

linktext = ' <a href='+links[0]+'>ogg</a> or <a href='+links[1]+'>mp3</a>'

You’ll notice that the first element of a list is numbered 0 and we use square brackets to pick list members.

Now we need use Python to edit these links into the last entry on the blog. For this, we’ll use a handy Python module called wordpresslib. It’s not part of the standard Python distribution; I found it by Googling, downloaded it and manually installed it into the site-packages directory in the Python library on my computer. Mental note: I’ll need to do that on any other computer I want to run this on, like my GNU/Linux server.

Initially I was getting strange HTML errors – 404s and 301s for blogs I know are there – until I found this page which showed me I had to point to a page /xmlrpc.php under my blog. Here’s the code to set up access to the most recent blog posting:

blogaddr = "https://it.gen.nz/xmlrpc.php"
blog = wordpresslib.WordPressClient(blogaddr,"colin",password)
blog.selectBlog(0)
post = blog.getLastPost()

The Python object post is an instance of a class in wordpresslib. It has an attribute description which contains the text of the post body until the “more”. I always end these with “download the audio” until I edit links to the sound files in.

Our friend the re module comes in handy to split the post before the “download the audio”:

frags = re.split(r'download the audio',post.description)

We only care about the first element of frags – add the link text and a closing full stop, then put it back into the post.

post.description = frags[0]+'download the audio as' + linktext + "."

Now, write it back to the blog:

blog.editPost(post.id,post,1)

and we’re done. Success! And it only took a dozen or so statements.

There’s some more detail to do – although not a lot – waiting for the links to become available on the web page, error checking and the like. And it needs to be put onto a server and set off by a cron job that will run it every Thirsday morning. But, for now, I’ll wrap this up and post the code to the finished program later.

posted by colin at 10:00 pm  

9 Comments

  1. Python is a great choice. Having started learning it myself about three years ago, after doing Perl for many years, I must say it’s a refreshing change–a good, clean, compact core with an amazing number of libraries built on top of that.

    Just one thought: shouldn’t you use https:// instead of http:// in the XML-RPC to your WordPress server? Security and all that…

    Comment by Lawrence D'Oliveiro — 22 September 2008 @ 9:27 pm

  2. Hey Colin,

    thanks so much for the link love. :-)

    Thanks too for the beginnings of this practical intro to Python – it’s something I’ve long been curious about.

    I have a problem. I get as far as the 4th line in your instructions above (the line starting page=) and get an error:

    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘urlopen’ is not defined

    My version of Python is the same as yours (except that yours says 19:35:16 and mine says 19:35:17). Any idea why I might be getting an error?

    I’ve copied and pasted (or dragged and dropped) from your instructions…

    Comment by Miraz Jordan — 23 September 2008 @ 4:47 pm

  3. Hi Miraz

    You have spotted my deliberate mistake :-)

    The ‘urlopen’ needs to be prefixed by a ‘urllib.’ so Python knows where to find it.

    The line in question should read:

    page = urllib.urlopen(rnzurl).read()

    I’ll change the posting so it’s correct.

    Anyone else find errors, please point them out.

    Colin

    Comment by colin — 23 September 2008 @ 5:54 pm

  4. Hi Colin,

    still no joy. :-(

    The following lines numbered here for ease of reading, but not numbered in the Terminal entries. I do this:

    1] python
    2] import urllib
    3] rnzrurl = “http://www.radionz.co.nz/national/programmes/ninetonoon”
    4] page = urllib.urlopen(rnzurl).read()

    and get this:

    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘rnzurl’ is not defined

    At least the error message is now different from the first one I was seeing. :-)

    Any further thoughts?

    Cheers,

    Miraz

    Comment by Miraz Jordan — 25 September 2008 @ 6:38 am

  5. “rnzrurl” vs “rnzurl”

    Comment by Lawrence D'Oliveiro — 25 September 2008 @ 6:50 pm

  6. I’ve done a new post on this one – please continue the comments there.

    Comment by colin — 26 September 2008 @ 7:50 am

  7. Thanks Lawrence. I’d looked for typos, but hadn’t spotted that one.

    Comment by Miraz Jordan — 26 September 2008 @ 8:21 am

  8. […] few weeks ago I blogged about writing a little program to make my life easier. (The entries are here and here.) In summary this program automates the messy but easy administrative task of editing […]

    Pingback by it.gen.nz » A little programming project - part 3 — 19 October 2008 @ 6:16 pm

  9. Nise site,

    Comment by name — 29 July 2009 @ 9:31 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress