Thursday, October 29, 2009

Archiving my blog posts




Last time I talked about looking at the Google Data API to archive my blog posts. Of course, there is a download link on the blog under the Settings tab, but I didn't like the look of the XML I get from them. However, I realized that (naturally enough) I get exactly the same data through the API. So I stopped with Google Data, and started trying to parse what I have. I found Mark Pilgrim's book very useful here, not only because it's so clearly written, but it's directly on point.

ElementTree is used to parse the XML. According to the Atom specification, the root element is a "feed", which is qualified with its "namespace." This is its "tag":

{http://www.w3.org/2005/Atom}feed

The child elements of the feed are varied. These are their tags (minus the namespace):

• id
• updated
• title
• link (x 4)
• author
• generator
• entry (many)

You might think when you got down to an entry it would be one of the blog entries. But nope. It's still metadata... Each entry has a child (id) with a long prefix and then its text value is BLOG_PUBLISHING_MODE or BLOG_NAME and so on.

The first authentic post is entry number 58. I distinguish the real entries from metadata by testing whether last character of the id is a digit, as in:

tag:blogger.com,1999:blog-8953369623923024563.post-799567705527714853

An item, whether it's an entry or metadata or a child of an entry, may have attributes, stored in a Python dictionary. It may also have a text value, or not. I wrote a script to look through all this stuff (bloggerScript.py).

This post is long enough that I think I'll quit here and talk about parsing for the URL for each of my images another time. Here is the output for one element:


item 100
id:
tag:blogger.com,1999:blog-8953369623923024563.post-2518535825745482316
title:
Heat Mapper rises from the ashes
tag:
entry
children:
id
t: tag:blogger.com,1999:blog-8953369623923024563.post...
{}
published
t: 2009-09-08T02:34:00.000-07:00
{}
updated
t: 2009-10-01T07:18:10.585-07:00
{}
category
t:
k: term
v: http://schemas.google.com/blogger/2008/kind#post
k: scheme
v: http://schemas.google.com/g/2005#kind
category
t:
k: term
v: Instant Cocoa
k: scheme
v: http://www.blogger.com/atom/ns#
title
t: Heat Mapper rises from the ashes
k: type
v: text
content
t: <a onblur="try {parent.deselectBloggerImageGracefu...
k: type
v: html
link
t:
k: href
v: http://telliott99.blogspot.com/feeds/2518535825745...
k: type
v: application/atom+xml
k: rel
v: replies
k: title
v: Post Comments
link
t:
k: href
v: https://www.blogger.com/comment.g?blogID=895336962...
k: type
v: text/html
k: rel
v: replies
k: title
v: 0 Comments
link
t:
k: href
v: http://www.blogger.com/feeds/8953369623923024563/p...
k: type
v: application/atom+xml
k: rel
v: edit
link
t:
k: href
v: http://www.blogger.com/feeds/8953369623923024563/p...
k: type
v: application/atom+xml
k: rel
v: self
link
t:
k: href
v: http://telliott99.blogspot.com/2009/09/heat-mapper...
k: type
v: text/html
k: rel
v: alternate
k: title
v: Heat Mapper rises from the ashes
author
t:
{}
{http://search.yahoo.com/mrss/}thumbnail
t:
k: url
v: http://1.bp.blogspot.com/_39NGKVWYg3o/SqYr9tfuDJI/...
k: width
v: 72
k: height
v: 72
{http://purl.org/syndication/thread/1.0}total
t: 0
{}
links:
http://www.blogger.com/feeds/8953369623923024563/posts/default/2518535825745482316
content:
k: type
v: html
<a onblur="try {parent.deselectBloggerImageGracefu
images:
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitZvwkD8rqQHLVk3Bohd0JaAuLTwvcW6JHvrYscyHzEdp6aFBF1nAk4jojHc9OIJRwrnlk0FIBdNhoczY9mFqQ5rnzzebZzPZUtLYpyDa683dLD1owrqYzevH-CmbAd2drSuwkxNzLFGbb/s1600-h/Screen+shot+2009-09-08+at+6.02.38+AM.png