Going Static Part 2

RSS Secrets

3 November 2018

For the second in my series on migrating this blog to Eleventy, I'm going to take you through creating my RSS feed, and what I learned about RSS and Atom. Regular readers will know I'm a big fan of RSS, but I must admit I didn't really understand how it works until I decided to write my own RSS file. RSS is both both simpler and weirder than I realised. Eleventy actually has an RSS plugin, but I decided to roll my own. This was partially because I would have had to fiddle with some code anyway, since the plugin is written in the Liquid templating language instead of Handlebars, but mostly because I wanted to actually understand how an RSS file is constructed and how it works. This post outlines some of the things I learned, but the usual caveat applies: I'm not an expert, and I've probably got some things wrong. If there are any serious mistakes, feel free to let me know on Mastodon or Twitter.

A (very) brief history

The first thing we need to get out of the way is that the term 'RSS' is often used to refer to all the various flavours of RSS and the Atom Protocol which was an attempt to modernise and replace RSS. Confusing things more, at one point there was effectively a fork in the development of RSS, with competing groups releasing different interpretations of RSS with, in one case, competing version numbering. I'm going to write a bit about Atom, but primarily this post is about the RSS protocol.

RSS developed in a fairly free-form way compared to most more recent web protocols. It was originally created at Netscape in 1999, bringing together ideas that had been floating around for a few years, but referring to no broader standards body. Ultimately this looseness is what led to the creation of Atom, but RSS 2.0 is still in very widespread use. Being a librarian, I'm persuaded by the argument that Atom is the better choice - with clear rules, every element properly explained, full namespacing, and adherence to other standards - but I ended up using RSS 2.0 for my feed. The reason for this was simple: my existing feed from when I was using Ghost is RSS 2.0 and used a permalink of /rss, so I wanted to ensure I didn't break anything currently using my feed.

But what exactly is RSS?

The key to understanding the nature of RSS (or Atom for that matter) is to understand that the acronym 'RSS' has stood for three different things over time, and you need to know all of them to get the full picture. Originally it was referred to as RDF Site Summary. The creators of version 0.91 called it Rich Site Summary. Most people, however, know the third term1 - Really Simple Syndication. If you understand that it is all of these things at the same time, then you can understand how RSS works. An RSS file is, essentially, just a summary of a website. It summarises the site using XML and the Resource Description Framework, enabling a rich ecosystem of independent software applications to parse each RSS file in a standardised way. And by providing the site summary in RDF via an XML file at a permanent URI, web content that is updated frequently can be syndicated. This last one is the thing that confused me for a while. 'Syndication' suggests that the content is pushed out somehow, but that's not how RSS works. An RSS file is simply a static XML document stored at a permanent address. The way RSS feeds are 'syndicated' is that applications periodically send a request to the file's URL, and check to see if it has changed in a particular way since it was last checked.

Writing an RSS feed

As I noted above, an RSS feed is simply an XML file. The RSS 2.0 specification outlines the requirements:

  • conforms to the XML 1.0 specification
  • has a top-level rss element specifying a version of 2.0
  • has a single channel element with compulsory sub-elements of title, link and description
  • within the channel element has one or more item elements
  • within each item element there must be at least one of title or description

And that's it. A very simple RSS file might look like this:

<rss version="2.0">
  <channel>
    <title>Information Flaneur</title>
    <link>https://www.hughrundle.net</link>
    <description>A blog about libraries, computer programming, and the impending end of humanity.</description>
    <item>
      <title>Going Static Part 1 - Messing with your head</title>
      <link>https://www.hughrundle.net/going-static-part-1/</link>
    </item>
  </channel>
</rss>

This is a totally valid RSS feed, but it's unlikely you will ever see something that only uses the bare minimum elements, and if you do so yourself, most RSS readers are likely to be fairly unhappy about it. Let's look at what else we can add:

Channel

  • language
  • copyright
  • managingEditor
  • webMaster
  • pubDate
  • lastBuildDate
  • category
  • generator
  • docs
  • cloud
  • ttl
  • image
  • rating
  • textInput
  • skipHours
  • skipDays

Item

  • description
  • author
  • category
  • comments
  • enclosure
  • guid
  • pubDate
  • source

These are all optional, which is just as well because many of them simply serve as an amusing reminder of how different the World Wide Web was in 2002. We're going to concentrate on a few key elements:

Channel

  • description: "Phrase or sentence describing the channel."
  • lastBuildDate: "The last time the content of the channel changed."
  • ttl: "ttl stands for time to live. It's a number of minutes that indicates how long a channel can be cached before refreshing from the source. This makes it possible for RSS sources to be managed by a file-sharing network such as Gnutella."2 (LOL)

Description should be a self-evidently useful piece of metadata. The lastBuildDate and ttl elements serve similar purposes to each other, in that they can be used by RSS reader software to process RSS feeds more efficiently by only processing or checking feeds when they are likely to have actually changed.

Item

  • description: "The item synopsis."
  • category: "Includes the item in one or more categories."
  • enclosure: "Describes a media object that is attached to the item."
  • pubDate: "Indicates when the item was published."
  • guid: "A string that uniquely identifies the item."

The item description should be a synopsis or precis of the content. However, because there is no provision in the RSS spec to include the full content of an article or other item, some feeds place the entire content in description. This is generally considered to be an error, and there are better ways to address this problem, as we will see shortly.

The category element can be used multiple times. So if you have several tags, you'd probably add a separate 'category' element for each tag.

An enclosure element can be used to 'attach' (or 'enclose', like putting it in an envelope) a media file to a feed item, and this single innovation is the basis of podcasting - it's how every podcast makes its way onto listening devices worldwide. Next time you encounter someone pontificating that RSS is dead but podcasting is the future, don't forget to laugh in their face.

The last two elements - pubDate and guid - can both be used by parsers to work out whether or not an item is new to the feed, but guid is more reliable. The specification for guid is a bit weird, because a whilst it should contain a 'global unique identifier', there are no rules at all about the syntax it should have. Often a guid will be the URL of the item, so there is an optional attribute isPermaLink which defaults to 'true'. However, many blogging systems assign a true unique number so that the item can have a stable identifier if the URL changes - in which case isPermaLink will be set to 'false'. The point of the guid is, of course, to help RSS parsers (readers) to identify whether they have already processed the item (e.g. added it to a reading list, queued it in a podcasting app, etc). We'll look more closely at this in a moment.

Automating

I can't say I was paying a lot of (any) attention to the RSS specification in the last 1990s and early 2000s - we can't all be child prodigies, after all. But part of the reason for the RSS fork appears to have been a division between those who wanted RSS feeds to be easy for website authors to create 'by hand', and those who wanted them to have more features and be easier for machines to parse. It seems that in the beginning, the intention really was for RSS files to be hand-coded and altered each time a new item was published. This seems completely bonkers to me now, but if you remember what the Web was like 19 years ago, it does make some sense. In a way, the arguments among those laying a claim to the RSS specification reflected broader shifts in how the Web was imagined. The fact that most people writing on the web in 2018 would think it was crazy to manually update their RSS feed, and have no idea how to do it, perhaps says more about what happened to the Web subsequently than whether it was a good idea originally.

In any case, I don't want to manually update an XML file every time I publish a blog post: I want it to happen automatically. Luckily, Eleventy and Handlebars can help me to take care of that. Let's have a look at how it's done. You hopefully remember from my last post that Eleventy is a static site generator software program, and Handlebars is a JavaScript templating language that allows us to use a placeholder in a template, then use that template to generate actual content. So we can write an RSS template like this:

{%raw%}
<rss version="2.0">
  <channel>
    <title>
     {{site.title}}
    </title>
    <description>
     {{site.description}}
    </description>
    <link>{{site.root}}</link>
    <generator>Eleventy</generator>
    <lastBuildDate> {{latestDate collections.post}} </lastBuildDate>
    <ttl>60</ttl>
        {{#each collections.rssPosts}}
      <item>
        <title>
         {{data.title}}{{#if data.subtitle}} - {{data.subtitle}}{{/if}}
        </title>
        <link>
         {{data.site.root}}{{this.url}}
        </link>
        {{#if data.guid}}
         <guid isPermaLink="false">{{data.guid}}</guid>
        {{else}}
         <guid isPermaLink="true">{{data.site.root}}{{this.url}}</guid>
        {{/if}}
        {{#if data.tags}}
            {{#each data.tags}}
            <category domain="https://www.hughrundle.net/tag">
             {{this}}
            </category>
            {{/each}}
        {{/if}}
        <pubDate>{{utc date}}</pubDate>
        <description>
        {{#if data.summary}}
          {{data.summary}}
        {{else}}
          {{site.description}}
        {{/if}}
        </description>
      </item>
      {{/each}}
  </channel>
</rss>
{%endraw%}

We saw all these elements earlier, all we're doing is pulling in the relevant data. I outlined how site works in Part 1, so let's not dwell on that. There are, however, a couple of things that may look a bit weird. Firstly, there's the last build date:

{%raw%}<lastBuildDate> {{latestDate collections.post}} </lastBuildDate>{%endraw%}

I stole this from the official Eleventy RSS plugin. What we're doing here is using a 'filter', or what would normally be called a 'helper' in Handlebars. It's just a JavaScript function that takes an argument and returns something. In this case, we want to look at all the pages with a 'post' tag (collections.post) and find the most recent publication date:

eleventyConfig.addFilter("latestDate", function(posts) {
  var value = 0;
  for (i=0; i < posts.length; i++) {
    value = posts[i].date > value ? posts[i].date : value;
  }
  return new Date(value).toUTCString();
});

We return it as a UTC string because the RSS specification requires all dates to be RFC 822 compliant. We do the same thing for each item's pubDate except in that case we just want to deal with a single date as the argument so it's simply:

// fix dates to UTC for RSS
eleventyConfig.addHandlebarsHelper("utc", function(pubDate, options) {
  let utcDate = new Date(pubDate).toUTCString();
  return utcDate
});

The other slightly complicated bit is the guid - and that's only because I migrated from Ghost. Ghost uses its own unique ID for each item. So for example my last post published with Ghost had this:

<guid isPermaLink="false">5bb04e002c9b9a0603b3acaf</guid>

This is fine, but I don't want to be creating my own unique IDs for every article when I could just use a permalink.3 The point of a guid is to make sure RSS readers don't retrieve items twice, so you shouldn't just change them when you migrate to a new system. To resolve this problem, I made sure that my migration script picked up the guid and put it into the front matter for all the posts that came our of Ghost:

layout: post-migrated
title: "The machine in Ghost"
author: hugh
tags: ['ghost','GLAM blog club','post']
date: 2018-09-30T08:28:48.000Z
permalink: 2018/09/30/the-machine-in-ghost/index.html
guid: 5bb04e002c9b9a0603b3acaf

With that in the data for each old post, I could then add an 'if/else' statement to my RSS feed:

{%raw%}
{{#if data.guid}}
  <guid isPermaLink="false">{{data.guid}}</guid>
{{else}}
  <guid isPermaLink="true">{{data.site.root}}{{this.url}}</guid>
{{/if}}
{%endraw%}

Problem solved!

We saw category before, but you may have noticed I've added something. In addition to simply listing each category, it's possible to link to a taxonomy by using the domain attribute. For example, you might have a blog about Australian wildlife and want to restrict yourself to the Atlas of Living Australia taxonomy. In that case, you would have a domain linking to the root of the taxonomy, and put the entry (the part after the last forward slash) as the category value. For example, here's one of my favourite Australian birds:

<category domain="https://bie.ala.org.au/species">urn:lsid:biodiversity.org.au:afd.taxon:91c90b44-e9dd-4ce1-a4b5-37d60d59b859</category>

Eww, that doesn't look so great, huh? Let's face it, almost nobody actually uses it this way. More likely, you'll have to your own idosyncratic and loosely-structured taxonomy that uses common words or phrases. In Eleventy, you can pretty easily create automatic pages for every tag you use, which means we have an inbuilt structure for taxonomy URIs. That allows us to do this:

{%raw%}
{{#if data.tags}}
  {{#each data.tags}}
  <category domain="https://www.hughrundle.net/tag">
  {{this}}
  </category>
  {{/each}}
{{/if}}
{%endraw%}

Then if I have a tag called 'eleventy', you know you can find the canonical URL for that term in my *cough* highly structured taxonomy at https://www.hughrundle.net/tag/eleventy. There's an outstanding problem with this when it comes to multi-word terms due to a mis-use of the category element both by me and a lot of other people. According to the RSS Board's official advice:

The category's value should be a slash-delimited string that identifies a hierarchical position in the taxonomy.

So if I have a tag called 'GLAM Blog Club', the value of category should be glam-blog-club because the URL for that tag is at https://www.hughrundle.net/tag/glam-blog-club. I didn't realise this until doing some homework for this post, so I will probably rethink how Aus GLAM Blogs deals with tags, and I made a change like this for my RSS feed:

{%raw%}
<category domain="https://www.hughrundle.net/tag">
  {{slug this}}
</category>
{%endraw%}

You may be tempted to think that author would be a useful element to add to each item. Unfortunately, the Web of 1999 looked a little different to the Web in 2018. The RSS spec tells us that author is for "the email address of the author of the item." 🙃 Hmm, maybe not. But surely it would be useful to have the author's name in the RSS feed. And you've probably seen items come through on an RSS feed that do have the author's name. So how do we do that? We do it with namespacing.

Extending RSS with namespaces

The major reason Atom was invented was as a result of an argument about whether or not RSS should be namespaced. In the end the RSS 2.0 specification settled on a compromise, whereby all the existing RSS elements are not namespaced, but RSS can be extended with new elements as long as they are namespaced - and in fact this is encouraged. As we'll see, this includes using Atom elements inside RSS2 feeds, which is confusing but perfectly valid. To complete our RSS feed we're going to add three additional elements from outside the RSS2 schema, and also make a couple of changes to clean things up. We're going to add:

  • atom:link
  • dc:creator
  • content:encoded
  • CDATA
  • another handlebars helper called deXMLify

The atom:link element is just the link element from Atom, namespaced for use in RSS. You may be wondering why we need to do this, given that RSS already has a link element. Technically we don't need to do it, but it's highly recommended by the RSS Board, because unlike the base RSS elements, Atom allows us to add a relationship attribute of "self" to the link. That allows us to identify the feed's own URL within itself - making the feed more portable. To add a namespace, we need to use a similar technique to the one I described in Going Static Part 1 when I wrote about adding metadata in the <head> element: in this case, we add a reference inside the opening tag of the rss element, using the XML namespace declaration:

<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">

Now we can add an atom:link element inside the channel element:

{%raw%}<atom:link href="{{site.root}}/rss" rel="self" type="application/rss+xml"/>{%endraw%}

I mentioned the problem with the native RSS author element, so let's deal with that. In this case, what we want is to use the Dublin Core creator element:

<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">

...

{%raw%}<dc:creator>
  {{#if data.author}}{{data.author}}{{else}}{{data.site.author}}{{/if}}
</dc:creator>{%endraw%}

Finally, remember we talked about how RSS 2.0 doesn't have an element designed for the actual content of an item in a feed? Well, prepare for your brain to melt. We're going to use the content namespace that was created in the ...RSS 1.0 specification:

<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
...
<content:encoded>
<![CDATA[ {{{templateContent}}} ]]>
</content:encoded>

Cleaning up your XML

Woah, what is <![CDATA[]]> ‽ In XML markup, there are three characters that are explicitly disallowed: angular brackets4, and ampersands5. There are also a few other rules about escaping various things that could be interpreted as XML markup when you want them to be treated as literal text. The way to tell XML that everything in a chunk of text is content rather than markup, is to use CDATA. I use it in two places: one you saw above, in content:encoded. The other is in the item description (for a while it was unclear whether this was allowed, but the RSS 2.0 specification makes clear that it is). The final thing we need to clean up is the title and description for the channel. These could have content that might be interpreted as XML. For example, my last post had a subtitle of "Messing with your <head>". This will break the RSS feed if it's not dealt with. Originally I simply used CDATA, but technically you're not supposed to use any HTML markup in the channel or item titles - they should be 'plain text'. Escaped HTML also looks horrible. Whilst you're allowed to use escaped HTML in item descriptions, it's still somewhat ambigous for channel descriptions. So for all titles, and the channel description, we need to remove the dangerous ampersands and angled brackets. We can use a filter again:

eleventyConfig.addFilter("deXMLify", function(text) {
  let newstring = text.replace(/&/g, 'and').replace(/[<>]/g, '')
  return newstring
})

This will change ampersands to 'and', and simply remove any angled brackets. Now in my RSS feed I just use the filter:

{%raw%}<title>
{{deXMLify data.title}}{{#if data.subtitle}} - {{deXMLify data.subtitle}}{{/if}}
</title>{%endraw%}

PURL

Wow, that was a lot to take in - if you're still reading, congratulations and trust me, it's not quite as complicated as it might sound. Before finishing, I thought I might share one last little thing I discovered. When I was looking at the html head metadata, I noticed that the Dublin Core schema URL starts with http://purl.org, but I didn't really think much about it - I just assumed it was a weird URL associated with Dublin Core for some reason. But then when I was checking my RSS feed again, I noticed that the RSS 1.0 spec (linked for content) also uses http://purl.org. It turns out that 'PURL' stands for "Permanent URL" and it's a service from our good friends the Internet Archive. As far as I can tell, it works basically like a DOI but is intended for exactly the thing we're using it for here: permanent addresses for schema descriptions.

So now you hopefully have learned more than you really wanted to know about making your own RSS feed. The last in this series about moving from Ghost to Eleventy will be about a little tool I made to generate markdown templates and automatically insert a URL to a free-to-use image. While you're waiting for that, why not go and build your own RSS feed from scratch?


1

which is really a 'backronym'

3

I actually think using a unique ID string is better practice, but ...it's also a pain in the neck and I'm lazy.

4

i.e. '<' and '>'

5

'&'