Indexing Australian GLAM blogs

Last week I did a big update to the code that drives the Australian GLAM Blogs website and Twitter account. Initially I was just intending to do a bit of interface cleanup, and add a simple search tool to find blogs. In thinking through how the search index would work, however, I realised that the most effectively way to do it would be to download all the categories (i.e. subjects) each blog has used. If I was doing that, I may as well grab all the other post information (title, author, date..). Suddenly, I was creating an index of every article from every blog in the list.

The mechanics of doing this retrospectively are relatively simple, but the outcome is inconsistent. I’ve always used Feedparser to grab the blog information, but previously only used it to tweet the title of the post. Feedparser has a lot more functionality than that, but it’s limited by the RSS and Atom protocols it’s using. What that means is that how far back we can go retrospectively depends on how each feed is set up - feedparser can only work with what it’s given. Generally this is the last ten posts regardless of their age, but it varies. To get the retrospective information, I wrote a simple Meteor application that pulled in the data from each feed and feed it into a MongoDB collection. I did the same thing for the categories/tags used, either adding or incrementing the number associated with each tag. Here's the important bit:

while (item = stream.read()) {

//  add article data to the Articles and Tags collections

	// use Fibers so we can interact with the collections inside Feedparser
	// see https://stackoverflow.com/questions/21151202/async-call-generates-error-cant-wait-without-a-fiber-even-with-wrapasync
	// add any tags to the Tags collection
	Fiber(function(){
    var cats = item.categories;
    cats.forEach(function(x){
   	var tag = x.toLowerCase();
    Tags.upsert({tag: tag}, {$inc:{total: 1}});
      });
    Fiber.yield();
   }).run()

   // add the article to the Articles collection
   Fiber(function(){
    var array = item.categories;
    var cats = [];
    array.forEach(function(x){
      var normalised = x.toLowerCase();
      cats.push(normalised);
     	});
    Articles.upsert({link: item.link}, {$set: {title:item.title, author: item.author, categories: cats, blog: meta.title, blogLink: meta.link, date: item.pubdate}});
    Fiber.yield();
      }).run()
    }

Fibers

There are a few things going on here. The first thing you’ll notice is my comment about using Fibers. This is a concept created to deal with the fact that JavaScript is naturally asynchronous, but when used as a server-side language in nodejs, sometimes you really need to do things synchronously. To be honest I only have a loose understanding of how this works, but getting to grips with what code was required to make Fibers work saved me several hours of frustration later. In Meteor, every time you want to interact with Mongo collections from inside a node package (in this case, feedparser) you need to wrap that request inside a ‘fiber’. This got slightly more complex later when I moved to the actual glamblogs app, where I had to wrap these two fibers inside another one.

Normalising

My initial tests for this process quickly highlighted the need to ‘normalise’ some of the data - specifically, the categories or tags associated with each post. Because I wanted to have a universal list of categories, I needed each instance of a particular word to be the same. For example, “Information Literacy”, “Information literacy” and “information literacy” are clearly all about the same thing. The simplest solution was to simply use the JavaScript method toLowerCase() and thereby eliminate any capitalisation problems. This is by no means perfect. For example, there is a category “blogjune” and a category “#blogjune”. Obviously these are about the same thing, but I didn’t strip out non-alphabetical characters in the normalisation process, so they are listed as two different categories. Likewise, Mitchell Whitelaw has written several posts using the category “generousinterfaces”. If someone was to write a post and give it a category “generous interfaces” it will be listed separately. I can always adjust this in future and retrospectively update the index, but I’m loathe to mess too much with what authors have listed - sometimes punctuation and ‘special’ characters matter.

Categories and tags

Feedparser refers to ‘categories’, but in most blogging systems what it’s actually picking up would usually be referred to as ‘tags’. I decided to collect categories twice - once in the ‘Articles’ collection and once in their own ‘Tags’ collection.¹ My reasoning for doing it this way is that we then have the categories associated with each particular post with the record of that post, but we can also count the cumulative total of posts associated with any particular category. It would be possible to do this on the fly, but creating a separate collection significantly reduces the processing required to find out how many articles include a particular category. Effectively we are caching that query inside a Mongo collection.

Dealing with JSON objects

Once I had a local database with a Tags collection and an Articles collection, I was then able to export both of them out of the Mongo database as JSON objects. Here’s a snapshot of the Articles JSON file:

{“_id”:”2G6ScHSF3XjJmuqoA”,”link”:”https://inthemailbox.wordpress.com/2015/03/17/avoiding-binary-oppositions-between-digital-and-analogue/“,”title”:”Avoiding binary oppositions between digital and analogue”,”author”:”inthemailbox”,”categories”:[“uncategorized”],”blog”:”In the mailbox”,”blogLink”:”https://inthemailbox.wordpress.com”,”date”:{“$date”:”2015-03-17T08:42:59.000Z”}}
{“_id”:”2KYzrjMZo8nTzY8Zk”,”link”:”http://conaltuohy.com/blog/lod-from-custom-web-api/“,”title”:”Linked Open Data built from a custom web API”,”author”:”Conal”,”categories”:[“uncategorized”,”cidoc-crm”,”json”,”linked data”,”lodlam”,”proxy”,”rest”,”web api”,”xproc-z”],”blog”:”Conal Tuohy’s blog”,”blogLink”:”http://conaltuohy.com”,”date”:{“$date”:”2015-09-07T08:51:37.000Z”}}
{“_id”:”2a7gWT5DgLprFdxYJ”,”link”:”https://graemeo28librarianbiker.wordpress.com/2015/07/12/soldiers-and-revolutionaries/“,”title”:”Soldiers and revolutionaries”,”author”:”graemeo28”,”categories”:[“uncategorized”,”librarianship”],”blog”:”GraemeO28 Librarian and biker”,”blogLink”:”https://graemeo28librarianbiker.wordpress.com”,”date”:{“$date”:”2015-07-12T06:35:29.000Z”}}

If you’re familiar with JavaScript, you’ll notice the dates are also objects (in this case, ISO standard date objects). It was a great relief once I realised that feedparser delivers them as objects and not strings, because it enabled me to write this on the way in to the database:

Articles.upsert({link: item.link}, {$set: {...date: item.pubdate}});

That is, we can push the date object into the database. On the way out, this allows us to sort date ascending (1) or descending (-1):

return Articles.find({},{sort: {date: -1},limit:10});

If the feedparser date was a string, I would have had to parse out every element and reconstruct it into an object in order to sort by date and find the latest ten articles. Ugh.

So what’s new?

Historical data

With the exported JSON files for Articles and Posts, I was then able to import those files into the existing database which held the blogs collection. That means we now have over 600 articles and 900 tags in the database.

Future data

The new code adds each article and its tags to the database as it is encountered, before tweeting it out as before. Basically, the code I used to get the full archive of old data has been tweaked slightly to run over each new post as it is encountered, and add the post to the database. The Australian GLAM community now has an index of GLAM related blog posts, which will increase over time.

Latest posts now available on the website

Whilst I’m excited about having an index of articles that could be used for future analysis, possibly the most important change I made last week is to create a page where the last ten post are listed. This means that AusGLAMBLogs can now be used not just by Twitter users, but by any interested person with an internet connection. This page is reactive, so as soon as a new article is added, it will appear at the top and the oldest one will roll off.

Browse and search

There are multiple ways to explore the archive.

The Latest posts page lists the last ten posts. Each listing includes a link to the article, but also clickable categories. If you click on the category, you will see all the other posts in the Mongo database that have the same category.

The Search function uses the Meteor easy:search package to index and search over the articles collection. I had a bit of trouble with this, because I wanted to use the MongoTextIndex engine. This was great, but I didn’t initially appreciate that this requires all fields to be text strings. Remember how our date field is an object? Yeah... Since I wanted to be able to sort by date, I changed to the standard MongoDB engine.

The search page works basically how you’d expect - put in a term and if it appears in the title, author, blog name or tags of any particular article, you’ll get a result. The output is exactly the same as the ‘latest’ page, so you can click on the title to go to the post, or click on a category and it will give you list of all the other posts using that category.

Finally, I created a Tags browse page, sorted by frequency. Unsurprisingly, ‘uncategorized’ is the top result - note that these are posts with a category listed as ‘uncategorized’, not posts that don’t have any categories. None of the NLA blog posts, for example, have any categories, but they don’t appear under ‘uncategorized’. Metadata nerds might also note that my naive ’normalisation’ process doesn’t merge uncategorized and uncategorised.

UI

I also made a few other changes to the user interface. I made a small number of improvements to make the site more responsive, though this still needs some work. I also added a section on the homepage showing a running total of blogs, articles and posts held within the database. Interestingly, GLAM bloggers really like love notes to the future, because there are 150% as many tags as there are articles. Although a less generous explanation could be that we just don’t have a controlled vocabulary for blog post metadata...

What's next?

I still have a few things on my 'todo' list. Indeed, it seems that the more code I write for this project, the more there is to do! On the list at the moment are:

change front page design so it's a grid of 6
fix CSS for better mobile/responsive
mark on listing whether good feed or failing
check feed works before approval
process to add all previous articles and tags on approval
refactor for Meteor 1.3 NPM integration

A couple of these require some explanation. The first two are pretty obvious - the user interface needs some work. The third one is a little more complicated. Whilst I was testing the archives, I discovered some issues with four feeds. What I'd like is a clear indication in the interface if a feed is failing, and ideally some notification to site admins and the actual blogger. This is tied in with the fourth point - when a blog is registered, the system should do a quick check as to whether the 'feed' listed (e.g. myblog.com/rss)actually works. The next one relates to the fact that historical data at the moment was manually imported by me after being collected outside the base app. I need to create a process whereby when you register a new blog, all the historical posts we can find are imported into the database. Finally, Meteor 1.3 brings in some really important changes to do with npm modules (greatly improving the ability to use npm modules within Meteor), and I really should rewrite some of the code to take account of that.

If you want to look at the code (or even improve it and send me a pull request) it’s all on GitHub.

If you'd like to register your blog, or get your friends and colleagues to register their Australian GLAM-themed blogs, please register here.

I’m not entirely sure how I ended up with ‘Articles’ and ‘Tags’ when ‘Posts’ and ‘Categories’ probably makes more sense, but that’s just how it is!