Better out than in

Using web applications to normalise metadata contextually

Sun Apr 28 2019 16:55:35 GMT+1000 (AEST)

Now that VALA Tech Camp is over I have a little more time, and I've made a start on the long-overdue rewrite of the software behind Aus GLAM Blogs. One of the things I noticed quickly after we launched GLAM Blog Club is that despite nearly all the participants being qualified or student Information Management professionals, compliance with the official tag ('GLAM Blog Club' on the post itself and '#GLAMBlogClub' on social media) was patchy at best. I count at least four different variations in the posts ingested into the Aus GLAM Blogs database. I found this mildly surprising, but the process of writing the software, observing how people interact with Blog Club, and now re-writing the software, has made me think more about how we manage metadata in collecting institutions.

My first and only experience of cataloguing in libraries was, ironically, in my very first library position before I was a qualified librarian. Due almost entirely to the fact that at 22 years of age I was at least a decade younger than any of my colleagues, I was given responsibility for purchasing and cataloguing all the compact discs purchased out of the 'teenage collections' budget. It didn't amount to much, but it did allow me to indulge my tastes in electronic and 'alternative' music using ratepayers' money. Having had approximately one hour of cataloguing training, I was one of the worst cataloguers in library history, but my primary problem was that I wanted our catalogue records to be useful to end-users, and the head of cataloguing wanted my records to be standards-compliant. The case that still sticks in my memory was when I was confronted with the Sigur Rós album (). In its original packaging, the album had a removable cover with cutouts of the two parentheses, with the insert completely blank and no title written on the CD itself. I knew that the album was called "()", but the cataloguing boss wouldn't have it, insisting that the catalogue record must list the title as 'untitled'. My protestations that this would make it impossible to find were given short shrift.

Never normalise

I've never stopped thinking about Jarret Drake's talk at the British Colombia Library Association's 2017 meeting since I first read it. In particular, Drake's exhortation to "never normalize" is shocking in its defiance of the norms of library practice. Drake meant it to be so - for us to wake up to the Fascist possibilities of fitting knowledge into easily connected, neat classifications. Drake explicity called for library and archive workers to resist standardisation of metadata in order to make integration between different systems harder:

Local languages, taxonomies, and other forms of knowledge that only people within specific communities can decipher might well be a form of resistance in a country where a president not only advocates for a Muslim database but also for “a lot of systems… beyond databases.”
Jarret Drake - How libraries can trump the trend to make America hate again

Drake is coming at this from the Archiving tradition, which has always been more interested than librarianship in retaining metadata as it was at the point of accessioning. But this call to 'Never normalise' is both more radical and more progressive than the occassional moves to change 'offensive' Library of Congress Subject Headings.[1] Emily Drabinski gets to the heart of this in her April 2013 Library Quarterly article, Queering the catalogue: Queer Theory and the politics of correction:

... as we attempt to contain entire fields of knowledge or ways of being in accordance with universalizing systems and structures, we invariably cannot account for knowledges or ways of being that are excess to and discursively produced by those systems ... From a queer perspective, critiques of LCC and LCSH that seek to correct them concede the terms of the knowledge organization project: that a universalizing system of organization and naming is possible and desirable.
Emily Drabinski - Queering the catalogue: Queer Theory and the politics of correction

In other words: the problem isn't particular cataloguing terms, but rather the idea that the world can be described using a single, universal ontology. Patrick McKenzie's (in)famous 2010 blog post Falsehoods programmers believe about names describes the problem of metadata normalisation from a different perspective, dispensing with theory to simply describe all the ways humans can be wrong in their assumptions about personally naming other individual humans, assuming only that individual humans are the final arbiters of what their own name(s) is.

All data is cooked

Nick Barrowman reminds us in Issue 56 of The New Atlantis that far from ever being raw, "all data is cooked". If we return to the problem I initially outlined - tags for GLAM Blog Club blog posts - this is evident in several different ways. Firstly, these descriptive tags have been decided upon by the author of each post, for reasons particular to them. Some authors, like Nik McGrath, regularly use a large number of tags representing both the topic of the post and her own relationship to the topic. Nik blogs on Tumblr, where a large number of very specific tags helps to make posts visible to other Tumblr users. When I migrated my blog publishing software to Eleventy, on the other hand, I radically reduced the number of tags I use, because I wanted my tag pages to be meaningful with a reasonable number of posts per topic. Neither of these approaches is 'correct' - they are simply different metadata strategies to suit the needs and functions of each blogging platform and our particular personal tastes. Nik has her recipe and I have mine.

Blogging software also requires or changes topic tags. For example, Eleventy and some other blogging software uses tags to distinguish between posts and pages, which means all of my posts have a tag 'post'. This is not particularly meaningful in the context of the Aus GLAM Blogs database, since everything in it is assumed to be a 'post', but it's needed by my system so that the item appears in the RSS feed. Likewise, due to an error in my understanding of the RSS specification for item categories, I initially set up my blogging system to hyphenate tags with spaces - so all my old posts about GLAM Blog Club have a tag of GLAM-Blog-Club. Given my exasperation about the inability of the Australian GLAM community to use a single, specified tag for the GLAM Blog Club, the irony is not lost on me. WordPress also notably creates an 'uncategorized' tag automatically for posts that don't have any tags or categories.

Better out than in

So what to do when designing an interface for searching and browsing blogs from the GLAM community? The approach I've ultimately decided upon is, in some ways, the inverse of a classic library Authority File. I haven't completely taken on Jarret Drake's advice to 'never normalise' because I will continue to downcase tags before ingesting them into the database. But that is the only change the system will make to blog data on the way into the database. Keeping tags intact within the database is important to me - it respects the choices of blog authors, and leaves the data unchanged for any future analysis or usage for reasons other than what I'm using it for. But at the same time, for the purpose Aus GLAM Blogs is designed for, 'system' tags like 'post' and 'uncategorized' are just noise, and glamblogclub, glam blog club and glam-blog-club are obviously equivalent. So rather than normalising and standardising tags on the way in to the database - which is essentially what an 'authority file' amounts to - the system will do some light standardisation on the way out of the database before hitting the search/browse results interface. This leaves the original-recipe tags in the database, whilst reheating them a little for the purposes of search and display.[2]

Most of this process lives in a single if statement:

// normalise tag if there is a tag
if (tag){
for (x in settings.tag_transforms) {
if (x === tag) { // if tag is in the special tags from settings.tag_transforms
tag = settings.tag_transforms[x] // replace it with the specified replacement value
}
}
// if tag includes any spaces or punctuation, replace with '.*'
// this creates something akin to a LIKE search in SQL
punctuation = /[\s!@#$%^&*()_=+\\|\]\[\}\{\-?\/\.>,<;:~`'"]/gi
tag = `.*${tag.replace(punctuation, '.*')}.*`
}

The second part of this statement is a light normalisation of tags to effectively ignore most punctuation. This is primarily aimed at merging together things like multi word tag and multi-word-tag, but will also merge 'multi word tag' and multi 'word' tag? and so on. This is done with a simple filter using a regular expression (called 'punctuation'). I'm also trying to make the code re-usable, rather than completely specific to the Aus GLAM Blogs project. So rather than hard-coding things, I've included a couple of settings in a settings.json file:

"tag_transforms" : {
"glamblogclub" : "glam blog club",
"#glamblogclub" : "glam blog club",
"blogjune" : "blog june",
"#blogjune" : "blog june"
},
"filtered_tags" : ["uncategorized", "uncategorised", "post", "page"]

The tag_transforms object is a list of key-value pairs where any tag equal to the lefthand value will be changed to the right-hand value when run through the statement shown earlier. filtered_tags is an array of tags that will be suppressed from all tag views. As you can see, tag_transforms in particular is context-specific - but both can be easily adjusted with any installation of the software to match the needs of a particular blogging community. The reason this is needed at all is because the tag.replace method only works if there are spaces or punctuation between words. For a tag like glamblogclub humans who can read English will probably work out that it's equivalent to "glam blog club", but it's very difficult to programatically identify whether arbitrary strings are a single word or several, and the aim is to keep normalisation as light-touch as possible. tag_transforms allows this to be done in a contextually-relevant way dependent on the needs of the community aggregating their blogs. There is also - as a notorious radical metadata librarian pointed out to me - a difference between the 'glam blog club' tag and other user-generated tags. This tag is mandated by a recognised (in this context) authority: newCardigan, and it is reasonable to assume that the slight variations seen in the wild are intended to match the standard for the purposes of identification by newCardigan, even though they don't actually match it. The Blog Club only exists because it was set up by newCardigan and the tags are only there so that the newCardigan community can associate the post with the Club, so in this case it's reasonable to normalise the tags to that standard.

The practical effect of this is that when you do click on a tag in a listed post, if the tags says "#glamblogclub" the browse result will pick up anything that is tagged "#glamblogclub", "#glamblogclub", "glam blog club", "'glam blog club'" and so on, treating them all as the same tag:

GIF of browsing tags in Rockpool software

Finally, before displaying tags for each post, we run the tags through a method to filter out anything in the filtered_tags array, and another method to make the listed date relative to the current time (e.g. 'four days ago' - this is another way to leave metadata untouched in the database but display it dynamically for each user in their given context):

x.categories = x.categories.filter(tag => settings.filtered_tags.includes(tag) != true) // filter out system tags
x.relativeDate = moment(x.date).fromNow(); // add a relative date on the fly

None of the processes described here change anything in the database - they are run on the fly and only affect the way the data is displayed. Using the same data, another interface could be designed to display and associate it quite differently. Linked data is supposedly the future answer to these sorts of challenges, but that requires sophisticated and complex markup at the publishing end - pretty unlikely to ever become the norm for self-published material like blogs. These processes to transform tags are not yet in place for the current incarnation of Aus GLAM Blogs, but will appear once I've finished rewriting the software (no promises on when that will be).

So as I stated up the top, this process has made me think a little more about how libraries deal with subject metadata. MaRC, Library of Congress Subject Headings, and pretty much every widely-used classification system all ultimately date to and are based on the assumptions of hardcopy catalogues and linear storage. There is no "update dynamically for each viewer" in a card catalogue. Whilst I'm certainly not the first to have considered these issues and have barely scratched the surface here, there needs to be not just a lot more thought about them, but - importantly - some action at the local level. Decades of centralising data in federated catalogues, fiddling about with 'new' standards that are both impractical and fail to solve the core problems, ceding control of terminology to the weirdest library in the world, and deskilling the workforce clearly hasn't resulted in a good outcome for library users. Cataloguing isn't some arcane irelevance, and library catalogues are still the core tool of the trade. If you care about social justice or representation in libraries, you need to care about library metadata and how it is controlled.


  1. I have previously written about the absurdity of any institution other than the United States Library of Congress using LCSH. ↩︎

  2. Ok I'll stop with the cooking metaphors now. ↩︎