Identify. Not too much. Mostly collections.

30 July 2017

This week I’ve been using one of the tools I learned about at VALA Tech Camp to clean up membership data as part of a migration we’re doing at MPOW from the aging Amlib system to Koha ILS. Open Refine is a powerful data cleaning tool that can be thought of as Excel with regex. As we’ve been working on getting data migration right, we hit a bit of a snag. In Koha, each discrete part of a member’s address is in a different field:

  • street number
  • street address1
  • street address2
  • city (suburb)
  • state
  • postcode

etc...

This is quite typical of most databases storing address data. Amlib, however, is a fairly old system with some data architecture that is ...interesting. For whatever reason, the address data coming out of Amlib is all stored in a single field. Combine that with multiple possibilities for how addresses can appear, and literally twenty years of data entry into a single free text field, and I’m sure you can imagine how consistent the data is. Working out how to split it out in a sane way that minimises work later has been time consuming. Part of the problem is having to consider all the possibly ways an address may have been entered. Even when library staff have followed procedures correctly, the data is still inconsistent - for example some records list a phone number first, then the street address, whereas other include no phone number. Consider all of the following possibilities, which are all ‘correct’:

0401 234 567, 1 Main St, Booksville, 3020
0401234567, 1 Main St, Booksville, 3020
1 Main St, Booksville, 3020
90123456, 1 Main St, Booksville, 3020
9012 3456, 1 Main St, Booksville, 3020
9012-3456, 1 Main St, Booksville, 3020

There are thousands of examples of all these types of records within our 130,000 member records. Initially, it looked like these were the major differences. Urban and suburban Australia tends to have very few street numbers above 999, partially because streets often change their name when they hit a new suburb. I wrote a quick regex query in OpenRefine to find every record where the first four characters didn’t include a space, and created a new column with the part before the first comma for records matching that query. That was fine until I realised that “P.O. Box 123” would appear to be a phone number under this rule, so I adjusted to exclude anything with a space or a full stop. That was the easy bit. Addresses aren’t as simple as you might think:

1 Main St, Booksville, 3020
Unit 1, 10 Main St, Booksville, 3020
1/10 Main St, Booksville, 3020
F1/10 Main St, Booksville, 3020
F1 10 Main St, Booksville, 3020
Unit 1, The Mews, 1 Main St, Booksville, 3020
1 Main St, Booksville, Vic, 3020

Welcome to regex hell. After a bit of trial and error, I eventually split out the ‘number’ from each address. There are some edge cases where the address information somehow ended up with no commas at all or was incomplete, that we will need to clean up manually, but that’s probably about 4000 out of 130,000, which isn’t so bad. I’ll post something on GitHub at some point with some of the formulas I used to clean the data up for import - for when all you Amlib libraries move over to Koha amirite?

A need to know basis

Going through this process has helped me to keep top of mind something that all librarians, indeed everyone working with any database of personal information, needs to constantly question:

What data are we storing about people, and do we need to store it?

For example, public libraries generally record a member’s sex or gender (no distinction is generally made, though usually it’s labelled as ‘sex’ but actually means ‘gender’). Why? Do I need to know the gender of a member in order to provide information, advice or assistance? The only real argument I’ve heard about this is that it assist in finding members in the database when they do not have their membership card, but that seems to be a fairly weak argument for storing a sometimes intensely personal data point that isn’t always readily ascertained, and can change over time. Of course, most public libraries, in Australia at least, aren’t necessarily able to make decisions like this alone. National, state and local government standards about ‘minimum data sets’ define what must or at least should be collected, sometimes seemingly in contradiction of privacy standards. Once we ask this question of whether we need to store certain data at all, however, another one pops up, in some ways just as important.

How are we storing this data about people?”

I don’t mean databases vs index cards here. What was frustrating me about migrating user address data was the the process of normalising it. Koha wants address data to be chopped into discrete data points - street number, street name, city/suburb, etc. Amlib just stores it as one field, so I need to ‘normalise’ the Amlib data to fit into Koha’s database model. These questions of course feed into each other. Why you want the data affects how you record it. How you record it affects how it can be used. In the case of postal addresses this is pretty innocuous. The fact Koha chops it up like this makes it much easier to correctly format postal addresses on library notices, and allows the system to conform to many different postal service standards in terms of whether the street number is listed first, or the state before the postcode or after it, for example.

But normalising, by definition, smooths out inconvenient differences in how information is turned into data. Consider the gender data point - the overwhelming majority of systems (and the Australian national government data standards) allow, at most, three options - male, female, or ‘not known’. O'Reilly's Introduction to SQL book even uses gender as an example of a data point that only has two possibly options. Note that the assumption here is that if someone’s gender is known, then it must be binary - either male or female, so if it is known that you identify as something else it has to be recorded incorrectly. This is why Tim Sherratt cautioned in A Map and Some Pins that even “open data” needs to be viewed critically and its biases interrogated:

...open data must always, to some extent, be closed. Categories have been determined, data has been normalised, decisions made about what is significant and why. There is power embedded in every CSV file, arguments in every API. This is inevitable. There is no neutral position.

There is no neutral position. This is the case whether we are describing people in our user databases or people within the books that line our shelves. Under pressure from librarians and the ALA, the Library of Congress decided in March 2016 to replace the term “Illegal aliens” with the terms “Noncitizens” and “Unauthorized immigration”. In the middle of a nasty Presidential election campaign, this was inevitably controversial.

When we classify items in our collections, we are deciding as much what terms not to use as we are deciding what terms to use. When Sherratt says we are determining categories, he is also pointing out that we have determined what categories are not used, not appropriate, not valid. When we decide what is significant, we also decide what is insignificant. Every act of classification is an act of erasure. Every ‘finding aid’ also helps to hide things.

Never normalise

Discussions about and changes in how people are described in library collections - whether due to their sexuality and gender, their ethnicity, or their religion - are important, but insufficient. The terms we use to classify people within our collections can affect the broader discourse. But it isn’t just in our collections that we classify, catalogue, and define. Every piece of data recorded about library users is a potential landmine. “People who are Jewish”, “People whose gender identity doesn't match their biological sex”, and “People who read books about socialism” are all identities that have been a death sentence at various times and places. As former NSA chief, General Michael Hayden, put it so clearly: “We kill people based on metadata”. If you’re keeping data about users, you need to think about the worst case scenario, and mitigate against it.

Jarrett M Drake, addressing the British Colombia Library Association earlier this year and seeing the danger, had a simple piece of advice: “Never normalize”:

...the rising tide of fascism should offer pause regarding the benefits of normalized data that can easily be piped from system to another. Local languages, taxonomies, and other forms of knowledge that only people within specific communities can decipher might well be a form of resistance in a country where a president not only advocates for a Muslim database but also for “a lot of systems… beyond databases.” In other words, a world of normalized data is a best friend to the surveillance state that deploys the technologies used to further fascist aspirations.

Identities can be exciting, empowering, and comforting. But they can also be stifling, exclusive, or dangerous. An identity can be something you embrace, accept, or create, but it can just as easily be something that is given to or stamped upon you - sometimes literally. Identity is Pride March and St Patrick’s Day, but it’s also the Cronulla riots and Auschwitz tattoos. In libraries, as well as other cultural and memory institutions like archives and museums, we must take care in how we identify objects and people.

In these public institutions there is no neutral position. Every identity is dangerous. Every database is a site of erasure. Every act is a political act.