Well Known

Distributing open library data

11 April 2022

A few months ago I was talking with a relative1 about my Library Map and he asked about where the data comes from and how it is automatically updated. The answer is that it isn't. It's not really plausible to even use a web scraper to periodically check for up to date information from libraries, because there are so many unique ways different public libraries communicate information about lending policies. Compiling data on library management systems is easier, because Marshall Breeding has basically done the job for me.

In February Dave Rowe wrote on the Frictionless Data blog about his work with Libraries Hacked and attempting to standardise UK public library data collection to enable it to be machine-readable (e.g. ensuring dates are published in a standardised date format). Libraries Hacked was actually one of the inspirations for my own map of Australian libraries. Dave notes the growing interest in the international library profession for open data, including about libraries, and references IFLA's Statement on Open Library Data. But IFLA's statement—like most writing I've seen on this topic—assumes that data should be collected, aggregated, and published by central government agencies, and then made "open". This increasingly seems a bit back-to-front to me.

Over the last few years I've gained more understanding of how website can "passively" provide a great deal of useful information about a service or organisation. This includes writing my own RSS feed, and implementing remote following on Bookwyrm. Both of these looked daunting at first glance, but are actually quite simple conceptually.

The second of these uses a concept that would be perfect for my Library Map project specifically, and for simple distributed data publication more generally: Well-known URIs. The first time I encountered this concept was when setting up TLS website certificates using Let's Encrypt. Let's Encrypt uses the ACME protocol to confirm the relationship between a particular server and a domain name. But it wasn't until noodling around with Fediverse applications like Mastodon and BookWyrm that I realised that https://www.example.com/.well-known/ was an official IETF proposed standard, rather than specific to Let's Encrypt. The initial "Request for Comments" was published in 2010, so this is not a hugely new concept, though the current proposed standard (RFC 8615) was published in 2018.

What RFC 8615 enables is both incredibly simple and potentially very powerful. Essentially, all it does is indicate a URI schema that every complying website follows so that regardless of the domain, you can always find particular data. To provide a more concrete example, let's say IFLA registers a standard with the Internet Assigned Numbers Authority under the suffix librarydata. Then any library service with a website can provide data at https://example.org/.well-known/librarydata. IFLA could define a core data schema, and allow national bodies (e.g. ALIA, CILIP, FEBAB) to create their own additional schemas if desired. Presented as a JSON file, it might look something like this:

"name": "Central Rivers Library Service",
"ifla": {
    "service_points": 3,
    "with_internet_access": 3,
    "staff": 8,
    "volunteers": 20,
    "registered_users": 15000,
    "visits": {
        "2019":90912,
        "2020":40684,
        "2021":61524,
        "2022": 39895
    },
    "physical_collection_size": 15098
}
"alia": {
    "circulation_software": "Koha ILS",
    "default_fine_amount": 0.00,
    "fines": {
        "children": false,
        "young_adults": false,
        "adults": false
    },
    "default_loan_days": 28,
    "mobile_stops": 6,
    "home_delivery": {
        "housebound": true,
        "general": false
    }
}

If every library service used these schemas, keeping a project like the Australian Library Map up to date would be a simple matter of collecting all the relevant website domains, and then periodically checking the well known library data URL for each of them. Of course, it doesn't have to be IFLA and its member Associations that set this all up. Indeed, I probably wouldn't be happy with the data sets they agree to, and the whole point of this is to decentralise data collection. But for it to work, there does need to be some kind of widespread consensus on the data set and schemas to use, and willingness to implement them.

On that last point, the example above might look like I'm suggesting somebody at each library would have to manually publish a webpage regularly. Far from it! A lot of this kind of data is already stored within library management systems. It would mostly be a case of programming those systems to publish certain data in the right format at the given URL. This would mean the data on things like loans and fines could automatically update itself in real time, and data on things like door counts could be stored within the library management system either automatically through a data feed, or manually on a daily or weekly basis.

This is just a little thought experiment for now, but I'd love to talk to people who are interested in taking something like this further - perhaps a project for VALA or ALIA 🤔.


1

Yes Chris, it's you!