Recursive

It was recently brought to my attention that the little RSS-to-Pocket application I re-wrote earlier this year (Empocketer) has not, in fact, been working. As I looked at the error log and the code, it was a mystery to me how I ever thought it would work. But it did take a day to sort out what exactly the problem was (turns out there were three different bugs).

The first bug manifested in the error logs. I'd rather naively deliberately created a recursive loop with a timer, to check the RSS feeds periodically for updates. The error log informed me that I had consequently created a "stack overflow", though initially I didn't understand why this was happening. Like most people who write software and don't know how to fix a bug, the first thing I did was use a search engine to try to work out what might be causing it. But here I ran into an unexpected dilemma. A typical workflow is to search using the error message and perhaps some keywords to give context, and then click through to the most relevant looking result, almost always being a page on the programming question and answer site Stack Overflow. Searching for python stack overflow is semantically identical to searching for python Stack Overflow - not particularly useful in my case.

Naturally my mind turned to what search operators I might be able to use to clarify what I was looking for, but this turns out to be not quite as simple as you might initially think. I could try

python stack overflow -site:stackoverflow.com

but this would simply mean that none of the results would be from Stack Overflow, which in a way was the opposite of what I wanted. Alternatively,

python stack overflow site:stackoverflow.com

would only return results from Stack Overflow (equivalent to simply using the site's own search tool). This is not particularly useful, because nearly all those results are referring to Stack Overflow the site, rather than stack overflow the computing term.

Essentially what I needed in this situation was a way to distinguish between a phrase and a proper noun. My colleagues who teach university students complex search techniques probably have a way to do this, but I'm not aware of any simple ways to get standard web search engines to do it. Neither Duck Duck Go nor Google seem to distinguish between capitalised and non-capitalised terms (which would help a bit but not totally solve the problem). Off the top of my head the ultimate, if unrealistic, solution would probably be some kind of linked open data allowing disambiguation. Something like:

python AND term:https://www.wikidata.org/wiki/Q13218767

But of course, that would only work if the search engine itself is able to disambiguate the terms within page text in a largely error-free way. This was hard enough in a paper based world where librarians and other trained information professionals read thousands of texts every year to determine their key subjects. In a computerised world where search engine crawlers read billions of texts every day, even the smartest machine learning algorithm is going to struggle to parse the differing meanings of identical phrases — let alone deal with the fact language constantly evolves new meanings. Google has attempted to force web publishers to feed it unambiguous metadata in a standard format, but this is basically an impossible task. Jane Bloggs isn't going to fiddle around with references to Wikidata entities when she publishes a blog post about admiring a South African Springbok. So you might see it in your search for the rugby team, or the antelope, or perhaps even the town.

Most of the time we don't notice this type of weakness in keyword searching of "whole texts", because the advantages compared to relying solely on a more limited but controlled set of metadata are usually pretty strong. But there will always be cases like this where the context of both the desired and non-desired meaning of a search term is so similar that a simple keyword string can't easily be used to filter the results usefully.