LogoLooking for Bobby but found Paris instead

Go to any local attraction and you'll see people taking selfies with statues, towers, all sorts of sights. Once, walking past tourists taking pictures of Greyfriars Bobby in Edinburgh, I reckoned that some of those will be posted on Twitter. I should be able to find these, and similar pictures, automatically. How difficult could it be?

Problem: Porn. If you do a search for tweets containing "selfie" in the raw firehose, then you'll find about it's about 60% Porn. I suspect the equivalent live search on twitter is doing some sort of filtering.

Even apart from the problem of Porn, there is no gaurantee they'll even use the term "selfie", so we need a different kind of filter.

My hypothesis was that the selfies I was after would have one person in the foreground on the bottom left or right, with the object of interest in the background. I coded this up, and left it running for a bit.

Umm, ok.

I seem to have written a chat app detector. What else fits my description?

Tv shows?

Framed quotes, inspirational or otherwise.

And headlines :-/

With the above, it's easy to see how it matched, but some others are a bit more obscure. For each of the following, have a look at the tweet image for a bit before clicking through to the highlighted face.

face

face

face

You can see more of these examples and other types in my github repo.

But, did I actually find anything? It took me a about an hour of trawling through around a thousand tweets it had found, but here's a perfect example of what I wanted:

A one in a thousand hit-rate is not great, but given this was barely a days worth of implementation and tweaking, it's not bad. It's also obviously improvable. For example:

  • I'm using the most basic face detector available in OpenIMAJ, and I'm not even filtering that by confidence.
  • I can easily filter by relative size as well e.g. only include those faces which take up around 25% to 20% of the image.
  • There is probably a text recogniser I can use to exclude images with mostly text outside the face.
  • I can use a more focussed source than the firehose. For example, look for geotagged tweets nearby tourist destinations, or use a larger set of keywords than just "selfie". However, the wide-open nature of the firehose is appealing, so I'd like to see how far I can get with only image analysis.
  • It's really a single-process app, even though I'm using Spark. If I had a bigger sample, and more limiting filters, then perhaps I could find more examples. I'm limited by the built-in firehose throttle, but I could probably scale out the image lookup, as long as that isn't then throttled.

This is a work in progress, and I'm quite happy with what I have so far, so might leave this for a bit. Always other projects on the go!

Finally, I'll end with one example I found which doesn't quite fit my original intent but which I like nonetheless: