Today, Jeroen Ooms announced the appearance on CRAN of an R package for language detection, wrapping the “CLD2″ compact language detector. Obviously, given a tool like that on a holiday long weekend, my first reaction was to try to confuse it.
Two fun games to play with a language detector:
Find an obviously English sentence (ideally a quote) that it doesn’t recognise as English, and a very non-obviously English sentence that it does
Find two sentences with as few differences as possible, where one is recognised as English and the other not
CLD2 doesn’t recognise the famous telegram about platypuses “Monotremes oviparous, ovum meroblastic” as English, which I suppose is fair enough.
It didn’t recognise Getrude Stein’s “Rose is a rose is a rose is a rose”, or even the shorter “Rose is a rose is a rose”, though it had no trouble with the start of FInnegan’s “Finnegans Wake” or bits of “Howl”. Even better than the Stein, though:
There’s a linguistic discussion of this sort of sentence at Language Log -- it’s not usual English in a lot of ways – but I think it’s going to be hard to beat as a false negative.
As a false positive I tried Jabberwocky (English), and then thought of Douglas Hofstatder’s self-referential example sentences
Ok, so how far can the second one be warped and still show up as English?
That’s English, but “See Spot run” isn’t!
For minimal changes: changing “a” to “the”
And as a sort of combination of the two: Chomsky’s obviously-English nonsense sentence “Colorless green ideas sleep furiously” is recognised as English, but so is every permutation of the same words.
So, is there a point to this (other than a fun way to waste half an hour)? Well, one of the important things to remember about automated classification algorithms is (as Zeynep Tukfeci puts it) how alien they are. They can often imitate human decisions astonishingly well, but they don’t work the same way. If another person makes the same decisions as you, it’s a good bet there are some basically similar reasons underneath. It’s easy to believe the same is true for machines, but it isn’t.