Sunday, April 10, 2011

Language Recognition with gzip

So I'm watching a lecture in CS221, and Prof. Norvig shows a slide with this Unix command on it:
  cat | gzip | wc
and informs us that it can be used to identify what language a text is written in.  Basically the command is "combine some files, zip them up, then measure the size of the resulting zip file.  Jiggawhat?  Yes, there is a thing called "compression-based classification", and it's actually pretty effective.  What happens when you zip a file?  A compression algorithm finds repeated patterns in the file and replaces instances of them with shorter patterns.  The more and longer the repeated patterns, the more the file can be compressed.  What happens when you zip two files together?  If they have similar content, they will compress well.  If they have different content, they won't compress as well.  So, if you have some text and you want to find out what language it is, zip it with example texts for any language it might be.  The language it compresses best with is the language it is mostly likely to be.  If the languages you are considering aren't too similar (e.g. Dutch & German, not Norwegian & Swedish), this technique is actually very reliable.

People have made serious studies of this.  I tried it out informally, and it actually works.  Here's a Perl script that does compression-based classification:
Note the system call in size_when_zipped.  I threw together a "corpus" for each of several languages by picking text from books on Project Gutenberg; English, French, German and Dutch and found the script to work consistently for various texts.

Changing the topic: I rewrote the script in Python and Haskell, just to see how the code would compare.  The translation from Perl to Python was pretty direct.  Again, I use Unix utilities to zip the files together and measure their length.  The Python version is both fewer lines and fewer characters, but only because I used less whitespace and no lines were devoted beginnings and endings of code blocks.  The translation to Haskell was less direct.  I don't have a Unix or GNU system to run Haskell programs on, so, when I wrote the Haskell version,  I had to write functions to perform the globbing and concatenation done by cat and the zipping done by gzip.  It it weren't for that, the Haskell code would have been the smallest.  Haskell blows both Python and Perl away in terms of terseness, without being (IMO) less readable.
I must say that Python is ridiculously easy to code with.  I've only written a few hundred lines of Python, and I already feel like I've got the language figured out.  Still, Perl is more fun.  Probably because in Perl "There's more than one way to do it", and using the ins and outs of the language to find a way to do that is both elegant and readable is an interesting challenge.  With Python "There should be one—and preferably only one—obvious way to do it"  and there are no challenges if you already know how to write conventional C#/Java code.

Thanks to Job Vranish for the lovely definition of bestOf in the Haskell version. The version I wrote was several lines longer.  With Haskell "There is more than one way to do it, the obvious way to do it is with recursion, but you should never use recursion because somebody already wrote higher-order function to do it".