cat | gzip | wcand informs us that it can be used to identify what language a text is written in. Basically the command is "combine some files, zip them up, then measure the size of the resulting zip file. Jiggawhat? Yes, there is a thing called "compression-based classification", and it's actually pretty effective. What happens when you zip a file? A compression algorithm finds repeated patterns in the file and replaces instances of them with shorter patterns. The more and longer the repeated patterns, the more the file can be compressed. What happens when you zip two files together? If they have similar content, they will compress well. If they have different content, they won't compress as well. So, if you have some text and you want to find out what language it is, zip it with example texts for any language it might be. The language it compresses best with is the language it is mostly likely to be. If the languages you are considering aren't too similar (e.g. Dutch & German, not Norwegian & Swedish), this technique is actually very reliable.
People have made serious studies of this. I tried it out informally, and it actually works. Here's a Perl script that does compression-based classification:
- Perl: compare
Changing the topic: I rewrote the script in Python and Haskell, just to see how the code would compare. The translation from Perl to Python was pretty direct. Again, I use Unix utilities to zip the files together and measure their length. The Python version is both fewer lines and fewer characters, but only because I used less whitespace and no lines were devoted beginnings and endings of code blocks. The translation to Haskell was less direct. I don't have a Unix or GNU system to run Haskell programs on, so, when I wrote the Haskell version, I had to write functions to perform the globbing and concatenation done by cat and the zipping done by gzip. It it weren't for that, the Haskell code would have been the smallest. Haskell blows both Python and Perl away in terms of terseness, without being (IMO) less readable.
- Python: compare.py
- Haskell: compare.hs
Thanks to Job Vranish for the lovely definition of bestOf in the Haskell version. The version I wrote was several lines longer. With Haskell "There is more than one way to do it, the obvious way to do it is with recursion, but you should never use recursion because somebody already wrote higher-order function to do it".
4 comments:
Indeed, your bestOf function in Haskell can be simplified to just function composition,
bestOf = maximumBy . (compare `on`)
Even better:
bestOf = maximumBy . comparing
"comparing" is in Data.Ord.
bestOf = maximumBy . comparing ... I really like that one. It even reads well as English. Thanks!
Wait, "maximum by comparing", how else would you get a maximum? And doesn't an ordinary max function find a maximum by comparing? OK; Dave wins for elegance, Job wins for readability.
Post a Comment