Galois engineers write a lot of Haskell (in fact, our technology catalogue is built pretty much entirely on it). We find we’re able to build systems faster, with fewer errors, and in turn are able to apply techniques to increase assurance, helping us deliver value to our clients. We’ve successfully engineered large systems in the language for nearly a decade. We also use and write a lot of open source Haskell code. Since 2004 we’ve been investing in improving packaging and distribution infrastructure for Haskell code, and since 2007 Galois has been hosting hackage.haskell.org: the central online database of open source Haskell libraries and applications. These packages are built via Cabal (dreamed up by Galois’ own Isaac Potoczny-Jones), and distributed via cabal-install. Hackage now hosts more than 1100 released libraries and tools, and has been growing rapidly (and, incidentally, Galois employees have released or been significant contributors to just shy of 10% of all Hackage projects). We’ve wondered for a while now just how busy Hackage was becoming, and in turn, what other interesting information about Haskell were buried in the Hackage logs. This post answers those questions for the first time. We’ll see
- Total, and growing, Haskell source downloads
- The most popular Haskell projects hosted on Hackage
- The most popular development categories
- The most popular methods for distributing Haskell source
and speculate a little on where Hackage is heading.
We’ve known for a while that uploads to Hackage were growing. You might have seen this graph elsewhere (it’s derivable from the RSS logs of package uploads):
There’s a pretty clear trend upwards. Average daily Hackage releases have increased 4-fold since Hackage was launched, and it’s now averaging 10 packages a day released. The question is: was anyone using this code?
To measure downloads, we processed the apache logs for Hackage going back to its launch (incidentally, the log processing script – in Haskell of course – uses the Haskell zlib, bytestring, filepath, containers, bytestring-csv libraries). This generates a two dimensional map of downloads per project, per month. (You can find links to the raw data at the end of the article).We can now play with the data to see some interesting trends.Some cautionary notes interpreting this data: we only process Hackage package downloads (i.e. “GET binary-0.5..tar.gz” requests). We are only able to measure source downloads – that is, someone downloads a package they will build from source, with GHC. We cannot measure downloads from open source mirrors (such as those provided by the major unix distributions), nor can we measure users of binary packages (such as on Debian, Ubuntu or Fedora), nor can we measure downloads of packages not hosted on Hackage (such as gtk2hs, pugs, darcs or ghc). So this doesn’t represent all Haskell downloads, only downloads of source packages.We have complete data for Hackage, from the initial alpha tests in July 2006, through to launch in January 2007, until March 2009.
To begin with, here’s the cumulative downloads from Hackage, over time:
As you can see, we’re just shy of 900 thousand package downloads, and from January 2008 to December 2008 – the first complete year of live operation – there were 500 thousand downloads, with a further 150 thousand downloads in the first 2 months of 2009. To visualize the growth trend, here is the same cumulative download line on a logarithmic scale:
In the alpha-testing period from July 2006 to July 2007, downloads grew exponentially (4 orders of magnitude in 4 quarters) as existing developers started to use the system. Since the middle of 2007, the rate of growth has slowed, increasing by an order of magnitude in a 15 month period.
Hackage is on target to reach its 1 millionth download next month. We’ll have a party.
Downloads by Month
Next up is the downloads per month, over time. I dropped a bezier curve on top to give a sense of the trend. As we only have partial data for March 2009, I’m excluding that. This graph essentially confirms the same trend as we see in the cumulative graphs. Interestingly, download spikes roughly correspond with upload spikes in our first graph (presumably as people scramble to get the new code).
Hackage is currently seeing 100 package downloads an hour, and that value has been doubling every 4 months for the past year and a half.
Besides overall downloads, there’s a wealth of per-package information. In the following graphs we extract total downloads for each package (ignoring version numbers).
The popularity graph displays a classic “long tail“, where the download frequency of any package is inversely proportional to its rank in the frequency table.
This suggests that a good interface to a large Hackage database should behave something like the “long tail” sites like Amazon – where we rely on recommendations and other interlinking to ensure visibility of projects in the tail.
The frequency of downloads is even more visible on a double log scale, where the popularity pretty much matches Zipf’s law:
The download frequency doesn’t quite match the classic distribution at the top and tail of the curve. There are two reasons for this (and maybe other factors at play). Firstly, the tail of the curve drops off, as the bottom 10% of popular packages tend to be the newest packages, and so are unde-represented in the “market”.
The other interesting part of the popularity curve is at the top. The top 10 to 50 packages are, to varying degrees, less popular than we might predict. Why is this?
The Distribution Effect
Remember that Hackage is a source-only repository. So it is of primary interest to developers. As a Haskell package becomes more popular, it tends to get picked up by Linux and BSD distributions such as Debian or Ubuntu (and also distributed in binary form for Mac and Windows), removing the need to download the source. Popular packages are doomed to become less popular in source form if the distributions are doing their work!
This is particularly apparent for libraries distributed with GHC (the “extra libs”). Libraries such as containers, arrays, bytestring, parsec and network rarely need to be downloaded in source form, as they’re bundled with GHC forming a platform base. They should thus be under-repesented in source downloads.
You can see this “distribution effect” in this overlay of xmonad and its Debian package installs where source installs decline dramatically as soon as the binary packaging takes off.
Popular packages are doomed to be distributed through other channels.
Most Popular Packages
And here – for the first time – are the per-package popularity statistics for Hackage. First, the top 25 packages sorted by their cumulative total downloads. Executable applications are marked in blue.
And the next 75 Haskell packages, in order:
filepath, X11-xft, alex, happy, vty, cgi, terminfo, unix, GLUT, chunks, fingertree, OpenGL, time, pureMD5, regex-tdfa, xhtml, bzlib, Crypto, syb-with-class, hxt, tagsoup, HDBC, MissingH, SDL, haskell-src-exts, plugins, Stream, frag, curl, pcre-light, unix-compat, uniplate, wxcore, hinstaller, stm, html, Diff, polyparse, leksah, HUnit, hmp3, haskell-src, RJson, fastcgi, pandoc, arrows, YamlReference, parsedate, HGL, GLFW, process, extensible-exceptions, zip-archive, iconv, HDBC-sqlite3, TypeCompose, cpphs, hmatrix, HPDF, HAppS-Server, haskell98, hspread, HAppS-Util, rosezipper, gd, dlist, array, yi-gtk, haskeline, HStringTemplate, HAppS-Data, fgl, haskelldb, xml, cabal-rpm
Congratulations to authors of these packages – you made the top 10% most popular releases.
Most Popular Downloads in February
The previous table looked at the cumulative most popular projects. But that doesn’t necessarily reflect what is popular at the moment to download in source form. This table compares January downloads against February downloads, for the top 25 packages in Feburary:
Most Popular Applications
The 25 most popular Haskell applications hosted on Hackage, to download in source form are:
xmonad, cabal-install, haddock, xmobar, yi, hscolour, alex, happy, frag, leksah, hmp3, pandoc, cpphs, cabal-rpm, darcs, c2hs, hoogle, lambdabot, cabal2arch, hpodder, monadius, lhs2tex, mkcabal, pugs, ghc-core
Note pugs and darcs have been primarily distributed separately to Hackage, until recently.
Most Popular Libraries by Category
We can also determine the most popular libraries and tools in each semantic category on Hackage:
|Regex||regex-base + regex-posix||7438|
|3D Graphics||GLUT + OpenGL||7345|
|2D Graphics||SDL *||3016|
|Web Framework||HAppS-Server *||1759|
Note that SDL is represented here as gtk2hs (which provides many cairo-based 2D graphics functions) isn’t distributed on Hackage. Also, HAppS-Server has been superceded by happstack.
These packages didn’t quite take first place, but still have significant user support. I include happstack, although it is only a month old, as it replaces the previously popular HAppS-Server.
|Regex||Regex-tdfa, regexpr, pcre-light||8791|
Finally, we can speculate on what would happen if the current download growth rate continued for a couple more years (projecting forward 18 months, using the trend of the last 18). We’d reach a cumulative total of 10 million source downloads around the end of 2010 (continuing the order of magnitude growth of the last 18 months).
Of course, a lot is unknown in this scenario. If everyone starts installing all their code via cabal-install, downloads will rocket, as does increasing reuse, by using more libraries.However, if the top Haskell applications and libraries become an order of magnitude more popular, the distros will take them up, slowing growth. Growth will also slow if we run out of resources in some form or another: no more easy libraries to bind to, for example, just as we ran out of existing things to cabalize in 2007.
Get the Data Yourself
You can play with the data set yourself here:
You can also get the full data in a sqlite database, courtesy of mmorrow on #haskell.