One Million Haskell Downloads…

Galois engineers write a lot of Haskell (in fact, our technology catalogue is built pretty much entirely on it). We find we’re able to build systems faster, with fewer errors, and in turn are able to apply techniques to increase assurance, helping us deliver value to our clients. We’ve successfully engineered large systems in the language for nearly a decade. We also use and write a lot of open source Haskell code. Since 2004 we’ve been investing in improving packaging and distribution infrastructure for Haskell code, and since 2007 Galois has been hosting hackage.haskell.org: the central online database of open source Haskell libraries and applications. These packages are built via Cabal (dreamed up by Galois’ own Isaac Potoczny-Jones), and distributed via cabal-install. Hackage now hosts more than 1100 released libraries and tools, and has been growing rapidly (and, incidentally, Galois employees have released or been significant contributors to just shy of 10% of all Hackage projects). We’ve wondered for a while now just how busy Hackage was becoming, and in turn, what other interesting information about Haskell were buried in the Hackage logs. This post answers those questions for the first time. We’ll see

  • Total, and growing, Haskell source downloads
  • The most popular Haskell projects hosted on Hackage
  • The most popular development categories
  • The most popular methods for distributing Haskell source

and speculate a little on where Hackage is heading.

Background

We’ve known for a while that uploads to Hackage were growing. You might have seen this graph elsewhere (it’s derivable from the RSS logs of package uploads):

There’s a pretty clear trend upwards. Average daily Hackage releases have increased 4-fold since Hackage was launched, and it’s now averaging 10 packages a day released. The question is: was anyone using this code?

Measuring Downloads

To measure downloads, we processed the apache logs for Hackage going back to its launch (incidentally, the log processing script – in Haskell of course – uses the Haskell zlib, bytestring, filepath, containers, bytestring-csv libraries). This generates a two dimensional map of downloads per project, per month. (You can find links to the raw data at the end of the article).We can now play with the data to see some interesting trends.Some cautionary notes interpreting this data: we only process Hackage package downloads (i.e. “GET binary-0.5..tar.gz” requests). We are only able to measure source downloads – that is, someone downloads a package they will build from source, with GHC. We cannot measure downloads from open source mirrors (such as those provided by the major unix distributions), nor can we measure users of binary packages (such as on Debian, Ubuntu or Fedora), nor can we measure downloads of packages not hosted on Hackage (such as gtk2hs, pugs, darcs or ghc). So this doesn’t represent all Haskell downloads, only downloads of source packages.We have complete data for Hackage, from the initial alpha tests in July 2006, through to launch in January 2007, until March 2009.

Total Downloads

To begin with, here’s the cumulative downloads from Hackage, over time:

As you can see, we’re just shy of 900 thousand package downloads, and from January 2008 to December 2008 – the first complete year of live operation – there were 500 thousand downloads, with a further 150 thousand downloads in the first 2 months of 2009. To visualize the growth trend, here is the same cumulative download line on a logarithmic scale:

In the alpha-testing period from July 2006 to July 2007, downloads grew exponentially (4 orders of magnitude in 4 quarters) as existing developers started to use the system. Since the middle of 2007, the rate of growth has slowed, increasing by an order of magnitude in a 15 month period.

Hackage is on target to reach its 1 millionth download next month. We’ll have a party.

Downloads by Month

Next up is the downloads per month, over time. I dropped a bezier curve on top to give a sense of the trend. As we only have partial data for March 2009, I’m excluding that. This graph essentially confirms the same trend as we see in the cumulative graphs. Interestingly, download spikes roughly correspond with upload spikes in our first graph (presumably as people scramble to get the new code).

 

Hackage is currently seeing 100 package downloads an hour, and that value has been doubling every 4 months for the past year and a half.

Package Popularity

Besides overall downloads, there’s a wealth of per-package information. In the following graphs we extract total downloads for each package (ignoring version numbers).

The popularity graph displays a classic “long tail“, where the download frequency of any package is inversely proportional to its rank in the frequency table.

This suggests that a good interface to a large Hackage database should behave something like the “long tail” sites like Amazon – where we rely on recommendations and other interlinking to ensure visibility of projects in the tail.

The frequency of downloads is even more visible on a double log scale, where the popularity pretty much matches Zipf’s law:

The download frequency doesn’t quite match the classic distribution at the top and tail of the curve. There are two reasons for this (and maybe other factors at play). Firstly, the tail of the curve drops off, as the bottom 10% of popular packages tend to be the newest packages, and so are unde-represented in the “market”.

The other interesting part of the popularity curve is at the top. The top 10 to 50 packages are, to varying degrees, less popular than we might predict. Why is this?

The Distribution Effect

Remember that Hackage is a source-only repository. So it is of primary interest to developers. As a Haskell package becomes more popular, it tends to get picked up by Linux and BSD distributions such as Debian or Ubuntu (and also distributed in binary form for Mac and Windows), removing the need to download the source. Popular packages are doomed to become less popular in source form if the distributions are doing their work!

This is particularly apparent for libraries distributed with GHC (the “extra libs”). Libraries such as containers, arrays, bytestring, parsec and network rarely need to be downloaded in source form, as they’re bundled with GHC forming a platform base. They should thus be under-repesented in source downloads.

You can see this “distribution effect” in this overlay of xmonad and its Debian package installs where source installs decline dramatically as soon as the binary packaging takes off.

Popular packages are doomed to be distributed through other channels.

Most Popular Packages

And here – for the first time – are the per-package popularity statistics for Hackage. First, the top 25 packages sorted by their cumulative total downloads. Executable applications are marked in blue.

1 xmonad 35428
2 HTTP 26203
3 zlib 24431
4 Cabal 23691
5 X11 21563
6 binary 15752
7 utf8-string 12633
8 mtl 12517
9 cabal-install 12274
10 regex-posix 11351
11 X11-extras 10509
12 xmonad-contrib 9794
13 haddock 9209
14 parsec 8468
15 bytestring 7473
16 regex-base 7438
17 HaXml 6307
18 network 6285
19 xmobar 6272
20 yi 6268
21 hscolour 6264
22 QuickCheck 5697
23 hslogger 5434
24 regex-compat 5266
25 ghc-paths 4653

And the next 75 Haskell packages, in order:

filepath, X11-xft, alex, happy, vty, cgi, terminfo, unix, GLUT, chunks, fingertree, OpenGL, time, pureMD5, regex-tdfa, xhtml, bzlib, Crypto, syb-with-class, hxt, tagsoup, HDBC, MissingH, SDL, haskell-src-exts, plugins, Stream, frag, curl, pcre-light, unix-compat, uniplate, wxcore, hinstaller, stm, html, Diff, polyparse, leksah, HUnit, hmp3, haskell-src, RJson, fastcgi, pandoc, arrows, YamlReference, parsedate, HGL, GLFW, process, extensible-exceptions, zip-archive, iconv, HDBC-sqlite3, TypeCompose, cpphs, hmatrix, HPDF, HAppS-Server, haskell98, hspread, HAppS-Util, rosezipper, gd, dlist, array, yi-gtk, haskeline, HStringTemplate, HAppS-Data, fgl, haskelldb, xml, cabal-rpm

Congratulations to authors of these packages – you made the top 10% most popular releases.

Most Popular Downloads in February

The previous table looked at the cumulative most popular projects. But that doesn’t necessarily reflect what is popular at the moment to download in source form. This table compares January downloads against February downloads, for the top 25 packages in Feburary:

Package Downloads Rank Change
HTTP 2926  
zlib 2345  
Cabal 2148  
cabal-install 1490  
utf8-string 1352  
xmonad 1280  
binary 1174  
regex-posix 901 +8
parsec 842 +2
X11 834 -1
xmonad-contrib 754 +1
hscolour 739 -2
terminfo 713 +1
haddock 669 -6
ghc-paths 630 +2
HaXml 600 +12
extensible-exceptions 596 +16
QuickCheck 584 +9
regex-base 558 -4
time 529 -1
darcs 501 1
leksah 500 +18
regex-tdfa 496 +24
hslogger 441 -2

 

Most Popular Applications

The 25 most popular Haskell applications hosted on Hackage, to download in source form are:

xmonad, cabal-install, haddock, xmobar, yi, hscolour, alex, happy, frag, leksah, hmp3, pandoc, cpphs, cabal-rpm, darcs, c2hs, hoogle, lambdabot, cabal2arch, hpodder, monadius, lhs2tex, mkcabal, pugs, ghc-core

Note pugs and darcs have been primarily distributed separately to Hackage, until recently.

Most Popular Libraries by Category

We can also determine the most popular libraries and tools in each semantic category on Hackage:

Task Library Downloads
     
Client-side HTTP HTTP 26203
Database HDBC 3098
XML HaXml 6307
Control mtl 12517
Parsing parsec 8468
Binary Parsing binary 15752
Logging hscolour 6264
Testing QuickCheck 5697
Regex regex-base + regex-posix 7438
Lexing alex 4360
Codec zlib 24431
Unicode IO utf8-string 12633
Sockets network 6285
Build System Cabal 23691
Documentation haddock 9209
Syntax hscolour 6264
3D Graphics GLUT + OpenGL 7345
2D Graphics SDL * 3016
Hashing pureMD5 3460
HTML xhtml 3391
Cryptography Crypto 3243
Generics syb-with-class 3230
IDE leksah 2408
JSON RJson 2222
Markup pandoc 2210
Numerics hmatrix 1844
Web Framework HAppS-Server * 1759
Graphs fgl 1658
Parallelism parallel 1370
Charting chart 1300
Code generation llvm 970
RSS feed 726
Wiki gitit 759

Note that SDL is represented here as gtk2hs (which provides many cairo-based 2D graphics functions) isn’t distributed on Hackage. Also, HAppS-Server has been superceded by happstack.

Honorable Mentions

These packages didn’t quite take first place, but still have significant user support. I include happstack, although it is only a month old, as it replaces the previously popular HAppS-Server.

Parsing happy, polyparse 6685
Regex Regex-tdfa, regexpr, pcre-light 8791
Codec bzlib 3275
XML hxt, xml 4871
HTML tagsoup 3141
Client-side HTTP curl 2678
Generics uniplate 2675
2D Graphics wxcore 2601
Testing HUnit 2325
Database haskelldb, hsql 3164
Network Network-bytestring 1210
Control monadLib 1081
Web Framework happstack 549

 

Future

Finally, we can speculate on what would happen if the current download growth rate continued for a couple more years (projecting forward 18 months, using the trend of the last 18). We’d reach a cumulative total of 10 million source downloads around the end of 2010 (continuing the order of magnitude growth of the last 18 months).

Of course, a lot is unknown in this scenario. If everyone starts installing all their code via cabal-install, downloads will rocket, as does increasing reuse, by using more libraries.However, if the top Haskell applications and libraries become an order of magnitude more popular, the distros will take them up, slowing growth. Growth will also slow if we run out of resources in some form or another: no more easy libraries to bind to, for example, just as we ran out of existing things to cabalize in 2007.

Get the Data Yourself

 You can play with the data set yourself here:

  • Monthly downloads per package (CSV, HTML)
  • Packages by download frequency (CSV, HTML)

You can also get the full data in a sqlite database, courtesy of mmorrow on #haskell.