Counting is a funny business. Some tribal languages count just one, two and then many. For most of our history we had to live with human counts, with all their concomitant problems. Fortunately at the height of the industrial revolution, a few people hit on the idea of using a machine to tabulate data. IBM went on to pioneer many concepts behind modern computing. Today we have very powerful computers that are, among other things, very good at counting.
However, the rise of massive data streams has complicated our calculus. What happens when we need to count through millions of packets streaming through an Ethernet switch? Or in Shareaholic’s case, how do we compute the number of page views a web page receives over a stream that comprises many thousands of URLs per second? Maintaining a count of every item exactly is very expensive.
We solve this problem with a count-min sketch. In exchange for some accuracy, we drastically reduce the computational requirements of real time counting over an enormous stream of data. This lets us surface the heavy-hitters among our stream of pageview data.