Counting is a funny business. Some tribal languages count just one, two and then many. For most of our history we had to live with human counts, with all their concomitant problems. Fortunately at the height of the industrial revolution, a few people hit on the idea of using a machine to tabulate data. IBM went on to pioneer many concepts behind modern computing. Today we have very powerful computers that are, among other things, very good at counting.
However, the rise of massive data streams has complicated our calculus. What happens when we need to count through millions of packets streaming through an Ethernet switch? Or in Shareaholic’s case, how do we compute the number of page views a web page receives over a stream that comprises many thousands of URLs per second? Maintaining a count of every item exactly is very expensive.
We solve this problem with a count-min sketch. In exchange for some accuracy, we drastically reduce the computational requirements of real time counting over an enormous stream of data. This lets us surface the heavy-hitters among our stream of pageview data.
The Count-min Sketch
Let’s begin by situating our problem in mathematics. We have a domain $U$ of items we want to count over. In the context of counting page views, an item $i\in U$ is a web page, and $c_i$ is the number of times page $i$ has been viewed. These pairs of items and their counts form a multiset $M$ that we track.
We could store the multiset $M$ explicitly, using an associative array that maps every $i$ as a key to its corresponding count $c_i$. But even with modern computational tools, storing millions or billions of individual mappings is expensive. For Shareaholic, it would require substantial computing resources to meet the outwardly simple goal of maintaining page view counts for millions of unique URLs. If all we need to do is maintain a list of the most-viewed pages, we need not be so precise.
Enter the count-min sketch. Rather than store an enormous set of mappings, we exploit the properties of pairwise-independent hash functions — namely that they do not collide too frequently — to drastically reduce the memory requirements of computing over such an enormous stream of data.
At a high level, we choose two small parameters, $\delta$ and $\varepsilon$, that control the precision of our sketch (more on this later.) Remember, we sacrifice some precision in exchange for the sketch’s small computational footprint. Our choice of $\delta$ and $\varepsilon$ inform our creation of a matrix, $X$, and a corresponding set of pairwise-independent hash functions, $H$. These two components, along with update and query methods, form our sketch. Since the matrix $X$ and hash functions $H$ are typically much smaller than an explicit representation of $M$, we achieve huge memory savings.
More specifically, we use $\varepsilon$ and $\delta$ to choose $m=\lceil \ln{1/\delta} \rceil$ and $n=\lceil e / \varepsilon \rceil$, the dimensions of our matrix $X$. Intuitively, as we make $\delta$ and $\varepsilon$ smaller (thus upping the precision of the sketch), we have to store something closer to an explicit representation of $M$ by making $X$ larger. Because we begin counting having seen nothing, we initialize the entries of $X$ to zero. We then choose $m$ functions at random from a family of pairwise-independent hash functions to form
where, for a given $h_j\in H$,
In words, each of the $m$ hash functions $h_j\in H$ corresponds to a row of $X$, and $h_j$ maps $i\in U$ (a web page) to one of the $n$ columns of $X$.
When someone visits a page and we need to update our count for item $i$ with update $u_i\in\mathbb{N}$, we update according to
for each row $j$ and its hash function $h_j\in H$, and the corresponding $k=h_j(i)$.
In words, the $j$th row of matrix $X$ receives an update by taking its corresponding hash function $h_j$, applying it to the item $i$ to obtain $k=h_j(i)$ corresponding to a column of the matrix $X$, and then updating the $j, k$th position of $X$ by adding $u_i$.
To obtain the approximate count $\hat{c_i}$ for item $i$, we take the minimum of the estimates of $c_i$ from every row of $X$. This is essentially the update procedure in reverse. For a given item $i$, we obtain its estimate from the $j$th row of $X$ by using its corresponding hash function $h_j$ to obtain $k=h_j(i)$, and reading the value at position $j, k$ of $X$.
In the context of our page view counting, for a given row of $X$ we take its corresponding hash function, and then find the column to lookup by applying the hash function to the page’s URL. Doing this over all the rows of $X$ gives us a set of estimates,
where $\hat{c}_{i,j}$ represents the estimated value for item $i$ with respect to row $j$ (or hash function $h_j$). We finally choose our best estimate by letting $\hat{c}_i=\min{S}$.
Why Does This Work?
The minimum of $S$ represents the best estimate of $c_i$. Because the range of each of the hash functions is much smaller than $| U|$ by design, we expect to have some hashing collisions. At the very least, the estimate $\hat{c}_i$ will include all of the $c_i$, though $\hat{c}_i$ will likely be higher because other items will hash to the same column. Thus, the estimate for a given $\hat{c_i}$ will satisfy $c_i\leq\hat{c_i}$. By taking the minimum estimate, then, we obtain the closest estimate possible of $c_i$.
That we get some notion of a “good” estimate requires a bit more work. But we can, in fact, put a specific bound on the precision of $\hat{c}_i$. I omit the proof here. I will vouch for it. By defining $m$ and $n$ as we did above, we can say that
In words, for the $\delta$ we choose, we know that the probability of a given estimate $\hat{c}_i$ not exceeding the true value $c_i$ by more than $\varepsilon$ of the total over all items counted, $Z$, is at least $1-\delta$. Since we typically choose very small values for $\delta$ and $\varepsilon$, i.e. $\delta=10^{-7}$ and $\varepsilon=0.005$, we can be reasonably confident in our estimates.
Knowing this gives us insight into what count-min sketches are good for, and what they are not so great for. Because the precision of the estimates is bounded in terms of the entire count of the system, counts for items with small true counts will tend to be more corrupted. This makes a count-min sketch useful for keeping track of which web pages receive the most traffic, but not for tracking every page across our network.
Looking at this in the context of the page view problem, suppose we see 10 million page views every day. That makes our $Z=10^7$ from the theorem above. If we setup a count-min sketch with $\varepsilon=0.005$ (a value we have used in production), our probable upper bound on error is $(0.005)10^7=50,000$. In practice, the sketch will be more precise than that. The value $\varepsilon Z=50,000$ is an upper bound, after all. But gets at the problem. Say 100 pages with two views out of 1 million different pages have hash collisions. Our sketch would report each of those pages as having 200 views, rather than just two. Adding those 200 views to a page that has already received 200,000 views makes less of a difference.
An Example in Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
|
The Tip of the Proverbial Iceberg
This is but one facet of one method in the universe of probabilistic data structures. Count-min sketches can be adapted to compensate for the statistical bias in their estimates, or can be used to keep track of inner products rather than just a count. Or one might look at Bloom filters, essentially the binary equivalent of the sketch we have discussed here, asking “Has this page been viewed?” rather than “How many times has this page been viewed?”
The math establishing error bounds or other properties can be somewhat hairy, but the structures themselves are not too complicated. I would encourage the curious reader to explore this topic in more depth!
References
- G. Cormode and S. Muthukrisnan, An improved data stream summary: the count-min sketch and its applications in: Journal of Algorithms, 2005, pp. 58-75.
- G. Cormode, Sketch Techniques for Approximate Query Processing in: Foundations and Trends in Databases.
- G. Cormode, F. Korn, S. Muthukrisnan and D. Srivastava, Finding Hierarchical Heavy Hitters in Streaming Data in: ACM Transactions on Database Systems, 2007