The Count-min Sketch: How to Count over Large Keyspaces When 'About Right' Is Good Enough

Counting is a funny business. Some tribal languages count just one, two and then many. For most of our history we had to live with human counts, with all their concomitant problems. Fortunately at the height of the industrial revolution, a few people hit on the idea of using a machine to tabulate data. IBM went on to pioneer many concepts behind modern computing. Today we have very powerful computers that are, among other things, very good at counting.

However, the rise of massive data streams has complicated our calculus. What happens when we need to count through millions of packets streaming through an Ethernet switch? Or in Shareaholic’s case, how do we compute the number of page views a web page receives over a stream that comprises many thousands of URLs per second? Maintaining a count of every item exactly is very expensive.

We solve this problem with a count-min sketch. In exchange for some accuracy, we drastically reduce the computational requirements of real time counting over an enormous stream of data. This lets us surface the heavy-hitters among our stream of pageview data.

The Count-min Sketch

Let’s begin by situating our problem in mathematics. We have a domain $U$ of items we want to count over. In the context of counting page views, an item $i\in U$ is a web page, and $c_i$ is the number of times page $i$ has been viewed. These pairs of items and their counts form a multiset $M$ that we track.

We could store the multiset $M$ explicitly, using an associative array that maps every $i$ as a key to its corresponding count $c_i$. But even with modern computational tools, storing millions or billions of individual mappings is expensive. For Shareaholic, it would require substantial computing resources to meet the outwardly simple goal of maintaining page view counts for millions of unique URLs. If all we need to do is maintain a list of the most-viewed pages, we need not be so precise.

Enter the count-min sketch. Rather than store an enormous set of mappings, we exploit the properties of pairwise-independent hash functions — namely that they do not collide too frequently — to drastically reduce the memory requirements of computing over such an enormous stream of data.

At a high level, we choose two small parameters, $\delta$ and $\varepsilon$, that control the precision of our sketch (more on this later.) Remember, we sacrifice some precision in exchange for the sketch’s small computational footprint. Our choice of $\delta$ and $\varepsilon$ inform our creation of a matrix, $X$, and a corresponding set of pairwise-independent hash functions, $H$. These two components, along with update and query methods, form our sketch. Since the matrix $X$ and hash functions $H$ are typically much smaller than an explicit representation of $M$, we achieve huge memory savings.

More specifically, we use $\varepsilon$ and $\delta$ to choose $m=\lceil \ln{1/\delta} \rceil$ and $n=\lceil e / \varepsilon \rceil$, the dimensions of our matrix $X$. Intuitively, as we make $\delta$ and $\varepsilon$ smaller (thus upping the precision of the sketch), we have to store something closer to an explicit representation of $M$ by making $X$ larger. Because we begin counting having seen nothing, we initialize the entries of $X$ to zero. We then choose $m$ functions at random from a family of pairwise-independent hash functions to form

$H=\left\{ h_{1}, h_{2}, \ldots, h_m \right\}$

where, for a given $h_j\in H$,

$h_j: \left\{ 1,2,\ldots, |U|\right\} \to \left\{ 1,2,\ldots,n \right\}.$

In words, each of the $m$ hash functions $h_j\in H$ corresponds to a row of $X$, and $h_j$ maps $i\in U$ (a web page) to one of the $n$ columns of $X$.

When someone visits a page and we need to update our count for item $i$ with update $u_i\in\mathbb{N}$, we update according to

$x_{j, k} = x_{j, k} + u_i$

for each row $j$ and its hash function $h_j\in H$, and the corresponding $k=h_j(i)$.

In words, the $j$th row of matrix $X$ receives an update by taking its corresponding hash function $h_j$, applying it to the item $i$ to obtain $k=h_j(i)$ corresponding to a column of the matrix $X$, and then updating the $j, k$th position of $X$ by adding $u_i$.

To obtain the approximate count $\hat{c_i}$ for item $i$, we take the minimum of the estimates of $c_i$ from every row of $X$. This is essentially the update procedure in reverse. For a given item $i$, we obtain its estimate from the $j$th row of $X$ by using its corresponding hash function $h_j$ to obtain $k=h_j(i)$, and reading the value at position $j, k$ of $X$.

In the context of our page view counting, for a given row of $X$ we take its corresponding hash function, and then find the column to lookup by applying the hash function to the page’s URL. Doing this over all the rows of $X$ gives us a set of estimates,

$S=\left\{\hat{c}_{i,1}, \hat{c}_{i,2}, \ldots, \hat{c}_{i,m} \right\},$

where $\hat{c}_{i,j}$ represents the estimated value for item $i$ with respect to row $j$ (or hash function $h_j$). We finally choose our best estimate by letting $\hat{c}_i=\min{S}$.

Why Does This Work?

The minimum of $S$ represents the best estimate of $c_i$. Because the range of each of the hash functions is much smaller than $| U|$ by design, we expect to have some hashing collisions. At the very least, the estimate $\hat{c}_i$ will include all of the $c_i$, though $\hat{c}_i$ will likely be higher because other items will hash to the same column. Thus, the estimate for a given $\hat{c_i}$ will satisfy $c_i\leq\hat{c_i}$. By taking the minimum estimate, then, we obtain the closest estimate possible of $c_i$.

That we get some notion of a “good” estimate requires a bit more work. But we can, in fact, put a specific bound on the precision of $\hat{c}_i$. I omit the proof here. I will vouch for it. By defining $m$ and $n$ as we did above, we can say that

$1 - \delta \leq P\left(\hat{c_i}\leq c_i + \varepsilon Z \right) \quad \text{where} \quad Z=\sum_{i\in U}{c_i}.$

In words, for the $\delta$ we choose, we know that the probability of a given estimate $\hat{c}_i$ not exceeding the true value $c_i$ by more than $\varepsilon$ of the total over all items counted, $Z$, is at least $1-\delta$. Since we typically choose very small values for $\delta$ and $\varepsilon$, i.e. $\delta=10^{-7}$ and $\varepsilon=0.005$, we can be reasonably confident in our estimates.

Knowing this gives us insight into what count-min sketches are good for, and what they are not so great for. Because the precision of the estimates is bounded in terms of the entire count of the system, counts for items with small true counts will tend to be more corrupted. This makes a count-min sketch useful for keeping track of which web pages receive the most traffic, but not for tracking every page across our network.

Looking at this in the context of the page view problem, suppose we see 10 million page views every day. That makes our $Z=10^7$ from the theorem above. If we setup a count-min sketch with $\varepsilon=0.005$ (a value we have used in production), our probable upper bound on error is $(0.005)10^7=50,000$. In practice, the sketch will be more precise than that. The value $\varepsilon Z=50,000$ is an upper bound, after all. But gets at the problem. Say 100 pages with two views out of 1 million different pages have hash collisions. Our sketch would report each of those pages as having 200 views, rather than just two. Adding those 200 views to a page that has already received 200,000 views makes less of a difference.

An Example in Python

cmsketch.py

#!/usr/bin/env python
# encoding: utf-8

"""
cmsketch.py

An implementation of count-min sketching from the paper due to Cormode and
Muthukrishnan 2005

"""

import sys
import random
import numpy as np
import heapq

BIG_PRIME = 9223372036854775783

def random_parameter():
    return random.randrange(0, BIG_PRIME - 1)


class Sketch:
    def __init__(self, delta, epsilon, k):
        """
        Setup a new count-min sketch with parameters delta, epsilon and k

        The parameters delta and epsilon control the accuracy of the
        estimates of the sketch

        Cormode and Muthukrishnan prove that for an item i with count a_i, the
        estimate from the sketch a_i_hat will satisfy the relation

        a_hat_i <= a_i + epsilon * ||a||_1

        with probability at least 1 - delta, where a is the the vector of all
        all counts and ||x||_1 is the L1 norm of a vector x

        Parameters
        ----------
        delta : float
            A value in the unit interval that sets the precision of the sketch
        epsilon : float
            A value in the unit interval that sets the precision of the sketch
        k : int
            A positive integer that sets the number of top items counted

        Examples
        --------
        >>> s = Sketch(10**-7, 0.005, 40)

        Raises
        ------
        ValueError
            If delta or epsilon are not in the unit interval, or if k is
            not a positive integer

        """
        if delta <= 0 or delta >= 1:
            raise ValueError("delta must be between 0 and 1, exclusive")
        if epsilon <= 0 or epsilon >= 1:
            raise ValueError("epsilon must be between 0 and 1, exclusive")
        if k < 1:
            raise ValueError("k must be a positive integer")

        self.w = int(np.ceil(np.exp(1) / epsilon))
        self.d = int(np.ceil(np.log(1 / delta)))
        self.k = k
        self.hash_functions = [self.__generate_hash_function() for i in range(self.d)]
        self.count = np.zeros((self.d, self.w), dtype='int32')
        self.heap, self.top_k = [], {} # top_k => [estimate, key] pairs

    def update(self, key, increment):
        """
        Updates the sketch for the item with name of key by the amount
        specified in increment

        Parameters
        ----------
        key : string
            The item to update the value of in the sketch
        increment : integer
            The amount to update the sketch by for the given key

        Examples
        --------
        >>> s = Sketch(10**-7, 0.005, 40)
        >>> s.update('http://www.cnn.com/', 1)

        """
        for row, hash_function in enumerate(self.hash_functions):
            column = hash_function(abs(hash(key)))
            self.count[row, column] += increment

        self.update_heap(key)

    def update_heap(self, key):
        """
        Updates the class's heap that keeps track of the top k items for a
        given key

        For the given key, it checks whether the key is present in the heap,
        updating accordingly if so, and adding it to the heap if it is
        absent

        Parameters
        ----------
        key : string
            The item to check against the heap

        """
        estimate = self.get(key)

        if not self.heap or estimate >= self.heap[0][0]:
            if key in self.top_k:
                old_pair = self.top_k.get(key)
                old_pair[0] = estimate
                heapq.heapify(self.heap)
            else:
                if len(self.top_k) < self.k:
                    heapq.heappush(self.heap, [estimate, key])
                    self.top_k[key] = [estimate, key]
                else:
                    new_pair = [estimate, key]
                    old_pair = heapq.heappushpop(self.heap, new_pair)
                    del self.top_k[old_pair[1]]
                    self.top_k[key] = new_pair

    def get(self, key):
        """
        Fetches the sketch estimate for the given key

        Parameters
        ----------
        key : string
            The item to produce an estimate for

        Returns
        -------
        estimate : int
            The best estimate of the count for the given key based on the
            sketch

        Examples
        --------
        >>> s = Sketch(10**-7, 0.005, 40)
        >>> s.update('http://www.cnn.com/', 1)
        >>> s.get('http://www.cnn.com/')
        1

        """
        value = sys.maxint
        for row, hash_function in enumerate(self.hash_functions):
            column = hash_function(abs(hash(key)))
            value = min(self.count[row, column], value)

        return value

    def __generate_hash_function(self):
        """
        Returns a hash function from a family of pairwise-independent hash
        functions

        """
        a, b = random_parameter(), random_parameter()
        return lambda x: (a * x + b) % BIG_PRIME % self.w

The Tip of the Proverbial Iceberg

This is but one facet of one method in the universe of probabilistic data structures. Count-min sketches can be adapted to compensate for the statistical bias in their estimates, or can be used to keep track of inner products rather than just a count. Or one might look at Bloom filters, essentially the binary equivalent of the sketch we have discussed here, asking “Has this page been viewed?” rather than “How many times has this page been viewed?”

The math establishing error bounds or other properties can be somewhat hairy, but the structures themselves are not too complicated. I would encourage the curious reader to explore this topic in more depth!

References

G. Cormode and S. Muthukrisnan, An improved data stream summary: the count-min sketch and its applications in: Journal of Algorithms, 2005, pp. 58-75.
G. Cormode, Sketch Techniques for Approximate Query Processing in: Foundations and Trends in Databases.
G. Cormode, F. Korn, S. Muthukrisnan and D. Srivastava, Finding Hierarchical Heavy Hitters in Streaming Data in: ACM Transactions on Database Systems, 2007

Sharing the Tech

The Shareaholic Engineering Blog

The Count-min Sketch: How to Count Over Large Keyspaces When 'About Right' Is Good Enough

The Count-min Sketch

Why Does This Work?

An Example in Python

The Tip of the Proverbial Iceberg

References

Comments