The Count-min Sketch: How to Count Over Large Keyspaces When ‘About Right’ Is Good Enough

Posted by Joseph Kibe on Dec 3rd, 2012

Counting is a funny business. Some tribal languages count just one, two and then many. For most of our history we had to live with human counts, with all their concomitant problems. Fortunately at the height of the industrial revolution, a few people hit on the idea of using a machine to tabulate data. IBM went on to pioneer many concepts behind modern computing. Today we have very powerful computers that are, among other things, very good at counting.

However, the rise of massive data streams has complicated our calculus. What happens when we need to count through millions of packets streaming through an Ethernet switch? Or in Shareaholic’s case, how do we compute the number of page views a web page receives over a stream that comprises many thousands of URLs per second? Maintaining a count of every item exactly is very expensive.

We solve this problem with a count-min sketch. In exchange for some accuracy, we drastically reduce the computational requirements of real time counting over an enormous stream of data. This lets us surface the heavy-hitters among our stream of pageview data.

How to Find the Image That Best Represents a Web Page

Posted by Zachary Broderick on Nov 2nd, 2012

We’ve been working a lot on products that contain visual previews of third party content. An example of this is Shareaholic Recommendations, which shows thumbnails of related pages at the bottom of a publisher’s blog post.

Finding the image that best represents a blog post is no trivial task. My research revealed that there isn’t much information floating around on this topic, and that the state of the art systems that do exist (e.g. Facebook’s thumbnail selection algorithm) are not all that sophisticated. I quickly realized that we could do better by rolling our own. That’s what we did, and here’s how we did it.

Using HAProxy to Build a More Featureful Elastic Load Balancer

Posted by Robby Grossman on Oct 26th, 2012

Though Shareaholic is hosted in the AWS cloud, we avoid depending on Amazon’s virtualized cloud services whenever possible. If we ever hit a bottleneck in AWS, I want to be able to switch providers without needing to rebuild a core piece of our architecture. I also don’t want our tech team to have to make product and infrastructure sacrifices just so that we conform to AWS standard practices.

Load balancing with HAProxy was the first example of a service that Amazon provides, that we felt was better to manage ourselves.

Here’s how we did it.

Blog Archives

Sharing the Tech

The Shareaholic Engineering Blog

The Count-min Sketch: How to Count Over Large Keyspaces When ‘About Right’ Is Good Enough

How to Find the Image That Best Represents a Web Page

Using HAProxy to Build a More Featureful Elastic Load Balancer