We’ve been working a lot on products that contain visual previews of third party content. An example of this is Shareaholic Recommendations, which shows thumbnails of related pages at the bottom of a publisher’s blog post.
Finding the image that best represents a blog post is no trivial task. My research revealed that there isn’t much information floating around on this topic, and that the state of the art systems that do exist (e.g. Facebook’s thumbnail selection algorithm) are not all that sophisticated. I quickly realized that we could do better by rolling our own. That’s what we did, and here’s how we did it.
Finding the Right Image
Given a web page, how do we select the best image to use as its thumbnail?
Modern web pages lack a common structure and are full of images in the form of ads, banners and icons. There’s no HTML tag called
<featured_image> that tells us exactly which image to use…
Open Graph Meta Tags
Actually, that’s not entirely true. Facebook and Twitter have standardized tags for exactly this purpose, and many popular blogs use them. Here’s an example from the Shareaholic Blog:
<meta property='og:image' content='http://blog.shareaholic.com/wp-content/uploads/2012/09/DSC_0056-300x199.jpg' />
The W3 spec provides an alternate version that’s seldom used:
<meta property="http://ogp.me/ns#image" content="http://example.com/alice/bob-ugly.jpg" />
These examples are part of the Open Graph Protocol, which enables site owners to embed standardized metadata in their pages so that social media sites can more easily index their content. This benefits everybody: site owners are empowered to express their preferences directly, and social media sites don’t have to do any guessing while scraping thumbnails.
The less-often used Twitter Cards follow a similar convention as the Open Graph tags, using
twitter:image tags instead. At Shareaholic we support sites that use any of these tags, as well as the small but growing number of publishers using the
There are some pitfalls to avoid when dealing with Open Graph tags, however, so we can’t just grab the image they point to and assume we’re done. Depending on where and how we are displaying thumbnails, the images may be of insufficient size or quality. Open Graph and Twitter Card tags are specifically designed for Facebook and Twitter sharing, respectively. The thumbnails Facebook shows in newsfeeds are small, and many sites format their Open Graph images accordingly. We’ve found this undesirable because we often display large thumbnails or full-sized images, and small photos look bad when you enlarge them.
Another problem we discovered is that sometimes an Open Graph image has nothing to do with its page, but is instead a site icon or some other irrelevant image. Some plugins and web frameworks insert Open Graph tags with the site’s favicon, or worse, their own product’s icon.
This brings me to a repeated lesson I learned while building this library: web sites are not our allies in this process. We’d hope they’d make things easy by providing Open Graph tags, or by using large, relevant images, or by sensibly naming their
<div>s, or by using properly-encoded image URLs that don’t have whitespace characters in them. The reality is that we cannot assume a web site will do any of these things. In fact, in order to conquer the edge cases, a guiding assumption we make is that the pages we crawl are doing everything in their power to hide their best images from us.
Finding the Largest Image
If the page in question doesn’t have Open Graph tags, we enumerate all of the image tags on the page and use some heuristics to select the one most likely to be relevant. There are also cases where we don’t select any image because there are no good options.
The question now becomes, “Of all the images on a web page, which one, if any, most represents the page?”
Humans can answer this question intuitively, and if you think about it, our intuition boils down to finding the largest image on the page that is closest to the top of it. This is conceptually simple, but made difficult by the fact that a page’s HTML obscures where images are placed visually.
Finding the Actual Image Size
Unfortunately, finding the size of an image is not as easy as looking for
height attributes because most of the time they are not present. We’re accustomed to our browsers telling us how large each image is, but remember that when our browser does this, it has already rendered the entire page and downloaded every image on it. For performance reasons, we’d like to avoid this. But can we? Sometimes the dimensions are specified via CSS. This is also applied by the browser during rendering, and is more complicated to extract than
height attributes within an image tag.
It turns out that we do have to download images to determine their sizes, but the performance hit isn’t as bad as it sounds. There is a technique that involves starting to download an image and then cancelling it as soon as we have enough information to determine the its size (usually from the header). This lets us determine an image’s dimensions by downloading only a small piece of it. We use the FastImage Ruby library to do this. Performance note: even with this optimization, we avoid fetching images mid-web request because we’re still at the mercy of the third party web server.
As with the Open Graph tags, we take a composition approach. First we look for
height attributes. If that fails, we use the partial image downloading technique.
In addition to letting us skip downloading some images, this process often avoids an important gotcha. When we say we want the “largest” image, we mean “largest” in terms of the size it is displayed, not the size of the raw image file. Some web pages contain images that are 1000×1000 pixels, but are only displayed at 100×100 pixels. If we use only the image downloading approach, such images are given inflated importance. Most websites, in order to save bandwidth, will scale the source file down for these situations; but again, websites are not our friends in this endeavor.
Banners, Sprites and Icons
Many images are large and high up on a page, but still unsuitable for thumbnails. Banners containing logos and titles are the primary example. As humans we discount them because they’re short and wide, but if we’re calculating image size by multiplying width by height, they’ll often be surfaced as the largest images. To work around this, we translate this visual intuition into an algorithm. Before calculating the area of an image, we divide its width by its height to ensure its aspect ratio is less than some threshold (3.0). Images that fail this test are classified as banners and discarded.
Another type of image that is large but unsuitable is the sprite. Sprites are collections of discrete smaller images. The smaller images are displayed on the page, referenced by coordinate offsets with respect to the large image. By merging several images into one, pages can load faster by reducing the number of network requests they require. When sprites are viewed as a single image, they look ridiculous and make poor thumbnails, so we filter them out of our list of candidates. We’ve found that sprites usually contain only small images like icons, and not large featured images, so they can be ignored safely. A simple string filter for the word “sprite” identifies the vast majority of them.
Finally, some pages do not have a suitable thumbnail image. In these cases it’s better to display no image than one that is completely wrong. We’ve established a minimum image size that we’ll accept to avoid pulling things like icons or 1×1 tracking pixels. We skip anything less than 5000 pixels in area, even if they are gleaned from Open Graph tags. This requirement is based on how big an image needs to be to look acceptable at the size we display it, typically 137x137.
Content Zones, Sidebars and Comments
Sprites and banners are fairly easy to identify, but what about the other unsuitable images that flood the average webpage, such as advertisements, related content thumbnails, user avatars in the comments section, and images in the header/footer like author photos? Fortunately the big one on this list, advertisements, is usually Flash-based or in an iframe and thus won’t be picked up by a parser looking for
<img> tags. But for a more general answer, once again we turn to human behavior: our mind is able to categorize advertisements and other images as irrelevant because they reside outside of the main content, in headers, footers and sidebars.
As discussed earlier, web pages don’t have a standard layout. Determining which zone an image is in programmatically is difficult, but we can make pretty good guesses by searching for
<div> tags with IDs and classes containing certain keywords, and then either excluding or giving preference to the images contained within them.
Many CMSs, including WordPress, encapsulate a blog post’s content inside a
<div id="content">. In fact, if you look at the source for this article, you’ll find such a
<div>. If you look inside it, you’ll see that it encompasses the article text but excludes the header, footer and comments section (though not the sidebar; again, this is an an inexact science). Restricting our search for image tags to within this
div will increase the chance of finding the right image. It will also enable us to skip downloading many of the images on the page. If we don’t find a suitable image within these content
divs or if no such
div exists, we expand our search to the rest of the page, as not all sites follow this pseudo-convention.
To find these content
divs, we use the Nokogiri HTML parser and search the page via XPaths. The following query finds
img tags for images where an ancestor’s ID contains the word “content”:
We also add other keywords to our query, such as “main” and some of the new HTML5 semantic tags like
<article>. We can add additional clauses to our query using “and” and “or”:
//img[ancestor::*[contains(@id, 'content') or contains(@id, 'main')]]
Similarly, we exclude images in undesirable zones. In these queries we specify that we don’t want
img tags whose ancestors have certain keywords like “sidebar”, “header”, “footer” and “comment” in their IDs:
//img[not(ancestor::*[contains(@id, 'sidebar') or contains(@id, 'comment') or contains(@id, 'footer') or contains(@id, 'header')]) and ancestor::*[contains(@id, 'content')]]
Notice the “not” clause. If we can’t find any suitible images, we expand our search beyond content
divs. However, we never expand it into the known “bad” zones such as “comment”, “header” and “footer”.
Unfortunately, on many web pages these bad zones will exist within content zones, and vice versa. For example, the “content”
div may include both the blog post and the
div. We avoid this in the aforementioned XPath search, but I point it out because it’s something to be aware of if you’re implementing this elsewhere. I’ve seen many ridiculous layouts down in the trenches!
The above query is not exhaustive. Which zones to exclude and how to identify them is as much an art as a science. We consider HTML5 semantic tags such as
<footer>, as well as zones labeled “nav” (for navigation). Most commenting engines use some form of the word “comment”. We include this in our search because we don’t want to be using some commenter’s Facebook photo as our page thumbnail. Trickier than comments are related content plugins, whose thumbnails are tempting targets. We work around this by researching the most popular plugins and adding the names of their container
divs to our XPath search queries.
By combining the above techniques, we’ve developed an effective algorithm for finding the best image on a page from a list of large, reasonable candidates. This is usually sufficient, but to find the best thumbnail, we need to support videos, too.
Many blog posts have a featured video instead of a featured image. Often that’s all they have, meaning our thumbnail scraper will come up empty handed if it’s just searching for the biggest picture on the page. It’s easy to extract a nice thumbnail from an embedded video, provided that it’s hosted on one of the major sites like YouTube or Vimeo. The key is a technology called OEmbed. Sites that implement OEmbed provide an API endpoint where we can pass in the URL of a piece of that site’s content (in this case, a video) and it will send back some metadata describing the video. This metadata includes a thumbnail image.
The first step is to find the embedded video in the HTML and extract its ID. Let’s use YouTube as an example.
There are two common ways to embed video on a page. The preferred way is to use an iframe, so we enumerate through the iframes on a given page and look for ones that contain “youtube” in the
src attribute. Videos fall victim to the same problems as images in regards to zones, so we apply the same filtering from the previous section. Here’s an example XPath:
//iframe[not(ancestor::*[contains(@id, 'sidebar') or contains(@id, 'comment') or contains(@id, 'footer') or contains(@id, 'header')]) and contains(@src, 'youtube')]/@src
We rarely get more than one result from this query. When we do, we take the first one. Now that we have the iframe’s
src attribute, we extract the ID of the video. Here is where the science becomes inexact. Most embedded YouTube URLs take one of the following forms:
The most straightforward way to extract the ID would be to split on the “/” character and take the last token, but in the latter case we must also remove the querystring params (which are not always present). YouTube URLs are case-sensitive (except for the domain name), so we’re careful not to case-normalize the URL.
Once we have the video ID we query YouTube’s OEmbed API to get the metadata for that video. We use ruby-oembed to do this, which makes a request to the API endpoint:
ENCODED_VIDEO_URL is equal to the encoded form of:
After calling the OEmbed API, we get a blob of JSON with a variety of fields. The one we are interested in is “thumbnail_url”, which points us to a still image from the video.
I mentioned that there was another method of embedding video. It involves the older practice of using
<embed> tag. For this we use a similar XPath as above to extract the
src attribute. We do this only if we are unable to find a suitable iframe video.
The above YouTube example can be applied to most of the major video hosts, as well as other multimedia hosts like SlideShare. We just need to investigate each host’s embed URL formats and their OEmbed endpoints and write cases for them in our algorithm. Unfortunately OEmbed is not standardized sufficiently for this process to be abstracted easily across all hosts; however there are services you can pay for (like Embed.ly) that will do this for you.
Putting It Together
Combining all of these methods, we get an algorithm that flows like this:
- Check for Open Graph/Twitter Card tags
- Find the largest suitable image on the page
- Look for a suitable video thumbnail if no image is found
These are the lessons I’ve learned and processes I’ve implemented after spending significant time trial-and-erroring. I’ve documented my findings in the hope of filling a small void in Google’s search results for “thumbnail scraping,” hopefully saving someone like yourself a considerable amount of time. We’ve been very happy with the results we get from these techniques, and I think you will be too should you implement them.